All Posts
Practical notes on GenAI system design, LLM serving, and ML infrastructure.
-
Case Study: Designing a GitHub Copilot-Style Code Completion Backend
A Staff+ GenAI system design case study. How do you build code autocomplete at 1B tokens per day, under 100ms p99 latency, across millions of developers?
-
Attention Mechanisms: A Backend Engineer's Guide
Understanding attention variants (MHA, MQA, GQA, SWA) is not just an ML topic. The variant your model uses determines your KV cache budget, GPU tier, tensor parallelism constraints, and maximum batch size.
-
Case Study: Designing an AI-Powered Order Support Agent for an Enterprise Logistics Platform
A production deep dive into multi-agent orchestration, hybrid retrieval, inference framework tradeoffs, and why fine-tuning your policy knowledge base will come back to haunt you.
-
A Framework for GenAI System Design Case Studies
A 9-step framework for designing production systems around large language models. Covers requirements, architecture choices, data strategy, model selection, inference infrastructure, guardrails, evaluation, and deployment.
-
Logits, Sampling, and Token Selection in LLM Inference
Before a token leaves the model, it passes through logit processing and sampling. This step is where temperature, top-k, top-p, and structured output constraints all live. Here's how it works and why it matters for serving.
-
Test-Time Compute and LLM Serving
Models that 'think longer' break the assumptions your serving stack was built on. Latency becomes unpredictable, KV caches explode, batching gets harder, and speculative decoding loses its edge. Here's what changes.
-
Disaggregated Inference: Why Prefill and Decode Belong on Different Servers
Prefill and decode have completely different hardware profiles. Running them on the same GPU pool wastes resources in both directions. Disaggregated inference separates them, but introduces a hard distributed systems problem: migrating the KV cache across the network.
-
Vector Search: Role in RAG and GenAI
Why keyword search fails for AI apps, how embeddings map meaning to math, and why HNSW is the only way to search a billion documents without melting your servers. A deep-dive case study covering the full stack from embedding models to production filtering.
-
PagedAttention & vLLM: Fixing the KV Cache Memory Crisis
Why does the KV cache waste so much memory, and how does PagedAttention fix it? We break down the fragmentation problem, virtual memory for GPUs, and how block-level sharing enables massive batch sizes.
-
Understanding the KV Cache: The Memory Wall of LLM Inference
The KV Cache is the most critical memory component in modern LLM serving. Without it, text generation is impossibly slow. With it, you hit a severe memory bottleneck. Let's break down what it actually is and why it exists.
-
Speculative Decoding: Making Large Models Generate Faster
Speculative Decoding speeds up LLM text generation using a small 'draft' model to guess ahead, verifying guesses in parallel with the big model. It attacks TPOT but carries a cost when guesses fail.
-
Prefill and Decode in LLM Inference
Prefill and Decode are important stages in any LLM inference call. Knowing what happens in both stages helps in building a robust inference service, debug latencies, observe performance and be able to make tradeoffs in system designing.