All Posts
Practical notes on GenAI system design, LLM serving, and ML infrastructure.
-
Case Study: Designing a Document Intelligence Platform, From ML to GenAI to Hybrid
A senior engineer's walkthrough of the same two-capability document intelligence system built twice: first with traditional ML (BM25, collaborative filtering, learning to rank), then evolved with GenAI (dense retrieval, RAG, semantic ranking), and finally composed as a hybrid.
-
Case Study: Designing a Multi-Tenant LoRA Fine-Tuning and Serving Platform
A production deep dive into per-tenant adapter training pipelines, GPU memory management for shared base models with swappable LoRA adapters, heterogeneous batching, and adapter-aware routing at scale.
-
Case Study: Building a Domain-Specific Foundation Model for Healthcare
A production walkthrough of custom tokenizer design, transformer architecture decisions, distributed training across 256 GPUs, and the compute math behind pre-training a 7B medical language model from scratch.
-
Case Study: Post-Training a Foundation Model for Reasoning
A production walkthrough of supervised fine-tuning, reward modeling, RLHF vs DPO alignment, and how reinforcement learning teaches language models to reason through multi-step problems.
-
Case Study: Automating Mortgage Processing with LLM Agents in a Hybrid Cloud Bank
A production deep dive into hybrid cloud agent architecture, durable workflow orchestration, multi-country document processing, and why most of a mortgage pipeline should NOT be an agent.
-
Case Study: Building an Adaptive Code Review Agent with Learning Feedback Loops
A production deep dive into three-tier agentic memory architectures, adaptive routing with online feedback, confidence calibration for comment gating, and why your code review agent needs to forget as much as it remembers.
-
Model Quantization as a Hardware Bottleneck Problem
Why LLM inference is memory-bandwidth bound, and how quantization functions as the primary lever to fit larger models into fewer GPUs to reduce serving costs.
-
Case Study: Building an Autonomous CI/CD Pipeline Agent for a Large-Scale Monorepo
A production deep dive into planning loops with backtracking, blast radius estimation, event-driven DAG orchestration, and why autonomous agents need hard boundaries on what they're allowed to fix.
-
Case Study: Building a Financial Document Processing Pipeline with Transaction Safety
A production walkthrough of saga orchestration, durable execution for human-in-the-loop workflows, multi-agent handoff protocols, and why idempotency keys are non-negotiable when LLMs write to ledgers.
-
Case Study: Designing a GitHub Copilot-Style Code Completion Backend
A Staff+ GenAI system design case study. How do you build code autocomplete at 1B tokens per day, under 100ms p99 latency, across millions of developers?
-
Attention Mechanisms: A Backend Engineer's Guide
Understanding attention variants (MHA, MQA, GQA, SWA) is not just an ML topic. The variant your model uses determines your KV cache budget, GPU tier, tensor parallelism constraints, and maximum batch size.
-
Case Study: Designing an AI-Powered Order Support Agent for an Enterprise Logistics Platform
A production deep dive into multi-agent orchestration, hybrid retrieval, inference framework tradeoffs, and why fine-tuning your policy knowledge base will come back to haunt you.
-
A Framework for GenAI System Design Case Studies
A 9-step framework for designing production systems around large language models. Covers requirements, architecture choices, data strategy, model selection, inference infrastructure, guardrails, evaluation, and deployment.
-
Logits, Sampling, and Token Selection in LLM Inference
Before a token leaves the model, it passes through logit processing and sampling. This step is where temperature, top-k, top-p, and structured output constraints all live. Here's how it works and why it matters for serving.
-
Test-Time Compute and LLM Serving
Models that 'think longer' break the assumptions your serving stack was built on. Latency becomes unpredictable, KV caches explode, batching gets harder, and speculative decoding loses its edge. Here's what changes.
-
Disaggregated Inference: Why Prefill and Decode Belong on Different Servers
Prefill and decode have completely different hardware profiles. Running them on the same GPU pool wastes resources in both directions. Disaggregated inference separates them, but introduces a hard distributed systems problem: migrating the KV cache across the network.
-
Vector Search: Role in RAG and GenAI
Why keyword search fails for AI apps, how embeddings map meaning to math, and why HNSW is the only way to search a billion documents without melting your servers. A deep-dive case study covering the full stack from embedding models to production filtering.
-
PagedAttention & vLLM: Fixing the KV Cache Memory Crisis
Why does the KV cache waste so much memory, and how does PagedAttention fix it? We break down the fragmentation problem, virtual memory for GPUs, and how block-level sharing enables massive batch sizes.
-
Understanding the KV Cache: The Memory Wall of LLM Inference
The KV Cache is the most critical memory component in modern LLM serving. Without it, text generation is impossibly slow. With it, you hit a severe memory bottleneck. Let's break down what it actually is and why it exists.
-
Speculative Decoding: Making Large Models Generate Faster
Speculative Decoding speeds up LLM text generation using a small 'draft' model to guess ahead, verifying guesses in parallel with the big model. It attacks TPOT but carries a cost when guesses fail.
-
Prefill and Decode in LLM Inference
Prefill and Decode are important stages in any LLM inference call. Knowing what happens in both stages helps in building a robust inference service, debug latencies, observe performance and be able to make tradeoffs in system designing.