GenAI System Design

Home Blog About
Connect on LinkedIn View GitHub profile

All Posts

Practical notes on GenAI system design, LLM serving, and ML infrastructure.

  • Mar 7, 2026
    System DesignInterview

    Case Study: Designing a GitHub Copilot-Style Code Completion Backend

    A Staff+ GenAI system design case study. How do you build code autocomplete at 1B tokens per day, under 100ms p99 latency, across millions of developers?

  • Mar 6, 2026
    TransformersLLM ServingGenAIInfrastructure

    Attention Mechanisms: A Backend Engineer's Guide

    Understanding attention variants (MHA, MQA, GQA, SWA) is not just an ML topic. The variant your model uses determines your KV cache budget, GPU tier, tensor parallelism constraints, and maximum batch size.

  • Mar 1, 2026
    AI AgentsSystem DesignGenAIInterview

    Case Study: Designing an AI-Powered Order Support Agent for an Enterprise Logistics Platform

    A production deep dive into multi-agent orchestration, hybrid retrieval, inference framework tradeoffs, and why fine-tuning your policy knowledge base will come back to haunt you.

  • Jan 16, 2026
    System DesignGenAIAI EngineeringInterview

    A Framework for GenAI System Design Case Studies

    A 9-step framework for designing production systems around large language models. Covers requirements, architecture choices, data strategy, model selection, inference infrastructure, guardrails, evaluation, and deployment.

  • Jan 14, 2026
    LLM InferenceAI EngineeringGenAIInterview

    Logits, Sampling, and Token Selection in LLM Inference

    Before a token leaves the model, it passes through logit processing and sampling. This step is where temperature, top-k, top-p, and structured output constraints all live. Here's how it works and why it matters for serving.

  • Jan 12, 2026
    Reasoning ModelsLLM ServingAI EngineeringSystem Design

    Test-Time Compute and LLM Serving

    Models that 'think longer' break the assumptions your serving stack was built on. Latency becomes unpredictable, KV caches explode, batching gets harder, and speculative decoding loses its edge. Here's what changes.

  • Jan 11, 2026
    Distributed SystemsLLM ServingAI EngineeringSystem Design

    Disaggregated Inference: Why Prefill and Decode Belong on Different Servers

    Prefill and decode have completely different hardware profiles. Running them on the same GPU pool wastes resources in both directions. Disaggregated inference separates them, but introduces a hard distributed systems problem: migrating the KV cache across the network.

  • Jan 10, 2026
    Embedding ModelsRAGAI EngineeringInterview

    Vector Search: Role in RAG and GenAI

    Why keyword search fails for AI apps, how embeddings map meaning to math, and why HNSW is the only way to search a billion documents without melting your servers. A deep-dive case study covering the full stack from embedding models to production filtering.

  • Jan 8, 2026
    Memory ManagementLLM ServingAI EngineeringInterview

    PagedAttention & vLLM: Fixing the KV Cache Memory Crisis

    Why does the KV cache waste so much memory, and how does PagedAttention fix it? We break down the fragmentation problem, virtual memory for GPUs, and how block-level sharing enables massive batch sizes.

  • Jan 7, 2026
    Memory OptimizationLLM InferenceAI EngineeringInterview

    Understanding the KV Cache: The Memory Wall of LLM Inference

    The KV Cache is the most critical memory component in modern LLM serving. Without it, text generation is impossibly slow. With it, you hit a severe memory bottleneck. Let's break down what it actually is and why it exists.

  • Jan 6, 2026
    Inference OptimizationLLM InferenceAI EngineeringSystem Design

    Speculative Decoding: Making Large Models Generate Faster

    Speculative Decoding speeds up LLM text generation using a small 'draft' model to guess ahead, verifying guesses in parallel with the big model. It attacks TPOT but carries a cost when guesses fail.

  • Jan 5, 2026
    LLM InferenceAI EngineeringSystem DesignInterview

    Prefill and Decode in LLM Inference

    Prefill and Decode are important stages in any LLM inference call. Knowing what happens in both stages helps in building a robust inference service, debug latencies, observe performance and be able to make tradeoffs in system designing.

© 2026 Ashish Bhutani. All rights reserved.

Practical GenAI system design for backend engineers.

Connect on LinkedIn View GitHub profile