GenAI System Design

Home Blog About
Connect on LinkedIn View GitHub profile

All Posts

Practical notes on GenAI system design, LLM serving, and ML infrastructure.

  • Apr 2, 2026
    System DesignAI EngineeringLLM ServingInterview

    Case Study: Designing a Document Intelligence Platform, From ML to GenAI to Hybrid

    A senior engineer's walkthrough of the same two-capability document intelligence system built twice: first with traditional ML (BM25, collaborative filtering, learning to rank), then evolved with GenAI (dense retrieval, RAG, semantic ranking), and finally composed as a hybrid.

  • Mar 27, 2026
    Distributed SystemsLLM ServingAI EngineeringInterview

    Case Study: Designing a Multi-Tenant LoRA Fine-Tuning and Serving Platform

    A production deep dive into per-tenant adapter training pipelines, GPU memory management for shared base models with swappable LoRA adapters, heterogeneous batching, and adapter-aware routing at scale.

  • Mar 17, 2026
    Distributed SystemsLLM TrainingAI EngineeringInterview

    Case Study: Building a Domain-Specific Foundation Model for Healthcare

    A production walkthrough of custom tokenizer design, transformer architecture decisions, distributed training across 256 GPUs, and the compute math behind pre-training a 7B medical language model from scratch.

  • Mar 17, 2026
    LLM TrainingReasoning ModelsAI EngineeringInterview

    Case Study: Post-Training a Foundation Model for Reasoning

    A production walkthrough of supervised fine-tuning, reward modeling, RLHF vs DPO alignment, and how reinforcement learning teaches language models to reason through multi-step problems.

  • Mar 15, 2026
    AI AgentsSystem DesignAI EngineeringInterview

    Case Study: Automating Mortgage Processing with LLM Agents in a Hybrid Cloud Bank

    A production deep dive into hybrid cloud agent architecture, durable workflow orchestration, multi-country document processing, and why most of a mortgage pipeline should NOT be an agent.

  • Mar 13, 2026
    AI AgentsSystem DesignAI EngineeringInterview

    Case Study: Building an Adaptive Code Review Agent with Learning Feedback Loops

    A production deep dive into three-tier agentic memory architectures, adaptive routing with online feedback, confidence calibration for comment gating, and why your code review agent needs to forget as much as it remembers.

  • Mar 12, 2026
    GenAISystem DesignLLM Servingconcept

    Model Quantization as a Hardware Bottleneck Problem

    Why LLM inference is memory-bandwidth bound, and how quantization functions as the primary lever to fit larger models into fewer GPUs to reduce serving costs.

  • Mar 11, 2026
    AI AgentsSystem DesignAI EngineeringInterview

    Case Study: Building an Autonomous CI/CD Pipeline Agent for a Large-Scale Monorepo

    A production deep dive into planning loops with backtracking, blast radius estimation, event-driven DAG orchestration, and why autonomous agents need hard boundaries on what they're allowed to fix.

  • Mar 10, 2026
    AI AgentsDistributed SystemsAI EngineeringInterview

    Case Study: Building a Financial Document Processing Pipeline with Transaction Safety

    A production walkthrough of saga orchestration, durable execution for human-in-the-loop workflows, multi-agent handoff protocols, and why idempotency keys are non-negotiable when LLMs write to ledgers.

  • Mar 7, 2026
    System DesignInterview

    Case Study: Designing a GitHub Copilot-Style Code Completion Backend

    A Staff+ GenAI system design case study. How do you build code autocomplete at 1B tokens per day, under 100ms p99 latency, across millions of developers?

  • Mar 6, 2026
    TransformersLLM ServingGenAIInfrastructure

    Attention Mechanisms: A Backend Engineer's Guide

    Understanding attention variants (MHA, MQA, GQA, SWA) is not just an ML topic. The variant your model uses determines your KV cache budget, GPU tier, tensor parallelism constraints, and maximum batch size.

  • Mar 1, 2026
    AI AgentsSystem DesignGenAIInterview

    Case Study: Designing an AI-Powered Order Support Agent for an Enterprise Logistics Platform

    A production deep dive into multi-agent orchestration, hybrid retrieval, inference framework tradeoffs, and why fine-tuning your policy knowledge base will come back to haunt you.

  • Jan 16, 2026
    System DesignGenAIAI EngineeringInterview

    A Framework for GenAI System Design Case Studies

    A 9-step framework for designing production systems around large language models. Covers requirements, architecture choices, data strategy, model selection, inference infrastructure, guardrails, evaluation, and deployment.

  • Jan 14, 2026
    LLM InferenceAI EngineeringGenAIInterview

    Logits, Sampling, and Token Selection in LLM Inference

    Before a token leaves the model, it passes through logit processing and sampling. This step is where temperature, top-k, top-p, and structured output constraints all live. Here's how it works and why it matters for serving.

  • Jan 12, 2026
    Reasoning ModelsLLM ServingAI EngineeringSystem Design

    Test-Time Compute and LLM Serving

    Models that 'think longer' break the assumptions your serving stack was built on. Latency becomes unpredictable, KV caches explode, batching gets harder, and speculative decoding loses its edge. Here's what changes.

  • Jan 11, 2026
    Distributed SystemsLLM ServingAI EngineeringSystem Design

    Disaggregated Inference: Why Prefill and Decode Belong on Different Servers

    Prefill and decode have completely different hardware profiles. Running them on the same GPU pool wastes resources in both directions. Disaggregated inference separates them, but introduces a hard distributed systems problem: migrating the KV cache across the network.

  • Jan 10, 2026
    Embedding ModelsRAGAI EngineeringInterview

    Vector Search: Role in RAG and GenAI

    Why keyword search fails for AI apps, how embeddings map meaning to math, and why HNSW is the only way to search a billion documents without melting your servers. A deep-dive case study covering the full stack from embedding models to production filtering.

  • Jan 8, 2026
    Memory ManagementLLM ServingAI EngineeringInterview

    PagedAttention & vLLM: Fixing the KV Cache Memory Crisis

    Why does the KV cache waste so much memory, and how does PagedAttention fix it? We break down the fragmentation problem, virtual memory for GPUs, and how block-level sharing enables massive batch sizes.

  • Jan 7, 2026
    Memory OptimizationLLM InferenceAI EngineeringInterview

    Understanding the KV Cache: The Memory Wall of LLM Inference

    The KV Cache is the most critical memory component in modern LLM serving. Without it, text generation is impossibly slow. With it, you hit a severe memory bottleneck. Let's break down what it actually is and why it exists.

  • Jan 6, 2026
    Inference OptimizationLLM InferenceAI EngineeringSystem Design

    Speculative Decoding: Making Large Models Generate Faster

    Speculative Decoding speeds up LLM text generation using a small 'draft' model to guess ahead, verifying guesses in parallel with the big model. It attacks TPOT but carries a cost when guesses fail.

  • Jan 5, 2026
    LLM InferenceAI EngineeringSystem DesignInterview

    Prefill and Decode in LLM Inference

    Prefill and Decode are important stages in any LLM inference call. Knowing what happens in both stages helps in building a robust inference service, debug latencies, observe performance and be able to make tradeoffs in system designing.

© 2026 Ashish Bhutani. All rights reserved.

Practical GenAI system design for backend engineers.

Connect on LinkedIn View GitHub profile