Ashish Bhutani · · 6 min read

Prefill and Decode in LLM Inference

LLM InferenceAI EngineeringSystem DesignInterview

This post assumes you have a high-level understanding of what an LLM is (prompts go in, tokens come out) but want to know what actually happens on the GPU during that process. If you’re completely new to LLMs, I recommend checking out a basic primer first.

The 30-Second Version LLM inference isn’t one continuous process. It’s split into two phases: Prefill (reading the prompt) which is math-heavy and limited by GPU compute, and Decode (writing the response) which is memory-heavy and limited by how fast the GPU can move data. If you don’t treat them differently in your system design, your application will be slow and crash under load.

Why this matters in production

Most backend engineers treat LLM inference as a single black-box API call. That works until things start breaking in confusing ways. Here are common production failures that start making sense once you understand the Prefill/Decode split:

  • The “Sluggish Start” — Users upload a 50-page PDF and wait 10 seconds for the first token to appear. This is a Prefill bottleneck — the GPU is buried in math processing the input.
  • The “Chatty” Bot Slowdown — Your bot is snappy at first, but crawls as the conversation gets longer. This is a KV Cache bandwidth wall during the Decode phase.
  • The “Stuttering” Stream — A user is getting a response, but it randomly pauses when someone else joins the chat. This is a scheduling problem — Continuous Batching and the “Noisy Neighbor” effect.
  • The “OOM” Crash — You have plenty of VRAM, but your system crashes after only two users. This is Memory Fragmentation — and why PagedAttention exists.

All of these trace back to one core misunderstanding about how models are actually served.


Why LLM speed is actually two different things

One thing that tends to trip up experienced backend engineers is that LLM serving isn’t one workload. It’s two very different ones hiding behind a single API call.

When you hit a “generate” endpoint, you’re triggering a two-act play. Act one is Prefill (reading the prompt). Act two is Decode (writing the response).

If you treat them the same in your monitoring or load balancing, the user experience degrades fast.

Let’s break down Act One first and see why big prompts cause so much trouble.


Prefill: Handling the input

This is the model gulping down your entire prompt in one go.

It does this in parallel. If you send a 2,000-word prompt, the GPU tries to crunch all those tokens at once. This phase is Compute-Bound. It’s raw, heavy math. The bottleneck here is the GPU’s TFLOPS (how fast the cores can multiply numbers).

If you’re seeing high “Time to First Token” (TTFT), your prompts are likely too big for your compute power. In practice, this is where RAG apps die. You stuff 20 documents into a prompt, and the user stares at a spinner for 5 seconds while the GPU sweats through the math.

Ok, so the model has digested the prompt. What happens when it starts talking back?


Decode: Generating the output

Once the first word pops out, the model switches to Decode mode.

It generates one token at a time, sequentially. Surprisingly, the bottleneck here isn’t the math; it’s Memory Bandwidth.

To produce a single token, the GPU has to pull every single model weight and the entire conversation history—stored in the KV Cache—from its VRAM.

(Note: I’ll be doing a deep-dive post on the KV Cache soon, but for now, just think of it as the model’s short-term memory).

This is why H100s are so expensive. It’s the massive “pipes” (HBM3 memory) that move data fast enough so the cores aren’t just sitting there waiting.

Knowing this, the obvious solution seems to be packing more requests into a single GPU to increase throughput. But that introduces a whole new set of problems.


The problem with batching

Batching is how we make these systems affordable. Running one request per GPU is a fast way to go broke.

But here’s the catch: the more requests you pack into a batch (to increase throughput), the more you’re splitting that fixed memory bandwidth we just talked about in the Decode phase.

Your “typing” speed (TPOT - Time Per Output Token) will start to crawl.

I’ve seen teams tune for “90% utilization” only to realize their chatbot was typing at 2 words per second. It feels broken to the user.

And it gets worse when you mix different types of requests together on the same server.


Scheduling and “noisy neighbors”

Standard load balancers are pretty useless here.

If User A sends a massive document for summary, that Prefill phase is going to hog the GPU’s compute for a solid second or two.

During that time, User B—who is just waiting for their next token in the Decode phase—will experience a stutter. The text just stops.

To fix this, you need a scheduler that understands Chunked Prefill or Prefix Caching, breaking up the heavy prefill work so it doesn’t block the light decode work.

(Note: We’ll talk about the mechanics of Prefix Caching in a later post, but essentially it’s just recycling old ‘memory’ to save time).

So, how do we avoid falling into these traps when sizing our infrastructure?


What to watch when scaling

Don’t scale on CPU or some generic “GPU Load” percentage.

Scale on KV Cache Occupancy.

If your memory is full, it doesn’t matter if your compute cores are idle—you can’t fit another word in. You either have to wait for someone else to finish or start “swapping” data out, which absolutely nukes your performance.

A common mistake is benchmarking with 10-token prompts and thinking you’re ready for production. It looks fast in dev. Then real users start pasting in long emails or code snippets, and everything falls apart because the prefill math on long contexts is a completely different load profile.


Next Steps

The next natural question from here is how to speed up the Decode phase itself. I’m planning to cover Speculative Decoding next — a technique that uses a smaller draft model to predict tokens ahead of the main model, attacking that memory bandwidth bottleneck we discussed. Or we could go into MQA/GQA architectures since we touched on the KV Cache problem above.


Note: This blog represents my technical views and production experience. I use AI-based tools to help with drafting and formatting to keep these posts coming daily.

← Back to all posts