Ashish Bhutani · · 6 min read

Speculative Decoding: Making Large Models Generate Faster

Inference OptimizationLLM InferenceAI EngineeringSystem Design

Assumes you know why decode is memory-bandwidth bound. If not, read this primer first.

The 30-Second Version Decode phase is memory-bound, leaving GPU compute idle. We use a tiny “draft” model to quickly guess the next few tokens. We feed those guesses to the big model, which uses its idle compute to verify them all at once. If accurate, you generate multiple tokens in the time normally taken for one.

Why this matters

If you’re running a large model (say 70B parameters) and the generation speed feels sluggish, speculative decoding is one of the more interesting techniques to understand. Here are the kinds of problems it addresses:

  • The “Slow Typer”: Your 70B model streams at 8 tokens/s. Users are complaining. Speculative decoding can help speed this up without swapping to a smaller model.
  • The “Wasted GPU”: You deploy speculation, GPU utilization spikes but throughput doesn’t improve. This usually means a low acceptance rate from a poorly matched draft model.
  • The “Which Draft?” Decision: Should you use a 1B or 7B draft model? There’s a real speed vs. accuracy trade-off here.
  • The Batch Size Concern: At high concurrency (200+ requests), speculation can actually hurt performance because the compute bottleneck shifts.

Why decode compute is idle

As discussed previously, during token generation (decode), the GPU must load the entire model’s weights from VRAM for every single token.

The actual matrix multiplication is trivial. The GPU spends almost all its time waiting for data to travel across the memory bus. Your expensive GPU ALUs (compute cores) are effectively idle, twiddling their thumbs.

Speculative decoding is simply a trick to put that idle compute to work.


The core idea: guess and verify

Instead of asking the big model for the next token step-by-step, we use a cheap, tiny “draft” model (e.g., 1B parameters) to guess the next K tokens. This is lightning fast because the draft model barely taxes memory bandwidth.

Next, we feed all K guessed tokens to the big model simultaneously. The big model processes them in parallel (like a prefill pass). Because prefill is compute-bound, the big model uses its idle cores to verify the guesses without taking extra time—it was going to load its weights from memory anyway!

You accept all correct guesses up to the first wrong one. If the draft guesses 5 tokens and 4 are right, you just generated 4 tokens in the time it normally takes to generate 1.

For an industry example of this yielding massive speedups while maintaining identical outputs, check out how Databricks improved LLM serving with speculative decoding.


The math behind the speedup

Assume your big model takes 100ms per token (TPOT = 100ms), and the draft takes 5ms. We speculate K=5 tokens ahead.

Without speculation: 5 tokens takes 500ms.

With speculation: The draft generates 5 guesses in 25ms. The big model verifies all 5 in roughly 100ms. If all 5 are accepted, you generate 5 tokens in 125ms. That’s a 4x speedup.

Even if only 3 guesses are right, you generated 3 tokens in ~115ms (instead of 300ms). It’s a huge win, but it relies entirely on one metric.


The acceptance rate is everything

The acceptance rate (α) is the fraction of draft tokens the big model agrees with.

If α is 0.8, speculative decoding is magic. If α is 0.3, you are burning GPU cycles for nothing. You guessed 5, accepted 1, and wasted compute on the other 4.

Acceptance rate is heavily task-dependent:

  • Code completion: High α (code syntax is predictable).
  • Creative writing: Low α (too many valid word choices).
  • Structured JSON: High α (constrained by schema).

You can’t just deploy this blindly. It’s worth monitoring α and tuning K based on your workload.


The wasted compute trade-off

What happens when the draft model misses?

If it gets tokens 1 and 2 right, but misses token 3, you accept 1 and 2, and the big model generates the real token 3. You still got 3 tokens in one step.

However, the draft compute spent on tokens 4 and 5 was wasted. The big model’s verification of those wrong tokens was also wasted. Worst case? If the draft model constantly misses the very first token (α < 0.2), your system is actually slower than vanilla decoding.

Rule of thumb: If α < 0.5, turn off speculation. Only enable it if α > 0.6.


The high batch size trap

Speculative decoding thrives when the big model’s GPU compute is underutilized (small batch sizes).

But as you pack more concurrent requests into a batch, GPU compute becomes the bottleneck. Suddenly, verifying K extra tokens per request costs real time. The “free verification” assumption dies.

At batch sizes of 32+, speculation’s speedup diminishes. At batches of 128+, it actively hurts throughput. Use it selectively: enable it for low-load/latency-sensitive paths, disable it for high-throughput batch processing.

(Note: I’ll dive into Continuous Batching and dynamic scheduling soon to address high-throughput serving).


Picking the right draft model

Teams argue constantly over draft models. The main options:

  1. Smaller same-family model (e.g., Llama-3-1B for Llama-3-70B): High acceptance rate, but requires extra VRAM to load a second model.
  2. Quantized draft: Using an INT4/INT8 version of the big model as the draft. Great acceptance rate, but again, costs VRAM.
  3. Shared layers (self-speculative): Skipping layers in the big model to generate a cheap draft. Saves VRAM but still experimental.

Your VRAM budget usually dictates the choice. You are already fighting for memory with the massive KV Cache; adding a draft model requires careful capacity planning.


Production Monitoring

If you deploy this, put these on your dashboard:

  • Acceptance rate (α): Track per-request and per-task.
  • Tokens generated per step: If it’s close to 1, speculation is failing.
  • Draft latency: The draft must be >10x faster than the big model per token.
  • VRAM headroom: Don’t let the draft model evict your KV cache entries.

Next Steps

Next up: KV Cache management and PagedAttention—how vLLM handles memory fragmentation at scale. Alternatively, we could tackle Quantization (INT8/INT4/AWQ) since we just mentioned it for drafting.


Note: This blog represents my technical views and production experience. I use AI-based tools to help with drafting and formatting to keep these posts coming daily.

← Back to all posts