Anatomy of an LLM Request: Prefill, Decode, Latency & Cost

TL;DR

An LLM request has two phases that dominate latency and cost: prefill (processing the input) and decode (generating the output one token at a time). Prefill is compute-bound and scales with input length. Decode is memory-bandwidth-bound and scales with output length.

Total latency follows:

Total latency ≈ TTFT + (output_tokens × TPOT)

Output tokens cost 3-5x more than input tokens on commercial APIs. The biggest cost wins come from request design (shorter outputs, prefix caching, smaller RAG chunks), not infrastructure.

For interactive chat, target TPOT under 100ms (10 tokens/sec). Self-hosted LLM serving is cheaper than API at sustained utilisation above approximately 30%. [Industry surveys indicate that token cost increasingly dominates total system cost for LLM workloads, often exceeding the underlying Kubernetes infrastructure spend].

What is a token in an LLM?

A token is a chunk of text from a model's fixed vocabulary, typically 50,000 to 150,000 unique pieces. Models do not process characters or words directly. The tokenizer splits input text into tokens before the model sees anything.

Token-to-text conversion rates for English

1 token ≈ 4 characters
1 token ≈ 0.75 words
100 tokens ≈ a short paragraph
1,000 tokens ≈ one page of prose

Non-English languages, code, and JSON tokenize less efficiently. Chinese text typically uses 2-3x more tokens than equivalent English. JSON with many field names uses more tokens than plain prose.

Why tokenization matters for production

Every operational metric (context limits, API pricing, latency, KV cache size) is measured in tokens, not characters. Estimating in characters will mislead capacity planning.

Tokenizers also vary by model family. The same text might be 1,200 tokens for Llama and 1,400 for GPT-4. Switching models in production requires re-doing the token budget math.

What is the context window in an LLM?

The context window is the total number of tokens a model can attend to in one request. It is a shared budget across all components of a prompt.

A 32,000 token context window must accommodate:

System prompt (instructions, persona)
Tool / function definitions
Few-shot examples
Retrieved RAG chunks
Conversation history
The current user message
Reserved space for the response

Why large context windows are not always useful

Three constraints limit useful context length:

Quality degrades with length. Models attend reliably to the start and end of long contexts but get fuzzy in the middle. This is called the "lost in the middle" effect, [first documented by Liu and colleagues at Stanford in 2023].
Cost scales linearly or worse. Doubling input tokens at least doubles prefill cost.
Latency scales linearly. 32k of input takes approximately 8x longer to prefill than 4k.

Most production systems use 4k to 32k of context even on models that support 200k+ tokens.

What are the phases of an LLM request?

An LLM request passes through five phases. Two of them dominate latency and cost.

Phase	Description	Cost driver
Tokenization	Text converted to token IDs	Negligible
Prefill	Model processes all input tokens in one parallel pass	Input length
Decode	Model generates output tokens one at a time	Output length
Detokenization	Token IDs converted back to text	Negligible
Streaming	Tokens delivered to client	Network I/O

What is prefill in LLM inference?

Prefill is the phase where the model processes all input tokens together in one parallel matrix multiplication. It is compute-bound: the GPU is doing serious arithmetic at full utilisation.

Properties of prefill

Scales linearly with input length. 10,000 input tokens take approximately 10x longer than 1,000.
Compute-bound. GPU TFLOPS (compute throughput) is the bottleneck.
Determines TTFT. The user sees no response until prefill completes.
Hardware matters. H100 is significantly faster than A100 on prefill due to higher TFLOPS.

Prefill latency for common configurations

Input length	Model	GPU	Approximate prefill time
1k tokens	Qwen3.5-9B	L4	~200 ms
4k tokens	Qwen3.5-9B	L4	~700 ms
16k tokens	Qwen3.5-9B	L4	~2.8 s
16k tokens	Llama-3.1-70B	H100	~1.2 s

Prefill time can be hidden by streaming, but only the first token; users still wait for prefill to complete before seeing any response.

What is decode in LLM inference?

Decode is the phase where the model generates output tokens one at a time. Each token requires reading the entire model weights and KV cache from VRAM, performing a small computation, and producing one new token.

Properties of decode

Scales linearly with output length. 500 output tokens take approximately 5x longer than 100.
Memory-bandwidth-bound. The GPU is mostly waiting on VRAM reads, not computing.
Benefits from batching. Multiple users decoding in parallel share the model weight reads.
Determines TPOT. The gap between consecutive streamed tokens.

Decode latency targets

Use case	Target TPOT	Why
Interactive chat	< 100 ms (≥10 tok/sec)	Above human reading speed
Conversational agents	< 50 ms (≥20 tok/sec)	Feels instant
Batch processing	Not relevant	Throughput matters, not per-token

Human reading speed averages approximately 250 words per minute, or roughly 5 tokens per second. TPOT faster than this is perceived as instant.

Why are prefill and decode bottlenecked by different hardware?

Prefill processes thousands of tokens in parallel: the GPU's compute units are saturated. The bottleneck is TFLOPS (floating-point operations per second).

Decode processes one token at a time per user. The GPU's compute units are mostly idle; the bottleneck is how fast model weights can be read from VRAM. The bottleneck is memory bandwidth (GB/sec).

This means:

H100 vs A100 on prefill: H100 is approximately 3x faster (more TFLOPS).
H100 vs A100 on decode: difference is smaller (memory bandwidth gap is smaller).

The right GPU for a workload depends on whether the traffic is "long prompts, short outputs" (prefill-dominated) or "short prompts, long outputs" (decode-dominated).

How is total LLM request latency calculated?

Total latency is the sum of prefill time (TTFT) and the cumulative decode time across all output tokens:

Total latency ≈ TTFT + (output_tokens × TPOT)

Worked example

An 8,000 token input, 500 token output, on a Qwen3.5-9B model running on an L4:

TTFT (8k prefill): ~1.5 seconds
TPOT: ~25 ms/token
Output time: 500 × 0.025 = 12.5 seconds
Total latency: ~14 seconds

In this example, decode dominates total latency. For workloads with short outputs (classification, extraction), prefill dominates. Optimisation strategies differ for each case.

What are TTFT and TPOT in LLM serving?

TTFT (Time To First Token): The latency from request sent to first response token received. Dominated by prefill (input length).

TPOT (Time Per Output Token): The gap between successive output tokens during streaming. Also called ITL (Inter-Token Latency). Dominated by decode (memory bandwidth and batch contention).

Why track P50, P95, and P99 separately

Averages hide tail latency. A system with average latency of 2 seconds may have P99 latency of 30 seconds, meaning 1% of users experience a 30 second wait. [The Google SRE book has long emphasised that mean latency is "a misleading metric for user-facing services," with percentile distributions being the right operational signal].

Metric	What it represents
P50 (median)	The typical user experience
P95	The worst 5% of requests
P99	The worst 1% of requests; reflects preemptions, KV pressure, long-context spikes

Production LLM systems should track P50, P95, and P99 for TTFT, total latency, and tokens per response.

What is the throughput vs latency trade-off in LLM serving?

LLM serving has one fundamental trade-off: per-user latency versus aggregate throughput. The variable is batch size.

Batch size	Per-user TPOT	Aggregate throughput
1-2	Lowest (best per-user)	Lowest
8-16	Moderate	Moderate
32-64	Highest TPOT (worst per-user)	Highest

When to optimise for low latency

Consumer chat applications
Interactive coding assistants
Anything where a single user is waiting

When to optimise for throughput

Batch document processing
Async pipelines
Bulk content generation
Embedding generation

Both cannot be maximised simultaneously. The right choice depends on the workload.

What types of caching are used in LLM serving?

Four distinct mechanisms are commonly called "caching" in LLM systems. They are not equivalent.

Cache type	What it stores	Who controls it
KV cache	Model's internal state for an active conversation	Engine, automatic
Prefix caching	KV blocks reusable across requests with shared starting prefixes	Engine, must be enabled
Prompt caching (Anthropic / OpenAI APIs)	API-exposed version of prefix caching	User, with explicit markers
Semantic / response caching	Mapping similar prompts to past responses	Application layer, fully explicit

Which caching strategy provides the largest gains

Prefix caching produces the largest gains when prompts share a stable prefix (system prompt, tool definitions, RAG context). It can reduce TTFT by 5-20x with zero code changes. Enable with --enable-prefix-caching in vLLM.

[Anthropic's prompt caching documentation reports up to 90% cost reduction and up to 85% latency reduction for long prompts on cached calls]. The mechanism is the same as vLLM's prefix caching, exposed through the API.

Semantic caching saves entire LLM calls but risks serving stale responses to similar but distinct queries. Best for FAQ-style workloads with bounded query distributions.

How is LLM cost calculated for API usage?

Commercial LLM APIs charge separately for input and output tokens. Output is typically 3-5x more expensive than input.

Why output costs more than input

Input tokens are processed in one parallel prefill batch (compute-efficient). Output tokens are generated sequentially in decode (cannot parallelise within one user's response).

Worked example: RAG chatbot cost

Using GPT-4-class pricing ($3 per million input tokens, $15 per million output tokens):

Component	Tokens
System prompt + tools + RAG chunks	6,000 input
Conversation history	4,000 input
User question	200 input
Response	500 output
Total input	10,200 tokens
Total output	500 tokens

Cost per turn:

Input: 10,200 × $3/M = $0.0306
Output: 500 × $15/M = $0.0075
Total: ~$0.038 per turn

At 100,000 turns per day: $3,800 per day, or approximately $115,000 per month.

In this example, input dominates cost because of large RAG context. Halving RAG chunks saves approximately $1,500 per month from one configuration change.

How is LLM cost calculated for self-hosted serving?

Self-hosted serving uses GPU-hours instead of token pricing:

cost_per_token = GPU_$/hr ÷ aggregate_throughput_tokens_per_hour

Worked example

An L4 spot instance at $0.30/hr, sustaining 500 tokens/sec aggregate throughput:

Throughput: 500 tokens/sec × 3,600 sec/hr = 1.8M tokens/hr
Cost per token: $0.30 / 1,800,000 = $0.00000017 (or $0.17 per million tokens)

This is approximately 17x cheaper than GPT-4o input pricing and 90x cheaper than GPT-4o output pricing, provided utilisation is high.

When self-hosting wins on cost

Self-hosted LLM serving is cheaper than commercial APIs when sustained utilisation exceeds approximately 30%. Below this threshold, idle GPU time exceeds API token cost.

Above 60% sustained utilisation, self-hosting is typically 10-50x cheaper per token than commercial APIs. [LMSYS reported a 50% reduction in GPU count while serving 2-3x more requests after migrating to vLLM-based self-hosting], an example of what high-utilisation self-hosting economics look like in practice.

What latency targets should LLM applications meet?

Use case	TTFT target	TPOT target	Total latency target
Interactive chat	< 1 second	< 100 ms	< 10 seconds
Voice / real-time agent	< 500 ms	< 50 ms	< 5 seconds
Code completion	< 200 ms	< 30 ms	< 2 seconds
Background extraction	Not critical	Not critical	< 30 seconds
Batch processing	Not critical	Not critical	Throughput, not latency

Streaming should be enabled for any user-facing interactive use case. It does not change total latency but reduces perceived latency dramatically.

How can LLM request cost and latency be reduced?

In order of impact:

Reduce output length. Output costs 3-5x input. Prompting for concise responses can cut total cost by 30-50%.
Enable prefix caching. Free win for any workload with stable prefixes. 5-20x faster TTFT.
Reduce RAG context. Fewer or smaller chunks reduce input cost per request.
Use a smaller model where quality permits. A 9B model for synthesis may match a 70B at 10x lower cost.
Quantize. FP8 or AWQ-4bit reduces memory 2-4x with minor quality loss.
Cache at the application layer. Semantic or exact-match caching skips LLM calls for repeated queries.

Key terms

Token: Chunk of text from the model's vocabulary. Approximately 4 characters or 0.75 words in English.

Context window: Total tokens a model can attend to in one request, shared across system prompt, history, RAG, and response.

Prefill: Phase where the model processes all input tokens in parallel. Compute-bound. Sets TTFT.

Decode: Phase where the model generates output tokens one at a time. Memory-bandwidth-bound. Sets TPOT.

TTFT (Time To First Token): Latency from request sent to first response token received.

TPOT (Time Per Output Token): Gap between consecutive output tokens during streaming.

ITL (Inter-Token Latency): Synonym for TPOT.

Streaming: Sending output tokens to the client as generated, rather than waiting for full completion.

Throughput: Aggregate tokens per second across all in-flight requests.

Batch size: Number of requests processed concurrently. Trades off per-user latency for aggregate throughput.

Prefix caching: Reusing KV cache blocks across requests with shared starting prefixes.

Prompt caching: API-exposed prefix caching (Anthropic, OpenAI).

Semantic caching: Application-layer caching based on prompt similarity, not exact match.

Why is my LLM TTFT high even with a short user message?

TTFT is dominated by total input length, not user message length. If the system prompt, tool definitions, conversation history, and RAG chunks total 8,000 tokens, prefill processes all 8,000 tokens before producing the first output token. Enabling prefix caching can reduce TTFT 5-20x when prompts share stable prefixes.

Why does my LLM cost more for output than input?

Output tokens cost 3-5x more than input tokens on most commercial APIs because output is generated sequentially in decode while input is processed in parallel during prefill. Decode is harder to amortise across users.

What is a good TPOT for a chat application?

For interactive chat, target TPOT under 100ms (10 tokens per second or faster). This exceeds average human reading speed and feels instant. Above 200ms TPOT, the response feels sluggish.

Does streaming reduce total LLM latency?

No. Streaming does not change total latency. It changes perceived latency by showing the first token in approximately TTFT time instead of waiting for the entire response. For non-interactive workloads (extraction, classification, batch jobs), streaming provides no benefit.

Should I self-host or use an LLM API?

Use an LLM API at low utilisation (under 30% sustained GPU usage if self-hosted). Self-hosting becomes cheaper at higher utilisation, often by 10-50x per token at 60%+ sustained usage. Calculate the breakeven based on expected traffic before committing.

How do I reduce LLM API costs by 50%?

The highest-impact changes:

Cap output length with max_tokens and prompt for concise responses
Reduce RAG chunk count and chunk size
Truncate conversation history aggressively
Use a smaller model for routing/classification, large model only for synthesis
Enable prompt caching on supported providers

About NudgeBee

NudgeBee builds AI-powered SRE automation for cloud-native production systems. We use LLMs at scale and have made most of the mistakes documented in this series. Learn more at nudgebee.com.

Part 1: How to serve LLMs in production: GPU memory, KV cache, and sizing
Part 2: Anatomy of an LLM request
Part 3: How to build production applications on LLMs: RAG, agents, and tool calling
Part 4: LLM serving cheat sheet
Part 5: Common mistakes when serving LLMs in production

Anatomy of an LLM request: prefill, decode, latency, and cost

TL;DR

Token-to-text conversion rates for English

Why tokenization matters for production

What is the context window in an LLM?

Why large context windows are not always useful

What are the phases of an LLM request?

What is prefill in LLM inference?

Properties of prefill

Prefill latency for common configurations

What is decode in LLM inference?

Properties of decode

Decode latency targets

Why are prefill and decode bottlenecked by different hardware?

How is total LLM request latency calculated?

Worked example

What are TTFT and TPOT in LLM serving?

Why track P50, P95, and P99 separately

What is the throughput vs latency trade-off in LLM serving?

When to optimise for low latency

When to optimise for throughput

What types of caching are used in LLM serving?

Which caching strategy provides the largest gains

How is LLM cost calculated for API usage?

Why output costs more than input

Worked example: RAG chatbot cost

How is LLM cost calculated for self-hosted serving?

Worked example

When self-hosting wins on cost

What latency targets should LLM applications meet?

How can LLM request cost and latency be reduced?

Key terms

Why is my LLM TTFT high even with a short user message?

Why does my LLM cost more for output than input?

What is a good TPOT for a chat application?

Does streaming reduce total LLM latency?

Should I self-host or use an LLM API?

How do I reduce LLM API costs by 50%?

About NudgeBee

Series navigation