TL;DR
An LLM request has two phases that dominate latency and cost: prefill (processing the input) and decode (generating the output one token at a time). Prefill is compute-bound and scales with input length. Decode is memory-bandwidth-bound and scales with output length.
Total latency follows:
Total latency ≈ TTFT + (output_tokens × TPOT)
Output tokens cost 3-5x more than input tokens on commercial APIs. The biggest cost wins come from request design (shorter outputs, prefix caching, smaller RAG chunks), not infrastructure.
For interactive chat, target TPOT under 100ms (10 tokens/sec). Self-hosted LLM serving is cheaper than API at sustained utilisation above approximately 30%. [Industry surveys indicate that token cost increasingly dominates total system cost for LLM workloads, often exceeding the underlying Kubernetes infrastructure spend].
What is a token in an LLM?
A token is a chunk of text from a model's fixed vocabulary, typically 50,000 to 150,000 unique pieces. Models do not process characters or words directly. The tokenizer splits input text into tokens before the model sees anything.
Token-to-text conversion rates for English
- 1 token ≈ 4 characters
- 1 token ≈ 0.75 words
- 100 tokens ≈ a short paragraph
- 1,000 tokens ≈ one page of prose
Non-English languages, code, and JSON tokenize less efficiently. Chinese text typically uses 2-3x more tokens than equivalent English. JSON with many field names uses more tokens than plain prose.
Why tokenization matters for production
Every operational metric (context limits, API pricing, latency, KV cache size) is measured in tokens, not characters. Estimating in characters will mislead capacity planning.
Tokenizers also vary by model family. The same text might be 1,200 tokens for Llama and 1,400 for GPT-4. Switching models in production requires re-doing the token budget math.
What is the context window in an LLM?
The context window is the total number of tokens a model can attend to in one request. It is a shared budget across all components of a prompt.
A 32,000 token context window must accommodate:
- System prompt (instructions, persona)
- Tool / function definitions
- Few-shot examples
- Retrieved RAG chunks
- Conversation history
- The current user message
- Reserved space for the response
Why large context windows are not always useful
Three constraints limit useful context length:
- Quality degrades with length. Models attend reliably to the start and end of long contexts but get fuzzy in the middle. This is called the "lost in the middle" effect, [first documented by Liu and colleagues at Stanford in 2023].
- Cost scales linearly or worse. Doubling input tokens at least doubles prefill cost.
- Latency scales linearly. 32k of input takes approximately 8x longer to prefill than 4k.
Most production systems use 4k to 32k of context even on models that support 200k+ tokens.
What are the phases of an LLM request?
An LLM request passes through five phases. Two of them dominate latency and cost.
| Phase | Description | Cost driver |
|---|---|---|
| Tokenization | Text converted to token IDs | Negligible |
| Prefill | Model processes all input tokens in one parallel pass | Input length |
| Decode | Model generates output tokens one at a time | Output length |
| Detokenization | Token IDs converted back to text | Negligible |
| Streaming | Tokens delivered to client | Network I/O |
What is prefill in LLM inference?
Prefill is the phase where the model processes all input tokens together in one parallel matrix multiplication. It is compute-bound: the GPU is doing serious arithmetic at full utilisation.
Properties of prefill
- Scales linearly with input length. 10,000 input tokens take approximately 10x longer than 1,000.
- Compute-bound. GPU TFLOPS (compute throughput) is the bottleneck.
- Determines TTFT. The user sees no response until prefill completes.
- Hardware matters. H100 is significantly faster than A100 on prefill due to higher TFLOPS.
Prefill latency for common configurations
| Input length | Model | GPU | Approximate prefill time |
|---|---|---|---|
| 1k tokens | Qwen3.5-9B | L4 | ~200 ms |
| 4k tokens | Qwen3.5-9B | L4 | ~700 ms |
| 16k tokens | Qwen3.5-9B | L4 | ~2.8 s |
| 16k tokens | Llama-3.1-70B | H100 | ~1.2 s |
Prefill time can be hidden by streaming, but only the first token; users still wait for prefill to complete before seeing any response.
What is decode in LLM inference?
Decode is the phase where the model generates output tokens one at a time. Each token requires reading the entire model weights and KV cache from VRAM, performing a small computation, and producing one new token.
Properties of decode
- Scales linearly with output length. 500 output tokens take approximately 5x longer than 100.
- Memory-bandwidth-bound. The GPU is mostly waiting on VRAM reads, not computing.
- Benefits from batching. Multiple users decoding in parallel share the model weight reads.
- Determines TPOT. The gap between consecutive streamed tokens.
Decode latency targets
| Use case | Target TPOT | Why |
|---|---|---|
| Interactive chat | < 100 ms (≥10 tok/sec) | Above human reading speed |
| Conversational agents | < 50 ms (≥20 tok/sec) | Feels instant |
| Batch processing | Not relevant | Throughput matters, not per-token |
Human reading speed averages approximately 250 words per minute, or roughly 5 tokens per second. TPOT faster than this is perceived as instant.
Why are prefill and decode bottlenecked by different hardware?
Prefill processes thousands of tokens in parallel: the GPU's compute units are saturated. The bottleneck is TFLOPS (floating-point operations per second).
Decode processes one token at a time per user. The GPU's compute units are mostly idle; the bottleneck is how fast model weights can be read from VRAM. The bottleneck is memory bandwidth (GB/sec).
This means:
- H100 vs A100 on prefill: H100 is approximately 3x faster (more TFLOPS).
- H100 vs A100 on decode: difference is smaller (memory bandwidth gap is smaller).
The right GPU for a workload depends on whether the traffic is "long prompts, short outputs" (prefill-dominated) or "short prompts, long outputs" (decode-dominated).
How is total LLM request latency calculated?
Total latency is the sum of prefill time (TTFT) and the cumulative decode time across all output tokens:
Total latency ≈ TTFT + (output_tokens × TPOT)
Worked example
An 8,000 token input, 500 token output, on a Qwen3.5-9B model running on an L4:
- TTFT (8k prefill): ~1.5 seconds
- TPOT: ~25 ms/token
- Output time: 500 × 0.025 = 12.5 seconds
- Total latency: ~14 seconds
In this example, decode dominates total latency. For workloads with short outputs (classification, extraction), prefill dominates. Optimisation strategies differ for each case.
What are TTFT and TPOT in LLM serving?
TTFT (Time To First Token): The latency from request sent to first response token received. Dominated by prefill (input length).
TPOT (Time Per Output Token): The gap between successive output tokens during streaming. Also called ITL (Inter-Token Latency). Dominated by decode (memory bandwidth and batch contention).
Why track P50, P95, and P99 separately
Averages hide tail latency. A system with average latency of 2 seconds may have P99 latency of 30 seconds, meaning 1% of users experience a 30 second wait. [The Google SRE book has long emphasised that mean latency is "a misleading metric for user-facing services," with percentile distributions being the right operational signal].
| Metric | What it represents |
|---|---|
| P50 (median) | The typical user experience |
| P95 | The worst 5% of requests |
| P99 | The worst 1% of requests; reflects preemptions, KV pressure, long-context spikes |
Production LLM systems should track P50, P95, and P99 for TTFT, total latency, and tokens per response.
What is the throughput vs latency trade-off in LLM serving?
LLM serving has one fundamental trade-off: per-user latency versus aggregate throughput. The variable is batch size.
| Batch size | Per-user TPOT | Aggregate throughput |
|---|---|---|
| 1-2 | Lowest (best per-user) | Lowest |
| 8-16 | Moderate | Moderate |
| 32-64 | Highest TPOT (worst per-user) | Highest |
When to optimise for low latency
- Consumer chat applications
- Interactive coding assistants
- Anything where a single user is waiting
When to optimise for throughput
- Batch document processing
- Async pipelines
- Bulk content generation
- Embedding generation
Both cannot be maximised simultaneously. The right choice depends on the workload.
What types of caching are used in LLM serving?
Four distinct mechanisms are commonly called "caching" in LLM systems. They are not equivalent.
| Cache type | What it stores | Who controls it |
|---|---|---|
| KV cache | Model's internal state for an active conversation | Engine, automatic |
| Prefix caching | KV blocks reusable across requests with shared starting prefixes | Engine, must be enabled |
| Prompt caching (Anthropic / OpenAI APIs) | API-exposed version of prefix caching | User, with explicit markers |
| Semantic / response caching | Mapping similar prompts to past responses | Application layer, fully explicit |
Which caching strategy provides the largest gains
Prefix caching produces the largest gains when prompts share a stable prefix (system prompt, tool definitions, RAG context). It can reduce TTFT by 5-20x with zero code changes. Enable with --enable-prefix-caching in vLLM.
[Anthropic's prompt caching documentation reports up to 90% cost reduction and up to 85% latency reduction for long prompts on cached calls]. The mechanism is the same as vLLM's prefix caching, exposed through the API.
Semantic caching saves entire LLM calls but risks serving stale responses to similar but distinct queries. Best for FAQ-style workloads with bounded query distributions.
How is LLM cost calculated for API usage?
Commercial LLM APIs charge separately for input and output tokens. Output is typically 3-5x more expensive than input.
Why output costs more than input
Input tokens are processed in one parallel prefill batch (compute-efficient). Output tokens are generated sequentially in decode (cannot parallelise within one user's response).
Worked example: RAG chatbot cost
Using GPT-4-class pricing ($3 per million input tokens, $15 per million output tokens):
| Component | Tokens |
|---|---|
| System prompt + tools + RAG chunks | 6,000 input |
| Conversation history | 4,000 input |
| User question | 200 input |
| Response | 500 output |
| Total input | 10,200 tokens |
| Total output | 500 tokens |
Cost per turn:
- Input: 10,200 × $3/M = $0.0306
- Output: 500 × $15/M = $0.0075
- Total: ~$0.038 per turn
At 100,000 turns per day: $3,800 per day, or approximately $115,000 per month.
In this example, input dominates cost because of large RAG context. Halving RAG chunks saves approximately $1,500 per month from one configuration change.
How is LLM cost calculated for self-hosted serving?
Self-hosted serving uses GPU-hours instead of token pricing:
cost_per_token = GPU_$/hr ÷ aggregate_throughput_tokens_per_hour
Worked example
An L4 spot instance at $0.30/hr, sustaining 500 tokens/sec aggregate throughput:
- Throughput: 500 tokens/sec × 3,600 sec/hr = 1.8M tokens/hr
- Cost per token: $0.30 / 1,800,000 = $0.00000017 (or $0.17 per million tokens)
This is approximately 17x cheaper than GPT-4o input pricing and 90x cheaper than GPT-4o output pricing, provided utilisation is high.
When self-hosting wins on cost
Self-hosted LLM serving is cheaper than commercial APIs when sustained utilisation exceeds approximately 30%. Below this threshold, idle GPU time exceeds API token cost.
Above 60% sustained utilisation, self-hosting is typically 10-50x cheaper per token than commercial APIs. [LMSYS reported a 50% reduction in GPU count while serving 2-3x more requests after migrating to vLLM-based self-hosting], an example of what high-utilisation self-hosting economics look like in practice.
What latency targets should LLM applications meet?
| Use case | TTFT target | TPOT target | Total latency target |
|---|---|---|---|
| Interactive chat | < 1 second | < 100 ms | < 10 seconds |
| Voice / real-time agent | < 500 ms | < 50 ms | < 5 seconds |
| Code completion | < 200 ms | < 30 ms | < 2 seconds |
| Background extraction | Not critical | Not critical | < 30 seconds |
| Batch processing | Not critical | Not critical | Throughput, not latency |
Streaming should be enabled for any user-facing interactive use case. It does not change total latency but reduces perceived latency dramatically.
How can LLM request cost and latency be reduced?
In order of impact:
- Reduce output length. Output costs 3-5x input. Prompting for concise responses can cut total cost by 30-50%.
- Enable prefix caching. Free win for any workload with stable prefixes. 5-20x faster TTFT.
- Reduce RAG context. Fewer or smaller chunks reduce input cost per request.
- Use a smaller model where quality permits. A 9B model for synthesis may match a 70B at 10x lower cost.
- Quantize. FP8 or AWQ-4bit reduces memory 2-4x with minor quality loss.
- Cache at the application layer. Semantic or exact-match caching skips LLM calls for repeated queries.
Key terms
Token: Chunk of text from the model's vocabulary. Approximately 4 characters or 0.75 words in English.
Context window: Total tokens a model can attend to in one request, shared across system prompt, history, RAG, and response.
Prefill: Phase where the model processes all input tokens in parallel. Compute-bound. Sets TTFT.
Decode: Phase where the model generates output tokens one at a time. Memory-bandwidth-bound. Sets TPOT.
TTFT (Time To First Token): Latency from request sent to first response token received.
TPOT (Time Per Output Token): Gap between consecutive output tokens during streaming.
ITL (Inter-Token Latency): Synonym for TPOT.
Streaming: Sending output tokens to the client as generated, rather than waiting for full completion.
Throughput: Aggregate tokens per second across all in-flight requests.
Batch size: Number of requests processed concurrently. Trades off per-user latency for aggregate throughput.
Prefix caching: Reusing KV cache blocks across requests with shared starting prefixes.
Prompt caching: API-exposed prefix caching (Anthropic, OpenAI).
Semantic caching: Application-layer caching based on prompt similarity, not exact match.
Why is my LLM TTFT high even with a short user message?
TTFT is dominated by total input length, not user message length. If the system prompt, tool definitions, conversation history, and RAG chunks total 8,000 tokens, prefill processes all 8,000 tokens before producing the first output token. Enabling prefix caching can reduce TTFT 5-20x when prompts share stable prefixes.
Why does my LLM cost more for output than input?
Output tokens cost 3-5x more than input tokens on most commercial APIs because output is generated sequentially in decode while input is processed in parallel during prefill. Decode is harder to amortise across users.
What is a good TPOT for a chat application?
For interactive chat, target TPOT under 100ms (10 tokens per second or faster). This exceeds average human reading speed and feels instant. Above 200ms TPOT, the response feels sluggish.
Does streaming reduce total LLM latency?
No. Streaming does not change total latency. It changes perceived latency by showing the first token in approximately TTFT time instead of waiting for the entire response. For non-interactive workloads (extraction, classification, batch jobs), streaming provides no benefit.
Should I self-host or use an LLM API?
Use an LLM API at low utilisation (under 30% sustained GPU usage if self-hosted). Self-hosting becomes cheaper at higher utilisation, often by 10-50x per token at 60%+ sustained usage. Calculate the breakeven based on expected traffic before committing.
How do I reduce LLM API costs by 50%?
The highest-impact changes:
- Cap output length with max_tokens and prompt for concise responses
- Reduce RAG chunk count and chunk size
- Truncate conversation history aggressively
- Use a smaller model for routing/classification, large model only for synthesis
- Enable prompt caching on supported providers
About NudgeBee
NudgeBee builds AI-powered SRE automation for cloud-native production systems. We use LLMs at scale and have made most of the mistakes documented in this series. Learn more at nudgebee.com.
Series navigation
- Part 1: How to serve LLMs in production: GPU memory, KV cache, and sizing
- Part 2: Anatomy of an LLM request
- Part 3: How to build production applications on LLMs: RAG, agents, and tool calling
- Part 4: LLM serving cheat sheet
- Part 5: Common mistakes when serving LLMs in production