TL;DR
A single-page reference for serving LLMs in production. Covers:
- Vocabulary (token, KV cache, prefill, decode, TTFT, TPOT, prefix caching, MoE)
- Sizing formulas (VRAM, latency, cost per token)
- GPU reference table (T4, L4, A100, H100)
- Model reference table (weights and KV per token for Qwen3.5-4B, 9B, Llama-3.1-70B, Qwen3.5-35B-A3B MoE)
- Essential vLLM flags
- Decision shortcuts for common problems
- Cost reduction levers ranked by impact
LLM serving vocabulary
| Term | Definition |
|---|---|
| Token | Chunk of text the model processes. ~4 characters or 0.75 words in English. |
| Context window | Total tokens the model can attend to in one request. Shared across system prompt, tools, history, RAG, and response. |
| KV cache | Model's internal state for an active conversation. Grows linearly with tokens. |
| Prefill | Processing the input prompt. Compute-bound. Scales with input length. Sets TTFT. |
| Decode | Generating output one token at a time. Memory-bandwidth-bound. Scales with output length. |
| TTFT | Time To First Token. Dominated by prefill. Critical for streaming UX. |
| TPOT / ITL | Time Per Output Token / Inter-Token Latency. Dominated by decode and batch contention. |
| Total latency | TTFT + (output_tokens × TPOT) |
| Throughput | Aggregate tokens per second across all in-flight requests. Trades off against per-user latency. |
| Paged attention | vLLM technique storing KV in small blocks, allocated dynamically. Enables high concurrency. |
| Continuous batching | Requests join and leave the active batch every step. No fixed batch windows. |
| Prefix caching | Reuse KV blocks across requests sharing a starting prefix. Major TTFT win. |
| Prompt caching | User-controlled version of prefix caching exposed via API (Anthropic, OpenAI). |
| Quantization | Lower-precision weights or KV (FP8, INT4) to save memory. Small accuracy cost. |
| GQA | Grouped Query Attention. Multiple query heads share key-value heads. Smaller KV cache. |
| MoE | Mixture of Experts. Many total parameters, few active per token. Saves compute, not memory. |
| BF16 / FP8 / INT4 | Numeric formats. 2 / 1 / 0.5 bytes per parameter respectively. |
Three formulas to memorise
Memory needed
VRAM ≈ weights + 2 GB overhead + (peak_concurrent × p95_context × kv_per_token) × 1.15
The 1.15 factor is a safety margin for activation spikes.
Latency
Total latency ≈ TTFT + (output_tokens × TPOT)
Self-hosted cost per token
cost_per_token = GPU_$/hr ÷ aggregate_throughput_tokens_per_hour
LLM model reference: weights and KV cache
| Model | BF16 weights | FP8 weights | INT4 weights | KV per token (FP8) |
|---|---|---|---|---|
| Qwen3.5-4B | 8 GB | 4 GB | 2.5 GB | ~74 KB |
| Qwen3.5-9B | 18 GB | 9 GB | 5.5 GB | ~147 KB |
| Llama-3.1-70B | 140 GB | 70 GB | 40 GB | ~330 KB |
| Qwen3.5-35B-A3B (MoE) | 70 GB | 35 GB | 20 GB | ~180 KB |
KV per token figures are approximate. Verify config.json for exact GQA shape. MoE total parameters dominate VRAM cost; active parameters only help latency.
GPU reference for LLM serving
| GPU | VRAM | BF16 | FP8 | Production fit |
|---|---|---|---|---|
| NVIDIA T4 | 16 GB | No | No | Small models only (≤4B), single user |
| NVIDIA L4 | 24 GB | Yes | Yes | 4B-9B production serving |
| NVIDIA A100 (40 GB) | 40 GB | Yes | No (native) | 13B-35B production |
| NVIDIA A100 (80 GB) | 80 GB | Yes | No (native) | 70B production or high concurrency |
| NVIDIA H100 (80 GB) | 80 GB | Yes | Yes | 70B with FP8, maximum throughput |
Essential vLLM flags for production
| Flag | Effect | When to use |
|---|---|---|
--enable-prefix-caching | Reuses KV blocks across shared prefixes | Always. Free latency win. |
--kv-cache-dtype fp8 | Stores KV in FP8 instead of FP16 | Compute capability ≥ 8.9 (L4, H100). 2x KV capacity. |
--max-num-seqs N | Caps concurrent in-flight requests | Match to realistic peak concurrency. Lower reduces OOM risk. |
--max-model-len N | Caps context length per request | Tight cap frees memory for KV pool. |
--gpu-memory-utilization 0.92 | Fraction of VRAM vLLM uses | 0.90-0.92 typical. Higher risks OOM. |
--swap-space N | GB of CPU RAM for KV spillover | Smooths bursts above max-num-seqs. |
--tensor-parallel-size N | Split model across N GPUs | When model plus KV does not fit on one GPU. |
--quantization awq_marlin | Use AWQ 4-bit weights | When weights need to shrink. |
--dtype bfloat16 | BF16 for weights and activations | Default for modern GPUs. Use float16 on T4. |
Decision shortcuts for common LLM serving problems
Which inference engine should I use?
- Single user, laptop, development: Ollama or llama.cpp with GGUF
- Multi-user production: vLLM with safetensors. Benchmark SGLang as alternative.
- Maximum NVIDIA performance: TensorRT-LLM
How do I fix OOM at startup?
Lower --gpu-memory-utilization (try 0.88). Reduce --max-num-seqs. Switch to smaller weights (FP8 or AWQ).
How do I fix OOM mid-traffic?
Same fixes as startup. Additionally verify --max-model-len is not allowing requests that do not fit at peak concurrency.
How do I fix high TTFT?
Enable prefix caching (--enable-prefix-caching). Check whether prompts share prefixes. Consider a smaller model. Reduce input length.
How do I fix high TPOT or slow streaming?
Reduce concurrent batch size. Use a smaller model. Check GPU memory bandwidth.
How do I fix P99 latency much worse than P50?
KV pressure is causing preemptions. Reduce --max-num-seqs or add --swap-space. [Tail latency monitoring is a foundational SRE practice for any production service].
LLM cost reduction levers, ranked by impact
The order matters. Start at the top.
1. Reduce output length
Output tokens cost 3-5x input tokens. Concise prompting can cut cost 30-50%.
2. Enable prefix caching
Free. 5-20x faster TTFT on cached prefixes. --enable-prefix-caching in vLLM. [Anthropic's prompt caching documentation reports up to 90% cost reduction on cached calls].
3. Reduce RAG context
Fewer or smaller chunks reduce input cost per request. 3-5 chunks beat 10 mediocre ones.
4. Use a smaller model where quality permits
4B for routing, 9B for synthesis. Do not default to 70B.
5. Quantize
FP8 or AWQ 4-bit cuts memory 2-4x with minor quality loss.
6. Self-host above 30% utilisation
Above 60% sustained utilisation, self-hosting is 10-50x cheaper per token than commercial APIs. Below 20%, APIs are cheaper. [LMSYS publicly reported cutting GPU count by 50% while serving 2-3x more requests after migrating to vLLM-based self-hosting].
LLM serving sizing checklist
Before provisioning a GPU, answer each question with a number.
- Exact model name and size (e.g., "Qwen3.5-9B")
- Precision (BF16, FP8, AWQ-4bit)
- Average concurrent in-flight requests
- Peak concurrent requests
- Average context length (measured from real traffic)
- P95 context length
- Prefix overlap fraction (cacheable proportion of each request)
- Latency budget (P50 and P95 TTFT)
Plug into:
VRAM_needed ≈ weights + 2 GB + (peak_concurrent × p95_context × kv_per_token) × 1.15
If the result exceeds the GPU's VRAM, four levers are available: smaller model, lower precision, fewer concurrent users, shorter contexts, or a bigger GPU.
Application-layer reference
RAG configuration starting points
- Chunk size: 500-800 tokens
- Chunk overlap: 50-100 tokens
- Top-k retrieved: 3-5
- Embedding model: voyage-3, text-embedding-3-large, or BGE-large (see [MTEB leaderboard])
- Reranker: Cohere rerank API
- Search: hybrid (BM25 + embedding)
Tool calling rules
- Limit to under 10 tools per request
- Write descriptions as docstrings for new colleagues
- Execute parallel tool calls concurrently (asyncio.gather, Promise.all)
- Validate JSON outputs (Pydantic, Zod)
- Retry on validation failure with the error included in the prompt
Agent constraints
- Set max-steps hard limit
- Detect repeated tool calls
- Use small fast models for routing, large model only for synthesis
- Plan for compounding failure: 95% per step × 5 steps = 77% overall
Thinking mode rules
- Enable for: math, multi-step reasoning, code debugging, complex planning
- Disable for: chat, summarisation, extraction, classification, real-time agents
- Latency cost: 5-20x non-thinking mode
Latency targets by use case
| Use case | TTFT target | TPOT target | Total latency target |
|---|---|---|---|
| Interactive chat | < 1 second | < 100 ms (≥10 tok/sec) | < 10 seconds |
| Voice / real-time | < 500 ms | < 50 ms (≥20 tok/sec) | < 5 seconds |
| Code completion | < 200 ms | < 30 ms | < 2 seconds |
| Background extraction | Not critical | Not critical | < 30 seconds |
| Batch processing | Not critical | Not critical | Throughput, not latency |
What is the minimum GPU for serving an LLM in production?
For multi-user serving of small models (4B-9B), an L4 (24 GB) is the practical minimum. T4 (16 GB) lacks modern numeric format support and is suitable only for single-user small-model deployments.
How do I calculate how many concurrent users a GPU can serve?
Use the formula: VRAM_available_for_KV = VRAM - weights - 2 GB. Then concurrent_users = VRAM_available_for_KV / (p95_context × kv_per_token × 1.15).
Which is faster: vLLM or SGLang?
Depends on workload. vLLM is the default for general production serving. SGLang often outperforms on long-context and structured-output workloads. Benchmark with real traffic before choosing.
Can I run a 70B model on one GPU?
Llama-3.1-70B in BF16 needs 140 GB; one GPU is not sufficient. In FP8 (H100), 70 GB plus KV fits on one H100. In INT4, 40 GB fits on an A100 80 GB or H100 with substantial KV pool remaining.
How do I reduce LLM API costs?
In order: cap output length, enable prompt caching, reduce RAG chunks, truncate conversation history, use smaller models where possible, cache repeated queries at the application layer.
About NudgeBee
NudgeBee builds AI-powered SRE automation for cloud-native production systems. We use LLMs at scale and have made most of the mistakes documented in this series. Learn more at nudgebee.com.
Series navigation
- Part 1: How to serve LLMs in production: GPU memory, KV cache, and sizing
- Part 2: Anatomy of an LLM request: prefill, decode, latency, and cost
- Part 3: Building LLM applications: RAG, agents, tool calling, and reasoning
- Part 4: LLM serving cheat sheet
- Part 5: Common mistakes when serving LLMs in production