LLM Serving Cheat Sheet: GPUs, vLLM Flags & Cost Levers

TL;DR

A single-page reference for serving LLMs in production. Covers:

Vocabulary (token, KV cache, prefill, decode, TTFT, TPOT, prefix caching, MoE)
Sizing formulas (VRAM, latency, cost per token)
GPU reference table (T4, L4, A100, H100)
Model reference table (weights and KV per token for Qwen3.5-4B, 9B, Llama-3.1-70B, Qwen3.5-35B-A3B MoE)
Essential vLLM flags
Decision shortcuts for common problems
Cost reduction levers ranked by impact

LLM serving vocabulary

Term	Definition
Token	Chunk of text the model processes. ~4 characters or 0.75 words in English.
Context window	Total tokens the model can attend to in one request. Shared across system prompt, tools, history, RAG, and response.
KV cache	Model's internal state for an active conversation. Grows linearly with tokens.
Prefill	Processing the input prompt. Compute-bound. Scales with input length. Sets TTFT.
Decode	Generating output one token at a time. Memory-bandwidth-bound. Scales with output length.
TTFT	Time To First Token. Dominated by prefill. Critical for streaming UX.
TPOT / ITL	Time Per Output Token / Inter-Token Latency. Dominated by decode and batch contention.
Total latency	TTFT + (output_tokens × TPOT)
Throughput	Aggregate tokens per second across all in-flight requests. Trades off against per-user latency.
Paged attention	vLLM technique storing KV in small blocks, allocated dynamically. Enables high concurrency.
Continuous batching	Requests join and leave the active batch every step. No fixed batch windows.
Prefix caching	Reuse KV blocks across requests sharing a starting prefix. Major TTFT win.
Prompt caching	User-controlled version of prefix caching exposed via API (Anthropic, OpenAI).
Quantization	Lower-precision weights or KV (FP8, INT4) to save memory. Small accuracy cost.
GQA	Grouped Query Attention. Multiple query heads share key-value heads. Smaller KV cache.
MoE	Mixture of Experts. Many total parameters, few active per token. Saves compute, not memory.
BF16 / FP8 / INT4	Numeric formats. 2 / 1 / 0.5 bytes per parameter respectively.

Three formulas to memorise

Memory needed

VRAM ≈ weights + 2 GB overhead + (peak_concurrent × p95_context × kv_per_token) × 1.15

The 1.15 factor is a safety margin for activation spikes.

Latency

Total latency ≈ TTFT + (output_tokens × TPOT)

Self-hosted cost per token

cost_per_token = GPU_$/hr ÷ aggregate_throughput_tokens_per_hour

LLM model reference: weights and KV cache

Model	BF16 weights	FP8 weights	INT4 weights	KV per token (FP8)
Qwen3.5-4B	8 GB	4 GB	2.5 GB	~74 KB
Qwen3.5-9B	18 GB	9 GB	5.5 GB	~147 KB
Llama-3.1-70B	140 GB	70 GB	40 GB	~330 KB
Qwen3.5-35B-A3B (MoE)	70 GB	35 GB	20 GB	~180 KB

KV per token figures are approximate. Verify config.json for exact GQA shape. MoE total parameters dominate VRAM cost; active parameters only help latency.

GPU reference for LLM serving

GPU	VRAM	BF16	FP8	Production fit
NVIDIA T4	16 GB	No	No	Small models only (≤4B), single user
NVIDIA L4	24 GB	Yes	Yes	4B-9B production serving
NVIDIA A100 (40 GB)	40 GB	Yes	No (native)	13B-35B production
NVIDIA A100 (80 GB)	80 GB	Yes	No (native)	70B production or high concurrency
NVIDIA H100 (80 GB)	80 GB	Yes	Yes	70B with FP8, maximum throughput

Essential vLLM flags for production

Flag	Effect	When to use
`--enable-prefix-caching`	Reuses KV blocks across shared prefixes	Always. Free latency win.
`--kv-cache-dtype fp8`	Stores KV in FP8 instead of FP16	Compute capability ≥ 8.9 (L4, H100). 2x KV capacity.
`--max-num-seqs N`	Caps concurrent in-flight requests	Match to realistic peak concurrency. Lower reduces OOM risk.
`--max-model-len N`	Caps context length per request	Tight cap frees memory for KV pool.
`--gpu-memory-utilization 0.92`	Fraction of VRAM vLLM uses	0.90-0.92 typical. Higher risks OOM.
`--swap-space N`	GB of CPU RAM for KV spillover	Smooths bursts above `max-num-seqs`.
`--tensor-parallel-size N`	Split model across N GPUs	When model plus KV does not fit on one GPU.
`--quantization awq_marlin`	Use AWQ 4-bit weights	When weights need to shrink.
`--dtype bfloat16`	BF16 for weights and activations	Default for modern GPUs. Use `float16` on T4.

Decision shortcuts for common LLM serving problems

Which inference engine should I use?

Single user, laptop, development: Ollama or llama.cpp with GGUF
Multi-user production: vLLM with safetensors. Benchmark SGLang as alternative.
Maximum NVIDIA performance: TensorRT-LLM

How do I fix OOM at startup?

Lower --gpu-memory-utilization (try 0.88). Reduce --max-num-seqs. Switch to smaller weights (FP8 or AWQ).

How do I fix OOM mid-traffic?

Same fixes as startup. Additionally verify --max-model-len is not allowing requests that do not fit at peak concurrency.

How do I fix high TTFT?

Enable prefix caching (--enable-prefix-caching). Check whether prompts share prefixes. Consider a smaller model. Reduce input length.

How do I fix high TPOT or slow streaming?

Reduce concurrent batch size. Use a smaller model. Check GPU memory bandwidth.

How do I fix P99 latency much worse than P50?

KV pressure is causing preemptions. Reduce --max-num-seqs or add --swap-space. [Tail latency monitoring is a foundational SRE practice for any production service].

LLM cost reduction levers, ranked by impact

The order matters. Start at the top.

1. Reduce output length

Output tokens cost 3-5x input tokens. Concise prompting can cut cost 30-50%.

2. Enable prefix caching

Free. 5-20x faster TTFT on cached prefixes. --enable-prefix-caching in vLLM. [Anthropic's prompt caching documentation reports up to 90% cost reduction on cached calls].

3. Reduce RAG context

Fewer or smaller chunks reduce input cost per request. 3-5 chunks beat 10 mediocre ones.

4. Use a smaller model where quality permits

4B for routing, 9B for synthesis. Do not default to 70B.

5. Quantize

FP8 or AWQ 4-bit cuts memory 2-4x with minor quality loss.

6. Self-host above 30% utilisation

Above 60% sustained utilisation, self-hosting is 10-50x cheaper per token than commercial APIs. Below 20%, APIs are cheaper. [LMSYS publicly reported cutting GPU count by 50% while serving 2-3x more requests after migrating to vLLM-based self-hosting].

LLM serving sizing checklist

Before provisioning a GPU, answer each question with a number.

Exact model name and size (e.g., "Qwen3.5-9B")
Precision (BF16, FP8, AWQ-4bit)
Average concurrent in-flight requests
Peak concurrent requests
Average context length (measured from real traffic)
P95 context length
Prefix overlap fraction (cacheable proportion of each request)
Latency budget (P50 and P95 TTFT)

Plug into:

VRAM_needed ≈ weights + 2 GB + (peak_concurrent × p95_context × kv_per_token) × 1.15

If the result exceeds the GPU's VRAM, four levers are available: smaller model, lower precision, fewer concurrent users, shorter contexts, or a bigger GPU.

Application-layer reference

RAG configuration starting points

Chunk size: 500-800 tokens
Chunk overlap: 50-100 tokens
Top-k retrieved: 3-5
Embedding model: voyage-3, text-embedding-3-large, or BGE-large (see [MTEB leaderboard])
Reranker: Cohere rerank API
Search: hybrid (BM25 + embedding)

Tool calling rules

Limit to under 10 tools per request
Write descriptions as docstrings for new colleagues
Execute parallel tool calls concurrently (asyncio.gather, Promise.all)
Validate JSON outputs (Pydantic, Zod)
Retry on validation failure with the error included in the prompt

Agent constraints

Set max-steps hard limit
Detect repeated tool calls
Use small fast models for routing, large model only for synthesis
Plan for compounding failure: 95% per step × 5 steps = 77% overall

Thinking mode rules

Enable for: math, multi-step reasoning, code debugging, complex planning
Disable for: chat, summarisation, extraction, classification, real-time agents
Latency cost: 5-20x non-thinking mode

Latency targets by use case

Use case	TTFT target	TPOT target	Total latency target
Interactive chat	< 1 second	< 100 ms (≥10 tok/sec)	< 10 seconds
Voice / real-time	< 500 ms	< 50 ms (≥20 tok/sec)	< 5 seconds
Code completion	< 200 ms	< 30 ms	< 2 seconds
Background extraction	Not critical	Not critical	< 30 seconds
Batch processing	Not critical	Not critical	Throughput, not latency

What is the minimum GPU for serving an LLM in production?

For multi-user serving of small models (4B-9B), an L4 (24 GB) is the practical minimum. T4 (16 GB) lacks modern numeric format support and is suitable only for single-user small-model deployments.

How do I calculate how many concurrent users a GPU can serve?

Use the formula: VRAM_available_for_KV = VRAM - weights - 2 GB. Then concurrent_users = VRAM_available_for_KV / (p95_context × kv_per_token × 1.15).

Which is faster: vLLM or SGLang?

Depends on workload. vLLM is the default for general production serving. SGLang often outperforms on long-context and structured-output workloads. Benchmark with real traffic before choosing.

Can I run a 70B model on one GPU?

Llama-3.1-70B in BF16 needs 140 GB; one GPU is not sufficient. In FP8 (H100), 70 GB plus KV fits on one H100. In INT4, 40 GB fits on an A100 80 GB or H100 with substantial KV pool remaining.

How do I reduce LLM API costs?

In order: cap output length, enable prompt caching, reduce RAG chunks, truncate conversation history, use smaller models where possible, cache repeated queries at the application layer.

About NudgeBee

NudgeBee builds AI-powered SRE automation for cloud-native production systems. We use LLMs at scale and have made most of the mistakes documented in this series. Learn more at nudgebee.com.

Part 1: How to serve LLMs in production: GPU memory, KV cache, and sizing
Part 2: Anatomy of an LLM request: prefill, decode, latency, and cost
Part 3: Building LLM applications: RAG, agents, tool calling, and reasoning
Part 4: LLM serving cheat sheet
Part 5: Common mistakes when serving LLMs in production

LLM serving cheat sheet: GPUs, models, vLLM flags, and cost levers