LLM serving cheat sheet: GPUs, models, vLLM flags, and cost levers

Shiv
Shiv CTO, NudgeBee · 5 min read
LLM serving cheat sheet: GPUs, models, vLLM flags, and cost levers

TL;DR

A single-page reference for serving LLMs in production. Covers:

  • Vocabulary (token, KV cache, prefill, decode, TTFT, TPOT, prefix caching, MoE)
  • Sizing formulas (VRAM, latency, cost per token)
  • GPU reference table (T4, L4, A100, H100)
  • Model reference table (weights and KV per token for Qwen3.5-4B, 9B, Llama-3.1-70B, Qwen3.5-35B-A3B MoE)
  • Essential vLLM flags
  • Decision shortcuts for common problems
  • Cost reduction levers ranked by impact

LLM serving vocabulary

TermDefinition
TokenChunk of text the model processes. ~4 characters or 0.75 words in English.
Context windowTotal tokens the model can attend to in one request. Shared across system prompt, tools, history, RAG, and response.
KV cacheModel's internal state for an active conversation. Grows linearly with tokens.
PrefillProcessing the input prompt. Compute-bound. Scales with input length. Sets TTFT.
DecodeGenerating output one token at a time. Memory-bandwidth-bound. Scales with output length.
TTFTTime To First Token. Dominated by prefill. Critical for streaming UX.
TPOT / ITLTime Per Output Token / Inter-Token Latency. Dominated by decode and batch contention.
Total latencyTTFT + (output_tokens × TPOT)
ThroughputAggregate tokens per second across all in-flight requests. Trades off against per-user latency.
Paged attentionvLLM technique storing KV in small blocks, allocated dynamically. Enables high concurrency.
Continuous batchingRequests join and leave the active batch every step. No fixed batch windows.
Prefix cachingReuse KV blocks across requests sharing a starting prefix. Major TTFT win.
Prompt cachingUser-controlled version of prefix caching exposed via API (Anthropic, OpenAI).
QuantizationLower-precision weights or KV (FP8, INT4) to save memory. Small accuracy cost.
GQAGrouped Query Attention. Multiple query heads share key-value heads. Smaller KV cache.
MoEMixture of Experts. Many total parameters, few active per token. Saves compute, not memory.
BF16 / FP8 / INT4Numeric formats. 2 / 1 / 0.5 bytes per parameter respectively.

Three formulas to memorise

Memory needed

VRAM ≈ weights + 2 GB overhead + (peak_concurrent × p95_context × kv_per_token) × 1.15

The 1.15 factor is a safety margin for activation spikes.

Latency

Total latency ≈ TTFT + (output_tokens × TPOT)

Self-hosted cost per token

cost_per_token = GPU_$/hr ÷ aggregate_throughput_tokens_per_hour

LLM model reference: weights and KV cache

ModelBF16 weightsFP8 weightsINT4 weightsKV per token (FP8)
Qwen3.5-4B8 GB4 GB2.5 GB~74 KB
Qwen3.5-9B18 GB9 GB5.5 GB~147 KB
Llama-3.1-70B140 GB70 GB40 GB~330 KB
Qwen3.5-35B-A3B (MoE)70 GB35 GB20 GB~180 KB

KV per token figures are approximate. Verify config.json for exact GQA shape. MoE total parameters dominate VRAM cost; active parameters only help latency.

GPU reference for LLM serving

GPUVRAMBF16FP8Production fit
NVIDIA T416 GBNoNoSmall models only (≤4B), single user
NVIDIA L424 GBYesYes4B-9B production serving
NVIDIA A100 (40 GB)40 GBYesNo (native)13B-35B production
NVIDIA A100 (80 GB)80 GBYesNo (native)70B production or high concurrency
NVIDIA H100 (80 GB)80 GBYesYes70B with FP8, maximum throughput

Essential vLLM flags for production

FlagEffectWhen to use
--enable-prefix-cachingReuses KV blocks across shared prefixesAlways. Free latency win.
--kv-cache-dtype fp8Stores KV in FP8 instead of FP16Compute capability ≥ 8.9 (L4, H100). 2x KV capacity.
--max-num-seqs NCaps concurrent in-flight requestsMatch to realistic peak concurrency. Lower reduces OOM risk.
--max-model-len NCaps context length per requestTight cap frees memory for KV pool.
--gpu-memory-utilization 0.92Fraction of VRAM vLLM uses0.90-0.92 typical. Higher risks OOM.
--swap-space NGB of CPU RAM for KV spilloverSmooths bursts above max-num-seqs.
--tensor-parallel-size NSplit model across N GPUsWhen model plus KV does not fit on one GPU.
--quantization awq_marlinUse AWQ 4-bit weightsWhen weights need to shrink.
--dtype bfloat16BF16 for weights and activationsDefault for modern GPUs. Use float16 on T4.

Decision shortcuts for common LLM serving problems

Which inference engine should I use?

  • Single user, laptop, development: Ollama or llama.cpp with GGUF
  • Multi-user production: vLLM with safetensors. Benchmark SGLang as alternative.
  • Maximum NVIDIA performance: TensorRT-LLM

How do I fix OOM at startup?

Lower --gpu-memory-utilization (try 0.88). Reduce --max-num-seqs. Switch to smaller weights (FP8 or AWQ).

How do I fix OOM mid-traffic?

Same fixes as startup. Additionally verify --max-model-len is not allowing requests that do not fit at peak concurrency.

How do I fix high TTFT?

Enable prefix caching (--enable-prefix-caching). Check whether prompts share prefixes. Consider a smaller model. Reduce input length.

How do I fix high TPOT or slow streaming?

Reduce concurrent batch size. Use a smaller model. Check GPU memory bandwidth.

How do I fix P99 latency much worse than P50?

KV pressure is causing preemptions. Reduce --max-num-seqs or add --swap-space. [Tail latency monitoring is a foundational SRE practice for any production service].

LLM cost reduction levers, ranked by impact

The order matters. Start at the top.

1. Reduce output length

Output tokens cost 3-5x input tokens. Concise prompting can cut cost 30-50%.

2. Enable prefix caching

Free. 5-20x faster TTFT on cached prefixes. --enable-prefix-caching in vLLM. [Anthropic's prompt caching documentation reports up to 90% cost reduction on cached calls].

3. Reduce RAG context

Fewer or smaller chunks reduce input cost per request. 3-5 chunks beat 10 mediocre ones.

4. Use a smaller model where quality permits

4B for routing, 9B for synthesis. Do not default to 70B.

5. Quantize

FP8 or AWQ 4-bit cuts memory 2-4x with minor quality loss.

6. Self-host above 30% utilisation

Above 60% sustained utilisation, self-hosting is 10-50x cheaper per token than commercial APIs. Below 20%, APIs are cheaper. [LMSYS publicly reported cutting GPU count by 50% while serving 2-3x more requests after migrating to vLLM-based self-hosting].

LLM serving sizing checklist

Before provisioning a GPU, answer each question with a number.

  • Exact model name and size (e.g., "Qwen3.5-9B")
  • Precision (BF16, FP8, AWQ-4bit)
  • Average concurrent in-flight requests
  • Peak concurrent requests
  • Average context length (measured from real traffic)
  • P95 context length
  • Prefix overlap fraction (cacheable proportion of each request)
  • Latency budget (P50 and P95 TTFT)

Plug into:

VRAM_needed ≈ weights + 2 GB + (peak_concurrent × p95_context × kv_per_token) × 1.15

If the result exceeds the GPU's VRAM, four levers are available: smaller model, lower precision, fewer concurrent users, shorter contexts, or a bigger GPU.

Application-layer reference

RAG configuration starting points

  • Chunk size: 500-800 tokens
  • Chunk overlap: 50-100 tokens
  • Top-k retrieved: 3-5
  • Embedding model: voyage-3, text-embedding-3-large, or BGE-large (see [MTEB leaderboard])
  • Reranker: Cohere rerank API
  • Search: hybrid (BM25 + embedding)

Tool calling rules

  • Limit to under 10 tools per request
  • Write descriptions as docstrings for new colleagues
  • Execute parallel tool calls concurrently (asyncio.gather, Promise.all)
  • Validate JSON outputs (Pydantic, Zod)
  • Retry on validation failure with the error included in the prompt

Agent constraints

  • Set max-steps hard limit
  • Detect repeated tool calls
  • Use small fast models for routing, large model only for synthesis
  • Plan for compounding failure: 95% per step × 5 steps = 77% overall

Thinking mode rules

  • Enable for: math, multi-step reasoning, code debugging, complex planning
  • Disable for: chat, summarisation, extraction, classification, real-time agents
  • Latency cost: 5-20x non-thinking mode

Latency targets by use case

Use caseTTFT targetTPOT targetTotal latency target
Interactive chat< 1 second< 100 ms (≥10 tok/sec)< 10 seconds
Voice / real-time< 500 ms< 50 ms (≥20 tok/sec)< 5 seconds
Code completion< 200 ms< 30 ms< 2 seconds
Background extractionNot criticalNot critical< 30 seconds
Batch processingNot criticalNot criticalThroughput, not latency

What is the minimum GPU for serving an LLM in production?

For multi-user serving of small models (4B-9B), an L4 (24 GB) is the practical minimum. T4 (16 GB) lacks modern numeric format support and is suitable only for single-user small-model deployments.

How do I calculate how many concurrent users a GPU can serve?

Use the formula: VRAM_available_for_KV = VRAM - weights - 2 GB. Then concurrent_users = VRAM_available_for_KV / (p95_context × kv_per_token × 1.15).

Which is faster: vLLM or SGLang?

Depends on workload. vLLM is the default for general production serving. SGLang often outperforms on long-context and structured-output workloads. Benchmark with real traffic before choosing.

Can I run a 70B model on one GPU?

Llama-3.1-70B in BF16 needs 140 GB; one GPU is not sufficient. In FP8 (H100), 70 GB plus KV fits on one H100. In INT4, 40 GB fits on an A100 80 GB or H100 with substantial KV pool remaining.

How do I reduce LLM API costs?

In order: cap output length, enable prompt caching, reduce RAG chunks, truncate conversation history, use smaller models where possible, cache repeated queries at the application layer.

About NudgeBee

NudgeBee builds AI-powered SRE automation for cloud-native production systems. We use LLMs at scale and have made most of the mistakes documented in this series. Learn more at nudgebee.com.

Series navigation

  • Part 1: How to serve LLMs in production: GPU memory, KV cache, and sizing
  • Part 2: Anatomy of an LLM request: prefill, decode, latency, and cost
  • Part 3: Building LLM applications: RAG, agents, tool calling, and reasoning
  • Part 4: LLM serving cheat sheet
  • Part 5: Common mistakes when serving LLMs in production