Memory Reality: What NVIDIA Marketing Won't Tell You

The Hidden Truth About LLM Memory
GPU marketing focuses on FLOPS. Real-world LLM inference is dominated by memory bandwidth and capacity. This page shows the actual numbers.

Part 1: Activation Memory Per Token

Memory Writes Per Token (Decode)

For Qwen2-0.5B (hidden_dim=896, intermediate=4864, 24 layers):

OperationWrites per Layer24 Layers Total
RMSNorm output 896 × 4 = 3.5 KB 84 KB
Q, K, V projections 3 × 896 × 4 = 10.5 KB 252 KB
Attention output 896 × 4 = 3.5 KB 84 KB
O projection 896 × 4 = 3.5 KB 84 KB
MLP gate + up 2 × 4864 × 4 = 38 KB 912 KB
MLP down 896 × 4 = 3.5 KB 84 KB
KV cache (new token) ~1 KB 24 KB
Final logits (once) - ~600 KB
Total per token ~63 KB ~2.1 MB
Key Insight
Decode is memory-bandwidth bound. Every token requires ~2 MB of memory traffic through activations alone, plus reading all model weights (~500MB for Qwen2-0.5B).

Part 2: Context Length - The Memory Multiplier

The Elephant in the Room
Context length is the biggest memory factor that GPU marketing ignores. KV cache grows linearly with context, and it must stay resident for the entire generation.

KV Cache Size Formula

KV Cache Size = 2 × n_layers × n_kv_heads × head_dim × context_length × bytes_per_element

For Qwen2-0.5B (FP32 KV cache):
  = 2 × 24 × 2 × 64 × context_length × 4 bytes
  = 24,576 × context_length bytes
  = 24 KB per token of context

For Qwen2-0.5B (FP16 KV cache):
  = 12 KB per token of context
    

KV Cache Scaling by Context Length

Context LengthKV Cache (FP32)KV Cache (FP16)Fits in GPU VRAM?Fits in CPU RAM?
2K tokens 48 MB 24 MB Yes (80GB) Yes (4TB)
8K tokens 192 MB 96 MB Yes Yes
32K tokens 768 MB 384 MB Yes Yes
128K tokens 3 GB 1.5 GB Tight with model Yes
512K tokens 12 GB 6 GB Leaves little for model Yes
1M tokens 24 GB 12 GB No room for 70B model Yes

The Real Problem: Model + KV Cache Must Fit Together

70B Model (Llama 3.1 70B with GQA-8)

ContextKV Cache (FP16)+ Model (140GB)Fits H100 (80GB)?Fits 2TB Server?
8K 2.6 GB 143 GB No (need 2×) Yes
32K 10 GB 150 GB No (need 2×) Yes
128K 41 GB 181 GB No (need 3×) Yes
1M 320 GB 460 GB No (need 6×) Yes
📐 Show the Math: 70B KV Cache Calculation

Llama 3.1 70B Architecture

ParameterValueExplanation
n_layers80Number of transformer layers
n_attention_heads64Query heads per layer
n_kv_heads8KV heads (GQA: 8 groups, each serves 8 Q heads)
hidden_dim8192Model hidden dimension
head_dim128= hidden_dim / n_attention_heads = 8192/64

KV Cache Per Token (FP16)

KV_per_token = 2 × n_layers × n_kv_heads × head_dim × bytes
             = 2 × 80 × 8 × 128 × 2
             = 327,680 bytes
             = 320 KB per token

Scale by Context

8K context:   8,192 × 320 KB   = 2.62 GB
32K context:  32,768 × 320 KB  = 10.5 GB
128K context: 131,072 × 320 KB = 41.9 GB
1M context:   1,048,576 × 320 KB = 335 GB

Why GQA Reduces Memory 8×

Without GQA (Multi-Head Attention), 70B would need 64 KV heads instead of 8:

MHA (64 KV heads): 2.56 MB/token → 128K = 335 GB KV cache
GQA (8 KV heads):  320 KB/token  → 128K = 42 GB KV cache
                                          8× smaller!

KV Cache for Different Model Sizes

ModelParametersKV per Token (FP16)128K Context KV1M Context KV
Qwen2-0.5B 0.5B 12 KB 1.5 GB 12 GB
Llama-3-8B 8B 64 KB 8 GB 64 GB
Llama-3-70B 70B 320 KB 40 GB 320 GB
Llama-3-405B 405B 1.2 MB 150 GB 1.2 TB
The 405B Reality
Llama-3-405B with 1M context needs 1.2 TB just for KV cache. That's 15× the capacity of an H100. No amount of NVLink can solve this - you need actual memory, which CPUs have.

Part 3: Prefill vs Decode Memory Patterns

Decode (1 token at a time)

Memory pattern: Bandwidth-bound

  • Read: All model weights (~1GB for 0.5B model)
  • Read: Entire KV cache (grows with context)
  • Write: 1 new K, 1 new V per layer
  • Write: 1 output token logits

Bottleneck: Memory bandwidth, not compute

Prefill (N tokens at once)

Memory pattern: Compute + bandwidth

  • Read: All model weights (same as decode)
  • Write: N × K, N × V per layer to cache
  • Intermediate: N × hidden_dim per operation
  • Scales with N (context length)

Bottleneck: Both compute and memory

Prefill Activation Memory (Qwen2-0.5B)

Prefill LengthActivation Memory+ KV Cache
256 tokens ~400 MB ~6 MB
1K tokens ~1.5 GB ~24 MB
4K tokens ~6 GB ~96 MB
16K tokens ~24 GB ~384 MB
Prefill vs Decode
Prefill processes all input tokens at once (compute-bound). Decode generates tokens one at a time (memory-bound). Long prompts can exceed GPU VRAM even when the model fits.

Part 4: Why FP16 KV Cache Matters for CPUs

FP16 KV Cache: Not About Bandwidth, About L3 Residency

The common misconception: "FP16 saves memory bandwidth."

The reality: FP16 doubles context that fits in L3 cache.

L3 Cache SizeMax Context (FP32 KV)Max Context (FP16 KV)Benefit
6 MB (laptop) ~6K tokens ~12K tokens 2× more "hot" context
32 MB (desktop) ~32K tokens ~64K tokens 2× more "hot" context
128 MB (server) ~128K tokens ~256K tokens 2× more "hot" context
384 MB (EPYC) ~384K tokens ~768K tokens 2× more "hot" context
The L3 Cache Insight
When KV cache fits in L3 (~100 GB/s bandwidth), decode is fast. When it spills to DRAM (~20-50 GB/s), decode slows 2-5×. FP16 keeps more context in L3, maintaining fast decode for longer contexts.

Part 6: The Complete Picture

Memory Requirements: Model + KV Cache + Activations

Total Memory = Model Weights + KV Cache + Activation Memory

For 70B model at 128K context (FP16 weights, FP16 KV):
  Model:       140 GB
  KV Cache:     40 GB
  Activations:  10 GB
  ──────────────────
  Total:       190 GB

GPU (H100 80GB):  DOESN'T FIT
GPU (3× H100):    Fits, but $120K+ and NVLink complexity

CPU (2TB server): FITS with 1.8TB headroom
                  Run 10× concurrent users!
                  $30K total cost
    

GPU Memory Crisis Points

  • 7B + 32K context: Tight on 24GB consumer GPU
  • 13B + 32K context: Needs 40GB+ GPU
  • 70B + 8K context: Needs 2× 80GB GPUs
  • 70B + 128K context: Needs 3× 80GB GPUs
  • 405B + any context: Needs 8+ GPUs minimum

Pattern: Every model × context combination hits a wall.

CPU Memory Advantages

  • Capacity: 2-4 TB per server
  • Cost: ~$10/GB vs ~$500/GB for HBM
  • No walls: Just add more RAM
  • Multi-model: Run many models in same RAM
  • Long context: 1M+ tokens? No problem.

Pattern: Memory scales with need, not with GPU count.

Summary: The Numbers Don't Lie

What NVIDIA Marketing Says vs Reality

ClaimMarketingReality
"GPUs are faster" Higher FLOPS LLM inference is memory-bound, not compute-bound
"80GB is enough" Fits 40B model 70B + long context needs 200GB+
"Just add more GPUs" Scale with NVLink $40K per GPU, max 8 per node, then Ethernet anyway
"HBM is fast" 3.3 TB/s bandwidth Irrelevant when model doesn't fit (0 × ∞ = 0)
"Enterprise ready" DGX systems $500K+, still limited by VRAM per model
The Bottom Line
Real LLM deployment needs: large models + long context + multiple concurrent users. CPUs provide 10-100× more memory at 10× lower cost. The math is simple: 0 × ∞ = 0. If it doesn't fit, FLOPS don't matter.

Visual Guides

Memory Reality Infographic

GPU vs CPU memory comparison for LLM inference:

Memory Reality - GPU vs CPU comparison showing memory capacity, 70B model fit analysis, and cost comparison

Click to view full size

Further Reading

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close