v8 MLA / Kimi Decode Cache

This page documents the v8 implementation contract for Kimi/DeepSeek-style Multi-Head Latent Attention (MLA): how the template describes the latent KV path, how lowering inserts explicit prefill/decode cache operations, and which C kernels currently back the graph.

Scope
This is compiler, template, and kernel-contract support. Full Kimi model smoke is a high-memory lane and is not expected to run on a constrained laptop.

Why MLA Exists

Standard attention stores full K and V vectors in the KV cache. MLA compresses the K/V path into a smaller latent representation, then reconstructs the pieces needed by attention. That makes the memory contract different from ordinary GQA/MQA, especially during decode where every token reads historical cache state.

kv_lora_rank qk_nope_head_dim qk_rope_head_dim v_head_dim

Implemented Surface

version/v8/templates/kimi_vl.json declares the MLA decode cache contract.
build_ir_v8.py inserts MLA cache store operations and switches decode attention to the MLA decode kernel.
deepseek_kernels.c owns the reference C MLA kernels.
Focused unit tests validate the template lowering without requiring the full Kimi artifact.

Kimi and DeepSeek-style MLA decode cache diagram showing low-rank KV compression, kv_lora_decompress, partial RoPE concat, explicit cache store, and decode attention.

The MLA Path

1 · Normalize

Block RMSNorm

The layer input is normalized before both the Q projection and compressed KV projection consume it.

2 · Project

Q + compressed KV

q_proj emits Q. kv_a_proj emits the compressed latent KV vector instead of full K and V heads.

3 · Expand

KV LoRA Decompress

kv_lora_decompress expands latent KV into no-position K and V scratch buffers using the manifest-selected FP32 or BF16 kernel.

4 · Position

Partial RoPE concat

The RoPE slice is positioned and concatenated with the no-position K part. This is why Kimi cannot be treated as plain full-head RoPE.

5 · Cache

Explicit MLA KV store

Prefill writes a batch cache. Decode writes the current token cache. The cache operation is explicit in IR rather than hidden inside attention.

6 · Attend

MLA decode attention

Decode attention reads the explicit MLA cache and uses deepseek_mla_attention_decode_f32.

Kernel Contract

Template op	Kernel	Purpose
`kv_lora_decompress`	`deepseek_mla_kv_decompress_f32` `deepseek_mla_kv_decompress_bf16`	Expand compressed latent KV into K-noPE and V scratch buffers. The BF16 path is selected when `mla_kv_b_proj` is BF16.
`partial_rope_concat`	`deepseek_mla_partial_rope_concat_packed_f32`	Apply positioned RoPE to the RoPE slice and pack the K representation expected by MLA attention.
`mla_kv_cache_batch_store`	`deepseek_mla_kv_cache_batch_store_f32`	Store prefill K/V scratch into the MLA decode cache for a token block.
`mla_kv_cache_store`	`deepseek_mla_kv_cache_store_f32`	Store the current decode token K/V scratch into the MLA decode cache.
`mla_attention`	`deepseek_mla_attention_f32` `deepseek_mla_attention_decode_f32`	Prefill uses the reference MLA attention path. Decode switches to the cache-reading decode kernel.

Why the Cache Is Explicit

Ordinary transformer lowering can often treat cache store as a generic post-RoPE helper. MLA needs a clearer contract because K and V are produced through a compressed latent path and partial-RoPE concat. If that cache behavior is hidden inside one opaque attention kernel, the IR visualizer cannot show what is stored, tests cannot independently validate cache placement, and future MPI/pipeline execution has no clean ownership boundary for the cache buffers.

prefill:
  x -> norm -> q_proj
  x -> norm -> kv_a_proj -> kv_a_norm -> kv_lora_decompress
  q + k_nope + rope -> partial_rope_concat
  k/v scratch -> mla_kv_cache_batch_store
  q + cache -> mla_attention

decode:
  x -> norm -> q_proj
  x -> norm -> kv_a_proj -> kv_a_norm -> kv_lora_decompress
  q + k_nope + rope(position=t) -> partial_rope_concat
  k/v scratch -> mla_kv_cache_store
  q + historical cache -> mla_attention_decode

Validation

The current laptop-safe validation is focused on contracts and kernels, not full Kimi inference:

.venv/bin/python -m py_compile version/v8/scripts/build_ir_v8.py
make build/libckernel_engine.so
.venv/bin/python unittest/test_deepseek_reference_kernels.py
.venv/bin/python -m unittest \
  tests.test_v8_kimi_template \
  tests.test_v8_model_contract_inspector \
  tests.test_v8_template_circuit_audit

A full Kimi run should be treated as a high-memory smoke on a larger CPU host. The important local guarantee is that the template resolves to the intended MLA kernels, the decode cache operations appear explicitly in IR, and existing promoted v8 families do not regress.