v8 MLA / Kimi Decode Cache
This page documents the v8 implementation contract for Kimi/DeepSeek-style Multi-Head Latent Attention (MLA): how the template describes the latent KV path, how lowering inserts explicit prefill/decode cache operations, and which C kernels currently back the graph.
This is compiler, template, and kernel-contract support. Full Kimi model smoke is a high-memory lane and is not expected to run on a constrained laptop.
Why MLA Exists
Standard attention stores full K and V vectors in the KV cache. MLA compresses the K/V path into a smaller latent representation, then reconstructs the pieces needed by attention. That makes the memory contract different from ordinary GQA/MQA, especially during decode where every token reads historical cache state.
Implemented Surface
version/v8/templates/kimi_vl.jsondeclares the MLA decode cache contract.build_ir_v8.pyinserts MLA cache store operations and switches decode attention to the MLA decode kernel.deepseek_kernels.cowns the reference C MLA kernels.- Focused unit tests validate the template lowering without requiring the full Kimi artifact.
The MLA Path
Block RMSNorm
The layer input is normalized before both the Q projection and compressed KV projection consume it.
Q + compressed KV
q_proj emits Q. kv_a_proj emits the compressed latent KV vector instead of full K and V heads.
KV LoRA Decompress
kv_lora_decompress expands latent KV into no-position K and V scratch buffers using the manifest-selected FP32 or BF16 kernel.
Partial RoPE concat
The RoPE slice is positioned and concatenated with the no-position K part. This is why Kimi cannot be treated as plain full-head RoPE.
Explicit MLA KV store
Prefill writes a batch cache. Decode writes the current token cache. The cache operation is explicit in IR rather than hidden inside attention.
MLA decode attention
Decode attention reads the explicit MLA cache and uses deepseek_mla_attention_decode_f32.
Kernel Contract
| Template op | Kernel | Purpose |
|---|---|---|
kv_lora_decompress |
deepseek_mla_kv_decompress_f32deepseek_mla_kv_decompress_bf16 |
Expand compressed latent KV into K-noPE and V scratch buffers. The BF16 path is selected when mla_kv_b_proj is BF16. |
partial_rope_concat |
deepseek_mla_partial_rope_concat_packed_f32 |
Apply positioned RoPE to the RoPE slice and pack the K representation expected by MLA attention. |
mla_kv_cache_batch_store |
deepseek_mla_kv_cache_batch_store_f32 |
Store prefill K/V scratch into the MLA decode cache for a token block. |
mla_kv_cache_store |
deepseek_mla_kv_cache_store_f32 |
Store the current decode token K/V scratch into the MLA decode cache. |
mla_attention |
deepseek_mla_attention_f32deepseek_mla_attention_decode_f32 |
Prefill uses the reference MLA attention path. Decode switches to the cache-reading decode kernel. |
Why the Cache Is Explicit
Ordinary transformer lowering can often treat cache store as a generic post-RoPE helper. MLA needs a clearer contract because K and V are produced through a compressed latent path and partial-RoPE concat. If that cache behavior is hidden inside one opaque attention kernel, the IR visualizer cannot show what is stored, tests cannot independently validate cache placement, and future MPI/pipeline execution has no clean ownership boundary for the cache buffers.
prefill:
x -> norm -> q_proj
x -> norm -> kv_a_proj -> kv_a_norm -> kv_lora_decompress
q + k_nope + rope -> partial_rope_concat
k/v scratch -> mla_kv_cache_batch_store
q + cache -> mla_attention
decode:
x -> norm -> q_proj
x -> norm -> kv_a_proj -> kv_a_norm -> kv_lora_decompress
q + k_nope + rope(position=t) -> partial_rope_concat
k/v scratch -> mla_kv_cache_store
q + historical cache -> mla_attention_decode
Validation
The current laptop-safe validation is focused on contracts and kernels, not full Kimi inference:
.venv/bin/python -m py_compile version/v8/scripts/build_ir_v8.py
make build/libckernel_engine.so
.venv/bin/python unittest/test_deepseek_reference_kernels.py
.venv/bin/python -m unittest \
tests.test_v8_kimi_template \
tests.test_v8_model_contract_inspector \
tests.test_v8_template_circuit_audit
A full Kimi run should be treated as a high-memory smoke on a larger CPU host. The important local guarantee is that the template resolves to the intended MLA kernels, the decode cache operations appear explicitly in IR, and existing promoted v8 families do not regress.