Research Tracker

Tracking new techniques from recent papers for potential implementation in C-Kernel-Engine. This helps stay current with the field and plan future kernel additions.

Living Document
Updated as new papers are reviewed. Techniques marked "Planned" are candidates for implementation.

New: Deep Training Intuition Playbook

An infographic roadmap for gradient intuition: checkpoint-efficient logging, failure pattern triage, and a 12-experiment curriculum from tiny models to Qwen-scale runs.

Open training-intuition.html ->

New: v7 Cross-Entropy Parity Deep Dive

Detailed CE math and production numerics: p - one_hot derivation, PyTorch reduction semantics, ignore-index behavior, long-horizon drift history, and final stabilization changes.

Open v7-cross-entropy-parity.html ->

Implementation Status Legend

Implemented Planned Researching Archived

Attention Variants

Technique	Source	Status	Notes
Multi-Head Attention	Transformer (2017)	Implemented	Base attention with causal mask
Grouped Query Attention (GQA)	Llama 2	Implemented	Shared K/V heads, reduces KV cache
Multi-Query Attention (MQA)	PaLM, Falcon	Implemented	GQA with kv_heads=1
Sliding Window Attention	Mistral, Gemma 2	Planned	Local attention window, reduces memory for long context
Multi-Head Latent Attention (MLA)	DeepSeek-V2	Researching	Low-rank KV compression, massive KV cache reduction
Native Sparse Attention (NSA)	DeepSeek-V3	Researching	Hardware-aligned sparse attention patterns
Differential Attention	Microsoft (2024)	Researching	Subtracts two softmax attentions to reduce noise

Deep Dive: Multi-Head Latent Attention (MLA)

The KV Cache Problem

Standard attention requires storing K and V tensors for each token in the sequence. For long contexts, this becomes massive:

KV Cache size = 2 × num_layers × seq_len × num_heads × head_dim × bytes_per_element

Example (70B model, 128K context):
= 2 × 80 × 128,000 × 64 × 128 × 2 bytes (FP16)
= 167 GB just for KV cache!

GQA/MQA reduce this by sharing K/V heads, but MLA takes a different approach.

MLA: Low-Rank KV Compression

Instead of reducing heads, MLA compresses the KV cache into a low-rank latent space:

Standard Attention:
  h → W_K → K (full size)
  h → W_V → V (full size)
  Cache: [K, V] per layer

MLA:
  h → W_compress → c (small latent vector, e.g., 1024-dim)
  Cache: [c] per layer  ← 93% smaller!

  At attention time:
  c → W_decompress_K → K (reconstruct)
  c → W_decompress_V → V (reconstruct)

MLA Results (DeepSeek-V2)

Metric	Standard	MLA	Improvement
KV Cache Size	100%	6.7%	93.3% reduction
Generation Throughput	1x	5.76x	5.76x faster
Training Cost	100%	57.5%	42.5% cheaper

RoPE Compatibility Challenge

RoPE encodes position in both Q and K, but MLA compresses K into a latent space. DeepSeek's solution:

Keep RoPE on a subset of dimensions (not compressed)
Compress the remaining dimensions (NoPE - no positional encoding)
Requires careful dimension partitioning

Q = [Q_rope (with position), Q_nope (compressed)]
K = [K_rope (with position), K_nope (from latent)]

Implementation Considerations for C-Kernel-Engine

New kernel: mla_compress - h → latent (small matmul)
New kernel: mla_decompress - latent → K, V (during attention)
Modified attention: Handle split RoPE/NoPE dimensions
KV cache layout: Store latent vectors instead of full K/V
Memory savings: Critical for long context on CPU (limited RAM vs GPU)

Priority: High for long-context inference. The 93% KV cache reduction is huge for CPU deployment.

References

DeepSeek-V2 Paper - Original MLA introduction
DeepSeek-V3 MLA Explained - Visual walkthrough
Understanding MLA - Technical deep dive
TransMLA - Converting GQA models to MLA

Position Encodings

Technique	Source	Status	Notes
RoPE	RoFormer, Llama	Implemented	Rotary position embedding, relative position via rotation
ALiBi	BLOOM	Planned	Linear bias, no learned params, easy length extrapolation
YaRN (RoPE scaling)	Together AI	Planned	NTK-aware RoPE interpolation for context extension
Longrope	Microsoft	Researching	Progressive interpolation for very long context

Normalization

Technique	Source	Status	Notes
RMSNorm	Llama, Mistral	Implemented	Simpler than LayerNorm, no mean centering
LayerNorm	Original Transformer	Implemented	Full normalization with mean and variance
QK-Norm	Gemma 2	Planned	RMSNorm on Q and K before attention, training stability
Deep Norm	Microsoft	Researching	Scaled residual connections for very deep models

Activations & MLP

Technique	Source	Status	Notes
SwiGLU	Llama, Mistral	Implemented	Gated activation: Swish(xW_gate) * (xW_up)
GELU	GPT-2, BERT	Implemented	Gaussian Error Linear Unit
GeGLU	GLU Variants	Planned	GELU-gated variant
Mixture of Experts (MoE)	Mixtral, DeepSeek	Planned	Sparse expert routing, key for scaling
Shared Expert MoE	DeepSeek-V2	Researching	Some experts always active, rest routed

Quantization & Efficiency

Technique	Source	Status	Notes
FP8 Training	DeepSeek-V3	Planned	8-bit floating point for training efficiency
INT8 Inference	Various	Planned	Post-training quantization for inference
GPTQ	Frantar et al.	Researching	One-shot weight quantization
AWQ	MIT	Researching	Activation-aware weight quantization

Training Techniques

Technique	Source	Status	Notes
AdamW	Standard	Planned	Decoupled weight decay, next after SGD
Gradient Checkpointing	Various	Planned	Trade compute for memory during backprop
μP (Maximal Update)	Microsoft	Researching	Hyperparameter transfer across model sizes
Multi-Token Prediction	Meta (2024)	Researching	Predict N tokens at once, better representations
Auxiliary-Loss-Free Load Balancing	DeepSeek-V3	Researching	MoE balancing without aux loss

Paper Deep Dives

Detailed notes on key papers for implementation reference.

DeepSeek-V3 (Dec 2024)

Key innovations:

MLA (Multi-Head Latent Attention) - Compresses KV cache via low-rank projection
DeepSeekMoE - Fine-grained experts with shared experts
FP8 Training - Mixed precision with FP8 for efficiency
Auxiliary-Loss-Free Balancing - Expert load balancing without extra loss terms
Multi-Token Prediction - Speculative decoding friendly

Relevance: MLA could significantly reduce memory for long context. MoE is key for scaling.

Gemma 2 (Jun 2024)

Key innovations:

Sliding Window + Global Attention - Alternating layers
QK-Norm - RMSNorm on Q/K for stability
Logit Soft-Capping - Prevents extreme attention scores
Knowledge Distillation - Smaller models trained from larger

Relevance: Sliding window is practical for CPU (local memory access). QK-Norm is simple to add.

Qwen2 (Jun 2024)

Key innovations:

GQA - Standard grouped query attention
SwiGLU - Standard gated activation
RoPE with YaRN - Extended context via interpolation
Dual Chunk Attention - For very long context variants

Relevance: Mostly implemented. YaRN scaling would extend context capability.

MiniMax-01 (Jan 2025)

Key innovations:

Lightning Attention - Linear attention variant
Mixture of Experts - Sparse activation
1M+ context - Very long context support

Relevance: Linear attention could be interesting for CPU (no N² memory). Needs investigation.

Implementation Priority

Based on impact and feasibility for C-Kernel-Engine:

High Priority (Next Up)

AdamW optimizer - Required for real training
Sliding Window Attention - Memory efficiency for long context
QK-Norm - Simple addition, helps stability
ALiBi - Alternative to RoPE, easy length extrapolation

Medium Priority

YaRN (RoPE scaling)
Gradient checkpointing
Basic MoE routing
INT8 inference kernels

Research/Long-term

MLA (requires deeper architectural changes)
FP8 training (CPU support unclear)
Linear attention variants
Multi-token prediction

Adding New Papers

When reviewing a new paper, add an entry with:

Technique name and source
Status: Researching → Planned → Implemented
Implementation notes: what kernels affected, complexity estimate
Relevance to CPU training (some techniques are GPU-specific)