Research Tracker
Tracking new techniques from recent papers for potential implementation in C-Kernel-Engine. This helps stay current with the field and plan future kernel additions.
Updated as new papers are reviewed. Techniques marked "Planned" are candidates for implementation.
New: Deep Training Intuition Playbook
An infographic roadmap for gradient intuition: checkpoint-efficient logging, failure pattern triage, and a 12-experiment curriculum from tiny models to Qwen-scale runs.
New: v7 Cross-Entropy Parity Deep Dive
Detailed CE math and production numerics: p - one_hot derivation, PyTorch reduction semantics, ignore-index behavior, long-horizon drift history, and final stabilization changes.
Implementation Status Legend
Implemented Planned Researching Archived
Attention Variants
| Technique | Source | Status | Notes |
|---|---|---|---|
| Multi-Head Attention | Transformer (2017) | Implemented | Base attention with causal mask |
| Grouped Query Attention (GQA) | Llama 2 | Implemented | Shared K/V heads, reduces KV cache |
| Multi-Query Attention (MQA) | PaLM, Falcon | Implemented | GQA with kv_heads=1 |
| Sliding Window Attention | Mistral, Gemma 2 | Planned | Local attention window, reduces memory for long context |
| Multi-Head Latent Attention (MLA) | DeepSeek-V2 | Researching | Low-rank KV compression, massive KV cache reduction |
| Native Sparse Attention (NSA) | DeepSeek-V3 | Researching | Hardware-aligned sparse attention patterns |
| Differential Attention | Microsoft (2024) | Researching | Subtracts two softmax attentions to reduce noise |
Deep Dive: Multi-Head Latent Attention (MLA)
The KV Cache Problem
Standard attention requires storing K and V tensors for each token in the sequence. For long contexts, this becomes massive:
KV Cache size = 2 × num_layers × seq_len × num_heads × head_dim × bytes_per_element Example (70B model, 128K context): = 2 × 80 × 128,000 × 64 × 128 × 2 bytes (FP16) = 167 GB just for KV cache!
GQA/MQA reduce this by sharing K/V heads, but MLA takes a different approach.
MLA: Low-Rank KV Compression
Instead of reducing heads, MLA compresses the KV cache into a low-rank latent space:
Standard Attention: h → W_K → K (full size) h → W_V → V (full size) Cache: [K, V] per layer MLA: h → W_compress → c (small latent vector, e.g., 1024-dim) Cache: [c] per layer ← 93% smaller! At attention time: c → W_decompress_K → K (reconstruct) c → W_decompress_V → V (reconstruct)
MLA Results (DeepSeek-V2)
| Metric | Standard | MLA | Improvement |
|---|---|---|---|
| KV Cache Size | 100% | 6.7% | 93.3% reduction |
| Generation Throughput | 1x | 5.76x | 5.76x faster |
| Training Cost | 100% | 57.5% | 42.5% cheaper |
RoPE Compatibility Challenge
RoPE encodes position in both Q and K, but MLA compresses K into a latent space. DeepSeek's solution:
- Keep RoPE on a subset of dimensions (not compressed)
- Compress the remaining dimensions (NoPE - no positional encoding)
- Requires careful dimension partitioning
Q = [Q_rope (with position), Q_nope (compressed)] K = [K_rope (with position), K_nope (from latent)]
Implementation Considerations for C-Kernel-Engine
- New kernel:
mla_compress- h → latent (small matmul) - New kernel:
mla_decompress- latent → K, V (during attention) - Modified attention: Handle split RoPE/NoPE dimensions
- KV cache layout: Store latent vectors instead of full K/V
- Memory savings: Critical for long context on CPU (limited RAM vs GPU)
Priority: High for long-context inference. The 93% KV cache reduction is huge for CPU deployment.
References
- DeepSeek-V2 Paper - Original MLA introduction
- DeepSeek-V3 MLA Explained - Visual walkthrough
- Understanding MLA - Technical deep dive
- TransMLA - Converting GQA models to MLA
Position Encodings
| Technique | Source | Status | Notes |
|---|---|---|---|
| RoPE | RoFormer, Llama | Implemented | Rotary position embedding, relative position via rotation |
| ALiBi | BLOOM | Planned | Linear bias, no learned params, easy length extrapolation |
| YaRN (RoPE scaling) | Together AI | Planned | NTK-aware RoPE interpolation for context extension |
| Longrope | Microsoft | Researching | Progressive interpolation for very long context |
Normalization
| Technique | Source | Status | Notes |
|---|---|---|---|
| RMSNorm | Llama, Mistral | Implemented | Simpler than LayerNorm, no mean centering |
| LayerNorm | Original Transformer | Implemented | Full normalization with mean and variance |
| QK-Norm | Gemma 2 | Planned | RMSNorm on Q and K before attention, training stability |
| Deep Norm | Microsoft | Researching | Scaled residual connections for very deep models |
Activations & MLP
| Technique | Source | Status | Notes |
|---|---|---|---|
| SwiGLU | Llama, Mistral | Implemented | Gated activation: Swish(xW_gate) * (xW_up) |
| GELU | GPT-2, BERT | Implemented | Gaussian Error Linear Unit |
| GeGLU | GLU Variants | Planned | GELU-gated variant |
| Mixture of Experts (MoE) | Mixtral, DeepSeek | Planned | Sparse expert routing, key for scaling |
| Shared Expert MoE | DeepSeek-V2 | Researching | Some experts always active, rest routed |
Quantization & Efficiency
| Technique | Source | Status | Notes |
|---|---|---|---|
| FP8 Training | DeepSeek-V3 | Planned | 8-bit floating point for training efficiency |
| INT8 Inference | Various | Planned | Post-training quantization for inference |
| GPTQ | Frantar et al. | Researching | One-shot weight quantization |
| AWQ | MIT | Researching | Activation-aware weight quantization |
Training Techniques
| Technique | Source | Status | Notes |
|---|---|---|---|
| AdamW | Standard | Planned | Decoupled weight decay, next after SGD |
| Gradient Checkpointing | Various | Planned | Trade compute for memory during backprop |
| μP (Maximal Update) | Microsoft | Researching | Hyperparameter transfer across model sizes |
| Multi-Token Prediction | Meta (2024) | Researching | Predict N tokens at once, better representations |
| Auxiliary-Loss-Free Load Balancing | DeepSeek-V3 | Researching | MoE balancing without aux loss |
Paper Deep Dives
Detailed notes on key papers for implementation reference.
DeepSeek-V3 (Dec 2024)
Key innovations:
- MLA (Multi-Head Latent Attention) - Compresses KV cache via low-rank projection
- DeepSeekMoE - Fine-grained experts with shared experts
- FP8 Training - Mixed precision with FP8 for efficiency
- Auxiliary-Loss-Free Balancing - Expert load balancing without extra loss terms
- Multi-Token Prediction - Speculative decoding friendly
Relevance: MLA could significantly reduce memory for long context. MoE is key for scaling.
Gemma 2 (Jun 2024)
Key innovations:
- Sliding Window + Global Attention - Alternating layers
- QK-Norm - RMSNorm on Q/K for stability
- Logit Soft-Capping - Prevents extreme attention scores
- Knowledge Distillation - Smaller models trained from larger
Relevance: Sliding window is practical for CPU (local memory access). QK-Norm is simple to add.
Qwen2 (Jun 2024)
Key innovations:
- GQA - Standard grouped query attention
- SwiGLU - Standard gated activation
- RoPE with YaRN - Extended context via interpolation
- Dual Chunk Attention - For very long context variants
Relevance: Mostly implemented. YaRN scaling would extend context capability.
MiniMax-01 (Jan 2025)
Key innovations:
- Lightning Attention - Linear attention variant
- Mixture of Experts - Sparse activation
- 1M+ context - Very long context support
Relevance: Linear attention could be interesting for CPU (no N² memory). Needs investigation.
Implementation Priority
Based on impact and feasibility for C-Kernel-Engine:
High Priority (Next Up)
- AdamW optimizer - Required for real training
- Sliding Window Attention - Memory efficiency for long context
- QK-Norm - Simple addition, helps stability
- ALiBi - Alternative to RoPE, easy length extrapolation
Medium Priority
- YaRN (RoPE scaling)
- Gradient checkpointing
- Basic MoE routing
- INT8 inference kernels
Research/Long-term
- MLA (requires deeper architectural changes)
- FP8 training (CPU support unclear)
- Linear attention variants
- Multi-token prediction
Adding New Papers
When reviewing a new paper, add an entry with:
- Technique name and source
- Status:
Researching→Planned→Implemented - Implementation notes: what kernels affected, complexity estimate
- Relevance to CPU training (some techniques are GPU-specific)