Research Tracker

Tracking new techniques from recent papers for potential implementation in C-Kernel-Engine. This helps stay current with the field and plan future kernel additions.

Living Document
Updated as new papers are reviewed. Techniques marked "Planned" are candidates for implementation.

New: Deep Training Intuition Playbook

An infographic roadmap for gradient intuition: checkpoint-efficient logging, failure pattern triage, and a 12-experiment curriculum from tiny models to Qwen-scale runs.

Open training-intuition.html ->

New: v7 Cross-Entropy Parity Deep Dive

Detailed CE math and production numerics: p - one_hot derivation, PyTorch reduction semantics, ignore-index behavior, long-horizon drift history, and final stabilization changes.

Open v7-cross-entropy-parity.html ->

Implementation Status Legend

Implemented Planned Researching Archived


Attention Variants

Technique Source Status Notes
Multi-Head Attention Transformer (2017) Implemented Base attention with causal mask
Grouped Query Attention (GQA) Llama 2 Implemented Shared K/V heads, reduces KV cache
Multi-Query Attention (MQA) PaLM, Falcon Implemented GQA with kv_heads=1
Sliding Window Attention Mistral, Gemma 2 Planned Local attention window, reduces memory for long context
Multi-Head Latent Attention (MLA) DeepSeek-V2 Researching Low-rank KV compression, massive KV cache reduction
Native Sparse Attention (NSA) DeepSeek-V3 Researching Hardware-aligned sparse attention patterns
Differential Attention Microsoft (2024) Researching Subtracts two softmax attentions to reduce noise

Deep Dive: Multi-Head Latent Attention (MLA)

The KV Cache Problem

Standard attention requires storing K and V tensors for each token in the sequence. For long contexts, this becomes massive:

KV Cache size = 2 × num_layers × seq_len × num_heads × head_dim × bytes_per_element

Example (70B model, 128K context):
= 2 × 80 × 128,000 × 64 × 128 × 2 bytes (FP16)
= 167 GB just for KV cache!

GQA/MQA reduce this by sharing K/V heads, but MLA takes a different approach.

MLA: Low-Rank KV Compression

Instead of reducing heads, MLA compresses the KV cache into a low-rank latent space:

Standard Attention:
  h → W_K → K (full size)
  h → W_V → V (full size)
  Cache: [K, V] per layer

MLA:
  h → W_compress → c (small latent vector, e.g., 1024-dim)
  Cache: [c] per layer  ← 93% smaller!

  At attention time:
  c → W_decompress_K → K (reconstruct)
  c → W_decompress_V → V (reconstruct)

MLA Results (DeepSeek-V2)

MetricStandardMLAImprovement
KV Cache Size100%6.7%93.3% reduction
Generation Throughput1x5.76x5.76x faster
Training Cost100%57.5%42.5% cheaper

RoPE Compatibility Challenge

RoPE encodes position in both Q and K, but MLA compresses K into a latent space. DeepSeek's solution:

Q = [Q_rope (with position), Q_nope (compressed)]
K = [K_rope (with position), K_nope (from latent)]

Implementation Considerations for C-Kernel-Engine

Priority: High for long-context inference. The 93% KV cache reduction is huge for CPU deployment.

References

Position Encodings

Technique Source Status Notes
RoPE RoFormer, Llama Implemented Rotary position embedding, relative position via rotation
ALiBi BLOOM Planned Linear bias, no learned params, easy length extrapolation
YaRN (RoPE scaling) Together AI Planned NTK-aware RoPE interpolation for context extension
Longrope Microsoft Researching Progressive interpolation for very long context

Normalization

Technique Source Status Notes
RMSNorm Llama, Mistral Implemented Simpler than LayerNorm, no mean centering
LayerNorm Original Transformer Implemented Full normalization with mean and variance
QK-Norm Gemma 2 Planned RMSNorm on Q and K before attention, training stability
Deep Norm Microsoft Researching Scaled residual connections for very deep models

Activations & MLP

Technique Source Status Notes
SwiGLU Llama, Mistral Implemented Gated activation: Swish(xW_gate) * (xW_up)
GELU GPT-2, BERT Implemented Gaussian Error Linear Unit
GeGLU GLU Variants Planned GELU-gated variant
Mixture of Experts (MoE) Mixtral, DeepSeek Planned Sparse expert routing, key for scaling
Shared Expert MoE DeepSeek-V2 Researching Some experts always active, rest routed

Quantization & Efficiency

Technique Source Status Notes
FP8 Training DeepSeek-V3 Planned 8-bit floating point for training efficiency
INT8 Inference Various Planned Post-training quantization for inference
GPTQ Frantar et al. Researching One-shot weight quantization
AWQ MIT Researching Activation-aware weight quantization

Training Techniques

Technique Source Status Notes
AdamW Standard Planned Decoupled weight decay, next after SGD
Gradient Checkpointing Various Planned Trade compute for memory during backprop
μP (Maximal Update) Microsoft Researching Hyperparameter transfer across model sizes
Multi-Token Prediction Meta (2024) Researching Predict N tokens at once, better representations
Auxiliary-Loss-Free Load Balancing DeepSeek-V3 Researching MoE balancing without aux loss

Paper Deep Dives

Detailed notes on key papers for implementation reference.

DeepSeek-V3 (Dec 2024)

Key innovations:

Relevance: MLA could significantly reduce memory for long context. MoE is key for scaling.

Gemma 2 (Jun 2024)

Key innovations:

Relevance: Sliding window is practical for CPU (local memory access). QK-Norm is simple to add.

Qwen2 (Jun 2024)

Key innovations:

Relevance: Mostly implemented. YaRN scaling would extend context capability.

MiniMax-01 (Jan 2025)

Key innovations:

Relevance: Linear attention could be interesting for CPU (no N² memory). Needs investigation.


Implementation Priority

Based on impact and feasibility for C-Kernel-Engine:

High Priority (Next Up)

  1. AdamW optimizer - Required for real training
  2. Sliding Window Attention - Memory efficiency for long context
  3. QK-Norm - Simple addition, helps stability
  4. ALiBi - Alternative to RoPE, easy length extrapolation

Medium Priority

Research/Long-term

Adding New Papers

When reviewing a new paper, add an entry with:

  1. Technique name and source
  2. Status: ResearchingPlannedImplemented
  3. Implementation notes: what kernels affected, complexity estimate
  4. Relevance to CPU training (some techniques are GPU-specific)
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close