π Iteration Philosophy: v1βv6 and Beyond
Why rapid iteration, continuous learning, and bold architectural changes drive breakthrough performance in kernel optimization
1. Constraint-Driven vs Specification-Driven Development
π Specification-Driven
Assumption: We can define requirements upfront
- Write detailed specifications
- Design architecture to match spec
- Implement according to design
- Test against requirements
Problem: Specifications become obsolete as we learn
β‘ Constraint-Driven
Reality: Constraints emerge during development
- Identify current constraints
- Design around known constraints
- Prototype to discover new constraints
- Iterate as constraints evolve
Advantage: Adapt to reality, not assumptions
Common Constraints in Kernel Development
π» Hardware
- AVX-512 availability
- Cache hierarchy sizes
- Memory bandwidth limits
- NUMA topology
β‘ Performance
- Inference latency targets
- Throughput requirements
- Context length limits
- Batch size trade-offs
π Compatibility
- Model format support
- API stability
- Type system
- Platform support
Each iteration reveals new constraints and invalidates old assumptions. The architecture evolves to honor these discovered constraints.
2. Why Iteration Matters in Kernel Development
Kernel optimization is fundamentally experimental. Unlike application development where requirements are stable, kernel performance depends on:
- Hardware evolution: New CPU instructions (AVX-512, VNNI, AMX), cache hierarchies, memory bandwidth
- Model architectures: Transformer variants, quantization formats, context length trends
- Workload diversity: Real-world inference patterns are never uniform
- Hidden bottlenecks: What seems fast in microbenchmarks fails at scale
Perfect architecture is a myth. The only constant is change.
β The Perfectionist Trap
- Spend months designing the "perfect" v1 architecture
- Implement everything at once with complex abstractions
- Discover fundamental flaws only after full implementation
- Painful, expensive refactoring required
β The Iterative Approach
- Ship v1 with core ideas quickly (weeks, not months)
- Learn from real-world performance gaps
- Refactor aggressively in v2 based on learnings
- Each version builds on proven foundations
2. The v1βv6 Evolution: Lessons Learned
v1: Proof of Concept
Goal: Can we build custom kernels that beat BLAS?
Key Learning: Baseline performance achievable, but architectural debt accumulated
Kept: Core kernel structure, basic quantization support
Discarded: Over-engineered abstractions, premature optimization
v2: IR Revolution
Goal: Portable, optimized code generation
Key Learning: IR-based approach enables rapid experimentation with different optimizations
Kept: IR concept, kernel building blocks
Discarded: Hardcoded kernel paths, architecture-specific spaghetti
v3: Quantization Breakthrough
Goal: Multi-format quantization (Q4, Q5, Q6, Q8)
Key Learning: Type-specific kernels vastly outperform generic approaches
Kept: Block-based quantization, format abstraction
Discarded: FP32-only kernels, limited quantization support
v4: Memory Architecture
Goal: Deterministic, safe memory management
Key Learning: Memory safety doesn't sacrifice performance with proper design
Kept: Bump allocator, safety guarantees
Discarded: Manual memory management, undefined behavior
v5: Threading & Scheduling
Goal: Efficient multi-threading for attention kernels
Key Learning: Work partitioning strategy matters more than thread count
Kept: Chunk-based scheduling, OpenMP integration
Discarded: Poorly designed work stealing, race conditions
v6: Flash Attention & Beyond
Goal: Memory-efficient attention with competitive performance
Key Learning: ggml's custom CPU approach beats BLAS for attention - learning from competition
Kept: All proven v5 foundations, IR system, quantization formats
Discarding: Naive attention implementation, sub-optimal memory patterns
New Direction: Hybrid approach - combine our strengths with learned techniques
3. Case Study: IR v2 - When Iteration Requires Starting Over
IR v1: The Original Approach
What it was: Tight coupling between IR and generated code
- Hardcoded dimension layouts
- Fixed buffer arrangements
- Limited portability across models
- Difficult to optimize for different hardware
IR v2: The Redesigned Approach
What it became: Symbolic, portable model representation
- Symbolic dimensions: {vocab, embed, head_dim} that resolve at compile-time
- Explicit buffer layouts: Clean separation of concerns
- Model-agnostic: Works with any transformer architecture
- Hardware-aware: Can optimize for different CPU features
How IR v2 Embodies Iteration Philosophy
1. Kept the Foundation
The fundamental idea of IR-based code generation was sound and worth preserving
2. Discarded the Implementation
Everything else - data structures, APIs, layouts - was redesigned from scratch based on learnings
3. Introduced Symbolic Dimension System
Breakthrough insight: Instead of hardcoding shapes like [151936, 896], use symbolic names like [vocab, embed] that resolve from model config at compile-time
The Symbolic Dimension Innovation
{
"dimensions": [
{id: 0, name: "tokens"}, // Runtime batch_size x seq_len
{id: 1, name: "embed"}, // From hidden_size in config.json
{id: 2, name: "aligned_embed"}, // 64-byte aligned for SIMD
{id: 10, name: "vocab"} // From vocab_size in config.json
],
"buffers": [
{name: "token_emb", shape: [{dim:10}, {dim:2}]} // Symbolic!
]
}
This enables portability without sacrificing performance - dimensions are symbolic at IR level but concrete at runtime.
π Learn more: See IR v2 Format Documentation for complete technical details, dimension mappings, and the full compilation pipeline from HuggingFace models to optimized C code.
4. The Art of Selective Preservation
3.1 What We Always Keep
Performance-Critical Code
- Hand-optimized SIMD kernels
- Proven quantization formats
- Efficient memory layouts
- Cache-friendly access patterns
Architectural Foundations
- IR-based code generation
- Type abstraction layers
- Safe memory management
- Test infrastructure
3.2 What We Aggressively Discard
Over-Engineering
- Abstractions that add overhead without benefits
- Premature generalization
- Complex code paths for edge cases
- Architectural dogmatism
Proven Inefficiencies
- Sub-optimal algorithms (even if "elegant")
- Poor work partitioning strategies
- Unnecessary dependencies
- Legacy compatibility layers
5. Fast Idea Testing: The v(next) Strategy
Every new version follows a predictable pattern:
- Identify the bottleneck: What prevents better performance?
- Generate hypotheses: 3-5 potential solutions
- Prototype rapidly: Build minimal proof-of-concept in days, not weeks
- Measure ruthlessly: Benchmark on real workloads
- Scale or discard: If promising β full implementation; if not β learn and move on
Example: Flash Attention v6
Problem: Our attention implementation slower than ggml on several workloads
Hypothesis 1: Need custom CPU kernels (not BLAS)
Hypothesis 2: Different memory tiling strategy
Hypothesis 3: Work partitioning based on context length
Rapid Prototyping: Implemented all 3 approaches in parallel, benchmarked
Result: Discovered hybrid approach needed - combine custom kernels with different strategies for different contexts
v6.1: Full implementation with adaptive strategy selection
6. Compounding Knowledge
Each iteration compounds knowledge:
π Performance Data
Benchmarks from v1-v5 reveal patterns about what works and what doesn't across different hardware and workloads
π οΈ Technical Debt
What we discarded in each version becomes institutional knowledge about avoiding similar mistakes
π― Focus Areas
Each version narrows focus to the most impactful optimizations, avoiding scattered effort
7. Why This Works: The Mathematics of Iteration
Consider the velocity of learning:
| Approach | Time to Breakthrough | Learning Rate | Risk |
|---|---|---|---|
| Perfect v1 Architecture | 6-12 months | Slow (1x) | High (single point of failure) |
| Iterative (v1βv6) | 6 months (6 iterations) | Fast (6x faster learning) | Low (diversified experiments) |
8. Practical Guidelines for Iteration
8.1 When to Iterate vs. When to Persist
π¦ Iterate When:
- Fundamental bottleneck in architecture
- Performance plateau persists across 2-3 attempts
- Competitors demonstrate superior approaches
- Hardware support changes (new instructions, etc.)
πͺ Persist When:
- Approach is sound but implementation needs tuning
- Edge cases uncovered but core is solid
- Performance gains are marginal but consistent
- Infrastructure improvements unlock future gains
8.2 The 70% Rule
Ship when you're 70% happy with the architecture. The remaining 30% will be discovered through real-world usage and become the foundation for v(next).
Why 70%? It's the sweet spot where you've solved the fundamental problems but haven't invested so much time that you're unwilling to make bold changes.
9. Looking Forward: v7 and Beyond
The v6 plan already incorporates lessons from v1-v5:
- From v1: Keep IR-based approach but simplify code generation
- From v2: Portable but optimize for common cases
- From v3: Type-specific kernels remain critical
- From v4: Memory safety is non-negotiable
- From v5: Work partitioning strategy is the key to multi-threading
- From v6: Learn from competitors, don't be dogmatic
What's Next After v6?
Already identified potential v7 directions:
- Adaptive algorithms: Automatically select best kernel based on workload characteristics
- Hardware introspection: Detect CPU features at runtime, generate optimal code paths
- Distributed inference: Multi-node acceleration based on v6 learnings
Each direction will be prototyped in v6.1/v6.2 before becoming v7.
10. The Culture of Iteration
Technical decisions are only part of the equation. Cultural adoption of iteration is equally critical:
Embrace Experimentation
- Celebrate learnings, not just successes
- Document why approaches were discarded
- Make it easy to try radical ideas
- Reward risk-taking within bounds
Build for Change
- Modular architecture enables swapping components
- Comprehensive tests catch regressions early
- Automated benchmarking tracks performance
- Version control preserves knowledge
10. Key Takeaways
The Iteration Mindset
- Speed of learning > Speed of implementation: 6 mediocre versions beat 1 perfect version
- Preserve foundations, discard inefficiencies: Keep what works, remove what doesn't
- Learn from everyone: Study competitors, read papers, analyze benchmarks
- Measure relentlessly: Data beats opinions, especially in performance optimization
- Ship the learning: Every version should advance the state of the art
π Related Documentation
- System Architecture Overview - How all v6 components fit together
- Flash Attention Analysis - Case study in learning from competitors
- Testing Infrastructure - How we maintain correctness during rapid iteration
- Profiling Guide - Measuring performance to guide iteration
- Architecture Decision Records - Why we made specific choices