Flash Attention: Performance Comparison
Performance is workload-dependent: C-Kernel and llama.cpp each win 50% of tested cases. C-Kernel tends to win with long contexts (1K-8K tokens) and many heads (32+). llama.cpp tends to win with few heads (4-8) and medium contexts (256-512 tokens).
Executive Summary
This analysis compares attention implementations: C-Kernel-Engine's standard approach vs llama.cpp's ggml flash attention. Each implementation wins 50% of tested workloads, with clear patterns based on context length and head count.
Performance varies significantly by workload characteristics, and understanding these trade-offs helps guide optimization efforts.
1. Custom CPU Implementation (Not BLAS)
Philosophy: Own the Hot Path
Unlike PyTorch/TensorFlow which delegate to MKL/Accelerate, llama.cpp writes attention kernels specifically optimized for transformer inference:
- No BLAS dependency: Direct implementation in
ggml_compute_forward_flash_attn_ext_* - Inference-first design: Optimized for the specific attention patterns in decoder-only models
- Memory-layout aware: Designed for head-major and token-major layouts common in LLM inference
// ggml/src/ggml-cpu/ops.cpp
void ggml_compute_forward_flash_attn_ext(...) {
// Custom CPU implementation, not MKL/Accelerate
// Uses ggml_vec_dot/ggml_vec_mad directly
}
2. SIMD Vector Operations: ggml_vec_dot/ggml_vec_mad
Vector Dot Product (ggml_vec_dot)
Specialized for attention computation with AVX/AVX-512 intrinsics:
- AVX-512 path: 16 floats per iteration using
_mm512_*intrinsics - AVX2 path: 8 floats per iteration using
_mm256_*intrinsics - Accumulator pattern: Fused multiply-add for Q×Kᵀ computation
- Horizontal reduction: Efficient sum across vector lanes
// Simplified AVX-512 version
__m512 acc = _mm512_setzero_ps();
for (int i = 0; i < n; i += 16) {
__m512 va = _mm512_loadu_ps(&a[i]);
__m512 vb = _mm512_loadu_ps(&b[i]);
acc = _mm512_fmadd_ps(va, vb, acc); // fused multiply-add
}
float sum = _mm512_reduce_add_ps(acc);
Vector Multiply-Add (ggml_vec_mad)
For attention score normalization and value multiplication:
- Fused operations: (a * b) + c in single instruction
- In-place updates: Avoids temporary arrays
- Cache-friendly: Streamed memory access patterns
// Softmax application
for (int i = 0; i < n; i += 16) {
__m512 scores = _mm512_loadu_ps(&score[i]);
__m512 max_val = _mm512_broadcastss_ps(max_scalar);
scores = _mm512_sub_ps(scores, max_val);
scores = _mm512_exp_ps(scores);
_mm512_storeu_ps(&score[i], scores);
}
3. Threadpool + Chunk Scheduler
Work Partitioning Strategy
ggml uses a sophisticated thread scheduling approach that differs from naive OpenMP:
3.1 Row-Based Distribution
const int nth = params->nth; // total threads
const int ith = params->ith; // this thread's index
// Divide rows evenly among threads
const int nr = ggml_nrows(src0);
const int dr = (nr + nth - 1) / nth; // rows per thread
const int ir0 = dr * ith; // start row for this thread
const int ir1 = MIN(ir0 + dr, nr); // end row for this thread
3.2 Chunk-Based Attention
For long sequences, attention is computed in chunks to fit L2/L3 cache:
// Process attention in cache-sized chunks
const int chunk_size = 256; // Fits in L2 cache
for (int k_start = 0; k_start < T_k; k_start += chunk_size) {
int k_end = MIN(k_start + chunk_size, T_k);
// Each thread processes its row range for this chunk
for (int i = ir0; i < ir1; ++i) {
// Compute Q[i] × K[k_start:k_end]ᵀ
compute_chunk(i, k_start, k_end);
}
}
3.3 Benefits
- Load balancing: Even distribution of work across cores
- Cache locality: Chunks fit in L2/L3, reducing memory bandwidth
- NUMA-aware: Thread binding to specific cores reduces cross-socket traffic
4. Type-Specific Optimizations
4.1 FP16 Path
Half-precision optimization for memory bandwidth:
case GGML_TYPE_F16:
// Specialized FP16 SIMD path
// Uses F16C instructions (_mm256_cvtph_ps)
for (int i = 0; i < n; i += 8) {
__m256i packed = _mm_loadu_si256((__m256i*)&src[i]);
__m256 fp32 = _mm256_cvtph_ps(packed);
// ... vector operations ...
}
Benefits:
- 2x memory bandwidth efficiency
- Faster cache loading
- Reduced memory footprint (useful for large models)
4.2 Quantized Types
Optimized paths for Q4_0, Q4_1, Q5_0, Q6_K, Q8_0:
case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
// Dequantize on-the-fly during computation
ggml_to_float_t dequant_row = ggml_get_type_traits(type)->to_float;
for (int i = 0; i < n; i += block_size) {
// Dequantize 32/64 elements
__m256i q = _mm_loadu_si128((__m256i*)&src[i]);
// Convert to FP32
float f = dequantize_and_compute(q);
// Accumulate in attention score
acc = _mm256_fmadd_ps(f, weight, acc);
}
Benefits:
- No separate dequantization pass
- Compute during conversion
- 4x (Q4) or 8x (Q8) memory savings
5. Memory Access Patterns
Cache-Friendly Tiling
ggml's flash attention uses multi-level tiling to maximize cache utilization:
5.1 L1 Cache Tiling (32KB)
// Tile size chosen to fit in L1 cache (32KB)
const int TILE_Q = 32; // Query tokens per tile
const int TILE_K = 32; // Key tokens per tile
// Process tiles that fit in L1
for (int qt = 0; qt < T_q; qt += TILE_Q) {
for (int kt = 0; kt < T_k; kt += TILE_K) {
// Load Q[qt:qt+TILE_Q] into L1
// Load K[kt:kt+TILE_K] into L1
compute_tile(qt, kt);
}
}
5.2 L2/L3 Cache Tiling (256KB/1MB)
// Larger tiles for L2/L3
const int BLOCK_HEAD = 8; // Heads per block (fits in L2)
const int BLOCK_DIM = 64; // Head dimension per block
// Block decomposition
for (int h0 = 0; h0 < H; h0 += BLOCK_HEAD) {
for (int d0 = 0; d0 < D; d0 += BLOCK_DIM) {
// Each block fits in L2/L3
process_block(h0, d0);
}
}
5.3 Benefits
- L1 hit rate > 90%: Frequent data stays in L1
- Reduced memory bandwidth: Each value loaded once, used multiple times
- NUMA locality: Data stays close to computing core
6. Comparison: ggml vs. C-Kernel-Engine
| Aspect | ggml (llama.cpp) | C-Kernel-Engine (Current) |
|---|---|---|
| Implementation | Custom CPU SIMD in ops.cpp |
C with inline SIMD in attention_kernels.c |
| Threading | Threadpool with chunk scheduling, NUMA-aware | OpenMP parallel for (simpler) |
| SIMD Width | AVX-512 (16 floats), AVX2 (8 floats) | AVX-512, AVX2, generic fallback |
| Quantization | Specialized dequant+compute fused paths | Dequantize separately, then compute |
| Memory Layout | Head-major optimized, multiple variants | Head-major and token-major |
| Caching Strategy | Multi-level tiling (L1/L2/L3 aware) | Basic blocking, less aggressive tiling |
| FP16 Support | F16C intrinsics, memory bandwidth optimized | Basic FP16 support |
7. Key Optimizations to Borrow
7.1 Fused Dequant+Compute
Instead of:
// Separate dequantization
for (int i = 0; i < n; i++) {
float f = dequantize(src[i]);
acc += f * weight[i];
}
Use:
// Fused path
for (int i = 0; i < n; i += BLOCK) {
auto q = load_quant_block(src[i]);
acc = vector_fmadd(q, weight[i], acc);
}
Benefit: Saves memory bandwidth, reduces temp storage
7.2 Chunk-Based Threading
Replace naive OpenMP:
#pragma omp parallel for
for (int i = 0; i < T; i++) {
compute_attention_row(i);
}
With:
// Each thread gets chunk of rows
int chunk = T / nth;
int start = ith * chunk;
int end = (ith == nth-1) ? T : (ith+1) * chunk;
for (int i = start; i < end; i += CACHE_TILE) {
// Process in cache-sized tiles
process_tile(i, CACHE_TILE);
}
Benefit: Better cache locality, reduced false sharing
7.3 NUMA-Aware Memory Allocation
// Allocate on local NUMA node
float* data = numa_alloc_onnode(
size * sizeof(float),
numa_node_of_cpu(sched_getcpu())
);
// Thread binding
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(ith, &cpuset);
pthread_setaffinity_np(pthread_self(),
sizeof(cpu_set_t), &cpuset);
Benefit: Eliminates cross-socket memory traffic on multi-socket systems
8. Performance Reality Check: When ggml is Faster (and When It's Not)
Performance varies significantly by workload! Our tests show ggml is sometimes slower (0.66x-0.82x) and sometimes faster (1.66x-1.91x). The "2-3x faster" claim depends on specific conditions.
8.1 Actual Performance Data
| Kernel | llama.cpp (us) | C-Kernel (us) | Speedup |
|---|---|---|---|
decode_4h_512 (Tq=1,Tk=512,H=4,D=64) |
80.7 | 107.2 | 0.75x (slower) |
decode_8h_512 (Tq=1,Tk=512,H=8,D=64) |
92.9 | 141.7 | 0.66x (slower) |
decode_32h_512 (Tq=1,Tk=512,H=32,D=64) |
1301.1 | 720.7 | 1.81x (faster) |
decode_4h_1k (Tq=1,Tk=1024,H=4,D=64) |
242.6 | 145.9 | 1.66x (faster) |
decode_4h_8k (Tq=1,Tk=8192,H=4,D=64) |
3258.6 | 1701.7 | 1.91x (faster) |
decode_4h_hd128 (Tq=1,Tk=256,H=4,D=128) |
59.1 | 72.5 | 0.82x (slower) |
8.2 Why Performance Varies
When C-Kernel is Faster (1.66x-1.91x)
- Long contexts (Tk >= 1024): OpenMP parallel for scales well with sequence length
- Many heads (H >= 32): More parallel work per token
- Medium head dims (D=64): Workload fits well with our approach
Our simpler OpenMP approach works well for these workloads. No thread scheduling overhead when work is substantial.
When llama.cpp is Faster (0.66x-0.82x)
- Few heads (H <= 8): ggml's threadpool handles this better
- Medium contexts (Tk <= 512): Custom scheduling wins here
- Large head dims (D=128): ggml's optimizations for larger dimensions
ggml's custom threadpool + chunking is advantageous for these specific patterns.
8.3 Key Insights
- 50/50 Split: C-Kernel and llama.cpp each win 3 out of 6 tested workloads
- Context Length Matters: C-Kernel wins with long contexts (1K-8K), llama.cpp wins with medium (256-512)
- Head Count Pattern: C-Kernel wins with many heads (32), llama.cpp wins with few heads (4-8)
- No Universal Winner: Performance depends on workload characteristics, not implementation quality
8.4 What This Means for C-Kernel-Engine
Where C-Kernel Wins
- Long Contexts: 1K-8K tokens benefit from our OpenMP approach
- Many Heads: 32+ heads show good parallelization
- Simplicity: OpenMP parallel for is easier to maintain
- No Thread Overhead: Works well for appropriate workloads
Where llama.cpp Wins
- Few Heads: 4-8 heads benefit from their threading strategy
- Medium Contexts: 256-512 tokens show ggml's strength
- Large Head Dims: 128-dim heads outperform our approach
Optimization Strategy
- Understand Your Workload: Choose approach based on context length and head count
- Target Weaknesses: Optimize for few-head, medium-context cases
- Learn from ggml: Their threading strategy works well for specific patterns
9. Performance Breakdown: What Actually Matters
Memory Bandwidth (When It Matters)
- Long sequences (8K+): Tiling reduces DRAM bandwidth by 2-3x
- Quantized models: 4x less data = significant win
- Small sequences: Data fits in cache, bandwidth not the bottleneck
Threading Overhead (When It Hurts)
- Short sequences: Thread creation/scheduling overhead > compute time
- Few heads: Not enough parallel work
- Solution: Use fewer threads or stick with OpenMP for small cases
Cache Utilization (Always Matters)
- L1 hit rate: Multi-level tiling improves locality
- Register pressure: Wider SIMD (AVX-512) needs careful register allocation
- Prefetch: Regular memory patterns help hardware prefetcher
SIMD Efficiency (Architecture-Dependent)
- AVX-512 CPUs: Can see 2x throughput vs AVX2
- Older CPUs: Less benefit from SIMD optimizations
- Fused ops: FMA instructions reduce latency
10. Implementation Roadmap for C-Kernel-Engine
Each implementation wins 50% of workloads. Optimize based on your target use case rather than chasing universal performance.
Phase 1: Address Weaknesses (Priority: Medium)
- Few Heads (4-8): Investigate why llama.cpp wins here
- Medium Contexts (256-512): Add specialized handling
- Large Head Dims (128): Optimize for this configuration
Phase 2: Learn from ggml (Priority: Low)
- Study their threadpool strategy for few-head cases
- Implement chunk-based distribution for medium contexts
- Test fused operations on quantized types
Phase 3: Strengthen Wins (Priority: Very Low)
- Further optimize for long contexts (1K-8K)
- Enhance many-head (32+) parallelization
- NUMA-aware allocation for server deployments
11. Conclusion
This analysis reveals that neither implementation universally wins. Key takeaways:
- 50/50 Split: C-Kernel and llama.cpp each win 3 out of 6 tested workloads
- Context Length Pattern: C-Kernel wins with long contexts (1K-8K), llama.cpp with medium (256-512)
- Head Count Matters: C-Kernel wins with many heads (32), llama.cpp with few heads (4-8)
- Workload-Specific Optimization: Choose approach based on your specific use case
- Learn from Each Other: Both implementations have strengths to learn from
Bottom line: Performance is workload-dependent, not implementation-dependent. Understand your context length and head count to choose the right approach. Neither is "better" - they're optimized for different scenarios.
References
- llama.cpp: ggml-cpu/ops.cpp - Flash attention implementation
- ggml: vec.h - SIMD vector operations
- ggml: CPU optimizations - Complete CPU kernel suite