Flash Attention: Performance Comparison

Key Finding
Performance is workload-dependent: C-Kernel and llama.cpp each win 50% of tested cases. C-Kernel tends to win with long contexts (1K-8K tokens) and many heads (32+). llama.cpp tends to win with few heads (4-8) and medium contexts (256-512 tokens).

Executive Summary

This analysis compares attention implementations: C-Kernel-Engine's standard approach vs llama.cpp's ggml flash attention. Each implementation wins 50% of tested workloads, with clear patterns based on context length and head count.

Performance varies significantly by workload characteristics, and understanding these trade-offs helps guide optimization efforts.

1. Custom CPU Implementation (Not BLAS)

Philosophy: Own the Hot Path

Unlike PyTorch/TensorFlow which delegate to MKL/Accelerate, llama.cpp writes attention kernels specifically optimized for transformer inference:

No BLAS dependency: Direct implementation in ggml_compute_forward_flash_attn_ext_*
Inference-first design: Optimized for the specific attention patterns in decoder-only models
Memory-layout aware: Designed for head-major and token-major layouts common in LLM inference

// ggml/src/ggml-cpu/ops.cpp
void ggml_compute_forward_flash_attn_ext(...) {
    // Custom CPU implementation, not MKL/Accelerate
    // Uses ggml_vec_dot/ggml_vec_mad directly
}

2. SIMD Vector Operations: ggml_vec_dot/ggml_vec_mad

Vector Dot Product (ggml_vec_dot)

Specialized for attention computation with AVX/AVX-512 intrinsics:

AVX-512 path: 16 floats per iteration using _mm512_* intrinsics
AVX2 path: 8 floats per iteration using _mm256_* intrinsics
Accumulator pattern: Fused multiply-add for Q×Kᵀ computation
Horizontal reduction: Efficient sum across vector lanes

// Simplified AVX-512 version
__m512 acc = _mm512_setzero_ps();
for (int i = 0; i < n; i += 16) {
    __m512 va = _mm512_loadu_ps(&a[i]);
    __m512 vb = _mm512_loadu_ps(&b[i]);
    acc = _mm512_fmadd_ps(va, vb, acc); // fused multiply-add
}
float sum = _mm512_reduce_add_ps(acc);

Vector Multiply-Add (ggml_vec_mad)

For attention score normalization and value multiplication:

Fused operations: (a * b) + c in single instruction
In-place updates: Avoids temporary arrays
Cache-friendly: Streamed memory access patterns

// Softmax application
for (int i = 0; i < n; i += 16) {
    __m512 scores = _mm512_loadu_ps(&score[i]);
    __m512 max_val = _mm512_broadcastss_ps(max_scalar);
    scores = _mm512_sub_ps(scores, max_val);
    scores = _mm512_exp_ps(scores);
    _mm512_storeu_ps(&score[i], scores);
}

3. Threadpool + Chunk Scheduler

Work Partitioning Strategy

ggml uses a sophisticated thread scheduling approach that differs from naive OpenMP:

3.1 Row-Based Distribution

const int nth = params->nth;  // total threads
const int ith = params->ith;  // this thread's index

// Divide rows evenly among threads
const int nr = ggml_nrows(src0);
const int dr = (nr + nth - 1) / nth;  // rows per thread
const int ir0 = dr * ith;              // start row for this thread
const int ir1 = MIN(ir0 + dr, nr);     // end row for this thread

3.2 Chunk-Based Attention

For long sequences, attention is computed in chunks to fit L2/L3 cache:

// Process attention in cache-sized chunks
const int chunk_size = 256;  // Fits in L2 cache
for (int k_start = 0; k_start < T_k; k_start += chunk_size) {
    int k_end = MIN(k_start + chunk_size, T_k);

    // Each thread processes its row range for this chunk
    for (int i = ir0; i < ir1; ++i) {
        // Compute Q[i] × K[k_start:k_end]ᵀ
        compute_chunk(i, k_start, k_end);
    }
}

3.3 Benefits

Load balancing: Even distribution of work across cores
Cache locality: Chunks fit in L2/L3, reducing memory bandwidth
NUMA-aware: Thread binding to specific cores reduces cross-socket traffic

4. Type-Specific Optimizations

4.1 FP16 Path

Half-precision optimization for memory bandwidth:

case GGML_TYPE_F16:
    // Specialized FP16 SIMD path
    // Uses F16C instructions (_mm256_cvtph_ps)
    for (int i = 0; i < n; i += 8) {
        __m256i packed = _mm_loadu_si256((__m256i*)&src[i]);
        __m256 fp32 = _mm256_cvtph_ps(packed);
        // ... vector operations ...
    }

Benefits:

2x memory bandwidth efficiency
Faster cache loading
Reduced memory footprint (useful for large models)

4.2 Quantized Types

Optimized paths for Q4_0, Q4_1, Q5_0, Q6_K, Q8_0:

case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
    // Dequantize on-the-fly during computation
    ggml_to_float_t dequant_row = ggml_get_type_traits(type)->to_float;

    for (int i = 0; i < n; i += block_size) {
        // Dequantize 32/64 elements
        __m256i q = _mm_loadu_si128((__m256i*)&src[i]);

        // Convert to FP32
        float f = dequantize_and_compute(q);

        // Accumulate in attention score
        acc = _mm256_fmadd_ps(f, weight, acc);
    }

Benefits:

No separate dequantization pass
Compute during conversion
4x (Q4) or 8x (Q8) memory savings

5. Memory Access Patterns

Cache-Friendly Tiling

ggml's flash attention uses multi-level tiling to maximize cache utilization:

5.1 L1 Cache Tiling (32KB)

// Tile size chosen to fit in L1 cache (32KB)
const int TILE_Q = 32;   // Query tokens per tile
const int TILE_K = 32;   // Key tokens per tile

// Process tiles that fit in L1
for (int qt = 0; qt < T_q; qt += TILE_Q) {
    for (int kt = 0; kt < T_k; kt += TILE_K) {
        // Load Q[qt:qt+TILE_Q] into L1
        // Load K[kt:kt+TILE_K] into L1
        compute_tile(qt, kt);
    }
}

5.2 L2/L3 Cache Tiling (256KB/1MB)

// Larger tiles for L2/L3
const int BLOCK_HEAD = 8;  // Heads per block (fits in L2)
const int BLOCK_DIM = 64;  // Head dimension per block

// Block decomposition
for (int h0 = 0; h0 < H; h0 += BLOCK_HEAD) {
    for (int d0 = 0; d0 < D; d0 += BLOCK_DIM) {
        // Each block fits in L2/L3
        process_block(h0, d0);
    }
}

5.3 Benefits

L1 hit rate > 90%: Frequent data stays in L1
Reduced memory bandwidth: Each value loaded once, used multiple times
NUMA locality: Data stays close to computing core

6. Comparison: ggml vs. C-Kernel-Engine

Aspect	ggml (llama.cpp)	C-Kernel-Engine (Current)
Implementation	Custom CPU SIMD in `ops.cpp`	C with inline SIMD in `attention_kernels.c`
Threading	Threadpool with chunk scheduling, NUMA-aware	OpenMP parallel for (simpler)
SIMD Width	AVX-512 (16 floats), AVX2 (8 floats)	AVX-512, AVX2, generic fallback
Quantization	Specialized dequant+compute fused paths	Dequantize separately, then compute
Memory Layout	Head-major optimized, multiple variants	Head-major and token-major
Caching Strategy	Multi-level tiling (L1/L2/L3 aware)	Basic blocking, less aggressive tiling
FP16 Support	F16C intrinsics, memory bandwidth optimized	Basic FP16 support

7. Key Optimizations to Borrow

7.1 Fused Dequant+Compute

Instead of:

// Separate dequantization
for (int i = 0; i < n; i++) {
    float f = dequantize(src[i]);
    acc += f * weight[i];
}

Use:

// Fused path
for (int i = 0; i < n; i += BLOCK) {
    auto q = load_quant_block(src[i]);
    acc = vector_fmadd(q, weight[i], acc);
}

Benefit: Saves memory bandwidth, reduces temp storage

7.2 Chunk-Based Threading

Replace naive OpenMP:

#pragma omp parallel for
for (int i = 0; i < T; i++) {
    compute_attention_row(i);
}

With:

// Each thread gets chunk of rows
int chunk = T / nth;
int start = ith * chunk;
int end = (ith == nth-1) ? T : (ith+1) * chunk;

for (int i = start; i < end; i += CACHE_TILE) {
    // Process in cache-sized tiles
    process_tile(i, CACHE_TILE);
}

Benefit: Better cache locality, reduced false sharing

7.3 NUMA-Aware Memory Allocation

// Allocate on local NUMA node
float* data = numa_alloc_onnode(
    size * sizeof(float),
    numa_node_of_cpu(sched_getcpu())
);

// Thread binding
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(ith, &cpuset);
pthread_setaffinity_np(pthread_self(),
                       sizeof(cpu_set_t), &cpuset);

Benefit: Eliminates cross-socket memory traffic on multi-socket systems

8. Performance Reality Check: When ggml is Faster (and When It's Not)

Important Nuance
Performance varies significantly by workload! Our tests show ggml is sometimes slower (0.66x-0.82x) and sometimes faster (1.66x-1.91x). The "2-3x faster" claim depends on specific conditions.

8.1 Actual Performance Data

Kernel	llama.cpp (us)	C-Kernel (us)	Speedup
`decode_4h_512` (Tq=1,Tk=512,H=4,D=64)	80.7	107.2	0.75x (slower)
`decode_8h_512` (Tq=1,Tk=512,H=8,D=64)	92.9	141.7	0.66x (slower)
`decode_32h_512` (Tq=1,Tk=512,H=32,D=64)	1301.1	720.7	1.81x (faster)
`decode_4h_1k` (Tq=1,Tk=1024,H=4,D=64)	242.6	145.9	1.66x (faster)
`decode_4h_8k` (Tq=1,Tk=8192,H=4,D=64)	3258.6	1701.7	1.91x (faster)
`decode_4h_hd128` (Tq=1,Tk=256,H=4,D=128)	59.1	72.5	0.82x (slower)

8.2 Why Performance Varies

When C-Kernel is Faster (1.66x-1.91x)

Long contexts (Tk >= 1024): OpenMP parallel for scales well with sequence length
Many heads (H >= 32): More parallel work per token
Medium head dims (D=64): Workload fits well with our approach

Our simpler OpenMP approach works well for these workloads. No thread scheduling overhead when work is substantial.

When llama.cpp is Faster (0.66x-0.82x)

Few heads (H <= 8): ggml's threadpool handles this better
Medium contexts (Tk <= 512): Custom scheduling wins here
Large head dims (D=128): ggml's optimizations for larger dimensions

ggml's custom threadpool + chunking is advantageous for these specific patterns.

8.3 Key Insights

50/50 Split: C-Kernel and llama.cpp each win 3 out of 6 tested workloads
Context Length Matters: C-Kernel wins with long contexts (1K-8K), llama.cpp wins with medium (256-512)
Head Count Pattern: C-Kernel wins with many heads (32), llama.cpp wins with few heads (4-8)
No Universal Winner: Performance depends on workload characteristics, not implementation quality

8.4 What This Means for C-Kernel-Engine

Where C-Kernel Wins

Long Contexts: 1K-8K tokens benefit from our OpenMP approach
Many Heads: 32+ heads show good parallelization
Simplicity: OpenMP parallel for is easier to maintain
No Thread Overhead: Works well for appropriate workloads

Where llama.cpp Wins

Few Heads: 4-8 heads benefit from their threading strategy
Medium Contexts: 256-512 tokens show ggml's strength
Large Head Dims: 128-dim heads outperform our approach

Optimization Strategy

Understand Your Workload: Choose approach based on context length and head count
Target Weaknesses: Optimize for few-head, medium-context cases
Learn from ggml: Their threading strategy works well for specific patterns

9. Performance Breakdown: What Actually Matters

Memory Bandwidth (When It Matters)

Long sequences (8K+): Tiling reduces DRAM bandwidth by 2-3x
Quantized models: 4x less data = significant win
Small sequences: Data fits in cache, bandwidth not the bottleneck

Threading Overhead (When It Hurts)

Short sequences: Thread creation/scheduling overhead > compute time
Few heads: Not enough parallel work
Solution: Use fewer threads or stick with OpenMP for small cases

Cache Utilization (Always Matters)

L1 hit rate: Multi-level tiling improves locality
Register pressure: Wider SIMD (AVX-512) needs careful register allocation
Prefetch: Regular memory patterns help hardware prefetcher

SIMD Efficiency (Architecture-Dependent)

AVX-512 CPUs: Can see 2x throughput vs AVX2
Older CPUs: Less benefit from SIMD optimizations
Fused ops: FMA instructions reduce latency

10. Implementation Roadmap for C-Kernel-Engine

Balanced Approach
Each implementation wins 50% of workloads. Optimize based on your target use case rather than chasing universal performance.

Phase 1: Address Weaknesses (Priority: Medium)

Few Heads (4-8): Investigate why llama.cpp wins here
Medium Contexts (256-512): Add specialized handling
Large Head Dims (128): Optimize for this configuration

Phase 2: Learn from ggml (Priority: Low)

Study their threadpool strategy for few-head cases
Implement chunk-based distribution for medium contexts
Test fused operations on quantized types

Phase 3: Strengthen Wins (Priority: Very Low)

Further optimize for long contexts (1K-8K)
Enhance many-head (32+) parallelization
NUMA-aware allocation for server deployments

11. Conclusion

This analysis reveals that neither implementation universally wins. Key takeaways:

50/50 Split: C-Kernel and llama.cpp each win 3 out of 6 tested workloads
Context Length Pattern: C-Kernel wins with long contexts (1K-8K), llama.cpp with medium (256-512)
Head Count Matters: C-Kernel wins with many heads (32), llama.cpp with few heads (4-8)
Workload-Specific Optimization: Choose approach based on your specific use case
Learn from Each Other: Both implementations have strengths to learn from

Bottom line: Performance is workload-dependent, not implementation-dependent. Understand your context length and head count to choose the right approach. Neither is "better" - they're optimized for different scenarios.

References

llama.cpp: ggml-cpu/ops.cpp - Flash attention implementation
ggml: vec.h - SIMD vector operations
ggml: CPU optimizations - Complete CPU kernel suite