Flash Attention: Performance Comparison

Key Finding
Performance is workload-dependent: C-Kernel and llama.cpp each win 50% of tested cases. C-Kernel tends to win with long contexts (1K-8K tokens) and many heads (32+). llama.cpp tends to win with few heads (4-8) and medium contexts (256-512 tokens).

Executive Summary

This analysis compares attention implementations: C-Kernel-Engine's standard approach vs llama.cpp's ggml flash attention. Each implementation wins 50% of tested workloads, with clear patterns based on context length and head count.

Performance varies significantly by workload characteristics, and understanding these trade-offs helps guide optimization efforts.

1. Custom CPU Implementation (Not BLAS)

Philosophy: Own the Hot Path

Unlike PyTorch/TensorFlow which delegate to MKL/Accelerate, llama.cpp writes attention kernels specifically optimized for transformer inference:

// ggml/src/ggml-cpu/ops.cpp
void ggml_compute_forward_flash_attn_ext(...) {
    // Custom CPU implementation, not MKL/Accelerate
    // Uses ggml_vec_dot/ggml_vec_mad directly
}

2. SIMD Vector Operations: ggml_vec_dot/ggml_vec_mad

Vector Dot Product (ggml_vec_dot)

Specialized for attention computation with AVX/AVX-512 intrinsics:

  • AVX-512 path: 16 floats per iteration using _mm512_* intrinsics
  • AVX2 path: 8 floats per iteration using _mm256_* intrinsics
  • Accumulator pattern: Fused multiply-add for Q×Kᵀ computation
  • Horizontal reduction: Efficient sum across vector lanes
// Simplified AVX-512 version
__m512 acc = _mm512_setzero_ps();
for (int i = 0; i < n; i += 16) {
    __m512 va = _mm512_loadu_ps(&a[i]);
    __m512 vb = _mm512_loadu_ps(&b[i]);
    acc = _mm512_fmadd_ps(va, vb, acc); // fused multiply-add
}
float sum = _mm512_reduce_add_ps(acc);

Vector Multiply-Add (ggml_vec_mad)

For attention score normalization and value multiplication:

  • Fused operations: (a * b) + c in single instruction
  • In-place updates: Avoids temporary arrays
  • Cache-friendly: Streamed memory access patterns
// Softmax application
for (int i = 0; i < n; i += 16) {
    __m512 scores = _mm512_loadu_ps(&score[i]);
    __m512 max_val = _mm512_broadcastss_ps(max_scalar);
    scores = _mm512_sub_ps(scores, max_val);
    scores = _mm512_exp_ps(scores);
    _mm512_storeu_ps(&score[i], scores);
}

3. Threadpool + Chunk Scheduler

Work Partitioning Strategy

ggml uses a sophisticated thread scheduling approach that differs from naive OpenMP:

3.1 Row-Based Distribution

const int nth = params->nth;  // total threads
const int ith = params->ith;  // this thread's index

// Divide rows evenly among threads
const int nr = ggml_nrows(src0);
const int dr = (nr + nth - 1) / nth;  // rows per thread
const int ir0 = dr * ith;              // start row for this thread
const int ir1 = MIN(ir0 + dr, nr);     // end row for this thread

3.2 Chunk-Based Attention

For long sequences, attention is computed in chunks to fit L2/L3 cache:

// Process attention in cache-sized chunks
const int chunk_size = 256;  // Fits in L2 cache
for (int k_start = 0; k_start < T_k; k_start += chunk_size) {
    int k_end = MIN(k_start + chunk_size, T_k);

    // Each thread processes its row range for this chunk
    for (int i = ir0; i < ir1; ++i) {
        // Compute Q[i] × K[k_start:k_end]ᵀ
        compute_chunk(i, k_start, k_end);
    }
}

3.3 Benefits

4. Type-Specific Optimizations

4.1 FP16 Path

Half-precision optimization for memory bandwidth:

case GGML_TYPE_F16:
    // Specialized FP16 SIMD path
    // Uses F16C instructions (_mm256_cvtph_ps)
    for (int i = 0; i < n; i += 8) {
        __m256i packed = _mm_loadu_si256((__m256i*)&src[i]);
        __m256 fp32 = _mm256_cvtph_ps(packed);
        // ... vector operations ...
    }

Benefits:

  • 2x memory bandwidth efficiency
  • Faster cache loading
  • Reduced memory footprint (useful for large models)

4.2 Quantized Types

Optimized paths for Q4_0, Q4_1, Q5_0, Q6_K, Q8_0:

case GGML_TYPE_Q4_0:
case GGML_TYPE_Q4_1:
    // Dequantize on-the-fly during computation
    ggml_to_float_t dequant_row = ggml_get_type_traits(type)->to_float;

    for (int i = 0; i < n; i += block_size) {
        // Dequantize 32/64 elements
        __m256i q = _mm_loadu_si128((__m256i*)&src[i]);

        // Convert to FP32
        float f = dequantize_and_compute(q);

        // Accumulate in attention score
        acc = _mm256_fmadd_ps(f, weight, acc);
    }

Benefits:

  • No separate dequantization pass
  • Compute during conversion
  • 4x (Q4) or 8x (Q8) memory savings

5. Memory Access Patterns

Cache-Friendly Tiling

ggml's flash attention uses multi-level tiling to maximize cache utilization:

5.1 L1 Cache Tiling (32KB)

// Tile size chosen to fit in L1 cache (32KB)
const int TILE_Q = 32;   // Query tokens per tile
const int TILE_K = 32;   // Key tokens per tile

// Process tiles that fit in L1
for (int qt = 0; qt < T_q; qt += TILE_Q) {
    for (int kt = 0; kt < T_k; kt += TILE_K) {
        // Load Q[qt:qt+TILE_Q] into L1
        // Load K[kt:kt+TILE_K] into L1
        compute_tile(qt, kt);
    }
}

5.2 L2/L3 Cache Tiling (256KB/1MB)

// Larger tiles for L2/L3
const int BLOCK_HEAD = 8;  // Heads per block (fits in L2)
const int BLOCK_DIM = 64;  // Head dimension per block

// Block decomposition
for (int h0 = 0; h0 < H; h0 += BLOCK_HEAD) {
    for (int d0 = 0; d0 < D; d0 += BLOCK_DIM) {
        // Each block fits in L2/L3
        process_block(h0, d0);
    }
}

5.3 Benefits

6. Comparison: ggml vs. C-Kernel-Engine

Aspect ggml (llama.cpp) C-Kernel-Engine (Current)
Implementation Custom CPU SIMD in ops.cpp C with inline SIMD in attention_kernels.c
Threading Threadpool with chunk scheduling, NUMA-aware OpenMP parallel for (simpler)
SIMD Width AVX-512 (16 floats), AVX2 (8 floats) AVX-512, AVX2, generic fallback
Quantization Specialized dequant+compute fused paths Dequantize separately, then compute
Memory Layout Head-major optimized, multiple variants Head-major and token-major
Caching Strategy Multi-level tiling (L1/L2/L3 aware) Basic blocking, less aggressive tiling
FP16 Support F16C intrinsics, memory bandwidth optimized Basic FP16 support

7. Key Optimizations to Borrow

7.1 Fused Dequant+Compute

Instead of:

// Separate dequantization
for (int i = 0; i < n; i++) {
    float f = dequantize(src[i]);
    acc += f * weight[i];
}

Use:

// Fused path
for (int i = 0; i < n; i += BLOCK) {
    auto q = load_quant_block(src[i]);
    acc = vector_fmadd(q, weight[i], acc);
}

Benefit: Saves memory bandwidth, reduces temp storage

7.2 Chunk-Based Threading

Replace naive OpenMP:

#pragma omp parallel for
for (int i = 0; i < T; i++) {
    compute_attention_row(i);
}

With:

// Each thread gets chunk of rows
int chunk = T / nth;
int start = ith * chunk;
int end = (ith == nth-1) ? T : (ith+1) * chunk;

for (int i = start; i < end; i += CACHE_TILE) {
    // Process in cache-sized tiles
    process_tile(i, CACHE_TILE);
}

Benefit: Better cache locality, reduced false sharing

7.3 NUMA-Aware Memory Allocation

// Allocate on local NUMA node
float* data = numa_alloc_onnode(
    size * sizeof(float),
    numa_node_of_cpu(sched_getcpu())
);

// Thread binding
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(ith, &cpuset);
pthread_setaffinity_np(pthread_self(),
                       sizeof(cpu_set_t), &cpuset);

Benefit: Eliminates cross-socket memory traffic on multi-socket systems

8. Performance Reality Check: When ggml is Faster (and When It's Not)

Important Nuance
Performance varies significantly by workload! Our tests show ggml is sometimes slower (0.66x-0.82x) and sometimes faster (1.66x-1.91x). The "2-3x faster" claim depends on specific conditions.

8.1 Actual Performance Data

Kernel llama.cpp (us) C-Kernel (us) Speedup
decode_4h_512 (Tq=1,Tk=512,H=4,D=64) 80.7 107.2 0.75x (slower)
decode_8h_512 (Tq=1,Tk=512,H=8,D=64) 92.9 141.7 0.66x (slower)
decode_32h_512 (Tq=1,Tk=512,H=32,D=64) 1301.1 720.7 1.81x (faster)
decode_4h_1k (Tq=1,Tk=1024,H=4,D=64) 242.6 145.9 1.66x (faster)
decode_4h_8k (Tq=1,Tk=8192,H=4,D=64) 3258.6 1701.7 1.91x (faster)
decode_4h_hd128 (Tq=1,Tk=256,H=4,D=128) 59.1 72.5 0.82x (slower)

8.2 Why Performance Varies

When C-Kernel is Faster (1.66x-1.91x)

  • Long contexts (Tk >= 1024): OpenMP parallel for scales well with sequence length
  • Many heads (H >= 32): More parallel work per token
  • Medium head dims (D=64): Workload fits well with our approach

Our simpler OpenMP approach works well for these workloads. No thread scheduling overhead when work is substantial.

When llama.cpp is Faster (0.66x-0.82x)

  • Few heads (H <= 8): ggml's threadpool handles this better
  • Medium contexts (Tk <= 512): Custom scheduling wins here
  • Large head dims (D=128): ggml's optimizations for larger dimensions

ggml's custom threadpool + chunking is advantageous for these specific patterns.

8.3 Key Insights

  1. 50/50 Split: C-Kernel and llama.cpp each win 3 out of 6 tested workloads
  2. Context Length Matters: C-Kernel wins with long contexts (1K-8K), llama.cpp wins with medium (256-512)
  3. Head Count Pattern: C-Kernel wins with many heads (32), llama.cpp wins with few heads (4-8)
  4. No Universal Winner: Performance depends on workload characteristics, not implementation quality

8.4 What This Means for C-Kernel-Engine

Where C-Kernel Wins

Where llama.cpp Wins

Optimization Strategy

9. Performance Breakdown: What Actually Matters

Memory Bandwidth (When It Matters)

  • Long sequences (8K+): Tiling reduces DRAM bandwidth by 2-3x
  • Quantized models: 4x less data = significant win
  • Small sequences: Data fits in cache, bandwidth not the bottleneck

Threading Overhead (When It Hurts)

  • Short sequences: Thread creation/scheduling overhead > compute time
  • Few heads: Not enough parallel work
  • Solution: Use fewer threads or stick with OpenMP for small cases

Cache Utilization (Always Matters)

  • L1 hit rate: Multi-level tiling improves locality
  • Register pressure: Wider SIMD (AVX-512) needs careful register allocation
  • Prefetch: Regular memory patterns help hardware prefetcher

SIMD Efficiency (Architecture-Dependent)

  • AVX-512 CPUs: Can see 2x throughput vs AVX2
  • Older CPUs: Less benefit from SIMD optimizations
  • Fused ops: FMA instructions reduce latency

10. Implementation Roadmap for C-Kernel-Engine

Balanced Approach
Each implementation wins 50% of workloads. Optimize based on your target use case rather than chasing universal performance.

Phase 1: Address Weaknesses (Priority: Medium)

  1. Few Heads (4-8): Investigate why llama.cpp wins here
  2. Medium Contexts (256-512): Add specialized handling
  3. Large Head Dims (128): Optimize for this configuration

Phase 2: Learn from ggml (Priority: Low)

  1. Study their threadpool strategy for few-head cases
  2. Implement chunk-based distribution for medium contexts
  3. Test fused operations on quantized types

Phase 3: Strengthen Wins (Priority: Very Low)

  1. Further optimize for long contexts (1K-8K)
  2. Enhance many-head (32+) parallelization
  3. NUMA-aware allocation for server deployments

11. Conclusion

This analysis reveals that neither implementation universally wins. Key takeaways:

  1. 50/50 Split: C-Kernel and llama.cpp each win 3 out of 6 tested workloads
  2. Context Length Pattern: C-Kernel wins with long contexts (1K-8K), llama.cpp with medium (256-512)
  3. Head Count Matters: C-Kernel wins with many heads (32), llama.cpp with few heads (4-8)
  4. Workload-Specific Optimization: Choose approach based on your specific use case
  5. Learn from Each Other: Both implementations have strengths to learn from

Bottom line: Performance is workload-dependent, not implementation-dependent. Understand your context length and head count to choose the right approach. Neither is "better" - they're optimized for different scenarios.

References

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close