Profiling Guide

Measure, Don't Guess
This page covers memory and compute profiling for C-Kernel-Engine. Always profile before optimizing.

Working on v7 Training Performance?

Use the dedicated v7-profiling.html runbook for VTune (hotspots/memory-access/uarch-exploration), Advisor roofline, perf/flamegraph capture, and run-dir artifact flow.

Intel VTune / Advisor host-driver warning: VTune and Advisor are useful for CK-vs-llama.cpp hotspot, memory, uarch, and roofline work, but some collection modes load Intel out-of-tree kernel drivers such as sep5, vtsspp, socwatch2_16, and pax. On the local ThinkPad i7 / Linux 6.14 profiling host, a deeper collection triggered repeated BUG: scheduling while atomic traces in sep5, followed by unrelated runc/nmcli core dumps and an unclean reboot. That signature points to the profiler driver path, not CK user-space math. Prefer perf stat, perf record, flamegraphs, and CK-vs-llama timing gates for routine iteration; run VTune/Advisor deep collections only on a disposable profiling boot, server host, or after accepting that the whole machine may need a reboot.

Safe v8 Prefill Counter Lane

For CK-vs-llama.cpp prefill work, start with the safe perf stat lane instead of VTune deep counters:

make profile-v8-prefill-perf-stat

The target compares the cached v8 Qwen3.5 Q4_K_M runtime against the local llama.cpp build with a fixed-token 128-token prefill and one decode token. It writes a timestamped Markdown and JSON report under profile_results/v8_prefill_perf_stat/ with prefill/decode tok/s, milliseconds, cycles, instructions, IPC, cache misses, cache miss rate, and context switches.

The Q4_K packed-meta x8 prefill kernel is shape-gated on by default for measured short/medium Qwen/Nemotron-family prefill shapes. Use this escape hatch when validating a new CPU or bisecting a regression:

CK_DISABLE_Q4K_PACKED_META_X8_PREFILL=1 \
CK_V8_PREFILL_PERF_ENGINE=ck \
make profile-v8-prefill-perf-stat

On the local ThinkPad i7, the dispatch matrix showed the x8 layout winning at M=32 and M=64 Q4_K prefill shapes, while the longer M=128,N=896,K=4864 compact-down shape lost to the canonical pool path. That is why the default gate has both a minimum and maximum token tile size. Treat these as host-local engineering numbers, not a universal benchmark: the llama lane uses llama-bench, run-to-run variance exists, and the important signal is the shape-gated layout win.

Use these environment knobs when testing the older Q4_K packed-meta 2D scheduler without changing the default dispatcher:

CK_ENABLE_Q4K_PACKED_META_2D_PREFILL=1 \
CK_FORCE_Q4K_PACKED_META_2D_PREFILL=1 \
CK_Q4K_PACKED_META_TILE_M=8 \
CK_Q4K_PACKED_META_TILE_N=512 \
CK_V8_PREFILL_PERF_ENGINE=ck \
make profile-v8-prefill-perf-stat

The local ThinkPad i7 data showed only a small win from 2D packed-meta scheduling. The stronger win came from the x8 packed-meta layout, which groups eight Q4_K output rows per K block so the hot loop reuses each Q8 activation load across multiple output lanes. Force it explicitly for new sweeps with CK_FORCE_Q4K_PACKED_META_X8_PREFILL=1. See Prefill Performance Roadmap for the optimization ladder and promotion rules.

The token-tile/output-tile experiment is separate from the default x8 path:

CK_ENABLE_Q4K_PACKED_META_X8MT_PREFILL=1 \
CK_Q4K_PACKED_META_X8MT_TILE_M=2 \
CK_V8_PREFILL_PERF_ENGINE=ck \
make profile-v8-prefill-perf-stat

Use this to test the first two-dimensional Q4_K prefill path. It is intentionally opt-in until model-level timings are stable across hardware and prompt lengths.

Memory Profiling with Valgrind

Valgrind's memcheck tool detects memory errors: leaks, use-after-free, uninitialized reads, and buffer overflows.

Install Valgrind

# Ubuntu/Debian
sudo apt install valgrind

# Fedora/RHEL
sudo dnf install valgrind

# Arch
sudo pacman -S valgrind

Basic Memory Check

# Build with debug symbols (required for line numbers)
make clean
CFLAGS="-O0 -g" make

# Run memcheck on the tiny model
valgrind --leak-check=full \
         --show-leak-kinds=all \
         --track-origins=yes \
         ./build/tiny_model \
           --model-weights build/tiny_weights.bin \
           --tokens build/tiny_tokens.bin \
           --out-logits build/tiny_logits.bin

Actual Output from C-Kernel-Engine

# Real output from: make profile-memory
==537520== HEAP SUMMARY:
==537520==     in use at exit: 8 bytes in 1 blocks
==537520==   total heap usage: 11 allocs, 10 frees, 2,144,864 bytes allocated

==537520== LEAK SUMMARY:
==537520==    definitely lost: 0 bytes in 0 blocks
==537520==    indirectly lost: 0 bytes in 0 blocks
==537520==      possibly lost: 0 bytes in 0 blocks
==537520==    still reachable: 0 bytes in 0 blocks
==537520==         suppressed: 8 bytes in 1 blocks  # OpenMP internal

==537520== ERROR SUMMARY: 0 errors from 0 contexts

Key observations:

11 allocs, 10 frees - The 1 remaining is OpenMP internal (suppressed)
2MB allocated - Our bump allocator buffer, properly freed
0 definitely lost - No memory leaks!

Red flags: "definitely lost", "Invalid read/write", "Conditional jump depends on uninitialised value"

Valgrind Options Explained

Option	Purpose
`--leak-check=full`	Show where each leak was allocated
`--show-leak-kinds=all`	Include "still reachable" (our bump buffer)
`--track-origins=yes`	Track where uninitialized values came from
`--error-exitcode=1`	Exit with 1 if errors found (for CI)
`--suppressions=file`	Ignore known false positives

Heap Profiling with Massif

Massif tracks memory usage over time, showing allocation patterns and peak usage.

Capture Heap Profile

# Run with massif
valgrind --tool=massif \
         --pages-as-heap=yes \
         --massif-out-file=massif.out \
         ./build/tiny_model \
           --model-weights build/tiny_weights.bin \
           --tokens build/tiny_tokens.bin \
           --out-logits build/tiny_logits.bin

# Visualize with ms_print
ms_print massif.out

Example Massif Output

    MB
4.194^                                                       #
     |                                                       #
     |                                                       #
     |                                                       #
     |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#
     |@                                                     @#
     |@                                                     @#
     |@                                                     @#
   0 +------------------------------------------------------->
     0                                                   100ms

# With bump allocator: one flat line (single allocation at startup)
# Bad pattern: sawtooth (repeated malloc/free) or stairs (leaks)

GUI Visualization with Massif-Visualizer

# Install
sudo apt install massif-visualizer

# Open profile
massif-visualizer massif.out

Shows interactive graphs with allocation call stacks.

CPU Profiling with perf

perf is the Linux profiler. It samples the CPU to find where time is spent.

Install perf

# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Fedora/RHEL
sudo dnf install perf

# Enable for non-root users
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

Record and Report

# Build with frame pointers for accurate call stacks
CFLAGS="-O3 -fno-omit-frame-pointer -g" make clean all

# Record profile (run your workload)
perf record -g -F 99 ./build/tiny_model \
  --model-weights build/tiny_weights.bin \
  --tokens build/tiny_tokens.bin \
  --out-logits build/tiny_logits.bin

# View top functions
perf report --stdio --sort=overhead

# Interactive TUI
perf report

Example perf report Output

# Overhead  Command    Shared Object       Symbol
# ........  .........  ..................  ............................
    42.15%  tiny_model libckernel_engine   [.] gemm_forward
    18.23%  tiny_model libckernel_engine   [.] attention_forward_...
    12.07%  tiny_model libckernel_engine   [.] rmsnorm_forward
     8.45%  tiny_model libckernel_engine   [.] swiglu_forward
     6.21%  tiny_model libc.so.6           [.] __memcpy_avx512
     ...

What to look for: GEMM should dominate. If other kernels are high, they may need optimization.

Useful perf Commands

# Count hardware events (cache misses, branch mispredicts)
perf stat -e cache-misses,cache-references,branches,branch-misses \
  ./build/tiny_model ...

# Per-function stats
perf annotate gemm_forward

# Compare two runs
perf diff perf.data.old perf.data

Flamegraphs

Flamegraphs visualize profiling data as interactive SVGs. Width = time spent.

Install FlameGraph Tools

# Clone Brendan Gregg's FlameGraph repo
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

Generate CPU Flamegraph

# Record with perf
perf record -g -F 99 ./build/tiny_model \
  --model-weights build/tiny_weights.bin \
  --tokens build/tiny_tokens.bin \
  --out-logits build/tiny_logits.bin

# Convert to flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flame.svg

# Open in browser
firefox cpu_flame.svg

Reading Flamegraphs

Y-axis: Call stack depth (bottom = entry point, top = leaf functions)
X-axis: Time spent (wider = more time)
Colors: Random, just for visual distinction
Click: Zoom into a function
Search: Ctrl+F to highlight functions

What to look for: Wide plateaus at the top indicate hot functions to optimize.

Memory Allocation Flamegraph

# Record malloc/free calls
perf record -e 'probe:malloc' -e 'probe:free' -g ./build/tiny_model ...

# With bump allocator, this should be nearly empty!
# (Just one mmap at startup)

Cache Profiling with Cachegrind

Cachegrind simulates CPU cache behavior to find cache-unfriendly code.

Run Cachegrind

# Profile cache behavior
valgrind --tool=cachegrind \
         --cachegrind-out-file=cachegrind.out \
         ./build/tiny_model \
           --model-weights build/tiny_weights.bin \
           --tokens build/tiny_tokens.bin \
           --out-logits build/tiny_logits.bin

# Annotate source code with cache stats
cg_annotate cachegrind.out src/kernels/gemm_kernels.c

Understanding Cache Stats

--------------------------------------------------------------------------------
I1 cache:         32,768 B, 64 B, 8-way associative
D1 cache:         32,768 B, 64 B, 8-way associative
LL cache:         8,388,608 B, 64 B, 16-way associative
--------------------------------------------------------------------------------
        Ir          I1mr        ILmr          Dr          D1mr        DLmr
--------------------------------------------------------------------------------
 1,234,567,890      12,345       1,234   456,789,012     2,345,678     123,456

# Key metrics:
# D1mr = L1 data cache read misses
# DLmr = Last-level cache read misses (goes to RAM - expensive!)
# Lower is better

Per-Function Cache Analysis

# Sort by L1 data misses
cg_annotate --sort=D1mr cachegrind.out

# Annotate specific source file
cg_annotate cachegrind.out --auto=yes src/kernels/gemm_kernels.c

What to look for: High cache miss rates in inner loops indicate poor memory access patterns (consider tiling, prefetching).

Quick Reference

Makefile Targets

# Add these to your workflow
make profile-memory   # Valgrind memcheck
make profile-heap     # Massif heap profile
make profile-cpu      # perf record + report
make flamegraph       # Generate SVG flamegraph

Profiling Cheatsheet

Goal	Tool	Command
Find memory leaks	Valgrind	`valgrind --leak-check=full ./binary`
Track heap over time	Massif	`valgrind --tool=massif ./binary`
Find hot functions	perf	`perf record -g ./binary && perf report`
Visualize call stacks	Flamegraph	`perf script \| stackcollapse-perf.pl \| flamegraph.pl`
Find cache misses	Cachegrind	`valgrind --tool=cachegrind ./binary`
Count CPU events	perf stat	`perf stat -e cache-misses,cycles ./binary`

Build Flags for Profiling

Purpose	CFLAGS
Valgrind (accurate line numbers)	`-O0 -g`
perf (with frame pointers)	`-O3 -fno-omit-frame-pointer -g`
Production + debug symbols	`-O3 -g`

C-Kernel-Engine Profiling Tips

Bump allocator: Valgrind should show one "still reachable" block (our mmap'd buffer)
No mallocs in hot path: Memory allocation flamegraph should be nearly empty
GEMM dominates: In perf, gemm_forward should be 40-60% of time
Low cache misses: Cachegrind DLmr should be < 1% for well-tiled GEMM
Huge pages: Check /proc/PID/smaps for AnonHugePages usage

CI Integration

# .github/workflows/profiling.yml
- name: Memory check
  run: |
    make CFLAGS="-O0 -g"
    valgrind --error-exitcode=1 --leak-check=full \
      ./build/tiny_model --model-weights ... --tokens ...