Profiling Guide

Measure, Don't Guess
This page covers memory and compute profiling for C-Kernel-Engine. Always profile before optimizing.

Working on v7 Training Performance?

Use the dedicated v7-profiling.html runbook for VTune (hotspots/memory-access/uarch-exploration), Advisor roofline, perf/flamegraph capture, and run-dir artifact flow.

Memory Profiling with Valgrind

Valgrind's memcheck tool detects memory errors: leaks, use-after-free, uninitialized reads, and buffer overflows.

Install Valgrind

# Ubuntu/Debian
sudo apt install valgrind

# Fedora/RHEL
sudo dnf install valgrind

# Arch
sudo pacman -S valgrind

Basic Memory Check

# Build with debug symbols (required for line numbers)
make clean
CFLAGS="-O0 -g" make

# Run memcheck on the tiny model
valgrind --leak-check=full \
         --show-leak-kinds=all \
         --track-origins=yes \
         ./build/tiny_model \
           --model-weights build/tiny_weights.bin \
           --tokens build/tiny_tokens.bin \
           --out-logits build/tiny_logits.bin

Actual Output from C-Kernel-Engine

# Real output from: make profile-memory
==537520== HEAP SUMMARY:
==537520==     in use at exit: 8 bytes in 1 blocks
==537520==   total heap usage: 11 allocs, 10 frees, 2,144,864 bytes allocated

==537520== LEAK SUMMARY:
==537520==    definitely lost: 0 bytes in 0 blocks
==537520==    indirectly lost: 0 bytes in 0 blocks
==537520==      possibly lost: 0 bytes in 0 blocks
==537520==    still reachable: 0 bytes in 0 blocks
==537520==         suppressed: 8 bytes in 1 blocks  # OpenMP internal

==537520== ERROR SUMMARY: 0 errors from 0 contexts

Key observations:

Red flags: "definitely lost", "Invalid read/write", "Conditional jump depends on uninitialised value"

Valgrind Options Explained

OptionPurpose
--leak-check=fullShow where each leak was allocated
--show-leak-kinds=allInclude "still reachable" (our bump buffer)
--track-origins=yesTrack where uninitialized values came from
--error-exitcode=1Exit with 1 if errors found (for CI)
--suppressions=fileIgnore known false positives

Heap Profiling with Massif

Massif tracks memory usage over time, showing allocation patterns and peak usage.

Capture Heap Profile

# Run with massif
valgrind --tool=massif \
         --pages-as-heap=yes \
         --massif-out-file=massif.out \
         ./build/tiny_model \
           --model-weights build/tiny_weights.bin \
           --tokens build/tiny_tokens.bin \
           --out-logits build/tiny_logits.bin

# Visualize with ms_print
ms_print massif.out

Example Massif Output

    MB
4.194^                                                       #
     |                                                       #
     |                                                       #
     |                                                       #
     |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#
     |@                                                     @#
     |@                                                     @#
     |@                                                     @#
   0 +------------------------------------------------------->
     0                                                   100ms

# With bump allocator: one flat line (single allocation at startup)
# Bad pattern: sawtooth (repeated malloc/free) or stairs (leaks)

GUI Visualization with Massif-Visualizer

# Install
sudo apt install massif-visualizer

# Open profile
massif-visualizer massif.out

Shows interactive graphs with allocation call stacks.

CPU Profiling with perf

perf is the Linux profiler. It samples the CPU to find where time is spent.

Install perf

# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Fedora/RHEL
sudo dnf install perf

# Enable for non-root users
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

Record and Report

# Build with frame pointers for accurate call stacks
CFLAGS="-O3 -fno-omit-frame-pointer -g" make clean all

# Record profile (run your workload)
perf record -g -F 99 ./build/tiny_model \
  --model-weights build/tiny_weights.bin \
  --tokens build/tiny_tokens.bin \
  --out-logits build/tiny_logits.bin

# View top functions
perf report --stdio --sort=overhead

# Interactive TUI
perf report

Example perf report Output

# Overhead  Command    Shared Object       Symbol
# ........  .........  ..................  ............................
    42.15%  tiny_model libckernel_engine   [.] gemm_forward
    18.23%  tiny_model libckernel_engine   [.] attention_forward_...
    12.07%  tiny_model libckernel_engine   [.] rmsnorm_forward
     8.45%  tiny_model libckernel_engine   [.] swiglu_forward
     6.21%  tiny_model libc.so.6           [.] __memcpy_avx512
     ...

What to look for: GEMM should dominate. If other kernels are high, they may need optimization.

Useful perf Commands

# Count hardware events (cache misses, branch mispredicts)
perf stat -e cache-misses,cache-references,branches,branch-misses \
  ./build/tiny_model ...

# Per-function stats
perf annotate gemm_forward

# Compare two runs
perf diff perf.data.old perf.data

Flamegraphs

Flamegraphs visualize profiling data as interactive SVGs. Width = time spent.

Install FlameGraph Tools

# Clone Brendan Gregg's FlameGraph repo
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

Generate CPU Flamegraph

# Record with perf
perf record -g -F 99 ./build/tiny_model \
  --model-weights build/tiny_weights.bin \
  --tokens build/tiny_tokens.bin \
  --out-logits build/tiny_logits.bin

# Convert to flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flame.svg

# Open in browser
firefox cpu_flame.svg

Reading Flamegraphs

What to look for: Wide plateaus at the top indicate hot functions to optimize.

Memory Allocation Flamegraph

# Record malloc/free calls
perf record -e 'probe:malloc' -e 'probe:free' -g ./build/tiny_model ...

# With bump allocator, this should be nearly empty!
# (Just one mmap at startup)

Cache Profiling with Cachegrind

Cachegrind simulates CPU cache behavior to find cache-unfriendly code.

Run Cachegrind

# Profile cache behavior
valgrind --tool=cachegrind \
         --cachegrind-out-file=cachegrind.out \
         ./build/tiny_model \
           --model-weights build/tiny_weights.bin \
           --tokens build/tiny_tokens.bin \
           --out-logits build/tiny_logits.bin

# Annotate source code with cache stats
cg_annotate cachegrind.out src/kernels/gemm_kernels.c

Understanding Cache Stats

--------------------------------------------------------------------------------
I1 cache:         32,768 B, 64 B, 8-way associative
D1 cache:         32,768 B, 64 B, 8-way associative
LL cache:         8,388,608 B, 64 B, 16-way associative
--------------------------------------------------------------------------------
        Ir          I1mr        ILmr          Dr          D1mr        DLmr
--------------------------------------------------------------------------------
 1,234,567,890      12,345       1,234   456,789,012     2,345,678     123,456

# Key metrics:
# D1mr = L1 data cache read misses
# DLmr = Last-level cache read misses (goes to RAM - expensive!)
# Lower is better

Per-Function Cache Analysis

# Sort by L1 data misses
cg_annotate --sort=D1mr cachegrind.out

# Annotate specific source file
cg_annotate cachegrind.out --auto=yes src/kernels/gemm_kernels.c

What to look for: High cache miss rates in inner loops indicate poor memory access patterns (consider tiling, prefetching).

Quick Reference

Makefile Targets

# Add these to your workflow
make profile-memory   # Valgrind memcheck
make profile-heap     # Massif heap profile
make profile-cpu      # perf record + report
make flamegraph       # Generate SVG flamegraph

Profiling Cheatsheet

GoalToolCommand
Find memory leaks Valgrind valgrind --leak-check=full ./binary
Track heap over time Massif valgrind --tool=massif ./binary
Find hot functions perf perf record -g ./binary && perf report
Visualize call stacks Flamegraph perf script | stackcollapse-perf.pl | flamegraph.pl
Find cache misses Cachegrind valgrind --tool=cachegrind ./binary
Count CPU events perf stat perf stat -e cache-misses,cycles ./binary

Build Flags for Profiling

PurposeCFLAGS
Valgrind (accurate line numbers) -O0 -g
perf (with frame pointers) -O3 -fno-omit-frame-pointer -g
Production + debug symbols -O3 -g

C-Kernel-Engine Profiling Tips

CI Integration

# .github/workflows/profiling.yml
- name: Memory check
  run: |
    make CFLAGS="-O0 -g"
    valgrind --error-exitcode=1 --leak-check=full \
      ./build/tiny_model --model-weights ... --tokens ...

Further Reading

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close