Profiling Guide
This page covers memory and compute profiling for C-Kernel-Engine. Always profile before optimizing.
Working on v7 Training Performance?
Use the dedicated v7-profiling.html runbook for VTune (hotspots/memory-access/uarch-exploration), Advisor roofline, perf/flamegraph capture, and run-dir artifact flow.
Memory Profiling with Valgrind
Valgrind's memcheck tool detects memory errors: leaks, use-after-free, uninitialized reads, and buffer overflows.
Install Valgrind
# Ubuntu/Debian sudo apt install valgrind # Fedora/RHEL sudo dnf install valgrind # Arch sudo pacman -S valgrind
Basic Memory Check
# Build with debug symbols (required for line numbers)
make clean
CFLAGS="-O0 -g" make
# Run memcheck on the tiny model
valgrind --leak-check=full \
--show-leak-kinds=all \
--track-origins=yes \
./build/tiny_model \
--model-weights build/tiny_weights.bin \
--tokens build/tiny_tokens.bin \
--out-logits build/tiny_logits.bin
Actual Output from C-Kernel-Engine
# Real output from: make profile-memory ==537520== HEAP SUMMARY: ==537520== in use at exit: 8 bytes in 1 blocks ==537520== total heap usage: 11 allocs, 10 frees, 2,144,864 bytes allocated ==537520== LEAK SUMMARY: ==537520== definitely lost: 0 bytes in 0 blocks ==537520== indirectly lost: 0 bytes in 0 blocks ==537520== possibly lost: 0 bytes in 0 blocks ==537520== still reachable: 0 bytes in 0 blocks ==537520== suppressed: 8 bytes in 1 blocks # OpenMP internal ==537520== ERROR SUMMARY: 0 errors from 0 contexts
Key observations:
- 11 allocs, 10 frees - The 1 remaining is OpenMP internal (suppressed)
- 2MB allocated - Our bump allocator buffer, properly freed
- 0 definitely lost - No memory leaks!
Red flags: "definitely lost", "Invalid read/write", "Conditional jump depends on uninitialised value"
Valgrind Options Explained
| Option | Purpose |
|---|---|
--leak-check=full | Show where each leak was allocated |
--show-leak-kinds=all | Include "still reachable" (our bump buffer) |
--track-origins=yes | Track where uninitialized values came from |
--error-exitcode=1 | Exit with 1 if errors found (for CI) |
--suppressions=file | Ignore known false positives |
Heap Profiling with Massif
Massif tracks memory usage over time, showing allocation patterns and peak usage.
Capture Heap Profile
# Run with massif
valgrind --tool=massif \
--pages-as-heap=yes \
--massif-out-file=massif.out \
./build/tiny_model \
--model-weights build/tiny_weights.bin \
--tokens build/tiny_tokens.bin \
--out-logits build/tiny_logits.bin
# Visualize with ms_print
ms_print massif.out
Example Massif Output
MB
4.194^ #
| #
| #
| #
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#
|@ @#
|@ @#
|@ @#
0 +------------------------------------------------------->
0 100ms
# With bump allocator: one flat line (single allocation at startup)
# Bad pattern: sawtooth (repeated malloc/free) or stairs (leaks)
GUI Visualization with Massif-Visualizer
# Install sudo apt install massif-visualizer # Open profile massif-visualizer massif.out
Shows interactive graphs with allocation call stacks.
CPU Profiling with perf
perf is the Linux profiler. It samples the CPU to find where time is spent.
Install perf
# Ubuntu/Debian sudo apt install linux-tools-common linux-tools-$(uname -r) # Fedora/RHEL sudo dnf install perf # Enable for non-root users echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
Record and Report
# Build with frame pointers for accurate call stacks CFLAGS="-O3 -fno-omit-frame-pointer -g" make clean all # Record profile (run your workload) perf record -g -F 99 ./build/tiny_model \ --model-weights build/tiny_weights.bin \ --tokens build/tiny_tokens.bin \ --out-logits build/tiny_logits.bin # View top functions perf report --stdio --sort=overhead # Interactive TUI perf report
Example perf report Output
# Overhead Command Shared Object Symbol
# ........ ......... .................. ............................
42.15% tiny_model libckernel_engine [.] gemm_forward
18.23% tiny_model libckernel_engine [.] attention_forward_...
12.07% tiny_model libckernel_engine [.] rmsnorm_forward
8.45% tiny_model libckernel_engine [.] swiglu_forward
6.21% tiny_model libc.so.6 [.] __memcpy_avx512
...
What to look for: GEMM should dominate. If other kernels are high, they may need optimization.
Useful perf Commands
# Count hardware events (cache misses, branch mispredicts) perf stat -e cache-misses,cache-references,branches,branch-misses \ ./build/tiny_model ... # Per-function stats perf annotate gemm_forward # Compare two runs perf diff perf.data.old perf.data
Flamegraphs
Flamegraphs visualize profiling data as interactive SVGs. Width = time spent.
Install FlameGraph Tools
# Clone Brendan Gregg's FlameGraph repo git clone https://github.com/brendangregg/FlameGraph.git export PATH=$PATH:$(pwd)/FlameGraph
Generate CPU Flamegraph
# Record with perf perf record -g -F 99 ./build/tiny_model \ --model-weights build/tiny_weights.bin \ --tokens build/tiny_tokens.bin \ --out-logits build/tiny_logits.bin # Convert to flamegraph perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flame.svg # Open in browser firefox cpu_flame.svg
Reading Flamegraphs
- Y-axis: Call stack depth (bottom = entry point, top = leaf functions)
- X-axis: Time spent (wider = more time)
- Colors: Random, just for visual distinction
- Click: Zoom into a function
- Search: Ctrl+F to highlight functions
What to look for: Wide plateaus at the top indicate hot functions to optimize.
Memory Allocation Flamegraph
# Record malloc/free calls perf record -e 'probe:malloc' -e 'probe:free' -g ./build/tiny_model ... # With bump allocator, this should be nearly empty! # (Just one mmap at startup)
Cache Profiling with Cachegrind
Cachegrind simulates CPU cache behavior to find cache-unfriendly code.
Run Cachegrind
# Profile cache behavior
valgrind --tool=cachegrind \
--cachegrind-out-file=cachegrind.out \
./build/tiny_model \
--model-weights build/tiny_weights.bin \
--tokens build/tiny_tokens.bin \
--out-logits build/tiny_logits.bin
# Annotate source code with cache stats
cg_annotate cachegrind.out src/kernels/gemm_kernels.c
Understanding Cache Stats
--------------------------------------------------------------------------------
I1 cache: 32,768 B, 64 B, 8-way associative
D1 cache: 32,768 B, 64 B, 8-way associative
LL cache: 8,388,608 B, 64 B, 16-way associative
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr
--------------------------------------------------------------------------------
1,234,567,890 12,345 1,234 456,789,012 2,345,678 123,456
# Key metrics:
# D1mr = L1 data cache read misses
# DLmr = Last-level cache read misses (goes to RAM - expensive!)
# Lower is better
Per-Function Cache Analysis
# Sort by L1 data misses cg_annotate --sort=D1mr cachegrind.out # Annotate specific source file cg_annotate cachegrind.out --auto=yes src/kernels/gemm_kernels.c
What to look for: High cache miss rates in inner loops indicate poor memory access patterns (consider tiling, prefetching).
Quick Reference
Makefile Targets
# Add these to your workflow make profile-memory # Valgrind memcheck make profile-heap # Massif heap profile make profile-cpu # perf record + report make flamegraph # Generate SVG flamegraph
Profiling Cheatsheet
| Goal | Tool | Command |
|---|---|---|
| Find memory leaks | Valgrind | valgrind --leak-check=full ./binary |
| Track heap over time | Massif | valgrind --tool=massif ./binary |
| Find hot functions | perf | perf record -g ./binary && perf report |
| Visualize call stacks | Flamegraph | perf script | stackcollapse-perf.pl | flamegraph.pl |
| Find cache misses | Cachegrind | valgrind --tool=cachegrind ./binary |
| Count CPU events | perf stat | perf stat -e cache-misses,cycles ./binary |
Build Flags for Profiling
| Purpose | CFLAGS |
|---|---|
| Valgrind (accurate line numbers) | -O0 -g |
| perf (with frame pointers) | -O3 -fno-omit-frame-pointer -g |
| Production + debug symbols | -O3 -g |
C-Kernel-Engine Profiling Tips
- Bump allocator: Valgrind should show one "still reachable" block (our mmap'd buffer)
- No mallocs in hot path: Memory allocation flamegraph should be nearly empty
- GEMM dominates: In perf,
gemm_forwardshould be 40-60% of time - Low cache misses: Cachegrind DLmr should be < 1% for well-tiled GEMM
- Huge pages: Check
/proc/PID/smapsforAnonHugePagesusage
CI Integration
# .github/workflows/profiling.yml
- name: Memory check
run: |
make CFLAGS="-O0 -g"
valgrind --error-exitcode=1 --leak-check=full \
./build/tiny_model --model-weights ... --tokens ...
Further Reading
- Valgrind Manual
- Brendan Gregg's Flamegraph Page
- perf Wiki Tutorial
- Memory Safety - Our bump allocator design