v7 Profiling and Performance Runbook
This page documents the actual profiling workflow we use for v7 training kernels: perf, flamegraphs, Intel VTune, and Intel Advisor.
This is the v7 training-focused profiling page. For generic memory/debug profiling (
valgrind, cachegrind basics), see profiling.html.
Why Performance Matters
Kernel performance is not a cosmetic metric in v7. It directly controls:
- Training throughput: tokens/second and wall-clock to convergence.
- Hardware efficiency: IPC, cache behavior, and branch quality.
- Scalability: thread utilization and memory-bandwidth headroom.
- Optimization priority: where to spend kernel engineering time first.
Rule of Operation
We optimize only after parity/memory correctness is green. Performance work without correctness gates is treated as invalid.
Tool Map
| Tool | Primary Question | Typical Outputs |
|---|---|---|
perf stat |
Are we using CPU efficiently (IPC, miss rates)? | perf_stat_summary.txt, perf_stat_summary.json |
perf record + FlameGraph |
Which functions dominate runtime? | *.data, *.folded, *.svg, flamegraph_manifest.json |
VTune (hotspots, memory-access, uarch-exploration) |
Where are microarchitectural and memory bottlenecks? | vtune_summary.json + result directories |
| Advisor (roofline) | Are kernels compute-bound or memory-bound by roofline model? | advisor_summary.json + Advisor project dir |
Direct Training Capture
This is the direct style you are already using around ck_run_v7.py train.
VTune memory-access on train loop
CK_NUM_THREADS=8 vtune -collect memory-access -result-dir /tmp/v7_vtune_mem -quiet -- \
python3 version/v7/scripts/ck_run_v7.py train \
--run /tmp/v7_ht_threads --backend ck \
--train-epochs 1 --train-seq-len 8 --train-total-tokens 2048 --train-grad-accum 8 \
--train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none \
--train-json-out /tmp/v7_vtune_mem_train.json
perf record + folded stacks for flamegraph
CK_NUM_THREADS=8 perf record --all-user -F 999 --call-graph dwarf -o /tmp/v7_train_perf.data -- \
python3 version/v7/scripts/ck_run_v7.py train \
--run /tmp/v7_ht_threads --backend ck \
--train-epochs 3 --train-seq-len 8 --train-total-tokens 4096 --train-grad-accum 8 \
--train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none \
--train-json-out /tmp/v7_perf_train.json
perf script -i /tmp/v7_train_perf.data | ./FlameGraph/stackcollapse-perf.pl > /tmp/v7_train_flame.folded
./FlameGraph/flamegraph.pl /tmp/v7_train_flame.folded > /tmp/v7_train_flame.svg
VTune GUI
vtune-gui /tmp/v7_vtune_mem # Example historical runs: # vtune-gui /tmp/v7_gui/vtune_hotspots # vtune-gui /tmp/v7_gui/vtune_memory
Advisor Roofline
CK_NUM_THREADS=8 advisor --collect=roofline --project-dir /tmp/v7_advisor_roofline -- \
python3 version/v7/scripts/ck_run_v7.py train \
--run /tmp/v7_ht_threads --backend ck \
--train-epochs 1 --train-seq-len 8 --train-total-tokens 2048 --train-grad-accum 8 \
--train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none
# Some installs expose "advixe-cl" instead of "advisor".
Makefile Workflow (Repeatable)
For CI-like reproducibility and normalized artifacts, use the built-in v7 profiling targets.
# Decode/prefill profile + perf counters + flamegraph + VTune (+ deep analyses) make --no-print-directory profile-v7-full # Or run the perf gate sequence (perf stat + flamegraph + VTune + budget evaluation) make --no-print-directory v7-perf-gate
Important knobs
# Runtime mode for profile targets V7_PERF_RUNTIME=cli|python # default: cli # VTune controls V7_WITH_VTUNE=1|0 # default: 1 V7_VTUNE_DEEP=1|0 # default: 1 (hotspots + memory-access + uarch-exploration)
Native C-First Workflow
ck-cli-v7 also exposes direct profile capture for perf, vtune, and advisor.
# Build native CLI make --no-print-directory ck-cli-v7 # Example token file (space-separated ints) printf '1 2 3 4 5 6 7 8 9 10 11 12\n' > /tmp/v7_tokens.txt # perf / vtune / advisor capture from native train subcommand ./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool perf \ --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \ --train-total-tokens 2048 --train-grad-accum 8 --threads 8 ./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool vtune \ --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \ --train-total-tokens 2048 --train-grad-accum 8 --threads 8 ./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool advisor \ --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \ --train-total-tokens 2048 --train-grad-accum 8 --threads 8
This path updates run_index.json and the canonical profiling artifacts in the run directory.
Artifact Contract
| Artifact | What It Represents | Produced By |
|---|---|---|
profile_summary.json |
Top-level profile metadata / decode throughput context | ck_run_v7.py --profile, ck-cli-v7 profile |
perf_stat_summary.json |
Parsed counters and derived metrics (ipc, miss rates) |
version/v7/scripts/perf_artifacts_v7.py |
flamegraph_manifest.json |
Links to perf.data, folded stacks, flamegraph SVG, top symbols |
version/v7/scripts/perf_artifacts_v7.py |
vtune_summary.json |
VTune reports (hotspots + optional memory/uarch summaries) | version/v7/scripts/vtune_artifacts_v7.py |
advisor_summary.json |
Advisor project pointer for roofline analysis | ck-cli-v7 profile --tool advisor |
perf_gate_report.json |
Budget pass/fail status for throughput + hardware metrics | version/v7/scripts/perf_gate_v7.py |
Viewer Integration
# Render existing artifacts python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_ht_threads --html-only # Active probe flow (perf + VTune + native run-dir Advisor when available) python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_ht_threads --with-probes --advisor --vtune
open_ir_visualizer.py consumes profile_summary.json, perf_stat_summary.json, flamegraph_manifest.json, vtune_summary.json, advisor_summary.json, and perf_gate_report.json when present.
How We Analyze Results
| Signal | What It Means | Typical Action |
|---|---|---|
train_tok_s drop |
Regression in end-to-end train step throughput | Open flamegraph + VTune hotspots and isolate new wide stacks |
| Low IPC | Poor retirement efficiency (stalls/speculation/memory) | Use VTune uarch summary to split frontend/backend/speculation issues |
| High cache miss rate | Bandwidth/locality pressure in hot kernels | Improve layout blocking, prepack strategy, and memory reuse |
| Memory Bound high in VTune | Kernel limited by memory subsystem, not arithmetic | Target data movement reduction before instruction-level tuning |
| Hotspots outside expected GEMM/attention path | Unexpected overhead in runtime glue or copies | Inspect call chains and remove avoidable copies/dispatch overhead |
Regression Loop
- Pin workload and thread count (
CK_NUM_THREADS, same run config). - Capture baseline artifacts (
perf+ flamegraph + VTune + optional Advisor). - Make one kernel/runtime change.
- Re-capture with same command line.
- Compare: throughput, IPC/miss rates, and hotspot width changes.
- Update notes and keep artifacts linked in run dir for reproducibility.
VTune/Advisor need oneAPI tooling installed (often via
source /opt/intel/oneapi/setvars.sh). Non-root perf may also require kernel perf_event_paranoid tuning.