v7 Profiling and Performance Runbook

This page documents the actual profiling workflow we use for v7 training kernels: perf, flamegraphs, Intel VTune, and Intel Advisor.

Scope
This is the v7 training-focused profiling page. For generic memory/debug profiling (valgrind, cachegrind basics), see profiling.html.

Why Performance Matters

Kernel performance is not a cosmetic metric in v7. It directly controls:

Training throughput: tokens/second and wall-clock to convergence.
Hardware efficiency: IPC, cache behavior, and branch quality.
Scalability: thread utilization and memory-bandwidth headroom.
Optimization priority: where to spend kernel engineering time first.

Rule of Operation

We optimize only after parity/memory correctness is green. Performance work without correctness gates is treated as invalid.

Tool Map

Tool	Primary Question	Typical Outputs
`perf stat`	Are we using CPU efficiently (IPC, miss rates)?	`perf_stat_summary.txt`, `perf_stat_summary.json`
`perf record` + FlameGraph	Which functions dominate runtime?	`.data`, `.folded`, `*.svg`, `flamegraph_manifest.json`
VTune (`hotspots`, `memory-access`, `uarch-exploration`)	Where are microarchitectural and memory bottlenecks?	`vtune_summary.json` + result directories
Advisor (roofline)	Are kernels compute-bound or memory-bound by roofline model?	`advisor_summary.json` + Advisor project dir

Direct Training Capture

This is the direct style you are already using around ck_run_v7.py train.

VTune memory-access on train loop

CK_NUM_THREADS=8 vtune -collect memory-access -result-dir /tmp/v7_vtune_mem -quiet -- \
  python3 version/v7/scripts/ck_run_v7.py train \
    --run /tmp/v7_ht_threads --backend ck \
    --train-epochs 1 --train-seq-len 8 --train-total-tokens 2048 --train-grad-accum 8 \
    --train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none \
    --train-json-out /tmp/v7_vtune_mem_train.json

perf record + folded stacks for flamegraph

CK_NUM_THREADS=8 perf record --all-user -F 999 --call-graph dwarf -o /tmp/v7_train_perf.data -- \
  python3 version/v7/scripts/ck_run_v7.py train \
    --run /tmp/v7_ht_threads --backend ck \
    --train-epochs 3 --train-seq-len 8 --train-total-tokens 4096 --train-grad-accum 8 \
    --train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none \
    --train-json-out /tmp/v7_perf_train.json

perf script -i /tmp/v7_train_perf.data | ./FlameGraph/stackcollapse-perf.pl > /tmp/v7_train_flame.folded
./FlameGraph/flamegraph.pl /tmp/v7_train_flame.folded > /tmp/v7_train_flame.svg

VTune GUI

vtune-gui /tmp/v7_vtune_mem
# Example historical runs:
# vtune-gui /tmp/v7_gui/vtune_hotspots
# vtune-gui /tmp/v7_gui/vtune_memory

Advisor Roofline

CK_NUM_THREADS=8 advisor --collect=roofline --project-dir /tmp/v7_advisor_roofline -- \
  python3 version/v7/scripts/ck_run_v7.py train \
    --run /tmp/v7_ht_threads --backend ck \
    --train-epochs 1 --train-seq-len 8 --train-total-tokens 2048 --train-grad-accum 8 \
    --train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none

# Some installs expose "advixe-cl" instead of "advisor".

Makefile Workflow (Repeatable)

For CI-like reproducibility and normalized artifacts, use the built-in v7 profiling targets.

# Decode/prefill profile + perf counters + flamegraph + VTune (+ deep analyses)
make --no-print-directory profile-v7-full

# Or run the perf gate sequence (perf stat + flamegraph + VTune + budget evaluation)
make --no-print-directory v7-perf-gate

Important knobs

# Runtime mode for profile targets
V7_PERF_RUNTIME=cli|python      # default: cli

# VTune controls
V7_WITH_VTUNE=1|0               # default: 1
V7_VTUNE_DEEP=1|0               # default: 1 (hotspots + memory-access + uarch-exploration)

Native C-First Workflow

ck-cli-v7 also exposes direct profile capture for perf, vtune, and advisor.

# Build native CLI
make --no-print-directory ck-cli-v7

# Example token file (space-separated ints)
printf '1 2 3 4 5 6 7 8 9 10 11 12\n' > /tmp/v7_tokens.txt

# perf / vtune / advisor capture from native train subcommand
./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool perf \
  --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \
  --train-total-tokens 2048 --train-grad-accum 8 --threads 8

./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool vtune \
  --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \
  --train-total-tokens 2048 --train-grad-accum 8 --threads 8

./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool advisor \
  --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \
  --train-total-tokens 2048 --train-grad-accum 8 --threads 8

This path updates run_index.json and the canonical profiling artifacts in the run directory.

Artifact Contract

Artifact	What It Represents	Produced By
`profile_summary.json`	Top-level profile metadata / decode throughput context	`ck_run_v7.py --profile`, `ck-cli-v7 profile`
`perf_stat_summary.json`	Parsed counters and derived metrics (`ipc`, miss rates)	`version/v7/scripts/perf_artifacts_v7.py`
`flamegraph_manifest.json`	Links to `perf.data`, folded stacks, flamegraph SVG, top symbols	`version/v7/scripts/perf_artifacts_v7.py`
`vtune_summary.json`	VTune reports (hotspots + optional memory/uarch summaries)	`version/v7/scripts/vtune_artifacts_v7.py`
`advisor_summary.json`	Advisor project pointer for roofline analysis	`ck-cli-v7 profile --tool advisor`
`perf_gate_report.json`	Budget pass/fail status for throughput + hardware metrics	`version/v7/scripts/perf_gate_v7.py`

Viewer Integration

# Render existing artifacts
python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_ht_threads --html-only

# Active probe flow (perf + VTune + native run-dir Advisor when available)
python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_ht_threads --with-probes --advisor --vtune

open_ir_visualizer.py consumes profile_summary.json, perf_stat_summary.json, flamegraph_manifest.json, vtune_summary.json, advisor_summary.json, and perf_gate_report.json when present.

How We Analyze Results

Signal	What It Means	Typical Action
`train_tok_s` drop	Regression in end-to-end train step throughput	Open flamegraph + VTune hotspots and isolate new wide stacks
Low IPC	Poor retirement efficiency (stalls/speculation/memory)	Use VTune uarch summary to split frontend/backend/speculation issues
High cache miss rate	Bandwidth/locality pressure in hot kernels	Improve layout blocking, prepack strategy, and memory reuse
Memory Bound high in VTune	Kernel limited by memory subsystem, not arithmetic	Target data movement reduction before instruction-level tuning
Hotspots outside expected GEMM/attention path	Unexpected overhead in runtime glue or copies	Inspect call chains and remove avoidable copies/dispatch overhead

Regression Loop

Pin workload and thread count (CK_NUM_THREADS, same run config).
Capture baseline artifacts (perf + flamegraph + VTune + optional Advisor).
Make one kernel/runtime change.
Re-capture with same command line.
Compare: throughput, IPC/miss rates, and hotspot width changes.
Update notes and keep artifacts linked in run dir for reproducibility.

Prerequisites
VTune/Advisor need oneAPI tooling installed (often via source /opt/intel/oneapi/setvars.sh). Non-root perf may also require kernel perf_event_paranoid tuning.