v7 Profiling and Performance Runbook

This page documents the actual profiling workflow we use for v7 training kernels: perf, flamegraphs, Intel VTune, and Intel Advisor.

Scope
This is the v7 training-focused profiling page. For generic memory/debug profiling (valgrind, cachegrind basics), see profiling.html.

Why Performance Matters

Kernel performance is not a cosmetic metric in v7. It directly controls:

Rule of Operation

We optimize only after parity/memory correctness is green. Performance work without correctness gates is treated as invalid.

Tool Map

Tool Primary Question Typical Outputs
perf stat Are we using CPU efficiently (IPC, miss rates)? perf_stat_summary.txt, perf_stat_summary.json
perf record + FlameGraph Which functions dominate runtime? *.data, *.folded, *.svg, flamegraph_manifest.json
VTune (hotspots, memory-access, uarch-exploration) Where are microarchitectural and memory bottlenecks? vtune_summary.json + result directories
Advisor (roofline) Are kernels compute-bound or memory-bound by roofline model? advisor_summary.json + Advisor project dir

Direct Training Capture

This is the direct style you are already using around ck_run_v7.py train.

VTune memory-access on train loop

CK_NUM_THREADS=8 vtune -collect memory-access -result-dir /tmp/v7_vtune_mem -quiet -- \
  python3 version/v7/scripts/ck_run_v7.py train \
    --run /tmp/v7_ht_threads --backend ck \
    --train-epochs 1 --train-seq-len 8 --train-total-tokens 2048 --train-grad-accum 8 \
    --train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none \
    --train-json-out /tmp/v7_vtune_mem_train.json

perf record + folded stacks for flamegraph

CK_NUM_THREADS=8 perf record --all-user -F 999 --call-graph dwarf -o /tmp/v7_train_perf.data -- \
  python3 version/v7/scripts/ck_run_v7.py train \
    --run /tmp/v7_ht_threads --backend ck \
    --train-epochs 3 --train-seq-len 8 --train-total-tokens 4096 --train-grad-accum 8 \
    --train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none \
    --train-json-out /tmp/v7_perf_train.json

perf script -i /tmp/v7_train_perf.data | ./FlameGraph/stackcollapse-perf.pl > /tmp/v7_train_flame.folded
./FlameGraph/flamegraph.pl /tmp/v7_train_flame.folded > /tmp/v7_train_flame.svg

VTune GUI

vtune-gui /tmp/v7_vtune_mem
# Example historical runs:
# vtune-gui /tmp/v7_gui/vtune_hotspots
# vtune-gui /tmp/v7_gui/vtune_memory

Advisor Roofline

CK_NUM_THREADS=8 advisor --collect=roofline --project-dir /tmp/v7_advisor_roofline -- \
  python3 version/v7/scripts/ck_run_v7.py train \
    --run /tmp/v7_ht_threads --backend ck \
    --train-epochs 1 --train-seq-len 8 --train-total-tokens 2048 --train-grad-accum 8 \
    --train-vocab 1024 --train-d-model 256 --train-hidden 1024 --profile-train none

# Some installs expose "advixe-cl" instead of "advisor".

Makefile Workflow (Repeatable)

For CI-like reproducibility and normalized artifacts, use the built-in v7 profiling targets.

# Decode/prefill profile + perf counters + flamegraph + VTune (+ deep analyses)
make --no-print-directory profile-v7-full

# Or run the perf gate sequence (perf stat + flamegraph + VTune + budget evaluation)
make --no-print-directory v7-perf-gate

Important knobs

# Runtime mode for profile targets
V7_PERF_RUNTIME=cli|python      # default: cli

# VTune controls
V7_WITH_VTUNE=1|0               # default: 1
V7_VTUNE_DEEP=1|0               # default: 1 (hotspots + memory-access + uarch-exploration)

Native C-First Workflow

ck-cli-v7 also exposes direct profile capture for perf, vtune, and advisor.

# Build native CLI
make --no-print-directory ck-cli-v7

# Example token file (space-separated ints)
printf '1 2 3 4 5 6 7 8 9 10 11 12\n' > /tmp/v7_tokens.txt

# perf / vtune / advisor capture from native train subcommand
./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool perf \
  --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \
  --train-total-tokens 2048 --train-grad-accum 8 --threads 8

./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool vtune \
  --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \
  --train-total-tokens 2048 --train-grad-accum 8 --threads 8

./build/ck-cli-v7 profile --run /tmp/v7_ht_threads --tool advisor \
  --train-token-file /tmp/v7_tokens.txt --train-epochs 1 --train-seq-len 8 \
  --train-total-tokens 2048 --train-grad-accum 8 --threads 8

This path updates run_index.json and the canonical profiling artifacts in the run directory.

Artifact Contract

Artifact What It Represents Produced By
profile_summary.json Top-level profile metadata / decode throughput context ck_run_v7.py --profile, ck-cli-v7 profile
perf_stat_summary.json Parsed counters and derived metrics (ipc, miss rates) version/v7/scripts/perf_artifacts_v7.py
flamegraph_manifest.json Links to perf.data, folded stacks, flamegraph SVG, top symbols version/v7/scripts/perf_artifacts_v7.py
vtune_summary.json VTune reports (hotspots + optional memory/uarch summaries) version/v7/scripts/vtune_artifacts_v7.py
advisor_summary.json Advisor project pointer for roofline analysis ck-cli-v7 profile --tool advisor
perf_gate_report.json Budget pass/fail status for throughput + hardware metrics version/v7/scripts/perf_gate_v7.py

Viewer Integration

# Render existing artifacts
python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_ht_threads --html-only

# Active probe flow (perf + VTune + native run-dir Advisor when available)
python3 version/v7/tools/open_ir_visualizer.py --generate --run /tmp/v7_ht_threads --with-probes --advisor --vtune

open_ir_visualizer.py consumes profile_summary.json, perf_stat_summary.json, flamegraph_manifest.json, vtune_summary.json, advisor_summary.json, and perf_gate_report.json when present.

How We Analyze Results

Signal What It Means Typical Action
train_tok_s drop Regression in end-to-end train step throughput Open flamegraph + VTune hotspots and isolate new wide stacks
Low IPC Poor retirement efficiency (stalls/speculation/memory) Use VTune uarch summary to split frontend/backend/speculation issues
High cache miss rate Bandwidth/locality pressure in hot kernels Improve layout blocking, prepack strategy, and memory reuse
Memory Bound high in VTune Kernel limited by memory subsystem, not arithmetic Target data movement reduction before instruction-level tuning
Hotspots outside expected GEMM/attention path Unexpected overhead in runtime glue or copies Inspect call chains and remove avoidable copies/dispatch overhead

Regression Loop

  1. Pin workload and thread count (CK_NUM_THREADS, same run config).
  2. Capture baseline artifacts (perf + flamegraph + VTune + optional Advisor).
  3. Make one kernel/runtime change.
  4. Re-capture with same command line.
  5. Compare: throughput, IPC/miss rates, and hotspot width changes.
  6. Update notes and keep artifacts linked in run dir for reproducibility.
Prerequisites
VTune/Advisor need oneAPI tooling installed (often via source /opt/intel/oneapi/setvars.sh). Non-root perf may also require kernel perf_event_paranoid tuning.
Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close