Getting Started
This page covers installation, building, and running C-Kernel-Engine from source.
Linux is the supported platform for build, parity, and profiling workflows. macOS and Windows/WSL may work for local smoke tests, but they are best-effort only and not part of the supported operator path.
Prerequisites
Required
# Linux toolchain + isolated Python env sudo apt install build-essential git make python3 python3-venv python3-pip # Supported bootstrap path make v7-init make v7-doctor
Use the repo-local .venv for all v7 Python commands. If you do not activate it, invoke scripts explicitly with .venv/bin/python ....
Required Python packages (supported v7 workflows): numpy torch safetensors requests tqdm huggingface_hub gguf tokenizers transformers
This is the supported v7 bundle for run / train / parity workflows. Not every package is used on every execution path. If you prefer uv, conda, or manual pip, install that same package set into the interpreter you plan to use, then run make v7-doctor.
Manual Python Environment (pip example)
python3 -m venv .venv . .venv/bin/activate python -m pip install --upgrade pip python -m pip install -r requirements-v7.txt make v7-doctor
Optional (for profiling)
# Valgrind (memory profiling) sudo apt install valgrind # perf (CPU profiling) sudo apt install linux-tools-common linux-tools-$(uname -r) # Enable perf for non-root users echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid # FlameGraph (visualization) - clone to ~/Programs git clone https://github.com/brendangregg/FlameGraph.git ~/Programs/FlameGraph
Optional (for documentation)
# Doxygen (API docs generation) sudo apt install doxygen
Building
Basic Build
# Clone the repository
git clone https://github.com/antshiv/C-Kernel-Engine.git
cd C-Kernel-Engine
# Build the main library
make
# Build the IR + codegen tool (HF config.json -> generated C)
make build/ck_ir_demo
# Optional: build the orchestrator CLI ("ck")
make ck-cli
Build Options
| Variable | Default | Description |
|---|---|---|
CC | gcc | C compiler |
AVX_FLAGS | auto-detected | SIMD flags (-mavx512f, -mavx2, -mavx, or empty) |
CFLAGS | -O3 -fPIC -fopenmp -Wall | Compiler flags |
# Build without SIMD (scalar reference implementation) make AVX_FLAGS= # Force AVX2 make AVX_FLAGS="-mavx2" # Debug build with symbols make CFLAGS="-O0 -g -fPIC -fopenmp -Wall"
v7 Quick Runtime (Linux)
Plain Runtime Path
make v7-init make v7-demo-runtime \ V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf
This is the shortest supported path to a working v7 runtime. It creates the repo-local virtualenv, installs Python dependencies, downloads/converts/compiles the model, and enters the local chat loop.
v8 Inference Quick Runtime (Linux)
Text Bring-Up
version/v8/scripts/cks-v8-run run \ hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \ --context-len 1024 \ --force-convert --force-compile \ --generate-visualizer
This is the short text-family v8 path. The same runner surface now covers the promoted qwen2, qwen3, qwen3.5, gemma3, and NanBeige command shapes, with family-specific chat-template notes in the full runbook.
Scoped Multimodal Bring-Up
version/v8/scripts/cks-v8-run run \ hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \ --mmproj ./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \ --image-path version/v8/test_assets/v8_vision_doc_card_72.png \ --prompt "Explain this image." \ --context-len 1024 \ --force-convert --force-compile \ --thinking-mode suppressed
Update --mmproj to the actual local path of your matching Qwen3-VL mmproj file.
This is the current v8 multimodal bring-up lane for the validated Qwen3-VL decoder/mmproj pair.
Full operator notes: see v8 Inference Runbook. Other multimodal families are not implied by this quick path.
Under the hood: see v8 Vision Encoder Architecture for how the encoder, bridge, and decoder are stitched together.
v7 Full Artifact Capture (Non-privileged)
Runtime + Report + Hub
make v7-init make v7-capture-artifacts \ V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf
This captures the runtime plus the HTML operator artifacts without requiring privileged profiling tools.
Not included: perf / VTune / Advisor / flamegraph artifacts. Those remain separate Linux-only profiling steps and may require host tooling or elevated permissions.
v7 Profiling Dashboard (Linux)
Runtime + Artifacts + Profiling
make v7-init make v7-profile-dashboard \ V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf
This is the one-command operator path for Linux hosts with profiling tools installed. It captures the runtime artifacts, then refreshes the IR report and Run Hub after perf/flamegraph collection.
Expected tooling: perf for perf-stat and FlameGraph-compatible decode profiling. If host tools are missing, the underlying profiling targets may skip portions of the capture.
Make Targets Reference
Building
| Target | Description |
|---|---|
make or make all |
Build the main library (build/libckernel_engine.so) |
make build/ck_ir_demo |
Build the IR + codegen tool |
make ck-cli |
Build the orchestrator CLI (build/ck) |
make test-libs |
Build per-kernel shared libraries for Python tests |
make clean |
Remove all built files in build/ |
Testing
| Target | Description |
|---|---|
make test |
Run all Python kernel unit tests (GELU, RMSNorm, Attention, etc.) |
make test-bf16 |
Run BF16 kernel unit tests (skips on unsupported CPUs) |
make tests-list |
Print the full list of Python kernel tests |
make test-quick |
Quick comprehensive tests (<1 min) - tiny models, basic configs |
make test-full |
Full comprehensive tests (5-10 min) - GQA, medium, deep, wide models |
make test-stress |
Stress tests (10+ min) - convergence and overfit tests |
make all-tests |
Run kernel tests + layer parity + tiny parity (safe defaults) |
make layer-parity |
Test single decoder layer forward pass vs PyTorch |
make tiny-parity |
Test full model training parity vs PyTorch |
End-to-End
| Target | Description |
|---|---|
make tiny-e2e |
Generate random weights/tokens, run tiny model forward pass |
make tiny-train |
Generate random data, run forward + backward + SGD update |
make small-e2e |
Same as tiny-e2e but with ~10MB model |
Code Generation
| Target | Description |
|---|---|
make ck |
Run IR demo with default config, print forward/backward IR |
make emit CONFIG=path OUT=path |
Generate C runtime from config JSON |
make gen-specs |
Regenerate kernel specs from kernel_maps/*.json |
Profiling
| Target | Description |
|---|---|
make profile-memory |
Run Valgrind memcheck - detect memory leaks |
make profile-heap |
Run Valgrind massif - track heap usage over time |
make profile-cpu |
Run perf record - CPU profiling |
make profile-cache |
Run Valgrind cachegrind - cache miss analysis |
make flamegraph |
Generate SVG flamegraph from perf data |
Prefill + Decode (KV Cache)
For autoregressive inference, the generated runtime supports a fast prefill → decode mode using a per-layer KV cache.
Generate a libmodel.so (library mode)
# Generate a model runtime with an exported C ABI (dlopen/ctypes) ./build/ck_ir_demo path/to/config.json --emit build/model.c --emit-lib # Compile into a self-contained shared library (link kernel sources from the manifest) cc -O3 -fPIC -fopenmp -shared -Iinclude -o build/libmodel.so build/model.c $(cat build/model.c.kernels) -lm
Inference Call Sequence
# 1) Init weights
ck_model_init("weights.bump");
# 2) Enable KV cache (inference-only)
ck_model_kv_cache_enable(context_window);
# 3) Prefill prompt (full forward once)
ck_model_embed_tokens(prompt_tokens, n);
ck_model_forward(NULL);
# 4) Decode tokens (one token per step, uses KV cache)
ck_model_decode(next_token, NULL);
Training/backprop uses the full forward+backward path and does not use KV-cache decode.
Your First Run
Step 1: Build Everything
make all test-libs
Step 2: Run Kernel Tests
make test
This runs Python unit tests comparing each kernel (GELU, RMSNorm, Attention, etc.) against PyTorch reference implementations. You should see output like:
Running unittest/test_gelu.py Testing GELU forward... OK Testing GELU backward... OK ... All Python kernel tests completed.
Step 3: Run End-to-End
make tiny-e2e
This:
- Generates a config for a tiny model (64 hidden, 2 layers, 256 vocab)
- Generates random weights and input tokens
- Runs the codegen to produce C code
- Compiles and runs forward pass
- Outputs logits to
build/tiny_logits.bin
Step 4: Run Codegen Demo
make ck
This shows the IR (Intermediate Representation) that the engine generates:
=== Forward IR === CKIRGraph: layers=2, hidden_size=64 ... L0 N0 RMSNORM outputs=[L0:N0:0] inputs=[IN] L0 N1 LINEAR_QKV outputs=[L0:N1:0] inputs=[L0:N0] L0 N2 ROPE outputs=[L0:N2:0] inputs=[L0:N1] L0 N3 ATTENTION outputs=[L0:N3:0] inputs=[L0:N2] ...
Quick IR Visualizer
Generate + Open in 1 Step
RUN=$HOME/.cache/ck-engine-v7/models/train/v7_svg_assets_bpe_l24_full_e1 if [ ! -d "$RUN" ]; then echo "[ERROR] Run directory not found: $RUN" echo "Use v7-runbook.html to create a run first." else .venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only echo "[OK] Visualizer generated: $RUN/ir_report.html" xdg-open "$RUN/ir_report.html" >/dev/null 2>&1 || true fi
For full training+inference flow, see v7-runbook.html.
Training Parity Test
The ultimate test: verify that our C kernels produce the same gradients and weight updates as PyTorch.
Run Training Parity
make tiny-parity
This runs training steps in both C and PyTorch, comparing:
- Forward pass outputs (logits)
- Loss values (cross-entropy)
- Gradients (backward pass)
- Updated weights after SGD step
Expected Output
$ make tiny-parity ... Step 0: C loss=10.2345, PyTorch loss=10.2345, diff=1.2e-06 Step 1: C loss=9.8765, PyTorch loss=9.8765, diff=1.1e-06 ... Max weight diff: 1.2e-05 PASS: Training parity verified!
Quick Profiling
Memory Check
make profile-memory
With our bump allocator, you should see:
LEAK SUMMARY: definitely lost: 0 bytes in 0 blocks <-- Clean!
CPU Profile + Flamegraph
# Enable perf first (one-time) echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid # Run profiler make profile-cpu # Generate flamegraph make flamegraph # View in browser firefox build/flamegraph.svg
What to Expect in Profile
For a transformer model:
- Short context (64 tokens): GEMM dominates (~95%) - matrix multiply is the bottleneck
- Long context (4K+ tokens): Attention dominates - O(n²) attention scaling takes over
This is normal and expected behavior for transformer architectures.
Next Steps
- Architecture Overview - Understand the system design
- Kernel Reference - All available kernels
- Code Generation - How IR becomes C code
- PyTorch Parity - Detailed parity testing
- Profiling Guide - Advanced profiling techniques