Getting Started

Quick Start Guide
This page covers installation, building, and running C-Kernel-Engine from source.
Supported Platform
Linux is the supported platform for build, parity, and profiling workflows. macOS and Windows/WSL may work for local smoke tests, but they are best-effort only and not part of the supported operator path.

Prerequisites

Required

# Linux toolchain + isolated Python env
sudo apt install build-essential git make python3 python3-venv python3-pip

# Supported bootstrap path
make v7-init
make v7-doctor

Use the repo-local .venv for all v7 Python commands. If you do not activate it, invoke scripts explicitly with .venv/bin/python ....

Required Python packages (supported v7 workflows): numpy torch safetensors requests tqdm huggingface_hub gguf tokenizers transformers

This is the supported v7 bundle for run / train / parity workflows. Not every package is used on every execution path. If you prefer uv, conda, or manual pip, install that same package set into the interpreter you plan to use, then run make v7-doctor.

Manual Python Environment (pip example)

python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements-v7.txt
make v7-doctor

Optional (for profiling)

# Valgrind (memory profiling)
sudo apt install valgrind

# perf (CPU profiling)
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Enable perf for non-root users
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# FlameGraph (visualization) - clone to ~/Programs
git clone https://github.com/brendangregg/FlameGraph.git ~/Programs/FlameGraph

Optional (for documentation)

# Doxygen (API docs generation)
sudo apt install doxygen

Building

Basic Build

# Clone the repository
git clone https://github.com/antshiv/C-Kernel-Engine.git
cd C-Kernel-Engine

# Build the main library
make

# Build the IR + codegen tool (HF config.json -> generated C)
make build/ck_ir_demo

# Optional: build the orchestrator CLI ("ck")
make ck-cli

Build Options

VariableDefaultDescription
CCgccC compiler
AVX_FLAGSauto-detectedSIMD flags (-mavx512f, -mavx2, -mavx, or empty)
CFLAGS-O3 -fPIC -fopenmp -WallCompiler flags
# Build without SIMD (scalar reference implementation)
make AVX_FLAGS=

# Force AVX2
make AVX_FLAGS="-mavx2"

# Debug build with symbols
make CFLAGS="-O0 -g -fPIC -fopenmp -Wall"

v7 Quick Runtime (Linux)

Plain Runtime Path

make v7-init
make v7-demo-runtime \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

This is the shortest supported path to a working v7 runtime. It creates the repo-local virtualenv, installs Python dependencies, downloads/converts/compiles the model, and enters the local chat loop.

v8 Inference Quick Runtime (Linux)

Text Bring-Up

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 \
  --force-convert --force-compile \
  --generate-visualizer

This is the short text-family v8 path. The same runner surface now covers the promoted qwen2, qwen3, qwen3.5, gemma3, and NanBeige command shapes, with family-specific chat-template notes in the full runbook.

Scoped Multimodal Bring-Up

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
  --mmproj ./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.png \
  --prompt "Explain this image." \
  --context-len 1024 \
  --force-convert --force-compile \
  --thinking-mode suppressed

Update --mmproj to the actual local path of your matching Qwen3-VL mmproj file.

This is the current v8 multimodal bring-up lane for the validated Qwen3-VL decoder/mmproj pair.

Full operator notes: see v8 Inference Runbook. Other multimodal families are not implied by this quick path.

Under the hood: see v8 Vision Encoder Architecture for how the encoder, bridge, and decoder are stitched together.

v7 Full Artifact Capture (Non-privileged)

Runtime + Report + Hub

make v7-init
make v7-capture-artifacts \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

This captures the runtime plus the HTML operator artifacts without requiring privileged profiling tools.

Not included: perf / VTune / Advisor / flamegraph artifacts. Those remain separate Linux-only profiling steps and may require host tooling or elevated permissions.

v7 Profiling Dashboard (Linux)

Runtime + Artifacts + Profiling

make v7-init
make v7-profile-dashboard \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

This is the one-command operator path for Linux hosts with profiling tools installed. It captures the runtime artifacts, then refreshes the IR report and Run Hub after perf/flamegraph collection.

Expected tooling: perf for perf-stat and FlameGraph-compatible decode profiling. If host tools are missing, the underlying profiling targets may skip portions of the capture.

Make Targets Reference

Building

TargetDescription
make or make all Build the main library (build/libckernel_engine.so)
make build/ck_ir_demo Build the IR + codegen tool
make ck-cli Build the orchestrator CLI (build/ck)
make test-libs Build per-kernel shared libraries for Python tests
make clean Remove all built files in build/

Testing

TargetDescription
make test Run all Python kernel unit tests (GELU, RMSNorm, Attention, etc.)
make test-bf16 Run BF16 kernel unit tests (skips on unsupported CPUs)
make tests-list Print the full list of Python kernel tests
make test-quick Quick comprehensive tests (<1 min) - tiny models, basic configs
make test-full Full comprehensive tests (5-10 min) - GQA, medium, deep, wide models
make test-stress Stress tests (10+ min) - convergence and overfit tests
make all-tests Run kernel tests + layer parity + tiny parity (safe defaults)
make layer-parity Test single decoder layer forward pass vs PyTorch
make tiny-parity Test full model training parity vs PyTorch

End-to-End

TargetDescription
make tiny-e2e Generate random weights/tokens, run tiny model forward pass
make tiny-train Generate random data, run forward + backward + SGD update
make small-e2e Same as tiny-e2e but with ~10MB model

Code Generation

TargetDescription
make ck Run IR demo with default config, print forward/backward IR
make emit CONFIG=path OUT=path Generate C runtime from config JSON
make gen-specs Regenerate kernel specs from kernel_maps/*.json

Profiling

TargetDescription
make profile-memory Run Valgrind memcheck - detect memory leaks
make profile-heap Run Valgrind massif - track heap usage over time
make profile-cpu Run perf record - CPU profiling
make profile-cache Run Valgrind cachegrind - cache miss analysis
make flamegraph Generate SVG flamegraph from perf data

Prefill + Decode (KV Cache)

For autoregressive inference, the generated runtime supports a fast prefill → decode mode using a per-layer KV cache.

Generate a libmodel.so (library mode)

# Generate a model runtime with an exported C ABI (dlopen/ctypes)
./build/ck_ir_demo path/to/config.json --emit build/model.c --emit-lib

# Compile into a self-contained shared library (link kernel sources from the manifest)
cc -O3 -fPIC -fopenmp -shared -Iinclude -o build/libmodel.so build/model.c $(cat build/model.c.kernels) -lm

Inference Call Sequence

# 1) Init weights
ck_model_init("weights.bump");

# 2) Enable KV cache (inference-only)
ck_model_kv_cache_enable(context_window);

# 3) Prefill prompt (full forward once)
ck_model_embed_tokens(prompt_tokens, n);
ck_model_forward(NULL);

# 4) Decode tokens (one token per step, uses KV cache)
ck_model_decode(next_token, NULL);

Training/backprop uses the full forward+backward path and does not use KV-cache decode.

Your First Run

Step 1: Build Everything

make all test-libs

Step 2: Run Kernel Tests

make test

This runs Python unit tests comparing each kernel (GELU, RMSNorm, Attention, etc.) against PyTorch reference implementations. You should see output like:

Running unittest/test_gelu.py
Testing GELU forward...  OK
Testing GELU backward... OK
...
All Python kernel tests completed.

Step 3: Run End-to-End

make tiny-e2e

This:

  1. Generates a config for a tiny model (64 hidden, 2 layers, 256 vocab)
  2. Generates random weights and input tokens
  3. Runs the codegen to produce C code
  4. Compiles and runs forward pass
  5. Outputs logits to build/tiny_logits.bin

Step 4: Run Codegen Demo

make ck

This shows the IR (Intermediate Representation) that the engine generates:

=== Forward IR ===
CKIRGraph: layers=2, hidden_size=64 ...
  L0 N0 RMSNORM       outputs=[L0:N0:0]   inputs=[IN]
  L0 N1 LINEAR_QKV    outputs=[L0:N1:0]   inputs=[L0:N0]
  L0 N2 ROPE          outputs=[L0:N2:0]   inputs=[L0:N1]
  L0 N3 ATTENTION     outputs=[L0:N3:0]   inputs=[L0:N2]
  ...

Quick IR Visualizer

Generate + Open in 1 Step

RUN=$HOME/.cache/ck-engine-v7/models/train/v7_svg_assets_bpe_l24_full_e1

if [ ! -d "$RUN" ]; then
  echo "[ERROR] Run directory not found: $RUN"
  echo "Use v7-runbook.html to create a run first."
else
  .venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only
  echo "[OK] Visualizer generated: $RUN/ir_report.html"
  xdg-open "$RUN/ir_report.html" >/dev/null 2>&1 || true
fi

For full training+inference flow, see v7-runbook.html.

Training Parity Test

The ultimate test: verify that our C kernels produce the same gradients and weight updates as PyTorch.

Run Training Parity

make tiny-parity

This runs training steps in both C and PyTorch, comparing:

Expected Output

$ make tiny-parity
...
Step 0: C loss=10.2345, PyTorch loss=10.2345, diff=1.2e-06
Step 1: C loss=9.8765, PyTorch loss=9.8765, diff=1.1e-06
...
Max weight diff: 1.2e-05
PASS: Training parity verified!

Quick Profiling

Memory Check

make profile-memory

With our bump allocator, you should see:

LEAK SUMMARY:
   definitely lost: 0 bytes in 0 blocks  <-- Clean!

CPU Profile + Flamegraph

# Enable perf first (one-time)
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# Run profiler
make profile-cpu

# Generate flamegraph
make flamegraph

# View in browser
firefox build/flamegraph.svg

What to Expect in Profile

For a transformer model:

This is normal and expected behavior for transformer architectures.

Next Steps

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close