Getting Started

Quick Start Guide
This page covers installation, building, and running C-Kernel-Engine from source.

Supported Platform
Linux is the supported platform for build, parity, and profiling workflows. macOS and Windows/WSL may work for local smoke tests, but they are best-effort only and not part of the supported operator path.

Prerequisites

Required

# Linux toolchain + isolated Python env
sudo apt install build-essential git make python3 python3-venv python3-pip

# Supported bootstrap path
make v7-init
make v7-doctor

Use the repo-local .venv for all v7 Python commands. If you do not activate it, invoke scripts explicitly with .venv/bin/python ....

Required Python packages (supported v7 workflows): numpy torch safetensors requests tqdm huggingface_hub gguf tokenizers transformers

This is the supported v7 bundle for run / train / parity workflows. Not every package is used on every execution path. If you prefer uv, conda, or manual pip, install that same package set into the interpreter you plan to use, then run make v7-doctor.

Manual Python Environment (pip example)

python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements-v7.txt
make v7-doctor

Optional (for profiling)

# Valgrind (memory profiling)
sudo apt install valgrind

# perf (CPU profiling)
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Enable perf for non-root users
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# FlameGraph (visualization) - clone to ~/Programs
git clone https://github.com/brendangregg/FlameGraph.git ~/Programs/FlameGraph

Optional (for documentation)

# Doxygen (API docs generation)
sudo apt install doxygen

Building

Basic Build

# Clone the repository
git clone https://github.com/antshiv/C-Kernel-Engine.git
cd C-Kernel-Engine

# Build the main library
make

# Build the IR + codegen tool (HF config.json -> generated C)
make build/ck_ir_demo

# Optional: build the orchestrator CLI ("ck")
make ck-cli

Build Options

Variable	Default	Description
`CC`	gcc	C compiler
`AVX_FLAGS`	auto-detected	SIMD flags (-mavx512f, -mavx2, -mavx, or empty)
`CFLAGS`	-O3 -fPIC -fopenmp -Wall	Compiler flags

# Build without SIMD (scalar reference implementation)
make AVX_FLAGS=

# Force AVX2
make AVX_FLAGS="-mavx2"

# Debug build with symbols
make CFLAGS="-O0 -g -fPIC -fopenmp -Wall"

v7 Quick Runtime (Linux)

Plain Runtime Path

make v7-init
make v7-demo-runtime \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

This is the shortest supported path to a working v7 runtime. It creates the repo-local virtualenv, installs Python dependencies, downloads/converts/compiles the model, and enters the local chat loop.

v8 Inference Quick Runtime (Linux)

Text Bring-Up

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf \
  --context-len 1024 \
  --force-convert --force-compile \
  --generate-visualizer

This is the short text-family v8 path. The same runner surface now covers the promoted qwen2, qwen3, qwen3.5, gemma3, and NanBeige command shapes, with family-specific chat-template notes in the full runbook.

Scoped Multimodal Bring-Up

version/v8/scripts/cks-v8-run run \
  hf://Qwen/Qwen3-VL-8B-Instruct-GGUF/Qwen3VL-8B-Instruct-Q4_K_M.gguf \
  --mmproj ./mmproj-Qwen3VL-8B-Instruct-Q8_0.gguf \
  --image-path version/v8/test_assets/v8_vision_doc_card_72.ppm \
  --prompt "Explain this image." \
  --context-len 1024 \
  --force-convert --force-compile \
  --thinking-mode suppressed

Update --mmproj to the actual local path of your matching Qwen3-VL mmproj file.

This is the current v8 multimodal bring-up lane for the validated Qwen3-VL decoder/mmproj pair.

Full operator notes: see v8 Inference Runbook. Other multimodal families are not implied by this quick path.

Under the hood: see v8 Vision Encoder Architecture for how the encoder, bridge, and decoder are stitched together.

v7 Full Artifact Capture (Non-privileged)

Runtime + Report + Hub

make v7-init
make v7-capture-artifacts \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

This captures the runtime plus the HTML operator artifacts without requiring privileged profiling tools.

Not included: perf / VTune / Advisor / flamegraph artifacts. Those remain separate Linux-only profiling steps and may require host tooling or elevated permissions.

v7 Profiling Dashboard (Linux)

Runtime + Artifacts + Profiling

make v7-init
make v7-profile-dashboard \
  V7_MODEL=hf://Qwen/Qwen3-0.6B-GGUF/Qwen3-0.6B-Q8_0.gguf

This is the one-command operator path for Linux hosts with profiling tools installed. It captures the runtime artifacts, then refreshes the IR report and Run Hub after perf/flamegraph collection.

Expected tooling: perf for perf-stat and FlameGraph-compatible decode profiling. If host tools are missing, the underlying profiling targets may skip portions of the capture.

Make Targets Reference

Building

Target	Description
`make` or `make all`	Build the main library (`build/libckernel_engine.so`)
`make build/ck_ir_demo`	Build the IR + codegen tool
`make ck-cli`	Build the orchestrator CLI (`build/ck`)
`make test-libs`	Build per-kernel shared libraries for Python tests
`make clean`	Remove all built files in `build/`

Testing

Target	Description
`make test`	Run all Python kernel unit tests (GELU, RMSNorm, Attention, etc.)
`make test-bf16`	Run BF16 kernel unit tests (skips on unsupported CPUs)
`make tests-list`	Print the full list of Python kernel tests
`make test-quick`	Quick comprehensive tests (<1 min) - tiny models, basic configs
`make test-full`	Full comprehensive tests (5-10 min) - GQA, medium, deep, wide models
`make test-stress`	Stress tests (10+ min) - convergence and overfit tests
`make all-tests`	Run kernel tests + layer parity + tiny parity (safe defaults)
`make layer-parity`	Test single decoder layer forward pass vs PyTorch
`make tiny-parity`	Test full model training parity vs PyTorch

End-to-End

Target	Description
`make tiny-e2e`	Generate random weights/tokens, run tiny model forward pass
`make tiny-train`	Generate random data, run forward + backward + SGD update
`make small-e2e`	Same as tiny-e2e but with ~10MB model

Code Generation

Target	Description
`make ck`	Run IR demo with default config, print forward/backward IR
`make emit CONFIG=path OUT=path`	Generate C runtime from config JSON
`make gen-specs`	Regenerate kernel specs from kernel_maps/*.json

Profiling

Target	Description
`make profile-memory`	Run Valgrind memcheck - detect memory leaks
`make profile-heap`	Run Valgrind massif - track heap usage over time
`make profile-cpu`	Run perf record - CPU profiling
`make profile-cache`	Run Valgrind cachegrind - cache miss analysis
`make flamegraph`	Generate SVG flamegraph from perf data

Prefill + Decode (KV Cache)

For autoregressive inference, the generated runtime supports a fast prefill → decode mode using a per-layer KV cache.

Generate a `libmodel.so` (library mode)

# Generate a model runtime with an exported C ABI (dlopen/ctypes)
./build/ck_ir_demo path/to/config.json --emit build/model.c --emit-lib

# Compile into a self-contained shared library (link kernel sources from the manifest)
cc -O3 -fPIC -fopenmp -shared -Iinclude -o build/libmodel.so build/model.c $(cat build/model.c.kernels) -lm

Inference Call Sequence

# 1) Init weights
ck_model_init("weights.bump");

# 2) Enable KV cache (inference-only)
ck_model_kv_cache_enable(context_window);

# 3) Prefill prompt (full forward once)
ck_model_embed_tokens(prompt_tokens, n);
ck_model_forward(NULL);

# 4) Decode tokens (one token per step, uses KV cache)
ck_model_decode(next_token, NULL);

Training/backprop uses the full forward+backward path and does not use KV-cache decode.

Your First Run

Step 1: Build Everything

make all test-libs

Step 2: Run Kernel Tests

make test

This runs Python unit tests comparing each kernel (GELU, RMSNorm, Attention, etc.) against PyTorch reference implementations. You should see output like:

Running unittest/test_gelu.py
Testing GELU forward...  OK
Testing GELU backward... OK
...
All Python kernel tests completed.

Step 3: Run End-to-End

make tiny-e2e

This:

Generates a config for a tiny model (64 hidden, 2 layers, 256 vocab)
Generates random weights and input tokens
Runs the codegen to produce C code
Compiles and runs forward pass
Outputs logits to build/tiny_logits.bin

Step 4: Run Codegen Demo

make ck

This shows the IR (Intermediate Representation) that the engine generates:

=== Forward IR ===
CKIRGraph: layers=2, hidden_size=64 ...
  L0 N0 RMSNORM       outputs=[L0:N0:0]   inputs=[IN]
  L0 N1 LINEAR_QKV    outputs=[L0:N1:0]   inputs=[L0:N0]
  L0 N2 ROPE          outputs=[L0:N2:0]   inputs=[L0:N1]
  L0 N3 ATTENTION     outputs=[L0:N3:0]   inputs=[L0:N2]
  ...

Quick IR Visualizer

Generate + Open in 1 Step

RUN=$HOME/.cache/ck-engine-v7/models/train/v7_svg_assets_bpe_l24_full_e1

if [ ! -d "$RUN" ]; then
  echo "[ERROR] Run directory not found: $RUN"
  echo "Use v7-runbook.html to create a run first."
else
  .venv/bin/python version/v7/tools/open_ir_visualizer.py --generate --run "$RUN" --html-only
  echo "[OK] Visualizer generated: $RUN/ir_report.html"
  xdg-open "$RUN/ir_report.html" >/dev/null 2>&1 || true
fi

For full training+inference flow, see v7-runbook.html.

Training Parity Test

The ultimate test: verify that our C kernels produce the same gradients and weight updates as PyTorch.

Run Training Parity

make tiny-parity

This runs training steps in both C and PyTorch, comparing:

Forward pass outputs (logits)
Loss values (cross-entropy)
Gradients (backward pass)
Updated weights after SGD step

Expected Output

$ make tiny-parity
...
Step 0: C loss=10.2345, PyTorch loss=10.2345, diff=1.2e-06
Step 1: C loss=9.8765, PyTorch loss=9.8765, diff=1.1e-06
...
Max weight diff: 1.2e-05
PASS: Training parity verified!

Quick Profiling

Memory Check

make profile-memory

With our bump allocator, you should see:

LEAK SUMMARY:
   definitely lost: 0 bytes in 0 blocks  <-- Clean!

CPU Profile + Flamegraph

# Enable perf first (one-time)
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid

# Run profiler
make profile-cpu

# Generate flamegraph
make flamegraph

# View in browser
firefox build/flamegraph.svg

What to Expect in Profile

For a transformer model:

Short context (64 tokens): GEMM dominates (~95%) - matrix multiply is the bottleneck
Long context (4K+ tokens): Attention dominates - O(n²) attention scaling takes over

This is normal and expected behavior for transformer architectures.

Next Steps

Architecture Overview - Understand the system design
Kernel Reference - All available kernels
Code Generation - How IR becomes C code
PyTorch Parity - Detailed parity testing
Profiling Guide - Advanced profiling techniques

Getting Started

Prerequisites

Required

Manual Python Environment (pip example)

Optional (for profiling)

Optional (for documentation)

Building

Basic Build

Build Options

v7 Quick Runtime (Linux)

Plain Runtime Path

v8 Inference Quick Runtime (Linux)

Text Bring-Up

Scoped Multimodal Bring-Up

v7 Full Artifact Capture (Non-privileged)

Runtime + Report + Hub

v7 Profiling Dashboard (Linux)

Runtime + Artifacts + Profiling

Make Targets Reference

Building

Testing

End-to-End

Code Generation

Profiling

Prefill + Decode (KV Cache)

Generate a libmodel.so (library mode)

Inference Call Sequence

Your First Run

Step 1: Build Everything

Step 2: Run Kernel Tests

Step 3: Run End-to-End

Step 4: Run Codegen Demo

Quick IR Visualizer

Generate + Open in 1 Step

Training Parity Test

Run Training Parity

Expected Output

Quick Profiling

Memory Check

CPU Profile + Flamegraph

What to Expect in Profile

Next Steps

Generate a `libmodel.so` (library mode)