Architecture Links
A comprehensive guide to all architecture documentation in C-Kernel-Engine.
Runbooks
v8 Qwen3-VL Runbook
Scoped operator runbook for the validated v8 multimodal inference path: Qwen3-VL decoder + matching mmproj.
v8 Vision Encoder Architecture
Design page for how the v8 vision encoder is derived from GGUF + template + lowering, then stitched into the decoder bridge.
v7 Inference + Training Runbook
Copy/paste workflow for HF GGUF inference plus true_bpe training and train->infer handoff.
v7 Profiling Runbook
Repeatable performance workflow for v7 training kernels with perf, VTune, flamegraphs, and Advisor.
v7 SVG Dataset Runbook
Operator workflow to generate Stage A pretraining and Stage B midtraining SVG corpora from docs/site/assets/*.svg, then hand off to v7 training.
Core
- System Overview IR, Codegen, Kernels
- IR Pipeline v6.6 Templates → IR1 → IR2 → Lowering
- v7 Backprop IR Pipeline Init → IR1 → IR2 → Layout → Codegen + canary diagnostics
- v7 Cross-Entropy Parity p - one_hot derivation, PyTorch semantics, long-horizon drift fixes
- v7 Grad-Accum Windows Micro-batch vs effective batch, N vs K, CPU batch simulation
- v7 Train Data Pipeline One-command dataset -> tokenizer -> train orchestration
- v7 Runtime Stitch Graph Function-level forward/backward/accum/optimizer stitching view
- v8 Qwen3-VL Runbook Current operator path for validated v8 multimodal inference
- v8 Vision Encoder Architecture How GGUF intake, template lowering, memory planning, and bridge stitching power the working vision path
- v7 Inference + Training Runbook Copy/paste commands for HF GGUF inference + true_bpe training
- v7 SVG Dataset Runbook Dataset generation for Stage A pretrain and Stage B midtrain
- Model + Kernel Matrix Qwen2/Qwen3/Gemma + kernel coverage
- Tokenizer BPE, WordPiece, Trie
- Kernel Reference Forward/backward ops
- Code Generation IR to C compilation
- Iteration Philosophy Why v1→v6 matters
- IR v2 Format Case study: symbolic dimensions
- Deep Dive Concepts RoPE, Flash Attention, GQA
Quantization
- Quant Fundamentals Block formats, grouping
- Bit Manipulation Visuals Q5_0, Q4_K, INT8 with spaced repetition
- Quant Format Reference Byte-level visualization
- GGUF to Bump Weight conversion
- GGUF Parsing Byte-level guide
Optimization
- GEMM Memory Layout NN/NT layouts, offsets
- GEMM Optimization AVX, MKL, blocking
- v7 Train Layout + Dispatch IR3 memory + parallel execution plan
- Threadpool GEMM Playbook Split M/N/K policy for training
- SIMD Architecture AVX-512, VNNI, AMX
- Flash Attention Analysis Why llama.cpp is faster
Infrastructure
- Memory Safety Bump allocator, canaries
- Deterministic Memory RDMA, interpretability
- Profiling Guide Valgrind, perf, flamegraphs
- v8 Qwen3-VL Runbook Validated decoder/mmproj workflow for multimodal inference
- v8 Vision Encoder Architecture Bridge and encoder design notes for the current multimodal inference lane
- v7 Profiling Runbook VTune + Advisor + perf/flamegraph on train kernels
- v7 Inference + Training Runbook Operational workflow from dataset to chat output
- Testing Numerical parity verification
Temp / Work in Progress
These pages are work-in-progress and may be moved or updated.
Quantization Math Deep Dive
Explains Q5_0/Q8_0 block formats, dequantization math, and AVX-512 vectorization strategy.
Read MoreGEMM Memory Layout
Covers quantized block storage, cache blocking strategies, and KV cache layouts.
Read MoreQuick Navigation
By Task
| Task | Documentation |
|---|---|
| Understanding the system | System Overview, Concepts, v7 Backprop IR, v7 CE Parity, v7 Grad-Accum Windows, v7 Runbook |
| Implementing new kernels | Kernel Reference, Codegen |
| Quantization work | Quant Fundamentals, Bit Visuals, GGUF Parsing |
| Performance optimization | GEMM Layout, v7 Train Layout+Dispatch, Threadpool Playbook, SIMD |
| Debugging & profiling | Profiling, v7 Profiling Runbook, Testing, v7 CE Parity Deep Dive, v7 Runtime Stitch Graph, v7 Runbook |
| Operator train + compute workflow | v7 SVG Dataset Runbook, v7 Inference + Training Runbook, v7 Profiling Runbook |