Quantization Bit Manipulation Visuals

Interactive guide to internalize the math behind Q5_0, Q4_K, and INT8 kernels

Goal: Make Bit Manipulation Visceral

These visuals explain why the code does qh >> (j + 12) instead of qh >> (j + 16). Step through each diagram to understand exactly how bits are packed and extracted in quantized formats.

100%
Quantization Overview
Q5_0 Bit Layout
Q4_K Superblock
INT8 Dot Product

Quantization Overview

  • Where quantization happens in a transformer layer
  • Bits per weight: FP32 (32) → Q8_0 (8.5) → Q5_0 (5.5) → Q4_K (4.5)
  • Kernel selection: Which kernel to use for each weight type
  • INT8 vs FP32: Why quantized activations are 4x faster
Keyboard shortcuts: 1-4 switch diagrams | +/- zoom | 0 reset | F fullscreen

Related Quantization Documentation

Quantization Deep Dive

BF16 vs FP16, block grouping, kernel dispatch strategy

Format Reference

Byte-level layout of Q4_0, Q5_0, Q8_0, K-quants

SIMD Architecture

AVX-512, VNNI, AMX for quantized inference

Kernel Catalog

All available GEMM and quantized kernels

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close