Per-head RMSNorm on Q and K (Qwen3-style QK norm) More...

#include <stddef.h>

Functions
void	qk_norm_forward (float q, float k, const float q_gamma, const float k_gamma, int num_heads, int num_kv_heads, int num_tokens, int head_dim, float eps)

void	rmsnorm_forward (const float input, const float gamma, float output, float rstd_cache, int tokens, int d_model, int aligned_embed_dim, float eps)

Detailed Description

Per-head RMSNorm on Q and K (Qwen3-style QK norm)

CK-ENGINE KERNEL RULES:

NO malloc/free - memory via bump allocator, pointers passed in
NO OpenMP - parallelization at orchestrator/codegen layer
API must define: inputs, outputs, workspace, and memory layouts
Pure computation - deterministic, no side effects

After changes: make test && python unittest/test_qk_norm.py

QK Norm normalizes each head's query/key vectors independently before RoPE. This stabilizes Q*K^T dot products before softmax, preventing attention collapse from large magnitude vectors.

Why only Q and K, not V? V does not participate in the attention score computation (Q*K^T). The softmax saturation problem comes from large Q*K^T values, so only Q and K magnitudes matter. V is linearly combined after softmax weights are computed – normalizing it would change output scale but not fix attention stability.

Data layout after QKV projection (head-major): Q: [num_heads, num_tokens, head_dim] contiguous K: [num_kv_heads, num_tokens, head_dim] contiguous

We treat Q as [num_heads * num_tokens] rows of [head_dim] elements. rmsnorm_forward normalizes each row independently. The gamma weight [head_dim] is shared across all heads (Qwen3 design: one gamma per Q, one per K).

Definition in file qk_norm_kernels.c.

Function Documentation

◆ qk_norm_forward()

void qk_norm_forward	(	float *	q,
		float *	k,
		const float *	q_gamma,
		const float *	k_gamma,
		int	num_heads,
		int	num_kv_heads,
		int	num_tokens,
		int	head_dim,
		float	eps
	)

Per-head RMSNorm on Q and K.

Parameters

q	Q scratch buffer [num_heads * num_tokens * head_dim], in-place
k	K scratch buffer [num_kv_heads * num_tokens * head_dim], in-place
q_gamma	Q norm gamma weights [head_dim]
k_gamma	K norm gamma weights [head_dim]
num_heads	Number of query heads (e.g. 32 for Qwen3-8B)
num_kv_heads	Number of KV heads (e.g. 8 for Qwen3-8B with GQA)
num_tokens	Number of tokens (1 for decode, T for prefill)
head_dim	Dimension per head (e.g. 128)
eps	RMSNorm epsilon (e.g. 1e-6)

Test:: unittest/test_qk_norm.py

Definition at line 61 of file qk_norm_kernels.c.

 {
     /* Q norm: [num_heads * num_tokens] rows of [head_dim]
      * Each row is one head's vector for one token. */
     rmsnorm_forward(q, q_gamma, q, NULL,
                     num_heads * num_tokens, head_dim, head_dim, eps);
  
     /* K norm: [num_kv_heads * num_tokens] rows of [head_dim]
      * Same logic, fewer rows when using GQA. */
     rmsnorm_forward(k, k_gamma, k, NULL,
                     num_kv_heads * num_tokens, head_dim, head_dim, eps);
 }

References rmsnorm_forward().

◆ rmsnorm_forward()

void rmsnorm_forward	(	const float *	input,
		const float *	gamma,
		float *	output,
		float *	rstd_cache,
		int	tokens,
		int	d_model,
		int	aligned_embed_dim,
		float	eps
	)

RMSNorm forward pass

Test:

test_rmsnorm.py::TestRMSNormForward::test_fp32_tokens

test_rmsnorm.py::TestRMSNormForward::test_fp32_single

test_rmsnorm.py::TestRMSNormForward::test_perf_rolled

test_layernorm.py::TestLayerNormForward::test_rmsnorm_compat

test_parity.py::test_rmsnorm_parity

RMSNorm: y[i] = gamma[i] * x[i] / sqrt(mean(x^2) + eps)

After changes: make test && make llamacpp-parity-full

Definition at line 50 of file rmsnorm_kernels.c.

 {
     int T = tokens;
     int D = d_model;
     int aligned = aligned_embed_dim;
  
     for (int t = 0; t < T; ++t) {
         const float *x = input + (size_t)t * aligned;
         float *y = output + (size_t)t * aligned;
  
 #if defined(__AVX512F__)
         // AVX-512: Process 16 floats at a time
         __m512 sum_sq_vec = _mm512_setzero_ps();
         int d = 0;
  
         // Vectorized sum of squares
         for (; d + 16 <= D; d += 16) {
             __m512 xv = _mm512_loadu_ps(&x[d]);
             sum_sq_vec = _mm512_fmadd_ps(xv, xv, sum_sq_vec);
         }
         float sum_sq = _mm512_reduce_add_ps(sum_sq_vec);
  
         // Handle remaining elements
         for (; d < D; ++d) {
             sum_sq += x[d] * x[d];
         }
  
         float mean_sq = sum_sq / (float)D;
         float rstd = 1.0f / sqrtf(mean_sq + eps);
         if (rstd_cache) {
             rstd_cache[t] = rstd;
         }
  
         // Apply normalization and scale (vectorized)
         __m512 rstd_vec = _mm512_set1_ps(rstd);
         d = 0;
         for (; d + 16 <= D; d += 16) {
             __m512 xv = _mm512_loadu_ps(&x[d]);
             __m512 gv = _mm512_loadu_ps(&gamma[d]);
             __m512 x_hat = _mm512_mul_ps(xv, rstd_vec);
             __m512 yv = _mm512_mul_ps(x_hat, gv);
             _mm512_storeu_ps(&y[d], yv);
         }
         // Handle remaining elements
         for (; d < D; ++d) {
             y[d] = x[d] * rstd * gamma[d];
         }
  
 #elif defined(__AVX__)
         // AVX: Process 8 floats at a time
         __m256 sum_sq_vec = _mm256_setzero_ps();
         int d = 0;
  
         // Vectorized sum of squares (no FMA in AVX1, use mul + add)
         for (; d + 8 <= D; d += 8) {
             __m256 xv = _mm256_loadu_ps(&x[d]);
             __m256 xv_sq = _mm256_mul_ps(xv, xv);
             sum_sq_vec = _mm256_add_ps(sum_sq_vec, xv_sq);
         }
         float sum_sq = hsum256_ps_rmsnorm(sum_sq_vec);
  
         // Handle remaining elements
         for (; d < D; ++d) {
             sum_sq += x[d] * x[d];
         }
  
         float mean_sq = sum_sq / (float)D;
         float rstd = 1.0f / sqrtf(mean_sq + eps);
         if (rstd_cache) {
             rstd_cache[t] = rstd;
         }
  
         // Apply normalization and scale (vectorized)
         __m256 rstd_vec = _mm256_set1_ps(rstd);
         d = 0;
         for (; d + 8 <= D; d += 8) {
             __m256 xv = _mm256_loadu_ps(&x[d]);
             __m256 gv = _mm256_loadu_ps(&gamma[d]);
             __m256 x_hat = _mm256_mul_ps(xv, rstd_vec);
             __m256 yv = _mm256_mul_ps(x_hat, gv);
             _mm256_storeu_ps(&y[d], yv);
         }
         // Handle remaining elements
         for (; d < D; ++d) {
             y[d] = x[d] * rstd * gamma[d];
         }
  
 #else
         // Scalar fallback
         double sum_sq = 0.0;
         for (int d = 0; d < D; ++d) {
             double v = (double)x[d];
             sum_sq += v * v;
         }
         double mean_sq = sum_sq / (double)D;
         double r = sqrt(mean_sq + (double)eps);
         float rstd = (float)(1.0 / r);
         if (rstd_cache) {
             rstd_cache[t] = rstd;
         }
  
         // Apply normalization and scale
         for (int d = 0; d < D; ++d) {
             float x_hat = x[d] * rstd;
             y[d] = x_hat * gamma[d];
         }
 #endif
  
         // Zero padding (if any)
         for (int d = D; d < aligned; ++d) {
             y[d] = 0.0f;
         }
     }
 }

Referenced by qk_norm_forward().

Functions

Detailed Description

CK-ENGINE KERNEL RULES:

Function Documentation

◆ qk_norm_forward()

◆ rmsnorm_forward()