Fused RMSNorm + Q8_K Quantization kernel. More...
#include <immintrin.h>#include <math.h>#include <stdint.h>#include <string.h>#include "ckernel_quant.h"Go to the source code of this file.
Functions | |
| static float | hmax256_ps_fused (__m256 v) |
| static float | hsum256_ps_fused (__m256 v) |
| void | rmsnorm_q8_k_fused (const float *input, const float *gamma, void *vy, int tokens, int d_model, int aligned_embed_dim, float eps) |
Fused RMSNorm + Q8_K Quantization kernel.
After changes: make test && make llamacpp-parity-full
FUSION BENEFIT: Eliminates intermediate FP32 buffer between RMSNorm and quantization, keeping normalized values in registers/L1.
Definition in file rmsnorm_q8_k_fused.c.
|
inlinestatic |
|
inlinestatic |
Definition at line 27 of file rmsnorm_q8_k_fused.c.
Referenced by fused_rmsnorm_linear_q4k(), gemm_bias_gelu_fused(), gemm_bias_relu_fused(), gemm_bias_silu_fused(), gemm_swiglu_fused(), and rmsnorm_q8_k_fused().
| void rmsnorm_q8_k_fused | ( | const float * | input, |
| const float * | gamma, | ||
| void * | vy, | ||
| int | tokens, | ||
| int | d_model, | ||
| int | aligned_embed_dim, | ||
| float | eps | ||
| ) |
Fused RMSNorm + Q8_K Quantization
Benefits:
Definition at line 54 of file rmsnorm_q8_k_fused.c.
References block_q8_K::bsums, block_q8_K::d, hmax256_ps_fused(), hsum256_ps_fused(), QK_K, and block_q8_K::qs.