RoPE (Rotary Position Embedding) kernels with SIMD. More...
#include <math.h>#include <stddef.h>Go to the source code of this file.
Macros | |
| #define | M_PI 3.14159265358979323846 |
Functions | |
| static void | rope_apply_head (float *x, const float *cos_cache, const float *sin_cache, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset) |
| void | rope_backward (const float *d_out, float *d_x, const float *cos_cache, const float *sin_cache, int num_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset) |
| void | rope_backward_inplace (float *d_x, const float *cos_cache, const float *sin_cache, int num_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset) |
| void | rope_backward_qk (const float *d_q_out, const float *d_k_out, float *d_q, float *d_k, const float *cos_cache, const float *sin_cache, int num_heads, int num_kv_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset) |
| void | rope_forward (float *x, const float *cos_cache, const float *sin_cache, int num_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset) |
| void | rope_forward_qk (float *q, float *k, const float *cos_cache, const float *sin_cache, int num_heads, int num_kv_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset) |
| void | rope_forward_qk_strided (float *q, float *k, const float *cos_cache, const float *sin_cache, int num_heads, int num_kv_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset, int q_stride_tokens, int k_stride_tokens) |
| void | rope_forward_strided (float *x, const float *cos_cache, const float *sin_cache, int num_heads, int num_tokens, int head_dim, int aligned_head_dim, int pos_offset, int head_stride_tokens) |
| void | rope_precompute_cache (float *cos_cache, float *sin_cache, int max_seq_len, int head_dim, float base) |
RoPE (Rotary Position Embedding) kernels with SIMD.
After changes: make test && make llamacpp-parity-full
Applies rotary position embeddings to query and key vectors. Used by Llama, SmolLM, and most modern transformer architectures.
Math (Llama-style rotate-half): Split head_dim into two halves (0..half-1, half..head_dim-1). For each position m and index i in [0, half): x0 = x[i], x1 = x[i + half] x'[i] = x0 * cos(m * theta_i) - x1 * sin(m * theta_i) x'[i+half] = x0 * sin(m * theta_i) + x1 * cos(m * theta_i)
Where theta_i = 1 / (base^(2i/d)), typically base=10000.
Layout: x: [num_heads, num_tokens, head_dim] head-major cos_cache, sin_cache: [max_seq_len, head_dim/2] precomputed
Definition in file rope_kernels.c.
| #define M_PI 3.14159265358979323846 |
Definition at line 39 of file rope_kernels.c.
|
inlinestatic |
Definition at line 79 of file rope_kernels.c.
Referenced by rope_forward(), and rope_forward_strided().
| void rope_backward | ( | const float * | d_out, |
| float * | d_x, | ||
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset | ||
| ) |
RoPE backward (inverse rotation)
test_rope.py::TestRoPEBackward::test_rope_backward
test_rope.py::TestRoPEBackward::test_rope_backward_vs_separate
RoPE backward: inverse rotation (rotate by -θ). Since cos(-θ) = cos(θ) and sin(-θ) = -sin(θ): d_x[2i] = d0 * c + d1 * s d_x[2i+1] = -d0 * s + d1 * c
After changes: make test
Definition at line 238 of file rope_kernels.c.
Referenced by rope_backward_bf16(), and rope_backward_qk().
| void rope_backward_inplace | ( | float * | d_x, |
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset | ||
| ) |
RoPE backward in-place (overwrite with inverse rotation)
In-place backward: overwrite d_out with inverse-rotated gradients. Useful when d_x == d_out is acceptable (saves memory).
After changes: make test
Definition at line 345 of file rope_kernels.c.
| void rope_backward_qk | ( | const float * | d_q_out, |
| const float * | d_k_out, | ||
| float * | d_q, | ||
| float * | d_k, | ||
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_kv_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset | ||
| ) |
RoPE backward for both dQ and dK
Combined RoPE backward for both dQ and dK gradients.
After changes: make test
Definition at line 497 of file rope_kernels.c.
References rope_backward().
Referenced by ck_layer_backward_rmsnorm_swiglu().
| void rope_forward | ( | float * | x, |
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset | ||
| ) |
RoPE forward (head-major layout, in-place)
test_rope.py::TestRoPEForward::test_rope_forward
test_rope.py::TestRoPEForward::test_rope_vs_separate
test_parity.py::test_rope_parity
Applies rotary position embeddings in-place to Q or K tensor. x: [num_heads, num_tokens, head_dim] head-major
After changes: make test && make llamacpp-parity-full
Definition at line 180 of file rope_kernels.c.
References rope_apply_head().
Referenced by model_layer_0_decode(), model_layer_10_decode(), model_layer_11_decode(), model_layer_12_decode(), model_layer_13_decode(), model_layer_14_decode(), model_layer_15_decode(), model_layer_16_decode(), model_layer_17_decode(), model_layer_18_decode(), model_layer_19_decode(), model_layer_1_decode(), model_layer_20_decode(), model_layer_21_decode(), model_layer_22_decode(), model_layer_23_decode(), model_layer_2_decode(), model_layer_3_decode(), model_layer_4_decode(), model_layer_5_decode(), model_layer_6_decode(), model_layer_7_decode(), model_layer_8_decode(), model_layer_9_decode(), qwen2_0_5b_decode_layer_0_decode(), qwen2_0_5b_decode_layer_10_decode(), qwen2_0_5b_decode_layer_11_decode(), qwen2_0_5b_decode_layer_12_decode(), qwen2_0_5b_decode_layer_13_decode(), qwen2_0_5b_decode_layer_14_decode(), qwen2_0_5b_decode_layer_15_decode(), qwen2_0_5b_decode_layer_16_decode(), qwen2_0_5b_decode_layer_17_decode(), qwen2_0_5b_decode_layer_18_decode(), qwen2_0_5b_decode_layer_19_decode(), qwen2_0_5b_decode_layer_1_decode(), qwen2_0_5b_decode_layer_20_decode(), qwen2_0_5b_decode_layer_21_decode(), qwen2_0_5b_decode_layer_22_decode(), qwen2_0_5b_decode_layer_23_decode(), qwen2_0_5b_decode_layer_2_decode(), qwen2_0_5b_decode_layer_3_decode(), qwen2_0_5b_decode_layer_4_decode(), qwen2_0_5b_decode_layer_5_decode(), qwen2_0_5b_decode_layer_6_decode(), qwen2_0_5b_decode_layer_7_decode(), qwen2_0_5b_decode_layer_8_decode(), qwen2_0_5b_decode_layer_9_decode(), rope_forward_bf16(), and rope_forward_qk().
| void rope_forward_qk | ( | float * | q, |
| float * | k, | ||
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_kv_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset | ||
| ) |
RoPE forward for both Q and K (common inference pattern)
test_rope.py::TestRoPEForward::test_rope_forward_qk
test_fused_attention_decode.py::TestFusedAttentionDecode::test_qk_rope
test_parity.py::test_rope_qk_parity
Combined RoPE forward for both Q and K in one call. q: [num_heads, num_tokens, head_dim] k: [num_kv_heads, num_tokens, head_dim]
After changes: make test && make llamacpp-parity-full
Definition at line 448 of file rope_kernels.c.
References rope_forward().
Referenced by ck_layer_forward_rmsnorm_swiglu(), ck_layer_forward_rmsnorm_swiglu_decode(), ck_layer_forward_rmsnorm_swiglu_decode_fused(), ck_layer_forward_rmsnorm_swiglu_decode_fused_attn_impl(), ck_layer_forward_rmsnorm_swiglu_decode_q4_k(), ck_layer_forward_rmsnorm_swiglu_decode_quant(), ck_layer_forward_rmsnorm_swiglu_q4_k(), ck_layer_forward_rmsnorm_swiglu_quant(), ck_layer_forward_rmsnorm_swiglu_ref(), ck_test_rope(), qwen2_0_5b_decode_layer_0_decode(), qwen2_0_5b_decode_layer_0_prefill(), qwen2_0_5b_decode_layer_10_decode(), qwen2_0_5b_decode_layer_10_prefill(), qwen2_0_5b_decode_layer_11_decode(), qwen2_0_5b_decode_layer_11_prefill(), qwen2_0_5b_decode_layer_12_decode(), qwen2_0_5b_decode_layer_12_prefill(), qwen2_0_5b_decode_layer_13_decode(), qwen2_0_5b_decode_layer_13_prefill(), qwen2_0_5b_decode_layer_14_decode(), qwen2_0_5b_decode_layer_14_prefill(), qwen2_0_5b_decode_layer_15_decode(), qwen2_0_5b_decode_layer_15_prefill(), qwen2_0_5b_decode_layer_16_decode(), qwen2_0_5b_decode_layer_16_prefill(), qwen2_0_5b_decode_layer_17_decode(), qwen2_0_5b_decode_layer_17_prefill(), qwen2_0_5b_decode_layer_18_decode(), qwen2_0_5b_decode_layer_18_prefill(), qwen2_0_5b_decode_layer_19_decode(), qwen2_0_5b_decode_layer_19_prefill(), qwen2_0_5b_decode_layer_1_decode(), qwen2_0_5b_decode_layer_1_prefill(), qwen2_0_5b_decode_layer_20_decode(), qwen2_0_5b_decode_layer_20_prefill(), qwen2_0_5b_decode_layer_21_decode(), qwen2_0_5b_decode_layer_21_prefill(), qwen2_0_5b_decode_layer_22_decode(), qwen2_0_5b_decode_layer_22_prefill(), qwen2_0_5b_decode_layer_23_decode(), qwen2_0_5b_decode_layer_23_prefill(), qwen2_0_5b_decode_layer_2_decode(), qwen2_0_5b_decode_layer_2_prefill(), qwen2_0_5b_decode_layer_3_decode(), qwen2_0_5b_decode_layer_3_prefill(), qwen2_0_5b_decode_layer_4_decode(), qwen2_0_5b_decode_layer_4_prefill(), qwen2_0_5b_decode_layer_5_decode(), qwen2_0_5b_decode_layer_5_prefill(), qwen2_0_5b_decode_layer_6_decode(), qwen2_0_5b_decode_layer_6_prefill(), qwen2_0_5b_decode_layer_7_decode(), qwen2_0_5b_decode_layer_7_prefill(), qwen2_0_5b_decode_layer_8_decode(), qwen2_0_5b_decode_layer_8_prefill(), qwen2_0_5b_decode_layer_9_decode(), and qwen2_0_5b_decode_layer_9_prefill().
| void rope_forward_qk_strided | ( | float * | q, |
| float * | k, | ||
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_kv_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset, | ||
| int | q_stride_tokens, | ||
| int | k_stride_tokens | ||
| ) |
RoPE forward for both Q and K with custom strides (KV cache layouts)
test_rope.py::TestRoPEForward::test_rope_forward_qk_strided
test_kv_cache_attention.py::TestKVCacheAttention::test_qk_rope_strided
Combined QK RoPE with configurable strides for KV cache layouts.
After changes: make test
Definition at line 472 of file rope_kernels.c.
References rope_forward_strided().
Referenced by mega_fused_attention_prefill(), mega_fused_attention_prefill_q8_0(), model_layer_0_prefill(), model_layer_10_prefill(), model_layer_11_prefill(), model_layer_12_prefill(), model_layer_13_prefill(), model_layer_14_prefill(), model_layer_15_prefill(), model_layer_16_prefill(), model_layer_17_prefill(), model_layer_18_prefill(), model_layer_19_prefill(), model_layer_1_prefill(), model_layer_20_prefill(), model_layer_21_prefill(), model_layer_22_prefill(), model_layer_23_prefill(), model_layer_2_prefill(), model_layer_3_prefill(), model_layer_4_prefill(), model_layer_5_prefill(), model_layer_6_prefill(), model_layer_7_prefill(), model_layer_8_prefill(), model_layer_9_prefill(), qwen2_0_5b_decode_layer_0_prefill(), qwen2_0_5b_decode_layer_10_prefill(), qwen2_0_5b_decode_layer_11_prefill(), qwen2_0_5b_decode_layer_12_prefill(), qwen2_0_5b_decode_layer_13_prefill(), qwen2_0_5b_decode_layer_14_prefill(), qwen2_0_5b_decode_layer_15_prefill(), qwen2_0_5b_decode_layer_16_prefill(), qwen2_0_5b_decode_layer_17_prefill(), qwen2_0_5b_decode_layer_18_prefill(), qwen2_0_5b_decode_layer_19_prefill(), qwen2_0_5b_decode_layer_1_prefill(), qwen2_0_5b_decode_layer_20_prefill(), qwen2_0_5b_decode_layer_21_prefill(), qwen2_0_5b_decode_layer_22_prefill(), qwen2_0_5b_decode_layer_23_prefill(), qwen2_0_5b_decode_layer_2_prefill(), qwen2_0_5b_decode_layer_3_prefill(), qwen2_0_5b_decode_layer_4_prefill(), qwen2_0_5b_decode_layer_5_prefill(), qwen2_0_5b_decode_layer_6_prefill(), qwen2_0_5b_decode_layer_7_prefill(), qwen2_0_5b_decode_layer_8_prefill(), and qwen2_0_5b_decode_layer_9_prefill().
| void rope_forward_strided | ( | float * | x, |
| const float * | cos_cache, | ||
| const float * | sin_cache, | ||
| int | num_heads, | ||
| int | num_tokens, | ||
| int | head_dim, | ||
| int | aligned_head_dim, | ||
| int | pos_offset, | ||
| int | head_stride_tokens | ||
| ) |
RoPE forward with custom head stride (for KV cache layouts)
test_rope.py::TestRoPEForward::test_rope_strided
test_kv_cache_attention.py::TestKVCacheAttention::test_rope_decode
Variant with configurable head_stride_tokens for non-contiguous head layouts.
After changes: make test
Definition at line 207 of file rope_kernels.c.
References rope_apply_head().
Referenced by rope_forward_qk_strided().
| void rope_precompute_cache | ( | float * | cos_cache, |
| float * | sin_cache, | ||
| int | max_seq_len, | ||
| int | head_dim, | ||
| float | base | ||
| ) |
Precompute RoPE cos/sin cache
test_rope.py::TestRoPECache::test_cache_computation
test_rope.py::TestRoPECache::test_cache_values
Precomputes cos(m * theta_i) and sin(m * theta_i) for positions 0..max_seq_len-1. cos_cache, sin_cache: [max_seq_len, head_dim/2]
After changes: make test
Definition at line 52 of file rope_kernels.c.
Referenced by ck_test_rope().