AXPY kernels for FP32: y = y + alpha * x. More...

#include <stdint.h>
#include <stddef.h>
#include <string.h>

Functions
void	axpy_2d_f32 (float Y, const float X, float alpha, int num_tokens, int dim, int y_stride, int x_stride)
	Batched AXPY for 2D tensors: Y[t,:] += alpha * X[t,:]. More...

void	axpy_f32 (float y, const float x, float alpha, int n)
	In-place AXPY: y += alpha * x. More...

void	axpy_zero_f32 (float y, const float x, float alpha, int n)
	Zero output then accumulate: y = 0; y += alpha * x. More...

void	moe_accumulate_expert_f32 (float output, const float expert_output, float routing_weight, int hidden_dim)
	Accumulate expert output: output += routing_weight * expert_output. More...

void	scal_copy_f32 (float y, const float x, float alpha, int n)
	Scaled copy: y = alpha * x. More...

void	weighted_sum_f32 (float y, const float vectors, const float weights, int k, int n)
	Weighted sum of k vectors: y = sum_i(weights[i] * vectors[i]) More...

Detailed Description

AXPY kernels for FP32: y = y + alpha * x.

CK-ENGINE KERNEL RULES:

NO malloc/free - memory via bump allocator, pointers passed in
NO OpenMP - parallelization at orchestrator/codegen layer
API must define: inputs, outputs, workspace, and memory layouts
Pure computation - deterministic, no side effects

After changes: make test && make llamacpp-parity-full

Classic BLAS Level-1 operation used in MoE expert output accumulation. When gathering expert outputs: output += weight[i] * expert_output[i]

Operations:

axpy_f32: y += alpha * x (in-place)
axpy_strided_f32: strided version for non-contiguous memory
weighted_sum_f32: sum multiple vectors with weights

Definition in file axpy_kernels.c.

Function Documentation

◆ axpy_2d_f32()

void axpy_2d_f32	(	float *	Y,
		const float *	X,
		float	alpha,
		int	num_tokens,
		int	dim,
		int	y_stride,
		int	x_stride
	)

Batched AXPY for 2D tensors: Y[t,:] += alpha * X[t,:].

Parameters

Y	Output tensor [num_tokens, dim]
X	Input tensor [num_tokens, dim]
alpha	Scalar multiplier
num_tokens	Number of tokens
dim	Hidden dimension
y_stride	Stride between Y rows (for alignment)
x_stride	Stride between X rows

Definition at line 221 of file axpy_kernels.c.

 {
     if (!Y || !X || num_tokens <= 0 || dim <= 0) {
         return;
     }
  
     /* Default strides if not specified */
     if (y_stride <= 0) y_stride = dim;
     if (x_stride <= 0) x_stride = dim;
  
     for (int t = 0; t < num_tokens; t++) {
         axpy_f32(Y + t * y_stride, X + t * x_stride, alpha, dim);
     }
 }

References axpy_f32().

◆ axpy_f32()

void axpy_f32	(	float *	y,
		const float *	x,
		float	alpha,
		int	n
	)

In-place AXPY: y += alpha * x.

Test:

test_axpy.py::TestAXPY::test_axpy_f32

test_axpy.py::TestAXPY::test_axpy_vs_naive

In-place scaled vector addition: y += alpha * x BLAS-like axpy operation.

After changes: make test

Definition at line 54 of file axpy_kernels.c.

 {
     if (!y || !x || n <= 0) {
         return;
     }
  
     int i = 0;
  
 #ifdef __AVX512F__
     __m512 valpha = _mm512_set1_ps(alpha);
     for (; i + 16 <= n; i += 16) {
         __m512 vy = _mm512_loadu_ps(&y[i]);
         __m512 vx = _mm512_loadu_ps(&x[i]);
         vy = _mm512_fmadd_ps(vx, valpha, vy);  /* y = y + alpha * x */
         _mm512_storeu_ps(&y[i], vy);
     }
 #endif
  
 #ifdef __AVX2__
     __m256 valpha256 = _mm256_set1_ps(alpha);
     for (; i + 8 <= n; i += 8) {
         __m256 vy = _mm256_loadu_ps(&y[i]);
         __m256 vx = _mm256_loadu_ps(&x[i]);
         vy = _mm256_fmadd_ps(vx, valpha256, vy);
         _mm256_storeu_ps(&y[i], vy);
     }
 #endif
  
     /* Scalar remainder */
     for (; i < n; i++) {
         y[i] += alpha * x[i];
     }
 }

Referenced by axpy_2d_f32(), axpy_zero_f32(), moe_accumulate_expert_f32(), and weighted_sum_f32().

◆ axpy_zero_f32()

void axpy_zero_f32	(	float *	y,
		const float *	x,
		float	alpha,
		int	n
	)

Zero output then accumulate: y = 0; y += alpha * x.

Parameters

y	Output vector [n], zeroed then accumulated
x	Input vector [n]
alpha	Scalar multiplier
n	Vector length

Definition at line 188 of file axpy_kernels.c.

 {
     if (!y || n <= 0) {
         return;
     }
  
     memset(y, 0, n * sizeof(float));
  
     if (x) {
         axpy_f32(y, x, alpha, n);
     }
 }

References axpy_f32().

◆ moe_accumulate_expert_f32()

void moe_accumulate_expert_f32	(	float *	output,
		const float *	expert_output,
		float	routing_weight,
		int	hidden_dim
	)

Accumulate expert output: output += routing_weight * expert_output.

Parameters

output	Token output buffer [hidden_dim], accumulated in place
expert_output	Expert's output for this token [hidden_dim]
routing_weight	Softmax routing weight for this expert
hidden_dim	Hidden dimension

Definition at line 256 of file axpy_kernels.c.

 {
     axpy_f32(output, expert_output, routing_weight, hidden_dim);
 }

References axpy_f32().

◆ scal_copy_f32()

void scal_copy_f32	(	float *	y,
		const float *	x,
		float	alpha,
		int	n
	)

Scaled copy: y = alpha * x.

Parameters

y	Output vector [n]
x	Input vector [n]
alpha	Scalar multiplier
n	Vector length

Definition at line 105 of file axpy_kernels.c.

 {
     if (!y || !x || n <= 0) {
         return;
     }
  
     int i = 0;
  
 #ifdef __AVX512F__
     __m512 valpha = _mm512_set1_ps(alpha);
     for (; i + 16 <= n; i += 16) {
         __m512 vx = _mm512_loadu_ps(&x[i]);
         __m512 vy = _mm512_mul_ps(vx, valpha);
         _mm512_storeu_ps(&y[i], vy);
     }
 #endif
  
 #ifdef __AVX2__
     __m256 valpha256 = _mm256_set1_ps(alpha);
     for (; i + 8 <= n; i += 8) {
         __m256 vx = _mm256_loadu_ps(&x[i]);
         __m256 vy = _mm256_mul_ps(vx, valpha256);
         _mm256_storeu_ps(&y[i], vy);
     }
 #endif
  
     for (; i < n; i++) {
         y[i] = alpha * x[i];
     }
 }

Referenced by weighted_sum_f32().

◆ weighted_sum_f32()

void weighted_sum_f32	(	float *	y,
		const float **	vectors,
		const float *	weights,
		int	k,
		int	n
	)

Weighted sum of k vectors: y = sum_i(weights[i] * vectors[i])

Parameters

y	Output vector [n]
vectors	Array of k input vector pointers, each [n]
weights	Array of k scalar weights
k	Number of vectors to combine
n	Vector length

Definition at line 155 of file axpy_kernels.c.

 {
     if (!y || !vectors || !weights || k <= 0 || n <= 0) {
         return;
     }
  
     /* Initialize with first vector */
     scal_copy_f32(y, vectors[0], weights[0], n);
  
     /* Accumulate rest */
     for (int i = 1; i < k; i++) {
         axpy_f32(y, vectors[i], weights[i], n);
     }
 }

References axpy_f32(), and scal_copy_f32().

Functions

Detailed Description

CK-ENGINE KERNEL RULES:

Function Documentation

◆ axpy_2d_f32()

◆ axpy_f32()

◆ axpy_zero_f32()

◆ moe_accumulate_expert_f32()

◆ scal_copy_f32()

◆ weighted_sum_f32()