AXPY kernels for FP32: y = y + alpha * x. More...
#include <stdint.h>#include <stddef.h>#include <string.h>Go to the source code of this file.
Functions | |
| void | axpy_2d_f32 (float *Y, const float *X, float alpha, int num_tokens, int dim, int y_stride, int x_stride) |
| Batched AXPY for 2D tensors: Y[t,:] += alpha * X[t,:]. More... | |
| void | axpy_f32 (float *y, const float *x, float alpha, int n) |
| In-place AXPY: y += alpha * x. More... | |
| void | axpy_zero_f32 (float *y, const float *x, float alpha, int n) |
| Zero output then accumulate: y = 0; y += alpha * x. More... | |
| void | moe_accumulate_expert_f32 (float *output, const float *expert_output, float routing_weight, int hidden_dim) |
| Accumulate expert output: output += routing_weight * expert_output. More... | |
| void | scal_copy_f32 (float *y, const float *x, float alpha, int n) |
| Scaled copy: y = alpha * x. More... | |
| void | weighted_sum_f32 (float *y, const float **vectors, const float *weights, int k, int n) |
| Weighted sum of k vectors: y = sum_i(weights[i] * vectors[i]) More... | |
AXPY kernels for FP32: y = y + alpha * x.
After changes: make test && make llamacpp-parity-full
Classic BLAS Level-1 operation used in MoE expert output accumulation. When gathering expert outputs: output += weight[i] * expert_output[i]
Operations:
Definition in file axpy_kernels.c.
| void axpy_2d_f32 | ( | float * | Y, |
| const float * | X, | ||
| float | alpha, | ||
| int | num_tokens, | ||
| int | dim, | ||
| int | y_stride, | ||
| int | x_stride | ||
| ) |
Batched AXPY for 2D tensors: Y[t,:] += alpha * X[t,:].
| Y | Output tensor [num_tokens, dim] |
| X | Input tensor [num_tokens, dim] |
| alpha | Scalar multiplier |
| num_tokens | Number of tokens |
| dim | Hidden dimension |
| y_stride | Stride between Y rows (for alignment) |
| x_stride | Stride between X rows |
Definition at line 221 of file axpy_kernels.c.
References axpy_f32().
| void axpy_f32 | ( | float * | y, |
| const float * | x, | ||
| float | alpha, | ||
| int | n | ||
| ) |
In-place AXPY: y += alpha * x.
test_axpy.py::TestAXPY::test_axpy_f32
test_axpy.py::TestAXPY::test_axpy_vs_naive
In-place scaled vector addition: y += alpha * x BLAS-like axpy operation.
After changes: make test
Definition at line 54 of file axpy_kernels.c.
Referenced by axpy_2d_f32(), axpy_zero_f32(), moe_accumulate_expert_f32(), and weighted_sum_f32().
| void axpy_zero_f32 | ( | float * | y, |
| const float * | x, | ||
| float | alpha, | ||
| int | n | ||
| ) |
Zero output then accumulate: y = 0; y += alpha * x.
| y | Output vector [n], zeroed then accumulated |
| x | Input vector [n] |
| alpha | Scalar multiplier |
| n | Vector length |
Definition at line 188 of file axpy_kernels.c.
References axpy_f32().
| void moe_accumulate_expert_f32 | ( | float * | output, |
| const float * | expert_output, | ||
| float | routing_weight, | ||
| int | hidden_dim | ||
| ) |
Accumulate expert output: output += routing_weight * expert_output.
| output | Token output buffer [hidden_dim], accumulated in place |
| expert_output | Expert's output for this token [hidden_dim] |
| routing_weight | Softmax routing weight for this expert |
| hidden_dim | Hidden dimension |
Definition at line 256 of file axpy_kernels.c.
References axpy_f32().
| void scal_copy_f32 | ( | float * | y, |
| const float * | x, | ||
| float | alpha, | ||
| int | n | ||
| ) |
Scaled copy: y = alpha * x.
| y | Output vector [n] |
| x | Input vector [n] |
| alpha | Scalar multiplier |
| n | Vector length |
Definition at line 105 of file axpy_kernels.c.
Referenced by weighted_sum_f32().
| void weighted_sum_f32 | ( | float * | y, |
| const float ** | vectors, | ||
| const float * | weights, | ||
| int | k, | ||
| int | n | ||
| ) |
Weighted sum of k vectors: y = sum_i(weights[i] * vectors[i])
| y | Output vector [n] |
| vectors | Array of k input vector pointers, each [n] |
| weights | Array of k scalar weights |
| k | Number of vectors to combine |
| n | Vector length |
Definition at line 155 of file axpy_kernels.c.
References axpy_f32(), and scal_copy_f32().