GELU activation kernels with SIMD (SSE/AVX/AVX512) More...
#include <math.h>#include <stddef.h>#include <stdint.h>#include <string.h>#include "bf16_utils.h"Go to the source code of this file.
Functions | |
| void | geglu_backward_fp32 (const float *x, const float *d_out, float *d_x, int tokens, int dim) |
| void | geglu_forward_bf16 (const uint16_t *x, uint16_t *out, int tokens, int dim, float *scratch) |
| void | geglu_forward_fp32 (const float *x, float *out, int tokens, int dim) |
| void | gelu_backward_exact (const float *input, const float *d_output, float *d_input, size_t n) |
| void | gelu_backward_fast (const float *input, const float *d_output, float *d_input, size_t n) |
| void | gelu_backward_scalar (const float *input, const float *d_output, float *d_input, size_t n) |
| void | gelu_exact_inplace (float *data, size_t n) |
| void | gelu_fast_inplace (float *data, size_t n) |
GELU activation kernels with SIMD (SSE/AVX/AVX512)
After changes: make test && make llamacpp-parity-full
GELU: y = x * 0.5 * (1 + erf(x / sqrt(2))) Fast approx: y = x * sigmoid(1.702 * x)
Definition in file gelu_kernels.c.
| void geglu_backward_fp32 | ( | const float * | x, |
| const float * | d_out, | ||
| float * | d_x, | ||
| int | tokens, | ||
| int | dim | ||
| ) |
GeGLU backward pass (fp32)
dL/dx given dL/d(out) where out = GELU(a) * b Chain rule: dL/da = dL/dout * d(GELU)/da * b dL/db = dL/dout * GELU(a)
After changes: make test
Definition at line 843 of file gelu_kernels.c.
Referenced by ck_test_geglu_backward().
| void geglu_forward_bf16 | ( | const uint16_t * | x, |
| uint16_t * | out, | ||
| int | tokens, | ||
| int | dim, | ||
| float * | scratch | ||
| ) |
GeGLU forward pass (bf16)
BF16 version: converts to FP32, computes, converts back. Caller provides scratch buffer of size 3 * tokens * dim * sizeof(float).
Layout:
Note: We need separate buffers for input and output to avoid overlap when tokens > 1. The input is 2*dim per token, output is dim per token.
After changes: make test
Definition at line 813 of file gelu_kernels.c.
References bf16_tensor_to_float(), float_tensor_to_bf16(), and geglu_forward_fp32().
| void geglu_forward_fp32 | ( | const float * | x, |
| float * | out, | ||
| int | tokens, | ||
| int | dim | ||
| ) |
GeGLU forward pass (fp32)
Computes out = GELU(a) * b where x = [a, b] along last dimension. Input shape: [tokens, 2 * dim], Output shape: [tokens, dim]
After changes: make test
Definition at line 623 of file gelu_kernels.c.
References __attribute__().
Referenced by ck_test_geglu(), and geglu_forward_bf16().
| void gelu_backward_exact | ( | const float * | input, |
| const float * | d_output, | ||
| float * | d_input, | ||
| size_t | n | ||
| ) |
| void gelu_backward_fast | ( | const float * | input, |
| const float * | d_output, | ||
| float * | d_input, | ||
| size_t | n | ||
| ) |
Definition at line 486 of file gelu_kernels.c.
References __attribute__().
Referenced by gelu_backward_fast_bf16().
| void gelu_backward_scalar | ( | const float * | input, |
| const float * | d_output, | ||
| float * | d_input, | ||
| size_t | n | ||
| ) |
| void gelu_exact_inplace | ( | float * | data, |
| size_t | n | ||
| ) |
Definition at line 446 of file gelu_kernels.c.
Referenced by gelu_fast_inplace_bf16(), and mlp_token_parallel_exact().
| void gelu_fast_inplace | ( | float * | data, |
| size_t | n | ||
| ) |
GELU activation forward (fast approximation, in-place)
test_gelu.py::TestGELUForward::test_gelu_fast_inplace
test_gelu.py::TestGELUForward::test_gelu_vs_exact
test_parity.py::test_gelu_parity
Fast GELU approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))) In-place on contiguous buffer.
After changes: make test && make llamacpp-parity-full
Definition at line 132 of file gelu_kernels.c.
References __attribute__().
Referenced by mlp_token_parallel().