What is the purpose of GGML_F32_STEP
and GGML_F16_STEP
?
#386
-
I can tell they're used to compute As an example, let's look at inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
ggml_float sumf = 0.0;
const int np = (n & ~(GGML_F32_STEP - 1));
GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
GGML_F32_VEC ax[GGML_F32_ARR];
GGML_F32_VEC ay[GGML_F32_ARR];
for (int i = 0; i < np; i += GGML_F32_STEP) {
for (int j = 0; j < GGML_F32_ARR; j++) {
ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
}
}
// reduce sum0..sum3 to sum0
GGML_F32_VEC_REDUCE(sumf, sum);
// leftovers
for (int i = np; i < n; ++i) {
sumf += x[i]*y[i];
}
*s = sumf;
} For starters, we can flatten the two main loops into a single one and simplify the index computation: inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
ggml_float sumf = 0.0;
const int np = (n & ~(GGML_F32_STEP - 1));
GGML_F32_VEC sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
GGML_F32_VEC ax[GGML_F32_ARR];
GGML_F32_VEC ay[GGML_F32_ARR];
for (int i = 0; i < np; i += GGML_F32_EPR) {
int j = i % GGML_F32_ARR;
ax[j] = GGML_F32_VEC_LOAD(x + i);
ay[j] = GGML_F32_VEC_LOAD(y + i);
sum[j] = GGML_F32_VEC_FMA(sum[j], ax[j], ay[j]);
}
// reduce sum0..sum3 to sum0
GGML_F32_VEC_REDUCE(sumf, sum);
// leftovers
for (int i = np; i < n; ++i) {
sumf += x[i]*y[i];
}
*s = sumf;
} Now, it looks like we don't really need inline static void ggml_vec_dot_f32(const int n, float * restrict s, const float * restrict x, const float * restrict y) {
ggml_float sumf = 0.0;
const int np = n - (n % GGML_F32_EPR);
GGML_F32_VEC sum = GGML_F32_VEC_ZERO;
for (int i = 0; i < np; i += GGML_F32_EPR) {
GGML_F32_VEC ax = GGML_F32_VEC_LOAD(x + i);
GGML_F32_VEC ay = GGML_F32_VEC_LOAD(y + i);
sum = GGML_F32_VEC_FMA(sum, ax, ay);
}
// reduce sum0..sum3 to sum0
GGML_F32_VEC __temp_for_sum[GGML_F32_ARR] = { GGML_F32_VEC_ZERO };
__temp_for_sum[0] = sum;
GGML_F32_VEC_REDUCE(sumf, __temp_for_sum);
// leftovers
for (int i = np; i < n; ++i) {
sumf += x[i]*y[i];
}
*s = sumf;
} Now The results? Before
After
So, using the So why don't we remove it? Is there a performance reason behind it that isn't visible on my system? I'm running an Intel Celeron N4120 with SSE3 and BLAS. I'd appreciate if someone could test this on a PC that has better performance than a potato, unlike mine. A version of the code with the changes I made above can be found at https://github.com/abitofevrything/whisper.cpp/tree/remove_step. Note that I have not made the changes necessary for POWER9 as I couldn't find enough documentation online on how to reimplement GGML_F16_VEC_LOAD without the |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I was working on ggml_vec_dot_f16() this morning. Did not get a significant improvement by flattening the nested loop. Here's what I came up with, but my reduction piece is not ready for prime time. inline static void ggml_vec_dot_f16(const int n, float * restrict s, ggml_fp16_t * restrict x, ggml_fp16_t * restrict y) { #if defined(GGML_SIMD)
#endif
} |
Beta Was this translation helpful? Give feedback.
-
I've just added support back for POWER9 (I think). @ggerganov, hope you don't mind the mention, but do you have any explanation for |
Beta Was this translation helpful? Give feedback.
I've just added support back for POWER9 (I think).
@ggerganov, hope you don't mind the mention, but do you have any explanation for
GGML_F32_STEP
andGGML_F16_STEP
?