-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23
Conversation
…ust a by-product. It also fixes the following minor bugs. * TRIGRANGEMAX3 should be used in sleefld.c and sleefqp.c * fma could return inf if the true return value is above 1e+303, instead of 1e+304.
src/arch/helperadvsimd.h
Outdated
@@ -350,6 +351,28 @@ static INLINE vdouble vsel_vd_vo_vd_vd(vopmask mask, vdouble x, vdouble y) { | |||
return vbslq_f64(vreinterpretq_u64_u32(mask), x, y); | |||
} | |||
|
|||
static INLINE CONST vdouble vsel_vd_vo_d_d(vopmask o, double d0, double d1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you use the same implementation as the one in the AVX target here? It seems very expensive to me to create the idx
vector each time the function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idx vector is common among the same series of coefficients. So, it is generated only once. You can see the assembly output from gcc-6.3.0 via the following link.
https://www.dropbox.com/s/n71fn42cscgzlbs/sleefsimddp.advsimd.s.txt?dl=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assembly file is too long to check. :)
If I understood correctly, the idx is generated only once because the function is inlined. Is that the case? Even so, the generic implementation in the AVX header file seems to need less instructions. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's "common subexpression elimination." All modern compilers remove redundant codes for calculating the same results.
https://en.wikipedia.org/wiki/Common_subexpression_elimination
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do benchmarking to get the definitive answer as to which implementation is faster, but the generic implementation for choosing between 4 values requires 4 loads into 4 different registers and three blending instructions. The tbl implementation requires two loads and one tbl instruction, in addition to two adrp instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took benchmark of both implementation. It looks the generic implementation is around 10% faster. The following values show execution time of each function in micro second.
TBL
Sleef_lgammad2_u10advsimd, 0.553044
Sleef_tgammad2_u10advsimd, 0.572347
Sleef_erfd2_u10advsimd, 0.21031
Sleef_erfcd2_u15advsimd, 0.286495
Generic
Sleef_lgammad2_u10advsimd, 0.50789
Sleef_tgammad2_u10advsimd, 0.519268
Sleef_erfd2_u10advsimd, 0.187486
Sleef_erfcd2_u15advsimd, 0.256248
return (vdouble) vqtbl1q_u8(tab, idx); | ||
} | ||
|
||
static INLINE vdouble vsel_vd_vo_vo_vo_d_d_d_d(vopmask o0, vopmask o1, vopmask o2, double d0, double d1, double d2, double d3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sam here - I think this is not optimal. Isn't the AVX code better?
t = vsel_vd_vo_vd_vd(o2, vrec_vd_vd(x.x), ddnormalize_vd2_vd2(ddadd2_vd2_vd2_vd(x, vsel_vd_vo_d_d(o0, -1, -2))).x); | ||
|
||
u = vsel_vd_vo_vo_d_d_d(o2, o0, -156.801412704022726379848862, +0.2947916772827614196e+2, +0.7074816000864609279e-7); | ||
u = vmla_vd_vd_vd_vd(u, t, vsel_vd_vo_vo_d_d_d(o2, o0, +1.120804464289911606838558160000, +0.1281459691827820109e+3, +0.4009244333008730443e-6)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be expensive on aarch64 expensive (lines 1950 to 1971). It would be good to write a #ifdef indexed_vmla
version for those architectures that can run the code using less loads by means of using indexed mla instructions (see e.g. vfmad_lane_f64
).
I am fine for you to leave the code as it is for now, but please a TODO comment in the source about the potential optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean using a gather instruction to choose a value from four values? Both SVE and AVX2 have permutation instructions, which should be less expensive than a gather instruction. So I think that's pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, a large part of the computation time is actually the latency of FMA. My intention for this part is to hide the time taken by load and permutation in this latency. Load is actually pretty fast. Permutation takes time if there is no hardware support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are having concern about SVE with 128 bit vector length, I would introduce a conditional branch that checks the vector length. If the vector length is 128 bit, the generic implementation should work well. If the vector length is 256 bit or longer, then TBL implementation would be good. For each FMA, one 256 bit load and one TBL are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is easier to understand what I mean if I show the code snippet. The generated AVX2 assembly code of the corresponding part is as follows.
vpermd .LC20(%rip), %ymm6, %ymm2
vfmadd132pd %ymm0, %ymm2, %ymm1
vpermd .LC21(%rip), %ymm6, %ymm2
vfmadd231pd %ymm0, %ymm1, %ymm2
vpermd .LC22(%rip), %ymm6, %ymm1
vfmadd132pd %ymm0, %ymm1, %ymm2
vpermd .LC23(%rip), %ymm6, %ymm1
vfmadd231pd %ymm0, %ymm2, %ymm1
vpermd .LC24(%rip), %ymm6, %ymm2
vfmadd132pd %ymm0, %ymm2, %ymm1
However, my benchmarking result shows that the latency of vpermd is not completely hidden in the latency of vfmadd. The overall computation speed is not bad compared to SVML, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks goot, it is just that there is a potential optimization here for AArch64 because we have vector-scalar FMA instruction, that allows loading the coefficient in vector registers and use fewer vsel
operations. All I am asking is to add a TODO
comment saing
/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */
u = vsel_vd_vo_vd_vd(o0, vmul_vd_vd_vd(a, a), a); | ||
|
||
t = vsel_vd_vo_vo_d_d_d(o0, o1, +0.6801072401395392157e-20, +0.2830954522087717660e-13, -0.5846750404269610493e-17); | ||
t = vmla_vd_vd_vd_vd(t, u, vsel_vd_vo_vo_d_d_d(o0, o1, -0.2161766247570056391e-18, -0.1509491946179481940e-11, +0.6076691048812607898e-15)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please add a TODO comment about the potential optimization that can be done for architectures with indexed mla
instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */
|
||
u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0))))); | ||
|
||
t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, please use add a TOD comment for the potential indexed mla
optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */
@@ -130,11 +130,11 @@ static INLINE vfloat vsqrt_vf_vf(vfloat d) { return vsqrtq_f32(d); } | |||
|
|||
// Multiply accumulate: z = z + x * y | |||
static INLINE vfloat vmla_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) { | |||
return vmlaq_f32(z, x, y); | |||
return vfmaq_f32(z, x, y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason behind this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can safely assume that FMA is the faster than any other combination of multiplication and addition. I saw the assembly output from the compiler and vmlaq is converted into multiplication and addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Strange. Could you please provide a minimal example that shows what seems to be an inconsistent behavior? You might have found a bug in gcc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiply-accumulate instructions are all fused in aarch64, so this is not a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, I was just asking you to provide an example of code that generated multiplication and addition (separated) from the vmlaq_f32
intrinsics.
No worries if you don't have time.
} | ||
// Multiply subtract: z = z = x * y | ||
static INLINE vfloat vmlanp_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) { | ||
return vmlsq_f32(z, x, y); | ||
return vfmsq_f32(z, x, y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
…ic implementation is faster on aarch64. So, the generic implementation is now default on aarch64. * This patch fixes the issue of exp(1.0) not correctly rounded, only when FMA is available. * This patch also includes small performance optimization for expk and expk2 functions.
…F, which is probably faster than the generic implementation.
t = vsel_vd_vo_vd_vd(o2, vrec_vd_vd(x.x), ddnormalize_vd2_vd2(ddadd2_vd2_vd2_vd(x, vsel_vd_vo_d_d(o0, -1, -2))).x); | ||
|
||
u = vsel_vd_vo_vo_d_d_d(o2, o0, -156.801412704022726379848862, +0.2947916772827614196e+2, +0.7074816000864609279e-7); | ||
u = vmla_vd_vd_vd_vd(u, t, vsel_vd_vo_vo_d_d_d(o2, o0, +1.120804464289911606838558160000, +0.1281459691827820109e+3, +0.4009244333008730443e-6)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks goot, it is just that there is a potential optimization here for AArch64 because we have vector-scalar FMA instruction, that allows loading the coefficient in vector registers and use fewer vsel
operations. All I am asking is to add a TODO
comment saing
/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */
u = vsel_vd_vo_vd_vd(o0, vmul_vd_vd_vd(a, a), a); | ||
|
||
t = vsel_vd_vo_vo_d_d_d(o0, o1, +0.6801072401395392157e-20, +0.2830954522087717660e-13, -0.5846750404269610493e-17); | ||
t = vmla_vd_vd_vd_vd(t, u, vsel_vd_vo_vo_d_d_d(o0, o1, -0.2161766247570056391e-18, -0.1509491946179481940e-11, +0.6076691048812607898e-15)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */
|
||
u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0))))); | ||
|
||
t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */
sinpi is just a by-product. It also fixes the following minor bugs.