This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23

shibatch · 2017-05-01T14:39:28Z

sinpi is just a by-product. It also fixes the following minor bugs.

TRIGRANGEMAX3 should be used in sleefld.c and sleefqp.c
fma could return inf if the true return value is above 1e+303, instead of 1e+304.

…ust a by-product. It also fixes the following minor bugs. * TRIGRANGEMAX3 should be used in sleefld.c and sleefqp.c * fma could return inf if the true return value is above 1e+303, instead of 1e+304.

fpetrogalli · 2017-05-02T09:03:50Z

src/arch/helperadvsimd.h

@@ -350,6 +351,28 @@ static INLINE vdouble vsel_vd_vo_vd_vd(vopmask mask, vdouble x, vdouble y) {
  return vbslq_f64(vreinterpretq_u64_u32(mask), x, y);
 }

+static INLINE CONST vdouble vsel_vd_vo_d_d(vopmask o, double d0, double d1) {


Why can't you use the same implementation as the one in the AVX target here? It seems very expensive to me to create the idx vector each time the function is called.

idx vector is common among the same series of coefficients. So, it is generated only once. You can see the assembly output from gcc-6.3.0 via the following link.

https://www.dropbox.com/s/n71fn42cscgzlbs/sleefsimddp.advsimd.s.txt?dl=0

The assembly file is too long to check. :)

If I understood correctly, the idx is generated only once because the function is inlined. Is that the case? Even so, the generic implementation in the AVX header file seems to need less instructions. Am I missing something?

It's "common subexpression elimination." All modern compilers remove redundant codes for calculating the same results.

https://en.wikipedia.org/wiki/Common_subexpression_elimination

We need to do benchmarking to get the definitive answer as to which implementation is faster, but the generic implementation for choosing between 4 values requires 4 loads into 4 different registers and three blending instructions. The tbl implementation requires two loads and one tbl instruction, in addition to two adrp instructions.

I took benchmark of both implementation. It looks the generic implementation is around 10% faster. The following values show execution time of each function in micro second.

TBL
Sleef_lgammad2_u10advsimd, 0.553044
Sleef_tgammad2_u10advsimd, 0.572347
Sleef_erfd2_u10advsimd, 0.21031
Sleef_erfcd2_u15advsimd, 0.286495

Generic
Sleef_lgammad2_u10advsimd, 0.50789
Sleef_tgammad2_u10advsimd, 0.519268
Sleef_erfd2_u10advsimd, 0.187486
Sleef_erfcd2_u15advsimd, 0.256248

fpetrogalli · 2017-05-02T09:04:58Z

src/arch/helperadvsimd.h

+  return (vdouble) vqtbl1q_u8(tab, idx);
+}
+
+static INLINE vdouble vsel_vd_vo_vo_vo_d_d_d_d(vopmask o0, vopmask o1, vopmask o2, double d0, double d1, double d2, double d3) {


Sam here - I think this is not optimal. Isn't the AVX code better?

fpetrogalli · 2017-05-02T10:14:26Z

src/libm/sleefsimddp.c

+  t = vsel_vd_vo_vd_vd(o2, vrec_vd_vd(x.x), ddnormalize_vd2_vd2(ddadd2_vd2_vd2_vd(x, vsel_vd_vo_d_d(o0, -1, -2))).x);
+
+  u = vsel_vd_vo_vo_d_d_d(o2, o0, -156.801412704022726379848862, +0.2947916772827614196e+2, +0.7074816000864609279e-7);
+  u = vmla_vd_vd_vd_vd(u, t, vsel_vd_vo_vo_d_d_d(o2, o0, +1.120804464289911606838558160000, +0.1281459691827820109e+3, +0.4009244333008730443e-6));


This is going to be expensive on aarch64 expensive (lines 1950 to 1971). It would be good to write a #ifdef indexed_vmla version for those architectures that can run the code using less loads by means of using indexed mla instructions (see e.g. vfmad_lane_f64).

I am fine for you to leave the code as it is for now, but please a TODO comment in the source about the potential optimization.

Do you mean using a gather instruction to choose a value from four values? Both SVE and AVX2 have permutation instructions, which should be less expensive than a gather instruction. So I think that's pointless.

Basically, a large part of the computation time is actually the latency of FMA. My intention for this part is to hide the time taken by load and permutation in this latency. Load is actually pretty fast. Permutation takes time if there is no hardware support.

If you are having concern about SVE with 128 bit vector length, I would introduce a conditional branch that checks the vector length. If the vector length is 128 bit, the generic implementation should work well. If the vector length is 256 bit or longer, then TBL implementation would be good. For each FMA, one 256 bit load and one TBL are required.

Maybe it is easier to understand what I mean if I show the code snippet. The generated AVX2 assembly code of the corresponding part is as follows.

vpermd .LC20(%rip), %ymm6, %ymm2 vfmadd132pd %ymm0, %ymm2, %ymm1 vpermd .LC21(%rip), %ymm6, %ymm2 vfmadd231pd %ymm0, %ymm1, %ymm2 vpermd .LC22(%rip), %ymm6, %ymm1 vfmadd132pd %ymm0, %ymm1, %ymm2 vpermd .LC23(%rip), %ymm6, %ymm1 vfmadd231pd %ymm0, %ymm2, %ymm1 vpermd .LC24(%rip), %ymm6, %ymm2 vfmadd132pd %ymm0, %ymm2, %ymm1

However, my benchmarking result shows that the latency of vpermd is not completely hidden in the latency of vfmadd. The overall computation speed is not bad compared to SVML, though.

The code looks goot, it is just that there is a potential optimization here for AArch64 because we have vector-scalar FMA instruction, that allows loading the coefficient in vector registers and use fewer vsel operations. All I am asking is to add a TODO comment saing

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

fpetrogalli · 2017-05-02T10:16:19Z

src/libm/sleefsimddp.c

+  u = vsel_vd_vo_vd_vd(o0, vmul_vd_vd_vd(a, a), a);
+
+  t = vsel_vd_vo_vo_d_d_d(o0, o1, +0.6801072401395392157e-20, +0.2830954522087717660e-13, -0.5846750404269610493e-17);
+  t = vmla_vd_vd_vd_vd(t, u, vsel_vd_vo_vo_d_d_d(o0, o1, -0.2161766247570056391e-18, -0.1509491946179481940e-11, +0.6076691048812607898e-15));


Same here, please add a TODO comment about the potential optimization that can be done for architectures with indexed mla instructions.

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

fpetrogalli · 2017-05-02T10:16:54Z

src/libm/sleefsimddp.c

+
+  u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0)))));
+
+  t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5);


same here, please use add a TOD comment for the potential indexed mla optimization.

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

fpetrogalli · 2017-05-02T10:17:58Z

src/arch/helperadvsimd.h

@@ -130,11 +130,11 @@ static INLINE vfloat vsqrt_vf_vf(vfloat d) { return vsqrtq_f32(d); }

 // Multiply accumulate: z = z + x * y
 static INLINE vfloat vmla_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) {
-  return vmlaq_f32(z, x, y);
+  return vfmaq_f32(z, x, y);


What's the reason behind this change?

We can safely assume that FMA is the faster than any other combination of multiplication and addition. I saw the assembly output from the compiler and vmlaq is converted into multiplication and addition.

Oh! Strange. Could you please provide a minimal example that shows what seems to be an inconsistent behavior? You might have found a bug in gcc.

Multiply-accumulate instructions are all fused in aarch64, so this is not a bug.

This LGTM, I was just asking you to provide an example of code that generated multiplication and addition (separated) from the vmlaq_f32 intrinsics.

No worries if you don't have time.

fpetrogalli · 2017-05-02T10:18:15Z

src/arch/helperadvsimd.h

 }
 // Multiply subtract: z = z = x * y
 static INLINE vfloat vmlanp_vf_vf_vf_vf(vfloat x, vfloat y, vfloat z) {
-  return vmlsq_f32(z, x, y);
+  return vfmsq_f32(z, x, y);


…ic implementation is faster on aarch64. So, the generic implementation is now default on aarch64. * This patch fixes the issue of exp(1.0) not correctly rounded, only when FMA is available. * This patch also includes small performance optimization for expk and expk2 functions.

…F, which is probably faster than the generic implementation.

fpetrogalli · 2017-05-15T07:44:40Z

src/libm/sleefsimddp.c

+  t = vsel_vd_vo_vd_vd(o2, vrec_vd_vd(x.x), ddnormalize_vd2_vd2(ddadd2_vd2_vd2_vd(x, vsel_vd_vo_d_d(o0, -1, -2))).x);
+
+  u = vsel_vd_vo_vo_d_d_d(o2, o0, -156.801412704022726379848862, +0.2947916772827614196e+2, +0.7074816000864609279e-7);
+  u = vmla_vd_vd_vd_vd(u, t, vsel_vd_vo_vo_d_d_d(o2, o0, +1.120804464289911606838558160000, +0.1281459691827820109e+3, +0.4009244333008730443e-6));


The code looks goot, it is just that there is a potential optimization here for AArch64 because we have vector-scalar FMA instruction, that allows loading the coefficient in vector registers and use fewer vsel operations. All I am asking is to add a TODO comment saing

/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

fpetrogalli · 2017-05-15T07:45:05Z

src/libm/sleefsimddp.c

+  u = vsel_vd_vo_vd_vd(o0, vmul_vd_vd_vd(a, a), a);
+
+  t = vsel_vd_vo_vo_d_d_d(o0, o1, +0.6801072401395392157e-20, +0.2830954522087717660e-13, -0.5846750404269610493e-17);
+  t = vmla_vd_vd_vd_vd(t, u, vsel_vd_vo_vo_d_d_d(o0, o1, -0.2161766247570056391e-18, -0.1509491946179481940e-11, +0.6076691048812607898e-15));


/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

fpetrogalli · 2017-05-15T07:45:49Z

src/libm/sleefsimddp.c

+
+  u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0)))));
+
+  t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5);


/* TODO AArch64: potential optimization by using `vfmad_lane_f64` */

This patch introduces erf, erfc, lgamma, tgamma and sinpi. sinpi is j…

eb8b397

…ust a by-product. It also fixes the following minor bugs. * TRIGRANGEMAX3 should be used in sleefld.c and sleefqp.c * fma could return inf if the true return value is above 1e+303, instead of 1e+304.

fpetrogalli requested changes May 2, 2017

View reviewed changes

shibatch added 2 commits May 7, 2017 16:12

This patch adds permutation-based value selection function for AVX512…

00d9052

…F, which is probably faster than the generic implementation.

fpetrogalli requested changes May 15, 2017

View reviewed changes

Added TODO comments

6bb9fcd

fpetrogalli approved these changes May 15, 2017

View reviewed changes

shibatch merged commit c28e257 into master May 15, 2017

shibatch deleted the Introducing_erf_and_gamma branch May 15, 2017 10:10

avmgithub mentioned this pull request May 28, 2018

Add VSX support #195

Merged

edelsohn mentioned this pull request Jul 22, 2020

Z15 VX s390x Linux on z support #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23

This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23

shibatch commented May 1, 2017

fpetrogalli May 2, 2017

shibatch May 2, 2017 •

edited

Loading

fpetrogalli May 2, 2017

shibatch May 2, 2017

shibatch May 2, 2017 •

edited

Loading

shibatch May 4, 2017

fpetrogalli May 2, 2017

fpetrogalli May 2, 2017

shibatch May 12, 2017

shibatch May 12, 2017

shibatch May 12, 2017

shibatch May 13, 2017

fpetrogalli May 15, 2017

fpetrogalli May 2, 2017

fpetrogalli May 15, 2017

fpetrogalli May 2, 2017

fpetrogalli May 15, 2017

fpetrogalli May 2, 2017

shibatch May 2, 2017

fpetrogalli May 2, 2017

shibatch May 2, 2017

fpetrogalli May 15, 2017

fpetrogalli May 2, 2017

fpetrogalli May 15, 2017

fpetrogalli May 15, 2017

fpetrogalli May 15, 2017


		u = vsel_vd2_vo_vd2_vd2(o0, ddmul_vd2_vd_vd(a, a), vsel_vd2_vo_vd2_vd2(o1, vcast_vd2_vd_vd(a, vcast_vd_d(0)), dddiv_vd2_vd2_vd2(vcast_vd2_d_d(1, 0), vcast_vd2_vd_vd(a, vcast_vd_d(0)))));

		t = vsel_vd_vo_vo_vo_d_d_d_d(o0, o1, o2, +0.6801072401395386139e-20, +0.3438010341362585303e-12, -0.5757819536420710449e+2, +0.2334249729638701319e+5);

This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23

This patch introduces erf, erfc, lgamma, tgamma and sinpi. #23

Conversation

shibatch commented May 1, 2017

Choose a reason for hiding this comment

shibatch May 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shibatch May 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shibatch May 2, 2017 •

edited

Loading

shibatch May 2, 2017 •

edited

Loading