Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64][CodeGen] Fix wrong operand order when creating vcmla intrinsic #65278

Merged
merged 1 commit into from
Sep 14, 2023

Conversation

daisy202309
Copy link
Contributor

No description provided.

@daisy202309 daisy202309 requested a review from a team as a code owner September 4, 2023 17:23
@daisy202309 daisy202309 marked this pull request as draft September 4, 2023 17:32
@daisy202309 daisy202309 marked this pull request as ready for review September 4, 2023 17:32
Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note in the commit message whether this is actually a visible miscompile; at first glance, multiplication is commutative.

Assuming it is a miscompile, please verify that using the intrinsics from clang produces the correct result.

Otherwise LGTM

@daisy202309
Copy link
Contributor Author

daisy202309 commented Sep 5, 2023

Please note in the commit message whether this is actually a visible miscompile; at first glance, multiplication is commutative.

Assuming it is a miscompile, please verify that using the intrinsics from clang produces the correct result.

Otherwise LGTM

For the fcmla instruction with the rot value of 90 or 270, the results obtained by exchanging the positions of the last two parameters are different.

Here is an example:

#include<arm_neon.h>
#include<stdio.h>

float64x2_t test_rot0(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

float64x2_t test_rot90(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_rot90_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

float64x2_t test_rot180(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_rot180_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

float64x2_t test_rot270(float64_t *acc, float64_t *lhs, float64_t *rhs) {
  return vcmlaq_rot270_f64(vld1q_dup_f64(acc), vld1q_dup_f64(lhs), vld1q_dup_f64(rhs));
}

int main() {
  float64_t acc[] = {100, 100};
  float64_t lhs[] = {10, 10};
  float64_t rhs[] = {77, 77};
  float64_t r0[] = {1000, 1000};
  float64_t r90[] = {1000, 1000};
  float64_t r180[] = {1000, 1000};
  float64_t r270[] = {1000, 1000};

  vst1q_lane_f64(r0, test_rot0(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r0[1], test_rot0(acc, rhs, lhs), 1);

  vst1q_lane_f64(r90, test_rot90(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r90[1], test_rot90(acc, rhs, lhs), 1);

  vst1q_lane_f64(r180, test_rot180(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r180[1], test_rot180(acc, rhs, lhs), 1);

  vst1q_lane_f64(r270, test_rot270(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r270[1], test_rot270(acc, rhs, lhs), 1);

  printf("r0: %lf %lf\n", r0[0], r0[1]);
  printf("r90: %lf %lf\n", r90[0], r90[1]);
  printf("r180: %lf %lf\n", r180[0], r180[1]);
  printf("r270: %lf %lf\n", r270[0], r270[1]);
}

The result:

r0: 870.000000 870.000000
r90: -670.000000 870.000000
r180: -670.000000 -670.000000
r270: 870.000000 -670.000000

I'm not sure whether I need add this test case in my patch.

@efriedma-quic
Copy link
Collaborator

I don't think we need an end-to-end testcase; it should be sufficient to have testcases for clang->llvm IR and llvm IR->asm. So should be fine as-is.

@davemgreen
Copy link
Collaborator

The test case you have is testing different lanes from the vcmla instrinsics, not just commutativity. The different lanes would be expected to be different. The lane indices would need to be the same to test commutivity:

  vst1q_lane_f64(r0, test_rot0(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r0[1], test_rot0(acc, rhs, lhs), 0);

  vst1q_lane_f64(r90, test_rot90(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r90[1], test_rot90(acc, rhs, lhs), 0);

  vst1q_lane_f64(r180, test_rot180(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r180[1], test_rot180(acc, rhs, lhs), 0);

  vst1q_lane_f64(r270, test_rot270(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r270[1], test_rot270(acc, rhs, lhs), 0);

When you say "Fix wrong operand order" is there a bug here? And if so can you explain where. The same change may need to be applied for ARM too, but I was under the impression that enough testing had been done to catch problems like this (baring maybe some edge cases with Nan's and whatnot)

@daisy202309
Copy link
Contributor Author

daisy202309 commented Sep 8, 2023

The test case you have is testing different lanes from the vcmla instrinsics, not just commutativity. The different lanes would be expected to be different. The lane indices would need to be the same to test commutivity:

  vst1q_lane_f64(r0, test_rot0(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r0[1], test_rot0(acc, rhs, lhs), 0);

  vst1q_lane_f64(r90, test_rot90(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r90[1], test_rot90(acc, rhs, lhs), 0);

  vst1q_lane_f64(r180, test_rot180(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r180[1], test_rot180(acc, rhs, lhs), 0);

  vst1q_lane_f64(r270, test_rot270(acc, lhs, rhs) , 0);
  vst1q_lane_f64(&r270[1], test_rot270(acc, rhs, lhs), 0);

When you say "Fix wrong operand order" is there a bug here? And if so can you explain where. The same change may need to be applied for ARM too, but I was under the impression that enough testing had been done to catch problems like this (baring maybe some edge cases with Nan's and whatnot)

I meet a run time error in SPECCPU2006 433 because getting a wrong calculation result. And I can get a right result by adding this patch or close the optimization of complex-deinterleaving.
The order of (InputA, inputB) is used in other places of this function, only this place is in the order of (InputB, InputA). I guess this is a clerical error. I tried to build a small example to reproduce this error, but I have not succeeded so far, and the actual scenario is more complicated.

@davemgreen
Copy link
Collaborator

Is the milc compiled with -Ofast or without fast math? I don't think we've seen the same thing here.

From what I can tell this this should just change from one complex multiply to another. It will be different but I'm not sure it's better or worse. For example with a normal complex mul:

vcmla A, B, C, #0
vcmla A, B, C, #90

would be:

Ar = ((Ar + Br.Cr) - Bi.Ci)
Ai = ((Ai + Br.Ci) + Bi.Cr)

vs, if the operands were the other way:

Ar = ((Ar + Cr.Br) - Ci.Bi)
Ai = ((Ai + Cr.Bi) + Ci.Br)

The brackets show where the fused multiply-accumulates happen. So there will be slightly different rounding between the two versions, but neither match the original exactly.

The new order sounds like a more natural order, so I'm not against it. It might just be that milc is a little susceptible to fast math rounding differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants