feat: provide a f32x16 abstraction to make unrolling 256-bit code easier #1495

eddyxu · 2023-11-01T05:36:10Z

No description provided.

eddyxu · 2023-11-01T05:57:48Z

Also update norm_l2 to use simd lib, about 3x speedup

norm_l2(SIMD)           time:   [74.141 ms 74.443 ms 75.051 ms]
                        change: [-69.423% -69.099% -68.710%] (p = 0.00 < 0.10)

westonpace

Very cool

westonpace · 2023-11-01T14:12:40Z

rust/lance-linalg/src/simd/f32.rs

@@ -249,4 +498,23 @@ mod tests {
            format!("{:?}", simd_power)
        );
    }
+
+    #[test]
+    fn test_basic_f32x16_ops() {


It's probably tedious but given how technical these methods are, it might be a good idea to add tests for all the methods (e.g. add, add assign, sub, etc.)

sure i can add some

westonpace · 2023-11-01T14:14:30Z

rust/lance-linalg/src/distance/norm_l2.rs

+        let dim = self.len();
+        if dim % 16 == 0 {
+            let mut sum = f32x16::zeros();
+            for i in (0..dim).step_by(16) {
+                let x = unsafe { f32x16::load_unaligned(self.as_ptr().add(i)) };
+                sum += x * x;
+            }
+            sum.reduce_sum().sqrt()
+        } else if dim % 8 == 0 {
+            let mut sum = f32x8::zeros();
+            for i in (0..dim).step_by(8) {
+                let x = unsafe { f32x8::load_unaligned(self.as_ptr().add(i)) };
+                sum += x * x;
            }
+            sum.reduce_sum().sqrt()
+        } else {
+            // Fallback to scalar
+            return self.iter().map(|v| v * v).sum::<f32>().sqrt();


I like how readable this is (well, readable for SIMD :)

chebbyChefNEQ · 2023-11-01T16:58:21Z

rust/lance-linalg/src/distance/l2.rs

-                    x2 -= y2;
-                    sum1.multiply_add(x1, x1);
-                    sum2.multiply_add(x2, x2);
+                    let mut x = f32x16::load_unaligned(self.as_ptr().add(i));


do we want to unroll more than twice?

Yea we can add it later? did not see improvement last time on my CPU tho.

eddyxu added 5 commits October 31, 2023 21:46

add f32x16

1ae2b64

f32

143c02f

add f32x16 for 512 bit

7839a47

set zeros

4bd8c7f

use f32x8

68e9e51

eddyxu closed this Nov 1, 2023

eddyxu reopened this Nov 1, 2023

fix clippy;

c6b7c70

eddyxu requested review from wjones127, QianZhu, westonpace and chebbyChefNEQ November 1, 2023 06:03

fix mul

c6110d7

eddyxu self-assigned this Nov 1, 2023

westonpace approved these changes Nov 1, 2023

View reviewed changes

add test for basic

56791bd

chebbyChefNEQ reviewed Nov 1, 2023

View reviewed changes

eddyxu added 3 commits November 1, 2023 09:58

add test for basic

f00608b

add tests

c3d12a9

easier tests

836cf71

eddyxu merged commit cde1208 into main Nov 1, 2023
15 checks passed

eddyxu deleted the lei/f32x16 branch November 1, 2023 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: provide a f32x16 abstraction to make unrolling 256-bit code easier #1495

feat: provide a f32x16 abstraction to make unrolling 256-bit code easier #1495

eddyxu commented Nov 1, 2023

eddyxu commented Nov 1, 2023

westonpace left a comment

westonpace Nov 1, 2023

eddyxu Nov 1, 2023

westonpace Nov 1, 2023

chebbyChefNEQ Nov 1, 2023

eddyxu Nov 1, 2023

feat: provide a f32x16 abstraction to make unrolling 256-bit code easier #1495

feat: provide a f32x16 abstraction to make unrolling 256-bit code easier #1495

Conversation

eddyxu commented Nov 1, 2023

eddyxu commented Nov 1, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Nov 1, 2023

Choose a reason for hiding this comment

eddyxu Nov 1, 2023

Choose a reason for hiding this comment

westonpace Nov 1, 2023

Choose a reason for hiding this comment

chebbyChefNEQ Nov 1, 2023

Choose a reason for hiding this comment

eddyxu Nov 1, 2023

Choose a reason for hiding this comment