Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: provide a f32x16 abstraction to make unrolling 256-bit code easier #1495

Merged
merged 11 commits into from
Nov 1, 2023

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Nov 1, 2023

No description provided.

@eddyxu
Copy link
Contributor Author

eddyxu commented Nov 1, 2023

Also update norm_l2 to use simd lib, about 3x speedup

norm_l2(SIMD)           time:   [74.141 ms 74.443 ms 75.051 ms]
                        change: [-69.423% -69.099% -68.710%] (p = 0.00 < 0.10)

@eddyxu eddyxu closed this Nov 1, 2023
@eddyxu eddyxu reopened this Nov 1, 2023
@eddyxu eddyxu self-assigned this Nov 1, 2023
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool

@@ -249,4 +498,23 @@ mod tests {
format!("{:?}", simd_power)
);
}

#[test]
fn test_basic_f32x16_ops() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably tedious but given how technical these methods are, it might be a good idea to add tests for all the methods (e.g. add, add assign, sub, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure i can add some

Comment on lines +55 to +72
let dim = self.len();
if dim % 16 == 0 {
let mut sum = f32x16::zeros();
for i in (0..dim).step_by(16) {
let x = unsafe { f32x16::load_unaligned(self.as_ptr().add(i)) };
sum += x * x;
}
sum.reduce_sum().sqrt()
} else if dim % 8 == 0 {
let mut sum = f32x8::zeros();
for i in (0..dim).step_by(8) {
let x = unsafe { f32x8::load_unaligned(self.as_ptr().add(i)) };
sum += x * x;
}
sum.reduce_sum().sqrt()
} else {
// Fallback to scalar
return self.iter().map(|v| v * v).sum::<f32>().sqrt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how readable this is (well, readable for SIMD :)

x2 -= y2;
sum1.multiply_add(x1, x1);
sum2.multiply_add(x2, x2);
let mut x = f32x16::load_unaligned(self.as_ptr().add(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to unroll more than twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we can add it later? did not see improvement last time on my CPU tho.

@eddyxu eddyxu merged commit cde1208 into main Nov 1, 2023
15 checks passed
@eddyxu eddyxu deleted the lei/f32x16 branch November 1, 2023 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants