Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sgemm and dgemm using fma #36

Merged
merged 2 commits into from
Dec 7, 2018
Merged

Conversation

SuperFluffy
Copy link
Contributor

@SuperFluffy SuperFluffy commented Dec 3, 2018

This uses fused multiply add via _mm256_fmadd_{ps,pd} to multiply and accumulate matrices in one go. The performance gains are impressive, as described in issue #35.

Fixes #31
Fixes #35
Fixes #38

@bluss
Copy link
Owner

bluss commented Dec 3, 2018

Can you update Travis and unit tests so that they still cover all kernels?

@bluss
Copy link
Owner

bluss commented Dec 3, 2018

Let's merge the other one then we rebase and fix this pr.

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Dec 3, 2018 via email

@bluss
Copy link
Owner

bluss commented Dec 3, 2018

This PR is failing on travis for its own reason, so it should be investigated.

@SuperFluffy
Copy link
Contributor Author

SuperFluffy commented Dec 4, 2018

This PR is failing on travis for its own reason, so it should be investigated.

Could it be that it's failing due to there not being support for fma on travis?

On the other hand, the kernel unit tests with _fma are passing. On my local machine the integration tests are passing as well.

I haven't yet rebased everything off of master. The fallback kernel is still broken.

@bluss
Copy link
Owner

bluss commented Dec 4, 2018

@SuperFluffy The code should runtime detect if it can use fma or not, so then there is a bug. Also, are you sure the feature "fma" implies "avx"? I haven't reviewed this, so I'm not sure but I think we need to manually check for each intrinsic if it belongs to the correct feature (in this case the fma feature).

@bluss
Copy link
Owner

bluss commented Dec 4, 2018

This build is crashing with SIGILL, that sounds interesting. https://travis-ci.org/bluss/matrixmultiply/jobs/462848596

Potentially an aligned load/store on something not aligned? If it's not an instruction being used when not supported.

@bluss
Copy link
Owner

bluss commented Dec 4, 2018

I got the travis builder to spit out its available target features and indeed it doesn't support fma. But why did it crash? And how can we keep this tested if travis doesn't have it...

> rustc --print cfg -Ctarget-cpu=native
debug_assertions
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="avx"
target_feature="fxsr"
target_feature="mmx"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsaveopt"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="cas"
target_has_atomic="ptr"
target_os="linux"
target_pointer_width="64"
target_thread_local
target_vendor="unknown"
unix

@bluss
Copy link
Owner

bluss commented Dec 4, 2018

Can put in the github keywords to close issues? https://help.github.com/articles/closing-issues-using-keywords/ In this case, just put "Fixes #35" in the PR description. The PR description is the best place to put this. Thanks :)

We need to resolve the massive code duplication and comment duplication. (It's almost exactly the same code, isn't it?). Is it likely to stay identical like this?

I'd propose to solve it by making exactly kernel_x86_avx a generic function. Make a simple trait and two marker types, so that you can call kernel_x86_avx::<Fma> and kernel_x86_avx::<Avx>. Using conditionals aided by those static types will make us generate two distinct functions at compile time.

@SuperFluffy
Copy link
Contributor Author

So, this feels like a bit of a hack, but I found this when googling: https://github.com/uclouvain/openjpeg/blob/master/.travis.yml#L29-L33

If you specify os: linux, sudo: true, and dist: trusty, you get a machine with avx2 and fma, apparently. The tests pass now....

Can you use github keywords to close issues? https://help.github.com/articles/closing-issues-using-keywords/ In this case, just put "Fixes #35" in the PR description. The PR description is the best place to put this.

Will do!

We need to resolve the massive code duplication and comment duplication. (It's almost exactly the same code, isn't it?). Is it likely to stay identical like this?

Yes, you are right, we should fix that.

@bluss
Copy link
Owner

bluss commented Dec 4, 2018

@SuperFluffy but tests should pass also on machines that don't have fma. I'm not sure why they were failing, can we understand that?

@SuperFluffy
Copy link
Contributor Author

@bluss: Yes, you are right once more. Looks like the macro isn't picking up on fma not being available?

Regarding what you said earlier:

Also, are you sure the feature "fma" implies "avx"?

From what I can tell, there is not a single procecessor out there that supports fma, but not avx. Since fma is acting on __m256 and __m256d vectors, and since the only way to load them is through functions like _mm256_load_p{s,d} introduced with avx, I think we can safely assume that fma => avx.

@bluss
Copy link
Owner

bluss commented Dec 4, 2018

The tests need to be updated so that they crash again. That they pass is indicative of one thing: We don't test all the kernels on this new fma setup 😄

So .travis.yml would need to be updated to make sure we reach all the different kernels.

@SuperFluffy
Copy link
Contributor Author

It turns out that you need to cargo clean in between benchmark runs to check that the correct code paths are taken. But I have run cargo build with no env vars, with MMNO_fma=1, and with MMNO_FMA=1 MMNO_avx=1, to test with fma enabled, only avx enabled, and neither enabled, and all results are consistent:

# fallback, MMNO_fma=1 MMNO_avx=1
test mat_mul_f64::m127   ... bench:     339,714 ns/iter (+/- 14,009)
# avx only, MMNO_fma=1
test mat_mul_f64::m127   ... bench:     188,678 ns/iter (+/- 40,136)
# fma
test mat_mul_f64::m127   ... bench:     112,323 ns/iter (+/- 7,942)

src/sgemm_kernel.rs Outdated Show resolved Hide resolved
@SuperFluffy SuperFluffy force-pushed the dgemm_fma branch 3 times, most recently from 33aa05e to cf549c4 Compare December 5, 2018 22:20
src/dgemm_kernel.rs Outdated Show resolved Hide resolved
@@ -95,25 +124,35 @@ pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
#[inline]
#[target_feature(enable="avx")]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be feature "fma" here. As said, this is a directive how to compile the code and without the directive to use "fma" performance is absymal (because the fma instrinsics compile to function calls).

This introduces a new trait `DgemmMultiplyAdd` that selects
fused multiply add if available, and multiplication followed
by addition if now.

Tests for avx and fma kernels are disabled for now.
I do not know why this works, but it currently works.

In addition, extra travis targets are specified that
disable fma and avx to hit the tests for all kernels.
@bluss
Copy link
Owner

bluss commented Dec 7, 2018

Thanks! What massive performance improvement, when using this feature! Will add on the dedup of sgemm too.

@bluss bluss merged commit 20932b3 into bluss:master Dec 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants