Implement sgemm and dgemm using fma #36

SuperFluffy · 2018-12-03T15:10:59Z

This uses fused multiply add via _mm256_fmadd_{ps,pd} to multiply and accumulate matrices in one go. The performance gains are impressive, as described in issue #35.

Fixes #31
Fixes #35
Fixes #38

bluss · 2018-12-03T15:56:24Z

Can you update Travis and unit tests so that they still cover all kernels?

bluss · 2018-12-03T16:01:41Z

Let's merge the other one then we rebase and fix this pr.

SuperFluffy · 2018-12-03T16:03:00Z

Let's do that! I haven't yet looked into why the tests in the other one, and hence this one, are failing.

…

On Mon, Dec 3, 2018, 17:01 bluss ***@***.*** wrote: Let's merge the other one then we rebase and fix this pr. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAqy-a3Vzhmb4zMpjSDRHZNONqvK8sD1ks5u1UrngaJpZM4Y-x4B> .

bluss · 2018-12-03T18:03:08Z

This PR is failing on travis for its own reason, so it should be investigated.

SuperFluffy · 2018-12-04T20:09:53Z

This PR is failing on travis for its own reason, so it should be investigated.

~~Could it be that it's failing due to there not being support for fma on travis?~~

~~On the other hand, the kernel unit tests with _fma are passing. On my local machine the integration tests are passing as well.~~

I haven't yet rebased everything off of master. The fallback kernel is still broken.

bluss · 2018-12-04T20:23:22Z

@SuperFluffy The code should runtime detect if it can use fma or not, so then there is a bug. Also, are you sure the feature "fma" implies "avx"? I haven't reviewed this, so I'm not sure but I think we need to manually check for each intrinsic if it belongs to the correct feature (in this case the fma feature).

bluss · 2018-12-04T20:43:38Z

This build is crashing with SIGILL, that sounds interesting. https://travis-ci.org/bluss/matrixmultiply/jobs/462848596

Potentially an aligned load/store on something not aligned? If it's not an instruction being used when not supported.

bluss · 2018-12-04T21:08:32Z

I got the travis builder to spit out its available target features and indeed it doesn't support fma. But why did it crash? And how can we keep this tested if travis doesn't have it...

> rustc --print cfg -Ctarget-cpu=native
debug_assertions
target_arch="x86_64"
target_endian="little"
target_env="gnu"
target_family="unix"
target_feature="avx"
target_feature="fxsr"
target_feature="mmx"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsaveopt"
target_has_atomic="16"
target_has_atomic="32"
target_has_atomic="64"
target_has_atomic="8"
target_has_atomic="cas"
target_has_atomic="ptr"
target_os="linux"
target_pointer_width="64"
target_thread_local
target_vendor="unknown"
unix

bluss · 2018-12-04T21:37:40Z

Can put in the github keywords to close issues? https://help.github.com/articles/closing-issues-using-keywords/ In this case, just put "Fixes #35" in the PR description. The PR description is the best place to put this. Thanks :)

We need to resolve the massive code duplication and comment duplication. (It's almost exactly the same code, isn't it?). Is it likely to stay identical like this?

I'd propose to solve it by making exactly kernel_x86_avx a generic function. Make a simple trait and two marker types, so that you can call kernel_x86_avx::<Fma> and kernel_x86_avx::<Avx>. Using conditionals aided by those static types will make us generate two distinct functions at compile time.

SuperFluffy · 2018-12-04T21:42:45Z

So, this feels like a bit of a hack, but I found this when googling: https://github.com/uclouvain/openjpeg/blob/master/.travis.yml#L29-L33

If you specify os: linux, sudo: true, and dist: trusty, you get a machine with avx2 and fma, apparently. The tests pass now....

Can you use github keywords to close issues? https://help.github.com/articles/closing-issues-using-keywords/ In this case, just put "Fixes #35" in the PR description. The PR description is the best place to put this.

Will do!

We need to resolve the massive code duplication and comment duplication. (It's almost exactly the same code, isn't it?). Is it likely to stay identical like this?

Yes, you are right, we should fix that.

bluss · 2018-12-04T21:43:44Z

@SuperFluffy but tests should pass also on machines that don't have fma. I'm not sure why they were failing, can we understand that?

SuperFluffy · 2018-12-04T21:49:48Z

@bluss: Yes, you are right once more. Looks like the macro isn't picking up on fma not being available?

Regarding what you said earlier:

Also, are you sure the feature "fma" implies "avx"?

From what I can tell, there is not a single procecessor out there that supports fma, but not avx. Since fma is acting on __m256 and __m256d vectors, and since the only way to load them is through functions like _mm256_load_p{s,d} introduced with avx, I think we can safely assume that fma => avx.

bluss · 2018-12-04T21:49:54Z

The tests need to be updated so that they crash again. That they pass is indicative of one thing: We don't test all the kernels on this new fma setup 😄

So .travis.yml would need to be updated to make sure we reach all the different kernels.

SuperFluffy · 2018-12-05T00:23:55Z

It turns out that you need to cargo clean in between benchmark runs to check that the correct code paths are taken. But I have run cargo build with no env vars, with MMNO_fma=1, and with MMNO_FMA=1 MMNO_avx=1, to test with fma enabled, only avx enabled, and neither enabled, and all results are consistent:

# fallback, MMNO_fma=1 MMNO_avx=1
test mat_mul_f64::m127   ... bench:     339,714 ns/iter (+/- 14,009)
# avx only, MMNO_fma=1
test mat_mul_f64::m127   ... bench:     188,678 ns/iter (+/- 40,136)
# fma
test mat_mul_f64::m127   ... bench:     112,323 ns/iter (+/- 7,942)

src/sgemm_kernel.rs

src/dgemm_kernel.rs

bluss · 2018-12-07T18:27:27Z

src/dgemm_kernel.rs

@@ -95,25 +124,35 @@ pub unsafe fn kernel(k: usize, alpha: T, a: *const T, b: *const T,
 #[inline]
 #[target_feature(enable="avx")]


Should be feature "fma" here. As said, this is a directive how to compile the code and without the directive to use "fma" performance is absymal (because the fma instrinsics compile to function calls).

This introduces a new trait `DgemmMultiplyAdd` that selects fused multiply add if available, and multiplication followed by addition if now. Tests for avx and fma kernels are disabled for now.

I do not know why this works, but it currently works. In addition, extra travis targets are specified that disable fma and avx to hit the tests for all kernels.

bluss · 2018-12-07T19:40:08Z

Thanks! What massive performance improvement, when using this feature! Will add on the dedup of sgemm too.

SuperFluffy force-pushed the dgemm_fma branch from 6e28f79 to 5e90824 Compare December 4, 2018 21:10

SuperFluffy mentioned this pull request Dec 4, 2018

Test fma instructions on travis #38

Closed

SuperFluffy force-pushed the dgemm_fma branch from 44ae0ee to 9f1a17c Compare December 4, 2018 23:21

bluss reviewed Dec 5, 2018

View reviewed changes

src/sgemm_kernel.rs Outdated Show resolved Hide resolved

SuperFluffy force-pushed the dgemm_fma branch 3 times, most recently from 33aa05e to cf549c4 Compare December 5, 2018 22:20

bluss reviewed Dec 5, 2018

View reviewed changes

src/dgemm_kernel.rs Outdated Show resolved Hide resolved

bluss reviewed Dec 5, 2018

View reviewed changes

src/dgemm_kernel.rs Show resolved Hide resolved

SuperFluffy force-pushed the dgemm_fma branch from cf549c4 to 9cc56c5 Compare December 5, 2018 22:30

bluss reviewed Dec 7, 2018

View reviewed changes

SuperFluffy added 2 commits December 7, 2018 19:38

Implement sgemm and dgemm using fma; closes bluss#35

161715e

This introduces a new trait `DgemmMultiplyAdd` that selects fused multiply add if available, and multiplication followed by addition if now. Tests for avx and fma kernels are disabled for now.

FIX: Enable fma intrinsics on travis. Fixes bluss#38

6fdfa19

I do not know why this works, but it currently works. In addition, extra travis targets are specified that disable fma and avx to hit the tests for all kernels.

SuperFluffy force-pushed the dgemm_fma branch from a4ed764 to 6fdfa19 Compare December 7, 2018 18:42

bluss merged commit 20932b3 into bluss:master Dec 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sgemm and dgemm using fma #36

Implement sgemm and dgemm using fma #36

SuperFluffy commented Dec 3, 2018 •

edited by bluss

Loading

bluss commented Dec 3, 2018

bluss commented Dec 3, 2018

SuperFluffy commented Dec 3, 2018 via email

bluss commented Dec 3, 2018

SuperFluffy commented Dec 4, 2018 •

edited

Loading

bluss commented Dec 4, 2018

bluss commented Dec 4, 2018 •

edited

Loading

bluss commented Dec 4, 2018

bluss commented Dec 4, 2018 •

edited

Loading

SuperFluffy commented Dec 4, 2018

bluss commented Dec 4, 2018

SuperFluffy commented Dec 4, 2018

bluss commented Dec 4, 2018 •

edited

Loading

SuperFluffy commented Dec 5, 2018

bluss Dec 7, 2018

bluss commented Dec 7, 2018 •

edited

Loading

		@@ -95,25 +124,35 @@ pub unsafe fn kernel(k: usize, alpha: T, a: const T, b: const T,
		#[inline]
		#[target_feature(enable="avx")]

Implement sgemm and dgemm using fma #36

Implement sgemm and dgemm using fma #36

Conversation

SuperFluffy commented Dec 3, 2018 • edited by bluss Loading

bluss commented Dec 3, 2018

bluss commented Dec 3, 2018

SuperFluffy commented Dec 3, 2018 via email

bluss commented Dec 3, 2018

SuperFluffy commented Dec 4, 2018 • edited Loading

bluss commented Dec 4, 2018

bluss commented Dec 4, 2018 • edited Loading

bluss commented Dec 4, 2018

bluss commented Dec 4, 2018 • edited Loading

SuperFluffy commented Dec 4, 2018

bluss commented Dec 4, 2018

SuperFluffy commented Dec 4, 2018

bluss commented Dec 4, 2018 • edited Loading

SuperFluffy commented Dec 5, 2018

bluss Dec 7, 2018

Choose a reason for hiding this comment

bluss commented Dec 7, 2018 • edited Loading

SuperFluffy commented Dec 3, 2018 •

edited by bluss

Loading

SuperFluffy commented Dec 4, 2018 •

edited

Loading

bluss commented Dec 4, 2018 •

edited

Loading

bluss commented Dec 4, 2018 •

edited

Loading

bluss commented Dec 4, 2018 •

edited

Loading

bluss commented Dec 7, 2018 •

edited

Loading