-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BLAKE3 freebsd aarch64 #14728
Fix BLAKE3 freebsd aarch64 #14728
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this looks good to me. Thanks!
I don't think it matters given the fact that it was broken before, but I'm curious if there's any noticable performance impact on Linux or if it's just noise. |
Same machine, but with Debian 11:
So everything seems fine :) |
Can we have the |
I've uncovered another problem in my local integration of the fpu changes... it turns out that we have one point that breaks the build that I haven't thought of how to workaround. https://github.com/openzfs/zfs/blob/master/module/icp/algs/blake3/blake3_impl.c#L33 Here we have an escape hatch for x86_64 to avoid the SIMD implementations with At first I thought perhaps (edit to add: |
macOS results - unsure if not using This adds These are the changes to this PRs .S files to make them assemble: And here is I tend to get one of the following:
Something has certainly gone wrong there, sure is an interesting address
Which is where |
Oh I see, ok I can fix macOS with:
Ie, allocate it instead of using We implemented it with but clearly that isn't good enough. I thought |
Oh no, I forget them again - when I had done the aproval :/
Squashed together.
No need for this. You have to ensure, that the ctx isn't shared between the cpus. The part below can't be changed to
|
6236c39
to
a701efc
Compare
We should include the Edit: no, we should not .... it's a bit more todo then... so it should be done with a seperate PR. |
Ah interesting. |
Ah I missed that, they are just empty on those platforms. |
The |
eb8fbe8
to
b9f7987
Compare
Ui, Edit: currently: aarach64 is broken with these changes in the assembly. |
The attempt to use the c files was done by me also ... the highly optimized (clang 13 -O3 assembly) isn't reached by older versions and gcc ... so we will have a big performance degradation then. Also some compiler will complain about usage of xyz - because they don't or don't want understand some intrinsics ... see #14624 for an example. |
@lundman - Do we really all these headers with nearly the same
In your openzfs-fork repo the definitions for aarch64 (linux/freebsd) are not there currently. |
I only went that way, as we (Solaris) started with But then, the SIMD files are done with I've fixed up the asm files like 18 times to be able to compile on macOS, each time upstream changes them, so it certainly is tiring. The hope is that can end soon though, but it's deflating when even more code goes in that needs fixing. |
I personally see the |
b9f7987
to
907b52a
Compare
907b52a
to
e9dd8a7
Compare
Rebased and testings with different clang versions (11..15) - the best result was with clang-14 -O3. Edit: Speedup on Apple M1: 1575->1600 ... only around 2% - but constant |
Starting to get a hang of the higher level of the blake3 work here, with the "stack" and pushing out work. Isn't this something that in ZFS would have traditionally been done with a |
@lundman - I think fixing the macOS implementation of This one fixes the x18 issue of the aarch64 assembly - I also added a link within the comment, how this assembly gets generated. The C files and also my work on this are in public domain. |
Yep, skip over that extra, I'll handle that separately. |
e9dd8a7
to
ce09534
Compare
@lundman - please test again in macOS - I think it should work now (fix: x29 must be used as frame pointer register) |
ce09534
to
952fa9f
Compare
The x18 register isn't useable within FreeBSD kernel space, so we have to fix the BLAKE3 aarch64 assembly for not using it. The source files are here: https://github.com/mcmilk/BLAKE3-tests Signed-off-by: Tino Reichardt <[email protected]>
what's the status here? |
I tested this on Linux and FreeBSD 13 in kernel space, but not FreeBSD 14 currently. |
Thanks for the update. Then I suggest we get this integrated unless you'd prefer we wait until it can also be tested on FreeBSD 14. |
It should work their also, I wanted to check this first... but my time says no this. |
This boots fine on main (14) w/ kfpu_allowed() flipped back to 1. |
@kevans91 thanks for test booting this! |
Yup! JFYI- I committed https://cgit.freebsd.org/src/commit/?id=ce5a210997da3c4 which is a little different than what we upstreamed, and I don't recall if we've mentioned the problem or not -- we have to special-case _STANDALONE on aarch64 because blake3_impl.c offers a way to avoid compiling it for x86 ( I'm hoping that we can fix that so that the bootloader doesn't need any of this code that will never run there, then we can just revert back to always providing kfpu_begin()/kfpu_end() to match simd_x86.h. @bsdimp has taken a bit of a look at it, but I don't think he has a patch to propose at the moment. |
Thanks for merging this! Somewhat unrelated, can someone help me understand the format of the vfs.zfs.blake3_impl sysctl?
I don't understand what "cycle" and "[fastest]" are trying to convey, they're always there and implementations are listed in the same order. My understanding is that one of them should be the fastest, but I can't tell that from this sysctl (ditto for the sha*_impl sysctls, it seems). |
Looking at the code for that, I suspect that's actually a bug and that one of generic, sse2, sse41 should have been [bracketed] while the rest appear with no styling and the words "cycle" / "[fastest]" not appearing at all. |
It works the same as fletcher_4_impl, or zfs_vdev_raidz_impl. cycle cycles through them all, fastest picks whichever one microbenchmarked fastest on module load (see chksum_bench), and explicitly setting one of the others will force it to use that one instead. It defaults to fastest for reasons which may be obvious. |
I think the complaint is that this does not explain which one benchmarked as the fastest, so you don't know which one you are using. Personally I can't be arsed to do anything here and I don't think it warrants work modulo terminal boredom, but a patched report would be nice. For example: also the _impl suffix really should not be there and blake3 should probably be under vfs.zfs.crypto. or similar to avoid namespace clashes. |
ahh, ok, thanks! yeah, as mjg noted my hope was that this sysctl would show me which one had benchmarked the fastest; information I don't really know how to get otherwise. The described behavior does make sense in hindsight. |
On Linux you can view the micro benchmark results under
The same is true for the raidz benchmark results which measure each of the parity generation and reconstruction implementations. In this case they're all the same, but that's not always the case.
|
The x18 register isn't useable within FreeBSD kernel space, so we have to fix the BLAKE3 aarch64 assembly for not using it. The source files are here: https://github.com/mcmilk/BLAKE3-tests Reviewed-by: Kyle Evans <[email protected]> Signed-off-by: Tino Reichardt <[email protected]> Closes openzfs#14728
Motivation and Context
See pull request #14715
How Has This Been Tested?
Compile + benchmarkingon same cpu: APM eMAG 8180 r3p2
Types of changes
Checklist:
Signed-off-by
.