Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vld1q_u32/vst1q_u32 etc. #429

Closed
jasondavies opened this issue Apr 18, 2018 · 11 comments
Closed

vld1q_u32/vst1q_u32 etc. #429

jasondavies opened this issue Apr 18, 2018 · 11 comments

Comments

@jasondavies
Copy link
Contributor

Would be great to have support for these.

I assume they simply need to be added under coresimd::arm::neon?

@jasondavies
Copy link
Contributor Author

Also, is there a simple way to locate the relevant LLVM intrinsic for a known instruction?

@TheIronBorn
Copy link
Contributor

TheIronBorn commented Apr 18, 2018

Perhaps #40 (comment)?

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

I'm closing this as a duplicate of #148.

I assume they simply need to be added under coresimd::arm::neon?

These are arm and aarch64 intrinsics, so they need to be added in coresimd::arm::neon but also work on aarch64.

is there a simple way to locate the relevant LLVM intrinsic for a known instruction?

The LLVM ARM neon intrinsics start with llvm.arm.neon... so you can try googling for llvm.arm.neon.vld1q or using the name of their cpu instruction on ARM llvm.arm.neon.ld1 or similar. The LLVM AArch64 neon intrinsics start with llvm.aarch64.neon.... The intrinsic that does the same in both is typically called differently because they typically use the name of the cpu instruction, which is different.

Otherwise you can try to check if clang has any code-gen tests for the vld1 intrinsics.

Check out the coresimd::arm::neon module for how things are exactly done.

One trick that is helpful if you don't have access to an arm machine is adding the intrinsic to point to some LLVM function, exporting an #[inline(never)] function that uses the intrinsic, and cross-compiling for --target=aarch64-unknown-linux-gnu. Often, but not always, compilation will fail if the name of the llvm intrinsic is incorrect.

@gnzlbg gnzlbg closed this as completed Apr 18, 2018
@jasondavies
Copy link
Contributor Author

@TheIronBorn Thanks, I hadn't seen those particular links (LLVM intrinsic dump).

@gnzlbg Thanks! I have access to a Cavium ThunderX machine so should get some real tests working. Hopefully I can just compile for both arm and aarch64 targets and that should be sufficient to test.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

@jasondavies that should allow you to run the tests locally and speed things up! Typically when building coresimd no errors happen even if the llvm. intrinsics are incorrect because the intrinsics are inline and thus only get lowered when you actually use them. If you are able to compile the tests locally then this is exactly what the tests do so you should get the errors and be able to iterate quickly.

Just note that for the armv7-.. targets to work properly you need to enable NEON at compile-time (RUSTFLAGS="-C target-feature=neon" should do), for aarch64 this is not necessary and the test will run if the chip supports neon (asimd) on Linux. Run-time feature detection for other operating systems is currently not implemented for the arm and aarch64 targets but you can use our CI build bots to run the tests if you need too.

Also, if you get stuck, just open a PR so that we can look into it, and include the LLVM errors that you get in the comments, if any. You already know the drill :P

@jasondavies
Copy link
Contributor Author

jasondavies commented Apr 18, 2018

Hmm, perhaps I don't actually need these ld/st intrinsics for my purposes (though it would be nice to have them eventually). My reasoning is that if I simply use u32x4, LLVM should be clever enough to figure out any relevant optimisations for loading/storing to memory.

Quick question for you: is u32x4 the best-practice type to use, or should I use something else? This is for aarch64-specific code.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

Quick question for you: is u32x4 the best-practice type to use, or should I use something else? This is for aarch64-specific code.

Write a small function that does that, cargo asm it, and see for yourself.

LLVM should be clever enough to figure out any relevant optimisations for loading/storing to memory.

LLVM is able to do amazing things here while at the same time also missing some of the most simplest cases. If that doesn't work as you expect. Fill it as a bug here, and I'll report it to LLVM upstream.

@jasondavies
Copy link
Contributor Author

I switched to using u32x4 and friends for all architectures (rather than arch-specific load/store intrinsics) and there is no slowdown, plus the code is simpler. Lovely!

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 18, 2018

Good to know, just keep in mind that those types are not hitting stable rust in the immediate future. There is an RFC but that's still under review and there is still a lot of work to be done.

@jasondavies
Copy link
Contributor Author

If you're interested, I just published my code for hardware-accelerated iterated SHA-256 here: https://github.com/plutomonkey/verify-beacon.

I expect/hope that eventually the hardware-accelerated portions can be moved entirely to the rust-crypto crates, but I just wanted to get the job done in the most straightforward way to start with.

The Intel/ARM implementations can probably be merged together quite easily too.

The only thing I wasn't quite sure about was instructing users about how to build the binary tool with the right features enabled. In both Intel/ARM cases I've put custom RUSTFLAGS but it would be nice if this was not required. Perhaps you can suggest a better way to do this?

As for use of u32x4, I'm personally OK with nightlies for now. As I mentioned, I hope rust-crypto will pick up the code anyway so the best approach for stable is better handled there.

One more thing: I'm using stdsimd directly as a dependency but IIUC this will no longer be required when the next nightly is available.

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 19, 2018

Perhaps you can suggest a better way to do this?

Specifying RUSTFLAGS is the job of the person compiling the final binary, so libraries shouldn't even try to do this.

What libraries can do is fail in some way if the features that they require are not available. For example, you could use run-time feature detection for std users, and panic! at run-time if the CPU does not support the features required, or you could use compile_error! for std and #[no_std] users to emit a compiler error if the features are required at compile-time. You can be as informative as you want in the error message, for example, stating which features the library needs.

You can also add a fallback algorithm that uses no features, and is invoked if the CPU does not support what the other algorithms need.

The is_sorted crate showcases how to do degrade gracefully depending on the build-type and features available: https://github.com/gnzlbg/is_sorted IIRC the aesni crate uses the compile_error! approach.

I'm using stdsimd directly as a dependency but IIUC this will no longer be required when the next nightly is available.

I don't know which stdsimd version is exactly in Rust nightly, but the 0.0.3-0.0.4 crate versions are really old, so if you can migrate to nightly at this point you probably should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants