vld1q_u32/vst1q_u32 etc. #429

jasondavies · 2018-04-18T00:18:48Z

Would be great to have support for these.

I assume they simply need to be added under coresimd::arm::neon?

The text was updated successfully, but these errors were encountered:

jasondavies · 2018-04-18T00:31:36Z

Also, is there a simple way to locate the relevant LLVM intrinsic for a known instruction?

TheIronBorn · 2018-04-18T05:38:39Z

Perhaps #40 (comment)?

gnzlbg · 2018-04-18T06:18:40Z

I'm closing this as a duplicate of #148.

I assume they simply need to be added under coresimd::arm::neon?

These are arm and aarch64 intrinsics, so they need to be added in coresimd::arm::neon but also work on aarch64.

is there a simple way to locate the relevant LLVM intrinsic for a known instruction?

The LLVM ARM neon intrinsics start with llvm.arm.neon... so you can try googling for llvm.arm.neon.vld1q or using the name of their cpu instruction on ARM llvm.arm.neon.ld1 or similar. The LLVM AArch64 neon intrinsics start with llvm.aarch64.neon.... The intrinsic that does the same in both is typically called differently because they typically use the name of the cpu instruction, which is different.

Otherwise you can try to check if clang has any code-gen tests for the vld1 intrinsics.

Check out the coresimd::arm::neon module for how things are exactly done.

One trick that is helpful if you don't have access to an arm machine is adding the intrinsic to point to some LLVM function, exporting an #[inline(never)] function that uses the intrinsic, and cross-compiling for --target=aarch64-unknown-linux-gnu. Often, but not always, compilation will fail if the name of the llvm intrinsic is incorrect.

jasondavies · 2018-04-18T09:30:33Z

@TheIronBorn Thanks, I hadn't seen those particular links (LLVM intrinsic dump).

@gnzlbg Thanks! I have access to a Cavium ThunderX machine so should get some real tests working. Hopefully I can just compile for both arm and aarch64 targets and that should be sufficient to test.

gnzlbg · 2018-04-18T09:37:37Z

@jasondavies that should allow you to run the tests locally and speed things up! Typically when building coresimd no errors happen even if the llvm. intrinsics are incorrect because the intrinsics are inline and thus only get lowered when you actually use them. If you are able to compile the tests locally then this is exactly what the tests do so you should get the errors and be able to iterate quickly.

Just note that for the armv7-.. targets to work properly you need to enable NEON at compile-time (RUSTFLAGS="-C target-feature=neon" should do), for aarch64 this is not necessary and the test will run if the chip supports neon (asimd) on Linux. Run-time feature detection for other operating systems is currently not implemented for the arm and aarch64 targets but you can use our CI build bots to run the tests if you need too.

Also, if you get stuck, just open a PR so that we can look into it, and include the LLVM errors that you get in the comments, if any. You already know the drill :P

jasondavies · 2018-04-18T10:13:11Z

Hmm, perhaps I don't actually need these ld/st intrinsics for my purposes (though it would be nice to have them eventually). My reasoning is that if I simply use u32x4, LLVM should be clever enough to figure out any relevant optimisations for loading/storing to memory.

Quick question for you: is u32x4 the best-practice type to use, or should I use something else? This is for aarch64-specific code.

gnzlbg · 2018-04-18T10:54:04Z

Quick question for you: is u32x4 the best-practice type to use, or should I use something else? This is for aarch64-specific code.

Write a small function that does that, cargo asm it, and see for yourself.

LLVM should be clever enough to figure out any relevant optimisations for loading/storing to memory.

LLVM is able to do amazing things here while at the same time also missing some of the most simplest cases. If that doesn't work as you expect. Fill it as a bug here, and I'll report it to LLVM upstream.

jasondavies · 2018-04-18T11:30:55Z

I switched to using u32x4 and friends for all architectures (rather than arch-specific load/store intrinsics) and there is no slowdown, plus the code is simpler. Lovely!

gnzlbg · 2018-04-18T15:56:43Z

Good to know, just keep in mind that those types are not hitting stable rust in the immediate future. There is an RFC but that's still under review and there is still a lot of work to be done.

jasondavies · 2018-04-18T16:07:51Z

If you're interested, I just published my code for hardware-accelerated iterated SHA-256 here: https://github.com/plutomonkey/verify-beacon.

I expect/hope that eventually the hardware-accelerated portions can be moved entirely to the rust-crypto crates, but I just wanted to get the job done in the most straightforward way to start with.

The Intel/ARM implementations can probably be merged together quite easily too.

The only thing I wasn't quite sure about was instructing users about how to build the binary tool with the right features enabled. In both Intel/ARM cases I've put custom RUSTFLAGS but it would be nice if this was not required. Perhaps you can suggest a better way to do this?

As for use of u32x4, I'm personally OK with nightlies for now. As I mentioned, I hope rust-crypto will pick up the code anyway so the best approach for stable is better handled there.

One more thing: I'm using stdsimd directly as a dependency but IIUC this will no longer be required when the next nightly is available.

gnzlbg · 2018-04-19T08:54:16Z

Perhaps you can suggest a better way to do this?

Specifying RUSTFLAGS is the job of the person compiling the final binary, so libraries shouldn't even try to do this.

What libraries can do is fail in some way if the features that they require are not available. For example, you could use run-time feature detection for std users, and panic! at run-time if the CPU does not support the features required, or you could use compile_error! for std and #[no_std] users to emit a compiler error if the features are required at compile-time. You can be as informative as you want in the error message, for example, stating which features the library needs.

You can also add a fallback algorithm that uses no features, and is invoked if the CPU does not support what the other algorithms need.

The is_sorted crate showcases how to do degrade gracefully depending on the build-type and features available: https://github.com/gnzlbg/is_sorted IIRC the aesni crate uses the compile_error! approach.

I'm using stdsimd directly as a dependency but IIUC this will no longer be required when the next nightly is available.

I don't know which stdsimd version is exactly in Rust nightly, but the 0.0.3-0.0.4 crate versions are really old, so if you can migrate to nightly at this point you probably should.

gnzlbg closed this as completed Apr 18, 2018

tarcieri mentioned this issue Apr 3, 2021

Migrate to assembly from OpenSSL RustCrypto/asm-hashes#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vld1q_u32/vst1q_u32 etc. #429

vld1q_u32/vst1q_u32 etc. #429

jasondavies commented Apr 18, 2018

jasondavies commented Apr 18, 2018

TheIronBorn commented Apr 18, 2018 •

edited

Loading

gnzlbg commented Apr 18, 2018 •

edited

Loading

jasondavies commented Apr 18, 2018

gnzlbg commented Apr 18, 2018 •

edited

Loading

jasondavies commented Apr 18, 2018 •

edited

Loading

gnzlbg commented Apr 18, 2018

jasondavies commented Apr 18, 2018

gnzlbg commented Apr 18, 2018

jasondavies commented Apr 18, 2018

gnzlbg commented Apr 19, 2018 •

edited

Loading

vld1q_u32/vst1q_u32 etc. #429

vld1q_u32/vst1q_u32 etc. #429

Comments

jasondavies commented Apr 18, 2018

jasondavies commented Apr 18, 2018

TheIronBorn commented Apr 18, 2018 • edited Loading

gnzlbg commented Apr 18, 2018 • edited Loading

jasondavies commented Apr 18, 2018

gnzlbg commented Apr 18, 2018 • edited Loading

jasondavies commented Apr 18, 2018 • edited Loading

gnzlbg commented Apr 18, 2018

jasondavies commented Apr 18, 2018

gnzlbg commented Apr 18, 2018

jasondavies commented Apr 18, 2018

gnzlbg commented Apr 19, 2018 • edited Loading

TheIronBorn commented Apr 18, 2018 •

edited

Loading

gnzlbg commented Apr 18, 2018 •

edited

Loading

gnzlbg commented Apr 18, 2018 •

edited

Loading

jasondavies commented Apr 18, 2018 •

edited

Loading

gnzlbg commented Apr 19, 2018 •

edited

Loading