Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more ARM SIMD intrinsics #792

Merged
merged 1 commit into from
Apr 7, 2020

Conversation

Licenser
Copy link
Contributor

@Licenser Licenser commented Aug 1, 2019

I'm trying to add some more SIMD intrinsics for arm. It's still very much WIP - I'm also not sure how to test them locally.

@rust-highfive
Copy link

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @gnzlbg (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@Licenser
Copy link
Contributor Author

Licenser commented Aug 1, 2019

This is in relation to #148 to catch up a bit on the delta.

@bjorn3
Copy link
Member

bjorn3 commented Aug 1, 2019

I'm also not sure how to test them locally.

I believe it is:

$ TARGET=... ./ci/run.sh

@Licenser
Copy link
Contributor Author

Licenser commented Aug 1, 2019

that does fail with quite some errors sadly, not it did catch two syntax errors however! :D

@bjorn3
Copy link
Member

bjorn3 commented Aug 1, 2019

that does fail with quite some errors sadly

Did it before this PR? Those errors may be caused by the recent LLVM update.

@Licenser
Copy link
Contributor Author

Licenser commented Aug 1, 2019

Something goes wrong before that I get errors like:

error[E0463]: can't find crate for `core`
error[E0463]: can't find crate for `core`
  |
  |
  = note: the `aarch64-unknown-linux-gnu` target may not be installed

depsite rustup saying I have the toolchain installed:

heinz@Schrodinger ~/Projects/rustlang/stdarch (more-arm-intrinsics) $ rustup show
Default host: x86_64-apple-darwin

installed toolchains
--------------------

stable-x86_64-apple-darwin (default)
nightly-aarch64-unknown-linux-gnu
nightly-x86_64-apple-darwin

installed targets for active toolchain
--------------------------------------

aarch64-unknown-linux-gnu
x86_64-apple-darwin

active toolchain
----------------

stable-x86_64-apple-darwin (directory override for '/Users/heinz/Projects/rustlang/stdarch')
rustc 1.36.0 (a53f9df32 2019-07-03)

@Licenser
Copy link
Contributor Author

Licenser commented Aug 1, 2019

Don't get me wrong, I know I won't be able to run the tests locally, but I was hoping to be able to compile it before tossing it on something slow to test.

@Licenser
Copy link
Contributor Author

Licenser commented Aug 1, 2019

Ah I got that part it was:

  • rustup override set nightly
  • rustup target add aarch64-unknown-linux-gnu

@Licenser
Copy link
Contributor Author

Licenser commented Aug 2, 2019

A question, I'm planning to add some more intrinsics, namely this list (with it's variants): simd-lite/simd-json#32 (comment)

Is it preferable to do this in one large PR or in smaller somewhat topiced ones (like and and or here) ?

@Licenser Licenser force-pushed the more-arm-intrinsics branch 2 times, most recently from 09d1d12 to e31f55a Compare August 3, 2019 17:32
@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 5, 2019

Is it preferable to do this in one large PR or in smaller somewhat topiced ones (like and and or here) ?

As you wish.

@Licenser Licenser marked this pull request as ready for review August 9, 2019 18:15
@Licenser
Copy link
Contributor Author

Licenser commented Aug 9, 2019

The following tests are still failing on my tests - I'm entirely unsure why. All of them share one thing in common: they have neither simd_* intrinsics nor llvm intrinsics to link to. That means it's up the programmer to figure out what combination of code causes simd to spit out the right command based on the ARM docs. I will go into them below section by section for some details.

    core_arch::arm::neon::assert_vget_lane_u64_umov
    core_arch::arm::neon::assert_vget_lane_u8_umov
    core_arch::arm::neon::assert_vgetq_lane_u16_umov
    core_arch::arm::neon::assert_vgetq_lane_u32_umov
    core_arch::arm::neon::assert_vgetq_lane_u64_umov

The lane get commands - I brought them up earlier and every combiation of code I try ends up with something other then umov. I don't know why.

    core_arch::arm::neon::assert_vextq_s8_ext

This is quite the bugger, it combines two vectors but in simd* are structs. I tried a big match block but it doesn't like it a lot.

    core_arch::arm::neon::assert_vld1q_s8_ld1
    core_arch::arm::neon::assert_vld1q_u8_ld1
    core_arch::arm::neon::assert_vst1q_u8_st1

This translates to ldr instead of ld1, I got no clue :( (same for str and st1)

    core_arch::arm::neon::assert_vshrq_n_u8_ushr

For reasons unknown llvm changes ushr here to ushl, the ushl code however is correct ushl...

@Licenser
Copy link
Contributor Author

Licenser commented Aug 9, 2019

Experimenting around I made an other interesting observation:

https://godbolt.org/z/Qz27y8

using a array instead of a struct results in other (less) instructions.

@Licenser
Copy link
Contributor Author

Licenser commented Aug 9, 2019

iximeow on twitter came to the rescue with shr: https://twitter.com/iximeow/status/1159935494202908672

It looks like shr take only Imitate values not registers - that clears up why the code compiles to what it does. Sadly I'm still unsure how to fix that.

@Licenser
Copy link
Contributor Author

vextq_s8 seems to be a llvm limitation if I interpret the difference between clang and gcc on those two correctly:

The same goes for ushr rust seems to do a bit better then clang but it seems to not optimises the ushr when the shift value is only indirectly a constant value.

@bjorn3
Copy link
Member

bjorn3 commented Aug 10, 2019

You can take a look at

pub unsafe fn _m_pshufw(a: __m64, imm8: i32) -> __m64 {
macro_rules! call {
($imm8:expr) => {
pshufw(a, $imm8)
};
}
constify_imm8!(imm8, call)
}

for a fix for the constant value problem.

@Licenser
Copy link
Contributor Author

I've been experimenting with different approaches to the vget_lane code and what I get out of it is still quite odd. using the same code different targets return different. Am I missing something?

https://godbolt.org/z/DOJOoR

@Licenser
Copy link
Contributor Author

The CI currently fails with:

error: proc macro panicked
   --> crates/stdarch-verify/tests/arm.rs:170:1
    |
170 | stdarch_verify::arm_functions!(static FUNCTIONS);
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = help: message: unspported type: "poly64_t"

error: aborting due to previous error

can someone shed light on what that means?

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2019

The type poly64_t is not in the list at

"poly64x1_t" => quote! { &POLY64X1 },

so it cant verify that the signatures of the intrinsics declared by you are correct.

As a side note I noticed a spell error in the error: unspported.

@Licenser
Copy link
Contributor Author

Note I've removed the following (failing) intrinsics as I'm stuck on them and don't want to hold up the PR unnessesarily since I think there are already some good and useful additions in it as it stands :)

  • vget_lane_u64
  • vget_lane_u8
  • vgetq_lane_u16
  • vgetq_lane_u32
  • vld1q_s8
  • vld1q_u8
  • vextq_s8

sunnygleason added a commit to simd-lite/simd-json that referenced this pull request Aug 16, 2019
* feat: neon support
* feat: temp stub replacements for neon intrinsics (pending rust-lang/stdarch#792)
* fix: drone CI rustup nightly
* feat: fix guards, use rust stdlib for bit count operations
* fix: remove double semicolon
* feat: fancy generic generator functions, thanks @Licenser
@Amanieu
Copy link
Member

Amanieu commented Mar 29, 2020

Looking at the CI errors, it seems that you're not handling the difference properly:

---- core_arch::arm::neon::generated::assert_vcgtq_u32_cmhi stdout ----
disassembly for stdarch_test_shim_vcgtq_u32_cmhi: 
	 0: ldr r0, [pc, #20] ; 31198 <stdarch_test_shim_vcgtq_u32_cmhi+0x1c>
	 1: vcgt.u32 q0, q0, q1
	 2: ldr r1, [pc, #16] ; 3119c <stdarch_test_shim_vcgtq_u32_cmhi+0x20>
	 3: add r0, pc, r0
	 4: ldr r1, [pc, r1]
	 5: str r0, [r1]
	 6: bx lr
	 7: .word 0x00091538
	 8: .word 0x000c6b6c
thread 'core_arch::arm::neon::generated::assert_vcgtq_u32_cmhi' panicked at 'failed to find instruction `cmhi` in the disassembly', crates/stdarch-test/src/lib.rs:157:9

Here you are looking for the AArch64 instruction (cmhi) in ARM code (vcgt.u32).

@Licenser
Copy link
Contributor Author

The code generation changed since the original PR, there were problems with wrong codes being generated so I suspect something fixed it don't ask me what or why no clue :) but I'll update the list.

@Licenser
Copy link
Contributor Author

Okay got some kind of local reproduction I'll dig through them :) might take a bit

error: aborting due to 174 previous errors

😭

@Amanieu
Copy link
Member

Amanieu commented Apr 1, 2020

By the way, you should include the code generator in the repository. Otherwise it will be difficult to modify the generated intrinsics or add new ones.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 1, 2020

I agree it would be nice to have the code generator in the repo, I had it all set including a build.rs version and a manual re-generating version but was asked to take it out because generated code wasn't welcome :(.

Since I was burned with that a few times in the PR before (see the 200something comments above :/), so I want to make sure that what I'll put in is what is desired.

Can you take a look at the generator: https://github.com/simd-lite/simd-lite and say if you're OK with it?

Do you prefer it as a build.rs script that generates the file during compile time or would you rather have a sub crate for the generator that is called manually to update the generated code, or something entirely different?

@Amanieu
Copy link
Member

Amanieu commented Apr 1, 2020

I would prefer avoiding build.rs because core_arch is special: it's a submodule of libcore rather than a real crate. Simply including the generator crate and adding instructions at the top of the generated code should be sufficient.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 1, 2020

Will do! not sure how much I time I get prior to the weekend but I'll start cleaning things up then, having the generator in create will make it easier!

@bors
Copy link
Contributor

bors commented Apr 4, 2020

☔ The latest upstream changes (presumably 1a577bd) made this pull request unmergeable. Please resolve the merge conflicts.

Copy link
Member

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a header to the generated code saying something along the lines of:

// This code is automatically generated. DO NOT MODIFY.
//
// Instead, modify <path to neon.spec> and run the following command to re-generate this file:
// <command to re-generate>

crates/stdarch-gen/README.md Outdated Show resolved Hide resolved
crates/stdarch-gen/README.md Outdated Show resolved Hide resolved
crates/stdarch-gen/README.md Outdated Show resolved Hide resolved
crates/stdarch-gen/neon.spec Outdated Show resolved Hide resolved
crates/stdarch-gen/neon.spec Outdated Show resolved Hide resolved
crates/stdarch-gen/neon.spec Outdated Show resolved Hide resolved
crates/stdarch-gen/neon.spec Outdated Show resolved Hide resolved
@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

So I'm a bit stumped on this one:

pub unsafe fn vget_lane_u64(v: uint64x1_t, imm5: i32) -> u64 {
    if imm5 != 0 {
        unreachable_unchecked()
    }
    simd_extract(v, 0)
}

on aarch64 this [should] (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/BABJFCGC.html) generate a vmov instruction but it it seems to use fmov. I'm not sure there is a actual difference here given we move a 64 value into a 64 bit value but I wanted to double check.

---- core_arch::arm::neon::assert_vget_lane_u64_vmov_32 stdout ----
disassembly for stdarch_test_shim_vget_lane_u64_vmov_32: 
	 0: adrp x8, 155000 <anon.2e35cbf084f41f36b988b5d9ab4c4beb.3.llvm.14623789697366360962+0x1ab8>
	 1: ldr x8, [x8, #4040]
	 2: adrp x9, ed000 <anon.d04d5b1829f76b81fdd58e677cfb88c2.67.llvm.9826272335327430019+0x3a>
	 3: add x9, x9, #0xe7e
	 4: fmov x0, d0
	 5: str x9, [x8]
	 6: ret
thread 'core_arch::arm::neon::assert_vget_lane_u64_vmov_32' panicked at 'failed to find instruction `vmov.32` in the disassembly', crates/stdarch-test/src/lib.rs:157:9

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2020

The reference you are using is only giving you the names for the ARM instructions, not the AArch64 ones (which are completely different).

As a general rule you can tell the difference by looking at the first letter of the instruction: on ARM all NEON/VFP instructions start with a v. On AArch64 the equivalent instructions do not use the v prefix, but the float-related ones may use a f prefix instead.

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2020

In summary: vmov is an ARM instruction, fmov is an AArch64 instruction.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

for aarch64:
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vget_lane_u64

which states vget_lane_u64 should be umov, but it turns out as fmov for some reason.

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2020

Yes, basically fmov d0, d1 is equivalent to umov d0, v1.d[0]. Both instructions have the same effect so the compiler can choose either one of them. Note that this only applies for lane 0. For other lanes it should generate a umov.

@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

awesome thanks :) I was hoping it was something along the line :D

@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

It looks like the issues are fixed :) I'd rebase and squash this so we don't blow up the repo w/ 110 commits is that a accepted practice for stdarch?

@Amanieu
Copy link
Member

Amanieu commented Apr 5, 2020

Sure that's fine.

How much of the ARM intrinsics is there still left to implement?

@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

Doing quick math based on https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?page=146 there are about 4355

A quick check on exported functions:

# rg 'pub unsafe fn' crates/core_arch/src/{aarch64,arm} | wc -l
528

so I think about 4000 😭

A arm simd and and orr

Improve data for test cses

Fix numbers picked for test cases

Remove boilerplate of over and over repeated impl

Add exclusive or operation

Add bitwise equality operations

Add gt and lt

Add lte and gte

Add vmul_p64

Add some uget intrinsics

Add some vdup commands

adding reinterpret and updating vget_lane

Add vld1q u8 and s8

Add vmovq_n_u8

add vpaddq_u8

Add vextq_s8

Add vqmovn_u64

add vqsubq_u8

add vshrq_n_u8 and vshlq_n_u8

add vst1q_u8

Add vscode to git ignore

Fix shr using constify

Move macros

Improve guard

Use imm5 for vget_lane - this solves vgetq_lane_u64

Fix incorrect types for compairiso operators

Fix poly64_t

Rmove vst1q_u8

Add poly64_t to stdarch-verify

Fix typo in unsupported type check

Add poly128_t to stdarch-verify

Update vextq_s8

Come cleanup

Fix up const values

Fix vsh*q_n_u8

Remove unused import

Remove failing intrinsics

Remove extra line

Remove now unused import

Add vextq_u8 and vextq_s8

Add vextq_u8

add vgetq_lane_u16

Add vget_lane_u8

Add missing documentaiton

Try using u32 for parameters

return arguments to i32

Fix test for vpaddq_u8

Update docs in macros

Add vget_lane_u64

Add code generation for neon intrinsics

Add vgetq_lane_u32 (fmov)

Skip generated modules for rustfmt

Add dummy files for cargo fmt

Don't re-generate files unless required.

Add documentation to spec file and update syntax

Add more docs for test variables in sepc

Add generation for vqsub* intrinsics to demonstrate use of links

Add vqadd

Add hadd

Fix missing test

Fix unused imports and test

Add a number of additional intrinsics adn move generation to an example

tag vgetq_lane_u32 as fmov instead of umov

Remove generator, it's all writen by hand, promised

Remove comment and unused example

Remove comments

Format generated files

Remove don't edit comment

Improve tests for vmul_f

Work around bug in simdarch-verify

Remove quadd for the time being

Add tests for vreinterpret

Fix bug in stdarch-test and nop intrinsics

feat: additional tests for comparison operations

feat: additional tests, move tests to non-generated file

chore: rustfmt, move tests to neon/mod.rs

feat: tests for conditionals and bitwise operators

feat: improved test coverage for ARM intrinsics

fix: removing 64-bit comparison ops (noticed they're in AARCH64)

fix: fix tests for removed comparison operators

feat: move test support into own module

feat: implementation of checks and test support for aarch64

Revert changes to generated files

Re-add tests that got lost in the merge

Fmt and fix test values

Add some negatives

Only run test_support for v7 and aarch cpus

Fix mul intrinsics

Include code generator

Fix first hive of intrinsic changes

escape intrinsics

fix more generated code

Update crates/stdarch-gen/neon.spec

Co-Authored-By: bjorn3 <[email protected]>

Update crates/stdarch-gen/neon.spec

Co-Authored-By: bjorn3 <[email protected]>

Update crates/stdarch-gen/neon.spec

Co-Authored-By: bjorn3 <[email protected]>

escape all intriniscs w/ a dot

Fix typo

Fix unsigned prefix i -> s

regenerate code

differentiate between signed and unsinged intriniscs

Start cleaning up aarch64

Fix bad spec

Fix imm passing

Fix more aarch intriniscs

Update more aarch64 intrinsics

Fix last aarch intriniscs, hopefully

Fix last armv7 intriniscs, hopefully

Fix last armv7 intriniscs, hopefully

Fix unused import in stdarch-gen
@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

hm the number is probably wrong, in x86 and x86_64 there are only 1122 pub unsafe fn functions, it wouldn't make sense that NEON has neary 4 times as many SIMD intrinsics as x86

@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/BABJFCGC.html seems to be a good list it might be a good idea to do it per topic, that way it's smaller more digestable chunks and not one mamuth task

topics would be (given the headlines in the link above):

  • Addition
  • Multiplication
  • Subtraction
  • Comparison
  • Absolute difference
  • Max/Min
  • Pairwise addition
  • Folding maximum
  • Folding minimum
  • Reciprocal/Sqrt
  • Shifts by signed variable
  • Shifts by a constant
  • Shifts with insert
  • Loads of a single vector or lane
  • Store a single vector or lane
  • Loads of an N‑element structure
  • Extract lanes from a vector and put into a register
  • Load a single lane of a vector from a literal
  • Initialize a vector from a literal bit pattern
  • Set all lanes to same value
  • Combining vectors
  • Splitting vectors
  • Converting vectors
  • Table look up
  • Extended table look up intrinsics
  • Operations with a scalar value
  • Vector extract
  • Reverse vector elements (swap endianness)
  • Other single operand arithmetic
  • Logical operations
  • Transposition operations
  • Vector reinterpret cast operations

@Licenser
Copy link
Contributor Author

Licenser commented Apr 5, 2020

Can we re-trigger the BSD build? it timed out at creating the instance.

@Amanieu Amanieu merged commit ca7f756 into rust-lang:master Apr 7, 2020
@lu-zero
Copy link
Contributor

lu-zero commented Sep 1, 2020

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/BABJFCGC.html seems to be a good list it might be a good idea to do it per topic, that way it's smaller more digestable chunks and not one mamuth task

Could you please open an or more issues about it?

@jack-signal
Copy link

hm the number is probably wrong, in x86 and x86_64 there are only 1122 pub unsafe fn functions, it wouldn't make sense that NEON has neary 4 times as many SIMD intrinsics as x86

GCC 9's arm_neon.h header has about 3700 inline functions plus ~200 function-like macros. In many cases several functions expose the same underlying instruction but operating on different types (eg vtbl3_{u8,s8,p8}) or vector sizes (nearly every instruction supports either 64-bit or 128-bit wide vectors, analogous to if every SSE2 or AVX2 instruction had an MMX variant).

@Lokathor
Copy link
Contributor

that figure is correct, here's a list of all neon intrinsics and which ones rust currently supports
https://docs.google.com/spreadsheets/d/1MqW1g8c7tlhdRWQixgdWvR4uJHNZzCYAf4V0oHjZkwA/edit?usp=drivesdk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.