-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
@tlively Please advice which opcodes could be used for these instructions |
We're basically out of opcode space, so you should just append these to the end for now. When we did the last renumbering, we thought we were essentially done adding new opcodes, so if we end up including all these newly proposed instructions, we'll probably have to do yet another renumbering. |
4523f43
to
c12a216
Compare
@tlively There isn't enough space in the end, inserted at |
Opcodes are encoded as ULEB128s, so it's totally fine to use numbers above 0xff. |
@tlively Good point! Do you prefer to move them to the end, or is it fine to leave the opcodes as is. |
No, that's ok. Their current position looks fine. Thanks! |
c12a216
to
b65ecb2
Compare
Note at for ARM, the codegen will require an extra register and extra instruction compared to what's listed, |
Prototyped for x64 in https://crrev.com/c/2444578, should see it in canary by tomorrow. @Maratyszcza |
Prototype the newly proposed load_lane instructions, as specified in WebAssembly/simd#350. Since these instructions are not available to origin trial users on Chrome stable, make them opt-in by only selecting them from intrinsics rather than normal ISel patterns. Since we only need rough prototypes to measure performance right now, this commit does not implement all the load and store patterns that would be necessary to make full use of the offset immediate. However, the full suite of offset tests is included to make it easy to track improvements in the future. Since these are the first instructions to have a memarg immediate as well as an additional immediate, the disassembler needed some additional hacks to be able to parse them correctly. Making that code more principled is left as future work. Differential Revision: https://reviews.llvm.org/D89366
These instructions are proposed in WebAssembly/simd#350. This PR implements them throughout Binaryen except in the C/JS APIs and in the fuzzer, where it leaves TODOs instead. Right now these instructions are just being implemented for prototyping so adding them to the APIs isn't critical and they aren't generally available to be fuzzed in Wasm engines.
These instructions are proposed in WebAssembly/simd#350. This PR implements them throughout Binaryen except in the C/JS APIs and in the fuzzer, where it leaves TODOs instead. Right now these instructions are just being implemented for prototyping so adding them to the APIs isn't critical and they aren't generally available to be fuzzed in Wasm engines.
These instructions are proposed in WebAssembly/simd#350. This PR implements them throughout Binaryen except in the C/JS APIs and in the fuzzer, where it leaves TODOs instead. Right now these instructions are just being implemented for prototyping so adding them to the APIs isn't critical and they aren't generally available to be fuzzed in Wasm engines.
These instructions have now landed in both LLVM and Binaryen so they will be ready to use in tip-of-tree Emscripten (usable via |
I evaluated the performance impact of these instructions by modifying WebAssembly SIMD microkernels for Sigmoid operator in XNNPACK library of neural network operators to use
The code modifications can be seen in google/XNNPACK#1016 (for the baseline version with |
b65ecb2
to
d53a173
Compare
Attn @abrown |
d53a173
to
1e5d467
Compare
proposals/simd/BinarySIMD.md
Outdated
| `v128.store8_lane` | `0x5c`| m:memarg, i:ImmLaneIdx16 | | ||
| `v128.store16_lane` | `0x5d`| m:memarg, i:ImmLaneIdx16 | | ||
| `v128.store32_lane` | `0x5e`| m:memarg, i:ImmLaneIdx16 | | ||
| `v128.store64_lane` | `0x5f`| m:memarg, i:ImmLaneIdx16 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImmLaneIdx16
needs to be updated for {load,store}_{16,32,64} to ImmLaneIdx8
, ImmLaneIdx4
, and ImmLaneIdx2
respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
09f5bf5
to
f93e5e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the new operations to ImplementationStatus.md as well.
2060759
to
54b98cd
Compare
@dtig Done |
54b98cd
to
3dc15fe
Compare
Rebased on top of merged PRs |
3dc15fe
to
405f725
Compare
Rebased once again |
Load lane and store lane instructions added in WebAssembly#350.
Load lane and store lane instructions added in #350.
Prototype the newly proposed load_lane instructions, as specified in WebAssembly/simd#350. Since these instructions are not available to origin trial users on Chrome stable, make them opt-in by only selecting them from intrinsics rather than normal ISel patterns. Since we only need rough prototypes to measure performance right now, this commit does not implement all the load and store patterns that would be necessary to make full use of the offset immediate. However, the full suite of offset tests is included to make it easy to track improvements in the future. Since these are the first instructions to have a memarg immediate as well as an additional immediate, the disassembler needed some additional hacks to be able to parse them correctly. Making that code more principled is left as future work. Differential Revision: https://reviews.llvm.org/D89366
Introduction
Both x86 SSE4.1 and ARM NEON instruction sets include instructions which load or store a single lane of a SIMD register, and this PR introduce equivalent instructions in WebAssembly SIMD. The single-lane load and store instructions cover several broad use-cases:
Load-Lane instructions complement the Load-Zero instructions (#237), but with different performance effects: a non-contiguous load sequence based on Load-Zero instructions would result in lower latency at the cost of throughput, while a sequence based on Load-Lane instructions would trade the low latency for higher throughput. Moreover, even when Load-Lane instructions are used, typically the first element is loaded with Load-Zero, and others are inserted with Load-Lane instructions.
Load-Lane instructions can be emulated via a combination of scalar loads and
replace_lane
instructions, and Store-Lane instructions can be emulated via a combination ofextract_lane
instructions and scalar stores. However, these emulation sequences are substantially less efficient that direct lane loads/stores with a SIMD register:replace_lane
/extract_lane
, they involve moving values between SIMD and general-purpose registers, which comes at high latency and throughput cost.Explicit instructions for loading and storing lanes alleviate all these concerns.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
v128.load8_lane
y = v128.load8_lane(mem, x, lane)
is lowered toVPINSRB xmm_y, xmm_x, [mem], lane
v128.load16_lane
y = v128.load16_lane(mem, x, lane)
is lowered toVPINSRW xmm_y, xmm_x, [mem], lane
v128.load32_lane
y = v128.load32_lane(mem, x, lane)
is lowered toVINSERTPS xmm_y, xmm_x, [mem], (lane << 4)
v128.load64_lane
y = v128.load64_lane(mem, x, lane)
is lowered to:VMOVLPS xmm_y, xmm_x, [mem]
whenlane == 0
VMOVHPS xmm_y, xmm_x, [mem]
whenlane == 1
v128.store8_lane
v128.store8_lane(mem, v, lane)
is lowered toVPEXTRB [mem], xmm_v, lane
v128.store16_lane
v128.store16_lane(mem, v, lane)
is lowered toVPEXTRW [mem], xmm_v, lane
v128.store32_lane
v128.store32_lane(mem, v, lane)
is lowered to:VMOVSS [mem], xmm_v
whenlane == 0
VEXTRACTPS [mem], xmm_v, lane
otherwisev128.store64_lane
v128.store64_lane(mem, v, lane)
is lowered to:VMOVLPS [mem], xmm_v
whenlane == 0
VMOVHPS [mem], xmm_v
whenlane == 1
x86/x86-64 processors with SSE4.1 instruction set
v128.load8_lane
y = v128.load8_lane(mem, x, lane)
is lowered toMOVDQA xmm_y, xmm_x
+PINSRB xmm_y, [mem], lane
v128.load32_lane
y = v128.load32_lane(mem, x, lane)
is lowered toMOVAPS xmm_y, xmm_x
+INSERTPS xmm_y, [mem], (lane << 4)
v128.store8_lane
v128.store8_lane(mem, v, lane)
is lowered toPEXTRB [mem], xmm_v, lane
v128.store16_lane
v128.store16_lane(mem, v, lane)
is lowered toPEXTRW [mem], xmm_v, lane
v128.store32_lane
v128.store32_lane(mem, v, lane)
is lowered to:MOVSS [mem], xmm_v
whenlane == 0
EXTRACTPS [mem], xmm_v, lane
otherwisex86/x86-64 processors with SSE2 instruction set
v128.load8_lane
y = v128.load8_lane(mem, x, lane)
is lowered to:MOVD eax, xmm_x, (lane/2)
+MOV al, byte [mem]
+MOVDQA xmm_y, xmm_x
+PINSRW xmm_y, eax, 0
whenlane == 0
MOVD eax, xmm_x, (lane/2)
+MOV ah, byte [mem]
+MOVDQA xmm_y, xmm_x
+PINSRW xmm_y, eax, 0
whenlane == 1
PEXTRW eax, xmm_x, (lane/2)
+MOV al, byte [mem]
+MOVDQA xmm_y, xmm_x
+PINSRW xmm_y, eax, (lane/2)
whenlane is even
andlane >= 2
PEXTRW eax, xmm_x, (lane/2)
+MOV ah, byte [mem]
+MOVDQA xmm_y, xmm_x
+PINSRW xmm_y, eax, (lane/2)
whenlane is odd
andlane >= 2
v128.load16_lane
y = v128.load16_lane(mem, x, lane)
is lowered toMOVDQA xmm_y, xmm_x
+PINSRW xmm_y, [mem], lane
v128.load32_lane
y = v128.load32_lane(mem, x, lane)
is lowered to:MOVAPS xmm_y, xmm_x
+MOVSS xmm_tmp, [mem]
+MOVSS xmm_y, xmm_tmp
whenlane == 0
MOVAPS xmm_y, xmm_x
+PINSRW xmm_y, [mem], (lane*2)
+PINSRW xmm_y, [mem+2], (lane*2+1)
otherwisev128.load64_lane
y = v128.load64_lane(mem, x, lane)
is lowered to:MOVAPS xmm_y, xmm_x
+MOVLPS xmm_y, [mem]
whenlane == 0
MOVAPS xmm_y, xmm_x
+MOVHPS xmm_y, [mem]
whenlane == 1
v128.store8_lane
v128.store8_lane(mem, v, lane)
is lowered to:MOVD eax, xmm_v
+MOV byte [mem], al
whenlane == 0
MOVD eax, xmm_v
+MOV byte [mem], ah
whenlane == 1
PEXTRW eax, xmm_v, (lane/2)
+MOV byte [mem], al
whenlane is even
andlane >= 2
PEXTRW eax, xmm_v, (lane/2)
+MOV byte [mem], ah
whenlane is odd
andlane >= 2
v128.store16_lane
v128.store16_lane(mem, v, lane)
is lowered toPEXTRW r_tmp, xmm_v, lane
+MOV word [mem], r_tmp
v128.store32_lane
v128.store32_lane(mem, v, lane)
is lowered to:MOVSS [mem], xmm_v
whenlane == 0
PSHUFD xmm_tmp, xmm_v, lane
+MOVD [mem], xmm_tmp
otherwisev128.store64_lane
v128.store64_lane(mem, v, lane)
is lowered to:MOVLPS [mem], xmm_v
whenlane == 0
MOVHPS [mem], xmm_v
whenlane == 1
ARM64 processors
v128.load8_lane
y = v128.load8_lane(mem, x, lane)
is lowered toMOV Qy, Qx
+LD1 {Vy.B}[lane], [Xmem]
v128.load16_lane
y = v128.load16_lane(mem, x, lane)
is lowered toMOV Qy, Qx
+LD1 {Vy.H}[lane], [Xmem]
v128.load32_lane
y = v128.load32_lane(mem, x, lane)
is lowered toMOV Qy, Qx
+LD1 {Vy.S}[lane], [Xmem]
v128.load64_lane
y = v128.load64_lane(mem, x, lane)
is lowered toMOV Qy, Qx
+LD1 {Vy.D}[lane], [Xmem]
v128.store8_lane
v128.store8_lane(mem, v, lane)
is lowered toST1 {Vv.B}[lane], [Xmem]
v128.store16_lane
v128.store16_lane(mem, v, lane)
is lowered toST1 {Vv.H}[lane], [Xmem]
v128.store32_lane
v128.store32_lane(mem, v, lane)
is lowered toST1 {Vv.S}[lane], [Xmem]
v128.store64_lane
v128.store64_lane(mem, v, lane)
is lowered toST1 {Vv.D}[lane], [Xmem]
ARMv7 processors with NEON instruction set
v128.load8_lane
y = v128.load8_lane(mem, x, lane)
is lowered to:VMOV Qy, Qx
+VLD1.8 {Dy_lo[lane]}, [Xmem]
whenlane < 8
VMOV Qy, Qx
+VLD1.8 {Dy_hi[(lane-8)]}, [Xmem]
whenlane >= 8
v128.load16_lane
y = v128.load16_lane(mem, x, lane)
is lowered to:VMOV Qy, Qx
+VLD1.16 {Dy_lo[lane]}, [Xmem]
whenlane < 4
VMOV Qy, Qx
+VLD1.16 {Dy_hi[(lane-4)]}, [Xmem]
whenlane >= 4
v128.load32_lane
y = v128.load32_lane(mem, x, lane)
is lowered to:VMOV Qy, Qx
+VLD1.32 {Dy_lo[lane]}, [Xmem]
whenlane < 2
VMOV Qy, Qx
+VLD1.32 {Dy_hi[(lane-2)]}, [Xmem]
whenlane >= 2
v128.load64_lane
VMOV Dy_hi, Dx_hi
+VLD1.64 {Dy_lo}, [Xmem]
whenlane == 0
VMOV Dy_lo, Dx_lo
+VLD1.64 {Dy_hi}, [Xmem]
whenlane == 1
v128.store8_lane
v128.store8_lane(mem, v, lane)
is lowered toST1 {Vv.B}[lane], [Xmem]
v128.store8_lane(mem, v, lane)
is lowered to:VST1.8 {Dy_lo[lane]}, [Xmem]
whenlane < 8
VST1.8 {Dy_hi[(lane-8)]}, [Xmem]
whenlane >= 8
v128.store16_lane
v128.store16_lane(mem, v, lane)
is lowered to:VST1.16 {Dy_lo[lane]}, [Xmem]
whenlane < 4
VST1.16 {Dy_hi[(lane-4)]}, [Xmem]
whenlane >= 4
v128.store32_lane
VST1.32 {Dy_lo[lane]}, [Xmem]
whenlane < 2
VST1.32 {Dy_hi[(lane-2)]}, [Xmem]
whenlane >= 2
v128.store64_lane
v128.store64_lane(mem, v, lane)
is lowered to:VST1.64 {Dy_lo}, [Xmem]
whenlane == 0
VST1.64 {Dy_hi}, [Xmem]
whenlane == 1