Load Lane and Store Lane instructions #350

Maratyszcza · 2020-09-18T18:36:49Z

Introduction

Both x86 SSE4.1 and ARM NEON instruction sets include instructions which load or store a single lane of a SIMD register, and this PR introduce equivalent instructions in WebAssembly SIMD. The single-lane load and store instructions cover several broad use-cases:

Non-contiguous loads and stores, when we need to combine elements from disjoint locations in a single SIMD vector, or scatter elements from a single SIMD vector into disjoint locations.
Processing fewer than 128 bits of data. Sometimes the algorithm or data structures just don't expose enough data to utilize all 128 bits of a SIMD vector, but would nevertheless benefit from processing fewer elements in parallel (e.g. adding 8 bytes in one SIMD instruction rather than eight scalar instructions).

Load-Lane instructions complement the Load-Zero instructions (#237), but with different performance effects: a non-contiguous load sequence based on Load-Zero instructions would result in lower latency at the cost of throughput, while a sequence based on Load-Lane instructions would trade the low latency for higher throughput. Moreover, even when Load-Lane instructions are used, typically the first element is loaded with Load-Zero, and others are inserted with Load-Lane instructions.

Load-Lane instructions can be emulated via a combination of scalar loads and replace_lane instructions, and Store-Lane instructions can be emulated via a combination of extract_lane instructions and scalar stores. However, these emulation sequences are substantially less efficient that direct lane loads/stores with a SIMD register:

They need an extra register for the extracted scalar value.
They use an 2+ instructions where one would suffice on both ARM and x86.
For integer-type replace_lane/extract_lane, they involve moving values between SIMD and general-purpose registers, which comes at high latency and throughput cost.

Explicit instructions for loading and storing lanes alleviate all these concerns.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

v128.load8_lane
- y = v128.load8_lane(mem, x, lane) is lowered to VPINSRB xmm_y, xmm_x, [mem], lane
v128.load16_lane
- y = v128.load16_lane(mem, x, lane) is lowered to VPINSRW xmm_y, xmm_x, [mem], lane
v128.load32_lane
- y = v128.load32_lane(mem, x, lane) is lowered to VINSERTPS xmm_y, xmm_x, [mem], (lane << 4)
v128.load64_lane
- y = v128.load64_lane(mem, x, lane) is lowered to:
  - VMOVLPS xmm_y, xmm_x, [mem] when lane == 0
  - VMOVHPS xmm_y, xmm_x, [mem] when lane == 1
v128.store8_lane
- v128.store8_lane(mem, v, lane) is lowered to VPEXTRB [mem], xmm_v, lane
v128.store16_lane
- v128.store16_lane(mem, v, lane) is lowered to VPEXTRW [mem], xmm_v, lane
v128.store32_lane
- v128.store32_lane(mem, v, lane) is lowered to:
  - VMOVSS [mem], xmm_v when lane == 0
  - VEXTRACTPS [mem], xmm_v, lane otherwise
v128.store64_lane
- v128.store64_lane(mem, v, lane) is lowered to:
  - VMOVLPS [mem], xmm_v when lane == 0
  - VMOVHPS [mem], xmm_v when lane == 1

x86/x86-64 processors with SSE4.1 instruction set

v128.load8_lane
- y = v128.load8_lane(mem, x, lane) is lowered to MOVDQA xmm_y, xmm_x + PINSRB xmm_y, [mem], lane
v128.load32_lane
- y = v128.load32_lane(mem, x, lane) is lowered to MOVAPS xmm_y, xmm_x + INSERTPS xmm_y, [mem], (lane << 4)
v128.store8_lane
- v128.store8_lane(mem, v, lane) is lowered to PEXTRB [mem], xmm_v, lane
v128.store16_lane
- v128.store16_lane(mem, v, lane) is lowered to PEXTRW [mem], xmm_v, lane
v128.store32_lane
- v128.store32_lane(mem, v, lane) is lowered to:
  - MOVSS [mem], xmm_v when lane == 0
  - EXTRACTPS [mem], xmm_v, lane otherwise

x86/x86-64 processors with SSE2 instruction set

v128.load8_lane
- y = v128.load8_lane(mem, x, lane) is lowered to:
  - MOVD eax, xmm_x, (lane/2) + MOV al, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, 0 when lane == 0
  - MOVD eax, xmm_x, (lane/2) + MOV ah, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, 0 when lane == 1
  - PEXTRW eax, xmm_x, (lane/2) + MOV al, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, (lane/2) when lane is even and lane >= 2
  - PEXTRW eax, xmm_x, (lane/2) + MOV ah, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, (lane/2) when lane is odd and lane >= 2
v128.load16_lane
- y = v128.load16_lane(mem, x, lane) is lowered to MOVDQA xmm_y, xmm_x + PINSRW xmm_y, [mem], lane
v128.load32_lane
- y = v128.load32_lane(mem, x, lane) is lowered to:
  - MOVAPS xmm_y, xmm_x + MOVSS xmm_tmp, [mem] + MOVSS xmm_y, xmm_tmp when lane == 0
  - MOVAPS xmm_y, xmm_x + PINSRW xmm_y, [mem], (lane*2) + PINSRW xmm_y, [mem+2], (lane*2+1) otherwise
v128.load64_lane
- y = v128.load64_lane(mem, x, lane) is lowered to:
  - MOVAPS xmm_y, xmm_x + MOVLPS xmm_y, [mem] when lane == 0
  - MOVAPS xmm_y, xmm_x + MOVHPS xmm_y, [mem] when lane == 1
v128.store8_lane
- v128.store8_lane(mem, v, lane) is lowered to:
  - MOVD eax, xmm_v + MOV byte [mem], al when lane == 0
  - MOVD eax, xmm_v + MOV byte [mem], ah when lane == 1
  - PEXTRW eax, xmm_v, (lane/2) + MOV byte [mem], al when lane is even and lane >= 2
  - PEXTRW eax, xmm_v, (lane/2) + MOV byte [mem], ah when lane is odd and lane >= 2
v128.store16_lane
- v128.store16_lane(mem, v, lane) is lowered to PEXTRW r_tmp, xmm_v, lane + MOV word [mem], r_tmp
v128.store32_lane
- v128.store32_lane(mem, v, lane) is lowered to:
  - MOVSS [mem], xmm_v when lane == 0
  - PSHUFD xmm_tmp, xmm_v, lane + MOVD [mem], xmm_tmp otherwise
v128.store64_lane
- v128.store64_lane(mem, v, lane) is lowered to:
  - MOVLPS [mem], xmm_v when lane == 0
  - MOVHPS [mem], xmm_v when lane == 1

ARM64 processors

v128.load8_lane
- y = v128.load8_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.B}[lane], [Xmem]
v128.load16_lane
- y = v128.load16_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.H}[lane], [Xmem]
v128.load32_lane
- y = v128.load32_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.S}[lane], [Xmem]
v128.load64_lane
- y = v128.load64_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.D}[lane], [Xmem]
v128.store8_lane
- v128.store8_lane(mem, v, lane) is lowered to ST1 {Vv.B}[lane], [Xmem]
v128.store16_lane
- v128.store16_lane(mem, v, lane) is lowered to ST1 {Vv.H}[lane], [Xmem]
v128.store32_lane
- v128.store32_lane(mem, v, lane) is lowered to ST1 {Vv.S}[lane], [Xmem]
v128.store64_lane
- v128.store64_lane(mem, v, lane) is lowered to ST1 {Vv.D}[lane], [Xmem]

ARMv7 processors with NEON instruction set

v128.load8_lane
- y = v128.load8_lane(mem, x, lane) is lowered to:
  - VMOV Qy, Qx + VLD1.8 {Dy_lo[lane]}, [Xmem] when lane < 8
  - VMOV Qy, Qx + VLD1.8 {Dy_hi[(lane-8)]}, [Xmem] when lane >= 8
v128.load16_lane
- y = v128.load16_lane(mem, x, lane) is lowered to:
  - VMOV Qy, Qx + VLD1.16 {Dy_lo[lane]}, [Xmem] when lane < 4
  - VMOV Qy, Qx + VLD1.16 {Dy_hi[(lane-4)]}, [Xmem] when lane >= 4
v128.load32_lane
- y = v128.load32_lane(mem, x, lane) is lowered to:
  - VMOV Qy, Qx + VLD1.32 {Dy_lo[lane]}, [Xmem] when lane < 2
  - VMOV Qy, Qx + VLD1.32 {Dy_hi[(lane-2)]}, [Xmem] when lane >= 2
v128.load64_lane
- VMOV Dy_hi, Dx_hi + VLD1.64 {Dy_lo}, [Xmem] when lane == 0
- VMOV Dy_lo, Dx_lo + VLD1.64 {Dy_hi}, [Xmem] when lane == 1
v128.store8_lane
- v128.store8_lane(mem, v, lane) is lowered to ST1 {Vv.B}[lane], [Xmem]
- v128.store8_lane(mem, v, lane) is lowered to:
  - VST1.8 {Dy_lo[lane]}, [Xmem] when lane < 8
  - VST1.8 {Dy_hi[(lane-8)]}, [Xmem] when lane >= 8
v128.store16_lane
- v128.store16_lane(mem, v, lane) is lowered to:
  - VST1.16 {Dy_lo[lane]}, [Xmem] when lane < 4
  - VST1.16 {Dy_hi[(lane-4)]}, [Xmem] when lane >= 4
v128.store32_lane
- VST1.32 {Dy_lo[lane]}, [Xmem] when lane < 2
- VST1.32 {Dy_hi[(lane-2)]}, [Xmem] when lane >= 2
v128.store64_lane
- v128.store64_lane(mem, v, lane) is lowered to:
  - VST1.64 {Dy_lo}, [Xmem] when lane == 0
  - VST1.64 {Dy_hi}, [Xmem] when lane == 1

Maratyszcza · 2020-09-19T09:49:11Z

@tlively Please advice which opcodes could be used for these instructions

tlively · 2020-09-20T00:45:22Z

We're basically out of opcode space, so you should just append these to the end for now. When we did the last renumbering, we thought we were essentially done adding new opcodes, so if we end up including all these newly proposed instructions, we'll probably have to do yet another renumbering.

Maratyszcza · 2020-09-21T11:40:48Z

@tlively There isn't enough space in the end, inserted at 0x58-0x5f instead.

tlively · 2020-09-21T18:21:19Z

Opcodes are encoded as ULEB128s, so it's totally fine to use numbers above 0xff.

Maratyszcza · 2020-09-21T18:23:37Z

@tlively Good point! Do you prefer to move them to the end, or is it fine to leave the opcodes as is.

tlively · 2020-09-21T20:03:00Z

No, that's ok. Their current position looks fine. Thanks!

proposals/simd/BinarySIMD.md

ngzhian · 2020-10-01T17:59:09Z

Note at for ARM, the codegen will require an extra register and extra instruction compared to what's listed, ld1 only supports no offset, or post-index. So we will have to add base and offset ourselves before doing the load.

ngzhian · 2020-10-12T17:56:13Z

Prototyped for x64 in https://crrev.com/c/2444578, should see it in canary by tomorrow. @Maratyszcza

Prototype the newly proposed load_lane instructions, as specified in WebAssembly/simd#350. Since these instructions are not available to origin trial users on Chrome stable, make them opt-in by only selecting them from intrinsics rather than normal ISel patterns. Since we only need rough prototypes to measure performance right now, this commit does not implement all the load and store patterns that would be necessary to make full use of the offset immediate. However, the full suite of offset tests is included to make it easy to track improvements in the future. Since these are the first instructions to have a memarg immediate as well as an additional immediate, the disassembler needed some additional hacks to be able to parse them correctly. Making that code more principled is left as future work. Differential Revision: https://reviews.llvm.org/D89366

These instructions are proposed in WebAssembly/simd#350. This PR implements them throughout Binaryen except in the C/JS APIs and in the fuzzer, where it leaves TODOs instead. Right now these instructions are just being implemented for prototyping so adding them to the APIs isn't critical and they aren't generally available to be fuzzed in Wasm engines.

tlively · 2020-10-23T04:53:07Z

These instructions have now landed in both LLVM and Binaryen so they will be ready to use in tip-of-tree Emscripten (usable via emsdk install tot && emsdk activate tot) in a few hours. The builtin functions for these instructions are __builtin_wasm_{load,store}{8,16,32,64}_lane(lane_t*, vec_t, lane_index). Since these are still prototypes, there are no corresponding wasm_simd128.h instrinsics yet.

Maratyszcza · 2020-12-04T05:51:00Z

I evaluated the performance impact of these instructions by modifying WebAssembly SIMD microkernels for Sigmoid operator in XNNPACK library of neural network operators to use v128.load32_lane instruction, and the results are below:

Processor	Performance with WAsm SIMD + `v128.load32_zero` + `v128.load32_lane` (this PR)	Performance with WAsm SIMD + `v128.load32_zero` (PR #237)	Speedup
Intel Xeon W-2135	6.57 GB/s	6.35 GB/s	3%
AMD PRO A10-8700B	3.19 GB/s	3.10 GB/s	3%
Snapdragon 670 (Pixel 3a)	1.55 GB/s	1.49 GB/s	4%

The code modifications can be seen in google/XNNPACK#1016 (for the baseline version with v128.load32_zero) and in google/XNNPACK#1199 (for the optimized version with both v128.load32_zero and v128.load32_lane).

Maratyszcza · 2020-12-11T17:59:43Z

Attn @abrown

omnisip · 2020-12-22T21:15:42Z

Discussed in (#402 12/22/2020 Sync Meeting).

Provisional Voting Results:
2 SF - 5 F - 1 N

Minutes are here.

ngzhian · 2021-01-11T07:21:08Z

proposals/simd/BinarySIMD.md

+| `v128.store8_lane`          |    `0x5c`| m:memarg, i:ImmLaneIdx16 |
+| `v128.store16_lane`         |    `0x5d`| m:memarg, i:ImmLaneIdx16 |
+| `v128.store32_lane`         |    `0x5e`| m:memarg, i:ImmLaneIdx16 |
+| `v128.store64_lane`         |    `0x5f`| m:memarg, i:ImmLaneIdx16 |


ImmLaneIdx16 needs to be updated for {load,store}_{16,32,64} to ImmLaneIdx8, ImmLaneIdx4, and ImmLaneIdx2 respectively.

dtig

Please add the new operations to ImplementationStatus.md as well.

Maratyszcza · 2021-01-11T19:52:45Z

@dtig Done

Maratyszcza · 2021-01-12T00:20:35Z

Rebased on top of merged PRs

Maratyszcza · 2021-01-12T03:00:30Z

Rebased once again

Load lane and store lane instructions added in WebAssembly#350.

Load lane and store lane instructions added in #350.

Prototype the newly proposed load_lane instructions, as specified in WebAssembly/simd#350. Since these instructions are not available to origin trial users on Chrome stable, make them opt-in by only selecting them from intrinsics rather than normal ISel patterns. Since we only need rough prototypes to measure performance right now, this commit does not implement all the load and store patterns that would be necessary to make full use of the offset immediate. However, the full suite of offset tests is included to make it easy to track improvements in the future. Since these are the first instructions to have a memarg immediate as well as an additional immediate, the disassembler needed some additional hacks to be able to parse them correctly. Making that code more principled is left as future work. Differential Revision: https://reviews.llvm.org/D89366

Maratyszcza force-pushed the store-lane branch from 4523f43 to c12a216 Compare September 21, 2020 11:39

Maratyszcza mentioned this pull request Sep 21, 2020

Finalizing the instruction set #343

Closed

ngzhian reviewed Sep 30, 2020

View reviewed changes

proposals/simd/BinarySIMD.md Outdated Show resolved Hide resolved

Maratyszcza force-pushed the store-lane branch from c12a216 to b65ecb2 Compare September 30, 2020 18:23

ngzhian mentioned this pull request Oct 21, 2020

Remove integer SIMD not-equals instructions #351

Closed

tlively mentioned this pull request Oct 22, 2020

Implement v128.{load,store}{8,16,32,64}_lane instructions WebAssembly/binaryen#3278

Merged

Maratyszcza force-pushed the store-lane branch from b65ecb2 to d53a173 Compare December 4, 2020 05:59

Maratyszcza force-pushed the store-lane branch from d53a173 to 1e5d467 Compare December 18, 2020 09:23

tlively mentioned this pull request Dec 22, 2020

When should our next Sync meeting be? #402

Closed

This was referenced Jan 8, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

Tracking instructions with unassigned opcodes #421

Closed

ngzhian reviewed Jan 11, 2021

View reviewed changes

Maratyszcza force-pushed the store-lane branch 2 times, most recently from 09f5bf5 to f93e5e8 Compare January 11, 2021 18:08

tlively approved these changes Jan 11, 2021

View reviewed changes

dtig suggested changes Jan 11, 2021

View reviewed changes

Maratyszcza force-pushed the store-lane branch 2 times, most recently from 2060759 to 54b98cd Compare January 11, 2021 19:51

dtig approved these changes Jan 11, 2021

View reviewed changes

Maratyszcza force-pushed the store-lane branch from 54b98cd to 3dc15fe Compare January 12, 2021 00:20

Load Lane and Store Lane instructions

405f725

Maratyszcza force-pushed the store-lane branch from 3dc15fe to 405f725 Compare January 12, 2021 02:59

dtig merged commit 32d700a into WebAssembly:master Jan 12, 2021

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021

[spectext] Load lane and store lane validation and semantics

db1cf36

Load lane and store lane instructions added in WebAssembly#350.

ngzhian mentioned this pull request Feb 3, 2021

[spectext] Load lane and store lane validation and semantics #445

Merged

ngzhian added a commit that referenced this pull request Feb 9, 2021

[spectext] Load lane and store lane validation and semantics (#445)

34e195c

Load lane and store lane instructions added in #350.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load Lane and Store Lane instructions #350

Load Lane and Store Lane instructions #350

Maratyszcza commented Sep 18, 2020 •

edited

Loading

Maratyszcza commented Sep 19, 2020

tlively commented Sep 20, 2020

Maratyszcza commented Sep 21, 2020

tlively commented Sep 21, 2020

Maratyszcza commented Sep 21, 2020

tlively commented Sep 21, 2020

ngzhian commented Oct 1, 2020

ngzhian commented Oct 12, 2020

tlively commented Oct 23, 2020

Maratyszcza commented Dec 4, 2020

Maratyszcza commented Dec 11, 2020

omnisip commented Dec 22, 2020

ngzhian Jan 11, 2021 •

edited

Loading

Maratyszcza Jan 11, 2021

dtig left a comment •

edited

Loading

Maratyszcza commented Jan 11, 2021

Maratyszcza commented Jan 12, 2021

Maratyszcza commented Jan 12, 2021

Load Lane and Store Lane instructions #350

Load Lane and Store Lane instructions #350

Conversation

Maratyszcza commented Sep 18, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

Maratyszcza commented Sep 19, 2020

tlively commented Sep 20, 2020

Maratyszcza commented Sep 21, 2020

tlively commented Sep 21, 2020

Maratyszcza commented Sep 21, 2020

tlively commented Sep 21, 2020

ngzhian commented Oct 1, 2020

ngzhian commented Oct 12, 2020

tlively commented Oct 23, 2020

Maratyszcza commented Dec 4, 2020

Maratyszcza commented Dec 11, 2020

omnisip commented Dec 22, 2020

ngzhian Jan 11, 2021 • edited Loading

Choose a reason for hiding this comment

Maratyszcza Jan 11, 2021

Choose a reason for hiding this comment

dtig left a comment • edited Loading

Choose a reason for hiding this comment

Maratyszcza commented Jan 11, 2021

Maratyszcza commented Jan 12, 2021

Maratyszcza commented Jan 12, 2021

Maratyszcza commented Sep 18, 2020 •

edited

Loading

ngzhian Jan 11, 2021 •

edited

Loading

dtig left a comment •

edited

Loading