Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Load Lane and Store Lane instructions #350

Merged
merged 1 commit into from
Jan 12, 2021

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Sep 18, 2020

Introduction

Both x86 SSE4.1 and ARM NEON instruction sets include instructions which load or store a single lane of a SIMD register, and this PR introduce equivalent instructions in WebAssembly SIMD. The single-lane load and store instructions cover several broad use-cases:

  1. Non-contiguous loads and stores, when we need to combine elements from disjoint locations in a single SIMD vector, or scatter elements from a single SIMD vector into disjoint locations.
  2. Processing fewer than 128 bits of data. Sometimes the algorithm or data structures just don't expose enough data to utilize all 128 bits of a SIMD vector, but would nevertheless benefit from processing fewer elements in parallel (e.g. adding 8 bytes in one SIMD instruction rather than eight scalar instructions).

Load-Lane instructions complement the Load-Zero instructions (#237), but with different performance effects: a non-contiguous load sequence based on Load-Zero instructions would result in lower latency at the cost of throughput, while a sequence based on Load-Lane instructions would trade the low latency for higher throughput. Moreover, even when Load-Lane instructions are used, typically the first element is loaded with Load-Zero, and others are inserted with Load-Lane instructions.

Load-Lane instructions can be emulated via a combination of scalar loads and replace_lane instructions, and Store-Lane instructions can be emulated via a combination of extract_lane instructions and scalar stores. However, these emulation sequences are substantially less efficient that direct lane loads/stores with a SIMD register:

  1. They need an extra register for the extracted scalar value.
  2. They use an 2+ instructions where one would suffice on both ARM and x86.
  3. For integer-type replace_lane/extract_lane, they involve moving values between SIMD and general-purpose registers, which comes at high latency and throughput cost.

Explicit instructions for loading and storing lanes alleviate all these concerns.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • v128.load8_lane

    • y = v128.load8_lane(mem, x, lane) is lowered to VPINSRB xmm_y, xmm_x, [mem], lane
  • v128.load16_lane

    • y = v128.load16_lane(mem, x, lane) is lowered to VPINSRW xmm_y, xmm_x, [mem], lane
  • v128.load32_lane

    • y = v128.load32_lane(mem, x, lane) is lowered to VINSERTPS xmm_y, xmm_x, [mem], (lane << 4)
  • v128.load64_lane

    • y = v128.load64_lane(mem, x, lane) is lowered to:
      • VMOVLPS xmm_y, xmm_x, [mem] when lane == 0
      • VMOVHPS xmm_y, xmm_x, [mem] when lane == 1
  • v128.store8_lane

    • v128.store8_lane(mem, v, lane) is lowered to VPEXTRB [mem], xmm_v, lane
  • v128.store16_lane

    • v128.store16_lane(mem, v, lane) is lowered to VPEXTRW [mem], xmm_v, lane
  • v128.store32_lane

    • v128.store32_lane(mem, v, lane) is lowered to:
      • VMOVSS [mem], xmm_v when lane == 0
      • VEXTRACTPS [mem], xmm_v, lane otherwise
  • v128.store64_lane

    • v128.store64_lane(mem, v, lane) is lowered to:
      • VMOVLPS [mem], xmm_v when lane == 0
      • VMOVHPS [mem], xmm_v when lane == 1

x86/x86-64 processors with SSE4.1 instruction set

  • v128.load8_lane

    • y = v128.load8_lane(mem, x, lane) is lowered to MOVDQA xmm_y, xmm_x + PINSRB xmm_y, [mem], lane
  • v128.load32_lane

    • y = v128.load32_lane(mem, x, lane) is lowered to MOVAPS xmm_y, xmm_x + INSERTPS xmm_y, [mem], (lane << 4)
  • v128.store8_lane

    • v128.store8_lane(mem, v, lane) is lowered to PEXTRB [mem], xmm_v, lane
  • v128.store16_lane

    • v128.store16_lane(mem, v, lane) is lowered to PEXTRW [mem], xmm_v, lane
  • v128.store32_lane

    • v128.store32_lane(mem, v, lane) is lowered to:
      • MOVSS [mem], xmm_v when lane == 0
      • EXTRACTPS [mem], xmm_v, lane otherwise

x86/x86-64 processors with SSE2 instruction set

  • v128.load8_lane

    • y = v128.load8_lane(mem, x, lane) is lowered to:
      • MOVD eax, xmm_x, (lane/2) + MOV al, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, 0 when lane == 0
      • MOVD eax, xmm_x, (lane/2) + MOV ah, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, 0 when lane == 1
      • PEXTRW eax, xmm_x, (lane/2) + MOV al, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, (lane/2) when lane is even and lane >= 2
      • PEXTRW eax, xmm_x, (lane/2) + MOV ah, byte [mem] + MOVDQA xmm_y, xmm_x + PINSRW xmm_y, eax, (lane/2) when lane is odd and lane >= 2
  • v128.load16_lane

    • y = v128.load16_lane(mem, x, lane) is lowered to MOVDQA xmm_y, xmm_x + PINSRW xmm_y, [mem], lane
  • v128.load32_lane

    • y = v128.load32_lane(mem, x, lane) is lowered to:
      • MOVAPS xmm_y, xmm_x + MOVSS xmm_tmp, [mem] + MOVSS xmm_y, xmm_tmp when lane == 0
      • MOVAPS xmm_y, xmm_x + PINSRW xmm_y, [mem], (lane*2) + PINSRW xmm_y, [mem+2], (lane*2+1) otherwise
  • v128.load64_lane

    • y = v128.load64_lane(mem, x, lane) is lowered to:
      • MOVAPS xmm_y, xmm_x + MOVLPS xmm_y, [mem] when lane == 0
      • MOVAPS xmm_y, xmm_x + MOVHPS xmm_y, [mem] when lane == 1
  • v128.store8_lane

    • v128.store8_lane(mem, v, lane) is lowered to:
      • MOVD eax, xmm_v + MOV byte [mem], al when lane == 0
      • MOVD eax, xmm_v + MOV byte [mem], ah when lane == 1
      • PEXTRW eax, xmm_v, (lane/2) + MOV byte [mem], al when lane is even and lane >= 2
      • PEXTRW eax, xmm_v, (lane/2) + MOV byte [mem], ah when lane is odd and lane >= 2
  • v128.store16_lane

    • v128.store16_lane(mem, v, lane) is lowered to PEXTRW r_tmp, xmm_v, lane + MOV word [mem], r_tmp
  • v128.store32_lane

    • v128.store32_lane(mem, v, lane) is lowered to:
      • MOVSS [mem], xmm_v when lane == 0
      • PSHUFD xmm_tmp, xmm_v, lane + MOVD [mem], xmm_tmp otherwise
  • v128.store64_lane

    • v128.store64_lane(mem, v, lane) is lowered to:
      • MOVLPS [mem], xmm_v when lane == 0
      • MOVHPS [mem], xmm_v when lane == 1

ARM64 processors

  • v128.load8_lane

    • y = v128.load8_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.B}[lane], [Xmem]
  • v128.load16_lane

    • y = v128.load16_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.H}[lane], [Xmem]
  • v128.load32_lane

    • y = v128.load32_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.S}[lane], [Xmem]
  • v128.load64_lane

    • y = v128.load64_lane(mem, x, lane) is lowered to MOV Qy, Qx + LD1 {Vy.D}[lane], [Xmem]
  • v128.store8_lane

    • v128.store8_lane(mem, v, lane) is lowered to ST1 {Vv.B}[lane], [Xmem]
  • v128.store16_lane

    • v128.store16_lane(mem, v, lane) is lowered to ST1 {Vv.H}[lane], [Xmem]
  • v128.store32_lane

    • v128.store32_lane(mem, v, lane) is lowered to ST1 {Vv.S}[lane], [Xmem]
  • v128.store64_lane

    • v128.store64_lane(mem, v, lane) is lowered to ST1 {Vv.D}[lane], [Xmem]

ARMv7 processors with NEON instruction set

  • v128.load8_lane

    • y = v128.load8_lane(mem, x, lane) is lowered to:
      • VMOV Qy, Qx + VLD1.8 {Dy_lo[lane]}, [Xmem] when lane < 8
      • VMOV Qy, Qx + VLD1.8 {Dy_hi[(lane-8)]}, [Xmem] when lane >= 8
  • v128.load16_lane

    • y = v128.load16_lane(mem, x, lane) is lowered to:
      • VMOV Qy, Qx + VLD1.16 {Dy_lo[lane]}, [Xmem] when lane < 4
      • VMOV Qy, Qx + VLD1.16 {Dy_hi[(lane-4)]}, [Xmem] when lane >= 4
  • v128.load32_lane

    • y = v128.load32_lane(mem, x, lane) is lowered to:
      • VMOV Qy, Qx + VLD1.32 {Dy_lo[lane]}, [Xmem] when lane < 2
      • VMOV Qy, Qx + VLD1.32 {Dy_hi[(lane-2)]}, [Xmem] when lane >= 2
  • v128.load64_lane

    • VMOV Dy_hi, Dx_hi + VLD1.64 {Dy_lo}, [Xmem] when lane == 0
    • VMOV Dy_lo, Dx_lo + VLD1.64 {Dy_hi}, [Xmem] when lane == 1
  • v128.store8_lane

    • v128.store8_lane(mem, v, lane) is lowered to ST1 {Vv.B}[lane], [Xmem]
    • v128.store8_lane(mem, v, lane) is lowered to:
      • VST1.8 {Dy_lo[lane]}, [Xmem] when lane < 8
      • VST1.8 {Dy_hi[(lane-8)]}, [Xmem] when lane >= 8
  • v128.store16_lane

    • v128.store16_lane(mem, v, lane) is lowered to:
      • VST1.16 {Dy_lo[lane]}, [Xmem] when lane < 4
      • VST1.16 {Dy_hi[(lane-4)]}, [Xmem] when lane >= 4
  • v128.store32_lane

    • VST1.32 {Dy_lo[lane]}, [Xmem] when lane < 2
    • VST1.32 {Dy_hi[(lane-2)]}, [Xmem] when lane >= 2
  • v128.store64_lane

    • v128.store64_lane(mem, v, lane) is lowered to:
      • VST1.64 {Dy_lo}, [Xmem] when lane == 0
      • VST1.64 {Dy_hi}, [Xmem] when lane == 1

@Maratyszcza
Copy link
Contributor Author

@tlively Please advice which opcodes could be used for these instructions

@tlively
Copy link
Member

tlively commented Sep 20, 2020

We're basically out of opcode space, so you should just append these to the end for now. When we did the last renumbering, we thought we were essentially done adding new opcodes, so if we end up including all these newly proposed instructions, we'll probably have to do yet another renumbering.

@Maratyszcza
Copy link
Contributor Author

@tlively There isn't enough space in the end, inserted at 0x58-0x5f instead.

@tlively
Copy link
Member

tlively commented Sep 21, 2020

Opcodes are encoded as ULEB128s, so it's totally fine to use numbers above 0xff.

@Maratyszcza
Copy link
Contributor Author

@tlively Good point! Do you prefer to move them to the end, or is it fine to leave the opcodes as is.

@tlively
Copy link
Member

tlively commented Sep 21, 2020

No, that's ok. Their current position looks fine. Thanks!

@ngzhian
Copy link
Member

ngzhian commented Oct 1, 2020

Note at for ARM, the codegen will require an extra register and extra instruction compared to what's listed, ld1 only supports no offset, or post-index. So we will have to add base and offset ourselves before doing the load.

@ngzhian
Copy link
Member

ngzhian commented Oct 12, 2020

Prototyped for x64 in https://crrev.com/c/2444578, should see it in canary by tomorrow. @Maratyszcza

tlively added a commit to llvm/llvm-project that referenced this pull request Oct 15, 2020
Prototype the newly proposed load_lane instructions, as specified in
WebAssembly/simd#350. Since these instructions are not
available to origin trial users on Chrome stable, make them opt-in by only
selecting them from intrinsics rather than normal ISel patterns. Since we only
need rough prototypes to measure performance right now, this commit does not
implement all the load and store patterns that would be necessary to make full
use of the offset immediate. However, the full suite of offset tests is included
to make it easy to track improvements in the future.

Since these are the first instructions to have a memarg immediate as well as an
additional immediate, the disassembler needed some additional hacks to be able
to parse them correctly. Making that code more principled is left as future
work.

Differential Revision: https://reviews.llvm.org/D89366
tlively added a commit to tlively/binaryen that referenced this pull request Oct 22, 2020
These instructions are proposed in WebAssembly/simd#350.
This PR implements them throughout Binaryen except in the C/JS APIs and in the
fuzzer, where it leaves TODOs instead. Right now these instructions are just
being implemented for prototyping so adding them to the APIs isn't critical and
they aren't generally available to be fuzzed in Wasm engines.
tlively added a commit to tlively/binaryen that referenced this pull request Oct 22, 2020
These instructions are proposed in WebAssembly/simd#350.
This PR implements them throughout Binaryen except in the C/JS APIs and in the
fuzzer, where it leaves TODOs instead. Right now these instructions are just
being implemented for prototyping so adding them to the APIs isn't critical and
they aren't generally available to be fuzzed in Wasm engines.
tlively added a commit to WebAssembly/binaryen that referenced this pull request Oct 23, 2020
These instructions are proposed in WebAssembly/simd#350.
This PR implements them throughout Binaryen except in the C/JS APIs and in the
fuzzer, where it leaves TODOs instead. Right now these instructions are just
being implemented for prototyping so adding them to the APIs isn't critical and
they aren't generally available to be fuzzed in Wasm engines.
@tlively
Copy link
Member

tlively commented Oct 23, 2020

These instructions have now landed in both LLVM and Binaryen so they will be ready to use in tip-of-tree Emscripten (usable via emsdk install tot && emsdk activate tot) in a few hours. The builtin functions for these instructions are __builtin_wasm_{load,store}{8,16,32,64}_lane(lane_t*, vec_t, lane_index). Since these are still prototypes, there are no corresponding wasm_simd128.h instrinsics yet.

@Maratyszcza
Copy link
Contributor Author

I evaluated the performance impact of these instructions by modifying WebAssembly SIMD microkernels for Sigmoid operator in XNNPACK library of neural network operators to use v128.load32_lane instruction, and the results are below:

Processor  Performance with WAsm SIMD + v128.load32_zero + v128.load32_lane (this PR) Performance with WAsm SIMD + v128.load32_zero (PR #237) Speedup
Intel Xeon W-2135 6.57 GB/s 6.35 GB/s 3%
AMD PRO A10-8700B 3.19 GB/s 3.10 GB/s 3%
Snapdragon 670 (Pixel 3a) 1.55 GB/s 1.49 GB/s 4%

The code modifications can be seen in google/XNNPACK#1016 (for the baseline version with v128.load32_zero) and in google/XNNPACK#1199 (for the optimized version with both v128.load32_zero and v128.load32_lane).

@Maratyszcza
Copy link
Contributor Author

Attn @abrown

@omnisip
Copy link

omnisip commented Dec 22, 2020

Discussed in (#402 12/22/2020 Sync Meeting).

Provisional Voting Results:
2 SF - 5 F - 1 N

Minutes are here.

| `v128.store8_lane` | `0x5c`| m:memarg, i:ImmLaneIdx16 |
| `v128.store16_lane` | `0x5d`| m:memarg, i:ImmLaneIdx16 |
| `v128.store32_lane` | `0x5e`| m:memarg, i:ImmLaneIdx16 |
| `v128.store64_lane` | `0x5f`| m:memarg, i:ImmLaneIdx16 |
Copy link
Member

@ngzhian ngzhian Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmLaneIdx16 needs to be updated for {load,store}_{16,32,64} to ImmLaneIdx8, ImmLaneIdx4, and ImmLaneIdx2 respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Maratyszcza Maratyszcza force-pushed the store-lane branch 2 times, most recently from 09f5bf5 to f93e5e8 Compare January 11, 2021 18:08
Copy link
Member

@dtig dtig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the new operations to ImplementationStatus.md as well.

@Maratyszcza Maratyszcza force-pushed the store-lane branch 2 times, most recently from 2060759 to 54b98cd Compare January 11, 2021 19:51
@Maratyszcza
Copy link
Contributor Author

@dtig Done

@Maratyszcza
Copy link
Contributor Author

Rebased on top of merged PRs

@Maratyszcza
Copy link
Contributor Author

Rebased once again

@dtig dtig merged commit 32d700a into WebAssembly:master Jan 12, 2021
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021
ngzhian added a commit that referenced this pull request Feb 9, 2021
Load lane and store lane instructions added in #350.
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 24, 2021
Prototype the newly proposed load_lane instructions, as specified in
WebAssembly/simd#350. Since these instructions are not
available to origin trial users on Chrome stable, make them opt-in by only
selecting them from intrinsics rather than normal ISel patterns. Since we only
need rough prototypes to measure performance right now, this commit does not
implement all the load and store patterns that would be necessary to make full
use of the offset immediate. However, the full suite of offset tests is included
to make it easy to track improvements in the future.

Since these are the first instructions to have a memarg immediate as well as an
additional immediate, the disassembler needed some additional hacks to be able
to parse them correctly. Making that code more principled is left as future
work.

Differential Revision: https://reviews.llvm.org/D89366
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants