Attempt to optimize the Vector4 element constructor in mono interp #86782

kg · 2023-05-26T01:30:18Z

Looking at dotnet/perf-autofiling-issues#18031 in wasm+interp, the hot path of some of the failures boils down to foo.AsVector128(), where AsVector128 itself boils down to new Vector4(this, ...), which then translates to new Vector4(this.X, this.Y, ...). Because the Vector4 ctor just assigns 4 fields sequentially, this generates a bunch of interp opcodes.

If we initialize Vector4 as if it were a Vector128, we can tap into the interpreter's SIMD support and initialize it in (almost) one vectorized operation, and then on WASM that will use the native vector instruction set.

With these changes a Vector2.AsVector128() call looks roughly like this in the mono interpreter, annotated with my reading of the opcodes:

1b42dba ldc.r4  -> 192                 // 0f
1b42dc2 ldc.r4  -> 200                 // 0f
1b42dca newobj_vt_inlined 96 -> 112    // result = default(Vector4);
1b42dd2 mov.4 88 -> 176                // load src.X
1b42dd8 mov.4 92 -> 184                // load src.Y
1b42dde ldflda 112 -> 112              // &result.X
1b42de6 simd_v128_i4_create 176 -> 128 // Vector128.Create(src.X, src.Y, 0f, 0f)
1b42dec stobj.vt.noref 112, 128        // result = (Vector4)Vector128.Create(src.X, src.Y, 0f, 0f)

I don't know if we want to do this, or whether this is a good way to do it. But I figured a draft PR is a good way to start the discussion. I suspect if we optimize this one ctor a bunch of stuff will get faster in the wasm interp scenario (and maybe all interp targets, but it's hard to be certain).

ghost · 2023-05-26T01:30:53Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Looking at dotnet/perf-autofiling-issues#18031 in wasm+interp, the hot path of some of the failures boils down to foo.AsVector128(), where AsVector128 itself boils down to new Vector4(this, ...), which then translates to new Vector4(this.X, this.Y, ...). Because the Vector4 ctor just assigns 4 fields sequentially, this generates a bunch of interp opcodes.

If we initialize Vector4 as if it were a Vector128, we can tap into the interpreter's SIMD support and initialize it in (almost) one vectorized operation, and then on WASM that will use the native vector instruction set.

With these changes a Vector2.AsVector128() call looks roughly like this in the mono interpreter, annotated with my reading of the opcodes:

1b42dba ldc.r4  -> 192                 // 0f
1b42dc2 ldc.r4  -> 200                 // 0f
1b42dca newobj_vt_inlined 96 -> 112    // result = default(Vector4);
1b42dd2 mov.4 88 -> 176                // load src.X
1b42dd8 mov.4 92 -> 184                // load src.Y
1b42dde ldflda 112 -> 112              // &result.X
1b42de6 simd_v128_i4_create 176 -> 128 // Vector128.Create(src.X, src.Y, 0f, 0f)
1b42dec stobj.vt.noref 112, 128        // result = (Vector4)Vector128.Create(src.X, src.Y, 0f, 0f)

I don't know if we want to do this, or whether this is a good way to do it. But I figured a draft PR is a good way to start the discussion. I suspect if we optimize this one ctor a bunch of stuff will get faster in the wasm interp scenario (and maybe all interp targets, but it's hard to be certain).

Author:	kg
Assignees:	-
Labels:	`NO-MERGE`, `area-System.Numerics`
Milestone:	-

kg · 2023-05-26T01:48:49Z

Learned today that 'ref this' is legal in structs now, I don't think it used to be? In any case, that eliminates the ldflda:

1b42c02 ldc.r4  -> 192
1b42c0a ldc.r4  -> 200
1b42c12 newobj_vt_inlined 96 -> 112
1b42c1a mov.4 88 -> 176
1b42c20 mov.4 92 -> 184
1b42c26 simd_v128_i4_create 176 -> 128
1b42c2c stobj.vt.noref 112, 128

BrzVlad · 2023-05-26T09:04:24Z

I'm not convinced having this on the C# side is overall good design. I think it would make more sense for the interpreter to recognize Vector methods and generate appropriate available opcodes.

Also the output doesn't seem properly optimized and I think doing this by intrinsifying the ctor method as a whole might be better.

1b42c02 ldc.r4  -> 192
1b42c0a ldc.r4  -> 200
1b42c12 newobj_vt_inlined 96 -> 112
1b42c1a mov.4 88 -> 176
1b42c20 mov.4 92 -> 184
1b42c26 simd_v128_i4_create 176 -> 128
1b42c2c stobj.vt.noref 112, 128

should be instead something like

1b42c02 ldc.r4  -> 192
1b42c0a ldc.r4  -> 200
1b42c1a mov.4 88 -> 176
1b42c20 mov.4 92 -> 184
1b42c26 simd_v128_i4_create 176 -> 112

I'm currently working on reusing vectorization for Vector structs

vargaz · 2023-05-26T09:54:29Z

The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it.

tannergooding · 2023-05-26T13:26:18Z

The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it.

Agreed and that's how RyuJIT handles it today.

-- As an aside, It was recently brought up if we could do more of these in managed code, so we don't have to duplicate logic between RyuJIT and Mono; but there is a general concern over how that will impact codegen and throughput in scenarios these are heavily used. It will certainly put more pressure on the inliner to do the right thing.

kg · 2023-05-26T15:57:01Z

The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it.

Agreed and that's how RyuJIT handles it today.

-- As an aside, It was recently brought up if we could do more of these in managed code, so we don't have to duplicate logic between RyuJIT and Mono; but there is a general concern over how that will impact codegen and throughput in scenarios these are heavily used. It will certainly put more pressure on the inliner to do the right thing.

My thinking was that doing it in managed code would mean that AOT would automatically pick up the improvement, but it sounds like intrinsifying it in the interpreter will produce better code without stressing the inliner, so I'm not opposed to doing that instead. I'll give that a try later.

Attempt to optimize the Vector4 element constructor in mono interp

ff74278

kg added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 26, 2023

kg requested review from BrzVlad, tannergooding and kotlarmilos May 26, 2023 01:30

dotnet-issue-labeler bot added the area-System.Numerics label May 26, 2023

TIL: 'ref this' is legal in structs now

999477c

This was referenced May 26, 2023

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

Assert failure in GC/API/NoGCRegion/Callback_Svr test #86612

Closed

runfoapp bot mentioned this pull request May 26, 2023

Infra improvements for Helix #68176

Closed

kg closed this May 26, 2023

ghost locked as resolved and limited conversation to collaborators Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to optimize the Vector4 element constructor in mono interp #86782

Attempt to optimize the Vector4 element constructor in mono interp #86782

kg commented May 26, 2023

ghost commented May 26, 2023

kg commented May 26, 2023 •

edited

Loading

BrzVlad commented May 26, 2023 •

edited

Loading

vargaz commented May 26, 2023

tannergooding commented May 26, 2023

kg commented May 26, 2023

Attempt to optimize the Vector4 element constructor in mono interp #86782

Attempt to optimize the Vector4 element constructor in mono interp #86782

Conversation

kg commented May 26, 2023

ghost commented May 26, 2023

kg commented May 26, 2023 • edited Loading

BrzVlad commented May 26, 2023 • edited Loading

vargaz commented May 26, 2023

tannergooding commented May 26, 2023

kg commented May 26, 2023

kg commented May 26, 2023 •

edited

Loading

BrzVlad commented May 26, 2023 •

edited

Loading