Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to optimize the Vector4 element constructor in mono interp #86782

Closed
wants to merge 2 commits into from

Conversation

kg
Copy link
Contributor

@kg kg commented May 26, 2023

Looking at dotnet/perf-autofiling-issues#18031 in wasm+interp, the hot path of some of the failures boils down to foo.AsVector128(), where AsVector128 itself boils down to new Vector4(this, ...), which then translates to new Vector4(this.X, this.Y, ...). Because the Vector4 ctor just assigns 4 fields sequentially, this generates a bunch of interp opcodes.

If we initialize Vector4 as if it were a Vector128, we can tap into the interpreter's SIMD support and initialize it in (almost) one vectorized operation, and then on WASM that will use the native vector instruction set.

With these changes a Vector2.AsVector128() call looks roughly like this in the mono interpreter, annotated with my reading of the opcodes:

1b42dba ldc.r4  -> 192                 // 0f
1b42dc2 ldc.r4  -> 200                 // 0f
1b42dca newobj_vt_inlined 96 -> 112    // result = default(Vector4);
1b42dd2 mov.4 88 -> 176                // load src.X
1b42dd8 mov.4 92 -> 184                // load src.Y
1b42dde ldflda 112 -> 112              // &result.X
1b42de6 simd_v128_i4_create 176 -> 128 // Vector128.Create(src.X, src.Y, 0f, 0f)
1b42dec stobj.vt.noref 112, 128        // result = (Vector4)Vector128.Create(src.X, src.Y, 0f, 0f)

I don't know if we want to do this, or whether this is a good way to do it. But I figured a draft PR is a good way to start the discussion. I suspect if we optimize this one ctor a bunch of stuff will get faster in the wasm interp scenario (and maybe all interp targets, but it's hard to be certain).

@kg kg added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label May 26, 2023
@ghost
Copy link

ghost commented May 26, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Looking at dotnet/perf-autofiling-issues#18031 in wasm+interp, the hot path of some of the failures boils down to foo.AsVector128(), where AsVector128 itself boils down to new Vector4(this, ...), which then translates to new Vector4(this.X, this.Y, ...). Because the Vector4 ctor just assigns 4 fields sequentially, this generates a bunch of interp opcodes.

If we initialize Vector4 as if it were a Vector128, we can tap into the interpreter's SIMD support and initialize it in (almost) one vectorized operation, and then on WASM that will use the native vector instruction set.

With these changes a Vector2.AsVector128() call looks roughly like this in the mono interpreter, annotated with my reading of the opcodes:

1b42dba ldc.r4  -> 192                 // 0f
1b42dc2 ldc.r4  -> 200                 // 0f
1b42dca newobj_vt_inlined 96 -> 112    // result = default(Vector4);
1b42dd2 mov.4 88 -> 176                // load src.X
1b42dd8 mov.4 92 -> 184                // load src.Y
1b42dde ldflda 112 -> 112              // &result.X
1b42de6 simd_v128_i4_create 176 -> 128 // Vector128.Create(src.X, src.Y, 0f, 0f)
1b42dec stobj.vt.noref 112, 128        // result = (Vector4)Vector128.Create(src.X, src.Y, 0f, 0f)

I don't know if we want to do this, or whether this is a good way to do it. But I figured a draft PR is a good way to start the discussion. I suspect if we optimize this one ctor a bunch of stuff will get faster in the wasm interp scenario (and maybe all interp targets, but it's hard to be certain).

Author: kg
Assignees: -
Labels:

NO-MERGE, area-System.Numerics

Milestone: -

@kg
Copy link
Contributor Author

kg commented May 26, 2023

Learned today that 'ref this' is legal in structs now, I don't think it used to be? In any case, that eliminates the ldflda:

1b42c02 ldc.r4  -> 192
1b42c0a ldc.r4  -> 200
1b42c12 newobj_vt_inlined 96 -> 112
1b42c1a mov.4 88 -> 176
1b42c20 mov.4 92 -> 184
1b42c26 simd_v128_i4_create 176 -> 128
1b42c2c stobj.vt.noref 112, 128

@BrzVlad
Copy link
Member

BrzVlad commented May 26, 2023

I'm not convinced having this on the C# side is overall good design. I think it would make more sense for the interpreter to recognize Vector methods and generate appropriate available opcodes.

Also the output doesn't seem properly optimized and I think doing this by intrinsifying the ctor method as a whole might be better.

1b42c02 ldc.r4  -> 192
1b42c0a ldc.r4  -> 200
1b42c12 newobj_vt_inlined 96 -> 112
1b42c1a mov.4 88 -> 176
1b42c20 mov.4 92 -> 184
1b42c26 simd_v128_i4_create 176 -> 128
1b42c2c stobj.vt.noref 112, 128

should be instead something like

1b42c02 ldc.r4  -> 192
1b42c0a ldc.r4  -> 200
1b42c1a mov.4 88 -> 176
1b42c20 mov.4 92 -> 184
1b42c26 simd_v128_i4_create 176 -> 112

I'm currently working on reusing vectorization for Vector structs

@vargaz
Copy link
Contributor

vargaz commented May 26, 2023

The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it.

@tannergooding
Copy link
Member

The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it.

Agreed and that's how RyuJIT handles it today.

-- As an aside, It was recently brought up if we could do more of these in managed code, so we don't have to duplicate logic between RyuJIT and Mono; but there is a general concern over how that will impact codegen and throughput in scenarios these are heavily used. It will certainly put more pressure on the inliner to do the right thing.

@kg
Copy link
Contributor Author

kg commented May 26, 2023

The Vector ctor is marked as Intrinsic, so that suggests the runtime should optimize it.

Agreed and that's how RyuJIT handles it today.

-- As an aside, It was recently brought up if we could do more of these in managed code, so we don't have to duplicate logic between RyuJIT and Mono; but there is a general concern over how that will impact codegen and throughput in scenarios these are heavily used. It will certainly put more pressure on the inliner to do the right thing.

My thinking was that doing it in managed code would mean that AOT would automatically pick up the improvement, but it sounds like intrinsifying it in the interpreter will produce better code without stressing the inliner, so I'm not opposed to doing that instead. I'll give that a try later.

@kg kg closed this May 26, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants