Don't aggregate homogeneous floats in the Rust ABI #93564

Urgau · 2022-02-01T23:17:29Z

This PR removes the aggregate "optimization" done to small (<= ptr-size * 2) ~~and non-homogeneous~~ homogeneous floats arguments as a result of these problems:

This block auto-vectorization (especially in with floats) Basic vectorization performance regression from 1.48.0 onwards #85265 (comment)
This adds unnecessary steps for packing (in the function) and unpacking (at the caller) the arguments

The logic the replaced with a more inner types aware approach (integers vs floats, homogeneous vs heterogeneous).

Specifically the behavior before the this PR was to aggregate every-type (struct, array) that was pass or returned and was small enough (<= ptr-size * 2) into a single and unique integer, no matter the underline type. This would means for example that [f32; 3] (which is a very common representation for a vector in 3 dimensions) would be represented as a unique i96 in the LLVM-IR that would need to be pack to be returned and than unpack at the caller to be used. #91447 (comment)

The new behavior with this PR is now to consider homogeneous:

~~integer type for aggregating into a single and unique integer~~
floats: pass them a array if <= ptr-size and by pointer on array if > ptr-size as an optimization

Rust code:

#[no_mangle]
pub fn sum_f32(a: f32, b: f32) -> f32 {
    a + b
}

#[no_mangle]
pub fn sum_f32x2(a: [f32; 2], b: [f32; 2]) -> [f32; 2] {
    [
        a[0] + b[0],
        a[1] + b[1],
    ]
}

#[no_mangle]
pub fn sum_f32x4(a: [f32; 4], b: [f32; 4]) -> [f32; 4] {
    [
        a[0] + b[0],
        a[1] + b[1],
        a[2] + b[2],
        a[3] + b[3],
    ]
}

ASM Before:

sum_f32:
	addss	xmm0, xmm1
	ret

sum_f32x2:
	.cfi_startproc
	movd	xmm0, edi
	shr	rdi, 32
	movd	xmm1, edi
	movd	xmm2, esi
	addss	xmm2, xmm0
	shr	rsi, 32
	movd	xmm0, esi
	addss	xmm0, xmm1
	movd	eax, xmm2
	movd	ecx, xmm0
	shl	rcx, 32
	or	rax, rcx
	ret

sum_f32x4:
	movd	xmm0, esi
	mov	rax, rsi
	shld	rax, rdi, 32
	shr	rsi, 32
	movq	xmm1, rax
	movq	xmm2, rsi
	punpckldq	xmm2, xmm1
	movd	xmm1, edi
	movd	xmm3, edx
	addps	xmm3, xmm1
	movd	xmm1, ecx
	addss	xmm1, xmm0
	shrd	rdx, rcx, 32
	shr	rcx, 32
	movq	xmm0, rdx
	movq	xmm4, rcx
	punpckldq	xmm4, xmm0
	addps	xmm4, xmm2
	movd	eax, xmm3
	movd	ecx, xmm1
	movd	edx, xmm4
	shufps	xmm4, xmm4, 85
	movd	esi, xmm4
	shl	rsi, 32
	shl	rdx, 32
	or	rdx, rcx
	or	rax, rsi
	ret

ASM After:

sum_f32:
	addss	xmm0, xmm1
	ret

sum_f32x2:
	addss	xmm0, xmm2
	addss	xmm1, xmm3
	ret

sum_f32x4:
	mov	rax, rdi
	movups	xmm0, xmmword ptr [rsi]
	movups	xmm1, xmmword ptr [rdx]
	addps	xmm1, xmm0
	movups	xmmword ptr [rdi], xmm1
	ret

Note that this pull-request is a continuation of #93405 that I closed due to the performance regressions.

Fixes #85265
Fixes #91447
Fixes #93490

rust-highfive · 2022-02-01T23:17:33Z

r? @matthewjasper

(rust-highfive has picked a reviewer for you, use r? to override)

bjorn3 · 2022-02-01T23:27:07Z

compiler/rustc_middle/src/ty/layout.rs

+                                // We don't want to aggregate float as Integer because this will
+                                // hurt (#93490) the codegen and performance so instead use a
+                                // vector (this is similar to what clang is doing in C++).
+                                RegKind::Float => arg.cast_to(Reg { kind: RegKind::Vector, size }),


This is unsound. Different -Ctarget-features of different crates can cause ABI incompatibilities.

Given the comment below, that was also what I taught but trying my PR with a code that use different target features leads me to the correct output:

#![feature(bench_black_box)] #![allow(non_camel_case_types)] #[derive(Debug)] struct f32x4(f32, f32, f32, f32); #[target_feature(enable = "avx")] unsafe fn foo() -> f32x4 { std::hint::black_box(f32x4(0., 1., 2., std::hint::black_box(3.))) } #[target_feature(enable = "sse3")] unsafe fn bar(arg: f32x4) { println!("{:?} == f32x4(0.0, 1.0, 2.0, 3.0)", std::hint::black_box(arg)); } fn main() { unsafe { bar(std::hint::black_box(foo())); } }

$ rustc +stage1 -C opt-level=0 a.rs && ./a f32x4(0.0, 1.0, 2.0, 3.0) == f32x4(0.0, 1.0, 2.0, 3.0) $ rustc +stage1 -C opt-level=3 a.rs && ./a f32x4(0.0, 1.0, 2.0, 3.0) == f32x4(0.0, 1.0, 2.0, 3.0)

That's just a 128bit vector which fits into a single XMM register. And since that is the lower half of the YMM register in AVX code, that probably just happens to work by accident on x86_64.

I guess using an x86 target with code that has the SSE target feature enabled on only foo or bar might fail (careful, IIRC the i686 arch enables SSE2bby default). Or maybe some of the other supported archs can trigger a bug here as well.

The i586 targets disable SSE by default I believe. You can also manually disable it.

@dotdash Yes and no, above this code there is a check for the max size by value who is ptr-size * 2, so the created vector cannot be (at least on x86_64) upper than 128bits which as you said fits on the lower half of the YMM register. This is also consistent with clang++ who (accordingly to my test) never pass vectors more than ptr-size * 2 by value.

Right, looks like it doesn't error. It however does change the abi from returning the value in xmm versus on the stack. This means that this makes disabling SSE support UB where previously it wasn't, which is still a breaking change. See https://llvm.godbolt.org/z/YThqqs9Ts (left: SSE disabled, right: SSE enabled)

Noooo... Well this is bad, really bad. @bjorn3 Any recommendation(s) on how to proceed from now ?

Not sure. For vector types we "fix" this by forcing them to always be passed on the stack for the rust abi and hoping llvm's mem2reg pass will eliminate the stack usage after inlining.

@bjorn3 I've updated the PR to not modify the Reg representation of homogeneous floats. I think this now completely resolve your concern about ABI compatibility as the generated IR doesn't use a vector type and is the same as rust <= 1.47.
For information as before it doesn't compile with -Ctarget-feature=-sse but does compile with -C target-feature=-sse,+soft-float.

Are you satisfied with the changes, @bjorn3?

dotdash · 2022-02-02T11:24:36Z

Passing structs of four f32 values as a single vector isn't always what you want. For general purpose structs where you don't perform SIMD-like operations, the code might end up worse this way.

https://godbolt.org/z/Ecoz3ThWz shows a (made up) example where the single 4-element vector variant produces worse code than the one that uses two 2-element vectors.

There are trade-offs but be made here. I suppose that in the long run, the Rust ABI may have to become arch specific as well. There's a lot to consider.

Urgau · 2022-02-02T13:40:36Z

I have updated this PR to now only take in account homogeneous floats in account and only in x86_64 (for now) use a vector to represent them. On other archs their current representation is not modified instead of previously by aggregated as a unique Integer. I think this should resolve the concern raised by @bjorn3 and @dotdash.

Urgau · 2022-02-02T13:44:32Z

Passing structs of four f32 values as a single vector isn't always what you want. For general purpose structs where you don't perform SIMD-like operations, the code might end up worse this way.

I agree this is not ideal but this is definitely better than the current situation.

There are trade-offs but be made here. I suppose that in the long run, the Rust ABI may have to become arch specific as well. There's a lot to consider.

For now this PR does the Vector cast only in x86_64 per your concern. It may be extended in the future.

Urgau · 2022-02-03T00:08:33Z

Over the (new) concern of @bjorn3, I've once again updated the logic in this PR to not change the Reg representation at all.
The "new" logic in basically the same as Rust 1.47 and below (only for homogeneous floats): pass them a array if <= ptr-size and by pointer on array if > ptr-size as an optimization.

workingjubilee · 2022-02-20T20:47:34Z

If this is intended to fix a regression in the assembly emitted then this requires assembly tests before it can land.

This is basically is ripoff of src/test/ui/simd/target-feature-mixup.rs but for floats and without #[repr(simd)]

This is required because depending on the version of llvm FileCheck it could fail.

workingjubilee · 2022-02-21T01:14:25Z

Then I think I have the requisite investiture to do this...
@bors try @rust-timer queue

rust-timer · 2022-02-21T01:14:26Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2022-02-21T01:14:32Z

⌛ Trying commit 74a429e with merge d557e0f29befbe3f15344eb15c19364a660ce35b...

bors · 2022-02-21T02:50:07Z

☀️ Try build successful - checks-actions
Build commit: d557e0f29befbe3f15344eb15c19364a660ce35b (d557e0f29befbe3f15344eb15c19364a660ce35b)

rust-timer · 2022-02-21T02:50:09Z

Queued d557e0f29befbe3f15344eb15c19364a660ce35b with parent 45e2c28, future comparison URL.

rust-timer · 2022-02-21T04:47:26Z

Finished benchmarking commit (d557e0f29befbe3f15344eb15c19364a660ce35b): comparison url.

Summary: This benchmark run shows 4 relevant regressions 😿 to instruction counts.

Average relevant regression: 8.5%
Largest regression in instruction counts: 8.8% on full builds of await-call-tree opt

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

Urgau · 2022-02-21T12:50:17Z

The regressions are purely in the await-call-tree compiler benchmark. Looking closely at the performance data, the regressions are located in fn_abi_of_instance section which is expected as this PR touch this area. The time delta of these 4 regressions is well over 3000% from ~0.000s to ~0.001s (accordingly to rustc-perf).

Investigating locally await-call-tree with my changes, told me that there is now 21 calls to the homogeneous_aggregate function most of them are redurent (4 different types) but adding caching seems overkill and counterproductive as the cost of calculating the hash will most certainly take more time than just doing the homogeneous_aggregate call. Maybe adding #[inline] to the homogeneous_aggregate function could help ? (because some matching is unnecessary; we only pass Abi::Aggregate)

Anyway I don't think we should worry about these localized regressions because:

await-call-tree is a micro benchmark (not representative of any real world code)
the regressions are very small (in term of time)
the changes in this PR fixes major regressions with floats (re-enabling auto-vectorization in most cases)
and the rest of the benchmarks doesn't show regressions

@rustbot label: +perf-regression-triaged

lqd · 2022-02-21T13:05:55Z

await-call-tree is a micro benchmark (not representative of any real world code)

Note: this benchmark was reduced from real world code triggering #65147

workingjubilee · 2022-02-23T01:32:33Z

Yeah, the benchmark is artificial but the regression would hit real code. If it was a handful of tiny regressions, I wouldn't think anything of it, but it's a very telling regression on a single case, and I can only infer that this test case was added precisely to prevent such a regression. I don't know that this isn't worth the changes, either, I just am not sure I have sufficient proof for posterity, when e.g. someone files a regression report after this gets merged and files a PR to revert this change.

We also have the option to land #91719 instead.

workingjubilee · 2022-03-04T18:39:32Z

We landed the other PR #91719 in #94570, which is unfortunately incompatible with this approach. Thank you for exploring this issue with us, however.

rust-highfive assigned matthewjasper Feb 1, 2022

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 1, 2022

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 1, 2022

Urgau mentioned this pull request Feb 1, 2022

Rust should pass vectors by vector register #93490

Closed

bjorn3 reviewed Feb 1, 2022

View reviewed changes

Urgau force-pushed the abi-arguments-optimization branch from 9150cc5 to bb38320 Compare February 2, 2022 13:36

Urgau changed the title ~~Only aggregate homogeneous types in the Rust ABI~~ Don't aggregate homogeneous floats in the Rust ABI Feb 2, 2022

Urgau force-pushed the abi-arguments-optimization branch from bb38320 to 5a71f32 Compare February 2, 2022 23:52

Urgau mentioned this pull request Feb 20, 2022

[Experiment] revert issue #26494 associated pulls #76986 and #79547 #91719

Closed

Simplify the code of fixup by making it's code flow more natural

2e374cf

Urgau force-pushed the abi-arguments-optimization branch from 5a71f32 to 5a9f3d9 Compare February 21, 2022 00:06

Don't aggregate homogeneous floats in the Rust ABI

7b69d21

Urgau force-pushed the abi-arguments-optimization branch from 5a9f3d9 to 7b69d21 Compare February 21, 2022 00:27

Urgau added 2 commits February 21, 2022 01:45

Test that target feature mix up with homogeneous floats is sound

fdc672a

This is basically is ripoff of src/test/ui/simd/target-feature-mixup.rs but for floats and without #[repr(simd)]

Escape outer [] in x86-64-homogeneous-floats.rs

74a429e

This is required because depending on the version of llvm FileCheck it could fail.

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2022

rustbot added the perf-regression Performance regression. label Feb 21, 2022

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2022

rustbot added the perf-regression-triaged The performance regression has been triaged. label Feb 21, 2022

workingjubilee closed this Mar 4, 2022

This was referenced May 30, 2022

Result that uses niches resulting in a final size of 16 bytes emits poor LLVM IR due to ABI #97540

Closed

Optimize layout of function arguments in the Rust ABI - take 2 #97559

Closed

Ayanekoji mentioned this pull request Oct 28, 2024

Stack overflow at runtime when using Rc<RefCell<T>> in cycle detection algorithm #132263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't aggregate homogeneous floats in the Rust ABI #93564

Don't aggregate homogeneous floats in the Rust ABI #93564

Urgau commented Feb 1, 2022 •

edited

Loading

rust-highfive commented Feb 1, 2022

bjorn3 Feb 1, 2022

Urgau Feb 1, 2022

dotdash Feb 2, 2022

bjorn3 Feb 2, 2022

Urgau Feb 2, 2022

bjorn3 Feb 2, 2022 •

edited

Loading

Urgau Feb 2, 2022

bjorn3 Feb 2, 2022

Urgau Feb 3, 2022

workingjubilee Feb 21, 2022

dotdash commented Feb 2, 2022

Urgau commented Feb 2, 2022

Urgau commented Feb 2, 2022

Urgau commented Feb 3, 2022 •

edited

Loading

workingjubilee commented Feb 20, 2022

workingjubilee commented Feb 21, 2022

rust-timer commented Feb 21, 2022

bors commented Feb 21, 2022

bors commented Feb 21, 2022

rust-timer commented Feb 21, 2022

rust-timer commented Feb 21, 2022

Urgau commented Feb 21, 2022

lqd commented Feb 21, 2022

workingjubilee commented Feb 23, 2022 •

edited

Loading

workingjubilee commented Mar 4, 2022

Don't aggregate homogeneous floats in the Rust ABI #93564

Don't aggregate homogeneous floats in the Rust ABI #93564

Conversation

Urgau commented Feb 1, 2022 • edited Loading

rust-highfive commented Feb 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjorn3 Feb 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotdash commented Feb 2, 2022

Urgau commented Feb 2, 2022

Urgau commented Feb 2, 2022

Urgau commented Feb 3, 2022 • edited Loading

workingjubilee commented Feb 20, 2022

workingjubilee commented Feb 21, 2022

rust-timer commented Feb 21, 2022

bors commented Feb 21, 2022

bors commented Feb 21, 2022

rust-timer commented Feb 21, 2022

rust-timer commented Feb 21, 2022

Urgau commented Feb 21, 2022

lqd commented Feb 21, 2022

workingjubilee commented Feb 23, 2022 • edited Loading

workingjubilee commented Mar 4, 2022

Urgau commented Feb 1, 2022 •

edited

Loading

bjorn3 Feb 2, 2022 •

edited

Loading

Urgau commented Feb 3, 2022 •

edited

Loading

workingjubilee commented Feb 23, 2022 •

edited

Loading