Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the array::map codegen #107634

Merged
merged 4 commits into from
Feb 13, 2023
Merged

Improve the array::map codegen #107634

merged 4 commits into from
Feb 13, 2023

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented Feb 3, 2023

The map method on arrays is documented as sometimes performing poorly, and after a question on URLO prompted me to take another look at the core try_collect_into_array function, I had some ideas that ended up working better than I'd expected.

There's three main ideas in here, split over three commits:

  1. Don't use array::IntoIter when we can avoid it, since that seems to not get SRoA'd, meaning that every step writes things like loop counters into the stack unnecessarily
  2. Don't return arrays in Results unnecessarily, as that doesn't seem to optimize away even with unwrap_unchecked (perhaps because it needs to get moved into a new LLVM type to account for the discriminant)
  3. Don't distract LLVM with all the Option dances when we know for sure we have enough items (like in map and zip). This one's a larger commit as to do it I ended up adding a new pub(crate) trait, but hopefully those changes are still straight-forward.

(No libs-api changes; everything should be completely implementation-detail-internal.)

It's still not completely fixed -- I think it needs pcwalton's memcpy optimizations still (#103830) to get further -- but this seems to go much better than before. And the remaining memcpys are just transmute-equivalent ([T; N] -> ManuallyDrop<[T; N]> and [MaybeUninit<T>; N] -> [T; N]), so hopefully those will be easier to remove with LLVM16 than the previous subobject copies 🤞

r? @thomcc

As a simple example, this test

pub fn long_integer_map(x: [u32; 64]) -> [u32; 64] {
    x.map(|x| 13 * x + 7)
}

On nightly https://rust.godbolt.org/z/xK7548TGj takes sub rsp, 808

start:
  %array.i.i.i.i = alloca [64 x i32], align 4
  %_3.sroa.5.i.i.i = alloca [65 x i32], align 4
  %_5.i = alloca %"core::iter::adapters::map::Map<core::array::iter::IntoIter<u32, 64>, [closure@/app/example.rs:2:11: 2:14]>", align 8

(and yes, that's a 65-element array alloca despite 64-element input and output)

But with this PR it's only sub rsp, 520

start:
  %array.i.i.i.i.i.i = alloca [64 x i32], align 4
  %array1.i.i.i = alloca %"core::mem::manually_drop::ManuallyDrop<[u32; 64]>", align 4

Similarly, the loop it emits on nightly is scalar-only and horrifying

.LBB0_1:
        mov     esi, 64
        mov     edi, 0
        cmp     rdx, 64
        je      .LBB0_3
        lea     rsi, [rdx + 1]
        mov     qword ptr [rsp + 784], rsi
        mov     r8d, dword ptr [rsp + 4*rdx + 528]
        mov     edi, 1
        lea     edx, [r8 + 2*r8]
        lea     r8d, [r8 + 4*rdx]
        add     r8d, 7
.LBB0_3:
        test    edi, edi
        je      .LBB0_11
        mov     dword ptr [rsp + 4*rcx + 272], r8d
        cmp     rsi, 64
        jne     .LBB0_6
        xor     r8d, r8d
        mov     edx, 64
        test    r8d, r8d
        jne     .LBB0_8
        jmp     .LBB0_11
.LBB0_6:
        lea     rdx, [rsi + 1]
        mov     qword ptr [rsp + 784], rdx
        mov     edi, dword ptr [rsp + 4*rsi + 528]
        mov     r8d, 1
        lea     esi, [rdi + 2*rdi]
        lea     edi, [rdi + 4*rsi]
        add     edi, 7
        test    r8d, r8d
        je      .LBB0_11
.LBB0_8:
        mov     dword ptr [rsp + 4*rcx + 276], edi
        add     rcx, 2
        cmp     rcx, 64
        jne     .LBB0_1

whereas with this PR it's unrolled and vectorized

	vpmulld	ymm1, ymm0, ymmword ptr [rsp + 64]
	vpaddd	ymm1, ymm1, ymm2
	vmovdqu	ymmword ptr [rsp + 328], ymm1
	vpmulld	ymm1, ymm0, ymmword ptr [rsp + 96]
	vpaddd	ymm1, ymm1, ymm2
	vmovdqu	ymmword ptr [rsp + 360], ymm1

(though sadly still stack-to-stack)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 3, 2023
@rustbot

This comment was marked as resolved.

///
/// It currently requires `TrustedLen` because it's unclear whether it's
/// reasonably possible to depend on the `size_hint` of anything else.
pub(crate) trait UncheckedIterator: TrustedLen {
Copy link
Member

@the8472 the8472 Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like another take on TrustedRandomAccess, except that the source can drop its elements. I tried something similar in #93243 except by augmenting TRA directly, but it lead to compile time regressions due to the added traits and methods.
Ideally we would only have to maintain one unsafe iteration trait. But this is a lot simpler and sometimes the single-induction-variable that TRA provides isn't necessary, so maybe it's not so bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what's the best resource to read up on TrustedRandomAccessNoCoerce? TBH I've always ignored it 😬 Is https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/iter/trait.TrustedRandomAccess.html#safety the go-to place?

I could try changing try_from_trusted_iterator to essentially try_from_fn(Iterator::__iterator_get_unchecked), rather needing next_unchecked at all...

(But next_unchecked has a much simpler safety condition than TRANC.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s https://stdrs.dev/nightly/x86_64-unknown-linux-gnu/std/iter/trait.TrustedRandomAccess.html#safety the go-to place?

Yes. But nothing actionable for this PR. The trait as it is currently isn't fully applicable for array::IntoIter anyway because it is an owning iterator which needs to drop items when it's not exhausted. So it can only be implemented for T: Copy (as we do for vec::IntoIter) as approximation for needs_drop() == false. To generalize it further would need some changes, as tried in the mentioned PR.

I was just pointing out that we now have two ways for unchecked access. TrustedLen is slightly more general since it can also apply to Chain or Flatten because it supports len > usize::MAX. TRA on the other hand is a bit more powerful when it applies because it allows external iteration with a single induction variable (this really helps Zip) and doesn't need to update internal fields on which the next loop iteration would depend.

(But next_unchecked has a much simpler safety condition than TRANC.)

Agreed, there's some simplicity to it. I have tried using next().unwrap_unchecked() before in conjunction with TrustedLen while it brought some improvements it couldn't match TRA.
I hope that maybe UncheckedIterator could get us closer (if we turn it into a specialization trait, which I thought it was 😅), perhaps even to parity. If we could do away with the TRA machinery that would be awesome because it's chernobyl-levels of unsafe.

But it doesn't cover all the same cases then we'd be left with two unsafe iteration traits, at least one which we might never be able to stabilize and that seems kind of ugly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to play with it in a future PR :)

What's the best way to make it a specialization trait? Make the method __iterator_next_unchecked instead, like __iterator_get_unchecked, so it's on Iterator instead and just bounded by the specialization trait TrustedLen? (TBH I'm not sure why the unsafe specialization trait keeps things from having methods, but presumably there's a good reason.)

Copy link
Member

@the8472 the8472 Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specialization traits can have methods (TrustedRandomAccessNoCoerce has size()). __iterator_next_unchecked was moved to the iterator trait during the specialization -> min_specialization refactor for reasons I have forgotten (edit, reason was #73565 (comment)).

Slapping #[rustc_specialization_trait] on it might be enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see -- #[rustc_specialization_trait] can have methods, but TrustedLen is #[rustc_unsafe_specialization_marker], which can't. That's the distinction I'd missed. Thanks!

@scottmcm
Copy link
Member Author

scottmcm commented Feb 3, 2023

Since this doesn't impact for desugaring and doesn't change dyn Iterator it's hopefully perf-ok, but after the8472's message I'll check just in case:

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 3, 2023
@bors
Copy link
Contributor

bors commented Feb 3, 2023

⌛ Trying commit 2e20acd169c4fd9c1ee69c089b9c7439ef48a02f with merge 4d8ea29940107409a02030c3f43a0fd2bb7c567c...

@the8472
Copy link
Member

the8472 commented Feb 3, 2023

Ah, I wasn't thinking that this would have perf impacts. I was more thinking about whether it was a good idea to add more specialization traits for unchecked iteration. And the compiler isn't array-iteration-heavy so it it should affect things much.

@the8472
Copy link
Member

the8472 commented Feb 3, 2023

Did you run the std benches? There are a few that cover array mapping and collecting arrays for next_chunk.

@scottmcm
Copy link
Member Author

scottmcm commented Feb 3, 2023

I'm not sure whether this is the direction you meant, but I'll note that this PR doesn't actually do any specialization at all.

The methods are internal, so just only accept UncheckedIterator, as all the callers are places where we know for sure that we trust what's going on.

(I first tried to add next_unchecked to TrustedLen, instead of a new trait, but apparently the specialization attribute makes it a #[marker] trait, so I couldn't do that.)

@the8472
Copy link
Member

the8472 commented Feb 3, 2023

Ah, I saw TrustedLen and then glossed over the actual use. Then it should affect even fewer things.

@bors
Copy link
Contributor

bors commented Feb 3, 2023

☀️ Try build successful - checks-actions
Build commit: 4d8ea29940107409a02030c3f43a0fd2bb7c567c (4d8ea29940107409a02030c3f43a0fd2bb7c567c)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (4d8ea29940107409a02030c3f43a0fd2bb7c567c): comparison URL.

Overall result: ❌✅ regressions and improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.8% [0.6%, 0.9%] 4
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.7% [-6.5%, -1.0%] 2
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.6% [0.8%, 3.6%] 7
Regressions ❌
(secondary)
2.7% [0.7%, 4.6%] 39
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.6% [0.8%, 3.6%] 7

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-3.1% [-3.1%, -3.1%] 1
All ❌✅ (primary) - - 0

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 4, 2023
@scottmcm
Copy link
Member Author

scottmcm commented Feb 5, 2023

Benches are essentially unchanged:

Before:

running 1 test
test vec::bench_next_chunk                               ... bench:         342 ns/iter (+/- 14)
running 2 tests
test iter::bench_copied_chunks                                     ... bench:          31 ns/iter (+/- 0)
test iter::bench_trusted_random_access_chunks                      ... bench:          31 ns/iter (+/- 0)

After

running 1 test
test vec::bench_next_chunk                               ... bench:         337 ns/iter (+/- 15)
running 2 tests
test iter::bench_copied_chunks                                     ... bench:          30 ns/iter (+/- 0)
test iter::bench_trusted_random_access_chunks                      ... bench:          30 ns/iter (+/- 0)

(I couldn't find any for array map; let me know if I missed something.)

move |_| {
// SAFETY: We know that `from_fn` will call this at most N times,
// and we checked to ensure that we have at least that many items.
unsafe { iter.next_unchecked() }
Copy link
Member Author

@scottmcm scottmcm Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To your point about unwrap_uncheck not working as well as you'd hoped, @the8472, that's what I saw here too that led me to the new trait. I've lost the intermediate version I'd originally tried, but was able to get back to something similar by swapping this iter.next_unchecked for iter.next().unwrap_unchecked().

That ends up leaving this in the IR for the long_integer_map test:

"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1": ; preds = %bb5.i.i.i.i.i.i.i
  %generator.sroa.4.07.i.i.i.i.ptr.i.i.i.1 = getelementptr inbounds i32, ptr %array1.i.i.i, i64 %generator.sroa.4.1.i.i.i.i.idx.i.i.i
  %generator.sroa.4.07.i.i.i.i.add.i.i.i.1 = add nsw i64 %generator.sroa.4.1.i.i.i.i.idx.i.i.i, 1
  %tmp.0.copyload.i.i.i.i.i.i.i.i.i.i.1 = load i32, ptr %generator.sroa.4.07.i.i.i.i.ptr.i.i.i.1, align 4, !noalias !31
  %7 = insertvalue { i32, i32 } { i32 1, i32 undef }, i32 %tmp.0.copyload.i.i.i.i.i.i.i.i.i.i.1, 1
  %_3.i.i.i.i.i.i.i.i.i.i.i.i.i.1 = mul i32 %tmp.0.copyload.i.i.i.i.i.i.i.i.i.i.1, 13
  %8 = add i32 %_3.i.i.i.i.i.i.i.i.i.i.i.i.i.1, 7
  br label %bb5.i.i.i.i.i.i.i.1

bb5.i.i.i.i.i.i.i.1:                              ; preds = %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1", %bb5.i.i.i.i.i.i.i
  %generator.sroa.4.1.i.i.i.i.idx.i.i.i.1 = phi i64 [ 64, %bb5.i.i.i.i.i.i.i ], [ %generator.sroa.4.07.i.i.i.i.add.i.i.i.1, %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1" ]
  %9 = phi { i32, i32 } [ { i32 0, i32 undef }, %bb5.i.i.i.i.i.i.i ], [ %7, %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1" ]
  %10 = phi i32 [ undef, %bb5.i.i.i.i.i.i.i ], [ %8, %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.1" ]
  %_3.0.i.i.i.i.i.i.i.i.1 = extractvalue { i32, i32 } %9, 0
  %switch.i.i.i.i.i.i.i.i.i.1 = icmp ne i32 %_3.0.i.i.i.i.i.i.i.i.1, 0
  tail call void @llvm.assume(i1 %switch.i.i.i.i.i.i.i.i.i.1)
  %11 = getelementptr inbounds i32, ptr %array.i.i.i.i.i.i, i64 %6
  store i32 %10, ptr %11, align 4, !alias.scope !28, !noalias !38
  %12 = or i64 %guard.sroa.7.08.i.i.i.i.i.i.i, 2
  %_10.i.i.i.i.i.i.i.i.i.i.i.2 = icmp eq i64 %generator.sroa.4.1.i.i.i.i.idx.i.i.i.1, 64
  br i1 %_10.i.i.i.i.i.i.i.i.i.i.i.2, label %bb5.i.i.i.i.i.i.i.2, label %"_ZN93_$LT$core..array..drain..Drain$LT$T$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h51881ffc2935de71E.exit.i.i.i.i.i.i.i.i.i.2"

Those variable names are a horrible mess, but from the instructions you can see that the Options just don't optimize away: it does insertvalue { i32, i32 } to make an option, gets phi'd to the next block, then there's the extractvalue { i32, i32 } to pull things out again, and assumes the discriminant.

Arguably the next_unchecked shouldn't be doing something different, but somehow it makes a huge difference.

There's way too much code here to file a bug, but I'll see whether I can at least find something simpler that I can file on the rust side to see if an expert can tell why LLVM can't eliminate stuff here.

EDIT: Oh, it was easier to find a simple example than I thought, #107681

{
let mut map = iter.map(NeverShortCircuit);
try_collect_into_array(&mut map).map(|NeverShortCircuit(arr)| arr)
pub(crate) fn iter_next_chunk<T, const N: usize>(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about what motivated your reasoning to change the type signature here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilHof It's essentially the same; I didn't have to change it.

This is just a bit more consistent with how I did the other things in this file. (Like how try_from_fn_erased takes impl FnMut, not Fwhere F: FnMut.) I tend to think of APIT like this as "generic elision" -- for things like iterators and closures, I tend not to actually care about the specific type, so having a used-once generic for it where I have to read the bounds to follow the signature is kinda meh.

I like thinking of it as "this function takes a mutable reference to an iterator of Ts, and returns an array of Ts". Like how you might write a function taking Iterator<T> in Java, or an IEnumerable<T> in C#, just without the dyn indirection that that results in those languages. Thinking of it as "This takes an I, and returns an array of the item type of I, and BTW I needs to be an iterator" is just less direct in my brain. The item type is the important one, not the iterator's type, for me.

TL/DR: Entirely preference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottmcm That is a really explanation!

Your reasoning definitely makes sense. In a way, the T is maybe more relevant to the purpose of these methods, so having the T and not some "elusive" I in the signature definitely seems very sensible! At least, this is how I am interpreting what you are saying.

Thank you for the great response! :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's a good phrasing of it!

@emilHof

This comment was marked as resolved.

@scottmcm
Copy link
Member Author

@thomcc There's been lots of conversation here, so in case it wasn't clear I think this is still ready to go as-is.

(I'll experiment with the various other future possibilities later, since I don't want to mix those in with this improvement.)

@thomcc
Copy link
Member

thomcc commented Feb 13, 2023

@thomcc There's been lots of conversation here, so in case it wasn't clear I think this is still ready to go as-is.

Yep, it looks good to me (it did before, but I hadn't read through all the conversations to determine that nobody else wanted anything). Thanks a ton, excited to see this.

@bors r+

@bors
Copy link
Contributor

bors commented Feb 13, 2023

📌 Commit bb77860 has been approved by thomcc

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 13, 2023
@bors
Copy link
Contributor

bors commented Feb 13, 2023

⌛ Testing commit bb77860 with merge 2d91939...

@bors
Copy link
Contributor

bors commented Feb 13, 2023

☀️ Test successful - checks-actions
Approved by: thomcc
Pushing 2d91939 to master...

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (2d91939): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.3% [0.3%, 0.4%] 6
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.2% [-0.3%, -0.2%] 4
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.1% [1.1%, 1.1%] 1
Regressions ❌
(secondary)
1.6% [1.6%, 1.6%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.4% [-2.4%, -2.4%] 1
All ❌✅ (primary) 1.1% [1.1%, 1.1%] 1

Cycles

This benchmark run did not return any relevant results for this metric.

@rustbot rustbot added the perf-regression Performance regression. label Feb 13, 2023
@scottmcm scottmcm deleted the array-drain branch February 14, 2023 03:03
@rylev
Copy link
Member

rylev commented Feb 14, 2023

Perf is a wash - going to mark as triaged.

@rustbot label: perf-regression-triaged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants