Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align destination in x86_64's mem* instructions. #474

Merged
merged 6 commits into from
Jul 28, 2022

Conversation

Demindiro
Copy link
Contributor

@Demindiro Demindiro commented Jul 3, 2022

While misaligned reads are generally fast, misaligned writes aren't and can have severe penalties.

I don't know if LLVM 9 is still supported. I've used Intel syntax anyways since it's more readable IMO.

Benchmark results on a Ryzen 2700X:

master
test memset_rust_1048576              ... bench:      18,450 ns/iter (+/- 7,933) = 56833 MB/s
test memset_rust_1048576_offset       ... bench:      36,641 ns/iter (+/- 4,096) = 28617 MB/s
test memset_rust_4096                 ... bench:          93 ns/iter (+/- 3) = 44043 MB/s
test memset_rust_4096_offset          ... bench:         144 ns/iter (+/- 4) = 28444 MB/s
test memcpy_rust_1048576              ... bench:      56,233 ns/iter (+/- 8,037) = 18646 MB/s
test memcpy_rust_1048576_misalign     ... bench:      63,518 ns/iter (+/- 3,180) = 16508 MB/s
test memcpy_rust_1048576_offset       ... bench:      57,239 ns/iter (+/- 9,745) = 18319 MB/s
test memcpy_rust_4096                 ... bench:          87 ns/iter (+/- 12) = 47080 MB/s
test memcpy_rust_4096_misalign        ... bench:         161 ns/iter (+/- 10) = 25440 MB/s
test memcpy_rust_4096_offset          ... bench:         164 ns/iter (+/- 12) = 24975 MB/s
test memmove_rust_1048576             ... bench:      53,519 ns/iter (+/- 5,885) = 19592 MB/s
test memmove_rust_1048576_misalign    ... bench:     102,426 ns/iter (+/- 3,702) = 10237 MB/s
test memmove_rust_4096                ... bench:         234 ns/iter (+/- 31) = 17504 MB/s
test memmove_rust_4096_misalign       ... bench:         425 ns/iter (+/- 5) = 9637 MB/s
x86_64-mem-align-dest
test memset_rust_1048576              ... bench:      18,862 ns/iter (+/- 4,941) = 55591 MB/s
test memset_rust_1048576_offset       ... bench:      19,768 ns/iter (+/- 4,017) = 53044 MB/s
test memset_rust_4096                 ... bench:          93 ns/iter (+/- 8) = 44043 MB/s
test memset_rust_4096_offset          ... bench:         106 ns/iter (+/- 12) = 38641 MB/s
test memcpy_rust_1048576              ... bench:      54,790 ns/iter (+/- 5,362) = 19138 MB/s
test memcpy_rust_1048576_misalign     ... bench:      58,650 ns/iter (+/- 5,727) = 17878 MB/s
test memcpy_rust_1048576_offset       ... bench:      55,283 ns/iter (+/- 7,114) = 18967 MB/s
test memcpy_rust_4096                 ... bench:          87 ns/iter (+/- 23) = 47080 MB/s
test memcpy_rust_4096_misalign        ... bench:          99 ns/iter (+/- 20) = 41373 MB/s
test memcpy_rust_4096_offset          ... bench:         106 ns/iter (+/- 11) = 38641 MB/s
test memmove_rust_1048576             ... bench:      52,964 ns/iter (+/- 3,015) = 19797 MB/s
test memmove_rust_1048576_misalign    ... bench:      52,006 ns/iter (+/- 12,560) = 20162 MB/s
test memmove_rust_4096                ... bench:         223 ns/iter (+/- 12) = 18367 MB/s
test memmove_rust_4096_misalign       ... bench:         231 ns/iter (+/- 25) = 17731 MB/s
builtin
test memset_builtin_1048576           ... bench:      19,972 ns/iter (+/- 5,495) = 52502 MB/s
test memset_builtin_1048576_offset    ... bench:      16,349 ns/iter (+/- 6,467) = 64137 MB/s
test memset_builtin_4096              ... bench:          67 ns/iter (+/- 14) = 61134 MB/s
test memset_builtin_4096_offset       ... bench:          68 ns/iter (+/- 19) = 60235 MB/s
test memcpy_builtin_1048576           ... bench:      22,621 ns/iter (+/- 186) = 46354 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      31,836 ns/iter (+/- 6,460) = 32936 MB/s
test memcpy_builtin_1048576_offset    ... bench:      28,163 ns/iter (+/- 5,183) = 37232 MB/s
test memcpy_builtin_4096              ... bench:          68 ns/iter (+/- 10) = 60235 MB/s
test memcpy_builtin_4096_misalign     ... bench:          69 ns/iter (+/- 1) = 59362 MB/s
test memcpy_builtin_4096_offset       ... bench:          69 ns/iter (+/- 13) = 59362 MB/s
test memmove_builtin_1048576          ... bench:      28,480 ns/iter (+/- 7,325) = 36817 MB/s
test memmove_builtin_1048576_misalign ... bench:      34,861 ns/iter (+/- 3,024) = 30078 MB/s
test memmove_builtin_4096             ... bench:          66 ns/iter (+/- 10) = 62060 MB/s
test memmove_builtin_4096_misalign    ... bench:          72 ns/iter (+/- 15) = 56888 MB/s

@Demindiro
Copy link
Contributor Author

The PowerPC test failing is odd, I don't think it should be affected by any changes in this PR?

@Demindiro
Copy link
Contributor Author

Demindiro commented Jul 3, 2022

I can't reproduce the PowerPC64 failure. I think it's a bug in LLVM that may have been fixed by now.

I run the tests like so:

CC_powerpc64_unknown_linux_gnu=powerpc64-linux-gnu-gcc CARGO_TARGET_POWERPC64_UNKNOWN_LINUX_GNU_LINKER=powerpc64-linux-gnu-gcc CARGO_TARGET_POWERPC64_UNKNOWN_LINUX_GNU_RUNNER=qemu-ppc64-static QEMU_LD_PREFIX=/usr/powerpc64-linux-gnu RUST_TEST_THREADS=1 cargo t --target powerpc64-unknown-linux-gnu

The versions of rustc, gcc and QEMU are:

rustc 1.64.0-nightly (2f3ddd9f5 2022-06-27)
powerpc64-linux-gnu-gcc (Debian 10.2.1-6) 10.2.1 20210110
qemu-ppc64 version 5.2.0 (Debian 1:5.2+dfsg-11+deb11u1)

EDIT: updated to rustc 1.64.0-nightly (f2d93935f 2022-07-02), still no luck even though it's the same as what the CI uses.

@bjorn3
Copy link
Member

bjorn3 commented Jul 3, 2022

As far as I understand it the current implementation is actually faster for memcpy and memset on Intel. For backwards memmove using unaligned bytes is indeed slower and as such we already use 8 byte aligned operations. See https://docs.google.com/spreadsheets/d/1H-ubR-xCJWomUYDI9D2JH19BNUD7R9kfkl_OHSv6vMk/edit, which is linked from #365.

@Demindiro
Copy link
Contributor Author

For backwards memmove using unaligned bytes is indeed slower and as such we already use 8 byte aligned operations

copy_backwards does not align to 8 bytes before executing rep movsq.

From what I understand recent Intel and Zen3 processors have the ERMSB feature which makes using only rep movsb fast. This is already special-cased if target_feature = "ermsb" is set.

I'll see if I can do a benchmark on an Intel processor which does not have the ERMSB feature.

@Demindiro
Copy link
Contributor Author

I don't seem to have any Intel CPU without ERMSB, but I figured I'd do a benchmark on one anyways.

Benchmark results for a i5-5287U (Macbook with MacOS):

master
test memset_rust_1048576              ... bench:      35,380 ns/iter (+/- 5,678) = 29637 MB/s
test memset_rust_1048576_offset       ... bench:      34,749 ns/iter (+/- 938) = 30175 MB/s
test memset_rust_4096                 ... bench:          66 ns/iter (+/- 8) = 62060 MB/s
test memset_rust_4096_offset          ... bench:         109 ns/iter (+/- 13) = 37577 MB/s
test memcpy_rust_1048576              ... bench:      68,431 ns/iter (+/- 42,373) = 15323 MB/s
test memcpy_rust_1048576_misalign     ... bench:      99,567 ns/iter (+/- 35,795) = 10531 MB/s
test memcpy_rust_1048576_offset       ... bench:      71,338 ns/iter (+/- 50,645) = 14698 MB/s
test memcpy_rust_4096                 ... bench:          73 ns/iter (+/- 1) = 56109 MB/s
test memcpy_rust_4096_misalign        ... bench:         109 ns/iter (+/- 12) = 37577 MB/s
test memcpy_rust_4096_offset          ... bench:         126 ns/iter (+/- 186) = 32507 MB/s
test memmove_rust_1048576             ... bench:      78,784 ns/iter (+/- 8,777) = 13309 MB/s
test memmove_rust_1048576_misalign    ... bench:     168,847 ns/iter (+/- 8,877) = 6210 MB/s
test memmove_rust_4096                ... bench:         209 ns/iter (+/- 7) = 19598 MB/s
test memmove_rust_4096_misalign       ... bench:         250 ns/iter (+/- 45) = 16384 MB/s
x86_64-mem-align-dest
test memset_rust_1048576              ... bench:      35,106 ns/iter (+/- 3,536) = 29868 MB/s
test memset_rust_1048576_offset       ... bench:      35,186 ns/iter (+/- 997) = 29800 MB/s
test memset_rust_4096                 ... bench:          55 ns/iter (+/- 21) = 74472 MB/s
test memset_rust_4096_offset          ... bench:         106 ns/iter (+/- 3) = 38641 MB/s
test memcpy_rust_1048576              ... bench:      62,781 ns/iter (+/- 7,814) = 16702 MB/s
test memcpy_rust_1048576_misalign     ... bench:      83,916 ns/iter (+/- 4,891) = 12495 MB/s
test memcpy_rust_1048576_offset       ... bench:      64,644 ns/iter (+/- 8,535) = 16220 MB/s
test memcpy_rust_4096                 ... bench:          62 ns/iter (+/- 5) = 66064 MB/s
test memcpy_rust_4096_misalign        ... bench:         117 ns/iter (+/- 2) = 35008 MB/s
test memcpy_rust_4096_offset          ... bench:         122 ns/iter (+/- 20) = 33573 MB/s
test memmove_rust_1048576             ... bench:      78,933 ns/iter (+/- 30,644) = 13284 MB/s
test memmove_rust_1048576_misalign    ... bench:      82,033 ns/iter (+/- 15,301) = 12782 MB/s
test memmove_rust_4096                ... bench:         225 ns/iter (+/- 33) = 18204 MB/s
test memmove_rust_4096_misalign       ... bench:         245 ns/iter (+/- 50) = 16718 MB/s
builtin
test memset_builtin_1048576           ... bench:      34,678 ns/iter (+/- 1,803) = 30237 MB/s
test memset_builtin_1048576_offset    ... bench:      34,760 ns/iter (+/- 3,794) = 30166 MB/s
test memset_builtin_4096              ... bench:          44 ns/iter (+/- 11) = 93090 MB/s
test memset_builtin_4096_offset       ... bench:          44 ns/iter (+/- 12) = 93090 MB/s
test memcpy_builtin_1048576           ... bench:      61,564 ns/iter (+/- 8,961) = 17032 MB/s
test memcpy_builtin_1048576_misalign  ... bench:      76,861 ns/iter (+/- 32,761) = 13642 MB/s
test memcpy_builtin_1048576_offset    ... bench:      64,832 ns/iter (+/- 12,701) = 16173 MB/s
test memcpy_builtin_4096              ... bench:          48 ns/iter (+/- 2) = 85333 MB/s
test memcpy_builtin_4096_misalign     ... bench:          63 ns/iter (+/- 15) = 65015 MB/s
test memcpy_builtin_4096_offset       ... bench:          48 ns/iter (+/- 1) = 85333 MB/s
test memmove_builtin_1048576          ... bench:      70,972 ns/iter (+/- 9,459) = 14774 MB/s
test memmove_builtin_1048576_misalign ... bench:      74,613 ns/iter (+/- 5,180) = 14053 MB/s
test memmove_builtin_4096             ... bench:          45 ns/iter (+/- 3) = 91022 MB/s
test memmove_builtin_4096_misalign    ... bench:          58 ns/iter (+/- 6) = 70620 MB/s
-C target-cpu=native (uses ERMSB version)
test memset_rust_1048576              ... bench:      34,632 ns/iter (+/- 969) = 30277 MB/s
test memset_rust_1048576_offset       ... bench:      34,718 ns/iter (+/- 2,995) = 30202 MB/s
test memset_rust_4096                 ... bench:          44 ns/iter (+/- 3) = 93090 MB/s
test memset_rust_4096_offset          ... bench:          82 ns/iter (+/- 11) = 49951 MB/s
test memcpy_rust_1048576              ... bench:      66,104 ns/iter (+/- 8,826) = 15862 MB/s
test memcpy_rust_1048576_misalign     ... bench:     100,270 ns/iter (+/- 67,361) = 10457 MB/s
test memcpy_rust_1048576_offset       ... bench:      70,141 ns/iter (+/- 10,951) = 14949 MB/s
test memcpy_rust_4096                 ... bench:          53 ns/iter (+/- 4) = 77283 MB/s
test memcpy_rust_4096_misalign        ... bench:         100 ns/iter (+/- 36) = 40960 MB/s
test memcpy_rust_4096_offset          ... bench:          91 ns/iter (+/- 9) = 45010 MB/s
test memmove_rust_1048576             ... bench:      79,976 ns/iter (+/- 36,904) = 13111 MB/s
test memmove_rust_1048576_misalign    ... bench:      92,412 ns/iter (+/- 34,262) = 11346 MB/s
test memmove_rust_4096                ... bench:         227 ns/iter (+/- 25) = 18044 MB/s
test memmove_rust_4096_misalign       ... bench:         245 ns/iter (+/- 71) = 16718 MB/s

It seems only memmove_rust_1048576_misalign benefits, all other cases seem to be as fast as before this change. It is also clear that rep movsb with ERMSB is indeed faster for memcpy. Sadly Zen 1 CPUs do not have it.

Overall, aligning the destination beforehand seems to have no measurable negative impact.

Copy link
Contributor

@josephlr josephlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. However, should we add some unit tests for rep_param and rep_param_rev

I would just want sanity checks that:
pre_byte_count + 8*qword_count + byte_count == count

"rep stosb",
inout("ecx") pre_byte_count => _,
inout("rdi") dest => dest,
in("al") c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in different values for al and rax we can just pull the multiplication to the top and pass the same rax value for each asm block.

See: https://rust.godbolt.org/z/9hrv8eq1G

This seems to make it easier for the compiler to combine these blocks.

Comment on lines 38 to 67
pub unsafe fn copy_forward(mut dest: *mut u8, mut src: *const u8, count: usize) {
let (pre_byte_count, qword_count, byte_count) = rep_param(dest, count);
// Separating the blocks gives the compiler more freedom to reorder instructions.
// It also allows us to trivially skip the rep movsb, which is faster when memcpying
// aligned data.
if pre_byte_count > 0 {
asm!(
"rep movsb",
inout("ecx") pre_byte_count => _,
inout("rdi") dest => dest,
inout("rsi") src => src,
options(nostack, preserves_flags)
);
}
asm!(
"rep movsq",
inout("rcx") qword_count => _,
inout("rdi") dest => _,
inout("rsi") src => _,
options(att_syntax, nostack, preserves_flags)
inout("rdi") dest => dest,
inout("rsi") src => src,
options(nostack, preserves_flags)
);
if byte_count > 0 {
asm!(
"rep movsb",
inout("ecx") byte_count => _,
inout("rdi") dest => _,
inout("rsi") src => _,
options(nostack, preserves_flags)
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assembly looks reasonable here: https://rust.godbolt.org/z/Ejd8Kv6rb

Comment on lines 198 to 216
/// Determine optimal parameters for a `rep` instruction.
fn rep_param(dest: *mut u8, mut count: usize) -> (usize, usize, usize) {
// Unaligned writes are still slow on modern processors, so align the destination address.
let pre_byte_count = ((8 - (dest as usize & 0b111)) & 0b111).min(count);
count -= pre_byte_count;
let qword_count = count >> 3;
let byte_count = count & 0b111;
(pre_byte_count, qword_count, byte_count)
}

/// Determine optimal parameters for a reverse `rep` instruction (i.e. direction bit is set).
fn rep_param_rev(dest: *mut u8, mut count: usize) -> (usize, usize, usize) {
// Unaligned writes are still slow on modern processors, so align the destination address.
let pre_byte_count = ((dest as usize + count) & 0b111).min(count);
count -= pre_byte_count;
let qword_count = count >> 3;
let byte_count = count & 0b111;
(pre_byte_count, qword_count, byte_count)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need two of these functions, can't we just have one rep_param function and just use the output in reverse order for copy_backward?

If we did that, it might be worth calling these: before_byte_count, qword_count, after_byte_count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually just didn't think of that 😅

Comment on lines 72 to 92
let (pre_byte_count, qword_count, byte_count) = rep_param_rev(dest, count);
// We can't separate this block due to std/cld
asm!(
"std",
"repe movsq (%rsi), (%rdi)",
"movl {byte_count:e}, %ecx",
"addq $7, %rdi",
"addq $7, %rsi",
"repe movsb (%rsi), (%rdi)",
"rep movsb",
"sub rsi, 7",
"sub rdi, 7",
"mov rcx, {qword_count}",
"rep movsq",
"add rsi, 7",
"add rdi, 7",
"mov ecx, {byte_count:e}",
"rep movsb",
"cld",
byte_count = in(reg) byte_count,
inout("rcx") qword_count => _,
inout("rdi") dest.add(count).wrapping_sub(8) => _,
inout("rsi") src.add(count).wrapping_sub(8) => _,
options(att_syntax, nostack)
qword_count = in(reg) qword_count,
inout("ecx") pre_byte_count => _,
inout("rdi") dest.add(count - 1) => _,
inout("rsi") src.add(count - 1) => _,
// We modify flags, but we restore it afterwards
options(nostack, preserves_flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASM looks reasonable: https://rust.godbolt.org/z/EaGe1vM5b

"addq $7, %rdi",
"addq $7, %rsi",
"repe movsb (%rsi), (%rdi)",
"rep movsb",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to skip the rep movsb if the count is zero (like we do for the other functions?

@josephlr
Copy link
Contributor

josephlr commented Jul 7, 2022

I don't know if LLVM 9 is still supported. I've used Intel syntax anyways since it's more readable IMO.

rust-lang/rust#83387 make the minimum LLVM version 10, so I think this is fine.

Can we have the switch to Intel Syntax be in a standalone Commit/PR? We should switch all the ASM to intel style, rather that just the functions in this PR.

@Demindiro
Copy link
Contributor Author

However, should we add some unit tests for rep_param and rep_param_rev

Where should I put them? Putting them directly in src/mem/x86_64.rs doesn't work (testcrate doesn't run them and running cargo t from the root causes an error). Adding them to the testcrate would work but would also be a bit ugly IMO since you'd need a bunch of cfg(target_arch = "x86_64") + make rep_param public (if tests are enabled).

// Separating the blocks gives the compiler more freedom to reorder instructions.
// It also allows us to trivially skip the rep stosb, which is faster when memcpying
// aligned data.
if pre_byte_count > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a rep stosb with an ecx value of 0 to act like a no-op anyways. Is there a perf benefit to keeping the if here?

Copy link
Contributor Author

@Demindiro Demindiro Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a measurable benefit for memcpy_rust_4096 on my machine at least:

with branch:

test memcpy_rust_1048576              ... bench:      53,173 ns/iter (+/- 644) = 19720 MB/s
test memcpy_rust_1048576_misalign     ... bench:      58,352 ns/iter (+/- 5,939) = 17969 MB/s
test memcpy_rust_1048576_offset       ... bench:      52,561 ns/iter (+/- 1,950) = 19949 MB/s
test memcpy_rust_4096                 ... bench:          84 ns/iter (+/- 20) = 48761 MB/s
test memcpy_rust_4096_misalign        ... bench:          96 ns/iter (+/- 2) = 42666 MB/s
test memcpy_rust_4096_offset          ... bench:          97 ns/iter (+/- 0) = 42226 MB/s

without branch:

test memcpy_rust_1048576              ... bench:      55,051 ns/iter (+/- 4,696) = 19047 MB/s
test memcpy_rust_1048576_misalign     ... bench:      57,791 ns/iter (+/- 545) = 18144 MB/s
test memcpy_rust_1048576_offset       ... bench:      53,902 ns/iter (+/- 1,893) = 19453 MB/s
test memcpy_rust_4096                 ... bench:          89 ns/iter (+/- 0) = 46022 MB/s
test memcpy_rust_4096_misalign        ... bench:          97 ns/iter (+/- 1) = 42226 MB/s
test memcpy_rust_4096_offset          ... bench:          97 ns/iter (+/- 0) = 42226 MB/s

(Ditto for memset)

It probably makes more sense to leave it out though.

Copy link
Member

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

The PowerPC failure is fixed in CI, you can rebase onto the latest master.

While misaligned reads are generally fast, misaligned writes aren't and
can have severe penalties.
There is currently no measureable performance difference in benchmarks
but it likely will make a difference in real workloads.
While it is measurably faster for older CPUs, removing them keeps the code
smaller and is likely more beneficial for newer CPUs.
@Amanieu Amanieu merged commit 9dfe467 into rust-lang:master Jul 28, 2022
@Demindiro Demindiro deleted the x86_64-mem-align-dest branch July 28, 2022 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants