-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ASCII case conversions more than 4× faster #59283
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, modulo the tidy warnings.
I like that the explanation is so much longer than the code :)
Looks good to me. r=me as soon as CI passes. |
Might this be slower for platforms without SIMD which can't take advantage of auto-vectorization or does that not matter? |
It's probably still faster than the status quo on those platforms because it does the computation without branches. If one cared deeply about those platforms, then the pseudo-SIMD approach could be resurrected. However, I think this is a pretty good compromise. |
I guess it depends on whether LLVM can auto-vectorize based on "classic" I also just realize that when doing one byte at a time, instead of convoluted add-then-mask to emulate comparison, we can use actual comparison to obtain a byte &= !(0x20 * (b'a' <= byte && byte <= b'z') as u8) This even turns out to be slightly faster! I’ll update the PR. |
If instead of Benchmark results in GIF for "visual diff": Benchmark results in textBefore: test ascii::long::is_ascii ... bench: 187 ns/iter (+/- 0) = 37379 MB/s
test ascii::long::is_ascii_alphabetic ... bench: 94 ns/iter (+/- 0) = 74361 MB/s
test ascii::long::is_ascii_alphanumeric ... bench: 125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_control ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit ... bench: 125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_graphic ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation ... bench: 124 ns/iter (+/- 1) = 56370 MB/s
test ascii::long::is_ascii_uppercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::medium::is_ascii ... bench: 28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic ... bench: 24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_alphanumeric ... bench: 24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_control ... bench: 23 ns/iter (+/- 1) = 1391 MB/s
test ascii::medium::is_ascii_digit ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_graphic ... bench: 24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_hexdigit ... bench: 23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_lowercase ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase ... bench: 22 ns/iter (+/- 2) = 1454 MB/s
test ascii::medium::is_ascii_whitespace ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::short::is_ascii ... bench: 23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_alphabetic ... bench: 24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_alphanumeric ... bench: 24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_control ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_digit ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_graphic ... bench: 25 ns/iter (+/- 0) = 280 MB/s
test ascii::short::is_ascii_hexdigit ... bench: 24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_lowercase ... bench: 23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase ... bench: 24 ns/iter (+/- 1) = 291 MB/s
test ascii::short::is_ascii_whitespace ... bench: 22 ns/iter (+/- 0) = 318 MB/s After: test ascii::long::is_ascii ... bench: 186 ns/iter (+/- 0) = 37580 MB/s
test ascii::long::is_ascii_alphabetic ... bench: 96 ns/iter (+/- 0) = 72812 MB/s
test ascii::long::is_ascii_alphanumeric ... bench: 119 ns/iter (+/- 0) = 58739 MB/s
test ascii::long::is_ascii_control ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_graphic ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_uppercase ... bench: 124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace ... bench: 134 ns/iter (+/- 0) = 52164 MB/s
test ascii::medium::is_ascii ... bench: 28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic ... bench: 23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_alphanumeric ... bench: 23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_control ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_digit ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_graphic ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_hexdigit ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_lowercase ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase ... bench: 21 ns/iter (+/- 0) = 1523 MB/s
test ascii::medium::is_ascii_whitespace ... bench: 20 ns/iter (+/- 0) = 1600 MB/s
test ascii::short::is_ascii ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphabetic ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphanumeric ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_control ... bench: 20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_digit ... bench: 20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_graphic ... bench: 23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_hexdigit ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_lowercase ... bench: 20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_punctuation ... bench: 22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase ... bench: 21 ns/iter (+/- 0) = 333 MB/s
test ascii::short::is_ascii_whitespace ... bench: 20 ns/iter (+/- 0) = 350 MB/s |
Benchmark results from the original PR description, in case they end up being relevant: 6830 bytes string:
alloc_only ... bench: 109 ns/iter (+/- 0) = 62660 MB/s
black_box_read_each_byte ... bench: 1,708 ns/iter (+/- 5) = 3998 MB/s
lookup ... bench: 1,725 ns/iter (+/- 2) = 3959 MB/s
branch_and_subtract ... bench: 413 ns/iter (+/- 1) = 16537 MB/s
branch_and_mask ... bench: 411 ns/iter (+/- 2) = 16618 MB/s
branchless ... bench: 377 ns/iter (+/- 2) = 18116 MB/s
libcore ... bench: 378 ns/iter (+/- 2) = 18068 MB/s
fake_simd_u32 ... bench: 373 ns/iter (+/- 1) = 18310 MB/s
fake_simd_u64 ... bench: 374 ns/iter (+/- 0) = 18262 MB/s
32 bytes string:
alloc_only ... bench: 13 ns/iter (+/- 0) = 2461 MB/s
black_box_read_each_byte ... bench: 28 ns/iter (+/- 0) = 1142 MB/s
lookup ... bench: 25 ns/iter (+/- 0) = 1280 MB/s
branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s
branchless ... bench: 15 ns/iter (+/- 0) = 2133 MB/s
libcore ... bench: 16 ns/iter (+/- 0) = 2000 MB/s
fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s
7 bytes string:
alloc_only ... bench: 13 ns/iter (+/- 0) = 538 MB/s
black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s
lookup ... bench: 17 ns/iter (+/- 0) = 411 MB/s
branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask ... bench: 17 ns/iter (+/- 0) = 411 MB/s
branchless ... bench: 21 ns/iter (+/- 0) = 333 MB/s
libcore ... bench: 21 ns/iter (+/- 0) = 333 MB/s
fake_simd_u32 ... bench: 20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u64 ... bench: 23 ns/iter (+/- 0) = 304 MB/s |
@@ -3794,7 +3794,8 @@ impl u8 { | |||
#[stable(feature = "ascii_methods_on_intrinsics", since = "1.23.0")] | |||
#[inline] | |||
pub fn to_ascii_uppercase(&self) -> u8 { | |||
ASCII_UPPERCASE_MAP[*self as usize] | |||
// Unset the fith bit if this is a lowercase letter | |||
*self & !((self.is_ascii_lowercase() as u8) << 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*self & !((self.is_ascii_lowercase() as u8) << 5) | |
*self - ((self.is_ascii_lowercase() as u8) << 5) |
Using subtract is slightly faster for me:
test long::case12_mask_shifted_bool_match_range ... bench: 776 ns/iter (+/- 26) = 9007 MB/s
test long::case13_sub_shifted_bool_match_range ... bench: 734 ns/iter (+/- 49) = 9523 MB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also an improvement for me, but smaller:
test ascii::long::case12_mask_shifted_bool_match_range ... bench: 352 ns/iter (+/- 0) = 19857 MB/s
test ascii::long::case13_subtract_shifted_bool_match_range ... bench: 350 ns/iter (+/- 1) = 19971 MB/s
test ascii::medium::case12_mask_shifted_bool_match_range ... bench: 15 ns/iter (+/- 0) = 2133 MB/s
test ascii::medium::case13_subtract_shifted_bool_match_range ... bench: 15 ns/iter (+/- 0) = 2133 MB/s
test ascii::short::case12_mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s
test ascii::short::case13_subtract_shifted_bool_match_range ... bench: 18 ns/iter (+/- 0) = 388 MB/s
A quick benchmark using
Which shows that this can be slower than the lookup for a target without SIMD. |
What commit were these i586 results on? Because the |
I was just a recent nightly so that's why |
@joshtriplett I pushed several changes since your review, could you have another look? |
@bors r+ |
📌 Commit 7fad370 has been approved by |
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```
Rollup of 18 pull requests Successful merges: - #57293 (Make some lints incremental) - #57565 (syntax: Remove warning for unnecessary path disambiguators) - #58253 (librustc_driver => 2018) - #58837 (librustc_interface => 2018) - #59268 (Add suggestion to use `&*var` when `&str: From<String>` is expected) - #59283 (Make ASCII case conversions more than 4× faster) - #59284 (adjust MaybeUninit API to discussions) - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations) - #59390 (Make `ptr::eq` documentation mention fat-pointer behavior) - #59393 (Refactor tuple comparison tests) - #59420 ([CI] record docker image info for reuse) - #59421 (Reject integer suffix when tuple indexing) - #59430 (Renames `EvalContext` to `InterpretCx`) - #59439 (Generalize diagnostic for `x = y` where `bool` is the expected type) - #59449 (fix: Make incremental artifact deletion more robust) - #59451 (Add `Default` to `std::alloc::System`) - #59459 (Add some tests) - #59460 (Include id in Thread's Debug implementation) Failed merges: r? @ghost
Version 1.35.0 (2019-05-23) ========================== Language -------- - [`FnOnce`, `FnMut`, and the `Fn` traits are now implemented for `Box<FnOnce>`, `Box<FnMut>`, and `Box<Fn>` respectively.][59500] - [You can now coerce closures into unsafe function pointers.][59580] e.g. ```rust unsafe fn call_unsafe(func: unsafe fn()) { func() } pub fn main() { unsafe { call_unsafe(|| {}); } } ``` Compiler -------- - [Added the `armv6-unknown-freebsd-gnueabihf` and `armv7-unknown-freebsd-gnueabihf` targets.][58080] - [Added the `wasm32-unknown-wasi` target.][59464] Libraries --------- - [`Thread` will now show its ID in `Debug` output.][59460] - [`StdinLock`, `StdoutLock`, and `StderrLock` now implement `AsRawFd`.][59512] - [`alloc::System` now implements `Default`.][59451] - [Expanded `Debug` output (`{:#?}`) for structs now has a trailing comma on the last field.][59076] - [`char::{ToLowercase, ToUppercase}` now implement `ExactSizeIterator`.][58778] - [All `NonZero` numeric types now implement `FromStr`.][58717] - [Removed the `Read` trait bounds on the `BufReader::{get_ref, get_mut, into_inner}` methods.][58423] - [You can now call the `dbg!` macro without any parameters to print the file and line where it is called.][57847] - [In place ASCII case conversions are now up to 4× faster.][59283] e.g. `str::make_ascii_lowercase` - [`hash_map::{OccupiedEntry, VacantEntry}` now implement `Sync` and `Send`.][58369] Stabilized APIs --------------- - [`f32::copysign`] - [`f64::copysign`] - [`RefCell::replace_with`] - [`RefCell::map_split`] - [`ptr::hash`] - [`Range::contains`] - [`RangeFrom::contains`] - [`RangeTo::contains`] - [`RangeInclusive::contains`] - [`RangeToInclusive::contains`] - [`Option::copied`] Cargo ----- - [You can now set `cargo:rustc-cdylib-link-arg` at build time to pass custom linker arguments when building a `cdylib`.][cargo/6298] Its usage is highly platform specific. Misc ---- - [The Rust toolchain is now available natively for musl based distros.][58575] [59460]: rust-lang/rust#59460 [59464]: rust-lang/rust#59464 [59500]: rust-lang/rust#59500 [59512]: rust-lang/rust#59512 [59580]: rust-lang/rust#59580 [59283]: rust-lang/rust#59283 [59451]: rust-lang/rust#59451 [59076]: rust-lang/rust#59076 [58778]: rust-lang/rust#58778 [58717]: rust-lang/rust#58717 [58369]: rust-lang/rust#58369 [58423]: rust-lang/rust#58423 [58080]: rust-lang/rust#58080 [57847]: rust-lang/rust#57847 [58575]: rust-lang/rust#58575 [cargo/6298]: rust-lang/cargo#6298 [`f32::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f32.html#method.copysign [`f64::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f64.html#method.copysign [`RefCell::replace_with`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.replace_with [`RefCell::map_split`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.map_split [`ptr::hash`]: https://doc.rust-lang.org/stable/std/ptr/fn.hash.html [`Range::contains`]: https://doc.rust-lang.org/std/ops/struct.Range.html#method.contains [`RangeFrom::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeFrom.html#method.contains [`RangeTo::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeTo.html#method.contains [`RangeInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeInclusive.html#method.contains [`RangeToInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeToInclusive.html#method.contains [`Option::copied`]: https://doc.rust-lang.org/std/option/enum.Option.html#method.copied
Reformatted output of
./x.py bench src/libcore --test-args ascii
below. Thelibcore
benchmark calls[u8]::make_ascii_lowercase
.lookup
has code (effectively) identical to that before this PR, andbranchless
mask_shifted_bool_match_range
after this PR.See code comments inu8::to_ascii_uppercase
insrc/libcore/num/mod.rs
for an explanation of the branchless algorithm.Update: the algorithm was simplified while keeping the performance. See
branchless
v.s.mask_shifted_bool_match_range
benchmarks.Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on
u32
to convert four bytes at a time. Thefake_simd_u32
benchmarks implements this withlet (before, aligned, after) = bytes.align_to_mut::<u32>()
. Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).This could be fixed (to optimize
[u8]::make_ascii_lowercase
and[u8]::make_ascii_uppercase
insrc/libcore/slice/mod.rs
) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.Benchmark results on Linux x64 with Intel i7-7700K: (updated from #59283 (comment))