Make ASCII case conversions more than 4× faster #59283

SimonSapin · 2019-03-18T20:11:58Z

Reformatted output of ./x.py bench src/libcore --test-args ascii below. The libcore benchmark calls [u8]::make_ascii_lowercase. lookup has code (effectively) identical to that before this PR, and ~~branchless~~ mask_shifted_bool_match_range after this PR.

~~See code comments in u8::to_ascii_uppercase in src/libcore/num/mod.rs for an explanation of the branchless algorithm.~~

Update: the algorithm was simplified while keeping the performance. See branchless v.s. mask_shifted_bool_match_range benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on u32 to convert four bytes at a time. The fake_simd_u32 benchmarks implements this with let (before, aligned, after) = bytes.align_to_mut::<u32>(). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize [u8]::make_ascii_lowercase and [u8]::make_ascii_uppercase in src/libcore/slice/mod.rs) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from #59283 (comment))

6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:


alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s

rust-highfive · 2019-03-18T20:12:08Z

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

raphlinus

This looks good to me, modulo the tidy warnings.

I like that the explanation is so much longer than the code :)

joshtriplett · 2019-03-18T20:42:56Z

Looks good to me. r=me as soon as CI passes.

ollie27 · 2019-03-18T22:14:37Z

Might this be slower for platforms without SIMD which can't take advantage of auto-vectorization or does that not matter?

raphlinus · 2019-03-18T22:22:47Z

It's probably still faster than the status quo on those platforms because it does the computation without branches. If one cared deeply about those platforms, then the pseudo-SIMD approach could be resurrected. However, I think this is a pretty good compromise.

SimonSapin · 2019-03-18T22:26:14Z

I guess it depends on whether LLVM can auto-vectorize based on "classic" u32 operations. But either way it’s likely still faster than the current lookup table.

I also just realize that when doing one byte at a time, instead of convoluted add-then-mask to emulate comparison, we can use actual comparison to obtain a bool, then cast to u8 to obtain a 1 or 0, then multiply that by a mask:

byte &= !(0x20 * (b'a' <= byte && byte <= b'z') as u8)

This even turns out to be slightly faster! I’ll update the PR.

…nstead.

SimonSapin · 2019-03-18T23:43:00Z

If instead of b'a' <= byte && byte <= b'z' in the above I use byte.is_ascii_lowercase(), the performance is completely destroyed and goes to several slower than before this PR. So I also change the implementations of all u8::is_ascii_* methods to use match expressions with range patterns instead of the ASCII_CHARACTER_CLASS lookup table. When benchmarking black_box(bytes.iter().all(u8::is_ascii_FOO), the change is small, possibly noise.

Benchmark results in GIF for "visual diff":

Benchmark results in text

Before:

test ascii::long::is_ascii                                 ... bench:         187 ns/iter (+/- 0) = 37379 MB/s
test ascii::long::is_ascii_alphabetic                      ... bench:          94 ns/iter (+/- 0) = 74361 MB/s
test ascii::long::is_ascii_alphanumeric                    ... bench:         125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_control                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit                           ... bench:         125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_graphic                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit                        ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation                     ... bench:         124 ns/iter (+/- 1) = 56370 MB/s
test ascii::long::is_ascii_uppercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace                      ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::medium::is_ascii                               ... bench:          28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic                    ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_alphanumeric                  ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_control                       ... bench:          23 ns/iter (+/- 1) = 1391 MB/s
test ascii::medium::is_ascii_digit                         ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_graphic                       ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_hexdigit                      ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_lowercase                     ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_punctuation                   ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase                     ... bench:          22 ns/iter (+/- 2) = 1454 MB/s
test ascii::medium::is_ascii_whitespace                    ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::short::is_ascii                                ... bench:          23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_alphabetic                     ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_alphanumeric                   ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_control                        ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_digit                          ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_graphic                        ... bench:          25 ns/iter (+/- 0) = 280 MB/s
test ascii::short::is_ascii_hexdigit                       ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_lowercase                      ... bench:          23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_punctuation                    ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase                      ... bench:          24 ns/iter (+/- 1) = 291 MB/s
test ascii::short::is_ascii_whitespace                     ... bench:          22 ns/iter (+/- 0) = 318 MB/s

After:

test ascii::long::is_ascii                                 ... bench:         186 ns/iter (+/- 0) = 37580 MB/s
test ascii::long::is_ascii_alphabetic                      ... bench:          96 ns/iter (+/- 0) = 72812 MB/s
test ascii::long::is_ascii_alphanumeric                    ... bench:         119 ns/iter (+/- 0) = 58739 MB/s
test ascii::long::is_ascii_control                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit                           ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_graphic                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit                        ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation                     ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_uppercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace                      ... bench:         134 ns/iter (+/- 0) = 52164 MB/s
test ascii::medium::is_ascii                               ... bench:          28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic                    ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_alphanumeric                  ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_control                       ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_digit                         ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_graphic                       ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_hexdigit                      ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_lowercase                     ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_punctuation                   ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase                     ... bench:          21 ns/iter (+/- 0) = 1523 MB/s
test ascii::medium::is_ascii_whitespace                    ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::short::is_ascii                                ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphabetic                     ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphanumeric                   ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_control                        ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_digit                          ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_graphic                        ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_hexdigit                       ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_lowercase                      ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_punctuation                    ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase                      ... bench:          21 ns/iter (+/- 0) = 333 MB/s
test ascii::short::is_ascii_whitespace                     ... bench:          20 ns/iter (+/- 0) = 350 MB/s

SimonSapin · 2019-03-19T00:00:40Z

Benchmark results from the original PR description, in case they end up being relevant:

6830 bytes string:

alloc_only                ... bench:    109 ns/iter (+/- 0) = 62660 MB/s
black_box_read_each_byte  ... bench:  1,708 ns/iter (+/- 5) = 3998 MB/s
lookup                    ... bench:  1,725 ns/iter (+/- 2) = 3959 MB/s
branch_and_subtract       ... bench:    413 ns/iter (+/- 1) = 16537 MB/s
branch_and_mask           ... bench:    411 ns/iter (+/- 2) = 16618 MB/s
branchless                ... bench:    377 ns/iter (+/- 2) = 18116 MB/s
libcore                   ... bench:    378 ns/iter (+/- 2) = 18068 MB/s
fake_simd_u32             ... bench:    373 ns/iter (+/- 1) = 18310 MB/s
fake_simd_u64             ... bench:    374 ns/iter (+/- 0) = 18262 MB/s

32 bytes string:

alloc_only                ... bench:     13 ns/iter (+/- 0) = 2461 MB/s
black_box_read_each_byte  ... bench:     28 ns/iter (+/- 0) = 1142 MB/s
lookup                    ... bench:     25 ns/iter (+/- 0) = 1280 MB/s
branch_and_subtract       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask           ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
libcore                   ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
fake_simd_u32             ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64             ... bench:     17 ns/iter (+/- 0) = 1882 MB/s

7 bytes string:

alloc_only                ... bench:     13 ns/iter (+/- 0) = 538 MB/s
black_box_read_each_byte  ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup                    ... bench:     17 ns/iter (+/- 0) = 411 MB/s
branch_and_subtract       ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask           ... bench:     17 ns/iter (+/- 0) = 411 MB/s
branchless                ... bench:     21 ns/iter (+/- 0) = 333 MB/s
libcore                   ... bench:     21 ns/iter (+/- 0) = 333 MB/s
fake_simd_u32             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u64             ... bench:     23 ns/iter (+/- 0) = 304 MB/s

ollie27 · 2019-03-19T01:18:03Z

src/libcore/num/mod.rs

@@ -3794,7 +3794,8 @@ impl u8 {
    #[stable(feature = "ascii_methods_on_intrinsics", since = "1.23.0")]
    #[inline]
    pub fn to_ascii_uppercase(&self) -> u8 {
-        ASCII_UPPERCASE_MAP[*self as usize]
+        // Unset the fith bit if this is a lowercase letter
+        *self & !((self.is_ascii_lowercase() as u8) << 5)


Suggested change

*self & !((self.is_ascii_lowercase() as u8) << 5)

*self - ((self.is_ascii_lowercase() as u8) << 5)

Using subtract is slightly faster for me:

test long::case12_mask_shifted_bool_match_range ... bench: 776 ns/iter (+/- 26) = 9007 MB/s test long::case13_sub_shifted_bool_match_range ... bench: 734 ns/iter (+/- 49) = 9523 MB/s

This is also an improvement for me, but smaller:

test ascii::long::case12_mask_shifted_bool_match_range ... bench: 352 ns/iter (+/- 0) = 19857 MB/s test ascii::long::case13_subtract_shifted_bool_match_range ... bench: 350 ns/iter (+/- 1) = 19971 MB/s test ascii::medium::case12_mask_shifted_bool_match_range ... bench: 15 ns/iter (+/- 0) = 2133 MB/s test ascii::medium::case13_subtract_shifted_bool_match_range ... bench: 15 ns/iter (+/- 0) = 2133 MB/s test ascii::short::case12_mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s test ascii::short::case13_subtract_shifted_bool_match_range ... bench: 18 ns/iter (+/- 0) = 388 MB/s

ollie27 · 2019-03-19T01:18:07Z

A quick benchmark using i586-pc-windows-msvc target gets me:

test long::case00_alloc_only                            ... bench:         291 ns/iter (+/- 46) = 24020 MB/s
test long::case01_black_box_read_each_byte              ... bench:       4,214 ns/iter (+/- 163) = 1658 MB/s
test long::case02_lookup_table                          ... bench:       6,158 ns/iter (+/- 226) = 1135 MB/s
test long::case03_branch_and_subtract                   ... bench:      17,402 ns/iter (+/- 641) = 401 MB/s
test long::case04_branch_and_mask                       ... bench:      17,748 ns/iter (+/- 1,242) = 393 MB/s
test long::case05_branchless                            ... bench:      10,757 ns/iter (+/- 390) = 649 MB/s
test long::case06_libcore                               ... bench:       6,165 ns/iter (+/- 401) = 1133 MB/s
test long::case07_fake_simd_u32                         ... bench:       2,790 ns/iter (+/- 138) = 2505 MB/s
test long::case08_fake_simd_u64                         ... bench:       2,816 ns/iter (+/- 166) = 2482 MB/s
test long::case09_mask_mult_bool_branchy_lookup_table   ... bench:      11,366 ns/iter (+/- 353) = 614 MB/s
test long::case10_mask_mult_bool_lookup_table           ... bench:       9,793 ns/iter (+/- 486) = 713 MB/s
test long::case11_mask_mult_bool_match_range            ... bench:       8,949 ns/iter (+/- 330) = 781 MB/s
test long::case12_mask_shifted_bool_match_range         ... bench:       8,938 ns/iter (+/- 478) = 782 MB/s
test long::case13_sub_shifted_bool_match_range          ... bench:       8,136 ns/iter (+/- 363) = 859 MB/s
test medium::case00_alloc_only                          ... bench:          64 ns/iter (+/- 1) = 500 MB/s
test medium::case01_black_box_read_each_byte            ... bench:          73 ns/iter (+/- 2) = 438 MB/s
test medium::case02_lookup_table                        ... bench:          66 ns/iter (+/- 4) = 484 MB/s
test medium::case03_branch_and_subtract                 ... bench:          63 ns/iter (+/- 2) = 507 MB/s
test medium::case04_branch_and_mask                     ... bench:          64 ns/iter (+/- 2) = 500 MB/s
test medium::case05_branchless                          ... bench:         110 ns/iter (+/- 3) = 290 MB/s
test medium::case06_libcore                             ... bench:          62 ns/iter (+/- 4) = 516 MB/s
test medium::case07_fake_simd_u32                       ... bench:          79 ns/iter (+/- 2) = 405 MB/s
test medium::case08_fake_simd_u64                       ... bench:          80 ns/iter (+/- 2) = 400 MB/s
test medium::case09_mask_mult_bool_branchy_lookup_table ... bench:         118 ns/iter (+/- 5) = 271 MB/s
test medium::case10_mask_mult_bool_lookup_table         ... bench:          64 ns/iter (+/- 5) = 500 MB/s
test medium::case11_mask_mult_bool_match_range          ... bench:          62 ns/iter (+/- 3) = 516 MB/s
test medium::case12_mask_shifted_bool_match_range       ... bench:          62 ns/iter (+/- 2) = 516 MB/s
test medium::case13_sub_shifted_bool_match_range        ... bench:          61 ns/iter (+/- 3) = 524 MB/s
test short::case00_alloc_only                           ... bench:          62 ns/iter (+/- 3) = 112 MB/s
test short::case01_black_box_read_each_byte             ... bench:          65 ns/iter (+/- 4) = 107 MB/s
test short::case02_lookup_table                         ... bench:          61 ns/iter (+/- 1) = 114 MB/s
test short::case03_branch_and_subtract                  ... bench:          63 ns/iter (+/- 4) = 111 MB/s
test short::case04_branch_and_mask                      ... bench:          61 ns/iter (+/- 1) = 114 MB/s
test short::case05_branchless                           ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case06_libcore                              ... bench:          61 ns/iter (+/- 4) = 114 MB/s
test short::case07_fake_simd_u32                        ... bench:          74 ns/iter (+/- 4) = 94 MB/s
test short::case08_fake_simd_u64                        ... bench:          74 ns/iter (+/- 3) = 94 MB/s
test short::case09_mask_mult_bool_branchy_lookup_table  ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case10_mask_mult_bool_lookup_table          ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case11_mask_mult_bool_match_range           ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case12_mask_shifted_bool_match_range        ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case13_sub_shifted_bool_match_range         ... bench:          61 ns/iter (+/- 2) = 114 MB/s

Which shows that this can be slower than the lookup for a target without SIMD.

SimonSapin · 2019-03-19T07:00:26Z

What commit were these i586 results on? Because the libcore performs exactly like lookup_table, which seems surprising.

ollie27 · 2019-03-19T12:30:35Z

What commit were these i586 results on? Because the libcore performs exactly like lookup_table, which seems surprising.

I was just a recent nightly so that's why libcore is the same as lookup_table.

SimonSapin · 2019-03-22T17:49:39Z

@joshtriplett I pushed several changes since your review, could you have another look?

joshtriplett · 2019-03-26T22:39:42Z

@bors r+

bors · 2019-03-26T22:39:43Z

📌 Commit 7fad370 has been approved by joshtriplett

@raphlinus

…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```

@raphlinus

…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```

@raphlinus

…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```

@raphlinus

…=joshtriplett Make ASCII case conversions more than 4× faster Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR. ~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~ **Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks. Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII). This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR. Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment)) ```rust 6830 bytes string: alloc_only ... bench: 112 ns/iter (+/- 0) = 62410 MB/s black_box_read_each_byte ... bench: 1,733 ns/iter (+/- 8) = 4033 MB/s lookup_table ... bench: 1,766 ns/iter (+/- 11) = 3958 MB/s branch_and_subtract ... bench: 417 ns/iter (+/- 1) = 16762 MB/s branch_and_mask ... bench: 401 ns/iter (+/- 1) = 17431 MB/s branchless ... bench: 365 ns/iter (+/- 0) = 19150 MB/s libcore ... bench: 367 ns/iter (+/- 1) = 19046 MB/s fake_simd_u32 ... bench: 361 ns/iter (+/- 2) = 19362 MB/s fake_simd_u64 ... bench: 361 ns/iter (+/- 1) = 19362 MB/s mask_mult_bool_branchy_lookup_table ... bench: 6,309 ns/iter (+/- 19) = 1107 MB/s mask_mult_bool_lookup_table ... bench: 4,183 ns/iter (+/- 29) = 1671 MB/s mask_mult_bool_match_range ... bench: 339 ns/iter (+/- 0) = 20619 MB/s mask_shifted_bool_match_range ... bench: 339 ns/iter (+/- 1) = 20619 MB/s 32 bytes string: alloc_only ... bench: 15 ns/iter (+/- 0) = 2133 MB/s black_box_read_each_byte ... bench: 29 ns/iter (+/- 0) = 1103 MB/s lookup_table ... bench: 24 ns/iter (+/- 4) = 1333 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 2000 MB/s branchless ... bench: 16 ns/iter (+/- 0) = 2000 MB/s libcore ... bench: 15 ns/iter (+/- 0) = 2133 MB/s fake_simd_u32 ... bench: 17 ns/iter (+/- 0) = 1882 MB/s fake_simd_u64 ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_mult_bool_branchy_lookup_table ... bench: 42 ns/iter (+/- 0) = 761 MB/s mask_mult_bool_lookup_table ... bench: 35 ns/iter (+/- 0) = 914 MB/s mask_mult_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s mask_shifted_bool_match_range ... bench: 16 ns/iter (+/- 0) = 2000 MB/s 7 bytes string: alloc_only ... bench: 14 ns/iter (+/- 0) = 500 MB/s black_box_read_each_byte ... bench: 22 ns/iter (+/- 0) = 318 MB/s lookup_table ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_subtract ... bench: 16 ns/iter (+/- 0) = 437 MB/s branch_and_mask ... bench: 16 ns/iter (+/- 0) = 437 MB/s branchless ... bench: 19 ns/iter (+/- 0) = 368 MB/s libcore ... bench: 20 ns/iter (+/- 0) = 350 MB/s fake_simd_u32 ... bench: 18 ns/iter (+/- 0) = 388 MB/s fake_simd_u64 ... bench: 21 ns/iter (+/- 0) = 333 MB/s mask_mult_bool_branchy_lookup_table ... bench: 20 ns/iter (+/- 0) = 350 MB/s mask_mult_bool_lookup_table ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_mult_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s mask_shifted_bool_match_range ... bench: 19 ns/iter (+/- 0) = 368 MB/s ```

@ghost

Rollup of 18 pull requests Successful merges: - #57293 (Make some lints incremental) - #57565 (syntax: Remove warning for unnecessary path disambiguators) - #58253 (librustc_driver => 2018) - #58837 (librustc_interface => 2018) - #59268 (Add suggestion to use `&*var` when `&str: From<String>` is expected) - #59283 (Make ASCII case conversions more than 4× faster) - #59284 (adjust MaybeUninit API to discussions) - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations) - #59390 (Make `ptr::eq` documentation mention fat-pointer behavior) - #59393 (Refactor tuple comparison tests) - #59420 ([CI] record docker image info for reuse) - #59421 (Reject integer suffix when tuple indexing) - #59430 (Renames `EvalContext` to `InterpretCx`) - #59439 (Generalize diagnostic for `x = y` where `bool` is the expected type) - #59449 (fix: Make incremental artifact deletion more robust) - #59451 (Add `Default` to `std::alloc::System`) - #59459 (Add some tests) - #59460 (Include id in Thread's Debug implementation) Failed merges: r? @ghost

Version 1.35.0 (2019-05-23) ========================== Language -------- - [`FnOnce`, `FnMut`, and the `Fn` traits are now implemented for `Box<FnOnce>`, `Box<FnMut>`, and `Box<Fn>` respectively.][59500] - [You can now coerce closures into unsafe function pointers.][59580] e.g. ```rust unsafe fn call_unsafe(func: unsafe fn()) { func() } pub fn main() { unsafe { call_unsafe(|| {}); } } ``` Compiler -------- - [Added the `armv6-unknown-freebsd-gnueabihf` and `armv7-unknown-freebsd-gnueabihf` targets.][58080] - [Added the `wasm32-unknown-wasi` target.][59464] Libraries --------- - [`Thread` will now show its ID in `Debug` output.][59460] - [`StdinLock`, `StdoutLock`, and `StderrLock` now implement `AsRawFd`.][59512] - [`alloc::System` now implements `Default`.][59451] - [Expanded `Debug` output (`{:#?}`) for structs now has a trailing comma on the last field.][59076] - [`char::{ToLowercase, ToUppercase}` now implement `ExactSizeIterator`.][58778] - [All `NonZero` numeric types now implement `FromStr`.][58717] - [Removed the `Read` trait bounds on the `BufReader::{get_ref, get_mut, into_inner}` methods.][58423] - [You can now call the `dbg!` macro without any parameters to print the file and line where it is called.][57847] - [In place ASCII case conversions are now up to 4× faster.][59283] e.g. `str::make_ascii_lowercase` - [`hash_map::{OccupiedEntry, VacantEntry}` now implement `Sync` and `Send`.][58369] Stabilized APIs --------------- - [`f32::copysign`] - [`f64::copysign`] - [`RefCell::replace_with`] - [`RefCell::map_split`] - [`ptr::hash`] - [`Range::contains`] - [`RangeFrom::contains`] - [`RangeTo::contains`] - [`RangeInclusive::contains`] - [`RangeToInclusive::contains`] - [`Option::copied`] Cargo ----- - [You can now set `cargo:rustc-cdylib-link-arg` at build time to pass custom linker arguments when building a `cdylib`.][cargo/6298] Its usage is highly platform specific. Misc ---- - [The Rust toolchain is now available natively for musl based distros.][58575] [59460]: rust-lang/rust#59460 [59464]: rust-lang/rust#59464 [59500]: rust-lang/rust#59500 [59512]: rust-lang/rust#59512 [59580]: rust-lang/rust#59580 [59283]: rust-lang/rust#59283 [59451]: rust-lang/rust#59451 [59076]: rust-lang/rust#59076 [58778]: rust-lang/rust#58778 [58717]: rust-lang/rust#58717 [58369]: rust-lang/rust#58369 [58423]: rust-lang/rust#58423 [58080]: rust-lang/rust#58080 [57847]: rust-lang/rust#57847 [58575]: rust-lang/rust#58575 [cargo/6298]: rust-lang/cargo#6298 [`f32::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f32.html#method.copysign [`f64::copysign`]: https://doc.rust-lang.org/stable/std/primitive.f64.html#method.copysign [`RefCell::replace_with`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.replace_with [`RefCell::map_split`]: https://doc.rust-lang.org/stable/std/cell/struct.RefCell.html#method.map_split [`ptr::hash`]: https://doc.rust-lang.org/stable/std/ptr/fn.hash.html [`Range::contains`]: https://doc.rust-lang.org/std/ops/struct.Range.html#method.contains [`RangeFrom::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeFrom.html#method.contains [`RangeTo::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeTo.html#method.contains [`RangeInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeInclusive.html#method.contains [`RangeToInclusive::contains`]: https://doc.rust-lang.org/std/ops/struct.RangeToInclusive.html#method.contains [`Option::copied`]: https://doc.rust-lang.org/std/option/enum.Option.html#method.copied

SimonSapin added 3 commits March 18, 2019 20:08

Add benchmarks for [u8]::make_ascii_uppercase

8740d5d

Make u8::to_ascii_lowercase and to_ascii_uppercase branchless

ce933f7

Add benchmark for not-quite-correct “fake SIMD” make_ascii_uppercase

fbe34cc

SimonSapin added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 18, 2019

rust-highfive assigned joshtriplett Mar 18, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 18, 2019

This comment has been minimized.

Sign in to view

Tidy

e3fb6f8

raphlinus approved these changes Mar 18, 2019

View reviewed changes

SimonSapin added 3 commits March 18, 2019 23:32

Rename src/libcore/benches/ascii_case.rs to ascii.rs

525a043

Add benchmarks for u8::is_ascii*

6d3840b

Remove ASCII_CHARACTER_CLASS table, use match with range patterns i…

b4faa9b

…nstead.

raphlinus mentioned this pull request Mar 18, 2019

Beginning of "tinystr" optimization projectfluent/fluent-langneg-rs#8

Closed

SimonSapin added 2 commits March 19, 2019 00:49

Benchmark more possibles impls of [u8]::make_ascii_uppercase

4a3241a

Simplify u8::to_ascii_{upp,low}ercase while keeping it fast

0ad91f7

ollie27 reviewed Mar 19, 2019

View reviewed changes

ASCII uppercase: add "subtract shifted bool" benchmark

c1ec29a

ASCII uppercase: add "subtract multiplied bool" benchmark

7fad370

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 26, 2019

Centril mentioned this pull request Mar 27, 2019

Rollup of 13 pull requests #59457

Closed

Centril mentioned this pull request Mar 27, 2019

Rollup of 14 pull requests #59461

Closed

Centril mentioned this pull request Mar 27, 2019

Rollup of 17 pull requests #59466

Closed

cuviper mentioned this pull request Mar 28, 2019

Rollup of 18 pull requests #59471

Merged

bors merged commit 7fad370 into rust-lang:master Mar 28, 2019

SimonSapin deleted the branchless-ascii-case branch November 28, 2019 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make ASCII case conversions more than 4× faster #59283

Make ASCII case conversions more than 4× faster #59283

SimonSapin commented Mar 18, 2019 •

edited

Loading

rust-highfive commented Mar 18, 2019

This comment has been minimized.

raphlinus left a comment

joshtriplett commented Mar 18, 2019

ollie27 commented Mar 18, 2019

raphlinus commented Mar 18, 2019

SimonSapin commented Mar 18, 2019

SimonSapin commented Mar 18, 2019

SimonSapin commented Mar 19, 2019

ollie27 Mar 19, 2019

SimonSapin Mar 19, 2019

ollie27 commented Mar 19, 2019

SimonSapin commented Mar 19, 2019

ollie27 commented Mar 19, 2019

SimonSapin commented Mar 22, 2019

joshtriplett commented Mar 26, 2019

bors commented Mar 26, 2019

	*self & !((self.is_ascii_lowercase() as u8) << 5)
	*self - ((self.is_ascii_lowercase() as u8) << 5)

Make ASCII case conversions more than 4× faster #59283

Make ASCII case conversions more than 4× faster #59283

Conversation

SimonSapin commented Mar 18, 2019 • edited Loading

rust-highfive commented Mar 18, 2019

This comment has been minimized.

raphlinus left a comment

Choose a reason for hiding this comment

joshtriplett commented Mar 18, 2019

ollie27 commented Mar 18, 2019

raphlinus commented Mar 18, 2019

SimonSapin commented Mar 18, 2019

SimonSapin commented Mar 18, 2019

SimonSapin commented Mar 19, 2019

ollie27 Mar 19, 2019

Choose a reason for hiding this comment

SimonSapin Mar 19, 2019

Choose a reason for hiding this comment

ollie27 commented Mar 19, 2019

SimonSapin commented Mar 19, 2019

ollie27 commented Mar 19, 2019

SimonSapin commented Mar 22, 2019

joshtriplett commented Mar 26, 2019

bors commented Mar 26, 2019

SimonSapin commented Mar 18, 2019 •

edited

Loading