Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve non-SIMD with wordwise validation #123

Merged
merged 4 commits into from
Apr 18, 2023

Conversation

AaronO
Copy link
Contributor

@AaronO AaronO commented Sep 2, 2022

A 1st pass at improving non-SIMD perf, to land after #122

On aarch64/M1 we observe a ~5x asymptotic improvement in uri parsing and ~2.5x in header names & values.

We also observe a -40% time reduction in the req/req bench.

I briefly benched it on x86_64, we see a -20% time reduction in the req/req bench there for non-SIMD, needs more testing/tweaks on x86_64 and possibly only enabling these fastpaths if SIMD is off

Bench results

test req/req ... bench:         145 ns/iter (+/- 2)
test req_short/req_short ... bench:          30 ns/iter (+/- 3)
test resp/resp ... bench:         153 ns/iter (+/- 6)
test resp_short/resp_short ... bench:          35 ns/iter (+/- 0)
test uri/uri_1b ... bench:           1 ns/iter (+/- 0)
test uri/uri_2b ... bench:           1 ns/iter (+/- 0)
test uri/uri_4b ... bench:           3 ns/iter (+/- 0)
test uri/uri_8b ... bench:           5 ns/iter (+/- 0)
test uri/uri_16b ... bench:           7 ns/iter (+/- 0)
test uri/uri_32b ... bench:           8 ns/iter (+/- 0)
test uri/uri_64b ... bench:          13 ns/iter (+/- 0)
test uri/uri_128b ... bench:          15 ns/iter (+/- 0)
test uri/uri_256b ... bench:          25 ns/iter (+/- 1)
test uri/uri_512b ... bench:          49 ns/iter (+/- 0)
test uri/uri_1024b ... bench:         111 ns/iter (+/- 3)
test uri/uri_2048b ... bench:         193 ns/iter (+/- 8)
test uri/uri_4096b ... bench:         374 ns/iter (+/- 18)
test header/name_1b ... bench:           7 ns/iter (+/- 0)
test header/name_2b ... bench:           8 ns/iter (+/- 0)
test header/name_4b ... bench:           8 ns/iter (+/- 0)
test header/name_8b ... bench:           8 ns/iter (+/- 0)
test header/name_16b ... bench:          10 ns/iter (+/- 0)
test header/name_32b ... bench:          13 ns/iter (+/- 0)
test header/name_64b ... bench:          19 ns/iter (+/- 0)
test header/name_128b ... bench:          32 ns/iter (+/- 0)
test header/name_256b ... bench:          57 ns/iter (+/- 0)
test header/name_512b ... bench:         108 ns/iter (+/- 2)
test header/name_1024b ... bench:         221 ns/iter (+/- 25)
test header/name_2048b ... bench:         424 ns/iter (+/- 10)
test header/name_4096b ... bench:         845 ns/iter (+/- 20)
test header/value_1b ... bench:           8 ns/iter (+/- 0)
test header/value_2b ... bench:           8 ns/iter (+/- 0)
test header/value_4b ... bench:          10 ns/iter (+/- 0)
test header/value_8b ... bench:          15 ns/iter (+/- 0)
test header/value_16b ... bench:          16 ns/iter (+/- 0)
test header/value_32b ... bench:          17 ns/iter (+/- 0)
test header/value_64b ... bench:          18 ns/iter (+/- 0)
test header/value_128b ... bench:          22 ns/iter (+/- 0)
test header/value_256b ... bench:          33 ns/iter (+/- 0)
test header/value_512b ... bench:          53 ns/iter (+/- 0)
test header/value_1024b ... bench:         104 ns/iter (+/- 0)
test header/value_2048b ... bench:         185 ns/iter (+/- 1)
test header/value_4096b ... bench:         348 ns/iter (+/- 8)
test header/count_1 ... bench:           7 ns/iter (+/- 0)
test header/count_2 ... bench:          13 ns/iter (+/- 0)
test header/count_4 ... bench:          26 ns/iter (+/- 0)
test header/count_8 ... bench:          46 ns/iter (+/- 1)
test header/count_16 ... bench:          94 ns/iter (+/- 0)
test header/count_32 ... bench:         175 ns/iter (+/- 2)
test header/count_64 ... bench:         341 ns/iter (+/- 90)
test header/count_128 ... bench:         665 ns/iter (+/- 15)
test version/http10 ... bench:           0 ns/iter (+/- 0)
test version/http11 ... bench:           0 ns/iter (+/- 0)
test version/partial ... bench:           1 ns/iter (+/- 0)
test method/get ... bench:           0 ns/iter (+/- 0)
test method/head ... bench:           2 ns/iter (+/- 0)
test method/post ... bench:           1 ns/iter (+/- 0)
test method/put ... bench:           2 ns/iter (+/- 0)
test method/delete ... bench:           3 ns/iter (+/- 0)
test method/connect ... bench:           3 ns/iter (+/- 0)
test method/options ... bench:           3 ns/iter (+/- 0)
test method/trace ... bench:           2 ns/iter (+/- 0)
test method/patch ... bench:           3 ns/iter (+/- 0)
test method/custom ... bench:           3 ns/iter (+/- 0)

Prev baseline (from #122)

test req/req ... bench:                 223 ns/iter (+/- 8)
test req_short/req_short ... bench:     29 ns/iter (+/- 0)
test resp/resp ... bench:               212 ns/iter (+/- 4)
test resp_short/resp_short ... bench:   35 ns/iter (+/- 1)
test uri/uri_1b ... bench:              0 ns/iter (+/- 0)
test uri/uri_2b ... bench:              1 ns/iter (+/- 0)
test uri/uri_4b ... bench:              2 ns/iter (+/- 0)
test uri/uri_8b ... bench:              3 ns/iter (+/- 0)
test uri/uri_16b ... bench:             7 ns/iter (+/- 0)
test uri/uri_32b ... bench:             15 ns/iter (+/- 0)
test uri/uri_64b ... bench:             30 ns/iter (+/- 0)
test uri/uri_128b ... bench:            67 ns/iter (+/- 0)
test uri/uri_256b ... bench:            129 ns/iter (+/- 2)
test uri/uri_512b ... bench:            250 ns/iter (+/- 3)
test uri/uri_1024b ... bench:           494 ns/iter (+/- 5)
test uri/uri_2048b ... bench:           990 ns/iter (+/- 19)
test uri/uri_4096b ... bench:           2069 ns/iter (+/- 525)
test header/name_1b ... bench:          8 ns/iter (+/- 0)
test header/name_2b ... bench:          9 ns/iter (+/- 0)
test header/name_4b ... bench:          10 ns/iter (+/- 0)
test header/name_8b ... bench:          11 ns/iter (+/- 0)
test header/name_16b ... bench:         15 ns/iter (+/- 0)
test header/name_32b ... bench:         23 ns/iter (+/- 0)
test header/name_64b ... bench:         38 ns/iter (+/- 0)
test header/name_128b ... bench:        79 ns/iter (+/- 3)
test header/name_256b ... bench:        137 ns/iter (+/- 2)
test header/name_512b ... bench:        265 ns/iter (+/- 9)
test header/name_1024b ... bench:       506 ns/iter (+/- 10)
test header/name_2048b ... bench:       994 ns/iter (+/- 19)
test header/name_4096b ... bench:       1974 ns/iter (+/- 28)
test header/value_1b ... bench:         8 ns/iter (+/- 0)
test header/value_2b ... bench:         9 ns/iter (+/- 0)
test header/value_4b ... bench:         10 ns/iter (+/- 0)
test header/value_8b ... bench:         9 ns/iter (+/- 0)
test header/value_16b ... bench:        10 ns/iter (+/- 0)
test header/value_32b ... bench:        14 ns/iter (+/- 0)
test header/value_64b ... bench:        20 ns/iter (+/- 0)
test header/value_128b ... bench:       33 ns/iter (+/- 0)
test header/value_256b ... bench:       60 ns/iter (+/- 1)
test header/value_512b ... bench:       127 ns/iter (+/- 2)
test header/value_1024b ... bench:      233 ns/iter (+/- 5)
test header/value_2048b ... bench:      456 ns/iter (+/- 11)
test header/value_4096b ... bench:      893 ns/iter (+/- 23)
test header/count_1 ... bench:          8 ns/iter (+/- 0)
test header/count_2 ... bench:          13 ns/iter (+/- 0)
test header/count_4 ... bench:          25 ns/iter (+/- 0)
test header/count_8 ... bench:          45 ns/iter (+/- 0)
test header/count_16 ... bench:         92 ns/iter (+/- 0)
test header/count_32 ... bench:         175 ns/iter (+/- 2)
test header/count_64 ... bench:         342 ns/iter (+/- 8)
test header/count_128 ... bench:        673 ns/iter (+/- 19)
test version/http10 ... bench:          0 ns/iter (+/- 0)
test version/http11 ... bench:          0 ns/iter (+/- 0)
test version/partial ... bench:         1 ns/iter (+/- 0)
test method/get ... bench:              0 ns/iter (+/- 0)
test method/head ... bench:             2 ns/iter (+/- 0)
test method/post ... bench:             1 ns/iter (+/- 0)
test method/put ... bench:              2 ns/iter (+/- 0)
test method/delete ... bench:           3 ns/iter (+/- 0)
test method/connect ... bench:          3 ns/iter (+/- 0)
test method/options ... bench:          3 ns/iter (+/- 0)
test method/trace ... bench:            2 ns/iter (+/- 0)
test method/patch ... bench:            3 ns/iter (+/- 0)
test method/custom ... bench:           3 ns/iter (+/- 0)

@seanmonstar
Copy link
Owner

This is awesome! Do you know how new of a compiler it requires? The MSRV CI job is grumpy 🤷 .

@paolobarbolini
Copy link
Contributor

@AaronO
Copy link
Contributor Author

AaronO commented Sep 16, 2022

This is awesome! Do you know how new of a compiler it requires? The MSRV CI job is grumpy 🤷 .

Sorry for the delayed response, I've been traveling. As @paolobarbolini pointed out the new MSRV would be 1.44, though we could avoid an MSRV bump and stick to 1.36 with some extra or less-idomatic code (see latest commit)

@AaronO
Copy link
Contributor Author

AaronO commented Sep 16, 2022

@seanmonstar It would be great to land #122 (the benchmarks & baseline) first, then I can polish this a little more before merging and releasing

@AaronO
Copy link
Contributor Author

AaronO commented Apr 12, 2023

@seanmonstar Update here:

Conclusion

With the latest changes this PR nearly doubles throughput on ARM, with no real downsides.

On x86_64 non-SIMD is now roughly on-par with SIMD in fact faster for smaller reqs/resps.

Benchmarks

M1 Air

master

test req/req ... bench:                       219 ns/iter (+/- 1)
test req_short/req_short ... bench:           28 ns/iter (+/- 0)
test resp/resp ... bench:                     208 ns/iter (+/- 2)
test resp_short/resp_short ... bench:         35 ns/iter (+/- 0)

perf/wordwise-validation

test req/req ... bench:                       112 ns/iter (+/- 1)
test req_short/req_short ... bench:           26 ns/iter (+/- 0)
test resp/resp ... bench:                     117 ns/iter (+/- 8)
test resp_short/resp_short ... bench:         28 ns/iter (+/- 0)

GH Codespace 4-core SIMD

(Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz)

master

test req/req ... bench:                       309 ns/iter (+/- 12)
test req_short/req_short ... bench:           66 ns/iter (+/- 5)
test resp/resp ... bench:                     326 ns/iter (+/- 14)
test resp_short/resp_short ... bench:         66 ns/iter (+/- 5)

perf/wordwise-validation

test req/req ... bench:                       297 ns/iter (+/- 18)
test req_short/req_short ... bench:           67 ns/iter (+/- 3)
test resp/resp ... bench:                     317 ns/iter (+/- 100)
test resp_short/resp_short ... bench:         66 ns/iter (+/- 7)

GH Codespace 4-core no-SIMD

master

test req/req ... bench:                       491 ns/iter (+/- 25)
test req_short/req_short ... bench:           65 ns/iter (+/- 2)
test resp/resp ... bench:                     490 ns/iter (+/- 19)
test resp_short/resp_short ... bench:         82 ns/iter (+/- 3)

perf/wordwise-validation

test req/req ... bench:                       318 ns/iter (+/- 16)
test req_short/req_short ... bench:           54 ns/iter (+/- 2)
test resp/resp ... bench:                     324 ns/iter (+/- 13)
test resp_short/resp_short ... bench:         62 ns/iter (+/- 3)

@bartlomieju
Copy link

We're eagerly waiting for this PR in Deno 🙏

@AaronO
Copy link
Contributor Author

AaronO commented Apr 12, 2023

Revisiting this, I took a look at the SIMD code and noticed inefficiencies on larger inputs and with a few tweaks see 2-3x improvements for header-values/URIs greater than 128b, will submit another PR.

128b+ header-values are more common in production than you would expect (and obviously we would want to scale gracefully anyway), I don't have hard data on real-world distributions but anecdotally checked my Chrome headers on Google.com, Cookie is 1024b+ and a handful are 128b+

Summary table

(Disclaimer: aggregated by ChatGPT, which "computed" the ratio rows which aren't exactly correct but close enough)

Test 128b 256b 512b 1024b 2048b 4096b
Before
Header 38 66 123 263 484 946
URI 19 44 116 237 465 937
After
Header 30 39 55 88 193 300
URI 12 20 35 65 127 270
Improvement
Header Ratio ~1.5x ~1.5x ~2.0x ~3.0x ~2.5x ~3.0x
URI Ratio ~1.5x ~2.0x ~3.5x ~3.5x ~3.5x ~3.5x

Raw benches

before:
test header/value_128b ... bench:           38 ns/iter (+/- 3)
test header/value_256b ... bench:           66 ns/iter (+/- 0)
test header/value_512b ... bench:           123 ns/iter (+/- 2)
test header/value_1024b ... bench:          263 ns/iter (+/- 13)
test header/value_2048b ... bench:          484 ns/iter (+/- 19)
test header/value_4096b ... bench:          946 ns/iter (+/- 7)

test uri/uri_128b ... bench:          19 ns/iter (+/- 3)
test uri/uri_256b ... bench:          44 ns/iter (+/- 1)
test uri/uri_512b ... bench:         116 ns/iter (+/- 1)
test uri/uri_1024b ... bench:         237 ns/iter (+/- 3)
test uri/uri_2048b ... bench:         465 ns/iter (+/- 3)
test uri/uri_4096b ... bench:         937 ns/iter (+/- 58)

after:
test header/value_128b ... bench:           30 ns/iter (+/- 1)
test header/value_256b ... bench:           39 ns/iter (+/- 1)
test header/value_512b ... bench:           55 ns/iter (+/- 2)
test header/value_1024b ... bench:          88 ns/iter (+/- 4)
test header/value_2048b ... bench:          193 ns/iter (+/- 49)
test header/value_4096b ... bench:          300 ns/iter (+/- 4)

test uri/uri_128b ... bench:          12 ns/iter (+/- 3)
test uri/uri_256b ... bench:          20 ns/iter (+/- 0)
test uri/uri_512b ... bench:          35 ns/iter (+/- 1)
test uri/uri_1024b ... bench:          65 ns/iter (+/- 4)
test uri/uri_2048b ... bench:         127 ns/iter (+/- 2)
test uri/uri_4096b ... bench:         270 ns/iter (+/- 36)

Copy link
Owner

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic news! I'm happy to merge this real soon, I have one last question I put inline to address.

src/lib.rs Outdated Show resolved Hide resolved
A 1st pass at improving non-SIMD perf.

On aarch64/M1 we observe a ~5x asymptotic improvement in uri parsing and ~2.5x in header names & values.

We also observe a -40% time reduction in the `req/req` bench.

I briefly benched it on x86_64, we see a -20% time reduction in the `req/req` bench there for non-SIMD, needs more testing on x86_64 and possibly only enable these fastpaths if SIMD is off
Return length of partial validations, reducing penalty of block misses
@AaronO
Copy link
Contributor Author

AaronO commented Apr 13, 2023

@seanmonstar I would consider this PR good to land now, I'll start another PR with the SIMD improvements.

I have identified 2-3 other improvements that will help with "latency" and a cleanup, will try to get to those time permitting.

@AaronO AaronO requested a review from seanmonstar April 13, 2023 21:43
@AaronO
Copy link
Contributor Author

AaronO commented Apr 13, 2023

@seanmonstar The SIMD-related PR is up here: #131

@AaronO
Copy link
Contributor Author

AaronO commented Apr 15, 2023

Small update, I've got a first-pass implementation of NEON support (based off this #123 and #132), the bench results are promising (M1 Air):

test req/req               ... bench:         110 ns/iter (+/- 1)
test req_short/req_short   ... bench:          26 ns/iter (+/- 0)
test resp/resp             ... bench:         116 ns/iter (+/- 0)
test resp_short/resp_short ... bench:          28 ns/iter (+/- 0)
test uri/uri_1b            ... bench:           1 ns/iter (+/- 0)
test uri/uri_2b            ... bench:           2 ns/iter (+/- 0)
test uri/uri_4b            ... bench:           3 ns/iter (+/- 0)
test uri/uri_8b            ... bench:           2 ns/iter (+/- 0)
test uri/uri_16b           ... bench:           2 ns/iter (+/- 0)
test uri/uri_32b           ... bench:           2 ns/iter (+/- 0)
test uri/uri_64b           ... bench:           4 ns/iter (+/- 0)
test uri/uri_128b          ... bench:           7 ns/iter (+/- 0)
test uri/uri_256b          ... bench:          14 ns/iter (+/- 0)
test uri/uri_512b          ... bench:          26 ns/iter (+/- 0)
test uri/uri_1024b         ... bench:          62 ns/iter (+/- 0)
test uri/uri_2048b         ... bench:         112 ns/iter (+/- 3)
test uri/uri_4096b         ... bench:         217 ns/iter (+/- 6)
test header/name_1b        ... bench:           9 ns/iter (+/- 0)
test header/name_2b        ... bench:          10 ns/iter (+/- 0)
test header/name_4b        ... bench:           9 ns/iter (+/- 1)
test header/name_8b        ... bench:           9 ns/iter (+/- 0)
test header/name_16b       ... bench:          11 ns/iter (+/- 0)
test header/name_32b       ... bench:          14 ns/iter (+/- 0)
test header/name_64b       ... bench:          20 ns/iter (+/- 0)
test header/name_128b      ... bench:          33 ns/iter (+/- 0)
test header/name_256b      ... bench:          59 ns/iter (+/- 4)
test header/name_512b      ... bench:         109 ns/iter (+/- 1)
test header/name_1024b     ... bench:         220 ns/iter (+/- 2)
test header/name_2048b     ... bench:         424 ns/iter (+/- 13)
test header/name_4096b     ... bench:         832 ns/iter (+/- 10)
test header/value_1b       ... bench:           9 ns/iter (+/- 0)
test header/value_2b       ... bench:          11 ns/iter (+/- 0)
test header/value_4b       ... bench:          13 ns/iter (+/- 0)
test header/value_8b       ... bench:           9 ns/iter (+/- 0)
test header/value_16b      ... bench:          10 ns/iter (+/- 0)
test header/value_32b      ... bench:          10 ns/iter (+/- 0)
test header/value_64b      ... bench:          13 ns/iter (+/- 3)
test header/value_128b     ... bench:          15 ns/iter (+/- 0)
test header/value_256b     ... bench:          20 ns/iter (+/- 0)
test header/value_512b     ... bench:          33 ns/iter (+/- 0)
test header/value_1024b    ... bench:          57 ns/iter (+/- 0)
test header/value_2048b    ... bench:         119 ns/iter (+/- 0)
test header/value_4096b    ... bench:         224 ns/iter (+/- 2)
test header/count_1        ... bench:           9 ns/iter (+/- 0)
test header/count_2        ... bench:          17 ns/iter (+/- 0)
test header/count_4        ... bench:          32 ns/iter (+/- 0)
test header/count_8        ... bench:          60 ns/iter (+/- 1)
test header/count_16       ... bench:         141 ns/iter (+/- 2)
test header/count_32       ... bench:         281 ns/iter (+/- 11)
test header/count_64       ... bench:         556 ns/iter (+/- 5)
test header/count_128      ... bench:        1112 ns/iter (+/- 27)
test version/http10        ... bench:           0 ns/iter (+/- 0)
test version/http11        ... bench:           0 ns/iter (+/- 0)
test version/partial       ... bench:           1 ns/iter (+/- 0)
test method/get            ... bench:           0 ns/iter (+/- 0)
test method/head           ... bench:           2 ns/iter (+/- 0)
test method/post           ... bench:           0 ns/iter (+/- 0)
test method/put            ... bench:           2 ns/iter (+/- 0)
test method/delete         ... bench:           3 ns/iter (+/- 0)
test method/connect        ... bench:           3 ns/iter (+/- 3)
test method/options        ... bench:           3 ns/iter (+/- 0)
test method/trace          ... bench:           2 ns/iter (+/- 0)
test method/patch          ... bench:           2 ns/iter (+/- 0)
test method/custom         ... bench:           3 ns/iter (+/- 0)

I implemented a vectorized header-name validator but without other changes it's a net-negative since in practice header-names are often less than 16 bytes.

Benches with vectorized header-name:

test req/req               ... bench:         174 ns/iter (+/- 23)
test req_short/req_short   ... bench:          34 ns/iter (+/- 0)
test resp/resp             ... bench:         187 ns/iter (+/- 3)
test resp_short/resp_short ... bench:          33 ns/iter (+/- 0)
test header/name_1b        ... bench:          10 ns/iter (+/- 0)
test header/name_2b        ... bench:          10 ns/iter (+/- 0)
test header/name_4b        ... bench:          10 ns/iter (+/- 0)
test header/name_8b        ... bench:          10 ns/iter (+/- 0)
test header/name_16b       ... bench:          10 ns/iter (+/- 3)
test header/name_32b       ... bench:          11 ns/iter (+/- 25)
test header/name_64b       ... bench:          13 ns/iter (+/- 0)
test header/name_128b      ... bench:          16 ns/iter (+/- 0)
test header/name_256b      ... bench:          23 ns/iter (+/- 5)
test header/name_512b      ... bench:          36 ns/iter (+/- 0)
test header/name_1024b     ... bench:          63 ns/iter (+/- 4)
test header/name_2048b     ... bench:         128 ns/iter (+/- 2)
test header/name_4096b     ... bench:         237 ns/iter (+/- 8)
test header/count_1        ... bench:          10 ns/iter (+/- 0)
test header/count_2        ... bench:          18 ns/iter (+/- 0)
test header/count_4        ... bench:          41 ns/iter (+/- 0)
test header/count_8        ... bench:         121 ns/iter (+/- 4)
test header/count_16       ... bench:         248 ns/iter (+/- 5)
test header/count_32       ... bench:         498 ns/iter (+/- 10)
test header/count_64       ... bench:        1003 ns/iter (+/- 18)
test header/count_128      ... bench:        2006 ns/iter (+/- 28)

I do think there's room for improvement, specifically the handoff between SIMD and non-SIMD is not ideal right now.
I'll investigate that if we can land these other PRs first

@AaronO AaronO mentioned this pull request Apr 15, 2023
@AaronO
Copy link
Contributor Author

AaronO commented Apr 15, 2023

@seanmonstar NEON PR is up: #133

The PRs mostly build off each other and can be landed/reviewed in series.

Copy link
Owner

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phenomenal work. And I appreciate the code comments helping to explain the bit-fiddling going on. Thanks!

src/lib.rs Show resolved Hide resolved
@seanmonstar seanmonstar merged commit 5dd152e into seanmonstar:master Apr 18, 2023
@AaronO AaronO deleted the perf/wordwise-validation branch April 18, 2023 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants