-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: improve non-SIMD with wordwise validation #123
perf: improve non-SIMD with wordwise validation #123
Conversation
This is awesome! Do you know how new of a compiler it requires? The MSRV CI job is grumpy 🤷 . |
Sorry for the delayed response, I've been traveling. As @paolobarbolini pointed out the new MSRV would be 1.44, though we could avoid an MSRV bump and stick to 1.36 with some extra or less-idomatic code (see latest commit) |
@seanmonstar It would be great to land #122 (the benchmarks & baseline) first, then I can polish this a little more before merging and releasing |
873fa21
to
f9ccca3
Compare
@seanmonstar Update here: ConclusionWith the latest changes this PR nearly doubles throughput on ARM, with no real downsides. On x86_64 non-SIMD is now roughly on-par with SIMD in fact faster for smaller reqs/resps. BenchmarksM1 Airmaster
perf/wordwise-validation
GH Codespace 4-core SIMD(Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz) master
perf/wordwise-validation
GH Codespace 4-core no-SIMDmaster
perf/wordwise-validation
|
We're eagerly waiting for this PR in Deno 🙏 |
Revisiting this, I took a look at the SIMD code and noticed inefficiencies on larger inputs and with a few tweaks see 2-3x improvements for header-values/URIs greater than 128b, will submit another PR. 128b+ header-values are more common in production than you would expect (and obviously we would want to scale gracefully anyway), I don't have hard data on real-world distributions but anecdotally checked my Chrome headers on Google.com, Cookie is 1024b+ and a handful are 128b+ Summary table(Disclaimer: aggregated by ChatGPT, which "computed" the ratio rows which aren't exactly correct but close enough)
Raw benches
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic news! I'm happy to merge this real soon, I have one last question I put inline to address.
A 1st pass at improving non-SIMD perf. On aarch64/M1 we observe a ~5x asymptotic improvement in uri parsing and ~2.5x in header names & values. We also observe a -40% time reduction in the `req/req` bench. I briefly benched it on x86_64, we see a -20% time reduction in the `req/req` bench there for non-SIMD, needs more testing on x86_64 and possibly only enable these fastpaths if SIMD is off
cf7933e
to
2fe5407
Compare
@seanmonstar I would consider this PR good to land now, I'll start another PR with the SIMD improvements. I have identified 2-3 other improvements that will help with "latency" and a cleanup, will try to get to those time permitting. |
@seanmonstar The SIMD-related PR is up here: #131 |
Small update, I've got a first-pass implementation of NEON support (based off this #123 and #132), the bench results are promising (M1 Air):
I implemented a vectorized header-name validator but without other changes it's a net-negative since in practice header-names are often less than 16 bytes. Benches with vectorized header-name:
I do think there's room for improvement, specifically the handoff between SIMD and non-SIMD is not ideal right now. |
@seanmonstar NEON PR is up: #133 The PRs mostly build off each other and can be landed/reviewed in series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phenomenal work. And I appreciate the code comments helping to explain the bit-fiddling going on. Thanks!
A 1st pass at improving non-SIMD perf, to land after #122
On aarch64/M1 we observe a ~5x asymptotic improvement in uri parsing and ~2.5x in header names & values.
We also observe a -40% time reduction in the
req/req
bench.I briefly benched it on x86_64, we see a -20% time reduction in the
req/req
bench there for non-SIMD, needs more testing/tweaks on x86_64 and possibly only enabling these fastpaths if SIMD is offBench results
Prev baseline (from #122)