simd: reapply and fix split cursor advancing #175

lucab · 2024-06-12T09:56:01Z

This reapplies #156 with an additional commit on top in order to fix #172.

AaronO · 2024-09-01T03:40:54Z

Note, cursory benches for this on M1 / arm64 show a ~10% perf regression vs main

AaronO · 2024-09-01T04:00:22Z

@seanmonstar I can take a deeper look at this before reconsidering landing it, haven't looked into it in depth, but potentially there could be gains on avx2 simd from this approach because it avoids reprocessing an entire/last block on SIMD miss, allowing the SWAR or byte-wise code to pick up.

Since avx2 operates on 32B/256bit blocks it possibly "misses" a lot on header-values, short URLs etc...

So definitely a variant of this is worth considering, but I think it should be rigorously benchmarked given those shared in #156 were unclear... @seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows)

AaronO · 2024-09-01T04:08:37Z

Also a big factor here could be avx2 being inlined or dispatched at runtime (when generically compiled for x64), the inlined version compiled for avx2 will obviously have less overhead.

@lucab Also are all the pub => pub(crate) changes necessary ?

AaronO · 2024-09-01T05:16:39Z

@seanmonstar probably worth adding CI benches (which can be noisy, but probably not meaningfully so these pure compute bound flows)

Did a first pass at CI benches: #179, this PR should show a measurable delta

lucab · 2024-09-01T13:33:28Z

@AaronO from #156 (comment):

There is a minor cleanup bundled in this PR (marking several functions as pub(crate)) which I did in order to make sure I wasn't changing public APIs. I can split that to a dedicated PR if you prefer.

…r#156)" This reverts commit 140faf3.

AaronO · 2024-09-01T14:37:39Z

@AaronO from #156 (comment):

There is a minor cleanup bundled in this PR (marking several functions as pub(crate)) which I did in order to make sure I wasn't changing public APIs. I can split that to a dedicated PR if you prefer.

@lucab Yes I think that should be split out. Especially since the original PR made substantial (AVX2) perf claims, it's best if we can focus on that and remove unrelated "noise". The alt refactor should be judged on its own merits. The focus here should be demonstrating, explaining clear perf benefits. From the benchmarks (local [arm64] & CI) that appears unclear.

AaronO · 2024-09-01T15:00:21Z

@lucab Modern CPUs with speculative execution can counterintuitively be faster "processing the same bytes twice" if both execution paths are independent. There can also be tradeoffs impacting inline-ability if you increase register usage in hot loop, etc...

I touched this code a while back so not 100% sure, but I think I explored your change (if I'm distilling it correctly to returning partial matches) and concluded (based off local M1 benches at the time) that it wasn't a net positive.

For example your original PR #156, explains the improvement with:

This refactors all SIMD modules in order to make the value-matching logic self-contained. Thus, all bytes-cursor manipulations are now grouped and performed once at the end, outside of SIMD logic.

I should double check but IIRC, minimizing the amount of times you update the bytes-cursor sounds like a plausible improvement but in practice it impacts lowering. What actually happens here is that the "stack-allocated" Bytes struct actually gets allocated into 3 registers (start, cursor, end), we're always reading from cursor, with your code I think it gets lowered as basepointer+offset or ends up requiring an extra register for that inner loop.

Long story short, we should really distill this perf improvement hypothesis to its essentials and fully test/understand that.

AaronO · 2024-09-01T15:37:47Z

@lucab Ok looking at the CI benches there does appear to be a beyind-noise improvement for sse42 (not avx2, but possibly I missed something in the CI benches).

It smells like it might boil down to inline-ability of sse42. Would need to double check, I'll take a deeper look tomorrow.

In part it's a perf bug vs purely raising the ceiling, especially for small values sse42 was substantially underperforming swar, possibly due to (not-)inlining (if sse42 is called non-inlined in the loop you pay some painful register init costs) or could be register alloc or something else.

lucab · 2024-09-01T16:11:02Z

@AaronO Ack, I'll split the pub cleanup as I was originally offering to.
I've marked this PR as a draft, as I think we aren't going to land this as-is anyway.

For context, I wasn't really looking at perf improvements when I opened this, but it happened as a local side effect:

My goal was to (eventually) use the bytes crate in this library, but then I realized that the required SIMD-related groundwork was actually providing performance improvements on its own.

This has massive implications on the default runtime perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful). Should supersede seanmonstar#175, seanmonstar#156

This has massive implications on the default runtime perf, improving how the code is lowered/inlined. (Falling back to SSE4.2 for a handful of bytes was wasteful). Should supersede #175, #156

lucab mentioned this pull request Jun 12, 2024

Fix no space after colon #174

Merged

bearcage mentioned this pull request Jun 15, 2024

Test and fix a regression in parsing partial requests in the 1.9.x series #176

Merged

lucab added 3 commits September 1, 2024 15:44

Reapply "simd: split cursor advancing from value matching (seanmonsta…

916b410

…r#156)" This reverts commit 140faf3.

fix: adapt seanmonstar#176

72d9192

fix: properly advance buffer cursor on header colon

2262ff1

lucab force-pushed the ups/simd-remove-bytes-cursor-2 branch from 7de9ddb to 2262ff1 Compare September 1, 2024 14:00

lucab marked this pull request as draft September 1, 2024 16:05

AaronO mentioned this pull request Sep 2, 2024

perf(simd): avx2 fallback to swar instead of sse4.2 #181

Merged

lucab mentioned this pull request Sep 2, 2024

simd: reduce pub functions visibility #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simd: reapply and fix split cursor advancing #175

simd: reapply and fix split cursor advancing #175

lucab commented Jun 12, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024

lucab commented Sep 1, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024 •

edited

Loading

AaronO commented Sep 1, 2024 •

edited

Loading

lucab commented Sep 1, 2024 •

edited

Loading

simd: reapply and fix split cursor advancing #175

Are you sure you want to change the base?

simd: reapply and fix split cursor advancing #175

Conversation

lucab commented Jun 12, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024

lucab commented Sep 1, 2024

AaronO commented Sep 1, 2024

AaronO commented Sep 1, 2024 • edited Loading

AaronO commented Sep 1, 2024 • edited Loading

lucab commented Sep 1, 2024 • edited Loading

AaronO commented Sep 1, 2024 •

edited

Loading

AaronO commented Sep 1, 2024 •

edited

Loading

lucab commented Sep 1, 2024 •

edited

Loading