-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(parquet): use optimized bloom filter #4441
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jimexist who contributed the existing implementation -- perhaps you can take a look?
parquet/Cargo.toml
Outdated
paste = { version = "1.0" } | ||
sbbf-rs-safe = "0.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using a new crate (that would add a new dependency and need to be maintained, etc) what do you think about inlining your implementation into this repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep it separate since it can be used in a lot of other applications, there are a lot of bloom filter libraries for rust but none of them implement split block variants afaik
cc @JayjeetAtGithub and @crepererum (as I think you may be interested in this area of parquet) |
I fixed the compilation issue on wasm, fixed fmt |
The results here look good and I think having this in a separate crate is totally OK. I this is really perf critical for anyone I could imagine that the impl. get reasonable complex (incl. assembly) which we probably don't wanna have in core parquet-rs. |
illegal instruction issue should be fixed, could you run the test again? |
Rerunning |
I added CI for |
I would actually expect the existing code to auto-vectorize and generate nearly the same assembly as with using intrinsics. Maybe Are there more optimizations in the crate that I'm missing? |
I didn’t check assembly code myself. My implementation is adapted from the
c impl of impala. There is a big difference in the benchmarks, so I don’t think
it optimizes to the same assembly. Especially for avx2.
I'll try to make a fresh run of benchmarks at sbbf-rs repo when I get access to my PC
|
Also compiler can't enable avx unless compiled with benchmarks on 4 AMD EPYC coresparquet2 insert time: [21.260 ns 21.743 ns 22.242 ns] parquet insert time: [14.298 ns 14.501 ns 14.721 ns] sbbf-rs insert time: [3.5763 ns 3.6755 ns 3.8263 ns] parquet2 contains time: [6.1688 ns 6.2463 ns 6.3320 ns] parquet contains time: [6.4125 ns 6.5060 ns 6.6049 ns] sbbf-rs contains time: [2.5523 ns 2.6273 ns 2.7069 ns] benchmarks on 4 AMD EPYC cores (with `target-cpu=native`)parquet insert time: [8.9271 ns 9.1874 ns 9.4780 ns] sbbf-rs insert time: [3.5264 ns 3.5634 ns 3.6087 ns] parquet2 contains time: [3.2520 ns 3.2906 ns 3.3314 ns] parquet contains time: [3.1553 ns 3.1971 ns 3.2399 ns] sbbf-rs contains time: [2.2696 ns 2.3210 ns 2.3780 ns] The sbbf-rs code also has the disadvantage of dynamic dispatch so the code can't get inlined by the compiler and this actually has a big effect since it is a microbenchmark, only calling one function. |
If autovectorization / avx is of concern we could also bring back feature detection using the I know also of a few (cloud) users are enabling a specific cpu target, e.g. by setting |
With Also checking old code, it seems like it loads the entire filter into memory whereas both |
We certainly do, especially in a cloud or server side environment we want to make the best use of the available hardware. So our usecase does not benefit from runtime dispatching, but I understand this might not be the main usecase. I'm also a huge proponent of using the available instruction set, if necessary via intrinsics. In many cases though the compiler generates just as good assembly without specialized instructions. Sometimes this requires a bit of restructuring or careful use of unsafe, but simple loops like 0..8 usually work fine, and the rust code is then more maintainable than the intrinsics. I don't have a strong opinion on the bloom filter code, using a crate would also make sense instead of implementing in the arrow project itself. On the other hand, minimizing dependencies to a small, well-known set is also a good goal and something that some commercial users might care about. Regarding benchmarks, I think the existing |
On Running `cargo bench` under parquet folder, first with master branch and then sbbf branch
So something like %35 gain across the board. |
about avx on distributed code: ClickHouse/ClickHouse#40459. One problem with compiler generated assembly is that compiler doesn't seem to generate Not sure why Also probably, there is a huge loss because of this part: https://github.com/ozgrakkurt/sbbf-rs/blob/c63e9bbb4b4257d954c3c2a4333258f998ad10a0/benches/parquet_impl.rs#L139 It indexes into slice which might be (it actually does generate branching if I paste it into godbolt) generating branching. I wouldn't trust rust compiler to optimize this kind of thing, even though it is supposed to be very good with this. In this case it doesn't understand fast modulo reduction so it thinks it can index to outside of the slice? So overall, if the version of master was to be optimized a bit more, and the pattern with |
self.0.index_mut(index) | ||
} | ||
} | ||
use xxhash_rust::xxh64::xxh64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like (OSX) this is also part of the speed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, forgot about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also in my insert
bench I also call contains
on parquet
impl. But there is big diff in integration tests of writing parquet with bloom filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also realised I don't use the sbbf filter properly in this PR branch as well, It allocates a new buffer when Sbbf::new
is called, this shouldn't be the case ideally. It can just take borrow of thebitset
, assuming it is aligned to 64 bits
Can we also compare the current version to the fallback version in performance in |
The fallback version is basically |
@Dandandan I updated fallback version and made a release, code is here: https://github.com/ozgrakkurt/sbbf-rs/blob/master/src/arch/fallback/parquet_impl.rs Not sure how it will perform since I modified it to fit the api of the library seems like it doesn't generate very good assembly for |
Thank you very much @ozgrakkurt I think it makes sense to use the crate as it seems we can't get similar performance relying on auto vectorization only in this case. |
no problem, thanks as well! |
I'm somewhat apprehensive about making our default impl be based off a crate with an unclear long-term maintenance story. Perhaps we could make sbbf-rs-safe optional, with it disabled by default and using the existing impl. This would allow people to explicitly opt-in to a faster bloom filter if this is important to them and they're comfortable with the implications of this |
I am likewise similarly apprehensive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having thought about this more, I think we need to make this an optional feature that people explicitly opt-in to
done, can you check? |
Closes #4213
Uses bloom filter implementation from a different crate. This crate uses CPU-specific optimized implementations so it is faster than the old implementation. benchmarks are on this repo
I am the author of the library that this PR uses.
There are no user-facing changes.