Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuco::bloom_filter #101

Closed
wants to merge 36 commits into from
Closed

cuco::bloom_filter #101

wants to merge 36 commits into from

Conversation

sleeepyjack
Copy link
Collaborator

@sleeepyjack sleeepyjack commented Aug 9, 2021

Adds a new class called cuco::bloom_filter for approximate set membership queries.

It is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed; the more items added, the larger the probability of false positives.

The type of implementation used here is known as a "partitioned" or "pattern-blocked" bloom filter.

This PR comes with examples, benchmarks, as well as unit tests.

@GPUtester
Copy link

Can one of the admins verify this patch?

@sleeepyjack sleeepyjack changed the title [WIP] Added bloom_filter with example and benchmarks [REVIEW] Add cuco::bloom_filter Aug 18, 2021
@sleeepyjack
Copy link
Collaborator Author

ok to test

@sleeepyjack
Copy link
Collaborator Author

Meh, forgot that I don't have the permissions to fire up the CI.

This PR is ready to test and ready for review.

@jrhemstad
Copy link
Collaborator

add to whitelist

@jrhemstad
Copy link
Collaborator

okay to test

@jrhemstad
Copy link
Collaborator

ok to test

@dillon-cullinan
Copy link
Contributor

add to whitelist

@sleeepyjack
Copy link
Collaborator Author

rerun tests

1 similar comment
@jrhemstad
Copy link
Collaborator

rerun tests

* in the filter.
*
* @tparam block_size The size of the thread block
* @tparam InputIt Device accessible input iterator whose `value_type` is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand input iterators don't enforce equality_comparable property (unlike legacy input iterators or random access iterators). If I'm not mistaken, we might need to rewrite (first + tid) < last as auto size = distance(first, last); tid < size or require legacy input iterators in the documentation. I'm not particularly strong in the field of iterator concepts, so correct me if I'm wrong 😅

@jrhemstad
Copy link
Collaborator

@sleeepyjack can you resolve conflicts?

@sleeepyjack
Copy link
Collaborator Author

@sleeepyjack can you resolve conflicts?

on it

@PointKernel PointKernel added the topic: build CMake build issue label Dec 3, 2021
@PointKernel PointKernel marked this pull request as draft July 28, 2022 16:34
@sleeepyjack
Copy link
Collaborator Author

Bildschirmfoto 2022-07-29 um 14 21 22

Wider Slot types result in a better FPR. However, since cuda::atomic<__int128_t>::is_lock_free() == false, the query throughput drops drastically.

@sleeepyjack
Copy link
Collaborator Author

I'm dropping the cuda::annotated_ptr/cuda::apply_access_policy strategy as the access policy is apparently not applied correctly (virtual no performance difference between L2-persistent and non-persistent filters). Thus, I'm rolling back to the old strategy, i.e., using the CUDA driver API.

Here are some benchmark results on A100 80GB L2-resident vs. non-resident filter:

KeyType SlotType FilterOperation FilterScope DataScope NumInputs NumBits NumHashes nv/filter/fpr nv/filter/size/mb Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil Samples Batch GPU
I32 U64 INSERT GMEM GMEM 10000000 300000000 2 0.0059597 37 773x 656.308 us 1.37% 647.584 us 0.22% 15.442G 123.536 GB/s 6.38% 814x 641.942 us
I32 U64 INSERT GMEM GMEM 100000000 300000000 2 0.24194 37 81x 6.194 ms 1.50% 6.185 ms 1.49% 16.168G 129.345 GB/s 6.68% 85x 6.163 ms
I32 U64 INSERT GMEM REGS 10000000 300000000 2 0.0059597 37 1289x 396.613 us 2.47% 387.908 us 1.02% 25.779G 206.235 GB/s 10.66% 1374x 380.201 us
I32 U64 INSERT GMEM REGS 100000000 300000000 2 0.24194 37 171x 2.940 ms 0.87% 2.932 ms 0.81% 34.111G 272.888 GB/s 14.10% 180x 2.940 ms
I32 U64 INSERT L2 GMEM 10000000 300000000 2 0.0059597 37 1819x 283.593 us 3.21% 274.877 us 0.51% 36.380G 291.039 GB/s 15.04% 1896x 269.990 us
I32 U64 INSERT L2 GMEM 100000000 300000000 2 0.24194 37 201x 2.505 ms 0.74% 2.496 ms 0.64% 40.060G 320.483 GB/s 16.56% 202x 2.519 ms
I32 U64 INSERT L2 REGS 10000000 300000000 2 0.0059597 37 1951x 265.059 us 3.47% 256.315 us 0.62% 39.014G 312.115 GB/s 16.13% 2031x 251.270 us
I32 U64 INSERT L2 REGS 100000000 300000000 2 0.24194 37 217x 2.316 ms 0.39% 2.307 ms 0.04% 43.341G 346.728 GB/s 17.92% 227x 2.302 ms
I32 U64 CONTAINS GMEM GMEM 10000000 300000000 2 0.0059597 37 1793x 287.897 us 3.22% 278.999 us 0.46% 35.842G 286.740 GB/s 14.82% 1906x 262.407 us
I32 U64 CONTAINS GMEM GMEM 100000000 300000000 2 0.24194 37 192x 2.621 ms 0.64% 2.612 ms 0.54% 38.282G 306.254 GB/s 15.82% 199x 2.617 ms
I32 U64 CONTAINS GMEM REGS 10000000 300000000 2 0.0059597 37 1831x 282.127 us 3.39% 273.182 us 0.85% 36.606G 292.845 GB/s 15.13% 1946x 258.885 us
I32 U64 CONTAINS GMEM REGS 100000000 300000000 2 0.24194 37 197x 2.552 ms 0.38% 2.543 ms 0.13% 39.323G 314.587 GB/s 16.25% 207x 2.528 ms
I32 U64 CONTAINS L2 GMEM 10000000 300000000 2 0.0059597 37 1873x 276.017 us 3.42% 266.967 us 0.43% 37.458G 299.662 GB/s 15.48% 1961x 260.320 us
I32 U64 CONTAINS L2 GMEM 100000000 300000000 2 0.24194 37 194x 2.593 ms 0.72% 2.584 ms 0.63% 38.706G 309.651 GB/s 16.00% 196x 2.599 ms
I32 U64 CONTAINS L2 REGS 10000000 300000000 2 0.0059597 37 1891x 273.571 us 3.51% 264.540 us 0.81% 37.802G 302.412 GB/s 15.63% 1939x 258.965 us
I32 U64 CONTAINS L2 REGS 100000000 300000000 2 0.24194 37 198x 2.543 ms 0.36% 2.534 ms 0.06% 39.458G 315.667 GB/s 16.31% 204x 2.529 ms

@kkraus14
Copy link

kkraus14 commented Aug 7, 2024

@sleeepyjack we would love to see this work pushed forward so we can utilize this. Is there anything that we can do to help here?

@sleeepyjack
Copy link
Collaborator Author

@kkraus14 I can move this up on my task list and hammer out a new draft PR tomorrow so we can get started on discussing the last few design questions. I'll keep you posted.

@sleeepyjack
Copy link
Collaborator Author

Superseeded by #573

@sleeepyjack sleeepyjack closed this Aug 8, 2024
sleeepyjack added a commit that referenced this pull request Oct 2, 2024
Superseeds #101

Implementation of a GPU "Blocked Bloom Filter".

This PR is an updated/optimized version of #101 and features the
following improvements:

- Incorporate the new library design
- Improve performance by computing the key's bit pattern based on a
single hash value instead of using a double hashing derivative

---------

Co-authored-by: Yunsong Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
In Progress Currently a work in progress topic: build CMake build issue topic: performance Performance related issue type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants