cuco::bloom_filter #101

sleeepyjack · 2021-08-09T01:59:39Z

Adds a new class called cuco::bloom_filter for approximate set membership queries.

It is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed; the more items added, the larger the probability of false positives.

The type of implementation used here is known as a "partitioned" or "pattern-blocked" bloom filter.

This PR comes with examples, benchmarks, as well as unit tests.

GPUtester · 2021-08-09T01:59:40Z

Can one of the admins verify this patch?

sleeepyjack · 2021-08-18T19:42:16Z

ok to test

sleeepyjack · 2021-08-18T19:44:19Z

Meh, forgot that I don't have the permissions to fire up the CI.

This PR is ready to test and ready for review.

jrhemstad · 2021-08-19T17:08:49Z

add to whitelist

jrhemstad · 2021-08-19T17:09:00Z

okay to test

jrhemstad · 2021-08-19T17:12:40Z

ok to test

dillon-cullinan · 2021-08-19T17:14:57Z

add to whitelist

sleeepyjack · 2021-08-19T19:26:35Z

rerun tests

jrhemstad · 2021-08-24T19:33:45Z

rerun tests

include/cuco/detail/bloom_filter_kernels.cuh

gevtushenko · 2021-10-02T12:56:53Z

include/cuco/detail/bloom_filter_kernels.cuh

+ * in the filter.
+ *
+ * @tparam block_size The size of the thread block
+ * @tparam InputIt Device accessible input iterator whose `value_type` is


As I understand input iterators don't enforce equality_comparable property (unlike legacy input iterators or random access iterators). If I'm not mistaken, we might need to rewrite (first + tid) < last as auto size = distance(first, last); tid < size or require legacy input iterators in the documentation. I'm not particularly strong in the field of iterator concepts, so correct me if I'm wrong 😅

jrhemstad · 2021-11-01T16:47:55Z

@sleeepyjack can you resolve conflicts?

sleeepyjack · 2021-11-14T13:04:45Z

@sleeepyjack can you resolve conflicts?

on it

sleeepyjack · 2022-07-29T13:04:35Z

Wider Slot types result in a better FPR. However, since cuda::atomic<__int128_t>::is_lock_free() == false, the query throughput drops drastically.

include/cuco/detail/bloom_filter.inl

…amples.

sleeepyjack · 2022-08-01T17:32:07Z

I'm dropping the cuda::annotated_ptr/cuda::apply_access_policy strategy as the access policy is apparently not applied correctly (virtual no performance difference between L2-persistent and non-persistent filters). Thus, I'm rolling back to the old strategy, i.e., using the CUDA driver API.

Here are some benchmark results on A100 80GB L2-resident vs. non-resident filter:

KeyType	SlotType	FilterOperation	FilterScope	DataScope	NumInputs	NumBits	NumHashes	nv/filter/fpr	nv/filter/size/mb	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil	Samples	Batch GPU
I32	U64	INSERT	GMEM	GMEM	10000000	300000000	2	0.0059597	37	773x	656.308 us	1.37%	647.584 us	0.22%	15.442G	123.536 GB/s	6.38%	814x	641.942 us
I32	U64	INSERT	GMEM	GMEM	100000000	300000000	2	0.24194	37	81x	6.194 ms	1.50%	6.185 ms	1.49%	16.168G	129.345 GB/s	6.68%	85x	6.163 ms
I32	U64	INSERT	GMEM	REGS	10000000	300000000	2	0.0059597	37	1289x	396.613 us	2.47%	387.908 us	1.02%	25.779G	206.235 GB/s	10.66%	1374x	380.201 us
I32	U64	INSERT	GMEM	REGS	100000000	300000000	2	0.24194	37	171x	2.940 ms	0.87%	2.932 ms	0.81%	34.111G	272.888 GB/s	14.10%	180x	2.940 ms
I32	U64	INSERT	L2	GMEM	10000000	300000000	2	0.0059597	37	1819x	283.593 us	3.21%	274.877 us	0.51%	36.380G	291.039 GB/s	15.04%	1896x	269.990 us
I32	U64	INSERT	L2	GMEM	100000000	300000000	2	0.24194	37	201x	2.505 ms	0.74%	2.496 ms	0.64%	40.060G	320.483 GB/s	16.56%	202x	2.519 ms
I32	U64	INSERT	L2	REGS	10000000	300000000	2	0.0059597	37	1951x	265.059 us	3.47%	256.315 us	0.62%	39.014G	312.115 GB/s	16.13%	2031x	251.270 us
I32	U64	INSERT	L2	REGS	100000000	300000000	2	0.24194	37	217x	2.316 ms	0.39%	2.307 ms	0.04%	43.341G	346.728 GB/s	17.92%	227x	2.302 ms
I32	U64	CONTAINS	GMEM	GMEM	10000000	300000000	2	0.0059597	37	1793x	287.897 us	3.22%	278.999 us	0.46%	35.842G	286.740 GB/s	14.82%	1906x	262.407 us
I32	U64	CONTAINS	GMEM	GMEM	100000000	300000000	2	0.24194	37	192x	2.621 ms	0.64%	2.612 ms	0.54%	38.282G	306.254 GB/s	15.82%	199x	2.617 ms
I32	U64	CONTAINS	GMEM	REGS	10000000	300000000	2	0.0059597	37	1831x	282.127 us	3.39%	273.182 us	0.85%	36.606G	292.845 GB/s	15.13%	1946x	258.885 us
I32	U64	CONTAINS	GMEM	REGS	100000000	300000000	2	0.24194	37	197x	2.552 ms	0.38%	2.543 ms	0.13%	39.323G	314.587 GB/s	16.25%	207x	2.528 ms
I32	U64	CONTAINS	L2	GMEM	10000000	300000000	2	0.0059597	37	1873x	276.017 us	3.42%	266.967 us	0.43%	37.458G	299.662 GB/s	15.48%	1961x	260.320 us
I32	U64	CONTAINS	L2	GMEM	100000000	300000000	2	0.24194	37	194x	2.593 ms	0.72%	2.584 ms	0.63%	38.706G	309.651 GB/s	16.00%	196x	2.599 ms
I32	U64	CONTAINS	L2	REGS	10000000	300000000	2	0.0059597	37	1891x	273.571 us	3.51%	264.540 us	0.81%	37.802G	302.412 GB/s	15.63%	1939x	258.965 us
I32	U64	CONTAINS	L2	REGS	100000000	300000000	2	0.24194	37	198x	2.543 ms	0.36%	2.534 ms	0.06%	39.458G	315.667 GB/s	16.31%	204x	2.529 ms

kkraus14 · 2024-08-07T01:04:29Z

@sleeepyjack we would love to see this work pushed forward so we can utilize this. Is there anything that we can do to help here?

sleeepyjack · 2024-08-07T02:23:46Z

@kkraus14 I can move this up on my task list and hammer out a new draft PR tomorrow so we can get started on discussing the last few design questions. I'll keep you posted.

sleeepyjack · 2024-08-08T01:28:42Z

Superseeded by #573

Superseeds #101 Implementation of a GPU "Blocked Bloom Filter". This PR is an updated/optimized version of #101 and features the following improvements: - Incorporate the new library design - Improve performance by computing the key's bit pattern based on a single hash value instead of using a double hashing derivative --------- Co-authored-by: Yunsong Wang <[email protected]>

Added bloom_filter with example and benchmarks.

2e68593

sleeepyjack force-pushed the bloom-filter branch from 1261426 to 2e68593 Compare August 9, 2021 02:02

sleeepyjack mentioned this pull request Aug 10, 2021

[FEA] On-demand size computation to solve #65 and #39 #102

Open

sleeepyjack added 5 commits August 18, 2021 05:31

Add function to (re-)initialize the filter.

f8a4722

Benchmarks refactored. Added benchmark for L2 resident filter.

ce83891

Unit tests for bloom filter added.

4899219

Add missing const specifier for device-side contains operation.

5301bb2

Fix num_bits and num_slots calculation.

aadfaab

sleeepyjack changed the title ~~[WIP] Added bloom_filter with example and benchmarks~~ [REVIEW] Add cuco::bloom_filter Aug 18, 2021

Fix for key pattern computation. Reduces FPR by a factor of ~10.

2ca505f

sleeepyjack force-pushed the bloom-filter branch from b55e7da to 2ca505f Compare August 19, 2021 12:02

sleeepyjack added 4 commits August 24, 2021 22:32

Benchmark analysis notebook added.

0634ef7

Added helper functions for L2 residency control.

7bc30c9

Add function for determining optimal grid size.

8faf891

Extended bloom filter benchmarks.

370f11c

gevtushenko reviewed Oct 2, 2021

View reviewed changes

include/cuco/detail/bloom_filter_kernels.cuh Outdated Show resolved Hide resolved

gevtushenko reviewed Oct 2, 2021

View reviewed changes

include/cuco/detail/bloom_filter_kernels.cuh Outdated Show resolved Hide resolved

gevtushenko reviewed Oct 2, 2021

View reviewed changes

PointKernel added the topic: build CMake build issue label Dec 3, 2021

PointKernel marked this pull request as draft July 28, 2022 16:34

sleeepyjack added 4 commits July 29, 2022 16:45

New Bloom filter benchmarks.

f4fbfa4

Remove external definition of NVBENCH_MODULE.

a1f86f1

Remove outdated L2 residency control helper.

faa5e85

Merge branch 'dev' into bloom-filter

5578405

sleeepyjack force-pushed the bloom-filter branch from 60dcc65 to 5578405 Compare July 29, 2022 17:01

pre-commit-ci bot and others added 2 commits July 29, 2022 17:01

[pre-commit.ci] auto code formatting

602be1b

Replace grid.num_threads() with grid.size(). (Backport for CUDA 11.0)

4f9acd7

PointKernel reviewed Jul 29, 2022

View reviewed changes

include/cuco/detail/bloom_filter.inl Outdated Show resolved Hide resolved

PointKernel reviewed Jul 29, 2022

View reviewed changes

include/cuco/detail/bloom_filter.inl Outdated Show resolved Hide resolved

More descriptive names for examples.

efc2998

sleeepyjack mentioned this pull request Aug 1, 2022

More descriptive names for examples #196

Merged

sleeepyjack added 7 commits August 1, 2022 11:54

Rename Bloom filter example to comply to the new naming scheme for ex…

51189e4

…amples.

Add stream param to Bloom filter ctor.

75a127e

Use new CUCO_HAS_INDEPENDENT_THREADS macro.

2e5d6c3

Add helper functions for L2 residency control.

a1ea293

Add usage example for an L2-resident Bloom filter.

6b61279

Use L2 residency control helper functions in benchmark script.

d3f825f

Merge remote-tracking branch 'upstream/dev' into bloom-filter

5c726e7

sleeepyjack mentioned this pull request Aug 2, 2022

L2 residency control helper functions #197

Closed

sleeepyjack mentioned this pull request Aug 9, 2022

[FEA] Variadic ctor parameters #207

Open

4 tasks

PointKernel mentioned this pull request Apr 10, 2023

[FEA] Accelerate Bloom filtered joins NVIDIA/spark-rapids#7803

Closed

4 tasks

sleeepyjack mentioned this pull request Aug 8, 2024

Add cuco::bloom_filter #573

Merged

sleeepyjack closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuco::bloom_filter #101

cuco::bloom_filter #101

sleeepyjack commented Aug 9, 2021 •

edited

Loading

GPUtester commented Aug 9, 2021

sleeepyjack commented Aug 18, 2021

sleeepyjack commented Aug 18, 2021

jrhemstad commented Aug 19, 2021

jrhemstad commented Aug 19, 2021

jrhemstad commented Aug 19, 2021

dillon-cullinan commented Aug 19, 2021

sleeepyjack commented Aug 19, 2021

jrhemstad commented Aug 24, 2021

gevtushenko Oct 2, 2021

jrhemstad commented Nov 1, 2021

sleeepyjack commented Nov 14, 2021

sleeepyjack commented Jul 29, 2022

sleeepyjack commented Aug 1, 2022

kkraus14 commented Aug 7, 2024

sleeepyjack commented Aug 7, 2024

sleeepyjack commented Aug 8, 2024

cuco::bloom_filter #101

cuco::bloom_filter #101

Conversation

sleeepyjack commented Aug 9, 2021 • edited Loading

GPUtester commented Aug 9, 2021

sleeepyjack commented Aug 18, 2021

sleeepyjack commented Aug 18, 2021

jrhemstad commented Aug 19, 2021

jrhemstad commented Aug 19, 2021

jrhemstad commented Aug 19, 2021

dillon-cullinan commented Aug 19, 2021

sleeepyjack commented Aug 19, 2021

jrhemstad commented Aug 24, 2021

gevtushenko Oct 2, 2021

Choose a reason for hiding this comment

jrhemstad commented Nov 1, 2021

sleeepyjack commented Nov 14, 2021

sleeepyjack commented Jul 29, 2022

sleeepyjack commented Aug 1, 2022

kkraus14 commented Aug 7, 2024

sleeepyjack commented Aug 7, 2024

sleeepyjack commented Aug 8, 2024

sleeepyjack commented Aug 9, 2021 •

edited

Loading