-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cub::DeviceFind::FindIf
#2405
base: main
Are you sure you want to change the base?
Conversation
|
I would love to see a benchmark comparison of
Do I understand correctly, that |
cub::DeviceFind::FindIf
cub::DeviceFind::FindIf
The name For your benchmark you could name this value |
@bernhardmgruber |
🟨 CI finished in 1h 52m: Pass: 97%/259 | Total: 1d 08h | Avg: 7m 28s | Max: 35m 24s | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
ce3b44c
to
6f9fba8
Compare
🟨 CI finished in 4h 42m: Pass: 94%/259 | Total: 5d 01h | Avg: 28m 13s | Max: 1h 39m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
f2dc70c
to
612a9c2
Compare
🟨 CI finished in 8h 06m: Pass: 94%/259 | Total: 5d 02h | Avg: 28m 19s | Max: 1h 34m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 2h 35m: Pass: 97%/259 | Total: 5d 01h | Avg: 28m 11s | Max: 1h 41m | Hits: 99%/20079
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
+/- | CUDA C Core Library |
🏃 Runner counts (total jobs: 259)
# | Runner |
---|---|
186 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
Performance Results of thrust::count_if vs cub::DeviceFind::FindIf
When We also see how |
Some long awaiting performance results on A6000 and H200 (extending @gevtushenko's work in #1870): Search Operation Equal Operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
- You can make some more variables
const
- Your hand written vectorized loading looks a lot like the
BLOCK_LOAD_VECTORIZE
algorithm ofcub::BlockLoad
. Why can't you use that one?
cub/cub/device/device_find_if.cuh
Outdated
using VectorT = typename CubVector<InputT, _VECTOR_LOAD_LENGTH>::Type; | ||
/// | ||
bool full_tile = (tile_offset + tile_size) <= num_items; | ||
bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find a modulo operation easier to read, and it should compile to the same bitwise operation. Also, reinterpreting a pointer as an integer should use uintptr_t
. However, that may be a different type as size_t
, so:
bool is_aligned = (size_t(d_in) & (sizeof(VectorT) - 1)) == 0; | |
const bool is_aligned = reinterpret_cast<::cuda::std::uintptr_t>(d_in) % uintptr_t{sizeof(VectorT)} == 0; |
You may also need to handle the case when Iterator
is not a pointer, but I guess that is planned for later, when the runtime numbers look good :) Given you have a second overload of IsAlignedAndFullTile
below, you could just change Iterator
to T*
in the function signature, so any non-pointer iterator would just not match your overload here.
cub/cub/device/device_find_if.cuh
Outdated
InputT* d_in_unqualified = const_cast<InputT*>(begin) + tile_offset + (threadIdx.x * _VECTOR_LOAD_LENGTH); | ||
|
||
cub::CacheModifiedInputIterator<cub::CacheLoadModifier::LOAD_LDG, VectorT> d_vec_in( | ||
reinterpret_cast<VectorT*>(d_in_unqualified)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stripping const
from the input pointer should not be necessary, since we want to only read from it. Is this a limitation of cub::CacheModifiedInputIterator
?
cub/cub/device/device_find_if.cuh
Outdated
// use d_temp_storage as the intermediate device result | ||
// to read and write from. Then store the final result in the output iterator. | ||
cuda_mem_set_async_dtemp_storage<<<1, 1>>>(int_temp_storage, num_items); | ||
|
||
find_if<<<findif_grid_size, block_threads, 0, stream>>>(d_in, d_in + num_items, op, int_temp_storage, num_items); | ||
|
||
write_final_result_in_output_iterator_already<int><<<1, 1>>>(int_temp_storage, d_out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: if OutputIteratorT
is a pointer to some integer type (which I imagine is common), you could just use that memory location to fire the atomic minimum operations at. Then you would not need a second kernel and also no temporary storage.
cub/cub/device/device_find_if.cuh
Outdated
if (threadIdx.x == 0) | ||
{ | ||
sresult = atomicAdd(result, 0); | ||
} | ||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use an ordinary load: sresult = result;
. In the first loop iteartion, only thread 0 reads, so there is no concurrency. In all subsequent iterations, ConsumeRange
contains a synchronize_or()
after writes to result
, so no thread can be writing when reading here in thread 0 again.
cub/cub/device/device_find_if.cuh
Outdated
InputT input_items[elements_per_thread]; | ||
VectorT* vec_items = reinterpret_cast<VectorT*>(input_items); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, input_items
has a smaller alignment than when you access the data through vec_items
, so you could run into misaligned stores.
I suggest to declare an array of vectors instead, and reinterpret it as an array of scalars.
cub/cub/device/device_find_if.cuh
Outdated
int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset; | ||
// i % WORDS = + 0 1 2 3, 0 1 2 3, 0 1 2 3, ... (static) | ||
// (i / WORDS) * block_threads * WORDS = + 0 , 64 , 128, ... (static) | ||
// threadIdx.x * WORDS = + 0, 4, 8, ... offset of the thread within working tile | ||
// tile_offset = + just start at the beginning of the block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vectorization only concerns the loads, so the processing loop should no longer need to consider WORDS
. It should just be auto index = tile_offset + threadIdx.x + i * block_threads;
, like in the non-vectorized version. But please verify. Maybe I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my suggestion was wrong above. I was somehow assuming both versions used the same access pattern, loading blocks of elements_per_thread. However, each version loads block-strided. That changes how the index is computed.
cub/cub/device/device_find_if.cuh
Outdated
#pragma unroll | ||
for (int i = 0; i < WORDS; ++i) | ||
{ | ||
vec_items[i] = d_vec_in[block_threads * i]; | ||
} | ||
//// vectorized loads end | ||
|
||
bool found = false; | ||
for (int i = 0; i < elements_per_thread; ++i) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it would make any sense to merge the two loops in order to load 1 vector, then process the predicate on all scalars of the vector, and then continue to load the next vector. This way, you could could avoid loading the entire tile in which the predicate turns true, and you can overlap data loading with compute more tighly. But if you want to try this, maybe try it last and stabilize your existing approach first!
7ff1a1a
to
0364cf3
Compare
Many thanks to @elstehle for helping figure out this index!!!! cccl/cub/cub/agent/agent_find.cuh Lines 213 to 217 in 0364cf3
|
cub/cub/device/device_find_if.cuh
Outdated
int index = i % WORDS + (i / WORDS) * block_threads * WORDS + threadIdx.x * WORDS + tile_offset; | ||
// i % WORDS = + 0 1 2 3, 0 1 2 3, 0 1 2 3, ... (static) | ||
// (i / WORDS) * block_threads * WORDS = + 0 , 64 , 128, ... (static) | ||
// threadIdx.x * WORDS = + 0, 4, 8, ... offset of the thread within working tile | ||
// tile_offset = + just start at the beginning of the block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my suggestion was wrong above. I was somehow assuming both versions used the same access pattern, loading blocks of elements_per_thread. However, each version loads block-strided. That changes how the index is computed.
…ording to tile size
0364cf3
to
6bb6985
Compare
Update: After refactoring the code by introducing Dispatch and Agent layers the benchmark results look the same on my A6000 local machine. docs to be added over the weekend |
6bb6985
to
eaa5b75
Compare
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
This is a draft to track the work progress on
cub::DeviceFind::FindIf
which should ultimately be used to improvethrust::all_of
.Algorithm Description
The kernel is based on the concept of early cancellation through a global atomicresult
-that works as a flag- so that when its default value (input.size()
) is overwritten, it is atomically min compared and then broadcasted to the other CTAs and they avoid proceeding to any further iterations.For that to happen every block collects a local minimum index,
block_result
(when predicate is found)and at the end of the loop-body the first thread of every block atomically minCompares and writes to the global minimum index variable:
It's a an algorithm that heavily favors cases where the "to-be-found" elements are ubiquitous.
Notes:
sresult
and then checks whether it should exit.block_result
should not be confused withsresult
. Although both are shared memory variables that hold a local minimum, the former is being used to read from the global atomic, while the latter is being used to write to the global atomic.