-
Notifications
You must be signed in to change notification settings - Fork 758
Conversation
Can one of the admins verify this patch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. Needs a few tweaks.
thrust/system/cuda/detail/unique.h
Outdated
if (first == last) { | ||
return 0; | ||
} | ||
auto size = last - first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm should work with forward iterators, but taking the difference of two iterators requires random access. You'll need to find a different way to implement this. @jrhemstad and @allisonvacanti suggest a possible implementation in #1619.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose thrust::distance would be appropriate then? I am having a hard time imagining how you would implement this parallel algorithm without random access iterators.
Regardless of that, I'll give DeviceRunLengthEncode::Encode
a try, since it's simpler than I thought on my first look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like cub doesn't like discard_iterator, since it writes and reads the first key from each run. So I think this is not possible without additional temporary storage O(size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using thrust::distance
is correct here.
Could you file a CUB issue describing the discard iterator bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out this has already been reported: https://github.com/NVIDIA/thrust/issues/1490
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, DeviceRunLengthEncode::Encode
works with cub::DiscardOutputIterator
, but (at least on the Titan X I tested on) gives significantly worse performance, comparison (GB/s bandwidth)
int
n | count_if | Encode |
---|---|---|
16M | 179.173 | 109.111 |
1B | 254.523 | 160.037 |
long
n | count_if | Encode |
---|---|---|
16M | 259.123 | 180.335 |
512M | 315.969 | 215.304 |
pair<long, int>
n | count_if | Encode |
---|---|---|
16M | 351.678 | 284.543 |
256M | 392.726 | 331.611 |
|
||
ASSERT_EQUAL(div_10_count, 3); | ||
} | ||
DECLARE_INTEGRAL_VECTOR_UNITTEST(TestUniqueCountSimple); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique_count
claims to work with forward iterators. There should be a test using forward iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a nice way to wrap an iterator into a forward_iterator in Thrust? I wrote a small wrapper class and that seems to work and compile, but I suppose that problem has been solved elsewhere already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericniebler is right, we should be testing this. Unfortunately we lack robust testing for these sorts of things.
Thanks for adding the new testing infrastructure! Please include it in this PR, ideally in the testing framework so we can reuse it from other tests later 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test to unique_copy and unique. I am not 100% sure it does what we would expect - due to the missing iterator tag, it gets executed sequentially on the CPU using device references for access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, forward iterators always dispatch to the CPU? @allisonvacanti can you comment on that? I mean, it seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounded odd to me, as I've never seen any logic in Thrust that would dispatch forward iterators to serial implementations. So I dug into it, and unfortunately this is due to a pretty nasty bug in Thrust's iterator traits.
The details are gory, but I've summarized them in a comment on #902.
Until that's fixed, I'm not comfortable merging this test using the forward_iterator_wrapper, since they only "do the right thing" because the iterator framework is broken.
I hate to churn on this PR even more, but I think we should remove the iterator wrappers for now and just test that the regular iterators work. We can re-introduce the wrapper tests as part of #55, after #902 is fixed and settled.
@ericniebler Can you review the two linked issues and see if you agree with my suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that forward iterators actually need to be dispatched to the sequential backend. They support multipass reading and should be usable in a parallel algorithm, so long as they're only copied and incremented. Is there something in the unique_count/count_if algorithms that would break them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue I see with parallel execution on forward iterators is that they introduce an essentially linear dependency chain that means either every thread i
starts from begin
and increments it i
times, or waits until one of its predecessors j
is done and writes its iterator somewhere, to then increments it i - j
times. Both don't really seem useful for parallel execution to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will require more increments, but if the work-per-element is expensive compared to the cost of the iterator increment, it may still make sense to parallelize. I'd rather let the user make that call, since they can opt-in to sequential execution by passing in the sequential exeuction policy (thrust::seq
).
More importantly, the sequential implementation executes on CPU, and some types of device memory aren't accessible from the CPU's address space, so switching to seq
really needs to be opt-in rather than default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, that makes sense! I was only thinking of simple but massively parallel cases.
I'm suspicious of why this would be slower than the @upsj if you're interested, I'd be curious to see an Nsight Compute profile of those two kernels with emphasis on the memory movement summary. That would tell us if |
@jrhemstad Yes, I thought the same thing. I don't have Nsight Compute setup on my laptop, but I can give you some nvprof numbers on a Titan X (int with 1B elements):
Occupancy, IPC etc. are similar, so it seems like the decrease in achieved throughput is based on higher amount of DRAM accesses.
|
This is definitely an issue specific to the implementation of count_if. With a quick-and-dirty hand-written implementation, I get performance on par with plain #include "thrust/unique.h"
#include "thrust/device_vector.h"
#include "thrust/iterator/zip_iterator.h"
#include "thrust/tuple.h"
#include "thrust/sequence.h"
#include <chrono>
#include <iostream>
template <typename T>
__global__ __launch_bounds__(1024) void unique_count_raw(const T* data, int size, int* out_counts) {
const auto tidx = threadIdx.x + blockIdx.x * blockDim.x;
const auto num_threads = gridDim.x * blockDim.x;
__shared__ int counts[32];
int count{};
const auto lane = tidx % 32;
for (auto i = tidx; i < size - 1; i += num_threads) {
count += data[i] != data[i + 1];
}
#pragma unroll
for (int step = 1; step < 32; step *= 2) {
count += __shfl_xor_sync(0xFFFFFFFFu, count, step);
}
if (lane == 0) {
counts[threadIdx.x / 32] = count;
}
__syncthreads();
if (threadIdx.x >= 32) {
return;
}
count = counts[lane];
#pragma unroll
for (int step = 1; step < 32; step *= 2) {
count += __shfl_xor_sync(0xFFFFFFFFu, count, step);
}
if (lane == 0) {
out_counts[blockIdx.x] = count;
}
}
__device__ __host__ long get(int a) { return a; }
__device__ __host__ long get(long a) { return a; }
template <typename A, typename B>
__device__ __host__ long get(thrust::tuple<A,B> a) { return thrust::get<0>(a) + thrust::get<1>(a); }
template <typename ForwardIt>
void benchmark(std::string desc, ForwardIt begin, ForwardIt end) {
thrust::unique_count(begin, end);
cudaDeviceSynchronize();
const auto reps = 100;
auto start = std::chrono::high_resolution_clock::now();
int count1;
int count2;
for (int i = 0; i < reps; i++) {
count1 = thrust::unique_count(begin, end);
}
cudaDeviceSynchronize();
auto total = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count() / reps;
cudaDeviceSynchronize();
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < reps; i++) {
thrust::count_if(begin, end, []__host__ __device__(typename ForwardIt::value_type val) { return get(val) >= 0; });
}
cudaDeviceSynchronize();
auto total2 = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count() / reps;
cudaDeviceSynchronize();
start = std::chrono::high_resolution_clock::now();
thrust::device_vector<int> tmp_storage(1024);
for (int i = 0; i < reps; i++) {
unique_count_raw<<<1024, 1024>>>((&begin[0]).get(), end - begin, tmp_storage.data().get());
count2 = 1 + thrust::reduce(tmp_storage.begin(), tmp_storage.end());
}
cudaDeviceSynchronize();
auto total3 = std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::high_resolution_clock::now() - start).count() / reps;
if (count1 != count2) {
std::cerr << "Invalid: " << count1 << ',' << count2 << '\n';
}
auto memory = sizeof(typename ForwardIt::value_type) * (end - begin) * 1.0;
std::cout << desc << ',' << (memory / total) << ',' << (memory / total2) << ',' << (memory / total3) << '\n';
}
int main() {
{
thrust::device_vector<int> data(1u << 30);
thrust::sequence(data.begin(), data.end(), 0);
for (long i = 1; i <= data.size(); i *= 2) {
benchmark("int," + std::to_string(i), data.begin(), data.begin() + i);
}
}
{
thrust::device_vector<long> data(1u << 29);
thrust::sequence(data.begin(), data.end(), 0);
for (long i = 1; i <= data.size(); i *= 2) {
benchmark("long," + std::to_string(i), data.begin(), data.begin() + i);
}
}
/*{
thrust::device_vector<int> data1(1u << 28);
thrust::device_vector<long> data2(1u << 28);
thrust::sequence(data1.begin(), data1.end(), 0);
thrust::sequence(data2.begin(), data2.end(), 0);
auto it = thrust::make_zip_iterator(thrust::make_tuple(data1.begin(), data2.begin()));
for (long i = data1.size(); i <= data1.size(); i *= 2) {
benchmark("pair<int long>," + std::to_string(i), it, it + i);
}
}*/
}
|
After taking a brief look over cub's Reduce implementation, I would guess that it is very well optimized for a particular setting, and that is pointer-like iterators that read exactly as much as their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! I have a few more suggested changes to the new forward_iterator_wrapper
class, and a question for @allisonvacanti.
|
||
ASSERT_EQUAL(div_10_count, 3); | ||
} | ||
DECLARE_INTEGRAL_VECTOR_UNITTEST(TestUniqueCountSimple); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean, forward iterators always dispatch to the CPU? @allisonvacanti can you comment on that? I mean, it seems reasonable to me.
@ericniebler Thanks for the feedback, it's been a while since I've written truly generic code! |
run tests |
@senior-zero any ideas about why this I suspect something is going on with strided accesses as a result of the CUB reduction implementation. |
|
||
ASSERT_EQUAL(div_10_count, 3); | ||
} | ||
DECLARE_INTEGRAL_VECTOR_UNITTEST(TestUniqueCountSimple); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be possible. Thrust packs some extra info into its iterator category. See thrust/iterator/iterator_categories.h
for more info.
The iterator wrapper will need to test the base iterator using thrust::iterator_system<BaseIterator>::type
, and define its iterator category to forward_host_iterator_tag
if the result is host_system_tag
, or forward_device_iterator_tag
if the result is device_iterator_tag
.
@jrhemstad I've profiled the latest version: const int n = 128 * 1024 * 1024;
thrust::device_vector<std::size_t> a(n);
thrust::sequence(a.begin(), a.end());
thrust::unique_count(a.begin(), a.end()); Here are the results:
Am I missing the reproducer that leads to 2x global loads? |
@senior-zero good question! I haven't tried running it myself. I was going off of @upsj's comment here #1619 (comment) @upsj are you not seeing the perf regression anymore? |
Add a counting equivalent to unique_* algorithms that can be used to allocate the correct amount of data before actually filling it. Addresses issue NVIDIA#1612
The interface specifies ForwardIterator, not InputIterator
* use iterator traits * use hidden friend operators * fix member access operator Co-authored-by: Eric Niebler <[email protected]>
@senior-zero @jrhemstad I was talking about load instruction executed, not amount of memory served by DRAM (i.e. |
run tests |
1 similar comment
run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes to block merging until the situation is https://github.com/NVIDIA/thrust/pull/1619/files#r866368806 is resolved.
@allisonvacanti I reverted the changes to still have them available in case you need them later. Does that work for you? |
That's perfect, I'll note this in NVIDIA/cccl#679. Thanks! run tests |
Tests look good, merging. Thanks for the PR! |
Add a counting equivalent to unique_* algorithms that can be used to allocate the correct amount of data before actually filling it.
The formatting is a bit all over the place, I just tried to fit the environment as well as possible.
I'll run a few benchmarks to see how this fares performance-wise (I am hoping I won't need to use
cub::DeviceRunLengthEncode::Encode
, because that will be more work to integrate, and does more work internally)I tested everything with the CUDA, OMP and CPU backends, don't have a TBB installation on hand, but the code is equivalent to OpenMP, so hopefully I didn't break anything :)
Closes #1612