-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added batch memset to memset data and validity buffers in parquet reader #16281
Conversation
This should use
|
Hi @sdrp713, thank you for your effort. If possible, can you please merge the two tables in the |
+1 on using this instead of the new impl. |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved trivial CMake changes
* @return The data in device spans all set to value | ||
*/ | ||
|
||
namespace CUDF_EXPORT cudf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be exported as a public API? In my understanding, that is only needed if we use this API in our tests. Since we have a test, maybe this is correct. cc: @robertmaynard to check my assertions (still unsure about rules, need docs on CUDF_EXPORT
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same idea but yes, more info and confirmation needed.
namespace CUDF_EXPORT cudf { | ||
namespace io::detail { | ||
|
||
void batched_memset(std::vector<cudf::device_span<uint64_t>>& bufs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the buffers typed as uint64_t
? I thought this would be a generic function with support for any fixed-width types. Do we not need the ability to set buffers of std::byte
to a value?
It seems like we're being too specific to the use case in Parquet if the only type we care about is uint64_t
. If that's the case, should we move this to src/io/parquet
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay as one can always initialize a std::byte buffer by providing a 8 byte value for it (say: -1UL) to initialize all bytes as 0xFF. I am a bit concerned about __int128_t buffers if we aren’t initializing with trivial 0s or -1ULs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn’t work for std::byte buffers unless their size is a multiple of 8, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It relies on buffer allocation being 8 byte aligned as per the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we design this as a templated function that supports arbitrary fixed width types, or narrow the scope so this is more clearly a utility for only one specific use case in Parquet? Currently it seems like it is positioned as a generic IO utility but it is not generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the goal here is to cast all buffers to a common type for us to use the cub function to memset all of them at once. Casting to uint64 does seem okay here as long as we can guarantee multiple of 8 byte allocations (?) and are writing trivial all zeros or -1s for int128 (templating may help here) and any value for all smaller types (not sure if we do this in any reader for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment and length are different. Yes, 8 byte alignment would be required. But how would this work for a 4-byte allocation? Even if it is aligned (the starting address is on an 8 byte boundary), you can’t write a uint64_t because it would write four bytes out of bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For working with variable-length buffers like this, using std::byte as the common type seems like the only solution. Reinterpreting other allocations as uint64_t* is not safe (UB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment and length are different. Yes, 8 byte alignment would be required. But how would this work for a 4-byte allocation? Even if it is aligned (the starting address is on an 8 byte boundary), you can’t write a uint64_t because it would write four bytes out of bounds.
Definitely. By saying 8 byte aligned alloc I meant as in the alloc size to be a multiple of 8 byte.
// Using uint64_t as we are relying on the allocation padding for the buffers to be large enough
// that we can always write a multiple of 8 byte words
From this comment on reader_impl_preprocess.cu it looks like we are relying on the assumption that rmm is padding the buffer allocations to be have allocated sizes a multiple of 8 bytes to not go out of bounds. (I could only find this info rapidsai/rmm#1278 where it looks like it has been removed) so it should be done explicitly in column_buffers.hpp functions if going that path.
For working with variable-length buffers like this, using std::byte as the common type seems like the only solution. Reinterpreting other allocations as uint64_t* is not safe (UB).
Certainly if we can ensure that the value to memset would always fit in 1 byte even if we are memsetting a int4 or int8 buffer - the same concern I had with int128_t
with the current design. Though, with std:byte
I am concerned we may lose any performance advantage we are currently seeing. Worth a try though @sdrp713!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though, with
std::byte
I am concerned we may lose any performance advantage we are currently seeing.
I discussed this with @abellina and @sdrp713. CUB decides how many items-per-thread to use based on some heuristics. This should be efficient, and if it's not, we can file an issue with CUB.
// list columns store a buffer of int32's as offsets to represent | ||
// their individual rows | ||
case type_id::LIST: _data = create_data(data_type{type_id::INT32}, size, stream, _mr); break; | ||
case type_id::LIST: | ||
_data = create_data(data_type{type_id::INT32}, size, memset_data, stream, _mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be correct with large strings, which use 64 bit offsets. @davidwendt Can you weigh in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lists columns offsets are only 32-bit (INT32). Strings columns offsets can be either INT32 or INT64 depending on the size of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh. I must have misread "list" as "string." Apologies. Either way, can we use data_type{type_to_id<size_type>()}
instead of data_type{type_id::INT32}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think INT32 would be more correct since changing size_type
would no longer make this correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use size_type and not int32 for lists offsets across libcudf, in my understanding.
using offset_iterator = size_type const*; ///< Iterator type for offsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those will have to change if we ever change size_type
since offsets could still be INT32
regardless of size_type
. We get away with this since size_type
always maps to INT32
for now. I think we'll be safe for the foreseeable future using size_type
for list offsets. But I posit the INT32/int32_t
would be more clear/correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments from discussion with @abellina.
We discussed using a templated design like this:
template<typename T>
void batched_memset(std::vector<cudf::device_span<T>>& bufs,
T const value,
rmm::cuda_stream_view stream);
This will avoid the assumptions currently being made by using uint64_t
about the safety of writing garbage data into allocation padding bytes if the type is not actually uint64_t
. (See also the "Type aliasing" section of https://en.cppreference.com/w/cpp/language/reinterpret_cast -- it was using UB anyway.)
{ | ||
CUDF_EXPECTS(type.id() == type_id::STRING, "allocate_strings_data called for non-string column"); | ||
// size + 1 for final offset. _string_data will be initialized later. | ||
_data = create_data(data_type{type_id::INT32}, size + 1, stream, _mr); | ||
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as below: https://github.com/rapidsai/cudf/pull/16281/files#r1699007201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr); | |
_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is resolved in this utility function
std::unique_ptr<column> cudf::io::detail::inline_column_buffer::make_string_column_impl( |
There is some efficiency builtin here where if the offsets fit, they do not need to be copied.
(Disclaimer: I'm not exactly how all of this works myself).
@sdrp713 Can you also please update the performance improvement data with the new updates when ready? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small suggestions, otherwise LGTM.
{ | ||
CUDF_EXPECTS(type.id() == type_id::STRING, "allocate_strings_data called for non-string column"); | ||
// size + 1 for final offset. _string_data will be initialized later. | ||
_data = create_data(data_type{type_id::INT32}, size + 1, stream, _mr); | ||
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr); | |
_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr); |
/ok to test |
/merge |
Description
Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing cudaMemset() operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. This PR adds a batched memset kernel that takes a list of device spans as a single input and does all the work under a single kernel launch. This PR addresses issue #15773
Improvements
Using out performance cluster, improvements of 2.39% were shown on running the overall NDS queries
Additionally, benchmarks were added showing big improvements(around 20%) especially on fixed width data types which can be shown below
Checklist