Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added batch memset to memset data and validity buffers in parquet reader #16281

Merged
merged 42 commits into from
Aug 5, 2024

Conversation

sdrp713
Copy link
Contributor

@sdrp713 sdrp713 commented Jul 15, 2024

Description

Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing cudaMemset() operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. This PR adds a batched memset kernel that takes a list of device spans as a single input and does all the work under a single kernel launch. This PR addresses issue #15773

Improvements

Using out performance cluster, improvements of 2.39% were shown on running the overall NDS queries
Additionally, benchmarks were added showing big improvements(around 20%) especially on fixed width data types which can be shown below

data_type num_cols cardinality run_length bytes_per_second_before_this_pr bytes_per_second_after_this_pr speedup
INTEGRAL 1000 0 1 36514934834 42756531566 1.170932709
INTEGRAL 1000 1000 1 35364061247 39112512476 1.105996062
INTEGRAL 1000 0 32 37349112510 39641370858 1.061373837
INTEGRAL 1000 1000 32 39167079622 43740824957 1.116775245
FLOAT 1000 0 1 51877322003 64083898838 1.235296973
FLOAT 1000 1000 1 48983612272 58705522023 1.198472699
FLOAT 1000 0 32 46544977658 53715018581 1.154045426
FLOAT 1000 1000 32 54493432148 66617609904 1.22248879
DECIMAL 1000 0 1 47616412888 57952310685 1.217065864
DECIMAL 1000 1000 1 47166138095 54283772484 1.1509056
DECIMAL 1000 0 32 45266163387 53770390830 1.18787162
DECIMAL 1000 1000 32 52292176603 58847723569 1.125363819
TIMESTAMP 1000 0 1 50245415328 60797982330 1.210020495
TIMESTAMP 1000 1000 1 50300238706 60810368331 1.208947908
TIMESTAMP 1000 0 32 55338354243 66786275739 1.206871376
TIMESTAMP 1000 1000 32 55680028082 69029227374 1.23974843
DURATION 1000 0 1 54680007758 66855201896 1.222662626
DURATION 1000 1000 1 54305832171 66602436269 1.226432477
DURATION 1000 0 32 60040760815 72663056969 1.210228784
DURATION 1000 1000 32 60212221703 75646396131 1.256329595
STRING 1000 0 1 29691707753 33388700976 1.12451265
STRING 1000 1000 1 31411129876 35407241037 1.127219593
STRING 1000 0 32 29680479388 33382478907 1.124728427
STRING 1000 1000 32 35476213777 40478389269 1.141000827
LIST 1000 0 1 6874253484 7370835717 1.072237987
LIST 1000 1000 1 6763426009 7253762966 1.07249831
LIST 1000 0 32 6981508808 7502741115 1.074658977
LIST 1000 1000 32 6989374761 7506418252 1.073975643
STRUCT 1000 0 1 2137525922 2189495762 1.024313081
STRUCT 1000 1000 1 1057923939 1078475980 1.019426766
STRUCT 1000 0 32 1637342446 1698913790 1.037604439
STRUCT 1000 1000 32 1057587701 1082539399 1.02359303

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@sdrp713 sdrp713 requested review from a team as code owners July 15, 2024 15:52
@sdrp713 sdrp713 requested review from vyasr and bdice July 15, 2024 15:52
Copy link

copy-pr-bot bot commented Jul 15, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue labels Jul 15, 2024
python/rmm_log.txt Outdated Show resolved Hide resolved
cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/strings/detail/gather.cuh Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/multibuffer_memset.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/multibuffer_memset.hpp Outdated Show resolved Hide resolved
cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/multibuffer_memset.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/multibuffer_memset.cu Outdated Show resolved Hide resolved
cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved
@jrhemstad
Copy link
Contributor

jrhemstad commented Jul 15, 2024

This should use cub::DeviceCopy::Batched instead of implementing a new algorithm in libcudf. One can use a constant iterator to make DeviceCopy work like a DeviceFill/DeviceMemset.

cub::DeviceCopy::Batched was added for this kind of use case as originally described here.

@mhaseeb123
Copy link
Member

Hi @sdrp713, thank you for your effort. If possible, can you please merge the two tables in the Improvement section by renaming the bytes_per_second column in the first table to bytes_per_second_before_this_pr and adding a new column called bytes_per_second_after_this_pr next to and removing the second table altogether. Maybe a new column called percentage_improvement would also be cool. Thank you!

@mhaseeb123
Copy link
Member

This should use cub::DeviceCopy::Batched instead of implementing a new algorithm in libcudf. One can use a constant iterator to make DeviceCopy work like a DeviceFill/DeviceMemset.

cub::DeviceCopy::Batched was added for this kind of use case as originally described here.

+1 on using this instead of the new impl.

@mhaseeb123
Copy link
Member

/ok to test

@mhaseeb123
Copy link
Member

/ok to test

@lithomas1 lithomas1 removed the request for review from a team July 31, 2024 16:32
Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
* @return The data in device spans all set to value
*/

namespace CUDF_EXPORT cudf {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be exported as a public API? In my understanding, that is only needed if we use this API in our tests. Since we have a test, maybe this is correct. cc: @robertmaynard to check my assertions (still unsure about rules, need docs on CUDF_EXPORT)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same idea but yes, more info and confirmation needed.

cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/column_buffer.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/column_buffer.hpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/column_buffer.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memset.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
namespace CUDF_EXPORT cudf {
namespace io::detail {

void batched_memset(std::vector<cudf::device_span<uint64_t>>& bufs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the buffers typed as uint64_t? I thought this would be a generic function with support for any fixed-width types. Do we not need the ability to set buffers of std::byte to a value?

It seems like we're being too specific to the use case in Parquet if the only type we care about is uint64_t. If that's the case, should we move this to src/io/parquet or something like that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay as one can always initialize a std::byte buffer by providing a 8 byte value for it (say: -1UL) to initialize all bytes as 0xFF. I am a bit concerned about __int128_t buffers if we aren’t initializing with trivial 0s or -1ULs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn’t work for std::byte buffers unless their size is a multiple of 8, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It relies on buffer allocation being 8 byte aligned as per the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we design this as a templated function that supports arbitrary fixed width types, or narrow the scope so this is more clearly a utility for only one specific use case in Parquet? Currently it seems like it is positioned as a generic IO utility but it is not generic.

Copy link
Member

@mhaseeb123 mhaseeb123 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal here is to cast all buffers to a common type for us to use the cub function to memset all of them at once. Casting to uint64 does seem okay here as long as we can guarantee multiple of 8 byte allocations (?) and are writing trivial all zeros or -1s for int128 (templating may help here) and any value for all smaller types (not sure if we do this in any reader for now).

Copy link
Contributor

@bdice bdice Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment and length are different. Yes, 8 byte alignment would be required. But how would this work for a 4-byte allocation? Even if it is aligned (the starting address is on an 8 byte boundary), you can’t write a uint64_t because it would write four bytes out of bounds.

Copy link
Contributor

@bdice bdice Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For working with variable-length buffers like this, using std::byte as the common type seems like the only solution. Reinterpreting other allocations as uint64_t* is not safe (UB).

Copy link
Member

@mhaseeb123 mhaseeb123 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alignment and length are different. Yes, 8 byte alignment would be required. But how would this work for a 4-byte allocation? Even if it is aligned (the starting address is on an 8 byte boundary), you can’t write a uint64_t because it would write four bytes out of bounds.

Definitely. By saying 8 byte aligned alloc I meant as in the alloc size to be a multiple of 8 byte.

// Using uint64_t as we are relying on the allocation padding for the buffers to be large enough
// that we can always write a multiple of 8 byte words

From this comment on reader_impl_preprocess.cu it looks like we are relying on the assumption that rmm is padding the buffer allocations to be have allocated sizes a multiple of 8 bytes to not go out of bounds. (I could only find this info rapidsai/rmm#1278 where it looks like it has been removed) so it should be done explicitly in column_buffers.hpp functions if going that path.

For working with variable-length buffers like this, using std::byte as the common type seems like the only solution. Reinterpreting other allocations as uint64_t* is not safe (UB).

Certainly if we can ensure that the value to memset would always fit in 1 byte even if we are memsetting a int4 or int8 buffer - the same concern I had with int128_t with the current design. Though, with std:byte I am concerned we may lose any performance advantage we are currently seeing. Worth a try though @sdrp713!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, with std::byte I am concerned we may lose any performance advantage we are currently seeing.

I discussed this with @abellina and @sdrp713. CUB decides how many items-per-thread to use based on some heuristics. This should be efficient, and if it's not, we can file an issue with CUB.

cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved
Comment on lines 109 to 112
// list columns store a buffer of int32's as offsets to represent
// their individual rows
case type_id::LIST: _data = create_data(data_type{type_id::INT32}, size, stream, _mr); break;
case type_id::LIST:
_data = create_data(data_type{type_id::INT32}, size, memset_data, stream, _mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be correct with large strings, which use 64 bit offsets. @davidwendt Can you weigh in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists columns offsets are only 32-bit (INT32). Strings columns offsets can be either INT32 or INT64 depending on the size of the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh. I must have misread "list" as "string." Apologies. Either way, can we use data_type{type_to_id<size_type>()} instead of data_type{type_id::INT32}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think INT32 would be more correct since changing size_type would no longer make this correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use size_type and not int32 for lists offsets across libcudf, in my understanding.

using offset_iterator = size_type const*; ///< Iterator type for offsets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those will have to change if we ever change size_type since offsets could still be INT32 regardless of size_type. We get away with this since size_type always maps to INT32 for now. I think we'll be safe for the foreseeable future using size_type for list offsets. But I posit the INT32/int32_t would be more clear/correct.

cpp/tests/utilities_tests/batched_memset_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/utilities_tests/batched_memset_tests.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments from discussion with @abellina.

We discussed using a templated design like this:

template<typename T>
void batched_memset(std::vector<cudf::device_span<T>>& bufs,
                    T const value,
                    rmm::cuda_stream_view stream);

This will avoid the assumptions currently being made by using uint64_t about the safety of writing garbage data into allocation padding bytes if the type is not actually uint64_t. (See also the "Type aliasing" section of https://en.cppreference.com/w/cpp/language/reinterpret_cast -- it was using UB anyway.)

cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memset.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memset.hpp Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memset.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
{
CUDF_EXPECTS(type.id() == type_id::STRING, "allocate_strings_data called for non-string column");
// size + 1 for final offset. _string_data will be initialized later.
_data = create_data(data_type{type_id::INT32}, size + 1, stream, _mr);
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);
_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is resolved in this utility function

std::unique_ptr<column> cudf::io::detail::inline_column_buffer::make_string_column_impl(

There is some efficiency builtin here where if the offsets fit, they do not need to be copied.
(Disclaimer: I'm not exactly how all of this works myself).

@mhaseeb123
Copy link
Member

@sdrp713 Can you also please update the performance improvement data with the new updates when ready?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions, otherwise LGTM.

{
CUDF_EXPECTS(type.id() == type_id::STRING, "allocate_strings_data called for non-string column");
// size + 1 for final offset. _string_data will be initialized later.
_data = create_data(data_type{type_id::INT32}, size + 1, stream, _mr);
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);
_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr);

cpp/src/io/utilities/column_buffer.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/batched_memset.hpp Show resolved Hide resolved
@mhaseeb123
Copy link
Member

/ok to test

@mhaseeb123
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 837dfe5 into rapidsai:branch-24.10 Aug 5, 2024
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.