Added batch memset to memset data and validity buffers in parquet reader #16281

sdrp713 · 2024-07-15T15:52:50Z

Description

Under some situations in the Parquet reader (particularly the case with tables containing many columns or deeply nested column) we burn a decent amount of time doing cudaMemset() operations on output buffers. A good amount of this overhead seems to stem from the fact that we're simply launching many tiny kernels. This PR adds a batched memset kernel that takes a list of device spans as a single input and does all the work under a single kernel launch. This PR addresses issue #15773

Improvements

Using out performance cluster, improvements of 2.39% were shown on running the overall NDS queries
Additionally, benchmarks were added showing big improvements(around 20%) especially on fixed width data types which can be shown below

data_type	num_cols	cardinality	run_length	bytes_per_second_before_this_pr	bytes_per_second_after_this_pr	speedup
INTEGRAL	1000	0	1	36514934834	42756531566	1.170932709
INTEGRAL	1000	1000	1	35364061247	39112512476	1.105996062
INTEGRAL	1000	0	32	37349112510	39641370858	1.061373837
INTEGRAL	1000	1000	32	39167079622	43740824957	1.116775245
FLOAT	1000	0	1	51877322003	64083898838	1.235296973
FLOAT	1000	1000	1	48983612272	58705522023	1.198472699
FLOAT	1000	0	32	46544977658	53715018581	1.154045426
FLOAT	1000	1000	32	54493432148	66617609904	1.22248879
DECIMAL	1000	0	1	47616412888	57952310685	1.217065864
DECIMAL	1000	1000	1	47166138095	54283772484	1.1509056
DECIMAL	1000	0	32	45266163387	53770390830	1.18787162
DECIMAL	1000	1000	32	52292176603	58847723569	1.125363819
TIMESTAMP	1000	0	1	50245415328	60797982330	1.210020495
TIMESTAMP	1000	1000	1	50300238706	60810368331	1.208947908
TIMESTAMP	1000	0	32	55338354243	66786275739	1.206871376
TIMESTAMP	1000	1000	32	55680028082	69029227374	1.23974843
DURATION	1000	0	1	54680007758	66855201896	1.222662626
DURATION	1000	1000	1	54305832171	66602436269	1.226432477
DURATION	1000	0	32	60040760815	72663056969	1.210228784
DURATION	1000	1000	32	60212221703	75646396131	1.256329595
STRING	1000	0	1	29691707753	33388700976	1.12451265
STRING	1000	1000	1	31411129876	35407241037	1.127219593
STRING	1000	0	32	29680479388	33382478907	1.124728427
STRING	1000	1000	32	35476213777	40478389269	1.141000827
LIST	1000	0	1	6874253484	7370835717	1.072237987
LIST	1000	1000	1	6763426009	7253762966	1.07249831
LIST	1000	0	32	6981508808	7502741115	1.074658977
LIST	1000	1000	32	6989374761	7506418252	1.073975643
STRUCT	1000	0	1	2137525922	2189495762	1.024313081
STRUCT	1000	1000	1	1057923939	1078475980	1.019426766
STRUCT	1000	0	32	1637342446	1698913790	1.037604439
STRUCT	1000	1000	32	1057587701	1082539399	1.02359303

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-07-15T15:52:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

python/rmm_log.txt

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp

cpp/include/cudf/detail/gather.cuh

cpp/include/cudf/strings/detail/gather.cuh

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/utilities/multibuffer_memset.hpp

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp

cpp/src/io/utilities/multibuffer_memset.cu

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp

jrhemstad · 2024-07-15T16:55:52Z

This should use cub::DeviceCopy::Batched instead of implementing a new algorithm in libcudf. One can use a constant iterator to make DeviceCopy work like a DeviceFill/DeviceMemset.

cub::DeviceCopy::Batched was added for this kind of use case as originally described here.

cpp/src/io/utilities/multibuffer_memset.cu

cpp/src/io/utilities/multibuffer_memset.hpp

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp

mhaseeb123 · 2024-07-15T18:40:06Z

Hi @sdrp713, thank you for your effort. If possible, can you please merge the two tables in the Improvement section by renaming the bytes_per_second column in the first table to bytes_per_second_before_this_pr and adding a new column called bytes_per_second_after_this_pr next to and removing the second table altogether. Maybe a new column called percentage_improvement would also be cool. Thank you!

cpp/src/io/parquet/reader_impl_preprocess.cu

mhaseeb123 · 2024-07-15T20:58:55Z

This should use cub::DeviceCopy::Batched instead of implementing a new algorithm in libcudf. One can use a constant iterator to make DeviceCopy work like a DeviceFill/DeviceMemset.

cub::DeviceCopy::Batched was added for this kind of use case as originally described here.

+1 on using this instead of the new impl.

cpp/src/io/utilities/column_buffer.cpp

cpp/src/io/utilities/column_buffer.hpp

mhaseeb123 · 2024-07-30T02:53:52Z

/ok to test

mhaseeb123 · 2024-07-30T16:48:11Z

/ok to test

KyleFromNVIDIA

Approved trivial CMake changes

cpp/CMakeLists.txt

bdice · 2024-07-31T16:42:09Z

cpp/include/cudf/io/detail/batched_memset.hpp

+ * @return The data in device spans all set to value
+ */
+
+namespace CUDF_EXPORT cudf {


Should this be exported as a public API? In my understanding, that is only needed if we use this API in our tests. Since we have a test, maybe this is correct. cc: @robertmaynard to check my assertions (still unsure about rules, need docs on CUDF_EXPORT)

I have the same idea but yes, more info and confirmation needed.

cpp/src/io/utilities/batched_memset.cu

cpp/src/io/utilities/column_buffer.hpp

cpp/include/cudf/io/detail/batched_memset.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

bdice · 2024-07-31T19:22:18Z

cpp/include/cudf/io/detail/batched_memset.hpp

+namespace CUDF_EXPORT cudf {
+namespace io::detail {
+
+void batched_memset(std::vector<cudf::device_span<uint64_t>>& bufs,


Why are the buffers typed as uint64_t? I thought this would be a generic function with support for any fixed-width types. Do we not need the ability to set buffers of std::byte to a value?

It seems like we're being too specific to the use case in Parquet if the only type we care about is uint64_t. If that's the case, should we move this to src/io/parquet or something like that?

I think this is okay as one can always initialize a std::byte buffer by providing a 8 byte value for it (say: -1UL) to initialize all bytes as 0xFF. I am a bit concerned about __int128_t buffers if we aren’t initializing with trivial 0s or -1ULs

This doesn’t work for std::byte buffers unless their size is a multiple of 8, right?

It relies on buffer allocation being 8 byte aligned as per the comment.

Can we design this as a templated function that supports arbitrary fixed width types, or narrow the scope so this is more clearly a utility for only one specific use case in Parquet? Currently it seems like it is positioned as a generic IO utility but it is not generic.

I think the goal here is to cast all buffers to a common type for us to use the cub function to memset all of them at once. Casting to uint64 does seem okay here as long as we can guarantee multiple of 8 byte allocations (?) and are writing trivial all zeros or -1s for int128 (templating may help here) and any value for all smaller types (not sure if we do this in any reader for now).

Alignment and length are different. Yes, 8 byte alignment would be required. But how would this work for a 4-byte allocation? Even if it is aligned (the starting address is on an 8 byte boundary), you can’t write a uint64_t because it would write four bytes out of bounds.

For working with variable-length buffers like this, using std::byte as the common type seems like the only solution. Reinterpreting other allocations as uint64_t* is not safe (UB).

Alignment and length are different. Yes, 8 byte alignment would be required. But how would this work for a 4-byte allocation? Even if it is aligned (the starting address is on an 8 byte boundary), you can’t write a uint64_t because it would write four bytes out of bounds.

Definitely. By saying 8 byte aligned alloc I meant as in the alloc size to be a multiple of 8 byte.

// Using uint64_t as we are relying on the allocation padding for the buffers to be large enough
// that we can always write a multiple of 8 byte words

From this comment on reader_impl_preprocess.cu it looks like we are relying on the assumption that rmm is padding the buffer allocations to be have allocated sizes a multiple of 8 bytes to not go out of bounds. (I could only find this info rapidsai/rmm#1278 where it looks like it has been removed) so it should be done explicitly in column_buffers.hpp functions if going that path.

For working with variable-length buffers like this, using std::byte as the common type seems like the only solution. Reinterpreting other allocations as uint64_t* is not safe (UB).

Certainly if we can ensure that the value to memset would always fit in 1 byte even if we are memsetting a int4 or int8 buffer - the same concern I had with int128_t with the current design. Though, with std:byte I am concerned we may lose any performance advantage we are currently seeing. Worth a try though @sdrp713!

Though, with std::byte I am concerned we may lose any performance advantage we are currently seeing.

I discussed this with @abellina and @sdrp713. CUB decides how many items-per-thread to use based on some heuristics. This should be efficient, and if it's not, we can file an issue with CUB.

cpp/src/io/utilities/batched_memset.cu

bdice · 2024-07-31T19:31:52Z

cpp/src/io/utilities/column_buffer.cpp

    // list columns store a buffer of int32's as offsets to represent
    // their individual rows
-    case type_id::LIST: _data = create_data(data_type{type_id::INT32}, size, stream, _mr); break;
+    case type_id::LIST:
+      _data = create_data(data_type{type_id::INT32}, size, memset_data, stream, _mr);


This may not be correct with large strings, which use 64 bit offsets. @davidwendt Can you weigh in here?

Lists columns offsets are only 32-bit (INT32). Strings columns offsets can be either INT32 or INT64 depending on the size of the data.

Argh. I must have misread "list" as "string." Apologies. Either way, can we use data_type{type_to_id<size_type>()} instead of data_type{type_id::INT32}?

I think INT32 would be more correct since changing size_type would no longer make this correct.

We use size_type and not int32 for lists offsets across libcudf, in my understanding.

cudf/cpp/include/cudf/lists/lists_column_view.hpp

Line 76 in 8068a2d

using offset_iterator = size_type const*; ///< Iterator type for offsets

Yes, those will have to change if we ever change size_type since offsets could still be INT32 regardless of size_type. We get away with this since size_type always maps to INT32 for now. I think we'll be safe for the foreseeable future using size_type for list offsets. But I posit the INT32/int32_t would be more clear/correct.

cpp/tests/utilities_tests/batched_memset_tests.cpp

cpp/include/cudf/io/detail/batched_memset.hpp

cpp/src/io/utilities/column_buffer.hpp

bdice

A few comments from discussion with @abellina.

We discussed using a templated design like this:

template<typename T>
void batched_memset(std::vector<cudf::device_span<T>>& bufs,
                    T const value,
                    rmm::cuda_stream_view stream);

This will avoid the assumptions currently being made by using uint64_t about the safety of writing garbage data into allocation padding bytes if the type is not actually uint64_t. (See also the "Type aliasing" section of https://en.cppreference.com/w/cpp/language/reinterpret_cast -- it was using UB anyway.)

cpp/src/io/utilities/batched_memset.cu

cpp/include/cudf/io/detail/batched_memset.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

bdice · 2024-08-02T22:19:20Z

cpp/src/io/utilities/column_buffer.cpp

 {
  CUDF_EXPECTS(type.id() == type_id::STRING, "allocate_strings_data called for non-string column");
  // size + 1 for final offset. _string_data will be initialized later.
-  _data = create_data(data_type{type_id::INT32}, size + 1, stream, _mr);
+  _data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);


Same question as below: https://github.com/rapidsai/cudf/pull/16281/files#r1699007201

Suggested change

_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);

_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr);

I believe this is resolved in this utility function

cudf/cpp/src/io/utilities/column_buffer_strings.cu

Line 25 in af57286

std::unique_ptr<column> cudf::io::detail::inline_column_buffer::make_string_column_impl(

There is some efficiency builtin here where if the offsets fit, they do not need to be copied.
(Disclaimer: I'm not exactly how all of this works myself).

mhaseeb123 · 2024-08-05T13:57:49Z

@sdrp713 Can you also please update the performance improvement data with the new updates when ready?

bdice

A few small suggestions, otherwise LGTM.

bdice · 2024-08-05T15:14:45Z

cpp/src/io/utilities/column_buffer.cpp

 {
  CUDF_EXPECTS(type.id() == type_id::STRING, "allocate_strings_data called for non-string column");
  // size + 1 for final offset. _string_data will be initialized later.
-  _data = create_data(data_type{type_id::INT32}, size + 1, stream, _mr);
+  _data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);


Suggested change

_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);

_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr);

cpp/src/io/utilities/column_buffer.cpp

cpp/include/cudf/io/detail/batched_memset.hpp

mhaseeb123 · 2024-08-05T19:52:01Z

/ok to test

mhaseeb123 · 2024-08-05T19:53:12Z

/merge

sdrp713 added 8 commits July 12, 2024 15:17

Initital Commit

f0e8558

Added Null Memsest

812ca51

Update branch

078a451

Update tests

5c2aa8b

Added template

36a0670

Fix template

b449207

Added rebase

810e38e

PR Push

c5c94ea

sdrp713 requested review from a team as code owners July 15, 2024 15:52

sdrp713 requested review from vyasr and bdice July 15, 2024 15:52

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue labels Jul 15, 2024

nvdbaranec requested changes Jul 15, 2024

View reviewed changes

python/rmm_log.txt Outdated Show resolved Hide resolved

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved

GregoryKimball requested a review from mhaseeb123 July 15, 2024 16:14

nvdbaranec requested changes Jul 15, 2024

View reviewed changes

nvdbaranec reviewed Jul 15, 2024

View reviewed changes

cpp/src/io/utilities/multibuffer_memset.cu Outdated Show resolved Hide resolved

nvdbaranec requested changes Jul 15, 2024

View reviewed changes

cpp/src/io/utilities/multibuffer_memset.cu Outdated Show resolved Hide resolved

cpp/src/io/utilities/multibuffer_memset.cu Outdated Show resolved Hide resolved

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Jul 15, 2024

View reviewed changes

cpp/src/io/utilities/multibuffer_memset.cu Outdated Show resolved Hide resolved

mhaseeb123 reviewed Jul 15, 2024

View reviewed changes

cpp/src/io/utilities/multibuffer_memset.hpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Jul 15, 2024

View reviewed changes

cpp/tests/utilities_tests/multibuffer_memset_tests.cpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Jul 15, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved

mattahrens assigned sdrp713 Jul 15, 2024

Final PR Fixes

bd877a5

mhaseeb123 reviewed Jul 29, 2024

View reviewed changes

cpp/src/io/utilities/column_buffer.cpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Jul 29, 2024

View reviewed changes

cpp/src/io/utilities/column_buffer.hpp Outdated Show resolved Hide resolved

mhaseeb123 reviewed Jul 29, 2024

View reviewed changes

cpp/src/io/utilities/column_buffer.hpp Show resolved Hide resolved

mhaseeb123 and others added 3 commits July 29, 2024 09:50

Merge branch 'branch-24.10' into multibuffermemset

7c70358

Namespace Fix

4da4578

Merge branch 'branch-24.10' into multibuffermemset

9229a1a

Style Fix

5d292ef

mhaseeb123 approved these changes Jul 31, 2024

View reviewed changes

lithomas1 removed the request for review from a team July 31, 2024 16:32

KyleFromNVIDIA approved these changes Jul 31, 2024

View reviewed changes

bdice requested changes Jul 31, 2024

View reviewed changes

Argument Fixes

478b088

bdice reviewed Jul 31, 2024

View reviewed changes

mhaseeb123 reviewed Jul 31, 2024

View reviewed changes

cpp/include/cudf/io/detail/batched_memset.hpp Show resolved Hide resolved

Format Fixes

81e94ce

sdrp713 commented Jul 31, 2024

View reviewed changes

cpp/src/io/utilities/column_buffer.hpp Outdated Show resolved Hide resolved

bdice reviewed Aug 1, 2024

View reviewed changes

cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved

cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved

cpp/src/io/utilities/batched_memset.cu Outdated Show resolved Hide resolved

Added Template

a64481f

bdice reviewed Aug 2, 2024

View reviewed changes

Type Fix

c7af590

bdice approved these changes Aug 5, 2024

View reviewed changes

sdrp713 and others added 2 commits August 5, 2024 08:55

Small Type Fixes

4d0497b

Merge branch 'branch-24.10' into multibuffermemset

65dda74

rapids-bot bot merged commit 837dfe5 into rapidsai:branch-24.10 Aug 5, 2024
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added batch memset to memset data and validity buffers in parquet reader #16281

Added batch memset to memset data and validity buffers in parquet reader #16281

sdrp713 commented Jul 15, 2024 •

edited

Loading

copy-pr-bot bot commented Jul 15, 2024

jrhemstad commented Jul 15, 2024 •

edited

Loading

mhaseeb123 commented Jul 15, 2024

mhaseeb123 commented Jul 15, 2024

mhaseeb123 commented Jul 30, 2024

mhaseeb123 commented Jul 30, 2024

KyleFromNVIDIA left a comment

bdice Jul 31, 2024

mhaseeb123 Jul 31, 2024

bdice Jul 31, 2024

mhaseeb123 Jul 31, 2024

bdice Jul 31, 2024

mhaseeb123 Jul 31, 2024

bdice Jul 31, 2024

mhaseeb123 Jul 31, 2024 •

edited

Loading

bdice Jul 31, 2024 •

edited

Loading

bdice Jul 31, 2024 •

edited

Loading

mhaseeb123 Jul 31, 2024 •

edited

Loading

bdice Aug 2, 2024

bdice Jul 31, 2024

davidwendt Aug 5, 2024

bdice Aug 5, 2024

davidwendt Aug 5, 2024

bdice Aug 5, 2024

davidwendt Aug 5, 2024

bdice left a comment •

edited

Loading

bdice Aug 2, 2024

bdice Aug 5, 2024

davidwendt Aug 5, 2024

mhaseeb123 commented Aug 5, 2024

bdice left a comment

bdice Aug 5, 2024

mhaseeb123 commented Aug 5, 2024

mhaseeb123 commented Aug 5, 2024

	_data = create_data(data_type{type_id::INT32}, size + 1, memset_data, stream, _mr);
	_data = create_data(data_type{type_to_id<size_type>()}, size + 1, memset_data, stream, _mr);

Added batch memset to memset data and validity buffers in parquet reader #16281

Added batch memset to memset data and validity buffers in parquet reader #16281

Conversation

sdrp713 commented Jul 15, 2024 • edited Loading

Description

Improvements

Checklist

copy-pr-bot bot commented Jul 15, 2024

jrhemstad commented Jul 15, 2024 • edited Loading

mhaseeb123 commented Jul 15, 2024

mhaseeb123 commented Jul 15, 2024

mhaseeb123 commented Jul 30, 2024

mhaseeb123 commented Jul 30, 2024

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhaseeb123 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

bdice Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

bdice Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

mhaseeb123 Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhaseeb123 commented Aug 5, 2024

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhaseeb123 commented Aug 5, 2024

mhaseeb123 commented Aug 5, 2024

sdrp713 commented Jul 15, 2024 •

edited

Loading

jrhemstad commented Jul 15, 2024 •

edited

Loading

mhaseeb123 Jul 31, 2024 •

edited

Loading

bdice Jul 31, 2024 •

edited

Loading

bdice Jul 31, 2024 •

edited

Loading

mhaseeb123 Jul 31, 2024 •

edited

Loading

bdice left a comment •

edited

Loading