Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List lexicographic comparator #11129

Merged
merged 91 commits into from
Sep 12, 2022
Merged
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
933c974
First commit
devavret Aug 26, 2021
a1636e5
testing and profiling deep single hierarchy struct
devavret Aug 27, 2021
d59f54c
Merge branch 'branch-22.02' into struct-row-comp
devavret Jan 12, 2022
765dd8d
Merge branch 'branch-22.02' into struct-row-comp
devavret Jan 12, 2022
3d21daf
Make the sandboxed test compile again
devavret Jan 14, 2022
9f32e6b
Update my row_comparator with nullate
devavret Jan 15, 2022
53d3c90
Merge branch 'branch-22.02' into struct-row-comp
devavret Jan 21, 2022
022e2a4
Basic verticalization utility and experimental namespace
devavret Jan 24, 2022
7fef643
clean up most of row operators that I didn't change.
devavret Jan 26, 2022
930d8de
Sliced column test
devavret Jan 27, 2022
0ecc4f8
column order and null precendence support
devavret Jan 28, 2022
ff36d2d
Manually managed stack
devavret Jan 28, 2022
cd0f938
New depth based method to avoid superimpose nulls
devavret Feb 2, 2022
7b8e060
Put sort2 impl in separate TU
devavret Feb 2, 2022
25eb237
Merge branch 'branch-22.04' into struct-row-comp
devavret Feb 2, 2022
d2937cf
Merge branch 'branch-22.04' into struct-row-comp
devavret Feb 10, 2022
d55c9c7
Move verticalization code to row_comparator.cpp
devavret Feb 15, 2022
3bd749e
Owning row lex operator
devavret Feb 22, 2022
613d664
merge fixes
devavret Feb 23, 2022
2ef3ac7
Move struct logic out of main row loop and into element_relational_co…
devavret Feb 24, 2022
5577431
pushing even more logic into element_relational_comparator
devavret Feb 24, 2022
f037bc0
More optimizations.
devavret Feb 24, 2022
8c54a85
review changes
devavret Feb 24, 2022
9d24a87
Checks to ensure tables can be compared
devavret Feb 24, 2022
a664c81
Super basic list lex working
devavret Mar 2, 2022
1ebd877
list test expansion and cleanups.
devavret Mar 2, 2022
3e6e9f4
Make struct comp work again
devavret Mar 2, 2022
facc031
List lex benchmark
devavret Mar 2, 2022
a19b2c3
Merge branch 'branch-22.08' into list-lex-comp
devavret Jun 7, 2022
11bcf16
Add back code from old lex comparator that had list flattening
devavret Jun 7, 2022
53f4418
Move list lex code to experimental header
devavret Jun 14, 2022
0e528f0
get list lex working on code ported to exp header
devavret Jun 14, 2022
f545af0
Add null handling
devavret Jun 15, 2022
a4190a0
handle empty lists
devavret Jun 16, 2022
9362b8d
Add sliced list test
devavret Jun 16, 2022
a7ec09b
Use progressive slicing to get leaf column
devavret Jun 16, 2022
d6ef822
Clean up old experiment files
devavret Jun 16, 2022
5f0d36e
Turn dremel data raw pointers to spans
devavret Jun 17, 2022
941b808
Replace bench with nvbench and fix destroyed dremel data issue
devavret Jun 20, 2022
16e11cb
Merge branch 'branch-22.08' into list-lex-comp
devavret Jun 22, 2022
2bd2b6b
More benchmark iterations
devavret Jun 22, 2022
4be403c
merge pointers to dremel data into a view class
devavret Jun 28, 2022
8cbd70c
Allow lhs and rhs dremel data
devavret Jun 29, 2022
4640383
Merge branch 'branch-22.08' into list-lex-comp
devavret Jul 15, 2022
3f43968
reduce test verbosity
devavret Jul 15, 2022
4b233dc
Remove debug prints
devavret Jul 18, 2022
6be6078
rename linked column header
devavret Jul 18, 2022
8e4c870
Move dremel specific code out into spearate files
devavret Jul 18, 2022
ee13936
pass _comparator to elem comt
devavret Jul 18, 2022
499a5bd
Remove lines that deal with dremel data as separate variables
devavret Jul 18, 2022
0c3c12e
remove requirement to pass d_nullability and allow dremel_device_view…
devavret Jul 19, 2022
b62d0a2
Let get_dremel_data work without nullability
devavret Jul 20, 2022
a180bcc
Merge remote-tracking branch 'origin/branch-22.08' into list-lex-comp
vyasr Jul 26, 2022
229ebe3
Update meta.yaml.
vyasr Jul 26, 2022
926e7ab
Consolidate and augment descriptions of Dremel encoding.
vyasr Jul 27, 2022
d1cea06
Fix style.
vyasr Jul 27, 2022
6030b7b
Remove unnecessary optionals around dremel_device_view.
vyasr Jul 27, 2022
fbb9dd3
Simplify list_lex_preprocess.
vyasr Jul 27, 2022
25c22f9
Add some extra comments and docstrings.
vyasr Jul 27, 2022
5305349
Add extensive comments explaining the list comparison algorithm.
vyasr Jul 27, 2022
9b5f8c1
Reorder declarations for improved readability and logical consistency.
vyasr Jul 27, 2022
b334d19
Address open PR comments.
vyasr Jul 27, 2022
1e31ac1
Enable previously disabled test.
vyasr Jul 27, 2022
7c77616
Clean up comment.
vyasr Jul 27, 2022
1f6b050
Address first round of PR comments.
vyasr Jul 28, 2022
c35a39a
Move dremel files to lists/detail.
vyasr Jul 28, 2022
b520e38
Fix header.
vyasr Jul 28, 2022
ff66bdb
Merge remote-tracking branch 'origin/branch-22.08' into list-lex-comp
vyasr Jul 29, 2022
5a637d4
Merge remote-tracking branch 'origin/branch-22.10' into list-lex-comp
vyasr Aug 3, 2022
e8ebcc4
Try separating out primitive comparison.
vyasr Aug 4, 2022
77c57bf
Revert "Try separating out primitive comparison."
vyasr Aug 4, 2022
46f234f
Address most simple review comments.
vyasr Aug 4, 2022
6d89799
Merge remote-tracking branch 'origin/branch-22.10' into list-lex-comp
vyasr Aug 4, 2022
6b4ce40
Merge remote-tracking branch 'origin/branch-22.10' into list-lex-comp
vyasr Aug 29, 2022
b32205d
Update benchmark for new data generation API.
vyasr Aug 30, 2022
d578c8b
Add method to check for nested columns in a table_view.
vyasr Aug 31, 2022
0671246
Template comparator on the presence of nested columns and propagate p…
vyasr Aug 31, 2022
8c0ae93
Only define the list/struct overloads in the specialization that coul…
vyasr Aug 31, 2022
d285df9
Move the specialization to a completely separate class.
vyasr Sep 1, 2022
31a9bfd
Revert "Move the specialization to a completely separate class."
vyasr Sep 1, 2022
36cc5f3
Fix typo.
vyasr Sep 1, 2022
8dd293a
Convert the Dremel members of the preprocessed_table to optionals.
vyasr Sep 1, 2022
7b0ae58
Propagate optionals down to the element comparator.
vyasr Sep 1, 2022
be3ab5e
Revert "Propagate optionals down to the element comparator."
vyasr Sep 2, 2022
4a735d3
Stop storing empty dremel views for non-list columns and use a thrust…
vyasr Sep 2, 2022
ab5a264
Some cleanup.
vyasr Sep 2, 2022
d58ad80
Remove unnecessary check.
vyasr Sep 2, 2022
1f6b7ab
Address PR comments.
vyasr Sep 7, 2022
a1a9655
Revert "Remove unnecessary check."
vyasr Sep 7, 2022
f5cee47
Address remaining TODOs.
vyasr Sep 8, 2022
f7b671a
Fix typo.
vyasr Sep 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,8 @@ outputs:
- test -f $PREFIX/include/cudf/detail/transpose.hpp
- test -f $PREFIX/include/cudf/detail/unary.hpp
- test -f $PREFIX/include/cudf/detail/utilities/alignment.hpp
- test -f $PREFIX/include/cudf/detail/utilities/column.hpp
- test -f $PREFIX/include/cudf/detail/utilities/dremel.hpp
- test -f $PREFIX/include/cudf/detail/utilities/linked_column.hpp
- test -f $PREFIX/include/cudf/detail/utilities/int_fastdiv.h
- test -f $PREFIX/include/cudf/detail/utilities/integer_utils.hpp
- test -f $PREFIX/include/cudf/detail/utilities/vector_factories.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,7 @@ add_library(
src/column/column_factories.cpp
src/column/column_factories.cu
src/column/column_view.cpp
src/column/dremel.cu
vyasr marked this conversation as resolved.
Show resolved Hide resolved
src/copying/concatenate.cu
src/copying/contiguous_split.cu
src/copying/copy.cpp
Expand Down
2 changes: 1 addition & 1 deletion cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ ConfigureNVBench(SEARCH_NVBENCH search/contains.cpp)
# ##################################################################################################
# * sort benchmark --------------------------------------------------------------------------------
ConfigureBench(SORT_BENCH sort/rank.cpp sort/sort.cpp sort/sort_strings.cpp)
ConfigureNVBench(SORT_NVBENCH sort/sort_structs.cpp)
ConfigureNVBench(SORT_NVBENCH sort/sort_lists.cpp sort/sort_structs.cpp)

# ##################################################################################################
# * quantiles benchmark
Expand Down
49 changes: 49 additions & 0 deletions cpp/benchmarks/sort/sort_lists.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/common/generate_input.hpp>
#include <benchmarks/fixture/rmm_pool_raii.hpp>

#include <cudf/detail/sorting.hpp>

#include <nvbench/nvbench.cuh>

void nvbench_sort_lists(nvbench::state& state)
{
cudf::rmm_pool_raii pool_raii;

const size_t size_bytes(state.get_int64("size_bytes"));
const cudf::size_type depth{static_cast<cudf::size_type>(state.get_int64("depth"))};
const double null_frequency{state.get_float64("null_frequency")};
vyasr marked this conversation as resolved.
Show resolved Hide resolved

data_profile table_profile;
table_profile.set_distribution_params(cudf::type_id::LIST, distribution_id::UNIFORM, 0, 5);
table_profile.set_list_depth(depth);
table_profile.set_null_frequency(null_frequency);
auto const table =
create_random_table({cudf::type_id::LIST}, table_size_bytes{size_bytes}, table_profile);

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
rmm::cuda_stream_view stream_view{launch.get_stream()};
cudf::detail::sorted_order(*table, {}, {}, stream_view, rmm::mr::get_current_device_resource());
});
}

NVBENCH_BENCH(nvbench_sort_lists)
.set_name("sort_list")
.add_int64_power_of_two_axis("size_bytes", {10, 18, 24, 28})
.add_int64_axis("depth", {1, 4})
.add_float64_axis("null_frequency", {0, 0.2});
195 changes: 195 additions & 0 deletions cpp/include/cudf/detail/utilities/dremel.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cudf/column/column.hpp>

#include <rmm/device_uvector.hpp>

namespace cudf::detail {

/**
* @brief Device view for `dremel_data`.
*
* @see the `dremel_data` struct for more info.
*/
struct dremel_device_view {
size_type* offsets;
uint8_t* rep_levels;
uint8_t* def_levels;
size_type leaf_data_size;
uint8_t max_def_level;
};
vyasr marked this conversation as resolved.
Show resolved Hide resolved

/**
* @brief Dremel data that describes one nested type column
*
* @see get_dremel_data() for more info.
*/
struct dremel_data {
rmm::device_uvector<size_type> dremel_offsets;
rmm::device_uvector<uint8_t> rep_level;
rmm::device_uvector<uint8_t> def_level;

size_type leaf_data_size;
uint8_t max_def_level;
vyasr marked this conversation as resolved.
Show resolved Hide resolved

operator dremel_device_view()
vyasr marked this conversation as resolved.
Show resolved Hide resolved
{
return dremel_device_view{
dremel_offsets.data(), rep_level.data(), def_level.data(), leaf_data_size, max_def_level};
}
};

/**
* @brief Get the dremel offsets and repetition and definition levels for a LIST column
*
* Dremel is a query system created by Google for ad hoc data analysis. The Dremel engine is
* described in depth in the paper "Dremel: Interactive Analysis of Web-Scale
* Datasets" (https://research.google/pubs/pub36632/). One of the key components of Dremel
* is an encoding that converts record-like data into a columnar store for efficient memory
* accesses. The Parquet file format uses Dremel encoding to handle nested data, so libcudf
* requires some facilities for working with this encoding. Furthermore, libcudf leverages
* Dremel encoding as a means for performing lexicographic comparisons of nested columns.
*
* Dremel encoding is built around two concepts, the repetition and definition levels.
* Since describing them thoroughly is out of scope for this docstring, here are a couple of
* blogs that provide useful background:
* http://www.goldsborough.me/distributed-systems/2019/05/18/21-09-00-a_look_at_dremel/
* https://akshays-blog.medium.com/wrapping-head-around-repetition-and-definition-levels-in-dremel-powering-bigquery-c1a33c9695da
*
* The remainder of this documentation assumes familiarity with the Dremel concepts.
*
* Dremel offsets are the per row offsets into the repetition and definition level arrays for a
* column.
* Example:
* ```
* col = {{1, 2, 3}, { }, {5, 6}}
* dremel_offsets = { 0, 3, 4, 6}
* rep_level = { 0, 1, 1, 0, 0, 1}
* def_level = { 1, 1, 1, 0, 1, 1}
* ```
*
* The repetition and definition level values are ideally computed using a recursive call over a
* nested structure but in order to better utilize GPU resources, this function calculates them
* with a bottom up merge method.
*
* Given a LIST column of type `List<List<int>>` like so:
* ```
* col = {
* [],
* [[], [1, 2, 3], [4, 5]],
* [[]]
* }
* ```
* We can represent it in cudf format with two level of offsets like this:
* ```
* Level 0 offsets = {0, 0, 3, 5, 6}
* Level 1 offsets = {0, 0, 3, 5, 5}
* Values = {1, 2, 3, 4, 5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is off here. Level 0 offsets can't go to 6 because there's only 5 indices in the level 1 offsets.
Level 0 should have N+1 values for N rows, so 3 rows means 4 values. The start, end offsets should be 0, 0 for [], 0, 3 for [[], [1, 2, 3], [4, 5]], and 3, 4 for [[]].

Suggested change
* Level 0 offsets = {0, 0, 3, 5, 6}
* Level 1 offsets = {0, 0, 3, 5, 5}
* Values = {1, 2, 3, 4, 5}
* Level 0 offsets = {0, 0, 3, 4}
* Level 1 offsets = {0, 0, 3, 5, 5}
* Values = {1, 2, 3, 4, 5}

* ```
* The desired result of this function is the repetition and definition level values that
* correspond to the data values:
* ```
* col = {[], [[], [1, 2, 3], [4, 5]], [[]]}
* def = { 0 1, 2, 2, 2, 2, 2, 1 }
* rep = { 0, 0, 0, 2, 2, 1, 2, 0 }
* ```
*
* Since repetition and definition levels arrays contain a value for each empty list, the size of
* the rep/def level array can be given by
* ```
* rep_level.size() = size of leaf column + number of empty lists in level 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing a core piece of the calculation with nulls at various levels?

* + number of empty lists in level 1 ...
* ```
*
* We start with finding the empty lists in the penultimate level and merging it with the indices
* of the leaf level. The values for the merge are the definition and repetition levels
* ```
* empties at level 1 = {0, 5}
* def values at 1 = {1, 1}
* rep values at 1 = {1, 1}
* indices at leaf = {0, 1, 2, 3, 4}
* def values at leaf = {2, 2, 2, 2, 2}
* rep values at leaf = {2, 2, 2, 2, 2}
* ```
*
* merged def values = {1, 2, 2, 2, 2, 2, 1}
* merged rep values = {1, 2, 2, 2, 2, 2, 1}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* We start with finding the empty lists in the penultimate level and merging it with the indices
* of the leaf level. The values for the merge are the definition and repetition levels
* ```
* empties at level 1 = {0, 5}
* def values at 1 = {1, 1}
* rep values at 1 = {1, 1}
* indices at leaf = {0, 1, 2, 3, 4}
* def values at leaf = {2, 2, 2, 2, 2}
* rep values at leaf = {2, 2, 2, 2, 2}
* ```
*
* merged def values = {1, 2, 2, 2, 2, 2, 1}
* merged rep values = {1, 2, 2, 2, 2, 2, 1}
* We start with finding the empty lists in the penultimate level and merging it with the indices
* of the leaf level. The values for the merge are the definition and repetition levels:
* ```
* empty lists at level 1 = {0, 5}
* definition values at level 1 = {1, 1}
* repetition values at level 1 = {1, 1}
* indices at leaf = {0, 1, 2, 3, 4}
* definition values at leaf = {2, 2, 2, 2, 2}
* repetition values at leaf = {2, 2, 2, 2, 2}
*
* merged def values = {1, 2, 2, 2, 2, 2, 1}
* merged rep values = {1, 2, 2, 2, 2, 2, 1}
* ```

*
* The size of the rep/def values is now larger than the leaf values and the offsets need to be
* adjusted in order to point to the correct start indices. We do this with an exclusive scan over
* the indices of offsets of empty lists and adding to existing offsets.
* ```
* Level 1 new offsets = {0, 1, 4, 6, 7}
* ```
* Repetition values at the beginning of a list need to be decremented. We use the new offsets to
* scatter the rep value.
* ```
* merged rep values = {1, 2, 2, 2, 2, 2, 1}
* scatter (1, new offsets)
* new offsets = {0, 1, 4, 6, 7}
* new rep values = {1, 1, 2, 2, 1, 2, 1}
* ```
*
* Similarly we merge up all the way till level 0 offsets
*
* STRUCT COLUMNS :
* In case of struct columns, we don't have to merge struct levels with their children because a
* struct is the same size as its children. e.g. for a column `struct<int, float>`, if the row `i`
* is null, then the children columns `int` and `float` are also null at `i`. They also have the
* null entry represented in their respective null masks. So for any case of strictly struct based
* nesting, we can get the definition levels merely by iterating over the nesting for the same row.
*
* In case struct and lists are intermixed, the definition levels of all the contiguous struct
* levels can be constructed using the aforementioned iterative method. Only when we reach a list
* level, we need to do a merge with the subsequent level.
*
* So, for a column like `struct<list<int>>`, we are going to merge between the levels `struct<list`
* and `int`.
* For a column like `list<struct<int>>`, we are going to merge between `list` and `struct<int>`.
*
* In general, one nesting level is the list level and any struct level that precedes it.
*
* A few more examples to visualize the partitioning of column hierarchy into nesting levels:
* (L is list, S is struct, i is integer(leaf data level), angle brackets omitted)
* ```
* 1. LSi = L Si
* - | --
*
* 2. LLSi = L L Si
* - | - | --
*
* 3. SSLi = SSL i
* --- | -
*
* 4. LLSLSSi = L L SL SSi
* - | - | -- | ---
* ```
*
* @param col Column of LIST type
* @param level_nullability Pre-determined nullability at each list level. Empty means infer from
* `col`
* @param stream CUDA stream used for device memory operations and kernel launches.
*
* @return A struct containing dremel data
*/
dremel_data get_dremel_data(column_view h_col,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this require a lists_column_view?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this function assumes that we have a column_view of list dtype. It should probably validate that the input column's dtype actually is list, and passing a lists_column_view would be one way to promise that. It definitely requires some minor changes to the rest of the code, but I think it could work. I'd say that's also a change for the next PR, but definitely worth investigating.

std::vector<uint8_t> nullability,
vyasr marked this conversation as resolved.
Show resolved Hide resolved
rmm::cuda_stream_view stream);

} // namespace cudf::detail
Loading