Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support sorting on lists of Strings #10184

Closed
revans2 opened this issue Feb 1, 2022 · 8 comments
Closed

[FEA] Support sorting on lists of Strings #10184

revans2 opened this issue Feb 1, 2022 · 8 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Feb 1, 2022

Is your feature request related to a problem? Please describe.
This is from #10181.

Spark's sorting is similar to what we currently do for sorting structs. It also has the same problem in that Spark lets you define at the top level if NULLs come first or last, but then ignores it for nested types. It is okay if the null ordering is consistent at all levels, just like with structs. But that means we will fall back to the CPU if the user requests an ordering that we cannot guarantee will be sorted properly.

So ignoring the null ordering Spark will first compare the elements to each other in the order in which they are in the list. It does this from the first element to the minimum length of the two lists being compared. As soon as it finds a pair of elements that are not equal to each other it uses that as the ordering for the two lists. If all the elements are the same up to the min length of the lists, then the length of the elements is used, and longer lists are considered to come after shorter ones.

For reference the following code is used for doing this comparison. Please note that in Spark a LIST is called an Array https://github.com/apache/spark/blob/66b9087233bc0cb90cf3af07ec34ba74b9c32d5b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala#L116-L149

for example: the following data is sorted ascending: (note that [] is an empty list)

+------+
|   c_0|
+------+
|    []|
|   [a]|
|[a, a]|
|[a, b]|
|   [b]|
|[b, a]|
|[b, b]|
+------+

and the same data descending:

+------+
|   c_0|
+------+
|[b, b]|
|[b, a]|
|   [b]|
|[a, b]|
|[a, a]|
|   [a]|
|    []|
+------+

Describe the solution you'd like
Ideally we would like to both sort the data and also merge sort the data. But merge sorting is lower priority and we can work around that.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 1, 2022
@devavret
Copy link
Contributor

devavret commented Feb 1, 2022

Now this looks like a duplicate of #5890

@revans2
Copy link
Contributor Author

revans2 commented Feb 2, 2022

Yes this is a special case of #5890

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz
Copy link
Contributor

sameerz commented May 25, 2022

Sill needed

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 24, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@sameerz
Copy link
Contributor

sameerz commented Sep 22, 2022

Still needed

@davidwendt davidwendt self-assigned this Sep 23, 2022
@davidwendt
Copy link
Contributor

You can do this by calling existing libcudf functions. It appears the sort process described in the description is the same as for sorting strings so I think artificially creating a strings column using the lists column's offsets and calling sort would give the correct results.

The basic idea is to build a fake column of strings to sort and use that result to build the output lists column.
If the input column looks like this:

[b, b], [b, a], [a, a], [a], [b], [], [a, b]

The fake strings column can be built using the lists offsets and the strings chars and would look like:

"bb", "ba", "aa", "a", "b", "", "ab"

Sorting using cudf::sorted_order produces:

5, 3, 2, 6, 4, 1, 0   => virtual sort is: {"", "a", "aa", "ab", "b",  "ba", "bb"} (never materialized)

Finally, using cudf::gather with this on the original input produces:

[], [a], [a, a], [a, b], [b], [b, a], [b, b]

Assuming this is one-level deep lists column and nulls handling is not needed (as mentioned in the description) here is some example code that I think achieves this:

  using LCW = cudf::test::lists_column_wrapper<cudf::string_view>;
  LCW input(
    {LCW{"b", "b"}, LCW{"b", "a"}, LCW{"a", "a"}, LCW{"a"}, LCW{"b"}, LCW{}, LCW{"a", "b"}});

  cudf::lists_column_view lcv(input);
  cudf::strings_column_view scv(lcv.child());  // get the strings child

  // build an offsets column by mapping the lists' offsets to the strings' offsets;
  // this is necessary if any of the lists element rows could have strings with more than 1 character
  auto offsets = cudf::gather(cudf::table_view({scv.offsets()}), lcv.offsets());
  auto ocv     = offsets->view().column(0);

  // wrap the new offsets and existing chars into a column view (no data is copied/moved)
  auto ncv = cudf::column_view(lcv.child().type(),
                               lcv.size(),
                               nullptr,
                               lcv.null_mask(),  // these are not necessary here 
                               lcv.null_count(), // based on the issue description
                               0,
                               {ocv, scv.chars()}); // new offsets

  // use cudf::sorted_order() to get the mapping result of the sort;
  // this should sort the strings as described in the description
  auto order = cudf::sorted_order(cudf::table_view({ncv}));

  // finally use the 'order' as a map to build the resulting lists column
  auto result = cudf::gather(cudf::table_view({input}), order->view());

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Nov 19, 2022

I believe this was solved by #11129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants