[FEA] Support sorting on lists of Strings #10184

revans2 · 2022-02-01T17:55:48Z

Is your feature request related to a problem? Please describe.
This is from #10181.

Spark's sorting is similar to what we currently do for sorting structs. It also has the same problem in that Spark lets you define at the top level if NULLs come first or last, but then ignores it for nested types. It is okay if the null ordering is consistent at all levels, just like with structs. But that means we will fall back to the CPU if the user requests an ordering that we cannot guarantee will be sorted properly.

So ignoring the null ordering Spark will first compare the elements to each other in the order in which they are in the list. It does this from the first element to the minimum length of the two lists being compared. As soon as it finds a pair of elements that are not equal to each other it uses that as the ordering for the two lists. If all the elements are the same up to the min length of the lists, then the length of the elements is used, and longer lists are considered to come after shorter ones.

For reference the following code is used for doing this comparison. Please note that in Spark a LIST is called an Array https://github.com/apache/spark/blob/66b9087233bc0cb90cf3af07ec34ba74b9c32d5b/sql/catalyst/src/main/scala/org/apache/spark/sql/types/ArrayType.scala#L116-L149

for example: the following data is sorted ascending: (note that [] is an empty list)

+------+
|   c_0|
+------+
|    []|
|   [a]|
|[a, a]|
|[a, b]|
|   [b]|
|[b, a]|
|[b, b]|
+------+

and the same data descending:

+------+
|   c_0|
+------+
|[b, b]|
|[b, a]|
|   [b]|
|[a, b]|
|[a, a]|
|   [a]|
|    []|
+------+

Describe the solution you'd like
Ideally we would like to both sort the data and also merge sort the data. But merge sorting is lower priority and we can work around that.

The text was updated successfully, but these errors were encountered:

devavret · 2022-02-01T19:15:20Z

Now this looks like a duplicate of #5890

revans2 · 2022-02-02T15:30:30Z

Yes this is a special case of #5890

github-actions · 2022-03-31T18:07:51Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

sameerz · 2022-05-25T05:37:09Z

Sill needed

github-actions · 2022-09-22T07:13:38Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

sameerz · 2022-09-22T17:52:54Z

Still needed

davidwendt · 2022-09-23T19:18:09Z

You can do this by calling existing libcudf functions. It appears the sort process described in the description is the same as for sorting strings so I think artificially creating a strings column using the lists column's offsets and calling sort would give the correct results.

The basic idea is to build a fake column of strings to sort and use that result to build the output lists column.
If the input column looks like this:

[b, b], [b, a], [a, a], [a], [b], [], [a, b]

The fake strings column can be built using the lists offsets and the strings chars and would look like:

"bb", "ba", "aa", "a", "b", "", "ab"

Sorting using cudf::sorted_order produces:

5, 3, 2, 6, 4, 1, 0   => virtual sort is: {"", "a", "aa", "ab", "b",  "ba", "bb"} (never materialized)

Finally, using cudf::gather with this on the original input produces:

[], [a], [a, a], [a, b], [b], [b, a], [b, b]

Assuming this is one-level deep lists column and nulls handling is not needed (as mentioned in the description) here is some example code that I think achieves this:

  using LCW = cudf::test::lists_column_wrapper<cudf::string_view>;
  LCW input(
    {LCW{"b", "b"}, LCW{"b", "a"}, LCW{"a", "a"}, LCW{"a"}, LCW{"b"}, LCW{}, LCW{"a", "b"}});

  cudf::lists_column_view lcv(input);
  cudf::strings_column_view scv(lcv.child());  // get the strings child

  // build an offsets column by mapping the lists' offsets to the strings' offsets;
  // this is necessary if any of the lists element rows could have strings with more than 1 character
  auto offsets = cudf::gather(cudf::table_view({scv.offsets()}), lcv.offsets());
  auto ocv     = offsets->view().column(0);

  // wrap the new offsets and existing chars into a column view (no data is copied/moved)
  auto ncv = cudf::column_view(lcv.child().type(),
                               lcv.size(),
                               nullptr,
                               lcv.null_mask(),  // these are not necessary here 
                               lcv.null_count(), // based on the issue description
                               0,
                               {ocv, scv.chars()}); // new offsets

  // use cudf::sorted_order() to get the mapping result of the sort;
  // this should sort the strings as described in the description
  auto order = cudf::sorted_order(cudf::table_view({ncv}));

  // finally use the 'order' as a map to build the resulting lists column
  auto result = cudf::gather(cudf::table_view({input}), order->view());

GregoryKimball · 2022-11-19T22:06:09Z

I believe this was solved by #11129

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Feb 1, 2022

This was referenced Feb 1, 2022

[FEA] group by aggregations that include a list of strings as the grouping type #10181

Closed

[FEA] Support Group-By on Array[String] NVIDIA/spark-rapids#4656

Closed

devavret mentioned this issue Feb 1, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

github-actions bot added the inactive-30d label Mar 31, 2022

github-actions bot removed the inactive-30d label May 25, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 24, 2022

github-actions bot added the inactive-90d label Sep 22, 2022

github-actions bot removed the inactive-90d label Sep 22, 2022

davidwendt self-assigned this Sep 23, 2022

GregoryKimball closed this as completed Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support sorting on lists of Strings #10184

[FEA] Support sorting on lists of Strings #10184

revans2 commented Feb 1, 2022 •

edited by devavret

Loading

devavret commented Feb 1, 2022

revans2 commented Feb 2, 2022

github-actions bot commented Mar 31, 2022

sameerz commented May 25, 2022

github-actions bot commented Sep 22, 2022

sameerz commented Sep 22, 2022

davidwendt commented Sep 23, 2022

GregoryKimball commented Nov 19, 2022 •

edited

Loading

[FEA] Support sorting on lists of Strings #10184

[FEA] Support sorting on lists of Strings #10184

Comments

revans2 commented Feb 1, 2022 • edited by devavret Loading

devavret commented Feb 1, 2022

revans2 commented Feb 2, 2022

github-actions bot commented Mar 31, 2022

sameerz commented May 25, 2022

github-actions bot commented Sep 22, 2022

sameerz commented Sep 22, 2022

davidwendt commented Sep 23, 2022

GregoryKimball commented Nov 19, 2022 • edited Loading

revans2 commented Feb 1, 2022 •

edited by devavret

Loading

GregoryKimball commented Nov 19, 2022 •

edited

Loading