[FEA] collect_set on struct[Array] #5508

viadea · 2022-05-17T00:25:13Z

I wish collect_set can support struct[Array] as input column.
The ask is mainly for struct[Array(String)] and struct[Array(Long)].

For example:

import org.apache.spark.sql.types._
val arrayData = Seq(
    Row("John",List("apple","orange","banana"),1,List(100L,200L,300L)),
    Row("David",List("apple","orange","banana"),2,List(100L,200L,300L)),
    Row("Harry",List("apple","other"),1,List(100L,200L,300L))
)

val arraySchema = new StructType().add("name",StringType).add("fruits", ArrayType(StringType)).add("favorite",IntegerType).add("arraylong", ArrayType(LongType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.write.mode("overwrite").format("parquet").save("/tmp/testparquet")
val df2=spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")

val sqltext = """
select collect_set( CASE WHEN favorite IN (0,1,2,3) THEN struct(name,cast(favorite as string),fruits, arraylong) END) from df2 group by name
"""

spark.sql(sqltext).show()

The not-supported-messages:

          !Expression <CollectSet> collect_set(CASE WHEN favorite#115 IN (0,1,2,3) THEN struct(name, name#113, col2, cast(favorite#115 as string), fruits, fruits#114, arraylong, arraylong#116) END, 0, 0) cannot run on GPU because input expression CaseWhen CASE WHEN favorite#115 IN (0,1,2,3) THEN struct(name, name#113, col2, cast(favorite#115 as string), fruits, fruits#114, arraylong, arraylong#116) END (child ArrayType(StringType,true) is not supported, child ArrayType(LongType,true) is not supported); expression CollectSet collect_set(CASE WHEN favorite#115 IN (0,1,2,3) THEN struct(name, name#113, col2, cast(favorite#115 as string), fruits, fruits#114, arraylong, arraylong#116) END, 0, 0) produces an unsupported type ArrayType(StructType(StructField(name,StringType,true),StructField(col2,StringType,true),StructField(fruits,ArrayType(StringType,true),true),StructField(arraylong,ArrayType(LongType,true),true)),false)

The text was updated successfully, but these errors were encountered:

sameerz · 2022-05-17T21:53:01Z

~~Depends on PR rapidsai/cudf#10730 from issue rapidsai/cudf#10508 and rapidsai/cudf#10883.~~

sameerz · 2022-05-19T00:04:53Z

Depends on rapidsai/cudf#10870

ttnghia · 2022-05-24T16:38:46Z

A note for cudf implementation: The current cudf unit tests for drop_list_duplicates use hard-coded sorted lists as the expected lists for comparison with the output. When switched to use the new row comparator for nested types, the list elements will not be sorted thus we need to update unit tests as well.

ttnghia · 2022-06-10T20:41:36Z

Now depends on:

Refactor collect_set to use cudf::distinct and cudf::lists::distinct rapidsai/cudf#11228

which in turn depends on:

Implement lists::distinct and cudf::detail::stable_distinct rapidsai/cudf#11149

…inct` (#11228) The current groupby/reducttion `collect_set` aggregations use `lists::drop_list_duplicates` to generate set(s) of distinct elements. This PR changes that to use `cudf::distinct` and `cudf::lists::distinct` instead, which have some advantages including: * Fully supporting nested types, and: * Achieving better performance (`O(n)` instead of `O(nlogn)`) by internally using hash table instead of segmented sort. This also enables nested types support for `collect_set` in spark-rapids (issue NVIDIA/spark-rapids#5508). The changes in Java code here are only to fix unit tests. Previously, they were implemented with the assumption that the `collect_set` results are sorted, now they fail when the results are no longer sorted. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) - David Wendt (https://github.com/davidwendt) - MithunR (https://github.com/mythrocks) URL: #11228

NVnavkumar · 2022-08-05T18:00:53Z

Fixed via #6079

viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 17, 2022

sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels May 17, 2022

ttnghia mentioned this issue May 18, 2022

Strong index types for equality comparator rapidsai/cudf#10883

Merged

sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed cudf_dependency An issue or PR with this label depends on a new feature in cudf labels May 18, 2022

razajafri self-assigned this May 20, 2022

ttnghia mentioned this issue Jun 10, 2022

[FEA] Fully support nested types in lists::drop_list_duplicates rapidsai/cudf#11093

Closed

ttnghia mentioned this issue Jul 8, 2022

Fully support nested types in lists::drop_list_duplicates rapidsai/cudf#11224

Closed

2 tasks

razajafri removed their assignment Jul 8, 2022

ttnghia mentioned this issue Jul 8, 2022

Refactor collect_set to use cudf::distinct and cudf::lists::distinct rapidsai/cudf#11228

Merged

NVnavkumar self-assigned this Jul 12, 2022

NVnavkumar mentioned this issue Jul 25, 2022

Add support for nested types to collect_set(...) on the GPU [databricks] #6079

Merged

NVnavkumar closed this as completed Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] collect_set on struct[Array] #5508

[FEA] collect_set on struct[Array] #5508

viadea commented May 17, 2022

sameerz commented May 17, 2022 •

edited by ttnghia

Loading

sameerz commented May 19, 2022

ttnghia commented May 24, 2022

ttnghia commented Jun 10, 2022 •

edited

Loading

NVnavkumar commented Aug 5, 2022

[FEA] collect_set on struct[Array] #5508

[FEA] collect_set on struct[Array] #5508

Comments

viadea commented May 17, 2022

sameerz commented May 17, 2022 • edited by ttnghia Loading

sameerz commented May 19, 2022

ttnghia commented May 24, 2022

ttnghia commented Jun 10, 2022 • edited Loading

NVnavkumar commented Aug 5, 2022

sameerz commented May 17, 2022 •

edited by ttnghia

Loading

ttnghia commented Jun 10, 2022 •

edited

Loading