Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] collect_set on struct[Array] #5508

Closed
viadea opened this issue May 17, 2022 · 5 comments · Fixed by #6079
Closed

[FEA] collect_set on struct[Array] #5508

viadea opened this issue May 17, 2022 · 5 comments · Fixed by #6079
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented May 17, 2022

I wish collect_set can support struct[Array] as input column.
The ask is mainly for struct[Array(String)] and struct[Array(Long)].

For example:

import org.apache.spark.sql.types._
val arrayData = Seq(
    Row("John",List("apple","orange","banana"),1,List(100L,200L,300L)),
    Row("David",List("apple","orange","banana"),2,List(100L,200L,300L)),
    Row("Harry",List("apple","other"),1,List(100L,200L,300L))
)

val arraySchema = new StructType().add("name",StringType).add("fruits", ArrayType(StringType)).add("favorite",IntegerType).add("arraylong", ArrayType(LongType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayData),arraySchema)
df.write.mode("overwrite").format("parquet").save("/tmp/testparquet")
val df2=spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")

val sqltext = """
select collect_set( CASE WHEN favorite IN (0,1,2,3) THEN struct(name,cast(favorite as string),fruits, arraylong) END) from df2 group by name
"""

spark.sql(sqltext).show()

The not-supported-messages:

          !Expression <CollectSet> collect_set(CASE WHEN favorite#115 IN (0,1,2,3) THEN struct(name, name#113, col2, cast(favorite#115 as string), fruits, fruits#114, arraylong, arraylong#116) END, 0, 0) cannot run on GPU because input expression CaseWhen CASE WHEN favorite#115 IN (0,1,2,3) THEN struct(name, name#113, col2, cast(favorite#115 as string), fruits, fruits#114, arraylong, arraylong#116) END (child ArrayType(StringType,true) is not supported, child ArrayType(LongType,true) is not supported); expression CollectSet collect_set(CASE WHEN favorite#115 IN (0,1,2,3) THEN struct(name, name#113, col2, cast(favorite#115 as string), fruits, fruits#114, arraylong, arraylong#116) END, 0, 0) produces an unsupported type ArrayType(StructType(StructField(name,StringType,true),StructField(col2,StringType,true),StructField(fruits,ArrayType(StringType,true),true),StructField(arraylong,ArrayType(LongType,true),true)),false)
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 17, 2022
@sameerz
Copy link
Collaborator

sameerz commented May 17, 2022

Depends on PR rapidsai/cudf#10730 from issue rapidsai/cudf#10508 and rapidsai/cudf#10883.

@sameerz sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels May 17, 2022
@sameerz sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed cudf_dependency An issue or PR with this label depends on a new feature in cudf labels May 18, 2022
@sameerz
Copy link
Collaborator

sameerz commented May 19, 2022

Depends on rapidsai/cudf#10870

@razajafri razajafri self-assigned this May 20, 2022
@ttnghia
Copy link
Collaborator

ttnghia commented May 24, 2022

A note for cudf implementation: The current cudf unit tests for drop_list_duplicates use hard-coded sorted lists as the expected lists for comparison with the output. When switched to use the new row comparator for nested types, the list elements will not be sorted thus we need to update unit tests as well.

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 10, 2022

@razajafri razajafri removed their assignment Jul 8, 2022
@NVnavkumar NVnavkumar self-assigned this Jul 12, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Jul 15, 2022
…inct` (#11228)

The current groupby/reducttion `collect_set` aggregations use `lists::drop_list_duplicates` to generate set(s) of distinct elements. This PR changes that to use `cudf::distinct` and `cudf::lists::distinct` instead, which have some advantages including:
 * Fully supporting nested types, and:
 * Achieving better performance (`O(n)` instead of `O(nlogn)`) by internally using hash table instead of segmented sort.

This also enables nested types support for `collect_set` in spark-rapids (issue NVIDIA/spark-rapids#5508).

The changes in Java code here are only to fix unit tests. Previously, they were implemented with the assumption that the `collect_set` results are sorted, now they fail when the results are no longer sorted.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - David Wendt (https://github.com/davidwendt)
  - MithunR (https://github.com/mythrocks)

URL: #11228
@NVnavkumar
Copy link
Collaborator

Fixed via #6079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants