[QST] Question about using UDF to implement operations. #5343

tregodev · 2021-05-28T08:11:22Z

tregodev
May 28, 2021

Hey all!

As a part of my thesis I am doing research on spark-rapids, comparing GPU and CPU processing on biological sequencing, essentially constructing De Bruijn Graphs from a large text file. The part of the code I want to accelerate is fairly simple, with the only complicated operations that are not already implemented that I require being collect_set and zipWithIndex.

Is there any method using UDF to implement these in a GPU accelerated way?

The data I want to use both functions is String, Long , making the code look something like this:

val inFile = spark.read.schema("kmer STRING, source_seq LONG").csv(inPath).toDF
val collectedSourceSeqs = inFile.groupBy("kmer").agg(sort_array(collect_set("source_seq")).as("source_seqs"))
val collectedSets = collectedSourceSeqs.groupBy("source_seqs").agg(collect_set("kmer").as("kmers"))

As far as I can tell, collect_list is supported in windowing, however dropDuplicates is not supported on listTypes, and zipWithIndex is not dataframe supported, however, I have been using a function that mostly transfers the operation to dataframe:

def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {
    val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())
    val partitionOffsets = dfWithPartitionId
      .groupBy("partition_id")
      .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")
      .orderBy("partition_id")
      .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )
      .collect()
      .map(_.getLong(0))
      .toArray
    val pof = udf((partitionId: Int) => partitionOffsets(partitionId))
    dfWithPartitionId
      .withColumn("partition_offset", pof((col("partition_id"))))
      .withColumn(indexName, col("partition_offset") + col("inc_id"))
      .drop("partition_id", "partition_offset", "inc_id")
  }

Answered by revans2

May 28, 2021

We support UDFs (sort of). If the UDF is really simple, and can be translated into a catalyst expression, then we can do some things with that in your turn it on (very experimental). I don't think what you are doing is something we support yet for translation to catalyst. The other option we have is they you can write your own UDFs either using cuda directly or using the java cudf API. They can give you a lot of control. But it looks like your UDF is really just a join. You have an array mapping partition ids to some other number, and you want to look it up based off of that partition id. That is a join.

As a side note we are working on collect_list and collect_set for aggregations. Proba…

View full answer

revans2 · 2021-05-28T11:44:38Z

revans2
May 28, 2021
Maintainer

We support UDFs (sort of). If the UDF is really simple, and can be translated into a catalyst expression, then we can do some things with that in your turn it on (very experimental). I don't think what you are doing is something we support yet for translation to catalyst. The other option we have is they you can write your own UDFs either using cuda directly or using the java cudf API. They can give you a lot of control. But it looks like your UDF is really just a join. You have an array mapping partition ids to some other number, and you want to look it up based off of that partition id. That is a join.

As a side note we are working on collect_list and collect_set for aggregations. Probably a few more releases before we can support it in spark, but the cudf code does support it (we just cannot do it distributed). Sort array is another one that we do not officially support, but the latest cudf does (no java APIs for it yet though). cudf also does not support grouping by lists of things yet. So there is a lot of work to get this functioning. I'll see what I can come up with though.

0 replies

tregodev · 2021-05-28T12:07:56Z

tregodev
May 28, 2021
Author

Thank you for the quick and detailed response!

How about collect_set in Windowing, as I am under the impression the cudf library supports collect_set in its java api?

0 replies

revans2 · 2021-05-28T12:31:07Z

revans2
May 28, 2021
Maintainer

Cudf just did a core freeze for our next release, and we will be doing our own code freeze shortly. So remembering what is in previous releases gets to be a bit complicated. https://nvidia.github.io/spark-rapids/docs/supported_ops.html should list all of the operations for the current release on Apache Spark 3.0.0.

collect_set is not supported for window operations in our current release 0.5. collect_list is. collect_set was added in the cudf release that just froze, so I am hopeful that it will be available for window operations in our upcoming release 21.06.0 (we are moving to calendar versioning). But there is no PR up for it yet, so it might not make it in by the code freeze this Friday. For collect_set and collect_list to work the way Apache Spark wants them to we need an group by aggregation to concat lists and sets. But that didn't make it into cudf for the code freeze so it will probably be a few releases before we can support it.

@jlowe I don't think we support UDAFs yet for RapidsUDFs. Do we?

0 replies

jlowe · 2021-05-28T13:40:30Z

jlowe
May 28, 2021
Maintainer

@revans2 correct, UDAFs are not yet supported.

0 replies

sameerz · 2021-07-16T02:15:16Z

sameerz
Jul 16, 2021
Maintainer

collect_set for windowing will be supported in the upcoming 21.08 release. The overarching issue for collect aggregations is #2062 . The collect_set for windowing PR is #2548.

0 replies

jlowe · 2021-07-16T14:34:35Z

jlowe
Jul 16, 2021
Maintainer

Closing this as answered. Feel free to reopen if there's more to discuss.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Question about using UDF to implement operations. #5343

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QST] Question about using UDF to implement operations. #5343

tregodev May 28, 2021

Replies: 6 comments

revans2 May 28, 2021 Maintainer

tregodev May 28, 2021 Author

revans2 May 28, 2021 Maintainer

jlowe May 28, 2021 Maintainer

sameerz Jul 16, 2021 Maintainer

jlowe Jul 16, 2021 Maintainer

tregodev
May 28, 2021

revans2
May 28, 2021
Maintainer

tregodev
May 28, 2021
Author

revans2
May 28, 2021
Maintainer

jlowe
May 28, 2021
Maintainer

sameerz
Jul 16, 2021
Maintainer

jlowe
Jul 16, 2021
Maintainer