[FEA] Improve performance of from_json #10301

andygrove · 2024-01-26T21:44:47Z

Is your feature request related to a problem? Please describe.
Using the following benchmark, I see that the performance of from_json on GPU can be 4x slower than native Spark on CPU.

Generate JSON Data

import org.apache.spark.sql.SaveMode
val t1 = spark.read.parquet("/mnt/bigdata/tpcds/sf10-parquet/web_sales.parquet")
val df = t1.select(to_json(struct(t1.columns.map(col): _*)).alias("my_json"))
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp.parquet"))

from_json benchmark

import org.apache.spark.sql.SaveMode
val t1 = spark.read.parquet("/mnt/bigdata/tpcds/sf10-parquet/web_sales.parquet")
val t2 = spark.read.parquet("temp.parquet").repartition(16)
val df = t2.select(from_json(col("my_json"), t1.schema))

spark.conf.set("spark.rapids.sql.expression.JsonToStructs", true)
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp2.parquet"))

spark.conf.set("spark.rapids.sql.expression.JsonToStructs", false)
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp2.parquet"))

Describe the solution you'd like
Improve performance.

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

andygrove · 2024-01-26T22:27:23Z

Breakdown of time spent in each major stage:

PERF: cleanAndConcat=1376 millis
PERF: copyToHost=29 millis
PERF: readJSON=44 millis
PERF: cast=32 millis
PERF: postProcessing=17 millis

andygrove · 2024-01-26T22:46:23Z

The expensive call in cleanAndConcat is this:

val concat = withResource(joined) { _ =>
  withResource(ColumnVector.fromStrings("\n")) { newline =>
    ColumnVector.stringConcatenate(Array[ColumnView](joined, newline))
  }
}

revans2 · 2024-01-26T22:48:15Z

Do we have a breakdown on cleanAndConcat? Because I see all kinds of things that we could do better. And it is really complicated too.

andygrove · 2024-01-26T22:49:05Z

@revans2 I just posted that as you were making your comment

andygrove · 2024-01-26T23:01:46Z

PERF: cleanAndConcat: strip: 14 millis
PERF: cleanAndConcat: isNullOrEmpty: 9 millis
PERF: cleanAndConcat: literalNull: 32 millis
PERF: cleanAndConcat: checkNewline: 7 millis
PERF: cleanAndConcat: joinStrings: 1 millis
PERF: cleanAndConcat: stringConcatenate: 1275 millis

andygrove · 2024-01-27T00:12:23Z

I understand better what the issue is now.

We start with string column containing all the JSON lines (after some cleanup that we have already performed). We then call joinStrings to create a string column containing a single string with each entry separated by a newline character.

val joined = withResource(cudf.Scalar.fromString("\n")) { lineSep =>
  cleaned.joinStrings(lineSep, emptyRow)
}

This is very fast.

This does not append a final newline character to the document, and this caused issues with some tests where the final JSON line was empty or invalid (causing the error The input data didn't parse correctly and we read a different number of rows than was expected. Expected 512, but got 511), so we have some code to add a final newline char:

val concat = withResource(joined) { _ =>
  withResource(ColumnVector.fromStrings("\n")) { newline =>
    ColumnVector.stringConcatenate(Array[ColumnView](joined, newline))
  }
}

This is the cause of the performance issue. If I remove this part then the benchmark in this issue performs well on the GPU (slightly faster than CPU at least).

I will experiment with other approaches here, and file an issue against cuDF with a simple repro if I can't find another solution.

revans2 · 2024-01-29T14:34:24Z

At a minimum we can write a kernel just for this. The original joinStrings is not that complicated.

https://github.com/rapidsai/cudf/blob/d5db68e018d720094489325d5547a7ed82f22b0d/cpp/src/strings/combine/join.cu

We could copy it and make it so it appends the trailing separator

https://github.com/rapidsai/cudf/blob/d5db68e018d720094489325d5547a7ed82f22b0d/cpp/src/strings/combine/join.cu#L65

Or we could try and put up a patch to CUDF that lets us select if it happens or not.

https://github.com/rapidsai/cudf/blob/d5db68e018d720094489325d5547a7ed82f22b0d/cpp/src/strings/combine/join.cu#L65

andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Jan 26, 2024

andygrove self-assigned this Jan 26, 2024

andygrove mentioned this issue Jan 26, 2024

[FEA] [EPIC] Priority JSON Issues #9458

Open

26 tasks

This was referenced Jan 27, 2024

[QST] Performance issue with stringConcatenate rapidsai/cudf#14913

Closed

Fix performance regression in from_json #10306

Merged

andygrove closed this as completed in #10306 Jan 29, 2024

sameerz removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve performance of from_json #10301

[FEA] Improve performance of from_json #10301

andygrove commented Jan 26, 2024

andygrove commented Jan 26, 2024

andygrove commented Jan 26, 2024

revans2 commented Jan 26, 2024

andygrove commented Jan 26, 2024

andygrove commented Jan 26, 2024

andygrove commented Jan 27, 2024 •

edited

Loading

revans2 commented Jan 29, 2024

[FEA] Improve performance of from_json #10301

[FEA] Improve performance of from_json #10301

Comments

andygrove commented Jan 26, 2024

Generate JSON Data

from_json benchmark

andygrove commented Jan 26, 2024

andygrove commented Jan 26, 2024

revans2 commented Jan 26, 2024

andygrove commented Jan 26, 2024

andygrove commented Jan 26, 2024

andygrove commented Jan 27, 2024 • edited Loading

revans2 commented Jan 29, 2024

andygrove commented Jan 27, 2024 •

edited

Loading