[FEA] Improve memory efficiency of from_json #10001

andygrove · 2023-12-08T18:02:56Z

Is your feature request related to a problem? Please describe.
I tried running some benchmarks with to_json and from_json and ran into OOM (and split/retry) issues with from_json, even on relatively small inputs.

Describe the solution you'd like
See if we can improve memory efficiency.

Describe alternatives you've considered

Additional context

andygrove · 2023-12-20T19:01:34Z

This is one expensive workaround that we have, that we could remove with additional work in cuDF:

// if the last entry in a column is incomplete or invalid, then cuDF
// will drop the row rather than replace with null if there is no newline, so we
// add a newline here to prevent that
val joined = withResource(cleaned.joinStrings(lineSep, emptyRow)) { joined =>
  withResource(ColumnVector.fromStrings("\n")) { newline =>
    ColumnVector.stringConcatenate(Array[ColumnView](joined, newline))
  }
}

EDIT: It is just the stringConcatenate that we could potentially remove. We still have to call joinStrings, which is expensive, unless we can have cuDF parse a column of JSON entries rather than provide a "file" in one row.

andygrove · 2024-01-02T20:49:57Z

I added some debug logging to show the size of the inputs being passed to readJSON in my perf test and see two tasks both trying to allocate ~500 MB and running into OOM.

Table.readJSON start=0, length=528729598                           (0 + 8) / 14]
Table.readJSON start=0, length=528884953
24/01/02 20:43:02 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2115539824 after a synchronize. Total RMM allocated is 6502158080 bytes.
24/01/02 20:43:02 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2115539824 after a synchronize. Total RMM allocated is 6500297984 bytes.
24/01/02 20:43:02 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2115539824 bytes. Total RMM allocated is 6500297984 bytes.

andygrove · 2024-01-03T21:39:18Z

The earlier OOM was happening when running on a workstation with an RTX 3080 which only has 10GB RAM so I am not convinced that this is really an issue. I did not run into any OOM/retry when using a workstation with a RTX Quadro 6000.

The GPU version of from_json performed slightly better than running on CPU in this environment.

GPU: 176s
CPU: 213s

Here is the script that I use for testing.

## to_json

import org.apache.spark.sql.SaveMode

val t1 = spark.read.parquet("/home/andygrove/web_sales.parquet")
val df = t1.select(to_json(struct(t1.columns.map(col): _*)).alias("my_json"))

spark.conf.set("spark.rapids.sql.expression.StructsToJson", true)
spark.time(df.write.mode(SaveMode.Overwrite).parquet("temp.parquet"))

## from_json

import org.apache.spark.sql.SaveMode
val t1 = spark.read.parquet("/home/andygrove/web_sales.parquet")
val t2 = spark.read.parquet("temp.parquet")
val df = t2.select(from_json(col("my_json"), t1.schema))

spark.conf.set("spark.rapids.sql.expression.JsonToStructs", true)
spark.time(df.collect())

andygrove · 2024-01-03T21:41:14Z

@revans2 I could use a sanity check on my conclusions here before closing this issue. Also, let me know if there are other benchmarks that you would like to see.

revans2 · 2024-01-03T22:58:49Z

I think we still have problems, but the underlying problem is being masked by bugs in the retry framework. I tried to run on a 48 GiB GPU with concurrent set to 1. It failed if maxPartitionBytes was set to 256 MiB, but worked if it was set to 128 MiB. The amount of memory used by the 128 MiB use case was very high, but enough that it would risk using up all of the memory on the GPU. Instead I think we are hitting the limit of what a string can hold in CUDF. This gets treated like a split and retry oom exception, but with retry framework eats the original exception so we cannot see what really caused the problem.

I suspect that for your 10 GiB GPU that you really did run out of memory and it was mostly due to fragmentation that it could not finish. I don't think an RTX 3080 support the ARENA allocator. But I could be wrong.

Either way I think we would need to solve this in a generic way with project, and have it support splitting the input, eventually.

revans2 · 2024-01-10T15:05:33Z

To be clear here the ultimate right fix here is to implement #7866

andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 8, 2023

andygrove self-assigned this Dec 8, 2023

andygrove added the performance A performance related task/issue label Dec 8, 2023

mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Dec 14, 2023

andygrove mentioned this issue Dec 20, 2023

Refactor GpuJsonToStruct to reduce code duplication and manage resources more efficiently #10084

Merged

sameerz removed the feature request New feature or request label Jan 23, 2024

andygrove removed their assignment Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve memory efficiency of from_json #10001

[FEA] Improve memory efficiency of from_json #10001

andygrove commented Dec 8, 2023

andygrove commented Dec 20, 2023 •

edited

Loading

andygrove commented Jan 2, 2024

andygrove commented Jan 3, 2024

andygrove commented Jan 3, 2024

revans2 commented Jan 3, 2024

revans2 commented Jan 10, 2024

[FEA] Improve memory efficiency of from_json #10001

[FEA] Improve memory efficiency of from_json #10001

Comments

andygrove commented Dec 8, 2023

andygrove commented Dec 20, 2023 • edited Loading

andygrove commented Jan 2, 2024

andygrove commented Jan 3, 2024

andygrove commented Jan 3, 2024

revans2 commented Jan 3, 2024

revans2 commented Jan 10, 2024

andygrove commented Dec 20, 2023 •

edited

Loading