from_json throws exception when the json's structure only partially matches the provided schema #10901

Feng-Jiang28 · 2024-05-27T03:39:55Z

from_json function parses a column containing JSON data using a StructType object based on a provided schema, throws an exception, when the json's structure only partially matches the shema.

CPU:

cala>     import org.apache.spark.sql.types.{ArrayType, IntegerType, LongType, MapType, StringType, StructType}
scala>     import org.apache.spark.sql.functions.{from_json, to_json}
scala>     import org.apache.spark.sql.{Row}
scala>           val st = new StructType().add("c1", LongType).add("c2", ArrayType(new StructType().add("c3", LongType).add("c4", StringType)))
scala>     val df1 = Seq("""{"c2": [19], "c1": 123456}""").toDF("c0")
scala>     df1.write.mode("OVERWRITE").parquet("TEMP")                                                           
scala>           val df2 = spark.read.parquet("TEMP")
scala>     df2.select(from_json($"c0", st)).show()
+--------------+
| from_json(c0)|
+--------------+
|{123456, null}|
+--------------+

GPU:

$SPARK_HOME/bin/spark-shell --master local[*] --jars ${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.rapids.sql.enabled=true --conf spark.rapids.sql.explain=ALL --driver-java-options '-ea -Duser.timezone=UTC ' --conf spark.rapids.sql.expression.JsonTuple=true --conf spark.rapids.sql.expression.GetJsonObject=true --conf spark.rapids.sql.expression.JsonToStructs=true --conf spark.rapids.sql.expression.StructsToJson=true

scala>     import org.apache.spark.sql.types.{ArrayType, IntegerType, LongType, MapType, StringType, StructType}
scala>     import org.apache.spark.sql.functions.{from_json, to_json}
scala>     import org.apache.spark.sql.{Row}
scala>           val st = new StructType().add("c1", LongType).add("c2", ArrayType(new StructType().add("c3", LongType).add("c4", StringType)))
scala>     val df1 = Seq("""{"c2": [19], "c1": 123456}""").toDF("c0")
scala>     df1.write.mode("OVERWRITE").parquet("TEMP")                                                                              
scala>           val df2 = spark.read.parquet("TEMP")
scala>     df2.select(from_json($"c0", st)).show()
24/05/27 03:36:48 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(from_json(StructField(c1,LongType,true), StructField(c2,ArrayType(StructType(StructField(c3,LongType,true),StructField(c4,StringType,true)),true),true), c0#7, Some(UTC)) as string) AS from_json(c0)#13 will run on GPU
      *Expression <Cast> cast(from_json(StructField(c1,LongType,true), StructField(c2,ArrayType(StructType(StructField(c3,LongType,true),StructField(c4,StringType,true)),true),true), c0#7, Some(UTC)) as string) will run on GPU
        *Expression <JsonToStructs> from_json(StructField(c1,LongType,true), StructField(c2,ArrayType(StructType(StructField(c3,LongType,true),StructField(c4,StringType,true)),true),true), c0#7, Some(UTC)) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

24/05/27 03:36:49 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-295-cuda11/target/libcudf-install/include/cudf/column/column_factories.hpp:343: Invalid, non-fixed-width type.
	at ai.rapids.cudf.Table.readJSONFromDataSource(Native Method)
	at ai.rapids.cudf.Table.readJSON(Table.java:1441)

The text was updated successfully, but these errors were encountered:

Feng-Jiang28 mentioned this issue May 27, 2024

[BUG] Issues found by Spark UT Framework on RapidsJsonFunctionsSuite #10852

Open

5 tasks

revans2 self-assigned this May 28, 2024

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 28, 2024

mattahrens assigned Feng-Jiang28 and unassigned revans2 May 28, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label May 28, 2024

GaryShen2008 changed the title ~~from_json Exception, when the json's structure only partially matches the provided schema~~ from_json throws exception when the json's structure only partially matches the provided schema Jun 7, 2024

Feng-Jiang28 mentioned this issue Jul 15, 2024

from_json Json to Struct Exception Logging #11186

Merged

This was referenced Nov 15, 2024

[FEA] read_json should output all-nulls columns for the schema columns that do not match with the input JSON rapidsai/cudf#17341

Open

Execute from_json with struct schema using JSONUtils.fromJSONToStructs #11618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from_json throws exception when the json's structure only partially matches the provided schema #10901

from_json throws exception when the json's structure only partially matches the provided schema #10901

Feng-Jiang28 commented May 27, 2024 •

edited

Loading

from_json throws exception when the json's structure only partially matches the provided schema #10901

from_json throws exception when the json's structure only partially matches the provided schema #10901

Comments

Feng-Jiang28 commented May 27, 2024 • edited Loading

Feng-Jiang28 commented May 27, 2024 •

edited

Loading