[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

ttnghia · 2024-11-16T15:25:36Z

When the input JSON has lists with mixed types, read_json does not output nulls correctly for the output lists column. For example:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123]}                 |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+----------------+                                                              
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|        {[null]}|             <========== wrong, should be {null}
+----------------+

In the example above, the input of the second list is [123], while the desired schema is list<struct<string, string>>. As such, the output here should be a null list, instead of a list of null struct like above.

The text was updated successfully, but these errors were encountered:

ttnghia · 2024-11-16T15:48:18Z

Another example:
GPU:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123, {"b": "1"}]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123, {"b": "1"}]}     |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+-------------------+
|   from_json(value)|
+-------------------+
|         {[{1, 2}]}|
|{[null, {1, null}]}|              <================ wrong
+-------------------+

The correct output from Spark CPU is also a null list:

+----------------+
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|          {null}|
+----------------+

ttnghia added the bug Something isn't working label Nov 16, 2024

github-project-automation bot added this to cuDF/Dask/Numba/UCX Nov 16, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Nov 16, 2024

ttnghia added this to libcudf Nov 16, 2024

ttnghia moved this to Burndown in libcudf Nov 16, 2024

karthikeyann self-assigned this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

ttnghia commented Nov 16, 2024 •

edited

Loading

ttnghia commented Nov 16, 2024 •

edited

Loading

[BUG] read_json does not output null list when the input is a list with different data type than the specified schema #17349

[BUG] read_json does not output null list when the input is a list with different data type than the specified schema #17349

Comments

ttnghia commented Nov 16, 2024 • edited Loading

ttnghia commented Nov 16, 2024 • edited Loading

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

[BUG] `read_json` does not output null list when the input is a list with different data type than the specified schema #17349

ttnghia commented Nov 16, 2024 •

edited

Loading

ttnghia commented Nov 16, 2024 •

edited

Loading