Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] read_json does not output null list when the input is a list with different data type than the specified schema #17349

Open
ttnghia opened this issue Nov 16, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Nov 16, 2024

When the input JSON has lists with mixed types, read_json does not output nulls correctly for the output lists column. For example:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123]}                 |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+----------------+                                                              
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|        {[null]}|             <========== wrong, should be {null}
+----------------+

In the example above, the input of the second list is [123], while the desired schema is list<struct<string, string>>. As such, the output here should be a null list, instead of a list of null struct like above.

@ttnghia ttnghia added the bug Something isn't working label Nov 16, 2024
@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 16, 2024

Another example:
GPU:

scala> val df = Seq("""{"a": [{"b": "1", "c": "2"}]}""", """{"a": [123, {"b": "1"}]}""").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+-----------------------------+
|value                        |
+-----------------------------+
|{"a": [{"b": "1", "c": "2"}]}|
|{"a": [123, {"b": "1"}]}     |
+-----------------------------+

scala> df.repartition(1).selectExpr("from_json(value, 'struct<a: array<struct<b: string, c: string>>>')").show()
+-------------------+
|   from_json(value)|
+-------------------+
|         {[{1, 2}]}|
|{[null, {1, null}]}|              <================ wrong
+-------------------+

The correct output from Spark CPU is also a null list:

+----------------+
|from_json(value)|
+----------------+
|      {[{1, 2}]}|
|          {null}|
+----------------+

@ttnghia ttnghia added this to libcudf Nov 16, 2024
@ttnghia ttnghia moved this to Burndown in libcudf Nov 16, 2024
@karthikeyann karthikeyann self-assigned this Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Status: Burndown
Development

No branches or pull requests

2 participants