[FEA] JSON reader: support unquoted JSON field names. #10266

wbo4958 · 2022-02-10T09:10:57Z

This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file

{name: "Reynold Xin"}

Spark can parse it when enabling allowUnquotedFieldNames

CUDF parsing will throw exception

We expect there is a configure allowUnquotedFieldNames to control this behavior.

The text was updated successfully, but these errors were encountered:

github-actions · 2022-03-13T17:03:56Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2022-05-16T21:19:02Z

Setting this to P1 as it is off by default in Spark

github-actions · 2022-09-26T05:30:57Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

karthikeyann · 2022-10-11T11:35:10Z

should we support whitespace in the unquoted field names?
What should be the behaviour if there is whitespace 1) at beginning of field name, 2) in middle of field name 3) after field name, but before : ?
Similarly what should be behaviour if there is newline for above 3 cases?
Should we provide escape \ character support?

revans2 · 2022-10-11T14:51:30Z

should we support whitespace in the unquoted field names?

No. see below for details

What should be the behaviour if there is whitespace 1) at beginning of field name, 2) in middle of field name 3) after field name, but before : ?

White space at the beginning and after the name but before the : should be stripped out and removed. In the middle is an error.

Similarly what should be behaviour if there is newline for above 3 cases?

That would only show up for a non-json lines use case. In those cases newline is treated like other white space, and stripped from the beginning and end, but an error if it is in the middle.

Should we provide escape \ character support?

Not for unquoted names. None quoted escape characters in the names of a field are considered an error. This is true even if escaping any character is allowed as set by a second config.

Here is the test files and code that I used.


$ cat test.json
{name: "A", value : 100}
{"name": "B", "value ": 200}
{name: "C", value\ : 300}
{name: "D",     value    : 400}
{name: "E", name value: 500}
{name: "F", name N value: 600}
{name: "G", name\nvalue: 700}
{name: "H", "name\tvalue": 800}
{name: "I", "name\nvalue": 900}
{name: "J", va\lue: 1000}
{name: "K", "va\lue": 1100}
$ cat test_2.json
{name: "A", value : 100,
"other ": 200, a
:
"a",



ABC: "ABC"}
# spark-shell
scala> spark.read.option("allowUnquotedFieldNames", "true").option("multiLine", "true").json("./test_2.json").show(truncate = false)
+---+---+----+------+-----+
|ABC|a  |name|other |value|
+---+---+----+------+-----+
|ABC|a  |A   |200   |100  |
+---+---+----+------+-----+

scala> spark.read.options(Map("allowUnquotedFieldNames" -> "true", "allowBackslashEscapingAnyCharacter" -> "true")).json("./test.json").show(truncate = false)
+------------------------------+----+-----------+-----------+-----+------+
|_corrupt_record               |name|name\tvalue|name\nvalue|value|value |
+------------------------------+----+-----------+-----------+-----+------+
|null                          |A   |null       |null       |100  |null  |
|null                          |B   |null       |null       |null |200   |
|{name: "C", value\ : 300}     |null|null       |null       |null |null  |
|null                          |D   |null       |null       |400  |null  |
|{name: "E", name value: 500}  |null|null       |null       |null |null  |
|{name: "F", name N value: 600}|null|null       |null       |null |null  |
|{name: "G", name\nvalue: 700} |null|null       |null       |null |null  |
|null                          |H   |800        |null       |null |null  |
|null                          |I   |null       |900        |null |null  |
|{name: "J", va\lue: 1000}     |null|null       |null       |null |null  |
|null                          |K   |null       |null       |1100 |null  |
+------------------------------+----+-----------+-----------+-----+------+


scala> spark.read.option("allowUnquotedFieldNames", "true").json("./test.json").show(truncate = false)
+------------------------------+----+-----------+-----------+-----+------+
|_corrupt_record               |name|name\tvalue|name\nvalue|value|value |
+------------------------------+----+-----------+-----------+-----+------+
|null                          |A   |null       |null       |100  |null  |
|null                          |B   |null       |null       |null |200   |
|{name: "C", value\ : 300}     |null|null       |null       |null |null  |
|null                          |D   |null       |null       |400  |null  |
|{name: "E", name value: 500}  |null|null       |null       |null |null  |
|{name: "F", name N value: 600}|null|null       |null       |null |null  |
|{name: "G", name\nvalue: 700} |null|null       |null       |null |null  |
|null                          |H   |800        |null       |null |null  |
|null                          |I   |null       |900        |null |null  |
|{name: "J", va\lue: 1000}     |null|null       |null       |null |null  |
|{name: "K", "va\lue": 1100}   |null|null       |null       |null |null  |
+------------------------------+----+-----------+-----------+-----+------+

Like in the other examples you can ignore the "_corrupt_record" field it is generally not used and we don't support it on the GPU, but it shows which lines had errors in them .

wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Feb 10, 2022

galipremsagar added the cuIO cuIO issue label Feb 11, 2022

github-actions bot added the inactive-30d label Mar 13, 2022

sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022

github-actions bot removed the inactive-30d label May 16, 2022

GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022

GregoryKimball added this to the Nested JSON reader milestone Jun 28, 2022

github-actions bot added the inactive-90d label Sep 26, 2022

github-actions bot removed the inactive-90d label Oct 11, 2022

GregoryKimball added the 0 - Backlog In queue waiting for assignment label Oct 26, 2022

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

This was referenced Mar 13, 2024

[FEA] Support allowUnquotedFieldNames for JsonToStructs and ScanJson NVIDIA/spark-rapids#10587

Open

[FEA] JSON input support NVIDIA/spark-rapids#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] JSON reader: support unquoted JSON field names. #10266

[FEA] JSON reader: support unquoted JSON field names. #10266

wbo4958 commented Feb 10, 2022

github-actions bot commented Mar 13, 2022

revans2 commented May 16, 2022

github-actions bot commented Sep 26, 2022

karthikeyann commented Oct 11, 2022

revans2 commented Oct 11, 2022

[FEA] JSON reader: support unquoted JSON field names. #10266

[FEA] JSON reader: support unquoted JSON field names. #10266

Comments

wbo4958 commented Feb 10, 2022

github-actions bot commented Mar 13, 2022

revans2 commented May 16, 2022

github-actions bot commented Sep 26, 2022

karthikeyann commented Oct 11, 2022

revans2 commented Oct 11, 2022