Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON reader: support unquoted JSON field names. #10266

Open
Tracked by #9
wbo4958 opened this issue Feb 10, 2022 · 5 comments
Open
Tracked by #9

[FEA] JSON reader: support unquoted JSON field names. #10266

wbo4958 opened this issue Feb 10, 2022 · 5 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wbo4958
Copy link
Contributor

wbo4958 commented Feb 10, 2022

This is part of FEA of NVIDIA/spark-rapids#9
We have a JSON file

{name: "Reynold Xin"}

Spark can parse it when enabling allowUnquotedFieldNames

CUDF parsing will throw exception

We expect there is a configure allowUnquotedFieldNames to control this behavior.

@wbo4958 wbo4958 added feature request New feature or request Needs Triage Need team to review and classify labels Feb 10, 2022
@galipremsagar galipremsagar added the cuIO cuIO issue label Feb 11, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@sameerz sameerz added the Spark Functionality that helps Spark RAPIDS label Mar 23, 2022
@revans2
Copy link
Contributor

revans2 commented May 16, 2022

Setting this to P1 as it is off by default in Spark

@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Jun 28, 2022
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Jun 28, 2022
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@karthikeyann
Copy link
Contributor

should we support whitespace in the unquoted field names?
What should be the behaviour if there is whitespace 1) at beginning of field name, 2) in middle of field name 3) after field name, but before : ?
Similarly what should be behaviour if there is newline for above 3 cases?
Should we provide escape \ character support?

@revans2
Copy link
Contributor

revans2 commented Oct 11, 2022

should we support whitespace in the unquoted field names?

No. see below for details

What should be the behaviour if there is whitespace 1) at beginning of field name, 2) in middle of field name 3) after field name, but before : ?

White space at the beginning and after the name but before the : should be stripped out and removed. In the middle is an error.

Similarly what should be behaviour if there is newline for above 3 cases?

That would only show up for a non-json lines use case. In those cases newline is treated like other white space, and stripped from the beginning and end, but an error if it is in the middle.

Should we provide escape \ character support?

Not for unquoted names. None quoted escape characters in the names of a field are considered an error. This is true even if escaping any character is allowed as set by a second config.

Here is the test files and code that I used.


$ cat test.json
{name: "A", value : 100}
{"name": "B", "value ": 200}
{name: "C", value\ : 300}
{name: "D",     value    : 400}
{name: "E", name value: 500}
{name: "F", name N value: 600}
{name: "G", name\nvalue: 700}
{name: "H", "name\tvalue": 800}
{name: "I", "name\nvalue": 900}
{name: "J", va\lue: 1000}
{name: "K", "va\lue": 1100}
$ cat test_2.json
{name: "A", value : 100,
"other ": 200, a
:
"a",



ABC: "ABC"}
# spark-shell
scala> spark.read.option("allowUnquotedFieldNames", "true").option("multiLine", "true").json("./test_2.json").show(truncate = false)
+---+---+----+------+-----+
|ABC|a  |name|other |value|
+---+---+----+------+-----+
|ABC|a  |A   |200   |100  |
+---+---+----+------+-----+

scala> spark.read.options(Map("allowUnquotedFieldNames" -> "true", "allowBackslashEscapingAnyCharacter" -> "true")).json("./test.json").show(truncate = false)
+------------------------------+----+-----------+-----------+-----+------+
|_corrupt_record               |name|name\tvalue|name\nvalue|value|value |
+------------------------------+----+-----------+-----------+-----+------+
|null                          |A   |null       |null       |100  |null  |
|null                          |B   |null       |null       |null |200   |
|{name: "C", value\ : 300}     |null|null       |null       |null |null  |
|null                          |D   |null       |null       |400  |null  |
|{name: "E", name value: 500}  |null|null       |null       |null |null  |
|{name: "F", name N value: 600}|null|null       |null       |null |null  |
|{name: "G", name\nvalue: 700} |null|null       |null       |null |null  |
|null                          |H   |800        |null       |null |null  |
|null                          |I   |null       |900        |null |null  |
|{name: "J", va\lue: 1000}     |null|null       |null       |null |null  |
|null                          |K   |null       |null       |1100 |null  |
+------------------------------+----+-----------+-----------+-----+------+


scala> spark.read.option("allowUnquotedFieldNames", "true").json("./test.json").show(truncate = false)
+------------------------------+----+-----------+-----------+-----+------+
|_corrupt_record               |name|name\tvalue|name\nvalue|value|value |
+------------------------------+----+-----------+-----------+-----+------+
|null                          |A   |null       |null       |100  |null  |
|null                          |B   |null       |null       |null |200   |
|{name: "C", value\ : 300}     |null|null       |null       |null |null  |
|null                          |D   |null       |null       |400  |null  |
|{name: "E", name value: 500}  |null|null       |null       |null |null  |
|{name: "F", name N value: 600}|null|null       |null       |null |null  |
|{name: "G", name\nvalue: 700} |null|null       |null       |null |null  |
|null                          |H   |800        |null       |null |null  |
|null                          |I   |null       |900        |null |null  |
|{name: "J", va\lue: 1000}     |null|null       |null       |null |null  |
|{name: "K", "va\lue": 1100}   |null|null       |null       |null |null  |
+------------------------------+----+-----------+-----------+-----+------+

Like in the other examples you can ignore the "_corrupt_record" field it is generally not used and we don't support it on the GPU, but it shows which lines had errors in them .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants