[FEA] JSON input support #9

revans2 · 2020-05-28T17:48:03Z

Is your feature request related to a problem? Please describe.

High Priority

`GetJsonObject`/`get_json_object`

get_json_object is being traced by a separate epic. One of them is in the spark-rapids repo and another is in spark-rapids-jni. The JNI version is trying to write a new parser from scratch to match what Spark is doing.

`JsonTuple`/`json_tuple`

For now it is implemented in terms of multiple calls to get_json_object. This is likely to change in the future, but for now it should be tracked with get_json_object

`JsonToStructs` and `ScanJson`

JsonToStructs and ScanJson share a common backend and most bugs in one are reflected in the other with the exception of a few issues that are specific to how JsonToStructs prepares it's input so that the CUDF JSON parser can handle it.

Medium Priority

`JsonToStructs` and `ScanJson`

Low Priority

`JsonToStructs` and `ScanJson`

Testing

`JsonToStructs` and `ScanJson`

The text was updated successfully, but these errors were encountered:

revans2 · 2021-11-03T18:22:26Z

We have had a customer ask about this, so we might bump up the priority on this. I have been looking at the JSON parsing and how that relates to the existing CSV parsing. Just like CSV parsing it is not great, but I think we could do with JSON what we want to do with CSV and parse all of the atomic types as Strings and then handle casting/parsing them ourselves. This will make the code a lot more robust in terms of Spark compatibility. But there are still a number of issues that we have to look into and probably address.

The current CUDF code does not support nested types for parsing JSON. [FEA] Support nested types in JSON reader rapidsai/cudf#8827 should fix that, but it is not done yet.
The current CUDF code parses types how they want to, and not how Spark wants them to be parsed. Like I said above we might be able to ask CUDF to read all of the types as Strings and then parse them ourselves. This will not fix everything because looking at cast there are a number of types that we do not fully support when casting from a string still. But it will be a step in the right direction and should let us avoid most of the enable this type for JSON configs. We could at a minimum reused the cast from string configs that already exist.
"allowSingleQuotes" option. This is set to true by default so Spark allows parsing values with single quotes instead of just double quotes. The CUDF code does not. So we probably want to file an issue with CUDF to ask for this, and also in the short term document that we cannot support this.
There are also a number of configs in Spark that we will need to play around with in CUDF, but I am fairly sure if they are enabled we will have to fall back to the CPU, or at least really document well the inconsistencies. These include
a. "allowComments" C/C++ style comments // and /* */
b. "allowUnquotesFieldNames"
c. "allowNumericLeadingZeros" Here the parser strips the leading zeros for numeric JSON types instead of marking them as corrupt entries, so if I had something like {"a": 0001, "b": 01.1000000000} both entries would cause the row to be marked as corrupt. Even if we provide a schema an ask for both a and b to be strings they show up as nulls because the values are technically corrupt. If I set the option to true they are parsed, but the leading and in the case of the float, trailing zeros are stripped from the resulting string.
d. "allowNonNumericNumbers" The docs say that this mean INF, -INF, and NaN but I could not make INF work, just -INF. Not sure how well used this is.
e. "allowBackslashEscapingAnyCharacter" typically only a few chars in the JSON standard are allowed. In CUDF they only support ", \, \t, \r, and \b. Not sure if there are others in JSON or not. Also not sure what happens if CUDF if others are encountered vs in Spark.
f. "allowUnquotedControlChars" Docs say ASCII characters with a value less than 32. My guess is CUDF just allows these all the time.
The "parseMode" and "columnNameOfCorruptRecord" parse mode config probably needs to always be lenient and we need to fall back to the CPU like with CSV if we ever see the columnNameOfCorruptRecord in the schema. We just don't support that and I don't know what it would take for CUDF to be able to do it.
"ignoreNullFields" and "dropFieldIfAllNull" are things we might be able to do with post processing, but we need to play around with it a little.
IF we see multi-line we have to fall back to the CPU. We just cannot support it.
We need to look at encodings and see if we have to fall back to the CPU or not. I hope that we can normalize things as we read it in like we do with the line ending in CSV and will have to do with the line ending here, but I am not 100% sure on that.
We need to check the zoneId like with CSV and fallback if it is not UTC
Need to check the date/timestamp formats too and see if there is anything in there that we cannot support.
We also need to play around with JSON parsing in general and see what falls out. It looks like Spark is fairly strict in parsing and there can be odd corner cases. For example it looks like unquoted number values, if they are going to be interpreted as a string, are parsed as numbers first and then cast to a string, So 3.00 becomes 3.0 but "3.00" remains unchanged. I don't see any way we can match this behavior and probably will just need to document it.

revans2 · 2021-11-03T18:30:12Z

Oh, also a single line can contain multiple entries if the top level is an array.

[{"a": 1, "b": 1}, {"a": 2}]
[{"a": 1, "b": 2.0}]
{"a": 1, "b": "inf"}

produces

+---+----+
|  a|   b|
+---+----+
|  1|   1|
|  2|null|
|  1| 2.0|
|  1| inf|
+---+----+

But if there is extra JSON like stuff at the end of the line, it is ignored.

{"a": 1, "b": 1}, {"other": 2}
{"a": 1, "b": "inf"} garbage
{"a": 1, "b": 1} {"more": 2}

produces the following with no errors. Which feels really odd to me.

+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|inf|
|  1|  1|
+---+---+

There may be a lot of other odd cases, that we need to look into.

revans2 · 2021-11-03T18:32:59Z

We might want to look at some of the Spark JSON tests, but they are not that complete.

revans2 · 2021-11-03T18:48:17Z

Oh that is interesting. The parsing of the JSON keys is case sensitive, but auto detection of the schema is not totally so you can get errors if you let Spark detect the schema and there are keys with different cases. i.e. A vs a. So we should test if we can select the keys in a case sensitive way.

Also if there are multiple keys in a record. For spark it looks like the last one wins. Not sure what CUDF does in those cases.

{"a": 1, "b": 1}
{"a": 1, "a": 2}
{"a": 1, "b": 1}

produces

+---+----+
|  a|   b|
+---+----+
|  1|   1|
|  2|null|
|  1|   1|
+---+----+

with no errors

revans2 · 2021-11-03T18:54:49Z

Added Needs Triage back on so we can look at this again because I think most of the analysis of this is done.

log for dup scan issue Signed-off-by: Firestarman <[email protected]>

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 28, 2020

sameerz added P2 Not required for release and removed ? - Needs Triage Need team to review and classify labels Oct 13, 2020

revans2 added the ? - Needs Triage Need team to review and classify label Nov 3, 2021

Salonijain27 added cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work and removed ? - Needs Triage Need team to review and classify labels Nov 16, 2021

Salonijain27 assigned revans2 Nov 16, 2021

This was referenced Nov 17, 2021

[FEA] JSON: Basic Reader for String type #4135

Closed

[FEA] JSON: fuzz testing #4138

Closed

This was referenced Jan 10, 2022

[FEA][JSON reader] to support parsing with single quotes rapidsai/cudf#10004

Closed

[FEA] [JSON reader] to support column prune rapidsai/cudf#9990

Closed

This was referenced Feb 10, 2022

[FEA] JSON reader: ignores Java/C++ style comment rapidsai/cudf#10265

Open

[FEA] JSON reader: support unquoted JSON field names. rapidsai/cudf#10266

Open

[FEA] JSON reader: support multi-lines rapidsai/cudf#10267

Open

andygrove mentioned this issue Feb 14, 2022

[FEA] JSON reader: Provide option to treat quoted strings as null values rapidsai/cudf#10283

Closed

HaoYang670 mentioned this issue Feb 18, 2022

JSON reader: Collect more corner cases in JSON reader parsing #4821

Open

andygrove mentioned this issue Oct 17, 2023

[FEA] [EPIC] Priority JSON Issues #9458

Open

26 tasks

This was referenced Nov 14, 2023

[FEA][JSON] Add support for non default dateFormat in from_json #9667

Closed

[FEA] Enable full StringGen in test_structs_to_json #9705

Open

wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Dec 13, 2023

Add log to debug a dup scan issue (NVIDIA#9)

a5c2649

log for dup scan issue Signed-off-by: Firestarman <[email protected]>

sameerz removed the P2 Not required for release label Feb 27, 2024

revans2 changed the title ~~[FEA] JSON input parsing~~ [FEA] JSON input support Mar 13, 2024

binmahone added a commit to binmahone/spark-rapids that referenced this issue Jun 3, 2024

optimzing Expand+Aggregate in sqlw with many count distinct (NVIDIA#9)

0bb7a67

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] JSON input support #9

[FEA] JSON input support #9

revans2 commented May 28, 2020 •

edited by mattahrens

Loading

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

[FEA] JSON input support #9

[FEA] JSON input support #9

Comments

revans2 commented May 28, 2020 • edited by mattahrens Loading

High Priority

GetJsonObject/get_json_object

JsonTuple/json_tuple

JsonToStructs and ScanJson

Medium Priority

JsonToStructs and ScanJson

Low Priority

JsonToStructs and ScanJson

Testing

JsonToStructs and ScanJson

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented Nov 3, 2021

revans2 commented May 28, 2020 •

edited by mattahrens

Loading

`GetJsonObject`/`get_json_object`

`JsonTuple`/`json_tuple`

`JsonToStructs` and `ScanJson`

`JsonToStructs` and `ScanJson`

`JsonToStructs` and `ScanJson`

`JsonToStructs` and `ScanJson`