Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON input support #9

Open
8 of 62 tasks
revans2 opened this issue May 28, 2020 · 5 comments
Open
8 of 62 tasks

[FEA] JSON input support #9

revans2 opened this issue May 28, 2020 · 5 comments
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work feature request New feature or request SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 28, 2020

Is your feature request related to a problem? Please describe.

High Priority

GetJsonObject/get_json_object

get_json_object is being traced by a separate epic. One of them is in the spark-rapids repo and another is in spark-rapids-jni. The JNI version is trying to write a new parser from scratch to match what Spark is doing.

JsonTuple/json_tuple

For now it is implemented in terms of multiple calls to get_json_object. This is likely to change in the future, but for now it should be tracked with get_json_object

JsonToStructs and ScanJson

JsonToStructs and ScanJson share a common backend and most bugs in one are reflected in the other with the exception of a few issues that are specific to how JsonToStructs prepares it's input so that the CUDF JSON parser can handle it.

Medium Priority

JsonToStructs and ScanJson

Low Priority

JsonToStructs and ScanJson

Testing

JsonToStructs and ScanJson

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin labels May 28, 2020
@sameerz sameerz added P2 Not required for release and removed ? - Needs Triage Need team to review and classify labels Oct 13, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Nov 3, 2021

We have had a customer ask about this, so we might bump up the priority on this. I have been looking at the JSON parsing and how that relates to the existing CSV parsing. Just like CSV parsing it is not great, but I think we could do with JSON what we want to do with CSV and parse all of the atomic types as Strings and then handle casting/parsing them ourselves. This will make the code a lot more robust in terms of Spark compatibility. But there are still a number of issues that we have to look into and probably address.

  1. The current CUDF code does not support nested types for parsing JSON. [FEA] Support nested types in JSON reader rapidsai/cudf#8827 should fix that, but it is not done yet.
  2. The current CUDF code parses types how they want to, and not how Spark wants them to be parsed. Like I said above we might be able to ask CUDF to read all of the types as Strings and then parse them ourselves. This will not fix everything because looking at cast there are a number of types that we do not fully support when casting from a string still. But it will be a step in the right direction and should let us avoid most of the enable this type for JSON configs. We could at a minimum reused the cast from string configs that already exist.
  3. "allowSingleQuotes" option. This is set to true by default so Spark allows parsing values with single quotes instead of just double quotes. The CUDF code does not. So we probably want to file an issue with CUDF to ask for this, and also in the short term document that we cannot support this.
  4. There are also a number of configs in Spark that we will need to play around with in CUDF, but I am fairly sure if they are enabled we will have to fall back to the CPU, or at least really document well the inconsistencies. These include
    a. "allowComments" C/C++ style comments // and /* */
    b. "allowUnquotesFieldNames"
    c. "allowNumericLeadingZeros" Here the parser strips the leading zeros for numeric JSON types instead of marking them as corrupt entries, so if I had something like {"a": 0001, "b": 01.1000000000} both entries would cause the row to be marked as corrupt. Even if we provide a schema an ask for both a and b to be strings they show up as nulls because the values are technically corrupt. If I set the option to true they are parsed, but the leading and in the case of the float, trailing zeros are stripped from the resulting string.
    d. "allowNonNumericNumbers" The docs say that this mean INF, -INF, and NaN but I could not make INF work, just -INF. Not sure how well used this is.
    e. "allowBackslashEscapingAnyCharacter" typically only a few chars in the JSON standard are allowed. In CUDF they only support ", \, \t, \r, and \b. Not sure if there are others in JSON or not. Also not sure what happens if CUDF if others are encountered vs in Spark.
    f. "allowUnquotedControlChars" Docs say ASCII characters with a value less than 32. My guess is CUDF just allows these all the time.
  5. The "parseMode" and "columnNameOfCorruptRecord" parse mode config probably needs to always be lenient and we need to fall back to the CPU like with CSV if we ever see the columnNameOfCorruptRecord in the schema. We just don't support that and I don't know what it would take for CUDF to be able to do it.
  6. "ignoreNullFields" and "dropFieldIfAllNull" are things we might be able to do with post processing, but we need to play around with it a little.
  7. IF we see multi-line we have to fall back to the CPU. We just cannot support it.
  8. We need to look at encodings and see if we have to fall back to the CPU or not. I hope that we can normalize things as we read it in like we do with the line ending in CSV and will have to do with the line ending here, but I am not 100% sure on that.
  9. We need to check the zoneId like with CSV and fallback if it is not UTC
  10. Need to check the date/timestamp formats too and see if there is anything in there that we cannot support.
  11. We also need to play around with JSON parsing in general and see what falls out. It looks like Spark is fairly strict in parsing and there can be odd corner cases. For example it looks like unquoted number values, if they are going to be interpreted as a string, are parsed as numbers first and then cast to a string, So 3.00 becomes 3.0 but "3.00" remains unchanged. I don't see any way we can match this behavior and probably will just need to document it.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 3, 2021

Oh, also a single line can contain multiple entries if the top level is an array.

[{"a": 1, "b": 1}, {"a": 2}]
[{"a": 1, "b": 2.0}]
{"a": 1, "b": "inf"}

produces

+---+----+
|  a|   b|
+---+----+
|  1|   1|
|  2|null|
|  1| 2.0|
|  1| inf|
+---+----+

But if there is extra JSON like stuff at the end of the line, it is ignored.

{"a": 1, "b": 1}, {"other": 2}
{"a": 1, "b": "inf"} garbage
{"a": 1, "b": 1} {"more": 2}

produces the following with no errors. Which feels really odd to me.

+---+---+
|  a|  b|
+---+---+
|  1|  1|
|  1|inf|
|  1|  1|
+---+---+

There may be a lot of other odd cases, that we need to look into.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 3, 2021

We might want to look at some of the Spark JSON tests, but they are not that complete.

@revans2
Copy link
Collaborator Author

revans2 commented Nov 3, 2021

Oh that is interesting. The parsing of the JSON keys is case sensitive, but auto detection of the schema is not totally so you can get errors if you let Spark detect the schema and there are keys with different cases. i.e. A vs a. So we should test if we can select the keys in a case sensitive way.

Also if there are multiple keys in a record. For spark it looks like the last one wins. Not sure what CUDF does in those cases.

{"a": 1, "b": 1}
{"a": 1, "a": 2}
{"a": 1, "b": 1}

produces

+---+----+
|  a|   b|
+---+----+
|  1|   1|
|  2|null|
|  1|   1|
+---+----+

with no errors

@revans2 revans2 added the ? - Needs Triage Need team to review and classify label Nov 3, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Nov 3, 2021

Added Needs Triage back on so we can look at this again because I think most of the analysis of this is done.

@Salonijain27 Salonijain27 added cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work and removed ? - Needs Triage Need team to review and classify labels Nov 16, 2021
wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Dec 13, 2023
log for dup scan issue

Signed-off-by: Firestarman <[email protected]>
@sameerz sameerz removed the P2 Not required for release label Feb 27, 2024
@revans2 revans2 changed the title [FEA] JSON input parsing [FEA] JSON input support Mar 13, 2024
binmahone added a commit to binmahone/spark-rapids that referenced this issue Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf epic Issue that encompasses a significant feature or body of work feature request New feature or request SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

3 participants