Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Write Spark compatible JSON validation function #1957

Open
revans2 opened this issue Apr 11, 2024 · 0 comments
Open

[FEA] Write Spark compatible JSON validation function #1957

revans2 opened this issue Apr 11, 2024 · 0 comments
Assignees

Comments

@revans2
Copy link
Collaborator

revans2 commented Apr 11, 2024

Is your feature request related to a problem? Please describe.
This is related to rapidsai/cudf#15222

We need a way to validate that the input JSON data is valid according to the JSON parser that Spark uses.

Describe the solution you'd like
After some offline discussions with the CUDF team it was decided that spark-rapids should do the initial validation work, and then we will look at moving it to CUDF depending on the needs of other customers/etc.

The idea is to use the details cudf JSON APIs to tokenize the input data. Once we have tokens for the data we will write a custom kernel to validate the input data. We will only validate the things that CUDF does not already validate. So we don't need to worry about things like matching brackets or quotes. The output of this will be a boolean column, that can be used to indicate if the entire row needs to be marked as invalid or not.

After that the original tokens will finish being processed by CUDF to turn them into a table that we can then update with nulls to match the validation and return.

The following configs are things that Spark supports. This is which ones we need to eventually support or not.

  • maxNestingDepth - this is only after Spark 3.5.0. We are not going to be able to support this as CUDF itself has a limit of about 254. We will need to document it and if the value is changed from the default and is larger than 254, then we need to fall back to the CPU. But that will not be done as a part of this.
  • maxNumLen - This is very low priority. It too was introduced in Spark 3.5.0. We should support it, but it can be done in a follow on issues.
  • maxStringLen - Just like maxNumLen this is a very low priority, and showed up in 3.5.0. We should support it, but it can be done as a follow on issue.
  • allowComments - This would require changes to CUDF, or be done as a pre-processing step. I don't really want to try and support it so we will fall back to the CPU if we see it set.
  • allowSingleQuotes - This is already covered as a pre-processing step before tokenization so we don't need to worry about it.
  • allowUnquotedFieldNames - This would require changes to the CUDF parser to support, or another pre-processing step. Because it is off by default I am fine if we don't support this.
  • allowNumericLeadingZeros - The official JSON spec says that leading zeros on a number are not allowed. i.e. 007 is invalid. This changes that so leading zeros are allowed. This is not on by default for any JSON operation so it could be done as a follow-on PR if needed.
  • allowNonNumericNumbers - The official JSON spec only allows for numbers to look like -? (?: 0|[1-9]\d*) (?: \.\d+)? (?: [Ee] [+-]? \d++)? ). If this config is enabled, which it is by default for some operations, then we need to extend this to also include NaN, +INF, -INF, +Infinity, Infinity, and -Infinity.
  • allowBackslashEscapingAnyCharacter - this is off by default so we don't need to worry about it. The regular CUDF tokenizer already validates this properly, so we are okay. Again we don't want to make changes to CUDF unless we have to, and especially not as a part of this.
  • allowUnquotedControlChars - This is false by default, but CUDF happily allows this, so we just need to make sure that quoted strings do not include the characters \x00-\x1f directly in them, unless this config is set to true.

Please note that a lot of this validation is done as a part of

https://github.com/NVIDIA/spark-rapids/blob/479b4a041b018ea516be02ceac2f65a65bcb1826/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonReadCommon.scala

But it is only partially done on columns that are returned from CUDF. So you can use it as a general guide, but not something specific.

Describe alternatives you've considered
We could validate the data before it is sent to CUDF. But this is likely to require a custom kernel based off of the get_json_object custom kernel. I don't think that is what we want to do. Long term we want to have as much code as common for JSON parsing as possible. Eventually we want to use the regular JSON tokenizer + this validation logic to process in input JSON data, and then use a custom kernel to pull out only the parts of the JSON that match the desired pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants