You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
This is related to rapidsai/cudf#15222
We need a way to validate that the input JSON data is valid according to the JSON parser that Spark uses.
Describe the solution you'd like
After some offline discussions with the CUDF team it was decided that spark-rapids should do the initial validation work, and then we will look at moving it to CUDF depending on the needs of other customers/etc.
The idea is to use the details cudf JSON APIs to tokenize the input data. Once we have tokens for the data we will write a custom kernel to validate the input data. We will only validate the things that CUDF does not already validate. So we don't need to worry about things like matching brackets or quotes. The output of this will be a boolean column, that can be used to indicate if the entire row needs to be marked as invalid or not.
After that the original tokens will finish being processed by CUDF to turn them into a table that we can then update with nulls to match the validation and return.
The following configs are things that Spark supports. This is which ones we need to eventually support or not.
maxNestingDepth - this is only after Spark 3.5.0. We are not going to be able to support this as CUDF itself has a limit of about 254. We will need to document it and if the value is changed from the default and is larger than 254, then we need to fall back to the CPU. But that will not be done as a part of this.
maxNumLen - This is very low priority. It too was introduced in Spark 3.5.0. We should support it, but it can be done in a follow on issues.
maxStringLen - Just like maxNumLen this is a very low priority, and showed up in 3.5.0. We should support it, but it can be done as a follow on issue.
allowComments - This would require changes to CUDF, or be done as a pre-processing step. I don't really want to try and support it so we will fall back to the CPU if we see it set.
allowSingleQuotes - This is already covered as a pre-processing step before tokenization so we don't need to worry about it.
allowUnquotedFieldNames - This would require changes to the CUDF parser to support, or another pre-processing step. Because it is off by default I am fine if we don't support this.
allowNumericLeadingZeros - The official JSON spec says that leading zeros on a number are not allowed. i.e. 007 is invalid. This changes that so leading zeros are allowed. This is not on by default for any JSON operation so it could be done as a follow-on PR if needed.
allowNonNumericNumbers - The official JSON spec only allows for numbers to look like -? (?: 0|[1-9]\d*) (?: \.\d+)? (?: [Ee] [+-]? \d++)? ). If this config is enabled, which it is by default for some operations, then we need to extend this to also include NaN, +INF, -INF, +Infinity, Infinity, and -Infinity.
allowBackslashEscapingAnyCharacter - this is off by default so we don't need to worry about it. The regular CUDF tokenizer already validates this properly, so we are okay. Again we don't want to make changes to CUDF unless we have to, and especially not as a part of this.
allowUnquotedControlChars - This is false by default, but CUDF happily allows this, so we just need to make sure that quoted strings do not include the characters \x00-\x1f directly in them, unless this config is set to true.
Please note that a lot of this validation is done as a part of
But it is only partially done on columns that are returned from CUDF. So you can use it as a general guide, but not something specific.
Describe alternatives you've considered
We could validate the data before it is sent to CUDF. But this is likely to require a custom kernel based off of the get_json_object custom kernel. I don't think that is what we want to do. Long term we want to have as much code as common for JSON parsing as possible. Eventually we want to use the regular JSON tokenizer + this validation logic to process in input JSON data, and then use a custom kernel to pull out only the parts of the JSON that match the desired pattern.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
This is related to rapidsai/cudf#15222
We need a way to validate that the input JSON data is valid according to the JSON parser that Spark uses.
Describe the solution you'd like
After some offline discussions with the CUDF team it was decided that spark-rapids should do the initial validation work, and then we will look at moving it to CUDF depending on the needs of other customers/etc.
The idea is to use the details cudf JSON APIs to tokenize the input data. Once we have tokens for the data we will write a custom kernel to validate the input data. We will only validate the things that CUDF does not already validate. So we don't need to worry about things like matching brackets or quotes. The output of this will be a boolean column, that can be used to indicate if the entire row needs to be marked as invalid or not.
After that the original tokens will finish being processed by CUDF to turn them into a table that we can then update with nulls to match the validation and return.
The following configs are things that Spark supports. This is which ones we need to eventually support or not.
007
is invalid. This changes that so leading zeros are allowed. This is not on by default for any JSON operation so it could be done as a follow-on PR if needed.-? (?: 0|[1-9]\d*) (?: \.\d+)? (?: [Ee] [+-]? \d++)? )
. If this config is enabled, which it is by default for some operations, then we need to extend this to also includeNaN
,+INF
,-INF
,+Infinity
,Infinity
, and-Infinity
.\x00-\x1f
directly in them, unless this config is set to true.Please note that a lot of this validation is done as a part of
https://github.com/NVIDIA/spark-rapids/blob/479b4a041b018ea516be02ceac2f65a65bcb1826/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonReadCommon.scala
But it is only partially done on columns that are returned from CUDF. So you can use it as a general guide, but not something specific.
Describe alternatives you've considered
We could validate the data before it is sent to CUDF. But this is likely to require a custom kernel based off of the get_json_object custom kernel. I don't think that is what we want to do. Long term we want to have as much code as common for JSON parsing as possible. Eventually we want to use the regular JSON tokenizer + this validation logic to process in input JSON data, and then use a custom kernel to pull out only the parts of the JSON that match the desired pattern.
The text was updated successfully, but these errors were encountered: