[BUG] JsonToStructs and ScanJson should return null for non-numeric, non-boolean non-quoted strings #10479
Labels
bug
Something isn't working
cudf_dependency
An issue or PR with this label depends on a new feature in cudf
Describe the bug
The way that from_json and the json scan work is that they will try to parse a number/boolean first and then if that works out the result is returned as a string. This is also related to validation. If a single column is an invalid unquoted string, then then entire row needs to be invalidated.
In this case we are looking at unquoted values. In JSON for boolean values only
true
andfalse
are allowed. They are case sensitive soTRUE
andFALSE
are not valid. Numbers have to look like the desired number type or they are not valid. 1.0 is not a valid int like with #10469. Note that 1,000 is invalid in all cases for numbers, unless it is in a quoted string and is being read as a decimal value #10470.Things get to be a little complicated because this is different for GetJsonObject or JsonTuple where everything that is valid is returned as a string. Note that I said is valid.
TRUE
is not a valid unquoted value, and it too would result in the entire line for GetJsonObject or JsonTuple being returned as null.I think to make this work we are either going to need some help from CUDF to have better validation. Or we are going to need complicates post processing by enabling CUDF to return quoted strings. I think the latter is going to give us the most flexibility, and then we can come back to CUDF and figure out how to make it work more effeciently.
The text was updated successfully, but these errors were encountered: