Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JsonToStructs and ScanJson should return null for non-numeric, non-boolean non-quoted strings #10479

Closed
revans2 opened this issue Feb 23, 2024 · 2 comments · Fixed by #11464
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 23, 2024

Describe the bug
The way that from_json and the json scan work is that they will try to parse a number/boolean first and then if that works out the result is returned as a string. This is also related to validation. If a single column is an invalid unquoted string, then then entire row needs to be invalidated.

In this case we are looking at unquoted values. In JSON for boolean values only true and false are allowed. They are case sensitive so TRUE and FALSE are not valid. Numbers have to look like the desired number type or they are not valid. 1.0 is not a valid int like with #10469. Note that 1,000 is invalid in all cases for numbers, unless it is in a quoted string and is being read as a decimal value #10470.

Things get to be a little complicated because this is different for GetJsonObject or JsonTuple where everything that is valid is returned as a string. Note that I said is valid. TRUE is not a valid unquoted value, and it too would result in the entire line for GetJsonObject or JsonTuple being returned as null.

I think to make this work we are either going to need some help from CUDF to have better validation. Or we are going to need complicates post processing by enabling CUDF to return quoted strings. I think the latter is going to give us the most flexibility, and then we can come back to CUDF and figure out how to make it work more effeciently.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 23, 2024
@revans2 revans2 mentioned this issue Feb 26, 2024
62 tasks
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 27, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Mar 14, 2024

This is mostly fixed, but if we try to read the data as a string, then it is not validated, it is just returned as a string.

I see this as a subset of rapidsai/cudf#15222

We can probably still fix it in our code, for non-nested data but it means we will have to run a regular expression over all of the returned string output, and ultimately we really should have CUDF do the validation everywhere if we want it to be right.

@revans2 revans2 added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Mar 14, 2024
@revans2 revans2 changed the title [BUG] StructsToJson and ScanJson should return null for non-numeric, non-boolean non-quoted strings [BUG] JsonToStructs and ScanJson should return null for non-numeric, non-boolean non-quoted strings Mar 15, 2024
@revans2 revans2 self-assigned this Jun 24, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Jun 24, 2024

This should be addressed as a part of CUDF validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants