Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes stack context for json lines format that recovers from invalid JSON lines #14309

Merged

Conversation

elstehle
Copy link
Contributor

Description

Addresses #14282.

For the JSON lines format that recovers after an invalid JSON line, we've had two issues when we were generating the stack context that is used downstream in the full JSON pushdown transducer.

For that format, we need to make sure that we "reset" the stack context after each JSON line. That is,

  1. We need to reset the stack to the empty stack after each JSON line, as the stack may not be empty after an erroneous JSON line. E.g. {"this opening brace is never closed":123\n{"<=this brace should be on the empty stack":...}
  2. We need to reset that we are outside of a string: {"no matching end-quote on this line\n{"<=this quote is the beginning of a field name, not the end of the previous line's field name"

This fixes above requirements as follows:

  1. Was already implemented - but with an inappropriate scan operator that is not associative:
StackLevelT new_level = (symbol_to_stack_op_type(rhs.value) == stack_op_type::RESET)
                              ? 0
                              : (lhs.stack_level + rhs.stack_level);

E.g. ({,n,{,},n,{,{,n,},},n,} all fail the associativity test). This was replaced with a ScanByKey that would start with a "fresh" stack level with each new key segment.

  1. Was addressed by changing the transition table of the finite-state transducer that filters out brackets and braces that are enclosed in quotes to go back to the OOS (outside-of-string) state after every newline. This behaviour requires that every newline character is treated as a delimiter of a JSON line. This was confirmed by Spark Rapids, who is the targeted user for the recovery option to be the case.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested a review from a team as a code owner October 22, 2023 15:54
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 22, 2023
@elstehle elstehle added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Oct 22, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just one question to confirm my understanding

cpp/src/io/fst/logical_stack.cuh Show resolved Hide resolved
@elstehle elstehle changed the title Fixes stack context for json lines format that recovers from invalid JSON lines Fixes stack context for json lines format that recovers from invalid JSON lines Oct 26, 2023
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@elstehle
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 2abf9a6 into rapidsai:branch-23.12 Oct 31, 2023
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants