Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement re-lexing logic for improved error recovery #11640

Closed
dhruvmanila opened this issue May 31, 2024 · 0 comments · Fixed by #11845
Closed

Implement re-lexing logic for improved error recovery #11640

dhruvmanila opened this issue May 31, 2024 · 0 comments · Fixed by #11845
Assignees
Labels
parser Related to the parser

Comments

@dhruvmanila
Copy link
Member

#11457 builds up the necessary infrastructure to completely synchronize the lexer and parser. The final step is to implement re-lexing logic in the lexer and update the parser to use it during error recovery for list parsing.

Internal document: https://www.notion.so/astral-sh/Lexer-Parser-feedback-loop-dcd653ea94d64629a388530f13393424

@dhruvmanila dhruvmanila added the parser Related to the parser label May 31, 2024
@dhruvmanila dhruvmanila self-assigned this May 31, 2024
dhruvmanila added a commit that referenced this issue Jun 17, 2024
## Summary

This PR implements the re-lexing logic in the parser.

This logic is only applied when recovering from an error during list
parsing. The logic is as follows:
1. During list parsing, if an unexpected token is encountered and it
detects that an outer context can understand it and thus recover from
it, it invokes the re-lexing logic in the lexer
2. This logic first checks if the lexer is in a parenthesized context
and returns if it's not. Thus, the logic is a no-op if the lexer isn't
in a parenthesized context
3. It then reduces the nesting level by 1. It shouldn't reset it to 0
because otherwise the recovery from nested list parsing will be
incorrect
4. Then, it tries to find last newline character going backwards from
the current position of the lexer. This avoids any whitespaces but if it
encounters any character other than newline or whitespace, it aborts.
5. Now, if there's a newline character, then it needs to be re-lexed in
a logical context which means that the lexer needs to emit it as a
`Newline` token instead of `NonLogicalNewline`.
6. If the re-lexing gives a different token than the current one, the
token source needs to update it's token collection to remove all the
tokens which comes after the new current position.

It turns out that the list parsing isn't that happy with the results so
it requires some re-arranging such that the following two errors are
raised correctly:
1. Expected comma
2. Recovery context error

For (1), the following scenarios needs to be considered:
* Missing comma between two elements
* Half parsed element because the grammar doesn't allow it (for example,
named expressions)

For (2), the following scenarios needs to be considered:
1. If the parser is at a comma which means that there's a missing
element otherwise the comma would've been consumed by the first `eat`
call above. And, the parser doesn't take the re-lexing route on a comma
token.
2. If it's the first element and the current token is not a comma which
means that it's an invalid element.

resolves: #11640 

## Test Plan

- [x] Update existing test snapshots and validate them
- [x] Add additional test cases specific to the re-lexing logic and
validate the snapshots
- [x] Run the fuzzer on 3000+ valid inputs
- [x] Run the fuzzer on invalid inputs
- [x] Run the parser on various open source projects
- [x] Make sure the ecosystem changes are none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser Related to the parser
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant