-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data: make a full Python parser for python grammar #11
data: make a full Python parser for python grammar #11
Conversation
79ecc1d
to
af4d8d7
Compare
I am making progress. But I have a couple of questions. First I had to include the fix to make the parser return a single item rather than a list of a single item. It is the first commit, I did not update the tests though. I will try to make a separate PR for that. Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ? Finally I would be interested in any strategy to test the parser and next how to properly handle line information. |
For testing the parser, the best way is to throw it a lot of Python files (we used to have the whole stdlib or the top 5000 Pypi packages to do this) and compare potentially the output (or check if it crashes). If you plan to generate exactly the same AST, then it makes sense to compare the output with the built-in parser.
Not sure I fully understand the problem. I need some time to investigate exactly what's going on but it would be helpful if you could elaborate a bit more and maybe provide some examples :)
Could you also elaborate on this. Currently, I am quite overburdened with work on CPython so apologies if it takes me a bit more than usually to respond here :) |
I will look into how to do that then.
Pegen requires to have a rule start (or to have trailers). The Python grammar has multiple entries (file, interactive, eval). Looking at the original C grammar I realize now that the start is dynamically set in that case. I am not sure what would be the best way to emulate that, but for testing I can simply keep
I was referring to the point discussed in #6 and for which I opened #14.
|
I have reached a somehow puzzling point where trying to parse def a():
pass
a: int
a = 1
a += 1
a *= 1
a /= 1
a //= 1
a %= 1
a + 1
a * 1
a / 1
a // 1
a % 1
a < 1
a > 1
a <= 1
a >= 1
a == 1
a != 1
a = 2
a << 1
a >> 1
a ^ 1
a ^= 1
a = []
a = {} but the ast test just checking the str complain about the context of some Name node being explicitely set to Load, I will look into making the ast comparison more robust at some point. Any help would be very much appreciated. |
At some point in time, we added hard keywords to the C parser, which means that keywords do not get parsed as NAMEs, but are themselves tokens. After that we moved the That's probably what's creating the problem. We have not implemented hard keywords for the python generator, which means that Does that make sense or have I missed something? If you want to have a better look at how keywords are parsed in the C parser, have a look at the keyword list and how it gets generated. |
Thanks a lot @lysnikolaou ! Yes that make sense. Trying to browse through the C implementation am I right to assume that the key point is https://github.com/python/cpython/blob/3bb19873abd572879cc9a8810b1db9db1f704070/Parser/pegen.c#L629 which is in charge on transforming the tokens returned by the tokenizer and turn bare NAME token into the appropriate keyword TOKEN type. |
Exactly, yes! More specifically the function call to |
Thanks for the confirmation. |
I have now a better idea of how those things are connected but I still have a couple of questions:
|
With pattern matching Python has
Yup. Check for instance https://github.com/python/cpython/blob/master/Grammar/python.gram#L212
Yeah,
They should be extracted from the grammar. |
I started from the
Thanks
My concern is that the token type need to be converted from NAME to the a dedicated type as is done in _PyPegen_fill_token (along with other operations) and that feels like something that should be done inside the tokenizer rather than the parser, hence my idea to provide a way to do it on the tokenizer wrapper since we are already adding the ability to peek to the tokenizer. Ideally I would like to document well enough what are the requirements on the tokenizer to allow people to use a different one.
Thanks |
Not doing this in the tokenizer was an explicit design decision though. While your feeling makes sense (we had this discussion when doing this), the general rule is that the tokenizer should know nothing about reserved keywords. Its only job is to identify different "sentence parts", without knowing about what they mean in the context they're parsed. For example,
It is up to the parser to identify that |
Thanks for the clarification @lysnikolaou. Happy to know that, my feeling was not completely off. I will see how to implement the logic that way in the python parser/generator then. Thanks for your advice. |
Good new I can now parse import statements ! However I would appreciate a quick review of the commit adding support for keywords because it is not without issues. Basically I followed the idea of the c generator and assigned new integers as token type. Even though I follow the idea that it should be done in the parser, I ended up doing inside the tokenizer wrapper since it seems the most efficient way to do it (this way the transformation is done once and only once). The keywords and soft keywords are collected from the grammar but due to how they collected are only listed at the very end of the generated parser. One issue I have is that printing anything involving TokenInfo fails because the type is unknown. Since we are basically doing string matching, I could use the OP type and still get the proper matching but it does not seems ideal. I am open to suggestions. When I get more time I will go back to more extensively testing the parser. |
bc483dc
to
8efec0e
Compare
I found a more elegant way to handle keyword by using the NAME matching rule. This way there is no need to touch the tokenizer and no modification of the token making them non printable. Since to achieve this I had to make a number of changes not directly related to adding a full blown python grammar to the project I would like to know if you want me to split that part of and make a different PR for the Python parser. Regarding testing for the Python parser, should I focus on the updating the scripts in scripts, or should I write more systematic tests under tests (maybe tests\python) ? |
Happy to hear that.
Truth is that would help a bunch with the review. If that's not too much work for you, I'd be really happy to review a PR with only the changes for keyword support.
I guess the answer is both. For now though, let's just focus on writing as many tests as possible for the constructs the parser can parse. For example, let's write tests that the output AST is correct for parsing import statements and then incrementally build more tests as we support more and more things. When we've got a first working version for all of the grammar, we can see if it makes sense to run modify and run something under |
Ok then I will try to make a new PR with the change to support for keyword, plus other minor stuff (such as not creating a method called async !), each with its own dedicated commit. And I will keep working on adding tests once I am done. Currently the one very broken things is the handling of strings. I need to look into it. I may try a trick similar to https://github.com/nucleic/enaml/blob/main/enaml/core/parser/parser36.py#L98 since we are parsing Python in Python and I do not fancy spending to much time on parsing f strings (but I could be convinced otherwise). |
See #16 for a smaller PR for keyword support |
I made some more progress (you can check the files under test/data to get an idea of what we can currently parse successfully). Right now I have some issue with parsing |
After adding support for Forced node in the grammar definition and making some changes found in the current python grammar, I can now parse the tests files in the test directory and parse the source code of pegen ! I will try to add more tests and run on some popular pypi packages to get a better idea of how robust things are. Next I will try to add pattern matching and line numbers in the AST. Regarding this last point I am interested into any suggestion to make this process less verbose and in ways to test the output since the repr of the AST will not be enough anymore, I imagine I will have to have test walking the AST in that case. |
Note that black dislike tests files with Python 3.9+ only syntax. Should we exclude those files or bump the Python version running black ? |
Sorry for not participating here these days but I am very swamped with the beta release of Python 3.10. I just wanted to say that this effort actually could be very useful in the future to black, as we could add actions that create a CFG instead of an AST, allowing black to support Python 3.10+ syntax. Thanks for all the patience @MatthieuDartiailh! Your work is great! Unfortunately, both @lysnikolaou and me are quite busy so it will take us a bit more to review 😅 |
No problem, I had to work on other projects impacted by changes in the latest alpha for a while and so I made only slow progresses. |
I made some extra progress. I corrected a bunch of issues on async and decorators (I included a version check for the pre 3.9 syntax). python.gram based parser can now parse the 100 pypi packages tested by the included script ! If you are aware of additional corner cases worth testing, let me know. As a side note, for async, I had to revert to identify async and await as keyword since tokenize send back NAME token not ASYNC token. So I am left with pattern matching and line numbers to support. For the former, where can I find testing files, I assume some exist within CPython repo but I could not find them. For line numbers, I am open to any suggestion to make it efficient and on how to best test for correctness. |
The file that we use for |
Thanks for the pointer @isidentical I will try to have look. |
I mean, not sure what you mean about "valid" but two NEWLINEs is absolutely possible, we cannot filter those. IIRC
|
I meant two NEWLINE tokens (which are supposed to have meaning contrary to NL tokens). I will try to poke more around to see if I can find more odd cases and a rule to handle them. |
I just replaced |
Playing a bit more with that it appears that the error only show up when there is a comment at the end of file without a blank line after it. In this case the tokenizer generate COMMENT, NL, NEWLINE, ENDMARKER, while in the presence of a newline it generates COMMENT, NL, ENDMARKER. If there is any DEDENT pending there are inserted before ENDMARKER but after the NEWLINE So either the tokenizer wrapper needs to swallow that spurious NEWLINE, or we need [NEWLINE] in both the file rule and the block rule since the spurious new line appears before the DEDENT tokens. Any preference @pablogsal Should we also open a bug report on the python bug tracker too for this ? It does not look like a desirable behavior to me but I may be missing something. |
I think this looks like a tokenizer bug. I would recommend opening an issue so other folks can corroborate this or mentioned if this change is going to break anything for them |
I opened the issue some time ago (https://bugs.python.org/issue44667) but it did not receive any comment. I will try to make a PR this week. In the meantime do you have any opinion @pablogsal @lysnikolaou on how to handle the issue in older version (in the tokenizer or as extra rules) ? I personally prefer the second option to keep the grammar as close as possible to the existing one. This is the last blocker and after this PR can be merged. I will hence remove the WIP marker. |
@MatthieuDartiailh I agree that keeping the grammar as close as possible is probably the best course of action. I will try to investigate the bpo issue you opened after I release the first release candiate for 3.10 next week. Otherwise, I would prefer to land this as soon as possible. Do you want to tacke the tokenizer issue in this PR or in another one? |
Also, @lysnikolaou if you have some spare minutes, it would be great if you could give the PR also a quick look. |
Yup! I'll make some time to give it a more thorough look tomorrow evening or Sunday morning. |
In this case, the python tokenizer generates: NEWLINE, COMMENT, NL, NEWLINE but two consecutive NEWLINE without any other token than comment and NL between them is never valid so we discard the second NEWLINE.
I just pushed a fix for the tokenizer. Let me know what you think. I may not find time to back port your last 4 commits on CPython @pablogsal today but I will do my best. Feel free to merge without those. |
What 4 commits are you referring to? |
I have tried to apply any fix you made to the grammar file to the grammar file of this PR. The 4 commits I refer to are: |
There is only one PR that you are probably interested which is a reordering of the rules for readability. The rules are the same, is just that they are grouped and commented so is not like a gigantic blob of text. The other PRs are for Syntax errors for prints, so is basically a single new rule. |
I will give some time for other folks to review and I will land afterwards. Then I will work on making a release |
@pablogsal Should I make a new PR to address those ? I have the changes ready but I need to add a couple of tests. |
As you prefer. I would recommend at least doing python/cpython@302cf35 |
I did the re-organization and I added the new invalid rule but no test for it. Will do if I find time before this PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such great work! Thank you so much for going through with this, @MatthieuDartiailh! I'll leave merging to Pablo, since he did all the heavy-lifting of reviewing this.
Thanks @lysnikolaou ! |
Fantastic! Thanks a lot, @MatthieuDartiailh for the hard work and the patience 🎉 🎉 🎉 I will do a release after I manage to get 3.10.0 rc 1 out today 😉 By the way, I fixed and backported https://bugs.python.org/issue44667 |
nice, thanks everyone for your work on this! |
No description provided.