Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data: make a full Python parser for python grammar #11

Merged
merged 61 commits into from
Aug 2, 2021

Conversation

MatthieuDartiailh
Copy link
Collaborator

No description provided.

@MatthieuDartiailh
Copy link
Collaborator Author

I am making progress. But I have a couple of questions.

First I had to include the fix to make the parser return a single item rather than a list of a single item. It is the first commit, I did not update the tests though. I will try to make a separate PR for that.

Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ?

Finally I would be interested in any strategy to test the parser and next how to properly handle line information.

@pablogsal
Copy link
Contributor

Finally I would be interested in any strategy to test the parser and next how to properly handle line information.

For testing the parser, the best way is to throw it a lot of Python files (we used to have the whole stdlib or the top 5000 Pypi packages to do this) and compare potentially the output (or check if it crashes). If you plan to generate exactly the same AST, then it makes sense to compare the output with the built-in parser.

Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ?

Not sure I fully understand the problem. I need some time to investigate exactly what's going on but it would be helpful if you could elaborate a bit more and maybe provide some examples :)

First I had to include the fix to make the parser return a single item rather than a list of a single item

Could you also elaborate on this.

Currently, I am quite overburdened with work on CPython so apologies if it takes me a bit more than usually to respond here :)

@MatthieuDartiailh
Copy link
Collaborator Author

Finally I would be interested in any strategy to test the parser and next how to properly handle line information.

For testing the parser, the best way is to throw it a lot of Python files (we used to have the whole stdlib or the top 5000 Pypi packages to do this) and compare potentially the output (or check if it crashes). If you plan to generate exactly the same AST, then it makes sense to compare the output with the built-in parser.

I will look into how to do that then.

Second I am unsure how to make pegen happy regarding the start rule. It was manually set to file before but that does not seem completely correct what would be the right way to do this ?

Not sure I fully understand the problem. I need some time to investigate exactly what's going on but it would be helpful if you could elaborate a bit more and maybe provide some examples :)

Pegen requires to have a rule start (or to have trailers). The Python grammar has multiple entries (file, interactive, eval). Looking at the original C grammar I realize now that the start is dynamically set in that case. I am not sure what would be the best way to emulate that, but for testing I can simply keep start: file

First I had to include the fix to make the parser return a single item rather than a list of a single item

Could you also elaborate on this.

I was referring to the point discussed in #6 and for which I opened #14.

Currently, I am quite overburdened with work on CPython so apologies if it takes me a bit more than usually to respond here :)

@MatthieuDartiailh
Copy link
Collaborator Author

I have reached a somehow puzzling point where trying to parse import sys lead to this being matched as an expression but then just bubbling up that way and at some point being converted to None. I tried to step through with pdb but I am not sure I fully follow the logic. I feel like the parser should realize that matching NAME won't go anywhere and then explore different options but it does not seem to be what happen. Parsing this appear to work though:

def a():
    pass

a: int
a = 1
a += 1
a *= 1
a /= 1
a //= 1
a %= 1

a + 1
a * 1
a / 1
a // 1
a % 1

a < 1
a > 1
a <= 1
a >= 1
a == 1
a != 1

a = 2
a << 1
a >> 1
a ^ 1
a ^= 1

a = []
a = {}

but the ast test just checking the str complain about the context of some Name node being explicitely set to Load, I will look into making the ast comparison more robust at some point.

Any help would be very much appreciated.

@lysnikolaou
Copy link
Contributor

At some point in time, we added hard keywords to the C parser, which means that keywords do not get parsed as NAMEs, but are themselves tokens. After that we moved the star_expressions alternative in the simple_stmt rule all the way to the top, in order to speed up the parsing, since expressions are the most common constructs in Python source code.

That's probably what's creating the problem. We have not implemented hard keywords for the python generator, which means that import gets parsed as a name and therefore as an expression. That's why star_expressions succeeds, way before import_stmt is even reached.

Does that make sense or have I missed something?

If you want to have a better look at how keywords are parsed in the C parser, have a look at the keyword list and how it gets generated.

@MatthieuDartiailh
Copy link
Collaborator Author

Thanks a lot @lysnikolaou ! Yes that make sense. Trying to browse through the C implementation am I right to assume that the key point is https://github.com/python/cpython/blob/3bb19873abd572879cc9a8810b1db9db1f704070/Parser/pegen.c#L629 which is in charge on transforming the tokens returned by the tokenizer and turn bare NAME token into the appropriate keyword TOKEN type.

@lysnikolaou
Copy link
Contributor

Exactly, yes! More specifically the function call to _get_keyword_or_name_type here.

@MatthieuDartiailh
Copy link
Collaborator Author

Thanks for the confirmation.

@MatthieuDartiailh
Copy link
Collaborator Author

I have now a better idea of how those things are connected but I still have a couple of questions:

  • regarding soft keyword (I am not sure Python has any, but I would need them for my project), looking at the c_generator it looks like a soft keyword is not expected to be surrounded by ''. Am I reading that properly and does that mean that 'import' NAME would treat import as a keyword and "import "NAME would mean import is a soft keyword, since it has to be a string ?
  • I am trying to figure out what is the best way to handle the token stream transformation. It seems natural to want to do that operation in the tokenizer. Pegen is quite tightly depending on the Python tokenizer which makes altering the tokenizer directly cumbersome (which is however something I wished could be done to add extra tokens, but I digress). The current approach is to wrap the bare Python tokenizer with a version that implement the proper methods, and handling the keyword transformation there would work fine. However I like the idea of extracting the keywords from the grammar to avoid duplicating things. As a consequence, since the user is expected to create its own tokenizer, would it make sense to collect keywords on the parser and allow to initialize the tokenizer with a list of keywords that could be taken from the parser ?
  • related to the previous one, should keyword be declared upfront or would identifying as we parse the grammar be fine ?

@pablogsal
Copy link
Contributor

  • regarding soft keyword (I am not sure Python has any, but I would need them for my project), looking at the c_generator it looks like a soft keyword is not expected to be surrounded by ''.

With pattern matching Python has match and case as soft keywords.

Am I reading that properly and does that mean that 'import' NAME would treat import as a keyword and "import "NAME would mean import is a soft keyword, since it has to be a string ?

Yup. Check for instance https://github.com/python/cpython/blob/master/Grammar/python.gram#L212

  • I am trying to figure out what is the best way to handle the token stream transformation. It seems natural to want to do that operation in the tokenizer. Pegen is quite tightly depending on the Python tokenizer which makes altering the tokenizer directly cumbersome (which is however something I wished could be done to add extra tokens, but I digress). The current approach is to wrap the bare Python tokenizer with a version that implement the proper methods, and handling the keyword transformation there would work fine. However I like the idea of extracting the keywords from the grammar to avoid duplicating things. As a consequence, since the user is expected to create its own tokenizer, would it make sense to collect keywords on the parser and allow to initialize the tokenizer with a list of keywords that could be taken from the parser ?

Yeah, pegen is tightly coupled with the Python tokenizer and could be quite difficult to untangle since is baked into its base design. I think the best approach is to extract the keywords from the grammar and make the generated Python parser check for them as we are doing with the C parser as @lysnikolaou mentioned. Maybe I am not understanding your proposal correctly, but if your concern is how to enforce keywords, then I think the best approach is that one. Check for example:

https://github.com/python/cpython/blob/af50c84643ce21cfbdfdabbdfae6bd5e1368c542/Tools/peg_generator/pegen/c_generator.py#L114-L134

  • related to the previous one, should keyword be declared upfront or would identifying as we parse the grammar be fine ?

They should be extracted from the grammar.

@MatthieuDartiailh
Copy link
Collaborator Author

  • regarding soft keyword (I am not sure Python has any, but I would need them for my project), looking at the c_generator it looks like a soft keyword is not expected to be surrounded by ''.

With pattern matching Python has match and case as soft keywords.

I started from the python.gram from the repo and pattern matching was not there yet so I missed that thanks for the pointer.

Am I reading that properly and does that mean that 'import' NAME would treat import as a keyword and "import "NAME would mean import is a soft keyword, since it has to be a string ?

Yup. Check for instance https://github.com/python/cpython/blob/master/Grammar/python.gram#L212

Thanks

  • I am trying to figure out what is the best way to handle the token stream transformation. It seems natural to want to do that operation in the tokenizer. Pegen is quite tightly depending on the Python tokenizer which makes altering the tokenizer directly cumbersome (which is however something I wished could be done to add extra tokens, but I digress). The current approach is to wrap the bare Python tokenizer with a version that implement the proper methods, and handling the keyword transformation there would work fine. However I like the idea of extracting the keywords from the grammar to avoid duplicating things. As a consequence, since the user is expected to create its own tokenizer, would it make sense to collect keywords on the parser and allow to initialize the tokenizer with a list of keywords that could be taken from the parser ?

Yeah, pegen is tightly coupled with the Python tokenizer and could be quite difficult to untangle since is baked into its base design. I think the best approach is to extract the keywords from the grammar and make the generated Python parser check for them as we are doing with the C parser as @lysnikolaou mentioned. Maybe I am not understanding your proposal correctly, but if your concern is how to enforce keywords, then I think the best approach is that one. Check for example:

https://github.com/python/cpython/blob/af50c84643ce21cfbdfdabbdfae6bd5e1368c542/Tools/peg_generator/pegen/c_generator.py#L114-L134

My concern is that the token type need to be converted from NAME to the a dedicated type as is done in _PyPegen_fill_token (along with other operations) and that feels like something that should be done inside the tokenizer rather than the parser, hence my idea to provide a way to do it on the tokenizer wrapper since we are already adding the ability to peek to the tokenizer. Ideally I would like to document well enough what are the requirements on the tokenizer to allow people to use a different one.

  • related to the previous one, should keyword be declared upfront or would identifying as we parse the grammar be fine ?

They should be extracted from the grammar.

Thanks

@lysnikolaou
Copy link
Contributor

My concern is that the token type need to be converted from NAME to the a dedicated type as is done in _PyPegen_fill_token (along with other operations) and that feels like something that should be done inside the tokenizer rather than the parser, hence my idea to provide a way to do it on the tokenizer wrapper since we are already adding the ability to peek to the tokenizer.

Not doing this in the tokenizer was an explicit design decision though. While your feeling makes sense (we had this discussion when doing this), the general rule is that the tokenizer should know nothing about reserved keywords. Its only job is to identify different "sentence parts", without knowing about what they mean in the context they're parsed.

For example, a.import() should be tokenized as:

NAME a
DOT .
NAME import
LPAR (
RPAR )

It is up to the parser to identify that import is a reserved keyword and fail here.

@MatthieuDartiailh
Copy link
Collaborator Author

Thanks for the clarification @lysnikolaou. Happy to know that, my feeling was not completely off. I will see how to implement the logic that way in the python parser/generator then. Thanks for your advice.

@MatthieuDartiailh
Copy link
Collaborator Author

Good new I can now parse import statements !

However I would appreciate a quick review of the commit adding support for keywords because it is not without issues. Basically I followed the idea of the c generator and assigned new integers as token type. Even though I follow the idea that it should be done in the parser, I ended up doing inside the tokenizer wrapper since it seems the most efficient way to do it (this way the transformation is done once and only once). The keywords and soft keywords are collected from the grammar but due to how they collected are only listed at the very end of the generated parser.

One issue I have is that printing anything involving TokenInfo fails because the type is unknown. Since we are basically doing string matching, I could use the OP type and still get the proper matching but it does not seems ideal. I am open to suggestions.

When I get more time I will go back to more extensively testing the parser.

@MatthieuDartiailh
Copy link
Collaborator Author

I found a more elegant way to handle keyword by using the NAME matching rule. This way there is no need to touch the tokenizer and no modification of the token making them non printable.

Since to achieve this I had to make a number of changes not directly related to adding a full blown python grammar to the project I would like to know if you want me to split that part of and make a different PR for the Python parser.

Regarding testing for the Python parser, should I focus on the updating the scripts in scripts, or should I write more systematic tests under tests (maybe tests\python) ?

@lysnikolaou
Copy link
Contributor

I found a more elegant way to handle keyword by using the NAME matching rule. This way there is no need to touch the tokenizer and no modification of the token making them non printable.

Happy to hear that.

Since to achieve this I had to make a number of changes not directly related to adding a full blown python grammar to the project I would like to know if you want me to split that part of and make a different PR for the Python parser.

Truth is that would help a bunch with the review. If that's not too much work for you, I'd be really happy to review a PR with only the changes for keyword support.

Regarding testing for the Python parser, should I focus on the updating the scripts in scripts, or should I write more systematic tests under tests (maybe tests\python) ?

I guess the answer is both. For now though, let's just focus on writing as many tests as possible for the constructs the parser can parse. For example, let's write tests that the output AST is correct for parsing import statements and then incrementally build more tests as we support more and more things. When we've got a first working version for all of the grammar, we can see if it makes sense to run modify and run something under /scripts.

@MatthieuDartiailh
Copy link
Collaborator Author

Ok then I will try to make a new PR with the change to support for keyword, plus other minor stuff (such as not creating a method called async !), each with its own dedicated commit. And I will keep working on adding tests once I am done.

Currently the one very broken things is the handling of strings. I need to look into it. I may try a trick similar to https://github.com/nucleic/enaml/blob/main/enaml/core/parser/parser36.py#L98 since we are parsing Python in Python and I do not fancy spending to much time on parsing f strings (but I could be convinced otherwise).

@MatthieuDartiailh
Copy link
Collaborator Author

See #16 for a smaller PR for keyword support

@MatthieuDartiailh
Copy link
Collaborator Author

I made some more progress (you can check the files under test/data to get an idea of what we can currently parse successfully). Right now I have some issue with parsing f(a for a in b) while 1 + f(a for a in b) parse without issues. I checked the current python grammar and noticed that some rules make use of && which is not currently the case. I will try to diff pegen vs the version currently in CPython main and see what has not yet been reflected in pegen.

@MatthieuDartiailh MatthieuDartiailh mentioned this pull request Apr 29, 2021
@MatthieuDartiailh
Copy link
Collaborator Author

After adding support for Forced node in the grammar definition and making some changes found in the current python grammar, I can now parse the tests files in the test directory and parse the source code of pegen ! I will try to add more tests and run on some popular pypi packages to get a better idea of how robust things are.

Next I will try to add pattern matching and line numbers in the AST. Regarding this last point I am interested into any suggestion to make this process less verbose and in ways to test the output since the repr of the AST will not be enough anymore, I imagine I will have to have test walking the AST in that case.

@MatthieuDartiailh
Copy link
Collaborator Author

Note that black dislike tests files with Python 3.9+ only syntax. Should we exclude those files or bump the Python version running black ?

@pablogsal
Copy link
Contributor

Sorry for not participating here these days but I am very swamped with the beta release of Python 3.10. I just wanted to say that this effort actually could be very useful in the future to black, as we could add actions that create a CFG instead of an AST, allowing black to support Python 3.10+ syntax.

Thanks for all the patience @MatthieuDartiailh! Your work is great! Unfortunately, both @lysnikolaou and me are quite busy so it will take us a bit more to review 😅

@MatthieuDartiailh
Copy link
Collaborator Author

No problem, I had to work on other projects impacted by changes in the latest alpha for a while and so I made only slow progresses.

@MatthieuDartiailh
Copy link
Collaborator Author

MatthieuDartiailh commented Apr 30, 2021

I made some extra progress. I corrected a bunch of issues on async and decorators (I included a version check for the pre 3.9 syntax). python.gram based parser can now parse the 100 pypi packages tested by the included script ! If you are aware of additional corner cases worth testing, let me know.

As a side note, for async, I had to revert to identify async and await as keyword since tokenize send back NAME token not ASYNC token.

So I am left with pattern matching and line numbers to support. For the former, where can I find testing files, I assume some exist within CPython repo but I could not find them. For line numbers, I am open to any suggestion to make it efficient and on how to best test for correctness.

@isidentical
Copy link
Contributor

So I am left with pattern matching and line numbers to support. For the former, where can I find testing files, I assume some exist within CPython repo but I could not find them.

The file that we use for ast.unparse is Lib/test/test_patma.py (2000+ lines of pattern matching statements).

@MatthieuDartiailh
Copy link
Collaborator Author

Thanks for the pointer @isidentical I will try to have look.

@pablogsal
Copy link
Contributor

Is there any valid use of 2 successive NEWLINE (assuming we discard NL and COMMENT)? Naively I would say no but I am not sure. If it is indeed never valid we could filter it in the tokenizer wrapper otherwise we need some other heuristic in the parser.

I mean, not sure what you mean about "valid" but two NEWLINEs is absolutely possible, we cannot filter those.

IIRC NL is emitted when there is an empty line:

❯ python -m tokenize <<< ""
1,0-1,1:            NL             '\n'
2,0-2,0:            ENDMARKER      ''

@MatthieuDartiailh
Copy link
Collaborator Author

I meant two NEWLINE tokens (which are supposed to have meaning contrary to NL tokens). I will try to poke more around to see if I can find more odd cases and a rule to handle them.

@MatthieuDartiailh
Copy link
Collaborator Author

MatthieuDartiailh commented Jul 17, 2021

Token 510: b'if'
Token 524: b'True'
Token 11: b':'
Token 4: b'' # NEWLINE
Token 5: b'' # INDENT
Token 1: b'b'
Token 22: b'='
Token 2: b'1'
Token 4: b'' # NEWLINE
Token 6: b'' # DEDENT
Token 0: b''

What, why is emitting Token 524: b'True' ?

I just replaced if a: by if True: to avoid a name error at execution. I updated the source in my message above.

@MatthieuDartiailh
Copy link
Collaborator Author

Playing a bit more with that it appears that the error only show up when there is a comment at the end of file without a blank line after it. In this case the tokenizer generate COMMENT, NL, NEWLINE, ENDMARKER, while in the presence of a newline it generates COMMENT, NL, ENDMARKER. If there is any DEDENT pending there are inserted before ENDMARKER but after the NEWLINE

So either the tokenizer wrapper needs to swallow that spurious NEWLINE, or we need [NEWLINE] in both the file rule and the block rule since the spurious new line appears before the DEDENT tokens. Any preference @pablogsal

Should we also open a bug report on the python bug tracker too for this ? It does not look like a desirable behavior to me but I may be missing something.

@pablogsal
Copy link
Contributor

Should we also open a bug report on the python bug tracker too for this ? It does not look like a desirable behavior to me but I may be missing something.

I think this looks like a tokenizer bug. I would recommend opening an issue so other folks can corroborate this or mentioned if this change is going to break anything for them

@MatthieuDartiailh
Copy link
Collaborator Author

I opened the issue some time ago (https://bugs.python.org/issue44667) but it did not receive any comment. I will try to make a PR this week. In the meantime do you have any opinion @pablogsal @lysnikolaou on how to handle the issue in older version (in the tokenizer or as extra rules) ? I personally prefer the second option to keep the grammar as close as possible to the existing one. This is the last blocker and after this PR can be merged. I will hence remove the WIP marker.

@MatthieuDartiailh MatthieuDartiailh changed the title data: wip on making Python parser for python grammar data: make a full Python parser for python grammar Jul 25, 2021
@MatthieuDartiailh MatthieuDartiailh marked this pull request as ready for review July 25, 2021 16:58
@pablogsal
Copy link
Contributor

@MatthieuDartiailh I agree that keeping the grammar as close as possible is probably the best course of action. I will try to investigate the bpo issue you opened after I release the first release candiate for 3.10 next week.

Otherwise, I would prefer to land this as soon as possible. Do you want to tacke the tokenizer issue in this PR or in another one?

@pablogsal
Copy link
Contributor

Also, @lysnikolaou if you have some spare minutes, it would be great if you could give the PR also a quick look.

@lysnikolaou
Copy link
Contributor

Also, @lysnikolaou if you have some spare minutes, it would be great if you could give the PR also a quick look.

Yup! I'll make some time to give it a more thorough look tomorrow evening or Sunday morning.

In this case, the python tokenizer generates: NEWLINE, COMMENT, NL, NEWLINE but two consecutive NEWLINE without any other token than comment and NL between them is never valid so we discard the second NEWLINE.
@MatthieuDartiailh
Copy link
Collaborator Author

I just pushed a fix for the tokenizer. Let me know what you think. I may not find time to back port your last 4 commits on CPython @pablogsal today but I will do my best. Feel free to merge without those.

@pablogsal
Copy link
Contributor

I just pushed a fix for the tokenizer. Let me know what you think. I may not find time to back port your last 4 commits on CPython @pablogsal today but I will do my best. Feel free to merge without those.

What 4 commits are you referring to?

@MatthieuDartiailh
Copy link
Collaborator Author

I have tried to apply any fix you made to the grammar file to the grammar file of this PR. The 4 commits I refer to are:
python/cpython@6948964
python/cpython@ecc3c8e
python/cpython@302cf35
python/cpython@208a7e9

@pablogsal
Copy link
Contributor

I have tried to apply any fix you made to the grammar file to the grammar file of this PR. The 4 commits I refer to are:
python/cpython@6948964
python/cpython@ecc3c8e
python/cpython@302cf35
python/cpython@208a7e9

There is only one PR that you are probably interested which is a reordering of the rules for readability. The rules are the same, is just that they are grouped and commented so is not like a gigantic blob of text. The other PRs are for Syntax errors for prints, so is basically a single new rule.

@pablogsal
Copy link
Contributor

I will give some time for other folks to review and I will land afterwards. Then I will work on making a release

@MatthieuDartiailh
Copy link
Collaborator Author

I have tried to apply any fix you made to the grammar file to the grammar file of this PR. The 4 commits I refer to are:
python/cpython@6948964
python/cpython@ecc3c8e
python/cpython@302cf35
python/cpython@208a7e9

There is only one PR that you are probably interested which is a reordering of the rules for readability. The rules are the same, is just that they are grouped and commented so is not like a gigantic blob of text. The other PRs are for Syntax errors for prints, so is basically a single new rule.

@pablogsal Should I make a new PR to address those ? I have the changes ready but I need to add a couple of tests.

@pablogsal
Copy link
Contributor

@pablogsal Should I make a new PR to address those ? I have the changes ready but I need to add a couple of tests.

As you prefer. I would recommend at least doing python/cpython@302cf35

@MatthieuDartiailh
Copy link
Collaborator Author

I did the re-organization and I added the new invalid rule but no test for it. Will do if I find time before this PR is merged.

Copy link
Contributor

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such great work! Thank you so much for going through with this, @MatthieuDartiailh! I'll leave merging to Pablo, since he did all the heavy-lifting of reviewing this.

@MatthieuDartiailh
Copy link
Collaborator Author

Thanks @lysnikolaou !

@pablogsal pablogsal merged commit f4126d1 into we-like-parsers:main Aug 2, 2021
@pablogsal
Copy link
Contributor

Fantastic! Thanks a lot, @MatthieuDartiailh for the hard work and the patience 🎉 🎉 🎉

I will do a release after I manage to get 3.10.0 rc 1 out today 😉

By the way, I fixed and backported https://bugs.python.org/issue44667

@cfbolz
Copy link

cfbolz commented Aug 2, 2021

nice, thanks everyone for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants