Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added regex module option. #593

Merged
merged 18 commits into from
Jun 30, 2020
Merged

Added regex module option. #593

merged 18 commits into from
Jun 30, 2020

Conversation

julienmalard
Copy link
Contributor

@julienmalard julienmalard commented May 14, 2020

Added regex module option (optional alternative to re module).

I believe we should also add a test configuration to make sure all tests pass with pip install lark-parser[regex], but was unsure of where to add it in the tox.ini file; please do let me know if you have any pointers. I could then also add a conditional test with a grammar containing a regex module-specific terminal definition that only activates when the [regex] optional dependency is enabled.

Please let me know what you think!

@julienmalard
Copy link
Contributor Author

Will close #590.

@erezsh
Copy link
Member

erezsh commented May 15, 2020

Well, it's good that the tests pass.

If you want to add regex-specific tests, the place to do it is in the tests folder. Just don't run the tests if the import fails (look at the nearley test).

@MegaIng
Copy link
Member

MegaIng commented May 15, 2020

@erezsh I think what @julienmalard means that we add an additional pull-request check that also installs the regex modules.

(Also note: regex has a higher group limit, so it should be a speed up for large grammars)

@julienmalard
Copy link
Contributor Author

@MegaIng Yes, I think that that is what I meant. I was referring to adding a test matrix option that runs all tests with regex installed, so that we can check that everything works when regex is used instead of re (the tests run now run without regex installed, so, effectively, with no changes as compared to the master branch).

@erezsh
Copy link
Member

erezsh commented May 15, 2020

If regex already have tests that ensure that it's fully backwards-compatible with re, I'm not sure there's a point of also doing those tests on our end. At least not more than once in a long while.

@julienmalard
Copy link
Contributor Author

All right, sure then! Should I include regex module-specific tests, or is all fine?

@erezsh
Copy link
Member

erezsh commented May 15, 2020

Yes, you should include module-specific tests.

@julienmalard
Copy link
Contributor Author

julienmalard commented May 15, 2020

I have tried adding tests as test_regex.py. I do think that I need help for a few points:

  1. I am not too sure how to tell tox to install regex. I added a regex-requirements.txt file and also added it to the tox.ini file.
  2. There is a problematic line in utils.py: sre_parse.parse(regexp).getwidth(). The problem is that sre_parse will not accept character classes from regex (the \w bugfix works a charm, though!). The effect of that, for the moment, is that any grammars including character class regexes will fail to compile (which at least is not a regression from current behaviour).
  3. I wasn't sure how to make sure the test_regex.py tests are run. Is including the file in the tests directory enough, or do I need to specify it somewhere?

Edit: For (2), perhaps one idea would be to replace all categories \p{xxx} with single ascii characters in get_regexp_width? That way the correct width would still be returned.

Edit 2: Apparently I haven't correctly connected my new test file, because tests are passing even though point (2) is not yet resolved.

Many thanks!

@julienmalard
Copy link
Contributor Author

@erezsh Hope you are doing well! I have pushed a fix for point no. 2 in my comment above, so now getting the width of regexes with Unicode categories with the regex module works! Please do let me know what you think of my implementation.
I do think I would be in need of a pointer on how to link up my new test file test_regex to the tox tests that are automatically run on the repo. (All tests do run and pass locally on my machine with PyTest, but I have a nagging feeling that they are not being run on tox for this pull request).
Many thanks!

@erezsh
Copy link
Member

erezsh commented Jun 10, 2020

@julienmalard Is there a guarantee that using regex instead of re won't slow down some grammars? Do we know if it's slower or faster on average?

@julienmalard
Copy link
Contributor Author

@erezsh I'm unfortunately unable to find standard benchmarks (though if you know of some standard tests I could run them quickly). However, the regex page does state that

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

I am not at all an expert on regex, but did note that @MegaIng above also mentionned that

(Also note: regex has a higher group limit, so it should be a speed up for large grammars)

@erezsh
Copy link
Member

erezsh commented Jun 26, 2020

I'm worried about adding this as the default, as users might experience performance degradation. It would be much better, imho, if this was opt-in rather than "always on"

@julienmalard
Copy link
Contributor Author

Hello,
Sure, I think I could change that quickly enough. Do you think it would be best to have it as a global option, e.g.,

import lark
from lark import Lark

lark.use_regex = True
parser = Lark('my grammar.lark', 'lalr')

Or as an option specific to each parser:

from lark import Lark
parser = Lark('my grammar.lark', 'lalr', regex=True)

@erezsh
Copy link
Member

erezsh commented Jun 26, 2020

Option #2 sounds good.

@erezsh
Copy link
Member

erezsh commented Jun 26, 2020

That looks good. Just make sure that the tests pass.

@julienmalard
Copy link
Contributor Author

Thanks! It was a missing tox dependency for regex. Testing it again now.

@julienmalard
Copy link
Contributor Author

Looks like this will take a bit of trial and error! Do you know how I can tell tox to install regex prior to running tests? All works well locally, but not on GitHub builds. Thanks!

@erezsh
Copy link
Member

erezsh commented Jun 26, 2020

See .github/workflows/tests.yml

Basically copy whatever we did for nearley-requirements. Maybe we can even merge them (for the tests)

@julienmalard
Copy link
Contributor Author

Thanks! I updated stubs and the README to match changes. Please let me know if all was done well!

@erezsh
Copy link
Member

erezsh commented Jun 28, 2020

There is a reason that all the parser tests sit in test_parser.py. There is an automated process there that runs each test for every parser and lexer. For example, your tests fail when parser="lalr".

Put your tests there and make sure that they pass.

Also, you forgot to update tox.ini

And lexer.py still imports regex for some reason, although it shouldn't (and probably shouldn't import re either)

@julienmalard
Copy link
Contributor Author

Thanks! Hope this works. I changed the tests, fixed the problems with lalr, and updated tox.ini. I also removed the regex import from lexer.py, but line 48 still requires re, so I left that one in.
Please let me know how it looks now!

@erezsh
Copy link
Member

erezsh commented Jun 28, 2020

Hmm looks like most of the tests failed..

@julienmalard
Copy link
Contributor Author

That is odd! It seems that pypy3 failed and that the rest aborted for some reason (without printing out error messages). Is it possible that the pypy3 failure caused the rest to stop running?
I am not sure what problems is occuring with pypy3 (all runs well locally with regular python), so I'll unfortunately probably have to debug remotely here with lots of commits and print statements...I hope you wouldn't mind?

@erezsh
Copy link
Member

erezsh commented Jun 28, 2020

It shouldn't be that difficult to install - https://www.pypy.org/download.html#python-3-6-compatible-pypy3-6-v7-3-1

@julienmalard
Copy link
Contributor Author

I think it worked!

@erezsh erezsh merged commit 1ef0e18 into lark-parser:master Jun 30, 2020
@erezsh
Copy link
Member

erezsh commented Jun 30, 2020

Congrats on a great PR 👍

@julienmalard julienmalard deleted the regex branch June 30, 2020 11:56
@julienmalard
Copy link
Contributor Author

Many thanks for your help and guidance throughout the process!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants