-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pattern matcher fixes #1876
Pattern matcher fixes #1876
Conversation
Hi @GregDubbin, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the If you've already included the Contributor Agreement in your pull request above, you can ignore this message. |
Hmm, it compiles fine for me but the CI servers are giving:
|
I think the compile error is coming from the unordered map. We could replace this with the I wish I could make the overall matching code simpler and better. There are a few block that are fairly repetitive, but it's hard to refactor it into functions without actually making things worse. Thanks a lot for your help here. This has been broken for a long time, and it's really not an easy bit of code to work with. I think your solution looks very good. |
I believe you are correct. An alternative is to include a pointer to a MatchEntryC in TokenPatternC data structure, initializing it so that every TokenPatternC in a pattern points to the same MatchEntryC. This should save some memory and time as we wouldn't have to move to the end of the pattern to check the last match for the pattern. |
I believe that the test examples in test_issue1450 are incorrect. Specifically, |
Amazing work! And yes I think you're right about that test: I think it matched the incorrect performance. Okay I think this is good to merge --- a major improvement! |
#1503
Description
This PR includes fixes and improvements to the pattern matching behavior of spacy.matcher.Matcher. Specifically, the '*' and '+' operators are not greedy and are inconsistent with the output of standard regular expression libraries.
The PR adds a ADVANCE_PLUS action that behaves as if both ADVANCE_ZERO and REPEAT were the action. This allows greedy matching to be implemented without backtracking. Due to the restrictions of the token patterns (i.e. one operator per token), the partial match list can be pruned to never be larger than twice the number of specs in a pattern.
Types of change
Checklist