Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pattern matcher fixes #1876

Merged
merged 10 commits into from
Jan 24, 2018
Merged

Pattern matcher fixes #1876

merged 10 commits into from
Jan 24, 2018

Conversation

GregDubbin
Copy link
Contributor

#1503

Description

This PR includes fixes and improvements to the pattern matching behavior of spacy.matcher.Matcher. Specifically, the '*' and '+' operators are not greedy and are inconsistent with the output of standard regular expression libraries.

The PR adds a ADVANCE_PLUS action that behaves as if both ADVANCE_ZERO and REPEAT were the action. This allows greedy matching to be implemented without backtracking. Due to the restrictions of the token patterns (i.e. one operator per token), the partial match list can be pruned to never be larger than twice the number of specs in a pattern.

Types of change

  1. Fixed bug in matching state logic preventing '*' from ever being greedy
  2. Expanded functionality to allow multiple variable length quantifiers without backtracking

Checklist

  • [ x] I have submitted the spaCy Contributor Agreement.
  • [ x] I ran the tests, and all new and existing tests passed.
  • [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.

@explosion-bot
Copy link
Collaborator

Hi @GregDubbin, thanks for your pull request! 👍 It looks like you haven't filled in the spaCy Contributor Agreement (SCA) yet. The agrement ensures that we can use your contribution across the project. Once you've filled in the template, put it in the .github/contributors directory and add it to this pull request. If your pull request targets a branch that's not master, for example develop, make sure to submit the Contributor Agreement to the master branch. Thanks a lot!

If you've already included the Contributor Agreement in your pull request above, you can ignore this message.

@ines ines added bug Bugs and behaviour differing from documentation enhancement Feature requests and improvements labels Jan 23, 2018
@honnibal
Copy link
Member

Hmm, it compiles fine for me but the CI servers are giving:

/usr/include/c++/4.8/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

@honnibal
Copy link
Member

honnibal commented Jan 23, 2018

I think the compile error is coming from the unordered map. We could replace this with the PreshMap hash table I use elsewhere in spaCy, but I think it's a bit inconvenient, since it's nice to map directly to a struct, as you do here. PreshMap only supports 64-bit keys and values. I'll try adding the compiler flag they suggest.

I wish I could make the overall matching code simpler and better. There are a few block that are fairly repetitive, but it's hard to refactor it into functions without actually making things worse.

Thanks a lot for your help here. This has been broken for a long time, and it's really not an easy bit of code to work with. I think your solution looks very good.

@GregDubbin
Copy link
Contributor Author

I believe you are correct. An alternative is to include a pointer to a MatchEntryC in TokenPatternC data structure, initializing it so that every TokenPatternC in a pattern points to the same MatchEntryC. This should save some memory and time as we wouldn't have to move to the end of the pattern to check the last match for the pattern.

@GregDubbin
Copy link
Contributor Author

I believe that the test examples in test_issue1450 are incorrect. Specifically, ('a b b c', 0, 2) should be ('a b b c', 0, 3) and ('a b b', 0, 2) should be ('a b b', 0, 3). I have changed the code accordingly, let me know if you disagree.

@honnibal
Copy link
Member

Amazing work! And yes I think you're right about that test: I think it matched the incorrect performance.

Okay I think this is good to merge --- a major improvement!

@honnibal honnibal merged commit 6a8cb90 into explosion:master Jan 24, 2018
@ines ines mentioned this pull request Jan 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation enhancement Feature requests and improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants