-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RegexSet misbehave with unicode #353
Labels
Comments
Interestingly, these work fine:
|
Found the problem. There appears to be a bug in the compiler that's producing incorrect bytecode specifically for
The correct program should be:
My guess is that the extra |
BurntSushi
added a commit
that referenced
this issue
May 20, 2017
When compiling a RegexSet, it was possible for the jump locations to become incorrect if the last regex in the set had a starting location that didn't correspond to the beginning of its program. This can happen in simple cases like when your set consists of the regexes `a` and `β`. In particular, the program for `β` is: 0: Bytes(\xB2) (goto 2) 1: Bytes(\xCE) (goto 0) 2: MATCH Where the entry point is `1` instead of `0`. To fix this, we compile a set of regexes similarly to how we compile `a|β`, where we handle the holes produced by sub-expressions correctly. Fixes #353
bors
added a commit
that referenced
this issue
May 20, 2017
compiler: fix RegexSet bug When compiling a RegexSet, it was possible for the jump locations to become incorrect if the last regex in the set had a starting location that didn't correspond to the beginning of its program. This can happen in simple cases like when your set consists of the regexes `a` and `β`. In particular, the program for `β` is: 0: Bytes(\xB2) (goto 2) 1: Bytes(\xCE) (goto 0) 2: MATCH Where the entry point is `1` instead of `0`. To fix this, we compile a set of regexes similarly to how we compile `a|β`, where we handle the holes produced by sub-expressions correctly. Fixes #353
I've tested with my original issue and it also works fine now. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tested with regex 0.2.1
gives
The third should also be true. The only difference of
b
orβ
leads to different results.The text was updated successfully, but these errors were encountered: