-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected unicode character class in HIR generated without unicode support #1088
Comments
The magic happens at: if let Some(singletons) = singleton_chars(&new) {
let it = singletons
.into_iter()
.map(|ch| ClassUnicodeRange { start: ch, end: ch });
return Hir::class(Class::Unicode(ClassUnicode::new(it)));
}
if let Some(singletons) = singleton_bytes(&new) {
let it = singletons
.into_iter()
.map(|b| ClassBytesRange { start: b, end: b });
return Hir::class(Class::Bytes(ClassBytes::new(it)));
} The alternation is replaced with a character class when all the alternatives are a single unicode character or byte. But this code snippet doesn't take into account whether unicode mode is disabled or not. Is this intended? Unfortunately, this is done in the |
Hmmmm, yes, this is quite the predicament. The idea is that there should be a bright dividing line between the HIR's smart constructors and all of the configuration that goes into the translator. I specifically tried to design the constructors to avoid needing to pass in a bunch of little configuration knobs that could easily (IMO) explode to make it more difficult to reason about. The idea is that these sorts of smart constructors would always do valid transformations. One simplification I did in the most recent release was to change the
However, if the current translator had retained the "literal is either Unicode or arbitrary bytes," then an optimization like this could inspect the literal type and not try to go from Unicode to bytes or vice versa. But that information (along with whether Unicode is enabled or not) is erased by the time you get to this point. Erasing information is the point of the HIR, but that does sometimes result in difficult scenarios like this. There are also other manifestations of this erasure of knowledge. For example, a regex like So I guess there are three options. The first is, as you say, API changes in order to keep this optimization some-how without breaking the documented guarantee that you get a Unicode class only when Unicode mode is enabled. I'm not exactly sure of the nature of the API changes. I'm sure we could brainstorm a simple-but-ugly change. But I think I would want to explore a change that makes The second would be to remove the optimization. It's not particularly critical and we could at least keep it for ASCII bytes... It could still flip a byte class to a Unicode class, but it would be limited to ASCII (I believe). The third would be to keep the optimization and document this in the API docs a I am somewhat inclined towards the third option. It feels somewhat unnecessarily restrictive to limit the smart constructors to "stay in their Unicode lane." That might be somewhat more compelling if we supported something other than UTF-8, but that's already pretty firmly baked into the entire family of How much does this bug impact you? For |
In my case it doesn't seem to be that important. It implies that I must implement support for Unicode character classes, but that's something that I had to do anyways because I want to support Unicode regexps in general. Until now I was rejecting unicode regexps, and assuming that So, I support the option of simply changing the documentation to avoid confusion about the offered guarantees. |
Perfect. I'm sure you know this, but another path forward would be to compile Unicode classes as an alternation of byte literals at whatever point you convert an |
I just discovered another gem that could make my life easier: https://docs.rs/regex-syntax/latest/regex_syntax/utf8/index.html TBH, it looks like half the YARA Rust implementation will be your code: |
Yes indeedy. That module is precisely to aide in building byte oriented UTF-8 automata from Unicode classes. The only other missing piece of the puzzle are optimizations. The regex/regex-automata/src/nfa/thompson/compiler.rs Lines 1431 to 1444 in 061ee81
Are you sure you can't use the NFA compiler while you're at it? :P I will just add one other thing: I would advise not going down the Unicode route without really good reason. It is an absolute nightmare. It tends to be an abstraction/optimization-busting feature. I realize the pull is basically irresistible, but you've been warned. :-) |
I think I'll take the advice and focus on more pressuring matters before going down the Unicode rabbit hole. I have my own Unicode-inflicted scars from the past :-) |
I decided on going with option 3 above and just documenting that the There will also be some minor breaking changes to the |
I'm not using the |
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088
…classes. Unicode classes can appear even on regexps that were compiled without unicode support. This a well-known issue with the `regex-syntax` crate, and we should be able to handle it. See: rust-lang/regex#1088
What version of regex are you using?
Using
regex-syntax
0.7.5.Describe the bug at a high level.
When generating the HIR for certain regular expressions with unicode support turned off, the resulting HIR may contain Unicode character classes. I'm not sure if this is a bug or the intended behaviour, but the documentation seems to suggest that this is not expected. Specifically, the documentation for hir::Class says:
I assumed that the HIR produced without unicode support will contain character classes of the
Class::Bytes
variant alone. However this is not the case.What are the steps to reproduce the behavior?
Consider this example:
It produces the following output:
Here
sub
is a class of theClass::Unicode
variant.What is the expected behavior?
I was expecting that
(a|\xc2\xa0)
is represented as an alternation of two literals, not as aClass::Unicode
The text was updated successfully, but these errors were encountered: