Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected unicode character class in HIR generated without unicode support #1088

Closed
plusvic opened this issue Sep 8, 2023 · 9 comments
Closed

Comments

@plusvic
Copy link
Contributor

plusvic commented Sep 8, 2023

What version of regex are you using?

Using regex-syntax 0.7.5.

Describe the bug at a high level.

When generating the HIR for certain regular expressions with unicode support turned off, the resulting HIR may contain Unicode character classes. I'm not sure if this is a bug or the intended behaviour, but the documentation seems to suggest that this is not expected. Specifically, the documentation for hir::Class says:

A character class corresponds to a set of characters. A character is either defined by a Unicode scalar value or a byte. Unicode characters are used by default, while bytes are used when Unicode mode (via the u flag) is disabled.

I assumed that the HIR produced without unicode support will contain character classes of the Class::Bytes variant alone. However this is not the case.

What are the steps to reproduce the behavior?

Consider this example:

use regex_syntax; // 0.7.5

fn main() {
    let mut parser = regex_syntax::ParserBuilder::new()
        .utf8(false)
        .unicode(false)
        .build();
        
    let hir = parser.parse(r"(a|\xc2\xa0)");

    println!("{:?}", hir);
}

It produces the following output:

Ok(Capture(Capture { index: 1, name: None, sub: Class({'a'..='a', '\u{a0}'..='\u{a0}'}) }))

Here sub is a class of the Class::Unicode variant.

What is the expected behavior?

I was expecting that (a|\xc2\xa0) is represented as an alternation of two literals, not as a Class::Unicode

@plusvic
Copy link
Contributor Author

plusvic commented Sep 8, 2023

The magic happens at:
https://github.com/rust-lang/regex/blob/master/regex-syntax/src/hir/mod.rs#L586C1-L597C10

        if let Some(singletons) = singleton_chars(&new) {
            let it = singletons
                .into_iter()
                .map(|ch| ClassUnicodeRange { start: ch, end: ch });
            return Hir::class(Class::Unicode(ClassUnicode::new(it)));
        }
        if let Some(singletons) = singleton_bytes(&new) {
            let it = singletons
                .into_iter()
                .map(|b| ClassBytesRange { start: b, end: b });
            return Hir::class(Class::Bytes(ClassBytes::new(it)));
        }

The alternation is replaced with a character class when all the alternatives are a single unicode character or byte. But this code snippet doesn't take into account whether unicode mode is disabled or not. Is this intended?

Unfortunately, this is done in the Hir::alternation associated function, where there's no context about whether unicode mode is enabled or not. The caller of Hir::alternation knows this information via self.flags().unicode(), but Hir::alternation itself does not. So, the fix is not straightforward, it seems to require some API changes.

@BurntSushi
Copy link
Member

BurntSushi commented Sep 8, 2023

Hmmmm, yes, this is quite the predicament. The idea is that there should be a bright dividing line between the HIR's smart constructors and all of the configuration that goes into the translator. I specifically tried to design the constructors to avoid needing to pass in a bunch of little configuration knobs that could easily (IMO) explode to make it more difficult to reason about. The idea is that these sorts of smart constructors would always do valid transformations.

One simplification I did in the most recent release was to change the Literal HIR kind to just be an owned slice of arbitrary bytes. Previously, it could either be a u8 or a char, and the way to construct more than a single-character literal was to combine it with a concatenation. The previous translator had the philosophy of, "always try to use Unicode representation for a literal, regardless of whether Unicode mode is enabled or not." For example, using the old regex-debug tool (which in turn uses regex-syntax 0.6):

$ regex-debug hir 'a'
Hir {
    kind: Literal(
        Unicode(
            'a',
        ),
    ),
    info: HirInfo {
        bools: 1537,
    },
}

$ regex-debug hir '(?-u)a'
Hir {
    kind: Literal(
        Unicode(
            'a',
        ),
    ),
    info: HirInfo {
        bools: 1537,
    },
}

However, if the current translator had retained the "literal is either Unicode or arbitrary bytes," then an optimization like this could inspect the literal type and not try to go from Unicode to bytes or vice versa. But that information (along with whether Unicode is enabled or not) is erased by the time you get to this point. Erasing information is the point of the HIR, but that does sometimes result in difficult scenarios like this.

There are also other manifestations of this erasure of knowledge. For example, a regex like (?-u:a|b|c) will create a Unicode class even though Unicode mode is disabled. Things can get even weirder, for example, \s|(?-u:[a-z]) gets smushed into a single Unicode class despite the fact that (?-u:[a-z]) has Unicode disabled. Now, these are also arguably bugs with respect to the documented guarantee, but it's likely that they pose less of an issue in practice.

So I guess there are three options.

The first is, as you say, API changes in order to keep this optimization some-how without breaking the documented guarantee that you get a Unicode class only when Unicode mode is enabled. I'm not exactly sure of the nature of the API changes. I'm sure we could brainstorm a simple-but-ugly change. But I think I would want to explore a change that makes Literal carry through whether Unicode mode is enabled or not. Then I think we could fix this optimization to be correct. The problem is that I don't foresee myself having the energy to do this sort of change any time soon.

The second would be to remove the optimization. It's not particularly critical and we could at least keep it for ASCII bytes... It could still flip a byte class to a Unicode class, but it would be limited to ASCII (I believe).

The third would be to keep the optimization and document this in the API docs a wontfix bug or a clarification to the existing phrasing.

I am somewhat inclined towards the third option. It feels somewhat unnecessarily restrictive to limit the smart constructors to "stay in their Unicode lane." That might be somewhat more compelling if we supported something other than UTF-8, but that's already pretty firmly baked into the entire family of regex crates. Another way of looking at this is that it was a mistake on my part to provide the guarantee that the u flag guarantees a Unicode class and the lack of the u flag guarantees a byte class. Namely, part of the knowledge being erased by HIR translation is exactly all of the flags in the regex syntax. So having this one particular flag poke through a subset of the HIR (classes but not literals) does seem a bit weird. And it would be nice to have the flexibility for the smart constructors to build whatever class is appropriate there.


How much does this bug impact you? For regex-automata, this sort of thing doesn't matter because everything just get compiled down to a byte-oriented Thompson NFA. Basically, are you reporting this because you noticed an inconsistency or are you hitting a "real" problem?

@BurntSushi BurntSushi added the bug label Sep 8, 2023
@plusvic
Copy link
Contributor Author

plusvic commented Sep 8, 2023

In my case it doesn't seem to be that important. It implies that I must implement support for Unicode character classes, but that's something that I had to do anyways because I want to support Unicode regexps in general. Until now I was rejecting unicode regexps, and assuming that Class::Unicode could not be found in the HIR. The assumption seemed correct in practice, until I found the only regexp in my collection that triggers this condition. One in tens of thousands!

So, I support the option of simply changing the documentation to avoid confusion about the offered guarantees.

@BurntSushi
Copy link
Member

Perfect. I'm sure you know this, but another path forward would be to compile Unicode classes as an alternation of byte literals at whatever point you convert an Hir into a state machine. (Of course, I only mean to suggest this as a stop-gap. I don't know Yara's implementation details, but it would likely be disastrous to use this strategy for a Thompson NFA.)

@plusvic
Copy link
Contributor Author

plusvic commented Sep 8, 2023

I just discovered another gem that could make my life easier: https://docs.rs/regex-syntax/latest/regex_syntax/utf8/index.html

TBH, it looks like half the YARA Rust implementation will be your code: aho-corasick, regex-syntax, memchr. All fundamental pieces.

@BurntSushi
Copy link
Member

Yes indeedy. That module is precisely to aide in building byte oriented UTF-8 automata from Unicode classes.

The only other missing piece of the puzzle are optimizations. The regex-automata NFA compiler streams the output of regex-syntax::utf8 into Daciuk's algorithm. The high level entry point for that in the NFA compiler is here:

// In the forward direction, we always shrink our UTF-8 automata
// because we can stream it right into the UTF-8 compiler. There
// is almost no downside (in either memory or time) to using this
// approach.
let mut builder = self.builder.borrow_mut();
let mut utf8_state = self.utf8_state.borrow_mut();
let mut utf8c =
Utf8Compiler::new(&mut *builder, &mut *utf8_state)?;
for rng in cls.iter() {
for seq in Utf8Sequences::new(rng.start(), rng.end()) {
utf8c.add(seq.as_slice())?;
}
}
utf8c.finish()

Are you sure you can't use the NFA compiler while you're at it? :P

I will just add one other thing: I would advise not going down the Unicode route without really good reason. It is an absolute nightmare. It tends to be an abstraction/optimization-busting feature. I realize the pull is basically irresistible, but you've been warned. :-)

@plusvic
Copy link
Contributor Author

plusvic commented Sep 8, 2023

I think I'll take the advice and focus on more pressuring matters before going down the Unicode rabbit hole. I have my own Unicode-inflicted scars from the past :-)

@BurntSushi
Copy link
Member

I decided on going with option 3 above and just documenting that the u flag doesn't necessarily guarantee anything about which HIR class is used. (I plan to incorporate this into a regex-syntax 0.8 release, although I'm not 100% certain when it will be out.)

There will also be some minor breaking changes to the Ast data type to resolve #1090. I'm not sure if you're using Ast or not though.

@plusvic
Copy link
Contributor Author

plusvic commented Oct 4, 2023

I'm not using the Ast type at the moment, I compile regexps directly to Hir.

BurntSushi added a commit that referenced this issue Oct 6, 2023
Basically, we never should have guaranteed that a particular HIR would
(or wouldn't) be used if the 'u' flag was present (or absent). Such a
guarantee generally results in too little flexibility, particularly when
it comes to HIR's smart constructors.

We could probably uphold that guarantee, but it's somewhat gnarly to do
and would require rejiggering some of the HIR types. For example, we
would probably need a literal that is an enum of `&str` or `&[u8]` that
correctly preserves the Unicode flag. This in turn comes with a bigger
complexity cost in various rewriting rules.

In general, it's much simpler to require the caller to be prepared for
any kind of HIR regardless of what the flags are. I feel somewhat
justified in this position due to the fact that part of the point of the
HIR is to erase all of the regex flags so that callers no longer need to
worry about them. That is, the erasure is the point that provides a
simplification for everyone downstream.

Closes #1088
BurntSushi added a commit that referenced this issue Oct 6, 2023
Basically, we never should have guaranteed that a particular HIR would
(or wouldn't) be used if the 'u' flag was present (or absent). Such a
guarantee generally results in too little flexibility, particularly when
it comes to HIR's smart constructors.

We could probably uphold that guarantee, but it's somewhat gnarly to do
and would require rejiggering some of the HIR types. For example, we
would probably need a literal that is an enum of `&str` or `&[u8]` that
correctly preserves the Unicode flag. This in turn comes with a bigger
complexity cost in various rewriting rules.

In general, it's much simpler to require the caller to be prepared for
any kind of HIR regardless of what the flags are. I feel somewhat
justified in this position due to the fact that part of the point of the
HIR is to erase all of the regex flags so that callers no longer need to
worry about them. That is, the erasure is the point that provides a
simplification for everyone downstream.

Closes #1088
BurntSushi added a commit that referenced this issue Oct 8, 2023
Basically, we never should have guaranteed that a particular HIR would
(or wouldn't) be used if the 'u' flag was present (or absent). Such a
guarantee generally results in too little flexibility, particularly when
it comes to HIR's smart constructors.

We could probably uphold that guarantee, but it's somewhat gnarly to do
and would require rejiggering some of the HIR types. For example, we
would probably need a literal that is an enum of `&str` or `&[u8]` that
correctly preserves the Unicode flag. This in turn comes with a bigger
complexity cost in various rewriting rules.

In general, it's much simpler to require the caller to be prepared for
any kind of HIR regardless of what the flags are. I feel somewhat
justified in this position due to the fact that part of the point of the
HIR is to erase all of the regex flags so that callers no longer need to
worry about them. That is, the erasure is the point that provides a
simplification for everyone downstream.

Closes #1088
BurntSushi added a commit that referenced this issue Oct 9, 2023
Basically, we never should have guaranteed that a particular HIR would
(or wouldn't) be used if the 'u' flag was present (or absent). Such a
guarantee generally results in too little flexibility, particularly when
it comes to HIR's smart constructors.

We could probably uphold that guarantee, but it's somewhat gnarly to do
and would require rejiggering some of the HIR types. For example, we
would probably need a literal that is an enum of `&str` or `&[u8]` that
correctly preserves the Unicode flag. This in turn comes with a bigger
complexity cost in various rewriting rules.

In general, it's much simpler to require the caller to be prepared for
any kind of HIR regardless of what the flags are. I feel somewhat
justified in this position due to the fact that part of the point of the
HIR is to erase all of the regex flags so that callers no longer need to
worry about them. That is, the erasure is the point that provides a
simplification for everyone downstream.

Closes #1088
plusvic added a commit to VirusTotal/yara-x that referenced this issue Mar 18, 2024
…classes.

Unicode classes can appear even on regexps that were compiled without unicode support. This a well-known issue with the `regex-syntax` crate, and we should be able to handle it. See: rust-lang/regex#1088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants