Unexpected unicode character class in HIR generated without unicode support #1088

plusvic · 2023-09-08T10:27:06Z

What version of regex are you using?

Using regex-syntax 0.7.5.

Describe the bug at a high level.

When generating the HIR for certain regular expressions with unicode support turned off, the resulting HIR may contain Unicode character classes. I'm not sure if this is a bug or the intended behaviour, but the documentation seems to suggest that this is not expected. Specifically, the documentation for hir::Class says:

A character class corresponds to a set of characters. A character is either defined by a Unicode scalar value or a byte. Unicode characters are used by default, while bytes are used when Unicode mode (via the u flag) is disabled.

I assumed that the HIR produced without unicode support will contain character classes of the Class::Bytes variant alone. However this is not the case.

What are the steps to reproduce the behavior?

Consider this example:

use regex_syntax; // 0.7.5

fn main() {
    let mut parser = regex_syntax::ParserBuilder::new()
        .utf8(false)
        .unicode(false)
        .build();
        
    let hir = parser.parse(r"(a|\xc2\xa0)");

    println!("{:?}", hir);
}

It produces the following output:

Ok(Capture(Capture { index: 1, name: None, sub: Class({'a'..='a', '\u{a0}'..='\u{a0}'}) }))

Here sub is a class of the Class::Unicode variant.

What is the expected behavior?

I was expecting that (a|\xc2\xa0) is represented as an alternation of two literals, not as a Class::Unicode

The text was updated successfully, but these errors were encountered:

plusvic · 2023-09-08T10:51:00Z

The magic happens at:
https://github.com/rust-lang/regex/blob/master/regex-syntax/src/hir/mod.rs#L586C1-L597C10

        if let Some(singletons) = singleton_chars(&new) {
            let it = singletons
                .into_iter()
                .map(|ch| ClassUnicodeRange { start: ch, end: ch });
            return Hir::class(Class::Unicode(ClassUnicode::new(it)));
        }
        if let Some(singletons) = singleton_bytes(&new) {
            let it = singletons
                .into_iter()
                .map(|b| ClassBytesRange { start: b, end: b });
            return Hir::class(Class::Bytes(ClassBytes::new(it)));
        }

The alternation is replaced with a character class when all the alternatives are a single unicode character or byte. But this code snippet doesn't take into account whether unicode mode is disabled or not. Is this intended?

Unfortunately, this is done in the Hir::alternation associated function, where there's no context about whether unicode mode is enabled or not. The caller of Hir::alternation knows this information via self.flags().unicode(), but Hir::alternation itself does not. So, the fix is not straightforward, it seems to require some API changes.

BurntSushi · 2023-09-08T14:04:18Z

Hmmmm, yes, this is quite the predicament. The idea is that there should be a bright dividing line between the HIR's smart constructors and all of the configuration that goes into the translator. I specifically tried to design the constructors to avoid needing to pass in a bunch of little configuration knobs that could easily (IMO) explode to make it more difficult to reason about. The idea is that these sorts of smart constructors would always do valid transformations.

One simplification I did in the most recent release was to change the Literal HIR kind to just be an owned slice of arbitrary bytes. Previously, it could either be a u8 or a char, and the way to construct more than a single-character literal was to combine it with a concatenation. The previous translator had the philosophy of, "always try to use Unicode representation for a literal, regardless of whether Unicode mode is enabled or not." For example, using the old regex-debug tool (which in turn uses regex-syntax 0.6):

$ regex-debug hir 'a'
Hir {
    kind: Literal(
        Unicode(
            'a',
        ),
    ),
    info: HirInfo {
        bools: 1537,
    },
}

$ regex-debug hir '(?-u)a'
Hir {
    kind: Literal(
        Unicode(
            'a',
        ),
    ),
    info: HirInfo {
        bools: 1537,
    },
}

However, if the current translator had retained the "literal is either Unicode or arbitrary bytes," then an optimization like this could inspect the literal type and not try to go from Unicode to bytes or vice versa. But that information (along with whether Unicode is enabled or not) is erased by the time you get to this point. Erasing information is the point of the HIR, but that does sometimes result in difficult scenarios like this.

There are also other manifestations of this erasure of knowledge. For example, a regex like (?-u:a|b|c) will create a Unicode class even though Unicode mode is disabled. Things can get even weirder, for example, \s|(?-u:[a-z]) gets smushed into a single Unicode class despite the fact that (?-u:[a-z]) has Unicode disabled. Now, these are also arguably bugs with respect to the documented guarantee, but it's likely that they pose less of an issue in practice.

So I guess there are three options.

The first is, as you say, API changes in order to keep this optimization some-how without breaking the documented guarantee that you get a Unicode class only when Unicode mode is enabled. I'm not exactly sure of the nature of the API changes. I'm sure we could brainstorm a simple-but-ugly change. But I think I would want to explore a change that makes Literal carry through whether Unicode mode is enabled or not. Then I think we could fix this optimization to be correct. The problem is that I don't foresee myself having the energy to do this sort of change any time soon.

The second would be to remove the optimization. It's not particularly critical and we could at least keep it for ASCII bytes... It could still flip a byte class to a Unicode class, but it would be limited to ASCII (I believe).

The third would be to keep the optimization and document this in the API docs a wontfix bug or a clarification to the existing phrasing.

I am somewhat inclined towards the third option. It feels somewhat unnecessarily restrictive to limit the smart constructors to "stay in their Unicode lane." That might be somewhat more compelling if we supported something other than UTF-8, but that's already pretty firmly baked into the entire family of regex crates. Another way of looking at this is that it was a mistake on my part to provide the guarantee that the u flag guarantees a Unicode class and the lack of the u flag guarantees a byte class. Namely, part of the knowledge being erased by HIR translation is exactly all of the flags in the regex syntax. So having this one particular flag poke through a subset of the HIR (classes but not literals) does seem a bit weird. And it would be nice to have the flexibility for the smart constructors to build whatever class is appropriate there.

How much does this bug impact you? For regex-automata, this sort of thing doesn't matter because everything just get compiled down to a byte-oriented Thompson NFA. Basically, are you reporting this because you noticed an inconsistency or are you hitting a "real" problem?

plusvic · 2023-09-08T14:36:53Z

In my case it doesn't seem to be that important. It implies that I must implement support for Unicode character classes, but that's something that I had to do anyways because I want to support Unicode regexps in general. Until now I was rejecting unicode regexps, and assuming that Class::Unicode could not be found in the HIR. The assumption seemed correct in practice, until I found the only regexp in my collection that triggers this condition. One in tens of thousands!

So, I support the option of simply changing the documentation to avoid confusion about the offered guarantees.

BurntSushi · 2023-09-08T14:40:00Z

Perfect. I'm sure you know this, but another path forward would be to compile Unicode classes as an alternation of byte literals at whatever point you convert an Hir into a state machine. (Of course, I only mean to suggest this as a stop-gap. I don't know Yara's implementation details, but it would likely be disastrous to use this strategy for a Thompson NFA.)

plusvic · 2023-09-08T14:54:05Z

I just discovered another gem that could make my life easier: https://docs.rs/regex-syntax/latest/regex_syntax/utf8/index.html

TBH, it looks like half the YARA Rust implementation will be your code: aho-corasick, regex-syntax, memchr. All fundamental pieces.

BurntSushi · 2023-09-08T15:03:34Z

Yes indeedy. That module is precisely to aide in building byte oriented UTF-8 automata from Unicode classes.

The only other missing piece of the puzzle are optimizations. The regex-automata NFA compiler streams the output of regex-syntax::utf8 into Daciuk's algorithm. The high level entry point for that in the NFA compiler is here:

regex/regex-automata/src/nfa/thompson/compiler.rs

Lines 1431 to 1444 in 061ee81

    
           // In the forward direction, we always shrink our UTF-8 automata 
        
           // because we can stream it right into the UTF-8 compiler. There 
        
           // is almost no downside (in either memory or time) to using this 
        
           // approach. 
        
           let mut builder = self.builder.borrow_mut(); 
        
           let mut utf8_state = self.utf8_state.borrow_mut(); 
        
           let mut utf8c = 
        
               Utf8Compiler::new(&mut *builder, &mut *utf8_state)?; 
        
           for rng in cls.iter() { 
        
               for seq in Utf8Sequences::new(rng.start(), rng.end()) { 
        
                   utf8c.add(seq.as_slice())?; 
        
               } 
        
           } 
        
           utf8c.finish()

Are you sure you can't use the NFA compiler while you're at it? :P

I will just add one other thing: I would advise not going down the Unicode route without really good reason. It is an absolute nightmare. It tends to be an abstraction/optimization-busting feature. I realize the pull is basically irresistible, but you've been warned. :-)

plusvic · 2023-09-08T15:33:27Z

I think I'll take the advice and focus on more pressuring matters before going down the Unicode rabbit hole. I have my own Unicode-inflicted scars from the past :-)

BurntSushi · 2023-10-03T20:32:11Z

I decided on going with option 3 above and just documenting that the u flag doesn't necessarily guarantee anything about which HIR class is used. (I plan to incorporate this into a regex-syntax 0.8 release, although I'm not 100% certain when it will be out.)

There will also be some minor breaking changes to the Ast data type to resolve #1090. I'm not sure if you're using Ast or not though.

plusvic · 2023-10-04T07:47:15Z

I'm not using the Ast type at the moment, I compile regexps directly to Hir.

Basically, we never should have guaranteed that a particular HIR would (or wouldn't) be used if the 'u' flag was present (or absent). Such a guarantee generally results in too little flexibility, particularly when it comes to HIR's smart constructors. We could probably uphold that guarantee, but it's somewhat gnarly to do and would require rejiggering some of the HIR types. For example, we would probably need a literal that is an enum of `&str` or `&[u8]` that correctly preserves the Unicode flag. This in turn comes with a bigger complexity cost in various rewriting rules. In general, it's much simpler to require the caller to be prepared for any kind of HIR regardless of what the flags are. I feel somewhat justified in this position due to the fact that part of the point of the HIR is to erase all of the regex flags so that callers no longer need to worry about them. That is, the erasure is the point that provides a simplification for everyone downstream. Closes #1088

…classes. Unicode classes can appear even on regexps that were compiled without unicode support. This a well-known issue with the `regex-syntax` crate, and we should be able to handle it. See: rust-lang/regex#1088

BurntSushi added the bug label Sep 8, 2023

BurntSushi mentioned this issue Oct 3, 2023

consider providing \< and \> #469

Closed

BurntSushi added the fix-incoming label Oct 8, 2023

BurntSushi closed this as completed in 536cf70 Oct 9, 2023

dburgener mentioned this issue Oct 17, 2023

Release 0.20.1 lalrpop/lalrpop#842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected unicode character class in HIR generated without unicode support #1088

Unexpected unicode character class in HIR generated without unicode support #1088

plusvic commented Sep 8, 2023

plusvic commented Sep 8, 2023 •

edited

Loading

BurntSushi commented Sep 8, 2023 •

edited

Loading

plusvic commented Sep 8, 2023

BurntSushi commented Sep 8, 2023

plusvic commented Sep 8, 2023

BurntSushi commented Sep 8, 2023

plusvic commented Sep 8, 2023

BurntSushi commented Oct 3, 2023

plusvic commented Oct 4, 2023

Unexpected unicode character class in HIR generated without unicode support #1088

Unexpected unicode character class in HIR generated without unicode support #1088

Comments

plusvic commented Sep 8, 2023

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the expected behavior?

plusvic commented Sep 8, 2023 • edited Loading

BurntSushi commented Sep 8, 2023 • edited Loading

plusvic commented Sep 8, 2023

BurntSushi commented Sep 8, 2023

plusvic commented Sep 8, 2023

BurntSushi commented Sep 8, 2023

plusvic commented Sep 8, 2023

BurntSushi commented Oct 3, 2023

plusvic commented Oct 4, 2023

plusvic commented Sep 8, 2023 •

edited

Loading

BurntSushi commented Sep 8, 2023 •

edited

Loading