panicked at 'not a char boundary' when using Regex::replace #1006

WIZeaz · 2023-06-05T12:05:12Z

What version of regex are you using?

v1.8.3

Describe the bug at a high level.

I have fuzzed the regex package with afl.rs, and afl reports some cases panicked at "char boundary":

thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'ο' (bytes 0..2) of `ο00000000000`', <my_project_path>/regex/src/re_unicode.rs:574:31

It seems replace function have some problems with utf-8 character. I have found such panic in several fuzz drivers.

What are the steps to reproduce the behavior?

fn main() {
    let re = regex::Regex::new(r"\B|00(?-u)\B").unwrap();
    let text = r"𐾁00000000";
    let rep = r"0𐾁Ű000ο";
    let _ = re.replace(text, rep);
}

fn main(){
    let re = regex::Regex::new(r"\B|00(?-u)0\B").unwrap();
    let text = "ο";
    let rep = "000";
    let _ = re.replace(text, rep);
}

fn main(){
    let re = regex::Regex::new(r"(()(?-u)\B)0|\B").unwrap();
    let _ = re.replace("Ԑ0000000000000" ,"000000000000000");
}

fn main(){
    let re = regex::Regex::new(r"0()|\B|(?-u)\B0").unwrap();
    let _ = re.replace("ⳅ000000000000" ,"000000000000000");
}

fn main(){
    let re = regex::Regex::new(r"\B|(?-u)\B0").unwrap();
    let _ = re.replace_all("00000^睓00" ,"00000000000");
}

I also found that if I replace "replace_all" with "replace", the last case will not panic. The rest of these will panic after being replaced. It seems weird.

What is the actual behavior?

Here is an error message sample; the error messages in these cases are similar.

thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'ο' (bytes 0..2) of `ο00000000000`', <my_project_path>/regex/src/re_unicode.rs:574:31

What is the expected behavior?

Regex::replace should deal with utf-8 character correctly.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2023-06-05T12:36:04Z

It looks like this is a new bug introduced by regex 1.8. Prior versions return an error for such regexes:

$ cargo run
   Compiling aho-corasick v0.7.20
   Compiling regex-syntax v0.6.29
   Compiling regex v1.7.3
   Compiling i1006 v0.1.0 (/home/andrew/tmp/issues/regex/i1006)
    Finished dev [unoptimized + debuginfo] target(s) in 1.50s
     Running `target/debug/i1006`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    \B|00(?-u)0\B
               ^^
error: pattern can match invalid UTF-8
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', main.rs:2:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This is fixed by #978 more generally by allowing the regex to compile but not reporting a match anywhere. In older versions of this crate, things like (?-u:\B) were forbidden in Unicode regexes because it can match between code units in UTF-8 encoded text. But that's technically also true of .* as well (since it can match any empty string). So #978 takes a more robust perspective on this issue and generally permits regexes that could match between code units, but adjusts the matching engines to not report such matches.

This regex failed to compile in `regex <1.8`, but the migration to regex-automata tweaked the rules in a subtle way that permitted it to compile despite the fact that the old/status-quo matching engines can't handle it correctly. By that, I mean that they may permit the \B to match between code units. That in turn results in panicking when slicing a &str. In `regex 1.9`, this regex will actually be able to be compiled, but the matching engines will correctly and robustly never report matches that split UTF-8 code units. For now, we just add code that causes `regex 1.8` to have the same behavior as previous releases. Fixes #1006

BurntSushi · 2023-06-05T12:40:25Z

I put up #1007 to make regex 1.8 also return a regex compilation error, like in previous versions.

This regex failed to compile in `regex <1.8`, but the migration to regex-automata tweaked the rules in a subtle way that permitted it to compile despite the fact that the old/status-quo matching engines can't handle it correctly. By that, I mean that they may permit the \B to match between code units. That in turn results in panicking when slicing a &str. In `regex 1.9`, this regex will actually be able to be compiled, but the matching engines will correctly and robustly never report matches that split UTF-8 code units. For now, we just add code that causes `regex 1.8` to have the same behavior as previous releases. Fixes #1006

addisoncrump · 2023-06-05T12:59:33Z

Can we add replace to the general fuzz harness for this case? Seems to only happen in replace, which I didn't write a fuzzer for.

BurntSushi · 2023-06-05T13:05:09Z

Probably just making sure that Match::as_str works is enough. But adding replace seems fine.

I'd prefer is we waited until #978 was merged to do more work on the fuzzers.

BurntSushi · 2023-06-05T13:05:41Z

This is fixed in regex 1.8.4 on crates.io.

This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | patch | `1.8.1` -> `1.8.4` | --- ### Release Notes <details> <summary>rust-lang/regex</summary> ### [`v1.8.4`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#184-2023-06-05) [Compare Source](rust-lang/regex@1.8.3...1.8.4) \================== This is a patch release that fixes a bug where `(?-u:\B)` was allowed in Unicode regexes, despite the fact that the current matching engines can report match offsets between the code units of a single UTF-8 encoded codepoint. That in turn means that match offsets that split a codepoint could be reported, which in turn results in panicking when one uses them to slice a `&str`. This bug occurred in the transition to `regex 1.8` because the underlying syntactical error that prevented this regex from compiling was intentionally removed. That's because `(?-u:\B)` will be permitted in Unicode regexes in `regex 1.9`, but the matching engines will guarantee to never report match offsets that split a codepoint. When the underlying syntactical error was removed, no code was added to ensure that `(?-u:\B)` didn't compile in the `regex 1.8` transition release. This release, `regex 1.8.4`, adds that code such that `Regex::new(r"(?-u:\B)")` returns to the `regex <1.8` behavior of not compiling. (A `bytes::Regex` can still of course compile it.) Bug fixes: - [BUG #1006](rust-lang/regex#1006): Fix a bug where `(?-u:\B)` was allowed in Unicode regexes, and in turn could lead to match offsets that split a codepoint in `&str`. ### [`v1.8.3`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#183-2023-05-25) [Compare Source](rust-lang/regex@1.8.2...1.8.3) \================== This is a patch release that fixes a bug where the regex would report a match at every position even when it shouldn't. This could occur in a very small subset of regexes, usually an alternation of simple literals that have particular properties. (See the issue linked below for a more precise description.) Bug fixes: - [BUG #999](rust-lang/regex#999): Fix a bug where a match at every position is erroneously reported. ### [`v1.8.2`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#182-2023-05-22) [Compare Source](rust-lang/regex@1.8.1...1.8.2) \================== This is a patch release that fixes a bug where regex compilation could panic in debug mode for regexes with large counted repetitions. For example, `a{2147483516}{2147483416}{5}` resulted in an integer overflow that wrapped in release mode but panicking in debug mode. Despite the unintended wrapping arithmetic in release mode, it didn't cause any other logical bugs since the errant code was for new analysis that wasn't used yet. Bug fixes: - [BUG #995](rust-lang/regex#995): Fix a bug where regex compilation with large counted repetitions could panic. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).  Co-authored-by: cabr2-bot <[email protected]> Co-authored-by: crapStone <[email protected]> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1912 Reviewed-by: crapStone <[email protected]> Co-authored-by: Calciumdibromid Bot <[email protected]> Co-committed-by: Calciumdibromid Bot <[email protected]>

BurntSushi mentioned this issue Jun 5, 2023

compile: make Regex::new(r"(?-u:\B)") fail again #1007

Merged

BurntSushi closed this as completed in #1007 Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panicked at 'not a char boundary' when using Regex::replace #1006

panicked at 'not a char boundary' when using Regex::replace #1006

WIZeaz commented Jun 5, 2023 •

edited

Loading

BurntSushi commented Jun 5, 2023

BurntSushi commented Jun 5, 2023

addisoncrump commented Jun 5, 2023

BurntSushi commented Jun 5, 2023

BurntSushi commented Jun 5, 2023

panicked at 'not a char boundary' when using Regex::replace #1006

panicked at 'not a char boundary' when using Regex::replace #1006

Comments

WIZeaz commented Jun 5, 2023 • edited Loading

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

BurntSushi commented Jun 5, 2023

BurntSushi commented Jun 5, 2023

addisoncrump commented Jun 5, 2023

BurntSushi commented Jun 5, 2023

BurntSushi commented Jun 5, 2023

WIZeaz commented Jun 5, 2023 •

edited

Loading