Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build is failing in 1.8 because a Unicode feature isn't enabled, but previously worked in 1.7 #982

Closed
DevNulPavel opened this issue Apr 21, 2023 · 15 comments
Labels

Comments

@DevNulPavel
Copy link

Hello

When i updated regex version from 1.7.3 up to 1.8.0 i've received a problem with regex parsing.

Output:

(\d+\s?(years|year|y))?\s?(\d+\s?(months|month|m))?\s?(\d+\s?(weeks|week|w))?\s?(\d+\s?(days|day|d))?\s?(\d+\s?(hours|hour|h))?
       ^^
error: Unicode-aware Perl class not found (make sure the unicode-perl feature is enabled)

Even if i enable unicode feature manually it still does not work:

regex = {version = "1.8", features = ["unicode"]}

But if i downgrade version to 1.7.3, everything works good:

regex = "~1.7.3"
@BurntSushi
Copy link
Member

Could you please include a complete reproduction here? I cannot reproduce it:

use regex::Regex;

fn main() {
    let pat = r"(\d+\s?(years|year|y))?\s?(\d+\s?(months|month|m))?\s?(\d+\s?(weeks|week|w))?\s?(\d+\s?(days|day|d))?\s?(\d+\s?(hours|hour|h))?";
    let result = Regex::new(pat);
    assert!(result.is_ok());
}

And my Cargo.toml:

[package]
name = "i982"
version = "0.1.0"
edition = "2021"

[dependencies]
regex = "1.8.0"

[[bin]]
name = "i982"
path = "main.rs"

And the result:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/i982`
$

Trying to work backwards from the error, the message you're getting is here:

UnicodePerlClassNotFound => {
"Unicode-aware Perl class not found \
(make sure the unicode-perl feature is enabled)"
}

Which corresponds to this error kind variant:

/// This occurs when a Unicode-aware Perl character class (`\w`, `\s` or
/// `\d`) could not be found. This can occur when the `unicode-perl`
/// crate feature is not enabled.
UnicodePerlClassNotFound,

This error variant is constructed in precisely one place in non-test code:

unicode::Error::PerlClassNotFound => {
self.error(sp, ErrorKind::UnicodePerlClassNotFound)
}

Which occurs only when an internal Unicode error occurs indicating that a Perl class could not be found:

PerlClassNotFound,

That variant in turn is constructed in three places. Here:

#[cfg(not(feature = "unicode-perl"))]
fn imp() -> Result<hir::ClassUnicode, Error> {
Err(Error::PerlClassNotFound)
}

here:

#[cfg(not(any(feature = "unicode-perl", feature = "unicode-bool")))]
fn imp() -> Result<hir::ClassUnicode, Error> {
Err(Error::PerlClassNotFound)
}

and here:

#[cfg(not(any(feature = "unicode-perl", feature = "unicode-gencat")))]
fn imp() -> Result<hir::ClassUnicode, Error> {
Err(Error::PerlClassNotFound)
}

You use \d and \s, but not \w, so it can't be the first. This must mean that, somehow, one of unicode-perl, unicode-bool or unicode-gencat is not present when building regex-syntax. But the default is to enable those features.

So I can't quite figure out what's going on here without more details. Please do try to include a complete reproduction here. Thanks.

@BratSinot
Copy link

BratSinot commented Apr 21, 2023

tracing-subscriber have regex in deps and after upgrades to 1.8 have same error:
image

Downgrade to 1.7 work fine.

tokio-rs/tracing#2565

@BurntSushi
Copy link
Member

Folks, I need a reproduction. Please find a way to hand me a set of commands and source code that I can enter and run. As I mentioned above, I still don't understand how this error is appearing.

@BurntSushi
Copy link
Member

@BratSinot And the error you're showing is different from the error in the OP. It's not the same.

@BurntSushi
Copy link
Member

BurntSushi commented Apr 21, 2023

I have a guess as to what is happening. This PR tipped me off to it: tokio-rs/tracing#2566

Unfortunately, this is one of those cases where folks were able to under specify regex crate features and still have their patterns work even though the features specified weren't actually sufficient to do so. The way this manifests is likely going to be a little different in each case, but taking tracing-subscriber as one example... I ran cargo tree on it and noticed that it depends on matchers and that in turn depends on regex-automata 0.1. That in turn depends on regex-syntax 0.6. And it just brings in all of regex-syntax's Unicode features unconditionally.

Now tracing-subscriber also depends on regex directly, and note that it disables all features except for std: https://github.com/tokio-rs/tracing/blob/8aae1c37b091963aafdd336b1168fe5a24c0b4f0/tracing-subscriber/Cargo.toml#L72

But because of how feature unification works in Cargo, this means that even though the regex dependency on tracing-subscriber doesn't ask for anything more than std, in practice, it ends up pulling in everything because of the transitive dependency on regex-syntax with its default features enabled. This in turn permitted tracing-subscriber to use a regex with Unicode case insensitive matching enabled, even though the regex dependency did not enable the required features to do so. It worked because the required features were enabled implicitly via a transitive dependency.

With the release of regex-syntax 0.7, the feature unification no longer applies because regex-syntax 0.6 and regex-syntax 0.7 are semver incompatible. As a result, regex 1.8 now really is only using std in tracing-subscriber, and that in turn leads to a regex compilation error because the required features are not enabled. In other words, the dependency specification in tracing-subscriber is incorrect but "happened" to work.

So... probably everything is actually working correctly here and folks are just now getting hit over the head with the fact that they under-specified the regex crate features.

If you think this analysis is wrong or have a case where this doesn't apply, please let me know. And please provide as many details as possible. Ideally a reproduction.

@brymko
Copy link

brymko commented Apr 21, 2023

Here is a minimal repro:
(edit: set regex-syntax version to 0.7)

Cargo.toml:

[package]
name = "regex_repro"
version = "0.1.0"
edition = "2021"

[dependencies]
regex = { version = "=1.8.1", default_features = false, features = ["std"] }
regex-syntax = { version = "0.7", default_features = false }

main.rs:

use regex::Regex;

fn main() {
    let pat = r"(\d+\s?(years|year|y))?\s?(\d+\s?(months|month|m))?\s?(\d+\s?(weeks|week|w))?\s?(\d+\s?(days|day|d))?\s?(\d+\s?(hours|hour|h))?";
    let result = Regex::new(pat).unwrap();
}

Ouput:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    (\d+\s?(years|year|y))?\s?(\d+\s?(months|month|m))?\s?(\d+\s?(weeks|week|w))?\s?(\d+\s?(days|day|d))?\s?(\d+\s?(hours|hour|h))?
     ^^
error: Unicode-aware Perl class not found (make sure the unicode-perl feature is enabled)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', src/main.rs:5:34
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Judging from your comment above this looks like it's one of those times where we have relied on incorrect assumptions. Even tho it hurts in the short term I'd be in favor of correctness over legacy hacks and backwards compatibility hacks like you see in webkit.

@BurntSushi
Copy link
Member

So I actually think that example isn't quite right. You want default-features = true for regex-syntax 0.6. And you do want 0.6 in there. Basically, the key is that you want something that works with regex 1.7.3 but doesn't with regex 1.8 without changing the dependencies. Here:

[dependencies]
regex = { version = "1.6.0", default_features = false, features = ["std"] }
regex-syntax = "0.6.29"

With the same main.rs as you, running that gives:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/regexdepissue`
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Syntax(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
regex parse error:
    (\d+\s?(years|year|y))?\s?(\d+\s?(months|month|m))?\s?(\d+\s?(weeks|week|w))?\s?(\d+\s?(days|day|d))?\s?(\d+\s?(hours|hour|h))?
     ^^
error: Unicode-aware Perl class not found (make sure the unicode-perl feature is enabled)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
)', main.rs:5:34
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Now let's downgrade to regex 1.7.3 via the lock file:

$ cargo update -p regex --precise 1.7.3
    Updating crates.io index
 Downgrading regex v1.8.1 -> v1.7.3
    Removing regex-syntax v0.7.1

And re-run:

$ cargo run                            
   Compiling regex v1.7.3
   Compiling regexdepissue v0.1.0 (/tmp/regexdepissue)
    Finished dev [unoptimized + debuginfo] target(s) in 0.54s
     Running `target/debug/regexdepissue`
$

(I've snipped out unrelated warnings from the above output.

Anyway, that's the right repro, because it simulates real world conditions:

  1. You've got a dependency on regex with its crate features under specified.
  2. Somewhere else there is a dependency on regex-syntax that enables everything.
  3. The under specified features in (1) gets papered over by feature unification.
  4. Everything works.

And then when you upgrade to regex 1.8, you get regex-syntax 0.7 and the feature unification no longer applies.

@BurntSushi BurntSushi changed the title Version 1.8.0, regex problem build is failing in 1.8 because a Unicode feature isn't enabled, but previously worked in 1.7 Apr 21, 2023
@dbidwell94
Copy link

I hate to say it, but wouldn't the latest change then be considered "breaking" if features were changed which causes all these cascading failures with different crates?

@BurntSushi
Copy link
Member

BurntSushi commented Apr 21, 2023

The features weren't changed. I'm confused as to why you're saying that. As I said above, the issue is that some were implicitly relying on Cargo feature unification.

The only cause of this is that I put out a new breaking change release of regex-syntax. That's all it took to expose the fact that some folks under-specified their dependency on the regex crate. There's really nothing I can do short of yanking regex 1.8 and releasing regex 2.0.0. I have no plans to do that, especially since there were no semver incompatible changes made in regex.

@dbidwell94
Copy link

Gotcha,that makes sense for sure. This is the first time actually looking into the inner workings of this crate so I didn't realize the breaking change was a dependency and not the actual crate. Thanks for clarifying :)

@BurntSushi
Copy link
Member

Yes, I suppose this is a case study in some costs that were previously not anticipated:

  1. The fact that regex-syntax even exists is really a property of how Cargo and the crates.io ecosystem works. Most regex engines do not expose a separately versioned library of their parser. But I did it because it's "virtually free" to do given the existence of tools like Cargo, and in exchange, other folks can (and do) get access to a battle tested regex parser whose semantics precisely match that of the regex crate. And since it's separately versioned, breaking changes can be made to it without putting out new breaking change releases of regex proper. Needless to say, if regex-syntax didn't exist and was instead bundled into regex proper, then these build failures wouldn't have happened.
  2. The reason why the regex crate even has things like a unicode feature is because a lot of people have complained about the binary size and compilation time of the regex crate. So by being accommodating to that feedback, we've opened ourselves up to the effects of misconfiguration like the ones seen here. If I had said, "no, your use case isn't worth adding all these crate features," then we wouldn't have had this problem.

Of course, if Cargo's feature unification worked in a different way, then we might not also have had this problem either. So I don't mean to say that (1) and (2) above are the only causes, but they are certainly part of the story here.

@dburgener
Copy link

Is there any automation that crate authors can use to detect these sorts of underspecified dependencies to help avoid these sorts of problems going forward?

@BurntSushi
Copy link
Member

No idea personally. That would be a good question for @epage though.

@epage - If you only read one comment in this thread to figure out what's going on, this is the one to read: #982 (comment)

@epage
Copy link

epage commented Apr 21, 2023

I can't think of a sure-fire way of detecting this.

  • cargo only sees the features and doesn't know about the Rust code.
  • In theory, you could have a "strict" mode that compiles a crate once per features, rather than unifying features, but that then makes interoperability impossible if you were meant to pass types between crates. I don't see a flag for this being stablized due to the limitations. Unsure how this would go over for a perma-unstable flag.
  • We could tell clippy about the features and have it warn if a #[cfg(feature = "unicode")] API item is used when the feature isn't activated but that won't help in this case because unicode changes behavior rather than making an API item available
  • Someone could write a tool to report what features of your dependencies are being activated elsewhere but if we recurse to catch the regex / regex-syntax case, that will likely be noisy and would be a lot more complicated to write.

If it isn't already there, we should probably explicitly document this compatibility hazard. I definitely don't think it should be considered a major breaking change. I think it shouldn't even be a minor breaking change. Its just a hazard.

@BurntSushi
Copy link
Member

I'm going to close this because I think the root cause is known and folks have fixed it. I also don't think there is really much that can be done here. It is apparently neither possible to fully guard against this and nor is it feasible to never put out a breaking change release of regex-syntax.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants