Skip to content

Commit

Permalink
cli: include \w, \s and \d in Unicode data table generation
Browse files Browse the repository at this point in the history
This was an oversight omission when porting the old generator shell
script to regex-cli. This hasn't been an issue because I don't think
we've generated data for a new release of Unicode with this new
infrastructure yet.

This was flagged by unit tests that failed because \d was no longer a
subset of \w.
  • Loading branch information
BurntSushi committed Sep 29, 2024
1 parent b790aa5 commit 7691e49
Showing 1 changed file with 17 additions and 0 deletions.
17 changes: 17 additions & 0 deletions regex-cli/cmd/generate/unicode.rs
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,23 @@ USAGE:
gen(d.join("sentence_break.rs"), &["sentence-break", &ucd, "--chars"])?;
gen(d.join("word_break.rs"), &["word-break", &ucd, "--chars"])?;

// These generate the \w, \d and \s Unicode-aware character classes for
// regex-syntax. \d and \s are technically part of the general category
// and boolean properties generated above. However, these are generated
// separately to make it possible to enable or disable them via Cargo
// features independently of whether all boolean properties or general
// categories are enabled or disabled. The crate ensures that only one copy
// is compiled.
gen(d.join("perl_word.rs"), &["perl-word", &ucd, "--chars"])?;
gen(
d.join("perl_decimal.rs"),
&["general-category", &ucd, "--chars", "--include", "decimalnumber"],
)?;
gen(
d.join("perl_space.rs"),
&["property-bool", &ucd, "--chars", "--include", "whitespace"],
)?;

// Data tables for regex-automata.
let d = out
.join("regex-automata")
Expand Down

0 comments on commit 7691e49

Please sign in to comment.