-
Notifications
You must be signed in to change notification settings - Fork 443
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
syntax: make Unicode completely optional
This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
- Loading branch information
1 parent
a88b696
commit 85f2c0d
Showing
15 changed files
with
1,381 additions
and
246 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
regex-syntax | ||
============ | ||
This crate provides a robust regular expression parser. | ||
|
||
[![Build status](https://travis-ci.com/rust-lang/regex.svg?branch=master)](https://travis-ci.com/rust-lang/regex) | ||
[![Build status](https://ci.appveyor.com/api/projects/status/github/rust-lang/regex?svg=true)](https://ci.appveyor.com/project/rust-lang-libs/regex) | ||
[![](https://meritbadge.herokuapp.com/regex-syntax)](https://crates.io/crates/regex-syntax) | ||
[![Rust](https://img.shields.io/badge/rust-1.28.0%2B-blue.svg?maxAge=3600)](https://github.com/rust-lang/regex) | ||
|
||
|
||
### Documentation | ||
|
||
https://docs.rs/regex-syntax | ||
|
||
|
||
### Overview | ||
|
||
There are two primary types exported by this crate: `Ast` and `Hir`. The former | ||
is a faithful abstract syntax of a regular expression, and can convert regular | ||
expressions back to their concrete syntax while mostly preserving its original | ||
form. The latter type is a high level intermediate representation of a regular | ||
expression that is amenable to analysis and compilation into byte codes or | ||
automata. An `Hir` achieves this by drastically simplifying the syntactic | ||
structure of the regular expression. While an `Hir` can be converted back to | ||
its equivalent concrete syntax, the result is unlikely to resemble the original | ||
concrete syntax that produced the `Hir`. | ||
|
||
|
||
### Example | ||
|
||
This example shows how to parse a pattern string into its HIR: | ||
|
||
```rust | ||
use regex_syntax::Parser; | ||
use regex_syntax::hir::{self, Hir}; | ||
|
||
let hir = Parser::new().parse("a|b").unwrap(); | ||
assert_eq!(hir, Hir::alternation(vec![ | ||
Hir::literal(hir::Literal::Unicode('a')), | ||
Hir::literal(hir::Literal::Unicode('b')), | ||
])); | ||
``` | ||
|
||
|
||
### Crate features | ||
|
||
By default, this crate bundles a fairly large amount of Unicode data tables | ||
(a source size of ~750KB). Because of their large size, one can disable some | ||
or all of these data tables. If a regular expression attempts to use Unicode | ||
data that is not available, then an error will occur when translating the `Ast` | ||
to the `Hir`. | ||
|
||
The full set of features one can disable are | ||
[in the "Crate features" section of the documentation](https://docs.rs/regex-syntax/*/#crate-features). | ||
|
||
|
||
### Testing | ||
|
||
Simply running `cargo test` will give you very good coverage. However, because | ||
of the large number of features exposed by this crate, a `test` script is | ||
included in this directory which will test several feature combinations. This | ||
is the same script that is run in CI. | ||
|
||
|
||
### Motivation | ||
|
||
The primary purpose of this crate is to provide the parser used by `regex`. | ||
Specifically, this crate is treated as an implementation detail of the `regex`, | ||
and is primarily developed for the needs of `regex`. | ||
|
||
Since this crate is an implementation detail of `regex`, it may experience | ||
breaking change releases at a different cadence from `regex`. This is only | ||
possible because this crate is _not_ a public dependency of `regex`. | ||
|
||
Another consequence of this de-coupling is that there is no direct way to | ||
compile a `regex::Regex` from a `regex_syntax::hir::Hir`. Instead, one must | ||
first convert the `Hir` to a string (via its `std::fmt::Display`) and then | ||
compile that via `Regex::new`. While this does repeat some work, compilation | ||
typically takes much longer than parsing. | ||
|
||
Stated differently, the coupling between `regex` and `regex-syntax` exists only | ||
at the level of the concrete syntax. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.