Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move to byte based automata #146

Closed
BurntSushi opened this issue Jan 6, 2016 · 2 comments
Closed

move to byte based automata #146

BurntSushi opened this issue Jan 6, 2016 · 2 comments
Assignees
Milestone

Comments

@BurntSushi
Copy link
Member

A prerequisite to a faster regex engine, and a DFA (see #66) is compiling UTF-8 decoding into the regex automaton. The reason why this is necessary, especially for a DFA, is so state transitions can proceed a byte-at-a-time and can therefore avoid large character maps in each state transition. We can preserve the invariant that all match boundaries returned are on UTF-8 code unit sequence boundaries by virtue of the fact that all regex automatons exclusively match valid UTF-8 encodings.

This is also a prerequisite to providing byte-based regexes that can match on a &[u8] instead of requiring a &str. See #85.

There are a few tricky components to this:

  1. Translating an arbitrary Unicode codepoint range to a sequence of non-overlapping byte-based character classes. This is done already in the utf8_ranges crate.
  2. Using said byte based character classes in the compiler. I've written a proof of concept, but the automata generated are squeamishly large for anything non-trivial.
  3. Being diligent about supporting zero-width assertions that are Unicode aware. Fortunately, this crate only has one: word boundaries. (I think this should be somewhat straight-forward to do in the NFA and backtracking engines, but is hard to do well in a DFA.)
@BurntSushi BurntSushi self-assigned this Jan 6, 2016
@BurntSushi
Copy link
Member Author

This is proving to be quite tricky. I do have this working in my branch (including reducing the UTF-8 automata by reusing common suffixes), but it has introduced some weird problems.

Byte based automata are considerably slower when used in the NFA or backtracking engines. For example, matching a \pL goes from a binary search over Unicode codepoints to executing a regex program that is a couple thousand instructions long. This is an unacceptable performance regression IMO, so we'll need to support matching on both codepoints and bytes.

I did this by simply adding a Bytes variant to the instruction set, but this has a number of downsides:

  1. A regex program with both a Bytes op and a Char/Ranges op is invalid, but this invariant is maintained at runtime. It would be nicer to enforce this invariant in the type system, but it seems tricky to do that while permitting the NFA/backtracking engines to match on either codepoints or bytes. It seems like the instruction codes might need to be put behind a trait, but it isn't clear what the interface should be yet.
  2. A codepoint based program requires more memory per instruction and a byte based program requires many more instructions. This is bad because the byte based program ends up paying the overhead for codepoint programs, which is exacerbated by the increased number of states.

Since there are a few unknowns, I'm going to mush on to building the DFA before submitting a PR. Perhaps the right way to make instructions polymorphic will emerge from that...

@BurntSushi BurntSushi modified the milestone: 1.0 Jan 31, 2016
BurntSushi added a commit that referenced this issue Feb 15, 2016
A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression *except* for
regular expressions that contain word boundary assertions (`\b` or
`\B`). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.

Adding a lazy DFA was a substantial change and required several
modifications:

1. The compiler can now produce both Unicode based programs (still used by the
   NFA engines) and byte based programs (required by the lazy DFA, but possible
   to use in the NFA engines too). In byte based programs, UTF-8 decoding is
   built into the automaton.
2. A new `Exec` type was introduced to implement the logic for compiling
   and choosing the right engine to use on each search.
3. Prefix literal detection was rewritten to work on bytes.
4. Benchmarks were overhauled and new ones were added to more carefully
   track the impact of various optimizations.
5. A new `HACKING.md` guide has been added that gives a high-level
   design overview of this crate.

Other changes in this commit include:

1. Protection against stack overflows. All places that once required
   recursion have now either acquired a bound or have been converted to
   using a stack on the heap.
2. Update the Aho-Corasick dependency, which includes `memchr2` and
   `memchr3` optimizations.
3. Add PCRE benchmarks using the Rust `pcre` bindings.

Closes #66, #146.
BurntSushi added a commit that referenced this issue Feb 15, 2016
A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression *except* for
regular expressions that contain word boundary assertions (`\b` or
`\B`). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.

Adding a lazy DFA was a substantial change and required several
modifications:

1. The compiler can now produce both Unicode based programs (still used by the
   NFA engines) and byte based programs (required by the lazy DFA, but possible
   to use in the NFA engines too). In byte based programs, UTF-8 decoding is
   built into the automaton.
2. A new `Exec` type was introduced to implement the logic for compiling
   and choosing the right engine to use on each search.
3. Prefix literal detection was rewritten to work on bytes.
4. Benchmarks were overhauled and new ones were added to more carefully
   track the impact of various optimizations.
5. A new `HACKING.md` guide has been added that gives a high-level
   design overview of this crate.

Other changes in this commit include:

1. Protection against stack overflows. All places that once required
   recursion have now either acquired a bound or have been converted to
   using a stack on the heap.
2. Update the Aho-Corasick dependency, which includes `memchr2` and
   `memchr3` optimizations.
3. Add PCRE benchmarks using the Rust `pcre` bindings.

Closes #66, #146.
BurntSushi added a commit that referenced this issue Feb 15, 2016
A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression *except* for
regular expressions that contain word boundary assertions (`\b` or
`\B`). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.

Adding a lazy DFA was a substantial change and required several
modifications:

1. The compiler can now produce both Unicode based programs (still used by the
   NFA engines) and byte based programs (required by the lazy DFA, but possible
   to use in the NFA engines too). In byte based programs, UTF-8 decoding is
   built into the automaton.
2. A new `Exec` type was introduced to implement the logic for compiling
   and choosing the right engine to use on each search.
3. Prefix literal detection was rewritten to work on bytes.
4. Benchmarks were overhauled and new ones were added to more carefully
   track the impact of various optimizations.
5. A new `HACKING.md` guide has been added that gives a high-level
   design overview of this crate.

Other changes in this commit include:

1. Protection against stack overflows. All places that once required
   recursion have now either acquired a bound or have been converted to
   using a stack on the heap.
2. Update the Aho-Corasick dependency, which includes `memchr2` and
   `memchr3` optimizations.
3. Add PCRE benchmarks using the Rust `pcre` bindings.

Closes #66, #146.
@BurntSushi
Copy link
Member Author

Done in 2aa1727

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant