move to byte based automata #146

BurntSushi · 2016-01-06T02:28:24Z

A prerequisite to a faster regex engine, and a DFA (see #66) is compiling UTF-8 decoding into the regex automaton. The reason why this is necessary, especially for a DFA, is so state transitions can proceed a byte-at-a-time and can therefore avoid large character maps in each state transition. We can preserve the invariant that all match boundaries returned are on UTF-8 code unit sequence boundaries by virtue of the fact that all regex automatons exclusively match valid UTF-8 encodings.

This is also a prerequisite to providing byte-based regexes that can match on a &[u8] instead of requiring a &str. See #85.

There are a few tricky components to this:

Translating an arbitrary Unicode codepoint range to a sequence of non-overlapping byte-based character classes. This is done already in the utf8_ranges crate.
Using said byte based character classes in the compiler. I've written a proof of concept, but the automata generated are squeamishly large for anything non-trivial.
Being diligent about supporting zero-width assertions that are Unicode aware. Fortunately, this crate only has one: word boundaries. (I think this should be somewhat straight-forward to do in the NFA and backtracking engines, but is hard to do well in a DFA.)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-01-31T15:46:55Z

This is proving to be quite tricky. I do have this working in my branch (including reducing the UTF-8 automata by reusing common suffixes), but it has introduced some weird problems.

Byte based automata are considerably slower when used in the NFA or backtracking engines. For example, matching a \pL goes from a binary search over Unicode codepoints to executing a regex program that is a couple thousand instructions long. This is an unacceptable performance regression IMO, so we'll need to support matching on both codepoints and bytes.

I did this by simply adding a Bytes variant to the instruction set, but this has a number of downsides:

A regex program with both a Bytes op and a Char/Ranges op is invalid, but this invariant is maintained at runtime. It would be nicer to enforce this invariant in the type system, but it seems tricky to do that while permitting the NFA/backtracking engines to match on either codepoints or bytes. It seems like the instruction codes might need to be put behind a trait, but it isn't clear what the interface should be yet.
A codepoint based program requires more memory per instruction and a byte based program requires many more instructions. This is bad because the byte based program ends up paying the overhead for codepoint programs, which is exacerbated by the increased number of states.

Since there are a few unknowns, I'm going to mush on to building the DFA before submitting a PR. Perhaps the right way to make instructions polymorphic will emerge from that...

A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.

BurntSushi · 2016-02-15T21:30:18Z

Done in 2aa1727

BurntSushi self-assigned this Jan 6, 2016

BurntSushi added the enhancement label Jan 6, 2016

BurntSushi modified the milestone: 1.0 Jan 31, 2016

birkenfeld mentioned this issue Feb 8, 2016

implement matching on byte strings #85

Closed

BurntSushi mentioned this issue Feb 15, 2016

Add a lazy DFA. #164

Merged

BurntSushi closed this as completed Feb 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move to byte based automata #146

move to byte based automata #146

BurntSushi commented Jan 6, 2016

BurntSushi commented Jan 31, 2016

BurntSushi commented Feb 15, 2016

move to byte based automata #146

move to byte based automata #146

Comments

BurntSushi commented Jan 6, 2016

BurntSushi commented Jan 31, 2016

BurntSushi commented Feb 15, 2016