-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize literal alternations #21
Comments
BurntSushi
added a commit
that referenced
this issue
Jun 15, 2015
Overview of changes: * Instruction set has been redesigned to be smaller, mostly by collapsing empty-width matches into one instruction type. In addition to moving instruction-matching out of the matching engine, this makes matching engine code much simpler. * Rewrote input handling to use an inline representation of `Option<char>` and clearer position handling with the `Input` trait. * Added a new bounded backtracking matching engine that is invoked for small regexes/inputs. It's about twice as fast as the full NFA matching engine. * Implemented caching for both the NFA and backtracking engines. This avoids costly allocations on subsequent uses of the regex. * Overhauled prefix handling at both discovery and matching. Namely, sets of prefix literals can now be extracted from regexes. Depending on what the prefixes look like, an Aho-Corasick DFA is built from them. (This adds a dependency on the `aho-corasick` crate.) * When appropriate, use `memchr` to jump around in the input when there is a single common byte prefix. (This adds a dependency on the `memchr` crate.) * Bring the `regex!` macro up to date. Unfortunately, it still implements the full NFA matching engine and doesn't yet have access to the new prefix DFA handling. Thus, its performance has gotten *worse* than the dynamic implementation in most cases. The docs have been updated to reflect this change. Surprisingly, all of this required exactly one new application of `unsafe`, which is isolated in the `memchr` crate. (Aho-Corasick has no `unsafe` either!) There should be *no* breaking changes in this commit. The only public facing change is the addition of a method to the `Replacer` trait, but it comes with a default implementation so that existing implementors won't break. (Its purpose is to serve as a hint as to whether or not replacement strings need to be expanded. This is crucial to speeding up simple replacements.) Closes #21.
BurntSushi
added a commit
that referenced
this issue
Jun 16, 2015
Overview of changes: * Instruction set has been redesigned to be smaller, mostly by collapsing empty-width matches into one instruction type. In addition to moving instruction-matching out of the matching engine, this makes matching engine code much simpler. * Rewrote input handling to use an inline representation of `Option<char>` and clearer position handling with the `Input` trait. * Added a new bounded backtracking matching engine that is invoked for small regexes/inputs. It's about twice as fast as the full NFA matching engine. * Implemented caching for both the NFA and backtracking engines. This avoids costly allocations on subsequent uses of the regex. * Overhauled prefix handling at both discovery and matching. Namely, sets of prefix literals can now be extracted from regexes. Depending on what the prefixes look like, an Aho-Corasick DFA is built from them. (This adds a dependency on the `aho-corasick` crate.) * When appropriate, use `memchr` to jump around in the input when there is a single common byte prefix. (This adds a dependency on the `memchr` crate.) * Bring the `regex!` macro up to date. Unfortunately, it still implements the full NFA matching engine and doesn't yet have access to the new prefix DFA handling. Thus, its performance has gotten *worse* than the dynamic implementation in most cases. The docs have been updated to reflect this change. Surprisingly, all of this required exactly one new application of `unsafe`, which is isolated in the `memchr` crate. (Aho-Corasick has no `unsafe` either!) There should be *no* breaking changes in this commit. The only public facing change is the addition of a method to the `Replacer` trait, but it comes with a default implementation so that existing implementors won't break. (Its purpose is to serve as a hint as to whether or not replacement strings need to be expanded. This is crucial to speeding up simple replacements.) Closes #21.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
From https://lwn.net/Articles/589009/
In other words, this is a way to bypass the regex machinery and degrade to a simple substring search.
In addition to being a common case to optimize, it should also give a small bump to the
regex-dna
benchmark because one of the regexes is just an alternation of literals: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=rust&id=1. The rest contain character classes, which complicates things somewhat.The easy part of this is optimization is the actual searching of literal strings and jumping ahead in the input (there is precedent for this already in the code with literal prefixes). The harder part, I think, is analyzing the regex to find where the optimization can be applied. The issue is that an alternation is compiled to a series of split and jump instructions. It is easiest to discover the opportunity to optimize by analyzing the AST of the regex---but there will need to be a way to carry that information through to the VM.
One approach might be to tag pieces of the syntax with possible optimization (this is hopefully the first of many). Then when the AST is compiled to instructions, that information can be stored and indexed by the current program counter. The VM can then ask, "Do there exist any optimizations for this PC?" The rest is gravy.
N.B. This only works for a regex that is of the form
a|b|c|...
. It might be possible to generalize this to other cases, but it seems tricky.The text was updated successfully, but these errors were encountered: