optimize literal alternations #21

BurntSushi · 2015-01-14T23:12:24Z

The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).

On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:
    loop:
      find 'aaa' at position p
      if found goto match
      find 'bbb' at position p
      if found goto match
      find 'ccc' at position p
      if found goto match
      increment position and goto loop
    match:
The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.

In other words, this is a way to bypass the regex machinery and degrade to a simple substring search.

In addition to being a common case to optimize, it should also give a small bump to the regex-dna benchmark because one of the regexes is just an alternation of literals: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=rust&id=1. The rest contain character classes, which complicates things somewhat.

The easy part of this is optimization is the actual searching of literal strings and jumping ahead in the input (there is precedent for this already in the code with literal prefixes). The harder part, I think, is analyzing the regex to find where the optimization can be applied. The issue is that an alternation is compiled to a series of split and jump instructions. It is easiest to discover the opportunity to optimize by analyzing the AST of the regex---but there will need to be a way to carry that information through to the VM.

One approach might be to tag pieces of the syntax with possible optimization (this is hopefully the first of many). Then when the AST is compiled to instructions, that information can be stored and indexed by the current program counter. The VM can then ask, "Do there exist any optimizations for this PC?" The rest is gravy.

N.B. This only works for a regex that is of the form a|b|c|.... It might be possible to generalize this to other cases, but it seems tricky.

The text was updated successfully, but these errors were encountered:

Overview of changes: * Instruction set has been redesigned to be smaller, mostly by collapsing empty-width matches into one instruction type. In addition to moving instruction-matching out of the matching engine, this makes matching engine code much simpler. * Rewrote input handling to use an inline representation of `Option<char>` and clearer position handling with the `Input` trait. * Added a new bounded backtracking matching engine that is invoked for small regexes/inputs. It's about twice as fast as the full NFA matching engine. * Implemented caching for both the NFA and backtracking engines. This avoids costly allocations on subsequent uses of the regex. * Overhauled prefix handling at both discovery and matching. Namely, sets of prefix literals can now be extracted from regexes. Depending on what the prefixes look like, an Aho-Corasick DFA is built from them. (This adds a dependency on the `aho-corasick` crate.) * When appropriate, use `memchr` to jump around in the input when there is a single common byte prefix. (This adds a dependency on the `memchr` crate.) * Bring the `regex!` macro up to date. Unfortunately, it still implements the full NFA matching engine and doesn't yet have access to the new prefix DFA handling. Thus, its performance has gotten *worse* than the dynamic implementation in most cases. The docs have been updated to reflect this change. Surprisingly, all of this required exactly one new application of `unsafe`, which is isolated in the `memchr` crate. (Aho-Corasick has no `unsafe` either!) There should be *no* breaking changes in this commit. The only public facing change is the addition of a method to the `Replacer` trait, but it comes with a default implementation so that existing implementors won't break. (Its purpose is to serve as a hint as to whether or not replacement strings need to be expanded. This is crucial to speeding up simple replacements.) Closes #21.

BurntSushi mentioned this issue Jun 15, 2015

Major refactoring and performance improvements. #91

Merged

BurntSushi closed this as completed in #91 Jun 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize literal alternations #21

optimize literal alternations #21

BurntSushi commented Jan 14, 2015

optimize literal alternations #21

optimize literal alternations #21

Comments

BurntSushi commented Jan 14, 2015