Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize literal alternations #21

Closed
BurntSushi opened this issue Jan 14, 2015 · 0 comments · Fixed by #91
Closed

optimize literal alternations #21

BurntSushi opened this issue Jan 14, 2015 · 0 comments · Fixed by #91

Comments

@BurntSushi
Copy link
Member

From https://lwn.net/Articles/589009/

The handling of more complex alternations is a known (relatively) weak point of jrep (more precisely of rejit) in need of improvement. Grep uses a smart Boyer-Moore algorithm. To look for aaa|bbb|ccc at position p, it looks up the character at p + 2, and if it is not a, b, or c, knows it can jump three characters ahead to p + 3 (and then look at the character at p + 5).

On the other hand, like for single strings, rejit handles alternations simply: it applies brute force. But it does so relatively efficiently, so the performance is still good. To search for aaa|bbb|ccc at some position p in the text, rejit performs operations like:

    loop:
      find 'aaa' at position p
      if found goto match
      find 'bbb' at position p
      if found goto match
      find 'ccc' at position p
      if found goto match
      increment position and goto loop
    match:

The complexity is proportional to the number of alternations. Worse, when the number of alternated expressions exceeds a threshold (i.e. when the compiler cannot allocate a register per alternated expression), rejit falls back to some slow default code. This is what happens for the two regexps with eight or more alternated strings. The code generation should be fixed to allow an arbitrary number of alternated strings.

In other words, this is a way to bypass the regex machinery and degrade to a simple substring search.

In addition to being a common case to optimize, it should also give a small bump to the regex-dna benchmark because one of the regexes is just an alternation of literals: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna&lang=rust&id=1. The rest contain character classes, which complicates things somewhat.

The easy part of this is optimization is the actual searching of literal strings and jumping ahead in the input (there is precedent for this already in the code with literal prefixes). The harder part, I think, is analyzing the regex to find where the optimization can be applied. The issue is that an alternation is compiled to a series of split and jump instructions. It is easiest to discover the opportunity to optimize by analyzing the AST of the regex---but there will need to be a way to carry that information through to the VM.

One approach might be to tag pieces of the syntax with possible optimization (this is hopefully the first of many). Then when the AST is compiled to instructions, that information can be stored and indexed by the current program counter. The VM can then ask, "Do there exist any optimizations for this PC?" The rest is gravy.

N.B. This only works for a regex that is of the form a|b|c|.... It might be possible to generalize this to other cases, but it seems tricky.

BurntSushi added a commit that referenced this issue Jun 15, 2015
Overview of changes:

* Instruction set has been redesigned to be smaller, mostly by
  collapsing empty-width matches into one instruction type.
  In addition to moving instruction-matching out of the matching
  engine, this makes matching engine code much simpler.
* Rewrote input handling to use an inline representation of
  `Option<char>` and clearer position handling with the `Input` trait.
* Added a new bounded backtracking matching engine that is invoked for
  small regexes/inputs. It's about twice as fast as the full NFA
  matching engine.
* Implemented caching for both the NFA and backtracking engines.
  This avoids costly allocations on subsequent uses of the regex.
* Overhauled prefix handling at both discovery and matching.
  Namely, sets of prefix literals can now be extracted from regexes.
  Depending on what the prefixes look like, an Aho-Corasick DFA
  is built from them.
  (This adds a dependency on the `aho-corasick` crate.)
* When appropriate, use `memchr` to jump around in the input when
  there is a single common byte prefix.
  (This adds a dependency on the `memchr` crate.)
* Bring the `regex!` macro up to date. Unfortunately, it still
  implements the full NFA matching engine and doesn't yet have
  access to the new prefix DFA handling. Thus, its performance
  has gotten *worse* than the dynamic implementation in most
  cases. The docs have been updated to reflect this change.

Surprisingly, all of this required exactly one new application of
`unsafe`, which is isolated in the `memchr` crate. (Aho-Corasick has no
`unsafe` either!)

There should be *no* breaking changes in this commit. The only public
facing change is the addition of a method to the `Replacer` trait, but
it comes with a default implementation so that existing implementors
won't break. (Its purpose is to serve as a hint as to whether or not
replacement strings need to be expanded. This is crucial to speeding
up simple replacements.)

Closes #21.
BurntSushi added a commit that referenced this issue Jun 16, 2015
Overview of changes:

* Instruction set has been redesigned to be smaller, mostly by
  collapsing empty-width matches into one instruction type.
  In addition to moving instruction-matching out of the matching
  engine, this makes matching engine code much simpler.
* Rewrote input handling to use an inline representation of
  `Option<char>` and clearer position handling with the `Input` trait.
* Added a new bounded backtracking matching engine that is invoked for
  small regexes/inputs. It's about twice as fast as the full NFA
  matching engine.
* Implemented caching for both the NFA and backtracking engines.
  This avoids costly allocations on subsequent uses of the regex.
* Overhauled prefix handling at both discovery and matching.
  Namely, sets of prefix literals can now be extracted from regexes.
  Depending on what the prefixes look like, an Aho-Corasick DFA
  is built from them.
  (This adds a dependency on the `aho-corasick` crate.)
* When appropriate, use `memchr` to jump around in the input when
  there is a single common byte prefix.
  (This adds a dependency on the `memchr` crate.)
* Bring the `regex!` macro up to date. Unfortunately, it still
  implements the full NFA matching engine and doesn't yet have
  access to the new prefix DFA handling. Thus, its performance
  has gotten *worse* than the dynamic implementation in most
  cases. The docs have been updated to reflect this change.

Surprisingly, all of this required exactly one new application of
`unsafe`, which is isolated in the `memchr` crate. (Aho-Corasick has no
`unsafe` either!)

There should be *no* breaking changes in this commit. The only public
facing change is the addition of a method to the `Replacer` trait, but
it comes with a default implementation so that existing implementors
won't break. (Its purpose is to serve as a hint as to whether or not
replacement strings need to be expanded. This is crucial to speeding
up simple replacements.)

Closes #21.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant