Add a lazy DFA. #164

BurntSushi · 2016-02-15T01:59:17Z

A lazy DFA is much faster than executing an NFA because it doesn't
repeat the work of following epsilon transitions over and and over.
Instead, it computes states during search and caches them for reuse. We
avoid exponential state blow up by bounding the cache in size. When the
DFA isn't powerful enough to fulfill the caller's request (e.g., return
sub-capture locations), it still runs to find the boundaries of the
match and then falls back to NFA execution on the matched region. The
lazy DFA can otherwise execute on every regular expression except for
regular expressions that contain word boundary assertions (\b or
\B). (They are tricky to implement in the lazy DFA because they are
Unicode aware and therefore require multi-byte look-behind/ahead.)
The implementation in this PR is based on the implementation in Google's
RE2 library.

Adding a lazy DFA was a substantial change and required several
modifications:

The compiler can now produce both Unicode based programs (still used by the
NFA engines) and byte based programs (required by the lazy DFA, but possible
to use in the NFA engines too). In byte based programs, UTF-8 decoding is
built into the automaton.
A new Exec type was introduced to implement the logic for compiling
and choosing the right engine to use on each search.
Prefix literal detection was rewritten to work on bytes.
Benchmarks were overhauled and new ones were added to more carefully
track the impact of various optimizations.
A new HACKING.md guide has been added that gives a high-level
design overview of this crate.

Other changes in this commit include:

Protection against stack overflows. All places that once required
recursion have now either acquired a bound or have been converted to
using a stack on the heap.
Update the Aho-Corasick dependency, which includes memchr2 and
memchr3 optimizations.
Add PCRE benchmarks using the Rust pcre bindings.

Closes #66, #146.

BurntSushi · 2016-02-15T02:02:23Z

cc @alexcrichton I don't think this PR is possible to review, it's too big. Two things for you I think:

There should be no public API changes.
I've added benchmarks for PCRE, which introduces a dev-dependency on the pcre crate. I think this is causing the test/benchmark build to fail on Windows. I'd be fine skipping benchmarks on Windows, but I'm not sure how to do that. (I guess worst case scenario is to separate benchmarks into a different crate? Blech. Or can I use a feature?)

BurntSushi · 2016-02-15T02:05:05Z

Benchmarks, without the DFA and with the DFA:

$ cargo-benchcmp dynamic-no-lazy-dfa dynamic 
name                                           dynamic-no-lazy-dfa ns/iter  dynamic ns/iter         diff ns/iter   diff %
bench::anchored_literal_long_match             169 (2,307 MB/s)             75 (5,200 MB/s)                  -94  -55.62%
bench::anchored_literal_long_non_match         85 (4,588 MB/s)              61 (6,393 MB/s)                  -24  -28.24%
bench::anchored_literal_short_match            158 (164 MB/s)               75 (346 MB/s)                    -83  -52.53%
bench::anchored_literal_short_non_match        84 (309 MB/s)                61 (426 MB/s)                    -23  -27.38%
bench::easy0_1K                                318 (3,220 MB/s)             196 (5,224 MB/s)                -122  -38.36%
bench::easy0_1MB                               257,205 (4,076 MB/s)         255,138 (4,109 MB/s)          -2,067   -0.80%
bench::easy0_32                                82 (390 MB/s)                71 (450 MB/s)                    -11  -13.41%
bench::easy0_32K                               8,666 (3,781 MB/s)           5,392 (6,077 MB/s)            -3,274  -37.78%
bench::easy1_1K                                293 (3,494 MB/s)             241 (4,248 MB/s)                 -52  -17.75%
bench::easy1_1MB                               329,774 (3,179 MB/s)         334,872 (3,131 MB/s)           5,098    1.55%
bench::easy1_32                                77 (415 MB/s)                65 (492 MB/s)                    -12  -15.58%
bench::easy1_32K                               8,856 (3,700 MB/s)           6,139 (5,337 MB/s)            -2,717  -30.68%
bench::hard_1K                                 31,888 (32 MB/s)             4,654 (220 MB/s)             -27,234  -85.41%
bench::hard_1MB                                58,435,108 (17 MB/s)         4,719,487 (222 MB/s)     -53,715,621  -91.92%
bench::hard_32                                 1,048 (30 MB/s)              199 (160 MB/s)                  -849  -81.01%
bench::hard_32K                                1,033,930 (31 MB/s)          147,389 (222 MB/s)          -886,541  -85.74%
bench::literal                                 20 (2,550 MB/s)              20 (2,550 MB/s)                    0    0.00%
bench::match_class                             84 (964 MB/s)                85 (952 MB/s)                      1    1.19%
bench::match_class_in_range                    33 (2,454 MB/s)              32 (2,531 MB/s)                   -1   -3.03%
bench::match_class_unicode                     2,218 (72 MB/s)              783 (205 MB/s)                -1,435  -64.70%
bench::medium_1K                               1,368 (748 MB/s)             1,334 (767 MB/s)                 -34   -2.49%
bench::medium_1MB                              2,034,481 (515 MB/s)         2,044,757 (512 MB/s)          10,276    0.51%
bench::medium_32                               141 (226 MB/s)               99 (323 MB/s)                    -42  -29.79%
bench::medium_32K                              59,949 (546 MB/s)            59,603 (549 MB/s)               -346   -0.58%
bench::no_exponential                          336,653                      553 (180 MB/s)              -336,100  -99.84%
bench::not_literal                             1,247 (40 MB/s)              293 (174 MB/s)                  -954  -76.50%
bench::one_pass_long_prefix                    264 (98 MB/s)                177 (146 MB/s)                   -87  -32.95%
bench::one_pass_long_prefix_not                267 (97 MB/s)                175 (148 MB/s)                   -92  -34.46%
bench::one_pass_short                          768 (22 MB/s)                134 (126 MB/s)                  -634  -82.55%
bench::one_pass_short_not                      797 (21 MB/s)                136 (125 MB/s)                  -661  -82.94%
bench::replace_all                             149                          153                                4    2.68%
bench_dynamic_compile::compile_huge            161,349                      165,209                        3,860    2.39%
bench_dynamic_compile::compile_huge_bytes      18,050,519                   18,795,770                   745,251    4.13%
bench_dynamic_compile::compile_simple          6,664                        6,883                            219    3.29%
bench_dynamic_compile::compile_simple_bytes    7,035                        7,281                            246    3.50%
bench_dynamic_compile::compile_small           8,914                        9,091                            177    1.99%
bench_dynamic_compile::compile_small_bytes     186,970                      182,815                       -4,155   -2.22%
bench_dynamic_parse::parse_huge                1,238                        1,233                             -5   -0.40%
bench_dynamic_parse::parse_simple              2,005                        2,015                             10    0.50%
bench_dynamic_parse::parse_small               2,494                        2,500                              6    0.24%
bench_sherlock::before_holmes                  42,005,594 (14 MB/s)         2,741,811 (216 MB/s)     -39,263,783  -93.47%
bench_sherlock::everything_greedy              38,431,063 (15 MB/s)         7,807,696 (76 MB/s)      -30,623,367  -79.68%
bench_sherlock::everything_greedy_nl           32,003,966 (18 MB/s)         5,424,922 (109 MB/s)     -26,579,044  -83.05%
bench_sherlock::holmes_cochar_watson           1,457,068 (408 MB/s)         266,557 (2,231 MB/s)      -1,190,511  -81.71%
bench_sherlock::holmes_coword_watson           136,035,549 (4 MB/s)         1,327,967 (448 MB/s)    -134,707,582  -99.02%
bench_sherlock::line_boundary_sherlock_holmes  33,024,291 (18 MB/s)         2,690,485 (221 MB/s)     -30,333,806  -91.85%
bench_sherlock::name_alt1                      157,989 (3,765 MB/s)         77,206 (7,705 MB/s)          -80,783  -51.13%
bench_sherlock::name_alt2                      545,254 (1,091 MB/s)         303,775 (1,958 MB/s)        -241,479  -44.29%
bench_sherlock::name_alt3                      2,245,964 (264 MB/s)         1,385,153 (429 MB/s)        -860,811  -38.33%
bench_sherlock::name_alt3_nocase               4,792,290 (124 MB/s)         1,473,833 (403 MB/s)      -3,318,457  -69.25%
bench_sherlock::name_alt4                      584,204 (1,018 MB/s)         300,912 (1,977 MB/s)        -283,292  -48.49%
bench_sherlock::name_alt4_nocase               2,318,020 (256 MB/s)         1,421,519 (418 MB/s)        -896,501  -38.68%
bench_sherlock::name_holmes                    51,880 (11,467 MB/s)         52,027 (11,435 MB/s)             147    0.28%
bench_sherlock::name_holmes_nocase             1,414,500 (420 MB/s)         1,241,204 (479 MB/s)        -173,296  -12.25%
bench_sherlock::name_sherlock                  34,294 (17,348 MB/s)         34,378 (17,305 MB/s)              84    0.24%
bench_sherlock::name_sherlock_holmes           34,531 (17,228 MB/s)         34,463 (17,262 MB/s)             -68   -0.20%
bench_sherlock::name_sherlock_holmes_nocase    1,692,651 (351 MB/s)         1,281,540 (464 MB/s)        -411,111  -24.29%
bench_sherlock::name_sherlock_nocase           1,657,413 (358 MB/s)         1,281,293 (464 MB/s)        -376,120  -22.69%
bench_sherlock::name_whitespace                131,372 (4,528 MB/s)         60,463 (9,839 MB/s)          -70,909  -53.98%
bench_sherlock::no_match_common                567,065 (1,049 MB/s)         568,357 (1,046 MB/s)           1,292    0.23%
bench_sherlock::no_match_uncommon              23,782 (25,016 MB/s)         23,656 (25,149 MB/s)            -126   -0.53%
bench_sherlock::quotes                         11,251,366 (52 MB/s)         977,907 (608 MB/s)       -10,273,459  -91.31%
bench_sherlock::the_lower                      789,781 (753 MB/s)           794,285 (749 MB/s)             4,504    0.57%
bench_sherlock::the_nocase                     1,807,509 (329 MB/s)         1,837,240 (323 MB/s)          29,731    1.64%
bench_sherlock::the_upper                      53,542 (11,111 MB/s)         54,083 (11,000 MB/s)             541    1.01%
bench_sherlock::the_whitespace                 5,410,444 (109 MB/s)         1,986,579 (299 MB/s)      -3,423,865  -63.28%
bench_sherlock::word_ending_n                  56,017,874 (10 MB/s)         55,205,101 (10 MB/s)        -812,773   -1.45%

Notably, no regressions outside of noise:

$ cargo-benchcmp dynamic-no-lazy-dfa dynamic  --regressions --threshold 5
name  dynamic-no-lazy-dfa ns/iter  dynamic ns/iter    diff ns/iter  diff %

As a bonus, a comparison with PCRE:

$ cargo-benchcmp pcre dynamic --strip-new '^bench_|^bench::'
name                                     pcre ns/iter          dynamic ns/iter         diff ns/iter    diff %
anchored_literal_long_match              90 (4,333 MB/s)       75 (5,200 MB/s)                  -15   -16.67%
anchored_literal_long_non_match          60 (6,500 MB/s)       61 (6,393 MB/s)                    1     1.67%
anchored_literal_short_match             87 (298 MB/s)         75 (346 MB/s)                    -12   -13.79%
anchored_literal_short_non_match         58 (448 MB/s)         61 (426 MB/s)                      3     5.17%
easy0_1K                                 258 (3,968 MB/s)      196 (5,224 MB/s)                 -62   -24.03%
easy0_1MB                                226,139 (4,636 MB/s)  255,138 (4,109 MB/s)          28,999    12.82%
easy0_32                                 60 (533 MB/s)         71 (450 MB/s)                     11    18.33%
easy0_32K                                7,028 (4,662 MB/s)    5,392 (6,077 MB/s)            -1,636   -23.28%
easy1_1K                                 794 (1,289 MB/s)      241 (4,248 MB/s)                -553   -69.65%
easy1_1MB                                751,438 (1,395 MB/s)  334,872 (3,131 MB/s)        -416,566   -55.44%
easy1_32                                 71 (450 MB/s)         65 (492 MB/s)                     -6    -8.45%
easy1_32K                                23,042 (1,422 MB/s)   6,139 (5,337 MB/s)           -16,903   -73.36%
hard_1K                                  30,841 (33 MB/s)      4,654 (220 MB/s)             -26,187   -84.91%
hard_1MB                                 35,239,100 (29 MB/s)  4,719,487 (222 MB/s)     -30,519,613   -86.61%
hard_32                                  86 (372 MB/s)         199 (160 MB/s)                   113   131.40%
hard_32K                                 993,011 (32 MB/s)     147,389 (222 MB/s)          -845,622   -85.16%
literal                                  130 (392 MB/s)        20 (2,550 MB/s)                 -110   -84.62%
match_class                              183 (442 MB/s)        85 (952 MB/s)                    -98   -53.55%
match_class_in_range                     175 (462 MB/s)        32 (2,531 MB/s)                 -143   -81.71%
match_class_unicode                      513 (313 MB/s)        783 (205 MB/s)                   270    52.63%
medium_1K                                278 (3,683 MB/s)      1,334 (767 MB/s)               1,056   379.86%
medium_1MB                               240,699 (4,356 MB/s)  2,044,757 (512 MB/s)       1,804,058   749.51%
medium_32                                61 (524 MB/s)         99 (323 MB/s)                     38    62.30%
medium_32K                               7,369 (4,446 MB/s)    59,603 (549 MB/s)             52,234   708.83%
not_literal                              274 (186 MB/s)        293 (174 MB/s)                    19     6.93%
one_pass_long_prefix                     87 (298 MB/s)         177 (146 MB/s)                    90   103.45%
one_pass_long_prefix_not                 86 (302 MB/s)         175 (148 MB/s)                    89   103.49%
one_pass_short                           117 (145 MB/s)        134 (126 MB/s)                    17    14.53%
one_pass_short_not                       122 (139 MB/s)        136 (125 MB/s)                    14    11.48%
sherlock::before_holmes                  14,450,308 (41 MB/s)  2,741,811 (216 MB/s)     -11,708,497   -81.03%
sherlock::holmes_cochar_watson           546,919 (1,087 MB/s)  266,557 (2,231 MB/s)        -280,362   -51.26%
sherlock::line_boundary_sherlock_holmes  194,524 (3,058 MB/s)  2,690,485 (221 MB/s)       2,495,961  1283.11%
sherlock::name_alt1                      457,899 (1,299 MB/s)  77,206 (7,705 MB/s)         -380,693   -83.14%
sherlock::name_alt2                      496,659 (1,197 MB/s)  303,775 (1,958 MB/s)        -192,884   -38.84%
sherlock::name_alt3                      983,620 (604 MB/s)    1,385,153 (429 MB/s)         401,533    40.82%
sherlock::name_alt3_nocase               3,500,367 (169 MB/s)  1,473,833 (403 MB/s)      -2,026,534   -57.89%
sherlock::name_alt4                      972,128 (611 MB/s)    300,912 (1,977 MB/s)        -671,216   -69.05%
sherlock::name_alt4_nocase               1,877,017 (316 MB/s)  1,421,519 (418 MB/s)        -455,498   -24.27%
sherlock::name_holmes                    398,258 (1,493 MB/s)  52,027 (11,435 MB/s)        -346,231   -86.94%
sherlock::name_holmes_nocase             492,292 (1,208 MB/s)  1,241,204 (479 MB/s)         748,912   152.13%
sherlock::name_sherlock                  268,891 (2,212 MB/s)  34,378 (17,305 MB/s)        -234,513   -87.21%
sherlock::name_sherlock_holmes           197,067 (3,018 MB/s)  34,463 (17,262 MB/s)        -162,604   -82.51%
sherlock::name_sherlock_holmes_nocase    1,112,501 (534 MB/s)  1,281,540 (464 MB/s)         169,039    15.19%
sherlock::name_sherlock_nocase           1,332,423 (446 MB/s)  1,281,293 (464 MB/s)         -51,130    -3.84%
sherlock::name_whitespace                267,257 (2,226 MB/s)  60,463 (9,839 MB/s)         -206,794   -77.38%
sherlock::no_match_common                595,211 (999 MB/s)    568,357 (1,046 MB/s)         -26,854    -4.51%
sherlock::no_match_uncommon              584,057 (1,018 MB/s)  23,656 (25,149 MB/s)        -560,401   -95.95%
sherlock::quotes                         1,208,235 (492 MB/s)  977,907 (608 MB/s)          -230,328   -19.06%
sherlock::the_lower                      1,210,851 (491 MB/s)  794,285 (749 MB/s)          -416,566   -34.40%
sherlock::the_nocase                     1,286,611 (462 MB/s)  1,837,240 (323 MB/s)         550,629    42.80%
sherlock::the_upper                      776,113 (766 MB/s)    54,083 (11,000 MB/s)        -722,030   -93.03%
sherlock::the_whitespace                 1,368,468 (434 MB/s)  1,986,579 (299 MB/s)         618,111    45.17%
sherlock::word_ending_n                  12,018,618 (49 MB/s)  55,205,101 (10 MB/s)      43,186,483   359.33%

My take is that we are quite competitive now. There are a few regexes where some performance is left on the table, but I think it's an otherwise pretty strong showing! (I think many of the performance differences could be resolved if something like the jetscii crate could work on Rust stable. cc @shepmaster)

BurntSushi · 2016-02-15T02:06:52Z

Previous to this PR, the regex! macro was "generally slower." Now it's substantially slower in just about every case:

$ cargo-benchcmp native dynamic 
name                                           native ns/iter        dynamic ns/iter         diff ns/iter   diff %
bench::anchored_literal_long_match             189 (2,063 MB/s)      75 (5,200 MB/s)                 -114  -60.32%
bench::anchored_literal_long_non_match         47 (8,297 MB/s)       61 (6,393 MB/s)                   14   29.79%
bench::anchored_literal_short_match            177 (146 MB/s)        75 (346 MB/s)                   -102  -57.63%
bench::anchored_literal_short_non_match        46 (565 MB/s)         61 (426 MB/s)                     15   32.61%
bench::easy0_1K                                26,578 (38 MB/s)      196 (5,224 MB/s)             -26,382  -99.26%
bench::easy0_1MB                               27,229,730 (38 MB/s)  255,138 (4,109 MB/s)     -26,974,592  -99.06%
bench::easy0_32                                867 (36 MB/s)         71 (450 MB/s)                   -796  -91.81%
bench::easy0_32K                               847,113 (38 MB/s)     5,392 (6,077 MB/s)          -841,721  -99.36%
bench::easy1_1K                                23,525 (43 MB/s)      241 (4,248 MB/s)             -23,284  -98.98%
bench::easy1_1MB                               24,075,047 (43 MB/s)  334,872 (3,131 MB/s)     -23,740,175  -98.61%
bench::easy1_32                                767 (41 MB/s)         65 (492 MB/s)                   -702  -91.53%
bench::easy1_32K                               752,730 (43 MB/s)     6,139 (5,337 MB/s)          -746,591  -99.18%
bench::hard_1K                                 44,053 (23 MB/s)      4,654 (220 MB/s)             -39,399  -89.44%
bench::hard_1MB                                44,982,170 (23 MB/s)  4,719,487 (222 MB/s)     -40,262,683  -89.51%
bench::hard_32                                 1,418 (22 MB/s)       199 (160 MB/s)                -1,219  -85.97%
bench::hard_32K                                1,407,013 (23 MB/s)   147,389 (222 MB/s)        -1,259,624  -89.52%
bench::literal                                 1,202 (42 MB/s)       20 (2,550 MB/s)               -1,182  -98.34%
bench::match_class                             2,057 (39 MB/s)       85 (952 MB/s)                 -1,972  -95.87%
bench::match_class_in_range                    2,060 (39 MB/s)       32 (2,531 MB/s)               -2,028  -98.45%
bench::match_class_unicode                     12,945 (12 MB/s)      783 (205 MB/s)               -12,162  -93.95%
bench::medium_1K                               27,874 (36 MB/s)      1,334 (767 MB/s)             -26,540  -95.21%
bench::medium_1MB                              28,614,500 (36 MB/s)  2,044,757 (512 MB/s)     -26,569,743  -92.85%
bench::medium_32                               896 (35 MB/s)         99 (323 MB/s)                   -797  -88.95%
bench::medium_32K                              892,349 (36 MB/s)     59,603 (549 MB/s)           -832,746  -93.32%
bench::no_exponential                          319,270               553 (180 MB/s)              -318,717  -99.83%
bench::not_literal                             1,477 (34 MB/s)       293 (174 MB/s)                -1,184  -80.16%
bench::one_pass_long_prefix                    653 (39 MB/s)         177 (146 MB/s)                  -476  -72.89%
bench::one_pass_long_prefix_not                651 (39 MB/s)         175 (148 MB/s)                  -476  -73.12%
bench::one_pass_short                          1,016 (16 MB/s)       134 (126 MB/s)                  -882  -86.81%
bench::one_pass_short_not                      1,588 (10 MB/s)       136 (125 MB/s)                -1,452  -91.44%
bench::replace_all                             1,078                 153                             -925  -85.81%
bench_sherlock::before_holmes                  54,264,124 (10 MB/s)  2,741,811 (216 MB/s)     -51,522,313  -94.95%
bench_sherlock::everything_greedy              22,724,158 (26 MB/s)  7,807,696 (76 MB/s)      -14,916,462  -65.64%
bench_sherlock::everything_greedy_nl           22,168,804 (26 MB/s)  5,424,922 (109 MB/s)     -16,743,882  -75.53%
bench_sherlock::holmes_cochar_watson           24,791,824 (23 MB/s)  266,557 (2,231 MB/s)     -24,525,267  -98.92%
bench_sherlock::holmes_coword_watson           885,999,793           1,327,967 (448 MB/s)    -884,671,826  -99.85%
bench_sherlock::line_boundary_sherlock_holmes  25,113,805 (23 MB/s)  2,690,485 (221 MB/s)     -22,423,320  -89.29%
bench_sherlock::name_alt1                      23,382,716 (25 MB/s)  77,206 (7,705 MB/s)      -23,305,510  -99.67%
bench_sherlock::name_alt2                      23,585,220 (25 MB/s)  303,775 (1,958 MB/s)     -23,281,445  -98.71%
bench_sherlock::name_alt3                      80,283,635 (7 MB/s)   1,385,153 (429 MB/s)     -78,898,482  -98.27%
bench_sherlock::name_alt3_nocase               77,357,394 (7 MB/s)   1,473,833 (403 MB/s)     -75,883,561  -98.09%
bench_sherlock::name_alt4                      22,736,520 (26 MB/s)  300,912 (1,977 MB/s)     -22,435,608  -98.68%
bench_sherlock::name_alt4_nocase               26,921,524 (22 MB/s)  1,421,519 (418 MB/s)     -25,500,005  -94.72%
bench_sherlock::name_holmes                    15,145,735 (39 MB/s)  52,027 (11,435 MB/s)     -15,093,708  -99.66%
bench_sherlock::name_holmes_nocase             16,285,042 (36 MB/s)  1,241,204 (479 MB/s)     -15,043,838  -92.38%
bench_sherlock::name_sherlock                  16,189,653 (36 MB/s)  34,378 (17,305 MB/s)     -16,155,275  -99.79%
bench_sherlock::name_sherlock_holmes           14,975,742 (39 MB/s)  34,463 (17,262 MB/s)     -14,941,279  -99.77%
bench_sherlock::name_sherlock_holmes_nocase    16,904,928 (35 MB/s)  1,281,540 (464 MB/s)     -15,623,388  -92.42%
bench_sherlock::name_sherlock_nocase           16,335,907 (36 MB/s)  1,281,293 (464 MB/s)     -15,054,614  -92.16%
bench_sherlock::name_whitespace                14,837,905 (40 MB/s)  60,463 (9,839 MB/s)      -14,777,442  -99.59%
bench_sherlock::no_match_common                16,036,625 (37 MB/s)  568,357 (1,046 MB/s)     -15,468,268  -96.46%
bench_sherlock::no_match_uncommon              15,278,356 (38 MB/s)  23,656 (25,149 MB/s)     -15,254,700  -99.85%
bench_sherlock::quotes                         21,580,801 (27 MB/s)  977,907 (608 MB/s)       -20,602,894  -95.47%
bench_sherlock::the_lower                      16,059,120 (37 MB/s)  794,285 (749 MB/s)       -15,264,835  -95.05%
bench_sherlock::the_nocase                     17,376,836 (34 MB/s)  1,837,240 (323 MB/s)     -15,539,596  -89.43%
bench_sherlock::the_upper                      15,259,087 (38 MB/s)  54,083 (11,000 MB/s)     -15,205,004  -99.65%
bench_sherlock::the_whitespace                 18,835,951 (31 MB/s)  1,986,579 (299 MB/s)     -16,849,372  -89.45%
bench_sherlock::word_ending_n                  59,832,390 (9 MB/s)   55,205,101 (10 MB/s)      -4,627,289   -7.73%

alexcrichton · 2016-02-15T06:12:02Z

Holy cow, nice work @BurntSushi! Some thoughts:

Maybe the pcre benchmarks could be behind an off-by-default feature? The Travis CI could then just do cargo bench --features pcre or something like that.
Should we jettison regex_macros entirely? If it's basically always slower and nightly-only, maybe it should be revisited at a later date if at all?
I like to hear the sound of no API changes!

pczarn · 2016-02-15T13:22:25Z

src/backtrack.rs

 //
-// With the above settings, this comes out to ~3.2MB. Mostly these numbers
+// With the contants below, this comes out to ~1.6MB. Mostly these numbers


contants -> constants

BurntSushi · 2016-02-15T14:06:41Z

@alexcrichton For optional pcre, I tried doing that, but Cargo gives:

Caused by:
  Dev-dependencies are not allowed to be optional: `pcre`

I guess I could put pcre into [dependencies] proper and make it optional, but that feels wrong.

Should we jettison regex_macros entirely? If it's basically always slower and nightly-only, maybe it should be revisited at a later date if at all?

Hmm. I wouldn't necessarily be opposed, because using it is almost always wrong now. I guess it could still technically be useful if you want to execute a regex without allocating (which limits one to is_match, find and find_iter, I think), but that isn't necessarily a goal of regex!---it just happens to be that way now. (Of course, maybe it should be a goal, I don't know.)

There are also a few crates using it.

BurntSushi · 2016-02-15T14:09:03Z

cc @Geal @Manishearth @llogiq @kbknapp (We are talking about possibly removing the regex! macro. See benchmarks above.)

pczarn · 2016-02-15T14:29:30Z

Could you modify regex! to use Regex with lazy_static?

BurntSushi · 2016-02-15T14:35:44Z

@pczarn I don't think we could use lazy_static! explicitly (since that would require all users to add #[macro_use] extern crate lazy_static;), but I think it might be possible to inline the logic from lazy_static! into regex!. The important bit is making sure that one can still do static RE: Regex = regex!("..."); which can be done today I think. At that point though, I question whether it's worth it at all. (I guess a benefit is that the syntax of the regex is guaranteed to be correct.)

pczarn · 2016-02-15T15:46:48Z

src/inst.rs

+    insts: Vec<Inst>,
+    bytes: bool,
+    reverse: bool,
+    byte_classes: Vec<usize>,


byte_classes can be Vec<u8>.

llogiq · 2016-02-15T17:14:14Z

Yeah, but we also have a Syntax check for dynamic regexes with clippy, so the advantage is somewhat diminished.

alexcrichton · 2016-02-15T19:01:41Z

@BurntSushi about pcre ah oh well, so long as the CI passes on Windows seems fine to me!

kbknapp · 2016-02-15T20:37:05Z

@BurntSushi nice work! And thanks for the heads up!

A lazy DFA is much faster than executing an NFA because it doesn't repeat the work of following epsilon transitions over and and over. Instead, it computes states during search and caches them for reuse. We avoid exponential state blow up by bounding the cache in size. When the DFA isn't powerful enough to fulfill the caller's request (e.g., return sub-capture locations), it still runs to find the boundaries of the match and then falls back to NFA execution on the matched region. The lazy DFA can otherwise execute on every regular expression *except* for regular expressions that contain word boundary assertions (`\b` or `\B`). (They are tricky to implement in the lazy DFA because they are Unicode aware and therefore require multi-byte look-behind/ahead.) The implementation in this PR is based on the implementation in Google's RE2 library. Adding a lazy DFA was a substantial change and required several modifications: 1. The compiler can now produce both Unicode based programs (still used by the NFA engines) and byte based programs (required by the lazy DFA, but possible to use in the NFA engines too). In byte based programs, UTF-8 decoding is built into the automaton. 2. A new `Exec` type was introduced to implement the logic for compiling and choosing the right engine to use on each search. 3. Prefix literal detection was rewritten to work on bytes. 4. Benchmarks were overhauled and new ones were added to more carefully track the impact of various optimizations. 5. A new `HACKING.md` guide has been added that gives a high-level design overview of this crate. Other changes in this commit include: 1. Protection against stack overflows. All places that once required recursion have now either acquired a bound or have been converted to using a stack on the heap. 2. Update the Aho-Corasick dependency, which includes `memchr2` and `memchr3` optimizations. 3. Add PCRE benchmarks using the Rust `pcre` bindings. Closes #66, #146.

BurntSushi · 2016-02-15T20:43:23Z

@pczarn Thanks for the review! I've made both of your suggested changes. Nice catch!

I've also separated the PCRE benchmarks into their own sub-crate. We'll run them on Travis but not AppVeyor.

Add a lazy DFA.

It was not used in stable at all, because it only works in nightly. Now that regex! is almost always slower¹ there's no reason to keep it in. ¹: rust-lang/regex#164

ArtemGr · 2016-02-16T08:59:44Z

Should we jettison regex_macros entirely? If it's basically always slower and nightly-only, maybe it should be revisited at a later date if at all?

There might be a gap in the logic of assuming regex_macros to be "basically always slower".

Benchmarks only take into account the cost of executing already constructed automata. The automata construction isn't benchmarked.

But keeping the compiled automata in some static variable or field isn't always convenient.

Now, I haven't benchmarked it myself, but from a recent reddit thread numbers on regex performance I'm pretty sure regex! will be much faster than Regex if constructed in place instead of being cached somewhere.

On a different note I'd like to point that the benchmarks probably only compare the new lazy DFA engine with the default PCRE engine. But the default PCRE engine isn't the fastest engine around. If we're serious about performance, then the JIT PCRE engine should be accounted for.

And regex! compiling the regular expression to native code (Like Ragel does) might be a good way to beat the JIT PCRE in the long term.

Manishearth · 2016-02-16T09:01:27Z

AIUI regex!() is slower than Regex on usage, excluding instantiation, too.

ticki · 2016-02-16T12:38:11Z

Wow, this is great! Nice work, @BurntSushi.

ticki · 2016-02-16T12:40:39Z

@ArtemGr I don't think anyone have claimed that. The implementation just happens to be slow in this case.

BurntSushi · 2016-02-16T12:50:30Z

There might be a gap in the logic of assuming regex_macros

The benchmarks speak for themselves. :-)

Benchmarks only take into account the cost of executing already constructed automata. The automata construction isn't benchmarked.

The automata construction is benchmarked. I've even taken a profiler to it to improve compile times. It is of course true that compilation is not benchmarked in the benchmarks for searching text, because that seems really strange.

It is of course true that regex! will always have faster compilation time at runtime, since it is at 0.

Now, I haven't benchmarked it myself, but from a recent reddit thread numbers on regex performance I'm pretty sure regex! will be much faster than Regex if constructed in place instead of being cached somewhere.

Could you explain more? I'm not sure I understand. Is there any particular reason why lazy_static! doesn't help you here? You can see an example here: https://github.com/rust-lang-nursery/regex#usage-avoid-compiling-the-same-regex-in-a-loop

On a different note I'd like to point that the benchmarks probably only compare the new lazy DFA engine with the default PCRE engine. But the default PCRE engine isn't the fastest engine around. If we're serious about performance, then the JIT PCRE engine should be accounted for.

I may have made a mistake benchmarking PCRE, but neglecting the JIT is certainly not one of them. I am, in fact, serious about performance! You can check out how PCRE regexes are constructed for the benchmarks here: https://github.com/rust-lang-nursery/regex/blob/master/benches/bench_pcre.rs#L54-L71 --- If I'm doing anything wrong, I would like to correct it.

And I have even made sure that the PCRE bindings are really enabling the JIT too. You can see the benchmarks before/after for just plain PCRE and PCRE w/ JIT:

[andrew@Liger regex] cargo-benchcmp pcre-nojit pcre-jit
name                                     pcre-nojit ns/iter    pcre-jit ns/iter        diff ns/iter   diff %
anchored_literal_long_match              162 (2,407 MB/s)      90 (4,333 MB/s)                  -72  -44.44%
anchored_literal_long_non_match          88 (4,431 MB/s)       59 (6,610 MB/s)                  -29  -32.95%
anchored_literal_short_match             161 (161 MB/s)        86 (302 MB/s)                    -75  -46.58%
anchored_literal_short_non_match         88 (295 MB/s)         58 (448 MB/s)                    -30  -34.09%
easy0_1K                                 1,490 (687 MB/s)      271 (3,778 MB/s)              -1,219  -81.81%
easy0_1MB                                1,147,791 (913 MB/s)  226,638 (4,626 MB/s)        -921,153  -80.25%
easy0_32                                 92 (347 MB/s)         61 (524 MB/s)                    -31  -33.70%
easy0_32K                                36,342 (901 MB/s)     7,033 (4,659 MB/s)           -29,309  -80.65%
easy1_1K                                 1,418 (722 MB/s)      712 (1,438 MB/s)                -706  -49.79%
easy1_1MB                                1,158,163 (905 MB/s)  749,070 (1,399 MB/s)        -409,093  -35.32%
easy1_32                                 124 (258 MB/s)        72 (444 MB/s)                    -52  -41.94%
easy1_32K                                36,830 (889 MB/s)     23,151 (1,415 MB/s)          -13,679  -37.14%
hard_1K                                  162,687 (6 MB/s)      29,162 (35 MB/s)            -133,525  -82.07%
hard_1MB                                 164,249,154 (6 MB/s)  35,046,957 (29 MB/s)    -129,202,197  -78.66%
hard_32                                  89 (359 MB/s)         85 (376 MB/s)                     -4   -4.49%
hard_32K                                 5,128,962 (6 MB/s)    995,497 (32 MB/s)         -4,133,465  -80.59%
literal                                  162 (314 MB/s)        132 (386 MB/s)                   -30  -18.52%
match_class                              207 (391 MB/s)        176 (460 MB/s)                   -31  -14.98%
match_class_in_range                     206 (393 MB/s)        179 (452 MB/s)                   -27  -13.11%
match_class_unicode                      2,111 (76 MB/s)       534 (301 MB/s)                -1,577  -74.70%
medium_1K                                2,919 (350 MB/s)      293 (3,494 MB/s)              -2,626  -89.96%
medium_1MB                               2,619,833 (400 MB/s)  238,605 (4,394 MB/s)      -2,381,228  -90.89%
medium_32                                91 (351 MB/s)         60 (533 MB/s)                    -31  -34.07%
medium_32K                               80,492 (407 MB/s)     7,474 (4,384 MB/s)           -73,018  -90.71%
not_literal                              1,565 (32 MB/s)       275 (185 MB/s)                -1,290  -82.43%
one_pass_long_prefix                     260 (100 MB/s)        89 (292 MB/s)                   -171  -65.77%
one_pass_long_prefix_not                 260 (100 MB/s)        90 (288 MB/s)                   -170  -65.38%
one_pass_short                           796 (21 MB/s)         118 (144 MB/s)                  -678  -85.18%
one_pass_short_not                       811 (20 MB/s)         120 (141 MB/s)                  -691  -85.20%
sherlock::before_holmes                  31,483,254 (18 MB/s)  14,331,327 (41 MB/s)     -17,151,927  -54.48%
sherlock::holmes_cochar_watson           810,188 (734 MB/s)    546,602 (1,088 MB/s)        -263,586  -32.53%
sherlock::letters                        47,533,812 (12 MB/s)  28,586,898 (20 MB/s)     -18,946,914  -39.86%
sherlock::letters_lower                  46,949,062 (12 MB/s)  27,705,647 (21 MB/s)     -19,243,415  -40.99%
sherlock::letters_upper                  14,959,232 (39 MB/s)  3,698,364 (160 MB/s)     -11,260,868  -75.28%
sherlock::line_boundary_sherlock_holmes  21,851,913 (27 MB/s)  193,300 (3,077 MB/s)     -21,658,613  -99.12%
sherlock::name_alt1                      385,444 (1,543 MB/s)  452,550 (1,314 MB/s)          67,106   17.41%
sherlock::name_alt2                      697,067 (853 MB/s)    491,396 (1,210 MB/s)        -205,671  -29.51%
sherlock::name_alt3                      1,607,896 (370 MB/s)  994,980 (597 MB/s)          -612,916  -38.12%
sherlock::name_alt3_nocase               18,971,907 (31 MB/s)  3,344,872 (177 MB/s)     -15,627,035  -82.37%
sherlock::name_alt4                      696,606 (854 MB/s)    936,383 (635 MB/s)           239,777   34.42%
sherlock::name_alt4_nocase               3,691,771 (161 MB/s)  1,781,904 (333 MB/s)      -1,909,867  -51.73%
sherlock::name_holmes                    423,978 (1,403 MB/s)  398,036 (1,494 MB/s)         -25,942   -6.12%
sherlock::name_holmes_nocase             1,531,623 (388 MB/s)  491,416 (1,210 MB/s)      -1,040,207  -67.92%
sherlock::name_sherlock                  360,692 (1,649 MB/s)  266,261 (2,234 MB/s)         -94,431  -26.18%
sherlock::name_sherlock_holmes           362,400 (1,641 MB/s)  196,224 (3,031 MB/s)        -166,176  -45.85%
sherlock::name_sherlock_holmes_nocase    1,583,591 (375 MB/s)  1,322,505 (449 MB/s)        -261,086  -16.49%
sherlock::name_sherlock_nocase           1,581,447 (376 MB/s)  1,265,250 (470 MB/s)        -316,197  -19.99%
sherlock::name_whitespace                366,499 (1,623 MB/s)  267,019 (2,228 MB/s)         -99,480  -27.14%
sherlock::no_match_common                1,561,214 (381 MB/s)  594,673 (1,000 MB/s)        -966,541  -61.91%
sherlock::no_match_uncommon              319,560 (1,861 MB/s)  583,568 (1,019 MB/s)         264,008   82.62%
sherlock::quotes                         2,197,806 (270 MB/s)  1,211,738 (490 MB/s)        -986,068  -44.87%
sherlock::the_lower                      2,627,774 (226 MB/s)  1,215,907 (489 MB/s)      -1,411,867  -53.73%
sherlock::the_nocase                     2,511,957 (236 MB/s)  1,276,917 (465 MB/s)      -1,235,040  -49.17%
sherlock::the_upper                      446,597 (1,332 MB/s)  770,944 (771 MB/s)           324,347   72.63%
sherlock::the_whitespace                 2,838,721 (209 MB/s)  1,359,074 (437 MB/s)      -1,479,647  -52.12%
sherlock::word_ending_n                  27,965,770 (21 MB/s)  12,697,792 (46 MB/s)     -15,267,978  -54.60%
sherlock::words                          18,562,934 (32 MB/s)  10,759,892 (55 MB/s)      -7,803,042  -42.04%

And regex! compiling the regular expression to native code (Like Ragel does) might be a good way to beat the JIT PCRE in the long term.

We are already beating JIT PCRE now on many of the micro benchmarks. But yes, certainly something like Ragel should probably be in the regex! macro's future, but it may need to be augmented with other things to support sub-capture locations. It's quite a big undertaking!

To be clear: I welcome improvements to the benchmark suite. I wrote many of them (not all), so the suite is likely biased in favor of regexes that are better executed by this library. A possibly better methodology would be to grep source code for use of regexes and use those instead. That is however a lot of work, especially since performance can vary greatly based on the input, which is harder to capture from real world usage.

shepmaster · 2016-02-16T16:20:45Z

I guess a benefit is that the syntax of the regex is guaranteed to be correct.

That is actually my favorite feature of the macro. Forcing the regex to be checked at compile time removes an error check I need to handle in my code, even it it's just with unwrap. The performance boost was icing on the cake.

ArtemGr · 2016-02-16T17:50:47Z

Could you explain more? I'm not sure I understand. Is there any particular reason why lazy_static! doesn't help you here?

When prototyping it's much easier to throw a quick regex!("re").is_match() than to look for the right place to throw a static variable into. It's about keeping the global namespace lean, the principle is aptly covered here: http://www.youtube.com/watch?v=5Nc68IdNKdg.

P.S. One might use lazy_static! inside a function, but I keep forgetting it.

A lot of users come from interpreted languages where the regular expressions are cached by the language. Idea of caching the regular expression manually might be alien to them.

You can check out how PCRE regexes are constructed for the benchmarks here: https://github.com/rust-lang-nursery/regex/blob/master/benches/bench_pcre.rs#L54-L71 --- If I'm doing anything wrong, I would like to correct it.

Looks good!
And thanks for the extra benchmarks.

BurntSushi · 2016-02-16T17:58:05Z

@ArtemGr Thanks for responding. There's no question that regex! is a nice ergonomic win, and the only real reason why it's possible is because its runtime cost is free(ish), ceteris paribus. Doing automatic caching of a regex at runtime feels like bad juju to me. ("You mean when my Regex goes out of scope it actually leaves some of its state behind in some global cache somewhere?") I'd rather take it as an opportunity to educate others about the cost centers in their program. With that said, if there's anything more I can do on the documentation front (there's examples now in the API docs and the README added recently) to help facilitate that, then I'd be happy to hear thoughts.

ArtemGr · 2016-02-16T18:04:37Z

With that said, if there's anything more I can do on the documentation front (there's examples now in the API docs and the README added recently) to help facilitate that, then I'd be happy to hear thoughts.

Cool, next time I peruse the regex docs I'll watch out for any place that could be improved.
Also, beating PCRE-JIT is impressive! I'm impressed! : )
Congratz!

jnicholls · 2016-02-18T16:42:32Z

So, after this DFA update my regex replace operation no longer works. Example code:

let re = Regex::new(r"(?m:(\s*pub _bindgen_bitfield_\d+_: \w+,\s*\n)(\s*pub _bindgen_bitfield_\d+_: \w+,\s*\n)+)").unwrap(); let code = re.replace_all(&code, "$1");

This code is to work around an issue with rust-bindgen by replacing extra generated bitfields. After updating to the latest regex crate, this regex no longer works (no matches are found).

Please advise. Thanks.

BurntSushi · 2016-02-18T17:26:20Z

@jnicholls Probably best to file a new issue. Could you also show some text that should be matched? Thanks.

jnicholls · 2016-02-18T17:48:05Z

@BurntSushi Thanks, #169 created.

Since rust-lang/regex#164 the "dynamic" regex generated is faster than what the `regex!` macro produces. Still, to avoid the overhead of recreating the regex everytime, we use lazy_static to initialize it once at first use

BurntSushi force-pushed the dfa-pr branch from e5a5198 to 43146f6 Compare February 15, 2016 02:00

pczarn reviewed Feb 15, 2016
View reviewed changes

src/inst.rs

insts: Vec<Inst>,

bytes: bool,

reverse: bool,

byte_classes: Vec<usize>,

Copy link

pczarn Feb 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte_classes can be Vec<u8>.

BurntSushi force-pushed the dfa-pr branch from 43146f6 to 2aa1727 Compare February 15, 2016 20:42

BurntSushi added a commit that referenced this pull request Feb 15, 2016

Merge pull request #164 from rust-lang-nursery/dfa-pr

2de9af5

Add a lazy DFA.

BurntSushi merged commit 2de9af5 into master Feb 15, 2016

BurntSushi deleted the dfa-pr branch February 15, 2016 21:24

This was referenced Feb 16, 2016

refactor: Remove dependency on regex_macro clog-tool/clog-lib#13

Merged

refactor: Remove dependency on regex_macro clog-tool/clog-cli#84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a lazy DFA. #164

Add a lazy DFA. #164

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

alexcrichton commented Feb 15, 2016

pczarn Feb 15, 2016

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

pczarn commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

pczarn Feb 15, 2016

llogiq commented Feb 15, 2016

alexcrichton commented Feb 15, 2016

kbknapp commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

ArtemGr commented Feb 16, 2016

Manishearth commented Feb 16, 2016

ticki commented Feb 16, 2016

ticki commented Feb 16, 2016

BurntSushi commented Feb 16, 2016

shepmaster commented Feb 16, 2016

ArtemGr commented Feb 16, 2016

BurntSushi commented Feb 16, 2016

ArtemGr commented Feb 16, 2016

jnicholls commented Feb 18, 2016

BurntSushi commented Feb 18, 2016

jnicholls commented Feb 18, 2016

Add a lazy DFA. #164

Add a lazy DFA. #164

Conversation

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

alexcrichton commented Feb 15, 2016

pczarn Feb 15, 2016

Choose a reason for hiding this comment

BurntSushi commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

pczarn commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

pczarn Feb 15, 2016

Choose a reason for hiding this comment

llogiq commented Feb 15, 2016

alexcrichton commented Feb 15, 2016

kbknapp commented Feb 15, 2016

BurntSushi commented Feb 15, 2016

ArtemGr commented Feb 16, 2016

Manishearth commented Feb 16, 2016

ticki commented Feb 16, 2016

ticki commented Feb 16, 2016

BurntSushi commented Feb 16, 2016

shepmaster commented Feb 16, 2016

ArtemGr commented Feb 16, 2016

BurntSushi commented Feb 16, 2016

ArtemGr commented Feb 16, 2016

jnicholls commented Feb 18, 2016

BurntSushi commented Feb 18, 2016

jnicholls commented Feb 18, 2016