regex: fix inner literal extraction that resulted in false negatives #2885
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In some rare cases, it was possible for ripgrep's inner literal detector
to extract a set of literals that could produce a false negative. #2884
gives an example:
(?i:e.x|ex)
. In this case, the set extracted can bediscovered by running
rg '(?i:e.x|ex) --trace
:This extraction leads to building a multi-substring matcher for
EX
,Ex
,eX
andex
. Searching the haystacke-x
produces no match,and thus, ripgrep shows no matches. But the regex
(?i:e.x|ex)
matchese-x
.The issue at play here was that when two extracted literal sequences
were unioned, we were incorrectly unioning their "prefix" attribute.
And this in turn leads to those literal sequences being combined
incorrectly via cross product. This case in particular triggers it
because two different optimizations combine to produce an incorrect
result. Firslty, the regex has a common prefix extracted and is
rewritten as
(?i:e(?:.x|x))
. Secondly, thex
in the first branch ofthe alternation has its
prefix
attribute set tofalse
(correctly),which means it can't be cross producted with another concatenation. But
in this case, it is unioned with the
x
from the second branch, andthis results in the union result having
prefix
set totrue
. Thisin turn pops up and lets it get cross producted with the
e
prefix,producing an incorrect literal sequence.
We fix this by changing the implementation of
union
to returnprefix
set totrue
only when both literal sequences being unionedhave
prefix
set totrue
.Doing this exposed a second bug that was present, but was purely
cosmetic: the extracted literals in this case, after the fix, are
X
andx
. They were considered "exact" (i.e., lead to a match),but of course they are not. Observing an
X
or anx
does not meanthere is a match. This was fixed by making
choose
always returnan inexact literal sequence. This is perhaps too conservative in
aggregate in some cases, but always correct. The idea here is that if
one is choosing between two concatenations, then it is likely the case
that the sequence returned should be considered inexact. The issue
is that this can lead to avoiding cross products in some cases that
would otherwise be correct. This is bad because it means extracting
shorter literals in some cases. (In general, the longer the literal the
better.) But we prioritize correctness for now and fix it. You can see
a few tests where this shortens some extracted literals.
Fixes #2884