Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different -w behavior from grep/git-grep #389

Closed
crumblingstatue opened this issue Mar 1, 2017 · 3 comments · Fixed by #1017
Closed

Different -w behavior from grep/git-grep #389

crumblingstatue opened this issue Mar 1, 2017 · 3 comments · Fixed by #1017
Labels
libripgrep An issue related to modularizing ripgrep into libraries. question An issue that is lacking clarity on one or more points.
Milestone

Comments

@crumblingstatue
Copy link

crumblingstatue commented Mar 1, 2017

$ echo '-2' | rg -w '\-2' # rg yields no results
$ echo '-2' | grep -w '\-2' # grep (and git grep) matches
-2

Not sure if this is intentional or not, but it did surprise me that I didn't find what I was looking for when searching my codebase with ripgrep, and I had to resort to git-grep.

I used -w because I was specifically looking for the value -2, and not e.g. -24.

Using ripgrep 0.4.0.

@BurntSushi
Copy link
Owner

This is highly interesting.

The relevant passage from man grep (for GNU grep) is:

-w, --word-regexp
       Select only those lines containing matches that form whole words. The
       test is that the matching substring must either be at the beginning of
       the line, or preceded by a non-word constituent character. Similarly,
       it must be either at the end of the line or followed by a non-word
       constituent character. Word-constituent characters are letters, digits,
       and the underscore. This option has no effect if -x is also specified.

The key part is "... either be at the beginning of the line, or preceded by a non-word constituent character." ripgrep currently implements the -w flag by translating the given pattern to \b(?:pattern)\b, but it looks like grep actually does (?:^|\b)(?:pattern)(?:$|\b). I guess ripgrep should do that as well.

If my hypothesis is correct, then echo ' -2' | grep -w -e '-2' should return nothing. This is because neither nor - match \w, and therefore, \b shouldn't match. Interestingly, it does return a match:

$ echo ' -2' | grep -w -e '-2'
 -2

While the equivalent ripgrep command does not:

$ echo ' -2' | rg -e '(^|\b)-2($|\b)'

Interestingly, the same grep command does not either:

$ echo ' -2' | egrep -e '(^|\b)-2($|\b)'

This has to mean that my interpretation of -w is wrong.

Re-reading it, it now seems clearer to me that it isn't actually using word boundary assertions, since it says "preceded by a non-word constituent character" but doesn't say anything about the first letter of the match.

Looking at the source of GNU grep, I spotted this:

  /* In the match_words and match_lines cases, we use a different pattern
     for the DFA matcher that will quickly throw out cases that won't work.
     Then if DFA succeeds we do some hairy stuff using the regex matcher
     to decide whether the match should really count. */
  if (match_words || match_lines)
    {
      static char const line_beg_no_bk[] = "^(";
      static char const line_end_no_bk[] = ")$";
      static char const word_beg_no_bk[] = "(^|[^[:alnum:]_])(";
      static char const word_end_no_bk[] = ")([^[:alnum:]_]|$)";
      static char const line_beg_bk[] = "^\\(";
      static char const line_end_bk[] = "\\)$";
      static char const word_beg_bk[] = "\\(^\\|[^[:alnum:]_]\\)\\(";
      static char const word_end_bk[] = "\\)\\([^[:alnum:]_]\\|$\\)";

Which is interesting and confirms my suspicion that \b isn't actually being used to implement -w.


I will need to think on this more to figure out what to do. In the meantime, you could, depending on your use case, work-around this by:

$ echo ' -2' | rg -e '(^|\W)-2($|\W)'
1: -2

(This isn't a full solution though, since your colors will be messed up by including the surrounding non-word characters.)

@BurntSushi BurntSushi added the question An issue that is lacking clarity on one or more points. label Mar 1, 2017
@crumblingstatue crumblingstatue changed the title Different word boundary behavior from grep/git-grep Different -w behavior from grep/git-grep Mar 1, 2017
@crumblingstatue
Copy link
Author

Re-reading it, it now seems clearer to me that it isn't actually using word boundary assertions, since it says "preceded by a non-word constituent character" but doesn't say anything about the first letter of the match.

I think that's the key point. The match itself can contain both word and non-word characters. It's the surrounding context that matters. That way, you can intuitively match expressions like 2 - 2, and it will only match that, and not e.g. 42 - 24.

@BurntSushi
Copy link
Owner

I have pretty high confidence that this will be fixed in libripgrep.

BurntSushi added a commit that referenced this issue Aug 19, 2018
This commit updates the CHANGELOG to reflect all the work done to make
libripgrep a reality.

* Closes #162 (libripgrep)
* Closes #176 (multiline search)
* Closes #188 (opt-in PCRE2 support)
* Closes #244 (JSON output)
* Closes #416 (Windows CRLF support)
* Closes #917 (trim prefix whitespace)
* Closes #993 (add --null-data flag)
* Closes #997 (--passthru works with --replace)

* Fixes #2 (memory maps and context handling work)
* Fixes #200 (ripgrep stops when pipe is closed)
* Fixes #389 (more intuitive `-w/--word-regexp`)
* Fixes #643 (detection of stdin on Windows is better)
* Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird)
* Fixes #764 (coalesce color escapes)
* Fixes #922 (memory maps failing is no big deal)
* Fixes #937 (color escapes no longer used for empty matches)
* Fixes #940 (--passthru does not impact exit status)
* Fixes #1013 (show runtime CPU features in --version output)
BurntSushi added a commit that referenced this issue Aug 20, 2018
This commit updates the CHANGELOG to reflect all the work done to make
libripgrep a reality.

* Closes #162 (libripgrep)
* Closes #176 (multiline search)
* Closes #188 (opt-in PCRE2 support)
* Closes #244 (JSON output)
* Closes #416 (Windows CRLF support)
* Closes #917 (trim prefix whitespace)
* Closes #993 (add --null-data flag)
* Closes #997 (--passthru works with --replace)

* Fixes #2 (memory maps and context handling work)
* Fixes #200 (ripgrep stops when pipe is closed)
* Fixes #389 (more intuitive `-w/--word-regexp`)
* Fixes #643 (detection of stdin on Windows is better)
* Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird)
* Fixes #764 (coalesce color escapes)
* Fixes #922 (memory maps failing is no big deal)
* Fixes #937 (color escapes no longer used for empty matches)
* Fixes #940 (--passthru does not impact exit status)
* Fixes #1013 (show runtime CPU features in --version output)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libripgrep An issue related to modularizing ripgrep into libraries. question An issue that is lacking clarity on one or more points.
Projects
None yet
2 participants