consider providing \< and \> #469

BurntSushi · 2018-04-23T23:46:07Z

The \< and \> escape sequences correspond to "left word boundary" and "right word boundary," respectively. They are more specific variants of \b.

There was a request to add them to ripgrep, but in reality, this is more a request for the regex engine itself.

It's not clear whether these are worthwhile things to add, but they are present in other regex engines. There's a secondary question of whether their negation should be support too, although I don't think that's a common thing.

Since unrecognized escape sequences produce a syntax error today, these can be added in a backwards compatible way at any time.

The text was updated successfully, but these errors were encountered:

mqudsi · 2018-12-30T15:59:45Z

Note that in PCRE2 (as discussed in BurntSushi/ripgrep#1151) \> is generally recognized as an escape (although as @BurntSushi notes, it is an error in the current dialect), which may lead to unexpected results.

PCRE2 does not support \> and \< as word boundaries, which as I understand it are GNU extensions to the original dialect. I'm not able to secure information about which regex engines do support it, but I think the biggest one people are familiar with is through vim (which doesn't support \b, if I'm remembering correctly).

Just some food for thought.

jacob-keller · 2020-12-16T08:39:32Z

Grep supports them, as well as other gnu tools like sed, and emacs. Vim obviously supports them.

The primary reason why I'd like them implemented isn't that \b doesn't work for most use cases but mostly because I habitually type < instead of \b due to use in vim which doesn't support \b.. I also find it easier to think about wrapping a word in < > mentally verses using the same boundary key for left and right side..

It's not a huge deal as much as it required retraining to use different sequences in different programs. It can be worked around well enough...

I understand that specific word boundaries aren't all that useful because you generally follow up with either a word or non word character on at least one side. So yea they don't necessarily hold their weight in terms of adding useful features... It's more about compatibility with other tools.

BurntSushi · 2022-05-17T14:38:23Z

I think I'm just going to say no to this feature now. I'm happy to revisit it in the future, but I think I'd want something compelling. It is also worth nothing that even GNU grep supports \b:

$ echo 'foo bar' | egrep '\w\b' -o
o
r

So it kind of seems like vim is the oddball here that doesn't support \b.

BurntSushi · 2022-06-15T14:51:04Z

See this thread on HN for a discussion with a user that feels strongly enough about using \< and \> that they patched this crate to support it: https://news.ycombinator.com/item?id=31713791

KoviRobi · 2023-02-02T17:02:44Z

I would like to argue for this, because there is a common workflow (for me at least) that this gets in the way:

In vim, you use * to search for a word, it puts \<word\> into your / register (search register)
If you want to then search for it in other files, you just type :grep then press <Ctrl-R> then / and it inserts the contents of the / register.
Now I have to navigate and patch up the regexp so that ripgrep doesn't choke on it, which is a small annoyance but this often happens a fair amount when I'm navigating so it pops up a lot.

Given there is a patch (the URL from HN points to hvdijk@511c648 to make it easier to find, I find HN to be a difficult to read wall of text), is there any reason why we couldn't merge that patch? Could I help?

But also, if you really don't want to add \< and \>, then even just ignoring it as per #93 would be nice -- even if you make ignoring escapes an optional behaviour in ripgrep that you pass a flag to, you can tell vim to pass custom flags to the grep program.

BurntSushi · 2023-02-02T18:07:46Z

@KoviRobi I'm not sure what you mean by ignoring them. Ignoring them would mean that, e.g., \< would match <, which is probably not what you want given the workflow you described?

BurntSushi · 2023-02-02T18:10:00Z

I'll re-open this for now. We can consider adding this once #656 is done.

KoviRobi · 2023-02-02T18:18:10Z

@KoviRobi I'm not sure what you mean by ignoring them. Ignoring them would mean that, e.g., \< would match <, which is probably not what you want given the workflow you described?

Ah I misunderstood the issue I linked, I thought it ignored the escape characters at all, so \< would be the same as the empty string, not ignored the escape so \< is the same as <.

I'll re-open this for now. We can consider adding this once #656 is done.

Excellent, thank you -- let me know if I can be of any help

BurntSushi · 2023-10-03T02:02:47Z

In addition to \< and \>, I've been noodling on also adding these:

\b{start} as an alias for \<.
\b{end} as an alias for \>.
\b{prev-non-word} for (?<!\w), or equivalently, (?<=\W|^).
\b{next-non-word} for (?!\w), or equivalently, (?=\W|$).

I like \b{start} and \b{end} because I find them less opaque and also leave the door open for \B{start} and \B{end}, although I'd like to avoid adding those for the time being.

The \b{prev-non-word} and \b{next-non-word} are somewhat more specialized. For example, \b{start} only matches at a position where the left side is \W and the right side is \w, but \b{prev-non-word} only matches whenever the left side is \W, even if the right side is also \W. It turns out that this is quite useful for implementing the -w/--word-regexp flag in grep programs, where standard behavior is not to do \b(?:pat)\b, but (?<!\w)(?:pat)(?!\w). Indeed, that's how -w/--word-regexp is implemented in ripgrep when using PCRE2. But in a regex engine without general look-around, implementing it is a tortured affair. Indeed, I've found some latent bugs in the implementation.

I've thought for some time about how to implement something like \b{prev-non-word}/\b{next-non-word} outside of the regex engine and I haven't been able to come up with something that is obviously correct. However, it seems to me that they (along with \b{start} and \b{end}) could slot right in as new look-around assertions in the regex engine. And then there would no longer need to be any work-arounds in ripgrep or any other routine that wants similar semantics.

One downside of this is that adding new look-around assertions will require breaking change releases of both regex-syntax and regex-automata. I was hoping to avoid doing a breaking change release so quickly after the last one, but I'd like to get the -w/--word-regexp stuff cleaned up in ripgrep. And I can bundle in a few other planned breaking changes while I'm at it: #1090, #1088, #1031 and maybe a couple others that slip my mind at the moment.

My plan at this point is to think a bit more about working around this in ripgrep without adding the new word boundary assertions to the regex engine. If I can come up with something that makes me happy, then I might forgo this for now.

BurntSushi · 2023-10-04T11:38:51Z

I've been thinking about names because I really don't like \b{prev-non-word} and \b{next-non-word}. I'm thinking that I like \b{start-half} and \b{end-half} better. (Although I still don't like them.)

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

This adds Ast and Hir support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

Closes #469

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

BurntSushi · 2023-10-08T17:43:55Z

I have this implemented in #1098 and I'm currently planning to include it in the regex 1.10 release. Specifically, this is the new syntax as it will appear in the docs:

\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

Closes #469

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

ratmice · 2023-10-08T21:21:45Z

I have this implemented in #1098 and I'm currently planning to include it in the regex 1.10 release. Specifically, this is the new syntax as it will appear in the docs:
\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)

I haven't looked through the patch series yet, but this caught my eye.
Does this add the < character to regex-syntax is_meta_character, there is a decently sized comment
here where I will need to test/update some downstream code when this gets included.

https://github.com/softdevteam/grmtools/blob/cf6af64db67fed12c7259902794b98c627daadf1/lrlex/src/lib/parser.rs#L511

BurntSushi · 2023-10-08T21:41:39Z

Thanks for checking!

No, it doesn't. This change doesn't make < a meta character. That would break things so bad as to be untenable. It's only \< that is interpreted as a word boundary. If you try using \< in regex 1.9, you get an error:

$ regex-cli debug hir '\<'
failed to parse pattern 0 to AST: '\<': failed to parse pattern: regex parse error:
    \<
    ^^
error: unrecognized escape sequence

ratmice · 2023-10-08T22:09:36Z

I guess the problem then is that the code linked above is going to strip the \, off of \< unless it is a meta character i.e. currently we already need to escape < into \< to get it passed to regex as <, so I think to use this word boundary as-is without modifying the linked code above one would have to double escape \, via \\\<.

I would definitely recommend people rather use the \b{...} instead from lex, so thanks for adding those alternatives.
Anyhow I think it's not pretty but at least we can send both < via single-escape, and \< via double escape through the format.
So I think it should all work out.

But yeah, I concur that adding it as a meta-character would seem to be a mistake as it would become impossible to encode <. directly in our format.

So I don't believe there are any major issues or objections on my side of this.

BurntSushi · 2023-10-08T22:14:27Z

currently we already need to escape < into \< to get it passed to regex as <

Eh? No. The regex < matches < literally. As I showed above, if you try to write \< today, then you get an error. So you can't be doing that today...

Is there some existing valid regex pattern whose behavior or interpretation changes as a result of this PR?

Also, where and how are you using this crate? (Out of curiosity.)

ratmice · 2023-10-08T22:34:09Z

Sorry, for not making things explicitly clear, I'm dealing with 2 layers of escaping:

quoting from ¹

Both Lex and grmtools support escaping arbitrary characters, for all other characters
besides those listed above, when given an escaped character \c it will be passed to
the regex engine as the character c. This is useful when a character is used within
the lex format.

An example of this is when the character < is used at the beginning of a regex. Both Lex
and grmtools interpret this as the beginning of a start condition prefix. Which can be
escaped with \< to ensure it is treated as the start of a regular expression.

There is a simple testcase usage here², and an example where we use < before a regex, but not as part of the regex itself ³.

I use the crates in my lex/yacc lsp implementation⁴, instead of codegen it builds parse tables at runtime and so you can modify grammars and modify/parse input sources at runtime from a traditional lex/yacc specification.

Anyhow, from what I can tell the existing code I linked should be acceptable without changes, with the escaping woes stemming from the layer of escaping on top of regex.

BurntSushi · 2023-10-08T22:44:44Z

Yes, you shouldn't have to change anything. Nothing in this PR should change the behavior or semantics of any existing valid regex. And since < was never and will never be a meta character on its own, there is no need to escape it for this crate. Whatever escaping you're already doing must necessarily be sufficient after this PR, for if it weren't, it would imply your escaping routine would lead to an invalid regex pattern in the status quo.

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

Closes #469

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

KoviRobi · 2023-10-09T19:42:29Z

Thank you for this, I was going to give it a try by patching Cargo.toml in ripgrep https://github.com/KoviRobi/ripgrep/tree/use-word-boundary-regex but seems like I did something wrong, when I try to test it out I get the following:

ripgrep/$ cargo r -- \<foo\>
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/rg '\<foo\>'`
regex parse error:
    (?:\<foo\>)
       ^^
error: unrecognized escape sequence

possibly I have just misunderstood something (my understanding is ripgrep uses this engine, but maybe it also has some syntax checking that needs to be patched?)

BurntSushi · 2023-10-09T19:51:54Z

@KoviRobi You might need to explicitly patch regex-automata too. ripgrep master is in a somewhat state of flux at the moment, as I'm working on the ripgrep 14 release. In particular, it should no longer depend on regex itself at all. (It only uses it in dev-dependencies.) Instead it uses regex-automata directly.

BurntSushi · 2023-10-09T19:53:55Z

Note also that ripgrep's -w/--word-regexp flag will be using \b{start-half} and \b{end-half}, not \< and \>. (Although of course \< and \> will still be available for use explicitly.)

KoviRobi · 2023-10-09T20:15:38Z

@BurntSushi thank you, I have updated https://github.com/KoviRobi/ripgrep/tree/use-word-boundary-regex by updating the regex-automata and regex-syntax, then fixing the remaining errors to my best intuition, it seems to work!

ripgrep/$ cargo r -- \<oo\>
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/rg '\<oo\>'`
crates/regex/src/word.rs
289:        assert_eq!(Some((3, 7)), find(r"f?oo!?", "##\nfoo!##"));

And being able to use \< and \> explicitly fits my use-case very well :)

BurntSushi · 2023-10-09T20:46:07Z

Aye yeah, both regex-syntax and regex-automata have breaking changes (which means they get 0.8.0 and 0.4.0 releases, respectively). Once the releases are out, I'll update ripgrep master pretty quickly and you can drop your patch.

This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469

This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469

This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

BurntSushi · 2023-10-10T00:30:50Z

@KoviRobi You should be good to build ripgrep from master now!

KoviRobi · 2023-10-10T10:53:11Z

@BurntSushi thanks! Also just saw your last commit message about fixing up past commits being an effort, have you come across https://github.com/tummychow/git-absorb ? It makes the fixup workflow extremely simple

BurntSushi · 2023-10-10T12:15:49Z

Yes, I use it all the time. The issue there wasn't finding the right commit, but the conflicts that I guessed would result from inserting the change in the Cargo.lock file. (I didn't actually try it though, it was just my sense of things at the time.)

BurntSushi added enhancement question labels Apr 23, 2018

BurntSushi mentioned this issue Apr 23, 2018

[Feature request] Support \< and \> as an alias to \b BurntSushi/ripgrep#872

Closed

BurntSushi mentioned this issue Dec 30, 2018

Allow escaping < and > BurntSushi/ripgrep#1151

Closed

BurntSushi closed this as completed May 17, 2022

BurntSushi reopened this Feb 2, 2023

BurntSushi mentioned this issue Oct 3, 2023

allow starting DFA in noncontinuous bytes #1031

Closed

BurntSushi added a commit that referenced this issue Oct 8, 2023

doc: explain the new word boundary assertions

b9f0212

Closes #469

BurntSushi added a commit that referenced this issue Oct 8, 2023

lite: add special word boundaries to regex-lite

eab70f5

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

BurntSushi added a commit that referenced this issue Oct 8, 2023

doc: explain the new word boundary assertions

49d67f0

Closes #469

BurntSushi added a commit that referenced this issue Oct 8, 2023

lite: add special word boundaries to regex-lite

d53fe59

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

BurntSushi added a commit that referenced this issue Oct 9, 2023

doc: explain the new word boundary assertions

915a154

Closes #469

BurntSushi added a commit that referenced this issue Oct 9, 2023

lite: add special word boundaries to regex-lite

bd36c6f

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

BurntSushi closed this as completed in 2743a7a Oct 9, 2023

BurntSushi added a commit that referenced this issue Oct 9, 2023

lite: add special word boundaries to regex-lite

dbc5e6d

This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469

BurntSushi mentioned this issue Oct 10, 2023

-w/--word-regexp sometimes returns odd results BurntSushi/ripgrep#2623

Closed

This was referenced Oct 29, 2023

Upgrade to the latest regex crate lpil/rustexp#20

Merged

[Rust] Support for the latest version of Regex crate firasdib/Regex101#2167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consider providing \< and \> #469

consider providing \< and \> #469

BurntSushi commented Apr 23, 2018

mqudsi commented Dec 30, 2018 •

edited

Loading

jacob-keller commented Dec 16, 2020

BurntSushi commented May 17, 2022

BurntSushi commented Jun 15, 2022

KoviRobi commented Feb 2, 2023

BurntSushi commented Feb 2, 2023

BurntSushi commented Feb 2, 2023

KoviRobi commented Feb 2, 2023

BurntSushi commented Oct 3, 2023 •

edited

Loading

BurntSushi commented Oct 4, 2023

BurntSushi commented Oct 8, 2023

ratmice commented Oct 8, 2023

BurntSushi commented Oct 8, 2023

ratmice commented Oct 8, 2023

BurntSushi commented Oct 8, 2023 •

edited

Loading

ratmice commented Oct 8, 2023 •

edited

Loading

BurntSushi commented Oct 8, 2023 •

edited

Loading

KoviRobi commented Oct 9, 2023

BurntSushi commented Oct 9, 2023

BurntSushi commented Oct 9, 2023

KoviRobi commented Oct 9, 2023 •

edited

Loading

BurntSushi commented Oct 9, 2023

BurntSushi commented Oct 10, 2023

KoviRobi commented Oct 10, 2023

BurntSushi commented Oct 10, 2023

consider providing \< and \> #469

consider providing \< and \> #469

Comments

BurntSushi commented Apr 23, 2018

mqudsi commented Dec 30, 2018 • edited Loading

jacob-keller commented Dec 16, 2020

BurntSushi commented May 17, 2022

BurntSushi commented Jun 15, 2022

KoviRobi commented Feb 2, 2023

BurntSushi commented Feb 2, 2023

BurntSushi commented Feb 2, 2023

KoviRobi commented Feb 2, 2023

BurntSushi commented Oct 3, 2023 • edited Loading

BurntSushi commented Oct 4, 2023

BurntSushi commented Oct 8, 2023

ratmice commented Oct 8, 2023

BurntSushi commented Oct 8, 2023

ratmice commented Oct 8, 2023

BurntSushi commented Oct 8, 2023 • edited Loading

ratmice commented Oct 8, 2023 • edited Loading

Footnotes

BurntSushi commented Oct 8, 2023 • edited Loading

KoviRobi commented Oct 9, 2023

BurntSushi commented Oct 9, 2023

BurntSushi commented Oct 9, 2023

KoviRobi commented Oct 9, 2023 • edited Loading

BurntSushi commented Oct 9, 2023

BurntSushi commented Oct 10, 2023

KoviRobi commented Oct 10, 2023

BurntSushi commented Oct 10, 2023

mqudsi commented Dec 30, 2018 •

edited

Loading

BurntSushi commented Oct 3, 2023 •

edited

Loading

BurntSushi commented Oct 8, 2023 •

edited

Loading

ratmice commented Oct 8, 2023 •

edited

Loading

BurntSushi commented Oct 8, 2023 •

edited

Loading

KoviRobi commented Oct 9, 2023 •

edited

Loading