RegExp.escape escaping SyntaxCharacter alone is insufficient #48

michaelficarra · 2021-01-28T18:56:42Z

If the idea for RegExp.escape is to allow injection in any context, - needs to be escaped in character class context. - is not part of SyntaxCharacter. This is just the first character I thought of needing escaping, and it wasn't escaped, so there's probably others.

The text was updated successfully, but these errors were encountered:

benjamingr · 2021-01-28T19:03:52Z

@michaelficarra can you demonstrate where escaping - is helpful?

michaelficarra · 2021-01-28T19:03:59Z

> as well for capturing group names. Though that context requires \uXXXX style escaping of anything that doesn't otherwise match RegExpIdentifierName...

benjamingr · 2021-01-28T19:04:30Z

https://github.com/tc39-transfer/RegExp.escape/blob/master/EscapedChars.md#safe-with-extra-escape-set-proposal

michaelficarra · 2021-01-28T19:05:12Z

@benjamingr

new RegExp('[' + RegExp.escape('a-z') + ']')

matches all of the letters instead of a, -, and z alone.

michaelficarra · 2021-01-28T19:10:54Z

@benjamingr Okay, I would prefer the "Safe with extra escape set" proposal described there, then. It seems like that better meets the goal of this proposal as I understood from what was presented at the TC39 meeting today: RegExp.escape output should be safe in any RegExp context. That means outside any context, each code point is matched literally; within a character class each code point is unioned onto the class; in a named capture, each code point contributes to the capture name; etc.

benjamingr · 2021-01-28T19:35:54Z

@michaelficarra last time I checked (admittedly 6 years ago, you can see the repos scanned and output in the data directory) this sort of use case was very uncommon.

(Note, if we escape - it probably also makes sense to escape } and ])

benjamingr · 2021-01-28T19:38:27Z

I am not objecting to escaping it or the "safe with extra escape set" proposal - I'm only pointing out how it was removed from the set last time we talked about this. I think it makes sense to audit every character we escape (or not) in RegExp.escape given 5 years have passed since the list of escaped characters was made.

( also on an unrelated note - Thank you for weighing in it is appreciated! 🙏 )

Edit: also see prior discussion.

michaelficarra · 2021-01-28T21:22:02Z

If this proposal was going to go toward a RegExp.escape-style solution, I think the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts. We'll have to figure out what that means when injecting arbitrary values into a context with a limited grammar, such as the capturing group names I mentioned above or the repetition numbers (suffixed {x,y}) mentioned in the documentation here.

ljharb · 2021-01-28T21:35:13Z

One solution someone suggested was to include syntax that would make it invalid in some of the contexts - like appending {1} to ensure you couldn't combine it with a quantity, or if there's something that would make it invalid in a character class.

domenic · 2021-01-28T21:56:25Z

the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts

Isn't that impossible, because of the even-odd problem? That requirement is what sunk the proposal last time from what I remember; the community is pushing to remove this requirement.

benjamingr · 2021-01-28T22:01:06Z

I think the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts.

Why? Isn't solving the problem for all (very) reasonable cases enough?

michaelficarra · 2021-01-28T22:33:13Z

@domenic I am personally fine with not considering the context following \. I don't think anybody would expect new RegExp('\\' + RegExp.escape('t')) to match a tab.

bakkot · 2021-01-28T22:49:13Z

I think as a practical question, ignoring the context following \, "escape gives output which works in any context" would mean including the SyntaxCharacter set, plus - for character classes, , for repetition groups, and > for capturing group names. (We'd want to change u-mode regexes to make the escaped versions of those legal as identity escapes, or specify that they map to their \u form.)

@michaelficarra would that work for you?

benjamingr · 2021-01-29T13:24:32Z

@michaelficarra

I don't think anybody would expect new RegExp('\' + RegExp.escape('t')) to match a tab.

I agree.
I also don't think anyone would expect RegExp.escape to work in context sensitive situations though, would they?

I think that's an assumption PHP, Perl, Python, Ruby, Java and .NET made in their implementation of their respective escaping functions. It's also the assumption userland libraries (like lodash) have made.

I am honestly not sure about this - and I can see good points in both approaches. I do however think the bias should be for the approach other implementations we've looked at are doing naturally (in JS, even the userland implementions we've found of a tag function like XRegExp's don't do context-sensitive escaping).

Note that other languages have been working on reducing the set of characters they escape in order to create more readable regular expressions - a good example of this is Python.

I am personally fine with starting with the "Escape Everything" approach that solves most things (but not everything) - but I would still prefer the "escape as little as needed without creating a user expectation of RegExp.escape working on context sensitive scenarios.

Most use cases I've seen were things like:

const names = getArrayOfNames(); // ["John Smith", "Bob D. Goldman", "Zhang Zhu"];
const matcher = new RegExp('(' + names.map(name => RegExp.escape(name)).join(')|(') + ')');

Where context sensitivity was a non-issue.

domenic · 2021-02-16T02:30:08Z

The specific idea of escaping - has bad interactions with the u flag; see #52.

ljharb · 2023-09-26T10:41:59Z

Updated escaping semantics advanced to stage 2 today.

bakkot mentioned this issue Feb 16, 2021

Ensure the result works with the u flag #52

Closed

ljharb closed this as completed Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

michaelficarra commented Jan 28, 2021

benjamingr commented Jan 28, 2021

michaelficarra commented Jan 28, 2021

benjamingr commented Jan 28, 2021

michaelficarra commented Jan 28, 2021

michaelficarra commented Jan 28, 2021 •

edited

Loading

benjamingr commented Jan 28, 2021

benjamingr commented Jan 28, 2021 •

edited

Loading

michaelficarra commented Jan 28, 2021

ljharb commented Jan 28, 2021

domenic commented Jan 28, 2021

benjamingr commented Jan 28, 2021

michaelficarra commented Jan 28, 2021

bakkot commented Jan 28, 2021

benjamingr commented Jan 29, 2021 •

edited

Loading

domenic commented Feb 16, 2021

ljharb commented Sep 26, 2023

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

Comments

michaelficarra commented Jan 28, 2021

benjamingr commented Jan 28, 2021

michaelficarra commented Jan 28, 2021

benjamingr commented Jan 28, 2021

michaelficarra commented Jan 28, 2021

michaelficarra commented Jan 28, 2021 • edited Loading

benjamingr commented Jan 28, 2021

benjamingr commented Jan 28, 2021 • edited Loading

michaelficarra commented Jan 28, 2021

ljharb commented Jan 28, 2021

domenic commented Jan 28, 2021

benjamingr commented Jan 28, 2021

michaelficarra commented Jan 28, 2021

bakkot commented Jan 28, 2021

benjamingr commented Jan 29, 2021 • edited Loading

domenic commented Feb 16, 2021

ljharb commented Sep 26, 2023

michaelficarra commented Jan 28, 2021 •

edited

Loading

benjamingr commented Jan 28, 2021 •

edited

Loading

benjamingr commented Jan 29, 2021 •

edited

Loading