Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

Closed
michaelficarra opened this issue Jan 28, 2021 · 16 comments
Closed

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

michaelficarra opened this issue Jan 28, 2021 · 16 comments

Comments

@michaelficarra
Copy link
Member

If the idea for RegExp.escape is to allow injection in any context, - needs to be escaped in character class context. - is not part of SyntaxCharacter. This is just the first character I thought of needing escaping, and it wasn't escaped, so there's probably others.

@benjamingr
Copy link
Collaborator

@michaelficarra can you demonstrate where escaping - is helpful?

@michaelficarra
Copy link
Member Author

> as well for capturing group names. Though that context requires \uXXXX style escaping of anything that doesn't otherwise match RegExpIdentifierName...

@michaelficarra
Copy link
Member Author

@benjamingr

new RegExp('[' + RegExp.escape('a-z') + ']')

matches all of the letters instead of a, -, and z alone.

@michaelficarra
Copy link
Member Author

michaelficarra commented Jan 28, 2021

@benjamingr Okay, I would prefer the "Safe with extra escape set" proposal described there, then. It seems like that better meets the goal of this proposal as I understood from what was presented at the TC39 meeting today: RegExp.escape output should be safe in any RegExp context. That means outside any context, each code point is matched literally; within a character class each code point is unioned onto the class; in a named capture, each code point contributes to the capture name; etc.

@benjamingr
Copy link
Collaborator

@michaelficarra last time I checked (admittedly 6 years ago, you can see the repos scanned and output in the data directory) this sort of use case was very uncommon.

(Note, if we escape - it probably also makes sense to escape } and ])

@benjamingr
Copy link
Collaborator

benjamingr commented Jan 28, 2021

I am not objecting to escaping it or the "safe with extra escape set" proposal - I'm only pointing out how it was removed from the set last time we talked about this. I think it makes sense to audit every character we escape (or not) in RegExp.escape given 5 years have passed since the list of escaped characters was made.

( also on an unrelated note - Thank you for weighing in it is appreciated! 🙏 )

Edit: also see prior discussion.

@michaelficarra
Copy link
Member Author

If this proposal was going to go toward a RegExp.escape-style solution, I think the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts. We'll have to figure out what that means when injecting arbitrary values into a context with a limited grammar, such as the capturing group names I mentioned above or the repetition numbers (suffixed {x,y}) mentioned in the documentation here.

@ljharb
Copy link
Member

ljharb commented Jan 28, 2021

One solution someone suggested was to include syntax that would make it invalid in some of the contexts - like appending {1} to ensure you couldn't combine it with a quantity, or if there's something that would make it invalid in a character class.

@domenic
Copy link
Member

domenic commented Jan 28, 2021

the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts

Isn't that impossible, because of the even-odd problem? That requirement is what sunk the proposal last time from what I remember; the community is pushing to remove this requirement.

@benjamingr
Copy link
Collaborator

I think the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts.

Why? Isn't solving the problem for all (very) reasonable cases enough?

@michaelficarra
Copy link
Member Author

@domenic I am personally fine with not considering the context following \. I don't think anybody would expect new RegExp('\\' + RegExp.escape('t')) to match a tab.

@bakkot
Copy link
Collaborator

bakkot commented Jan 28, 2021

I think as a practical question, ignoring the context following \, "escape gives output which works in any context" would mean including the SyntaxCharacter set, plus - for character classes, , for repetition groups, and > for capturing group names. (We'd want to change u-mode regexes to make the escaped versions of those legal as identity escapes, or specify that they map to their \u form.)

@michaelficarra would that work for you?

@benjamingr
Copy link
Collaborator

benjamingr commented Jan 29, 2021

@michaelficarra

I don't think anybody would expect new RegExp('\' + RegExp.escape('t')) to match a tab.

I agree.
I also don't think anyone would expect RegExp.escape to work in context sensitive situations though, would they?

I think that's an assumption PHP, Perl, Python, Ruby, Java and .NET made in their implementation of their respective escaping functions. It's also the assumption userland libraries (like lodash) have made.

I am honestly not sure about this - and I can see good points in both approaches. I do however think the bias should be for the approach other implementations we've looked at are doing naturally (in JS, even the userland implementions we've found of a tag function like XRegExp's don't do context-sensitive escaping).

Note that other languages have been working on reducing the set of characters they escape in order to create more readable regular expressions - a good example of this is Python.

I am personally fine with starting with the "Escape Everything" approach that solves most things (but not everything) - but I would still prefer the "escape as little as needed without creating a user expectation of RegExp.escape working on context sensitive scenarios.

Most use cases I've seen were things like:

const names = getArrayOfNames(); // ["John Smith", "Bob D. Goldman", "Zhang Zhu"];
const matcher = new RegExp('(' + names.map(name => RegExp.escape(name)).join(')|(') + ')');

Where context sensitivity was a non-issue.

@domenic
Copy link
Member

domenic commented Feb 16, 2021

The specific idea of escaping - has bad interactions with the u flag; see #52.

@ljharb
Copy link
Member

ljharb commented Sep 26, 2023

Updated escaping semantics advanced to stage 2 today.

@ljharb ljharb closed this as completed Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants