Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate unescaped newlines in string literals #140

Merged
merged 5 commits into from
Feb 10, 2021

Conversation

mjambon
Copy link
Contributor

@mjambon mjambon commented Oct 6, 2020

This is useful for typescript and is considered not harmful in javascript given the scope of the tree-sitter project.

This PR is a follow-up to tree-sitter/tree-sitter-typescript#91

@mjambon
Copy link
Contributor Author

mjambon commented Oct 10, 2020

This should be good to merge!

grammar.js Outdated
$.escape_sequence
)),
"'"
)
),

string: $ => $._string,
jsx_string: $ => $._string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's no need to have two separate string and jsx_string rules.

Previously, I had been thinking of a different separation - in which, to be conservative, we would create a separate jsx_string rule that allowed unescaped newlines, and would only be used inside of JSX expressions. But even if we did that, I would have used alias to make sure that the jsx_string appeared as a normal string in the output. I don't think it's helpful to have a public-facing jsx_string rule, since it's just another node type that users need to remember to target for things like syntax highlighting.

But I think it's probably fine to just allow unescaped newlines in all strings, as you've done. I would just get rid of the jsx_string thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will do this. (sorry about rushing this PR and then not reacting - it turns out this is a "hackathon week" at r2c and we're all doing different things this week)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed the fix. There's now just one string rule that tolerates unescaped newlines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping :)

@mjambon
Copy link
Contributor Author

mjambon commented Feb 5, 2021

Hi @maxbrunsfeld, we'd like to merge this. We've been using it in prod for a while and it works fine. I just updated the generated code for mergeability.

mjambon added a commit to semgrep/tree-sitter-typescript that referenced this pull request Feb 9, 2021
grammar.js Outdated
string: $ => choice(
seq(
'"',
repeat(choice(
token.immediate(prec(PREC.STRING, /[^"\\\n]+|\\\r?\n/)),
token.immediate(prec(PREC.STRING, /[^"\\\n]+|\\?\r?\n/)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we're allowing newlines, you could simplify the regex so that newlines are not explicitly excluded on the left-hand-side of the OR operator. Then, the right-hand-side would only be for handling escaped newlines.

Something like [^"\\]|\\\r?\n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Also, I'm now noticing that the regexp doesn't accept escaped quotes as in "\"" or '\''. Looking into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, $.escape_sequence takes care of escaped quotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the regexp and added tests to cover my expectations. I also moved my original test for jsx attributes to literals.txt which seems more appropriate. expressions.txt still has some basic tests for strings and numbers which are probably fine to stay there.

@maxbrunsfeld maxbrunsfeld merged commit 1eee0d3 into tree-sitter:master Feb 10, 2021
@maxbrunsfeld
Copy link
Contributor

Thanks!

@mjambon
Copy link
Contributor Author

mjambon commented Feb 10, 2021

Thank you Max!

cpmsmith added a commit to cpmsmith/tree-sitter-javascript that referenced this pull request Jan 4, 2024
JSX text and attributes support HTML character references (a.k.a.
entities), and don't support ECMAScript string escape sequences.

Although the [spec] calls it "historical" and threatens to change it,
it _is_ in the spec, and the spec is pretty stable at this point.

In changing this, I landed back on an idea that @maxbrunsfeld suggested
in a [PR review] some time ago: having separate `string` and
`jsx_string` nodes, and aliasing `jsx_string` to `string` for consumers'
convenience. At that time, having two different node types was deemed
unnecessary, but this adds a second, more substantive difference between
the two, so I've brought the idea back, and stopped allowing invalid
newlines in JS string literals, which is invalid in both JS and TS.

[spec]: https://facebook.github.io/jsx/#sec-jsx-string-characters
[PR review]: tree-sitter#140 (comment)
amaanq pushed a commit that referenced this pull request Feb 1, 2024
JSX text and attributes support HTML character references (a.k.a.
entities), and don't support ECMAScript string escape sequences.

Although the [spec] calls it "historical" and threatens to change it,
it _is_ in the spec, and the spec is pretty stable at this point.

In changing this, I landed back on an idea that @maxbrunsfeld suggested
in a [PR review] some time ago: having separate `string` and
`jsx_string` nodes, and aliasing `jsx_string` to `string` for consumers'
convenience. At that time, having two different node types was deemed
unnecessary, but this adds a second, more substantive difference between
the two, so I've brought the idea back, and stopped allowing invalid
newlines in JS string literals, which is invalid in both JS and TS.

[spec]: https://facebook.github.io/jsx/#sec-jsx-string-characters
[PR review]: #140 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants