Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw string literal testing disagrees with documentation #14

Closed
gibson042 opened this issue Nov 8, 2022 · 18 comments
Closed

raw string literal testing disagrees with documentation #14

gibson042 opened this issue Nov 8, 2022 · 18 comments

Comments

@gibson042
Copy link
Contributor

gibson042 commented Nov 8, 2022

jmespath.site Grammar defines the grammar for raw string literals like

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x20-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

and Raw String Literals includes a search('\\', "") -> "\\" example implying that \\ is an escape sequence representing a single U+005C REVERSE SOLIDUS just like \' is an escape sequence representing a single U+0027 APOSTROPHE.

However, tests/literal.json in this repository includes test cases contradicting the above, such as { "expression": "'\n'", "result": "\n" } (a raw string literal including a U+000A LINE FEED, which is not covered by raw-string-char) and { "comment": "Backslash not followed by single quote is treated as any other character", "expression": "'\\\\'", "result": "\\\\" } (added in 2016 by c0f7923).

Both disagreements ultimately affect whether or not raw string literals are capable of representing all possible sequences of code points (respectively those including C0 control characters and those including U+005C REVERSE SOLIDUS), and the latter is especially odd—I've never heard of any other escaping approach in which the escape prefix itself is unrepresentable.

@springcomp
Copy link
Contributor

springcomp commented Nov 8, 2022

While the example from the specification you are referring to is indeed incorrect, the spirit of the raw-string literal is basically to accept any character as is. It only processes the following two escape sequences:

  • backslash followed by simple quote -> simple quote.
  • backslash followed by itself -> backslash.

I think you have addressed this point in the following pull-request.

However I do not see any conflicting examples in the tests/literal.json you linked to.

May I suggest that you are incorrectly interpreting the example you are referring to which in fact, does not contain any backslash character at all, but indeed contains a linefeed? 🤔

Expression Expressed as JSON Result Comment
'␊' '\n' "\n" U+000A LINE FEED
'\u03a6' '\\u03a6' "\\u03a6" Do not interpret escaped Unicode
'foo\'bar' 'foo\\'bar' "foo'bar" Can escape the single quote
'\z' '\\z' "\\z" Backslash not followed by single quote is treated as any other character",
'\\' '\\\\' "\\\\" Backslash not followed by single quote is treated as any other character",

@gibson042
Copy link
Contributor Author

While the example from the specification you are referring to is indeed incorrect, the spirit of the raw-string literal is basically to accept any character as is. It only processes the following two escape sequences:

  • backslash followed by simple quote -> simple quote.
  • backslash followed by itself -> backslash.

That makes sense to me, but the second bullet point is contradicted by the last test case at

{
"comment": "Can escape the single quote",
"expression": "'foo\\'bar'",
"result": "foo'bar"
},
{
"comment": "Backslash not followed by single quote is treated as any other character",
"expression": "'\\z'",
"result": "\\z"
},
{
"comment": "Backslash not followed by single quote is treated as any other character",
"expression": "'\\\\'",
"result": "\\\\"
}
. If backslash followed by backslash were an escape sequence representing a single backslash, then that case would instead look more like "Can escape the single quote":

         {
-          "comment": "Backslash not followed by single quote is treated as any other character",
+          "comment": "Can escape backslash",
           "expression": "'\\\\'",
-          "result": "\\\\"
+          "result": "\\"
         }

May I suggest that you are incorrectly interpreting the example you are referring to which in fact, does not contain any backslash character at all, but indeed contains a linefeed?

I don't think so, and I'll demonstrate using your table:

Expression Expressed as JSON Result Comment Response
'␊' '\n' "\n" U+000A LINE FEED not currently allowed by raw-string-char = (%x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape
'\u03a6' '\\u03a6' "\\u03a6" Do not interpret escaped Unicode
'foo\'bar' 'foo\\'bar' "foo'bar" Can escape the single quote
'\z' '\\z' "\\z" Backslash not followed by single quote is treated as any other character
'\\' '\\\\' "\\\\" Backslash not followed by single quote is treated as any other character JSON text "\\\\" expresses a string consisting of two consecutive backslashes rather than a string consisting of just a single backslash, so this result is analogous to the preceding '\z' ↔︀︁︎︎︎ `"\\z"` rather than to the '\'' ↔️`"'"` that I think we both consider preferable

@springcomp
Copy link
Contributor

springcomp commented Nov 8, 2022

I realize that control characters are not currently specified as valid for raw-string literals. Thanks for pointing this out. Indeed it seems we need to update the grammar rule accordingly. (#16)

Now I understand the last example as well. The JMESPath expression is a raw-string that contains two consecutive backslashes, which represents how a single backslash is escaped. Thus the resulting string should be a single backslash, which in JSON is represented by a sequence of two backslashes 🤔😏 (#15)

I agree with your assessment.

@springcomp
Copy link
Contributor

springcomp commented Nov 8, 2022

Should we really allow control characters in raw-string literals 🤔 ?

Like:

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x00-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

That feels somehow... Uncanny 😲

@springcomp
Copy link
Contributor

springcomp commented Nov 8, 2022

Or maybe just a subset of those supported as valid escape sequences in JSON:

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x08-09 / %x0A / %x0C-0D / %x20-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x08-09 / %x0A / %x0C-0D / %x20-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

@gibson042
Copy link
Contributor Author

gibson042 commented Nov 8, 2022

I would prefer to prohibit C0 control characters in raw strings, but if any are allowed then they all should be (and I don't know your comfort level with backwards-incompatible changes to the test suite, even if they are already supported by the formal grammar).

Alternatively, if it is a goal for all possible strings to be representable as raw literals (which also implies interpretation of \\ as an escape sequence for U+005C REVERSE SOLIDUS \, making '\\' equivalent to `"\\"` and `"\u005C"`), then C0 control characters should be allowed.

@springcomp
Copy link
Contributor

I agree to forbid usage of control character in JMESPath expressions.
Seems that was never the intended spirit when introduced in JEP-12

@gibson042
Copy link
Contributor Author

gibson042 commented Nov 9, 2022

Seems that was never the intended spirit when introduced in JEP-12

Oh, thanks for digging that up! Its examples actually contain an explicit rebuttal:

  • A raw string literal that contains new lines:
    'foo
    bar
    baz!'
    
    The above expression would be parsed as a string that contains new lines:
    foo
    baz
    bar!
    

And even more usefully, the "Disallow single quotes in a raw string" alternative documents its intent to fully subsume JSON string literals:

An alternative approach could be to not allow single quotes
in a raw string literal. While this would simplify the raw-string grammar
rule, it would severely limit the usability of the raw-string rule, forcing
users to use the literal rule.

The JEP-12 proposed grammar was missing an escape for \, but it also lacked support for \ appearing inside a raw string literal at all other than as part of a \' escape sequence, which was not fixed until over a year after adoption when jmespath/jmespath.site@52b91d9 introduced preserved-escape to align it with a '\u03a6' example in the then-current jmespath.site1 introduced by the JEP-12 spec commit and test in the then-current jmespath.test introduced by the corresponding JEP-12 test commit. And the jmespath/jmespath.test@c0f7923 commit associated with that introduction of preserved-escape is the source of the incorrect "expression": "'\\\\'" test which doesn't even agree with the search('\\', "") -> "\\" example introduced in the main repository by the same author at the same time.

All things considered, I think we have a clear picture that raw string literals were intended to fully subsume the expressiveness of JSON string backtick literals and treat backslashes literally except for the two specific cases of \' and \\, which are respectively escape sequences for ' and \. And since supporting control characters such as line feed and even NULL is technically harmless, I think this community fork might as well honor that original intent. Concretely, it would mean

Footnotes

  1. the example I just corrected in https://github.com/jmespath-community/jmespath.spec/pull/132

@springcomp
Copy link
Contributor

springcomp commented Nov 10, 2022

This makes me realize that JEP-12 while having been accepted is actually incomplete. Thanks for pointing out all the subequent commits that collectively allude to the thought process at the time.

I think an addendum is required for JEP-12

@springcomp
Copy link
Contributor

springcomp commented Nov 10, 2022

However, allowing \\ as a valid escape sequence does lend itself to the leaning toothpick syndrome that JEP-12 was explicitely trying to guard against… 🤔

So I’m starting to be convinced that the "expression": "\\\\" test is not incorrect after all.
I think the intended grammar should have been:

raw-string        = "'" *raw-string-char "'"
; The first grouping matches any character other than "'" and "\"
raw-string-char   = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape /raw-string-escape
; The second grouping matches any character other than "'"
preserved-escape  = escape (%x20-26 / %x28-10FFFF)
raw-string-escape = escape "'"

Making those expressions valid raw-string literals:

  • '␊Hello␊World␊'  -> "\nHello\nWorld\n"
  • 'Null Terminated String␀'-> "Null Terminated String\u0000"
  • 'Awesome, isn\'t it?'-> "Awesome, isn't it?"
  • '\p\r\e\s\e\r\v\e\d' -> "\\p\\r\\e\\s\\e\\r\\v\\e\\d" (emphasis on '\r' -> "\\r" instead of "\r")
  • 'preserving reverse \\ solidus' -> "preserving reverse \\ solidus"

That said, I’m not comfortable allowing C0 control characters in a raw-string as there would not be any easy way to write them in the first place…

@springcomp springcomp reopened this Nov 10, 2022
@gibson042
Copy link
Contributor Author

What would be the point of a \' escape to support representation of literal ' without a corresponding \\ escape to support literal representation of \? Your last example even presumes it:

  • 'preserving reverse \\ solidus' -> "preserving reverse \\ solidus"

(which in the absence of special treatment for \\ like raw-string-escape = escape ("'" / escape) would instead have a result like "preserving reverse \\\\ solidus")

That said, I’m not comfortable allowing C0 control characters in a raw-string as there would not be any easy way to write them in the first place…

Maybe not all of them, but remember that tab and line feed are included in that set. And it's also not too hard to get text containing ANSI escape sequences (which start with 0x1B).

@springcomp
Copy link
Contributor

springcomp commented Nov 14, 2022

OK, I agree that the current grammar is consistent with the last example from that page:

search( '\\', "" ) -> "\\"

That example clearly illustrates usage of two consecutive backslash characters to escape a single resulting backslash character in the output string.

@springcomp
Copy link
Contributor

That said, I’m not comfortable allowing C0 control characters in a raw-string as there would not be any easy way to write them in the first place…

Maybe not all of them, but remember that tab and line feed are included in that set. And it's also not too hard to get text containing ANSI escape sequences (which start with 0x1B).

To be clear, I’m explicitly referring to exanding the set of characters from %x20-26 to %x00-26.
This is what I’m mostly uncomfortable with.

I can’t see a good use case for making an exception here in raw string and not in, say quoting-string expressions.

@gibson042
Copy link
Contributor Author

I can’t see a good use case for making an exception here in raw string and not in, say quoting-string expressions.

I can make that case: quoted-string is aligned with JSON and in particular already supports the representation of arbitrary code points by means of \u escapes, whereas raw-string supports escapes only for its delimiter and escape prefix and would otherwise be unable to represent code points from U+0000 through U+001F, limiting usability in a way that was rejected in JEP-12 Alternative approaches and (in the case of U+000A LINE FEED) explicitly intended per JEP-12 Examples.

@springcomp
Copy link
Contributor

(in the case of U+000A LINE FEED) explicitly intended per JEP-12 Examples.

I rest my case your honor 😏

@springcomp
Copy link
Contributor

springcomp commented Nov 15, 2022

For the record, to fix this issue, the specification needs to be updated to:

raw-string        = "'" *raw-string-char "'" 
raw-string-char   = (%x00-26 / %x28-5B / %x5D-10FFFF) / preserved-escape / raw-string-escape 
preserved-escape  = escape (%x00-26 / %x28-5B / %x5D-10FFFF) 
raw-string-escape = escape ("'" / escape)

I must admit that I do not understand the purpose of the preserved-escape rule🤔

@gibson042
Copy link
Contributor Author

gibson042 commented Nov 16, 2022

I must admit that I do not understand the purpose of the preserved-escape rule

I think I can explain it... the raw-string-escape rule exists as a place to attach the behavior by which a backslash escapes a following backslash or single quote, and backslash must therefore not be allowed in the parent raw-string-char. And given the (however misguided) desire to treat literally any backslash that is not followed by one of those two characters, the preserved-escape rule is introduced to match and implement precisely that lack of escaping. Basically, the contents of a raw string are an arbitrarily long sequence in which each element is one of the following, collectively allowing representation of any code point sequence expressible in the outer encoding:

  • the grouped terminal alternatives of raw-string-char: any code point other than ' or \, representing itself
  • raw-string-escape: \' or \\, respectively representing ' or \
  • preserved-escape: \ followed by a code point that is neither ' or \, representing the sequence consisting of both code points in order

@springcomp
Copy link
Contributor

I think this issue has been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants