-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify UTF-8 encoding and tidy up string definition #247
Conversation
@EvanTheB Perhaps it should read:
Maybe even specifically only ASCII characters? Unicode characters can still be encoded using |
Yes I think you are right, maybe this is the time to specify that wdls are utf-8 encoded? |
@EvanTheB +1 to UTF8, although do you think there should be flexibility in this? |
versions/development/SPEC.md
Outdated
* An escape sequence starting with `\\x`, followed by hexadecimal characters `0-9a-fA-F`. This specifies a hexadecimal escape code. | ||
* An escape sequence starting with `\\u` or `\\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`. This specifies a unicode code point | ||
* Any character not in set: `\`, `"` (or `'` for single-quoted string), `\n` | ||
* An escape sequence starting with `\`, followed by one of the following characters: `\nrbtfav?` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we break this into separate sub-bullets indicating what each escape sequence is interpreted as?
... and we should probably double check that all of these actually make sense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are the same as the c89 escape sequences. Are we happy to refer to other documents or do we want to define here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to reference out to other docs or redefine. One reason for listing out is to make it explicit which ones we're actually supporting since some of these do seem unnecessary (eg IIRC b
is "backspace", which IMO we could probably do without)
* An escape sequence starting with `\\`, followed by 1 to 3 digits of value 0 through 7 inclusive. This specifies an octal escape code. | ||
* An escape sequence starting with `\\x`, followed by hexadecimal characters `0-9a-fA-F`. This specifies a hexadecimal escape code. | ||
* An escape sequence starting with `\\u` or `\\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`. This specifies a unicode code point | ||
* Any character not in set: `\`, `"` (or `'` for single-quoted string), `\n` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also mention that ~{
indicates a placeholder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note here: the ~{
placeholder syntax is not mentioned under the string interpolation section. This syntax (as per the current SPEC) only applies to the command section.
versions/development/SPEC.md
Outdated
* An escape sequence starting with `\`, followed by one of the following characters: `\nrbtfav?` | ||
* An escape sequence starting with `\`, followed by 1 to 3 digits of value 0 through 7 inclusive. This specifies an octal escape code. | ||
* An escape sequence starting with `\x`, followed by hexadecimal characters `0-9a-fA-F`. This specifies a hexadecimal escape code. | ||
* An escape sequence starting with `\u` or `\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`. This specifies a unicode code point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If WDL is set to unicode, would we remove these code point options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to be pedantic, but in a spec pedantry is kind of required; the proposal was to specify UTF-8 encoding of wdl files. unicode is not an encoding. On your point, \u would always be useful, because most users would find it hard or impossible to enter non-ascii code points using their keyboard.
Python did: https://www.python.org/dev/peps/pep-0263/
Which allows the encoding of the file to be specified in the file. That decision predates the dominance of UTF-8. Golang more recently specifies source files to be utf-8 encoded. I think there are 2 sensible options: specify ascii, or specify utf-8.
@patmagee I don't see any reason for choice, choice means a mechanism for communicating the choice and making all subsequent decisions abstract enough to deal with all possibilities. Python is gross: |
I have specified UTF-8, explained each escape, and removed many. Still to do; explain ~{} and ${}, and unify string literals with the 'command' thing. It seems like every concept is invented 3 times in WDL ;). https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals |
👍 |
1 similar comment
👍 |
👍 UTF-8, excellent! |
👍 but a warning that UTF-8 and "hermes parser" may be incompatible, so 🤞 but I'm afraid this might be in "awaiting implementation" quite a while until we can get away from hermes... |
@cjllanwarne Oh dear! ascii is a valid subset of UTF-8, so hermes would still be parsing a well defined subset I suppose. Who knows what the behaviour is like on all the non-standard characters though - the wording says anything is valid "except...". I also noticed my markdown table is not rendering as I expected. I do not have a markdown workflow here. |
👍 |
2 similar comments
👍 |
👍 |
@EvanTheB Please rollback the commit you made after voting went live. If you want to apply that it'll need to be done separately |
@EvanTheB Just looping back around to remind about removing that extraneous commit at the end |
I think I have done it, it is hard to tell. |
@EvanTheB Thanks. Please do feel free to submit a followup PR with those changes, they're valuable but in this case the voting was already active unfortunately |
Implemented in cromwell |
de-obfuscate the description of strings. I think this is fast-trackable.
However I do not know the intended behaviour of '\n', It should probably read
any character not in set ... '<newline>'
. But are newlines really not allowed while other whitespace and strange characters are (carriage return?)