Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify UTF-8 encoding and tidy up string definition #247

Merged
merged 2 commits into from
Dec 14, 2018

Conversation

EvanTheB
Copy link
Contributor

de-obfuscate the description of strings. I think this is fast-trackable.

However I do not know the intended behaviour of '\n', It should probably read any character not in set ... '<newline>' . But are newlines really not allowed while other whitespace and strange characters are (carriage return?)

@DavyCats
Copy link
Contributor

@EvanTheB Perhaps it should read:

Any printable (non-control) character not in set `\`, `"` (or `'` for single-quoted string).

Maybe even specifically only ASCII characters? Unicode characters can still be encoded using \u, anyway.

@EvanTheB
Copy link
Contributor Author

Yes I think you are right, maybe this is the time to specify that wdls are utf-8 encoded?

@patmagee
Copy link
Member

@EvanTheB +1 to UTF8, although do you think there should be flexibility in this?

* An escape sequence starting with `\\x`, followed by hexadecimal characters `0-9a-fA-F`. This specifies a hexadecimal escape code.
* An escape sequence starting with `\\u` or `\\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`. This specifies a unicode code point
* Any character not in set: `\`, `"` (or `'` for single-quoted string), `\n`
* An escape sequence starting with `\`, followed by one of the following characters: `\nrbtfav?`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break this into separate sub-bullets indicating what each escape sequence is interpreted as?

... and we should probably double check that all of these actually make sense...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the same as the c89 escape sequences. Are we happy to refer to other documents or do we want to define here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to reference out to other docs or redefine. One reason for listing out is to make it explicit which ones we're actually supporting since some of these do seem unnecessary (eg IIRC b is "backspace", which IMO we could probably do without)

* An escape sequence starting with `\\`, followed by 1 to 3 digits of value 0 through 7 inclusive. This specifies an octal escape code.
* An escape sequence starting with `\\x`, followed by hexadecimal characters `0-9a-fA-F`. This specifies a hexadecimal escape code.
* An escape sequence starting with `\\u` or `\\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`. This specifies a unicode code point
* Any character not in set: `\`, `"` (or `'` for single-quoted string), `\n`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also mention that ~{ indicates a placeholder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note here: the ~{ placeholder syntax is not mentioned under the string interpolation section. This syntax (as per the current SPEC) only applies to the command section.

* An escape sequence starting with `\`, followed by one of the following characters: `\nrbtfav?`
* An escape sequence starting with `\`, followed by 1 to 3 digits of value 0 through 7 inclusive. This specifies an octal escape code.
* An escape sequence starting with `\x`, followed by hexadecimal characters `0-9a-fA-F`. This specifies a hexadecimal escape code.
* An escape sequence starting with `\u` or `\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`. This specifies a unicode code point
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WDL is set to unicode, would we remove these code point options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to be pedantic, but in a spec pedantry is kind of required; the proposal was to specify UTF-8 encoding of wdl files. unicode is not an encoding. On your point, \u would always be useful, because most users would find it hard or impossible to enter non-ascii code points using their keyboard.

Python did: https://www.python.org/dev/peps/pep-0263/
Which allows the encoding of the file to be specified in the file. That decision predates the dominance of UTF-8. Golang more recently specifies source files to be utf-8 encoded. I think there are 2 sensible options: specify ascii, or specify utf-8.

@EvanTheB
Copy link
Contributor Author

EvanTheB commented Sep 6, 2018

@patmagee I don't see any reason for choice, choice means a mechanism for communicating the choice and making all subsequent decisions abstract enough to deal with all possibilities. Python is gross: # -*- coding: latin-1 -*-

@EvanTheB
Copy link
Contributor Author

I have specified UTF-8, explained each escape, and removed many. Still to do; explain ~{} and ${}, and unify string literals with the 'command' thing. It seems like every concept is invented 3 times in WDL ;).

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
http://www.asciitable.com/
https://unicode-table.com/en/#07C8
https://www.python.org/dev/peps/pep-0263/

@orodeh
Copy link
Contributor

orodeh commented Sep 18, 2018

👍

1 similar comment
@DavyCats
Copy link
Contributor

👍

@LeeTL1220
Copy link

👍 UTF-8, excellent!

@cjllanwarne
Copy link
Contributor

👍 but a warning that UTF-8 and "hermes parser" may be incompatible, so 🤞 but I'm afraid this might be in "awaiting implementation" quite a while until we can get away from hermes...

@EvanTheB
Copy link
Contributor Author

EvanTheB commented Sep 20, 2018

@cjllanwarne Oh dear! ascii is a valid subset of UTF-8, so hermes would still be parsing a well defined subset I suppose. Who knows what the behaviour is like on all the non-standard characters though - the wording says anything is valid "except...".

I also noticed my markdown table is not rendering as I expected. I do not have a markdown workflow here.

@EvanTheB EvanTheB changed the title Improve md formatting in string section Specify UTF-8 encoding and tidy up string definition Sep 20, 2018
@patmagee
Copy link
Member

👍

2 similar comments
@aednichols
Copy link
Contributor

👍

@dheiman
Copy link

dheiman commented Sep 28, 2018

👍

@geoffjentry
Copy link
Member

@EvanTheB Please rollback the commit you made after voting went live. If you want to apply that it'll need to be done separately

@geoffjentry
Copy link
Member

@EvanTheB Just looping back around to remind about removing that extraneous commit at the end

@EvanTheB
Copy link
Contributor Author

I think I have done it, it is hard to tell.

@geoffjentry
Copy link
Member

@EvanTheB Thanks. Please do feel free to submit a followup PR with those changes, they're valuable but in this case the voting was already active unfortunately

cjllanwarne added a commit to cjllanwarne/wdl that referenced this pull request Nov 27, 2018
@geoffjentry
Copy link
Member

Implemented in cromwell

@geoffjentry geoffjentry merged commit fb920ed into openwdl:master Dec 14, 2018
geoffjentry pushed a commit that referenced this pull request Dec 14, 2018
@patmagee patmagee added this to the v2.0 milestone Nov 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants