Specify UTF-8 encoding and tidy up string definition #247

EvanTheB · 2018-08-24T07:04:58Z

de-obfuscate the description of strings. I think this is fast-trackable.

However I do not know the intended behaviour of '\n', It should probably read any character not in set ... '<newline>' . But are newlines really not allowed while other whitespace and strange characters are (carriage return?)

DavyCats · 2018-08-24T07:40:46Z

@EvanTheB Perhaps it should read:

Any printable (non-control) character not in set `\`, `"` (or `'` for single-quoted string).

Maybe even specifically only ASCII characters? Unicode characters can still be encoded using \u, anyway.

EvanTheB · 2018-08-24T23:10:42Z

Yes I think you are right, maybe this is the time to specify that wdls are utf-8 encoded?

patmagee · 2018-08-28T12:10:15Z

@EvanTheB +1 to UTF8, although do you think there should be flexibility in this?

cjllanwarne · 2018-09-05T15:42:19Z

versions/development/SPEC.md

-* An escape sequence starting with `\\x`, followed by hexadecimal characters `0-9a-fA-F`.  This specifies a hexadecimal escape code.
-* An escape sequence starting with `\\u` or `\\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`.  This specifies a unicode code point
+* Any character not in set: `\`, `"` (or `'` for single-quoted string), `\n`
+* An escape sequence starting with `\`, followed by one of the following characters: `\nrbtfav?`


Can we break this into separate sub-bullets indicating what each escape sequence is interpreted as?

... and we should probably double check that all of these actually make sense...

these are the same as the c89 escape sequences. Are we happy to refer to other documents or do we want to define here?

I'd be happy to reference out to other docs or redefine. One reason for listing out is to make it explicit which ones we're actually supporting since some of these do seem unnecessary (eg IIRC b is "backspace", which IMO we could probably do without)

cjllanwarne · 2018-09-05T15:43:14Z

versions/development/SPEC.md

-* An escape sequence starting with `\\`, followed by 1 to 3 digits of value 0 through 7 inclusive.  This specifies an octal escape code.
-* An escape sequence starting with `\\x`, followed by hexadecimal characters `0-9a-fA-F`.  This specifies a hexadecimal escape code.
-* An escape sequence starting with `\\u` or `\\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`.  This specifies a unicode code point
+* Any character not in set: `\`, `"` (or `'` for single-quoted string), `\n`


We should also mention that ~{ indicates a placeholder

Just as a note here: the ~{ placeholder syntax is not mentioned under the string interpolation section. This syntax (as per the current SPEC) only applies to the command section.

cjllanwarne · 2018-09-05T15:44:32Z

versions/development/SPEC.md

+* An escape sequence starting with `\`, followed by one of the following characters: `\nrbtfav?`
+* An escape sequence starting with `\`, followed by 1 to 3 digits of value 0 through 7 inclusive.  This specifies an octal escape code.
+* An escape sequence starting with `\x`, followed by hexadecimal characters `0-9a-fA-F`.  This specifies a hexadecimal escape code.
+* An escape sequence starting with `\u` or `\U` followed by either 4 or 8 hexadecimal characters `0-9a-fA-F`.  This specifies a unicode code point


If WDL is set to unicode, would we remove these code point options?

I don't want to be pedantic, but in a spec pedantry is kind of required; the proposal was to specify UTF-8 encoding of wdl files. unicode is not an encoding. On your point, \u would always be useful, because most users would find it hard or impossible to enter non-ascii code points using their keyboard.

Python did: https://www.python.org/dev/peps/pep-0263/
Which allows the encoding of the file to be specified in the file. That decision predates the dominance of UTF-8. Golang more recently specifies source files to be utf-8 encoded. I think there are 2 sensible options: specify ascii, or specify utf-8.

EvanTheB · 2018-09-06T00:06:26Z

@patmagee I don't see any reason for choice, choice means a mechanism for communicating the choice and making all subsequent decisions abstract enough to deal with all possibilities. Python is gross: # -*- coding: latin-1 -*-

EvanTheB · 2018-09-12T05:09:04Z

I have specified UTF-8, explained each escape, and removed many. Still to do; explain ~{} and ${}, and unify string literals with the 'command' thing. It seems like every concept is invented 3 times in WDL ;).

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
http://www.asciitable.com/
https://unicode-table.com/en/#07C8
https://www.python.org/dev/peps/pep-0263/

orodeh · 2018-09-18T21:18:52Z

👍

DavyCats · 2018-09-19T11:58:32Z

👍

LeeTL1220 · 2018-09-19T15:41:33Z

👍 UTF-8, excellent!

cjllanwarne · 2018-09-19T21:07:31Z

👍 but a warning that UTF-8 and "hermes parser" may be incompatible, so 🤞 but I'm afraid this might be in "awaiting implementation" quite a while until we can get away from hermes...

EvanTheB · 2018-09-20T00:05:57Z

@cjllanwarne Oh dear! ascii is a valid subset of UTF-8, so hermes would still be parsing a well defined subset I suppose. Who knows what the behaviour is like on all the non-standard characters though - the wording says anything is valid "except...".

I also noticed my markdown table is not rendering as I expected. I do not have a markdown workflow here.

patmagee · 2018-09-20T01:27:53Z

👍

aednichols · 2018-09-21T18:50:37Z

👍

dheiman · 2018-09-28T14:18:19Z

👍

geoffjentry · 2018-10-02T07:46:58Z

@EvanTheB Please rollback the commit you made after voting went live. If you want to apply that it'll need to be done separately

geoffjentry · 2018-10-23T18:07:37Z

@EvanTheB Just looping back around to remind about removing that extraneous commit at the end

EvanTheB · 2018-10-25T00:13:46Z

I think I have done it, it is hard to tell.

geoffjentry · 2018-10-25T23:23:33Z

@EvanTheB Thanks. Please do feel free to submit a followup PR with those changes, they're valuable but in this case the voting was already active unfortunately

geoffjentry · 2018-12-14T03:05:05Z

Implemented in cromwell

Improve md formatting in string section

c11c3de

cjllanwarne reviewed Sep 5, 2018

View reviewed changes

Specify unicode, explain each escape

d901d03

geoffjentry added the Voting Active label Sep 18, 2018

EvanTheB changed the title ~~Improve md formatting in string section~~ Specify UTF-8 encoding and tidy up string definition Sep 20, 2018

geoffjentry added in review and removed Voting Active labels Oct 3, 2018

EvanTheB force-pushed the evan-string branch from 4fe236c to d901d03 Compare October 25, 2018 00:12

geoffjentry added Waiting for implementation and removed in review labels Oct 25, 2018

cjllanwarne added a commit to cjllanwarne/wdl that referenced this pull request Nov 27, 2018

Implement string escapes in openwdl#247

6654e24

This was referenced Nov 27, 2018

[Hermes Grammar] Implement string escapes for #247 #272

Merged

[WDL Biscayne] Move escape evaluation into Cromwell (and make it work) broadinstitute/cromwell#4427

Merged

geoffjentry removed the Waiting for implementation label Dec 14, 2018

geoffjentry merged commit fb920ed into openwdl:master Dec 14, 2018

geoffjentry pushed a commit that referenced this pull request Dec 14, 2018

Implement string escapes in #247 (#272)

005db97

patmagee added this to the v2.0 milestone Nov 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify UTF-8 encoding and tidy up string definition #247

Specify UTF-8 encoding and tidy up string definition #247

EvanTheB commented Aug 24, 2018

DavyCats commented Aug 24, 2018

EvanTheB commented Aug 24, 2018

patmagee commented Aug 28, 2018

cjllanwarne Sep 5, 2018

EvanTheB Sep 5, 2018

cjllanwarne Sep 10, 2018

cjllanwarne Sep 5, 2018

DavyCats Sep 6, 2018

cjllanwarne Sep 5, 2018

EvanTheB Sep 6, 2018

EvanTheB commented Sep 6, 2018

EvanTheB commented Sep 12, 2018

orodeh commented Sep 18, 2018

DavyCats commented Sep 19, 2018

LeeTL1220 commented Sep 19, 2018

cjllanwarne commented Sep 19, 2018

EvanTheB commented Sep 20, 2018 •

edited

Loading

patmagee commented Sep 20, 2018

aednichols commented Sep 21, 2018

dheiman commented Sep 28, 2018

geoffjentry commented Oct 2, 2018

geoffjentry commented Oct 23, 2018

EvanTheB commented Oct 25, 2018

geoffjentry commented Oct 25, 2018

geoffjentry commented Dec 14, 2018

Specify UTF-8 encoding and tidy up string definition #247

Specify UTF-8 encoding and tidy up string definition #247

Conversation

EvanTheB commented Aug 24, 2018

DavyCats commented Aug 24, 2018

EvanTheB commented Aug 24, 2018

patmagee commented Aug 28, 2018

cjllanwarne Sep 5, 2018

Choose a reason for hiding this comment

EvanTheB Sep 5, 2018

Choose a reason for hiding this comment

cjllanwarne Sep 10, 2018

Choose a reason for hiding this comment

cjllanwarne Sep 5, 2018

Choose a reason for hiding this comment

DavyCats Sep 6, 2018

Choose a reason for hiding this comment

cjllanwarne Sep 5, 2018

Choose a reason for hiding this comment

EvanTheB Sep 6, 2018

Choose a reason for hiding this comment

EvanTheB commented Sep 6, 2018

EvanTheB commented Sep 12, 2018

orodeh commented Sep 18, 2018

DavyCats commented Sep 19, 2018

LeeTL1220 commented Sep 19, 2018

cjllanwarne commented Sep 19, 2018

EvanTheB commented Sep 20, 2018 • edited Loading

patmagee commented Sep 20, 2018

aednichols commented Sep 21, 2018

dheiman commented Sep 28, 2018

geoffjentry commented Oct 2, 2018

geoffjentry commented Oct 23, 2018

EvanTheB commented Oct 25, 2018

geoffjentry commented Oct 25, 2018

geoffjentry commented Dec 14, 2018

EvanTheB commented Sep 20, 2018 •

edited

Loading