Proposal: Allow carriage return as valid newline sequence #837

hukkin · 2021-07-22T18:19:14Z

Related to some of the discussion in #835

I'd like to propose adding CR (carriage return) character that is not followed by an LF character to the list of allowed newline sequences. This is how e.g. CommonMark defines line ending.

The reasoning for the change is:

~~Avoid roundtrip instability when an LF is prefixed by more than one CR characters.~~ EDIT: Not an issue as per .abnf
In Python (perhaps other languages?) the standard way of opening files supports something called Universal Newline Support, meaning that LF, CR, and CRLF are all normalised to LF. This normalisation should NOT be used with the current TOML spec because
- ~~CR characters in strings will be converted to LFs when not allowed to do so.~~ EDIT: Not an issue as per .abnf
- Files where CRs are used as newline sequence are parsed successfully even though they are invalid TOML (spec v1.0.0).
As far as I know, all popular Python parsers and their documented API make the mistake of doing this incorrect newline normalisation.

Now for point 2, a valid counter-argument is to "just fix all Python parsers". 😄 However, perhaps other languages do similar normalisation by default or otherwise we agree that a spec change makes more sense here instead.

The text was updated successfully, but these errors were encountered:

hukkin · 2021-07-23T11:04:49Z

If I interpret the .abnf correctly, CRs not followed by an LF character are already prohibited in MLB strings, in which case point 1 in the original post is actually not an issue.

I think this should be clarified in the text spec though. Currently the text spec explicitly mentions that unescaped CRs can be used in MLBs (with no mention of a following LF being required).

eksortso · 2021-07-23T21:31:24Z

You're right that clarification is worthwhile. But to be clear, there's no such thing as an "escaped" carriage return in TOML. There's a limited set of characters that can be escaped (with a backslash) in basic strings, and carriage returns aren't among them. (Even inside escaped end-of-line whitespace in MLBs, carriage returns are parts of newline sequences.) And other representations of carriage returns like \r and \u000d use the escape backslash but the sequences aren't considered "eacaped."

hukkin · 2021-07-30T07:30:43Z

But to be clear, there's no such thing as an "escaped" carriage return

Thanks for the correction.

FWIW, I currently think it may be best to reject this proposal in favor of the clarifying PR #838

ethanhs · 2021-07-30T10:11:43Z

Hm shouldn't TOML use the the Unicode line breaking algorithm since documents are utf-8 encoded?
https://www.unicode.org/reports/tr14/tr14-32.html#Algorithm

hukkin · 2021-07-30T10:25:57Z

@ethanhs Based on quick glance, it seems that the algorithm is an annex that is not part of the core spec. It also seems to do a slightly different thing than what we want, specifying mandatory and optional break opportunities for text display. E.g. how web browsers wrap text at space characters.

pradyunsg · 2021-08-11T17:35:56Z

I'd like to propose adding CR (carriage return) character that is not followed by an LF character to the list of allowed newline sequences. This is how e.g. CommonMark defines line ending.

Happy to accept a PR adding this.

LongTengDao · 2021-08-14T08:59:55Z

TOML is a format for general config.

In currently computer world, Windows use CRLF, Linux and Mac use LF; CR only existed in history (old iOS).

So I think keep EOL as CRLF/LF is good. If all newlines are considered, they will be too many:

for JS, EOL is LF CR CRLF LS(U+2028) PS(U+2029)
for CSS, EOL is LF FF(U+000C) CR CRLF
for Unicode, there are more newline chars... like U+0085 for XML

ChristianSi · 2021-08-14T10:22:29Z

I have some sympathy for @LongTengDao's viewpoint, but at the other hand I think: If it works for CommonMark, it can work for us too!

The alternative would be fragmentation, since it doesn't sound wise to require all implementations to reject lonely CR's as errors. Hence, if we don't clarify this, some will continue to accept and others to reject them. Not an ideal state of affairs.

hukkin · 2021-08-14T10:23:43Z

Note that I didn't request this because "I want all newlines" 😄, but mostly because the Markdown spec seems to need either this or the clarifying PR to fix the first struck-through issue item in the original post.

I'm also slightly in favor of the clarifying PR and leaving lone CR in the history books. The nice thing is that it is a clarification, not a spec change, so less disturbing for implementations. E.g. git (as a modern system) doesn't seem to consider lone CR a newline in its text manipulations and people are happy with that.

I don't feel strongly at all however, as long as one of the two proposed changes happen.

ChristianSi · 2021-08-14T10:24:33Z

Also, from @LongTengDao's comment one can see that both JS and CSS accept lonely CR's as linebreaks, so it seems we're heading in the right direction. (I don't see a reason to allow additional linebreak characters, though. Let's keep it simple, like CommonMark does.).

pradyunsg · 2021-08-14T10:30:07Z

As far as I can tell, lone CRs are accepted in most places (think: browsers, editors, language interpreters/compilers etc) as a valid newline character sequence. I don't think I'd be expanding the specification to allow other characters.

arp242 · 2021-08-16T02:17:55Z

CR was used in the old pre-OSX MacOS and some other long-obsolete systems. Systems and standards that go back to the 80s and 90s support it because of that, but there's really no practical reason to add it in 2021.

hukkin · 2021-08-16T08:25:14Z

In addition to git's lack of support for lone CRs, also the Golang compiler seems to completely refuse to parse them. So even if we stick to just LF and CRLF, it seems we're not an outlier among other modern projects.

Lemmingh · 2021-10-06T04:56:57Z

I don't think lone CR should be accepted as EOL.

LF and CR LF are enough.

Modern things should not repeat the mistakes of predecessors

Ideally, EOL should be a uniform thing in the electronic world.

The first people who made computers created so many kinds of EOL, perhaps for laziness, perhaps for business competition. Anyway, the past decades have been a big lesson.

Nowadays, only LF and CR LF have enough weight. Other options will be eradicated in practice.

Adding dying things to new specs is not wise.

"Lone `CR` as line ending" will have poor support in future

The world is moving away from chaos slowly.

New editors, such as Atom and VS Code, only accept LF and CR LF:

Compiler designers should manage to get original input

The Python example (Universal Newline Support) is not convincing.

Compilers and similar tools are expected to interpret the original input. Working on tweaked data is a mistake. Those Python implementations should open files in binary mode and perform decoding on their own.

On the other hand, PEP 278 was published in 2002, so there was adequate reason to recognize lone CR for compatibility. The basis no longer exists nowadays.

CommonMark is not suitable as a reference

Also, CommonMark is designed to have great backward compatibility to cater to users from different flavors and backgrounds due to the wild history of Markdown.

ChristianSi · 2021-10-06T11:02:34Z

@Lemmingh 's comment makes a lot of sense. While I don't have strong opinion on this whole issue, I agree that the rationale for the suggested change seems pretty week indeed.

Which sane (or even insane) editor people might use these days to edit TOML files would use CR to write newlines? Probably none. So, it seems there is no problem and hence no need for a solution/change.

onerandomusername · 2021-12-04T22:03:32Z

Given the last few comments were over 2 months ago, is there an official verdict on this yet?

ChristianSi · 2021-12-06T18:54:26Z

Not yet, evidently. But considering the last two comments and the number of 👍 on them, I guess the proposal will be rejected.

septatrix · 2022-01-14T23:36:43Z

I too see no need for this change.

The reasoning for the change is:

[...]

In Python (perhaps other languages?) the standard way of opening files supports something called Universal Newline Support, meaning that LF, CR, and CRLF are all normalised to LF. This normalisation should NOT be used with the current TOML spec because

~~CR characters in strings will be converted to LFs when not allowed to do so.~~ EDIT: Not an issue as per .abnf

Files where CRs are used as newline sequence are parsed successfully even though they are invalid TOML (spec v1.0.0).

As far as I know, all popular Python parsers and their documented API make the mistake of doing this incorrect newline normalisation.

Now for point 2, a valid counter-argument is to "just fix all Python parsers". smile However, perhaps other languages do similar normalisation by default or otherwise we agree that a spec change makes more sense here instead.

Just to invalidate the last of your reasons (after 1 and 2a were already retracted): The newline parsing is a feature of python and not an implementation error in the toml libraries. If you do not want that normalization you are free to open the file in byte-mode. The libraries which I have tested will correctly reject files containing a lone \r if they are opened in byte-mode as you would expect.

pradyunsg · 2022-01-15T08:47:41Z

Well, looks like we have concensus within the broader group and... well, I don't care strongly enough to push against that.

Rejecting this proposal, although I do appreciate everyone who has pitched in on these discussions! ^>^

hukkin mentioned this issue Jul 22, 2021

Support reading TOML from a filename. hukkin/tomli#99

Closed

hukkin mentioned this issue Jul 23, 2021

Clarify that CR not followed by LF is not allowed in an MLB #838

Closed

This was referenced Jul 30, 2021

Disable universal newlines when reading TOML psf/black#2408

Merged

Disable universal newlines when reading TOML python/mypy#10893

Merged

Avoid deprecation warning from Tomli pypa/pip#10238

Merged

hukkin mentioned this issue Jul 30, 2021

Raise an error on carriage return in multi-line basic string hukkin/tomli#108

Merged

hukkin mentioned this issue Jul 31, 2021

Question: Can newlines in multi-line literals be normalized? #835

Closed

This was referenced Aug 11, 2021

Allow bare CR newlines hukkin/tomli-w#10

Merged

Allow bare CR newlines hukkin/tomli#119

Closed

Allow lone CR as valid newline #840

Closed

pradyunsg closed this as completed Jan 15, 2022

hukkin mentioned this issue Jan 15, 2022

Clarify that CR not followed by LF is not allowed in an MLB #867

Merged

psvenk mentioned this issue Jan 30, 2022

TOML parsing does not work on Windows Aspine/aspine#326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Allow carriage return as valid newline sequence #837

Proposal: Allow carriage return as valid newline sequence #837

hukkin commented Jul 22, 2021 •

edited

Loading

hukkin commented Jul 23, 2021 •

edited

Loading

eksortso commented Jul 23, 2021 •

edited

Loading

hukkin commented Jul 30, 2021

ethanhs commented Jul 30, 2021

hukkin commented Jul 30, 2021

pradyunsg commented Aug 11, 2021

LongTengDao commented Aug 14, 2021 •

edited

Loading

ChristianSi commented Aug 14, 2021 •

edited

Loading

hukkin commented Aug 14, 2021 •

edited

Loading

ChristianSi commented Aug 14, 2021 •

edited

Loading

pradyunsg commented Aug 14, 2021 •

edited

Loading

arp242 commented Aug 16, 2021

hukkin commented Aug 16, 2021

Lemmingh commented Oct 6, 2021

ChristianSi commented Oct 6, 2021 •

edited

Loading

onerandomusername commented Dec 4, 2021

ChristianSi commented Dec 6, 2021

septatrix commented Jan 14, 2022

pradyunsg commented Jan 15, 2022

Proposal: Allow carriage return as valid newline sequence #837

Proposal: Allow carriage return as valid newline sequence #837

Comments

hukkin commented Jul 22, 2021 • edited Loading

hukkin commented Jul 23, 2021 • edited Loading

eksortso commented Jul 23, 2021 • edited Loading

hukkin commented Jul 30, 2021

ethanhs commented Jul 30, 2021

hukkin commented Jul 30, 2021

pradyunsg commented Aug 11, 2021

LongTengDao commented Aug 14, 2021 • edited Loading

ChristianSi commented Aug 14, 2021 • edited Loading

hukkin commented Aug 14, 2021 • edited Loading

ChristianSi commented Aug 14, 2021 • edited Loading

pradyunsg commented Aug 14, 2021 • edited Loading

arp242 commented Aug 16, 2021

hukkin commented Aug 16, 2021

Lemmingh commented Oct 6, 2021

Modern things should not repeat the mistakes of predecessors

"Lone CR as line ending" will have poor support in future

Compiler designers should manage to get original input

CommonMark is not suitable as a reference

ChristianSi commented Oct 6, 2021 • edited Loading

onerandomusername commented Dec 4, 2021

ChristianSi commented Dec 6, 2021

septatrix commented Jan 14, 2022

pradyunsg commented Jan 15, 2022

hukkin commented Jul 22, 2021 •

edited

Loading

hukkin commented Jul 23, 2021 •

edited

Loading

eksortso commented Jul 23, 2021 •

edited

Loading

LongTengDao commented Aug 14, 2021 •

edited

Loading

ChristianSi commented Aug 14, 2021 •

edited

Loading

hukkin commented Aug 14, 2021 •

edited

Loading

ChristianSi commented Aug 14, 2021 •

edited

Loading

pradyunsg commented Aug 14, 2021 •

edited

Loading

"Lone `CR` as line ending" will have poor support in future

ChristianSi commented Oct 6, 2021 •

edited

Loading