Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

tats-u · 2024-09-23T07:12:41Z

commonmark/commonmark.js#261
commonmark/commonmark.js#289

https://spec.commonmark.org/0.31.2/#paragraphs

Leading spaces or tabs are skipped:
Example 222
 aaa
 bbb
aaa
bbb
Lines after the first may be indented any amount, since indented code blocks cannot interrupt paragraphs.
Example 223
aaa
 bbb
 ccc
aaa
bbb
ccc
However, the first line may be preceded by up to three spaces of indentation. Four spaces of indentation is too many:
Example 224
 aaa
bbb
aaa
bbb
Example 225
 aaa
bbb
<pre><code>aaa
</code></pre>
bbb
Final spaces or tabs are stripped before inline parsing, so a paragraph that ends with two or more spaces will not end with a hard line break:
Example 226
aaa 
bbb 
aaa 
bbb

The specification doesn't mention non-ASCII Unicode whitespaces, no-break space, form feed, or vertical tab, but some implementations treats them like ASCII space and tab, and remove trailing or leading ones in paragraphs.

~~cmark seems to remove at least U+00A0 NBSP, (but not U+3000 or U+2003)~~ preserves U+00A0 too

echo "`u{a0}a`u{a0}" | ./cmark | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 3c 70 3e c2 a0 61 c2 a0 ┊ 3c 2f 70 3e 0a          │<p>××a××┊</p>_   │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

However cmark removes trailing Form Feeds and Vertical Tabs.

PS> echo "guard`n`n`vVT`v`v`n`fFF`f`f`nend"  | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ ef bb bf 67 75 61 72 64 ┊ 0a 0a 0b 56 54 0b 0b 0a │×××guard┊__•VT••_│
│00000010│ 0c 46 46 0c 0c 0a 65 6e ┊ 64 0d 0a                │_FF___en┊d__     │
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘
PS> echo "guard`n`n`vVT`v`v`n`fFF`f`f`nend"  | ./cmark | hexyl
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 3c 70 3e 67 75 61 72 64 ┊ 3c 2f 70 3e 0a 3c 70 3e │<p>guard┊</p>_<p>│
│00000010│ 0b 56 54 0a 0c 46 46 0a ┊ 65 6e 64 3c 2f 70 3e 0a │•VT__FF_┊end</p>_│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

commonmark.js removes all characters trimmed by trim() of JS.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar

micromark doesn't touch other than ASCII space (U+0020) or tab (U+0009).

Other implementations:

https://babelmark.github.io/?text=%0BVertical+Tab+(U%2B000B)%0B%0A%0A%0CLine+Separator+(U%2B000C)%0C%0A%0A+Space+(U%2B0020)+%0A%0A%09Tab+(U%2B0009)%09%0A%0A%C2%A0NBSP+(U%2B00A0)%C2%A0%0A%0A%E2%80%83EM+Space+(U%2B2003)%E2%80%83%0A%0A%E2%80%A8Line+Separator+(U%2B2028)%E2%80%A8%0A%0A%E2%80%A9Paragraph+Separator+(U%2B2029)%E2%80%A9%0A%0A%E3%80%80Chinese%2FJapanese+Space+(U%2B3000)%E3%80%80%0A

The text was updated successfully, but these errors were encountered:

wooorm · 2024-09-23T08:42:28Z

IMO commonmarkjs / cmark are wrong here. Spec is super clear about this.

tats-u · 2024-09-23T10:19:16Z

IMO commonmarkjs / cmark are wrong here.

I agree with you.

super clear

Does your opinion come from the definition of space and tab? To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions.
I think it would be clearer and stop making library authors worry about extra concerns (i.e. Unicode whitespace handling) if every "space" and "tab" are linked.

Leading [space]s or [tab]s are skipped:

[space]: #space
[tab]: #tab

I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.
Also, I think the spec should advise library authors to confirm the strict specification (what kind of characters is removed) of .trim() method in the language library and not to use them (JS & .NET remove Unicode ones and Java removes VT & FF) in this context. I think it is a common pitfall.

wooorm · 2024-09-23T10:53:16Z

I wish new additional test cases that Unicode white spaces should be preserved would be added. The current examples and test cases use only ASCII space.

It was not clear to me what this issue was about. Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?

To arrive at our interpretation of the spec, I think the specifications must be interpreted strictly, like a robot, without unnecessary distractions.

I don’t see it. The spec is very clear about characters: https://spec.commonmark.org/0.31.2/#characters-and-lines.

Specs are always a bit technical. They must be interpreted strictly.

tats-u · 2024-09-23T13:53:20Z

Are you saying one example of unicode whitespace, at the beginning and end of a line, would solve this issue for you?

I want an example like the following to be available as an officially-provided example or test case:

https://github.com/commonmark/commonmark.js/pull/289/files#diff-76b8b4c10de6a516d009bda44d45cd15db413db5eb2bbbd7f3376d3ad8ee6b8c

The spec is very clear about characters

The syntax itself is so. It might be due to the word choices ("space" and "tab", both are very common words) that library authors have incorrectly implemented the non-Unicode whitespace handling. (or due to the lack of understanding of .trim() methods)

jgm · 2024-09-23T15:16:54Z

Agreed, there should be a test case with, say, NBSP at the beginning and end of a paragraph, showing that it is not stripped.

tats-u changed the title ~~Ensure that trailing/leading Unicode whitespaces in paragraphs are not removed~~ Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Line Feed/No-break Space in paragraphs are not removed Sep 23, 2024

tats-u changed the title ~~Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Line Feed/No-break Space in paragraphs are not removed~~ Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

tats-u commented Sep 23, 2024 •

edited

Loading

wooorm commented Sep 23, 2024

tats-u commented Sep 23, 2024 •

edited

Loading

wooorm commented Sep 23, 2024

tats-u commented Sep 23, 2024

jgm commented Sep 23, 2024

Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

Ensure that trailing/leading Unicode whitespaces/Vertical Tab/Form Feed/No-break Space in paragraphs are not removed #777

Comments

tats-u commented Sep 23, 2024 • edited Loading

wooorm commented Sep 23, 2024

tats-u commented Sep 23, 2024 • edited Loading

wooorm commented Sep 23, 2024

tats-u commented Sep 23, 2024

jgm commented Sep 23, 2024

tats-u commented Sep 23, 2024 •

edited

Loading

tats-u commented Sep 23, 2024 •

edited

Loading