[css-text-3] line-break, word-break: language unclear, and a new testcase. #2559

faceless2 · 2018-04-13T10:06:14Z

The language for line-break and (in particular) word-break, is unclear with regard to what changes are required to the UAX14 algorithm.

I've made a pull request for a new testcase we've been working up at web-platform-tests/wpt#10420. This testcases is complete but will require review due to the ambiguities described below.

While developing this is became apparent that some of the language in the spec was a bit unclear - certainly to me, and as I'm seeing different results with this testcase in different browsers, maybe others.

First, I expect I am not the first to point out that "word-break" and "line-break" have some considerable overlap. As described, breaks within words like ちょっと (UAX14 classes ID CJ CJ ID) are covered by the line-break rule, although this is a single word. And of course, "line-break: anywhere" will break words. Some sort of clarifying note as to the interaction of these two features might help.

Specific areas of the text that are a bit confusing or incomplete:

word-break states it "controls whether a soft wrap opportunity exists between adjacent typographic letter units (or other typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes" - although the note at the bottom of "keep-all" explicitly mentions Korean, the classes H2, H3, JL, JT and JV are excluded from this list. I don't know Korean so I'm unsure if that is a deliberate omission. It also doesn't mention classes CJ or NS, and again I'm not sure if this is a deliberate omission. Given the overlap with line-break it may be better to dump this descriptive paragraph completely in favour of exact descriptions of the behaviour of each property with regard to UAX14, as I've added below.
The language of "word-break: keep-all" is still a bit unclear with regards to the changes it mandates to UAX14. For example, "Breaking is forbidden within “words”: implicit soft wrap opportunities between typographic letter units are suppressed" makes no mention of character class, so isn't much help if you're implementing this. UAX14 describes this same customization as used for "ragged" korean text, and specifies "... breaking after spaces (as in Latin text)". I believe the intention here is to treat all ideographic characters as if they were latin text.
line-break: anywhere is described as providing "a soft wrap opportunity around every typographic character unit, including around any punctuation character or preserved spaces, or in the middle of words, disregarding any prohibition against line breaks introduced by characters with the GL, JW, or ZJW character class". It then states in the note that "This value triggers the line breaking rules typically seen in terminals.". If that's the intention then the mention of GL, JW and ZJW (which should be WJ and ZWJ by the way) is superfluous and confusing. And also superfluous. The final sentence should be "disregarding any prohibition", full-stop end of. Literally anywhere in the text is a valid break-point, even before U+20
What happens if I specify "word-break: keep-all; line-break: anywhere". The two rules contradict eachother; which one wins?
Using the language of the text as an input to the algorithm seems a bit odd to me. Is there any reason "loose-cj" and "normal-cj" values for line-break could not be used to achieve the same thing? Not really a serious issue and I can't think of a specific reason why it's a problem, it just feels out of character with the rest of the spec so thought I'd raise it while I'm typing.

We've interpreted the various property values as having the following meaning. Whether they're correct or not is almost a secondary issue at this stage; what I'm getting at is that these definitions are exact enough to work from, so I think it would be great if the descriptions for these property values were rewritten in this form, i.e. detailing exactly what changes need to be made to UAX14.

"word-break: normal" controls breakpoints between AI, AL, CJ, H2, H3, HL, ID, JL, JT and JV exactly as defined in UAX14. This allows breakpoints in the middle of CJK words, and denies them in non-CJK words. (note: existing description states "customary rules as described above", which is nowhere near exact enough)
"word-break: break-all" treats any glyphs of class AI, AL, HL, NU and SA as class ID for the purposes of UAX14. (note: class AI is not listed in the current description; it probably should be, as UAX14 LB1 suggests that class AI is resolved to another class. HL was also missing, I think it should be treated as for AL)
"word-break: keep-all" treats any glyphs of class AI, CJ, H2, H3, ID, JL, JT and JV as if they were class AL for the purposes of UAX14. In other words, CJK text will be broken exactly as if it was latin text, i.e. with spaces.
"line-break: anywhere" allows a breakpoint between any two typographic character units. The restrictions defined in UAX14 do not apply, and the value of "word-break" is ignored.

(note: this issue originally posted against the wrong repository at web-platform-tests/wpt#10423)

astearns · 2018-07-02T03:35:42Z

(removed agenda+ for now on @fantasai's recommendation)

fantasai · 2018-12-06T00:03:44Z

First, I expect I am not the first to point out that "word-break" and "line-break" have some considerable overlap. As described, breaks within words like ちょっと (UAX14 classes ID CJ CJ ID) are covered by the line-break rule, although this is a single word. And of course, "line-break: anywhere" will break words. Some sort of clarifying note as to the interaction of these two features might help.

I've tried to clarify the specific interactions. Not sure exactly how to explain the interactions at a high level other than what's there, but I'll give it a try later.

word-break states it "controls whether a soft wrap opportunity exists between adjacent typographic letter units (or other typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes" - although the note at the bottom of "keep-all" explicitly mentions Korean, the classes H2, H3, JL, JT and JV are excluded from this list. I don't know Korean so I'm unsure if that is a deliberate omission. It also doesn't mention classes CJ or NS, and again I'm not sure if this is a deliberate omission. Given the overlap with line-break it may be better to dump this descriptive paragraph completely in favour of exact descriptions of the behaviour of each property with regard to UAX14, as I've added below.

H2, H3, JL, JT, JV, and CJ are excluded from that list because they are all letters, so they're included in “typographic letter units” already. Line breaking around NS is controlled by line-break: word-break is not able to influence it. I've changed “other” to “non-letter” here to clarify. The sentence is a bit awkward because I don't know how to grammatically construct the sentence to make it clear that the “belonging to” phrase attaches only to “typographic character units” and not to “typographic letter units”, hence the parentheses. Anyway the sentence now looks like

“Specifically it controls whether a soft wrap opportunity exists between adjacent typographic letter units (and/or non-letter typographic character units belonging to the NU, AL, AI, or ID Unicode line breaking classes [UAX14]).”

The language of "word-break: keep-all" is still a bit unclear with regards to the changes it mandates to UAX14. For example, "Breaking is forbidden within “words”: implicit soft wrap opportunities between typographic letter units are suppressed" makes no mention of character class, so isn't much help if you're implementing this.

“typographic letter unit” is very specifically defined in https://www.w3.org/TR/css-text-3/#typographic-letter-unit so I don't know why you think there's “no mention of character class”.

I believe the intention here is to treat all ideographic characters as if they were latin text.

Yes.

line-break: anywhere is described as providing "a soft wrap opportunity around every typographic character unit, including around any punctuation character or preserved spaces, or in the middle of words, disregarding any prohibition against line breaks introduced by characters with the GL, JW, or ZJW character class". It then states in the note that "This value triggers the line breaking rules typically seen in terminals.". If that's the intention then the mention of GL, JW and ZJW (which should be WJ and ZWJ by the way) is superfluous and confusing. And also superfluous. The final sentence should be "disregarding any prohibition", full-stop end of. Literally anywhere in the text is a valid break-point, even before U+20

Edited as “any prohibition against line breaks, even those introduced by characters ...”. I want it to be clear that explicit wrapping controls are also ignored.

What happens if I specify "word-break: keep-all; line-break: anywhere". The two rules contradict each other; which one wins?

line-break: anywhere. I'll clarify that point.

Using the language of the text as an input to the algorithm seems a bit odd to me. Is there any reason "loose-cj" and "normal-cj" values for line-break could not be used to achieve the same thing? Not really a serious issue and I can't think of a specific reason why it's a problem, it just feels out of character with the rest of the spec so thought I'd raise it while I'm typing.

There's a lot of stuff in the spec that is language- or writing-system-dependent. Much of it is not called out in such explicit terms as these rules, but line-breaking, justification, white-space collapsing, and text transforms are all language-dependent. We do this because a) we want things to work optimally by default, without the author having to think about every single CSS property that does or will exist b) we want to keep the number of values limited to what switches are useful for an author to think about rather than overloading everyone in the world with more values than they can easily reason about or even need to know about.

(note: existing description states "customary rules as described above", which is nowhere near exact enough)

UAX14 is a starting point for universal line breaking, not the ultimate authority on quality typesetting. We are intentionally not requiring it.

…he section intro. Clean up some text about interactions. #2559

…hapter. #2559

fantasai · 2018-12-12T02:54:02Z

I've tried to clarify the specific interactions. Not sure exactly how to explain the interactions at a high level other than what's there, but I'll give it a try later.

OK, did a bunch of editorial work to try to clean up overview sections and interactions. :) I think this should be fixed now, let me know if you have further suggestions.

faceless2 mentioned this issue Apr 13, 2018

[css-text-3] line-break, word-break: language unclear, and a new testcase. web-platform-tests/wpt#10423

Closed

frivoal added the css-text-3 Current Work label Apr 13, 2018

frivoal changed the title ~~line-break, word-break: language unclear, and a new testcase.~~ [css-text-3] line-break, word-break: language unclear, and a new testcase. Apr 13, 2018

frivoal self-assigned this Apr 13, 2018

xfq added the i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. label May 7, 2018

frivoal added the Agenda+ F2F label Jul 1, 2018

astearns removed the Agenda+ F2F label Jul 2, 2018

frivoal added Agenda+ F2F and removed Agenda+ F2F labels Jul 2, 2018

faceless2 mentioned this issue Sep 6, 2018

[css-text-3] line breaking rules around replaced-inline content incorrectly refer to ID class #3085

Closed

fantasai added Needs Edits Tracked in DoC labels Sep 16, 2018

frivoal assigned fantasai Oct 2, 2018

fantasai added a commit that referenced this issue Dec 6, 2018

[css-text-3] Clarifications to line breaking. #2559

dc4a24f

fantasai added the Closed Accepted as Editorial label Dec 6, 2018

himorin mentioned this issue Dec 10, 2018

[css-text-3] line-break, word-break: language unclear, and a new testcase. w3c/i18n-activity#620

Closed

fantasai added a commit that referenced this issue Dec 12, 2018

[css-text-3] Provide an overview of the line-breaking properties in t…

043a420

…he section intro. Clean up some text about interactions. #2559

fantasai added a commit that referenced this issue Dec 12, 2018

[css-text-3] Merge two sections so that all of line breaking is one c…

c5e4c9a

…hapter. #2559

fantasai removed the Needs Edits label Dec 12, 2018

fantasai closed this as completed Dec 18, 2018

frivoal added the Needs Review of Test Case(s) label Apr 25, 2019

frivoal added Tested Memory aid - issue has WPT tests and removed Needs Review of Test Case(s) labels Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[css-text-3] line-break, word-break: language unclear, and a new testcase. #2559

[css-text-3] line-break, word-break: language unclear, and a new testcase. #2559

faceless2 commented Apr 13, 2018

astearns commented Jul 2, 2018

fantasai commented Dec 6, 2018 •

edited

Loading

fantasai commented Dec 12, 2018

[css-text-3] line-break, word-break: language unclear, and a new testcase. #2559

[css-text-3] line-break, word-break: language unclear, and a new testcase. #2559

Comments

faceless2 commented Apr 13, 2018

The language for line-break and (in particular) word-break, is unclear with regard to what changes are required to the UAX14 algorithm.

astearns commented Jul 2, 2018

fantasai commented Dec 6, 2018 • edited Loading

fantasai commented Dec 12, 2018

fantasai commented Dec 6, 2018 •

edited

Loading