Fix for issue #3658: improve line breaking #3743

ChrisLoer · 2016-12-05T21:59:02Z

This is an experimental new line breaking algorithm aiming to create as general a fix as possible to the problems discussed in issue #3658.

The general approach borrows heavily from the line breaking algorithm used by LaTeX -- we analyze a label as a graph of potential line breaks, and then assign each potential break a "badness" score based primarily on how much the length of the line deviates from the average line width of the label. We then choose the set of line breaks that minimizes "badness" across the whole label.

The advantages of this approach are:

No (or at least much less) special treatment needed for ideographic characters, which also means that there's no problem mixing ideographic and non-ideographic characters in the same label.
Punctuation marks can be included in ideographic labels and you get reasonable behavior (because the algorithm doesn't depend on any assumptions about character width)
"Plain" latin labels get balanced too (this will really only make a difference with relatively long labels)

Finding the line breaks is O(n^2) on the number of potential line breaks (and ideographic text is basically all potential line breaks) -- it doesn't seem to be a performance issue, but if we need to we can put an upper limit on that by using a sliding window.

When I was discussing desiderata with @nickidlugash , one idea she suggested was to have the last line "shorten" to allow all the lines above it to remain the same length. Although this algorithm does encourage the last line to shorten first, it is still possible in many cases to end up with lower lines longer than upper lines (in the interests of minimizing overall raggedness).

The algorithm does not strictly enforce maxWidth because in cases where maxWidth is very close to the average line length, we get better results by going slightly over the limit.

To see some examples of the new algorithm applied to ideographic text with punctuation, see: https://github.com/mapbox/mapbox-gl-test-suite/blob/cloer_leastbad_linebreak/render-tests/text-max-width/ideographic-punctuation-breaking/expected.png

cc @ian29 @1ec5 @ansis @lucaswoj

1ec5 · 2016-12-06T01:10:47Z

I love the idea of assigning relative priorities like this, and props for taking inspiration from L^AT_EX.

Eventually, we'll definitely want to expand the left-right distinction that you have for parentheses to cover all brackets; fortunately Unicode has metadata to help us out there. There are also some trickier cases: for example, a line break is disallowed inside "U.S.A." but allowed after it. A line break is allowed after a colon that's followed by a space, but not one that's followed by a letter (because that indicates pluralization in Swedish). But this looks like a great start.

1ec5 · 2016-12-06T22:15:12Z

There are also some trickier cases: for example, a line break is disallowed inside "U.S.A." but allowed after it. A line break is allowed after a colon that's followed by a space, but not one that's followed by a letter (because that indicates pluralization in Swedish).

Come to think of it, these punctuation marks should be fine as is because they aren’t included in breakable.

ChrisLoer · 2016-12-07T01:11:08Z

Come to think of it, these punctuation marks should be fine as is because they aren’t included in breakable.

Yup. I suspect we don't have to be that aggressive about breaking on that kind of punctuation, even though we could in many cases. If we want to add more potential break points, I think the biggest challenge in front of us are languages like Thai that need a word-breaking dictionary.

lucaswoj · 2016-12-07T18:56:45Z

Thank you for tackling this project @ChrisLoer. I'm excited to 🚢 this as soon as we are confident that it doesn't regress any important cases.

I noticed that the ideographic labels are no longer perfectly balanced (rectangle-shaped) in the test suite diff associated with this branch. Is there an easy way to have the old balanced breaking behaviour alongside this new approach?

ChrisLoer · 2016-12-07T19:53:56Z

I noticed that the ideographic labels are no longer perfectly balanced (rectangle-shaped) in the test suite diff associated with this branch. Is there an easy way to have the old balanced breaking behaviour alongside this new approach?

For the text-max-width/ideographic-breaking test, I believe there are three changes:

Top left: this went from two lines to three. This is because the old algorithm in some cases allowed lines to go further over maxWidth than the new algorithm (and this particular test happens to be a case that goes as far over as possible). We're not strict about maxWidth, so it's hard to be definitive about what the behavior should be, but I think the new behavior will be somewhat closer to the label-maker's intentions.
Top middle: there are two "missing" characters preventing this from being perfectly balanced. The old algorithm took them both out of the bottom line, while the new algorithm split them between the second and the bottom line. I believe the old algorithm gave a better result here. The new algorithm tries to get similar behavior by being forgiving of shorter final lines, but that doesn't help in this case because the "target width" is actually closer to four characters than it is to five characters, so the new algorithm thinks the second line looks better with four rather than five characters.
Bottom right: the new algorithm is strictly better here, I think, since it handles mixed text and makes a much more balanced block.

I'd like to do better for that top-middle case, but I don't see an easy way to do it with this approach. Keeping this algorithm polynomial in the number of line breaks requires us to have a "target width" that we set ahead of time. We could try something like this: if all characters in the label are full-width, then round the target width up to the nearest multiple of a full-width character size -- this would cause the algorithm to favor filling out the upper lines and shortening the last one... but we'd be partway back to where we started in that adding punctuation or numbers would change line breaking behavior more than expected...

lucaswoj · 2016-12-07T20:08:09Z

Thanks for the clarification @ChrisLoer. The design decisions make sense to me 👍 .

I'm going to defer to @nickidlugash and @xrwang for judgement on what, if anything, is blocking this from 🚢 from a design perspective.

ChrisLoer · 2016-12-08T18:25:55Z

I experimented with the "round target width up to nearest full-width character size" strategy. For ideographic text without punctuation, it gets the result we're looking for ("rounding up" is on the left, the changes are in top-middle and bottom-right):

However, it leads to what I think are suboptimal results with a little bit of punctuation:

The most instructive case is bottom-middle -- the parentheses increase the average line width just a little bit past "four ideographic characters", but that gets rounded all the way up to "five characters" which throws everything off.

We could do the "rounding up" only for runs of entirely full-width characters, but I think it would be a fragile special case and I suspect it's not necessary.

ChrisLoer · 2016-12-12T20:56:27Z

It's not very scientific, but to eyeball the performance implications, I profiled a worker while panning through an area of the map that had several long (>15 character) Chinese labels. In my test run, determineLineBreaks came out to 0.68% of CPU time for the worker.

nickidlugash · 2016-12-13T23:37:37Z

I reviewed this PR with our osm data in English, osm data in Chinese, and alternative data in Chinese. Overall, I think this is definitely an improvement to existing results.

With Chinese, biggest improvements I see are with mixed labels (Chinese + latin characters/numerals) and parenthetical labels:

before:

after:

The treatment of parenthetical labels also seems more visually optimal than the carto team's quick fix of some of our data, where we hard coded a linebreak before opening parentheses:

before:

after:

For English labels, I'm seeing a fair amount of differences between this and the current implementation. The vast majority of cases are either improvements, or not significant either way. Main differences I see are a reduction in number of lines from either 5 to 4, 4 to 3, or 3 to 2:

before:

after:

(It's not that useful to evaluate this before and after in isolation of other labels in that style layer, but just wanted to include it as a visual. When looking at the map though, this "after" looked closer in design intent to the other shorter labels than the "before" did).

In addition to these visual improvements, to me this seems like a conceptually solid backbone for our line breaking algorithm. I see a few areas that I think it would be great for us to improve upon in subsequent PRs:

I agree that it may be best not to do the "rounding up" for now if it introduces special casing into the algorithm, but hopefully we can continue to explore ways to improve that top-middle scenario.
I'm seeing some issues with punctuation that I assume can be addressed with additional penalty calculations, like for title marks:

I'm happy to help compile suggestions for punctuation and their penalty. We may just be able to start off with adding more opening and closing punctuation with the same penalties as parentheses.
I wonder if we can play around with adding even more allowances for going over the maxWidth, for situations where dropping text to another line (particularly in the case of one line versus two) causes line widths to be much shorter than the maxWidth. From a design standpoint, a think maxWidth should be considered a suggestion for "ideal" width. From a conceptual standpoint, this would mean creating a more nuanced definition of what is "ideal", beyond balanced lines. (Caveat: this could get hairy once we support linebreaking for text along lines, since designers would have to consider maxWidth and symbol spacing together).

@ChrisLoer @lucaswoj pls let me know if there's anything else you specifically wanted me to test out using our data! Otherwise, I'm 👍 on this from a design perspective.

ChrisLoer · 2016-12-14T01:20:28Z

Thanks so much for the detailed review, @nickidlugash !

You're right about the title marks -- it will be easy to treat them the same way as parentheses, we just need a list of which characters get that special treatment. I could merge them into this PR or always do it in a future PR.

For a future PR, we could explore further improvements to the treatment of parentheses (and other similar punctuation). For instance, we could do lookahead for closing braces, and introduce a penalty for splitting a brace across two lines (right now as long as there's one character between the parenthesis and the line break, our algorithm doesn't give the parenthesis any special treatment).

For both the "make the top lines heavier than the bottom lines" case and the "be willing to overflow max width to avoid a ragged line break" case, we might be able to do better by running the algorithm in multiple passes... 🤔

lucaswoj

🚢 on 🍏

* Optimize for minimal line width variation on multi-line labels * Use same algorithm for all character types to support diglossic labels * Avoid hanging parentheses in ideographic text

1ec5 · 2016-12-15T10:42:00Z

You're right about the title marks -- it will be easy to treat them the same way as parentheses, we just need a list of which characters get that special treatment. I could merge them into this PR or always do it in a future PR.

#3811

ChrisLoer force-pushed the cloer_leastbad_linebreak branch 2 times, most recently from 2750211 to 6d75abc Compare December 14, 2016 16:52

ChrisLoer mentioned this pull request Dec 14, 2016

Make gl-native handling of newlines conform to gl-js behavior mapbox/mapbox-gl-native#7253

Closed

lucaswoj approved these changes Dec 14, 2016

View reviewed changes

ChrisLoer force-pushed the cloer_leastbad_linebreak branch 2 times, most recently from 3ba54b9 to 5bf6c62 Compare December 15, 2016 00:17

Fix for issue #3658: improve line breaking

1b7b5c2

* Optimize for minimal line width variation on multi-line labels * Use same algorithm for all character types to support diglossic labels * Avoid hanging parentheses in ideographic text

ChrisLoer force-pushed the cloer_leastbad_linebreak branch from 5bf6c62 to 1b7b5c2 Compare December 15, 2016 00:41

ChrisLoer merged commit 7e254fa into master Dec 15, 2016

This was referenced Dec 15, 2016

Halfwidth punctuation prevents ideographic line breaking #3658

Closed

Fix issue #7253: port gl-js line breaking behavior mapbox/mapbox-gl-native#7446

Merged

1ec5 mentioned this pull request Dec 15, 2016

Avoid breaking on wrong side of brackets #3811

Open

mourner deleted the cloer_leastbad_linebreak branch December 15, 2016 12:12

1ec5 mentioned this pull request May 21, 2019

prefer breaking lines at a zero width space. #8255

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for issue #3658: improve line breaking #3743

Fix for issue #3658: improve line breaking #3743

ChrisLoer commented Dec 5, 2016

1ec5 commented Dec 6, 2016

1ec5 commented Dec 6, 2016

ChrisLoer commented Dec 7, 2016

lucaswoj commented Dec 7, 2016 •

edited

Loading

ChrisLoer commented Dec 7, 2016 •

edited

Loading

lucaswoj commented Dec 7, 2016

ChrisLoer commented Dec 8, 2016

ChrisLoer commented Dec 12, 2016

nickidlugash commented Dec 13, 2016

ChrisLoer commented Dec 14, 2016

lucaswoj left a comment

1ec5 commented Dec 15, 2016

Fix for issue #3658: improve line breaking #3743

Fix for issue #3658: improve line breaking #3743

Conversation

ChrisLoer commented Dec 5, 2016

1ec5 commented Dec 6, 2016

1ec5 commented Dec 6, 2016

ChrisLoer commented Dec 7, 2016

lucaswoj commented Dec 7, 2016 • edited Loading

ChrisLoer commented Dec 7, 2016 • edited Loading

lucaswoj commented Dec 7, 2016

ChrisLoer commented Dec 8, 2016

ChrisLoer commented Dec 12, 2016

nickidlugash commented Dec 13, 2016

ChrisLoer commented Dec 14, 2016

lucaswoj left a comment

Choose a reason for hiding this comment

1ec5 commented Dec 15, 2016

lucaswoj commented Dec 7, 2016 •

edited

Loading

ChrisLoer commented Dec 7, 2016 •

edited

Loading