Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for issue #3658: improve line breaking #3743

Merged
merged 1 commit into from
Dec 15, 2016

Conversation

ChrisLoer
Copy link
Contributor

This is an experimental new line breaking algorithm aiming to create as general a fix as possible to the problems discussed in issue #3658.

The general approach borrows heavily from the line breaking algorithm used by LaTeX -- we analyze a label as a graph of potential line breaks, and then assign each potential break a "badness" score based primarily on how much the length of the line deviates from the average line width of the label. We then choose the set of line breaks that minimizes "badness" across the whole label.

The advantages of this approach are:

  • No (or at least much less) special treatment needed for ideographic characters, which also means that there's no problem mixing ideographic and non-ideographic characters in the same label.
  • Punctuation marks can be included in ideographic labels and you get reasonable behavior (because the algorithm doesn't depend on any assumptions about character width)
  • "Plain" latin labels get balanced too (this will really only make a difference with relatively long labels)

Finding the line breaks is O(n^2) on the number of potential line breaks (and ideographic text is basically all potential line breaks) -- it doesn't seem to be a performance issue, but if we need to we can put an upper limit on that by using a sliding window.

When I was discussing desiderata with @nickidlugash , one idea she suggested was to have the last line "shorten" to allow all the lines above it to remain the same length. Although this algorithm does encourage the last line to shorten first, it is still possible in many cases to end up with lower lines longer than upper lines (in the interests of minimizing overall raggedness).

The algorithm does not strictly enforce maxWidth because in cases where maxWidth is very close to the average line length, we get better results by going slightly over the limit.

To see some examples of the new algorithm applied to ideographic text with punctuation, see: https://github.com/mapbox/mapbox-gl-test-suite/blob/cloer_leastbad_linebreak/render-tests/text-max-width/ideographic-punctuation-breaking/expected.png

cc @ian29 @1ec5 @ansis @lucaswoj

@1ec5
Copy link
Contributor

1ec5 commented Dec 6, 2016

I love the idea of assigning relative priorities like this, and props for taking inspiration from LATEX.

Eventually, we'll definitely want to expand the left-right distinction that you have for parentheses to cover all brackets; fortunately Unicode has metadata to help us out there. There are also some trickier cases: for example, a line break is disallowed inside "U.S.A." but allowed after it. A line break is allowed after a colon that's followed by a space, but not one that's followed by a letter (because that indicates pluralization in Swedish). But this looks like a great start.

@1ec5
Copy link
Contributor

1ec5 commented Dec 6, 2016

There are also some trickier cases: for example, a line break is disallowed inside "U.S.A." but allowed after it. A line break is allowed after a colon that's followed by a space, but not one that's followed by a letter (because that indicates pluralization in Swedish).

Come to think of it, these punctuation marks should be fine as is because they aren’t included in breakable.

@ChrisLoer
Copy link
Contributor Author

Come to think of it, these punctuation marks should be fine as is because they aren’t included in breakable.

Yup. I suspect we don't have to be that aggressive about breaking on that kind of punctuation, even though we could in many cases. If we want to add more potential break points, I think the biggest challenge in front of us are languages like Thai that need a word-breaking dictionary.

@lucaswoj
Copy link
Contributor

lucaswoj commented Dec 7, 2016

Thank you for tackling this project @ChrisLoer. I'm excited to 🚢 this as soon as we are confident that it doesn't regress any important cases.

I noticed that the ideographic labels are no longer perfectly balanced (rectangle-shaped) in the test suite diff associated with this branch. Is there an easy way to have the old balanced breaking behaviour alongside this new approach?

@ChrisLoer
Copy link
Contributor Author

ChrisLoer commented Dec 7, 2016

I noticed that the ideographic labels are no longer perfectly balanced (rectangle-shaped) in the test suite diff associated with this branch. Is there an easy way to have the old balanced breaking behaviour alongside this new approach?

For the text-max-width/ideographic-breaking test, I believe there are three changes:

screen shot 2016-12-07 at 11 55 15 am

  • Top left: this went from two lines to three. This is because the old algorithm in some cases allowed lines to go further over maxWidth than the new algorithm (and this particular test happens to be a case that goes as far over as possible). We're not strict about maxWidth, so it's hard to be definitive about what the behavior should be, but I think the new behavior will be somewhat closer to the label-maker's intentions.
  • Top middle: there are two "missing" characters preventing this from being perfectly balanced. The old algorithm took them both out of the bottom line, while the new algorithm split them between the second and the bottom line. I believe the old algorithm gave a better result here. The new algorithm tries to get similar behavior by being forgiving of shorter final lines, but that doesn't help in this case because the "target width" is actually closer to four characters than it is to five characters, so the new algorithm thinks the second line looks better with four rather than five characters.
  • Bottom right: the new algorithm is strictly better here, I think, since it handles mixed text and makes a much more balanced block.

I'd like to do better for that top-middle case, but I don't see an easy way to do it with this approach. Keeping this algorithm polynomial in the number of line breaks requires us to have a "target width" that we set ahead of time. We could try something like this: if all characters in the label are full-width, then round the target width up to the nearest multiple of a full-width character size -- this would cause the algorithm to favor filling out the upper lines and shortening the last one... but we'd be partway back to where we started in that adding punctuation or numbers would change line breaking behavior more than expected...

@lucaswoj
Copy link
Contributor

lucaswoj commented Dec 7, 2016

Thanks for the clarification @ChrisLoer. The design decisions make sense to me 👍 .

I'm going to defer to @nickidlugash and @xrwang for judgement on what, if anything, is blocking this from 🚢 from a design perspective.

@ChrisLoer
Copy link
Contributor Author

I experimented with the "round target width up to nearest full-width character size" strategy. For ideographic text without punctuation, it gets the result we're looking for ("rounding up" is on the left, the changes are in top-middle and bottom-right):

screen shot 2016-12-08 at 10 17 14 am

However, it leads to what I think are suboptimal results with a little bit of punctuation:

screen shot 2016-12-08 at 10 19 43 am

The most instructive case is bottom-middle -- the parentheses increase the average line width just a little bit past "four ideographic characters", but that gets rounded all the way up to "five characters" which throws everything off.

We could do the "rounding up" only for runs of entirely full-width characters, but I think it would be a fragile special case and I suspect it's not necessary.

@ChrisLoer
Copy link
Contributor Author

It's not very scientific, but to eyeball the performance implications, I profiled a worker while panning through an area of the map that had several long (>15 character) Chinese labels. In my test run, determineLineBreaks came out to 0.68% of CPU time for the worker.

@nickidlugash
Copy link

I reviewed this PR with our osm data in English, osm data in Chinese, and alternative data in Chinese. Overall, I think this is definitely an improvement to existing results.

With Chinese, biggest improvements I see are with mixed labels (Chinese + latin characters/numerals) and parenthetical labels:

before:
parentheses-and-numeral-before

after:
parentheses-and-numeral-after

The treatment of parenthetical labels also seems more visually optimal than the carto team's quick fix of some of our data, where we hard coded a linebreak before opening parentheses:

before:
better-parenthetical-before

after:
better-parenthetical-after

For English labels, I'm seeing a fair amount of differences between this and the current implementation. The vast majority of cases are either improvements, or not significant either way. Main differences I see are a reduction in number of lines from either 5 to 4, 4 to 3, or 3 to 2:

before:
screen shot 2016-12-13 at 4 46 51 pm

after:
screen shot 2016-12-13 at 4 46 47 pm

(It's not that useful to evaluate this before and after in isolation of other labels in that style layer, but just wanted to include it as a visual. When looking at the map though, this "after" looked closer in design intent to the other shorter labels than the "before" did).

In addition to these visual improvements, to me this seems like a conceptually solid backbone for our line breaking algorithm. I see a few areas that I think it would be great for us to improve upon in subsequent PRs:

  • I agree that it may be best not to do the "rounding up" for now if it introduces special casing into the algorithm, but hopefully we can continue to explore ways to improve that top-middle scenario.

  • I'm seeing some issues with punctuation that I assume can be addressed with additional penalty calculations, like for title marks:
    title-marks
    I'm happy to help compile suggestions for punctuation and their penalty. We may just be able to start off with adding more opening and closing punctuation with the same penalties as parentheses.

  • I wonder if we can play around with adding even more allowances for going over the maxWidth, for situations where dropping text to another line (particularly in the case of one line versus two) causes line widths to be much shorter than the maxWidth. From a design standpoint, a think maxWidth should be considered a suggestion for "ideal" width. From a conceptual standpoint, this would mean creating a more nuanced definition of what is "ideal", beyond balanced lines. (Caveat: this could get hairy once we support linebreaking for text along lines, since designers would have to consider maxWidth and symbol spacing together).

@ChrisLoer @lucaswoj pls let me know if there's anything else you specifically wanted me to test out using our data! Otherwise, I'm 👍 on this from a design perspective.

@ChrisLoer
Copy link
Contributor Author

Thanks so much for the detailed review, @nickidlugash !

You're right about the title marks -- it will be easy to treat them the same way as parentheses, we just need a list of which characters get that special treatment. I could merge them into this PR or always do it in a future PR.

For a future PR, we could explore further improvements to the treatment of parentheses (and other similar punctuation). For instance, we could do lookahead for closing braces, and introduce a penalty for splitting a brace across two lines (right now as long as there's one character between the parenthesis and the line break, our algorithm doesn't give the parenthesis any special treatment).

For both the "make the top lines heavier than the bottom lines" case and the "be willing to overflow max width to avoid a ragged line break" case, we might be able to do better by running the algorithm in multiple passes... 🤔

Copy link
Contributor

@lucaswoj lucaswoj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 on 🍏

@ChrisLoer ChrisLoer force-pushed the cloer_leastbad_linebreak branch 2 times, most recently from 3ba54b9 to 5bf6c62 Compare December 15, 2016 00:17
* Optimize for minimal line width variation on multi-line labels
* Use same algorithm for all character types to support diglossic labels
* Avoid hanging parentheses in ideographic text
@1ec5
Copy link
Contributor

1ec5 commented Dec 15, 2016

You're right about the title marks -- it will be easy to treat them the same way as parentheses, we just need a list of which characters get that special treatment. I could merge them into this PR or always do it in a future PR.

#3811

@mourner mourner deleted the cloer_leastbad_linebreak branch December 15, 2016 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants