-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for issue #3658: improve line breaking #3743
Conversation
I love the idea of assigning relative priorities like this, and props for taking inspiration from LATEX. Eventually, we'll definitely want to expand the left-right distinction that you have for parentheses to cover all brackets; fortunately Unicode has metadata to help us out there. There are also some trickier cases: for example, a line break is disallowed inside "U.S.A." but allowed after it. A line break is allowed after a colon that's followed by a space, but not one that's followed by a letter (because that indicates pluralization in Swedish). But this looks like a great start. |
Come to think of it, these punctuation marks should be fine as is because they aren’t included in |
Yup. I suspect we don't have to be that aggressive about breaking on that kind of punctuation, even though we could in many cases. If we want to add more potential break points, I think the biggest challenge in front of us are languages like Thai that need a word-breaking dictionary. |
Thank you for tackling this project @ChrisLoer. I'm excited to 🚢 this as soon as we are confident that it doesn't regress any important cases. I noticed that the ideographic labels are no longer perfectly balanced (rectangle-shaped) in the test suite diff associated with this branch. Is there an easy way to have the old balanced breaking behaviour alongside this new approach? |
For the text-max-width/ideographic-breaking test, I believe there are three changes:
I'd like to do better for that top-middle case, but I don't see an easy way to do it with this approach. Keeping this algorithm polynomial in the number of line breaks requires us to have a "target width" that we set ahead of time. We could try something like this: if all characters in the label are full-width, then round the target width up to the nearest multiple of a full-width character size -- this would cause the algorithm to favor filling out the upper lines and shortening the last one... but we'd be partway back to where we started in that adding punctuation or numbers would change line breaking behavior more than expected... |
Thanks for the clarification @ChrisLoer. The design decisions make sense to me 👍 . I'm going to defer to @nickidlugash and @xrwang for judgement on what, if anything, is blocking this from 🚢 from a design perspective. |
I experimented with the "round target width up to nearest full-width character size" strategy. For ideographic text without punctuation, it gets the result we're looking for ("rounding up" is on the left, the changes are in top-middle and bottom-right): However, it leads to what I think are suboptimal results with a little bit of punctuation: The most instructive case is bottom-middle -- the parentheses increase the average line width just a little bit past "four ideographic characters", but that gets rounded all the way up to "five characters" which throws everything off. We could do the "rounding up" only for runs of entirely full-width characters, but I think it would be a fragile special case and I suspect it's not necessary. |
It's not very scientific, but to eyeball the performance implications, I profiled a worker while panning through an area of the map that had several long (>15 character) Chinese labels. In my test run, |
I reviewed this PR with our osm data in English, osm data in Chinese, and alternative data in Chinese. Overall, I think this is definitely an improvement to existing results. With Chinese, biggest improvements I see are with mixed labels (Chinese + latin characters/numerals) and parenthetical labels: The treatment of parenthetical labels also seems more visually optimal than the carto team's quick fix of some of our data, where we hard coded a linebreak before opening parentheses: For English labels, I'm seeing a fair amount of differences between this and the current implementation. The vast majority of cases are either improvements, or not significant either way. Main differences I see are a reduction in number of lines from either 5 to 4, 4 to 3, or 3 to 2: (It's not that useful to evaluate this before and after in isolation of other labels in that style layer, but just wanted to include it as a visual. When looking at the map though, this "after" looked closer in design intent to the other shorter labels than the "before" did). In addition to these visual improvements, to me this seems like a conceptually solid backbone for our line breaking algorithm. I see a few areas that I think it would be great for us to improve upon in subsequent PRs:
@ChrisLoer @lucaswoj pls let me know if there's anything else you specifically wanted me to test out using our data! Otherwise, I'm 👍 on this from a design perspective. |
Thanks so much for the detailed review, @nickidlugash ! You're right about the title marks -- it will be easy to treat them the same way as parentheses, we just need a list of which characters get that special treatment. I could merge them into this PR or always do it in a future PR. For a future PR, we could explore further improvements to the treatment of parentheses (and other similar punctuation). For instance, we could do lookahead for closing braces, and introduce a penalty for splitting a brace across two lines (right now as long as there's one character between the parenthesis and the line break, our algorithm doesn't give the parenthesis any special treatment). For both the "make the top lines heavier than the bottom lines" case and the "be willing to overflow max width to avoid a ragged line break" case, we might be able to do better by running the algorithm in multiple passes... 🤔 |
2750211
to
6d75abc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 on 🍏
3ba54b9
to
5bf6c62
Compare
* Optimize for minimal line width variation on multi-line labels * Use same algorithm for all character types to support diglossic labels * Avoid hanging parentheses in ideographic text
5bf6c62
to
1b7b5c2
Compare
|
This is an experimental new line breaking algorithm aiming to create as general a fix as possible to the problems discussed in issue #3658.
The general approach borrows heavily from the line breaking algorithm used by LaTeX -- we analyze a label as a graph of potential line breaks, and then assign each potential break a "badness" score based primarily on how much the length of the line deviates from the average line width of the label. We then choose the set of line breaks that minimizes "badness" across the whole label.
The advantages of this approach are:
Finding the line breaks is O(n^2) on the number of potential line breaks (and ideographic text is basically all potential line breaks) -- it doesn't seem to be a performance issue, but if we need to we can put an upper limit on that by using a sliding window.
When I was discussing desiderata with @nickidlugash , one idea she suggested was to have the last line "shorten" to allow all the lines above it to remain the same length. Although this algorithm does encourage the last line to shorten first, it is still possible in many cases to end up with lower lines longer than upper lines (in the interests of minimizing overall raggedness).
The algorithm does not strictly enforce
maxWidth
because in cases wheremaxWidth
is very close to the average line length, we get better results by going slightly over the limit.To see some examples of the new algorithm applied to ideographic text with punctuation, see: https://github.com/mapbox/mapbox-gl-test-suite/blob/cloer_leastbad_linebreak/render-tests/text-max-width/ideographic-punctuation-breaking/expected.png
cc @ian29 @1ec5 @ansis @lucaswoj