Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid breaking on wrong side of brackets #3811

Open
1ec5 opened this issue Dec 15, 2016 · 2 comments
Open

Avoid breaking on wrong side of brackets #3811

1ec5 opened this issue Dec 15, 2016 · 2 comments

Comments

@1ec5
Copy link
Contributor

1ec5 commented Dec 15, 2016

Per #3743 (comment), we should weight all left and right brackets, not only ASCII parentheses, to avoid breaking on the wrong side of them. A comprehensive list of such brackets can be found by querying the Unicode Character Database for the following properties:

  • Ps (Punctuation, open)
  • Pe (Punctuation, close)
  • Pi (Punctuation, initial quote)
  • Pf (Punctuation, final quote)

This table may be a good starting point.

Of interest to major Western languages are the following brackets:

()[]{}<>«»‹›

These quotation marks may be problematic because they bind on different sides depending on the language. I think we should penalize them only when surrounded by ideographic characters:

“”‘’„

Of interest to CJK are the above, plus:

(){}〔〕〘〙【】《》〈〉〖〗<>[]⦅⦆「」『』「」

/ref #3505
/cc @ChrisLoer @nickidlugash

@ChrisLoer
Copy link
Contributor

Maybe it simplifies things to just disable breaking for all of these when they're adjacent to non ideographic characters? Although opening/closing punctuation might be a decent breaking point in most western text, you'd also usually expect to have a space before or after the punctuation...

FWIW, we can query the Unicode character properties table in our code using ICU, but it'll pull in a 35KB data dependency if we do.

@1ec5
Copy link
Contributor Author

1ec5 commented Dec 16, 2016

Intuitively, I’d expect most of these punctuation marks (the non-ideographic ones) to behave just like the ASCII parentheses that we’ve special-cased. If we’ve special-cased ASCII parentheses for the ideographic case specifically, then I agree that we should only treat them as breaking when they’re in the middle of text that doesn’t use spaces as word separators (particularly Chinese, Japanese, and Thai).

In the absence of word separators, probably all the charHasNeutralVerticalOrientation() characters are breakable, but the brackets are breakable only on one side. So I guess I’d consider that left/right bias to be the criteria for special-casing.

I don’t think it’d be necessary to query the entire UCD at runtime. We could manually build a list of qualifying characters based on these properties, just as we did in script_detection.js. Alternatively, we could automate that process in a build step. Either way, we’d only pull in the specific characters we want to assign special weights to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants