Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(font-size): precise text length calculation #6973

Merged
merged 4 commits into from
Jan 17, 2019

Conversation

castilloandres
Copy link
Contributor

@castilloandres castilloandres commented Jan 9, 2019

Summary
What kind of change does this PR introduce?
Count JavaScript string symbols not Unicode code points.

Is this a bugfix, feature, refactoring, build related change, etc?
Refactoring.

Describe the need for this change
To calculate text lengths on the font size gatherer with precision.

Link any documentation or information that would help understand this change

> '𝐀'.length
2
> Array.from('𝐀').length
1
> '💩'.length
2
> Array.from('💩').length
1
> '再'.length
2
> Array.from('再').length
1

Copy link
Collaborator

@patrickhulce patrickhulce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the contribution @castilloandres!!

I wonder how far down the rabbit hole we should go. For other reviewers this article was a fun read on the topic.

Array.from("👩‍❤️‍💋‍👩").length === 8

Basic unicode coverage seems like a pretty good start though 👍

Copy link
Member

@exterkamp exterkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very interesting looking into the fun and terrifying world of unicode and text encodings 😨

@@ -182,7 +182,7 @@ function getEffectiveFontRule({inlineStyle, matchedCSSRules, inherited}) {
* @returns {number}
*/
function getNodeTextLength(node) {
return !node.nodeValue ? 0 : node.nodeValue.trim().length;
return !node.nodeValue ? 0 : Array.from(node.nodeValue.trim()).length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a comment mentioning this PR or explaining why this is here. e.g.

// Use Array.from in order to more accurately count unicode characters. See: #6973

Maybe even add a TODO for @patrickhulce's comment on future rabbit hole exploring.

@castilloandres
Copy link
Contributor Author

castilloandres commented Jan 15, 2019

I wonder how far down the rabbit hole we should go.

@patrickhulce @exterkamp I think we should go as deep as readable text goes. Apart from emojis this should count more or less foreign character sets correctly 👍

@@ -182,7 +182,8 @@ function getEffectiveFontRule({inlineStyle, matchedCSSRules, inherited}) {
* @returns {number}
*/
function getNodeTextLength(node) {
return !node.nodeValue ? 0 : node.nodeValue.trim().length;
// Array.from to count JS symbols not unicode code points. See: #6973
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS Symbols => character ?

Copy link
Contributor Author

@castilloandres castilloandres Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoten depends on what counts as a character? I would use terms like: graphemes, symbols, code points, bytes and not "character" to avoid confusion. Here is a short video with some interesting use cases.

@Mickael-van-der-Beek
Copy link

Mickael-van-der-Beek commented Jan 16, 2019

@patrickhulce

  1. Emoji Sequences

The "👩‍❤️‍💋‍👩" emoji you added is an emoji sequence [1] as defined in the Unicode spec.
If you decompose it into code points and then chars you get:

[0] 👩
	[0] � (0xd83d)
	[1] � (0xdc69)

[1] 0x200d - emoji_zwj_sequence [2]
	[0] ‍ (0x200d)

[2] ❤
	[0] ❤ (0x2764)

[3] 0xfe0f - emoji_presentation_selector [3]
	[0] ️ (0xfe0f)

[4] 0x200d - emoji_zwj_sequence [2]
	[0] ‍ (0x200d)

[5] 💋
	[0] � (0xd83d)
	[1] � (0xdc8b)

[6] 0x200d - emoji_zwj_sequence [2]
	[0] ‍ (0x200d)

[7] 👩
	[0] � (0xd83d)
	[1] � (0xdc69)

[1] emoji_sequence: http://unicode.org/reports/tr51/#def_emoji_sequence
[2] emoji_zwj_sequence: http://www.unicode.org/reports/tr51/#def_emoji_zwj_sequence
[3] emoji_presentation_selector: http://www.unicode.org/reports/tr51/#def_emoji_presentation_selector

So if you wanted to go deeper into the "count graphemes" rabbit hole, you could count emoji sequences as one.

  1. Combining Diacritics

Another step could be to merge characters followed by combining diacritics like which is actually represented as:

> Array.from("ñ")
[ 'n', '̃' ]

The two cases can be handled using the NPM package GraphemeSplitter pretty well:

https://github.com/orling/grapheme-splitter

  1. 0 width characters

Counting or discounting 0-width characters like 0x200b could also be interesting but would require the font file to be loaded and read by an NPM package like Opentype.js:

https://github.com/opentypejs/opentype.js

Once the font data is loaded, each code point then has a width field that can be used to know by how much the cursor shifts to the left after having drawn the code point.

@patrickhulce
Copy link
Collaborator

Thanks very much for the explanation @Mickael-van-der-Beek! That's the example I pulled from the article I linked which came to similar conclusions as yours :)

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

𝐋𝐆𝐓𝐌

@brendankenny brendankenny merged commit 22e7bc5 into GoogleChrome:master Jan 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants