core(font-size): precise text length calculation #6973

castilloandres · 2019-01-09T22:50:55Z

Summary
What kind of change does this PR introduce?
Count JavaScript string symbols not Unicode code points.

Is this a bugfix, feature, refactoring, build related change, etc?
Refactoring.

Describe the need for this change
To calculate text lengths on the font size gatherer with precision.

Link any documentation or information that would help understand this change

> '𝐀'.length
2
> Array.from('𝐀').length
1
> '💩'.length
2
> Array.from('💩').length
1
> '再'.length
2
> Array.from('再').length
1

patrickhulce

Many thanks for the contribution @castilloandres!!

I wonder how far down the rabbit hole we should go. For other reviewers this article was a fun read on the topic.

Array.from("👩‍❤️‍💋‍👩").length === 8

Basic unicode coverage seems like a pretty good start though 👍

exterkamp

LGTM, very interesting looking into the fun and terrifying world of unicode and text encodings 😨

exterkamp · 2019-01-14T17:51:32Z

lighthouse-core/gather/gatherers/seo/font-size.js

@@ -182,7 +182,7 @@ function getEffectiveFontRule({inlineStyle, matchedCSSRules, inherited}) {
 * @returns {number}
 */
 function getNodeTextLength(node) {
-  return !node.nodeValue ? 0 : node.nodeValue.trim().length;
+  return !node.nodeValue ? 0 : Array.from(node.nodeValue.trim()).length;


Can we get a comment mentioning this PR or explaining why this is here. e.g.

// Use Array.from in order to more accurately count unicode characters. See: #6973

Maybe even add a TODO for @patrickhulce's comment on future rabbit hole exploring.

castilloandres · 2019-01-15T13:37:24Z

I wonder how far down the rabbit hole we should go.

@patrickhulce @exterkamp I think we should go as deep as readable text goes. Apart from emojis this should count more or less foreign character sets correctly 👍

connorjclark · 2019-01-15T23:48:07Z

lighthouse-core/gather/gatherers/seo/font-size.js

@@ -182,7 +182,8 @@ function getEffectiveFontRule({inlineStyle, matchedCSSRules, inherited}) {
 * @returns {number}
 */
 function getNodeTextLength(node) {
-  return !node.nodeValue ? 0 : node.nodeValue.trim().length;
+  // Array.from to count JS symbols not unicode code points. See: #6973


JS Symbols => character ?

@hoten depends on what counts as a character? I would use terms like: graphemes, symbols, code points, bytes and not "character" to avoid confusion. Here is a short video with some interesting use cases.

Mickael-van-der-Beek · 2019-01-16T14:29:44Z

@patrickhulce

Emoji Sequences

The "👩‍❤️‍💋‍👩" emoji you added is an emoji sequence [1] as defined in the Unicode spec.
If you decompose it into code points and then chars you get:

[0] 👩
	[0] � (0xd83d)
	[1] � (0xdc69)

[1] 0x200d - emoji_zwj_sequence [2]
	[0] ‍ (0x200d)

[2] ❤
	[0] ❤ (0x2764)

[3] 0xfe0f - emoji_presentation_selector [3]
	[0] ️ (0xfe0f)

[4] 0x200d - emoji_zwj_sequence [2]
	[0] ‍ (0x200d)

[5] 💋
	[0] � (0xd83d)
	[1] � (0xdc8b)

[6] 0x200d - emoji_zwj_sequence [2]
	[0] ‍ (0x200d)

[7] 👩
	[0] � (0xd83d)
	[1] � (0xdc69)

[1] emoji_sequence: http://unicode.org/reports/tr51/#def_emoji_sequence
[2] emoji_zwj_sequence: http://www.unicode.org/reports/tr51/#def_emoji_zwj_sequence
[3] emoji_presentation_selector: http://www.unicode.org/reports/tr51/#def_emoji_presentation_selector

So if you wanted to go deeper into the "count graphemes" rabbit hole, you could count emoji sequences as one.

Combining Diacritics

Another step could be to merge characters followed by combining diacritics like ñ which is actually represented as:

> Array.from("ñ")
[ 'n', '̃' ]

The two cases can be handled using the NPM package GraphemeSplitter pretty well:

https://github.com/orling/grapheme-splitter

0 width characters

Counting or discounting 0-width characters like 0x200b could also be interesting but would require the font file to be loaded and read by an NPM package like Opentype.js:

https://github.com/opentypejs/opentype.js

Once the font data is loaded, each code point then has a width field that can be used to know by how much the cursor shifts to the left after having drawn the code point.

patrickhulce · 2019-01-16T15:22:13Z

Thanks very much for the explanation @Mickael-van-der-Beek! That's the example I pulled from the article I linked which came to similar conclusions as yours :)

brendankenny

𝐋𝐆𝐓𝐌

castilloandres added 2 commits January 9, 2019 22:40

Count JS symbols + update tests

f13f932

Fix line length

fce3f1c

castilloandres requested review from patrickhulce and paulirish as code owners January 9, 2019 22:50

patrickhulce approved these changes Jan 14, 2019

View reviewed changes

exterkamp approved these changes Jan 14, 2019

View reviewed changes

Add comment

2dcc46a

connorjclark reviewed Jan 15, 2019

View reviewed changes

Update comment

c507962

brendankenny approved these changes Jan 17, 2019

View reviewed changes

brendankenny merged commit 22e7bc5 into GoogleChrome:master Jan 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core(font-size): precise text length calculation #6973

core(font-size): precise text length calculation #6973

castilloandres commented Jan 9, 2019 •

edited

Loading

patrickhulce left a comment

exterkamp left a comment

exterkamp Jan 14, 2019

castilloandres commented Jan 15, 2019 •

edited

Loading

connorjclark Jan 15, 2019

castilloandres Jan 16, 2019 •

edited

Loading

Mickael-van-der-Beek commented Jan 16, 2019 •

edited

Loading

patrickhulce commented Jan 16, 2019

brendankenny left a comment

core(font-size): precise text length calculation #6973

core(font-size): precise text length calculation #6973

Conversation

castilloandres commented Jan 9, 2019 • edited Loading

patrickhulce left a comment

Choose a reason for hiding this comment

exterkamp left a comment

Choose a reason for hiding this comment

exterkamp Jan 14, 2019

Choose a reason for hiding this comment

castilloandres commented Jan 15, 2019 • edited Loading

connorjclark Jan 15, 2019

Choose a reason for hiding this comment

castilloandres Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Mickael-van-der-Beek commented Jan 16, 2019 • edited Loading

patrickhulce commented Jan 16, 2019

brendankenny left a comment

Choose a reason for hiding this comment

castilloandres commented Jan 9, 2019 •

edited

Loading

castilloandres commented Jan 15, 2019 •

edited

Loading

castilloandres Jan 16, 2019 •

edited

Loading

Mickael-van-der-Beek commented Jan 16, 2019 •

edited

Loading