-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: correctly truncate unicode strings #14911
Conversation
} else if (firstRuleEnd < PREVIEW_LENGTH) { | ||
// The entire first rule-set fits within the preview | ||
preview = preview.slice(0, firstRuleEnd + 1) + ' ...'; | ||
preview = preview.slice(0, firstRuleEnd + 1) + ' …'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use Util.truncate(0, firstRuleEnd, ' …')
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because firstRuleEnd
is a specific location in the string, we should not truncate by characters here.
core/lib/lh-error.js
Outdated
* @typedef {{sentinel: '__ErrorSentinel', message: string, code?: string, stack?: string}} SerializedBaseError | ||
*/ | ||
|
||
class LighthouseError extends Error { | ||
/** | ||
* @param {LighthouseErrorDefinition} errorDefinition | ||
* @param {Record<string, string|undefined>=} properties | ||
* @param {Record<string, unknown>=} properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially I needed this change when the base tsconfig target was updated to 2022, reverted in de6a394 . the base tsconfig didn't need the change anymore after I moved the truncate function to shared/util.js
(it was in page-functions.js
). Might as well keep this more correct type, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a tsc bug that spreading unknown down on L125 doesn't result in an error, though. properties
are ICU placeholder values and the constructor shouldn't allow anything but strings through if at all possible, so ideally we can hold off on using unknown here unless it's absolutely required.
My only concern is that So for that cost, do we really need this for the handful of cases where grapheme clusters are significant (thanks Mathias for the term)? Most unicode will be fine, this is mostly down in html attributes and other non-prominent strings, and it'll just be strings like |
i was skeptical too. numberformat has been a pain for us. I was able to capture a profile of just axe + getnodedetails (in this branch): https://trace.cafe/t/Rd8274lf8y i made this snippet to just test from within devtools: https://gist.github.com/paulirish/e46ec350be36cd23047c350602c47435 given what i'm seeing here.. so far I feel okay about the perf side. 14ms isn't nothing, but.. it seems like it's a one-time cost? |
Co-authored-by: Paul Irish <[email protected]>
That could be the ICU startup time, and if that is the case, seems like V8 is caching it for us, which is great. We should be wary of ever doing this for the first time during tracing. If it looks good on perf, great to be doing the best segmenting here! |
@@ -8,6 +8,7 @@ import {ByteEfficiencyAudit} from './byte-efficiency-audit.js'; | |||
import * as i18n from '../../lib/i18n/i18n.js'; | |||
import {computeJSTokenLength as computeTokenLength} from '../../lib/minification-estimator.js'; | |||
import {getRequestForScript, isInline} from '../../lib/script-helpers.js'; | |||
import {Util} from '../../../shared/util.js'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels weird that this isn't a direct export
@@ -101,16 +101,16 @@ class UnusedCSS { | |||
firstRuleStart > firstRuleEnd || | |||
firstRuleStart > PREVIEW_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if PREVIEW_LENGTH
is in graphemes, it won't be comparable to preview.length
, firstRuleStart
, firstRuleEnd
, etc. Not sure how much we care, but definitely some wonkiness in here as a result (e.g. firstRuleStart
could be greater than PREVIEW_LENGTH
in code points but less than PREVIEW_LENGTH
in graphemes), which makes the code harder to understand/alter in the future.
core/lib/lh-error.js
Outdated
* @typedef {{sentinel: '__ErrorSentinel', message: string, code?: string, stack?: string}} SerializedBaseError | ||
*/ | ||
|
||
class LighthouseError extends Error { | ||
/** | ||
* @param {LighthouseErrorDefinition} errorDefinition | ||
* @param {Record<string, string|undefined>=} properties | ||
* @param {Record<string, unknown>=} properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a tsc bug that spreading unknown down on L125 doesn't result in an error, though. properties
are ICU placeholder values and the constructor shouldn't allow anything but strings through if at all possible, so ideally we can hold off on using unknown here unless it's absolutely required.
@@ -138,7 +138,7 @@ class UrlUtils { | |||
static elideDataURI(url) { | |||
try { | |||
const parsed = new URL(url); | |||
return parsed.protocol === 'data:' ? url.slice(0, 100) : url; | |||
return parsed.protocol === 'data:' ? Util.truncate(url, 100) : url; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW data URLs can only contain ascii characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but at least inclusion of the ellipse suffix is an improvement here.
const iterator = segmenter.segment(string)[Symbol.iterator](); | ||
|
||
let lastSegment; | ||
for (let i = 0; i <= characterLimit - ellipseSuffix.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically ellipseSuffix.length
might not be grapheme length, but I guess we can assume we won't pass in suffixes without a 1:1 code point:grapheme ratio? :)
In various places, we truncate web strings using
.slice
, which can result in eliding in the middle of a grapheme (visual character). This is typically fine for JSON serialization (it just won't look great), but in PSI there is a serialization step that breaks under this input.This PR improves all the places (but maybe I missed some?) that truncate strings that end up in the LHR to instead elide using
Intl.Segmenter
. It also changes the result a bit, by considering the suffix added to the truncated strings as part of the max character length - and also, using an ellipse in some places that instead used...
.reprise of #11697 #11698
fixes #14897