-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core(entity-classification): integrate public-suffix-list into LH #15641
Conversation
Bundle size increase: before:
after (with psl):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems pretty rad
* @param {LH.Result.Entities=} entities | ||
* @return {LH.Result.LhrEntity|string} | ||
*/ | ||
static getEntityFromUrl(url, entities) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the subtlety here kinda goes over my head..
so we use the new PSL-powered getrootdomain in core/c/entity-classification
but out in report we use this method which leverages the boring-basic getrootdomain..
ya?
this does fix #15623 ? my brain is losing track of when the url data is set vs tweaked for display...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the audit side, we use the PSL library. We don't want to carry that much JS payload to the report though. Fortunately, there are only a few places on the report side that uses getRootDomain()
— to convert a URL to its entity name for entity-grouping/lookup, and for legacy report third-party filtering, for example. For those, we already have keyed the entity-classification set with root-domains as entity names, so I rewrote them as entity-name comparisons.
This weird |string
return value is to support pre-10.0 LHRs, where we dont have entity classification data. For those, we fall back to getLegacyRootDomain
and convert them to string comparisons.
It does fix #15623. The entity identified is the proper domain (vs. mb.ca
before). Did you give it a try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool thanks.
Did you give it a try?
nope but i knew you would've.
shared/util.js
Outdated
} | ||
|
||
const entity = entities.find(e => e.origins.find(origin => url.startsWith(origin))); | ||
return entity || Util.getLegacyRootDomain(url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does pre-v10 LHRs play into it? getLegacyRootDomain is used on either side of the if condition.....
is this L92 use just an extra fallback we don't expect to hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're falling back to getLegacyRootDomain
if (1) we don't have entity classification L87
, or (2) the URL cannot be resolved to an entity (L92
). I think the latter shouldn't happen unless there's something wrong with the entity-classification dataset. Pre-v10 should proceed via line 88.
* @param {LH.Result.Entities=} entities | ||
* @return {LH.Result.LhrEntity|string} | ||
*/ | ||
static getEntityFromUrl(url, entities) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok cool thanks.
Did you give it a try?
nope but i knew you would've.
Experiments integrating npm:tldts (MIT licensed) to bring in public suffix list based root domain classification. Splits Util.getRootDomain into UrlUtils.getRootDomain that depends on PSL, and replaces report-side with entity recognition based. Preserves existing rootDomains with an explicit `Legacy` prefix to be used for rendering pre-10.0 LHRs.
Co-authored-by: Paul Irish <[email protected]>
a535cc3
to
a26f8bc
Compare
Updated the PR with review changes. |
Addresses #15623.
I took a stab at integrating PSL into LH, with an optimized dataset.
Experiments integrating npm:tldts (MIT licensed) that features a storage-optimized data-set (using Trie), to bring in public suffix list based root domain classification.
Util.getRootDomain
intoUrlUtils.getRootDomain
that depends on PSL, and replaces report-side with entity recognition based.Legacy
prefix to be used for rendering pre-10.0 LHRs.