Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core(entity-classification): integrate public-suffix-list into LH #15641

Merged
merged 4 commits into from
Dec 1, 2023

Conversation

alexnj
Copy link
Member

@alexnj alexnj commented Nov 29, 2023

Addresses #15623.

I took a stab at integrating PSL into LH, with an optimized dataset.

Experiments integrating npm:tldts (MIT licensed) that features a storage-optimized data-set (using Trie), to bring in public suffix list based root domain classification.

  • Splits Util.getRootDomain into UrlUtils.getRootDomain that depends on PSL, and replaces report-side with entity recognition based.
  • Preserves existing rootDomains with an explicit Legacy prefix to be used for rendering pre-10.0 LHRs.

@alexnj alexnj requested a review from a team as a code owner November 29, 2023 04:47
@alexnj alexnj requested review from connorjclark and removed request for a team November 29, 2023 04:47
shared/util.js Outdated Show resolved Hide resolved
@alexnj
Copy link
Member Author

alexnj commented Nov 29, 2023

Bundle size increase:

before:

1869637 lighthouse-dt-bundle.js 
 517264 lighthouse-dt-bundle.js.gz (gzip -6)
3763926 lighthouse.tgz

after (with psl):

1972732 lighthouse-dt-bundle.js (+100 KiB)
 558462 lighthouse-dt-bundle.js.gz (gzip -6) (+40 KiB)
3764644 lighthouse.tgz (+718 bytes)

Copy link
Member

@paulirish paulirish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems pretty rad

* @param {LH.Result.Entities=} entities
* @return {LH.Result.LhrEntity|string}
*/
static getEntityFromUrl(url, entities) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the subtlety here kinda goes over my head..

so we use the new PSL-powered getrootdomain in core/c/entity-classification
but out in report we use this method which leverages the boring-basic getrootdomain..

ya?

this does fix #15623 ? my brain is losing track of when the url data is set vs tweaked for display...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the audit side, we use the PSL library. We don't want to carry that much JS payload to the report though. Fortunately, there are only a few places on the report side that uses getRootDomain() — to convert a URL to its entity name for entity-grouping/lookup, and for legacy report third-party filtering, for example. For those, we already have keyed the entity-classification set with root-domains as entity names, so I rewrote them as entity-name comparisons.

This weird |string return value is to support pre-10.0 LHRs, where we dont have entity classification data. For those, we fall back to getLegacyRootDomain and convert them to string comparisons.

It does fix #15623. The entity identified is the proper domain (vs. mb.ca before). Did you give it a try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool thanks.

Did you give it a try?

nope but i knew you would've.

shared/util.js Outdated
}

const entity = entities.find(e => e.origins.find(origin => url.startsWith(origin)));
return entity || Util.getLegacyRootDomain(url);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does pre-v10 LHRs play into it? getLegacyRootDomain is used on either side of the if condition.....

is this L92 use just an extra fallback we don't expect to hit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're falling back to getLegacyRootDomain if (1) we don't have entity classification L87, or (2) the URL cannot be resolved to an entity (L92). I think the latter shouldn't happen unless there's something wrong with the entity-classification dataset. Pre-v10 should proceed via line 88.

shared/util.js Show resolved Hide resolved
shared/util.js Outdated Show resolved Hide resolved
* @param {LH.Result.Entities=} entities
* @return {LH.Result.LhrEntity|string}
*/
static getEntityFromUrl(url, entities) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok cool thanks.

Did you give it a try?

nope but i knew you would've.

shared/util.js Outdated Show resolved Hide resolved
shared/util.js Outdated Show resolved Hide resolved
alexnj and others added 4 commits December 1, 2023 14:08
Experiments integrating npm:tldts (MIT licensed) to bring in
public suffix list based root domain classification.
Splits Util.getRootDomain into UrlUtils.getRootDomain that depends
on PSL, and replaces report-side with entity recognition based.
Preserves existing rootDomains with an explicit `Legacy` prefix
to be used for rendering pre-10.0 LHRs.
@alexnj
Copy link
Member Author

alexnj commented Dec 1, 2023

Updated the PR with review changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants