Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deps: switch third-party-web dataset to entities-nostats subset #14548

Merged
merged 1 commit into from
Dec 1, 2022

Conversation

alexnj
Copy link
Member

@alexnj alexnj commented Nov 18, 2022

I was tracking down why a recent upgrade of the library was losing one of the recognized entities, and in the process realized we aren't getting several entities that are in entities.json.

const getEntityMerged = require('third-party-web').getEntity;
const getEntityHTTPArchive = require('third-party-web/httparchive-nostats-subset.js').getEntity;
console.log('result from merged entities: ',
  getEntityMerged('https://mnl4bjjsnz-dsn.algolia.net/1/indexes/dev_OFFICE_SCENES/query'))
console.log('result in httparchive entities:',
  getEntityHTTPArchive('https://mnl4bjjsnz-dsn.algolia.net/1/indexes/dev_OFFICE_SCENES/query'))

With 0.20.2, it outputs:

result from merged entities:  {
  company: 'Algolia',
  categories: [ 'utility' ],
  name: 'Algolia',
  homepage: 'https://www.algolia.com/',
  category: 'utility',
  domains: [ '*.algolianet.com', '*.algolia.net', '*.algolia.io' ],
  products: [],
  totalExecutionTime: 0,
  totalOccurrences: 0,
  averageExecutionTime: NaN
}
result in httparchive entities: undefined

The entity moved out of the HTTPArchive dataset (as expected since it went below the threshold requirements to make it to that dataset).

Digging further:

jsonlint entities-nostats.json | grep \"name\" | wc -l
2135

jsonlint entities-httparchive-nostats.json | grep \"name\" | wc -l
1095

diff <(jsonlint entities-nostats.json | grep \"name\" | sort) \
  <(jsonlint entities-httparchive-nostats.json | grep \"name\" | sort) | grep \< | wc -l
1040

@alexnj alexnj requested a review from a team as a code owner November 18, 2022 18:04
@alexnj alexnj requested review from adamraine, connorjclark and brendankenny and removed request for a team November 18, 2022 18:04
@connorjclark
Copy link
Collaborator

for reference #9067 all that I could find written down about this:

I'm using the httparchive-no-stats subset of the third-party-web dataset. It only includes the origins we actually observed in HTTPArchive and deletes the stats to cut down on bundle size impact (weighs in at 65KB compared to 202KB).

@brendankenny
Copy link
Member

brendankenny commented Nov 18, 2022

  • What is the purpose of the different subsets? The third-party-web readme doesn't mention them and I've never looked into why they exist. Presumably Patrick chose the httparchive set for a reason
  • One possible explanation is bundlesize. What's the delta for this change?

edit: nice 🕵️ @connorjclark :)

@alexnj
Copy link
Member Author

alexnj commented Nov 18, 2022

At 0.20.2 version, difference is:

ls -ls *nostats.json | awk '{print $6 " " $10}'
111003 entities-httparchive-nostats.json
201320 entities-nostats.json

Where we're now (0.17.1), difference is:

ls -ls *nostats.json | awk '{print $6 " " $10}'
106566 entities-httparchive-nostats.json
206842 entities-nostats.json

@alexnj
Copy link
Member Author

alexnj commented Nov 18, 2022

What is the purpose of the different subsets?

The entities-nostats.json seems to be the lookup endpoint that the library is publicly exposing. The HTTPArchive subset seems to maintain just the entities that were found in the recent HA scan and passed the defined thresholds. Or at least that's what I understood. Probably a good question for @patrickhulce :)

I think LH should switch to and depend on the full data set for maximizing coverage. Would the additional (double) size of plaintext payload be a concern?

@connorjclark
Copy link
Collaborator

It's a ~6% increase to devtools bundle, so not too much.

@brendankenny
Copy link
Member

How big is the dt bundle these days? 6% of big can still be pretty big :)

It might be worth looking into the helpfulness per byte added. I assume there are power law distributions at work here, e.g. if 1/1000 Lighthouse runs get a single additional entity recognized, it might not be worth it. 1/100, maybe?

For the sample_v2 test problem: there are a few requests to algolia.net subdomains still out there (so this change would technically help them), but in October 2022 there were 6,913 (out of 13,480,378) pages that made any request matching the ['*.algolianet.com', '*.algolia.net', '*.algolia.io'] wildcards, so it could just be it's less and less of a thing? OTOH it's for a search bar, so it's possible most sites just don't make a request to a algolia domain without user interactions, so it's naturally not going to be found often in HTTP Archive data.

Maybe it is no big deal even if there's diminishing returns, but something to think about. For the actual call, I'm not sure who the dt bundle size watcher is these days (or if there is one).

@brendankenny
Copy link
Member

also is that 6% has to be pre-compression? I assume all those domain lists compress really well.

@connorjclark
Copy link
Collaborator

It's pre-compression, because I am not familiar with how exactly the devtools build / chrome distribution is compressed. if we can just assume something like gzip... it's 427kb vs 445kb

@patrickhulce
Copy link
Collaborator

The HTTPArchive subset seems to maintain just the entities that were found in the recent HA scan and passed the defined thresholds. Or at least that's what I understood.

Yep that's right! I would expect there to be diminishing returns here but, as @brendankenny points out, we definitely bias against 3Ps that are only really triggered on interaction so for fraggle rock perspective it might be worth including more. Happy to craft more bespoke cuts of the dataset to publish if there's a different set of tradeoffs that would benefit LH/DT these days :)

@connorjclark
Copy link
Collaborator

it's 427kb vs 445kb

Seems OK to just include all the data if this is what we're looking at 🤷

@alexnj alexnj merged commit 9c90f20 into main Dec 1, 2022
@alexnj alexnj deleted the 3pweb-regular-nostat branch December 1, 2022 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants