deps: switch third-party-web dataset to entities-nostats subset #14548

alexnj · 2022-11-18T18:04:11Z

I was tracking down why a recent upgrade of the library was losing one of the recognized entities, and in the process realized we aren't getting several entities that are in entities.json.

const getEntityMerged = require('third-party-web').getEntity;
const getEntityHTTPArchive = require('third-party-web/httparchive-nostats-subset.js').getEntity;
console.log('result from merged entities: ',
  getEntityMerged('https://mnl4bjjsnz-dsn.algolia.net/1/indexes/dev_OFFICE_SCENES/query'))
console.log('result in httparchive entities:',
  getEntityHTTPArchive('https://mnl4bjjsnz-dsn.algolia.net/1/indexes/dev_OFFICE_SCENES/query'))

With 0.20.2, it outputs:

result from merged entities:  {
  company: 'Algolia',
  categories: [ 'utility' ],
  name: 'Algolia',
  homepage: 'https://www.algolia.com/',
  category: 'utility',
  domains: [ '*.algolianet.com', '*.algolia.net', '*.algolia.io' ],
  products: [],
  totalExecutionTime: 0,
  totalOccurrences: 0,
  averageExecutionTime: NaN
}
result in httparchive entities: undefined

The entity moved out of the HTTPArchive dataset (as expected since it went below the threshold requirements to make it to that dataset).

Digging further:

jsonlint entities-nostats.json | grep \"name\" | wc -l
2135

jsonlint entities-httparchive-nostats.json | grep \"name\" | wc -l
1095

diff <(jsonlint entities-nostats.json | grep \"name\" | sort) \
  <(jsonlint entities-httparchive-nostats.json | grep \"name\" | sort) | grep \< | wc -l
1040

connorjclark · 2022-11-18T18:10:04Z

for reference #9067 all that I could find written down about this:

I'm using the httparchive-no-stats subset of the third-party-web dataset. It only includes the origins we actually observed in HTTPArchive and deletes the stats to cut down on bundle size impact (weighs in at 65KB compared to 202KB).

brendankenny · 2022-11-18T18:10:39Z

What is the purpose of the different subsets? The third-party-web readme doesn't mention them and I've never looked into why they exist. Presumably Patrick chose the httparchive set for a reason
One possible explanation is bundlesize. What's the delta for this change?

edit: nice 🕵️ @connorjclark :)

alexnj · 2022-11-18T18:15:23Z

At 0.20.2 version, difference is:

ls -ls *nostats.json | awk '{print $6 " " $10}'
111003 entities-httparchive-nostats.json
201320 entities-nostats.json

Where we're now (0.17.1), difference is:

ls -ls *nostats.json | awk '{print $6 " " $10}'
106566 entities-httparchive-nostats.json
206842 entities-nostats.json

alexnj · 2022-11-18T18:22:25Z

What is the purpose of the different subsets?

The entities-nostats.json seems to be the lookup endpoint that the library is publicly exposing. The HTTPArchive subset seems to maintain just the entities that were found in the recent HA scan and passed the defined thresholds. Or at least that's what I understood. Probably a good question for @patrickhulce :)

I think LH should switch to and depend on the full data set for maximizing coverage. Would the additional (double) size of plaintext payload be a concern?

connorjclark · 2022-11-18T18:52:06Z

It's a ~6% increase to devtools bundle, so not too much.

brendankenny · 2022-11-18T19:40:28Z

How big is the dt bundle these days? 6% of big can still be pretty big :)

It might be worth looking into the helpfulness per byte added. I assume there are power law distributions at work here, e.g. if 1/1000 Lighthouse runs get a single additional entity recognized, it might not be worth it. 1/100, maybe?

For the sample_v2 test problem: there are a few requests to algolia.net subdomains still out there (so this change would technically help them), but in October 2022 there were 6,913 (out of 13,480,378) pages that made any request matching the ['*.algolianet.com', '*.algolia.net', '*.algolia.io'] wildcards, so it could just be it's less and less of a thing? OTOH it's for a search bar, so it's possible most sites just don't make a request to a algolia domain without user interactions, so it's naturally not going to be found often in HTTP Archive data.

Maybe it is no big deal even if there's diminishing returns, but something to think about. For the actual call, I'm not sure who the dt bundle size watcher is these days (or if there is one).

brendankenny · 2022-11-18T19:48:35Z

also is that 6% has to be pre-compression? I assume all those domain lists compress really well.

connorjclark · 2022-11-18T20:10:21Z

It's pre-compression, because I am not familiar with how exactly the devtools build / chrome distribution is compressed. if we can just assume something like gzip... it's 427kb vs 445kb

patrickhulce · 2022-11-19T02:11:39Z

The HTTPArchive subset seems to maintain just the entities that were found in the recent HA scan and passed the defined thresholds. Or at least that's what I understood.

Yep that's right! I would expect there to be diminishing returns here but, as @brendankenny points out, we definitely bias against 3Ps that are only really triggered on interaction so for fraggle rock perspective it might be worth including more. Happy to craft more bespoke cuts of the dataset to publish if there's a different set of tradeoffs that would benefit LH/DT these days :)

connorjclark · 2022-11-28T20:02:50Z

it's 427kb vs 445kb

Seems OK to just include all the data if this is what we're looking at 🤷

deps: switch third-party-web dataset to entities-nostats subset

6b01108

alexnj requested a review from a team as a code owner November 18, 2022 18:04

alexnj requested review from adamraine, connorjclark and brendankenny and removed request for a team November 18, 2022 18:04

alexnj mentioned this pull request Nov 18, 2022

deps: upgrade third-party-web to 0.20.2 #14546

Merged

connorjclark approved these changes Nov 28, 2022

View reviewed changes

alexnj merged commit 9c90f20 into main Dec 1, 2022

alexnj deleted the 3pweb-regular-nostat branch December 1, 2022 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deps: switch third-party-web dataset to entities-nostats subset #14548

deps: switch third-party-web dataset to entities-nostats subset #14548

alexnj commented Nov 18, 2022 •

edited

Loading

connorjclark commented Nov 18, 2022

brendankenny commented Nov 18, 2022 •

edited

Loading

alexnj commented Nov 18, 2022 •

edited

Loading

alexnj commented Nov 18, 2022 •

edited

Loading

connorjclark commented Nov 18, 2022

brendankenny commented Nov 18, 2022

brendankenny commented Nov 18, 2022

connorjclark commented Nov 18, 2022

patrickhulce commented Nov 19, 2022

connorjclark commented Nov 28, 2022

deps: switch third-party-web dataset to entities-nostats subset #14548

deps: switch third-party-web dataset to entities-nostats subset #14548

Conversation

alexnj commented Nov 18, 2022 • edited Loading

connorjclark commented Nov 18, 2022

brendankenny commented Nov 18, 2022 • edited Loading

alexnj commented Nov 18, 2022 • edited Loading

alexnj commented Nov 18, 2022 • edited Loading

connorjclark commented Nov 18, 2022

brendankenny commented Nov 18, 2022

brendankenny commented Nov 18, 2022

connorjclark commented Nov 18, 2022

patrickhulce commented Nov 19, 2022

connorjclark commented Nov 28, 2022

alexnj commented Nov 18, 2022 •

edited

Loading

brendankenny commented Nov 18, 2022 •

edited

Loading

alexnj commented Nov 18, 2022 •

edited

Loading

alexnj commented Nov 18, 2022 •

edited

Loading