-
-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sustainability 2022 Queries #2989
Conversation
@tunetheweb could we find some time later this week to look at 42714b0#diff-e62b9f849c03e2bbbcb42e98e4ecdfc786dea8b2ce842a87892bb286604dacd5 The query we've got currently gives a total percentage, but I'd also like to break it down by % of top 1000, 10000, 100000 sites. |
Details here: https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Analysts'-Guide#rank So this works: #standardSQL
# What percentage of URLs are hosted on a known green web hosting provider?
WITH green AS (
SELECT
NET.HOST(url) AS host,
TRUE AS is_green
FROM
`httparchive.almanac.green_web_foundation`
WHERE
date = '2022-06-01'
),
pages AS (
SELECT
_TABLE_SUFFIX AS client,
NET.HOST(url) AS host,
rank
FROM
`httparchive.summary_pages.2022_06_01_*`
)
SELECT
client,
rank_grouping,
COUNTIF(is_green) AS total_green,
COUNT(0) AS total_sites,
COUNTIF(is_green) / COUNT(0) AS pct_green
FROM
pages
LEFT JOIN
green
USING
(host),
UNNEST([1000, 10000, 100000, 1000000, 10000000]) AS rank_grouping
WHERE
rank <= rank_grouping
GROUP BY
client,
rank_grouping
ORDER BY
client,
rank_grouping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
A few comments:
I've still some open comments on this PR, but think most of the queries are in a reasonably fit state. I would suggest starting to run them, and save the data to the sheet, so you can see what the data looks like, while also addressing the comments I've made. I'm a little uncertain with exactly what we hope to get our of some of the queries, but maybe once we see the data it will make more sense (or you'll all see the query perhaps doesn't make as much sense as you thought). We'll still need this PR reviewed and merged, but as long as most of the queries are OK, it might just need a few rerunning once the review has identified corrections. You also may find some need slight tweaks as you run them. |
I've nuked the Green Third Parties query since it was returning some pretty strange results. I've rewritten it (f3dffe7) to produce results that look more right and give us something to talk to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is almost there as far as I can see.
Can we look at the open comments and then get this merged? We can open new PRs for any new queries we want to add after.
@tunetheweb I've updated the checklist at the top. We've got one query on Font format adoption that has been missed, but I'm guessing the Fonts chapter has data on this that we can use. |
Progress on #2910
Contents of PR are duplicated from the Google doc outline
Hosting
General
(Page weight)
Cache
- [ ] Caching by resource typeImage Optimization
- [ ] Native lazy loading v. JS implementation- [ ] Image qualityJS & CSS
Fonts
- [ ] Unused font requestsVideo
Third Parties
- [ ] Co2e from third parties- [ ] Co2e by third-party categoryPlatform Summary