Releases: privacy-tech-lab/gpc-web-crawler
Releases · privacy-tech-lab/gpc-web-crawler
June 2024 Crawl
Differences from April 2024 Crawl:
- addition of GPP version that identifies whether the site is using GPP v1.0 or v1.1 version
April 2024 Crawl
Differences from February 2024 crawl:
- well-known data is no longer collected by the crawler. We use a python script instead, which is also included in this repo.
- longer database values are now stored as TEXT instead of varchar
- addition of OneTrustWPCCPAGoogleOptOut and OTGPPConsent cookies
February 2024 Crawl
This is largely the same as the December 2023 crawl code.
Differences:
- well-known data is collected by the crawler
- column values in the debugging table are capped at 4,000 characters, as this is what is specified in our table
- one new human check regular expression
December 2023 Crawl
This is the code we used to perform our crawl on 11,708 sites in December 2023.
The extension collects data from Firefox's urlClassification object in order to determine whether a site is subject to the CCPA. It collects data on the USPS, GPP string, and the OptanonConsent cookie to determine whether sites recognize GPC signals. This version uses a SQL database to store the data.
Firefox-analysis-mode-crawler
The Firefox-analysis-mode-crawler is used to crawl the top 1000 sites of the US Privacy String Test Set.