Fetches and cleans data from NSO websites and publishes them as in a standardised tidy data format.
This work has two goals
- Provide a database of well-formatted data that can be used in Full Fact’s Stats Checking tools.
- To highlight how much work is involved to collect and compare national statistics data across countries, as discussed in the write-up.
The data files follows a simple timescale,observation
format. Time is YYYY-MM, and observation is percentage change. For example:
month,observation
1996-01,47.56
1996-02,43.645
1996-03,41.9048
...
These are the statistics that are fetched, reformatted and stored in the ./data
directory:
- Argentina
- Ireland
- Japan
- Mexico
- Nigeria
- Philippines
- UK
- South Africa
In almost all cases the data file is downloaded and read in (except for Philippines where the numbers were hard-coded). Preferably the files would be JSON or a CSV, but some countries have PDFs or XLS files. The location of all these files online and other metadata is in the data/nso_stats_metadata.json file.
It is also deployed as a Github action which runs several times between 6am and 10am UTC. So some of the statistics should stay up-to-date. You can view this Github action in .github/workflow/fetch_stats.yaml
. However, given the variability of these statistics data, it wouldn't be surprising if the action breaks at some point if the published format changes.
- Java 8+ (for Tabula to read PDFs)
- Python 3.10+
- It likely works for older versions of Python, but it hasn't been tested
Clone this repo
git clone https://github.com/FullFact/nso-stats-fetcher.git
Install required libraries
Either
poetry install
or
pip install -r requirements.txt
To run the scripts and fetch updated versions of all the statistics data, run:
python src/nsofetch/fetch_all.py
Or just run each country's individual script individually. We use ISO 3166 country codes for standardised country names.