NOTE: More specific documentation is available "on the spot", in the package and subpackages directories (e.g.
edscrapers/scrapers
oredscrapers/transformers
).
Clone this repo using git clone
.
Change directory into the directory created/cloned for this repo.
From within the repo directory run pip install -r requirements.txt
to install all package dependencies required to run the toolkit
You need the ED_OUTPUT_PATH
environment variable to be set before running. Not
having the variable set in your environment will result in a fatal error.
The ED_OUT_PATH
environment variable is used to set the path to the directory where all output generated by this kit will be stored. The path specified must exist.
If GNU Make is available in your environment, you can run the command
make install
. Alternatively, run python setup.py install
.
After installing, run the eds
command in a command line prompt.
If you would like to run this toolkit in a container environment, we have packaged this toolkit into a Docker image.
Simply run :
docker build
in the root directory of this cloned repo. This will build an image of the scraping tookit from the Dockerfile
To get more info on the usage on the ED Scrapers Command Line Interface - eds
, read the eds cli docs.
To get more info on the architectural design/approach for the scraping toolkit, read the architectural design doc
- Scraping Source: a website (or section of website) where you scrape information from
- Scraper: A script that collects structured data from (rather
unstructured) web pages
- Crawler: A script that follows links and identifies all the pages containing information to be parsed
- Parser: A script that identifies data in HTML and loads it into a machine readable data structure
- Transformer: a script that takes a data structure and adapts it to a target structure
- ETL: Extract + Transform + Load process for metadata.
- Data.json: A specific JSON format used by CKAN harvesters. Example
Scrapers are Scrapy powered scripts that crawl through links and parse HTML pages. The proposed structure is:
- A crawler class that defines rules for link extraction and page filters
- This will be instantiated by a
CrawlerProcess
in the mainscraper.py
script
- This will be instantiated by a
- A parser script that is essentially a callback for fetching HTML pages. It
receives a Scrapy Response
payload, which can be parsed using any HTML parsing methods
- An optional Model class, to define the properties of extracted datasets and make them more flexible for dumping or automating operations if needed
Transformers are independent scripts that take a input and return it filtered and/or restructured. They are meant to complement the work done by scrapers by taking their output and making it usable for various applications (e.g. the CKAN harvester).
We currently have 7 transformers in place:
-
deduplicate
: removes duplicates from scraping -
sanitize
: cleans up the scraping output data/metadata based on specified rules. -
datajson
: creates data.json files from the scraping output; these data.json files can then by ingested/harvested byckanext-harvest
(used to populate a CKAN data portal). -
rag
: produces RAG analyses output files using an agreed weighted-value system for calculating the quality of metadata generated by thedatajson
transformer and (by extension) the 'raw' scraping output. -
TODO: Add info about the others