Bookdepository has been discontinued, as a result, this project is now piece of history. Don't try to scrape anything, won't work.
The source code of Book Depository Dataset
. Here you will find the implementation for data extraction (scrapy spider), parsing and EDA.
Dataset is also available here as kaggle dataset
-
crawler: scrapy crawler for data extraction
-
parser: python script for data transformation and dataset creation
-
eda: Exploratory Data Analysis on dataset
- Run scrapy crawler in order to retrieve data from
bookdepository.com
- Run parser in order to create the dataset
This scrapy project is used to extract the majority of books from bookdepository.com. If you want to extract the data on your own, please keep settings file as is.
Use crawler as a common scrapy project:
poetry run scrapy crawl bdepobooks -o data/raw/textual/books.jsonlines
Scraping process will take more than a week. (scraping rate: ~50 items/minute). After crawling,
data/raw/textual/books.jsonlines
will contain all the raw data of books. Downloaded images can be found under the
data/raw/media/full
folder.
This submodule is about parsing and manipulating the raw data in order to create the dataset in a tabular format (csv
).
Use the parser directly from command line, just provide the .jsonlines
file with raw data and the output directory.
python parse_dataset.py -h
optional arguments:
-h, --help show this help message and exit
-i INP, --input-file INP
Input file path
-o OUT, --output-folder OUT
Output folder path
poetry run python src/parser/parse_dataset.py \
--input-jsonb data/raw/textual/books.jsonlines \
--input-images data/raw/media/full \
--output-folder data/parsed
This script will create a collection of .csv
and .zip
files in data/parsed/
folder.
@misc{simakis_2020,
title={Book Depository Dataset},
url={https://www.kaggle.com/ds/467291},
DOI={10.34740/kaggle/ds/467291},
publisher={Kaggle},
author={Simakis, Panagiotis},
year={2020}
}
A shout-out for the sponsors of this project:
- Konrad Mazanowski @konradm
All books are hosted by bookdepository.com. The use of dataset is fair use for academic purposes.