Book Depository Dataset

Bookdepository has been discontinued, as a result, this project is now piece of history. Don't try to scrape anything, won't work.

Book Depository Dataset

The source code of Book Depository Dataset. Here you will find the implementation for data extraction (scrapy spider), parsing and EDA.

Dataset is also available here as kaggle dataset

Project Structure

crawler: scrapy crawler for data extraction
parser: python script for data transformation and dataset creation
eda: Exploratory Data Analysis on dataset

Step to reproduce

Run scrapy crawler in order to retrieve data from bookdepository.com
Run parser in order to create the dataset

Crawler

This scrapy project is used to extract the majority of books from bookdepository.com. If you want to extract the data on your own, please keep settings file as is.

Usage

Use crawler as a common scrapy project:

poetry run scrapy crawl bdepobooks -o data/raw/textual/books.jsonlines

Scraping process will take more than a week. (scraping rate: ~50 items/minute). After crawling, data/raw/textual/books.jsonlines will contain all the raw data of books. Downloaded images can be found under the data/raw/media/full folder.

Parser

This submodule is about parsing and manipulating the raw data in order to create the dataset in a tabular format (csv).

Usage

Use the parser directly from command line, just provide the .jsonlines file with raw data and the output directory.

python parse_dataset.py -h
optional arguments:
  -h, --help            show this help message and exit
  -i INP, --input-file INP
                        Input file path
  -o OUT, --output-folder OUT
                        Output folder path

Working example

poetry run python src/parser/parse_dataset.py \
                  --input-jsonb data/raw/textual/books.jsonlines \
                  --input-images data/raw/media/full \
                  --output-folder data/parsed

This script will create a collection of .csv and .zip files in data/parsed/ folder.

Citation

 @misc{simakis_2020,
	title={Book Depository Dataset},
	url={https://www.kaggle.com/ds/467291},
	DOI={10.34740/kaggle/ds/467291},
	publisher={Kaggle},
	author={Simakis, Panagiotis},
	year={2020}
}

Sponsor

A shout-out for the sponsors of this project:

Konrad Mazanowski @konradm

Disclaimer

All books are hosted by bookdepository.com. The use of dataset is fair use for academic purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.github		.github
data		data
great_expectations		great_expectations
notebooks		notebooks
src		src
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book Depository Dataset

Project Structure

Step to reproduce

Crawler

Usage

Parser

Usage

Working example

Citation

Sponsor

Disclaimer

About

Releases

Sponsor this project

Packages

Contributors 5

Languages

License

sp1thas/book-depository-dataset

Folders and files

Latest commit

History

Repository files navigation

Book Depository Dataset

Project Structure

Step to reproduce

Crawler

Usage

Parser

Usage

Working example

Citation

Sponsor

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 5

Languages

Packages