Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasets and documentation #1

Merged
merged 19 commits into from
Feb 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
name: Bug report
about: Create a report to help us improve
title: "[BUG]"
labels: bug
assignees: ''

---

Before you open an issue, please check if a similar issue already exists or has been closed before.

### When reporting a bug, please be sure to include the following:
- [ ] A descriptive title
- [ ] An *isolated* way to reproduce the behavior (example: GitHub repository with code isolated to the issue that anyone can clone to observe the problem)
- [ ] What version of `squirrel-datasets` and `squirrel` you're using, and the platform(s) you're running it on
- [ ] What packages or other dependencies you're using
- [ ] The behavior you expect to see and the actual behavior

See [contributing guideline]() for more detail on what is expected of a bug report.
18 changes: 18 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
name: Feature request
about: Suggest an idea for this project
title: "[FEATURE]"
labels: enhancement
assignees: ''

---

Before you open an issue, please check if a similar issue already exists or has been closed before.

### When you open an issue for a feature request, please add as much detail as possible:
- [ ] A descriptive title
- [ ] A description of the problem you're trying to solve, including *why* you think this is a problem
- [ ] An overview of the suggested solution
- [ ] If the feature changes current behavior, reasons why your solution is better

See [contributing guideline]() for more detail on what is expected of a feature request.
25 changes: 25 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context.
List any dependencies that are required for this change.

Fixes # issue

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [ ] Refactoring including code style reformatting
- [ ] Other (please describe):

# Checklist:

- [ ] I have read the [contributing guideline doc] () (external only)
- [ ] I have signed the [CLA] () (external only)
- [ ] Lint and unit tests pass locally with my changes
- [ ] I have kept the PR small so that it can be easily reviewed
- [ ] I have made corresponding changes to the documentation
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] All dependency changes have been reflected in the pip requirement files.
170 changes: 170 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
######## OSX ########
*.DS_Store
.AppleDouble
.LSOverride
.xprocess/
.Trash-0/
.pytest_cache/

# Icon must end with two \r
Icon

# Thumbnails
._*

# Files that might appear in the root of a volume
.DocumentRevisions-V100
.fseventsd
.Spotlight-V100
.TemporaryItems
.Trashes
.VolumeIcon.icns
.com.apple.timemachine.donotpresent

# Directories potentially created on remote AFP share
.AppleDB
.AppleDesktop
Network Trash Folder
Temporary Items
.apdisk

# Terraform
.terraform
terraform.tfstate
terraform.tfstate.backup
.terraform.lock.hcl
.terraform.tfstate.lock.info

# experiment loggers
mlruns
wandb

####### Misc #######
# Vim
*.swp
*.swo

nohup.out

*.mp4
*.avi
*.wmv
*.mov
*.pdf
*.log

######## Python ########
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
node_modules/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
.venv/
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject

######## PYCHARM ########
.idea

######## vscode ########
.vscode
.devcontainer.json

####### LargeVis #######
**/annoy_index_file

####### Web #######
*.sass-cache
.ruby-version
/Pipfile*
/.conda-*
/yarn-error.log
bower_components/
npm-debug.log

# devspace
.devspace

# Hydra
outputs/
29 changes: 29 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the version of Python and other tools you might need
build:
os: ubuntu-20.04
tools:
python: "3.8"
# You can also specify other tool versions:
# nodejs: "16"
# rust: "1.55"
# golang: "1.17"

# Build documentation in the docs/ directory with Sphinx
sphinx:
configuration: docs/conf.py

# If using Sphinx, optionally build your docs in additional formats such as PDF
# formats:
# - pdf

# Optionally declare the Python requirements required to build your docs
# python:
# install:
# - requirements: docs/requirements.txt
7 changes: 7 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
Squirrel Datasets
=================

`squirrel-datasets-core` is a hub where the user can 1) explore existing datasets registered in the data mesh by other users and 2) preprocess their datasets and share them with other users. As an end user, you will
be able to load many publically available datasets with ease and speed with the help of `squirrel`, or load and preprocess
your own datasets with the tools we provide here.

For preprocessing, we currently support Spark as the main tool to carry out the task.

Please see our `documentation <https://squirrel-datasets-core.readthedocs.io>`_ for further details.
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
57 changes: 57 additions & 0 deletions docs/source/add_dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Contribute to Squirrel Datasets
===============================

Squirrel-datasets supports you with two tasks:

* Preprocessing data and registering the preprocessed dataset in the data mesh
* Loading data from the data mesh

Preprocessing
-------------
For the first task, i.e. preprocessing, we recommend using `Apache Spark`_. The scenario is that quite often you would
like to work with data stored in Google Cloud Storage and finish your batch processing job on a kubernetes cluser. We use
`PySpark`_ for defining the preprocessing logic in python.

Data Loading
------------
For the second task, i.e. data loading, we use the high level API from :code:`squirrel`. The corresponding data loading logic
is defined through a :py:class:`Driver` class.

Add a New Dataset
------------------
After having understood the two above discussed main tasks and how we handle them, here is how it looks like when you
want to add a new dataset into :code:`squirrel-datasets`: define your preprocessing logic; define your loading logic;
register the dataset into a catalog plugin.

#. Define your preprocessing logic.

- Create a new directory under :code:`squirrel_datasets_core/datasets` named after your dataset, e.g. "example_dataset".
Write your preprocessing scripts under a new ``preprocessing.py`` file in it.

#. Define your loading logic.

- After the preprocessing step, you want to make sure your preprocessed dataset is valid and readable. In that case,
you need to define the loading logic. The driver defines how the dataset is read from the serialized file into your memory.

- In :code:`squirrel` there are already many built-in drivers for reading all kinds of datasets. There are
:py:class:`CSVDataloader`, :py:class:`JSONLoader`, :py:class:`MessagePackDataLoader`, :py:class:`RecordLoader`
and many others. For details, please refer to `squirrel.driver`_.

- Select a suitable driver if one of them is applicable to your dataset's format and compression method.

- If there is no driver suitable for your dataset, then you need to define a custom driver. The custom driver should
have the same interface as :py:class:`squirrel.driver.IterDriver`. We recommend that you subclass from
this class, then add the loading logic inside. This class should be saved under
:code:`squirrel_datasets_core/datasets/example_dataset/driver.py`

.. note::

This is not always the case that data loading occurs after the preprocessing steps. For image datasets, spark is
not always the right tool to do it. In that case, you may want to load and process the data without it, and you
need to define the loading logic for your raw data. In that case, you may swap the above steps or use them more
flexibly. See `squirrel_datasets_core.datasets.imagenet`_ for an example.

.. _Apache Spark: https://spark.apache.org/docs/latest/
.. _PySpark: https://spark.apache.org/docs/latest/api/python/
.. _squirrel.driver: https://squirrel.readthedocs.io/
.. _squirrel_datasets_core.datasets.imagenet: https://squirrel.readthedocs.io/
37 changes: 37 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Configuration file for the Sphinx documentation builder.
import datetime

# -- Project information

project = "Squirrel Datasets"
copyright = f"{datetime.datetime.now().year}, Merantix Labs GmbH"
author = "Merantix Labs GmbH"
# -- General configuration

extensions = [
winfried-ripken marked this conversation as resolved.
Show resolved Hide resolved
"sphinx.ext.duration",
"sphinx.ext.doctest",
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"sphinx.ext.napoleon",
"sphinx.ext.autosectionlabel",
]

intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
"sphinx": ("https://www.sphinx-doc.org/en/master/", None),
"squirrel": ("https://squirrel.readthedocs.io/", None),
}
intersphinx_disabled_domains = ["std"]

templates_path = ["_templates"]

# -- Options for HTML output

html_theme = "sphinx_rtd_theme"

winfried-ripken marked this conversation as resolved.
Show resolved Hide resolved
# -- Options for EPUB output
epub_show_urls = "footnote"

autoclass_content = "both"
Loading