merantix-momentum · winfried-ripken · Feb 24, 2022 · Feb 3, 2022 · Feb 3, 2022 · Feb 3, 2022
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,19 @@
+---
+name: Bug report
+about: Create a report to help us improve
+title: "[BUG]"
+labels: bug
+assignees: ''
+
+---
+
+Before you open an issue, please check if a similar issue already exists or has been closed before.
+
+ ### When reporting a bug, please be sure to include the following:
+ - [ ] A descriptive title
+ - [ ] An *isolated* way to reproduce the behavior (example: GitHub repository with code isolated to the issue that anyone can clone to observe the problem)
+ - [ ] What version of `squirrel-datasets` and `squirrel` you're using, and the platform(s) you're running it on
+ - [ ] What packages or other dependencies you're using
+ - [ ] The behavior you expect to see and the actual behavior
+
+ See [contributing guideline]() for more detail on what is expected of a bug report.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,18 @@
+---
+name: Feature request
+about: Suggest an idea for this project
+title: "[FEATURE]"
+labels: enhancement
+assignees: ''
+
+---
+
+Before you open an issue, please check if a similar issue already exists or has been closed before.
+
+### When you open an issue for a feature request, please add as much detail as possible:
+ - [ ] A descriptive title
+ - [ ] A description of the problem you're trying to solve, including *why* you think this is a problem
+ - [ ] An overview of the suggested solution
+ - [ ] If the feature changes current behavior, reasons why your solution is better
+
+ See [contributing guideline]() for more detail on what is expected of a feature request.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,25 @@
+# Description
+
+Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. 
+List any dependencies that are required for this change.
+
+Fixes # issue
+
+## Type of change
+
+- [ ] Bug fix (non-breaking change which fixes an issue)
+- [ ] New feature (non-breaking change which adds functionality)
+- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
+- [ ] Documentation update
+- [ ] Refactoring including code style reformatting 
+- [ ] Other (please describe):
+
+# Checklist:
+
+- [ ] I have read the [contributing guideline doc] () (external only)
+- [ ] I have signed the [CLA] () (external only)
+- [ ] Lint and unit tests pass locally with my changes
+- [ ] I have kept the PR small so that it can be easily reviewed  
+- [ ] I have made corresponding changes to the documentation
+- [ ] I have added tests that prove my fix is effective or that my feature works
+- [ ] All dependency changes have been reflected in the pip requirement files. 
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,170 @@
+######## OSX ########
+*.DS_Store
+.AppleDouble
+.LSOverride
+.xprocess/
+.Trash-0/
+.pytest_cache/
+
+# Icon must end with two \r
+Icon
+
+# Thumbnails
+._*
+
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+
+# Terraform
+.terraform
+terraform.tfstate
+terraform.tfstate.backup
+.terraform.lock.hcl
+.terraform.tfstate.lock.info
+
+# experiment loggers
+mlruns
+wandb
+
+####### Misc #######
+# Vim
+*.swp
+*.swo
+
+nohup.out
+
+*.mp4
+*.avi
+*.wmv
+*.mov
+*.pdf
+*.log
+
+######## Python ########
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+node_modules/
+parts/
+sdist/
+var/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*,cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# IPython Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# dotenv
+.env
+
+# virtualenv
+.venv/
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+
+# Rope project settings
+.ropeproject
+
+######## PYCHARM ########
+.idea
+
+######## vscode ########
+.vscode
+.devcontainer.json
+
+####### LargeVis #######
+**/annoy_index_file
+
+####### Web #######
+*.sass-cache
+.ruby-version
+/Pipfile*
+/.conda-*
+/yarn-error.log
+bower_components/
+npm-debug.log
+
+# devspace
+.devspace
+
+# Hydra
+outputs/
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,29 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the version of Python and other tools you might need
+build:
+  os: ubuntu-20.04
+  tools:
+    python: "3.8"
+    # You can also specify other tool versions:
+    # nodejs: "16"
+    # rust: "1.55"
+    # golang: "1.17"
+
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+   configuration: docs/conf.py
+
+# If using Sphinx, optionally build your docs in additional formats such as PDF
+# formats:
+#    - pdf
+
+# Optionally declare the Python requirements required to build your docs
+# python:
+#   install:
+#   - requirements: docs/requirements.txt
diff --git a/README.rst b/README.rst
@@ -1,3 +1,10 @@
 Squirrel Datasets
 =================
 
+`squirrel-datasets-core` is a hub where the user can 1) explore existing datasets registered in the data mesh by other users and 2) preprocess their datasets and share them with other users. As an end user, you will
+be able to load many publically available datasets with ease and speed with the help of `squirrel`, or load and preprocess
+your own datasets with the tools we provide here. 
+
+For preprocessing, we currently support Spark as the main tool to carry out the task.
+
+Please see our [documentation](https://squirrel-datasets-core.readthedocs.io) for further details.
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/source/add_dataset.rst b/docs/source/add_dataset.rst
@@ -0,0 +1,57 @@
+Contribute to Squirrel Datasets
+===============================
+
+Squirrel-datasets supports you with two tasks:
+
+* Preprocessing data and registering the preprocessed dataset in the data mesh
+* Loading data from the data mesh
+
+Preprocessing
+-------------
+For the first task, i.e. preprocessing, we recommend using `Apache Spark`_. The scenario is that quite often you would
+like to work with data stored in Google Cloud Storage and finish your batch processing job on a kubernetes cluser. We use
+`PySpark`_ for defining the preprocessing logic in python.
+
+Data Loading
+------------
+For the second task, i.e. data loading, we use the high level API from :code:`squirrel`. The corresponding data loading logic
+is defined through a :py:class:`Driver` class.
+
+Add a New Dataset
+------------------
+After having understood the two above discussed main tasks and how we handle them, here is how it looks like when you
+want to add a new dataset into :code:`squirrel-datasets`: define your preprocessing logic; define your loading logic;
+register the dataset into a catalog plugin.
+
+#. Define your preprocessing logic.
+
+   - Create a new directory under :code:`squirrel_datasets_core/datasets` named after your dataset, e.g. "example_dataset".
+     Write your preprocessing scripts under a new ``preprocessing.py`` file in it.
+
+#. Define your loading logic.
+
+   - After the preprocessing step, you want to make sure your preprocessed dataset is valid and readable. In that case,
+     you need to define the loading logic. The driver defines how the dataset is read from the serialized file into your memory.
+
+   - In :code:`squirrel` there are already many built-in drivers for reading all kinds of datasets. There are
+     :py:class:`CSVDataloader`, :py:class:`JSONLoader`, :py:class:`MessagePackDataLoader`, :py:class:`RecordLoader`
+     and many others. For details, please refer to `squirrel.driver`_.
+
+   - Select a suitable driver if one of them is applicable to your dataset's format and compression method.
+
+   - If there is no driver suitable for your dataset, then you need to define a custom driver. The custom driver should
+     have the same interface as :py:class:`squirrel.driver.IterDriver`. We recommend that you subclass from
+     this class, then add the loading logic inside. This class should be saved under 
+     :code:`squirrel_datasets_core/datasets/example_dataset/driver.py`
+
+   .. note::
+
+     This is not always the case that data loading occurs after the preprocessing steps. For image datasets, spark is
+     not always the right tool to do it. In that case, you may want to load and process the data without it, and you
+     need to define the loading logic for your raw data. In that case, you may swap the above steps or use them more
+     flexibly. See `squirrel_datasets_core.datasets.imagenet`_ for an example.
+
+.. _Apache Spark: https://spark.apache.org/docs/latest/
+.. _PySpark: https://spark.apache.org/docs/latest/api/python/
+.. _squirrel.driver: https://squirrel.readthedocs.io/
+.. _squirrel_datasets_core.datasets.imagenet: https://squirrel.readthedocs.io/
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -0,0 +1,36 @@
+# Configuration file for the Sphinx documentation builder.
+import datetime
+
+# -- Project information
+
+project = "Squirrel Datasets"
+copyright = f"{datetime.datetime.now().year}, Merantix Labs GmbH"
+author = "Merantix Labs GmbH"
+# -- General configuration
+
+extensions = [
+    "sphinx.ext.duration",
+    "sphinx.ext.doctest",
+    "sphinx.ext.autodoc",
+    "sphinx.ext.autosummary",
+    "sphinx.ext.intersphinx",
+    "sphinx.ext.napoleon",
+    "sphinx.ext.autosectionlabel",
+]
+
+intersphinx_mapping = {
+    "python": ("https://docs.python.org/3/", None),
+    "sphinx": ("https://www.sphinx-doc.org/en/master/", None),
+}
+intersphinx_disabled_domains = ["std"]
+
+templates_path = ["_templates"]
+
+# -- Options for HTML output
+
+html_theme = "sphinx_rtd_theme"
+
+# -- Options for EPUB output
+epub_show_urls = "footnote"
+
+autoclass_content = "both"