Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datasets and documentation #1

Merged
merged 19 commits into from
Feb 24, 2022

Conversation

winfried-ripken
Copy link
Contributor

@winfried-ripken winfried-ripken commented Feb 3, 2022

  • Using Readthedocs: setup structure from their sphinx template
  • Copy PR template and issue templates from mxlabs-dataset
  • Adding Datasets + Documentation using this template created by Thomas
    • Kaggle Casting Quality: prototype dataset card
    • Imagenet: empty docu

TODO

  • Replace squirrel references with squirrel-core once available
  • Document preprocessing with spark

@winfried-ripken

This comment was marked as resolved.

@winfried-ripken
Copy link
Contributor Author

Would be happy to get your feedback on this prototype dataset card

Copy link
Contributor

@TiansuYu TiansuYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks really good to me.

docs/source/conf.py Outdated Show resolved Hide resolved
src/squirrel_datasets_core/datasets/imagenet/README.rst Outdated Show resolved Hide resolved
@winfried-ripken

This comment was marked as resolved.

Copy link
Contributor

@AlpAribal AlpAribal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good already! Dataset card is quite nice 👍

Depending on when this PR gets ready to merge, you may want to refrain from adding the driver .py files, since there will be a lot of changes with the new squirrel store and driver apis.

.gitignore Outdated Show resolved Hide resolved
.readthedocs.yaml Outdated Show resolved Hide resolved
docs/source/conf.py Show resolved Hide resolved
docs/source/conf.py Show resolved Hide resolved
requirements.in Outdated Show resolved Hide resolved
@ThomasWollmann
Copy link
Member

@winfried-loetzsch Very cool! Would be nice to have the templates also in the devtool, so that other projects get this when rendering the skeleton.

.gitignore Outdated Show resolved Hide resolved
@ThomasWollmann

This comment was marked as resolved.

@ThomasWollmann

This comment was marked as resolved.

@ThomasWollmann

This comment was marked as resolved.

@ThomasWollmann
Copy link
Member

There shall also be tests.

@winfried-ripken
Copy link
Contributor Author

Description

  • Create main structure of the repo
  • Add drivers for Camvid, ds_bowl2018, imagenet, kaggle_casting_quality, monthly_german_tweets as well as hub, huggingface, torchvision drivers
  • Add prototype dataset card + reference it in sphinx docu
  • Parse dataset cards to extract machine-readable information
from squirrel_datasets_core.datasets.kaggle_casting_quality import DATASET_ATTRIBUTES
print(DATASET_ATTRIBUTES)

will give

{
  'pretty_name': 'Kaggle Casting Quality', 
  'languages': '[]', 
  'licenses': 'CC BY-NC-ND 4.0', 
  'size_categories': '1K<n<10K', 
  'task_categories': 'image-classification'
}

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring including code style reformatting
  • Other (please describe):

Checklist:

  • I have read the [contributing guideline doc] () (external only)
  • I have signed the [CLA] () (external only)
  • Lint and unit tests pass locally with my changes
  • I have kept the PR small so that it can be easily reviewed
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • All dependency changes have been reflected in the pip requirement files.

@winfried-ripken winfried-ripken marked this pull request as ready for review February 21, 2022 15:21
Copy link
Contributor

@TiansuYu TiansuYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far everything LGTM. Thanks a lot for the work!
Also interesting to see that all the 'data cards' are in each dataset folder instead of in docs. Is there any reason to prefer this way?

src/squirrel_datasets_core/datasets/camvid/driver.py Outdated Show resolved Hide resolved
except AttributeError:
pass

return drivers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will add squirrel_sources here in the future, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to exclude the sources if they point to our internal gcp buckets, but I think we could include some of the huggingface sources for example. Let's discuss this on Monday :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor

@AlpAribal AlpAribal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left minor remarks.

Comment on lines 21 to 24
intersphinx_mapping = {
"python": ("https://docs.python.org/3/", None),
"sphinx": ("https://www.sphinx-doc.org/en/master/", None),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for future: We can add https://squirrel.readthedocs.io here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know we have already created the page. Looks nice!

Comment on lines +5 to +9
torchvision
scipy
pyspark==3.2.0
hydra-core>=1.1.0
hub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In squirrel-datasets, we have different flavors for e.g. torchvision and hub. Do we have them all in a single reqs.in here on purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's discuss then in the group if it makes sense to have the flavors or not :)


# generate extras based on requirements files
extras_require = dict()
for a_extra in ["dev", "preprocessing", "torchvision", "hub"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove some of these extras if we dont want flavors

@winfried-ripken
Copy link
Contributor Author

So far everything LGTM. Thanks a lot for the work! Also interesting to see that all the 'data cards' are in each dataset folder instead of in docs. Is there any reason to prefer this way?

This was brought up by @ThomasWollmann, mainly to simplify the process of adding a dataset (have everything in one place) and be able to navigate in github and see the readme/dataset card while browsing through the implementations of the datasets

@winfried-ripken winfried-ripken merged commit f5a3d25 into main Feb 24, 2022
@winfried-ripken winfried-ripken deleted the wl-add-datasets-and-documentation branch February 24, 2022 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants