Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.1.0 #70

Merged
merged 46 commits into from
Jul 19, 2022
Merged

Version 0.1.0 #70

merged 46 commits into from
Jul 19, 2022

Conversation

ivanzvonkov
Copy link
Contributor

@ivanzvonkov ivanzvonkov commented Jul 18, 2022

Version 0.1.0 updates

Generation:

  • Update generate.ipynb to include generating Github GDRIVE_CREDENTIALS_DATA secret
  • Generate requirements.txt for each project

New data format:

  • Pickle to csv transition: introduce new dataset csvs which contain labels and earth observation data and tracks eo status (duplicates, missing, etc) [justification below]
  • Upgrade datasets pipeline to generate these new csvs
  • Upgrade all datasets and projects to use these new csvs
  • Remove all code that is only relevant for previous dataset pipeline (features.py, many dvc file)
  • Create new dataset report with more information about eo data status

Generalizability

  • Rename "features" to "datasets" in CLI and other places
  • Rename "tifs" to "eo" to not be restricted to tifs in future

Pickle to csv justification:
Some numbers:

Data type pickle h5py csv
Train 2 epochs (1) 16.66s user 4.14s system 81% cpu 25.581 total 24.65s user 5.94s system 73% cpu 41.766 total 28.66s user 2.58s system 105% cpu 29.711 total
Train 2 epochs (2) 17.65s user 4.41s system 82% cpu 26.907 total 25.66s user 6.54s system 70% cpu 45.999 tota 29.96s user 3.08s system 101% cpu 32.591 total
Train 2 epochs (3) 15.46s user 3.36s system 94% cpu 19.923 total 25.51s user 5.17s system 88% cpu 34.853 total 29.39s user 2.74s system 103% cpu 30.938 total
Size 86.7 mb 302.9 MB 121.4 mb

Size: h5py is way too big. csv is slightly larger than pickle
CPU: csv is highest, h5py is lowest

From numbers alone, it does not make sense to move away from pickle. It should be noted that for both time and storage csvs are fairly close to pickle files.
The difference is, pickle requires

  • being strict about the pickled python file
  • having a decent amount of code that deals with the relationship between pickle file and label (matching, tests, updates)
  • additional data that needs to be tracked and makes it harder for someone to get started

These requirements fall away when using csvs (see # of lines deleted in this PR) and allow for a path towards importable self-contained datasets that need minimal surrounding code.

PS: Performance losses can most likely be addressed with dask

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ivanzvonkov ivanzvonkov changed the title Simpler generation Version 0.1.0 Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant