Version 0.1.0 #70

ivanzvonkov · 2022-07-18T15:02:10Z

Version 0.1.0 updates

Generation:

Update generate.ipynb to include generating Github GDRIVE_CREDENTIALS_DATA secret
Generate requirements.txt for each project

New data format:

Pickle to csv transition: introduce new dataset csvs which contain labels and earth observation data and tracks eo status (duplicates, missing, etc) [justification below]
Upgrade datasets pipeline to generate these new csvs
Upgrade all datasets and projects to use these new csvs
Remove all code that is only relevant for previous dataset pipeline (features.py, many dvc file)
Create new dataset report with more information about eo data status

Generalizability

Rename "features" to "datasets" in CLI and other places
Rename "tifs" to "eo" to not be restricted to tifs in future

Pickle to csv justification:
Some numbers:

Data type	pickle	h5py	csv
Train 2 epochs (1)	16.66s user 4.14s system 81% cpu 25.581 total	24.65s user 5.94s system 73% cpu 41.766 total	28.66s user 2.58s system 105% cpu 29.711 total
Train 2 epochs (2)	17.65s user 4.41s system 82% cpu 26.907 total	25.66s user 6.54s system 70% cpu 45.999 tota	29.96s user 3.08s system 101% cpu 32.591 total
Train 2 epochs (3)	15.46s user 3.36s system 94% cpu 19.923 total	25.51s user 5.17s system 88% cpu 34.853 total	29.39s user 2.74s system 103% cpu 30.938 total
Size	86.7 mb	302.9 MB	121.4 mb

Size: h5py is way too big. csv is slightly larger than pickle
CPU: csv is highest, h5py is lowest

From numbers alone, it does not make sense to move away from pickle. It should be noted that for both time and storage csvs are fairly close to pickle files.
The difference is, pickle requires

being strict about the pickled python file
having a decent amount of code that deals with the relationship between pickle file and label (matching, tests, updates)
additional data that needs to be tracked and makes it harder for someone to get started

These requirements fall away when using csvs (see # of lines deleted in this PR) and allow for a path towards importable self-contained datasets that need minimal surrounding code.

PS: Performance losses can most likely be addressed with dask

review-notebook-app · 2022-07-18T15:02:15Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

ivanzvonkov added 9 commits July 15, 2022 12:51

Clean generate notebook

514449c

Generate requirements file also

72a073d

Generate notebook entirely creates repo

50eab95

Upgrade test project

7b40c2a

Skip if certain files don't exist

bdf9717

Check for unused features

acc2407

Update version

044d7ed

Auto generate description

57601d8

Regenerate project with 0.0.3

b7c049a

ivanzvonkov added 20 commits July 18, 2022 12:46

Make test actually fail

8b1dad1

remove unused import

53cfd67

Remove unused notebook

7addbfe

Move features into csv

84e63d0

Regenerate crop-mask project

b34c1ed

Add new datasets to dvc

cfac6e4

Messed up copy and paste

324638d

formatting

2a8a4ed

fix datapath

f2a7974

Correct crop-mask-example bucket

9921600

Regenerate buildings-example

ffbb72d

Rename report

a5a6f3c

Upgrade maize-example

4d6ed0c

Update buildings dataset

50c4a4e

Tutorial uses 0.1.0

8503cf2

Write report

3a96fd9

ensure status is included

78a1bbf

pin openmapflow version

35f4797

Remove features naming

5d06459

Update duse create_datasets

85d3411

ivanzvonkov added 9 commits July 19, 2022 10:05

Ensure order stays the same

59d7488

regenerate reports

caf7639

Update datasets

da69a2d

Remove duplicates

0ee21e6

use eo vs tifs

007e40d

Rename to eo

6b56cbd

continue eo renaming

93cb1b2

Regenerate projects

f300473

Merge branch 'main' into clean-generate

1f6c0d3

ivanzvonkov changed the title ~~Simpler generation~~ Version 0.1.0 Jul 19, 2022

ivanzvonkov added 8 commits July 19, 2022 14:31

gee bug

8fc54a3

raw labels bug

98d16e8

Test adding a new dataset

c11af7d

Regenerate reports

b55253d

Consistent eo_data prefix

856104c

Setup eo cols in a function

732b4f4

Simpler notebook for adding data

f47b574

Update datasets

709bdad

ivanzvonkov merged commit 37be210 into main Jul 19, 2022

ivanzvonkov deleted the clean-generate branch July 19, 2022 20:26

This was referenced Jul 20, 2022

Port to OpenMapFlow nasaharvest/crop-mask#200

Merged

Allow feature caching through Google Cloud Storage #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.1.0 #70

Version 0.1.0 #70

ivanzvonkov commented Jul 18, 2022 •

edited

Loading

review-notebook-app bot commented Jul 18, 2022

Version 0.1.0 #70

Version 0.1.0 #70

Conversation

ivanzvonkov commented Jul 18, 2022 • edited Loading

Version 0.1.0 updates

review-notebook-app bot commented Jul 18, 2022

ivanzvonkov commented Jul 18, 2022 •

edited

Loading