Use EarthEngine API for fetching data #107

ivanzvonkov · 2022-09-22T01:16:03Z

Code Changes

New:

openmapflow create-datasets --ee_api uses EarthEngine API for getting data, this is faster but does not requires an active machine during the entire dataset creation process
_find_matching_point_url uses a EO data URL to obtain the pixel time series (similar to _find_matching_point)
EarthEngineAPI().get_ee_url for obtaining a EarthEngine API URL for the desired EO data

Minor:
start_date parameter is made optional for load_tif

Experiments

[ORIGINAL] Creating dataset with EE export tasks:

Total time: 16hrs 17m
Started on Sept 22, 8:30 am EST, and ended on Sept 23, 12:47 am EST
When fresh run it takes 16m29s to start 1000 Earth Engine export tasks, logs
When data is already available in Google Cloud (1000 examples), it took 17m50s to get the pixel time series, logs

[NEW] Creating dataset with EE Rest API:

Rough time estimate: 8hrs 38m (from 10 examples taking 5m11s)
Did not complete in 6 hours which is the Github job execution time limit

Appendix

Parallelizing EE Rest API npartitions with 10 examples

npartitions=1, time: 12m4s, logs
npartitions=4, time: 5m11s, logs
npartitions=10, time: 6m50s, logs

Side note: Tried pandarallel too but ran into some sort of time-out issues

Why is it EE Rest API still slow?
Getting the download URL alone takes a while (22.4s, 36s, 42.3s)

Getting the data from URL alone also takes a while (41.3s, 44.3s, 46.4s)

The rest of _find_matching_point_url, after downloading the data takes about 3-4s per example

ivanzvonkov · 2022-09-22T12:59:09Z

@gabrieltseng
a. I've implemented the EarthEngine URL method for getting data but it seems pretty slow (see results in the description). I see you also used 4 processes here, did you get faster results?

b. It may be possible to increase the speed by (1) exploring asynchronous methods or (2) parallelizing through the multiprocessing library rather than dask, do you think these would be useful? [Also (3) modifying surrounding meters to make size smaller is an option]

c. Right now I think running openmapflow create-datasets (using ee export tasks) inside a Github action is a good default.
If we want to get data faster we could spin up a big VM with lot's of processors and run lot's of parallel API requests.
What do you think?

gabrieltseng · 2022-09-22T18:25:34Z

a. I've implemented the EarthEngine URL method for getting data but it seems pretty slow (see results in the description). I see you also used 4 processes here, did you get faster results?

I wasn't monitoring the times as closely - I was running about 3000 exports at a time, and was finding that it was taking much less time than creating Earth Engine tasks (e.g. ~1 day instead of multiple days). But I'd start the process on a VM and leave it, typically overnight.

ivanzvonkov · 2022-09-23T05:21:52Z

Answering @gabrieltseng + own questions

a. I've implemented the EarthEngine URL method for getting data but it seems pretty slow (see results in the description). I see you also used 4 processes here, did you get faster results?

I wasn't monitoring the times as closely - I was running about 3000 exports at a time, and was finding that it was taking much less time than creating Earth Engine tasks (e.g. ~1 day instead of multiple days). But I'd start the process on a VM and leave it, typically overnight.

Okay, this seems pretty close with the numbers and estimates I get for 1000 tasks.

b. It may be possible to increase the speed by (1) exploring asynchronous methods or (2) parallelizing through the multiprocessing library rather than dask, do you think these would be useful? [Also (3) modifying surrounding meters to make size smaller is an option]

I do not anticipate any of these will result in a massive speed up and so do not plan to explore in the near term.

c. Right now I think running openmapflow create-datasets (using ee export tasks) inside a Github action is a good default.
If we want to get data faster we could spin up a big VM with lot's of processors and run lot's of parallel API requests.
What do you think?

Suggested usage:

Locally: openmapflow create-datasets (ee task method, pro: allows user to avoid long-running process)
In Github action: openmapflow create-datasets --non-interactive (ee task method, pro: starting exports, creating features fits into Github action max execution time)
For LEM: openmapflow create-datasets --ee_api (ee API method, pro: faster but requires long running process)

gabrieltseng · 2022-10-03T12:05:14Z

forest-example/datasets.py

@@ -24,7 +24,7 @@ def load_labels(self) -> pd.DataFrame:
            PROJECT_ROOT / DataPaths.RAW_LABELS / "hansen_labelled_data.csv"
        )

-        df = df.sample(n=1000, random_state=42)


Curious as to why?

Because there's way too many points, so this was a way to test a few

the change here is to the random_state - not super important, just curious why you changed it

Oh! I think this was to force new points to be exported rather than the 1000 that already existed in Google Cloud Storage when I initially tested

gabrieltseng · 2022-10-03T12:14:53Z

openmapflow/ee_exporter.py

@@ -306,3 +317,33 @@ def export_for_labels(
                ):
                    print(f"Started {exports_started} exports. Ending export")
                    return None
+
+
+class EarthEngineAPI:


Is this class necessary for now? It seems like get_ee_url could just be a standalone function.

It might make sense to move ee.Initialize to the top of this file so that any import of ee_exporter forces an initialization?

I lean towards keeping it as is:

I am generally weary of automatically executing code on import after the issue in CropHarvest where cartopy loads natural earth on almost every usage of the package.

Right now the ee initializations differ (one has the highvolume API specificied). Specifying the highvolume API for both is most likely fine but could also change some sort of behavior so I want to minimize the possibility of that

ivanzvonkov added 7 commits September 21, 2022 20:58

Reduce labels for testing

da41b32

Remove processed file for testing

8267d0b

Create separate function for ee_image and add EarthEngineAPI

5dc5951

Optional start date

3f25aa7

dates passed directly

ceddaa1

Create datasets use ee API

82898b1

Attempt running data pipeline inside github action

1eee323

ivanzvonkov linked an issue Sep 22, 2022 that may be closed by this pull request

Use Earth Engine's getDownloadURL to create training data faster #97

Closed

ivanzvonkov and others added 22 commits September 21, 2022 21:20

Ensure dask is installed

2209ac2

Authenticate ee attempt

8791e7d

Cache and authenticate gcloud

dc22d9a

Quiet arg

95dfe7f

Try app default again

435188e

Fix yaml

ddcb43f

Login by default

c4fea32

Remove CLI auth

f73714d

hardcover service account key credentials

d1bfb9c

Login with sa credentials if available

24c3bca

Read email from file

8fb5255

Try a different npartitions

99721fc

1 partition test

b4a8c4b

mock initialize

c0e248b

Remove f string

8245784

Set ref

a672693

back to 4

5a2b0ce

Only run tests if changes in own directories

1a609fc

Automated dataset updates

1cf502a

Merge branch 'main' into ee-api

b2bafb0

Process 1000 points

6a9917c

No ee api by default

a0b9d0e

DATASET-bot and others added 3 commits September 22, 2022 12:16

Automated dataset updates

c05cb5f

Regenerate 1000 point dataset with new points

51b6234

Automated dataset updates

f094c41

ivanzvonkov and others added 3 commits September 22, 2022 09:03

Skip interactive portion

0a42ff3

Automated dataset updates

4ba303c

Automated dataset updates

851e2a3

DATASET-bot and others added 16 commits September 22, 2022 20:29

Automated dataset updates

7cecb09

Automated dataset updates

f4d76ba

Automated dataset updates

070249e

test engineer test

05783cc

argparse and clean ee api code

b033c27

Remove date parameter

f6e90c0

Sort engineer imports

b605b74

Test for find_matching_point url

508291c

Improve CLI message

230c612

Merge branch 'ee-api' of github.com:nasaharvest/openmapflow into ee-api

5f7a06c

Reduce line length

ff9d447

Set non-interactive mode

1395cc9

Automated dataset updates

b588a8b

Allow git commit to fail

8bfdae8

Merge branch 'ee-api' of github.com:nasaharvest/openmapflow into ee-api

fddf213

Update version

b120d3a

ivanzvonkov requested a review from gabrieltseng September 23, 2022 05:15

gabrieltseng approved these changes Oct 3, 2022

View reviewed changes

Merge branch 'main' into ee-api

d0ed10e

ivanzvonkov merged commit e285fb2 into main Oct 3, 2022

ivanzvonkov deleted the ee-api branch October 3, 2022 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use EarthEngine API for fetching data #107

Use EarthEngine API for fetching data #107

ivanzvonkov commented Sep 22, 2022 •

edited

Loading

ivanzvonkov commented Sep 22, 2022 •

edited

Loading

gabrieltseng commented Sep 22, 2022

ivanzvonkov commented Sep 23, 2022

gabrieltseng Oct 3, 2022

ivanzvonkov Oct 3, 2022

gabrieltseng Oct 3, 2022

ivanzvonkov Oct 3, 2022

gabrieltseng Oct 3, 2022

gabrieltseng Oct 3, 2022

ivanzvonkov Oct 3, 2022

Use EarthEngine API for fetching data #107

Use EarthEngine API for fetching data #107

Conversation

ivanzvonkov commented Sep 22, 2022 • edited Loading

Code Changes

Experiments

Appendix

ivanzvonkov commented Sep 22, 2022 • edited Loading

gabrieltseng commented Sep 22, 2022

ivanzvonkov commented Sep 23, 2022

gabrieltseng Oct 3, 2022

Choose a reason for hiding this comment

ivanzvonkov Oct 3, 2022

Choose a reason for hiding this comment

gabrieltseng Oct 3, 2022

Choose a reason for hiding this comment

ivanzvonkov Oct 3, 2022

Choose a reason for hiding this comment

gabrieltseng Oct 3, 2022

Choose a reason for hiding this comment

gabrieltseng Oct 3, 2022

Choose a reason for hiding this comment

ivanzvonkov Oct 3, 2022

Choose a reason for hiding this comment

ivanzvonkov commented Sep 22, 2022 •

edited

Loading

ivanzvonkov commented Sep 22, 2022 •

edited

Loading