Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EarthEngine API for fetching data #107

Merged
merged 54 commits into from
Oct 3, 2022
Merged

Use EarthEngine API for fetching data #107

merged 54 commits into from
Oct 3, 2022

Conversation

ivanzvonkov
Copy link
Contributor

@ivanzvonkov ivanzvonkov commented Sep 22, 2022

Code Changes

New:

  • openmapflow create-datasets --ee_api uses EarthEngine API for getting data, this is faster but does not requires an active machine during the entire dataset creation process
  • _find_matching_point_url uses a EO data URL to obtain the pixel time series (similar to _find_matching_point)
  • EarthEngineAPI().get_ee_url for obtaining a EarthEngine API URL for the desired EO data

Minor:
start_date parameter is made optional for load_tif

Experiments

[ORIGINAL] Creating dataset with EE export tasks:

  • Total time: 16hrs 17m
  • Started on Sept 22, 8:30 am EST, and ended on Sept 23, 12:47 am EST
  • When fresh run it takes 16m29s to start 1000 Earth Engine export tasks, logs
  • When data is already available in Google Cloud (1000 examples), it took 17m50s to get the pixel time series, logs

[NEW] Creating dataset with EE Rest API:

  • Rough time estimate: 8hrs 38m (from 10 examples taking 5m11s)
  • Did not complete in 6 hours which is the Github job execution time limit

Appendix

Parallelizing EE Rest API npartitions with 10 examples

  • npartitions=1, time: 12m4s, logs
  • npartitions=4, time: 5m11s, logs
  • npartitions=10, time: 6m50s, logs

Side note: Tried pandarallel too but ran into some sort of time-out issues

Why is it EE Rest API still slow?
Getting the download URL alone takes a while (22.4s, 36s, 42.3s)
image

Getting the data from URL alone also takes a while (41.3s, 44.3s, 46.4s)
image

The rest of _find_matching_point_url, after downloading the data takes about 3-4s per example
image

@ivanzvonkov ivanzvonkov linked an issue Sep 22, 2022 that may be closed by this pull request
@ivanzvonkov
Copy link
Contributor Author

ivanzvonkov commented Sep 22, 2022

@gabrieltseng
a. I've implemented the EarthEngine URL method for getting data but it seems pretty slow (see results in the description). I see you also used 4 processes here, did you get faster results?

b. It may be possible to increase the speed by (1) exploring asynchronous methods or (2) parallelizing through the multiprocessing library rather than dask, do you think these would be useful? [Also (3) modifying surrounding meters to make size smaller is an option]

c. Right now I think running openmapflow create-datasets (using ee export tasks) inside a Github action is a good default.
If we want to get data faster we could spin up a big VM with lot's of processors and run lot's of parallel API requests.
What do you think?

@gabrieltseng
Copy link
Contributor

a. I've implemented the EarthEngine URL method for getting data but it seems pretty slow (see results in the description). I see you also used 4 processes here, did you get faster results?

I wasn't monitoring the times as closely - I was running about 3000 exports at a time, and was finding that it was taking much less time than creating Earth Engine tasks (e.g. ~1 day instead of multiple days). But I'd start the process on a VM and leave it, typically overnight.

@ivanzvonkov
Copy link
Contributor Author

Answering @gabrieltseng + own questions

a. I've implemented the EarthEngine URL method for getting data but it seems pretty slow (see results in the description). I see you also used 4 processes here, did you get faster results?

I wasn't monitoring the times as closely - I was running about 3000 exports at a time, and was finding that it was taking much less time than creating Earth Engine tasks (e.g. ~1 day instead of multiple days). But I'd start the process on a VM and leave it, typically overnight.

Okay, this seems pretty close with the numbers and estimates I get for 1000 tasks.

b. It may be possible to increase the speed by (1) exploring asynchronous methods or (2) parallelizing through the multiprocessing library rather than dask, do you think these would be useful? [Also (3) modifying surrounding meters to make size smaller is an option]

I do not anticipate any of these will result in a massive speed up and so do not plan to explore in the near term.

c. Right now I think running openmapflow create-datasets (using ee export tasks) inside a Github action is a good default.
If we want to get data faster we could spin up a big VM with lot's of processors and run lot's of parallel API requests.
What do you think?

Suggested usage:

  • Locally: openmapflow create-datasets (ee task method, pro: allows user to avoid long-running process)
  • In Github action: openmapflow create-datasets --non-interactive (ee task method, pro: starting exports, creating features fits into Github action max execution time)
  • For LEM: openmapflow create-datasets --ee_api (ee API method, pro: faster but requires long running process)

@@ -24,7 +24,7 @@ def load_labels(self) -> pd.DataFrame:
PROJECT_ROOT / DataPaths.RAW_LABELS / "hansen_labelled_data.csv"
)

df = df.sample(n=1000, random_state=42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious as to why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's way too many points, so this was a way to test a few

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change here is to the random_state - not super important, just curious why you changed it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I think this was to force new points to be exported rather than the 1000 that already existed in Google Cloud Storage when I initially tested

@@ -306,3 +317,33 @@ def export_for_labels(
):
print(f"Started {exports_started} exports. Ending export")
return None


class EarthEngineAPI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class necessary for now? It seems like get_ee_url could just be a standalone function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to move ee.Initialize to the top of this file so that any import of ee_exporter forces an initialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards keeping it as is:

  1. I am generally weary of automatically executing code on import after the issue in CropHarvest where cartopy loads natural earth on almost every usage of the package.
  2. Right now the ee initializations differ (one has the highvolume API specificied). Specifying the highvolume API for both is most likely fine but could also change some sort of behavior so I want to minimize the possibility of that

@ivanzvonkov ivanzvonkov merged commit e285fb2 into main Oct 3, 2022
@ivanzvonkov ivanzvonkov deleted the ee-api branch October 3, 2022 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Earth Engine's getDownloadURL to create training data faster
3 participants