-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use EarthEngine API for fetching data #107
Conversation
@gabrieltseng b. It may be possible to increase the speed by (1) exploring asynchronous methods or (2) parallelizing through the multiprocessing library rather than dask, do you think these would be useful? [Also (3) modifying surrounding meters to make size smaller is an option] c. Right now I think running |
I wasn't monitoring the times as closely - I was running about 3000 exports at a time, and was finding that it was taking much less time than creating Earth Engine tasks (e.g. ~1 day instead of multiple days). But I'd start the process on a VM and leave it, typically overnight. |
Answering @gabrieltseng + own questions
Okay, this seems pretty close with the numbers and estimates I get for 1000 tasks.
I do not anticipate any of these will result in a massive speed up and so do not plan to explore in the near term.
Suggested usage:
|
@@ -24,7 +24,7 @@ def load_labels(self) -> pd.DataFrame: | |||
PROJECT_ROOT / DataPaths.RAW_LABELS / "hansen_labelled_data.csv" | |||
) | |||
|
|||
df = df.sample(n=1000, random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious as to why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there's way too many points, so this was a way to test a few
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change here is to the random_state
- not super important, just curious why you changed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! I think this was to force new points to be exported rather than the 1000 that already existed in Google Cloud Storage when I initially tested
@@ -306,3 +317,33 @@ def export_for_labels( | |||
): | |||
print(f"Started {exports_started} exports. Ending export") | |||
return None | |||
|
|||
|
|||
class EarthEngineAPI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class necessary for now? It seems like get_ee_url
could just be a standalone function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to move ee.Initialize
to the top of this file so that any import of ee_exporter
forces an initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lean towards keeping it as is:
- I am generally weary of automatically executing code on import after the issue in CropHarvest where cartopy loads natural earth on almost every usage of the package.
- Right now the ee initializations differ (one has the highvolume API specificied). Specifying the highvolume API for both is most likely fine but could also change some sort of behavior so I want to minimize the possibility of that
Code Changes
New:
openmapflow create-datasets --ee_api
uses EarthEngine API for getting data, this is faster but does not requires an active machine during the entire dataset creation process_find_matching_point_url
uses a EO data URL to obtain the pixel time series (similar to_find_matching_point
)EarthEngineAPI().get_ee_url
for obtaining a EarthEngine API URL for the desired EO dataMinor:
start_date parameter is made optional for
load_tif
Experiments
[ORIGINAL] Creating dataset with EE export tasks:
[NEW] Creating dataset with EE Rest API:
Appendix
Parallelizing EE Rest API npartitions with 10 examples
npartitions=1
, time: 12m4s, logsnpartitions=4
, time: 5m11s, logsnpartitions=10
, time: 6m50s, logsSide note: Tried
pandarallel
too but ran into some sort of time-out issuesWhy is it EE Rest API still slow?
Getting the download URL alone takes a while (22.4s, 36s, 42.3s)
Getting the data from URL alone also takes a while (41.3s, 44.3s, 46.4s)
The rest of
_find_matching_point_url
, after downloading the data takes about 3-4s per example