Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use EarthEngine API for fetching data #107

Merged
merged 54 commits into from
Oct 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
da41b32
Reduce labels for testing
ivanzvonkov Sep 22, 2022
8267d0b
Remove processed file for testing
ivanzvonkov Sep 22, 2022
5dc5951
Create separate function for ee_image and add EarthEngineAPI
ivanzvonkov Sep 22, 2022
3f25aa7
Optional start date
ivanzvonkov Sep 22, 2022
ceddaa1
dates passed directly
ivanzvonkov Sep 22, 2022
82898b1
Create datasets use ee API
ivanzvonkov Sep 22, 2022
1eee323
Attempt running data pipeline inside github action
ivanzvonkov Sep 22, 2022
2209ac2
Ensure dask is installed
ivanzvonkov Sep 22, 2022
8791e7d
Authenticate ee attempt
ivanzvonkov Sep 22, 2022
dc22d9a
Cache and authenticate gcloud
ivanzvonkov Sep 22, 2022
95dfe7f
Quiet arg
ivanzvonkov Sep 22, 2022
435188e
Try app default again
ivanzvonkov Sep 22, 2022
ddcb43f
Fix yaml
ivanzvonkov Sep 22, 2022
c4fea32
Login by default
ivanzvonkov Sep 22, 2022
f73714d
Remove CLI auth
ivanzvonkov Sep 22, 2022
d1bfb9c
hardcover service account key credentials
ivanzvonkov Sep 22, 2022
24c3bca
Login with sa credentials if available
ivanzvonkov Sep 22, 2022
8fb5255
Read email from file
ivanzvonkov Sep 22, 2022
99721fc
Try a different npartitions
ivanzvonkov Sep 22, 2022
b4a8c4b
1 partition test
ivanzvonkov Sep 22, 2022
c0e248b
mock initialize
ivanzvonkov Sep 22, 2022
8245784
Remove f string
ivanzvonkov Sep 22, 2022
a672693
Set ref
ivanzvonkov Sep 22, 2022
5a2b0ce
back to 4
ivanzvonkov Sep 22, 2022
1a609fc
Only run tests if changes in own directories
ivanzvonkov Sep 22, 2022
1cf502a
Automated dataset updates
your-username Sep 22, 2022
b2bafb0
Merge branch 'main' into ee-api
ivanzvonkov Sep 22, 2022
6a9917c
Process 1000 points
ivanzvonkov Sep 22, 2022
a0b9d0e
No ee api by default
ivanzvonkov Sep 22, 2022
4fa8e40
better bot name
ivanzvonkov Sep 22, 2022
d6bf537
missing pd.index
ivanzvonkov Sep 22, 2022
c05cb5f
Automated dataset updates
DATASET-bot Sep 22, 2022
51b6234
Regenerate 1000 point dataset with new points
ivanzvonkov Sep 22, 2022
f094c41
Automated dataset updates
DATASET-bot Sep 22, 2022
0a42ff3
Skip interactive portion
ivanzvonkov Sep 22, 2022
4ba303c
Automated dataset updates
DATASET-bot Sep 22, 2022
851e2a3
Automated dataset updates
DATASET-bot Sep 22, 2022
7cecb09
Automated dataset updates
DATASET-bot Sep 22, 2022
f4d76ba
Automated dataset updates
DATASET-bot Sep 23, 2022
070249e
Automated dataset updates
DATASET-bot Sep 23, 2022
05783cc
test engineer test
ivanzvonkov Sep 23, 2022
b033c27
argparse and clean ee api code
ivanzvonkov Sep 23, 2022
f6e90c0
Remove date parameter
ivanzvonkov Sep 23, 2022
b605b74
Sort engineer imports
ivanzvonkov Sep 23, 2022
508291c
Test for find_matching_point url
ivanzvonkov Sep 23, 2022
230c612
Improve CLI message
ivanzvonkov Sep 23, 2022
5f7a06c
Merge branch 'ee-api' of github.com:nasaharvest/openmapflow into ee-api
ivanzvonkov Sep 23, 2022
ff9d447
Reduce line length
ivanzvonkov Sep 23, 2022
1395cc9
Set non-interactive mode
ivanzvonkov Sep 23, 2022
b588a8b
Automated dataset updates
DATASET-bot Sep 23, 2022
8bfdae8
Allow git commit to fail
ivanzvonkov Sep 23, 2022
fddf213
Merge branch 'ee-api' of github.com:nasaharvest/openmapflow into ee-api
ivanzvonkov Sep 23, 2022
b120d3a
Update version
ivanzvonkov Sep 23, 2022
d0ed10e
Merge branch 'main' into ee-api
ivanzvonkov Oct 3, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/buildings-example-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ on:
branches: [ main ]
pull_request:
branches: [ main ]
paths:
- 'buildings-example/**'

jobs:
test:
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/crop-mask-example-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ on:
branches: [ main ]
pull_request:
branches: [ main ]
paths:
- 'crop-mask-example/**'

jobs:
test:
Expand Down
30 changes: 24 additions & 6 deletions .github/workflows/forest-example-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ on:
branches: [ main ]
pull_request:
branches: [ main ]
paths:
- 'forest-example/**'

jobs:
test:
Expand All @@ -18,19 +20,35 @@ jobs:
steps:
- name: Clone repo
uses: actions/checkout@v2
with:
ref: ${{ github.event.pull_request.head.ref }}
- name: Set up python
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: 3.8
- name: Install dependencies
run: pip install -r requirements.txt
- run: pip install -r requirements.txt

- name: dvc pull data
- uses: google-github-actions/auth@v0
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Run data pipeline
env:
# https://dvc.org/doc/user-guide/setup-google-drive-remote#authorization
GDRIVE_CREDENTIALS_DATA: ${{ secrets.GDRIVE_CREDENTIALS_DATA }}
run: dvc pull -f

GCP_SA_KEY: ${{ secrets.GCP_SA_KEY }}
run: |
dvc pull -f
openmapflow create-datasets --non-interactive
dvc commit -f
dvc push
- name: Push automated dataset updates
run: |
git config --global user.name 'Dataset bot'
git config --global user.email '[email protected]'
git pull
git add data
git commit -m "Automated dataset updates" || echo "No updates to commit"
git push
- name: Integration test - Project
run: |
openmapflow cp templates/integration_test_project.py .
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/maize-example-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ on:
branches: [ main ]
pull_request:
branches: [ main ]
paths:
- 'maize-example/**'

jobs:
test:
Expand Down
4 changes: 2 additions & 2 deletions forest-example/data/datasets.dvc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
outs:
- md5: 718c5017dec70570f87d1ca1941db208.dir
size: 5789085
- md5: 3d8ac3ef8c473bb3445b9b67a0fdbc33.dir
size: 5403436
nfiles: 1
path: datasets
2 changes: 1 addition & 1 deletion forest-example/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def load_labels(self) -> pd.DataFrame:
PROJECT_ROOT / DataPaths.RAW_LABELS / "hansen_labelled_data.csv"
)

df = df.sample(n=1000, random_state=42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious as to why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's way too many points, so this was a way to test a few

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change here is to the random_state - not super important, just curious why you changed it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I think this was to force new points to be exported rather than the 1000 that already existed in Google Cloud Storage when I initially tested

df = df.sample(n=1000, random_state=43)

# Rename coordinate columns to be used for getting Earth observation data
df.rename(columns={"lon": LAT, "lat": LON}, inplace=True)
Expand Down
2 changes: 1 addition & 1 deletion openmapflow/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
TEMPLATE_README = TEMPLATES_DIR / "README.md"
TEMPLATE_DEPLOY_YML = TEMPLATES_DIR / "github-deploy.yaml"
TEMPLATE_TEST_YML = TEMPLATES_DIR / "github-test.yaml"
VERSION = "0.2.0rc1"
VERSION = "0.2.1rc1"

# -------------- Dataframe column names --------------------------------------
SOURCE = "source"
Expand Down
161 changes: 101 additions & 60 deletions openmapflow/ee_exporter.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import json
import os
import warnings
from datetime import date, timedelta
from typing import Dict, List, Optional, Union
Expand Down Expand Up @@ -104,6 +106,69 @@ def ee_safe_str(s: str):
return s.replace(".", "-").replace("=", "-").replace("/", "-")[:100]


def create_ee_image(
polygon: "ee.Geometry.Polygon",
start_date: date,
end_date: date,
days_per_timestep: int = DAYS_PER_TIMESTEP,
):
image_collection_list: List[ee.Image] = []
cur_date = start_date
cur_end_date = cur_date + timedelta(days=days_per_timestep)

# first, we get all the S1 images in an exaggerated date range
vv_imcol, vh_imcol = get_s1_image_collection(
polygon, start_date - timedelta(days=31), end_date + timedelta(days=31)
)

while cur_end_date <= end_date:
image_list: List[ee.Image] = []

# first, the S1 image which gets the entire s1 collection
image_list.append(
get_single_s1_image(
region=polygon,
start_date=cur_date,
end_date=cur_end_date,
vv_imcol=vv_imcol,
vh_imcol=vh_imcol,
)
)
for image_function in DYNAMIC_IMAGE_FUNCTIONS:
image_list.append(
image_function(
region=polygon, start_date=cur_date, end_date=cur_end_date
)
)
image_collection_list.append(ee.Image.cat(image_list))

cur_date += timedelta(days=days_per_timestep)
cur_end_date += timedelta(days=days_per_timestep)

# now, we want to take our image collection and append the bands into a single image
imcoll = ee.ImageCollection(image_collection_list)
combine_bands_function = make_combine_bands_function(DYNAMIC_BANDS)
img = ee.Image(imcoll.iterate(combine_bands_function))

# finally, we add the SRTM image seperately since its static in time
total_image_list: List[ee.Image] = [img]
for static_image_function in STATIC_IMAGE_FUNCTIONS:
total_image_list.append(static_image_function(region=polygon))

return ee.Image.cat(total_image_list)


def get_ee_credentials():
gcp_sa_key = os.environ.get("GCP_SA_KEY")
if gcp_sa_key is not None:
gcp_sa_email = json.loads(gcp_sa_key)["client_email"]
print(f"Logging into EarthEngine with {gcp_sa_email}")
return ee.ServiceAccountCredentials(gcp_sa_email, key_data=gcp_sa_key)
else:
print("Logging into EarthEngine with default credentials")
return "persistent"


class EarthEngineExporter:
"""
Export satellite data from Earth engine. It's called using the following
Expand All @@ -121,24 +186,10 @@ class EarthEngineExporter:
"""

def __init__(
self,
dest_bucket: str,
check_ee: bool = False,
check_gcp: bool = False,
credentials: Optional[str] = None,
days_per_timestep: int = DAYS_PER_TIMESTEP,
self, dest_bucket: str, check_ee: bool = False, check_gcp: bool = False
) -> None:
self.dest_bucket = dest_bucket
self.days_per_timestep = days_per_timestep
try:
if credentials:
ee.Initialize(credentials=credentials)
else:
ee.Initialize()
except Exception:
print(
"This code may not work if you have not authenticated your earthengine account"
)
ee.Initialize(get_ee_credentials())
self.check_ee = check_ee
self.ee_task_list = get_ee_task_list() if self.check_ee else []
self.check_gcp = check_gcp
Expand Down Expand Up @@ -172,50 +223,7 @@ def _export_for_polygon(
if len(self.ee_task_list) >= 3000:
return False

image_collection_list: List[ee.Image] = []
cur_date = start_date
cur_end_date = cur_date + timedelta(days=self.days_per_timestep)

# first, we get all the S1 images in an exaggerated date range
vv_imcol, vh_imcol = get_s1_image_collection(
polygon, start_date - timedelta(days=31), end_date + timedelta(days=31)
)

while cur_end_date <= end_date:
image_list: List[ee.Image] = []

# first, the S1 image which gets the entire s1 collection
image_list.append(
get_single_s1_image(
region=polygon,
start_date=cur_date,
end_date=cur_end_date,
vv_imcol=vv_imcol,
vh_imcol=vh_imcol,
)
)
for image_function in DYNAMIC_IMAGE_FUNCTIONS:
image_list.append(
image_function(
region=polygon, start_date=cur_date, end_date=cur_end_date
)
)
image_collection_list.append(ee.Image.cat(image_list))

cur_date += timedelta(days=self.days_per_timestep)
cur_end_date += timedelta(days=self.days_per_timestep)

# now, we want to take our image collection and append the bands into a single image
imcoll = ee.ImageCollection(image_collection_list)
combine_bands_function = make_combine_bands_function(DYNAMIC_BANDS)
img = ee.Image(imcoll.iterate(combine_bands_function))

# finally, we add the SRTM image seperately since its static in time
total_image_list: List[ee.Image] = [img]
for static_image_function in STATIC_IMAGE_FUNCTIONS:
total_image_list.append(static_image_function(region=polygon))

img = ee.Image.cat(total_image_list)
img = create_ee_image(polygon, start_date, end_date)

# and finally, export the image
if not test:
Expand Down Expand Up @@ -281,6 +289,9 @@ def export_for_labels(
for expected_column in [START, END, LAT, LON]:
assert expected_column in labels

labels[START] = pd.to_datetime(labels[START]).dt.date
labels[END] = pd.to_datetime(labels[END]).dt.date

exports_started = 0
print(f"Exporting {len(labels)} labels: ")

Expand All @@ -306,3 +317,33 @@ def export_for_labels(
):
print(f"Started {exports_started} exports. Ending export")
return None


class EarthEngineAPI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class necessary for now? It seems like get_ee_url could just be a standalone function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to move ee.Initialize to the top of this file so that any import of ee_exporter forces an initialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards keeping it as is:

  1. I am generally weary of automatically executing code on import after the issue in CropHarvest where cartopy loads natural earth on almost every usage of the package.
  2. Right now the ee initializations differ (one has the highvolume API specificied). Specifying the highvolume API for both is most likely fine but could also change some sort of behavior so I want to minimize the possibility of that

"""
Fetch satellite data from Earth engine by URL.
:param credentials: The credentials to use for the export. If not specified,
the default credentials will be used
"""

def __init__(self) -> None:
ee.Initialize(
get_ee_credentials(),
opt_url="https://earthengine-highvolume.googleapis.com",
)

def get_ee_url(self, lat, lon, start_date, end_date):
ee_bbox = EEBoundingBox.from_centre(
mid_lat=lat,
mid_lon=lon,
surrounding_metres=80,
).to_ee_polygon()
img = create_ee_image(ee_bbox, start_date, end_date)
return img.getDownloadURL(
{
"region": ee_bbox,
"scale": 10,
"filePerBand": False,
"format": "GEO_TIFF",
}
)
18 changes: 11 additions & 7 deletions openmapflow/engineer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

def load_tif(
filepath: Path,
start_date: datetime,
start_date: Optional[datetime] = None,
num_timesteps: Optional[int] = None,
):
r"""
Expand Down Expand Up @@ -56,12 +56,16 @@ def load_tif(
time_specific_da["band"] = range(bands_per_timestep + len(STATIC_BANDS))
da_split_by_time.append(time_specific_da)

timesteps = [
start_date + timedelta(days=DAYS_PER_TIMESTEP) * i
for i in range(len(da_split_by_time))
]

dynamic_data = xr.concat(da_split_by_time, pd.Index(timesteps, name="time"))
if start_date:
timesteps = [
start_date + timedelta(days=DAYS_PER_TIMESTEP) * i
for i in range(len(da_split_by_time))
]
dynamic_data = xr.concat(da_split_by_time, pd.Index(timesteps, name="time"))
else:
dynamic_data = xr.concat(
da_split_by_time, pd.Index(range(len(da_split_by_time)), name="time")
)
dynamic_data.attrs["band_descriptions"] = BANDS

return dynamic_data, average_slope
Expand Down
Loading