Skip to content

Commit

Permalink
Merge pull request #59 from nasaharvest/additional-testing
Browse files Browse the repository at this point in the history
Additional testing
  • Loading branch information
ivanzvonkov authored Jul 11, 2022
2 parents be8cd33 + 74881b1 commit 5128d4e
Show file tree
Hide file tree
Showing 11 changed files with 272 additions and 44 deletions.
101 changes: 88 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ Rapid map creation with machine learning and earth observation data.
![3maps-gif](assets/3maps.gif)

* [Tutorial](#tutorial-)
* [How it works](#how-it-works)
* [Generating a project](#generating-a-project-)
* [Adding data](#adding-data-)
* [Training a model](#training-a-model-)
Expand All @@ -39,29 +40,82 @@ Rapid map creation with machine learning and earth observation data.
Colab notebook tutorial demonstrating data exploration, model training, and inference over small region.

**Prerequisites:**
- Github account
- Github access token (obtained [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token))
- Forked OpenMapFlow repository
- [Forked OpenMapFlow repository](https://github.com/nasaharvest/openmapflow/fork)
- Basic Python knowledge

## How it works

To create your own maps with OpenMapFlow, you need to
1. [Generate your own OpenMapFlow project](#generating-a-project-), this will allow you to:
1. [Add your own labeled data](#adding-data-)
2. [Train a model](#training-a-model-) using that labeled data, and
3. [Create a map](#creating-a-map-) using the trained model.

![openmapflow-pipeline](assets/pipeline.png)

## Generating a project [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/generate_project.ipynb)
Inside a Github repository run:

**Prerequisites:**
- [ ] [Github repository](https://docs.github.com/en/get-started/quickstart/create-a-repo) - where your project will be stored
- [ ] [Google/Gmail based account](https://www.google.com/account/about/) - for accessing Google Drive and Google Cloud
- [ ] [Google Cloud Project](https://console.cloud.google.com/projectcreate) - for deploying Cloud resources for creating a map ([additional info](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console))

Once all prerequisites are satisfied, inside your Github repository run:
```bash
pip install openmapflow
openmapflow generate
```
This generates a project for: Adding data ➞ Training a model ➞ Creating a map
The command will prompt for project configuration such as project name and Google Cloud Project ID. Several prompts will have defaults shown in square brackets. These will be used if nothing is entered.

After all configuration is set, the following project structure will be generated:

```
<YOUR PROJECT NAME>
│ README.md
│ datasets.py # Dataset definitions (how labels should be processed)
│ evaluate.py # Template script for evaluating a model
│ openmapflow.yaml # Project configuration file
│ train.py # Template script for training a model
└─── .dvc/ # https://dvc.org/doc/user-guide/what-is-dvc
└─── .github
│ │
│ └─── workflows # Github actions
│ │ deploy.yaml # Automated Google Cloud deployment of trained models
│ │ test.yaml # Automated integration tests of labeled data
└─── data
│ raw_labels/ # User added labels
│ processed_labels/ # Labels standardized to common format
│ features/ # Labels combined with satellite data
│ compressed_features.tar.gz # Allows faster features downloads
│ models/ # Models trained using features
| raw_labels.dvc # Reference to a version of raw_labels/
| processed_labels.dvc # Reference to a version of processed_labels/
│ compressed_features.tar.gz.dvc # Reference to a version of features/
│ models.dvc # Reference to a version of models/
```

This project contains all the code necessary for: Adding data ➞ Training a model ➞ Creating a map.


## Adding data [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/new_data.ipynb)

**Prerequisites:**
- [ ] [Generated OpenMapFlow project](#generating-a-project-)
- [ ] [EarthEngine account](https://earthengine.google.com/signup) - for accessing Earth Engine and pulling satellite data
- [ ] Raw labels - a file (csv/shp/zip/txt) containing a list of labels and their coordinates (latitude, longitude)

Move raw labels into project:
```bash
export RAW_LABEL_DIR=$(openmapflow datapath RAW_LABELS)
mkdir RAW_LABEL_DIR/<my dataset name>
cp -r <path to my raw data files> RAW_LABEL_DIR/<my dataset name>
```
Add reference to data using a `LabeledDataset` object in datasets.py:
Add reference to data using a `LabeledDataset` object in datasets.py, example:
```python
datasets = [
LabeledDataset(
Expand All @@ -74,7 +128,6 @@ datasets = [
latitude_col="latitude",
class_prob=lambda df: df["crop"],
start_year=2019,
x_y_from_centroid=False,
),
),
),
Expand All @@ -87,9 +140,6 @@ earthengine authenticate # For getting new earth observation data
gcloud auth login # For getting cached earth observation data

openmapflow create-features # Initiatiates or checks progress of features creation
# May take long time depending on amount of labels in dataset
# TODO make the end more obvious

openmapflow datasets # Shows the status of datasets

dvc commit && dvc push # Push new data to data version control
Expand All @@ -98,26 +148,51 @@ git add .
git commit -m'Created new features'
git push
```
**Important:** When new data is pushed to the repository a Github action will be run to verify data integrity. This action will pull data using dvc and thereby needs access to remote storage (your Google Drive). To allow the Github action to access the data add a new repository secret ([instructions](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository)).
- In step 5 of the instructions, name the secret: `GDRIVE_CREDENTIALS_DATA`
- In step 6, enter the value in .dvc/tmp/gdrive-user-creditnals.json (in your repository)

After this the Github action should successfully run if the data is valid.


## Training a model [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/train.ipynb)

**Prerequisites:**
- [ ] [Generated OpenMapFlow project](#generating-a-project-)
- [ ] [Added labeled data](#adding-data-)

```bash
# Pull in latest data
dvc pull
tar -xzf $(openmapflow datapath COMPRESSED_FEATURES) -C data

export MODEL_NAME=<model_name> # Set model name
python train.py --model_name $MODEL_NAME # Train a model
python evaluate.py --model_name $MODEL_NAME # Record test metrics
# Set model name, train model, record test metrics
export MODEL_NAME=<YOUR MODEL NAME>
python train.py --model_name $MODEL_NAME
python evaluate.py --model_name $MODEL_NAME

dvc commit && dvc push # Push new models to data version control
# Push new models to data version control
dvc commit
dvc push

# Make a Pull Request to the repository
git checkout -b"$MODEL_NAME"
git add .
git commit -m "$MODEL_NAME"
git push --set-upstream origin "$MODEL_NAME"
```

**Important:** When a new model is pushed to the repository a Github action will be run to deploy this model to Google Cloud. To allow the Github action to access Google Cloud add a new repository secret ([instructions](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository)).
- In step 5 of the instructions, name the secret: `GCP_SA_KEY`
- In step 6, enter a Google Cloud Service Account key ([how to create](https://cloud.google.com/iam/docs/creating-managing-service-account-keys))

Now after merging the pull request, the model will be deployed to Google Cloud.

## Creating a map [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/create_map.ipynb)
**Prerequisites:**
- [ ] [Generated OpenMapFlow project](#generating-a-project-)
- [ ] [Added labeled data](#adding-data-)
- [ ] [Trained model](#training-a-model-)

Only available through Colab. Cloud Architecture must be deployed using the deploy.yaml Github Action.

Expand Down
Binary file added assets/pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 15 additions & 4 deletions openmapflow/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,17 +34,28 @@ def create_openmapflow_config(overwrite: bool):
description = input(" Description: ")
gcloud_project_id = input(" GCloud project ID: ")
gcloud_location = input(" GCloud location [us-central1]: ") or "us-central1"
gcloud_bucket_labeled_tifs = (
input(" GCloud bucket labeled tifs [crop-mask-tifs2]: ") or "crop-mask-tifs2"
)

buckets = {
"bucket_labeled_tifs": f"{project_name}-labeled-tifs",
"bucket_inference_tifs": f"{project_name}-inference-tifs",
"bucket_preds": f"{project_name}-preds",
"bucket_preds_merged": f"{project_name}-preds-merged",
}

for k, v in buckets.items():
buckets[k] = input(f" Gcloud {k.replace('_', ' ')} [{v}]: ") or v

openmapflow_str = (
"version: 0.0.1"
+ f"\nproject: {project_name}"
+ f"\ndescription: {description}"
+ "\ngcloud:"
+ f"\n project_id: {gcloud_project_id}"
+ f"\n location: {gcloud_location}"
+ f"\n bucket_labeled_tifs: {gcloud_bucket_labeled_tifs}"
+ f"\n bucket_labeled_tifs: {buckets['bucket_labeled_tifs']}"
+ f"\n bucket_inference_tifs: {buckets['bucket_inference_tifs']}"
+ f"\n bucket_preds: {buckets['bucket_preds']}"
+ f"\n bucket_preds_merged: {buckets['bucket_preds_merged']}"
)

with open(CONFIG_FILE, "w") as f:
Expand Down
10 changes: 8 additions & 2 deletions openmapflow/labeled_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -317,14 +317,20 @@ def create_processed_labels(self):

# Combine duplicate labels
df[NUM_LABELERS] = 1

def join_if_exists(values):
if all((isinstance(v, str) for v in values)):
return ",".join(values)
return ""

df = df.groupby([LON, LAT, START, END], as_index=False, sort=False).agg(
{
SOURCE: lambda sources: ",".join(sources.unique()),
CLASS_PROB: "mean",
NUM_LABELERS: "sum",
SUBSET: "first",
LABEL_DUR: lambda dur: ",".join(dur),
LABELER_NAMES: lambda name: ",".join(name),
LABEL_DUR: join_if_exists,
LABELER_NAMES: join_if_exists,
}
)
df[COUNTRY] = self.country
Expand Down
34 changes: 23 additions & 11 deletions openmapflow/raw_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import pandas as pd
from cropharvest.utils import set_seed
from dateutil.relativedelta import relativedelta
from fiona.errors import DriverError
from pyproj import Transformer

from openmapflow.constants import (
Expand Down Expand Up @@ -71,9 +72,12 @@ def _read_in_file(file_path) -> pd.DataFrame:
except UnicodeDecodeError:
return pd.read_csv(file_path, engine="python")
elif file_path.suffix == ".zip":
with zipfile.ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall(file_path.parent)
return gpd.read_file(file_path.parent / file_path.stem)
try:
return gpd.read_file(file_path)
except DriverError:
with zipfile.ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall(file_path.parent)
return gpd.read_file(file_path.parent / file_path.stem)
else:
return gpd.read_file(file_path)

Expand Down Expand Up @@ -141,7 +145,7 @@ def _set_lat_lon(
df = df[df.geometry != None] # noqa: E711
df["samples"] = (df.geometry.area / 0.001).astype(int)
list_of_points = np.vectorize(_get_points)(df.geometry, df.samples)
return gpd.GeoDataFrame(geometry=pd.concat(list_of_points, ignore_index=True))
df = gpd.GeoDataFrame(geometry=pd.concat(list_of_points, ignore_index=True))

if x_y_from_centroid:
df = df[df.geometry != None] # noqa: E711
Expand All @@ -156,10 +160,14 @@ def _set_lat_lon(
df[LAT] = y
return df

raise ValueError(
"Must specify latitude_col and longitude_col or x_y_from_centroid=True"
)


def _set_label_metadata(df, label_duration: Optional[str], labeler_name: Optional[str]):
df[LABEL_DUR] = df[label_duration].astype(str) if label_duration else ""
df[LABELER_NAMES] = df[labeler_name].astype(str) if labeler_name else ""
df[LABEL_DUR] = df[label_duration].astype(str) if label_duration else None
df[LABELER_NAMES] = df[labeler_name].astype(str) if labeler_name else None
return df


Expand All @@ -178,6 +186,9 @@ class RawLabels:
train_val_test (Tuple[float, float, float]): A tuple of floats representing the ratio of
train, validation, and test set. The sum of the values must be 1.0
Default: (1.0, 0.0, 0.0) [All data used for training]
filter_df (Callable[[pd.DataFrame]]): A function to filter the dataframe before processing
Example: lambda df: df[df["class"].notnull()]
Default: None
start_year (int): The year when the labels were collected, should be used when all labels
are from the same year
Example: 2019
Expand All @@ -186,7 +197,7 @@ class RawLabels:
Example: "Planting Date"
x_y_from_centroid (bool): Whether to use the centroid of the label as the latitude and
longitude coordinates
Default: True
Default: False
latitude_col (str): The name of the column representing the latitude of the label
Default: None, will use the latitude of the centroid of the label
longitude_col (str): The name of the column representing the longitude of the label
Expand All @@ -198,7 +209,8 @@ class RawLabels:
Default: None, assumes EPSG:4326
label_duration (str): The name of the column representing the labeling duration of the label
Default: None
labeler_name
labeler_name (str): The name of the column representing the name of the labeler
Default: None
"""

Expand All @@ -221,7 +233,7 @@ class RawLabels:
transform_crs_from: Optional[int] = None

# Label metadata
label_duruation: Optional[str] = None
label_duration: Optional[str] = None
labeler_name: Optional[str] = None

def __post_init__(self):
Expand All @@ -231,7 +243,6 @@ def __post_init__(self):

def process(self, raw_folder: Path) -> pd.DataFrame:
df = _read_in_file(raw_folder / self.filename)
df[SOURCE] = self.filename
if self.filter_df:
df = self.filter_df(df)
df = _set_lat_lon(
Expand All @@ -242,9 +253,10 @@ def process(self, raw_folder: Path) -> pd.DataFrame:
x_y_from_centroid=self.x_y_from_centroid,
transform_crs_from=self.transform_crs_from,
)
df[SOURCE] = self.filename
df = _set_class_prob(df, self.class_prob)
df = _set_start_end_dates(df, self.start_year, self.start_date_col)
df = _set_label_metadata(df, self.label_duruation, self.labeler_name)
df = _set_label_metadata(df, self.label_duration, self.labeler_name)
df = df.dropna(subset=[LON, LAT, CLASS_PROB])
df = df.round({LON: 8, LAT: 8})
df = _train_val_test_split(df, self.train_val_test)
Expand Down
8 changes: 4 additions & 4 deletions openmapflow/templates/github-deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ jobs:
python-version: 3.8
- name: Install dependencies
run: pip install -r requirements.txt
- uses: google-github-actions/setup-gcloud@v0
- uses: google-github-actions/auth@v0
with:
project_id: ${{ secrets.GCP_PROJECT_ID }}
service_account_key: ${{ secrets.GCP_SA_KEY }}
export_default_credentials: true
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v0
- uses: iterative/setup-dvc@v1
- name: Deploy Google Cloud Architecture
env:
Expand Down
4 changes: 2 additions & 2 deletions openmapflow/templates/openmapflow-default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ data_paths:
gcloud:
project_id: ~
location: us-central1
bucket_labeled_tifs: crop-mask-tifs2
bucket_labeled_tifs: <PROJECT>-labeled-tifs
bucket_inference_tifs: <PROJECT>-inference-tifs
bucket_preds: <PROJECT>-preds
bucket_preds_merged: <PROJECT>-preds-merged
bucket_preds_merged: <PROJECT>-preds-merged
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
"dvc[gdrive]>=2.10.1",
"earthengine-api",
"h5py>=3.1.0,!=3.7.0",
"ipyleaflet>=0.16.0",
"ipyleaflet==0.16.0",
"pandas==1.3.5",
"protobuf==3.20.1",
"pyyaml>=6.0",
Expand Down
4 changes: 2 additions & 2 deletions tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def test_load_default_config(self):
"gcloud": {
"project_id": None,
"location": "us-central1",
"bucket_labeled_tifs": "crop-mask-tifs2",
"bucket_labeled_tifs": "fake-project-labeled-tifs",
"bucket_inference_tifs": "fake-project-inference-tifs",
"bucket_preds": "fake-project-preds",
"bucket_preds_merged": "fake-project-preds-merged",
Expand Down Expand Up @@ -58,7 +58,7 @@ def test_deploy_env_variables(self):
+ f"OPENMAPFLOW_LIBRARY_DIR={LIBRARY_DIR} "
+ "OPENMAPFLOW_GCLOUD_PROJECT_ID=None "
+ "OPENMAPFLOW_GCLOUD_LOCATION=us-central1 "
+ "OPENMAPFLOW_GCLOUD_BUCKET_LABELED_TIFS=crop-mask-tifs2 "
+ "OPENMAPFLOW_GCLOUD_BUCKET_LABELED_TIFS=openmapflow-labeled-tifs "
+ "OPENMAPFLOW_GCLOUD_BUCKET_INFERENCE_TIFS=openmapflow-inference-tifs "
+ "OPENMAPFLOW_GCLOUD_BUCKET_PREDS=openmapflow-preds "
+ "OPENMAPFLOW_GCLOUD_BUCKET_PREDS_MERGED=openmapflow-preds-merged "
Expand Down
Loading

0 comments on commit 5128d4e

Please sign in to comment.