Merge pull request #59 from nasaharvest/additional-testing

Additional testing
nasaharvest · Jul 11, 2022 · 5128d4e · 5128d4e
2 parents be8cd33 + 74881b1
commit 5128d4e
Show file tree

Hide file tree

Showing 11 changed files with 272 additions and 44 deletions.
diff --git a/README.md b/README.md
@@ -30,6 +30,7 @@ Rapid map creation with machine learning and earth observation data.
 ![3maps-gif](assets/3maps.gif)
 
 * [Tutorial](#tutorial-)
+* [How it works](#how-it-works)
 * [Generating a project](#generating-a-project-)
 * [Adding data](#adding-data-)
 * [Training a model](#training-a-model-)
@@ -39,29 +40,82 @@ Rapid map creation with machine learning and earth observation data.
 Colab notebook tutorial demonstrating data exploration, model training, and inference over small region.
 
 **Prerequisites:**
-- Github account
 - Github access token (obtained [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token))
-- Forked OpenMapFlow repository 
+- [Forked OpenMapFlow repository](https://github.com/nasaharvest/openmapflow/fork)
 - Basic Python knowledge 
 
+## How it works 
+
+To create your own maps with OpenMapFlow, you need to 
+1. [Generate your own OpenMapFlow project](#generating-a-project-), this will allow you to:
+1. [Add your own labeled data](#adding-data-)
+2. [Train a model](#training-a-model-) using that labeled data, and 
+3. [Create a map](#creating-a-map-) using the trained model.
+
+![openmapflow-pipeline](assets/pipeline.png)
 
 ## Generating a project [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/generate_project.ipynb)
-Inside a Github repository run:
+
+**Prerequisites:**
+- [ ] [Github repository](https://docs.github.com/en/get-started/quickstart/create-a-repo) - where your project will be stored
+- [ ] [Google/Gmail based account](https://www.google.com/account/about/) - for accessing Google Drive and Google Cloud
+- [ ] [Google Cloud Project](https://console.cloud.google.com/projectcreate) - for deploying Cloud resources for creating a map ([additional info](https://cloud.google.com/resource-manager/docs/creating-managing-projects#console))
+
+Once all prerequisites are satisfied, inside your Github repository run:
 ```bash
 pip install openmapflow
 openmapflow generate
 ```
-This generates a project for: Adding data ➞ Training a model ➞ Creating a map 
+The command will prompt for project configuration such as project name and Google Cloud Project ID. Several prompts will have defaults shown in square brackets. These will be used if nothing is entered. 
+
+After all configuration is set, the following project structure will be generated:
+
+```
+<YOUR PROJECT NAME>
+│   README.md
+│   datasets.py             # Dataset definitions (how labels should be processed)
+│   evaluate.py             # Template script for evaluating a model
+│   openmapflow.yaml        # Project configuration file
+│   train.py                # Template script for training a model
+│   
+└─── .dvc/                  # https://dvc.org/doc/user-guide/what-is-dvc
+│       
+└─── .github
+│   │
+│   └─── workflows          # Github actions
+│       │   deploy.yaml     # Automated Google Cloud deployment of trained models
+│       │   test.yaml       # Automated integration tests of labeled data
+│       
+└─── data
+    │   raw_labels/                     # User added labels
+    │   processed_labels/               # Labels standardized to common format
+    │   features/                       # Labels combined with satellite data
+    │   compressed_features.tar.gz      # Allows faster features downloads
+    │   models/                         # Models trained using features
+    |   raw_labels.dvc                  # Reference to a version of raw_labels/
+    |   processed_labels.dvc            # Reference to a version of processed_labels/
+    │   compressed_features.tar.gz.dvc  # Reference to a version of features/
+    │   models.dvc                      # Reference to a version of models/
+    
+```
+
+This project contains all the code necessary for: Adding data ➞ Training a model ➞ Creating a map. 
+
 
 ## Adding data [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/new_data.ipynb)
 
+**Prerequisites:**
+- [ ] [Generated OpenMapFlow project](#generating-a-project-)
+- [ ] [EarthEngine account](https://earthengine.google.com/signup) - for accessing Earth Engine and pulling satellite data
+- [ ] Raw labels - a file (csv/shp/zip/txt) containing a list of labels and their coordinates (latitude, longitude)
+
 Move raw labels into project:
 ```bash
 export RAW_LABEL_DIR=$(openmapflow datapath RAW_LABELS)
 mkdir RAW_LABEL_DIR/<my dataset name>
 cp -r <path to my raw data files> RAW_LABEL_DIR/<my dataset name>
 ```
-Add reference to data using a `LabeledDataset` object in datasets.py:
+Add reference to data using a `LabeledDataset` object in datasets.py, example:
 ```python
 datasets = [
     LabeledDataset(
@@ -74,7 +128,6 @@ datasets = [
                 latitude_col="latitude",
                 class_prob=lambda df: df["crop"],
                 start_year=2019,
-                x_y_from_centroid=False,
             ),
         ),
     ),
@@ -87,9 +140,6 @@ earthengine authenticate    # For getting new earth observation data
 gcloud auth login           # For getting cached earth observation data
 
 openmapflow create-features # Initiatiates or checks progress of features creation
-# May take long time depending on amount of labels in dataset 
-# TODO make the end more obvious
-
 openmapflow datasets        # Shows the status of datasets
 
 dvc commit && dvc push      # Push new data to data version control
@@ -98,26 +148,51 @@ git add .
 git commit -m'Created new features'
 git push
 ```
+**Important:** When new data is pushed to the repository a Github action will be run to verify data integrity. This action will pull data using dvc and thereby needs access to remote storage (your Google Drive). To allow the Github action to access the data add a new repository secret ([instructions](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository)). 
+- In step 5 of the instructions, name the secret: `GDRIVE_CREDENTIALS_DATA`
+- In step 6, enter the value in .dvc/tmp/gdrive-user-creditnals.json (in your repository)
+
+After this the Github action should successfully run if the data is valid.
+
 
 ## Training a model [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/train.ipynb)
+
+**Prerequisites:**
+- [ ] [Generated OpenMapFlow project](#generating-a-project-)
+- [ ] [Added labeled data](#adding-data-)
+
 ```bash
 # Pull in latest data
 dvc pull    
 tar -xzf $(openmapflow datapath COMPRESSED_FEATURES) -C data
 
-export MODEL_NAME=<model_name>              # Set model name
-python train.py --model_name $MODEL_NAME    # Train a model
-python evaluate.py --model_name $MODEL_NAME # Record test metrics
+# Set model name, train model, record test metrics
+export MODEL_NAME=<YOUR MODEL NAME>              
+python train.py --model_name $MODEL_NAME    
+python evaluate.py --model_name $MODEL_NAME 
 
-dvc commit && dvc push  # Push new models to data version control
+# Push new models to data version control
+dvc commit 
+dvc push  
 
+# Make a Pull Request to the repository
 git checkout -b"$MODEL_NAME"
 git add .
 git commit -m "$MODEL_NAME"
 git push --set-upstream origin "$MODEL_NAME"
 ```
 
+**Important:** When a new model is pushed to the repository a Github action will be run to deploy this model to Google Cloud. To allow the Github action to access Google Cloud add a new repository secret ([instructions](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository)). 
+- In step 5 of the instructions, name the secret: `GCP_SA_KEY`
+- In step 6, enter a Google Cloud Service Account key ([how to create](https://cloud.google.com/iam/docs/creating-managing-service-account-keys))
+
+Now after merging the pull request, the model will be deployed to Google Cloud.
+
 ## Creating a map [![cb]](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/create_map.ipynb)
+**Prerequisites:**
+- [ ] [Generated OpenMapFlow project](#generating-a-project-)
+- [ ] [Added labeled data](#adding-data-)
+- [ ] [Trained model](#training-a-model-)
 
 Only available through Colab. Cloud Architecture must be deployed using the deploy.yaml Github Action.
 

diff --git a/assets/pipeline.png b/assets/pipeline.png
diff --git a/openmapflow/generate.py b/openmapflow/generate.py
@@ -34,17 +34,28 @@ def create_openmapflow_config(overwrite: bool):
     description = input("  Description: ")
     gcloud_project_id = input("  GCloud project ID: ")
     gcloud_location = input("  GCloud location [us-central1]: ") or "us-central1"
-    gcloud_bucket_labeled_tifs = (
-        input("  GCloud bucket labeled tifs [crop-mask-tifs2]: ") or "crop-mask-tifs2"
-    )
+
+    buckets = {
+        "bucket_labeled_tifs": f"{project_name}-labeled-tifs",
+        "bucket_inference_tifs": f"{project_name}-inference-tifs",
+        "bucket_preds": f"{project_name}-preds",
+        "bucket_preds_merged": f"{project_name}-preds-merged",
+    }
+
+    for k, v in buckets.items():
+        buckets[k] = input(f"  Gcloud {k.replace('_', ' ')} [{v}]: ") or v
+
     openmapflow_str = (
         "version: 0.0.1"
         + f"\nproject: {project_name}"
         + f"\ndescription: {description}"
         + "\ngcloud:"
         + f"\n    project_id: {gcloud_project_id}"
         + f"\n    location: {gcloud_location}"
-        + f"\n    bucket_labeled_tifs: {gcloud_bucket_labeled_tifs}"
+        + f"\n    bucket_labeled_tifs: {buckets['bucket_labeled_tifs']}"
+        + f"\n    bucket_inference_tifs: {buckets['bucket_inference_tifs']}"
+        + f"\n    bucket_preds: {buckets['bucket_preds']}"
+        + f"\n    bucket_preds_merged: {buckets['bucket_preds_merged']}"
     )
 
     with open(CONFIG_FILE, "w") as f:

diff --git a/openmapflow/labeled_dataset.py b/openmapflow/labeled_dataset.py
@@ -317,14 +317,20 @@ def create_processed_labels(self):
 
         # Combine duplicate labels
         df[NUM_LABELERS] = 1
+
+        def join_if_exists(values):
+            if all((isinstance(v, str) for v in values)):
+                return ",".join(values)
+            return ""
+
         df = df.groupby([LON, LAT, START, END], as_index=False, sort=False).agg(
             {
                 SOURCE: lambda sources: ",".join(sources.unique()),
                 CLASS_PROB: "mean",
                 NUM_LABELERS: "sum",
                 SUBSET: "first",
-                LABEL_DUR: lambda dur: ",".join(dur),
-                LABELER_NAMES: lambda name: ",".join(name),
+                LABEL_DUR: join_if_exists,
+                LABELER_NAMES: join_if_exists,
             }
         )
         df[COUNTRY] = self.country

diff --git a/openmapflow/raw_labels.py b/openmapflow/raw_labels.py
@@ -9,6 +9,7 @@
 import pandas as pd
 from cropharvest.utils import set_seed
 from dateutil.relativedelta import relativedelta
+from fiona.errors import DriverError
 from pyproj import Transformer
 
 from openmapflow.constants import (
@@ -71,9 +72,12 @@ def _read_in_file(file_path) -> pd.DataFrame:
         except UnicodeDecodeError:
             return pd.read_csv(file_path, engine="python")
     elif file_path.suffix == ".zip":
-        with zipfile.ZipFile(file_path, "r") as zip_ref:
-            zip_ref.extractall(file_path.parent)
-        return gpd.read_file(file_path.parent / file_path.stem)
+        try:
+            return gpd.read_file(file_path)
+        except DriverError:
+            with zipfile.ZipFile(file_path, "r") as zip_ref:
+                zip_ref.extractall(file_path.parent)
+            return gpd.read_file(file_path.parent / file_path.stem)
     else:
         return gpd.read_file(file_path)
 
@@ -141,7 +145,7 @@ def _set_lat_lon(
         df = df[df.geometry != None]  # noqa: E711
         df["samples"] = (df.geometry.area / 0.001).astype(int)
         list_of_points = np.vectorize(_get_points)(df.geometry, df.samples)
-        return gpd.GeoDataFrame(geometry=pd.concat(list_of_points, ignore_index=True))
+        df = gpd.GeoDataFrame(geometry=pd.concat(list_of_points, ignore_index=True))
 
     if x_y_from_centroid:
         df = df[df.geometry != None]  # noqa: E711
@@ -156,10 +160,14 @@ def _set_lat_lon(
         df[LAT] = y
         return df
 
+    raise ValueError(
+        "Must specify latitude_col and longitude_col or x_y_from_centroid=True"
+    )
+
 
 def _set_label_metadata(df, label_duration: Optional[str], labeler_name: Optional[str]):
-    df[LABEL_DUR] = df[label_duration].astype(str) if label_duration else ""
-    df[LABELER_NAMES] = df[labeler_name].astype(str) if labeler_name else ""
+    df[LABEL_DUR] = df[label_duration].astype(str) if label_duration else None
+    df[LABELER_NAMES] = df[labeler_name].astype(str) if labeler_name else None
     return df
 
 
@@ -178,6 +186,9 @@ class RawLabels:
         train_val_test (Tuple[float, float, float]): A tuple of floats representing the ratio of
             train, validation, and test set.  The sum of the values must be 1.0
             Default: (1.0, 0.0, 0.0) [All data used for training]
+        filter_df (Callable[[pd.DataFrame]]): A function to filter the dataframe before processing
+            Example: lambda df: df[df["class"].notnull()]
+            Default: None
         start_year (int): The year when the labels were collected, should be used when all labels
             are from the same year
             Example: 2019
@@ -186,7 +197,7 @@ class RawLabels:
             Example: "Planting Date"
         x_y_from_centroid (bool): Whether to use the centroid of the label as the latitude and
             longitude coordinates
-            Default: True
+            Default: False
         latitude_col (str): The name of the column representing the latitude of the label
             Default: None, will use the latitude of the centroid of the label
         longitude_col (str): The name of the column representing the longitude of the label
@@ -198,7 +209,8 @@ class RawLabels:
             Default: None, assumes EPSG:4326
         label_duration (str): The name of the column representing the labeling duration of the label
             Default: None
-        labeler_name
+        labeler_name (str): The name of the column representing the name of the labeler
+            Default: None
 
     """
 
@@ -221,7 +233,7 @@ class RawLabels:
     transform_crs_from: Optional[int] = None
 
     # Label metadata
-    label_duruation: Optional[str] = None
+    label_duration: Optional[str] = None
     labeler_name: Optional[str] = None
 
     def __post_init__(self):
@@ -231,7 +243,6 @@ def __post_init__(self):
 
     def process(self, raw_folder: Path) -> pd.DataFrame:
         df = _read_in_file(raw_folder / self.filename)
-        df[SOURCE] = self.filename
         if self.filter_df:
             df = self.filter_df(df)
         df = _set_lat_lon(
@@ -242,9 +253,10 @@ def process(self, raw_folder: Path) -> pd.DataFrame:
             x_y_from_centroid=self.x_y_from_centroid,
             transform_crs_from=self.transform_crs_from,
         )
+        df[SOURCE] = self.filename
         df = _set_class_prob(df, self.class_prob)
         df = _set_start_end_dates(df, self.start_year, self.start_date_col)
-        df = _set_label_metadata(df, self.label_duruation, self.labeler_name)
+        df = _set_label_metadata(df, self.label_duration, self.labeler_name)
         df = df.dropna(subset=[LON, LAT, CLASS_PROB])
         df = df.round({LON: 8, LAT: 8})
         df = _train_val_test_split(df, self.train_val_test)

diff --git a/openmapflow/templates/github-deploy.yaml b/openmapflow/templates/github-deploy.yaml
@@ -19,11 +19,11 @@ jobs:
         python-version: 3.8
     - name: Install dependencies
       run: pip install -r requirements.txt
-    - uses: google-github-actions/setup-gcloud@v0
+    - uses: google-github-actions/auth@v0
       with:
-        project_id: ${{ secrets.GCP_PROJECT_ID }}
-        service_account_key: ${{ secrets.GCP_SA_KEY }}
-        export_default_credentials: true   
+        credentials_json: ${{ secrets.GCP_SA_KEY }}
+    - name: Set up Cloud SDK
+      uses: google-github-actions/setup-gcloud@v0  
     - uses: iterative/setup-dvc@v1
     - name: Deploy Google Cloud Architecture
       env:

diff --git a/openmapflow/templates/openmapflow-default.yaml b/openmapflow/templates/openmapflow-default.yaml
@@ -15,7 +15,7 @@ data_paths:
 gcloud:
   project_id: ~
   location: us-central1
-  bucket_labeled_tifs: crop-mask-tifs2
+  bucket_labeled_tifs: <PROJECT>-labeled-tifs
   bucket_inference_tifs: <PROJECT>-inference-tifs
   bucket_preds: <PROJECT>-preds
-  bucket_preds_merged: <PROJECT>-preds-merged
+  bucket_preds_merged: <PROJECT>-preds-merged
diff --git a/setup.py b/setup.py
@@ -40,7 +40,7 @@
         "dvc[gdrive]>=2.10.1",
         "earthengine-api",
         "h5py>=3.1.0,!=3.7.0",
-        "ipyleaflet>=0.16.0",
+        "ipyleaflet==0.16.0",
         "pandas==1.3.5",
         "protobuf==3.20.1",
         "pyyaml>=6.0",

diff --git a/tests/test_config.py b/tests/test_config.py
@@ -26,7 +26,7 @@ def test_load_default_config(self):
             "gcloud": {
                 "project_id": None,
                 "location": "us-central1",
-                "bucket_labeled_tifs": "crop-mask-tifs2",
+                "bucket_labeled_tifs": "fake-project-labeled-tifs",
                 "bucket_inference_tifs": "fake-project-inference-tifs",
                 "bucket_preds": "fake-project-preds",
                 "bucket_preds_merged": "fake-project-preds-merged",
@@ -58,7 +58,7 @@ def test_deploy_env_variables(self):
             + f"OPENMAPFLOW_LIBRARY_DIR={LIBRARY_DIR} "
             + "OPENMAPFLOW_GCLOUD_PROJECT_ID=None "
             + "OPENMAPFLOW_GCLOUD_LOCATION=us-central1 "
-            + "OPENMAPFLOW_GCLOUD_BUCKET_LABELED_TIFS=crop-mask-tifs2 "
+            + "OPENMAPFLOW_GCLOUD_BUCKET_LABELED_TIFS=openmapflow-labeled-tifs "
             + "OPENMAPFLOW_GCLOUD_BUCKET_INFERENCE_TIFS=openmapflow-inference-tifs "
             + "OPENMAPFLOW_GCLOUD_BUCKET_PREDS=openmapflow-preds "
             + "OPENMAPFLOW_GCLOUD_BUCKET_PREDS_MERGED=openmapflow-preds-merged "