This project aims to track changes in water level using satellite imagery and deep learning. Throughout my studies, I've worked on this project with my friend Karl as part of our portfolio project of the Data Science Retreat. The retreat consists of a three months intensive in-person Data Science bootcamp in Berlin, Germany.
Table of Content:
- Introduction
- Datasets
- Labeling
- Data Augmentation
- Metrics
- Baseline
- Model Optimization
- Model Results
- Dashboard
- Technical Stack
- Virtual Environment
- Next Steps
The motivation for this project is the article Some of the World's Biggest Lakes Are Drying Up found in the March 2018 edition of the National Geographic magazine.
Freshwater is the most important resource for mankind, cross-cutting all social, economic and environmental activities. It is a condition for all life on our planet, an enabling limiting factor for any social and technological development, a possible source of welfare or misery, cooperation or conflict. (UNESCO)
The exponenetial growth of satellite-based information over the past four decades has provided unprecedented opportunities to improve water resource manegement.
NWPU-Resic-45 dataset is a pubicly available benchmark for Remote Sensing Image Scene Classification (RESIC), created by Nortwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes (including water classes) with 700 images in each class.
The second dataset is a time-series of cloudless Sentinel-2 imagery including 17 criticaly endangered lakes as following:
- Lake Poopo, Bolivia;
- Lake Urmia, Iran;
- Lake Mojave, USA;
- Aral sea, Kazahkstan;
- Lake Copais, Greece;
- Lake Ramganga, India;
- Qinghai Lake, China;
- Salton Sea, USA;
- Lake Faguibine, Mali;
- Mono Lake, USA;
- Walker Lake, USA;
- Lake Balaton, Hungary;
- Lake Koroneia, Greece;
- Lake Salda, Turkey;
- Lake Burdur, Turkey;
- Lake Mendocino, USA;
- Elephant Butte Reservoir, USA.
The MakeSense online tool has been used for labeling both datasets images. It only requires a web browser and you are ready to go. It's an excellent choice for small computer vision deep learning projects, making the process of preparing the dataset easier and faster.
The following techniques have been applied during training:
- Height shift up to 30%;
- Horizontal flip;
- Rotation up to 45 degrees;
- No shear;
- Vertical flip;
- Width shift up to 30%;
- Zoom between 75% and 125%.
The following metrics have been used to evaluate the semantic segmenation model:
- Jaccard Index
- Dice Coefficient
More information about both of these metrics can be found here.
The baseline consists of a simple U-Net model architecture. This strategy allow us to modify the model for our own purposes and fine-tunning it as necessary for our development purposes. By using this network architecture, we could spend more time understanding the optimization strategies.
Train/Validation/Test splits based on Resic-45 dataset only:
- training set: 489 images;
- validation set: 140 images;
- test set: 71 images.
Model performance:
Train/Validation/Test splits based on Resic-45 dataset only:
- training set: 979 images;
- validation set: 280 images;
- test set: 122 images.
Model performance:
It can be seen clearly that the baseline model overfits using image augmentation.
The following strategies have been explored:
- Using Early Stopping and adaptive learning rates;
- Using a bigger model (and dropout);
- Using regularization (Batch Normalization);
- Using residual connections;
- Dealing with class imbalance using dice loss;
- Refining label images using CRFs;
- Ensemble predictions.
Train/Validation/Test splits:
- training set: 489 images from Resic-45 dataset randomly transformed at each epoch using one of the techniques described in the fourth section Data Augmentation;
- validation set: 211 images from Resic-45 dataset;
- test set: 359 images from Sentinel-2 dataset.
Model performance using binary cross entropy as the loss function:
Model performance using dice loss as the loss function:
The test set to measure the results presented below is based on 182 images from Sentinel-2 dataset.
Model 1: U-Net residual model trained without label correction:
Model 2: U-Net residual model trained with label correction using Conditional Random Fields:
Model 3: Ensemble model based on the two previous models:
The ensemble model is the one with highest accuracy (97.15%) and is the one used in the Dashboard application that will be covered in the next section.
The dashboard can be executed with the following command:
python app.py
A demo is available here.
Use Case 1: Lake Copais, Greece (2019)
Use Case 2: Lake Di Cancano, Italy (2019)
Use Case 3: Lake Salda, Turkey (2016)
The following libraries are required to create the virtual environment. The creation of the virtual environment is detailed in the next section.
- Cython
- Dash
- Matplotlib
- NumPy
- Pillow
- Plotly
- Pydensecrf
- Rasterio
- Requests
- Tensorflow 2.4
To setup your local environemnt it is recommended to create a virtual environment using condas. Make sure you have it installed on your computer and then execute the command below:
conda env create -f environment.yml
The environment.yml
file ensures that all dependiences will be downloaded.
After the enviroment is created, it is necessary to activate the virtual environemnt as follows:
conda activate deep-water
The virtual environment can be deactivate in a single line of code.
conda deactivate
The topics below can be studied and analysed in the context of the project:
- Apply post-processing techniques such as defrosting;
- Collect satellite imagery with clouds;
- Collect more data using the sentinelsat package;
- Estimate the volume of a given water body.