This repository provides a script and recipe to train UNet Medical to achieve state of the art accuracy, and is tested and maintained by Habana Labs, Ltd. an Intel Company. Please visit this page for performance information.
For more information about training deep learning models on Gaudi, visit developer.habana.ai.
This repository provides script to train UNet Medical model for 2D Segmentation on Habana Gaudi (HPU). It is based on NVIDIA UNet Medical Image Segmentation for TensorFlow 2.x repository. Implementation provided in this repository covers UNet model as described in the original paper UNet: Convolutional Networks for Biomedical Image Segmentation.
UNet allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
The following figure shows the construction of the UNet model and its components. UNet is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
Figure 1. The architecture of a UNet model. Taken from the UNet: Convolutional Networks for Biomedical Image Segmentation paper.
Major changes done to original model from NVIDIA UNet Medical Image Segmentation for TensorFlow 2.x:
- GPU specific configurations have been removed;
- Some scripts were changed in order to run the model on Gaudi. It includes loading habana tensorflow modules and using multi Gaudi card helpers;
- Model is using bfloat16 precision instead of float16;
- tf.keras.activations.softmax was replaced with tf.nn.softmax due to performance issues described in tensorflow/tensorflow#47572;
- Additional tensorboard and performance logging was added;
- GPU specific files (examples/*, Dockerfile etc.) and some unused code have been removed,
- In order to improve the performance the tf.data.experimental.prefetch_to_device has been enabled for HPU device.
- Execution mode: train and evaluate;
- Batch size: 8;
- Data type: bfloat16;
- Maximum number of steps: 6400;
- Weight decay: 0.0005;
- Learning rate: 0.0001;
- Number of Horovod workers (HPUs): 1;
- Data augmentation: True;
- Cross-validation: disabled;
- Using XLA: False;
- Logging losses and performance every N steps: 100.
Please follow the instructions given in the following link for setting up the
environment including the $PYTHON
environment variable: Gaudi Installation
Guide.
This guide will walk you through the process of setting up your system to run
the model on Gaudi.
In the docker container, clone this repository and switch to the branch that
matches your SynapseAI version. (Run the
hl-smi
utility to determine the SynapseAI version.)
git clone -b [SynapseAI version] https://github.com/HabanaAI/Model-References /root/Model-References
Go to the UNet2D directory
cd /root/Model-References/TensorFlow/computer_vision/Unet2D
In the docker container, go to the UNet2D directory
cd /root/Model-References/TensorFlow/computer_vision/Unet2D
Install required packages using pip
$PYTHON -m pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:/root/Model-References/
Download the EM segmentation challenge dataset*:
$PYTHON download_dataset.py
by default it will download the dataset to ./data
path, use --data_dir <path>
to change
*If original location is unavailable, dataset is also mirrored on Kaggle: https://www.kaggle.com/soumikrakshit/isbi-challenge-dataset. Registration is required.
The model was tested both in single Gaudi and 8x Gaudi cards configurations.
$PYTHON unet2d.py --data_dir <path/to/dataset> --batch_size <batch_size> \
--dtype <precision> --model_dir <path/to/model_dir> --fold <fold>
For example:
- single Gaudi card training with batch size 8, bfloat16 precision and fold 0:
$PYTHON unet2d.py --data_dir /data/tensorflow/unet2d --batch_size 8 --dtype bf16 --model_dir /tmp/unet2d_1_hpu --fold 0 --tensorboard_logging
- single Gaudi card training with batch size 8, float32 precision and fold 0:
$PYTHON unet2d.py --data_dir /data/tensorflow/unet2d --batch_size 8 --dtype fp32 --model_dir /tmp/unet2d_1_hpu --fold 0 --tensorboard_logging
Running the script via mpirun requires--use_horovod
argument, and mpirun prefix with several parameters.
mpirun map-by PE attribute value may vary on your setup and should be calculated as:
socket:PE = floor((number of physical cores) / (number of gaudi devices per each node))
mpirun --allow-run-as-root --bind-to core --map-by socket:PE=7 -np 8 \
$PYTHON unet2d.py --data_dir <path/to/dataset> --batch_size <batch_size> \
--dtype <precision> --model_dir <path/to/model_dir> --fold <fold> --use_horovod
For example:
mpirun map-by PE attribute value may vary on your setup and should be calculated as:
socket:PE = floor((number of physical cores) / (number of gaudi devices per each node))
- 8 Gaudi cards training with batch size 8, bfloat16 precision and fold 0:
mpirun --allow-run-as-root --tag-output --merge-stderr-to-stdout --bind-to core --map-by socket:PE=7 -np 8 \ $PYTHON unet2d.py --data_dir /data/tensorflow/unet2d/ --batch_size 8 \ --dtype bf16 --model_dir /tmp/unet2d_8_hpus --fold 0 --tensorboard_logging --log_all_workers --use_horovod
- 8 Gaudi cards training with batch size 8, float32 precision and fold 0:
mpirun --allow-run-as-root --tag-output --merge-stderr-to-stdout --bind-to core --map-by socket:PE=7 -np 8 \ $PYTHON unet2d.py --data_dir /data/tensorflow/unet2d/ --batch_size 8 \ --dtype fp32 --model_dir /tmp/unet2d_8_hpus --fold 0 --tensorboard_logging --log_all_workers --use_horovod
All the commands described above will train and evaluate the model on the dataset with fold 0. To perform 5-fold-cross-validation on the dataset and compute average dice score across 5 folds, the user can execute training script 5 times and calculate the average dice score manually or run bash script train_and_evaluate.sh
:
bash train_and_evaluate.sh <path/to/dataset> <path/for/results> <batch_size> <precision> <number_of_HPUs>
For example:
- single Gaudi card 5-fold-cross-validation with batch size 8 and bfloat16 precision
bash train_and_evaluate.sh /data/tensorflow/unet2d/ /tmp/unet2d_1_hpu 8 bf16 1
- single Gaudi card 5-fold-cross-validation with batch size 8 and float32 precision
bash train_and_evaluate.sh /data/tensorflow/unet2d/ /tmp/unet2d_1_hpu 8 fp32 1
- 8 Gaudi cards 5-fold-cross-validation with batch size 8 and bfloat16 precision
bash train_and_evaluate.sh /data/tensorflow/unet2d/ /tmp/unet2d_8_hpus 8 bf16 8
- 8 Gaudi cards 5-fold-cross-validation with batch size 8 and float32 precision
bash train_and_evaluate.sh /data/tensorflow/unet2d/ /tmp/unet2d_8_hpus 8 fp32 8
The following sections provide more details of scripts in the repository, available parameters, and command-line options.
unet2d.py
: The training script of the UNet2D model, entry point to the application.download_dataset.py
- Script for downloading dataset.data_loading/data_loader.py
: Implements the data loading and augmentation.model/layers.py
: Defines the different blocks that are used to assemble UNet.model/unet.py
: Defines the model architecture using the blocks from thelayers.py
script.runtime/arguments.py
: Implements the command-line arguments parsing.runtime/losses.py
: Implements the losses used during training and evaluation.runtime/run.py
: Implements the logic for training, evaluation, and inference.runtime/parse_results.py
: Implements the intermediate results parsing.runtime/setup.py
: Implements helper setup functions.train_and_evaluate.sh
: Runs the topology training and evaluates the model for 5 cross-validation.
Other folders included in the root directory are:
images/
: Contains a model diagram.
The complete list of the available parameters for the unet2d.py
script contains:
--exec_mode
: Select the execution mode to run the model (default:train_and_evaluate
). Modes available:train
- trains model from scratch.evaluate
- loads checkpoint from--model_dir
(if available) and performs evaluation on validation subset (requires--fold
other thanNone
).train_and_evaluate
- trains model from scratch and performs validation at the end (requires--fold
other thanNone
).predict
- loads checkpoint from--model_dir
(if available) and runs inference on the test set. Stores the results in--model_dir
directory.train_and_predict
- trains model from scratch and performs inference.
--model_dir
: Set the output directory for information related to the model (default:/tmp/unet2d
).--data_dir
: Set the input directory containing the dataset (default:None
).--log_dir
: Set the output directory for logs (default:/tmp/unet2d
).--batch_size
: Size of each minibatch per HPU (default:8
).--dtype
: Set precision to be used in model: fp32/bf16 (default:bf16
).--fold
: Selected fold for cross-validation (default:None
).--max_steps
: Maximum number of steps (batches) for training (default:6400
).--log_every
: Log data every n steps (default:100
).--evaluate_every
: Evaluate every n steps (default:0
- evaluate once at the end).--warmup_steps
: Used during benchmarking - the number of steps to skip (default:200
). First iterations are usually much slower since the graph is being constructed. Skipping the initial iterations is required for a fair performance assessment.--weight_decay
: Weight decay coefficient (default:0.0005
).--learning_rate
: Model’s learning rate (default:0.0001
).--seed
: Set random seed for reproducibility (default:0
).--dump_config
: Directory for dumping debug traces (default:None
).--augment
: Enable data augmentation (default:True
).--benchmark
: Enable performance benchmarking (default:False
). If the flag is set, the script runs in a benchmark mode - each iteration is timed and the performance result (in images per second) is printed at the end. Works for bothtrain
andpredict
execution modes.--xla
: Enable accelerated linear algebra optimization (default:False
).--resume_training
: Resume training from a checkpoint (default:False
).--no_hpu
: Disable execution on HPU, train on CPU (default:False
).--synth_data
: Use deterministic and synthetic data (default:False
).--disable_ckpt_saving
: Disables saving checkpoints (default:False
).--use_horovod
: Enable horovod usage (default:False
).--tensorboard_logging
: Enable tensorboard logging (default:False
).--log_all_workers
: Enable logging data for every horovod worker in a separate directory namedworker_N
(default: False).--bf16_config_path
: Path to custom mixed precision config to use given in JSON format.--tf_verbosity
: If set changes logging level from Tensorflow:0
- all messages are logged (default behavior);1
- INFO messages are not printed;2
- INFO and WARNING messages are not printed;3
- INFO, WARNING, and ERROR messages are not printed.
To see the full list of available options and their descriptions, use the -h
or --help
command-line option, for example:
$PYTHON unet2d.py --help
Device | SynapseAI Version | TensorFlow Version(s) |
---|---|---|
Gaudi | 1.4.1 | 2.8.0 |
Gaudi | 1.4.1 | 2.7.1 |
- removed setting number of parallel calls in dataloader mapping in order to improve performance for different TF versions
- updated requirements.txt
- moved BF16 config json file from TensorFlow/common/ to model's dir
- updated requirements.txt
- in order to improve the performance the tf.data.experimental.prefetch_to_device has been enabled for HPU device.
- Change
python
orpython3
to$PYTHON
to execute correct version based on environment setup. - Import horovod-fork package directly instead of using Model-References' TensorFlow.common.horovod_helpers; wrapped horovod import with a try-catch block so that the user is not required to install this library when the model is being run on a single card
- References to custom demo script were replaced by community entry points in README and train_and_evaluate.sh