[UNet(med)/TF2] Adding TF2 version

This commit is contained in:
Przemek Strzelczyk 2020-03-02 14:49:56 +01:00
parent dee0fe36c5
commit 0634210859
35 changed files with 2130 additions and 0 deletions

View file

@ -0,0 +1,8 @@
ARG FROM_IMAGE_NAME=gitlab-master.nvidia.com:5005/dl/dgx/tensorflow:20.02-tf2-py3-devel
FROM ${FROM_IMAGE_NAME}
ADD . /workspace/unet
WORKDIR /workspace/unet
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

View file

@ -0,0 +1,29 @@
BSD 3-Clause License
Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View file

@ -0,0 +1,17 @@
Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Ths repository includes software from:
* TensorFlow, (https://github.com/tensorflow/tensorflow) licensed
under the Apache License, Version 2.0

View file

@ -0,0 +1,580 @@
# U-Net Medical Image Segmentation for TensorFlow 2.x
This repository provides a script and recipe to train U-Net Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Multi-dataset](#multi-dataset)
* [Training process](#training-process)
* [Inference process](#inference-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g)
* [Training stability results](#training-stability-results)
* [Training stability: NVIDIA DGX-1 (8x V100 16G)](#training-stability-nvidia-dgx-1-8x-v100-16g)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the original paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
The following figure shows the construction of the U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
![U-Net](images/unet.png)
Figure 1. The architecture of a U-Net model. Taken from the <a href="https://arxiv.org/abs/1505.04597">U-Net: Convolutional Networks for Biomedical Image Segmentation paper</a>.
### Default configuration
U-Net consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and concatenation with the correspondingly cropped feature map from the contractive path.
### Feature support matrix
The following features are supported by this model:
| **Feature** | **U-Net Medical** |
|-------------|---------------------|
| Automatic mixed precision (AMP) | Yes |
| Horovod Multi-GPU (NCCL) | Yes |
| Accelerated Linear Algebra (XLA)| Yes |
#### Features
**Automatic Mixed Precision (AMP)**
This implementation of U-Net uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
**Horovod**
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
Multi-GPU training with Horovod
Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
**XLA support (experimental)**
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled.
### Mixed precision training
Mixed precision is the combined use of different numerical precision in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
#### Enabling mixed precision
This implementation exploits the TensorFlow Automatic Mixed Precision feature. To enable AMP, you simply need to supply the `--use_amp` flag to the `main.py` script. For reference, enabling the AMP required us to apply the following changes to the code:
1. Set Keras mixed precision policy:
```python
if params['use_amp']:
tf.keras.mixed_precision.experimental.set_policy('mixed_float16')
```
2. Use loss scaling wrapper on the optimizer:
```python
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum)
if using_amp:
optimizer = tf.keras.mixed_precision.experimental.LossScaleOptimizer(optimizer, "dynamic")
```
3. Use scaled loss to calculate gradients:
```python
scaled_loss = optimizer.get_scaled_loss(loss)
tape = hvd.DistributedGradientTape(tape)
scaled_gradients = tape.gradient(scaled_loss, model.trainable_variables)
gradients = optimizer.get_unscaled_gradients(scaled_gradients)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
```
## Setup
The following section lists the requirements that you need to meet in order to start training the U-Net Medical model.
### Requirements
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- TensorFlow 20.02-tf2-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow) with Tensorflow 2.1 or later
- [NVIDIA Volta GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
- [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
For those unable to use the TensorFlow NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the U-Net TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
* compare your evaluation accuracy with our [Training accuracy results](#training-accuracy-results),
* compare your training performance with our [Training performance benchmark](#training-performance-benchmark),
* compare your inference performance with our [Inference performance benchmark](#inference-performance-benchmark).
For the specifics concerning training and inference, see the [Advanced](#advanced) section.
1. Clone the repository.
Executing this command will create your local repository with all the code to run U-Net.
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Medical_TF2
```
2. Build the U-Net TensorFlow NGC container.
This command will use the `Dockerfile` to create a Docker image named `unet_tf2`, downloading all the required components automatically.
```
docker build -t unet_tf2 .
```
The NGC container contains all the components optimized for usage on NVIDIA hardware.
3. Start an interactive session in the NGC container to run preprocessing/training/inference.
The following command will launch the container and mount the `./data` directory as a volume to the `/data` directory inside the container, and `./results` directory to the `/results` directory in the container.
```bash
mkdir data
mkdir results
docker run --runtime=nvidia -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm --ipc=host -v ${PWD}/data:/data -v ${PWD}/results:/results unet_tf2:latest /bin/bash
```
Any datasets and experiment results (logs, checkpoints, etc.) saved to `/data` or `/results` will be accessible
in the `./data` or `./results` directory on the host, respectively.
4. Download and preprocess the data.
The U-Net script `main.py` operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597).
The script `download_dataset.py` is provided for data download. It is possible to select the destination folder when downloading the files by using the `--data_dir` flag. For example:
```bash
python download_dataset.py --data_dir /data
```
Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images (around 30 Mb total). Once downloaded, the data with the `download_dataset.py` script can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
**Note:** Masks are only provided for training data.
5. Start training.
After the Docker container is launched, the training with the [default hyperparameters](#default-parameters) (for example 1/8 GPUs FP32/TF-AMP) can be started with:
```bash
bash examples/unet_{FP32, TF-AMP}_{1,8}GPU.sh <path/to/dataset> <path/to/checkpoint>
```
For example, to run with full precision (FP32) on 1 GPU from the projects folder, simply use:
```bash
bash examples/unet_FP32_1GPU.sh /data /results
```
This script will launch a training on a single fold and store the models checkpoint in the <path/to/checkpoint> directory.
The script can be run directly by modifying flags if necessary, especially the number of GPUs, which is defined after the `-np` flag. Since the test volume does not have labels, 20% of the training data is used for validation in 5-fold cross-validation manner. The number of fold can be changed using `--crossvalidation_idx` with an integer in range 0-4. For example, to run with 4 GPUs using fold 1 use:
```bash
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --use_xla --use_amp
```
Training will result in a checkpoint file being written to `./results` on the host machine.
6. Start validation/evaluation.
The trained model can be evaluated by passing the `--exec_mode evaluate` flag. Since evaluation is carried out on a validation dataset, the `--crossvalidation_idx` parameter should be filled. For example:
```bash
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --use_xla --use_amp
```
Evaluation can also be triggered jointly after training by passing the `--exec_mode train_and_evaluate` flag.
7. Start inference/predictions.
The trained model can be used for inference by passing the `--exec_mode predict` flag:
```bash
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode predict --use_xla --use_amp
```
Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark the performance of your training [Training performance benchmark](#training-performance-benchmark), or [Inference performance benchmark](#inference-performance-benchmark). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
### Scripts and sample code
In the root directory, the most important files are:
* `main.py`: Serves as the entry point to the application.
* `run.py`: Implements the logic for training, evaluation, and inference.
* `Dockerfile`: Specifies the container with the basic set of dependencies to run U-Net.
* `requirements.txt`: Set of extra requirements for running U-Net.
* `download_data.py`: Automatically downloads the dataset for training.
The `utils/` folder encapsulates the necessary tools to train and perform inference using U-Net. Its main components are:
* `cmd_util.py`: Implements the command-line arguments parsing.
* `data_loader.py`: Implements the data loading and augmentation.
* `losses.py`: Implements the losses used during training and evaluation.
* `parse_results.py`: Implements the intermediate results parsing.
The `model/` folder contains information about the building blocks of U-Net and the way they are assembled. Its contents are:
* `layers.py`: Defines the different blocks that are used to assemble U-Net.
* `unet.py`: Defines the model architecture using the blocks from the `layers.py` script.
Other folders included in the root directory are:
* `dllogger/`: Contains the utilities for logging.
* `examples/`: Provides examples for training and benchmarking U-Net.
* `images/`: Contains a model diagram.
### Parameters
The complete list of the available parameters for the `main.py` script contains:
* `--exec_mode`: Select the execution mode to run the model (default: `train`). Modes available:
* `evaluate` - loads checkpoint (if available) and performs evaluation on validation subset (requires `--crossvalidation_idx` other than `None`).
* `train_and_evaluate` - trains model from scratch and performs validation at the end (requires `--crossvalidation_idx` other than `None`).
* `predict` - loads checkpoint (if available) and runs inference on the test set. Stores the results in `--model_dir` directory.
* `train_and_predict` - trains model from scratch and performs inference.
* `--model_dir`: Set the output directory for information related to the model (default: `/results`).
* `--log_dir`: Set the output directory for logs (default: None).
* `--data_dir`: Set the input directory containing the dataset (default: `None`).
* `--batch_size`: Size of each minibatch per GPU (default: `1`).
* `--crossvalidation_idx`: Selected fold for cross-validation (default: `None`).
* `--max_steps`: Maximum number of steps (batches) for training (default: `1000`).
* `--seed`: Set random seed for reproducibility (default: `0`).
* `--weight_decay`: Weight decay coefficient (default: `0.0005`).
* `--log_every`: Log performance every n steps (default: `100`).
* `--learning_rate`: Models learning rate (default: `0.0001`).
* `--augment`: Enable data augmentation (default: `False`).
* `--benchmark`: Enable performance benchmarking (default: `False`). If the flag is set, the script runs in a benchmark mode - each iteration is timed and the performance result (in images per second) is printed at the end. Works for both `train` and `predict` execution modes.
* `--warmup_steps`: Used during benchmarking - the number of steps to skip (default: `200`). First iterations are usually much slower since the graph is being constructed. Skipping the initial iterations is required for a fair performance assessment.
* `--use_xla`: Enable accelerated linear algebra optimization (default: `False`).
* `--use_amp`: Enable automatic mixed precision (default: `False`).
### Command-line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
```bash
python main.py --help
```
The following example output is printed when running the model:
```python main.py --help
usage: main.py [-h]
[--exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}]
[--model_dir MODEL_DIR] --data_dir DATA_DIR [--log_dir LOG_DIR]
[--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE]
[--crossvalidation_idx CROSSVALIDATION_IDX]
[--max_steps MAX_STEPS] [--weight_decay WEIGHT_DECAY]
[--log_every LOG_EVERY] [--warmup_steps WARMUP_STEPS]
[--seed SEED] [--augment] [--no-augment] [--benchmark]
[--no-benchmark] [--use_amp] [--use_xla]
UNet-medical
optional arguments:
-h, --help show this help message and exit
--exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}
Execution mode of running the model
--model_dir MODEL_DIR
Output directory for information related to the model
--data_dir DATA_DIR Input directory containing the dataset for training
the model
--log_dir LOG_DIR Output directory for training logs
--batch_size BATCH_SIZE
Size of each minibatch per GPU
--learning_rate LEARNING_RATE
Learning rate coefficient for AdamOptimizer
--crossvalidation_idx CROSSVALIDATION_IDX
Chosen fold for cross-validation. Use None to disable
cross-validation
--max_steps MAX_STEPS
Maximum number of steps (batches) used for training
--weight_decay WEIGHT_DECAY
Weight decay coefficient
--log_every LOG_EVERY
Log performance every n steps
--warmup_steps WARMUP_STEPS
Number of warmup steps
--seed SEED Random seed
--augment Perform data augmentation during training
--no-augment
--benchmark Collect performance metrics during training
--no-benchmark
--use_amp Train using TF-AMP
--use_xla Train using XLA
```
### Getting the data
The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission. Upon registration, the challenge's data is made available through the following links:
* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-volume.tif)
* [train-labels.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-labels.tif)
* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/test-volume.tif)
Training and test data are comprised of three 512x512x30 `TIF` volumes (`test-volume.tif`, `train-volume.tif` and `train-labels.tif`). Files `test-volume.tif` and `train-volume.tif` contain grayscale 2D slices to be segmented. Additionally, training masks are provided in `train-labels.tif` as a 512x512x30 `TIF` volume, where each pixel has one of two classes:
* 0 indicating the presence of cellular membrane,
* 1 corresponding to background.
The objective is to produce a set of masks that segment the data as accurately as possible. The results are expected to be submitted as a 32-bit `TIF` 3D image, with values between `0` (100% membrane certainty) and `1` (100% non-membrane certainty).
#### Dataset guidelines
The training and test datasets are given as stacks of 30 2D-images provided as a multi-page `TIF` that can be read using the Pillow library and NumPy (both Python packages are installed by the `Dockerfile`).
Initially, data is loaded from a multi-page `TIF` file and converted to 512x512x30 NumPy arrays with the use of Pillow. The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data_loader.py` script.
These NumPy arrays are fed to the model through `tf.data.Dataset.from_tensor_slices()`, in order to achieve high performance.
The voxel intensities then normalized to an interval `[-1, 1]`, whereas labels are one-hot encoded for their later use in dice or pixel-wise cross-entropy loss, becoming 512x512x30x2 tensors.
If augmentation is enabled, the following set of augmentation techniques are applied:
* Random horizontal flipping
* Random vertical flipping
* Crop to a random dimension and resize to input dimension
* Random brightness shifting
In the end, images are reshaped to 388x388 and padded to 572x572 to fit the input of the network. Masks are only reshaped to 388x388 to fit the output of the network. Moreover, pixel intensities are clipped to the `[-1, 1]` interval.
#### Multi-dataset
This implementation is tuned for the EM segmentation challenge dataset. Using other datasets is possible, but might require changes to the code (data loader) and tuning some hyperparameters (e.g. learning rate, number of iterations).
In the current implementation, the data loader works with NumPy arrays by loading them at the initialization, and passing them for training in slices by `tf.data.Dataset.from_tensor_slices()`. If youre able to fit your dataset into the memory, then convert the data into three NumPy arrays - training images, training labels, and testing images (optional). If your dataset is large, you will have to adapt the optimizer for the lazy-loading of data. For a walk-through, check the [TensorFlow tf.data API guide](https://www.tensorflow.org/guide/data_performance)
The performance of the model depends on the dataset size.
Generally, the model should scale better for datasets containing more data. For a smaller dataset, you might experience lower performance.
### Training process
The model trains for a total 40,000 batches (40,000 / number of GPUs), with the default U-Net setup:
* Adam optimizer with learning rate of 0.0001.
This default parametrization is applied when running scripts from the `./examples` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--use_amp` flag. AMP can be enabled for every mode of execution.
The default configuration minimizes a function _L = 1 - DICE + cross entropy_ during training.
The training can be run directly without using the predefined scripts. The name of the training script is `main.py`. Because of the multi-GPU support, training should always be run with the Horovod distributed launcher like this:
```bash
horovodrun -np <number/of/gpus> python main.py --data_dir /data [other parameters]
```
*Note:* When calling the `main.py` script manually, data augmentation is disabled. In order to enable data augmentation, use the `--augment` flag in your invocation.
The main result of the training are checkpoints stored by default in `./results/` on the host machine, and in the `/results` in the container. This location can be controlled
by the `--model_dir` command-line argument, if a different location was mounted while starting the container. In the case when the training is run in `train_and_predict` mode, the inference will take place after the training is finished, and inference results will be stored to the `/results` directory.
If the `--exec_mode train_and_evaluate` parameter was used, and if `--crossvalidation_idx` parameter is set to an integer value of {0, 1, 2, 3, 4}, the evaluation of the validation set takes place after the training is completed. The results of the evaluation will be printed to the console.
### Inference process
Inference can be launched with the same script used for training by passing the `--exec_mode predict` flag:
```bash
python main.py --exec_mode predict --data_dir /data --model_dir <path/to/checkpoint> [other parameters]
```
The script will then:
* Load the checkpoint from the directory specified by the `<path/to/checkpoint>` directory (`/results`),
* Run inference on the test dataset,
* Save the resulting binary masks in a `TIF` format.
## Performance
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
#### Training performance benchmark
To benchmark training, run one of the `TRAIN_BENCHMARK` scripts in `./examples/`:
```bash
bash examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
```
For example, to benchmark training using mixed-precision on 8 GPUs use:
```bash
bash examples/unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during training in the next 800 iterations.
To have more control, you can run the script by directly providing all relevant run parameters. For example:
```bash
horovodrun -np <num of gpus> python main.py --exec_mode train --benchmark --augment --data_dir <path/to/dataset> --model_dir <optional, path/to/checkpoint> --batch_size <batch/size> --warmup_steps <warm-up/steps> --max_steps <max/steps>
```
At the end of the script, a line reporting the best train throughput will be printed.
#### Inference performance benchmark
To benchmark inference, run one of the scripts in `./examples/`:
```bash
bash examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
For example, to benchmark inference using mixed-precision:
```bash
bash examples/unet_INFER_BENCHMARK_TF-AMP.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during inference in the next 400 iterations.
To have more control, you can run the script by directly providing all relevant run parameters. For example:
```bash
python main.py --exec_mode predict --benchmark --data_dir <path/to/dataset> --model_dir <optional, path/to/checkpoint> --batch_size <batch/size> --warmup_steps <warm-up/steps> --max_steps <max/steps>
```
At the end of the script, a line reporting the best inference throughput will be printed.
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
#### Training accuracy results
##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [hours] | Time to train - mixed precision [hours] | Time to train speedup (FP32 to mixed precision) |
|------|------------------|-----------------|----------------------------|------------------------------|----------------------------|--------------------------------|
| 1 | 8 | 0.8825 | 0.8826 | 6.51 | 2.25 | 2.89 |
| 8 | 8 | 0.8968 | 0.8962 | 0.89 | 0.32 | 2.76 |
To reproduce this result, start the Docker container interactively and run one of the TRAIN scripts:
```bash
bash examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
for example
```bash
bash examples/unet_TRAIN_TF-AMP_8GPU.sh /data /results 8
```
This command will launch a script which will run 5-fold cross-validation training for 40,000 iterations and print the validation DICE score and cross-entropy loss. The time reported is for one fold, which means that the training for 5 folds will take 5 times longer. The default batch size is 8, however if you have less than 16 Gb memory card and you encounter GPU memory issue you should decrease the batch size. The logs of the runs can be found in `/results` directory once the script is finished.
#### Training stability results
##### Training stability: NVIDIA DGX-1 (8x V100 16G)
The histogram below shows the best DICE scores achieved for 100 experiments using mixed precision. Mean DICE score for mixed precision was equal to 0.8962 and for single precision it was equal to
0.8968.
![score_histogram](images/score_histogram.png)
#### Training performance results
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK_{TF-AMP, FP32}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in items/images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|------|------------------|-------------------|--------------------------------|---------------------------------------------|---------------------------|--------------------------------|
| 1 | 8 | 17.98 | 51.89 | 2.89 | N/A | N/A |
| 8 | 8 | 143.08 | 386.15 | 2.70 | 7.44 | 7.97 |
To achieve these same results, follow the steps in the [Training performance benchmark](#training-performance-benchmark) section.
Throughput is reported in images per second. Latency is reported in milliseconds per image.
TensorFlow 2 runs by default using the eager mode, which makes tensor evaluation trivial at the cost of lower performance. To mitigate this issue multiple layers of performance optimization were implemented. Two of them, AMP and XLA, were already described. There is an additional one called Autograph, which allows to construct a graph from a subset of Python syntax improving the performance simply by adding a `@tf.function` decorator to the train function. To read more about Autograph see [Better performance with tf.function and AutoGraph](https://www.tensorflow.org/guide/function).
The training performance using 1 GPU with batch size 8 and various combination of AMP/FP32, XLA, and Autograph (AG) are shown the plot below.
![training_throughput](images/training_throughput.png)
#### Inference performance results
##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
Our results were obtained by running the `examples/unet_INFER_BENCHMARK_{TF-AMP, FP32}.sh` inferencing benchmarking script in the tensorflow:20.02-tf2-py3 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.
FP16
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|------|-----------|--------|---------|--------|--------|--------|
| 1 | 572x572x1 | 152.48 | 6.558 | 6.568 | 6.569 | 6.572 |
| 2 | 572x572x1 | 164.64 | 12.148 | 12.162 | 12.164 | 12.168 |
| 4 | 572x572x1 | 179.54 | 22.279 | 22.299 | 22.302 | 22.309 |
| 8 | 572x572x1 | 187.65 | 42.633 | 42.658 | 42.663 | 42.672 |
| 16 | 572x572x1 | 189.38 | 84.486 | 84.541 | 84.551 | 84.569 |
FP32
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|------|-----------|--------|---------|---------|---------|---------|
| 1 | 572x572x1 | 54.41 | 18.381 | 18.395 | 18.398 | 18.403 |
| 2 | 572x572x1 | 56.83 | 35.193 | 35.210 | 35.213 | 35.219 |
| 4 | 572x572x1 | 57.62 | 69.421 | 69.459 | 69.465 | 69.478 |
| 8 | 572x572x1 | 58.66 | 136.391 | 136.506 | 136.525 | 136.564 |
| 16 | 572x572x1 | 75.74 | 211.240 | 211.302 | 211.313 | 211.336 |
To achieve these same results, follow the steps in the [Inference performance benchmark](#inference-performance-benchmark) section.
Throughput is reported in images per second. Latency is reported in milliseconds per batch.
The inference performance using 1 GPU with batch size 8 and various combination of AMP/FP32, XLA, and Autograph (AG) are shown the plot below.
![inference_throughput](images/inference_throughput.png)
## Release notes
### Changelog
February 2020
* Initial release
### Known issues
* For TensorFlow 2.0 the training performance using AMP and XLA is around 30% lower than reported here. The issue was solved in TensorFlow 2.1.
* Due to random initialization, around 5% of training runs result in lower DICE score, usually around 0.81.

View file

@ -0,0 +1,163 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from collections import defaultdict
from datetime import datetime
import json
import atexit
class Backend(ABC):
def __init__(self, verbosity):
self._verbosity = verbosity
@property
def verbosity(self):
return self._verbosity
@abstractmethod
def log(self, timestamp, elapsedtime, step, data):
pass
@abstractmethod
def metadata(self, timestamp, elapsedtime, metric, metadata):
pass
class Verbosity:
OFF = -1
DEFAULT = 0
VERBOSE = 1
class Logger:
def __init__(self, backends):
self.backends = backends
atexit.register(self.flush)
self.starttime = datetime.now()
def metadata(self, metric, metadata):
timestamp = datetime.now()
elapsedtime = (timestamp - self.starttime).total_seconds()
for b in self.backends:
b.metadata(timestamp, elapsedtime, metric, metadata)
def log(self, step, data, verbosity=1):
timestamp = datetime.now()
elapsedtime = (timestamp - self.starttime).total_seconds()
for b in self.backends:
if b.verbosity >= verbosity:
b.log(timestamp, elapsedtime, step, data)
def flush(self):
for b in self.backends:
b.flush()
def default_step_format(step):
return str(step)
def default_metric_format(metric, metadata, value):
unit = metadata["unit"] if "unit" in metadata.keys() else ""
format = "{" + metadata["format"] + "}" if "format" in metadata.keys() else "{}"
return "{}:{} {}".format(
metric, format.format(value) if value is not None else value, unit
)
def default_prefix_format(timestamp):
return "DLL {} - ".format(timestamp)
class StdOutBackend(Backend):
def __init__(
self,
verbosity,
step_format=default_step_format,
metric_format=default_metric_format,
prefix_format=default_prefix_format,
):
super().__init__(verbosity=verbosity)
self._metadata = defaultdict(dict)
self.step_format = step_format
self.metric_format = metric_format
self.prefix_format = prefix_format
self.elapsed = 0.0
def metadata(self, timestamp, elapsedtime, metric, metadata):
self._metadata[metric].update(metadata)
def log(self, timestamp, elapsedtime, step, data):
print(
"{}{} {}{}".format(
self.prefix_format(timestamp),
self.step_format(step),
" ".join(
[
self.metric_format(m, self._metadata[m], v)
for m, v in data.items()
]
),
"elapsed:"+str(elapsedtime)
)
)
def flush(self):
pass
class JSONStreamBackend(Backend):
def __init__(self, verbosity, filename):
super().__init__(verbosity=verbosity)
self._filename = filename
self.file = open(filename, "w")
atexit.register(self.file.close)
def metadata(self, timestamp, elapsedtime, metric, metadata):
self.file.write(
"DLLL {}\n".format(
json.dumps(
dict(
timestamp=str(timestamp.timestamp()),
elapsedtime=str(elapsedtime),
datetime=str(timestamp),
type="METADATA",
metric=metric,
metadata=metadata,
)
)
)
)
def log(self, timestamp, elapsedtime, step, data):
self.file.write(
"DLLL {}\n".format(
json.dumps(
dict(
timestamp=str(timestamp.timestamp()),
datetime=str(timestamp),
elapsedtime=str(elapsedtime),
type="LOG",
step=step,
data=data,
)
)
)
)
def flush(self):
self.file.flush()

View file

@ -0,0 +1,40 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
PARSER = argparse.ArgumentParser(description="U-Net medical")
PARSER.add_argument('--data_dir',
type=str,
default='./data',
help="""Directory where to download the dataset""")
def main():
FLAGS = PARSER.parse_args()
if not os.path.exists(FLAGS.data_dir):
os.makedirs(FLAGS.data_dir)
os.system('wget http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-volume.tif -P {}'.format(FLAGS.data_dir))
os.system('wget http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-labels.tif -P {}'.format(FLAGS.data_dir))
os.system('wget http://brainiac2.mit.edu/isbi_challenge/sites/default/files/test-volume.tif -P {}'.format(FLAGS.data_dir))
print("Finished downloading files for U-Net medical to {}".format(FLAGS.data_dir))
if __name__ == '__main__':
main()

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 1 GPU and trains for 40000 iterations with batch_size 1. Usage:
# bash unet_FP32_1GPU.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2/log.json

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs and trains for 40000 iterations with batch_size 1. Usage:
# bash unet_FP32_8GPU.sh <path to dataset> <path to results directory>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2/log.json

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 1 GPU for inference benchmarking. Usage:
# bash unet_INFER_BENCHMARK_FP32.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 1 GPU for inference benchmarking. Usage:
# bash unet_INFER_BENCHMARK_TF-AMP.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla --use_amp

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 1 GPU for inference batch_size 1. Usage:
# bash unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 1 GPU for inference batch_size 1. Usage:
# bash unet_INFER_TF-AMP.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla --use_amp

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 1 GPU and trains for 40000 iterations batch_size 1. Usage:
# bash unet_TF-AMP_1GPU.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2/log.json

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 8 GPUs and trains for 40000 iterations batch_size 1. Usage:
# bash unet_TF-AMP_8GPU.sh <path to dataset> <path to results directory>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2/log.json

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 1 GPU for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 1 GPU for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 8 GPUs for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp

View file

@ -0,0 +1,24 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPU and runs 5-fold cross-validation training for 40000 iterations.
# Usage:
# bash unet_TRAIN_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_1GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_1GPU

View file

@ -0,0 +1,24 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
# Usage:
# bash unet_TRAIN_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_8GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_8GPU

View file

@ -0,0 +1,24 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in TF-AMP on 1 GPU and runs 5-fold cross-validation training for 40000 iterations.
# Usage:
# bash unet_TRAIN_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_1GPU

View file

@ -0,0 +1,24 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in TF-AMP on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
# Usage:
# bash unet_TRAIN_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_8GPU

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

View file

@ -0,0 +1,111 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Entry point of the application.
This file serves as entry point to the run of UNet for segmentation of neuronal processes.
Example:
Training can be adjusted by modifying the arguments specified below::
$ python main.py --exec_mode train --model_dir /dataset ...
"""
import os
import horovod.tensorflow as hvd
import tensorflow as tf
from model.unet import Unet
from run import train, evaluate, predict, restore_checkpoint
from utils.cmd_util import PARSER, _cmd_params
from utils.data_loader import Dataset
from dllogger.logger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
def main():
"""
Starting point of the application
"""
flags = PARSER.parse_args()
params = _cmd_params(flags)
backends = [StdOutBackend(Verbosity.VERBOSE)]
if params.log_dir is not None:
backends.append(JSONStreamBackend(Verbosity.VERBOSE, params.log_dir))
logger = Logger(backends)
# Optimization flags
os.environ['CUDA_CACHE_DISABLE'] = '0'
os.environ['HOROVOD_GPU_ALLREDUCE'] = 'NCCL'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = 'data'
os.environ['TF_ADJUST_HUE_FUSED'] = 'data'
os.environ['TF_ADJUST_SATURATION_FUSED'] = 'data'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = 'data'
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
hvd.init()
if params.use_xla:
tf.config.optimizer.set_jit(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
if params.use_amp:
tf.keras.mixed_precision.experimental.set_policy('mixed_float16')
else:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'
# Build the model
model = Unet()
dataset = Dataset(data_dir=params.data_dir,
batch_size=params.batch_size,
fold=params.crossvalidation_idx,
augment=params.augment,
gpu_id=hvd.rank(),
num_gpus=hvd.size(),
seed=params.seed)
if 'train' in params.exec_mode:
train(params, model, dataset, logger)
if 'evaluate' in params.exec_mode:
if hvd.rank() == 0:
model = restore_checkpoint(model, params.model_dir)
evaluate(params, model, dataset, logger)
if 'predict' in params.exec_mode:
if hvd.rank() == 0:
model = restore_checkpoint(model, params.model_dir)
predict(params, model, dataset, logger)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,189 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# -*- coding: utf-8 -*-
""" Contains a set of utilities that allow building the UNet model
"""
import tensorflow as tf
def _crop_and_concat(inputs, residual_input):
""" Perform a central crop of ``residual_input`` and concatenate to ``inputs``
Args:
inputs (tf.Tensor): Tensor with input
residual_input (tf.Tensor): Residual input
Return:
Concatenated tf.Tensor with the size of ``inputs``
"""
factor = inputs.shape[1] / residual_input.shape[1]
return tf.concat([inputs, tf.image.central_crop(residual_input, factor)], axis=-1)
class InputBlock(tf.keras.Model):
def __init__(self, filters):
""" UNet input block
Perform two unpadded convolutions with a specified number of filters and downsample
through max-pooling. First convolution
Args:
filters (int): Number of filters in convolution
"""
super().__init__(self)
with tf.name_scope('input_block'):
self.conv1 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.maxpool = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=2)
def call(self, inputs):
out = self.conv1(inputs)
out = self.conv2(out)
mp = self.maxpool(out)
return mp, out
class DownsampleBlock(tf.keras.Model):
def __init__(self, filters, idx):
""" UNet downsample block
Perform two unpadded convolutions with a specified number of filters and downsample
through max-pooling
Args:
filters (int): Number of filters in convolution
idx (int): Index of block
Return:
Tuple of convolved ``inputs`` after and before downsampling
"""
super().__init__(self)
with tf.name_scope('downsample_block_{}'.format(idx)):
self.conv1 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.maxpool = tf.keras.layers.MaxPool2D(pool_size=(2, 2), strides=2)
def call(self, inputs):
out = self.conv1(inputs)
out = self.conv2(out)
mp = self.maxpool(out)
return mp, out
class BottleneckBlock(tf.keras.Model):
def __init__(self, filters):
""" UNet central block
Perform two unpadded convolutions with a specified number of filters and upsample
including dropout before upsampling for training
Args:
filters (int): Number of filters in convolution
"""
super().__init__(self)
with tf.name_scope('bottleneck_block'):
self.conv1 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.dropout = tf.keras.layers.Dropout(rate=0.5)
self.conv_transpose = tf.keras.layers.Conv2DTranspose(filters=filters // 2,
kernel_size=(3, 3),
strides=(2, 2),
padding='same',
activation=tf.nn.relu)
def call(self, inputs, training):
out = self.conv1(inputs)
out = self.conv2(out)
out = self.dropout(out, training=training)
out = self.conv_transpose(out)
return out
class UpsampleBlock(tf.keras.Model):
def __init__(self, filters, idx):
""" UNet upsample block
Perform two unpadded convolutions with a specified number of filters and upsample
Args:
filters (int): Number of filters in convolution
idx (int): Index of block
"""
super().__init__(self)
with tf.name_scope('upsample_block_{}'.format(idx)):
self.conv1 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv_transpose = tf.keras.layers.Conv2DTranspose(filters=filters // 2,
kernel_size=(3, 3),
strides=(2, 2),
padding='same',
activation=tf.nn.relu)
def call(self, inputs, residual_input):
out = _crop_and_concat(inputs, residual_input)
out = self.conv1(out)
out = self.conv2(out)
out = self.conv_transpose(out)
return out
class OutputBlock(tf.keras.Model):
def __init__(self, filters, n_classes):
""" UNet output block
Perform three unpadded convolutions, the last one with the same number
of channels as classes we want to classify
Args:
filters (int): Number of filters in convolution
n_classes (int): Number of output classes
"""
super().__init__(self)
with tf.name_scope('output_block'):
self.conv1 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv2 = tf.keras.layers.Conv2D(filters=filters,
kernel_size=(3, 3),
activation=tf.nn.relu)
self.conv3 = tf.keras.layers.Conv2D(filters=n_classes,
kernel_size=(1, 1),
activation=tf.nn.relu)
def call(self, inputs, residual_input):
out = _crop_and_concat(inputs, residual_input)
out = self.conv1(out)
out = self.conv2(out)
out = self.conv3(out)
return out

View file

@ -0,0 +1,59 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Model construction utils
This module provides a convenient way to create different topologies
based around UNet.
"""
import tensorflow as tf
from model.layers import InputBlock, DownsampleBlock, BottleneckBlock, UpsampleBlock, OutputBlock
class Unet(tf.keras.Model):
""" U-Net: Convolutional Networks for Biomedical Image Segmentation
Source:
https://arxiv.org/pdf/1505.04597
"""
def __init__(self):
super().__init__(self)
self.input_block = InputBlock(filters=64)
self.bottleneck = BottleneckBlock(1024)
self.output_block = OutputBlock(filters=64, n_classes=2)
self.down_blocks = [DownsampleBlock(filters, idx)
for idx, filters in enumerate([128, 256, 512])]
self.up_blocks = [UpsampleBlock(filters, idx)
for idx, filters in enumerate([512, 256, 128])]
def call(self, x, training=True):
skip_connections = []
out, residual = self.input_block(x)
skip_connections.append(residual)
for down_block in self.down_blocks:
out, residual = down_block(out)
skip_connections.append(residual)
out = self.bottleneck(out, training)
for up_block in self.up_blocks:
out = up_block(out, skip_connections.pop())
out = self.output_block(out, skip_connections.pop())
return tf.keras.activations.softmax(out, axis=-1)

View file

@ -0,0 +1,3 @@
Pillow==6.2.0
tf2onnx
munch

View file

@ -0,0 +1,168 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from time import time
import numpy as np
from PIL import Image
import horovod.tensorflow as hvd
import tensorflow as tf
from utils.losses import partial_losses
from utils.parse_results import process_performance_stats
def restore_checkpoint(model, model_dir):
try:
model.load_weights(os.path.join(model_dir, "checkpoint"))
except:
print("Failed to load checkpoint, model will have randomly initialized weights.")
return model
def train(params, model, dataset, logger):
np.random.seed(params.seed)
tf.random.set_seed(params.seed)
max_steps = params.max_steps // hvd.size()
optimizer = tf.keras.optimizers.Adam(learning_rate=params.learning_rate)
if params.use_amp:
optimizer = tf.keras.mixed_precision.experimental.LossScaleOptimizer(optimizer, "dynamic")
ce_loss = tf.keras.metrics.Mean(name='ce_loss')
f1_loss = tf.keras.metrics.Mean(name='dice_loss')
@tf.function
def train_step(features, labels, warmup_batch=False):
with tf.GradientTape() as tape:
output_map = model(features)
crossentropy_loss, dice_loss = partial_losses(output_map, labels)
added_losses = tf.add(crossentropy_loss, dice_loss, name="total_loss_ref")
loss = added_losses + params.weight_decay * tf.add_n(
[tf.nn.l2_loss(v) for v in model.trainable_variables
if 'batch_normalization' not in v.name])
if params.use_amp:
loss = optimizer.get_scaled_loss(loss)
tape = hvd.DistributedGradientTape(tape)
gradients = tape.gradient(loss, model.trainable_variables)
if params.use_amp:
gradients = optimizer.get_unscaled_gradients(gradients)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# Note: broadcast should be done after the first gradient step to ensure optimizer
# initialization.
if warmup_batch:
hvd.broadcast_variables(model.variables, root_rank=0)
hvd.broadcast_variables(optimizer.variables(), root_rank=0)
ce_loss(crossentropy_loss)
f1_loss(dice_loss)
return loss
if params.benchmark:
assert max_steps * hvd.size() > params.warmup_steps, \
"max_steps value has to be greater than warmup_steps"
timestamps = np.zeros((hvd.size(), max_steps * hvd.size() + 1), dtype=np.float32)
for iteration, (images, labels) in enumerate(dataset.train_fn(drop_remainder=True)):
t0 = time()
loss = train_step(images, labels, warmup_batch=iteration == 0).numpy()
timestamps[hvd.rank(), iteration] = time() - t0
if iteration >= max_steps * hvd.size():
break
timestamps = np.mean(timestamps, axis=0)
if hvd.rank() == 0:
throughput_imgps, latency_ms = process_performance_stats(timestamps, params)
logger.log(step=(),
data={"throughput_train": throughput_imgps,
"latency_train": latency_ms})
else:
for iteration, (images, labels) in enumerate(dataset.train_fn()):
train_step(images, labels, warmup_batch=iteration == 0)
if (hvd.rank() == 0) and (iteration % params.log_every == 0):
logger.log(step=(iteration, max_steps),
data={"train_ce_loss": float(ce_loss.result()),
"train_dice_loss": float(f1_loss.result()),
"train_total_loss": float(f1_loss.result() + ce_loss.result())})
f1_loss.reset_states()
ce_loss.reset_states()
if iteration >= max_steps:
break
if hvd.rank() == 0:
model.save_weights(os.path.join(params.model_dir, "checkpoint"))
logger.flush()
def evaluate(params, model, dataset, logger):
ce_loss = tf.keras.metrics.Mean(name='ce_loss')
f1_loss = tf.keras.metrics.Mean(name='dice_loss')
@tf.function
def validation_step(features, labels):
output_map = model(features, training=False)
crossentropy_loss, dice_loss = partial_losses(output_map, labels)
ce_loss(crossentropy_loss)
f1_loss(dice_loss)
for iteration, (images, labels) in enumerate(dataset.eval_fn(count=1)):
validation_step(images, labels)
if iteration >= dataset.eval_size // params.batch_size:
break
if dataset.eval_size > 0:
logger.log(step=(),
data={"eval_ce_loss": float(ce_loss.result()),
"eval_dice_loss": float(f1_loss.result()),
"eval_total_loss": float(f1_loss.result() + ce_loss.result()),
"eval_dice_score": 1.0 - float(f1_loss.result())})
logger.flush()
def predict(params, model, dataset, logger):
@tf.function
def prediction_step(features):
return model(features, training=False)
if params.benchmark:
assert params.max_steps > params.warmup_steps, \
"max_steps value has to be greater than warmup_steps"
timestamps = np.zeros(params.max_steps + 1, dtype=np.float32)
for iteration, images in enumerate(dataset.test_fn(count=None, drop_remainder=True)):
t0 = time()
prediction_step(images)
timestamps[iteration] = time() - t0
if iteration >= params.max_steps:
break
throughput_imgps, latency_ms = process_performance_stats(timestamps, params)
logger.log(step=(),
data={"throughput_test": throughput_imgps,
"latency_test": latency_ms})
else:
predictions = np.concatenate([prediction_step(images).numpy()
for images in dataset.test_fn(count=1)], axis=0)
binary_masks = [np.argmax(p, axis=-1).astype(np.uint8) * 255 for p in predictions]
multipage_tif = [Image.fromarray(mask).resize(size=(512, 512), resample=Image.BILINEAR)
for mask in binary_masks]
output_dir = os.path.join(params.model_dir, 'predictions')
if not os.path.exists(output_dir):
os.makedirs(output_dir)
multipage_tif[0].save(os.path.join(output_dir, 'test-masks.tif'),
compression="tiff_deflate",
save_all=True,
append_images=multipage_tif[1:])
logger.flush()

View file

@ -0,0 +1,124 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Command line argument parsing"""
import argparse
from munch import Munch
PARSER = argparse.ArgumentParser(description="UNet-medical")
PARSER.add_argument('--exec_mode',
choices=['train', 'train_and_predict', 'predict', 'evaluate', 'train_and_evaluate'],
type=str,
default='train_and_evaluate',
help="""Execution mode of running the model""")
PARSER.add_argument('--model_dir',
type=str,
default='/results',
help="""Output directory for information related to the model""")
PARSER.add_argument('--data_dir',
type=str,
required=True,
help="""Input directory containing the dataset for training the model""")
PARSER.add_argument('--log_dir',
type=str,
default=None,
help="""Output directory for training logs""")
PARSER.add_argument('--batch_size',
type=int,
default=1,
help="""Size of each minibatch per GPU""")
PARSER.add_argument('--learning_rate',
type=float,
default=0.0001,
help="""Learning rate coefficient for AdamOptimizer""")
PARSER.add_argument('--crossvalidation_idx',
type=int,
default=None,
help="""Chosen fold for cross-validation. Use None to disable cross-validation""")
PARSER.add_argument('--max_steps',
type=int,
default=1000,
help="""Maximum number of steps (batches) used for training""")
PARSER.add_argument('--weight_decay',
type=float,
default=0.0005,
help="""Weight decay coefficient""")
PARSER.add_argument('--log_every',
type=int,
default=100,
help="""Log performance every n steps""")
PARSER.add_argument('--warmup_steps',
type=int,
default=200,
help="""Number of warmup steps""")
PARSER.add_argument('--seed',
type=int,
default=0,
help="""Random seed""")
PARSER.add_argument('--augment', dest='augment', action='store_true',
help="""Perform data augmentation during training""")
PARSER.add_argument('--no-augment', dest='augment', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
help="""Collect performance metrics during training""")
PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
help="""Train using TF-AMP""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_xla', dest='use_xla', action='store_true',
help="""Train using XLA""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_trt', dest='use_trt', action='store_true',
help="""Use TF-TRT""")
PARSER.set_defaults(use_trt=False)
def _cmd_params(flags):
return Munch({
'exec_mode': flags.exec_mode,
'model_dir': flags.model_dir,
'data_dir': flags.data_dir,
'log_dir': flags.log_dir,
'batch_size': flags.batch_size,
'learning_rate': flags.learning_rate,
'crossvalidation_idx': flags.crossvalidation_idx,
'max_steps': flags.max_steps,
'weight_decay': flags.weight_decay,
'log_every': flags.log_every,
'warmup_steps': flags.warmup_steps,
'augment': flags.augment,
'benchmark': flags.benchmark,
'seed': flags.seed,
'use_amp': flags.use_amp,
'use_trt': flags.use_trt,
'use_xla': flags.use_xla,
})

View file

@ -0,0 +1,206 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Dataset class encapsulates the data loading"""
import multiprocessing
import os
from collections import deque
import numpy as np
import tensorflow as tf
from PIL import Image, ImageSequence
class Dataset:
"""Load, separate and prepare the data for training and prediction"""
def __init__(self, data_dir, batch_size, fold, augment=False, gpu_id=0, num_gpus=1, seed=0):
if not os.path.exists(data_dir):
raise FileNotFoundError('Cannot find data dir: {}'.format(data_dir))
self._data_dir = data_dir
self._batch_size = batch_size
self._augment = augment
self._seed = seed
images = self._load_multipage_tiff(os.path.join(self._data_dir, 'train-volume.tif'))
masks = self._load_multipage_tiff(os.path.join(self._data_dir, 'train-labels.tif'))
self._test_images = \
self._load_multipage_tiff(os.path.join(self._data_dir, 'test-volume.tif'))
train_indices, val_indices = self._get_val_train_indices(len(images), fold)
self._train_images = images[train_indices]
self._train_masks = masks[train_indices]
self._val_images = images[val_indices]
self._val_masks = masks[val_indices]
self._num_gpus = num_gpus
self._gpu_id = gpu_id
@property
def train_size(self):
return len(self._train_images)
@property
def eval_size(self):
return len(self._val_images)
@property
def test_size(self):
return len(self._test_images)
def _load_multipage_tiff(self, path):
"""Load tiff images containing many images in the channel dimension"""
return np.array([np.array(p) for p in ImageSequence.Iterator(Image.open(path))])
def _get_val_train_indices(self, length, fold, ratio=0.8):
assert 0 < ratio <= 1, "Train/total data ratio must be in range (0.0, 1.0]"
np.random.seed(self._seed)
indices = np.arange(0, length, 1, dtype=np.int)
np.random.shuffle(indices)
if fold is not None:
indices = deque(indices)
indices.rotate(fold * int((1.0 - ratio) * length))
indices = np.array(indices)
train_indices = indices[:int(ratio * len(indices))]
val_indices = indices[int(ratio * len(indices)):]
else:
train_indices = indices
val_indices = []
return train_indices, val_indices
def _normalize_inputs(self, inputs):
"""Normalize inputs"""
inputs = tf.expand_dims(tf.cast(inputs, tf.float32), -1)
# Center around zero
inputs = tf.divide(inputs, 127.5) - 1
# Resize to match output size
inputs = tf.image.resize(inputs, (388, 388))
return tf.image.resize_with_crop_or_pad(inputs, 572, 572)
def _normalize_labels(self, labels):
"""Normalize labels"""
labels = tf.expand_dims(tf.cast(labels, tf.float32), -1)
labels = tf.divide(labels, 255)
# Resize to match output size
labels = tf.image.resize(labels, (388, 388))
labels = tf.image.resize_with_crop_or_pad(labels, 572, 572)
cond = tf.less(labels, 0.5 * tf.ones(tf.shape(input=labels)))
labels = tf.where(cond, tf.zeros(tf.shape(input=labels)), tf.ones(tf.shape(input=labels)))
return tf.one_hot(tf.squeeze(tf.cast(labels, tf.int32)), 2)
@tf.function
def _preproc_samples(self, inputs, labels, augment=True):
"""Preprocess samples and perform random augmentations"""
inputs = self._normalize_inputs(inputs)
labels = self._normalize_labels(labels)
if self._augment and augment:
# Horizontal flip
h_flip = tf.random.uniform([]) > 0.5
inputs = tf.cond(pred=h_flip, true_fn=lambda: tf.image.flip_left_right(inputs), false_fn=lambda: inputs)
labels = tf.cond(pred=h_flip, true_fn=lambda: tf.image.flip_left_right(labels), false_fn=lambda: labels)
# Vertical flip
v_flip = tf.random.uniform([]) > 0.5
inputs = tf.cond(pred=v_flip, true_fn=lambda: tf.image.flip_up_down(inputs), false_fn=lambda: inputs)
labels = tf.cond(pred=v_flip, true_fn=lambda: tf.image.flip_up_down(labels), false_fn=lambda: labels)
# Prepare for batched transforms
inputs = tf.expand_dims(inputs, 0)
labels = tf.expand_dims(labels, 0)
# Random crop and resize
left = tf.random.uniform([]) * 0.3
right = 1 - tf.random.uniform([]) * 0.3
top = tf.random.uniform([]) * 0.3
bottom = 1 - tf.random.uniform([]) * 0.3
inputs = tf.image.crop_and_resize(inputs, [[top, left, bottom, right]], [0], (572, 572))
labels = tf.image.crop_and_resize(labels, [[top, left, bottom, right]], [0], (572, 572))
# Gray value variations
# Adjust brightness and keep values in range
inputs = tf.image.random_brightness(inputs, max_delta=0.2)
inputs = tf.clip_by_value(inputs, clip_value_min=-1, clip_value_max=1)
inputs = tf.squeeze(inputs, 0)
labels = tf.squeeze(labels, 0)
# Bring back labels to network's output size and remove interpolation artifacts
labels = tf.image.resize_with_crop_or_pad(labels, target_width=388, target_height=388)
cond = tf.less(labels, 0.5 * tf.ones(tf.shape(input=labels)))
labels = tf.where(cond, tf.zeros(tf.shape(input=labels)), tf.ones(tf.shape(input=labels)))
return inputs, labels
def train_fn(self, drop_remainder=False):
"""Input function for training"""
dataset = tf.data.Dataset.from_tensor_slices(
(self._train_images, self._train_masks))
dataset = dataset.shard(self._num_gpus, self._gpu_id)
dataset = dataset.repeat()
dataset = dataset.shuffle(self._batch_size * 3)
dataset = dataset.map(self._preproc_samples,
num_parallel_calls=multiprocessing.cpu_count()//self._num_gpus)
dataset = dataset.batch(self._batch_size, drop_remainder=drop_remainder)
dataset = dataset.prefetch(self._batch_size)
return dataset
def eval_fn(self, count, drop_remainder=False):
"""Input function for validation"""
dataset = tf.data.Dataset.from_tensor_slices(
(self._val_images, self._val_masks))
dataset = dataset.repeat(count=count)
dataset = dataset.map(self._preproc_samples,
num_parallel_calls=multiprocessing.cpu_count())
dataset = dataset.batch(self._batch_size, drop_remainder=drop_remainder)
dataset = dataset.prefetch(self._batch_size)
return dataset
def test_fn(self, count, drop_remainder=False):
"""Input function for testing"""
dataset = tf.data.Dataset.from_tensor_slices(
self._test_images)
dataset = dataset.repeat(count=count)
dataset = dataset.map(self._normalize_inputs)
dataset = dataset.batch(self._batch_size, drop_remainder=drop_remainder)
dataset = dataset.prefetch(self._batch_size)
return dataset
def synth_fn(self):
"""Synthetic data function for testing"""
inputs = tf.random.truncated_normal((572, 572, 1), dtype=tf.float32, mean=127.5, stddev=1, seed=self._seed,
name='synth_inputs')
masks = tf.random.truncated_normal((388, 388, 2), dtype=tf.float32, mean=0.01, stddev=0.1, seed=self._seed,
name='synth_masks')
dataset = tf.data.Dataset.from_tensors((inputs, masks))
dataset = dataset.cache()
dataset = dataset.repeat()
dataset = dataset.batch(self._batch_size)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset

View file

@ -0,0 +1,39 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Training and evaluation losses"""
import tensorflow as tf
# Class Dice coefficient averaged over batch
def dice_coef(predict, target, axis=1, eps=1e-6):
intersection = tf.reduce_sum(input_tensor=predict * target, axis=axis)
union = tf.reduce_sum(input_tensor=predict * predict + target * target, axis=axis)
dice = (2. * intersection + eps) / (union + eps)
return tf.reduce_mean(input_tensor=dice, axis=0) # average over batch
def partial_losses(predict, target):
n_classes = predict.shape[-1]
flat_logits = tf.reshape(tf.cast(predict, tf.float32),
[tf.shape(input=predict)[0], -1, n_classes])
flat_labels = tf.reshape(target,
[tf.shape(input=predict)[0], -1, n_classes])
crossentropy_loss = tf.reduce_mean(input_tensor=tf.nn.softmax_cross_entropy_with_logits(logits=flat_logits,
labels=flat_labels),
name='cross_loss_ref')
dice_loss = tf.reduce_mean(input_tensor=1 - dice_coef(flat_logits, flat_labels), name='dice_loss_ref')
return crossentropy_loss, dice_loss

View file

@ -0,0 +1,82 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import argparse
def process_performance_stats(timestamps, params):
warmup_steps = params['warmup_steps']
batch_size = params['batch_size']
timestamps_ms = 1000 * timestamps[warmup_steps:]
timestamps_ms = timestamps_ms[timestamps_ms > 0]
latency_ms = timestamps_ms.mean()
std = timestamps_ms.std()
n = np.sqrt(len(timestamps_ms))
throughput_imgps = (1000.0 * batch_size / timestamps_ms).mean()
print('Throughput Avg:', round(throughput_imgps, 3), 'img/s')
print('Latency Avg:', round(latency_ms, 3), 'ms')
for ci, lvl in zip(["90%:", "95%:", "99%:"],
[1.645, 1.960, 2.576]):
print("Latency", ci, round(latency_ms + lvl * std / n, 3), "ms")
return float(throughput_imgps), float(latency_ms)
def parse_convergence_results(path, environment):
dice_scores = []
ce_scores = []
logfiles = [f for f in os.listdir(path) if "log" in f and environment in f]
if not logfiles:
raise FileNotFoundError("No logfile found at {}".format(path))
for logfile in logfiles:
with open(os.path.join(path, logfile), "r") as f:
content = f.readlines()
if "eval_dice_score" not in content[-1]:
print("Evaluation score not found. The file", logfile, "might be corrupted.")
continue
dice_scores.append(float([val for val in content[-1].split()
if "eval_dice_score" in val][0].split(":")[1]))
ce_scores.append(float([val for val in content[-1].split()
if "eval_ce_loss" in val][0].split(":")[1]))
if dice_scores:
print("Evaluation dice score:", sum(dice_scores) / len(dice_scores))
print("Evaluation cross-entropy loss:", sum(ce_scores) / len(ce_scores))
else:
print("All logfiles were corrupted, no loss was obtained.")
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="UNet-medical-utils")
parser.add_argument('--exec_mode',
choices=['convergence', 'benchmark'],
type=str,
help="""Which execution mode to run the model into""")
parser.add_argument('--model_dir',
type=str,
required=True)
parser.add_argument('--env',
choices=['FP32_1GPU', 'FP32_8GPU', 'TF-AMP_1GPU', 'TF-AMP_8GPU'],
type=str,
required=True)
args = parser.parse_args()
if args.exec_mode == 'convergence':
parse_convergence_results(path=args.model_dir, environment=args.env)
elif args.exec_mode == 'benchmark':
pass
print()