[UNet_medical/TF1&2] Updating for Ampere

This commit is contained in:
Przemek Strzelczyk 2020-07-04 01:42:09 +02:00
parent 76a056cd33
commit b27abeba07
65 changed files with 781 additions and 882 deletions

View file

@ -1,7 +1,8 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.01-tf1-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
FROM ${FROM_IMAGE_NAME}
ADD . /workspace/unet
WORKDIR /workspace/unet
RUN pip install git+https://github.com/NVIDIA/dllogger
RUN pip install -r requirements.txt

View file

@ -1,8 +1,8 @@
# U-Net Medical Image Segmentation for TensorFlow 1.x
# UNet Medical Image Segmentation for TensorFlow 1.x
This repository provides a script and recipe to train U-Net Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table of contents
This repository provides a script and recipe to train UNet Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
@ -11,6 +11,7 @@ This repository provides a script and recipe to train U-Net Medical to achieve s
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Enabling TF32](#enabling-tf32)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
@ -29,38 +30,44 @@ This repository provides a script and recipe to train U-Net Medical to achieve s
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g)
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g)
* [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
* [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the original paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
The UNet model is a convolutional neural network for 2D image segmentation. This repository contains a UNet implementation as described in the original paper [UNet: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
UNet was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: [UNet: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). UNet allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
The following figure shows the construction of the U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
The following figure shows the construction of the UNet model and its different components. UNet is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
![U-Net](images/unet.png)
![UNet](images/unet.png)
Figure 1. UNet architecture
### Default configuration
U-Net consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and a concatenation with the correspondingly cropped feature map from the contractive path.
UNet consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and a concatenation with the correspondingly cropped feature map from the contractive path.
### Feature support matrix
The following features are supported by this model.
| **Feature** | **U-Net Medical** |
| **Feature** | **UNet Medical** |
|---------------------------------|-----|
| Automatic mixed precision (AMP) | Yes |
| Horovod Multi-GPU (NCCL) | Yes |
@ -70,13 +77,13 @@ The following features are supported by this model.
**Automatic Mixed Precision (AMP)**
This implementation of U-Net uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
This implementation of UNet uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
**Horovod**
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
Multi-GPU training with Horovod
**Multi-GPU training with Horovod**
Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
@ -86,40 +93,58 @@ XLA is a domain-specific compiler for linear algebra that can accelerate TensorF
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
#### Enabling mixed precision
This implementation exploits the TensorFlow Automatic Mixed Precision feature. In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
```
TF_ENABLE_AUTO_MIXED_PRECISION=1
```
Exporting these variables ensures that loss scaling is performed correctly and automatically.
By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training:
```
if params.use_amp:
Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance effort. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
To enable mixed precision, you can simply add the values to the environmental variables inside your training script:
- Enable TF-AMP graph rewrite:
```
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
```
- Enable Automated Mixed Precision:
```
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
```
```
#### Enabling TF32
TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
## Setup
The following section lists the requirements in order to start training the U-Net Medical model.
The following section lists the requirements in order to start training the UNet Medical model.
### Requirements
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- TensorFlow 20.02-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
- [NVIDIA Volta GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
- TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
- GPU-based architecture:
- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
@ -127,9 +152,10 @@ For more information about how to get started with NGC containers, see the follo
- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
For those unable to use the TensorFlow NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the U-Net TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the UNet model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the UNet TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
* compare your evaluation accuracy with our [Training accuracy results](#training-accuracy-results),
* compare your training performance with our [Training performance benchmark](#training-performance-benchmark),
* compare your inference performance with our [Inference performance benchmark](#inference-performance-benchmark).
@ -138,13 +164,13 @@ For the specifics concerning training and inference, see the [Advanced](#advance
1. Clone the repository.
Executing this command will create your local repository with all the code to run U-Net.
Executing this command will create your local repository with all the code to run UNet.
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Segmentation/U-Net_Medical_TF
cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Medical_TF
2. Build the U-Net TensorFlow NGC container.
2. Build the UNet TensorFlow NGC container.
This command will use the `Dockerfile` to create a Docker image named `unet_tf`, downloading all the required components automatically.
@ -169,14 +195,9 @@ For the specifics concerning training and inference, see the [Advanced](#advance
4. Download and preprocess the data.
The U-Net script `main.py` operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597).
The script `download_dataset.py` is provided for data download. It is possible to select the destination folder when downloading the files by using the `--data_dir` flag. For example:
```bash
python download_dataset.py --data_dir /data
```
Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images (around 30 Mb total). Once downloaded, the data with the `download_dataset.py` script can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
The UNet script `main.py` operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [UNet paper](https://arxiv.org/abs/1505.04597). The data is available to download upon registration on the website.
Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images (around 30 Mb total). Once downloaded, the data can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
**Note:** Masks are only provided for training data.
@ -185,13 +206,13 @@ For the specifics concerning training and inference, see the [Advanced](#advance
After the Docker container is launched, the training with the [default hyperparameters](#default-parameters) (for example 1/8 GPUs FP32/TF-AMP) can be started with:
```bash
bash examples/unet_{FP32, TF-AMP}_{1,8}GPU.sh <path/to/dataset> <path/to/checkpoint>
bash examples/unet{_TF-AMP}_{1,8}GPU.sh <path/to/dataset> <path/to/checkpoint>
```
For example, to run with full precision (FP32) on 1 GPU from the projects folder, simply use:
```bash
bash examples/unet_FP32_1GPU.sh /data /results
bash examples/unet_1GPU.sh /data /results
```
This script will launch a training on a single fold and store the models checkpoint in <path/to/checkpoint> directory.
@ -199,7 +220,7 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The script can be run directly by modifying flags if necessary, especially the number of GPUs, which is defined after the `-np` flag. Since the test volume does not have labels, 20% of the training data is used for validation in 5-fold cross-validation manner. The number of fold can be changed using `--crossvalidation_idx` with an integer in range 0-4. For example, to run with 4 GPUs using fold 1 use:
```bash
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --use_xla --use_amp
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --xla --amp
```
Training will result in a checkpoint file being written to `./results` on the host machine.
@ -209,7 +230,7 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The trained model can be evaluated by passing the `--exec_mode evaluate` flag. Since evaluation is carried out on a validation dataset, the `--crossvalidation_idx` parameter should be filled. For example:
```bash
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --use_xla --use_amp
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --xla --amp
```
Evaluation can also be triggered jointly after training by passing the `--exec_mode train_and_evaluate` flag.
@ -217,7 +238,7 @@ For the specifics concerning training and inference, see the [Advanced](#advance
7. Start inference/predictions.
To run inference on a checkpointed model, run:
```bash
bash examples/unet_INFER_{FP32, TF-AMP}.sh <path/to/dataset> <path/to/checkpoint>
bash examples/unet_INFER{_TF-AMP}.sh <path/to/dataset> <path/to/checkpoint>
```
For example:
```bash
@ -234,31 +255,31 @@ The following sections provide greater details of the dataset, running training
In the root directory, the most important files are:
* `main.py`: Serves as the entry point to the application.
* `Dockerfile`: Container with the basic set of dependencies to run U-Net.
* `requirements.txt`: Set of extra requirements for running U-Net.
* `download_data.py`: Automatically downloads the dataset for training.
* `Dockerfile`: Container with the basic set of dependencies to run UNet.
* `requirements.txt`: Set of extra requirements for running UNet.
The `utils/` folder encapsulates the necessary tools to train and perform inference using U-Net. Its main components are:
The `utils/` folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
* `cmd_util.py`: Implements the command-line arguments parsing.
* `data_loader.py`: Implements the data loading and augmentation.
* `model_fn.py`: Implements the logic for training and inference.
* `hooks/training_hook.py`: Collects different metrics during training.
* `hooks/profiling_hook.py`: Collects different metrics to be used for benchmarking and testing.
* `parse_results.py`: Implements the intermediate results parsing.
* `setup.py`: Implements helper setup functions.
The `model/` folder contains information about the building blocks of U-Net and the way they are assembled. Its contents are:
* `layers.py`: Defines the different blocks that are used to assemble U-Net
The `model/` folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
* `layers.py`: Defines the different blocks that are used to assemble UNet
* `unet.py`: Defines the model architecture using the blocks from the `layers.py` script
Other folders included in the root directory are:
* `dllogger/`: Contains the utils for logging
* `examples/`: Provides examples for training and benchmarking U-Net
* `examples/`: Provides examples for training and benchmarking UNet
* `images/`: Contains a model diagram
### Parameters
The complete list of the available parameters for the main.py script contains:
* `--exec_mode`: Select the execution mode to run the model (default: `train`). Modes available:
* `train` - trains model from scratch.
* `evaluate` - loads checkpoint (if available) and performs evaluation on validation subset (requires `--crossvalidation_idx` other than `None`).
* `train_and_evaluate` - trains model from scratch and performs validation at the end (requires `--crossvalidation_idx` other than `None`).
* `predict` - loads checkpoint (if available) and runs inference on the test set. Stores the results in `--model_dir` directory.
@ -276,8 +297,8 @@ The complete list of the available parameters for the main.py script contains:
* `--augment`: Enable data augmentation (default: `False`).
* `--benchmark`: Enable performance benchmarking (default: `False`). If the flag is set, the script runs in a benchmark mode - each iteration is timed and the performance result (in images per second) is printed at the end. Works for both `train` and `predict` execution modes.
* `--warmup_steps`: Used during benchmarking - the number of steps to skip (default: `200`). First iterations are usually much slower since the graph is being constructed. Skipping the initial iterations is required for a fair performance assessment.
* `--use_xla`: Enable accelerated linear algebra optimization (default: `False`).
* `--use_amp`: Enable automatic mixed precision (default: `False`).
* `--xla`: Enable accelerated linear algebra optimization (default: `False`).
* `--amp`: Enable automatic mixed precision (default: `False`).
### Command line options
@ -295,10 +316,10 @@ usage: main.py [-h]
[--crossvalidation_idx CROSSVALIDATION_IDX]
[--max_steps MAX_STEPS] [--weight_decay WEIGHT_DECAY]
[--log_every LOG_EVERY] [--warmup_steps WARMUP_STEPS]
[--seed SEED] [--augment] [--no-augment] [--benchmark]
[--no-benchmark] [--use_amp] [--use_xla]
[--seed SEED] [--augment] [--benchmark]
[--amp] [--xla]
U-Net-medical
UNet-medical
optional arguments:
-h, --help show this help message and exit
@ -326,25 +347,21 @@ optional arguments:
Number of warmup steps
--seed SEED Random seed
--augment Perform data augmentation during training
--no-augment
--benchmark Collect performance metrics during training
--no-benchmark
--use_amp Train using TF-AMP
--use_xla Train using XLA
--amp Train using TF-AMP
--xla Train using XLA
```
The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission. Upon registration, the challenge's data is made available through the following links:
* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-volume.tif)
* [train-labels.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-labels.tif)
* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/test-volume.tif)
## Getting the data
The UNet model uses the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission. The challenge's data is made available upon registration.
Training and test data are comprised of three 512x512x30 `TIF` volumes (`test-volume.tif`, `train-volume.tif` and `train-labels.tif`). Files `test-volume.tif` and `train-volume.tif` contain grayscale 2D slices to be segmented. Additionally, training masks are provided in `train-labels.tif` as a 512x512x30 `TIF` volume, where each pixel has one of two classes:
* 0 indicating the presence of cellular membrane,
* 1 corresponding to background.
The objective is to produce a set of masks that segment the data as accurately as possible. The results are expected to be submitted as a 32-bit `TIF` 3D image, with values between `0` (100% membrane certainty) and `1` (100% non-membrane certainty).
#### Dataset guidelines
The training and test datasets are given as stacks of 30 2D-images provided as a multi-page `TIF` that can be read using the Pillow library and NumPy (both Python packages are installed by the `Dockerfile`).
@ -374,10 +391,10 @@ Generally, the model should scale better for datasets containing more data. For
### Training process
The model trains for a total 40,000 batches (40,000 / number of GPUs), with the default U-Net setup:
The model trains for a total 6,400 batches (6,400 / number of GPUs), with the default UNet setup:
* Adam optimizer with learning rate of 0.0001.
This default parametrization is applied when running scripts from the `./examples` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--use_amp` flag. AMP can be enabled for every mode of execution.
This default parametrization is applied when running scripts from the `./examples` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--amp` flag. AMP can be enabled for every mode of execution.
The default configuration minimizes a function _L = 1 - DICE + cross entropy_ during training.
@ -414,11 +431,11 @@ The following section shows how to run benchmarks measuring the model performanc
To benchmark training, run one of the `TRAIN_BENCHMARK` scripts in `./examples/`:
```bash
bash examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
bash examples/unet_TRAIN_BENCHMARK{_TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
For example, to benchmark training using mixed-precision on 8 GPUs use:
```bash
bash examples/unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
bash examples/unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during training in the next 800 iterations.
@ -434,12 +451,12 @@ At the end of the script, a line reporting the best train throughput will be pri
To benchmark inference, run one of the scripts in `./examples/`:
```bash
bash examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
bash examples/unet_INFER_BENCHMARK{_TF-AMP}.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
For example, to benchmark inference using mixed-precision:
```bash
bash examples/unet_INFER_BENCHMARK_TF-AMP.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
bash examples/unet_INFER_BENCHMARK_TF-AMP.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during inference in the next 400 iterations.
@ -456,19 +473,29 @@ At the end of the script, a line reporting the best inference throughput will be
The following sections provide details on how we achieved our performance and accuracy in training and inference.
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN{_TF-AMP}_{1, 8}GPU.sh` training script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf1-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
| GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 [min] | Time to train - mixed precision [min] | Time to train speedup (TF32 to mixed precision) |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 8 | 0.8908 | 0.8910 | 22 | 10 | 2.2 |
| 8 | 8 | 0.8938 | 0.8942 | 2.6 | 2.5 | 1.04 |
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [hours] | Time to train - mixed precision [hours] | Time to train speedup (FP32 to mixed precision) |
|------|------------------|-----------------|----------------------------|------------------------------|----------------------------|--------------------------------|
| 1 | 8 | 0.8884 | 0.8906 | 7.08 | 2.54 | 2.79 |
| 8 | 8 | 0.8962 | 0.8972 | 0.97 | 0.37 | 2.64 |
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh` training script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [min] | Time to train - mixed precision [min] | Time to train speedup (FP32 to mixed precision) |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 8 | 0.8910 | 0.8903 | 48 | 19 | 2.53 |
| 8 | 8 | 0.8942 | 0.8940 | 7 | 7.5 | 0.93 |
To reproduce this result, start the Docker container interactively and run one of the TRAIN scripts:
```bash
bash examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
bash examples/unet_TRAIN{_TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
for example
```bash
@ -477,45 +504,91 @@ bash examples/unet_TRAIN_TF-AMP_8GPU.sh /data /results 8
This command will launch a script which will run 5-fold cross-validation training for 40,000 iterations and print the validation DICE score and cross-entropy loss. The time reported is for one fold, which means that the training for 5 folds will take 5 times longer. The default batch size is 8, however if you have less than 16 Gb memory card and you encounter GPU memory issue you should decrease the batch size. The logs of the runs can be found in `/results` directory once the script is finished.
**Learning curves**
The following image show the training loss as a function of iteration for training using DGX A100 (TF32 and TF-AMP) and DGX-1 V100 (FP32 and TF-AMP).
![LearningCurves](images/U-NetMed_TF1_conv.png)
#### Training performance results
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK{_TF-AMP}_{1, 8}GPU.sh` training script in the `examples/unet_TRAIN_BENCHMARK_{TF-AMP, FP32}_{1, 8}GPU.sh` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - TF32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
| 1 | 1 | 29.81 | 64.22 | 2.15 | - | - |
| 1 | 8 | 46.53 | 120.08 | 2.58 | - | - |
| 8 | 1 | 169.62 | 293.31 | 1.73 | 5.69 | 4.57 |
| 8 | 8 | 304.64 | 738.64 | 2.42 | 6.55 | 6.15 |
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK_{TF-AMP, FP32}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf1-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in items/images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|------|------------------|-------------------|--------------------------------|---------------------------------------------|---------------------------|--------------------------------|
| 1 | 8 | 18.57 | 52.27 | 2.81 | N/A | N/A |
| 8 | 8 | 138.50 | 366.88 | 2.65 | 7.02 | 7.46 |
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK{_TF-AMP}_{1, 8}GPU.sh` training script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
| 1 | 1 | 15.70 | 39.62 | 2.52 | - | - |
| 1 | 8 | 18.85 | 60.28 | 3.20 | - | - |
| 8 | 1 | 102.52 | 212.51 | 2.07 | 6.53 | 5.36 |
| 8 | 8 | 141.75 | 403.88 | 2.85 | 7.52 | 6.70 |
To achieve these same results, follow the steps in the [Training performance benchmark](#training-performance-benchmark) section.
Throughput is reported in images per second. Latency is reported in milliseconds per image.
##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
#### Inference performance results
##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
Our results were obtained by running the `examples/unet_INFER_BENCHMARK{_TF-AMP}.sh` inferencing benchmarking script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
FP16
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 251.11 | 3.983 | 3.990 | 3.991 | 3.993 |
| 2 | 572x572x1 | 179.70 | 11.130 | 11.138 | 11.139 | 11.142 |
| 4 | 572x572x1 | 197.53 | 20.250 | 20.260 | 20.262 | 20.266 |
| 8 | 572x572x1 | 382.48 | 24.050 | 29.356 | 30.372 | 32.359 |
| 16 | 572x572x1 | 400.58 | 45.759 | 55.615 | 57.502 | 61.192 |
TF32
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 88.80 | 11.261 | 11.264 | 11.264 | 11.265 |
| 2 | 572x572x1 | 104.62 | 19.120 | 19.149 | 19.155 | 19.166 |
| 4 | 572x572x1 | 117.02 | 34.184 | 34.217 | 34.223 | 34.235 |
| 8 | 572x572x1 | 131.54 | 65.094 | 72.577 | 74.009 | 76.811 |
| 16 | 572x572x1 | 137.41 | 121.552 | 130.795 | 132.565 | 136.027 |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
Our results were obtained by running the `examples/unet_INFER_BENCHMARK_{TF-AMP, FP32}.sh` inferencing benchmarking script in the tensorflow:20.02-tf1-py3 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.
Our results were obtained by running the `examples/unet_INFER_BENCHMARK{_TF-AMP}.sh` inferencing benchmarking script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.
FP16
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|------|-----------|--------|---------|--------|--------|--------|
| 1 | 572x572x1 | 133.21 | 7.507 | 7.515 | 7.517 | 7.519 |
| 2 | 572x572x1 | 153.45 | 13.033 | 13.046 | 13.048 | 13.052 |
| 4 | 572x572x1 | 173.67 | 23.032 | 23.054 | 23.058 | 23.066 |
| 8 | 572x572x1 | 181.62 | 44.047 | 49.051 | 49.067 | 50.880 |
| 16 | 572x572x1 | 184.21 | 89.377 | 94.116 | 95.024 | 96.798 |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 127.11 | 7.868 | 7.875 | 7.876 | 7.879 |
| 2 | 572x572x1 | 140.32 | 14.256 | 14.278 | 14.283 | 14.291 |
| 4 | 572x572x1 | 148.28 | 26.978 | 27.005 | 27.010 | 27.020 |
| 8 | 572x572x1 | 178.28 | 48.432 | 54.613 | 55.797 | 58.111 |
| 16 | 572x572x1 | 181.94 | 94.812 | 106.743 | 109.028 | 113.496 |
FP32
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|------|-----------|--------|---------|---------|---------|---------|
| 1 | 572x572x1 | 49.97 | 20.018 | 20.044 | 20.048 | 20.058 |
| 2 | 572x572x1 | 54.30 | 36.837 | 36.865 | 36.871 | 36.881 |
| 4 | 572x572x1 | 56.27 | 71.085 | 71.150 | 71.163 | 71.187 |
| 8 | 572x572x1 | 58.41 | 143.347 | 154.845 | 157.047 | 161.353 |
| 16 | 572x572x1 | 74.57 | 222.532 | 237.184 | 239.990 | 245.477 |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 47.32 | 21.133 | 21.155 | 21.159 | 21.167 |
| 2 | 572x572x1 | 51.43 | 38.888 | 38.921 | 38.927 | 38.940 |
| 4 | 572x572x1 | 53.56 | 74.692 | 74.763 | 74.777 | 74.804 |
| 8 | 572x572x1 | 54.41 | 152.733 | 163.148 | 165.142 | 169.042 |
| 16 | 572x572x1 | 67.11 | 245.775 | 259.548 | 262.186 | 267.343 |
To achieve these same results, follow the steps in the [Inference performance benchmark](#inference-performance-benchmark) section.
@ -524,7 +597,11 @@ Throughput is reported in images per second. Latency is reported in milliseconds
## Release notes
### Changelog
June 2020
* Updated training and inference accuracy with A100 results
* Updated training and inference performance with A100 results
February 2020
* Updated README template
* Added cross-validation for accuracy measurements

View file

@ -1,163 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from collections import defaultdict
from datetime import datetime
import json
import atexit
class Backend(ABC):
def __init__(self, verbosity):
self._verbosity = verbosity
@property
def verbosity(self):
return self._verbosity
@abstractmethod
def log(self, timestamp, elapsedtime, step, data):
pass
@abstractmethod
def metadata(self, timestamp, elapsedtime, metric, metadata):
pass
class Verbosity:
OFF = -1
DEFAULT = 0
VERBOSE = 1
class Logger:
def __init__(self, backends):
self.backends = backends
atexit.register(self.flush)
self.starttime = datetime.now()
def metadata(self, metric, metadata):
timestamp = datetime.now()
elapsedtime = (timestamp - self.starttime).total_seconds()
for b in self.backends:
b.metadata(timestamp, elapsedtime, metric, metadata)
def log(self, step, data, verbosity=1):
timestamp = datetime.now()
elapsedtime = (timestamp - self.starttime).total_seconds()
for b in self.backends:
if b.verbosity >= verbosity:
b.log(timestamp, elapsedtime, step, data)
def flush(self):
for b in self.backends:
b.flush()
def default_step_format(step):
return str(step)
def default_metric_format(metric, metadata, value):
unit = metadata["unit"] if "unit" in metadata.keys() else ""
format = "{" + metadata["format"] + "}" if "format" in metadata.keys() else "{}"
return "{}:{} {}".format(
metric, format.format(value) if value is not None else value, unit
)
def default_prefix_format(timestamp):
return "DLL {} - ".format(timestamp)
class StdOutBackend(Backend):
def __init__(
self,
verbosity,
step_format=default_step_format,
metric_format=default_metric_format,
prefix_format=default_prefix_format,
):
super().__init__(verbosity=verbosity)
self._metadata = defaultdict(dict)
self.step_format = step_format
self.metric_format = metric_format
self.prefix_format = prefix_format
self.elapsed = 0.0
def metadata(self, timestamp, elapsedtime, metric, metadata):
self._metadata[metric].update(metadata)
def log(self, timestamp, elapsedtime, step, data):
print(
"{}{} {}{}".format(
self.prefix_format(timestamp),
self.step_format(step),
" ".join(
[
self.metric_format(m, self._metadata[m], v)
for m, v in data.items()
]
),
"elapsed:"+str(elapsedtime)
)
)
def flush(self):
pass
class JSONStreamBackend(Backend):
def __init__(self, verbosity, filename):
super().__init__(verbosity=verbosity)
self._filename = filename
self.file = open(filename, "w")
atexit.register(self.file.close)
def metadata(self, timestamp, elapsedtime, metric, metadata):
self.file.write(
"DLLL {}\n".format(
json.dumps(
dict(
timestamp=str(timestamp.timestamp()),
elapsedtime=str(elapsedtime),
datetime=str(timestamp),
type="METADATA",
metric=metric,
metadata=metadata,
)
)
)
)
def log(self, timestamp, elapsedtime, step, data):
self.file.write(
"DLLL {}\n".format(
json.dumps(
dict(
timestamp=str(timestamp.timestamp()),
datetime=str(timestamp),
elapsedtime=str(elapsedtime),
type="LOG",
step=step,
data=data,
)
)
)
)
def flush(self):
self.file.flush()

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 1 GPU and trains for 40000 iterations with batch_size 1. Usage:
# This script launches U-Net run in FP32 on 1 GPU and trains for 6400 iterations with batch_size 8. Usage:
# bash unet_FP32_1GPU.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs and trains for 40000 iterations with batch_size 1. Usage:
# This script launches U-Net run in FP32 on 8 GPUs and trains for 6400 iterations with batch_size 8. Usage:
# bash unet_FP32_8GPU.sh <path to dataset> <path to results directory>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 1 GPU for inference batch_size 1. Usage:
# bash unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 1 GPU for inference benchmarking. Usage:
# bash unet_INFER_BENCHMARK_FP32.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 1 GPU for inference benchmarking. Usage:
# bash unet_INFER_BENCHMARK_TF-AMP.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla --use_amp
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla --amp

View file

@ -15,4 +15,4 @@
# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
# Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300 --use_xla
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 1 GPU for inference batch_size 1. Usage:
# bash unet_INFER_TF-AMP.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla --use_amp
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla --amp

View file

@ -15,4 +15,4 @@
# This script launches U-Net inference in TF-AMP on 1 GPUs
# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_trt --use_xla
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_trt --xla

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 1 GPU and trains for 40000 iterations batch_size 1. Usage:
# This script launches U-Net run in FP16 on 1 GPU and trains for 6400 iterations batch_size 8. Usage:
# bash unet_TF-AMP_1GPU.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 8 GPUs and trains for 40000 iterations batch_size 1. Usage:
# This script launches U-Net run in FP16 on 8 GPUs and trains for 6400 iterations batch_size 8. Usage:
# bash unet_TF-AMP_8GPU.sh <path to dataset> <path to results directory>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPU and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in FP32 on 8 GPU and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_1GPU_fold4.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla > $2/log_FP32_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla > $2/log_FP32_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla > $2/log_FP32_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla > $2/log_FP32_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla > $2/log_FP32_1GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_1GPU

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in FP32 on 8 GPUs and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_8GPU_fold4.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla > $2/log_FP32_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla > $2/log_FP32_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla > $2/log_FP32_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla > $2/log_FP32_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla > $2/log_FP32_8GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_8GPU

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 1 GPU for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 8 GPUs for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 1 GPU for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla --amp

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 8 GPUs for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla --amp

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in TF-AMP on 1 GPU and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in TF-AMP on 1 GPU and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold4.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_1GPU

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in TF-AMP on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in TF-AMP on 8 GPUs and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold4.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_8GPU

View file

@ -2,13 +2,13 @@ import argparse
import tensorflow as tf
from tf_exports.tf_export import to_savedmodel, to_tf_trt, to_onnx
from dlexport.tensorflow import to_savedmodel, to_onnx, to_tensorrt
from utils.data_loader import Dataset
from utils.model_fn import unet_fn
PARSER = argparse.ArgumentParser(description="U-Net medical")
PARSER.add_argument('--to', dest='to', choices=['savedmodel', 'tftrt', 'onnx'], required=True)
PARSER.add_argument('--to', dest='to', choices=['savedmodel', 'tensorrt', 'onnx'], required=True)
PARSER.add_argument('--use_amp', dest='use_amp', action='store_true', default=False)
PARSER.add_argument('--use_xla', dest='use_xla', action='store_true', default=False)
@ -46,14 +46,14 @@ def main():
if flags.to == 'savedmodel':
to_savedmodel(input_shape=flags.input_shape,
model_fn=unet_fn,
checkpoint_dir=flags.checkpoint_dir,
output_dir='./saved_model',
src_dir=flags.checkpoint_dir,
dst_dir='./saved_model',
input_names=['IteratorGetNext'],
output_names=['total_loss_ref'],
use_amp=flags.use_amp,
use_xla=flags.use_xla,
compress=flags.compress)
if flags.to == 'tftrt':
if flags.to == 'tensorrt':
ds = Dataset(data_dir=flags.data_dir,
batch_size=1,
augment=False,
@ -68,16 +68,16 @@ def main():
def input_data():
return {'input_tensor:0': sess.run(features)}
to_tf_trt(savedmodel_dir=flags.savedmodel_dir,
output_dir='./tf_trt_model',
to_tensorrt(src_dir=flags.savedmodel_dir,
dst_dir='./tf_trt_model',
precision=flags.precision,
feed_dict_fn=input_data,
num_runs=1,
output_tensor_names=['Softmax:0'],
compress=flags.compress)
if flags.to == 'onnx':
to_onnx(input_dir=flags.savedmodel_dir,
output_dir='./onnx_model',
to_onnx(src_dir=flags.savedmodel_dir,
dst_dir='./onnx_model',
compress=flags.compress)

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

View file

@ -31,77 +31,24 @@ import numpy as np
import tensorflow as tf
from PIL import Image
from utils.cmd_util import PARSER, _cmd_params
from utils.setup import prepare_model_dir, get_logger, build_estimator, set_flags
from utils.cmd_util import PARSER, parse_args
from utils.data_loader import Dataset
from utils.hooks.profiling_hook import ProfilingHook
from utils.hooks.training_hook import TrainingHook
from utils.model_fn import unet_fn
from dllogger.logger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
def main(_):
"""
Starting point of the application
"""
flags = PARSER.parse_args()
params = _cmd_params(flags)
np.random.seed(params.seed)
tf.compat.v1.random.set_random_seed(params.seed)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
backends = [StdOutBackend(Verbosity.VERBOSE)]
if params.log_dir is not None:
backends.append(JSONStreamBackend(Verbosity.VERBOSE, params.log_dir))
logger = Logger(backends)
# Optimization flags
os.environ['CUDA_CACHE_DISABLE'] = '0'
os.environ['HOROVOD_GPU_ALLREDUCE'] = 'NCCL'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = 'data'
os.environ['TF_ADJUST_HUE_FUSED'] = 'data'
os.environ['TF_ADJUST_SATURATION_FUSED'] = 'data'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = 'data'
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
if params.use_amp:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
else:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'
hvd.init()
set_flags()
params = parse_args(PARSER.parse_args())
model_dir = prepare_model_dir(params)
logger = get_logger(params)
# Build run config
gpu_options = tf.compat.v1.GPUOptions()
config = tf.compat.v1.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)
if params.use_xla:
config.graph_options.optimizer_options.global_jit_level = tf.compat.v1.OptimizerOptions.ON_1
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
run_config = tf.estimator.RunConfig(
save_summary_steps=1,
tf_random_seed=None,
session_config=config,
save_checkpoints_steps=params.max_steps // hvd.size(),
keep_checkpoint_max=1)
# Build the estimator model
estimator = tf.estimator.Estimator(
model_fn=unet_fn,
model_dir=params.model_dir,
config=run_config,
params=params)
estimator = build_estimator(params, model_dir)
dataset = Dataset(data_dir=params.data_dir,
batch_size=params.batch_size,

View file

@ -168,7 +168,7 @@ def output_block(inputs, residual_input, filters, n_classes):
return tf.layers.conv2d(inputs=out,
filters=n_classes,
kernel_size=(1, 1),
activation=tf.nn.relu)
activation=None)
def input_block(inputs, filters):

View file

@ -81,28 +81,24 @@ PARSER.add_argument('--seed',
PARSER.add_argument('--augment', dest='augment', action='store_true',
help="""Perform data augmentation during training""")
PARSER.add_argument('--no-augment', dest='augment', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
help="""Collect performance metrics during training""")
PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
PARSER.add_argument('--use_amp', '--amp', dest='use_amp', action='store_true',
help="""Train using TF-AMP""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_xla', dest='use_xla', action='store_true',
PARSER.add_argument('--use_xla', '--xla', dest='use_xla', action='store_true',
help="""Train using XLA""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_trt', dest='use_trt', action='store_true',
help="""Use TF-TRT""")
PARSER.set_defaults(use_trt=False)
PARSER.add_argument('--resume_training', dest='resume_training', action='store_true',
help="""Resume training from a checkpoint""")
def _cmd_params(flags):
def parse_args(flags):
return Munch({
'exec_mode': flags.exec_mode,
'model_dir': flags.model_dir,
@ -121,4 +117,5 @@ def _cmd_params(flags):
'use_amp': flags.use_amp,
'use_trt': flags.use_trt,
'use_xla': flags.use_xla,
'resume_training': flags.resume_training,
})

View file

@ -150,6 +150,18 @@ class Dataset():
return (inputs, labels)
def _preproc_eval_samples(self, inputs, labels):
"""Preprocess samples and perform random augmentations"""
inputs = self._normalize_inputs(inputs)
labels = self._normalize_labels(labels)
# Bring back labels to network's output size and remove interpolation artifacts
labels = tf.image.resize_image_with_crop_or_pad(labels, target_width=388, target_height=388)
cond = tf.less(labels, 0.5 * tf.ones(tf.shape(labels)))
labels = tf.where(cond, tf.zeros(tf.shape(labels)), tf.ones(tf.shape(labels)))
return (inputs, labels)
def train_fn(self, drop_remainder=False):
"""Input function for training"""
dataset = tf.data.Dataset.from_tensor_slices(
@ -169,7 +181,7 @@ class Dataset():
dataset = tf.data.Dataset.from_tensor_slices(
(self._val_images, self._val_masks))
dataset = dataset.repeat(count=count)
dataset = dataset.map(self._preproc_samples)
dataset = dataset.map(self._preproc_eval_samples)
dataset = dataset.batch(self._batch_size)
dataset = dataset.prefetch(self._batch_size)

View file

@ -49,8 +49,6 @@ class ProfilingHook(tf.estimator.SessionRunHook):
def end(self, session):
if hvd.rank() == 0:
throughput_imgps, latency_ms = process_performance_stats(np.array(self._timestamps),
self._global_batch_size)
self.logger.log(step=(),
data={'throughput_{}'.format(self.mode): throughput_imgps,
'latency_{}'.format(self.mode): latency_ms})
stats = process_performance_stats(np.array(self._timestamps),
self._global_batch_size)
self.logger.log(step=(), data={metric: value for (metric, value) in stats})

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,24 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
""" Runner class encapsulating the training
This module provides the functionality to initialize a run with hyper-parameters
which can be later used for training and inference.
Example:
Runner can be created with a parameter dictionary, and those parameters
are reused for training and inference::
params = {...}
runner = Runner(params)
runner.train()
runner.predict()
"""
import os
import horovod.tensorflow as hvd
import tensorflow as tf
@ -66,15 +48,6 @@ def regularization_l2loss(weight_decay):
return l2_loss
def is_using_hvd():
env_vars = ["OMPI_COMM_WORLD_RANK", "OMPI_COMM_WORLD_SIZE"]
if all([var in os.environ for var in env_vars]):
return True
else:
return False
def unet_fn(features, labels, mode, params):
""" Model function for tf.Estimator
@ -97,9 +70,6 @@ def unet_fn(features, labels, mode, params):
global_step = tf.compat.v1.train.get_global_step()
if mode == tf.estimator.ModeKeys.TRAIN:
lr_init = params.learning_rate
with tf.device(device):
features = tf.cast(features, dtype)
@ -130,10 +100,8 @@ def unet_fn(features, labels, mode, params):
"eval_dice_score": tf.compat.v1.metrics.mean(1.0 - dice_loss)}
return tf.estimator.EstimatorSpec(mode=mode, loss=dice_loss, eval_metric_ops=eval_metric_ops)
opt = tf.compat.v1.train.AdamOptimizer(learning_rate=lr_init)
if is_using_hvd():
opt = hvd.DistributedOptimizer(opt, device_dense='/gpu:0')
opt = tf.compat.v1.train.AdamOptimizer(learning_rate=params.learning_rate)
opt = hvd.DistributedOptimizer(opt, device_dense='/gpu:0')
with tf.control_dependencies(tf.compat.v1.get_collection(tf.compat.v1.GraphKeys.UPDATE_OPS)):
deterministic = True

View file

@ -19,17 +19,17 @@ import argparse
def process_performance_stats(timestamps, batch_size):
timestamps_ms = 1000 * timestamps
timestamps_ms = timestamps_ms[timestamps_ms > 0]
latency_ms = timestamps_ms.mean()
std = timestamps_ms.std()
n = np.sqrt(len(timestamps_ms))
throughput_imgps = (1000.0 * batch_size / timestamps_ms).mean()
print('Throughput Avg:', round(throughput_imgps, 3), 'img/s')
print('Latency Avg:', round(latency_ms, 3), 'ms')
stats = [("Throughput Avg", str(throughput_imgps)),
('Latency Avg:', str(latency_ms))]
for ci, lvl in zip(["90%:", "95%:", "99%:"],
[1.645, 1.960, 2.576]):
print("Latency", ci, round(latency_ms + lvl * std / n, 3), "ms")
return float(throughput_imgps), float(latency_ms)
stats.append(("Latency_" + ci, str(latency_ms + lvl * std / n)))
return stats
def parse_convergence_results(path, environment):
@ -40,14 +40,14 @@ def parse_convergence_results(path, environment):
raise FileNotFoundError("No logfile found at {}".format(path))
for logfile in logfiles:
with open(os.path.join(path, logfile), "r") as f:
content = f.readlines()
if "eval_dice_score" not in content[-1]:
content = f.readlines()[-1]
if "eval_dice_score" not in content:
print("Evaluation score not found. The file", logfile, "might be corrupted.")
continue
dice_scores.append(float([val for val in content[-1].split()
if "eval_dice_score" in val][0].split(":")[1]))
ce_scores.append(float([val for val in content[-1].split()
if "eval_ce_loss" in val][0].split(":")[1]))
dice_scores.append(float([val for val in content.split(" ")
if "eval_dice_score" in val][0].split()[-1]))
ce_scores.append(float([val for val in content.split(" ")
if "eval_ce_loss" in val][0].split()[-1]))
if dice_scores:
print("Evaluation dice score:", sum(dice_scores) / len(dice_scores))
print("Evaluation cross-entropy loss:", sum(ce_scores) / len(ce_scores))

View file

@ -0,0 +1,91 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import dllogger as logger
import tensorflow as tf
import horovod.tensorflow as hvd
import numpy as np
from dllogger import StdOutBackend, Verbosity, JSONStreamBackend
from utils.model_fn import unet_fn
def set_flags():
os.environ['CUDA_CACHE_DISABLE'] = '1'
os.environ['HOROVOD_GPU_ALLREDUCE'] = 'NCCL'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '0'
os.environ['TF_ADJUST_HUE_FUSED'] = '1'
os.environ['TF_ADJUST_SATURATION_FUSED'] = '1'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
def prepare_model_dir(params):
model_dir = os.path.join(params.model_dir, "model_checkpoint")
model_dir = model_dir if (hvd.rank() == 0 and not params.benchmark) else None
if model_dir is not None:
os.makedirs(model_dir, exist_ok=True)
if ('train' in params.exec_mode) and (not params.resume_training):
os.system('rm -rf {}/*'.format(model_dir))
return model_dir
def build_estimator(params, model_dir):
if params.use_amp:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
else:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'
np.random.seed(params.seed)
tf.compat.v1.random.set_random_seed(params.seed)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
gpu_options = tf.compat.v1.GPUOptions()
config = tf.compat.v1.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)
if params.use_xla:
config.graph_options.optimizer_options.global_jit_level = tf.compat.v1.OptimizerOptions.ON_1
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
run_config = tf.estimator.RunConfig(
save_summary_steps=1,
tf_random_seed=params.seed,
session_config=config,
save_checkpoints_steps=(params.max_steps // hvd.size()) if hvd.rank() == 0 else None,
keep_checkpoint_max=1)
estimator = tf.estimator.Estimator(
model_fn=unet_fn,
model_dir=model_dir,
config=run_config,
params=params)
return estimator
def get_logger(params):
backends = []
if hvd.rank() == 0:
backends += [StdOutBackend(Verbosity.VERBOSE)]
if params.log_dir:
backends += [JSONStreamBackend(Verbosity.VERBOSE, params.log_dir)]
logger.init(backends=backends)
return logger

View file

@ -0,0 +1 @@
results

View file

@ -1,7 +1,9 @@
FROM nvcr.io/nvidia/tensorflow:20.02-tf2-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf2-py3
FROM ${FROM_IMAGE_NAME}
ADD . /workspace/unet
WORKDIR /workspace/unet
RUN pip install --upgrade pip
RUN pip install git+https://github.com/NVIDIA/dllogger
RUN pip install -r requirements.txt

View file

@ -1,6 +1,6 @@
# U-Net Medical Image Segmentation for TensorFlow 2.x
# UNet Medical Image Segmentation for TensorFlow 2.x
This repository provides a script and recipe to train U-Net Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
This repository provides a script and recipe to train UNet Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table of Contents
@ -12,6 +12,7 @@ This repository provides a script and recipe to train U-Net Medical to achieve s
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Enabling TF32](#enabling-tf32)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
@ -30,13 +31,14 @@ This repository provides a script and recipe to train U-Net Medical to achieve s
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g)
* [Training stability results](#training-stability-results)
* [Training stability: NVIDIA DGX-1 (8x V100 16G)](#training-stability-nvidia-dgx-1-8x-v100-16g)
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g)
* [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
* [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
@ -45,29 +47,29 @@ This repository provides a script and recipe to train U-Net Medical to achieve s
## Model overview
The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the original paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
The UNet model is a convolutional neural network for 2D image segmentation. This repository contains a UNet implementation as described in the original paper [UNet: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
UNet was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: [UNet: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). UNet allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
The following figure shows the construction of the U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
The following figure shows the construction of the UNet model and its different components. UNet is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
![U-Net](images/unet.png)
Figure 1. The architecture of a U-Net model. Taken from the <a href="https://arxiv.org/abs/1505.04597">U-Net: Convolutional Networks for Biomedical Image Segmentation paper</a>.
![UNet](images/unet.png)
Figure 1. The architecture of a UNet model. Taken from the <a href="https://arxiv.org/abs/1505.04597">UNet: Convolutional Networks for Biomedical Image Segmentation paper</a>.
### Default configuration
U-Net consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and concatenation with the correspondingly cropped feature map from the contractive path.
UNet consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and concatenation with the correspondingly cropped feature map from the contractive path.
### Feature support matrix
The following features are supported by this model:
| **Feature** | **U-Net Medical** |
| **Feature** | **UNet Medical** |
|-------------|---------------------|
| Automatic mixed precision (AMP) | Yes |
| Horovod Multi-GPU (NCCL) | Yes |
@ -77,7 +79,7 @@ The following features are supported by this model:
**Automatic Mixed Precision (AMP)**
This implementation of U-Net uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
This implementation of UNet uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
**Horovod**
@ -94,20 +96,22 @@ XLA is a domain-specific compiler for linear algebra that can accelerate TensorF
### Mixed precision training
Mixed precision is the combined use of different numerical precision in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
#### Enabling mixed precision
This implementation exploits the TensorFlow Automatic Mixed Precision feature. To enable AMP, you simply need to supply the `--use_amp` flag to the `main.py` script. For reference, enabling the AMP required us to apply the following changes to the code:
This implementation exploits the TensorFlow Automatic Mixed Precision feature. To enable AMP, you simply need to supply the `--amp` flag to the `main.py` script. For reference, enabling the AMP required us to apply the following changes to the code:
1. Set Keras mixed precision policy:
```python
@ -117,7 +121,7 @@ This implementation exploits the TensorFlow Automatic Mixed Precision feature. T
2. Use loss scaling wrapper on the optimizer:
```python
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, momentum=momentum)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
if using_amp:
optimizer = tf.keras.mixed_precision.experimental.LossScaleOptimizer(optimizer, "dynamic")
```
@ -131,17 +135,29 @@ This implementation exploits the TensorFlow Automatic Mixed Precision feature. T
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
```
#### Enabling TF32
TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
## Setup
The following section lists the requirements that you need to meet in order to start training the U-Net Medical model.
The following section lists the requirements that you need to meet in order to start training the UNet Medical model.
### Requirements
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- TensorFlow 20.02-tf2-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow) with Tensorflow 2.1 or later
- [NVIDIA Volta GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
- TensorFlow 20.06-tf2-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow) with Tensorflow 2.2 or later
- GPU-based architecture:
- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
@ -149,9 +165,10 @@ For more information about how to get started with NGC containers, see the follo
- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
For those unable to use the TensorFlow NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the U-Net TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the UNet model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the UNet TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
* compare your evaluation accuracy with our [Training accuracy results](#training-accuracy-results),
* compare your training performance with our [Training performance benchmark](#training-performance-benchmark),
* compare your inference performance with our [Inference performance benchmark](#inference-performance-benchmark).
@ -160,14 +177,14 @@ For the specifics concerning training and inference, see the [Advanced](#advance
1. Clone the repository.
Executing this command will create your local repository with all the code to run U-Net.
Executing this command will create your local repository with all the code to run UNet.
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Medical_TF2
cd DeepLearningExamples/TensorFlow2/Segmentation/UNet_Medical/
```
2. Build the U-Net TensorFlow NGC container.
2. Build the UNet TensorFlow NGC container.
This command will use the `Dockerfile` to create a Docker image named `unet_tf2`, downloading all the required components automatically.
@ -192,14 +209,9 @@ For the specifics concerning training and inference, see the [Advanced](#advance
4. Download and preprocess the data.
The U-Net script `main.py` operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597).
The script `download_dataset.py` is provided for data download. It is possible to select the destination folder when downloading the files by using the `--data_dir` flag. For example:
```bash
python download_dataset.py --data_dir /data
```
Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images (around 30 Mb total). Once downloaded, the data with the `download_dataset.py` script can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
The UNet script `main.py` operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [UNet paper](https://arxiv.org/abs/1505.04597). The data is available to download upon registration on the website.
Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images (around 30 Mb total). Once downloaded, the data can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
**Note:** Masks are only provided for training data.
@ -208,13 +220,13 @@ For the specifics concerning training and inference, see the [Advanced](#advance
After the Docker container is launched, the training with the [default hyperparameters](#default-parameters) (for example 1/8 GPUs FP32/TF-AMP) can be started with:
```bash
bash examples/unet_{FP32, TF-AMP}_{1,8}GPU.sh <path/to/dataset> <path/to/checkpoint>
bash examples/unet{_TF-AMP}_{1,8}GPU.sh <path/to/dataset> <path/to/checkpoint>
```
For example, to run with full precision (FP32) on 1 GPU from the projects folder, simply use:
```bash
bash examples/unet_FP32_1GPU.sh /data /results
bash examples/unet_1GPU.sh /data /results
```
This script will launch a training on a single fold and store the models checkpoint in the <path/to/checkpoint> directory.
@ -222,7 +234,7 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The script can be run directly by modifying flags if necessary, especially the number of GPUs, which is defined after the `-np` flag. Since the test volume does not have labels, 20% of the training data is used for validation in 5-fold cross-validation manner. The number of fold can be changed using `--crossvalidation_idx` with an integer in range 0-4. For example, to run with 4 GPUs using fold 1 use:
```bash
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --use_xla --use_amp
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --xla --amp
```
Training will result in a checkpoint file being written to `./results` on the host machine.
@ -232,7 +244,7 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The trained model can be evaluated by passing the `--exec_mode evaluate` flag. Since evaluation is carried out on a validation dataset, the `--crossvalidation_idx` parameter should be filled. For example:
```bash
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --use_xla --use_amp
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --xla --amp
```
Evaluation can also be triggered jointly after training by passing the `--exec_mode train_and_evaluate` flag.
@ -242,7 +254,7 @@ For the specifics concerning training and inference, see the [Advanced](#advance
The trained model can be used for inference by passing the `--exec_mode predict` flag:
```bash
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode predict --use_xla --use_amp
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode predict --xla --amp
```
Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark the performance of your training [Training performance benchmark](#training-performance-benchmark), or [Inference performance benchmark](#inference-performance-benchmark). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
@ -256,29 +268,29 @@ The following sections provide greater details of the dataset, running training
In the root directory, the most important files are:
* `main.py`: Serves as the entry point to the application.
* `run.py`: Implements the logic for training, evaluation, and inference.
* `Dockerfile`: Specifies the container with the basic set of dependencies to run U-Net.
* `requirements.txt`: Set of extra requirements for running U-Net.
* `download_data.py`: Automatically downloads the dataset for training.
* `Dockerfile`: Specifies the container with the basic set of dependencies to run UNet.
* `requirements.txt`: Set of extra requirements for running UNet.
The `utils/` folder encapsulates the necessary tools to train and perform inference using U-Net. Its main components are:
The `utils/` folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
* `cmd_util.py`: Implements the command-line arguments parsing.
* `data_loader.py`: Implements the data loading and augmentation.
* `losses.py`: Implements the losses used during training and evaluation.
* `parse_results.py`: Implements the intermediate results parsing.
* `setup.py`: Implements helper setup functions.
The `model/` folder contains information about the building blocks of U-Net and the way they are assembled. Its contents are:
* `layers.py`: Defines the different blocks that are used to assemble U-Net.
The `model/` folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
* `layers.py`: Defines the different blocks that are used to assemble UNet.
* `unet.py`: Defines the model architecture using the blocks from the `layers.py` script.
Other folders included in the root directory are:
* `dllogger/`: Contains the utilities for logging.
* `examples/`: Provides examples for training and benchmarking U-Net.
* `examples/`: Provides examples for training and benchmarking UNet.
* `images/`: Contains a model diagram.
### Parameters
The complete list of the available parameters for the `main.py` script contains:
* `--exec_mode`: Select the execution mode to run the model (default: `train`). Modes available:
* `train` - trains model from scratch.
* `evaluate` - loads checkpoint (if available) and performs evaluation on validation subset (requires `--crossvalidation_idx` other than `None`).
* `train_and_evaluate` - trains model from scratch and performs validation at the end (requires `--crossvalidation_idx` other than `None`).
* `predict` - loads checkpoint (if available) and runs inference on the test set. Stores the results in `--model_dir` directory.
@ -296,8 +308,8 @@ The complete list of the available parameters for the `main.py` script contains:
* `--augment`: Enable data augmentation (default: `False`).
* `--benchmark`: Enable performance benchmarking (default: `False`). If the flag is set, the script runs in a benchmark mode - each iteration is timed and the performance result (in images per second) is printed at the end. Works for both `train` and `predict` execution modes.
* `--warmup_steps`: Used during benchmarking - the number of steps to skip (default: `200`). First iterations are usually much slower since the graph is being constructed. Skipping the initial iterations is required for a fair performance assessment.
* `--use_xla`: Enable accelerated linear algebra optimization (default: `False`).
* `--use_amp`: Enable automatic mixed precision (default: `False`).
* `--xla`: Enable accelerated linear algebra optimization (default: `False`).
* `--amp`: Enable automatic mixed precision (default: `False`).
### Command-line options
@ -315,8 +327,8 @@ usage: main.py [-h]
[--crossvalidation_idx CROSSVALIDATION_IDX]
[--max_steps MAX_STEPS] [--weight_decay WEIGHT_DECAY]
[--log_every LOG_EVERY] [--warmup_steps WARMUP_STEPS]
[--seed SEED] [--augment] [--no-augment] [--benchmark]
[--no-benchmark] [--use_amp] [--use_xla]
[--seed SEED] [--augment] [--benchmark]
[--amp] [--xla]
UNet-medical
@ -346,22 +358,16 @@ optional arguments:
Number of warmup steps
--seed SEED Random seed
--augment Perform data augmentation during training
--no-augment
--benchmark Collect performance metrics during training
--no-benchmark
--use_amp Train using TF-AMP
--use_xla Train using XLA
--amp Train using TF-AMP
--xla Train using XLA
```
### Getting the data
The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission. Upon registration, the challenge's data is made available through the following links:
* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-volume.tif)
* [train-labels.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-labels.tif)
* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/test-volume.tif)
The UNet model uses the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission. The challenge's data is made available upon registration.
Training and test data are comprised of three 512x512x30 `TIF` volumes (`test-volume.tif`, `train-volume.tif` and `train-labels.tif`). Files `test-volume.tif` and `train-volume.tif` contain grayscale 2D slices to be segmented. Additionally, training masks are provided in `train-labels.tif` as a 512x512x30 `TIF` volume, where each pixel has one of two classes:
* 0 indicating the presence of cellular membrane,
* 1 corresponding to background.
@ -397,10 +403,10 @@ Generally, the model should scale better for datasets containing more data. For
### Training process
The model trains for a total 40,000 batches (40,000 / number of GPUs), with the default U-Net setup:
The model trains for a total 6,400 batches (6,400 / number of GPUs), with the default UNet setup:
* Adam optimizer with learning rate of 0.0001.
This default parametrization is applied when running scripts from the `./examples` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--use_amp` flag. AMP can be enabled for every mode of execution.
This default parametrization is applied when running scripts from the `./examples` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--amp` flag. AMP can be enabled for every mode of execution.
The default configuration minimizes a function _L = 1 - DICE + cross entropy_ during training.
@ -438,7 +444,7 @@ The following section shows how to run benchmarks measuring the model performanc
To benchmark training, run one of the `TRAIN_BENCHMARK` scripts in `./examples/`:
```bash
bash examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
bash examples/unet_TRAIN_BENCHMARK{_TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
```
For example, to benchmark training using mixed-precision on 8 GPUs use:
```bash
@ -458,7 +464,7 @@ At the end of the script, a line reporting the best train throughput will be pri
To benchmark inference, run one of the scripts in `./examples/`:
```bash
bash examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
bash examples/unet_INFER_BENCHMARK{_TF-AMP}.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
For example, to benchmark inference using mixed-precision:
@ -482,45 +488,65 @@ The following sections provide details on how we achieved our performance and ac
#### Training accuracy results
##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN{_TF-AMP}_{1, 8}GPU.sh` training script in the `tensorflow:20.06-tf2-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [hours] | Time to train - mixed precision [hours] | Time to train speedup (FP32 to mixed precision) |
|------|------------------|-----------------|----------------------------|------------------------------|----------------------------|--------------------------------|
| 1 | 8 | 0.8825 | 0.8826 | 6.51 | 2.25 | 2.89 |
| 8 | 8 | 0.8968 | 0.8962 | 0.89 | 0.32 | 2.76 |
| GPUs | Batch size / GPU | DICE - TF32 | DICE - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision) |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 8 | 0.8900 | 0.8902 | 21.3 | 8.6 | 2.48 |
| 8 | 8 | 0.8855 | 0.8858 | 2.5 | 2.5 | 1.00 |
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh` training script in the `tensorflow:20.06-tf2-py3` NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
| GPUs | Batch size / GPU | DICE - FP32 | DICE - mixed precision | Time to train - FP32 [min] | Time to train - mixed precision [min] | Time to train speedup (FP32 to mixed precision) |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 8 | 0.8901 | 0.8898 | 47 | 16 | 2.94 |
| 8 | 8 | 0.8848 | 0.8857 | 7 | 4.5 | 1.56 |
To reproduce this result, start the Docker container interactively and run one of the TRAIN scripts:
```bash
bash examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
bash examples/unet_TRAIN{_TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
```
for example
```bash
bash examples/unet_TRAIN_TF-AMP_8GPU.sh /data /results 8
```
This command will launch a script which will run 5-fold cross-validation training for 40,000 iterations and print the validation DICE score and cross-entropy loss. The time reported is for one fold, which means that the training for 5 folds will take 5 times longer. The default batch size is 8, however if you have less than 16 Gb memory card and you encounter GPU memory issue you should decrease the batch size. The logs of the runs can be found in `/results` directory once the script is finished.
#### Training stability results
This command will launch a script which will run 5-fold cross-validation training for 6400 iterations and print the validation DICE score and cross-entropy loss. The time reported is for one fold, which means that the training for 5 folds will take 5 times longer. The default batch size is 8, however if you have less than 16 Gb memory card and you encounter GPU memory issue you should decrease the batch size. The logs of the runs can be found in `/results` directory once the script is finished.
##### Training stability: NVIDIA DGX-1 (8x V100 16G)
**Learning curves**
The histogram below shows the best DICE scores achieved for 100 experiments using mixed precision. Mean DICE score for mixed precision was equal to 0.8962 and for single precision it was equal to
0.8968.
![score_histogram](images/score_histogram.png)
The following image show the training loss as a function of iteration for training using DGX A100 (TF32 and TF-AMP) and DGX-1 V100 (FP32 and TF-AMP).
![LearningCurves](images/UNetMed_TF2_conv.png)
#### Training performance results
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK_{TF-AMP, FP32}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in items/images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|------|------------------|-------------------|--------------------------------|---------------------------------------------|---------------------------|--------------------------------|
| 1 | 8 | 17.98 | 51.89 | 2.89 | N/A | N/A |
| 8 | 8 | 143.08 | 386.15 | 2.70 | 7.44 | 7.97 |
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK{_TF-AMP}_{1, 8}GPU.sh` training script in the NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers (in images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - TF32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
| 1 | 1 | 29.56 | 62.50 | 2.11 | - | - |
| 1 | 8 | 46.26 | 118.98 | 2.57 | - | - |
| 8 | 1 | 210.74 | 259.22 | 1.23 | 7.13 | 4.15 |
| 8 | 8 | 293.64 | 561.77 | 1.91 | 6.35 | 4.72 |
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK{_TF-AMP}_{1, 8}GPU.sh` training script in the `tensorflow:20.06-tf2-py3`6-tf2-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
| 1 | 1 | 14.65 | 40.36 | 2.75 | - | - |
| 1 | 8 | 17.91 | 59.58 | 3.33 | - | - |
| 8 | 1 | 117.81 | 210.18 | 1.78 | 8.04 | 5.21 |
| 8 | 8 | 137.11 | 368.88 | 2.69 | 7.66 | 6.19 |
To achieve these same results, follow the steps in the [Training performance benchmark](#training-performance-benchmark) section.
@ -529,52 +555,78 @@ Throughput is reported in images per second. Latency is reported in milliseconds
TensorFlow 2 runs by default using the eager mode, which makes tensor evaluation trivial at the cost of lower performance. To mitigate this issue multiple layers of performance optimization were implemented. Two of them, AMP and XLA, were already described. There is an additional one called Autograph, which allows to construct a graph from a subset of Python syntax improving the performance simply by adding a `@tf.function` decorator to the train function. To read more about Autograph see [Better performance with tf.function and AutoGraph](https://www.tensorflow.org/guide/function).
The training performance using 1 GPU with batch size 8 and various combination of AMP/FP32, XLA, and Autograph (AG) are shown the plot below.
![training_throughput](images/training_throughput.png)
#### Inference performance results
##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
Our results were obtained by running the `examples/unet_INFER_BENCHMARK{_TF-AMP}.sh` inferencing benchmarking script in the `tensorflow:20.06-tf2-py3` NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
FP16
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 283.12 | 3.534 | 3.543 | 3.544 | 3.547 |
| 2 | 572x572x1 | 188.69 | 10.603 | 10.619 | 10.623 | 10.629 |
| 4 | 572x572x1 | 204.49 | 19.572 | 19.610 | 19.618 | 19.632 |
| 8 | 572x572x1 | 412.70 | 19.386 | 19.399 | 19.401 | 19.406 |
| 16 | 572x572x1 | 423.76 | 37.760 | 37.783 | 37.788 | 37.797 |
TF32
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 107.44 | 9.317 | 9.341 | 9.346 | 9.355 |
| 2 | 572x572x1 | 115.66 | 17.294 | 17.309 | 17.312 | 17.318 |
| 4 | 572x572x1 | 126.29 | 31.676 | 31.698 | 31.702 | 31.710 |
| 8 | 572x572x1 | 138.55 | 57.742 | 57.755 | 57.757 | 57.762 |
| 16 | 572x572x1 | 142.17 | 112.545 | 112.562 | 112.565 | 112.572 |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
Our results were obtained by running the `examples/unet_INFER_BENCHMARK_{TF-AMP, FP32}.sh` inferencing benchmarking script in the tensorflow:20.02-tf2-py3 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.
Our results were obtained by running the `examples/unet_INFER_BENCHMARK{_TF-AMP}.sh` inferencing benchmarking script in the `tensorflow:20.06-tf2-py3` NGC container on NVIDIA DGX-1 with (1x V100 16GB) GPU.
FP16
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|------|-----------|--------|---------|--------|--------|--------|
| 1 | 572x572x1 | 152.48 | 6.558 | 6.568 | 6.569 | 6.572 |
| 2 | 572x572x1 | 164.64 | 12.148 | 12.162 | 12.164 | 12.168 |
| 4 | 572x572x1 | 179.54 | 22.279 | 22.299 | 22.302 | 22.309 |
| 8 | 572x572x1 | 187.65 | 42.633 | 42.658 | 42.663 | 42.672 |
| 16 | 572x572x1 | 189.38 | 84.486 | 84.541 | 84.551 | 84.569 |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 146.17 | 6.843 | 6.851 | 6.853 | 6.856 |
| 2 | 572x572x1 | 151.19 | 13.230 | 13.242 | 13.244 | 13.248 |
| 4 | 572x572x1 | 153.65 | 26.035 | 26.049 | 26.051 | 26.057 |
| 8 | 572x572x1 | 183.49 | 43.602 | 43.627 | 43.631 | 43.640 |
| 16 | 572x572x1 | 186.62 | 85.743 | 85.807 | 85.819 | 85.843 |
FP32
| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
|------|-----------|--------|---------|---------|---------|---------|
| 1 | 572x572x1 | 54.41 | 18.381 | 18.395 | 18.398 | 18.403 |
| 2 | 572x572x1 | 56.83 | 35.193 | 35.210 | 35.213 | 35.219 |
| 4 | 572x572x1 | 57.62 | 69.421 | 69.459 | 69.465 | 69.478 |
| 8 | 572x572x1 | 58.66 | 136.391 | 136.506 | 136.525 | 136.564 |
| 16 | 572x572x1 | 75.74 | 211.240 | 211.302 | 211.313 | 211.336 |
|:----------:|:----------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
| 1 | 572x572x1 | 51.72 | 19.336 | 19.352 | 19.355 | 19.361 |
| 2 | 572x572x1 | 53.89 | 37.112 | 37.127 | 37.130 | 37.136 |
| 4 | 572x572x1 | 54.77 | 73.033 | 73.068 | 73.074 | 73.087 |
| 8 | 572x572x1 | 55.24 | 144.829 | 144.924 | 144.943 | 144.979 |
| 16 | 572x572x1 | 68.09 | 234.995 | 235.098 | 235.118 | 235.157 |
To achieve these same results, follow the steps in the [Inference performance benchmark](#inference-performance-benchmark) section.
Throughput is reported in images per second. Latency is reported in milliseconds per batch.
The inference performance using 1 GPU with batch size 8 and various combination of AMP/FP32, XLA, and Autograph (AG) are shown the plot below.
![inference_throughput](images/inference_throughput.png)
## Release notes
### Changelog
June 2020
* Updated training and inference accuracy with A100 results
* Updated training and inference performance with A100 results
February 2020
* Initial release
### Known issues
* For TensorFlow 2.0 the training performance using AMP and XLA is around 30% lower than reported here. The issue was solved in TensorFlow 2.1.
* Due to random initialization, around 5% of training runs result in lower DICE score, usually around 0.81.

View file

@ -0,0 +1,17 @@
jobs:
# no AMP
- python main.py --data_dir dataset --model_dir /tmp --warmup_steps 200 --max_steps 6400 --learning_rate 0.0001 --batch_size 8 --exec_mode train_and_evaluate --augment --use_xla --log_dir /result/log.json --crossvalidation_idx 0
# with AMP
- python main.py --data_dir dataset --model_dir /tmp --warmup_steps 200 --max_steps 6400 --learning_rate 0.0001 --batch_size 8 --exec_mode train_and_evaluate --augment --use_xla --log_dir /result/log.json --crossvalidation_idx 0 --use_amp
backend:
container: nvcr.io/nvidian/swdl/unetmed_tf2:20.06
download_dir: /tmp
hostname: ngc
instance: dgx1v.16g.1.norm
result_dir: /result
reports:
filename: unetmed_tf2_ngc_conv_1gpu
types:
- xls

View file

@ -0,0 +1,17 @@
jobs:
# no AMP
- horovodrun -np 8 python main.py --data_dir dataset --model_dir /tmp --warmup_steps 200 --max_steps 6400 --learning_rate 0.0001 --batch_size 8 --exec_mode train_and_evaluate --augment --use_xla --log_dir /result/log.json --crossvalidation_idx 0
# with AMP
- horovodrun -np 8 python main.py --data_dir dataset --model_dir /tmp --warmup_steps 200 --max_steps 6400 --learning_rate 0.0001 --batch_size 8 --exec_mode train_and_evaluate --augment --use_xla --log_dir /result/log.json --crossvalidation_idx 0 --use_amp
backend:
container: nvcr.io/nvidian/swdl/unetmed_tf2:20.06
download_dir: /tmp
hostname: ngc
instance: dgx1v.16g.8.norm
result_dir: /result
reports:
filename: unetmed_tf2_ngc_conv_8gpu
types:
- xls

View file

@ -1,163 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
from collections import defaultdict
from datetime import datetime
import json
import atexit
class Backend(ABC):
def __init__(self, verbosity):
self._verbosity = verbosity
@property
def verbosity(self):
return self._verbosity
@abstractmethod
def log(self, timestamp, elapsedtime, step, data):
pass
@abstractmethod
def metadata(self, timestamp, elapsedtime, metric, metadata):
pass
class Verbosity:
OFF = -1
DEFAULT = 0
VERBOSE = 1
class Logger:
def __init__(self, backends):
self.backends = backends
atexit.register(self.flush)
self.starttime = datetime.now()
def metadata(self, metric, metadata):
timestamp = datetime.now()
elapsedtime = (timestamp - self.starttime).total_seconds()
for b in self.backends:
b.metadata(timestamp, elapsedtime, metric, metadata)
def log(self, step, data, verbosity=1):
timestamp = datetime.now()
elapsedtime = (timestamp - self.starttime).total_seconds()
for b in self.backends:
if b.verbosity >= verbosity:
b.log(timestamp, elapsedtime, step, data)
def flush(self):
for b in self.backends:
b.flush()
def default_step_format(step):
return str(step)
def default_metric_format(metric, metadata, value):
unit = metadata["unit"] if "unit" in metadata.keys() else ""
format = "{" + metadata["format"] + "}" if "format" in metadata.keys() else "{}"
return "{}:{} {}".format(
metric, format.format(value) if value is not None else value, unit
)
def default_prefix_format(timestamp):
return "DLL {} - ".format(timestamp)
class StdOutBackend(Backend):
def __init__(
self,
verbosity,
step_format=default_step_format,
metric_format=default_metric_format,
prefix_format=default_prefix_format,
):
super().__init__(verbosity=verbosity)
self._metadata = defaultdict(dict)
self.step_format = step_format
self.metric_format = metric_format
self.prefix_format = prefix_format
self.elapsed = 0.0
def metadata(self, timestamp, elapsedtime, metric, metadata):
self._metadata[metric].update(metadata)
def log(self, timestamp, elapsedtime, step, data):
print(
"{}{} {}{}".format(
self.prefix_format(timestamp),
self.step_format(step),
" ".join(
[
self.metric_format(m, self._metadata[m], v)
for m, v in data.items()
]
),
"elapsed:"+str(elapsedtime)
)
)
def flush(self):
pass
class JSONStreamBackend(Backend):
def __init__(self, verbosity, filename):
super().__init__(verbosity=verbosity)
self._filename = filename
self.file = open(filename, "w")
atexit.register(self.file.close)
def metadata(self, timestamp, elapsedtime, metric, metadata):
self.file.write(
"DLLL {}\n".format(
json.dumps(
dict(
timestamp=str(timestamp.timestamp()),
elapsedtime=str(elapsedtime),
datetime=str(timestamp),
type="METADATA",
metric=metric,
metadata=metadata,
)
)
)
)
def log(self, timestamp, elapsedtime, step, data):
self.file.write(
"DLLL {}\n".format(
json.dumps(
dict(
timestamp=str(timestamp.timestamp()),
datetime=str(timestamp),
elapsedtime=str(elapsedtime),
type="LOG",
step=step,
data=data,
)
)
)
)
def flush(self):
self.file.flush()

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 1 GPU and trains for 40000 iterations with batch_size 1. Usage:
# This script launches U-Net run in FP32 on 1 GPU and trains for 6400 iterations with batch_size 8. Usage:
# bash unet_FP32_1GPU.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2/log.json
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --log_dir $2/log.json

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs and trains for 40000 iterations with batch_size 1. Usage:
# This script launches U-Net run in FP32 on 8 GPUs and trains for 6400 iterations with batch_size 8. Usage:
# bash unet_FP32_8GPU.sh <path to dataset> <path to results directory>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2/log.json
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --log_dir $2/log.json

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 1 GPU for inference batch_size 1. Usage:
# bash unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 1 GPU for inference benchmarking. Usage:
# bash unet_INFER_BENCHMARK_FP32.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 1 GPU for inference benchmarking. Usage:
# bash unet_INFER_BENCHMARK_TF-AMP.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla --use_amp
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla --amp

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 1 GPU for inference batch_size 1. Usage:
# bash unet_INFER_TF-AMP.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla --use_amp
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla --amp

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 1 GPU and trains for 40000 iterations batch_size 1. Usage:
# This script launches U-Net run in FP16 on 1 GPU and trains for 6400 iterations batch_size 8. Usage:
# bash unet_TF-AMP_1GPU.sh <path to dataset> <path to results directory>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2/log.json
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp --log_dir $2/log.json

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP16 on 8 GPUs and trains for 40000 iterations batch_size 1. Usage:
# This script launches U-Net run in FP16 on 8 GPUs and trains for 6400 iterations batch_size 8. Usage:
# bash unet_TF-AMP_8GPU.sh <path to dataset> <path to results directory>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2/log.json
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp --log_dir $2/log.json

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPU and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in FP32 on 8 GPU and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_1GPU_fold4.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla > $2/log_FP32_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla > $2/log_FP32_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla > $2/log_FP32_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla > $2/log_FP32_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla > $2/log_FP32_1GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_1GPU

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in FP32 on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in FP32 on 8 GPUs and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_8GPU_fold4.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla > $2/log_FP32_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla > $2/log_FP32_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla > $2/log_FP32_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla > $2/log_FP32_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla > $2/log_FP32_8GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_8GPU

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 1 GPU for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP32 on 8 GPUs for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 1 GPU for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla --amp

View file

@ -15,4 +15,4 @@
# This script launches U-Net run in FP16 on 8 GPUs for training benchmarking. Usage:
# bash unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --xla --amp

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in TF-AMP on 1 GPU and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in TF-AMP on 1 GPU and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold4.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold0.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold1.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold2.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold3.txt
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla --amp > $2/log_TF-AMP_1GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_1GPU

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -12,13 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net run in TF-AMP on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
# This script launches U-Net run in TF-AMP on 8 GPUs and runs 5-fold cross-validation training for 6400 iterations.
# Usage:
# bash unet_TRAIN_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold4.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold0.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold1.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold2.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold3.txt
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --xla --amp > $2/log_TF-AMP_8GPU_fold4.txt
python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_8GPU

Binary file not shown.

After

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 11 KiB

View file

@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@ -23,66 +23,26 @@ Example:
"""
import os
import horovod.tensorflow as hvd
import tensorflow as tf
from model.unet import Unet
from run import train, evaluate, predict, restore_checkpoint
from utils.cmd_util import PARSER, _cmd_params
from run import train, evaluate, predict
from utils.setup import get_logger, set_flags, prepare_model_dir
from utils.cmd_util import PARSER, parse_args
from utils.data_loader import Dataset
from dllogger.logger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
def main():
"""
Starting point of the application
"""
flags = PARSER.parse_args()
params = _cmd_params(flags)
backends = [StdOutBackend(Verbosity.VERBOSE)]
if params.log_dir is not None:
backends.append(JSONStreamBackend(Verbosity.VERBOSE, params.log_dir))
logger = Logger(backends)
# Optimization flags
os.environ['CUDA_CACHE_DISABLE'] = '0'
os.environ['HOROVOD_GPU_ALLREDUCE'] = 'NCCL'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = 'data'
os.environ['TF_ADJUST_HUE_FUSED'] = 'data'
os.environ['TF_ADJUST_SATURATION_FUSED'] = 'data'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = 'data'
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
hvd.init()
params = parse_args(PARSER.parse_args())
set_flags(params)
model_dir = prepare_model_dir(params)
params.model_dir = model_dir
logger = get_logger(params)
if params.use_xla:
tf.config.optimizer.set_jit(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
if params.use_amp:
tf.keras.mixed_precision.experimental.set_policy('mixed_float16')
else:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'
# Build the model
model = Unet()
dataset = Dataset(data_dir=params.data_dir,
@ -98,12 +58,10 @@ def main():
if 'evaluate' in params.exec_mode:
if hvd.rank() == 0:
model = restore_checkpoint(model, params.model_dir)
evaluate(params, model, dataset, logger)
if 'predict' in params.exec_mode:
if hvd.rank() == 0:
model = restore_checkpoint(model, params.model_dir)
predict(params, model, dataset, logger)

View file

@ -179,7 +179,7 @@ class OutputBlock(tf.keras.Model):
activation=tf.nn.relu)
self.conv3 = tf.keras.layers.Conv2D(filters=n_classes,
kernel_size=(1, 1),
activation=tf.nn.relu)
activation=None)
def call(self, inputs, residual_input):
out = _crop_and_concat(inputs, residual_input)

View file

@ -56,4 +56,4 @@ class Unet(tf.keras.Model):
out = up_block(out, skip_connections.pop())
out = self.output_block(out, skip_connections.pop())
return tf.keras.activations.softmax(out, axis=-1)
return out

View file

@ -23,14 +23,6 @@ from utils.losses import partial_losses
from utils.parse_results import process_performance_stats
def restore_checkpoint(model, model_dir):
try:
model.load_weights(os.path.join(model_dir, "checkpoint"))
except:
print("Failed to load checkpoint, model will have randomly initialized weights.")
return model
def train(params, model, dataset, logger):
np.random.seed(params.seed)
tf.random.set_seed(params.seed)
@ -42,6 +34,9 @@ def train(params, model, dataset, logger):
ce_loss = tf.keras.metrics.Mean(name='ce_loss')
f1_loss = tf.keras.metrics.Mean(name='dice_loss')
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
if params.resume_training:
checkpoint.restore(tf.train.latest_checkpoint(params.model_dir))
@tf.function
def train_step(features, labels, warmup_batch=False):
@ -83,10 +78,9 @@ def train(params, model, dataset, logger):
break
timestamps = np.mean(timestamps, axis=0)
if hvd.rank() == 0:
throughput_imgps, latency_ms = process_performance_stats(timestamps, params)
stats = process_performance_stats(timestamps, params)
logger.log(step=(),
data={"throughput_train": throughput_imgps,
"latency_train": latency_ms})
data={metric: value for (metric, value) in stats})
else:
for iteration, (images, labels) in enumerate(dataset.train_fn()):
train_step(images, labels, warmup_batch=iteration == 0)
@ -102,13 +96,16 @@ def train(params, model, dataset, logger):
if iteration >= max_steps:
break
if hvd.rank() == 0:
model.save_weights(os.path.join(params.model_dir, "checkpoint"))
checkpoint.save(file_prefix=os.path.join(params.model_dir, "checkpoint"))
logger.flush()
def evaluate(params, model, dataset, logger):
ce_loss = tf.keras.metrics.Mean(name='ce_loss')
f1_loss = tf.keras.metrics.Mean(name='dice_loss')
checkpoint = tf.train.Checkpoint(model=model)
checkpoint.restore(tf.train.latest_checkpoint(params.model_dir)).expect_partial()
@tf.function
def validation_step(features, labels):
@ -132,10 +129,12 @@ def evaluate(params, model, dataset, logger):
def predict(params, model, dataset, logger):
checkpoint = tf.train.Checkpoint(model=model)
checkpoint.restore(tf.train.latest_checkpoint(params.model_dir)).expect_partial()
@tf.function
def prediction_step(features):
return model(features, training=False)
return tf.nn.softmax(model(features, training=False), axis=-1)
if params.benchmark:
assert params.max_steps > params.warmup_steps, \
@ -147,10 +146,9 @@ def predict(params, model, dataset, logger):
timestamps[iteration] = time() - t0
if iteration >= params.max_steps:
break
throughput_imgps, latency_ms = process_performance_stats(timestamps, params)
stats = process_performance_stats(timestamps, params)
logger.log(step=(),
data={"throughput_test": throughput_imgps,
"latency_test": latency_ms})
data={metric: value for (metric, value) in stats})
else:
predictions = np.concatenate([prediction_step(images).numpy()
for images in dataset.test_fn(count=1)], axis=0)

View file

@ -89,20 +89,20 @@ PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
PARSER.add_argument('--use_amp', '--amp', dest='use_amp', action='store_true',
help="""Train using TF-AMP""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_xla', dest='use_xla', action='store_true',
PARSER.add_argument('--use_xla', '--xla', dest='use_xla', action='store_true',
help="""Train using XLA""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_trt', dest='use_trt', action='store_true',
help="""Use TF-TRT""")
PARSER.set_defaults(use_trt=False)
PARSER.add_argument('--resume_training', dest='resume_training', action='store_true',
help="""Resume training from a checkpoint""")
def _cmd_params(flags):
def parse_args(flags):
return Munch({
'exec_mode': flags.exec_mode,
'model_dir': flags.model_dir,
@ -121,4 +121,5 @@ def _cmd_params(flags):
'use_amp': flags.use_amp,
'use_trt': flags.use_trt,
'use_xla': flags.use_xla,
'resume_training': flags.resume_training,
})

View file

@ -151,6 +151,19 @@ class Dataset:
return inputs, labels
@tf.function
def _preproc_eval_samples(self, inputs, labels):
"""Preprocess samples and perform random augmentations"""
inputs = self._normalize_inputs(inputs)
labels = self._normalize_labels(labels)
# Bring back labels to network's output size and remove interpolation artifacts
labels = tf.image.resize_with_crop_or_pad(labels, target_width=388, target_height=388)
cond = tf.less(labels, 0.5 * tf.ones(tf.shape(input=labels)))
labels = tf.where(cond, tf.zeros(tf.shape(input=labels)), tf.ones(tf.shape(input=labels)))
return (inputs, labels)
def train_fn(self, drop_remainder=False):
"""Input function for training"""
dataset = tf.data.Dataset.from_tensor_slices(
@ -170,7 +183,7 @@ class Dataset:
dataset = tf.data.Dataset.from_tensor_slices(
(self._val_images, self._val_masks))
dataset = dataset.repeat(count=count)
dataset = dataset.map(self._preproc_samples,
dataset = dataset.map(self._preproc_eval_samples,
num_parallel_calls=multiprocessing.cpu_count())
dataset = dataset.batch(self._batch_size, drop_remainder=drop_remainder)
dataset = dataset.prefetch(self._batch_size)

View file

@ -35,5 +35,7 @@ def partial_losses(predict, target):
crossentropy_loss = tf.reduce_mean(input_tensor=tf.keras.backend.binary_crossentropy(output=flat_logits,
target=flat_labels),
name='cross_loss_ref')
dice_loss = tf.reduce_mean(input_tensor=1 - dice_coef(flat_logits, flat_labels), name='dice_loss_ref')
dice_loss = tf.reduce_mean(input_tensor=1 - dice_coef(tf.keras.activations.softmax(flat_logits, axis=-1),
flat_labels), name='dice_loss_ref')
return crossentropy_loss, dice_loss

View file

@ -26,12 +26,13 @@ def process_performance_stats(timestamps, params):
std = timestamps_ms.std()
n = np.sqrt(len(timestamps_ms))
throughput_imgps = (1000.0 * batch_size / timestamps_ms).mean()
print('Throughput Avg:', round(throughput_imgps, 3), 'img/s')
print('Latency Avg:', round(latency_ms, 3), 'ms')
stats = [("Throughput Avg", str(throughput_imgps)),
('Latency Avg:', str(latency_ms))]
for ci, lvl in zip(["90%:", "95%:", "99%:"],
[1.645, 1.960, 2.576]):
print("Latency", ci, round(latency_ms + lvl * std / n, 3), "ms")
return float(throughput_imgps), float(latency_ms)
stats.append(("Latency_"+ci, str(latency_ms + lvl * std / n)))
return stats
def parse_convergence_results(path, environment):
@ -42,14 +43,14 @@ def parse_convergence_results(path, environment):
raise FileNotFoundError("No logfile found at {}".format(path))
for logfile in logfiles:
with open(os.path.join(path, logfile), "r") as f:
content = f.readlines()
if "eval_dice_score" not in content[-1]:
content = f.readlines()[-1]
if "eval_dice_score" not in content:
print("Evaluation score not found. The file", logfile, "might be corrupted.")
continue
dice_scores.append(float([val for val in content[-1].split()
if "eval_dice_score" in val][0].split(":")[1]))
ce_scores.append(float([val for val in content[-1].split()
if "eval_ce_loss" in val][0].split(":")[1]))
dice_scores.append(float([val for val in content.split(" ")
if "eval_dice_score" in val][0].split()[-1]))
ce_scores.append(float([val for val in content.split(" ")
if "eval_ce_loss" in val][0].split()[-1]))
if dice_scores:
print("Evaluation dice score:", sum(dice_scores) / len(dice_scores))
print("Evaluation cross-entropy loss:", sum(ce_scores) / len(ce_scores))

View file

@ -0,0 +1,72 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import tensorflow as tf
import dllogger as logger
import horovod.tensorflow as hvd
from dllogger import StdOutBackend, Verbosity, JSONStreamBackend
def set_flags(params):
os.environ['CUDA_CACHE_DISABLE'] = '1'
os.environ['HOROVOD_GPU_ALLREDUCE'] = 'NCCL'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '0'
os.environ['TF_ADJUST_HUE_FUSED'] = '1'
os.environ['TF_ADJUST_SATURATION_FUSED'] = '1'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
np.random.seed(params.seed)
tf.random.set_seed(params.seed)
if params.use_xla:
tf.config.optimizer.set_jit(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
if params.use_amp:
tf.keras.mixed_precision.experimental.set_policy('mixed_float16')
else:
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'
def prepare_model_dir(params):
model_dir = os.path.join(params.model_dir, "model_checkpoint")
model_dir = model_dir if (hvd.rank() == 0 and not params.benchmark) else None
if model_dir is not None:
os.makedirs(model_dir, exist_ok=True)
if ('train' in params.exec_mode) and (not params.resume_training):
os.system('rm -rf {}/*'.format(model_dir))
return model_dir
def get_logger(params):
backends = []
if hvd.rank() == 0:
backends += [StdOutBackend(Verbosity.VERBOSE)]
if params.log_dir:
backends += [JSONStreamBackend(Verbosity.VERBOSE, params.log_dir)]
logger.init(backends=backends)
return logger