DeepLearningExamples/TensorFlow/Segmentation/UNet_Industrial/README.md

# UNet Industrial Defect Segmentation for TensorFlow

This repository provides a script and recipe to train UNet Industrial to achieve state of the art
accuracy on the dataset DAGM2007, and is tested and maintained by NVIDIA.


## Table of Contents
 
- [Model overview](#model-overview)
   * [Model architecture](#model-architecture)
   * [Default configuration](#default-configuration)
   * [Feature support matrix](#feature-support-matrix)
     * [Features](#features)
   * [Mixed precision training](#mixed-precision-training)
     * [Enabling mixed precision](#enabling-mixed-precision)
     * [Enabling TF32](#enabling-tf32)
- [Setup](#setup)
   * [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
   * [Scripts and sample code](#scripts-and-sample-code)
   * [Parameters](#parameters)
   * [Command-line options](#command-line-options)
   * [Getting the data](#getting-the-data)
     * [Dataset guidelines](#dataset-guidelines)
     * [Multi-dataset](#multi-dataset)
   * [Training process](#training-process)
   * [Inference process](#inference-process)
- [Performance](#performance)   
   * [Benchmarking](#benchmarking)
     * [Training performance benchmark](#training-performance-benchmark)
     * [Inference performance benchmark](#inference-performance-benchmark)
   * [Results](#results)
     * [Training accuracy results](#training-accuracy-results)
       * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)  
       * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
     * [Training stability results](#training-stability-results)
       * [Training stability: NVIDIA DGX A100 (8x A100 40GB)](#training-stability-nvidia-dgx-a100-8x-a100-40gb)
       * [Training stability: NVIDIA DGX-1 (8x V100 16GB)](#training-stability-nvidia-dgx-1-8x-v100-16gb)
     * [Training performance results](#training-performance-results)
       * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb) 
       * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
     * [Inference performance results](#inference-performance-results)
        * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
        * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
- [Release notes](#release-notes)
   * [Changelog](#changelog)
   * [Known issues](#known-issues)

## Model overview

This UNet model is adapted from the original version of the [UNet model](https://arxiv.org/abs/1505.04597) which is
a convolutional auto-encoder for 2D image segmentation. UNet was first introduced by
Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper:
[UNet: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597).

This work proposes a modified version of UNet, called `TinyUNet` which performs efficiently and with very high accuracy
on the industrial anomaly dataset [DAGM2007](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html).
*TinyUNet*, like the original *UNet* is composed of two parts:
- an encoding sub-network (left-side)
- a decoding sub-network (right-side).

It repeatedly applies 3 downsampling blocks composed of two 2D convolutions followed by a 2D max pooling
layer in the encoding sub-network. In the decoding sub-network, 3 upsampling blocks are composed of a upsample2D
layer followed by a 2D convolution, a concatenation operation with the residual connection and two 2D convolutions.

`TinyUNet` has been introduced to reduce the model capacity which was leading to a high degree of over-fitting on a
small dataset like DAGM2007. The complete architecture is presented in the figure below:

![UnetModel](images/unet.png)

Figure 1. Architecture of the UNet Industrial

### Default Configuration

This model trains in 2500 epochs, under the following setup:

* Global Batch Size: 16

* Optimizer RMSProp:
    * decay: 0.9
    * momentum: 0.8
    * centered: True

* Learning Rate Schedule: Exponential Step Decay
    * decay: 0.8
    * steps: 500
    * initial learning rate: 1e-4

* Weight Initialization: He Uniform Distribution (introduced by [Kaiming He et al. in 2015](https://arxiv.org/abs/1502.01852) to address issues related ReLU activations in deep neural networks)

* Loss Function: Adaptive
    * When DICE Loss < 0.3, Loss = Binary Cross Entropy
    * Else, Loss = DICE Loss

* Data Augmentation
    * Random Horizontal Flip (50% chance)
    * Random Rotation 90°

* Activation Functions:
    * ReLU is used for all layers
    * Sigmoid is used at the output to ensure that the ouputs are between [0, 1]

* Weight decay: 1e-5

### Feature support matrix
 
The following features are supported by this model.
 
| **Feature** | **UNet Medical** |
|---------------------------------|-----|
| Automatic mixed precision (AMP) | Yes |
| Horovod Multi-GPU (NCCL)        | Yes |
| Accelerated Linear Algebra (XLA)| Yes |
 
#### Features
 
**Automatic Mixed Precision (AMP)**
 
This implementation of UNet uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
 
**Horovod**
 
Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
 
**Multi-GPU training with Horovod**
 
Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
 
**XLA support (experimental)**
 
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled.
 
### Mixed precision training
 
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
1.  Porting the model to use the FP16 data type where appropriate.    
2.  Adding loss scaling to preserve small gradient values.

This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.

In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.

-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.

#### Enabling mixed precision
 
This implementation exploits the TensorFlow Automatic Mixed Precision feature. In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
```
TF_ENABLE_AUTO_MIXED_PRECISION=1
```
Exporting these variables ensures that loss scaling is performed correctly and automatically.
By supplying the `--amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training:
```
if params.use_amp:
  os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
```
 
#### Enabling TF32

TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.

For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

## Setup
 
The following section lists the requirements in order to start training the UNet Medical model.
 
### Requirements
 
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
-   GPU-based architecture:
    - [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    - [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
    - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)

 
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
- [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
 
For those unable to use the TensorFlow NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).

## Quick Start Guide
 
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the UNet model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the UNet TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
* compare your evaluation accuracy with our [Training accuracy results](#training-accuracy-results),
* compare your training performance with our [Training performance benchmark](#training-performance-benchmark),
* compare your inference performance with our [Inference performance benchmark](#inference-performance-benchmark).
 
For the specifics concerning training and inference, see the [Advanced](#advanced) section.

1. Clone the repository.

    ```bash
    git clone https://github.com/NVIDIA/DeepLearningExamples
    cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Industrial
    ```

2. Build the UNet TensorFlow NGC container.

    ```bash
    # Build the docker container
    docker build . --rm -t unet_industrial:latest
    ```
3. Start an interactive session in the NGC container to run preprocessing/training/inference.

    ```bash
    # make a directory for the dataset, for example ./dataset
    mkdir <path/to/dataset/directory>
    # make a directory for results, for example ./results
    mkdir <path/to/results/directory>
    # start the container with nvidia-docker
    nvidia-docker run -it --rm --gpus all \
        --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
        -v <path/to/dataset/directory>:/data/ \
        -v <path/to/result/directory>:/results \
        unet_industrial:latest
    ```

4. Download and preprocess the dataset: DAGM2007

    In order to download the dataset. You can execute the following:
   
    ```bash
    ./download_and_preprocess_dagm2007.sh /data
    ```

    **Important Information:** Some files of the dataset require an account to be downloaded, the script will invite you to download them manually and put them in the correct directory.

5. Start training.

To run training for a default configuration (as described in Default configuration, for example 1/4/8 GPUs,
FP32/TF-AMP), launch one of the scripts in the `./scripts` directory called
`./scripts/UNet{_AMP}_{1, 4, 8}GPU.sh`

Each of the scripts requires three parameters:
* path to the results directory of the model as the first argument
* path to the dataset as a second argument
* class ID from DAGM used (between 1-10)

For example, for class 1:

```bash
cd scripts/
./UNet_1GPU.sh /results /data 1
```

6. Run evaluation

Model evaluation on a checkpoint can be launched by running  one of the scripts in the `./scripts` directory
called `./scripts/UNet{_AMP}_EVAL.sh`.

Each of the scripts requires three parameters:
* path to the results directory of the model as the first argument
* path to the dataset as a second argument
* class ID from DAGM used (between 1-10)

For example, for class 1:

```bash
cd scripts/
./UNet_EVAL.sh /results /data 1
```

## Advanced

The following sections provide greater details of the dataset, running training and inference, and the training results.

### Command line options

To see the full list of available options and their descriptions, use the -h or --help command line option, for example:

```bash
python main.py --help
```

The following mandatory flags must be used to tune different aspects of the training:

general
-------
`--exec_mode=train_and_evaluate` Which execution mode to run the model into.

`--iter_unit=batch` Will the model be run for X batches or X epochs ?

`--num_iter=2500` Number of iterations to run.

`--batch_size=16` Size of each minibatch per GPU.

`--results_dir=/path/to/results` Directory in which to write training logs, summaries and checkpoints.

`--data_dir=/path/to/dataset` Directory which contains the DAGM2007 dataset.

`--dataset_name=DAGM2007` Name of the dataset used in this run (only DAGM2007 is supported atm).

`--dataset_classID=1` ClassID to consider to train or evaluate the network (used for DAGM).

model
-----

`--amp` Enable Automatic Mixed Precision to speedup FP32 computation using tensor cores.

`--xla` Enable TensorFlow XLA to maximise performance.

`--use_auto_loss_scaling` Use AutoLossScaling in TF-AMP

#### Dataset guidelines

The UNet model was trained with the [Weakly Supervised Learning for Industrial Optical Inspection (DAGM 2007)](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html) dataset.

> The competition is inspired by problems from industrial image processing. In order to satisfy their customers' needs, companies have to guarantee the quality of their products, which can often be achieved only by inspection of the finished product. Automatic visual defect detection has the potential to reduce the cost of quality assurance significantly.
>
> The competitors have to design a stand-alone algorithm which is able to detect miscellaneous defects on various background textures.
>
> The particular challenge of this contest is that the algorithm must learn, without human intervention, to discern defects automatically from a weakly labeled (i.e., labels are not exact to the pixel level) training set, the exact characteristics of which are unknown at development time. During the competition, the programs have to be trained on new data without any human guidance.

**Source:** https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html

> The provided data is artificially generated, but similar to real world problems. It consists of multiple data sets, each consisting of 1000 images showing the background texture without defects, and of 150 images with one labeled defect each on the background texture. The images in a single data set are very similar, but each data set is generated by a different texture model and defect model.

> Not all deviations from the texture are necessarily defects. The algorithm will need to use the weak labels provided during the training phase to learn the properties that characterize a defect.

> Below are two sample images from two data sets. In both examples, the left images are without defects; the right ones contain a scratch-shaped defect which appears as a thin dark line, and a diffuse darker area, respectively. The defects are weakly labeled by a surrounding ellipse, shown in red.

![DAGM_Examples](images/dagm2007.png)

The DAGM2007 challenge comes in propose two different challenges:
- A development set: public and available for download from
[here](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html).
The number of classes and sub-challenges for the development set is 6.
- A competition set: which requires an account to be downloaded from [here](https://hci.iwr.uni-heidelberg.de/node/3616).
The number of classes and sub-challenges for the competition set is 10.

The challenge consists in designing a single model with a set of predefined hyper-parameters which will not change
across the 10 different classes or sub-challenges of the competition set.

The performance shall be measured on the competition set which is normalized and more complex that the public dataset
while offering the most unbiased evaluation method.

### Training Process

*Laplace Smoothing*

We use this technique in the DICE loss to improve the training efficiency. This technique consists in replacing the
epsilon parameter (used to avoid dividing by zero and very small: +/- 1e-7) by 1. You can find more information on:
[https://en.wikipedia.org/wiki/Additive_smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)

*Adaptive Loss*

The DICE Loss is not able to provide a meaningful gradient at initialisation. This leads to a model instability which
often push the model to diverge. Nonetheless, once the model starts to converge, DICE loss is able to very efficiently
fully train the model. Therefore, we implemented an *adaptive loss* which is composed of two sub-losses:

- Binary Cross-Entropy (BCE)
- DICE Loss

The model is trained with the BCE loss until the DICE Loss reach a experimentally defined threshold (0.3).
Thereafter, DICE loss is used to finish training.

*Weak Labelling*

This dataset is referred as weakly labelled. That means that the segmentation labels are not given at the pixel level
but rather in an approximate fashion.

## Performance

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).

### Benchmarking

The following sections shows how to run benchmarks measuring the model performance in training and inference modes.

#### Training performance benchmark

To benchmark the inference performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
called `./scripts/benchmarking/UNet_trainbench{_AMP}_{1, 4, 8}GPU.sh`.

Each of the scripts requires three parameters:
* path to the dataset as the first argument
* class ID from DAGM used (between 1-10)

For example:

```bash
cd scripts/benchmarking/
./UNet_trainbench_1GPU.sh /data 1
```

#### Inference performance benchmark

To benchmark the training performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
called `./scripts/benchmarking/UNet_evalbench{_AMP}.sh`.

Each of the scripts requires three parameters:
* path to the dataset as the first argument
* class ID from DAGM used (between 1-10)

For example:

```bash
cd scripts/benchmarking/
./UNet_evalbench_AMP.sh /data 1
```

### Results

The following sections provide details on the achieved results in training accuracy, performance and inference performance.

#### Training accuracy results

##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
 
Our results were obtained by running the `./scripts/UNet{_AMP}_{1, 8}GPU.sh` training
script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.

| GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 [min] | Time to train - mixed precision [min] | Time to train speedup (TF32 to mixed precision) |
|:----:|:----------------:|:---------------:|:--------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
|  1   |        16        |     0.9717      |           0.9726           |         3.6          |               2.3               |                      1.57                       |
|  8   |        2         |     0.9733      |           0.9683           |         4.3          |               3.5               |                      1.23                       |

##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the `./scripts/UNet{_AMP}_{1, 8}GPU.sh` training
script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.

| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [min] | Time to train - mixed precision [min] | Time to train speedup (FP32 to mixed precision) |
|:----:|:----------------:|:---------------:|:--------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
|  1   |        16        |     0.9643      |           0.9653           |          10          |                8                |                      1.25                       |
|  8   |        2         |     0.9637      |           0.9655           |         2.5          |               2.5               |                      1.00                       |

#### Training performance results

##### Training performance: NVIDIA DGX A100 (8x A100 40GB)

Our results were obtained by running the scripts
`./scripts/benchmarking/UNet_trainbench{_AMP}_{1, 4, 8}GPU.sh` training script in the
TensorFlow `20.06-tf1-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.

| GPUs | Batch size / GPU | Throughput - TF32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (TF32 - mixed precision) | Strong scaling - TF32 | Strong scaling - mixed precision |
|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:---------------------:|:--------------------------------:|
|  1   |        16        |          135.95           |                255.26                |                    1.88                     |           -           |                -                 |
|  4   |        4         |           420.2           |                691.19                |                    1.64                     |         3.09          |               2.71               |
|  8   |        2         |          655.05           |                665.66                |                    1.02                     |         4.82          |               2.61               |

##### Training performance: NVIDIA DGX-1 (8x V100 16GB)

Our results were obtained by running the scripts
`./scripts/benchmarking/UNet_trainbench{_AMP}_{1, 4, 8}GPU.sh` training script in the
TensorFlow `20.06-tf1-py3` NGC container on an NVIDIA DGX-1 (8 V100 16GB) GPUs.

| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:---------------------:|:--------------------------------:|
|  1   |        16        |           86.95           |                168.54                |                    1.94                     |           -           |                -                 |
|  4   |        4         |          287.01           |                459.07                |                    1.60                     |         3.30          |               2.72               |
|  8   |        2         |          474.77           |                444.13                |                    0.94                     |         5.46          |               2.64               |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

#### Inference performance results

#### Inference performance results

##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)

Our results were obtained by running the scripts `./scripts/benchmarking/UNet_evalbench{_AMP}.sh`
evaluation script in the `20.06-tf1-py3` NGC container on NVIDIA DGX A100 (1x A100 40GB) GPUs.

FP16

| Batch size | Resolution | Throughput Avg [img/s] |
|:----------:|:----------:|:----------------------:|
|     1      | 512x512x1  |         247.83         |
|     8      | 512x512x1  |         761.41         |
|     16     | 512x512x1  |         823.46         |

TF32

| Batch size | Resolution | Throughput Avg [img/s] |
|:----------:|:----------:|:----------------------:|
|     1      | 512x512x1  |         227.97         |
|     8      | 512x512x1  |         419.70         |
|     16     | 512x512x1  |         424.57         |

To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

Our results were obtained by running the scripts `./scripts/benchmarking/UNet_evalbench{_AMP}.sh`
evaluation script in the `20.06-tf1-py3` NGC container on NVIDIA DGX-1 (1x V100 16GB) GPUs.

FP16

| Batch size | Resolution | Throughput Avg [img/s] |
|:----------:|:----------:|:----------------------:|
|     1      | 512x512x1  |         157.91         |
|     8      | 512x512x1  |         438.00         |
|     16     | 512x512x1  |         469.27         |

FP32

| Batch size | Resolution | Throughput Avg [img/s] |
|:----------:|:----------:|:----------------------:|
|     1      | 512x512x1  |         159.65         |
|     8      | 512x512x1  |         243.99         |
|     16     | 512x512x1  |         250.23         |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

## Release notes

### Changelog

June 2020

* Updated training and inference accuracy with A100 results
* Updated training and inference performance with A100 results

October 2019
  * Jupyter notebooks added

March,2019
  * Initial release

### Known issues
There are no known issues with this model.
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								# UNet Industrial Defect Segmentation for TensorFlow
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								This repository provides a script and recipe to train UNet Industrial to achieve state of the art
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								accuracy on the dataset DAGM2007, and is tested and maintained by NVIDIA.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Table of Contents
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
 								- [Model overview](#model-overview)
 								   * [Model architecture](#model-architecture)
 								   * [Default configuration](#default-configuration)
 								   * [Feature support matrix](#feature-support-matrix)
 								     * [Features](#features)
 								   * [Mixed precision training](#mixed-precision-training)
 								     * [Enabling mixed precision](#enabling-mixed-precision)
 								     * [Enabling TF32](#enabling-tf32)
 								- [Setup](#setup)
 								   * [Requirements](#requirements)
 								- [Quick Start Guide](#quick-start-guide)
 								- [Advanced](#advanced)
 								   * [Scripts and sample code](#scripts-and-sample-code)
 								   * [Parameters](#parameters)
 								   * [Command-line options](#command-line-options)
 								   * [Getting the data](#getting-the-data)
 								     * [Dataset guidelines](#dataset-guidelines)
 								     * [Multi-dataset](#multi-dataset)
 								   * [Training process](#training-process)
 								   * [Inference process](#inference-process)
 								- [Performance](#performance)
 								   * [Benchmarking](#benchmarking)
 								     * [Training performance benchmark](#training-performance-benchmark)
 								     * [Inference performance benchmark](#inference-performance-benchmark)
 								   * [Results](#results)
 								     * [Training accuracy results](#training-accuracy-results)
 								       * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
 								       * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
 								     * [Training stability results](#training-stability-results)
 								       * [Training stability: NVIDIA DGX A100 (8x A100 40GB)](#training-stability-nvidia-dgx-a100-8x-a100-40gb)
 								       * [Training stability: NVIDIA DGX-1 (8x V100 16GB)](#training-stability-nvidia-dgx-1-8x-v100-16gb)
 								     * [Training performance results](#training-performance-results)
 								       * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
 								       * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
 								     * [Inference performance results](#inference-performance-results)
 								        * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
 								        * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
 								- [Release notes](#release-notes)
 								   * [Changelog](#changelog)
 								   * [Known issues](#known-issues)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								## Model overview
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								This UNet model is adapted from the original version of the [UNet model](https://arxiv.org/abs/1505.04597) which is
 								a convolutional auto-encoder for 2D image segmentation. UNet was first introduced by
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper:
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								[UNet: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597).
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								This work proposes a modified version of UNet, called `TinyUNet` which performs efficiently and with very high accuracy
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								on the industrial anomaly dataset [DAGM2007](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html).
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								*TinyUNet*, like the original *UNet* is composed of two parts:
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								- an encoding sub-network (left-side)
 								- a decoding sub-network (right-side).
 								It repeatedly applies 3 downsampling blocks composed of two 2D convolutions followed by a 2D max pooling
 								layer in the encoding sub-network. In the decoding sub-network, 3 upsampling blocks are composed of a upsample2D
 								layer followed by a 2D convolution, a concatenation operation with the residual connection and two 2D convolutions.
 								`TinyUNet` has been introduced to reduce the model capacity which was leading to a high degree of over-fitting on a
 								small dataset like DAGM2007. The complete architecture is presented in the figure below:
 								![UnetModel](images/unet.png)
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								Figure 1. Architecture of the UNet Industrial
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Default Configuration
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								This model trains in 2500 epochs, under the following setup:
 								* Global Batch Size: 16
 								* Optimizer RMSProp:
 								    * decay: 0.9
 								    * momentum: 0.8
 								    * centered: True
 								* Learning Rate Schedule: Exponential Step Decay
 								    * decay: 0.8
 								    * steps: 500
 								    * initial learning rate: 1e-4
 								* Weight Initialization: He Uniform Distribution (introduced by [Kaiming He et al. in 2015](https://arxiv.org/abs/1502.01852) to address issues related ReLU activations in deep neural networks)
 								* Loss Function: Adaptive
 								    * When DICE Loss < 0.3, Loss = Binary Cross Entropy
 								    * Else, Loss = DICE Loss
 								* Data Augmentation
 								    * Random Horizontal Flip (50% chance)
 								    * Random Rotation 90°
 								* Activation Functions:
 								    * ReLU is used for all layers
 								    * Sigmoid is used at the output to ensure that the ouputs are between [0, 1]
 								* Weight decay: 1e-5
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								### Feature support matrix
 								The following features are supported by this model.
 								| **Feature** | **UNet Medical** |
 								|---------------------------------|-----|
 								| Automatic mixed precision (AMP) | Yes |
 								| Horovod Multi-GPU (NCCL)        | Yes |
 								| Accelerated Linear Algebra (XLA)| Yes |
 								#### Features
 								**Automatic Mixed Precision (AMP)**
 								This implementation of UNet uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
 								**Horovod**
 								Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
 								**Multi-GPU training with Horovod**
 								Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
 								**XLA support (experimental)**
 								XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled.
 								### Mixed precision training
 								Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
 .  Porting the model to use the FP16 data type where appropriate.
 .  Adding loss scaling to preserve small gradient values.
 								This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
 								In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
 								-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
 								-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
 								-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
 								#### Enabling mixed precision
 								This implementation exploits the TensorFlow Automatic Mixed Precision feature. In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
 								```
 								TF_ENABLE_AUTO_MIXED_PRECISION=1
 								```
 								Exporting these variables ensures that loss scaling is performed correctly and automatically.
 								By supplying the `--amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training:
 								```
 								if params.use_amp:
 								  os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
 								```
 								#### Enabling TF32
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Fixing hyperlinks

											
										
										
											2020-09-08 16:10:32 +02:00
+								TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Fixing hyperlinks

											
										
										
											2020-09-08 16:10:32 +02:00
+								For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								## Setup
 								The following section lists the requirements in order to start training the UNet Medical model.
 								### Requirements
 								This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 								- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
 								- TensorFlow 20.06-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
 								-   GPU-based architecture:
 								    - [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
 								    - [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
 								    - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
 								For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
 								- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								- [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
 								For those unable to use the TensorFlow NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Quick Start Guide
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
 								To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the UNet model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the UNet TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
 								* compare your evaluation accuracy with our [Training accuracy results](#training-accuracy-results),
 								* compare your training performance with our [Training performance benchmark](#training-performance-benchmark),
 								* compare your inference performance with our [Inference performance benchmark](#inference-performance-benchmark).
 								For the specifics concerning training and inference, see the [Advanced](#advanced) section.
 . Clone the repository.
 								    ```bash
 								    git clone https://github.com/NVIDIA/DeepLearningExamples
 								    cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Industrial
 								    ```
 . Build the UNet TensorFlow NGC container.
 								    ```bash
 								    # Build the docker container
 								    docker build . --rm -t unet_industrial:latest
 								    ```
 . Start an interactive session in the NGC container to run preprocessing/training/inference.
 								    ```bash
 								    # make a directory for the dataset, for example ./dataset
 								    mkdir <path/to/dataset/directory>
 								    # make a directory for results, for example ./results
 								    mkdir <path/to/results/directory>
 								    # start the container with nvidia-docker
-												Add --gpus flag to docker run

Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>

											
										
										
											2020-11-02 14:31:31 +01:00
+								    nvidia-docker run -it --rm --gpus all \
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								        --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
 								        -v <path/to/dataset/directory>:/data/ \
 								        -v <path/to/result/directory>:/results \
 								        unet_industrial:latest
 								    ```
 . Download and preprocess the dataset: DAGM2007
 								    In order to download the dataset. You can execute the following:
 								    ```bash
 								    ./download_and_preprocess_dagm2007.sh /data
 								    ```
 								    **Important Information:** Some files of the dataset require an account to be downloaded, the script will invite you to download them manually and put them in the correct directory.
 . Start training.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								To run training for a default configuration (as described in Default configuration, for example 1/4/8 GPUs,
 								FP32/TF-AMP), launch one of the scripts in the `./scripts` directory called
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								`./scripts/UNet{_AMP}_{1, 4, 8}GPU.sh`
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								Each of the scripts requires three parameters:
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								* path to the results directory of the model as the first argument
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								* path to the dataset as a second argument
 								* class ID from DAGM used (between 1-10)
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								For example, for class 1:
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								```bash
 								cd scripts/
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								./UNet_1GPU.sh /results /data 1
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								```
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+. Run evaluation
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								Model evaluation on a checkpoint can be launched by running  one of the scripts in the `./scripts` directory
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								called `./scripts/UNet{_AMP}_EVAL.sh`.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								Each of the scripts requires three parameters:
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								* path to the results directory of the model as the first argument
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								* path to the dataset as a second argument
 								* class ID from DAGM used (between 1-10)
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								For example, for class 1:
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								```bash
 								cd scripts/
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								./UNet_EVAL.sh /results /data 1
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Advanced
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								The following sections provide greater details of the dataset, running training and inference, and the training results.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Command line options
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								To see the full list of available options and their descriptions, use the -h or --help command line option, for example:
 								```bash
 								python main.py --help
 								```
 								The following mandatory flags must be used to tune different aspects of the training:
 								general
 								-------
 								`--exec_mode=train_and_evaluate` Which execution mode to run the model into.
 								`--iter_unit=batch` Will the model be run for X batches or X epochs ?
 								`--num_iter=2500` Number of iterations to run.
 								`--batch_size=16` Size of each minibatch per GPU.
 								`--results_dir=/path/to/results` Directory in which to write training logs, summaries and checkpoints.
 								`--data_dir=/path/to/dataset` Directory which contains the DAGM2007 dataset.
 								`--dataset_name=DAGM2007` Name of the dataset used in this run (only DAGM2007 is supported atm).
 								`--dataset_classID=1` ClassID to consider to train or evaluate the network (used for DAGM).
 								model
 								-----
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								`--amp` Enable Automatic Mixed Precision to speedup FP32 computation using tensor cores.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								`--xla` Enable TensorFlow XLA to maximise performance.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								`--use_auto_loss_scaling` Use AutoLossScaling in TF-AMP
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								#### Dataset guidelines
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								The UNet model was trained with the [Weakly Supervised Learning for Industrial Optical Inspection (DAGM 2007)](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html) dataset.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								> The competition is inspired by problems from industrial image processing. In order to satisfy their customers' needs, companies have to guarantee the quality of their products, which can often be achieved only by inspection of the finished product. Automatic visual defect detection has the potential to reduce the cost of quality assurance significantly.
 								>
 								> The competitors have to design a stand-alone algorithm which is able to detect miscellaneous defects on various background textures.
 								>
 								> The particular challenge of this contest is that the algorithm must learn, without human intervention, to discern defects automatically from a weakly labeled (i.e., labels are not exact to the pixel level) training set, the exact characteristics of which are unknown at development time. During the competition, the programs have to be trained on new data without any human guidance.
 								**Source:** https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html
 								> The provided data is artificially generated, but similar to real world problems. It consists of multiple data sets, each consisting of 1000 images showing the background texture without defects, and of 150 images with one labeled defect each on the background texture. The images in a single data set are very similar, but each data set is generated by a different texture model and defect model.
 								> Not all deviations from the texture are necessarily defects. The algorithm will need to use the weak labels provided during the training phase to learn the properties that characterize a defect.
 								> Below are two sample images from two data sets. In both examples, the left images are without defects; the right ones contain a scratch-shaped defect which appears as a thin dark line, and a diffuse darker area, respectively. The defects are weakly labeled by a surrounding ellipse, shown in red.
 								![DAGM_Examples](images/dagm2007.png)
 								The DAGM2007 challenge comes in propose two different challenges:
 								- A development set: public and available for download from
 								[here](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html).
 								The number of classes and sub-challenges for the development set is 6.
 								- A competition set: which requires an account to be downloaded from [here](https://hci.iwr.uni-heidelberg.de/node/3616).
 								The number of classes and sub-challenges for the competition set is 10.
 								The challenge consists in designing a single model with a set of predefined hyper-parameters which will not change
 								across the 10 different classes or sub-challenges of the competition set.
 								The performance shall be measured on the competition set which is normalized and more complex that the public dataset
 								while offering the most unbiased evaluation method.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Training Process
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								*Laplace Smoothing*
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								We use this technique in the DICE loss to improve the training efficiency. This technique consists in replacing the
 								epsilon parameter (used to avoid dividing by zero and very small: +/- 1e-7) by 1. You can find more information on:
 								[https://en.wikipedia.org/wiki/Additive_smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								*Adaptive Loss*
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								The DICE Loss is not able to provide a meaningful gradient at initialisation. This leads to a model instability which
 								often push the model to diverge. Nonetheless, once the model starts to converge, DICE loss is able to very efficiently
 								fully train the model. Therefore, we implemented an *adaptive loss* which is composed of two sub-losses:
 								- Binary Cross-Entropy (BCE)
 								- DICE Loss
 								The model is trained with the BCE loss until the DICE Loss reach a experimentally defined threshold (0.3).
 								Thereafter, DICE loss is used to finish training.
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								*Weak Labelling*
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								This dataset is referred as weakly labelled. That means that the segmentation labels are not given at the pixel level
 								but rather in an approximate fashion.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Performance
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Adding links to performance benchmark page

											
										
										
											2021-07-21 14:39:48 +02:00
+								The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Benchmarking
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								The following sections shows how to run benchmarks measuring the model performance in training and inference modes.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Training performance benchmark
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								To benchmark the inference performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								called `./scripts/benchmarking/UNet_trainbench{_AMP}_{1, 4, 8}GPU.sh`.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								Each of the scripts requires three parameters:
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								* path to the dataset as the first argument
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								* class ID from DAGM used (between 1-10)
 								For example:
 								```bash
 								cd scripts/benchmarking/
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								./UNet_trainbench_1GPU.sh /data 1
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Inference performance benchmark
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								To benchmark the training performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								called `./scripts/benchmarking/UNet_evalbench{_AMP}.sh`.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								Each of the scripts requires three parameters:
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								* path to the dataset as the first argument
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								* class ID from DAGM used (between 1-10)
 								For example:
 								```bash
 								cd scripts/benchmarking/
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								./UNet_evalbench_AMP.sh /data 1
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Results
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								The following sections provide details on the achieved results in training accuracy, performance and inference performance.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Training accuracy results
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
 								Our results were obtained by running the `./scripts/UNet{_AMP}_{1, 8}GPU.sh` training
 								script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
 								| GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 [min] | Time to train - mixed precision [min] | Time to train speedup (TF32 to mixed precision) |
 								|:----:|:----------------:|:---------------:|:--------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
 								|  1   |        16        |     0.9717      |           0.9726           |         3.6          |               2.3               |                      1.57                       |
 								|  8   |        2         |     0.9733      |           0.9683           |         4.3          |               3.5               |                      1.23                       |
 								##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
 								Our results were obtained by running the `./scripts/UNet{_AMP}_{1, 8}GPU.sh` training
 								script in the `tensorflow:20.06-tf1-py3` NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
 								| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [min] | Time to train - mixed precision [min] | Time to train speedup (FP32 to mixed precision) |
 								|:----:|:----------------:|:---------------:|:--------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
 								|  1   |        16        |     0.9643      |           0.9653           |          10          |                8                |                      1.25                       |
 								|  8   |        2         |     0.9637      |           0.9655           |         2.5          |               2.5               |                      1.00                       |
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Training performance results
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
 								Our results were obtained by running the scripts
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								`./scripts/benchmarking/UNet_trainbench{_AMP}_{1, 4, 8}GPU.sh` training script in the
 								TensorFlow `20.06-tf1-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
 								| GPUs | Batch size / GPU | Throughput - TF32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (TF32 - mixed precision) | Strong scaling - TF32 | Strong scaling - mixed precision |
 								|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:---------------------:|:--------------------------------:|
 								|  1   |        16        |          135.95           |                255.26                |                    1.88                     |           -           |                -                 |
 								|  4   |        4         |           420.2           |                691.19                |                    1.64                     |         3.09          |               2.71               |
 								|  8   |        2         |          655.05           |                665.66                |                    1.02                     |         4.82          |               2.61               |
 								##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
 								Our results were obtained by running the scripts
 								`./scripts/benchmarking/UNet_trainbench{_AMP}_{1, 4, 8}GPU.sh` training script in the
 								TensorFlow `20.06-tf1-py3` NGC container on an NVIDIA DGX-1 (8 V100 16GB) GPUs.
 								| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
 								|:----:|:----------------:|:-------------------------:|:------------------------------------:|:-------------------------------------------:|:---------------------:|:--------------------------------:|
 								|  1   |        16        |           86.95           |                168.54                |                    1.94                     |           -           |                -                 |
 								|  4   |        4         |          287.01           |                459.07                |                    1.60                     |         3.30          |               2.72               |
 								|  8   |        2         |          474.77           |                444.13                |                    0.94                     |         5.46          |               2.64               |
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Inference performance results
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								#### Inference performance results
 								##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
 								Our results were obtained by running the scripts `./scripts/benchmarking/UNet_evalbench{_AMP}.sh`
 								evaluation script in the `20.06-tf1-py3` NGC container on NVIDIA DGX A100 (1x A100 40GB) GPUs.
 								FP16
 								| Batch size | Resolution | Throughput Avg [img/s] |
 								|:----------:|:----------:|:----------------------:|
 								|     1      | 512x512x1  |         247.83         |
 								|     8      | 512x512x1  |         761.41         |
 								|     16     | 512x512x1  |         823.46         |
 								TF32
 								| Batch size | Resolution | Throughput Avg [img/s] |
 								|:----------:|:----------:|:----------------------:|
 								|     1      | 512x512x1  |         227.97         |
 								|     8      | 512x512x1  |         419.70         |
 								|     16     | 512x512x1  |         424.57         |
 								To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 								##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								Our results were obtained by running the scripts `./scripts/benchmarking/UNet_evalbench{_AMP}.sh`
-												[UNet_Ind/TF] Fix number of gpus

Signed-off-by: Pablo Ribalta Lorenzo <pribalta@nvidia.com>
											
										
										
											2021-04-07 14:45:56 +02:00
+								evaluation script in the `20.06-tf1-py3` NGC container on NVIDIA DGX-1 (1x V100 16GB) GPUs.
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
 								FP16
 								| Batch size | Resolution | Throughput Avg [img/s] |
 								|:----------:|:----------:|:----------------------:|
 								|     1      | 512x512x1  |         157.91         |
 								|     8      | 512x512x1  |         438.00         |
 								|     16     | 512x512x1  |         469.27         |
 								FP32
 								| Batch size | Resolution | Throughput Avg [img/s] |
 								|:----------:|:----------:|:----------------------:|
 								|     1      | 512x512x1  |         159.65         |
 								|     8      | 512x512x1  |         243.99         |
 								|     16     | 512x512x1  |         250.23         |
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
 								## Release notes
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Changelog
-												[UNet_Industrial/TF] Adding Jupyter notebooks and small fixes

											
										
										
											2019-10-09 14:45:45 +02:00
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
+								June 2020
 								* Updated training and inference accuracy with A100 results
 								* Updated training and inference performance with A100 results
 								October 2019
-												[UNet_Industrial/TF] Adding Jupyter notebooks and small fixes

											
										
										
											2019-10-09 14:45:45 +02:00
+								  * Jupyter notebooks added
-												[UNet industrial/TF] Updating for Ampere

											
										
										
											2020-07-04 02:17:58 +02:00
 								March,2019
-												[UNet_Industrial/TF] Adding Jupyter notebooks and small fixes

											
										
										
											2019-10-09 14:45:45 +02:00
+								  * Initial release
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Known issues
-												Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT

											
										
										
											2019-03-18 19:45:05 +01:00
+								There are no known issues with this model.