History

Pablo Ribalta 70f3e4362c Remove dllogger and fix bugs from GH Signed-off-by: Pablo Ribalta <pribalta@nvidia.com>		2020-04-08 09:40:52 +02:00
..
datasets	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
images	Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT	2019-03-18 19:45:05 +01:00
model	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
notebooks	New notebooks in UNet(industrial)/TF	2019-11-18 23:11:28 +01:00
runtime	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
scripts	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
utils	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
.gitignore	[UNet_Industrial/TF] Adding Jupyter notebooks and small fixes	2019-10-09 14:45:45 +02:00
Dockerfile	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
download_and_preprocess_dagm2007.sh	Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT	2019-03-18 19:45:05 +01:00
download_and_preprocess_dagm2007_public.sh	[UNet_Industrial/TF] Adding Jupyter notebooks and small fixes	2019-10-09 14:45:45 +02:00
export_saved_model.py	[UNet_Industrial/TF] Adding Jupyter notebooks and small fixes	2019-10-09 14:45:45 +02:00
LICENSE	Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT	2019-03-18 19:45:05 +01:00
main.py	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
NOTICE	Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT	2019-03-18 19:45:05 +01:00
preprocess_dagm2007.py	Adding 9 new models: ResNet50/TF, SSD/TF, BERT/TF, NCF/TF, UNet/TF, GNMT/TF, SSD/PyT, MaskRCNN/PyT, Tacotron2+Waveglow/PyT	2019-03-18 19:45:05 +01:00
README.md	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00
requirements.txt	Remove dllogger and fix bugs from GH	2020-04-08 09:40:52 +02:00

README.md

U-Net Industrial Defect Segmentation for TensorFlow

This repository provides a script and recipe to train U-Net Industrial to achieve state of the art accuracy on the dataset DAGM2007, and is tested and maintained by NVIDIA.

Model overview
- Default Configuration
- Enabling mixed precision
Setup
- Requirements
Quick Start Guide
Advanced
Performance
- Benchmarking
  - Training performance benchmark
  - Inference performance benchmark
- Results
Release notes
- Changelog
- Known issues

Model overview

This U-Net model is adapted from the original version of the U-Net model which is a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation.

This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with very high accuracy on the industrial anomaly dataset DAGM2007. TinyUNet, like the original U-Net is composed of two parts:

an encoding sub-network (left-side)
a decoding sub-network (right-side).

It repeatedly applies 3 downsampling blocks composed of two 2D convolutions followed by a 2D max pooling layer in the encoding sub-network. In the decoding sub-network, 3 upsampling blocks are composed of a upsample2D layer followed by a 2D convolution, a concatenation operation with the residual connection and two 2D convolutions.

TinyUNet has been introduced to reduce the model capacity which was leading to a high degree of over-fitting on a small dataset like DAGM2007. The complete architecture is presented in the figure below:

This model trains with mixed precision tensor cores on Volta, therefore researchers can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

The following features were implemented in this model:

Data-parallel multi-GPU training with Horovod.
Mixed precision support with TensorFlow Automatic Mixed Precision (TF-AMP), which enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable.
Tensor Core operations to maximize throughput using NVIDIA Volta GPUs.
Loss auto-scaling for Tensor Cores (mixed precision) training

The following performance optimizations were implemented in this model:

XLA support (experimental)

Default Configuration

This model trains in 2500 epochs, under the following setup:

Global Batch Size: 16
Optimizer RMSProp:
- decay: 0.9
- momentum: 0.8
- centered: True
Learning Rate Schedule: Exponential Step Decay
- decay: 0.8
- steps: 500
- initial learning rate: 1e-4
Weight Initialization: He Uniform Distribution (introduced by Kaiming He et al. in 2015 to address issues related ReLU activations in deep neural networks)
Loss Function: Adaptive
- When DICE Loss < 0.3, Loss = Binary Cross Entropy
- Else, Loss = DICE Loss
Data Augmentation
- Random Horizontal Flip (50% chance)
- Random Rotation 90°
Activation Functions:
- ReLU is used for all layers
- Sigmoid is used at the output to ensure that the ouputs are between [0, 1]
Weight decay: 1e-5

Enabling mixed precision

Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:

Porting the model to use the FP16 data type where appropriate.
Manually adding loss scaling to preserve small gradient values.

This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full mixed precision methodology in your existing TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.

In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.

For information about:

How to train using mixed precision, see the Mixed Precision Training paper and Training With Mixed Precision documentation.
How to access and enable AMP for TensorFlow, see Using TF-AMP from the TensorFlow User Guide.
Techniques used for mixed precision training, see the Mixed-Precision Training of Deep Neural Networks blog.

Setup

The following section list the requirements in order to start training the U-Net model (only the TinyUnet model is proposed here).

Requirements

This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies.
Aside from these dependencies, ensure you have the following components:

NVIDIA Docker
TensorFlow 19.12-tf1-py3 NGC container
(optional) NVIDIA Volta GPU (see section below) - for best training performance using mixed precision

For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

Quick Start Guide

To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default configuration of the U-Net model (only TinyUNet has been made available here) on the DAGM2007 dataset.

Clone the repository

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Segmentation/UNetIndustrial

Download and preprocess the dataset: DAGM2007

In order to download the dataset. You can execute the following:

./download_and_preprocess_dagm2007.sh /path/to/dataset/directory/

Important Information: Some files of the dataset require an account to be downloaded, the script will invite you to download them manually and put them in the correct directory.

Build and start the docker container based on the TensorFlow NGC container.

# Build the docker container
docker build . --rm -t unet_industrial:latest

# start the container with nvidia-docker
nvidia-docker run -it --rm \
    --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /path/to/dataset:/data/dagm2007/ \
    -v /path/to/results:/results \
    unet_industrial:latest

Run training

To run training for a default configuration (as described in Default configuration, for example 1/4/8 GPUs, FP32/TF-AMP), launch one of the scripts in the ./scripts directory called ./scripts/UNet_{FP32, AMP}_{1, 4, 8}GPU.sh

Each of the scripts requires three parameters:

path to the result directory of the model as the first argument
path to the dataset as a second argument
class ID from DAGM used (between 1-10)

For example:

cd scripts/
./UNet_FP32_1GPU.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>

Run evaluation

Model evaluation on a checkpoint can be launched by running one of the scripts in the ./scripts directory called ./scripts/UNet_{FP32, AMP}_EVAL.sh.

Each of the scripts requires three parameters:

path to the result directory of the model as the first argument
path to the dataset as a second argument
class ID from DAGM used (between 1-10)

For example:

cd scripts/
./UNet_FP32_EVAL.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>

Advanced

The following sections provide greater details of the dataset, running training and inference, and the training results.

Command line options

To see the full list of available options and their descriptions, use the -h or --help command line option, for example:

python main.py --help

The following mandatory flags must be used to tune different aspects of the training:

general

--exec_mode=train_and_evaluate Which execution mode to run the model into.

--iter_unit=batch Will the model be run for X batches or X epochs ?

--num_iter=2500 Number of iterations to run.

--batch_size=16 Size of each minibatch per GPU.

--results_dir=/path/to/results Directory in which to write training logs, summaries and checkpoints.

--data_dir=/path/to/dataset Directory which contains the DAGM2007 dataset.

--dataset_name=DAGM2007 Name of the dataset used in this run (only DAGM2007 is supported atm).

--dataset_classID=1 ClassID to consider to train or evaluate the network (used for DAGM).

model

--use_tf_amp Enable Automatic Mixed Precision to speedup FP32 computation using tensor cores.

--use_xla Enable TensorFlow XLA to maximise performance.

--use_auto_loss_scaling Use AutoLossScaling in TF-AMP

Getting the data

The U-Net model was trained with the Weakly Supervised Learning for Industrial Optical Inspection (DAGM 2007) dataset.

The competition is inspired by problems from industrial image processing. In order to satisfy their customers' needs, companies have to guarantee the quality of their products, which can often be achieved only by inspection of the finished product. Automatic visual defect detection has the potential to reduce the cost of quality assurance significantly.

The competitors have to design a stand-alone algorithm which is able to detect miscellaneous defects on various background textures.

The particular challenge of this contest is that the algorithm must learn, without human intervention, to discern defects automatically from a weakly labeled (i.e., labels are not exact to the pixel level) training set, the exact characteristics of which are unknown at development time. During the competition, the programs have to be trained on new data without any human guidance.

Source: https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html

Data description

The provided data is artificially generated, but similar to real world problems. It consists of multiple data sets, each consisting of 1000 images showing the background texture without defects, and of 150 images with one labeled defect each on the background texture. The images in a single data set are very similar, but each data set is generated by a different texture model and defect model.

Not all deviations from the texture are necessarily defects. The algorithm will need to use the weak labels provided during the training phase to learn the properties that characterize a defect.

Below are two sample images from two data sets. In both examples, the left images are without defects; the right ones contain a scratch-shaped defect which appears as a thin dark line, and a diffuse darker area, respectively. The defects are weakly labeled by a surrounding ellipse, shown in red.

The DAGM2007 challenge comes in propose two different challenges:

A development set: public and available for download from here. The number of classes and sub-challenges for the development set is 6.
A competition set: which requires an account to be downloaded from here. The number of classes and sub-challenges for the competition set is 10.

Challenge description

The challenge consists in designing a single model with a set of predefined hyper-parameters which will not change across the 10 different classes or sub-challenges of the competition set.

The performance shall be measured on the competition set which is normalized and more complex that the public dataset while offering the most unbiased evaluation method.

Training Process

Laplace Smoothing

We use this technique in the DICE loss to improve the training efficiency. This technique consists in replacing the epsilon parameter (used to avoid dividing by zero and very small: +/- 1e-7) by 1. You can find more information on: https://en.wikipedia.org/wiki/Additive_smoothing

Adaptive Loss

The DICE Loss is not able to provide a meaningful gradient at initialisation. This leads to a model instability which often push the model to diverge. Nonetheless, once the model starts to converge, DICE loss is able to very efficiently fully train the model. Therefore, we implemented an adaptive loss which is composed of two sub-losses:

Binary Cross-Entropy (BCE)
DICE Loss

The model is trained with the BCE loss until the DICE Loss reach a experimentally defined threshold (0.3). Thereafter, DICE loss is used to finish training.

Weak Labelling

This dataset is referred as weakly labelled. That means that the segmentation labels are not given at the pixel level but rather in an approximate fashion.

Performance

Benchmarking

The following sections shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the inference performance, you can run one of the scripts in the ./scripts/benchmarking/ directory called ./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh.

Each of the scripts requires three parameters:

path to the result directory of the model as the first argument
path to the dataset as a second argument
class ID from DAGM used (between 1-10)

For example:

cd scripts/benchmarking/
./DGX1v_trainbench_FP32_1GPU.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>

Inference performance benchmark

To benchmark the training performance, you can run one of the scripts in the ./scripts/benchmarking/ directory called ./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh.

Each of the scripts requires three parameters:

path to the result directory of the model as the first argument
path to the dataset as a second argument
class ID from DAGM used (between 1-10)

For example:

cd scripts/benchmarking/
./DGX1v_evalbench_FP16_1GPU.sh <path to result repository> <path to dataset> <dagm classID (1-10)>

Results

The following sections provide details on the achieved results in training accuracy, performance and inference performance.

Training accuracy results

Our results were obtained by running the ./scripts/UNet_{FP32, AMP}_{1, 4, 8}GPU.sh training script in the Tensorflow:19.12-tf1-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.

Threshold = 0.75

# DAGM Class ID	Precision	IoU (Intersection over Union)	TPR (True Positive Rate)	TNR (True Negative Rate)
1	FP32	0.968	100.00	97.88
1	Automatic Mixed Precision (AMP)	0.966	100.00	97.55
2	FP32	0.976	100.00	99.62
2	Automatic Mixed Precision (AMP)	0.976	100.00	99.66
3	FP32	0.979	100.00	99.64
3	Automatic Mixed Precision (AMP)	0.979	100.00	99.63
4	FP32	0.965	100.00	97.43
4	Automatic Mixed Precision (AMP)	0.960	100.00	96.85
5	FP32	0.992	100.00	99.88
5	Automatic Mixed Precision (AMP)	0.991	100.00	99.78
6	FP32	0.982	100.00	98.78
6	Automatic Mixed Precision (AMP)	0.979	100.00	98.39
7	FP32	0.980	100.00	99.58
7	Automatic Mixed Precision (AMP)	0.980	100.00	99.56
8	FP32	0.970	100.00	99.85
8	Automatic Mixed Precision (AMP)	0.971	100.00	100.00
9	FP32	0.988	99.82	99.81
9	Automatic Mixed Precision (AMP)	0.988	99.87	99.81
10	FP32	0.979	100.00	99.43
10	Automatic Mixed Precision (AMP)	0.982	100.00	99.70

Threshold = 0.85

# DAGM Class ID	Precision	IoU (Intersection over Union)	TPR (True Positive Rate)	TNR (True Negative Rate)
1	FP32	0.968	100.00	97.88
1	Automatic Mixed Precision (AMP)	0.966	100.00	97.56
2	FP32	0.976	100.00	99.62
2	Automatic Mixed Precision (AMP)	0.976	100.00	99.66
3	FP32	0.979	100.00	99.64
3	Automatic Mixed Precision (AMP)	0.977	100.00	99.33
4	FP32	0.966	100.00	97.44
4	Automatic Mixed Precision (AMP)	0.960	100.00	96.85
5	FP32	0.992	100.00	99.88
5	Automatic Mixed Precision (AMP)	0.991	100.00	99.78
6	FP32	0.982	100.00	98.80
6	Automatic Mixed Precision (AMP)	0.979	100.00	98.41
7	FP32	0.980	100.00	99.60
7	Automatic Mixed Precision (AMP)	0.980	100.00	99.58
8	FP32	0.970	100.00	99.85
8	Automatic Mixed Precision (AMP)	0.971	100.00	100.00
9	FP32	0.988	99.82	99.81
9	Automatic Mixed Precision (AMP)	0.988	99.87	99.81
10	FP32	0.980	100.00	99.45
10	Automatic Mixed Precision (AMP)	0.982	100.00	99.71

Threshold = 0.95

# DAGM Class ID	Precision	IoU (Intersection over Union)	TPR (True Positive Rate)	TNR (True Negative Rate)
1	FP32	0.969	100.00	97.93
1	Automatic Mixed Precision (AMP)	0.966	100.00	97.58
2	FP32	0.976	100.00	99.64
2	Automatic Mixed Precision (AMP)	0.976	100.00	99.66
3	FP32	0.979	100.00	99.64
3	Automatic Mixed Precision (AMP)	0.980	100.00	99.64
4	FP32	0.966	100.00	97.52
4	Automatic Mixed Precision (AMP)	0.961	100.00	96.88
5	FP32	0.992	100.00	99.88
5	Automatic Mixed Precision (AMP)	0.991	100.00	99.78
6	FP32	0.982	100.00	98.83
6	Automatic Mixed Precision (AMP)	0.980	100.00	98.58
7	FP32	0.981	100.00	99.65
7	Automatic Mixed Precision (AMP)	0.980	100.00	99.59
8	FP32	0.970	100.00	99.88
8	Automatic Mixed Precision (AMP)	0.971	100.00	100.00
9	FP32	0.988	99.82	99.83
9	Automatic Mixed Precision (AMP)	0.988	99.87	99.82
10	FP32	0.980	100.00	99.54
10	Automatic Mixed Precision (AMP)	0.982	100.00	99.72

Threshold = 0.99

# DAGM Class ID	Precision	IoU (Intersection over Union)	TPR (True Positive Rate)	TNR (True Negative Rate)
1	FP32	0.969	100.00	97.95
1	Automatic Mixed Precision (AMP)	0.967	100.00	97.66
2	FP32	0.976	100.00	99.64
2	Automatic Mixed Precision (AMP)	0.976	100.00	99.71
3	FP32	0.979	100.00	99.64
3	Automatic Mixed Precision (AMP)	0.980	100.00	99.66
4	FP32	0.967	100.00	97.55
4	Automatic Mixed Precision (AMP)	0.961	100.00	96.89
5	FP32	0.992	100.00	99.88
5	Automatic Mixed Precision (AMP)	0.991	100.00	99.78
6	FP32	0.983	100.00	98.88
6	Automatic Mixed Precision (AMP)	0.981	100.00	98.65
7	FP32	0.981	100.00	99.67
7	Automatic Mixed Precision (AMP)	0.980	100.00	99.62
8	FP32	0.970	100.00	99.92
8	Automatic Mixed Precision (AMP)	0.971	100.00	100.00
9	FP32	0.988	99.82	99.84
9	Automatic Mixed Precision (AMP)	0.988	99.87	99.82
10	FP32	0.981	100.00	99.62
10	Automatic Mixed Precision (AMP)	0.982	100.00	99.78

Training performance results

Our results were obtained by running the scripts ./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh training script in the TensorFlow 19.12-tf1-py3 NGC container on an NVIDIA DGX-1 with 8 V100 16G GPUs.

# GPUs	Precision	Throughput (Imgs/sec)	AMP Speedup	Scaling efficiency
1	FP32	92	1.00	1.00
1	Automatic Mixed Precision (AMP)	167	1.82	1.00
4	FP32	299	1.00	3.25
4	Automatic Mixed Precision (AMP)	458	1.53	2.74
8	FP32	507	1.00	5.51
8	Automatic Mixed Precision (AMP)	561	1.11	3.36

To achieve these same results, follow the Quick Start Guide outlined above.

Inference performance results

Our results were obtained by running the scripts ./scripts/benchmarking/DGX1v_evalbench_{FP32, AMP}.sh evaluation script in the 19.12-tf1-py3 NGC container on an NVIDIA DGX-1 server with 8 V100 16G GPUs.

# GPUs	Precision	Throughput (Imgs/sec)	Speedup
1	FP32	306	1.00
1	Automatic Mixed Precision (AMP)	550	1.80

To achieve these same results, follow the Quick Start Guide outlined above.

Release notes

Changelog

October 2019
- Jupyter notebooks added
March,2019
- Initial release

Known issues

There are no known issues with this model.

README.md

U-Net Industrial Defect Segmentation for TensorFlow

Table of Contents

Model overview

Default Configuration

Enabling mixed precision

Setup

Requirements

Quick Start Guide

Clone the repository

Download and preprocess the dataset: DAGM2007

Build and start the docker container based on the TensorFlow NGC container.

Run training

Run evaluation

Advanced

Command line options

general

model

Getting the data

Data description

Challenge description

Training Process

Laplace Smoothing

Adaptive Loss

Weak Labelling

Performance

Benchmarking

Training performance benchmark

Inference performance benchmark

Results

Training accuracy results

Threshold = 0.75

Threshold = 0.85

Threshold = 0.95

Threshold = 0.99

Training performance results

Inference performance results

Release notes

Changelog

Known issues