Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT * AMP support * Data preprocessing for Tacotron 2 training * Fixed dropouts on LSTMCells SSD/PyT * script and notebook for inference * AMP support * README update * updates to examples/* BERT/PyT * initial release GNMT/PyT * Default container updated to NGC PyTorch 19.05-py3 * Mixed precision training implemented using APEX AMP * Added inference throughput and latency results on NVIDIA Tesla V100 16G * Added option to run inference on user-provided raw input text from command line NCF/PyT * Updated performance tables. * Default container changed to PyTorch 19.06-py3. * Caching validation negatives between runs Transformer/PyT * new README * jit support added UNet Medical/TF * inference example scripts added * inference benchmark measuring latency added * TRT/TF-TRT support added * README updated GNMT/TF * Performance improvements Small updates (mostly README) for other models.
2019-07-16 21:13:08 +02:00 · 2019-07-16 21:13:08 +02:00 · a644350589
parent 3b3d0f6a55
commit a644350589
109 changed files with 6552 additions and 2524 deletions
--- a/PyTorch/Classification/RN50v1.5/main.py
+++ b/PyTorch/Classification/RN50v1.5/main.py
@ -40,7 +40,7 @@ def add_parser_arguments(parser):

    parser.add_argument('data', metavar='DIR',
                        help='path to dataset')
-    parser.add_argument('--data-backend', metavar='BACKEND', default='pytorch',
+    parser.add_argument('--data-backend', metavar='BACKEND', default='dali-cpu',
                        choices=DATA_BACKEND_CHOICES)

    parser.add_argument('--arch', '-a', metavar='ARCH', default='resnet50',
--- a/PyTorch/Detection/SSD/Dockerfile
+++ b/PyTorch/Detection/SSD/Dockerfile
@ -1,7 +1,9 @@
 FROM nvcr.io/nvidia/pytorch:19.05-py3

 # Set working directory
-WORKDIR /mlperf
+WORKDIR /workspace
+
+ENV PYTHONPATH "${PYTHONPATH}:/workspace"

 RUN apt-get update && apt-get install -y python3-tk python-pip git tmux htop tree

--- a/PyTorch/Detection/SSD/README.md
+++ b/PyTorch/Detection/SSD/README.md
@ -1,29 +1,45 @@
 # SSD300 v1.1 For PyTorch

-## Table Of Contents
-* [The model](#the-model)
-  * [Default configuration](#default-configuration)
-* [Setup](#setup)
-  * [Requirements](#requirements)
-* [Quick start guide](#quick-start-guide)
-* [Details](#details)
-  * [Command line arguments](#command-line-arguments)
-  * [Getting the data](#getting-the-data)
-  * [Training process](#training-process)
-    * [Data preprocessing](#data-preprocessing)
-    * [Data augmentation](#data-augmentation)
-  * [Enabling mixed precision](#enabling-mixed-precision)
-* [Benchmarking](#benchmarking)
-  * [Training performance benchmark](#training-performance-benchmark)
-  * [Inference performance benchmark](#inference-performance-benchmark)
-* [Results](#results)
-  * [Training accuracy results](#training-accuracy-results)
-  * [Training performance results](#training-performance-results)
-  * [Inference performance results](#inference-performance-results)
-* [Changelog](#changelog)
-* [Known issues](#known-issues)
+This repository provides a script and recipe to train the SSD300 v1.1 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.

-## The model
+## Table Of Contents
+- [Model overview](#model-overview)
+    * [Model architecture](#model-architecture)
+    * [Default configuration](#default-configuration)
+    * [Feature support matrix](#feature-support-matrix)
+        * [Features](#features)
+    * [Mixed precision training](#mixed-precision-training)
+        * [Enabling mixed precision](#enabling-mixed-precision)
+- [Setup](#setup)
+    * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Parameters](#parameters)
+    * [Command-line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
+        * [Dataset guidelines](#dataset-guidelines)
+            * [Data preprocessing](#data-preprocessing)
+            * [Data augmentation](#data-augmentation)
+    * [Training process](#training-process)
+    * [Inference process](#inference-process)
+- [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+        * [Training performance results](#training-performance-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g-1)
+        * [Inference performance results](#inference-performance-results)
+            * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
+- [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)
+
+
+## Model overview
 The SSD300 v1.1 model is based on the
 [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) paper, which
 describes SSD as “a method for detecting objects in images using a single deep neural network".
@ -36,54 +52,202 @@ From the
 [Speed/accuracy trade-offs for modern convolutional object detectors](https://arxiv.org/abs/1611.10012)
 paper, the following enhancements were made to the backbone:
 *   The conv5_x, avgpool, fc and softmax layers were removed from the original classification model.
-*   All strides in conv4_x are set to 1x1. 
+*   All strides in conv4_x are set to 1x1.
+
+Detector heads are similar to the ones referenced in the paper, however,
+they are enhanced by additional BatchNorm layers after each convolution.
+
+Additionally, we removed weight decay on every bias parameter and
+all the BatchNorm layer parameters as described in the
+[Highly Scalable Deep Learning Training System with Mixed-Precision:
+Training ImageNet in Four Minutes](https://arxiv.org/abs/1807.11205) paper.
+
+Training of SSD requires computational costly augmentations.
+To fully utilize GPUs during training we are using the
+[NVIDIA DALI](https://github.com/NVIDIA/DALI) library
+to accelerate data preparation pipelines.
+
+This model is trained with mixed precision using Tensor Cores on NVIDIA
+Volta and Turing GPUs. Therefore, researchers can get results 2x faster
+than training without Tensor Cores, while experiencing the benefits of
+mixed precision training. This model is tested against each NGC monthly
+container release to ensure consistent accuracy and performance over time.
+
+### Model architecture
+
+Despite the changes described in the previous section,
+the overall architecture, as described in the following diagram, has not changed.
+
+<p align="center">
+  <img width="90%" src="./img/ssd_diagram.png" />
+  <br>
+Figure 1. The architecture of a Single Shot MultiBox Detector model. Image has been taken from the <a href="https://arxiv.org/abs/1512.02325">Single Shot MultiBox Detector paper</a>.
+</p>

 The backbone is followed by 5 additional convolutional layers.
 In addition to the convolutional layers, we attached 6 detection heads:
 *   The first detection head is attached to the last conv4_x layer.
 *   The other five detection heads are attached to the corresponding 5 additional layers.

-Detector heads are similar to the ones referenced in the paper, however,
-they are enhanced by additional BatchNorm layers after each convolution.
-
- 
-Additionally, we removed weight decay on every bias parameter and
-all the BatchNorm layer parameters as described in the
-[Highly Scalable Deep Learning Training System with Mixed-Precision: 
-Training ImageNet in Four Minutes](https://arxiv.org/abs/1807.11205) paper. 
-
-This model trains with mixed precision tensor cores on Volta, therefore you can get results much faster than training without tensor cores.
-This model is tested against each NGC monthly container release to ensure
-consistent accuracy and performance over time.
-
-Because of these enhancements, the SSD300 v1.1 model achieves higher accuracy.
-
-Training of SSD requires computational costly augmentations. To fully utilize GPUs during training we are using [NVIDIA DALI](https://github.com/NVIDIA/DALI) library to accelerate data preparation pipeline.
-

 ### Default configuration
 We trained the model for 65 epochs with the following setup:
-*	SGD with momentum (0.9)
-*	Learning rate = 2.6e-3 * number of GPUs * (batch_size / 32)
-*	Learning rate decay – multiply by 0.1 before 43 and 54 epochs
-*	We use linear warmup of the learning rate during the first epoch. For more information, see the
+*    SGD with momentum (0.9)
+*    Learning rate = 2.6e-3 * number of GPUs * (batch_size / 32)
+*    Learning rate decay – multiply by 0.1 before 43 and 54 epochs
+*    We use linear warmup of the learning rate during the first epoch.
+
+For more information, see the
 [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677) paper.
+
 To enable warmup provide argument the `--warmup 300`
-*	Weight decay:
-    *	0 for BatchNorms and biases
-	*   5e-4 for other layers
-	
-**Note**: The learning rate is automatically scaled (in other words, mutliplied by the number of GPUs and multiplied by the batch size divided by 32).
+*    Weight decay:
+    *   0 for BatchNorms and biases
+    *   5e-4 for other layers
+
+**Note**: The learning rate is automatically scaled (in other words, multiplied
+by the number of GPUs and multiplied by the batch size divided by 32).
+
+### Feature support matrix
+
+The following features are supported by this model.
+
+| Feature               | SSD300 v1.1 PyTorch             |
+|-----------------------|--------------------------
+|Multi-GPU training with [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)  |  Yes |
+|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)                |  Yes |
+
+
+#### Features
+
+Multi-GPU training with Distributed Data Parallel - Our model uses Apex's
+DDP to implement efficient multi-GPU training with NCCL.
+To enable multi-GPU training with DDP, you have to wrap your model
+with a proper class, and change the way you launch training.
+For details, see example sources in this repo or see
+the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
+
+NVIDIA DALI - DALI is a library accelerating data preparation pipeline.
+To accelerate your input pipeline, you only need to define your data loader
+with the DALI library.
+For details, see example sources in this repo or see
+the [DALI documentation](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html)
+
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in
+a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740)
+training offers significant computational speedup by performing operations
+in half-precision format, while storing minimal information in single-precision
+to retain as much information as possible in critical parts of the network.
+Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
+in the Volta and Turing architecture, significant training speedups are
+experienced by switching to mixed precision -- up to 3x overall speedup
+on the most arithmetically intense model architectures. Using mixed precision
+training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced
+in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/)
+in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
+paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
+documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision
+Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
+blog.
+-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
+from the TensorFlow User Guide.
+-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools
+for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP)
+library from [APEX](https://github.com/NVIDIA/apex) which casts variables
+to half-precision upon retrieval, while storing variables in single-precision format.
+Furthermore, to preserve small gradient magnitudes in backpropagation,
+a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
+step must be included when applying gradients. In PyTorch, loss scaling
+can be easily applied by using `scale_loss()` method provided by AMP.
+The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler)
+or fixed.
+
+For an in-depth walk through on AMP, check out sample usage
+[here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started).
+[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
+utility libraries, such as AMP, which require minimal network code changes
+to leverage Tensor Cores performance.
+
+To enable mixed precision, you can:
+- Import AMP from APEX:
+
+  ```
+  from apex import amp
+  ```
+- Initialize an AMP handle:
+
+  ```
+  amp_handle = amp.init(enabled=True, verbose=True)
+  ```
+- Wrap your optimizer with the AMP handle:
+
+  ```
+  optimizer = amp_handle.wrap_optimizer(optimizer)
+  ```
+- Scale loss before backpropagation (assuming loss is stored in a variable called `losses`)
+  - Default backpropagate for FP32:
+
+    ```
+    losses.backward()
+    ```
+  - Scale loss and backpropagate with AMP:
+
+    ```
+    with optimizer.scale_loss(losses) as scaled_losses:
+       scaled_losses.backward()
+    ```
+
+### Glossary
+
+backbone
+: a part of a many object detection architectures, usually pre-trained for a different,
+simpler task, like classification.
+
+input pipeline
+: set of operations performed for every item in input data before feeding the neural
+network. Especially for object detection task, the input pipeline can be complex
+and computationally significant. For that reason, solutions like NVIDIA DALI emerged.
+
+object detection
+: a subset of Computer Vision problem. The task of object detection is to localize
+possibly multiple objects on the image and classify them. The difference between
+Object Detection, Image Classification, and Localization are clearly explained in the
+video published as a part of the [C4W3L01 course](https://www.youtube.com/watch?v=GSwYGkTfOKk).
+
+SSD (Single Shot MultiBox Detector)
+: a name for the detection model described in a [paper authored by Liu at al.](https://arxiv.org/abs/1512.02325)
+
+ResNet (ResNet-50)
+: a name for the classification model described in a [paper authored by He et al.](https://arxiv.org/abs/1512.03385)
+In this repo, it is used as a backbone for SSD.

 ## Setup
-The following section list the requirements in order to start training the SSD300 v1.1 model.
+The following section lists the requirements in order to start training the SSD300 v1.1 model.


 ### Requirements
-This repository contains `Dockerfile` which extends the PyTorch 19.03 NGC container and encapsulates some dependencies.  Aside from these dependencies, ensure you have the following software:
+This repository contains `Dockerfile` which extends the PyTorch 19.06 NGC container
+and encapsulates some dependencies.  Aside from these dependencies,
+ensure you have the following software:
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
-* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+* [PyTorch 19.06-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU

 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@ -92,44 +256,51 @@ Documentation:
 * [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
 * [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)

+For those unable to use the [PyTorch 19.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
+to set up the required environment or create your own container,
+see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
+

 ## Quick Start Guide
-To train your model using mixed precision with Tensor Cores, perform the
-following steps using the default parameters of the SSD v1.1 model on the
-[COCO 2017](http://cocodataset.org/#download) dataset.

-### 1. Clone the repository.
+To train your model using mixed precision with Tensor Cores or using FP32,
+perform the following steps using the default parameters of the SSD v1.1 model
+on the [COCO 2017](http://cocodataset.org/#download) dataset.
+For the specifics concerning training and inference,
+see the [Advanced](#advanced) section.
+
+
+1. Clone the repository.
 ```
 git clone https://github.com/NVIDIA/DeepLearningExamples
 cd DeepLearningExamples/PyTorch/Detection/SSD
 ```

-### 2. Download and preprocess the dataset.
+2. Download and preprocess the dataset.

 Extract the COCO 2017 dataset with `download_dataset.sh $COCO_DIR`.
 Data will be downloaded to the `$COCO_DIR` directory (on the host).

-### 3. Build the SSD300 v1.1 PyTorch NGC container.
+3. Build the SSD300 v1.1 PyTorch NGC container.

 ```
 docker build . -t nvidia_ssd
 ```

-### 4. Launch the NGC container to run training/inference.
+4. Start an interactive session in the NGC container to run training/inference.
 ```
 nvidia-docker run --rm -it --ulimit memlock=-1 --ulimit stack=67108864 -v $COCO_DIR:/coco --ipc=host nvidia_ssd
 ```

-
 **Note**: the default mount point in the container is `/coco`.

-### 5. Start training.
+5. Start training.

 The `./examples` directory provides several sample scripts for various GPU settings
-and act as wrappers around the main.py script.
+and act as wrappers around the `main.py` script.
 The example scripts need two arguments:
- A path to root SSD directory.
- A path to COCO 2017 dataset.
+- A path to the root SSD directory.
+- A path to the COCO 2017 dataset.

 Remaining arguments are passed to the `main.py` script.

@ -137,25 +308,22 @@ The `--save` flag, saves the model after each epoch.
 The checkpoints are stored as `./models/epoch_*.pt`.

 Use `python main.py -h` to obtain the list of available options in the `main.py` script.
-For example, if you want to run 8 GPU training with TensorCore acceleration and
+For example, if you want to run 8 GPU training with Tensor Core acceleration and
 save checkpoints after each epoch, run:

 ```
 bash ./examples/SSD300_FP16_8GPU.sh . /coco --save
 ```

-For information about how to train using mixed precision, see the [Mixed Precision Training paper](https://arxiv.org/abs/1710.03740) and [Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
-
-For PyTorch, easily adding mixed-precision support is available from NVIDIA’s [APEX](https://github.com/NVIDIA/apex), a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
-
-
-### 6. Start validation/evaluation.
-
+6. Start validation/evaluation.

 The `main.py` training script automatically runs validation during training.
-The results from the validation are printed to stdout.
+The results from the validation are printed to `stdout`.

-Pycocotools’ open-sourced scripts provides a consistent way to evaluate models on the COCO dataset. We are using these scripts during validation to measure models performance in AP metric. Metrics below are evaluated using pycocotools’ methodology, in the following format:
+Pycocotools’ open-sourced scripts provides a consistent way
+to evaluate models on the COCO dataset. We are using these scripts
+during validation to measure a models performance in AP metric.
+Metrics below are evaluated using pycocotools’ methodology, in the following format:
 ```
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.250
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.423
@ -172,24 +340,143 @@ Pycocotools’ open-sourced scripts provides a consistent way to evaluate models
 ```
 The metric reported in our results is present in the first row.

-To evaluate a checkpointed model saved in previous point, run:
+To evaluate a checkpointed model saved in the previous point, run:

 ```
 python ./main.py --backbone resnet50 --mode evaluation --checkpoint ./models/epoch_*.pt --data /coco
 ```

-### 7. Optionally, resume training from a checkpointed model.
+7. Optionally, resume training from a checkpointed model.

 ```
 python ./main.py --backbone resnet50 --checkpoint ./models/epoch_*.pt --data /coco
 ```

-## Details
+8. Start inference/predictions.

-The following sections provide greater details of the dataset, running training and inference, and the training results.
+You can check your trained model with a Jupyter notebook provided in the examples directory.
+Start with running a Docker container with a Jupyter notebook server:
+```
+nvidia-docker run --rm -it --ulimit memlock=-1 --ulimit stack=67108864 -v $SSD_CHECKPINT_PATH:/checkpoints/SSD300v1.1.pt -v $COCO_PATH:/datasets/coco2017 --ipc=host -p 8888:8888 nvidia_ssd jupyter-notebook --ip 0.0.0.0 --allow-root
+```

-### Command line arguments
-All these parameters can be controlled by passing command line arguments to the `main.py` script. To get a complete list of all command line arguments with descriptions and default values you can run:
+The container prints Jupyter notebook logs like this:
+```
+[I 16:17:58.935 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
+[I 16:17:59.769 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
+[I 16:17:59.769 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
+[I 16:17:59.770 NotebookApp] Serving notebooks from local directory: /workspace
+[I 16:17:59.770 NotebookApp] The Jupyter Notebook is running at: 
+[I 16:17:59.770 NotebookApp] http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
+[I 16:17:59.770 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
+[W 16:17:59.774 NotebookApp] No web browser found: could not locate runnable browser.
+[C 16:17:59.774 NotebookApp] 
+        
+    To access the notebook, open this file in a browser:
+        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
+    Or copy and paste one of these URLs:
+        http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
+```
+
+Use the token printed in the last line to start your notebook session.
+The notebook is in `examples/inference.ipynb`, for example:
+
+http://127.0.0.1:8888/notebooks/examples/inference.ipynb?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
+
+## Advanced
+
+The following sections provide greater details of the dataset,
+running training and inference, and the training results.
+
+### Scripts and sample code
+
+In the root directory, the most important files are:
+ - `main.py`:               the script that controls the logic of training and validation of the SSD300 v1.1 model;
+ - `Dockerfile`:            Instructions for docker to build a container with the basic set of dependencies to run SSD300 v1.1;
+ - `requirements.txt`:      a set of extra Python requirements for running SSD300 v1.1;
+ - `download_dataset.py`:   automatically downloads the COCO dataset for training.
+
+The `src/` directory contains modules used to train and evaluate the SSD300 v1.1 model
+ - `model.py`: the definition of SSD300 v1.1 model
+ - `data.py`: definition of input pipelines used in training and evaluation
+ - `train.py`: functions used to train the SSD300 v1.1 model
+ - `evaluate.py`: functions used to evaluate the SSD300 v1.1 model
+ - `coco_pipeline.py`: definition of input pipeline using NVIDIA DALI
+ - `coco.py`: code specific for the COCO dataset
+ - `logger.py`: utilities for logging
+ - `utils.py`: extra utility functions
+
+The `examples/` directory contains scripts wrapping common scenarios.
+
+### Parameters
+
+#### The script `main.py`
+The script for training end evaluating the SSD300 v1.1 model have a variety
+of parameters that control these processes.
+
+##### Common parameters
+`--data`
+: use it to specify, where your dataset is. By default, the script will look for it
+under the `/coco` directory.
+
+`--checkpoint`
+: allows you to specify the path to the pre-trained model.
+
+`--save`
+: when the flag is turned on, the script will save the trained model to the disc.
+
+`--seed`
+: Use it to specify the seed for RNGs.
+
+`--amp`
+: when the flag is turned on, the AMP features will be enabled.
+
+##### Training related
+
+`--epochs`
+: a number of times the model will see every example from the training dataset.
+
+`--evaluation`
+: after this parameter, list the number of epochs after which evaluation should
+be performed.
+
+`--learning-rate`
+: initial learning rate.
+
+`--multistep`
+: after this parameter, list the epochs after which learning rate should be decayed.
+
+`--warmup`
+: allows you to specify the number of iterations for which a linear learning-rate
+warmup will be performed.
+
+`--momentum`
+: momentum argument for SGD optimizer.
+
+`--weight-decay`
+: weight decay argument for SGD optimizer.
+
+`--batch-size`
+: a number of inputs processed at once for each iteration.
+
+`--backbone-path`
+: the path to the checkpointed backbone. When it is not provided, a pre-trained model from torchvision
+will be downloaded.
+
+##### Evaluation related
+
+`--eval-batch-size`
+: a number of inputs processed at once for each iteration.
+
+##### Utility parameters
+`--help`
+: displays a short description of all parameters accepted by the script.
+
+### Command-line options
+
+All these parameters can be controlled by passing command-line arguments
+to the `main.py` script. To get a complete list of all command-line arguments
+with descriptions and default values you can run:

 ```
 python main.py --help
@ -197,93 +484,74 @@ python main.py --help

 ### Getting the data

-The SSD model was trained on the COCO 2017 dataset. The val2017 validation set was used as a validation dataset. PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
+The SSD model was trained on the COCO 2017 dataset. The [val2017](http://cocodataset.org/#download) validation set
+was used as a validation dataset. PyTorch can work directly on JPEGs,
+therefore, preprocessing/augmentation is not needed.

 This repository contains the `download_dataset.sh` download script which will automatically
 download and preprocess the training, validation and test datasets. By default,
 data will be downloaded to the `/coco` directory.

+#### Dataset guidelines
+
+Our model expects input data aligned in a way a COCO dataset is aligned by the `download_dataset.sh` script.
+`train2017` and `val2017` directories should contain images in JPEG format.
+Annotation format is described in [the COCO documentation](http://cocodataset.org/#format-data).
+
+The preprocessing of the data is defined in the `src/coco_pipeline.py` module.
+
+##### Data preprocessing
+
+Before we feed data to the model, both during training and inference, we perform:
+* JPEG decoding
+* normalization with a mean =` [0.485, 0.456, 0.406]` and std dev = `[0.229, 0.224, 0.225]`
+* encoding bounding boxes
+* resizing to 300x300
+
+Additionally, during training, data is:
+* randomly shuffled
+* samples without annotations are skipped
+
+##### Data augmentation
+
+During training we perform the following augmentation techniques:
+* Random crop using the algorithm described in the [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) paper
+* Random horizontal flip
+* Color jitter
+
 ### Training process
-Training the SSD model is implemented in the `main.py` script. 
+
+Training the SSD model is implemented in the `main.py` script.

 By default, training is running for 65 epochs. Because evaluation is relatively time consuming,
 it is not running every epoch. With default settings, evaluation is executed after epochs:
 21, 31, 37, 42, 48, 53, 59, 64. The model is evaluated using pycocotools distributed with
 the COCO dataset.
- Which epochs should be evaluated can be reconfigured with argument `--evaluation`.
+ Which epochs should be evaluated can be reconfigured with the `--evaluation` argument.

-To run training with Tensor Cores, use the `--fp16` flag when running the `main.py` script.
+To run training with Tensor Cores, use the `--amp` flag when running the `main.py` script.
 The flag `--save` flag enables storing checkpoints after each epoch under `./models/epoch_*.pt`.

-#### Data preprocessing
-Before we feed data to the model, both during training and inference, we perform:
-*	Normalization
-*	Encoding bounding boxes
-*   Resize to 300x300
+### Inference process

-#### Data augmentation
-During training we perform the following augmentation techniques:
-*	Random crop
-*	Random horizontal flip
-*	Color jitter
+Our scripts for SSD300 v1.1 presents two ways to run inference.
+To get meaningful results, you need a pre-trained model checkpoint.

+One way is to run an interactive session on Jupyter notebook, as described in a [Quick Start Guide](#8-start-inferencepredictions).

+Another way is to run a script `src/SSD300_inference.py`. It contains the logic from the notebook, wrapped into a Python script. The script contains sample usage.

-### Enabling mixed precision
-[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.  Using [mixed precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) training previously required two steps:
-1. Porting the model to use the FP16 data type where appropriate.
-2. Manually adding loss scaling to preserve small gradient values.
- 
-Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP),  library from [APEX](https://github.com/NVIDIA/apex) that casts variables to half-precision upon retrieval,
-while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients.
-In PyTorch, loss scaling can be easily applied by using scale_loss() method provided by amp. The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
+To use the inference example script in your own code, you can call the `main` function, providing input image URIs as an argument. The result will be a list of detections for each input image.

-For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
+## Performance

-To enable mixed precision, you can:
- Import AMP from APEX, for example:
+### Benchmarking

-  ```
-  from apex import amp
-  ```
- Initialize an AMP handle, for example: 
-
-  ```
-  amp_handle = amp.init(enabled=True, verbose=True)
-  ```
- Wrap your optimizer with the AMP handle, for example:
-
-  ```
-  optimizer = amp_handle.wrap_optimizer(optimizer)
-  ```
- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
-  - Default backpropagate for FP32:
-
-    ```
-    losses.backward()
-    ```
-  - Scale loss and backpropagate with AMP:
-
-    ```
-    with optimizer.scale_loss(losses) as scaled_losses:
-       scaled_losses.backward()
-    ```
-
-For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
-
-
-
-
-
-
-## Benchmarking
 The following section shows how to run benchmarks measuring the model performance in training and inference modes.

-### Training performance benchmark
-Training benchmark was run in various scenarios on V100 16G GPU. For each scenario, batch size was set to 32. The benchmark does not require a checkpoint from a fully trained model.
+#### Training performance benchmark
+
+The training benchmark was run in various scenarios on V100 16G GPU. For each scenario, the batch size was set to 32. The benchmark does not require a checkpoint from a fully trained model.

 To benchmark training, run:
 ```
@ -292,34 +560,46 @@ python -m torch.distributed.launch --nproc_per_node={NGPU} \
               --mode benchmark-training \
               --benchmark-warmup 100 \
               --benchmark-iterations 200 \
-               {fp16} \
+               {AMP} \
               --data {data}
 ```
-Where the `{NGPU}` selects number of GPUs used in benchmark, the `{bs}` is the desired batch size, the `{fp16}` is set to `--fp16` if you want to benchmark training with tensor cores, and the `{data}` is the location of the COCO 2017 dataset.

-Benchmark warmup is specified to omit first iterations of first epoch. Benchmark iterations is number of iterations used to measure performance.
+Where the `{NGPU}` selects number of GPUs used in benchmark, the `{bs}` is the desired
+batch size, the `{AMP}` is set to `--amp` if you want to benchmark training with
+Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.
+
+`--benchmark-warmup` is specified to omit the first iteration of the first epoch.
+`--benchmark-iterations` is a number of iterations used to measure performance.
+
+#### Inference performance benchmark

-### Inference performance benchmark
 Inference benchmark was run on 1x V100 16G GPU.  To benchmark inference, run:
 ```
 python main.py --eval-batch-size {bs} \
               --mode benchmark-inference \
               --benchmark-warmup 100 \
               --benchmark-iterations 200 \
-               {fp16} \
+               {AMP} \
               --data {data}
 ```
-Where the `{bs}` is the desired batch size, the `{fp16}` is set to `--fp16` if you want to benchmark inference with Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.

-Benchmark warmup is specified to omit first iterations of first epoch. Benchmark iterations is number of iterations used to measure performance.
+Where the `{bs}` is the desired batch size, the `{AMP}` is set to `--amp` if you want to benchmark inference with Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.

-## Results
+`--benchmark-warmup` is specified to omit the first iterations of the first epoch. `--benchmark-iterations` is a number of iterations used to measure performance.
+
+### Results

 The following sections provide details on how we achieved our performance and accuracy in training and inference.

-### Training accuracy results
+#### Training accuracy results
+
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
 Our results were obtained by running the `./examples/SSD300_FP{16,32}_{1,4,8}GPU.sh`
-script in the pytorch-19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Batch was set to size best utilizing GPU memory. For FP32 precision, batch size is 32, for mixed precision batch size is 64
+script in the `pytorch-19.06-py3` NGC container on NVIDIA DGX-1 with 8x
+V100 16G GPUs. Performance numbers (in items/images per second) were averaged
+over an entire training epoch.

 | **Number of GPUs** | **Mixed precision mAP** | **Training time with mixed precision** | **FP32 mAP** | **Training time with FP32** |
 |:------------------:|:------------------------:|:-------------------------------------:|:------------:|:---------------------------:|
@ -334,10 +614,15 @@ Here are example graphs of FP32 and FP16 training on 8 GPU configuration:

 ![ValidationAccuracy](./img/validation_accuracy.png)

-### Training performance results

-Our results were obtained by running the `main.py` script with the
-`--mode benchmark-training` flag in the pytorch-19.03-py3 NGC container on NVIDIA DGX-1 with V100 16G GPUs.
+#### Training performance results
+
+##### NVIDIA DGX-1 (8x V100 16G)
+
+Our results were obtained by running the `main.py` script with the `--mode
+benchmark-training` flag in the `pytorch-19.06-py3` NGC container on NVIDIA
+DGX-1 with 8x V100 16G GPUs. Performance numbers (in items/images per second)
+were averaged over an entire training epoch.

 | **Number of GPUs** | **Batch size per GPU** | **Mixed precision img/s (median)** | **FP32 img/s (median)** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with mixed precision** | **Multi-gpu weak scaling with FP32** |
 |:------------------:|:----------------------:|:----------------------------------:|:-----------------------:|:---------------------------------:|:-----------------------------------------------:|:------------------------------------:|
@ -345,11 +630,16 @@ Our results were obtained by running the `main.py` script with the
 | 4                  | 32                     |  838.457                           |  397.797                | 2.11                              | 3.86                                            | 3.88                                 |
 | 8                  | 32                     | 1639.843                           |  789.695                | 2.08                              | 7.56                                            | 7.70                                 |

-To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

-### Inference performance results
+#### Inference performance results

-Our results were obtained by running the `main.py` script with `--mode benchmark-inference` flag in the pytorch-19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs.
+
+##### NVIDIA DGX-1 (1x V100 16G)
+
+Our results were obtained by running the `main.py` script with `--mode
+benchmark-inference` flag in the pytorch-19.06-py3 NGC container on NVIDIA
+DGX-1 with (1x V100 16G) GPUs.

 | **Batch size** | **Mixed precision img/s (median)** | **FP32 img/s (median)** |
 |:--------------:|:----------------------------------:|:-----------------------:|
@ -359,15 +649,24 @@ Our results were obtained by running the `main.py` script with `--mode benchmark
 |             16 |                            470.10  |                280.57   |
 |             32 |                            520.54  |                302.43   |

-To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

-## Changelog
+## Release notes
+
+### Changelog
+
+July 2019
+ * script and notebook for inference
+ * use AMP instead of hand-crafted FP16 support
+ * README update
+ * introduced a parameter with a path to the custom backbone checkpoint
+ * minor enchantments of `example/*` scripts
+ * alignment to changes in PyTorch 19.06

 March 2019
 * Initial release

-May 2019
- * Test scripts updated
+### Known issues

-## Known issues
 There are no known issues with this model.
+
--- a/PyTorch/Detection/SSD/dle/inference.py
+++ b/PyTorch/Detection/SSD/dle/inference.py
@ -0,0 +1,44 @@
+import numpy as np
+import skimage
+
+def load_image(image_path):
+    """Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
+    mean, std = 128, 128
+    img = skimage.img_as_float(skimage.io.imread(image_path))
+    if len(img.shape) == 2:
+        img = np.array([img, img, img]).swapaxes(0,2)
+    return img
+
+def rescale(img, input_height, input_width):
+    """Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
+    aspect = img.shape[1]/float(img.shape[0])
+    if(aspect>1):
+        # landscape orientation - wide image
+        res = int(aspect * input_height)
+        imgScaled = skimage.transform.resize(img, (input_width, res))
+    if(aspect<1):
+        # portrait orientation - tall image
+        res = int(input_width/aspect)
+        imgScaled = skimage.transform.resize(img, (res, input_height))
+    if(aspect == 1):
+        imgScaled = skimage.transform.resize(img, (input_width, input_height))
+    return imgScaled
+
+def crop_center(img,cropx,cropy):
+    """Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
+    y,x,c = img.shape
+    startx = x//2-(cropx//2)
+    starty = y//2-(cropy//2)
+    return img[starty:starty+cropy,startx:startx+cropx]
+
+def normalize(img, mean=128, std=128):
+    img = (img * 256 - mean) / std
+    return img
+
+def prepare_input(img_uri):
+    img = load_image(img_uri)
+    img = rescale(img, 300, 300)
+    img = crop_center(img, 300, 300)
+    img = normalize(img)
+
+    return img
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_1GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_1GPU.sh
@ -1,4 +1,4 @@
 # This script launches SSD300 training in FP16 on 1 GPUs using 64 batch size
 # Usage bash SSD300_FP16_1GPU.sh <path to this repository> <path to dataset> <additional flags>

-python $1/main.py --backbone resnet50 --warmup 300 --bs 64 --fp16 --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_4GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_4GPU.sh
@ -1,4 +1,4 @@
 # This script launches SSD300 training in FP16 on 4 GPUs using 256 batch size (64 per GPU)
 # Usage ./SSD300_FP16_4GPU.sh <path to this repository> <path to dataset> <additional flags>

-python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --fp16 --data $2 ${@:3}
+python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_8GPU.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_8GPU.sh
@ -1,4 +1,4 @@
 # This script launches SSD300 training in FP16 on 8 GPUs using 512 batch size (64 per GPU)
 # Usage ./SSD300_FP16_8GPU.sh <path to this repository> <path to dataset> <additional flags>

-python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --fp16 --data $2 ${@:3}
+python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_EVAL.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_EVAL.sh
@ -1,4 +1,4 @@
 # This script evaluates SSD300 model in FP16 using 32 batch size on 1 GPU
 # Usage: ./SSD300_FP16_EVAL.sh <path to this repository> <path to dataset> <path to checkpoint> <additional flags>

-python $1/main.py --backbone resnet50 --fp16 --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}
+python $1/main.py --backbone resnet50 --amp --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}
--- a/PyTorch/Detection/SSD/examples/SSD300_FP16_INFERENCE_BENCHMARK.sh
+++ b/PyTorch/Detection/SSD/examples/SSD300_FP16_INFERENCE_BENCHMARK.sh
@ -1,4 +1,4 @@
 # This script launches SSD300 inference benchmark in FP16 on 1 GPU with 64 batch size
 # Usage bash SSD300_FP16_INFERENCE_BENCHMARK.sh <path to this repository> <path to dataset> <additional flags>

-python $1/main.py --backbone resnet50 --mode benchmark-inference --bs 64 --fp16 --data $2 ${@:3}
+python $1/main.py --backbone resnet50 --mode benchmark-inference --bs 64 --amp --data $2 ${@:3}
--- a/PyTorch/Detection/SSD/examples/SSD300_inference.py
+++ b/PyTorch/Detection/SSD/examples/SSD300_inference.py
@ -0,0 +1,82 @@
+import torch
+import numpy as np
+
+from apex.fp16_utils import network_to_half
+
+from dle.inference import prepare_input
+from src.model import SSD300, ResNet
+from src.utils import dboxes300_coco, Encoder
+
+
+def load_checkpoint(model, model_file):
+    cp = torch.load(model_file)['model']
+    cp = { k.replace('module.1.', ''): cp[k] for k in cp }
+    model.load_state_dict(cp)
+
+
+def build_predictor(model_file, backbone='resnet50'):
+    ssd300 = SSD300(backbone=ResNet(backbone))
+    load_checkpoint(ssd300, model_file)
+
+    return ssd300
+
+
+def prepare_model(checkpoint_path):
+    ssd300 = build_predictor(checkpoint_path)
+    ssd300 = ssd300.cuda()
+    ssd300 = network_to_half(ssd300)
+    ssd300 = ssd300.eval()
+
+    return ssd300
+
+
+def prepare_tensor(inputs):
+    NHWC = np.array(inputs)
+    NCHW = np.swapaxes(np.swapaxes(NHWC, 2, 3), 1, 2)
+    tensor = torch.from_numpy(NCHW)
+    tensor = tensor.cuda()
+    tensor = tensor.half()
+
+    return tensor
+
+
+def decode_results(predictions):
+    dboxes = dboxes300_coco()
+    encoder = Encoder(dboxes)
+    ploc, plabel = [val.float() for val in predictions]
+    results = encoder.decode_batch(ploc, plabel, criteria=0.5, max_output=20)
+
+    return [ [ pred.detach().cpu().numpy()
+               for pred in detections
+             ]
+             for detections in results
+           ]
+
+
+def pick_best(detections, treshold):
+    bboxes, classes, confidences = detections
+    best = np.argwhere(confidences > 0.3).squeeze(axis=1)
+
+    return [pred[best] for pred in detections]
+
+
+def main(checkpoint_path, imgs):
+    inputs = [prepare_input(uri) for uri in imgs]
+    tensor = prepare_tensor(inputs)
+    ssd300 = prepare_model(checkpoint_path)
+
+    predictions = ssd300(tensor)
+
+    results = decode_results(predictions)
+    best_results = [pick_best(detections, treshold=0.3) for detections in results]
+    return best_results
+
+if __name__ == '__main__':
+    best_results = main(
+            checkpoint_path='/checkpoints/SSD300v1.1.pt',
+            imgs=[ 'http://images.cocodataset.org/val2017/000000397133.jpg',
+                   'http://images.cocodataset.org/val2017/000000037777.jpg',
+                   'http://images.cocodataset.org/val2017/000000252219.jpg',
+                 ]
+    )
+    print(best_results)
--- a/PyTorch/Detection/SSD/examples/inference.ipynb
+++ b/PyTorch/Detection/SSD/examples/inference.ipynb
--- a/PyTorch/Detection/SSD/examples/run_notebook.sh
+++ b/PyTorch/Detection/SSD/examples/run_notebook.sh
@ -0,0 +1 @@
+PYTHONPATH=$PYTHONPATH:/mlperf/ jupyter-notebook --ip 0.0.0.0 --no-browser --allow-root
--- a/PyTorch/Detection/SSD/img/ssd_diagram.png
+++ b/PyTorch/Detection/SSD/img/ssd_diagram.png
--- a/PyTorch/Detection/SSD/main.py
+++ b/PyTorch/Detection/SSD/main.py
@ -34,7 +34,7 @@ def generate_mean_std(args):
    mean = mean.view(*view)
    std = std.view(*view)

-    if args.fp16:
+    if args.amp:
        mean = mean.half()
        std = std.half()

@ -90,7 +90,6 @@ def make_parser():
                             ' When it is not provided, pretrained model from torchvision'
                             ' will be downloaded.')
    parser.add_argument('--num-workers', type=int, default=4)
-    parser.add_argument('--fp16', action='store_true')
    parser.add_argument('--amp', action='store_true')

    # Distributed
@ -102,8 +101,6 @@ def make_parser():


 def train(train_loop_func, logger, args):
-    if args.amp:
-        amp_handle = amp.init(enabled=args.fp16)
    # Check that GPUs are actually available
    use_cuda = not args.no_cuda

@ -149,20 +146,15 @@ def train(train_loop_func, logger, args):
        ssd300.cuda()
        loss_func.cuda()

-    if args.fp16 and not args.amp:
-        ssd300 = network_to_half(ssd300)
+    optimizer = torch.optim.SGD(tencent_trick(ssd300), lr=args.learning_rate,
+                                    momentum=args.momentum, weight_decay=args.weight_decay)
+    scheduler = MultiStepLR(optimizer=optimizer, milestones=args.multistep, gamma=0.1)
+    if args.amp:
+        ssd300, optimizer = amp.initialize(ssd300, optimizer, opt_level='O2')

    if args.distributed:
        ssd300 = DDP(ssd300)

-    optimizer = torch.optim.SGD(tencent_trick(ssd300), lr=args.learning_rate,
-                                    momentum=args.momentum, weight_decay=args.weight_decay)
-    scheduler = MultiStepLR(optimizer=optimizer, milestones=args.multistep, gamma=0.1)
-    if args.fp16:
-        if args.amp:
-            optimizer = amp_handle.wrap_optimizer(optimizer)
-        else:
-            optimizer = FP16_Optimizer(optimizer, static_loss_scale=128.)
    if args.checkpoint is not None:
        if os.path.isfile(args.checkpoint):
            load_checkpoint(ssd300, args.checkpoint)
--- a/PyTorch/Detection/SSD/requirements.txt
+++ b/PyTorch/Detection/SSD/requirements.txt
@ -1 +1,2 @@
 Cython==0.28.4
+scikit-image==0.15.0
--- a/PyTorch/Detection/SSD/src/data.py
+++ b/PyTorch/Detection/SSD/src/data.py
@ -15,7 +15,7 @@ def get_train_loader(args, local_seed):

    train_pipe = COCOPipeline(args.batch_size, args.local_rank, train_coco_root,
                    train_annotate, args.N_gpu, num_threads=args.num_workers,
-                    output_fp16=args.fp16, output_nhwc=False,
+                    output_fp16=args.amp, output_nhwc=False,
                    pad_output=False, seed=local_seed)
    train_pipe.build()
    test_run = train_pipe.run()
--- a/PyTorch/Detection/SSD/src/evaluate.py
+++ b/PyTorch/Detection/SSD/src/evaluate.py
@ -24,7 +24,7 @@ def evaluate(model, coco, cocoGt, encoder, inv_map, args):
        print("Parsing batch: {}/{}".format(nbatch, len(coco)), end='\r')
        with torch.no_grad():
            inp = img.cuda()
-            if args.fp16:
+            if args.amp:
                inp = inp.half()

            # Get predictions
--- a/PyTorch/Detection/SSD/src/train.py
+++ b/PyTorch/Detection/SSD/src/train.py
@ -3,6 +3,8 @@ import torch
 import time
 from SSD import _C as C

+from apex import amp
+
 def train_loop(model, loss_func, epoch, optim, train_dataloader, val_dataloader, encoder, iteration, logger, args, mean, std):
 #     for nbatch, (img, _, img_size, bbox, label) in enumerate(train_dataloader):
    for nbatch, data in enumerate(train_dataloader):
@ -46,12 +48,9 @@ def train_loop(model, loss_func, epoch, optim, train_dataloader, val_dataloader,
        if args.local_rank == 0:
            logger.update_iter(epoch, iteration, loss.item())

-        if args.fp16:
-            if args.amp:
-                with optim.scale_loss(loss) as scale_loss:
-                    scale_loss.backward()
-            else:
-                optim.backward(loss)
+        if args.amp:
+            with amp.scale_loss(loss, optim) as scale_loss:
+                scale_loss.backward()
        else:
            loss.backward()

@ -118,12 +117,9 @@ def benchmark_train_loop(model, loss_func, epoch, optim, train_dataloader, val_d


        # loss scaling
-        if args.fp16:
-            if args.amp:
-                with optim.scale_loss(loss) as scale_loss:
-                    scale_loss.backward()
-            else:
-                optim.backward(loss)
+        if args.amp:
+            with amp.scale_loss(loss, optim) as scale_loss:
+                scale_loss.backward()
        else:
            loss.backward()

@ -170,7 +166,7 @@ def benchmark_inference_loop(model, loss_func, epoch, optim, train_dataloader, v
            img = data[0]
            if not args.no_cuda:
                img = img.cuda()
-            if args.fp16:
+            if args.amp:
                img = img.half()
            img.sub_(mean).div_(std)
            img = Variable(img, requires_grad=False)
--- a/PyTorch/Detection/SSD/src/utils.py
+++ b/PyTorch/Detection/SSD/src/utils.py
@ -194,6 +194,9 @@ class Encoder(object):
            scores_out.append(score[candidates])
            labels_out.extend([i]*len(candidates))

+        if not bboxes_out:
+            return [torch.tensor([]) for _ in range(3)]
+
        bboxes_out, labels_out, scores_out = torch.cat(bboxes_out, dim=0), \
               torch.tensor(labels_out, dtype=torch.long), \
               torch.cat(scores_out, dim=0)
--- a/PyTorch/LanguageModeling/BERT/Dockerfile
+++ b/PyTorch/LanguageModeling/BERT/Dockerfile
@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=gitlab-master.nvidia.com:5005/dl/dgx/pytorch:19.05-py3-devel
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
 FROM ${FROM_IMAGE_NAME}
 RUN apt-get update && apt-get install -y pbzip2 pv bzip2 cabextract

--- a/PyTorch/LanguageModeling/BERT/NOTICE
+++ b/PyTorch/LanguageModeling/BERT/NOTICE
@ -0,0 +1,5 @@
+BERT PyTorch
+
+This repository includes software from https://github.com/huggingface/pytorch-pretrained-BERT
+licensed under the Apache License 2.0.
+
--- a/PyTorch/LanguageModeling/BERT/README.md
+++ b/PyTorch/LanguageModeling/BERT/README.md
--- a/PyTorch/LanguageModeling/BERT/data/wikipedia_corpus/download_wikipedia.sh
+++ b/PyTorch/LanguageModeling/BERT/data/wikipedia_corpus/download_wikipedia.sh
@ -1,7 +1,7 @@
 #! /bin/bash

-WIKI_DUMP="ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20190301/enwiki-20190301-pages-articles-multistream.xml.bz2"
-N_PROCS_PREPROCESS=4    # Adjust this based on memory requirements and available number of cores
+WIKI_DUMP="https://dumps.wikimedia.org/enwiki/20190320/enwiki-20190320-pages-articles-multistream.xml.bz2"
+N_PROCS_PREPROCESS=$(nproc)    # Adjust this based on memory requirements and available number of cores

 # Download Wikipedia dump file
 mkdir -p ./download
--- a/PyTorch/LanguageModeling/BERT/images/model.png
+++ b/PyTorch/LanguageModeling/BERT/images/model.png
--- a/PyTorch/LanguageModeling/BERT/run_pretraining.py
+++ b/PyTorch/LanguageModeling/BERT/run_pretraining.py
@ -170,7 +170,7 @@ def main():
                        type=float, default=0.0,
                        help='Loss scaling, positive power of 2 values can improve fp16 convergence.')
    parser.add_argument('--log_freq',
-                        type=float, default=10.0,
+                        type=float, default=50.0,
                        help='frequency of logging loss.')
    parser.add_argument('--checkpoint_activations',
                        default=False,
@ -333,12 +333,12 @@ def main():
            train_data = pretraining_dataset(input_file=data_file, max_pred_length=args.max_predictions_per_seq)

            if args.local_rank == -1:
-            train_sampler = RandomSampler(train_data)
-            train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size * n_gpu, num_workers=4, pin_memory=True)
+                train_sampler = RandomSampler(train_data)
+                train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size * n_gpu, num_workers=4, pin_memory=True)
            else:
-            train_sampler = DistributedSampler(train_data)
+                train_sampler = DistributedSampler(train_data)
+                train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size, num_workers=4, pin_memory=True)

-            train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size, num_workers=4, pin_memory=True)
            for step, batch in enumerate(tqdm(train_dataloader, desc="File Iteration")):
            
                training_steps += 1
@ -378,8 +378,9 @@ def main():
                                                                                loss.item(), optimizer.param_groups[0]['lr']))
                    average_loss = 0

+                if global_step >= args.max_steps or training_steps % (
+                        args.num_steps_per_checkpoint * args.gradient_accumulation_steps) == 0:

-                if global_step >= args.max_steps or training_steps == 1 * args.gradient_accumulation_steps or training_steps % (args.num_steps_per_checkpoint * args.gradient_accumulation_steps) == 0:
                    if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
                        # Save a trained model
                        logger.info("** ** * Saving fine - tuned model ** ** * ")
--- a/PyTorch/LanguageModeling/BERT/run_squad.py
+++ b/PyTorch/LanguageModeling/BERT/run_squad.py
@ -936,40 +936,40 @@ def main():
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
+    if args.do_train:
+        if args.fp16:
+            try:
+                # from fused_adam_local import FusedAdamBert as FusedAdam
+                from apex.optimizers import FusedAdam
+                from apex.optimizers import FP16_Optimizer
+            except ImportError:
+                raise ImportError(
+                    "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
+            # import ipdb; ipdb.set_trace()
+            optimizer = FusedAdam(optimizer_grouped_parameters,
+                                  lr=args.learning_rate,
+                                  bias_correction=False,
+                                  max_grad_norm=1.0)

-    if args.fp16:
-        try:
-            # from fused_adam_local import FusedAdamBert as FusedAdam
-            from apex.optimizers import FusedAdam
-            from apex.optimizers import FP16_Optimizer
-        except ImportError:
-            raise ImportError(
-                "Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
-        # import ipdb; ipdb.set_trace()
-        optimizer = FusedAdam(optimizer_grouped_parameters,
-                              lr=args.learning_rate,
-                              bias_correction=False,
-                              max_grad_norm=1.0)
-
-        if args.loss_scale == 0:
-            if args.old:
-                optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
+            if args.loss_scale == 0:
+                if args.old:
+                    optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
+                else:
+                    model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False,
+                                                      loss_scale="dynamic")
            else:
-                model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False,
-                                                  loss_scale="dynamic")
+                if args.old:
+                    optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
+                else:
+                    model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False, loss_scale=args.loss_scale)
+            if not args.old and args.do_train:
+                scheduler = LinearWarmUpScheduler(optimizer, warmup=args.warmup_proportion, total_steps=num_train_optimization_steps)
+
        else:
-            if args.old:
-                optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
-            else:
-                model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False, loss_scale=args.loss_scale)
-        if not args.old and args.do_train:
-            scheduler = LinearWarmUpScheduler(optimizer, warmup=args.warmup_proportion, total_steps=num_train_optimization_steps)
-
-    else:
-        optimizer = BertAdam(optimizer_grouped_parameters,
-                                lr=args.learning_rate,
-                                warmup=args.warmup_proportion,
-                                t_total=num_train_optimization_steps)
+            optimizer = BertAdam(optimizer_grouped_parameters,
+                                    lr=args.learning_rate,
+                                    warmup=args.warmup_proportion,
+                                    t_total=num_train_optimization_steps)

    #print(model)
    if args.local_rank != -1:
@ -1086,6 +1086,10 @@ def main():


    if args.do_predict and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
+
+        if not args.do_train and args.fp16:
+            model.half()
+
        eval_examples = read_squad_examples(
            input_file=args.predict_file, is_training=False, version_2_with_negative=args.version_2_with_negative)
        eval_features = convert_examples_to_features(
--- a/PyTorch/LanguageModeling/BERT/scripts/run_pretraining.sh
+++ b/PyTorch/LanguageModeling/BERT/scripts/run_pretraining.sh
@ -1,6 +1,20 @@
 #!/bin/bash

 echo "Container nvidia build = " $NVIDIA_BUILD_ID
+train_batch_size=${1:-14}
+learning_rate=${2:-"0.4375e-4"}
+precision=${3:-"fp16"}
+num_gpus=${4:-8}
+warmup_proportion=${5:-"0.01"}
+train_steps=${6:-2285714}
+save_checkpoint_steps=${7:-2000}
+resume_training=${8:-"false"}
+create_logfile=${9:-"true"}
+accumulate_gradients=${10:-"false"}
+gradient_accumulation_steps=${11:-1}
+seed=${12:-42}
+job_name=${13:-"job"}
+

 DATASET=wikipedia_corpus # change this for other datasets

@ -29,18 +43,6 @@ if [ ! -f "$BERT_CONFIG" ] ; then
   exit -1
 fi

-train_batch_size=${1:-14}
-learning_rate=${2:-"0.4375e-4"}
-precision=${3:-"fp16"}
-num_gpus=${4:-8}
-warmup_proportion=${5:-"0.01"}
-train_steps=${6:-2285714}
-save_checkpoint_steps=${7:-2000}
-resume_training=${8:-"false"}
-create_logfile=${9:-"true"}
-checkpoint_activations=${10:-"false"}
-seed=${11:-42}
-
 PREC=""
 if [ "$precision" = "fp16" ] ; then
   PREC="--fp16"
@ -51,9 +53,9 @@ else
   exit -2
 fi

-CHECKPOINT_ACTIVATIONS=""
-if [ "$checkpoint_activations" == "true" ] ; then
-   CHECKPOINT_ACTIVATIONS="--checkpoint_activations"
+ACCUMULATE_GRADIENTS=""
+if [ "$accumulate_gradients" == "true" ] ; then
+   ACCUMULATE_GRADIENTS="--gradient_accumulation_steps=$gradient_accumulation_steps"
 fi

 CHECKPOINT=""
@ -67,7 +69,6 @@ CMD=" /workspace/bert/run_pretraining.py"
 CMD+=" --input_dir=$DATA_DIR"
 CMD+=" --output_dir=$CHECKPOINTS_DIR"
 CMD+=" --config_file=$BERT_CONFIG"
-CMD+=" --do_train"
 CMD+=" --bert_model=bert-large-uncased"
 CMD+=" --train_batch_size=$train_batch_size"
 CMD+=" --max_seq_length=512"
@ -78,7 +79,7 @@ CMD+=" --num_steps_per_checkpoint=$save_checkpoint_steps"
 CMD+=" --learning_rate=$learning_rate"
 CMD+=" --seed=$seed"
 CMD+=" $PREC"
-CMD+=" $CHECKPOINT_ACTIVATIONS"
+CMD+=" $ACCUMULATE_GRADIENTS"
 CMD+=" $CHECKPOINT"


@ -93,7 +94,7 @@ if [ "$create_logfile" = "true" ] ; then
  export GBS=$(expr $train_batch_size \* $num_gpus)
  printf -v TAG "pyt_bert_pretraining_%s_gbs%d" "$precision" $GBS
  DATESTAMP=`date +'%y%m%d%H%M%S'`
-  LOGFILE=$RESULTS_DIR/$TAG.$DATESTAMP.log
+  LOGFILE=$RESULTS_DIR/$job_name.$TAG.$DATESTAMP.log
  printf "Logs written to %s\n" "$LOGFILE"
 fi

--- a/PyTorch/LanguageModeling/BERT/scripts/run_squad.sh
+++ b/PyTorch/LanguageModeling/BERT/scripts/run_squad.sh
@ -7,11 +7,11 @@ echo "Container nvidia build = " $NVIDIA_BUILD_ID

 init_checkpoint=${1:-"/workspace/checkpoints/bert_uncased.pt"}
 epochs=${2:-"2.0"}
-batch_size=${3:-"24"}
+batch_size=${3:-"3"}
 learning_rate=${4:-"3e-5"}
 precision=${5:-"fp16"}
 num_gpu=${6:-"8"}
-seed=${7:-"42"}
+seed=${7:-"1"}
 squad_dir=${8:-"/workspace/bert/data/squad/v1.1"}
 vocab_file=${9:-"/workspace/bert/vocab/vocab"}
 OUT_DIR=${10:-"/results/SQuAD"}
@ -50,6 +50,10 @@ elif [ "$mode" = "eval" ] ; then
  CMD+="--do_predict "
  CMD+="--predict_file=$squad_dir/dev-v1.1.json "
  CMD+="--predict_batch_size=$batch_size "
+elif [ "$mode" = "prediction" ] ; then
+  CMD+="--do_predict "
+  CMD+="--predict_file=$squad_dir/dev-v1.1.json "
+  CMD+="--predict_batch_size=$batch_size "
 else
  CMD+=" --do_train "
  CMD+=" --train_file=$squad_dir/train-v1.1.json "
@ -79,10 +83,18 @@ time $CMD |& tee $LOGFILE

 #sed -r 's/
 #|([A)/\n/g' $LOGFILE > $LOGFILE.edit
-throughput=`cat $LOGFILE | grep -E 'Iteration.*[0-9.]+(s/it|it/s)' | tail -1 | egrep -o '[0-9.]+(s/it|it/s)' | head -1 | egrep -o '[0-9.]+'`

-if [ "$mode" != "train" ]; then
-python $squad_dir/evaluate-v1.1.py $squad_dir/dev-v1.1.json $OUT_DIR/predictions.json |& tee -a $LOGFILE
+if [ "$mode" != "eval" ]; then
+throughput=`cat $LOGFILE | grep -E 'Iteration.*[0-9.]+(it/s)' | tail -1 | egrep -o '[0-9.]+(s/it|it/s)' | head -1 | egrep -o '[0-9.]+'`
+train_perf=$(awk 'BEGIN {print ('$throughput' * '$num_gpu' * '$batch_size')}')
+echo " training throughput: $train_perf"
 fi

-echo "throughput: $throughput"
+if [ "$mode" != "train" ]; then
+    if [ "$mode" != "prediction" ]; then
+        python $squad_dir/evaluate-v1.1.py $squad_dir/dev-v1.1.json $OUT_DIR/predictions.json |& tee -a $LOGFILE
+        eval_throughput=`cat $LOGFILE | grep Evaluating | tail -1 | awk -F ','  '{print $2}' | egrep -o '[0-9.]+' | head -1 | egrep -o '[0-9.]+'`
+        eval_perf=$(awk 'BEGIN {print ('$eval_throughput' * '$num_gpu' * '$batch_size')}')
+        echo " evaluation throughput: $eval_perf"
+    fi
+fi
--- a/PyTorch/Recommendation/NCF/Dockerfile
+++ b/PyTorch/Recommendation/NCF/Dockerfile
@ -12,7 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-FROM nvcr.io/nvidia/pytorch:19.05-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
+FROM ${FROM_IMAGE_NAME}

 RUN apt-get update && \
    apt-get install -y unzip
--- a/PyTorch/Recommendation/NCF/README.md
+++ b/PyTorch/Recommendation/NCF/README.md
@ -6,46 +6,50 @@ model to achieve state of the art accuracy, and is tested and maintained by NVID
 Table of Contents
 =================

-  * [The model](#the-model)
-     * [Model architecture](#model-architecture)
-     * [Default configuration](#default-configuration)
-   * [Feature support matrix](#feature-support-matrix)
+* [Model overview](#model-overview)
+    * [Model architecture](#model-architecture)
+    * [Default configuration](#default-configuration)
+    * [Feature support matrix](#feature-support-matrix)
        * [Features](#features)
-  * [Setup](#setup)
-     * [Requirements](#requirements)
-     * [Quick Start Guide](#quick-start-guide)
-  * [Details](#details)
-     * [Scripts and sample code](#scripts-and-sample-code)
-     * [Command-line options](#command-line-options)
-     * [Getting the data](#getting-the-data)
+    * [Mixed precision training](#mixed-precision-training)
+        * [Enabling mixed precision](#enabling-mixed-precision)
+* [Setup](#setup)
+    * [Requirements](#requirements)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Command-line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
        * [Dataset guidelines](#dataset-guidelines)
        * [Multi-dataset](#multi-dataset)
    * [ML-1m](#ml-1m)
-     * [Training process](#training-process)
-     * [Inference process](#inference-process)
-  * [Mixed precision training](#mixed-precision-training)
-     * [Enabling mixed precision](#enabling-mixed-precision)
-  * [Benchmarking](#benchmarking)
-     * [Training performance benchmark](#training-performance-benchmark)
-     * [Inference performance benchmark](#inference-performance-benchmark)
-  * [Results](#results)
-     * [Training accuracy results](#training-accuracy-results)
-        * [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-8x-v100-32g)
-        * [Training stability test](#training-stability-test)
-     * [Training performance results](#training-performance-results)
-        * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
-        * [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
-        * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
-     * [Inference performance results](#inference-performance-results)
-        * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
-        * [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
-        * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
-  * [Changelog](#changelog)
-  * [Known issues](#known-issues)
-     * [Scaling beyond 8 GPUs](#scaling-beyond-8-gpus)
-     * [Memory usage](#memory-usage)
+    * [Training process](#training-process)
+    * [Inference process](#inference-process)
+* [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+            * [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-8x-v100-32g)
+            * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
+            * [Training stability test](#training-stability-test)
+        * [Training performance results](#training-performance-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
+            * [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
+            * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
+        * [Inference performance results](#inference-performance-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
+            * [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
+            * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
+* [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)
+        * [Scaling beyond 8 GPUs](#scaling-beyond-8-gpus)
+        * [Memory usage](#memory-usage)

-## The model
+## Model overview

 The NCF model focuses on providing recommendations, also known as collaborative filtering; with implicit feedback. The training data for this model should contain binary information about whether a user interacted with a specific item.
 NCF was first described by Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu and Tat-Seng Chua in the [Neural Collaborative Filtering paper](https://arxiv.org/abs/1708.05031).
@ -110,6 +114,35 @@ It allows us to use FP16 training with FP32 master weights by modifying just 3 l
 * Fused Adam - We use a special implementation of the Adam implementation provided by the Apex package. It fuses some operations for faster weight updates.
 Since NCF is a relatively lightweight model with a large number of parameters, we’ve observed significant performance improvements from using FusedAdam.

+## Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+
+### Enabling mixed precision
+
+Using the Automatic Mixed Precision (AMP) package requires two modifications in the source code.
+The first one is to initialize the model and the optimizer using the `amp.initialize` function:
+```python
+model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level,
+                                          keep_batchnorm_fp32=False, loss_scale='dynamic')
+```
+
+The second one is to use the AMP's loss scaling context manager:
+```python
+with amp.scale_loss(loss, optimizer) as scaled_loss:
+    scaled_loss.backward()
+```

 ## Setup
 The following section lists the requirements in order to start training the Neural Collaborative Filtering model.
@ -128,7 +161,7 @@ Running PyTorch

 For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.  
  
-### Quick Start Guide
+## Quick Start Guide

 1. Clone the repository.
 ```bash
@ -189,14 +222,14 @@ This will result in a checkpoint file being written to `/data/checkpoints/model.

 6. Start validation/evaluation.

-The trained model can be evaluated by passing the `--mode test` flag to the `run.sh` script:
+The trained model can be evaluated by passing the `--mode` test flag to the `run.sh` script:

 ```bash
 python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m  --mode test --checkpoint-path /data/checkpoints/model.pth
 ```


-## Details
+## Advanced

 The following sections provide greater details of the dataset, running training and inference, and the training results.

@ -217,7 +250,7 @@ usage: ncf.py [-h] [--data DATA] [-e EPOCHS] [-b BATCH_SIZE]
              [--valid_batch_size VALID_BATCH_SIZE] [-f FACTORS]
              [--layers LAYERS [LAYERS ...]] [-n NEGATIVE_SAMPLES]
              [-l LEARNING_RATE] [-k TOPK] [--seed SEED]
-              [--threshold THRESHOLD] [--valid_negative VALID_NEGATIVE]
+              [--threshold THRESHOLD]
              [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--dropout DROPOUT]
              [--checkpoint_dir CHECKPOINT_DIR] [--mode {train,test}]
              [--grads_accumulated GRADS_ACCUMULATED] [--opt_level {O0,O2}]
@ -247,9 +280,6 @@ optional arguments:
  --seed SEED, -s SEED  Manually set random seed for torch
  --threshold THRESHOLD, -t THRESHOLD
                        Stop training early at threshold
-  --valid_negative VALID_NEGATIVE
-                        Number of negative samples for each positive test
-                        example
  --beta1 BETA1, -b1 BETA1
                        Beta1 for Adam
  --beta2 BETA2, -b2 BETA2
@ -329,39 +359,11 @@ The script will then:
 * Run inference on the test dataset
 * Compute and print the validation metric

-## Mixed precision training
+## Performance

-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
-1.  Porting the model to use the FP16 data type where appropriate.    
-2.  Adding loss scaling to preserve small gradient values.
+### Benchmarking

-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
-
-For information about:
-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
-   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
-
-
-### Enabling mixed precision
-
-Using the Automatic Mixed Precision (AMP) package requires two modifications in the source code.
-The first one is to initialize the model and the optimizer using the `amp.initialize` function:
-```python
-model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level,
-                                          keep_batchnorm_fp32=False, loss_scale='dynamic')
-```
-
-The second one is to use the AMP's loss scaling context manager:
-```python
-with amp.scale_loss(loss, optimizer) as scaled_loss:
-    scaled_loss.backward()
-```
-
-## Benchmarking
-
-### Training performance benchmark
+#### Training performance benchmark

 NCF training on NVIDIA DGX systems is very fast, therefore, in order to measure train and validation throughput, you can simply run the full training job with: 
 ```bash
@ -372,7 +374,7 @@ python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/
 At the end of the script, a line reporting the best train throughput is printed.


-### Inference performance benchmark
+#### Inference performance benchmark

 Validation throughput can be measured by running the full training job with:
 ```bash
@ -382,23 +384,42 @@ python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/

 The best validation throughput is reported to the standard output. 

-## Results
+### Results

 The following sections provide details on how we achieved our performance and accuracy in training and inference. 

-### Training accuracy results
+#### Training accuracy results

-#### NVIDIA DGX-1 (8x V100 32G)
+##### NVIDIA DGX-1 (8x V100 16G)
+
+Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
+
+The following table lists the best hit rate at 10 for DGX-1 with 8 V100 16G GPUs. It also shows the average time to reach this HR@10 across 5 random seeds.
+The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
+
+| **GPUs**    | **Batch size / GPU** | **Accuracy - FP32**  | **Accuracy - mixed precision**  |   **Time to train - FP32 (s)** |  **Time to train - mixed precision (s)** | **Time to train speedup (FP32 to mixed precision)**  |     
+|--------------------------:|-----------------------------:|--------------------------:|--------------------------:|-------------------------------:|-------------------------------:|------------------:|
+|                         1 | 1,048,576                    |  0.95913                  |  0.95887                  |                         188.82 |                         100.37 |              1.88 |
+|                         8 | 131,072                      |  0.95905                  |  0.95906                  |                          43.20 |                          26.68 |              1.62 |
+
+To reproduce this result, start the NCF Docker container interactively and run:
+```bash
+./prepare_dataset.sh
+python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
+```
+
+##### NVIDIA DGX-1 (8x V100 32G)

 Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs.

-The following table lists the best hit rate at 10 for DGX-1 with 8 V100 32G GPUs:
+The following table lists the best hit rate at 10 for DGX-1 with 8 V100 16G GPUs. It also shows the average time to reach this HR@10 across 5 random seeds.
+The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
+
+| **GPUs**    | **Batch size / GPU** | **Accuracy - FP32**  | **Accuracy - mixed precision**  |   **Time to train - FP32 (s)** |  **Time to train - mixed precision (s)** | **Time to train speedup (FP32 to mixed precision)**  |     
+|--------------------------:|-----------------------------:|--------------------------:|--------------------------:|-------------------------------:|-------------------------------:|------------------:|
+|                         1 | 1,048,576                    |  0.95913                  |  0.95887                  |                         194.72 |                         106.03 |              1.84 |
+|                         8 | 131,072                      |  0.95905                  |  0.95906                  |                          44.07 |                          27.86 |              1.58 |

-| **Number of GPUs** | **Single precision HR@10** | **Mixed precision HR@10** | 
-|:---:|:--------:|:-------:|
-|1|	0.95847 | 0.95845 |
-|4|	0.95887 | 0.95841 |
-|8|	0.95850 | 0.95885 |

 Here's an example validation accuracy curve for mixed precision vs single precision on DGX-1 with 8 V100 32G GPUs:

@ -410,9 +431,29 @@ To reproduce this result, start the NCF Docker container interactively and run:
 python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
 ```

-Training accuracy results on a DGX-1 with 8 V100 16G GPUs and on DGX-2 should be the same.
+##### NVIDIA DGX-2 (16x V100 32G)

-#### Training stability test
+Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
+
+The following table lists the best hit rate at 10 for DGX-1 with 8 V100 16G GPUs. It also shows the average time to reach this HR@10 across 5 random seeds.
+The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
+
+| **GPUs**    | **Batch size / GPU** | **Accuracy - FP32**  | **Accuracy - mixed precision**  |   **Time to train - FP32 (s)** |  **Time to train - mixed precision (s)** | **Time to train speedup (FP32 to mixed precision)**  |     
+|--------------------------:|-----------------------------:|--------------------------:|--------------------------:|-------------------------------:|-------------------------------:|------------------:|
+|                         1 | 1,048,576                    |  0.95913                  |  0.95887                  |                         180.85 |                         100.33 |              1.80 |
+|                         8 | 131,072                      |  0.95900                  |  0.95918                  |                          44.21 |                          29.68 |              1.49 |
+|                        16 | 65,536                       |  0.95896                  |  0.95906                  |                          34.47 |                          26.52 |              1.30 |
+
+
+
+To reproduce this result, start the NCF Docker container interactively and run:
+```bash
+./prepare_dataset.sh
+python -m torch.distributed.launch --nproc_per_node=16 ncf.py --data /data/cache/ml-20m
+```
+
+
+##### Training stability test

 The histogram below shows the best HR@10 achieved 
 for 400 experiments using mixed precision and 400 experiments using single precision.
@ -421,90 +462,60 @@ Mean HR@10 for mixed precision was equal to 0.95868 and for single precision it
 ![hr_histogram](./img/hr_histogram.png)


-### Training performance results
+#### Training performance results


-#### NVIDIA DGX-1 (8x V100 16G)
+##### NVIDIA DGX-1 (8x V100 16G)

 Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. 

 The following table shows the best training throughput:

-| **Number of GPUs** | **Batch size per GPU**| **Mixed precision throughput (samples/sec)** | **Single precision throughput (samples/sec)** | **Speed-up with mixed precision** | **Multi-GPU strong scaling with mixed precision** | **Multi-GPU strong scaling with FP32** |
-|:---:|:--------:|:-----:|:-----------:|:-----:|:----:|:---|
-| 1 |1048576| 20,459,365| 9,777,551 | 2.09 |  1 | 1 |
-| 4 |262144 | 61,782,125| 32,583,924 | 1.90 | 3.02 |3.33|
-| 8 |131072 | 98,464,084| 55,365,147 | 1.78 |4.81 |5.66|
- 
-The following table shows the average time to reach HR@10 of 0.9562 across 5 random seeds. The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
+|   **GPUs**                |   **Batch Size / GPU**       | **Throughput - FP32 (samples / s)** |   **Throughput - Mixed precision (samples /s)** |   **Throughput Speedup (FP32 to Mixed precision)** |   **Strong Scaling - FP32** |   **Strong scaling - Mixed precision** |
+|--------------------------:|-----------------------------:|----------------------------------:|----------------------------------:|------------------:|---------------------:|---------------------:|
+|                         1 | 1,048,576                    | 10,536,076                        | 21,059,303                        |              2.00 |                 1.00 |                 1.00 |
+|                         8 | 131,072                      | 58,286,313                        | 100,760,496                       |              1.73 |                 5.53 |                 4.78 |

-| **Number of GPUs** | **Batch size per GPU** | **Mixed precision (seconds)** | **Single precision (seconds)** | **Speed-up with mixed precision** | 
-|:---:|:----:|:---------:|:-----------:|:-----:|
-| 1 | 1048576| 67.03 | 142.31 | 2.12 |
-| 4 | 262144| 23.92	| 47.57	| 1.99 |
-| 8 | 131072| 18.82	| 31.48	| 1.67 | 
-
-
-
-#### NVIDIA DGX-1 (8x V100 32G)
+##### NVIDIA DGX-1 (8x V100 32G)

 Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. 

 The following table shows the best training throughput:
 	
-| **Number of GPUs** | **Batch size per GPU** | **Mixed precision throughput (samples/sec)** | **Single precision throughput (samples/sec)** | **Speed-up with mixed precision** | **Multi-GPU strong scaling with mixed precision** | **Multi-GPU strong scaling with FP32** | 
-|:---:|:----:|:---------:|:-----------:|:-----:|:---:|:---:|
-| 1 | 1048576| 19,314,944 | 9,464,431 | 2.04 | 1 | 1 |
-| 4 | 262144| 58,579,745 |31,577,085 | 1.86 | 3.03 | 3.34 |
-| 8 | 131072| 92,964,306 | 53,972,811 | 1.72 | 4.81	| 5.70 |
-
-The following table shows the average time to reach HR@10 of 0.9562 across 5 random seeds. The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
-
-| **Number of GPUs** | **Mixed precision (seconds)** | **Single precision (seconds)** | **Speed-up with mixed precision** | 
-|:---:|:-------------:|:-----------:|:-----:|
-| 1 | 70.49	| 146.68 | 2.08 |
-| 4 | 24.61	| 49.01	| 1.99 |
-| 8 | 19.72	| 32.25	| 1.64 |
+|   **GPUs**                |   **Batch Size / GPU**       | **Throughput - FP32 (samples / s)** |   **Throughput - Mixed precision (samples /s)** |   **Throughput Speedup (FP32 to Mixed precision)** |   **Strong Scaling - FP32** |   **Strong scaling - Mixed precision** |
+|--------------------------:|-----------------------------:|----------------------------------:|----------------------------------:|------------------:|---------------------:|---------------------:|
+|                         1 | 1,048,576                    | 10,230,464                        | 19,894,392                        |              1.94 |                 1.00 |                 1.00 |
+|                         8 | 131,072                      | 57,043,196                        | 95,424,391                        |              1.67 |                 5.58 |                 4.80 |


-
-#### NVIDIA DGX-2 (16x V100 32G)
+##### NVIDIA DGX-2 (16x V100 32G)

 Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. 

 The following table shows the best training throughput:
 	
-| **Number of GPUs ** | **Batch size per GPU** | **Mixed precision throughput (samples/sec)** | **Single precision throughput (samples/sec)** | **Speed-up with mixed precision** | **Multi-GPU strong scaling with mixed precision** | **Multi-GPU strong scaling with FP32** |
-|:---:|:-----:|:-------:|:-----------:|:-----:|:---:|:---:|
-| 1 | 1048576| 20,645,544 | 10,145,873 | 2.03 | 1 | 1 |
-| 4 | 262144 | 63,608,950 | 34,758,369 | 1.83 | 3.08 | 3.43 |
-| 8 | 131072| 98,887,103 | 57,251,418 | 1.73 | 4.79	| 5.64 |
-| 16 | 65536| 128,976,394 | 82,932,545 | 1.56 | 6.25 | 8.17 |
-
-The following table shows the average time to reach HR@10 of 0.9562 across 5 random seeds. The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
-
-| **Number of GPUs ** | **Mixed precision (seconds)** | **Single precision (seconds)** | **Speed-up with mixed precision** | 
-|:---:|:-------------:|:-----------:|:-----:|
-| 1 | 65.99	|134.93	|2.04|
-| 4 | 26.21	|41.12	|1.57|
-| 8 | 21.96	|29.71	|1.35|
-| 16| 22.15	|28.99	|1.31|
+|   **GPUs**                |   **Batch Size / GPU**       | **Throughput - FP32 (samples / s)** |   **Throughput - Mixed precision (samples /s)** |   **Throughput Speedup (FP32 to Mixed precision)** |   **Strong Scaling - FP32** |   **Strong scaling - Mixed precision** |
+|--------------------------:|:-----------------------------|:----------------------------------|:----------------------------------|------------------:|---------------------:|---------------------:|
+|                         1 | 1,048,576                    | 10,941,690                        | 21,056,129                        |              1.92 |                 1.00 |                 1.00 |
+|                         8 | 131,072                      | 60,247,209                        | 100,142,844                       |              1.66 |                 5.51 |                 4.76 |
+|                        16 | 65,536                       | 84,287,736                        | 133,300,953                       |              1.58 |                 7.70 |                 6.33 |


-### Inference performance results
+
+#### Inference performance results


-#### NVIDIA DGX-1 (8x V100 16G)
+##### NVIDIA DGX-1 (8x V100 16G)

 Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.

 The following table shows the best inference throughput:

-| **Number of GPUs ** | **Mixed precision (samples/sec)** | **Single precision (samples/sec)** | **Speed-up with mixed precision** | 
+| **Number of GPUs** | **Mixed precision (samples/sec)** | **Single precision (samples/sec)** | **Speed-up with mixed precision** | 
 |:---:|:-------------:|:-----------:|:-----:|
 | 1 | 57,163,273 | 28,877,257 | 1.98 |

-#### NVIDIA DGX-1 (8x V100 32G)
+##### NVIDIA DGX-1 (8x V100 32G)

 Our results were obtained by following the steps in the Quick Start Guidein the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs.

@ -515,7 +526,7 @@ The following table shows the best inference throughput:
 | 1 | 54,570,476 | 28,085,521 | 1.94 |


-#### NVIDIA DGX-2 (16x V100 32G)
+##### NVIDIA DGX-2 (16x V100 32G)

 Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.

@ -525,8 +536,9 @@ The following table shows the best inference throughput:
 |:---:|:-------------:|:-----------:|:-----:|
 | 1 | 58,383,216 | 30,018,043 | 1.94 |

+## Release notes

-## Changelog
+### Changelog
 1. January 22, 2018
    * Initial release
 2. May, 2019
@ -536,11 +548,15 @@ The following table shows the best inference throughput:
    * Data loading code cleanup.
    * Default container updated to PyTorch 19.05-py3.
    * Updated README.md.
+3. June, 2019
+    * Updated performance tables.
+    * Default container changed to PyTorch 19.06-py3.
+    * Caching validation negatives between runs


-## Known issues
+### Known issues
 
-### Scaling beyond 8 GPUs
+#### Scaling beyond 8 GPUs
 Neural Collaborative Filtering is a relatively lightweight model that trains quickly with this relatively smaller dataset, ML-20m.
 Because of that, the high ratio of communication to computation makes it difficult to 
 efficiently use more than 8 GPUs. Typically, this is not an issue because when using 8
@ -550,7 +566,7 @@ GPUs with FP16 precision, the training is sufficiently fast. However, if you’d
  by finding hyperparameters that enable using a larger batch size or by reducing the 
  number of trainable parameters.

-### Memory usage
+#### Memory usage

 In the default settings, the additional memory beyond 16G may not be fully utilized.
 This is because we set the default batch size for ML-20m dataset to 1M,
--- a/PyTorch/Recommendation/NCF/convert.py
+++ b/PyTorch/Recommendation/NCF/convert.py
@ -32,6 +32,7 @@ from argparse import ArgumentParser
 import pandas as pd
 from load import implicit_load
 import torch
+import tqdm

 from logger.logger import LOGGER
 from logger import tags
@ -48,11 +49,50 @@ def parse_args():
                        help='Path to reviews CSV file from MovieLens')
    parser.add_argument('--output', type=str, default='/data',
                        help='Output directory for train and test files')
+    parser.add_argument('--valid_negative', type=int, default=100,
+                        help='Number of negative samples for each positive test example')
+    parser.add_argument('--seed', '-s', type=int, default=1,
+                        help='Manually set random seed for torch')
    return parser.parse_args()

+
+class _TestNegSampler:
+    def __init__(self, train_ratings, nb_neg):
+        self.nb_neg = nb_neg
+        self.nb_users = int(train_ratings[:, 0].max()) + 1
+        self.nb_items = int(train_ratings[:, 1].max()) + 1
+
+        # compute unique ids for quickly created hash set and fast lookup
+        ids = (train_ratings[:, 0] * self.nb_items) + train_ratings[:, 1]
+        self.set = set(ids)
+
+    def generate(self, batch_size=128*1024):
+        users = torch.arange(0, self.nb_users).reshape([1, -1]).repeat([self.nb_neg, 1]).transpose(0, 1).reshape(-1)
+
+        items = [-1] * len(users)
+
+        random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
+        print('Generating validation negatives...')
+        for idx, u in enumerate(tqdm.tqdm(users.tolist())):
+            if not random_items:
+                random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
+            j = random_items.pop()
+            while u * self.nb_items + j in self.set:
+                if not random_items:
+                    random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
+                j = random_items.pop()
+
+            items[idx] = j
+        items = torch.LongTensor(items)
+        return items
+
+
 def main():
    args = parse_args()

+    if args.seed is not None:
+        torch.manual_seed(args.seed)
+
    print("Loading raw data from {}".format(args.path))
    df = implicit_load(args.path, sort=False)

@ -65,7 +105,6 @@ def main():
    df[USER_COLUMN] = pd.factorize(df[USER_COLUMN])[0]
    df[ITEM_COLUMN] = pd.factorize(df[ITEM_COLUMN])[0]

-    print("Creating list of items for each user")
    # Need to sort before popping to get last item
    df.sort_values(by='timestamp', inplace=True)

@ -87,5 +126,10 @@ def main():
    test_ratings = torch.from_numpy(test_data.values)
    torch.save(test_ratings, args.output+'/test_ratings.pt')

+    sampler = _TestNegSampler(train_ratings.cpu().numpy(), args.valid_negative)
+    test_negs = sampler.generate().cuda()
+    test_negs = test_negs.reshape(-1, args.valid_negative)
+    torch.save(test_negs, args.output+'/test_negatives.pt')
+
 if __name__ == '__main__':
    main()
--- a/PyTorch/Recommendation/NCF/dataloading.py
+++ b/PyTorch/Recommendation/NCF/dataloading.py
@ -30,53 +30,16 @@

 import time
 import torch
-import tqdm
-
-class _TestNegSampler:
-    def __init__(self, train_ratings, nb_neg):
-        self.nb_neg = nb_neg
-        self.nb_users = int(train_ratings[:, 0].max()) + 1
-        self.nb_items = int(train_ratings[:, 1].max()) + 1
-
-        # compute unique ids for quickly created hash set and fast lookup
-        ids = (train_ratings[:, 0] * self.nb_items) + train_ratings[:, 1]
-        self.set = set(ids)
-
-    def generate(self, batch_size=128*1024):
-        users = torch.arange(0, self.nb_users).reshape([1, -1]).repeat([self.nb_neg, 1]).transpose(0, 1).reshape(-1)
-
-        items = [-1] * len(users)
-
-        random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
-        print('Generating validation negatives...')
-        for idx, u in enumerate(tqdm.tqdm(users.tolist())):
-            if not random_items:
-                random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
-            j = random_items.pop()
-            while u * self.nb_items + j in self.set:
-                if not random_items:
-                    random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
-                j = random_items.pop()
-
-            items[idx] = j
-        items = torch.LongTensor(items)
-        return items


-def create_test_data(train_ratings, test_ratings, args):
+def create_test_data(test_ratings, test_negs, args):
    test_users = test_ratings[:,0]
    test_pos = test_ratings[:,1].reshape(-1,1)

-    begin = time.time()
-    sampler = _TestNegSampler(train_ratings.cpu().numpy(), args.valid_negative)
-    test_negs = sampler.generate().cuda()
-    end = time.time()
-    print('Generating validation negatives took: ', end - begin)
-    del train_ratings
-
    # create items with real sample at last position
-    test_users = test_users.reshape(-1,1).repeat(1, 1 + args.valid_negative)
-    test_items = torch.cat((test_negs.reshape(-1, args.valid_negative), test_pos), dim=1)
+    num_valid_negative = test_negs.shape[1]
+    test_users = test_users.reshape(-1,1).repeat(1, 1 + num_valid_negative)
+    test_items = torch.cat((test_negs, test_pos), dim=1)
    del test_ratings, test_negs

    # generate dup mask and real indices for exact same behavior on duplication compare to reference
--- a/PyTorch/Recommendation/NCF/inference.py
+++ b/PyTorch/Recommendation/NCF/inference.py
@ -0,0 +1,95 @@
+#
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import torch.jit
+import time
+from argparse import ArgumentParser
+
+import torch
+
+from neumf import NeuMF
+
+from logger.logger import LOGGER, timed_block, timed_function
+from logger.autologging import log_hardware, log_args
+
+from apex import amp
+
+LOGGER.model = 'ncf'
+
+
+def parse_args():
+    parser = ArgumentParser(description="Benchmark inference performance of the NCF model")
+    parser.add_argument('--load_checkpoint_path', default=None, type=str,
+                        help='Path to the checkpoint file to be loaded before training/evaluation')
+    parser.add_argument('--n_users', default=138493, type=int,
+                        help='Number of users. Defaults to the number of users in the ml-20m dataset after preprocessing')
+    parser.add_argument('--n_items', default=26744, type=int,
+                        help='Number of items. Defaults to the number of users in the ml-20m dataset after preprocessing')
+    parser.add_argument('-f', '--factors', type=int, default=64,
+                        help='Number of predictive factors')
+    parser.add_argument('--dropout', type=float, default=0.5,
+                        help='Dropout probability, if equal to 0 will not use dropout at all')
+    parser.add_argument('--layers', nargs='+', type=int,
+                        default=[256, 256, 128, 64],
+                        help='Sizes of hidden layers for MLP')
+    parser.add_argument('--batch_size', default=1, type=int, help='Batch size for inference')
+    parser.add_argument('--num_batches', default=20, type=int,
+                        help='Number of batches for which to measure latency and throughput')
+    parser.add_argument('--opt_level', default='O2', type=str,
+                        help='Optimization level for Automatic Mixed Precision',
+                        choices=['O0', 'O2'])
+
+    return parser.parse_args()
+
+
+def main():
+    log_hardware()
+    args = parse_args()
+    log_args(args)
+
+    model = NeuMF(nb_users=args.n_users, nb_items=args.n_items, mf_dim=args.factors,
+                  mlp_layer_sizes=args.layers, dropout=args.dropout)
+
+    model = model.cuda()
+
+    if args.load_checkpoint_path:
+        state_dict = torch.load(args.load_checkpoint_path)
+        model.load_state_dict(state_dict)
+
+    if args.opt_level == "O2":
+        model = amp.initialize(model, opt_level=args.opt_level,
+                               keep_batchnorm_fp32=False, loss_scale='dynamic')
+
+    users = torch.cuda.LongTensor(args.batch_size).random_(0, args.n_users)
+    items = torch.cuda.LongTensor(args.batch_size).random_(0, args.n_items)
+
+    latencies = []
+    for _ in range(args.num_batches):
+        torch.cuda.synchronize()
+        start = time.time()
+        predictions = model(users, items, sigmoid=True)
+        torch.cuda.synchronize()
+        latencies.append(time.time() - start)
+
+    LOGGER.log(key='batch_size', value=args.batch_size)
+    LOGGER.log(key='best_inference_throughput', value=args.batch_size / min(latencies))
+    LOGGER.log(key='best_inference_latency', value=min(latencies))
+    LOGGER.log(key='inference_latencies', value=latencies)
+    return
+
+
+if __name__ == '__main__':
+    main()
--- a/PyTorch/Recommendation/NCF/ncf.py
+++ b/PyTorch/Recommendation/NCF/ncf.py
@ -79,8 +79,6 @@ def parse_args():
                        help='Manually set random seed for torch')
    parser.add_argument('--threshold', '-t', type=float, default=1.0,
                        help='Stop training early at threshold')
-    parser.add_argument('--valid_negative', type=int, default=100,
-                        help='Number of negative samples for each positive test example')
    parser.add_argument('--beta1', '-b1', type=float, default=0.25,
                        help='Beta1 for Adam')
    parser.add_argument('--beta2', '-b2', type=float, default=0.5,
@ -91,6 +89,8 @@ def parse_args():
                        help='Dropout probability, if equal to 0 will not use dropout at all')
    parser.add_argument('--checkpoint_dir', default='/data/checkpoints/', type=str,
                        help='Path to the directory storing the checkpoint file')
+    parser.add_argument('--load_checkpoint_path', default=None, type=str,
+                        help='Path to the checkpoint file to be loaded before training/evaluation')
    parser.add_argument('--mode', choices=['train', 'test'], default='train', type=str,
                        help='Passing "test" will only run a single evaluation, otherwise full training will be performed')
    parser.add_argument('--grads_accumulated', default=1, type=int,
@ -173,17 +173,13 @@ def main():
    args.distributed, args.world_size = init_distributed(args.local_rank)
    log_args(args)

-    main_start_time = time.time()
-    
    if args.seed is not None:
        torch.manual_seed(args.seed)

    print("Saving results to {}".format(args.checkpoint_dir))
    if not os.path.exists(args.checkpoint_dir) and args.checkpoint_dir != '':
        os.makedirs(args.checkpoint_dir, exist_ok=True)
-    checkpoint_path = os.path.join(args.checkpoint_dir, 'model.pth')

-    LOGGER.log(key=tags.PREPROC_HP_NUM_EVAL, value=args.valid_negative)
    # The default of np.random.choice is replace=True, so does pytorch random_()
    LOGGER.log(key=tags.PREPROC_HP_SAMPLE_EVAL_REPLACEMENT, value=True)
    LOGGER.log(key=tags.INPUT_HP_SAMPLE_TRAIN_REPLACEMENT, value=True)
@ -194,10 +190,16 @@ def main():
        torch.distributed.broadcast(torch.tensor([1], device="cuda"), 0)
    torch.cuda.synchronize()

+    main_start_time = time.time()
    LOGGER.log(key=tags.RUN_START)

    train_ratings = torch.load(args.data+'/train_ratings.pt', map_location=torch.device('cuda:{}'.format(args.local_rank)))
    test_ratings = torch.load(args.data+'/test_ratings.pt', map_location=torch.device('cuda:{}'.format(args.local_rank)))
+    test_negs = torch.load(args.data+'/test_negatives.pt', map_location=torch.device('cuda:{}'.format(args.local_rank)))
+
+    valid_negative = test_negs.shape[1]
+    LOGGER.log(key=tags.PREPROC_HP_NUM_EVAL, value=valid_negative)
+

    nb_maxs = torch.max(train_ratings, 0)[0]
    nb_users = nb_maxs[0].item() + 1
@ -206,7 +208,7 @@ def main():

    all_test_users = test_ratings.shape[0]

-    test_users, test_items, dup_mask, real_indices = dataloading.create_test_data(train_ratings, test_ratings, args)
+    test_users, test_items, dup_mask, real_indices = dataloading.create_test_data(test_ratings, test_negs, args)

    # make pytorch memory behavior more consistent later
    torch.cuda.empty_cache()
@ -248,15 +250,25 @@ def main():
    LOGGER.log(key=tags.OPT_HP_ADAM_EPSILON, value=args.eps)
    LOGGER.log(key=tags.MODEL_HP_LOSS_FN, value=tags.VALUE_BCE)

+    if args.load_checkpoint_path:
+        state_dict = torch.load(args.load_checkpoint_path)
+        model.load_state_dict(state_dict)

    if args.mode == 'test':
-        state_dict = torch.load(checkpoint_path)
-        model.load_state_dict(state_dict)
+        LOGGER.log(key=tags.EVAL_START, value=0)
+        start = time.time()
        hr, ndcg = val_epoch(model, test_users, test_items, dup_mask, real_indices, args.topk,
-                             samples_per_user=args.valid_negative + 1,
+                             samples_per_user=valid_negative + 1,
                             num_user=all_test_users, distributed=args.distributed)
        print('HR@{K} = {hit_rate:.4f}, NDCG@{K} = {ndcg:.4f}'
              .format(K=args.topk, hit_rate=hr, ndcg=ndcg))
+        val_time = time.time() - start
+        eval_size = all_test_users * (valid_negative + 1)
+        eval_throughput = eval_size / val_time
+
+        LOGGER.log(key=tags.EVAL_ACCURACY, value={"epoch": 0, "value": hr})
+        LOGGER.log(key=tags.EVAL_STOP, value=0)
+        LOGGER.log(key='best_eval_throughput', value=eval_throughput)
        return
    
    success = False
@ -307,7 +319,7 @@ def main():
        LOGGER.log(key=tags.EVAL_START, value=epoch)

        hr, ndcg = val_epoch(model, test_users, test_items, dup_mask, real_indices, args.topk,
-                             samples_per_user=args.valid_negative + 1,
+                             samples_per_user=valid_negative + 1,
                             num_user=all_test_users, epoch=epoch, distributed=args.distributed)

        val_time = time.time() - begin
@ -321,15 +333,17 @@ def main():
        LOGGER.log(key=tags.EVAL_TARGET, value=args.threshold)
        LOGGER.log(key=tags.EVAL_STOP, value=epoch)

-        eval_size = all_test_users * (args.valid_negative + 1)
+        eval_size = all_test_users * (valid_negative + 1)
        eval_throughput = eval_size / val_time
        eval_throughputs.append(eval_throughput)
        LOGGER.log(key='eval_throughput', value=eval_throughput)

        if hr > max_hr and args.local_rank == 0:
            max_hr = hr
-            print("New best hr! Saving the model to: ", checkpoint_path)
-            torch.save(model.state_dict(), checkpoint_path)
+            save_checkpoint_path = os.path.join(args.checkpoint_dir, 'model.pth')
+            print("New best hr! Saving the model to: ", save_checkpoint_path)
+            torch.save(model.state_dict(), save_checkpoint_path)
+            best_model_timestamp = time.time()

        if args.threshold is not None:
            if hr >= args.threshold:
@ -337,13 +351,15 @@ def main():
                success = True
                break

-    LOGGER.log(key='best_train_throughput', value=max(train_throughputs))
-    LOGGER.log(key='best_eval_throughput', value=max(eval_throughputs))
-    LOGGER.log(key='best_accuracy', value=max_hr)
-    LOGGER.log(key='time_to_target', value=time.time() - main_start_time)
+    if args.local_rank == 0:
+        LOGGER.log(key='best_train_throughput', value=max(train_throughputs))
+        LOGGER.log(key='best_eval_throughput', value=max(eval_throughputs))
+        LOGGER.log(key='best_accuracy', value=max_hr)
+        LOGGER.log(key='time_to_target', value=time.time() - main_start_time)
+        LOGGER.log(key='time_to_best_model', value=best_model_timestamp - main_start_time)

-    LOGGER.log(key=tags.RUN_STOP, value={"success": success})
-    LOGGER.log(key=tags.RUN_FINAL)
+        LOGGER.log(key=tags.RUN_STOP, value={"success": success})
+        LOGGER.log(key=tags.RUN_FINAL)

 if __name__ == '__main__':
    main()
--- a/PyTorch/Recommendation/NCF/prepare_dataset.sh
+++ b/PyTorch/Recommendation/NCF/prepare_dataset.sh
@ -35,7 +35,7 @@ set -x

 DATASET_NAME=${1:-'ml-20m'}
 RAW_DATADIR=${2:-'/data'}
-CACHED_DATADIR="${RAW_DATADIR}/cache/${DATASET_NAME}"
+CACHED_DATADIR=${3:-"${RAW_DATADIR}/cache/${DATASET_NAME}"}

 # you can add another option to this case in order to support other datasets
 case ${DATASET_NAME} in
--- a/PyTorch/SpeechSynthesis/Tacotron2/README.md
+++ b/PyTorch/SpeechSynthesis/Tacotron2/README.md
@ -607,7 +607,7 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
 The following table shows the expected training time for convergence for Tacotron 2 (1500 epochs):

 |Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
-|---:|---:|---:|---:|
+|---:|---:|---:|---:|---:|
 |1| 128@FP16, 64@FP32 | 137.33 | 227.66 | 1.66 |
 |4| 128@FP16, 64@FP32 | 40.68 | 63.99 | 1.57 |
 |8| 128@FP16, 64@FP32 | 20.74 | 32.47 | 1.57 |
@ -617,7 +617,7 @@ The following table shows the expected training time for convergence for Tacotro
 The following table shows the expected training time for convergence for WaveGlow (1000 epochs):

 |Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
-|---:|---:|---:|---:|
+|---:|---:|---:|---:|---:|
 |1| 10@FP16, 4@FP32 | 358.00 | 793.97 | 2.22 |
 |4| 10@FP16, 4@FP32 | 103.10 | 223.59 | 2.17 |
 |8| 10@FP16, 4@FP32 | 50.40 | 109.45 | 2.17 |
--- a/PyTorch/SpeechSynthesis/Tacotron2/dllogger/LICENSE
+++ b/PyTorch/SpeechSynthesis/Tacotron2/dllogger/LICENSE
@ -1,12 +0,0 @@
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
-2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
-3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
--- a/PyTorch/Translation/GNMT/Dockerfile
+++ b/PyTorch/Translation/GNMT/Dockerfile
@ -1,9 +1,11 @@
-FROM nvcr.io/nvidia/pytorch:19.01-py3
+FROM nvcr.io/nvidia/pytorch:19.05-py3

 ENV LANG C.UTF-8
 ENV LC_ALL C.UTF-8

-ADD . /workspace/gnmt
 WORKDIR /workspace/gnmt

-RUN pip install -r requirements.txt
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+ADD . /workspace/gnmt
--- a/PyTorch/Translation/GNMT/README.md
+++ b/PyTorch/Translation/GNMT/README.md
--- a/PyTorch/Translation/GNMT/img/diagram.png
+++ b/PyTorch/Translation/GNMT/img/diagram.png
--- a/PyTorch/Translation/GNMT/requirements.txt
+++ b/PyTorch/Translation/GNMT/requirements.txt
@ -1,2 +1,5 @@
+pytablewriter
 sacrebleu==1.2.10
+sacremoses==0.0.19
+git+git://github.com/rsennrich/subword-nmt.git@48ba99e657591c329e0003f0c6e32e493fa959ef
 git+git://github.com/NVIDIA/apex.git#egg=apex
--- a/PyTorch/Translation/GNMT/scripts/docker/build.sh
+++ b/PyTorch/Translation/GNMT/scripts/docker/build.sh
@ -1,3 +1,3 @@
 #!/bin/bash

-docker build . --rm -t gnmt
+docker build . --rm -t gnmt:latest
--- a/PyTorch/Translation/GNMT/scripts/docker/interactive.sh
+++ b/PyTorch/Translation/GNMT/scripts/docker/interactive.sh
@ -1,3 +1,3 @@
 #!/bin/bash

-nvidia-docker run -it --rm --ipc=host -v $PWD:/workspace/gnmt/ gnmt bash
+nvidia-docker run --init -it --rm --ipc=host -v $PWD:/workspace/gnmt/ gnmt bash
--- a/PyTorch/Translation/GNMT/scripts/tests/inference.sh
+++ b/PyTorch/Translation/GNMT/scripts/tests/inference.sh
@ -33,12 +33,13 @@ fi
 cd $REPO_DIR

 python3 translate.py \
-   --input ${DATASET_DIR}/newstest2014.tok.bpe.32000.en \
+   --input ${DATASET_DIR}/newstest2014.en \
   --reference ${DATASET_DIR}/newstest2014.de \
   --output /tmp/output \
   --model results/gnmt/model_best.pth \
   --batch-size ${BATCH_SIZE} \
   --beam-size ${BEAM_SIZE} \
   --math ${MATH} \
+   --warmup 1 \
   --target-bleu 24.3 \
   ${TARGET_PERF}
--- a/PyTorch/Translation/GNMT/scripts/tests/reference_inference_performance
+++ b/PyTorch/Translation/GNMT/scripts/tests/reference_inference_performance
@ -1,6 +1,6 @@
-fp16,128,5,Tesla V100-SXM2-16GB,31050
-fp32,128,5,Tesla V100-SXM2-16GB,12500
-fp16,128,5,Tesla V100-SXM2-32GB,29500
-fp32,128,5,Tesla V100-SXM2-32GB,12500
-fp16,128,5,Tesla V100-SXM3-32GB,34050
-fp32,128,5,Tesla V100-SXM3-32GB,14250
+fp16,128,5,Tesla V100-SXM2-16GB,18740
+fp32,128,5,Tesla V100-SXM2-16GB,8610
+fp16,128,5,Tesla V100-SXM2-32GB,17800
+fp32,128,5,Tesla V100-SXM2-32GB,8180
+fp16,128,5,Tesla V100-SXM3-32GB,20550
+fp32,128,5,Tesla V100-SXM3-32GB,9810
--- a/PyTorch/Translation/GNMT/scripts/tests/reference_training_performance
+++ b/PyTorch/Translation/GNMT/scripts/tests/reference_training_performance
@ -1,20 +1,20 @@
-fp16,1,Tesla V100-SXM2-16GB,66050
-fp16,4,Tesla V100-SXM2-16GB,196174
-fp16,8,Tesla V100-SXM2-16GB,387282
-fp32,1,Tesla V100-SXM2-16GB,21346
-fp32,4,Tesla V100-SXM2-16GB,76083
-fp32,8,Tesla V100-SXM2-16GB,153697
+fp16,1,Tesla V100-SXM2-16GB,68000
+fp16,4,Tesla V100-SXM2-16GB,221000
+fp16,8,Tesla V100-SXM2-16GB,419000
+fp32,1,Tesla V100-SXM2-16GB,21000
+fp32,4,Tesla V100-SXM2-16GB,75000
+fp32,8,Tesla V100-SXM2-16GB,149000
 fp16,1,Tesla V100-SXM2-32GB,65000
-fp16,4,Tesla V100-SXM2-32GB,190000
-fp16,8,Tesla V100-SXM2-32GB,360000
+fp16,4,Tesla V100-SXM2-32GB,210000
+fp16,8,Tesla V100-SXM2-32GB,400000
 fp32,1,Tesla V100-SXM2-32GB,21000
-fp32,4,Tesla V100-SXM2-32GB,76000
-fp32,8,Tesla V100-SXM2-32GB,150000
-fp16,1,Tesla V100-SXM3-32GB,65830
-fp16,4,Tesla V100-SXM3-32GB,200886
-fp16,8,Tesla V100-SXM3-32GB,362612
-fp16,16,Tesla V100-SXM3-32GB,738521
-fp32,1,Tesla V100-SXM3-32GB,22695
-fp32,4,Tesla V100-SXM3-32GB,81224
-fp32,8,Tesla V100-SXM3-32GB,156536
-fp32,16,Tesla V100-SXM3-32GB,314831
+fp32,4,Tesla V100-SXM2-32GB,75000
+fp32,8,Tesla V100-SXM2-32GB,148000
+fp16,1,Tesla V100-SXM3-32GB,72000
+fp16,4,Tesla V100-SXM3-32GB,237000
+fp16,8,Tesla V100-SXM3-32GB,430000
+fp16,16,Tesla V100-SXM3-32GB,852000
+fp32,1,Tesla V100-SXM3-32GB,22000
+fp32,4,Tesla V100-SXM3-32GB,80000
+fp32,8,Tesla V100-SXM3-32GB,155000
+fp32,16,Tesla V100-SXM3-32GB,312000
--- a/PyTorch/Translation/GNMT/scripts/tests/train_bench.sh
+++ b/PyTorch/Translation/GNMT/scripts/tests/train_bench.sh
@ -47,7 +47,7 @@ python3 -m launch train.py \
   --epochs 1 \
   --remain-steps 1.0 \
   --no-eval \
-   --max-size $((128 * ${GPU_COUNT} * 300)) \
+   --train-max-size $((128 * ${GPU_COUNT} * 300)) \
   --math ${MATH} \
   --train-global-batch-size ${GLOBAL_BATCH_SIZE} \
   ${TARGET_PERF}
--- a/PyTorch/Translation/GNMT/scripts/wmt16_en_de.sh
+++ b/PyTorch/Translation/GNMT/scripts/wmt16_en_de.sh
@ -146,32 +146,25 @@ python3 scripts/filter_dataset.py \
   -f2 ${OUTPUT_DIR}/newstest_dev.tok.clean.de

 # Generate Subword Units (BPE)
-# Clone Subword NMT
-if [ ! -d "${OUTPUT_DIR}/subword-nmt" ]; then
-  git clone https://github.com/rsennrich/subword-nmt.git "${OUTPUT_DIR}/subword-nmt"
-  cd ${OUTPUT_DIR}/subword-nmt
-  git reset --hard 48ba99e657591c329e0003f0c6e32e493fa959ef
-  cd -
-fi

 # Learn Shared BPE
 for merge_ops in 32000; do
  echo "Learning BPE with merge_ops=${merge_ops}. This may take a while..."
  cat "${OUTPUT_DIR}/train.tok.clean.de" "${OUTPUT_DIR}/train.tok.clean.en" | \
-    ${OUTPUT_DIR}/subword-nmt/learn_bpe.py -s $merge_ops > "${OUTPUT_DIR}/bpe.${merge_ops}"
+    subword-nmt learn-bpe -s $merge_ops > "${OUTPUT_DIR}/bpe.${merge_ops}"

  echo "Apply BPE with merge_ops=${merge_ops} to tokenized files..."
  for lang in en de; do
    for f in ${OUTPUT_DIR}/*.tok.${lang} ${OUTPUT_DIR}/*.tok.clean.${lang}; do
      outfile="${f%.*}.bpe.${merge_ops}.${lang}"
-      ${OUTPUT_DIR}/subword-nmt/apply_bpe.py -c "${OUTPUT_DIR}/bpe.${merge_ops}" < $f > "${outfile}"
+      subword-nmt apply-bpe -c "${OUTPUT_DIR}/bpe.${merge_ops}" < $f > "${outfile}"
      echo ${outfile}
    done
  done

  # Create vocabulary file for BPE
  cat "${OUTPUT_DIR}/train.tok.clean.bpe.${merge_ops}.en" "${OUTPUT_DIR}/train.tok.clean.bpe.${merge_ops}.de" | \
-    ${OUTPUT_DIR}/subword-nmt/get_vocab.py | cut -f1 -d ' ' > "${OUTPUT_DIR}/vocab.bpe.${merge_ops}"
+    subword-nmt get-vocab | cut -f1 -d ' ' > "${OUTPUT_DIR}/vocab.bpe.${merge_ops}"

 done

--- a/PyTorch/Translation/GNMT/seq2seq/data/config.py
+++ b/PyTorch/Translation/GNMT/seq2seq/data/config.py
@ -6,27 +6,5 @@ EOS_TOKEN = '<\s>'
 # special PAD, UNKNOWN, BEGIN-OF-STRING, END-OF-STRING tokens
 PAD, UNK, BOS, EOS = [0, 1, 2, 3]

-# path to the BPE vocabulary file, relative to the data directory, it should
-# point to file generated by subword-nmt/get_vocab.py
-VOCAB_FNAME = 'vocab.bpe.32000'
-
-# paths to source and target training files, relative to the data directory, it
-# should point to BPE-encoded files, generated by subword-nmt/apply_bpe.py
-SRC_TRAIN_FNAME = 'train.tok.clean.bpe.32000.en'
-TGT_TRAIN_FNAME = 'train.tok.clean.bpe.32000.de'
-
-# paths to source and target validation files, relative to the data directory,
-# it should point to BPE-encoded files, generated by subword-nmt/apply_bpe.py
-SRC_VAL_FNAME = 'newstest_dev.tok.clean.bpe.32000.en'
-TGT_VAL_FNAME = 'newstest_dev.tok.clean.bpe.32000.de'
-
-# path to the test source file, relative to the data directory, it should point
-# to BPE-encoded file, generated by subword-nmt/apply_bpe.py
-SRC_TEST_FNAME = 'newstest2014.tok.bpe.32000.en'
-
-# path to the test target file, relative to the data directory, it should point
-# to plaintext file, tokenization is performed by the sacrebleu package
-TGT_TEST_TARGET_FNAME = 'newstest2014.de'
-
 # path to the moses detokenizer, relative to the data directory
 DETOKENIZER = 'mosesdecoder/scripts/tokenizer/detokenizer.perl'
--- a/PyTorch/Translation/GNMT/seq2seq/data/dataset.py
+++ b/PyTorch/Translation/GNMT/seq2seq/data/dataset.py
@ -28,17 +28,17 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):

        :param seq: list of sequences
        """
-        lengths = [len(s) for s in seq]
+        lengths = torch.tensor([len(s) for s in seq], dtype=torch.int64)
        batch_length = max(lengths)

-        shape = (batch_length, len(seq))
+        shape = (len(seq), batch_length)
        seq_tensor = torch.full(shape, config.PAD, dtype=torch.int64)

        for i, s in enumerate(seq):
            end_seq = lengths[i]
-            seq_tensor[:end_seq, i].copy_(s[:end_seq])
+            seq_tensor[i, :end_seq].copy_(s[:end_seq])

-        if batch_first:
+        if not batch_first:
            seq_tensor = seq_tensor.t()

        return (seq_tensor, lengths)
@ -81,6 +81,71 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
        return single_collate


+class RawTextDataset(Dataset):
+    def __init__(self, raw_data=None, raw_datafile=None, tokenizer=None,
+                 sort=False, max_size=None):
+        self.tokenizer = tokenizer
+        self.sorted = False
+
+        if raw_datafile:
+            with open(raw_datafile, 'r') as f:
+                self.raw_data = f.readlines()
+        else:
+            self.raw_data = raw_data
+
+        if max_size:
+            self.raw_data = self.raw_data[:max_size]
+
+        self.lengths = [len(s.split()) for s in self.raw_data]
+
+        if sort:
+            self.sort_by_length()
+
+    def __getitem__(self, idx):
+        raw = self.raw_data[idx]
+        tokenized = self.tokenizer.tokenize(raw)
+        return tokenized
+
+    def unsort(self, array):
+        """
+        "Unsorts" given array (restores original order of elements before
+        dataset was sorted by sequence length).
+
+        :param array: array to be "unsorted"
+        """
+        if self.sorted:
+            inverse = sorted(enumerate(self.indices), key=itemgetter(1))
+            array = [array[i[0]] for i in inverse]
+        return array
+
+    def sort_by_length(self):
+        output = sorted(
+            enumerate(self.raw_data),
+            key=lambda x: len(x[1].split()),
+            )
+        self.indices, self.raw_data = zip(*output)
+        self.lengths = [self.lengths[idx] for idx in self.indices]
+        self.sorted = True
+
+    def __len__(self):
+        return len(self.raw_data)
+
+    def get_loader(self, batch_size=1, num_workers=0, batch_first=False,
+                   pad=False, repeat=1):
+
+        collate_fn = build_collate_fn(batch_first, parallel=False,
+                                      sort=True)
+        sampler = StaticDistributedSampler(self, batch_size, pad, repeat)
+
+        return DataLoader(self,
+                          batch_size=batch_size,
+                          collate_fn=collate_fn,
+                          sampler=sampler,
+                          num_workers=num_workers,
+                          pin_memory=True,
+                          drop_last=False)
+
+
 class TextDataset(Dataset):
    def __init__(self, src_fname, tokenizer, min_len=None, max_len=None,
                 sort=False, max_size=None):
--- a/PyTorch/Translation/GNMT/seq2seq/data/sampler.py
+++ b/PyTorch/Translation/GNMT/seq2seq/data/sampler.py
@ -183,7 +183,7 @@ class BucketingSampler(DistributedSampler):
        bucket_ids.clamp_(0, num_buckets - 1)

        # build buckets
-        all_indices = torch.tensor(range(self.data_len))
+        all_indices = torch.arange(self.data_len)
        self.buckets = []
        self.num_samples = 0
        global_bs = self.global_batch_size
@ -226,7 +226,7 @@ class BucketingSampler(DistributedSampler):


 class StaticDistributedSampler(Sampler):
-    def __init__(self, dataset, batch_size, pad, world_size=None, rank=None):
+    def __init__(self, dataset, batch_size, pad, repeat=1, world_size=None, rank=None):
        """
        Constructor for the StaticDistributedSampler.

@ -247,11 +247,12 @@ class StaticDistributedSampler(Sampler):
        global_batch_size = batch_size * world_size

        data_len = len(dataset)
-        num_samples = (data_len + global_batch_size - 1) \
+        repeated_data_len = int(len(dataset) * repeat)
+        num_samples = (repeated_data_len + global_batch_size - 1) \
            // global_batch_size * global_batch_size
        self.num_samples = num_samples

-        indices = list(range(data_len))
+        indices = list(range(repeated_data_len))
        if pad:
            # pad dataset to a multiple of global_batch_size samples, uses
            # sample with idx 0 as pad
@ -267,6 +268,7 @@ class StaticDistributedSampler(Sampler):
        indices = indices.view(-1)
        # remove temporary pad
        indices = indices[indices != -1]
+        indices = indices % data_len
        indices = indices.tolist()
        self.indices = indices

--- a/PyTorch/Translation/GNMT/seq2seq/data/tokenizer.py
+++ b/PyTorch/Translation/GNMT/seq2seq/data/tokenizer.py
@ -2,6 +2,9 @@ import logging
 from collections import defaultdict
 from functools import partial

+import torch
+import subword_nmt.apply_bpe
+import sacremoses
 import seq2seq.data.config as config


@ -9,37 +12,53 @@ class Tokenizer:
    """
    Tokenizer class.
    """
-    def __init__(self, vocab_fname=None, pad=1, separator='@@'):
+    def __init__(self, vocab_fname=None, bpe_fname=None, lang=None, pad=1,
+                 separator='@@'):
        """
        Constructor for the Tokenizer class.

        :param vocab_fname: path to the file with vocabulary
+        :param bpe_fname: path to the file with bpe codes
        :param pad: pads vocabulary to a multiple of 'pad' tokens
        :param separator: tokenization separator
        """
+        self.separator = separator
+        self.lang = lang
+
+        if bpe_fname:
+            with open(bpe_fname, 'r') as bpe_codes:
+                self.bpe = subword_nmt.apply_bpe.BPE(bpe_codes)
+
        if vocab_fname:
-            self.separator = separator
+            self.build_vocabulary(vocab_fname, pad)

-            logging.info(f'Building vocabulary from {vocab_fname}')
-            vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
-                     config.BOS_TOKEN, config.EOS_TOKEN]
+        if lang:
+            self.init_moses(lang)

-            with open(vocab_fname) as vfile:
-                for line in vfile:
-                    vocab.append(line.strip())
+    def init_moses(self, lang):
+        self.moses_tokenizer = sacremoses.MosesTokenizer(lang['src'])
+        self.moses_detokenizer = sacremoses.MosesDetokenizer(lang['tgt'])

-            self.pad_vocabulary(vocab, pad)
+    def build_vocabulary(self, vocab_fname, pad):
+        logging.info(f'Building vocabulary from {vocab_fname}')
+        vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
+                 config.BOS_TOKEN, config.EOS_TOKEN]
+        with open(vocab_fname) as vfile:
+            for line in vfile:
+                vocab.append(line.strip())

-            self.vocab_size = len(vocab)
-            logging.info(f'Size of vocabulary: {self.vocab_size}')
+        self.pad_vocabulary(vocab, pad)

-            self.tok2idx = defaultdict(partial(int, config.UNK))
-            for idx, token in enumerate(vocab):
-                self.tok2idx[token] = idx
+        self.vocab_size = len(vocab)
+        logging.info(f'Size of vocabulary: {self.vocab_size}')

-            self.idx2tok = {}
-            for key, value in self.tok2idx.items():
-                self.idx2tok[value] = key
+        self.tok2idx = defaultdict(partial(int, config.UNK))
+        for idx, token in enumerate(vocab):
+            self.tok2idx[token] = idx
+
+        self.idx2tok = {}
+        for key, value in self.tok2idx.items():
+            self.idx2tok[value] = key

    def pad_vocabulary(self, vocab, pad):
        """
@ -58,8 +77,10 @@ class Tokenizer:
    def get_state(self):
        logging.info(f'Saving state of the tokenizer')
        state = {
+            'lang': self.lang,
            'separator': self.separator,
            'vocab_size': self.vocab_size,
+            'bpe': self.bpe,
            'tok2idx': self.tok2idx,
            'idx2tok': self.idx2tok,
        }
@ -67,11 +88,15 @@ class Tokenizer:

    def set_state(self, state):
        logging.info(f'Restoring state of the tokenizer')
+        self.lang = state['lang']
        self.separator = state['separator']
        self.vocab_size = state['vocab_size']
+        self.bpe = state['bpe']
        self.tok2idx = state['tok2idx']
        self.idx2tok = state['idx2tok']

+        self.init_moses(self.lang)
+
    def segment(self, line):
        """
        Tokenizes single sentence and adds special BOS and EOS tokens.
@ -85,7 +110,14 @@ class Tokenizer:
        entry = [config.BOS] + entry + [config.EOS]
        return entry

-    def detokenize(self, inputs, delim=' '):
+    def tokenize(self, line):
+        tokenized = self.moses_tokenizer.tokenize(line, return_str=True)
+        bpe = self.bpe.process_line(tokenized)
+        segmented = self.segment(bpe)
+        tensor = torch.tensor(segmented)
+        return tensor
+
+    def detokenize_bpe(self, inp, delim=' '):
        """
        Detokenizes single sentence and removes token separator characters.

@ -94,7 +126,7 @@ class Tokenizer:

        returns: string representing detokenized sentence
        """
-        detok = delim.join([self.idx2tok[idx] for idx in inputs])
+        detok = delim.join([self.idx2tok[idx] for idx in inp])
        detok = detok.replace(self.separator + ' ', '')
        detok = detok.replace(self.separator, '')

@ -103,3 +135,12 @@ class Tokenizer:
        detok = detok.replace(config.PAD_TOKEN, '')
        detok = detok.strip()
        return detok
+
+    def detokenize_moses(self, inp):
+        output = self.moses_detokenizer.detokenize(inp.split())
+        return output
+
+    def detokenize(self, inp):
+        detok_bpe = self.detokenize_bpe(inp)
+        output = self.detokenize_moses(detok_bpe)
+        return output
--- a/PyTorch/Translation/GNMT/seq2seq/inference/beam_search.py
+++ b/PyTorch/Translation/GNMT/seq2seq/inference/beam_search.py
@ -8,7 +8,7 @@ class SequenceGenerator:
    """
    Generator for the autoregressive inference with beam search decoding.
    """
-    def __init__(self, model, beam_size=5, max_seq_len=100, cuda=False,
+    def __init__(self, model, beam_size=5, max_seq_len=100,
                 len_norm_factor=0.6, len_norm_const=5,
                 cov_penalty_factor=0.1):
        """
@ -21,14 +21,12 @@ class SequenceGenerator:
        :param model: model which implements generate method
        :param beam_size: decoder beam size
        :param max_seq_len: maximum decoder sequence length
-        :param cuda: whether to use cuda
        :param len_norm_factor: length normalization factor
        :param len_norm_const: length normalization constant
        :param cov_penalty_factor: coverage penalty factor
        """

        self.model = model
-        self.cuda = cuda
        self.beam_size = beam_size
        self.max_seq_len = max_seq_len
        self.len_norm_factor = len_norm_factor
@ -51,18 +49,17 @@ class SequenceGenerator:
            lengths: (batch_size) - lengths of generated translations
            counter: number of iterations of the decoding loop
        """
+        device = initial_input.device
        max_seq_len = self.max_seq_len

-        translation = torch.zeros(batch_size, max_seq_len, dtype=torch.int64)
-        lengths = torch.ones(batch_size, dtype=torch.int64)
-        active = torch.arange(0, batch_size, dtype=torch.int64)
-        base_mask = torch.arange(0, batch_size, dtype=torch.int64)
-
-        if self.cuda:
-            translation = translation.cuda()
-            lengths = lengths.cuda()
-            active = active.cuda()
-            base_mask = base_mask.cuda()
+        translation = torch.zeros(batch_size, max_seq_len, dtype=torch.int64,
+                                  device=device)
+        lengths = torch.ones(batch_size, dtype=torch.int64,
+                             device=device)
+        active = torch.arange(0, batch_size, dtype=torch.int64,
+                              device=device)
+        base_mask = torch.arange(0, batch_size, dtype=torch.int64,
+                                 device=device)

        translation[:, 0] = BOS
        words, context = initial_input, initial_context
@ -118,6 +115,7 @@ class SequenceGenerator:
            lengths: (batch_size) - lengths of generated translations
            counter: number of iterations of the decoding loop
        """
+        device = initial_input.device
        beam_size = self.beam_size
        norm_const = self.len_norm_const
        norm_factor = self.len_norm_factor
@ -125,25 +123,19 @@ class SequenceGenerator:
        cov_penalty_factor = self.cov_penalty_factor

        translation = torch.zeros(batch_size * beam_size, max_seq_len,
-                                  dtype=torch.int64)
-        lengths = torch.ones(batch_size * beam_size, dtype=torch.int64)
-        scores = torch.zeros(batch_size * beam_size, dtype=torch.float32)
-
-        active = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
-        base_mask = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
+                                  dtype=torch.int64, device=device)
+        lengths = torch.ones(batch_size * beam_size,
+                             dtype=torch.int64, device=device)
+        scores = torch.zeros(batch_size * beam_size,
+                             dtype=torch.float32, device=device)
+        active = torch.arange(0, batch_size * beam_size,
+                              dtype=torch.int64, device=device)
+        base_mask = torch.arange(0, batch_size * beam_size,
+                                 dtype=torch.int64, device=device)
        global_offset = torch.arange(0, batch_size * beam_size, beam_size,
-                                     dtype=torch.int64)
-
-        eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')])
-
-        if self.cuda:
-            translation = translation.cuda()
-            lengths = lengths.cuda()
-            active = active.cuda()
-            base_mask = base_mask.cuda()
-            scores = scores.cuda()
-            global_offset = global_offset.cuda()
-            eos_beam_fill = eos_beam_fill.cuda()
+                                     device=device, dtype=torch.int64)
+        eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')],
+                                     dtype=torch.float32, device=device)

        translation[:, 0] = BOS

@ -182,9 +174,8 @@ class SequenceGenerator:
        context[1] = context[1].contiguous().view(batch_size * beam_size)
        # context[1]: (batch * beam)

-        accu_attn_scores = torch.zeros(batch_size * beam_size, seq)
-        if self.cuda:
-            accu_attn_scores = accu_attn_scores.cuda()
+        accu_attn_scores = torch.zeros(batch_size * beam_size, seq,
+                                       dtype=torch.float32, device=device)

        counter = 0
        for idx in range(1, self.max_seq_len):
--- a/PyTorch/Translation/GNMT/seq2seq/inference/tables.py
+++ b/PyTorch/Translation/GNMT/seq2seq/inference/tables.py
@ -0,0 +1,110 @@
+import collections
+import itertools
+
+import numpy as np
+from pytablewriter import MarkdownTableWriter
+
+
+def interleave(*args):
+    return list(itertools.chain(*zip(*args)))
+
+
+class AccuracyTable:
+    def __init__(self, unit):
+        self.data = collections.defaultdict(dict)
+        self.unit = unit
+
+    def add(self, key, data):
+        self.data[key].update(data)
+
+    def write(self, title, write_math):
+        writer = MarkdownTableWriter()
+        writer.table_name = f'{title}'
+        main_header = ['**Batch Size**', '**Beam Size**']
+        data_header = []
+        if 'fp32' in write_math:
+            data_header += [f'**Accuracy - FP32 ({self.unit})**']
+        if 'fp16' in write_math:
+            data_header += [f'**Accuracy - FP16 ({self.unit})**']
+        writer.headers = main_header + data_header
+
+        writer.value_matrix = []
+        for k, v in self.data.items():
+            batch_size, beam_size = k
+            row = [batch_size, beam_size]
+            if 'fp32' in write_math:
+                row.append(v['fp32'])
+            if 'fp16' in write_math:
+                row.append(v['fp16'])
+            writer.value_matrix.append(row)
+        writer.write_table()
+
+
+class PerformanceTable:
+    def __init__(self, percentiles, unit, reverse_percentiles=False):
+        self.percentiles = percentiles
+        self.data = collections.defaultdict(dict)
+        self.unit = unit
+        self.reverse_percentiles = reverse_percentiles
+
+    def add(self, key, value):
+        math, value = next(iter(value.items()))
+        value = np.array(value)
+
+        if self.reverse_percentiles:
+            percentiles = [100 - p for p in self.percentiles]
+        else:
+            percentiles = self.percentiles
+
+        stats = []
+        for p in percentiles:
+            val = np.percentile(value, p)
+            stats.append(val * self.unit_convert[self.unit])
+
+        avg = value.mean() * self.unit_convert[self.unit]
+
+        self.data[key].update({math: (avg, stats)})
+
+    def write(self, title, math, relative=None, reverse_speedup=False):
+        writer = MarkdownTableWriter()
+        writer.table_name = f'{title} - {math.upper()}'
+        main_header = ['**Batch Size**', '**Beam Size**']
+        data_header = [f'**Avg ({self.unit})**']
+        data_header += [f'**{p}% ({self.unit})**' for p in self.percentiles]
+
+        if relative:
+            speedup_header = ['**Speedup**'] * len(data_header)
+            data_header = interleave(data_header, speedup_header)
+
+        writer.headers = main_header + data_header
+
+        writer.value_matrix = []
+        for k, v in self.data.items():
+            batch_size, beam_size = k
+            avg, res_percentiles = v[math]
+            main = [batch_size, beam_size]
+            data = [avg, *res_percentiles]
+
+            if relative:
+                rel = self.data[k][relative]
+                rel_avg, rel_res_percentiles = rel
+                rel = [rel_avg, *rel_res_percentiles]
+                speedup = [d / r for (r, d) in zip(rel, data)]
+                if reverse_speedup:
+                    speedup = [1 / s for s in speedup]
+                data = interleave(data, speedup)
+
+            writer.value_matrix.append(main + data)
+        writer.write_table()
+
+
+class LatencyTable(PerformanceTable):
+    def __init__(self, percentiles, unit='ms'):
+        super().__init__(percentiles, unit)
+        self.unit_convert = {'s': 1, 'ms': 1e3, 'us': 1e6}
+
+
+class ThroughputTable(PerformanceTable):
+    def __init__(self, percentiles, unit='tok/s', reverse_percentiles=True):
+        super().__init__(percentiles, unit, reverse_percentiles)
+        self.unit_convert = {'tok/s': 1}
--- a/PyTorch/Translation/GNMT/seq2seq/inference/translator.py
+++ b/PyTorch/Translation/GNMT/seq2seq/inference/translator.py
@ -0,0 +1,238 @@
+import logging
+import subprocess
+import time
+
+import torch
+import torch.distributed as dist
+
+import seq2seq.data.config as config
+import seq2seq.utils as utils
+from seq2seq.inference.beam_search import SequenceGenerator
+
+
+def gather_predictions(preds):
+    world_size = utils.get_world_size()
+    if world_size > 1:
+        all_preds = [preds.new(preds.size(0), preds.size(1)) for i in range(world_size)]
+        dist.all_gather(all_preds, preds)
+        preds = torch.cat(all_preds)
+    return preds
+
+
+def run_sacrebleu(test_path, reference_path):
+    """
+    Executes sacrebleu and returns BLEU score.
+
+    :param test_path: path to the test file
+    :param reference_path: path to the reference file
+    """
+    sacrebleu_params = '--score-only -lc --tokenize intl'
+    logging.info(f'Running sacrebleu (parameters: {sacrebleu_params})')
+    sacrebleu = subprocess.run([f'sacrebleu --input {test_path} \
+                                {reference_path} {sacrebleu_params}'],
+                               stdout=subprocess.PIPE, shell=True)
+    test_bleu = round(float(sacrebleu.stdout.strip()), 2)
+    return test_bleu
+
+
+class Translator:
+    """
+    Translator runs validation on test dataset, executes inference, optionally
+    computes BLEU score using sacrebleu.
+    """
+    def __init__(self,
+                 model,
+                 tokenizer,
+                 loader=None,
+                 beam_size=5,
+                 len_norm_factor=0.6,
+                 len_norm_const=5.0,
+                 cov_penalty_factor=0.1,
+                 max_seq_len=50,
+                 print_freq=1,
+                 reference=None,
+                 ):
+
+        self.model = model
+        self.tokenizer = tokenizer
+        self.loader = loader
+        self.insert_target_start = [config.BOS]
+        self.insert_src_start = [config.BOS]
+        self.insert_src_end = [config.EOS]
+        self.batch_first = model.batch_first
+        self.beam_size = beam_size
+        self.print_freq = print_freq
+        self.reference = reference
+
+        self.distributed = (utils.get_world_size() > 1)
+
+        self.generator = SequenceGenerator(
+            model=self.model,
+            beam_size=beam_size,
+            max_seq_len=max_seq_len,
+            len_norm_factor=len_norm_factor,
+            len_norm_const=len_norm_const,
+            cov_penalty_factor=cov_penalty_factor)
+
+    def run(self, calc_bleu=True, epoch=None, iteration=None, eval_path=None,
+            summary=False, warmup=0, reference_path=None):
+        """
+        Runs translation on test dataset.
+
+        :param calc_bleu: if True compares results with reference and computes
+            BLEU score
+        :param epoch: index of the current epoch
+        :param iteration: index of the current iteration
+        :param eval_path: path to the file for saving results
+        :param summary: if True prints summary
+        :param reference_path: path to the file with reference translation
+        """
+        if reference_path is None:
+            reference_path = self.reference
+
+        device = next(self.model.parameters()).device
+
+        test_bleu = torch.tensor([0.], device=device)
+
+        rank = utils.get_rank()
+        logging.info(f'Running evaluation on test set')
+        self.model.eval()
+
+        output, eval_stats = self.evaluate(self.loader, epoch, iteration,
+                                           warmup, summary)
+        output = output[:len(self.loader.dataset)]
+        output = self.loader.dataset.unsort(output)
+
+        if rank == 0 and eval_path:
+            with open(eval_path, 'w') as eval_file:
+                lines = [line + '\n' for line in output]
+                eval_file.writelines(lines)
+            if calc_bleu:
+                test_bleu[0] = run_sacrebleu(eval_path, reference_path)
+                if summary:
+                    logging.info(f'BLEU on test dataset: {test_bleu[0]:.2f}')
+
+        utils.barrier()
+        logging.info(f'Finished evaluation on test set')
+
+        if self.distributed:
+            dist.broadcast(test_bleu, 0)
+
+        if calc_bleu:
+            eval_stats['bleu'] = test_bleu[0].item()
+        else:
+            eval_stats['bleu'] = None
+
+        return output, eval_stats
+
+    def evaluate(self, loader, epoch=0, iteration=0, warmup=0, summary=False):
+        """
+        Runs evaluation on test dataset.
+
+        :param epoch: index of the current epoch
+        :param iteration: index of the current iteration
+        :param summary: if True prints summary
+        """
+        device = next(self.model.parameters()).device
+
+        batch_time = utils.AverageMeter(warmup, keep=True)
+        tot_tok_per_sec = utils.AverageMeter(warmup, keep=True)
+        iterations = utils.AverageMeter()
+        enc_seq_len = utils.AverageMeter()
+        dec_seq_len = utils.AverageMeter()
+        stats = {}
+
+        batch_size = loader.batch_size
+        global_batch_size = batch_size * utils.get_world_size()
+        beam_size = self.beam_size
+
+        bos = [self.insert_target_start] * (batch_size * beam_size)
+        bos = torch.tensor(bos, dtype=torch.int64, device=device)
+        if self.batch_first:
+            bos = bos.view(-1, 1)
+        else:
+            bos = bos.view(1, -1)
+
+        if beam_size == 1:
+            generator = self.generator.greedy_search
+        else:
+            generator = self.generator.beam_search
+
+        output = []
+
+        for i, (src, indices) in enumerate(loader):
+            translate_timer = time.time()
+            src, src_length = src
+            stats['total_enc_len'] = int(src_length.sum())
+
+            src = src.to(device)
+            src_length = src_length.to(device)
+
+            with torch.no_grad():
+                context = self.model.encode(src, src_length)
+                context = [context, src_length, None]
+                preds, lengths, counter = generator(batch_size, bos, context)
+
+            stats['total_dec_len'] = lengths.sum().item()
+            stats['iters'] = counter
+
+            indices = torch.tensor(indices).to(preds)
+            preds = preds.scatter(0, indices.unsqueeze(1).expand_as(preds), preds)
+            preds = gather_predictions(preds).cpu()
+
+            for pred in preds:
+                pred = pred.tolist()
+                detok = self.tokenizer.detokenize(pred)
+                output.append(detok)
+
+            elapsed = time.time() - translate_timer
+            batch_time.update(elapsed, batch_size)
+
+            total_tokens = stats['total_dec_len'] + stats['total_enc_len']
+            ttps = total_tokens / elapsed
+            tot_tok_per_sec.update(ttps, batch_size)
+
+            iterations.update(stats['iters'])
+            enc_seq_len.update(stats['total_enc_len'] / batch_size, batch_size)
+            dec_seq_len.update(stats['total_dec_len'] / batch_size, batch_size)
+
+            if i % self.print_freq == self.print_freq - 1:
+                log = []
+                log += f'TEST '
+                if epoch is not None:
+                    log += f'[{epoch}]'
+                if iteration is not None:
+                    log += f'[{iteration}]'
+                log += f'[{i}/{len(loader)}]\t'
+                log += f'Time {batch_time.val:.4f} ({batch_time.avg:.4f})\t'
+                log += f'Decoder iters {iterations.val:.1f} ({iterations.avg:.1f})\t'
+                log += f'Tok/s {tot_tok_per_sec.val:.0f} ({tot_tok_per_sec.avg:.0f})'
+                log = ''.join(log)
+                logging.info(log)
+
+        tot_tok_per_sec.reduce('sum')
+        enc_seq_len.reduce('mean')
+        dec_seq_len.reduce('mean')
+        batch_time.reduce('mean')
+        iterations.reduce('sum')
+
+        if summary and utils.get_rank() == 0:
+            time_per_sentence = (batch_time.avg / global_batch_size)
+            log = []
+            log += f'TEST SUMMARY:\n'
+            log += f'Lines translated: {len(loader.dataset)}\t'
+            log += f'Avg total tokens/s: {tot_tok_per_sec.avg:.0f}\n'
+            log += f'Avg time per batch: {batch_time.avg:.3f} s\t'
+            log += f'Avg time per sentence: {1000*time_per_sentence:.3f} ms\n'
+            log += f'Avg encoder seq len: {enc_seq_len.avg:.2f}\t'
+            log += f'Avg decoder seq len: {dec_seq_len.avg:.2f}\t'
+            log += f'Total decoder iterations: {int(iterations.sum)}'
+            log = ''.join(log)
+            logging.info(log)
+
+        eval_stats = {}
+        eval_stats['tokens_per_sec'] = tot_tok_per_sec.avg
+        eval_stats['runtimes'] = batch_time.vals
+        eval_stats['throughputs'] = tot_tok_per_sec.vals
+
+        return output, eval_stats
--- a/PyTorch/Translation/GNMT/seq2seq/models/attention.py
+++ b/PyTorch/Translation/GNMT/seq2seq/models/attention.py
@ -144,7 +144,7 @@ class BahdanauAttention(nn.Module):
        if self.mask is not None:
            mask = self.mask.unsqueeze(1).expand(b, t_q, t_k)
            # I can't use -INF because of overflow check in pytorch
-            scores.data.masked_fill_(mask, -65504.0)
+            scores.masked_fill_(mask, -65504.0)

        # Normalize the scores, softmax over t_k
        scores_normalized = F.softmax(scores, dim=-1)
--- a/PyTorch/Translation/GNMT/seq2seq/train/fp_optimizers.py
+++ b/PyTorch/Translation/GNMT/seq2seq/train/fp_optimizers.py
@ -3,9 +3,11 @@ import math

 import torch
 from torch.nn.utils import clip_grad_norm_
+import apex.amp._amp_state
+from apex import amp


-class Fp16Optimizer:
+class FP16Optimizer:
    """
    Mixed precision optimizer with dynamic loss scaling and backoff.
    https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#scalefactor
@ -34,12 +36,12 @@ class Fp16Optimizer:
        for param, new_param in zip(params, new_params):
            param.data.copy_(new_param.data)

-    def __init__(self, fp16_model, grad_clip=float('inf'), loss_scale=8192,
+    def __init__(self, model, grad_clip=float('inf'), loss_scale=8192,
                 dls_downscale=2, dls_upscale=2, dls_upscale_interval=128):
        """
        Constructor for the Fp16Optimizer.

-        :param fp16_model: model (previously casted to half)
+        :param model: model
        :param grad_clip: coefficient for gradient clipping, max L2 norm of the
            gradients
        :param loss_scale: initial loss scale
@ -51,7 +53,7 @@ class Fp16Optimizer:
        :param dls_upscale_interval: interval for loss scale upscaling
        """
        logging.info('Initializing fp16 optimizer')
-        self.initialize_model(fp16_model)
+        self.initialize_model(model)

        self.since_last_invalid = 0
        self.loss_scale = loss_scale
@ -66,9 +68,11 @@ class Fp16Optimizer:

        :param model: fp16 model
        """
+        logging.info('Converting model to half precision')
+        model.half()
        logging.info('Initializing fp32 clone weights')
-        self.fp16_model = model
-        self.fp16_model.zero_grad()
+        self.model = model
+        self.model.zero_grad()
        self.fp32_params = [param.to(torch.float32).detach()
                            for param in model.parameters()]

@ -93,7 +97,7 @@ class Fp16Optimizer:
        loss.backward()

        if update:
-            self.set_grads(self.fp32_params, self.fp16_model.parameters())
+            self.set_grads(self.fp32_params, self.model.parameters())
            if self.loss_scale != 1.0:
                for param in self.fp32_params:
                    param.grad.data /= self.loss_scale
@ -103,7 +107,7 @@ class Fp16Optimizer:
            if math.isfinite(norm):
                scheduler.step()
                optimizer.step()
-                self.set_weights(self.fp16_model.parameters(),
+                self.set_weights(self.model.parameters(),
                                 self.fp32_params)
                self.since_last_invalid += 1
            else:
@ -118,10 +122,10 @@ class Fp16Optimizer:
                logging.info(f'Upscaling, new scale: {self.loss_scale}')
                self.since_last_invalid = 0

-            self.fp16_model.zero_grad()
+            self.model.zero_grad()


-class Fp32Optimizer:
+class FP32Optimizer:
    """
    Standard optimizer, computes backward and applies weight update.
    """
@ -161,3 +165,54 @@ class Fp32Optimizer:
            scheduler.step()
            optimizer.step()
            self.model.zero_grad()
+
+
+class AMPOptimizer:
+    """
+    Optimizer compatible with AMP.
+    Uses AMP to apply loss scaling, computes backward and applies weight
+    update.
+    """
+    def __init__(self, model, grad_clip=None, loss_scale=8192,
+                 dls_upscale_interval=128):
+        """
+        Constructor for the AMPOptimizer
+
+        :param model: model
+        :param grad_clip: coefficient for gradient clipping, max L2 norm of the
+            gradients
+        """
+        logging.info('Initializing amp optimizer')
+        self.initialize_model(model)
+        self.grad_clip = grad_clip
+
+        loss_scaler = apex.amp._amp_state.loss_scalers[0]
+        loss_scaler._loss_scale = loss_scale
+        loss_scaler._scale_seq_len = dls_upscale_interval
+
+    def initialize_model(self, model):
+        """
+        Initializes state of the model.
+
+        :param model: model
+        """
+        self.model = model
+        self.model.zero_grad()
+
+    def step(self, loss, optimizer, scheduler, update=True):
+        """
+        Performs one step of the optimizer.
+
+        :param loss: value of loss function
+        :param optimizer: optimizer
+        :param update: if True executes weight update
+        """
+        with amp.scale_loss(loss, optimizer) as scaled_loss:
+            scaled_loss.backward()
+
+        if update:
+            if self.grad_clip != float('inf'):
+                clip_grad_norm_(amp.master_params(optimizer), self.grad_clip)
+            scheduler.step()
+            optimizer.step()
+            self.model.zero_grad()
--- a/PyTorch/Translation/GNMT/seq2seq/train/table.py
+++ b/PyTorch/Translation/GNMT/seq2seq/train/table.py
@ -0,0 +1,32 @@
+from pytablewriter import MarkdownTableWriter
+
+
+class TrainingTable:
+    def __init__(self, acc_unit='BLEU', time_unit='min', perf_unit='tok/s'):
+        self.data = []
+        self.acc_unit = acc_unit
+        self.time_unit = time_unit
+        self.perf_unit = perf_unit
+        self.time_unit_convert = {'s': 1, 'min': 1/60, 'h': 1/3600}
+
+    def add(self, gpus, batch_size, accuracy, perf, time_to_train):
+        time_to_train *= self.time_unit_convert[self.time_unit]
+        if not accuracy:
+            accuracy = 0.0
+        accuracy = round(accuracy, 2)
+        self.data.append([gpus, batch_size, accuracy, perf, time_to_train])
+
+    def write(self, title, math):
+        writer = MarkdownTableWriter()
+        writer.table_name = f'{title}'
+
+        header = [f'**GPUs**',
+                  f'**Batch Size / GPU**',
+                  f'**Accuracy - {math.upper()} ({self.acc_unit})**',
+                  f'**Throughput - {math.upper()} ({self.perf_unit})**',
+                  f'**Time to Train - {math.upper()} ({self.time_unit})**',
+                  ]
+        writer.headers = header
+
+        writer.value_matrix = self.data
+        writer.write_table()
--- a/PyTorch/Translation/GNMT/seq2seq/train/trainer.py
+++ b/PyTorch/Translation/GNMT/seq2seq/train/trainer.py
@ -7,10 +7,12 @@ import numpy as np
 import torch
 import torch.optim
 import torch.utils.data
-from apex.parallel import DistributedDataParallel as DDP
+from apex.parallel import DistributedDataParallel
+from apex import amp

-from seq2seq.train.fp_optimizers import Fp16Optimizer
-from seq2seq.train.fp_optimizers import Fp32Optimizer
+from seq2seq.train.fp_optimizers import FP16Optimizer
+from seq2seq.train.fp_optimizers import FP32Optimizer
+from seq2seq.train.fp_optimizers import AMPOptimizer
 from seq2seq.train.lr_scheduler import WarmupMultiStepLR
 from seq2seq.utils import AverageMeter
 from seq2seq.utils import sync_workers
@ -28,16 +30,15 @@ class Seq2SeqTrainer:
                 print_freq=10,
                 save_freq=1000,
                 grad_clip=float('inf'),
-                 batch_first=False,
                 save_info={},
-                 save_path='.',
+                 save_dir='.',
                 train_iterations=0,
                 checkpoint_filename='checkpoint%s.pth',
                 keep_checkpoints=5,
                 math='fp32',
-                 cuda=True,
-                 distributed=False,
+                 loss_scaling={},
                 intra_epoch_eval=0,
+                 prealloc_mode='always',
                 iter_size=1,
                 translator=None,
                 verbose=False):
@ -52,18 +53,17 @@ class Seq2SeqTrainer:
        :param print_freq: prints short summary every 'print_freq' iterations
        :param save_freq: saves checkpoint every 'save_freq' iterations
        :param grad_clip: coefficient for gradient clipping
-        :param batch_first: if True the model uses (batch,seq,feature) tensors,
-            if false the model uses (seq, batch, feature)
        :param save_info: dict with additional state stored in each checkpoint
-        :param save_path: path to the directiory for checkpoints
+        :param save_dir: path to the directiory for checkpoints
        :param train_iterations: total number of training iterations to execute
        :param checkpoint_filename: name of files with checkpoints
        :param keep_checkpoints: max number of checkpoints to keep
        :param math: arithmetic type
-        :param cuda: if True use cuda, if False train on cpu
-        :param distributed: if True run distributed training
+        :param loss_scaling: options for dynamic loss scaling
        :param intra_epoch_eval: number of additional eval runs within each
            training epoch
+        :param prealloc_mode: controls preallocation,
+            choices=['off', 'once', 'always']
        :param iter_size: number of iterations between weight updates
        :param translator: instance of Translator, runs inference on test set
        :param verbose: enables verbose logging
@ -73,38 +73,36 @@ class Seq2SeqTrainer:
        self.criterion = criterion
        self.epoch = 0
        self.save_info = save_info
-        self.save_path = save_path
+        self.save_dir = save_dir
        self.save_freq = save_freq
        self.save_counter = 0
        self.checkpoint_filename = checkpoint_filename
        self.checkpoint_counter = cycle(range(keep_checkpoints))
        self.opt_config = opt_config
-        self.cuda = cuda
-        self.distributed = distributed
+        self.device = next(model.parameters()).device
        self.print_freq = print_freq
-        self.batch_first = batch_first
        self.verbose = verbose
        self.loss = None
        self.translator = translator
        self.intra_epoch_eval = intra_epoch_eval
        self.iter_size = iter_size
+        self.prealloc_mode = prealloc_mode
+        self.preallocated = False

-        if cuda:
-            self.model = self.model.cuda()
-            self.criterion = self.criterion.cuda()
+        self.distributed = torch.distributed.is_initialized()
+        self.batch_first = model.batch_first

-        if math == 'fp16':
-            self.model = self.model.half()
+        params = self.model.parameters()

-        if distributed:
-            self.model = DDP(self.model)
-
-        if math == 'fp16':
-            self.fp_optimizer = Fp16Optimizer(self.model, grad_clip)
+        if math == 'manual_fp16':
+            self.fp_optimizer = FP16Optimizer(
+                self.model, grad_clip,
+                loss_scale=loss_scaling['init_scale'],
+                dls_upscale_interval=loss_scaling['upscale_interval']
+                )
            params = self.fp_optimizer.fp32_params
        elif math == 'fp32':
-            self.fp_optimizer = Fp32Optimizer(self.model, grad_clip)
-            params = self.model.parameters()
+            self.fp_optimizer = FP32Optimizer(self.model, grad_clip)

        opt_name = opt_config.pop('optimizer')
        self.optimizer = torch.optim.__dict__[opt_name](params, **opt_config)
@ -113,6 +111,24 @@ class Seq2SeqTrainer:
        self.scheduler = WarmupMultiStepLR(self.optimizer, train_iterations,
                                           **scheduler_config)

+        if math == 'fp16':
+            self.model, self.optimizer = amp.initialize(
+                self.model,
+                self.optimizer,
+                cast_model_outputs=torch.float16,
+                keep_batchnorm_fp32=False,
+                opt_level='O2')
+
+            self.fp_optimizer = AMPOptimizer(
+                self.model,
+                grad_clip,
+                loss_scale=loss_scaling['init_scale'],
+                dls_upscale_interval=loss_scaling['upscale_interval']
+                )
+
+        if self.distributed:
+            self.model = DistributedDataParallel(self.model)
+
    def iterate(self, src, tgt, update=True, training=True):
        """
        Performs one iteration of the training/validation.
@ -124,18 +140,14 @@ class Seq2SeqTrainer:
        """
        src, src_length = src
        tgt, tgt_length = tgt
-        src_length = torch.LongTensor(src_length)
-        tgt_length = torch.LongTensor(tgt_length)
+        src = src.to(self.device)
+        tgt = tgt.to(self.device)
+        src_length = src_length.to(self.device)

        num_toks = {}
        num_toks['tgt'] = int(sum(tgt_length - 1))
        num_toks['src'] = int(sum(src_length))

-        if self.cuda:
-            src = src.cuda()
-            src_length = src_length.cuda()
-            tgt = tgt.cuda()
-
        if self.batch_first:
            output = self.model(src, src_length, tgt[:, :-1])
            tgt_labels = tgt[:, 1:]
@ -177,8 +189,8 @@ class Seq2SeqTrainer:

        batch_time = AverageMeter()
        data_time = AverageMeter()
-        losses_per_token = AverageMeter(skip_first=False)
-        losses_per_sentence = AverageMeter(skip_first=False)
+        losses_per_token = AverageMeter()
+        losses_per_sentence = AverageMeter()

        tot_tok_time = AverageMeter()
        src_tok_time = AverageMeter()
@ -214,9 +226,15 @@ class Seq2SeqTrainer:
            self.loss = losses_per_token.avg

            if training and i in eval_iters:
-                test_bleu, _ = self.translator.run(calc_bleu=True,
-                                                   epoch=self.epoch,
-                                                   iteration=i)
+                eval_fname = f'eval_epoch_{self.epoch}_iter_{i}'
+                eval_path = os.path.join(self.save_dir, eval_fname)
+                _, eval_stats = self.translator.run(
+                    calc_bleu=True,
+                    epoch=self.epoch,
+                    iteration=i,
+                    eval_path=eval_path,
+                    )
+                test_bleu = eval_stats['bleu']

                log = []
                log += [f'TRAIN [{self.epoch}][{i}/{len(data_loader)}]']
@ -225,7 +243,8 @@ class Seq2SeqTrainer:
                logging.info(log)

                self.model.train()
-                self.preallocate(data_loader, training=True)
+                self.preallocate(data_loader.batch_size,
+                                 data_loader.dataset.max_len, training=True)

            if i % self.print_freq == 0:
                phase = 'TRAIN' if training else 'VALIDATION'
@ -262,31 +281,37 @@ class Seq2SeqTrainer:

        return losses_per_token.avg, tot_tok_time.avg

-    def preallocate(self, data_loader, training):
+    def preallocate(self, batch_size, max_length, training):
        """
        Generates maximum sequence length batch and runs forward and backward
        pass without updating model parameters.

-        :param data_loader: data loader
+        :param batch_size: batch size for preallocation
+        :param max_length: max sequence length for preallocation
        :param training: if True preallocates memory for backward pass
        """
-        batch_size = data_loader.batch_size
-        max_len = data_loader.dataset.max_len
+        if self.prealloc_mode == 'always' or (self.prealloc_mode == 'once' and
+                                              not self.preallocated):
+            logging.info('Executing preallocation')
+            torch.cuda.empty_cache()

-        src_length = [max_len] * batch_size
-        tgt_length = [max_len] * batch_size
+            src_length = torch.full((batch_size,), max_length,
+                                    dtype=torch.int64)
+            tgt_length = torch.full((batch_size,), max_length,
+                                    dtype=torch.int64)

-        if self.batch_first:
-            shape = (batch_size, max_len)
-        else:
-            shape = (max_len, batch_size)
+            if self.batch_first:
+                shape = (batch_size, max_length)
+            else:
+                shape = (max_length, batch_size)

-        src = torch.full(shape, 4, dtype=torch.int64)
-        tgt = torch.full(shape, 4, dtype=torch.int64)
-        src = src, src_length
-        tgt = tgt, tgt_length
-        self.iterate(src, tgt, update=False, training=training)
-        self.model.zero_grad()
+            src = torch.full(shape, 4, dtype=torch.int64)
+            tgt = torch.full(shape, 4, dtype=torch.int64)
+            src = src, src_length
+            tgt = tgt, tgt_length
+            self.iterate(src, tgt, update=False, training=training)
+            self.model.zero_grad()
+            self.preallocated = True

    def optimize(self, data_loader):
        """
@ -297,11 +322,12 @@ class Seq2SeqTrainer:
        """
        torch.set_grad_enabled(True)
        self.model.train()
-        torch.cuda.empty_cache()
-        self.preallocate(data_loader, training=True)
+        self.preallocate(data_loader.batch_size, data_loader.dataset.max_len,
+                         training=True)
+
        output = self.feed_data(data_loader, training=True)
+
        self.model.zero_grad()
-        torch.cuda.empty_cache()
        return output

    def evaluate(self, data_loader):
@ -313,11 +339,12 @@ class Seq2SeqTrainer:
        """
        torch.set_grad_enabled(False)
        self.model.eval()
-        torch.cuda.empty_cache()
-        self.preallocate(data_loader, training=False)
+        self.preallocate(data_loader.batch_size, data_loader.dataset.max_len,
+                         training=False)
+
        output = self.feed_data(data_loader, training=False)
+
        self.model.zero_grad()
-        torch.cuda.empty_cache()
        return output

    def load(self, filename):
@ -352,7 +379,7 @@ class Seq2SeqTrainer:
        """

        def write_checkpoint(state, filename):
-            filename = os.path.join(self.save_path, filename)
+            filename = os.path.join(self.save_dir, filename)
            logging.info(f'Saving model to {filename}')
            torch.save(state, filename)

--- a/PyTorch/Translation/GNMT/seq2seq/utils.py
+++ b/PyTorch/Translation/GNMT/seq2seq/utils.py
@ -61,7 +61,7 @@ def broadcast_seeds(seeds, device):
    :param device: torch.device
    """
    if torch.distributed.is_available() and torch.distributed.is_initialized():
-        seeds_tensor = torch.LongTensor(seeds).to(device)
+        seeds_tensor = torch.tensor(seeds, dtype=torch.int64, device=device)
        torch.distributed.broadcast(seeds_tensor, 0)
        seeds = seeds_tensor.tolist()
    return seeds
@ -110,13 +110,10 @@ def setup_seeds(master_seed, epochs, device):

 def barrier():
    """
-    Works as a temporary distributed barrier, currently pytorch
-    doesn't implement barrier for NCCL backend.
-    Calls all_reduce on dummy tensor and synchronizes with GPU.
+    Call torch.distributed.barrier() if distritubed is in use
    """
    if torch.distributed.is_available() and torch.distributed.is_initialized():
-        torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
-        torch.cuda.synchronize()
+        torch.distributed.barrier()


 def get_rank():
@ -244,7 +241,7 @@ def log_env_info():


 def pad_vocabulary(math):
-    if math == 'fp16':
+    if math == 'fp16' or math == 'manual_fp16':
        pad_vocab = 8
    elif math == 'fp32':
        pad_vocab = 1
@ -269,78 +266,6 @@ def benchmark(test_acc, target_acc, test_perf, target_perf):
    passed &= test(test_perf, target_perf, 'Performance')
    return passed

-class AverageMeter:
-    """
-    Computes and stores the average and current value
-    """
-    def __init__(self, skip_first=True):
-        self.reset()
-        self.skip = skip_first
-
-    def reset(self):
-        self.val = 0
-        self.avg = 0
-        self.sum = 0
-        self.count = 0
-
-    def update(self, val, n=1):
-        self.val = val
-
-        if self.skip:
-            self.skip = False
-        else:
-            self.sum += val * n
-            self.count += n
-            self.avg = self.sum / self.count
-
-    def reduce(self, op):
-        """
-        Reduces average value over all workers.
-
-        :param op: 'sum' or 'mean', reduction operator
-        """
-        if op not in ('sum', 'mean'):
-            raise NotImplementedError
-
-        distributed = (get_world_size() > 1)
-        if distributed:
-            # Backward/forward compatibility around
-            # https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
-            # https://github.com/pytorch/pytorch/commit/044d00516ccd6572c0d6ab6d54587155b02a3b86
-            # To accomodate change in Pytorch's distributed API
-            if hasattr(dist, "get_backend"):
-                _backend = dist.get_backend()
-                if hasattr(dist, "DistBackend"):
-                    backend_enum_holder = dist.DistBackend
-                else:
-                    backend_enum_holder = dist.Backend
-            else:
-                _backend = dist._backend
-                backend_enum_holder = dist.dist_backend
-
-            cuda = _backend == backend_enum_holder.NCCL
-
-            if cuda:
-                avg = torch.cuda.FloatTensor([self.avg])
-                _sum = torch.cuda.FloatTensor([self.sum])
-            else:
-                avg = torch.FloatTensor([self.avg])
-                _sum = torch.FloatTensor([self.sum])
-
-            try:
-                _reduce_op = dist.ReduceOp
-            except AttributeError:
-                _reduce_op = dist.reduce_op
-
-            dist.all_reduce(avg, op=_reduce_op.SUM)
-            dist.all_reduce(_sum, op=_reduce_op.SUM)
-            self.avg = avg.item()
-            self.sum = _sum.item()
-
-            if op == 'mean':
-                self.avg /= get_world_size()
-                self.sum /= get_world_size()
-

 def debug_tensor(tensor, name):
    """
@ -356,3 +281,62 @@ def debug_tensor(tensor, name):
    logging.info(f'MIN: {tensor.min()} MAX: {tensor.max()} '
                 f'AVG: {tensor.mean()} STD: {tensor.std()} '
                 f'NAN: {np.isnan(tensor).sum()} INF: {np.isinf(tensor).sum()}')
+
+
+class AverageMeter:
+    """
+    Computes and stores the average and current value
+    """
+    def __init__(self, warmup=0, keep=False):
+        self.reset()
+        self.warmup = warmup
+        self.keep = keep
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+        self.iters = 0
+        self.vals = []
+
+    def update(self, val, n=1):
+        self.iters += 1
+        self.val = val
+
+        if self.iters > self.warmup:
+            self.sum += val * n
+            self.count += n
+            self.avg = self.sum / self.count
+            if self.keep:
+                self.vals.append(val)
+
+    def reduce(self, op):
+        """
+        Reduces average value over all workers.
+
+        :param op: 'sum' or 'mean', reduction operator
+        """
+        if op not in ('sum', 'mean'):
+            raise NotImplementedError
+
+        distributed = (get_world_size() > 1)
+        if distributed:
+            backend = dist.get_backend()
+            cuda = (backend == dist.Backend.NCCL)
+
+            if cuda:
+                avg = torch.cuda.FloatTensor([self.avg])
+                _sum = torch.cuda.FloatTensor([self.sum])
+            else:
+                avg = torch.FloatTensor([self.avg])
+                _sum = torch.FloatTensor([self.sum])
+
+            dist.all_reduce(avg)
+            dist.all_reduce(_sum)
+            self.avg = avg.item()
+            self.sum = _sum.item()
+
+            if op == 'mean':
+                self.avg /= get_world_size()
+                self.sum /= get_world_size()
--- a/PyTorch/Translation/GNMT/train.py
+++ b/PyTorch/Translation/GNMT/train.py
@ -3,6 +3,7 @@ import argparse
 import logging
 import os
 import sys
+import time
 from ast import literal_eval

 import torch.nn as nn
@ -17,9 +18,10 @@ from seq2seq.data.dataset import LazyParallelDataset
 from seq2seq.data.dataset import ParallelDataset
 from seq2seq.data.dataset import TextDataset
 from seq2seq.data.tokenizer import Tokenizer
-from seq2seq.inference.inference import Translator
+from seq2seq.inference.translator import Translator
 from seq2seq.models.gnmt import GNMT
 from seq2seq.train.smoothing import LabelSmoothing
+from seq2seq.train.table import TrainingTable


 def parse_args():
@ -44,17 +46,55 @@ def parse_args():
    dataset = parser.add_argument_group('dataset setup')
    dataset.add_argument('--dataset-dir', default='data/wmt16_de_en',
                         help='path to the directory with training/test data')
-    dataset.add_argument('--max-size', default=None, type=int,
-                         help='use at most MAX_SIZE elements from training \
-                         dataset (useful for benchmarking), by default \
-                         uses entire dataset')
+
+    dataset.add_argument('--src-lang',
+                         default='en',
+                         help='source language')
+    dataset.add_argument('--tgt-lang',
+                         default='de',
+                         help='target language')
+
+    dataset.add_argument('--vocab',
+                         default='vocab.bpe.32000',
+                         help='path to the vocabulary file \
+                         (relative to DATASET_DIR directory)')
+    dataset.add_argument('-bpe', '--bpe-codes', default='bpe.32000',
+                         help='path to the file with bpe codes \
+                         (relative to DATASET_DIR directory)')
+
+    dataset.add_argument('--train-src',
+                         default='train.tok.clean.bpe.32000.en',
+                         help='path to the training source data file \
+                         (relative to DATASET_DIR directory)')
+    dataset.add_argument('--train-tgt',
+                         default='train.tok.clean.bpe.32000.de',
+                         help='path to the training target data file \
+                         (relative to DATASET_DIR directory)')
+
+    dataset.add_argument('--val-src',
+                         default='newstest_dev.tok.clean.bpe.32000.en',
+                         help='path to the validation source data file \
+                         (relative to DATASET_DIR directory)')
+    dataset.add_argument('--val-tgt',
+                         default='newstest_dev.tok.clean.bpe.32000.de',
+                         help='path to the validation target data file \
+                         (relative to DATASET_DIR directory)')
+
+    dataset.add_argument('--test-src',
+                         default='newstest2014.tok.bpe.32000.en',
+                         help='path to the test source data file \
+                         (relative to DATASET_DIR directory)')
+    dataset.add_argument('--test-tgt',
+                         default='newstest2014.de',
+                         help='path to the test target data file \
+                         (relative to DATASET_DIR directory)')

    # results
    results = parser.add_argument_group('results setup')
    results.add_argument('--results-dir', default='results',
                         help='path to directory with results, it will be \
                         automatically created if it does not exist')
-    results.add_argument('--save', default='gnmt',
+    results.add_argument('--save-dir', default='gnmt',
                         help='defines subdirectory within RESULTS_DIR for \
                         results from this training run')
    results.add_argument('--print-freq', default=10, type=int,
@ -63,7 +103,7 @@ def parse_args():
    # model
    model = parser.add_argument_group('model setup')
    model.add_argument('--hidden-size', default=1024, type=int,
-                       help='model hidden size')
+                       help='hidden size of the model')
    model.add_argument('--num-layers', default=4, type=int,
                       help='number of RNN layers in encoder and in decoder')
    model.add_argument('--dropout', default=0.2, type=float,
@ -79,12 +119,16 @@ def parse_args():

    # setup
    general = parser.add_argument_group('general setup')
-    general.add_argument('--math', default='fp16', choices=['fp16', 'fp32'],
-                         help='arithmetic type')
+    general.add_argument('--math', default='fp16',
+                         choices=['fp16', 'fp32', 'manual_fp16'],
+                         help='precision')
    general.add_argument('--seed', default=None, type=int,
                         help='master seed for random number generators, if \
                         "seed" is undefined then the master seed will be \
                         sampled from random.SystemRandom()')
+    general.add_argument('--prealloc-mode', default='always', type=str,
+                         choices=['off', 'once', 'always'],
+                         help='controls preallocation')

    exclusive_group(group=general, name='eval', default=True,
                    help='run validation and test after every epoch')
@ -100,6 +144,10 @@ def parse_args():

    # training
    training = parser.add_argument_group('training setup')
+    dataset.add_argument('--train-max-size', default=None, type=int,
+                         help='use at most TRAIN_MAX_SIZE elements from \
+                         training dataset (useful for benchmarking), by \
+                         default uses entire dataset')
    training.add_argument('--train-batch-size', default=128, type=int,
                          help='training batch size per worker')
    training.add_argument('--train-global-batch-size', default=None, type=int,
@ -121,10 +169,10 @@ def parse_args():
    training.add_argument('--grad-clip', default=5.0, type=float,
                          help='enables gradient clipping and sets maximum \
                          norm of gradients')
-    training.add_argument('--max-length-train', default=50, type=int,
+    training.add_argument('--train-max-length', default=50, type=int,
                          help='maximum sequence length for training \
                          (including special BOS and EOS tokens)')
-    training.add_argument('--min-length-train', default=0, type=int,
+    training.add_argument('--train-min-length', default=0, type=int,
                          help='minimum sequence length for training \
                          (including special BOS and EOS tokens)')
    training.add_argument('--train-loader-workers', default=2, type=int,
@ -149,6 +197,15 @@ def parse_args():
                           default="{}",
                           help='extra options for the optimizer')

+    # mixed precision loss scaling
+    loss_scaling = parser.add_argument_group(
+        'mixed precision loss scaling setup'
+        )
+    loss_scaling.add_argument('--init-scale', type=float, default=8192,
+                              help='initial loss scale')
+    loss_scaling.add_argument('--upscale-interval', type=float, default=128,
+                              help='loss upscaling interval')
+
    # scheduler
    scheduler = parser.add_argument_group('learning rate scheduler setup')
    scheduler.add_argument('--warmup-steps', type=str, default='200',
@ -166,10 +223,10 @@ def parse_args():
    val = parser.add_argument_group('validation setup')
    val.add_argument('--val-batch-size', default=64, type=int,
                     help='batch size for validation')
-    val.add_argument('--max-length-val', default=125, type=int,
+    val.add_argument('--val-max-length', default=125, type=int,
                     help='maximum sequence length for validation \
                     (including special BOS and EOS tokens)')
-    val.add_argument('--min-length-val', default=0, type=int,
+    val.add_argument('--val-min-length', default=0, type=int,
                     help='minimum sequence length for validation \
                     (including special BOS and EOS tokens)')
    val.add_argument('--val-loader-workers', default=0, type=int,
@ -179,10 +236,10 @@ def parse_args():
    test = parser.add_argument_group('test setup')
    test.add_argument('--test-batch-size', default=128, type=int,
                      help='batch size for test')
-    test.add_argument('--max-length-test', default=150, type=int,
+    test.add_argument('--test-max-length', default=150, type=int,
                      help='maximum sequence length for test \
                      (including special BOS and EOS tokens)')
-    test.add_argument('--min-length-test', default=0, type=int,
+    test.add_argument('--test-min-length', default=0, type=int,
                      help='minimum sequence length for test \
                      (including special BOS and EOS tokens)')
    test.add_argument('--beam-size', default=5, type=int,
@ -232,6 +289,18 @@ def parse_args():

    args = parser.parse_args()

+    args.lang = {'src': args.src_lang, 'tgt': args.tgt_lang}
+
+    args.save_dir = os.path.join(args.results_dir, args.save_dir)
+    args.vocab = os.path.join(args.dataset_dir, args.vocab)
+    args.bpe_codes = os.path.join(args.dataset_dir, args.bpe_codes)
+    args.train_src = os.path.join(args.dataset_dir, args.train_src)
+    args.train_tgt = os.path.join(args.dataset_dir, args.train_tgt)
+    args.val_src = os.path.join(args.dataset_dir, args.val_src)
+    args.val_tgt = os.path.join(args.dataset_dir, args.val_tgt)
+    args.test_src = os.path.join(args.dataset_dir, args.test_src)
+    args.test_tgt = os.path.join(args.dataset_dir, args.test_tgt)
+
    args.warmup_steps = literal_eval(args.warmup_steps)
    args.remain_steps = literal_eval(args.remain_steps)
    args.decay_interval = literal_eval(args.decay_interval)
@ -239,6 +308,25 @@ def parse_args():
    return args


+def set_iter_size(train_iter_size, train_global_batch_size, train_batch_size):
+    """
+    Automatically set train_iter_size based on train_global_batch_size,
+    world_size and per-worker train_batch_size
+
+    :param train_global_batch_size: global training batch size
+    :param train_batch_size: local training batch size
+    """
+    if train_global_batch_size is not None:
+        global_bs = train_global_batch_size
+        bs = train_batch_size
+        world_size = utils.get_world_size()
+        assert global_bs % (bs * world_size) == 0
+        train_iter_size = global_bs // (bs * world_size)
+        logging.info(f'Global batch size was set, '
+                     f'Setting train_iter_size to {train_iter_size}')
+    return train_iter_size
+
+
 def build_criterion(vocab_size, padding_idx, smoothing):
    if smoothing == 0.:
        logging.info(f'Building CrossEntropyLoss')
@ -252,47 +340,39 @@ def build_criterion(vocab_size, padding_idx, smoothing):
    return criterion


-@utils.timer('TOTAL RUNTIME', sync_gpu=False)
 def main():
    """
    Launches data-parallel multi-gpu training.
    """
+    training_start = time.time()
    args = parse_args()
    device = utils.set_device(args.cuda, args.local_rank)
-    distributed = utils.init_distributed(args.cuda)
+    utils.init_distributed(args.cuda)
    args.rank = utils.get_rank()

    if not args.cudnn:
        torch.backends.cudnn.enabled = False

    # create directory for results
-    save_path = os.path.join(args.results_dir, args.save)
-    args.save_path = save_path
-    os.makedirs(save_path, exist_ok=True)
+    os.makedirs(args.save_dir, exist_ok=True)

    # setup logging
    log_filename = f'log_rank_{utils.get_rank()}.log'
    utils.setup_logging(args.log_all_ranks,
-                        os.path.join(save_path, log_filename))
+                        os.path.join(args.save_dir, log_filename))

    if args.env:
        utils.log_env_info()

-    logging.info(f'Saving results to: {save_path}')
+    logging.info(f'Saving results to: {args.save_dir}')
    logging.info(f'Run arguments: {args}')

-    # automatically set train_iter_size based on train_global_batch_size,
-    # world_size and per-worker train_batch_size
-    if args.train_global_batch_size is not None:
-        global_bs = args.train_global_batch_size
-        bs = args.train_batch_size
-        world_size = utils.get_world_size()
-        assert global_bs % (bs * world_size) == 0
-        args.train_iter_size = global_bs // (bs * world_size)
-        logging.info(f'Global batch size was set in the config, '
-                     f'Setting train_iter_size to {args.train_iter_size}')
+    args.train_iter_size = set_iter_size(args.train_iter_size,
+                                         args.train_global_batch_size,
+                                         args.train_batch_size)

-    worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed, args.epochs,
+    worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed,
+                                                      args.epochs,
                                                      device)
    worker_seed = worker_seeds[args.rank]
    logging.info(f'Worker {args.rank} is using worker seed: {worker_seed}')
@ -300,48 +380,54 @@ def main():

    # build tokenizer
    pad_vocab = utils.pad_vocabulary(args.math)
-    tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME),
-                          pad_vocab)
+    tokenizer = Tokenizer(args.vocab, args.bpe_codes, args.lang, pad_vocab)

    # build datasets
    train_data = LazyParallelDataset(
-        src_fname=os.path.join(args.dataset_dir, config.SRC_TRAIN_FNAME),
-        tgt_fname=os.path.join(args.dataset_dir, config.TGT_TRAIN_FNAME),
+        src_fname=args.train_src,
+        tgt_fname=args.train_tgt,
        tokenizer=tokenizer,
-        min_len=args.min_length_train,
-        max_len=args.max_length_train,
+        min_len=args.train_min_length,
+        max_len=args.train_max_length,
        sort=False,
-        max_size=args.max_size)
+        max_size=args.train_max_size,
+        )

    val_data = ParallelDataset(
-        src_fname=os.path.join(args.dataset_dir, config.SRC_VAL_FNAME),
-        tgt_fname=os.path.join(args.dataset_dir, config.TGT_VAL_FNAME),
+        src_fname=args.val_src,
+        tgt_fname=args.val_tgt,
        tokenizer=tokenizer,
-        min_len=args.min_length_val,
-        max_len=args.max_length_val,
-        sort=True)
+        min_len=args.val_min_length,
+        max_len=args.val_max_length,
+        sort=True,
+        )

    test_data = TextDataset(
-        src_fname=os.path.join(args.dataset_dir, config.SRC_TEST_FNAME),
+        src_fname=args.test_src,
        tokenizer=tokenizer,
-        min_len=args.min_length_test,
-        max_len=args.max_length_test,
-        sort=True)
+        min_len=args.test_min_length,
+        max_len=args.test_max_length,
+        sort=True,
+        )

    vocab_size = tokenizer.vocab_size

    # build GNMT model
    model_config = {'hidden_size': args.hidden_size,
+                    'vocab_size': vocab_size,
                    'num_layers': args.num_layers,
-                    'dropout': args.dropout, 'batch_first': False,
-                    'share_embedding': args.share_embedding}
-    model = GNMT(vocab_size=vocab_size, **model_config)
+                    'dropout': args.dropout,
+                    'batch_first': False,
+                    'share_embedding': args.share_embedding,
+                    }
+    model = GNMT(**model_config).to(device)
    logging.info(model)

    batch_first = model.batch_first

    # define loss function (criterion) and optimizer
-    criterion = build_criterion(vocab_size, config.PAD, args.smoothing)
+    criterion = build_criterion(vocab_size, config.PAD,
+                                args.smoothing).to(device)

    opt_config = {'optimizer': args.optimizer, 'lr': args.lr}
    opt_config.update(literal_eval(args.optimizer_extra))
@ -384,47 +470,52 @@ def main():
                            tokenizer=tokenizer,
                            loader=test_loader,
                            beam_size=args.beam_size,
-                            max_seq_len=args.max_length_test,
+                            max_seq_len=args.test_max_length,
                            len_norm_factor=args.len_norm_factor,
                            len_norm_const=args.len_norm_const,
                            cov_penalty_factor=args.cov_penalty_factor,
-                            cuda=args.cuda,
                            print_freq=args.print_freq,
-                            dataset_dir=args.dataset_dir,
-                            save_path=args.save_path)
+                            reference=args.test_tgt,
+                            )

    # create trainer
    total_train_iters = len(train_loader) // args.train_iter_size * args.epochs
-    save_info = {'model_config': model_config, 'config': args, 'tokenizer':
-                 tokenizer.get_state()}
+    save_info = {
+        'model_config': model_config,
+        'config': args,
+        'tokenizer': tokenizer.get_state()
+        }
+    loss_scaling = {
+        'init_scale': args.init_scale,
+        'upscale_interval': args.upscale_interval
+        }
    trainer_options = dict(
+        model=model,
        criterion=criterion,
        grad_clip=args.grad_clip,
        iter_size=args.train_iter_size,
-        save_path=save_path,
+        save_dir=args.save_dir,
        save_freq=args.save_freq,
        save_info=save_info,
        opt_config=opt_config,
        scheduler_config=scheduler_config,
        train_iterations=total_train_iters,
-        batch_first=batch_first,
        keep_checkpoints=args.keep_checkpoints,
        math=args.math,
+        loss_scaling=loss_scaling,
        print_freq=args.print_freq,
-        cuda=args.cuda,
-        distributed=distributed,
        intra_epoch_eval=args.intra_epoch_eval,
-        translator=translator)
+        translator=translator,
+        prealloc_mode=args.prealloc_mode,
+        )

-    trainer_options['model'] = model
    trainer = trainers.Seq2SeqTrainer(**trainer_options)

    # optionally resume from a checkpoint
    if args.resume:
        checkpoint_file = args.resume
        if os.path.isdir(checkpoint_file):
-            checkpoint_file = os.path.join(
-                checkpoint_file, 'model_best.pth')
+            checkpoint_file = os.path.join(checkpoint_file, 'model_best.pth')
        if os.path.isfile(checkpoint_file):
            trainer.load(checkpoint_file)
        else:
@ -432,6 +523,7 @@ def main():

    # training loop
    best_loss = float('inf')
+    training_perf = []
    break_training = False
    test_bleu = None
    for epoch in range(args.start_epoch, args.epochs):
@ -441,6 +533,7 @@ def main():

        trainer.epoch = epoch
        train_loss, train_perf = trainer.optimize(train_loader)
+        training_perf.append(train_perf)

        # evaluate on validation set
        if args.eval:
@ -455,7 +548,13 @@ def main():

        if args.eval:
            utils.barrier()
-            eval_stats = translator.run(calc_bleu=True, epoch=epoch)
+            eval_fname = f'eval_epoch_{epoch}'
+            eval_path = os.path.join(args.save_dir, eval_fname)
+            _, eval_stats = translator.run(
+                calc_bleu=True,
+                epoch=epoch,
+                eval_path=eval_path,
+                )
            test_bleu = eval_stats['bleu']
            if args.target_bleu and test_bleu >= args.target_bleu:
                logging.info(f'Target accuracy reached')
@ -483,12 +582,22 @@ def main():
            break

    utils.barrier()
+    training_stop = time.time()
+    training_time = training_stop - training_start
+    logging.info(f'Total training time {training_time:.0f} s')
+
+    table = TrainingTable()
+    avg_training_perf = sum(training_perf) / len(training_perf)
+    table.add(utils.get_world_size(), args.train_batch_size, test_bleu,
+              avg_training_perf, training_time)
+    if utils.get_rank() == 0:
+        table.write('Training Summary', args.math)
+
    passed = utils.benchmark(test_bleu, args.target_bleu,
                             train_perf, args.target_perf)
-    return passed
+    if not passed:
+        sys.exit(1)


 if __name__ == '__main__':
-    passed = main()
-    if not passed:
-        sys.exit(1)
+    main()
--- a/PyTorch/Translation/GNMT/translate.py
+++ b/PyTorch/Translation/GNMT/translate.py
@ -1,21 +1,19 @@
 #!/usr/bin/env python
 import argparse
 import logging
-import os
+import itertools
 import sys
 import warnings
-from ast import literal_eval
 from itertools import product

 import torch
-import torch.distributed as dist

 import seq2seq.utils as utils
-from seq2seq.data.dataset import TextDataset
+from seq2seq.data.dataset import RawTextDataset
 from seq2seq.data.tokenizer import Tokenizer
-from seq2seq.inference.inference import Translator
+from seq2seq.inference.translator import Translator
 from seq2seq.models.gnmt import GNMT
-from seq2seq.utils import setup_logging
+from seq2seq.inference import tables


 def parse_args():
@ -38,18 +36,22 @@ def parse_args():

    # dataset
    dataset = parser.add_argument_group('data setup')
-    dataset.add_argument('--dataset-dir', default='data/wmt16_de_en/',
-                         help='path to directory with training/test data')
-    dataset.add_argument('-i', '--input', required=True,
-                         help='full path to the input file (tokenized)')
-    dataset.add_argument('-o', '--output', required=True,
-                         help='full path to the output file (tokenized)')
+    dataset.add_argument('-o', '--output', required=False,
+                         help='full path to the output file \
+                         if not specified, then the output will be printed')
    dataset.add_argument('-r', '--reference', default=None,
                         help='full path to the file with reference \
-                         translations (for sacrebleu)')
+                         translations (for sacrebleu, raw text)')
    dataset.add_argument('-m', '--model', required=True,
                         help='full path to the model checkpoint file')
-    exclusive_group(group=dataset, name='sort', default=True,
+
+    source = dataset.add_mutually_exclusive_group(required=True)
+    source.add_argument('-i', '--input', required=False,
+                        help='full path to the input file (raw text)')
+    source.add_argument('-t', '--input-text', nargs='+', required=False,
+                        help='raw input text')
+
+    exclusive_group(group=dataset, name='sort', default=False,
                    help='sorts dataset by sequence length')

    # parameters
@ -69,9 +71,9 @@ def parse_args():
    # general setup
    general = parser.add_argument_group('general setup')
    general.add_argument('--math', nargs='+', default=['fp16'],
-                         choices=['fp16', 'fp32'], help='arithmetic type')
+                         choices=['fp16', 'fp32'], help='precision')

-    exclusive_group(group=general, name='env', default=True,
+    exclusive_group(group=general, name='env', default=False,
                    help='print info about execution env')
    exclusive_group(group=general, name='bleu', default=True,
                    help='compares with reference translation and computes \
@ -102,6 +104,21 @@ def parse_args():
                           per second)')
    benchmark.add_argument('--target-bleu', default=None, type=float,
                           help='target accuracy')
+
+    benchmark.add_argument('--repeat', nargs='+', default=[1], type=float,
+                           help='loops over the dataset REPEAT times, flag \
+                           accepts multiple arguments, one for each specified \
+                           batch size')
+    benchmark.add_argument('--warmup', default=0, type=int,
+                           help='warmup iterations for performance counters')
+    benchmark.add_argument('--percentiles', nargs='+', type=int,
+                           default=(50, 90, 95, 99, 100),
+                           help='Percentiles for confidence intervals for \
+                           throughput/latency benchmarks')
+    exclusive_group(group=benchmark, name='tables', default=False,
+                    help='print accuracy, throughput and latency results in \
+                    tables')
+
    # distributed
    distributed = parser.add_argument_group('distributed setup')
    distributed.add_argument('--rank', default=0, type=int,
@ -111,6 +128,9 @@ def parse_args():

    args = parser.parse_args()

+    if args.input_text:
+        args.bleu = False
+
    if args.bleu and args.reference is None:
        parser.error('--bleu requires --reference')

@ -120,6 +140,11 @@ def parse_args():
    if len(list(product(args.math, args.batch_size, args.beam_size))) > 1:
        args.target_bleu = None
        args.target_perf = None
+
+    args.repeat = dict(itertools.zip_longest(args.batch_size,
+                                             args.repeat,
+                                             fillvalue=1))
+
    return args


@ -130,9 +155,10 @@ def main():
    with length normalization and coverage penalty.
    """
    args = parse_args()
-    utils.set_device(args.cuda, args.local_rank)
+    device = utils.set_device(args.cuda, args.local_rank)
    utils.init_distributed(args.cuda)
-    setup_logging()
+    args.rank = utils.get_rank()
+    utils.setup_logging()

    if args.env:
        utils.log_env_info()
@ -150,54 +176,100 @@ def main():
    # build GNMT model
    tokenizer = Tokenizer()
    tokenizer.set_state(checkpoint['tokenizer'])
-    vocab_size = tokenizer.vocab_size
    model_config = checkpoint['model_config']
    model_config['batch_first'] = args.batch_first
-    model = GNMT(vocab_size=vocab_size, **model_config)
+    model_config['vocab_size'] = tokenizer.vocab_size
+    model = GNMT(**model_config)
    model.load_state_dict(checkpoint['state_dict'])

+    # construct the dataset
+    if args.input:
+        data = RawTextDataset(raw_datafile=args.input,
+                              tokenizer=tokenizer,
+                              sort=args.sort,
+                              )
+    elif args.input_text:
+        data = RawTextDataset(raw_data=args.input_text,
+                              tokenizer=tokenizer,
+                              sort=args.sort,
+                              )
+
+    latency_table = tables.LatencyTable(args.percentiles)
+    throughput_table = tables.ThroughputTable(args.percentiles)
+    accuracy_table = tables.AccuracyTable('BLEU')
+
+    dtype = {'fp32': torch.FloatTensor, 'fp16': torch.HalfTensor}
+
    for (math, batch_size, beam_size) in product(args.math, args.batch_size,
                                                 args.beam_size):
        logging.info(f'math: {math}, batch size: {batch_size}, '
                     f'beam size: {beam_size}')
-        if math == 'fp32':
-            dtype = torch.FloatTensor
-        if math == 'fp16':
-            dtype = torch.HalfTensor
-        model.type(dtype)

-        if args.cuda:
-            model = model.cuda()
+        model.type(dtype[math])
+        model = model.to(device)
        model.eval()

-        # construct the dataset
-        test_data = TextDataset(src_fname=args.input,
-                                tokenizer=tokenizer,
-                                sort=args.sort)
-
        # build the data loader
-        test_loader = test_data.get_loader(batch_size=batch_size,
-                                           batch_first=args.batch_first,
-                                           shuffle=False,
-                                           pad=True,
-                                           num_workers=0)
+        loader = data.get_loader(
+            batch_size=batch_size,
+            batch_first=args.batch_first,
+            pad=True,
+            repeat=args.repeat[batch_size],
+            num_workers=0,
+            )

        # build the translator object
-        translator = Translator(model=model,
-                                tokenizer=tokenizer,
-                                loader=test_loader,
-                                beam_size=beam_size,
-                                max_seq_len=args.max_seq_len,
-                                len_norm_factor=args.len_norm_factor,
-                                len_norm_const=args.len_norm_const,
-                                cov_penalty_factor=args.cov_penalty_factor,
-                                cuda=args.cuda,
-                                print_freq=args.print_freq,
-                                dataset_dir=args.dataset_dir)
+        translator = Translator(
+            model=model,
+            tokenizer=tokenizer,
+            loader=loader,
+            beam_size=beam_size,
+            max_seq_len=args.max_seq_len,
+            len_norm_factor=args.len_norm_factor,
+            len_norm_const=args.len_norm_const,
+            cov_penalty_factor=args.cov_penalty_factor,
+            print_freq=args.print_freq,
+            )

        # execute the inference
-        stats = translator.run(calc_bleu=args.bleu, eval_path=args.output,
-                               reference_path=args.reference, summary=True)
+        output, stats = translator.run(
+            calc_bleu=args.bleu,
+            eval_path=args.output,
+            summary=True,
+            warmup=args.warmup,
+            reference_path=args.reference,
+            )
+
+        # print translated outputs
+        if not args.output and args.rank == 0:
+            logging.info(f'Translated output:')
+            for out in output:
+                print(out)
+
+        key = (batch_size, beam_size)
+        latency_table.add(key, {math: stats['runtimes']})
+        throughput_table.add(key, {math: stats['throughputs']})
+        accuracy_table.add(key, {math: stats['bleu']})
+
+    if args.tables:
+        accuracy_table.write('Inference accuracy', args.math)
+
+        if 'fp16' in args.math and 'fp32' in args.math:
+            relative = 'fp32'
+        else:
+            relative = None
+
+        if 'fp32' in args.math:
+            throughput_table.write('Inference throughput', 'fp32')
+        if 'fp16' in args.math:
+            throughput_table.write('Inference throughput', 'fp16',
+                                   relative=relative)
+
+        if 'fp32' in args.math:
+            latency_table.write('Inference latency', 'fp32')
+        if 'fp16' in args.math:
+            latency_table.write('Inference latency', 'fp16',
+                                relative=relative, reverse_speedup=True)

    passed = utils.benchmark(stats['bleu'], args.target_bleu,
                             stats['tokens_per_sec'], args.target_perf)
--- a/PyTorch/Translation/Transformer/Dockerfile
+++ b/PyTorch/Translation/Transformer/Dockerfile
@ -12,7 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-FROM nvcr.io/nvidia/pytorch:19.05-py3 
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
+FROM ${FROM_IMAGE_NAME}

 # Install Python dependencies
 RUN pip install --upgrade --no-cache-dir pip \
@ -21,9 +22,10 @@ RUN pip install --upgrade --no-cache-dir pip \
      sentencepiece

 RUN apt-get update
-RUN apt-get install -y cmake pkg-config libprotobuf9v5 protobuf-compiler libprotobuf-dev libgoogle-perftools-dev
+RUN apt-get install -y cmake pkg-config protobuf-compiler libprotobuf-dev libgoogle-perftools-dev
 RUN git clone https://github.com/google/sentencepiece.git /workspace/sentencepiece
 RUN cd /workspace/sentencepiece \
+  && git checkout d4dd947 \
  && mkdir build \
  && cd build \
  && cmake .. \
@ -33,6 +35,6 @@ RUN cd /workspace/sentencepiece \

 ENV PYTHONPATH=/workspace/translation/examples/translation/subword-nmt/
 WORKDIR /workspace/translation
-COPY . .
 RUN git clone https://github.com/rsennrich/subword-nmt.git /workspace/translation/examples/translation/subword-nmt/
+COPY . .
 RUN pip install -e .
--- a/PyTorch/Translation/Transformer/README.md
+++ b/PyTorch/Translation/Transformer/README.md
@ -1,61 +1,180 @@
-# Transformer
+# Transformer For PyTorch

-This implementation of the Transformer model architecture is based on the optimized implementation in [Facebook's Fairseq NLP toolkit](https://github.com/pytorch/fairseq), built on top of PyTorch. The original version in the Fairseq project was developed using Tensor Cores, which provides significant training speedup. Our implementation improves the performance of a training and is tested on a DGX-1V 16GB.
+This repository provides a script and recipe to train the Transformer model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.

-# Requirements and installation
-This repository contains a `Dockerfile` which extends the PyTorch NGC container and encapsulates all dependencies. Ensure you have the following software:
-* [nvidia-docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.01-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
-* [SacreBLEU 1.2.10](https://pypi.org/project/sacrebleu/1.2.10/)
+**Table Of Contents**
+- [Model overview](#model-overview)
+    * [Model architecture](#model-architecture)
+    * [Default configuration](#default-configuration)
+    * [Feature support matrix](#feature-support-matrix)
+	    * [Features](#features)
+    * [Mixed precision training](#mixed-precision-training)
+	    * [Enabling mixed precision](#enabling-mixed-precision)
+        * [Glossary](#glossary)
+- [Setup](#setup)
+    * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Parameters](#parameters)
+    * [Command-line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
+        * [Dataset guidelines](#dataset-guidelines)
+        * [Multi-dataset](#multi-dataset)
+    * [Training process](#training-process)
+    * [Inference process](#inference-process)
+- [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+            * [Training stability test](#training-stability-test)
+        * [Training performance results](#training-performance-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
+            * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
+        * [Inference performance results](#inference-performance-results)
+- [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)

-If you use multiprocessing for multi-threaded data loaders, the default shared memory segment size that the container runs with may not be enough. Therefore, we recommend you to increase the shared memory size by issuing either: 
+
+## Model overview
+
+The Transformer is a Neural Machine Translation (NMT) model which uses attention mechanism to boost training speed and overall accuracy. The Transformer model was introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) and improved in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
+This implementation is based on the optimized implementation in [Facebook's Fairseq NLP toolkit](https://github.com/pytorch/fairseq), built on top of PyTorch.
+
+This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 3.6x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+### Model architecture
+
+The Transformer model uses standard NMT encoder-decoder architecture. This model unlike other NMT models, uses no recurrent connections and operates on fixed size context window.
+The encoder stack is made up of N identical layers. Each layer is composed of the following sublayers:
+    1. Self-attention layer
+    2. Feedforward network (which is 2 fully-connected layers)
+Like the encoder stack, the decoder stack is made up of N identical layers. Each layer is composed of the sublayers:
+    1. Self-attention layer
+    2. Multi-headed attention layer combining encoder outputs with results from
+       the previous self-attention layer.
+    3. Feedforward network (2 fully-connected layers)
+
+The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.
+The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
+
+<p align="center">
+    <img width="50%" src="./transformer.png" />
+    <br>
+    Figure 1. The architecture of a Transformer model.
+</p>
+
+The complete description of the Transformer architecture can be found in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.
+### Default configuration
+
+The Transformer uses Byte Pair Encoding tokenization scheme using [Moses decoder](https://github.com/moses-smt/mosesdecoder). This is a lossy compression method (we drop information about white spaces). Tokenization is applied over whole [WMT14](http://statmt.org/wmt14/) en-de dataset including test set. Default vocabulary size is 33708, excluding all special tokens. Encoder and decoder are using shared embeddings.
+We use 6 blocks in each encoder and decoder stacks. Self attention layer computes it's outputs according to the following formula $`Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V`$. At each attention step, the model computes 16 different attention representations (which we will call attention heads) and concatenates them.
+We trained the Transformer model using the Adam optimizer with betas `(0.9, 0.997)`, epsilon `1e-9` and learning rate `6e-4`. We used the inverse square root training schedule preceded with liniar warmup of 4000 steps.
+The implementation allows to perform training in mixed precision. We use dynamic loss scaling and custom mixed precision optimizer. Distributed multi-GPU and multi-Node is implemented with `torch.distirbuted` module with NCCL backend.
+For inference, we use beam search with default beam size of 5. Model performance is evaluated with BLEU4 metrics. For clarity, we report internal (legacy) BLEU implementation as well as external [SacreBleu](https://github.com/mjpost/sacreBLEU) score.
+
+### Feature support matrix
+
+The following features are supported by this model.<br>
+
+| Feature                  | Yes column                
+|--------------------------|--------------------------
+| Multi-GPU training with [Distributed Communication Package](https://pytorch.org/docs/stable/distributed.html)  | Yes          
+| APEX                     | Yes         
+
+#### Features
+
+Multi-GPU training with [Distributed Communication Package](https://pytorch.org/docs/stable/distributed.html) 
+
+Our model uses torch.distributed package to implement efficient multi-GPU training with NCCL.
+To enable multi-GPU training with torch.distributed, you have to initialize your model
+identically in every process spawned by torch.distributed.launch. For efficiency the only point of synchronization is gradient gathering.
+For details, see example sources in this repo or see
+the [pytorch tutorial](https://pytorch.org/docs/stable/distributed.html)
+
+APEX - This implementation uses Apex's FP16_Optimizer API to perform mixed precision training.
+The purpose of the APEX is to provide easy and intuitive framework for distributed training and mixed precision training.
+For details, see official [APEX repository](https://github.com/NVIDIA/apex).
+
+
+### Mixed precision training
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.    
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+-   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+#### Enabling mixed precision
+
+Mixed precision is enabled using the `--fp16` option in the `train.py` script. The script then builds a custom mixed precision optimizer. Forward and backward pass are computed with FP16 precision with exclusion of a loss function which is computed in FP32 precision. We keep a copy of a model in higher precision in order to perform accurate weight update. After the update FP32 weights are again copied to FP16 model. We use dynamic loss scaling with initial scale of 2^7 increasing it by a factor of 2 every 2000 successful iterations. Overflow is being checked after reducing gradients from all of the workers. If we encounter infs or nans the whole batch is dropped.
+
+### Glossary
+
+Attention layer - Layer that computes which elements of input sequence or it's hidden representation contribute the most to the currently considered output element.
+Beam search - A heuristic search algorithm which at each step of predictions keeps N most possible outputs as a base to perform further prediction.
+BPE - Binary Pair Encoding, compression algorithm that find most common pair of symbols in a data and replaces them with new symbol absent in the data.
+EOS - End of a sentence.
+Self attention layer - Attention layer that computes hidden representation of input using the same tensor as query, key and value.
+Token - A  string that is representable within the model. We also refer to the token's position in the dictionary as a token. There are special non-string tokens: alphabet tokens (all characters in a dataset), EOS token, PAD token.
+Tokenizer - Object that converts raw strings to sequences of tokens.
+Vocabulary embedding - Layer that projects one-hot token representations to a high dimensional space which preserves some information about correlations between tokens.
+
+## Setup
+
+The following section lists the requirements in order to start training the Transformer model.
+
+### Requirements
+
+This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+-   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+-   [PyTorch 19.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+-   [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+
+For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+-   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+-   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+-   Running [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+  
+For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
+
+## Quick Start Guide
+To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Transformer model on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+1. Clone the repository 
 ```
--ipc=host
+git clone --recurse-submodules https://github.com/NVIDIA/DeepLearningExamples.git 
+cd DeepLearningExamples/PyTorch/Translation/Transformer
 ```
-Or
-```
--shm-size=<requested memory size>
-```
-in the command line to `nvidia-docker run`. For more information,see [Setting The Shared Memory Flag](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#setincshmem) in the NVIDIA Container User Guide.

-For more information about how to get started with NGC containers, see the
-following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
-DGX Documentation:
- - [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
- - [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
- - [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
-
-## Training using mixed precision with Tensor Cores
-The training script provided in this project takes advantage of Tensor Cores to speedup the time it takes to train the Transformer model (for a translation task in this example). Tensor Cores accelerate matrix multiplication math and are available on NVIDIA Volta and Turing based GPUs. For more information about how to use Tensor Cores, see the [Training With Mixed Precision Guide](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) to Mixed Precision Training on NVIDIA GPUs.
-
-An additional resource for mixed precision training is NVIDIA’s
-[Apex](https://github.com/NVIDIA/apex), a PyTorch extension, that contains
-utility libraries, such as AMP, which stands for Automatic Mixed Precision and enables the use of Tensor Cores with minimal code changes to existing PyTorch training scripts.
-
-
-# Hyper parameters setting
-To reach the BLEU score reported in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) reaserch paper, we used mixed precision training with a batch size of 5120 per GPU and learning rate of 6e-4 on a DGX-1V system with 8 Tesla V100s 16G. If you use a different setup, we recommend you scale your hyperparameters by applying the following rules:
-1. To use FP32, reduce the batch size to 2560 and set the `--update-freq 2` and `--warmup-updates 8000` options.
-2. To train on a fewer GPUs, multiply `--update-freq` and `--warmup-updates` by the reciprocal of scaling factor.
-
-For example, when training in FP32 mode on 4 GPUs, use the `--update-freq=4` and `--warmup-updates 16000` options.
-
-# Quick start guide
-Perform the following steps to train using provided default parameters of the Transformer model on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset.
-## Build and launch Transformer Docker container
+2. Build and launch the Transformer PyTorch NGC  container
 ```bash
 docker build . -t your.repository:transformer
-nvidia-docker run -it --rm --ipc=host -v /path/to/your/dataset:/container/dataset/path your.repository:transformer bash
+nvidia-docker run -it --rm --ipc=host your.repository:transformer bash
 ```
-## Downloading and preprocessing dataset
+If you have already preprocessed data, use:
+```bash
+nvidia-docker run -it --rm --ipc=host -v path/to/your/data/:/data/wmt14_en_de_joined_dict your.repository:transformer bash
+```
+3. Download and preprocess dataset
 Download and preprocess the WMT14 English-German dataset.
 ```bash
 ./run_preprocessing.sh
 ```
-## Run training
+After running this command, the processed dataset will be put into: `/data/wmt14_en_de_joined_dict` directory.
+4. Start training
 The following command runs the training script that is distributed between 8 workers.
 ```bash
-python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py /workspace/data-bin/wmt14_en_de_joined_dict \
+python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py /data/wmt14_en_de_joined_dict \
  --arch transformer_wmt_en_de_big_t2t \
  --share-all-embeddings \
  --optimizer adam \
@ -78,54 +197,155 @@ python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/tra
  --fp16 \
  --save-dir /workspace/checkpoints \
  --distributed-init-method env:// 
-
 ```
-**WARNING**: If you don't have access to sufficient disk space, use the --save-interval $N option. The checkpoints are ~2.5GB large. For example it takes the Transformer model 16 epochs to reach the BLEU score of 28 points. Default option is to save the latest checkpoint, the best checkpoint and a checkpoint for every epoch, which means (16+1+1)*2.5GB = 45GB of a disk space used. Specifying `--save-interval 5` you can reduce this to (16/5+1+1)*2.5GB = 12.5GB. 

-# Details
+The script saves checkpoints every epoch to the directory specified in the `--save-dir` option. In addition, the best performing checkpoint (in terms of loss) and the latest checkpoints are saved separately.
+**WARNING**: If you don't have access to sufficient disk space, use the `--save-interval $N` option. The checkpoints are ~2.5GB large. For example, it takes the Transformer model 16 epochs to reach the BLEU score of 28 points. The default option is to save last checkpoint, the best checkpoint and a checkpoint for every epoch, which means (16+1+1)*2.5GB = 45GB of a disk space used. Specifying `--save-interval 5` reduces this to (16/5+1+1)*2.5GB = 12.5GB. 

-## Getting the data
-The Transformer model was trained on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. Concatenation of the *commoncrawl*, *europarl* and *news-commentary* is used as train and vaidation dataset and *newstest2014* is used as test dataset.<br/>
-This repository contains `run_preprocessing.sh` script which will automatically download and preprocess the training and test datasets. By default data will be stored in `/data/wmt14_en_de_joined_dict` directory.<br/>
-Our download script utilizes [Moses decoder](https://github.com/moses-smt/mosesdecoder) to perform tokenization of the dataset and [subword-nmt](https://github.com/rsennrich/subword-nmt) to segment text into subword units (BPE). By default, script builds shared vocabulary of 33708 tokens, which is constistent withbuilds shared vocabulary of 33708 tokens, which is constistent with [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
+## Advanced
+The following sections provide greater details of the dataset, running training and inference, and the training results.

-## Running training
-The default training configuration can be launched ny running the `train.py` training script. By default, the script saves one checkpoint evety epoch in addition to the latest and the best ones. The best chckpoint is considered the one with the lowest value of loss, not the one with the highest BLEU score. To override this behavior use the `--save-interval $N` option to save epoch checkpoints every N epoch or `--no-epoch-checkpoints` to disable them entirely (with this option latest and the best checkpoints still will be saved). Specify save directory with `--save-dir` option.<br/>
-In order to run multi-GPU training launch the training script with `python -m torch.distributed.launch --nproc_per_node $N` prepended, where N is the number of GPUs.
+### Scripts and sample code
+
+The `preprocess.py` script performs binarization of the dataset obtained and tokenized by the `examples/translation/prepare-wmt14en2de.sh` script. The `train.py` script contains training loop as well as statistics gathering code. Steps performed in single training step can be found in `fairseq/trainer.py` if you are using FP32 precision or inside `fairseq/fp16_trainer.py` for mixed precision. Model definition is placed in the file `fairseq/models/transformer.py`. Model specific modules including multiheaded attention and sinusoidal positional embedding are inside the `fairseq/modules/` directory. Finally, the data wrappers are placed inside the `fairseq/data/` directory.
+
+### Parameters
+
+In this section we give a user friendly description of the most common options used in the `train.py` script.
+### Command-line options
+`--arch` - select the specific configuration for the model. You can select between various predefined hyper parameters values like number of encoder/decoder blocks, dropout value or size of hidden state representation.<br/>
+`--share-all-embeddings` - use the same set of weights for encoder and decoder words embedding.<br/>
+`--optimizer` - choose optimization algorithm.<br/>
+`--clip-norm` - set a value that gradients will be clipped to.<br/>
+`--lr-scheduler` - choose learning rate change strategy.<br/>
+`--warmup-init-lr` - start linear warmup with a learning rate at this value.<br/>
+`--warmup-updates` - set number of optimization steps after which linear warmup will end.<br/>
+`--lr` - set learning rate.<br/>
+`--min-lr` - prevent learning rate to fall below this value using arbitrary learning rate schedule.<br/>
+`--dropout` - set dropout value.<br/>
+`--weight-decay` - set weight decay value.<br/>
+`--criterion` - select loss function.<br/>
+`--label-smoothing` - distribute value of one-hot labels between all entries of a dictionary. Value set by this option will be a value subtracted from one-hot label.<br/>
+`--max-tokens` - set batch size in terms of tokens.<br/>
+`--max-sentences` - set batch size in terms of sentences. Note that then the actual batchsize will vary a lot more than when using `--max-tokens` option.<br/>
+`--seed` - set random seed for NumPy and PyTorch RNGs.<br/>
+`--max-epochs` - set the maximum number of epochs.<br/>
+`--online-eval` - perform inference on test set and then compute BLEU score after every epoch.<br/>
+`--ignore-case` - used with `--online-eval`, ignore case while computing BLEU score.<br/>
+`--target-bleu` - works like `--online-eval` and sets a BLEU score threshold which after being attained will cause training to stop.<br/>
+`--fp16` - use mixed precision.<br/>
+`--save-dir` - set directory for saving checkpoints.<br/>
+`--distributed-init-method` - method for initializing torch.distributed package. You can either provide addresses with the `tcp` method or use the envionment variables initialization with `env` method<br/>
+`--update-freq` - use gradient accumulation. Set number of training steps across which gradient will be accumulated.<br/>
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
+```
+python train.py --help
+```
+
+The following (partial) output is printed when running the sample:
+```
+usage: train.py [-h] [--no-progress-bar] [--log-interval N]
+                [--log-format {json,none,simple,tqdm}] [--seed N] [--fp16]
+                [--profile PROFILE] [--task TASK]
+                [--skip-invalid-size-inputs-valid-test] [--max-tokens N]
+                [--max-sentences N] [--sentencepiece] [--train-subset SPLIT]
+                [--valid-subset SPLIT] [--max-sentences-valid N]
+                [--gen-subset SPLIT] [--num-shards N] [--shard-id ID]
+                [--distributed-world-size N]
+                [--distributed-rank DISTRIBUTED_RANK]
+                [--local_rank LOCAL_RANK]
+                [--distributed-backend DISTRIBUTED_BACKEND]
+                [--distributed-init-method DISTRIBUTED_INIT_METHOD]
+                [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
+                --arch ARCH [--criterion CRIT] [--max-epoch N]
+                [--max-update N] [--target-bleu TARGET] [--clip-norm NORM]
+                [--sentence-avg] [--update-freq N] [--optimizer OPT]
+                [--lr LR_1,LR_2,...,LR_N] [--momentum M] [--weight-decay WD]
+                [--lr-scheduler LR_SCHEDULER] [--lr-shrink LS] [--min-lr LR]
+                [--min-loss-scale D] [--enable-parallel-backward-allred-opt]
+                [--parallel-backward-allred-opt-threshold N]
+                [--enable-parallel-backward-allred-opt-correctness-check]
+                [--save-dir DIR] [--restore-file RESTORE_FILE]
+                [--save-interval N] [--save-interval-updates N]
+                [--keep-interval-updates N] [--no-save]
+                [--no-epoch-checkpoints] [--validate-interval N] [--path FILE]
+                [--remove-bpe [REMOVE_BPE]] [--cpu] [--quiet] [--beam N]
+                [--nbest N] [--max-len-a N] [--max-len-b N] [--min-len N]
+                [--no-early-stop] [--unnormalized] [--no-beamable-mm]
+                [--lenpen LENPEN] [--unkpen UNKPEN]
+                [--replace-unk [REPLACE_UNK]] [--score-reference]
+                [--prefix-size PS] [--sampling] [--sampling-topk PS]
+                [--sampling-temperature N] [--print-alignment]
+                [--model-overrides DICT] [--online-eval] [--ignore-case]
+                [--bpe-codes CODES] [--fuse-dropout-add] [--fuse-relu-dropout]
+```
+
+### Getting the data
+
+The Transformer model was trained on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. Concatenation of the *commoncrawl*, *europarl* and *news-commentary* is used as train and validation dataset and *newstest2014* is used as test dataset.<br/>
+This repository contains the `run_preprocessing.sh` script which will automatically downloads and preprocesses the training and test datasets. By default, data will be stored in the `/data/wmt14_en_de_joined_dict` directory.<br/>
+Our download script utilizes [Moses decoder](https://github.com/moses-smt/mosesdecoder) to perform tokenization of the dataset and [subword-nmt](https://github.com/rsennrich/subword-nmt) to segment text into subword units (BPE). By default, the script builds a shared vocabulary of 33708 tokens, which is consistent with [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
+
+#### Dataset guidelines
+
+The Transformer model works with a fixed sized vocabulary. Prior to the training, we need to learn a data representation that allows us to store the entire dataset as a sequence of tokens. To achieve this we use Binary Pair Encoding. This algorithm builds a vocabulary by iterating over a dataset, looking for the most frequent pair of symbols and replacing them with a new symbol, yet absent in the dataset. After identifying the desired number of encodings (new symbols can also be merged together) it outputs a code file that is used as an input for the `Dictionary` class.
+This approach does not minimize the length of the encoded dataset, however this is allowed using [SentencePiece](https://github.com/google/sentencepiece/) to tokenize the dataset with the unigram model. This approach tries to find encoding that is close to the theoretical entropy limit.
+Data is then sorted by length (in terms of tokens) and examples with similar length are batched together, padded if necessary.
+
+#### Multi-dataset
+
+The model has been tested oni the [wmt14 en-fr](http://www.statmt.org/wmt14/translation-task.html) dataset. Achieving state of the art accuracy of 41.4 BLEU.
+
+### Training process
+
+The default training configuration can be launched by running the `train.py` training script. By default, the script saves one checkpoint every epoch in addition to the latest and the best ones. The best checkpoint is considered the one with the lowest value of loss, not the one with the highest BLEU score. To override this behavior use the `--save-interval $N` option to save epoch checkpoints every N epoch or `--no-epoch-checkpoints` to disable them entirely (with this option the latest and the best checkpoints still will be saved). Specify save the directory with `--save-dir` option.<br/>
+In order to run multi-GPU training, launch the training script with `python -m torch.distributed.launch --nproc_per_node $N` prepended, where N is the number of GPUs.
 We have tested reliance on up to 16 GPUs on a single node.<br/>
-After each training epoch, the script runs a loss validation on the validation split of the dataset and outputs validation loss. By default the evaluation after each epoch is disabled. To enable it use `--online-eval` option or to use BLEU score value as training stopping condition use `--target-bleu $TGT` option. In order to compute case insensitive BLEU score use flag `--ignore-case` along with previous ones. BLEU is computed by the internal fairseq algorithm which implementation can be found in `fairseq/bleu.py` script.<br/>
-By default, the `train.py` script will launch fp32 training without Tensor Cores. To use mixed precision with Tensor Cores use `--fp16` option.<br/>
-To view all available options for training, run `python train.py --help`.
+After each training epoch, the script runs a loss validation on the validation split of the dataset and outputs the validation loss. By default the evaluation after each epoch is disabled. To enable it, use the `--online-eval` option or to use the BLEU score value as the training stopping condition use the `--target-bleu $TGT` option. In order to compute the case insensitive BLEU score, use the flag `--ignore-case` along with previous ones. The BLEU is computed by the internal fairseq algorithm which implementation can be found in the `fairseq/bleu.py` script.<br/>
+By default, the `train.py` script will launch FP32 training without Tensor Cores. To use mixed precision with Tensor Cores use the `--fp16` option.<br/>

-## Running inference
-Inference on a raw input can be performed by launching `interactive.py` inference script. It requires pre-trained model checkpoint,BPE codes file and dictionary file (both are produced by `run_preprocessing.sh` script and can be found in the dataset directory).<br/>
-To enhance speed of the inference on large input files it is recommended to preprocess them the same way as the dataset and run inference on a binarized input with the `generate.py` script.<br/>
-Both scripts run inference with a default beam size of 4 and give tokenized output. To remove BPE codes use `--remove-bpe` option.<br/>
-To view all available options for training, run `python interactive.py --help`.
+To reach the BLEU score reported in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) research paper, we used mixed precision training with a batch size of 5120 per GPU and learning rate of 6e-4 on a DGX-1V system with 8 Tesla V100s 16G. If you use a different setup, we recommend you scale your hyperparameters by applying the following rules:
+1. To use FP32, reduce the batch size to 2560 and set the `--update-freq 2` and `--warmup-updates 8000` options.
+2. To train on a fewer GPUs, multiply `--update-freq` and `--warmup-updates` by the reciprocal of scaling factor.

-## Testing
-Computing BLEU score is contained inside the training script and can be used to determine when the script should stop the training. To disable this feature replace `--target-bleu $BLEU$` and `--ignore-case` options with `--max-epoch $N`, where `N` is number of training epochs. By default, evaluation of the Transformer model is then performed on the binarized test split of the dataset by default. To evaluate the model, issue:
-```bash
-python generate.py /path/to/dataset/wmt14_en_de_joined_dict  \
-  --path /path/to/your/checkpoint.pt \
-  --beam 4 --remove-bpe
+For example, when training in FP32 mode on 4 GPUs, use the `--update-freq=4` and `--warmup-updates 16000` options.
+
+### Inference process
+
+Inference on a raw input can be performed by launching the `interactive.py` inference script. It requires a pre-trained model checkpoint, BPE codes file and dictionary file (both are produced by the `run_preprocessing.sh` script and can be found in the dataset directory).<br/>
+To enhance the speed of the inference on large input files, it is recommended to preprocess them the same way as the dataset and run inference on a binarized input with the `generate.py` script.<br/>
+Both scripts run inference with a default beam size of 4 and give tokenized output. To remove BPE codes use the `--remove-bpe` option.<br/>
+In order to run interactive inference, run command:
 ```
-In order to use [SacreBLEU](https://pypi.org/project/sacrebleu/1.2.10/) for evaluation, run:
-```bash
-sacrebleu -t wmt14/full -l en-de --echo src > wmt14-en-de.src
 python interactive.py --buffer-size 1 --fp16 --path /path/to/your/checkpoint.pt --max-tokens 128 \
        --fuse-dropout-add --remove-bpe --bpe-codes /path/to/code/file \
-        /path/to/dataset/wmt14_en_de_joined_dict/ < wmt14-en-de.src > wmt14.detok
-grep ^H wmt14.detok | cut -f3- > wmt14.translated
-cat wmt14.translated | sacrebleu -t wmt14 -lc -l en-de
+        /path/to/dataset/wmt14_en_de_joined_dict/
 ```
-Sacrebleu test set is a subset of test set used during a course of training thus score obtained with sacreBLEU can slightly differ from the one computed during training.
+The `--buffer-size` option allows the batching of input sentences up to `--max_token` length.

-## Training Accuracy Results
-In order to test accuracy of our implementation we have run experiments with different seeds for 100 epochs with batch size 5120 per GPU and learining rate 6e-4 in the pytorch-19.03-py3 Docker container. Plot below shows BLEU score changes.<br/>
-![Accuracy plot](BLEU.png)
+## Performance

-## Training Performance Results
+### Benchmarking
+
+The following section shows how to run benchmarks measuring the model performance in training and inference modes.
+
+#### Training performance benchmark
+
+To benchmark the training performance on a specific batch size, just run `train.py` training script. Performance in words/s will be printed to standard output every N iterations, specified by the `--log-interval` option. After each epoch, the mean performance across the epoch will be reported as well.
+
+#### Inference performance benchmark
+
+To benchmark the inference performance on a specific batch size, run the `generate.py` script. The mean throughput will be reported at the end of the script.
+
+### Results
+
+The following sections provide details on how we achieved our performance and accuracy in training and inference.
+
+#### Training accuracy results
+
+In order to test the accuracy of our implementation, we have run experiments with different seeds for 100 epochs with batch size 5120 per GPU and learning rate 6e-4 in the pytorch-18.12-py3 Docker container. The plot below shows the BLEU score changes.<br/>
+![Accuracy plot](/BLEU.png)

 Running this code with the provided hyperparameters will allow you to achieve the following results. Our setup is a DGX-1 with 8x Tesla V100 16GB. We've verified our results after training 32 epochs to obtain multi-GPU and mixed precision scaling results.

@ -136,32 +356,80 @@ Running this code with the provided hyperparameters will allow you to achieve th

 In some cases we can train further with the same setup to achieve slightly better results. 

-GPU count |Precision |  BLEU score | Epochs to train | Training time
---|---|---|---|---
- 4 |fp16      |  28.67      | 74              | 1925 min
- 4 |fp32      |  28.40      | 47              | 5478 min
+##### NVIDIA DGX-1 (8x V100 16G)

-Results here are the best we achieved. We've observed a large variance in BLEU, while using random seed. Nearly all setups reach 28.4 BLEU, although the time it takes also varies between setups.
-We also observed a good rate of week scaling. We measured performance in tokens (words) per second.
+Our results were obtained by running the `run_training.sh` and `run_training_fp32.sh` training scripts in the PyTorch NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch.

-GPU count | Mixed precision | FP32 | FP32/Mixed speedup | Mixed precision week scaling | FP32 week scaling
---|---|---|---|---|---
-1 | 37650 | 8630 | 4.36 | 1.0 | 1.0
-4 | 132700 | 30500 | 4.35 | 3.52 | 3.53
-8 | 260000 | 61000 | 4.26 | 6.91 | 7.07
+| GPUs   | Batch size / GPU   | Throughput - FP32    | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Weak scaling - FP32    | Weak scaling - mixed precision        
+|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|------------------------------
+|8       |2560                | 53641                | 186442                          | 3.48                                          |  7.03                  | 7.82
+|4       |2560                | 26647                | 92514                           | 3.47                                          |  3.49                  | 3.88
+|1       |2560                | 7635                 | 23821                           | 3.12                                          |  1                     | 1

-## Inference performance results
-All results were obtained by `generate.py` inference script in the pytorch-19.01-py3 Docker container. Inference was run on a single GPU.
+In addition mixed precision training has lower memory requirements, so we can train with batch size twice as big
+
+| GPUs   | Batch size / GPU   | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Weak scaling - mixed precision        
+|--------|--------------------|---------------------------------|-----------------------------------------------|--------------------
+|8       |5120                | 235077                          | 4.38                                          | 7.31 
+|4       |5120                | 75574                           | 2.83                                          | 2.35
+|1       |5120                | 32153                           | 4.21                                          | 1
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+##### NVIDIA DGX-2 (16x V100 32G)
+
+Our results were obtained by running the `run_training.sh` and `run_training_fp32.sh` training scripts in the Pytorch NGC container on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
+
+| GPUs   | Batch size / GPU   | Throughput - FP32    | Throughput - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Weak scaling - FP32    | Weak scaling - mixed precision        
+|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|-----------------------------
+| 16     | 5120               | 128319               | 476585                          | 3.71                                          |                        |                         
+ 
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+#### Inference performance results
+
+We provide two inference scripts, `generate.py` for preprocessed data and `interactive.py` for raw input. To measure throughput of the Transformer model, run:
+```bash
+python generate.py /path/to/dataset/wmt14_en_de_joined_dict  \
+  --path /path/to/your/checkpoint.pt \
+  --beam 4 \
+  --remove-bpe \
+  --quiet \
+  --fp16
+```
+To measure end-to-end inference with tokenization,
+```
+python interactive.py \
+    --buffer-size 1 \
+    --fp16 \
+    --path /path/to/your/checkpoint.pt \
+    --max-tokens 128 \
+    --fuse-dropout-add \
+    --remove-bpe\
+    --bpe-codes /path/to/code/file \
+    /path/to/dataset/wmt14_en_de_joined_dict/
+
+```
+We have benchmarked the inference performance by running the `generate.py` script using the pytorch-19.03-py3 NGC Docker container. Inference was run on a single GPU.

 GPU | Mixed precision | FP32 | FP16/Mixed speedup
 ---|---|---|---
-Tesla V100 | 5129.34 | 3396.09 | 1.51
+Tesla V100-SXM2-32GB | 6010 | 3414 | 1.76

-## Changelog
+## Release notes

+### Changelog
+
+January 2019
 - initial commit, forked from [fairseq](https://github.com/pytorch/fairseq/commit/ac5fddfc691267285a84c81d39475411da5ed1c6)
- adding mid-training [SacreBLEU](https://pypi.org/project/sacrebleu/1.2.10/) evaluation. Better handling of OOMs.
+
+May 2019:
+- adding mid-training SacreBLEU evaluation. Better handling of OOMs.
+
+June 2019
+- new README
+- jit support added

 ## Known issues

- Course of a training heavily depends on a random seed. There is high variance in a time required to reach a certain BLEU score. Also the highest BLEU score value observed vary between runs with different seeds.
+- Course of a training heavily depends on a random seed. There is high variance in the time required to reach a certain BLEU score. Also the highest BLEU score value observed vary between runs with different seeds.
--- a/PyTorch/Translation/Transformer/fairseq/models/transformer.py
+++ b/PyTorch/Translation/Transformer/fairseq/models/transformer.py
@ -41,8 +41,22 @@ from . import (
 )

 from apex.normalization.fused_layer_norm import FusedLayerNorm
-from .fused_dropout_add import fused_dropout_add
-from .fused_relu_dropout import fused_relu_dropout
+
+
+@torch.jit.script
+def jit_dropout_add(x, residual, prob, is_training) :
+    # type: (Tensor, Tensor, float, bool) -> Tensor
+    out = F.dropout(x, p=prob, training=is_training)
+    out = residual + out
+    return out
+
+@torch.jit.script
+def jit_relu_dropout(x, prob, is_training) :
+    # type: (Tensor, float, bool) -> Tensor
+    out = F.threshold(x, 0., 0.)
+    out = F.dropout(out, p=prob, training=is_training)
+    return out
+

@register_model('transformer')
 class TransformerModel(FairseqModel):
@ -438,7 +452,7 @@ class TransformerEncoderLayer(nn.Module):
        x = self.maybe_layer_norm(0, x, before=True)
        x, _ = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask)
        if self.fuse_dropout_add and self.training :
-            x = fused_dropout_add(x, residual, self.dropout)
+            x = jit_dropout_add(x, residual, self.dropout, self.training)
        else :
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = residual + x
@ -448,14 +462,14 @@ class TransformerEncoderLayer(nn.Module):
        x = self.maybe_layer_norm(1, x, before=True)

        if self.fuse_relu_dropout :
-            x = fused_relu_dropout(self.fc1(x), self.relu_dropout)
+            x = jit_relu_dropout(self.fc1(x), self.relu_dropout, self.training)
        else :
            x = F.threshold(self.fc1(x),0,0)
            x = F.dropout(x, p=self.relu_dropout, training=self.training)
        x = self.fc2(x)

        if self.fuse_dropout_add and self.training :
-            x = fused_dropout_add(x, residual, self.dropout)
+            x = jit_dropout_add(x, residual, self.dropout, self.training)
        else :
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = residual + x
@ -517,7 +531,7 @@ class TransformerDecoderLayer(nn.Module):
            need_weights=False,
        )
        if self.fuse_dropout_add and self.training :
-            x = fused_dropout_add(x, residual, self.dropout)
+            x = jit_dropout_add(x, residual, self.dropout, self.training)
        else :
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = residual + x
@ -537,7 +551,7 @@ class TransformerDecoderLayer(nn.Module):
                need_weights=(not self.training and self.need_attn),
            )
            if self.fuse_dropout_add and self.training :
-                x = fused_dropout_add(x, residual, self.dropout)
+                x = jit_dropout_add(x, residual, self.dropout, self.training)
            else :
                x = F.dropout(x, p=self.dropout, training=self.training)
                x = residual + x
@ -546,13 +560,13 @@ class TransformerDecoderLayer(nn.Module):
        residual = x
        x = self.maybe_layer_norm(self.final_layer_norm, x, before=True)
        if self.fuse_relu_dropout :
-            x = fused_relu_dropout(self.fc1(x), self.relu_dropout)
+            x = jit_relu_dropout(self.fc1(x), self.relu_dropout, self.training)
        else :
            x = F.threshold(self.fc1(x),0,0)
            x = F.dropout(x, p=self.relu_dropout, training=self.training)
        x = self.fc2(x)
        if self.fuse_dropout_add and self.training :
-            x = fused_dropout_add(x, residual, self.dropout)
+            x = jit_dropout_add(x, residual, self.dropout, self.training)
        else :
            x = F.dropout(x, p=self.dropout, training=self.training)
            x = residual + x
--- a/PyTorch/Translation/Transformer/interactive.py
+++ b/PyTorch/Translation/Transformer/interactive.py
@ -29,6 +29,7 @@ import torch

 from fairseq import data, options, tasks, tokenizer, utils
 from fairseq.sequence_generator import SequenceGenerator
+from fairseq.meters import StopwatchMeter

 from apply_bpe import BPE

@ -156,6 +157,9 @@ def main(args):
            )
        return result

+    gen_timer = StopwatchMeter()
+    end2end_timer = StopwatchMeter()
+
    def process_batch(batch):
        tokens = batch.tokens
        lengths = batch.lengths
@ -164,11 +168,13 @@ def main(args):
            tokens = tokens.cuda()
            lengths = lengths.cuda()

+        gen_timer.start()
        translations = translator.generate(
            tokens,
            lengths,
            maxlen=int(args.max_len_a * tokens.size(1) + args.max_len_b),
        )
+        gen_timer.stop()

        return [make_result(batch.srcs[i], t) for i, t in enumerate(translations)]

@ -178,6 +184,7 @@ def main(args):
    for inputs in buffered_read(args.buffer_size):
        indices = []
        results = []
+        end2end_timer.start()
        for batch, batch_indices in make_batches(inputs, args, src_dict, models[0].max_positions(), bpe):
            indices.extend(batch_indices)
            results += process_batch(batch)
@ -191,6 +198,12 @@ def main(args):
                if align is not None:
                    print(align)

+        print('Model latency: {} s'.format(gen_timer.sum))
+        gen_timer.reset()
+        end2end_timer.stop()
+        print('End-to-end translation time: {} s'.format(end2end_timer.sum))
+        end2end_timer.reset()
+

 if __name__ == '__main__':
    parser = options.get_generation_parser(interactive=True)
--- a/PyTorch/Translation/Transformer/run_preprocessing.sh
+++ b/PyTorch/Translation/Transformer/run_preprocessing.sh
@ -30,4 +30,6 @@ python preprocess.py \
  --nwordstgt 33712 \
  --joined-dictionary

+sacrebleu -t wmt14/full -l de-en --echo src > $DATASET_DIR/sacrebleu_reference.de
+
 cp $TEXT/code $DATASET_DIR/code
--- a/PyTorch/Translation/Transformer/run_scripts/run_training_1GPU.sh
+++ b/PyTorch/Translation/Transformer/run_scripts/run_training_1GPU.sh
@ -0,0 +1,27 @@
+#! /bin/bash
+
+nvidia-smi
+
+python  /workspace/translation/train.py \
+  /data/wmt14_en_de_joined_dict \
+  --arch transformer_wmt_en_de_big_t2t \
+  --share-all-embeddings \
+  --optimizer adam \
+  --adam-betas '(0.9, 0.997)' \
+  --adam-eps "1e-9" \
+  --clip-norm 0.0 \
+  --lr-scheduler inverse_sqrt \
+  --warmup-init-lr 0.0 \
+  --warmup-updates 4000 \
+  --lr 0.0006\
+  --min-lr 0.0 \
+  --dropout 0.1 \
+  --weight-decay 0.0 \
+  --criterion label_smoothed_cross_entropy \
+  --label-smoothing 0.1 \
+  --max-tokens 2560 \
+  --seed 1 \
+  --max-epoch 50 \
+  --no-epoch-checkpoints \
+  --no-progress-bar \
+  --save-dir /results/checkpoints \
--- a/PyTorch/Translation/Transformer/run_scripts/run_training_1GPU_fp16.sh
+++ b/PyTorch/Translation/Transformer/run_scripts/run_training_1GPU_fp16.sh
@ -0,0 +1,28 @@
+#! /bin/bash
+
+nvidia-smi
+
+python  /workspace/translation/train.py \
+  /data/wmt14_en_de_joined_dict \
+  --arch transformer_wmt_en_de_big_t2t \
+  --share-all-embeddings \
+  --optimizer adam \
+  --adam-betas '(0.9, 0.997)' \
+  --adam-eps "1e-9" \
+  --clip-norm 0.0 \
+  --lr-scheduler inverse_sqrt \
+  --warmup-init-lr 0.0 \
+  --warmup-updates 4000 \
+  --lr 0.0006\
+  --min-lr 0.0 \
+  --dropout 0.1 \
+  --weight-decay 0.0 \
+  --criterion label_smoothed_cross_entropy \
+  --label-smoothing 0.1 \
+  --max-tokens 5120 \
+  --seed 1 \
+  --max-epoch 50 \
+  --fp16 \
+  --no-epoch-checkpoints \
+  --no-progress-bar \
+  --save-dir /results/checkpoints_1_GPU_fp16 \
--- a/PyTorch/Translation/Transformer/run_scripts/run_training_dgx2.sh
+++ b/PyTorch/Translation/Transformer/run_scripts/run_training_dgx2.sh
@ -0,0 +1,29 @@
+#! /bin/bash
+nvidia-smi
+
+python -m torch.distributed.launch --nproc_per_node 16 /workspace/translation/train.py \
+  /data/wmt14_en_de_joined_dict \
+  --arch transformer_wmt_en_de_big_t2t \
+  --share-all-embeddings \
+  --optimizer adam \
+  --adam-betas '(0.9, 0.997)' \
+  --adam-eps "1e-9" \
+  --clip-norm 0.0 \
+  --lr-scheduler inverse_sqrt \
+  --warmup-init-lr 0.0 \
+  --warmup-updates 4000 \
+  --lr 0.0006\
+  --min-lr 0.0 \
+  --dropout 0.1 \
+  --weight-decay 0.0 \
+  --criterion label_smoothed_cross_entropy \
+  --label-smoothing 0.1 \
+  --max-tokens 5120 \
+  --seed 1 \
+  --max-epoch 1 \
+  --fp16 \
+  --no-epoch-checkpoints \
+  --no-progress-bar \
+  --log-interval 500 \
+  --save-dir /results/checkpoints_dgx2 \
+  --distributed-init-method env:// 
--- a/PyTorch/Translation/Transformer/run_scripts/run_training_fp16.sh
+++ b/PyTorch/Translation/Transformer/run_scripts/run_training_fp16.sh
@ -0,0 +1,29 @@
+#! /bin/bash
+nvidia-smi
+
+python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py \
+  /data/wmt14_en_de_joined_dict \
+  --arch transformer_wmt_en_de_big_t2t \
+  --share-all-embeddings \
+  --optimizer adam \
+  --adam-betas '(0.9, 0.997)' \
+  --adam-eps "1e-9" \
+  --clip-norm 0.0 \
+  --lr-scheduler inverse_sqrt \
+  --warmup-init-lr 0.0 \
+  --warmup-updates 4000 \
+  --lr 0.0006\
+  --min-lr 0.0 \
+  --dropout 0.1 \
+  --weight-decay 0.0 \
+  --criterion label_smoothed_cross_entropy \
+  --label-smoothing 0.1 \
+  --max-tokens 5120 \
+  --seed 1 \
+  --max-epoch 50 \
+  --fp16 \
+  --no-epoch-checkpoints \
+  --no-progress-bar \
+  --log-interval 500 \
+  --save-dir /results/checkpoints \
+  --distributed-init-method env:// 
--- a/PyTorch/Translation/Transformer/setup.py
+++ b/PyTorch/Translation/Transformer/setup.py
@ -57,22 +57,6 @@ strided_batched_gemm = CUDAExtension(
                        }
 )

-fused_dropout_add = CUDAExtension(
-                        name='fused_dropout_add_cuda',
-                        sources=['fairseq/models/fused_dropout_add/fused_dropout_add_cuda.cpp', 'fairseq/models/fused_dropout_add/fused_dropout_add_cuda_kernel.cu'],
-                        extra_compile_args={
-                                'cxx': ['-O2',],
-                                'nvcc':['--gpu-architecture=sm_70','-O3','--use_fast_math', '--expt-extended-lambda'],
-                        }
-)
-fused_relu_dropout = CUDAExtension(
-                        name='fused_relu_dropout_cuda',
-                        sources=['fairseq/models/fused_relu_dropout/fused_relu_dropout_cuda.cpp', 'fairseq/models/fused_relu_dropout/fused_relu_dropout_cuda_kernel.cu'],
-                        extra_compile_args={
-                                'cxx': ['-O2',],
-                                'nvcc':['--gpu-architecture=sm_70','-O3','--use_fast_math', '--expt-extended-lambda'],
-                        }
-)
 batch_utils = CppExtension(
                        name='fairseq.data.batch_C',
                        sources=['fairseq/data/csrc/make_batches.cpp'],
@ -88,7 +72,7 @@ setup(
    license=license,
    install_requires=reqs.strip().split('\n'),
    packages=find_packages(),
-    ext_modules=[bleu, strided_batched_gemm, fused_dropout_add, fused_relu_dropout, batch_utils],
+    ext_modules=[bleu, strided_batched_gemm, batch_utils],
    cmdclass={
                'build_ext': BuildExtension
    },
--- a/PyTorch/Translation/Transformer/train.py
+++ b/PyTorch/Translation/Transformer/train.py
@ -115,11 +115,10 @@ def main(args):
    valid_losses = [None]
    valid_subsets = args.valid_subset.split(',')

-    while lr >= args.min_lr and epoch_itr.epoch < max_epoch and trainer.get_num_updates() < max_update and current_bleu < tgt_bleu:

+    while lr >= args.min_lr and epoch_itr.epoch < max_epoch and trainer.get_num_updates() < max_update and current_bleu < tgt_bleu:
        # train for one epoch
        train(args, trainer, task, epoch_itr)
-
        if epoch_itr.epoch % args.validate_interval == 0:
            valid_losses = validate(args, trainer, task, epoch_itr, valid_subsets)

@ -366,25 +365,11 @@ def score(args, trainer, task, epoch_itr, subset):

    if args.distributed_world_size > 1:
        _all_gather_bleu_scorer(scorer)
-        chunked_predictions = []
-        while True:
-            if len(predictions)>100:
-                chunked_predictions.append(predictions[:100])
-                predictions = predictions[100:]
-            else:
-                chunked_predictions.append(predictions)
-                break
-
-        reduced_predictions = []
-        for chunk in chunked_predictions:
-            torch.cuda.synchronize()
-            reduced_predictions += distributed_utils.all_gather_list(chunk, max_size=65000)
-            torch.cuda.synchronize()
+        predictions = _all_gather_predictions(predictions)

    with open(os.path.join(args.data, 'sacrebleu_reference.de'), 'r') as reference:
        refs = [reference.readlines()]
    #reducing indexed predictions as strings is more memory efficient than reducing tuples
-    predictions = [item for sublist in reduced_predictions for item in sublist]
    predictions = [tuple(item.split('\t')) for item in predictions]
    predictions = [(int(item[0]), item[1]) for item in predictions]
    predictions.sort(key=lambda tup: tup[0])
@ -401,6 +386,36 @@ def score(args, trainer, task, epoch_itr, subset):

    return scorer.score(order=4), sacrebleu_score.score

+def _all_gather_predictions(predictions):
+    ready = False
+    all_ready = False
+    reduced_predictions = []
+    max_size = 65000
+    while not all_ready:
+        lst_len = len(predictions)
+        size = 2000     #some extra space for python stuff
+        n = 0
+        while n < lst_len:
+            str_len = len(predictions[n].encode('utf8')) + 8 # per string pickle overhead
+            if size + str_len >= max_size:
+                break
+            size += str_len
+            n += 1
+        chunk = predictions[:n]
+        predictions = predictions[n:]
+        if not predictions:
+            ready = True
+        chunk = (ready, chunk)
+        torch.cuda.synchronize()
+        gathered = distributed_utils.all_gather_list(chunk, max_size=65000)
+        torch.cuda.synchronize()
+        reduced_predictions += [t[1] for t in gathered]
+        all_ready = all([t[0] for t in gathered])
+
+    reduced_predictions = [item for sublist in reduced_predictions for item in sublist]
+
+    return reduced_predictions
+
 def _all_gather_bleu_scorer(scorer):
    stats = distributed_utils.all_gather_list(scorer.stat)
    bleu_stat = bleu.BleuStat()
--- a/PyTorch/Translation/Transformer/transformer.png
+++ b/PyTorch/Translation/Transformer/transformer.png
--- a/README.md
+++ b/README.md
@ -25,7 +25,7 @@ The examples are organized first by framework, such as TensorFlow, PyTorch, etc.
 ### Natural Language Processing
 - __GNMT__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT)]
 - __Transformer__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/Transformer)]
- __BERT__ [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT)]
+- __BERT__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)][[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT)]

 ### Recommender Systems
 - __NCF__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Recommendation/NCF)]
@ -41,3 +41,6 @@ We're posting these examples on GitHub to better support the community, facilita

 ## Known issues
 In each of the network READMEs, we indicate any known issues and encourage the community to provide feedback.
+
+
+
--- a/TensorFlow/Classification/RN50v1.5/README.md
+++ b/TensorFlow/Classification/RN50v1.5/README.md
@ -1,28 +1,45 @@
 # ResNet-50 v1.5 for TensorFlow

+This repository provides a script and recipe to train the ResNet-50 v1.5 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
+
 ## Table Of Contents
-* [The model](#the-model)
+* [Model overview](#model-overview)
+    * [Model architecture](#model-architecture)
    * [Default configuration](#default-configuration)
-    * [Data Augmentation](#data-augmentation)
    * [Other training recipes](#other-training-recipes)
+    * [Feature support matrix](#feature-support-matrix)
+        * [Features](#features)
+    * [Mixed precision training](#mixed-precision-training)
+        * [Enabling mixed precision](#enabling-mixed-precision)
 * [Setup](#setup)
    * [Requirements](#requirements)
-* [Quick start guide](#quick-start-guide)
-* [Details](#details)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Parameters](#parameters)
    * [Command line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
+        * [Data augmentation](#data-augmentation)
    * [Training process](#training-process)
-    * [Enabling mixed precision](#enabling-mixed-precision)
-* [Benchmarking](#benchmarking)
-    * [Training performance benchmark](#training-performance-benchmark)
-    * [Inference performance benchmark](#inference-performance-benchmark)
-* [Results](#results)
-    * [Training accuracy results](#training-accuracy-results)
-    * [Training performance results](#training-performance-results)
-    * [Inference performance results](#inference-performance-results)
-* [Changelog](#changelog)
-* [Known issues](#known-issues)
+    * [Inference process](#inference-process)
+* [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+        * [Training performance results](#training-performance-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+            * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
+        * [Inference performance results](#inference-performance-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+            * [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
+* [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)

-# The model
+## Model overview
 The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).

 The difference between v1 and v1.5 is in the bottleneck blocks which requires
@ -31,16 +48,15 @@ downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, where
 This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1,
 but comes with a small performance drawback (~5% imgs/sec).

-The following features were implemented in this model:
-* Data-parallel multi-GPU training with Horovod
-* Mixed precision support with TensorFlow Automatic Mixed Precision (TF-AMP), which enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable. Tensor Core operations to maximize throughput using NVIDIA Volta GPUs.
-* Static loss scaling for Tensor Cores (mixed precision) training
+This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

 The following performance optimizations were implemented in this model:
 * JIT graph compilation with [XLA](https://www.tensorflow.org/xla)
 * NVIDIA Data Loading ([DALI](https://github.com/NVIDIA/DALI)) support (experimental). 

-## Default configuration
+### Model architecture
+
+### Default configuration

 This model trains for 90 epochs, with default ResNet50 v1.5 setup:

@ -56,25 +72,7 @@ during first 5 epochs according to [Training ImageNet in 1 hour](https://arxiv.o

 * Weight decay: 1e-4

-
-## Data Augmentation
-
-This model uses the following data augmentation:
-
-* During training, we perform the following augmentation techniques:
-  * Normalization
-  * Random resized crop to 224x224
-    * Scale from 8% to 100%
-    * Aspect ratio from 3/4 to 4/3
-  * Random horizontal flip
-
-* During inference, we perform the following augmentation techniques:
-  * Normalization
-  * Scale to 256x256
-  * Center crop to 224x224
-
-
-## Other training recipes
+### Other training recipes

 This script does not target any specific benchmark.
 There are changes that others have made which can speed up convergence and/or increase accuracy.
@ -93,15 +91,63 @@ and this recipe keeps the original assumption that validation is done on 224px i

 Using 288px images means that a lot more FLOPs are needed during inference to reach the same accuracy.

-# Setup
+### Feature support matrix
+
+The following features are supported by this model.
+
+| Feature               | ResNet-50 v1.5 Tensorflow             |
+|-----------------------|--------------------------
+|Multi-GPU training with [Horovod](https://github.com/horovod/horovod)  |  Yes |
+|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)                |  Yes |
+
+
+#### Features
+
+Multi-GPU training with Horovod - Our model uses Horovod to implement efficient multi-GPU training with NCCL.
+For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
+
+NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
+with the DALI library. For details, see example sources in this repository or see the [DALI documentation](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html)
+
+### Mixed precision training
+[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.  Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.  Using [mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously required two steps:
+
+1. Porting the model to use the FP16 data type where appropriate.
+2. Manually adding loss scaling to preserve small gradient values. 
+
+This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
+
+In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
+
+For information about:
+ * How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+ * How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+ * Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+ 
+#### Enabling mixed precision
+
+Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format.Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance effort. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
+
+To enable mixed precision, you can simply the values of enviromental varialbes inside your training script:
+- Enable TF-AMP graph rewrite:
+  ```
+  os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
+  ```
+  
+- Enable Automated Mixed Precision:
+  ```
+  os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
+  ```
+
+## Setup

 The following section list the requirements that you need to meet in order to use the ResNet50 v1.5 model.

-## Requirements
+### Requirements
 This repository contains Dockerfile which extends the Tensorflow NGC container and encapsulates all dependencies.  Aside from these dependencies, ensure you have the following software:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [TensorFlow 19.03-py3 NGC container or later](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
+* [TensorFlow 19.06-py3 NGC container or later](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
 * [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)

 For more information about how to get started with NGC containers, see the
@ -110,28 +156,37 @@ following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
 * [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
 * [Running Tensorflow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running).

-# Quick start guide
-To train your model using mixed precision with tensor cores, perform the following steps using the default parameters of the ResNet-50 v1.5 model on the [ImageNet](http://www.image-net.org/) dataset.
+For those unable to use the [TensorFlow 19.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).

-## 1. Download and preprocess the dataset.
+## Quick Start Guide
+To train your model using mixed precision with tensor cores, perform the following steps using the default parameters of the ResNet-50 v1.5 model on the [ImageNet](http://www.image-net.org/) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+
+1. Clone the repository.
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/TensorFlow/Classification/RN50v1.5
+```
+
+2. Download and preprocess the dataset.
 The ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.

 To download and preprocess the dataset, use the [Generate ImageNet for TensorFlow](https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh) script. The dataset will be downloaded to a directory specified as the first parameter of the script.


-## 2. Build the ResNet-50 v1.5 TensorFlow NGC container.
+3. Build the ResNet-50 v1.5 TensorFlow NGC container.
 ```bash
 bash scripts/docker/build.sh
 ```

-## 3. Start an interactive session in the NGC container to run training/inference.
+4. Start an interactive session in the NGC container to run training/inference.
 After you build the container image, you can start an interactive CLI session with
 ```bash
 bash scripts/docker/interactive.sh
 ```
 The interactive.sh script requires that the location on the dataset is specified.  For example, /data.

-## 4. Start training.
+5. Start training.
 To run training for a default configuration (as described in [Default configuration](#default-configuration), for example 1/4/8 GPUs, FP16/FP32), run one of the scripts in the ./scripts directory called `./scripts/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh`. Each of the scripts require three parameters: 
 * path to the root directory of the model as the first argument
 * path to the dataset as a second argument 
@ -142,7 +197,7 @@ For example:
 ./scripts/RN50_FP16_8GPU.sh <path to model> <path to dataset> <path to results>
 ```

-## 5. Start validation/evaluation.
+6. Start validation/evaluation.

 Model evaluation on a checkpoint can be launched by running  one of the `./scripts/RN50_{FP16, FP32}_EVAL.sh` scripts in the `./scripts` directory. Each of the scripts requires three parameters: 
 * path to the root directory of the model as the first argument
@ -159,57 +214,131 @@ To run a non-default configuration, use:
 `python ./main.py --mode=evaluate --use_tf_amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to checkpoint>`


-# Details
+## Advanced

-## Command line options
-To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
+The following sections provide greater details of the dataset,running training and inference, and the training results.
+
+### Scripts and sample code
+
+In the root directory, the most important files are:
+ - `main.py`:               the script that controls the logic of training and validation of the ResNet-50 v1.5 model;
+ - `Dockerfile`:            Instructions for docker to build a container with the basic set of dependencies to run ResNet-50 v1.5;
+ - `requirements.txt`:      a set of extra Python requirements for running ResNet-50 v1.5;
+
+The `model/` directory contains modules used to define ResNet-50 v1.5 model
+ - `resnet_v1_5.py`: the definition of ResNet-50 v1.5 model
+ - `blocks/conv2d_block.py`: the definition of ResNet-50 v1.5 2D convolution block
+ - `blocks/resnet_bottleneck_block.py`: the definition of ResNet-50 v1.5 bottleneck block
+ - `layers/*.py`: definitions of specific layers used in ResNet-50 v1.5 model
+ 
+The `utils/` directory contains utility modules
+ - `cmdline_helper.py`: helper module for command line processing
+ - `data_utils.py`: module defining input data pipelines
+ - `dali_utils.py`: helper module for DALI 
+ - `hvd_utils.py`: helper module for Horovod
+ - `image_processing.py`: image processing and data augmentation functions
+ - `learning_rate.py`: definition of used learning rate schedule
+ - `optimizers.py`: definition of used custom optimizers
+ - `hooks/*.py`: defintions of specific hooks allowing loggin of training and inference process
+ 
+The `runtime/` directory contains modules that define the mechanics of trainig process
+ - `runner.py`: module encapsulating the training, inference and evaluation 
+ 
+The `scripts/` directory contains scripts wrapping common scenarios.
+
+
+### Parameters
+
+#### The script `main.py`
+The script for training end evaluating the ResNet-50 v1.5 model have a variety of parameters that control these processes.
+
+##### Common parameters
+`--mode`
+: allow specification of mode in which the script will run: train, train_and_evaluate, evaluate, training_benchmark or inference_benchmark
+
+`--data_dir` `--data_idx_dir`
+: allow specification of dataset location 
+
+`--seed`
+: allow specification of seed for RNGs.
+
+`--batch_size`
+: allow specification of the minibatch size.
+
+`--num_iter` and `--iter_unit`
+: allow specification of training/evaluation length 
+
+`--use_tf_amp`
+: flag enabling TF-AMP mixed precision computation.
+
+`--use_xla`
+: flag enabling XLA graph optimization.
+
+`--use_dali`
+: flag enabling DALI input pipeline. This parameter requires `--data_idx_dir` to be set.
+
+
+##### Training related
+`--use_auto_loss_scaling`
+: flag enabling automatic loss scaling
+
+`--lr_init`
+: initial value of learning rate.
+
+`--warmup_steps`
+: allows you to specify the number of iterations considered as warmup and not taken into account for performance measurements.
+
+`--momentum`
+: momentum argument for SGD optimizer.
+
+`--weight-decay`
+: weight decay argument for SGD optimizer.
+
+`--batch-size`
+: a number of inputs processed at once for each iteration.
+
+`--loss_scale`
+: value of static loss scale. This parameter will have no effect if `--use_auto_loss_scaling` is set.
+
+
+##### Utility parameters
+`--help`
+: displays a short description of all parameters accepted by the script.
+
+### Command-line options
+
+All these parameters can be controlled by passing command-line arguments
+to the `main.py` script. To get a complete list of all command-line arguments
+with descriptions and default values you can run:

 ```
 python main.py --help
 ```

-To summarize, the most important arguments are as follows:

-```
-  --mode {train,train_and_evaluate,evaluate,training_benchmark,inference_benchmark}
-                        The execution mode of the script.
-                        
-  --data_dir DATA_DIR   Path to dataset in TFRecord format. Files should be
-                        named 'train-*' and 'validation-*'.
-                        
-  --data_idx_dir DATA_IDX_DIR
-                        Path to index files for DALI. Files should be named
-                        'train-*' and 'validation-*'.
-                        
-  --batch_size BATCH_SIZE
-                        Size of each minibatch per GPU.
-                                               
-  --num_iter NUM_ITER   Number of iterations to run.
-  
-  --iter_unit {epoch,batch}
-                        Unit of iterations.
-                        
-  --results_dir RESULTS_DIR
-                        Directory in which to write training logs, summaries
-                        and checkpoints.
-                        
-  --loss_scale LOSS_SCALE
-                        Loss scale for mixed precision training.
-                        
-  --use_auto_loss_scaling
-                        Use automatic loss scaling in fp32 AMP.
-                        
-  --use_xla             Enable XLA (Accelerated Linear Algebra) computation
-                        for improved performance.
-                        
-  --use_tf_amp          Enable Automatic Mixed Precision to speedup fp32
-                        computation using tensor cores.
-  --use_dali            Enable DALI data input.
+### Getting the data

-                                           
-```
+The ResNet-50 v1.5 model was trained on the ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.

-## Training process
+To download and preprocess the dataset, use the [Generate ImageNet for TensorFlow](https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh) script. The dataset will be downloaded to a directory specified as the first parameter of the script.
+
+#### Data Augmentation
+
+This model uses the following data augmentation:
+
+* During training, we perform the following augmentation techniques:
+  * Normalization
+  * Random resized crop to 224x224
+    * Scale from 8% to 100%
+    * Aspect ratio from 3/4 to 4/3
+  * Random horizontal flip
+
+* During inference, we perform the following augmentation techniques:
+  * Normalization
+  * Scale to 256x256
+  * Center crop to 224x224
+
+### Training process
 To run a configuration that is not based on the default parameters, use:

 * For 1 GPU
@ -224,28 +353,13 @@ To run a configuration that is not based on the default parameters, use:
    * FP16
        `mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --batch_size=256 --use_tf_amp --data_dir=<path to imagenet> --results_dir=<path to results>`

+## Performance

-## Enabling mixed precision
-[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network.  Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.  Using [mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously required two steps:
-
-1. Porting the model to use the FP16 data type where appropriate.
-2. Manually adding loss scaling to preserve small gradient values. 
-
-This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
-
-In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
-
-For information about:
- * How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- * How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- * Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
-
-
-# Benchmarking
+### Benchmarking

 The following sections shows how to run benchmarks measuring the model performance in training and inference modes.

-## Training performance benchmark
+#### Training performance benchmark

 To benchmark the training performance on a specific batch size, run:

@ -269,7 +383,7 @@ Each of these scripts runs 200 warm-up iterations and measures the first epoch.
 To control warmup and benchmark length, use `--warmup_steps`, `--num_iter` and `--iter_unit` flags.


-## Inference performance benchmark
+#### Inference performance benchmark

 To benchmark the inference performance on a specific batch size, run:

@ -283,14 +397,16 @@ Each of these scripts, by default runs 20 warm-up iterations and measures the ne

 To control warm-up and benchmark length, use `--warmup_steps`, `--num_iter` and `--iter_unit` flags.

-# Results
+### Results

 The following sections provide details on how we achieved our results in training accuracy, performance and inference performance.

-## Training accuracy results
+#### Training accuracy results
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)

 Our results were obtained by running the `./scripts/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh` script in
-the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
+the tensorflow-19.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.


 | **number of GPUs** | **mixed precision top1** | **mixed precision training time** | **FP32 top1** | **FP32 training time** |
@ -301,10 +417,10 @@ the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.



-## Training performance results
+#### Training performance results

-### DGX-1
-The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbench_fp16.sh` and `./scripts/benchmarking/DGX1V_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPU.
+##### NVIDIA DGX-1 (8x V100 16G)
+The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbench_fp16.sh` and `./scripts/benchmarking/DGX1V_trainbench_fp32.sh` scripts in the tensorflow-19.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPU.


 | **number of GPUs** | **mixed precision img/s** | **FP32 img/s** | **mixed precision speedup** | **mixed precision weak scaling** | **FP32 weak scaling** |
@ -313,10 +429,17 @@ The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbenc
 | **4**                  | 3197.4                    | 1419.4         | 2.25                        | 3.96                             | 3.89                  |
 | **8**                  | 6209.9                    | 2778.5         | 2.24                        | 7.74                             | 7.61                  |

+##### XLA Enabled
+
+| **number of GPUs** | **mixed precision img/s** | **mixed precision + XLA img/s** | **XLA speedup** | 
+|:------------------:|:-------------------------:|:-------------------------------:|:---------------:|
+| **1**              | 802.1                     | 1177.9                          | 1.47            |
+| **4**              | 3197.4                    | 4654.1                          | 1.45            |
+| **8**              | 6209.9                    | 8104.4                          | 1.30            |


-### DGX-2
-The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench_fp16.sh` and `./scripts/benchmarking/DGX2_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-2 with 16 V100 32G GPU.
+##### NVIDIA DGX-2 (16x V100 32G)
+The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench_fp16.sh` and `./scripts/benchmarking/DGX2_trainbench_fp32.sh` scripts in the tensorflow-19.06-py3 Docker container on NVIDIA DGX-2 with 16 V100 32G GPU.

 | **number of GPUs** | **mixed precision img/s** | **FP32 img/s** | **mixed precision speedup** | **mixed precision weak scaling** | **FP32 weak scaling** |
 |:------------------:|:-------------------------:|:--------------:|:---------------------------:|:--------------------------------:|:---------------------:|
@ -325,21 +448,7 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench
 | **8**                  | 6439.4                    | 2888.6         | 2.23                        | 7.84                             | 7.65                  |
 | **16**                  | 12467.5                    | 5660.8         | 2.20                        | 15.17                             | 15.00                  |

-
-### XLA-enabled results 
-
-### DGX-1
-The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbench_fp16.sh` and `./scripts/benchmarking/DGX1V_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPU.
-
-
-| **number of GPUs** | **mixed precision img/s** | **mixed precision + XLA img/s** | **XLA speedup** | 
-|:------------------:|:-------------------------:|:-------------------------------:|:---------------:|
-| **1**              | 802.1                     | 1177.9                          | 1.47            |
-| **4**              | 3197.4                    | 4654.1                          | 1.45            |
-| **8**              | 6209.9                    | 8104.4                          | 1.30            |
-
-### DGX-2
-The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench_fp16.sh` and `./scripts/benchmarking/DGX2_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-2 with 16 V100 32G GPU.
+##### XLA Enabled

 | **number of GPUs** | **mixed precision img/s** | **mixed precision + XLA img/s** | **XLA speedup** | 
 |:------------------:|:-------------------------:|:-------------------------------:|:---------------:|
@ -348,10 +457,10 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench
 | **8**              | 6439.4                    | 9295.5                          | 1.44            |
 | **16**             | 12467.5                   | 15902.8                         | 1.27            |

-## Inference performance results
+#### Inference performance results

-### DGX-1
-The results were obtained by running the `./scripts/benchmarking/DGX1V_inferbench_fp16.sh` and `./scripts/benchmarking/DGX1V_inferbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
+##### NVIDIA DGX-1 (8x V100 16G)
+The results were obtained by running the `./scripts/benchmarking/DGX1V_inferbench_fp16.sh` and `./scripts/benchmarking/DGX1V_inferbench_fp32.sh` scripts in the tensorflow-19.06-py3 Docker container on a single GPU of NVIDIA DGX-1 with 8 V100 16G GPUs.

 | **batch size** | **mixed precision img/s** | **FP32 img/s** | **mixed precision + XLA img/s** |
 |:--------------:|:-------------------------:|:--------------:|:-------------------------------:|
@ -366,8 +475,8 @@ The results were obtained by running the `./scripts/benchmarking/DGX1V_inferbenc
 |       **256** |  2129.3 |  N/A    |   3547.9 |


-### DGX-2
-The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench_fp16.sh` and `./scripts/benchmarking/DGX2_inferbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 16 V100 32G GPUs.
+##### NVIDIA DGX-2 (16x V100 32G)
+The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench_fp16.sh` and `./scripts/benchmarking/DGX2_inferbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on a single GPU of NVIDIA DGX-2 with 16 V100 32G GPUs.

 | **batch size** | **mixed precision img/s** | **FP32 img/s** | **mixed precision + XLA img/s** |
 |:--------------:|:-------------------------:|:--------------:|:-------------------------------:|
@ -381,7 +490,9 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench
 |       **128** |  2126.5 |  1168.8 |  3469.6 |        
 |       **256** |  2203.6 |  N/A    |  3713.2 | 

-# Changelog
+## Release notes
+
+### Changelog
 1. March 1, 2019
  * Initial release
 2. May 15, 2019
@ -389,5 +500,5 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench
  * Added scripts for DGX-2
  * Added benchmark results for DGX-2 and XLA-enabled DGX-1 and DGX-2.

-# Known issues
+### Known issues
 There are no known issues with this model.
--- a/TensorFlow/Classification/RN50v1.5/runtime/runner.py
+++ b/TensorFlow/Classification/RN50v1.5/runtime/runner.py
@ -52,6 +52,7 @@ class Runner(object):
        model_dir=None,
        log_dir=None,
        data_dir=None,
+        data_idx_dir=None,

        # ======= Optimization HParams ======== #
        use_xla=False,
@ -223,6 +224,7 @@ class Runner(object):
        config.allow_soft_placement = True
        config.log_device_placement = False

+
        if hvd_utils.is_using_hvd():
            config.gpu_options.visible_device_list = str(hvd.local_rank())

--- a/TensorFlow/Detection/SSD/README.md
+++ b/TensorFlow/Detection/SSD/README.md
@ -332,6 +332,7 @@ To achieve same results, follow the [Quick start guide](#quick-start-guide) outl

 March 2019
 * Initial release
+
 May 2019
 * Test scripts updated

--- a/TensorFlow/Recommendation/NCF/Dockerfile
+++ b/TensorFlow/Recommendation/NCF/Dockerfile
@ -12,7 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-FROM nvcr.io/nvidia/tensorflow:19.03-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:19.06-py3
+FROM ${FROM_IMAGE_NAME}

 RUN apt-get update && \
    apt-get install -y unzip
--- a/TensorFlow/Recommendation/NCF/README.md
+++ b/TensorFlow/Recommendation/NCF/README.md
@ -4,28 +4,30 @@ This repository provides a script and recipe to train Neural Collaborative Filte
 accuracy, and is tested and maintained by NVIDIA.

 ## Table Of Contents
-* [The Model](#the-model)
+* [Model overview](#model-overview)
  * [Default Configuration](#default-configuration)
+  * [Mixed precision training](#mixed-precision-training)
 * [Setup](#setup)
  * [Requirements](#requirements)
 * [Quick Start Guide](#quick-start-guide)
-* [Details](#details)
+* [Advanced](#advanced)
  * [Command Line Arguments](#command-line-arguments)
  * [Getting the Data](#getting-the-data)
  * [Other Datasets](#other-datasets)
  * [Training Process](#training-process)
-  * [Enabling Mixed Precision](#enabling-mixed-precision)
  * [Evaluation Process](#evaluation-process)
-* [Benchmarking](#benchmarking)
-  * [Performance Benchmark](#performance-benchmark)
-* [Results](#results)
-  * [Training Accuracy Results](#training-accuracy-results)
-  * [Training Performance Results](#training-performance-results)
-  * [Inference Performance Results](#inference-performance-results)
-* [Changelog](#changelog)
-* [Known Issues](#known-issues)
+* [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    * [Performance Benchmark](#performance-benchmark)
+  * [Results](#results)
+    * [Training Accuracy Results](#training-accuracy-results)
+    * [Training Performance Results](#training-performance-results)
+    * [Inference Performance Results](#inference-performance-results)
+* [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known Issues](#known-issues)

-## The Model
+## Model overview

 The Neural Collaborative Filtering (NCF) model is a neural network that provides collaborative filtering based on
 implicit feedback, specifically, it provides product recommendations based on user and item interactions.  The training
@ -78,6 +80,47 @@ This implementation is implemented with the following features:
    - Note: The negative samples generated for the test set are always verified regardless if the shortcut is enabled or
      not.

+### Mixed Precision Training
+
+[Mixed Precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
+operations in half-precision format, while storing information in single-precision to retain as much information as
+possible. Mixed precision is enabled in TensorFlow by using a custom variable getter that casts variables to
+half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small
+gradient magnitudes in backpropagation, a [loss
+scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included
+when applying gradients. In TensorFlow, loss scaling can be easily applied by using
+[LossScaleOptimizer](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/LossScaleOptimizer) . The
+scaling value to be used can be
+[dynamic](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/ExponentialUpdateLossScaleManager) or
+[fixed](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/FixedLossScaleManager)
+
+Enabling mixed precision is now easier than ever with support for AMP in TensorFlow. TF-AMP is an extension of
+TensorFlow that enables mixed precision without any code changes. It accomplishes this by automatically rewriting all
+computation graphs with the necessary operations to enable mixed precision training and loss scaling. Currently, TF-AMP
+is only available through NVIDIA’s TensorFlow Docker container.
+
+TF-AMP is controlled by the `TF_ENABLE_AUTO_MIXED_PRECISION=1` environment variable; when set, TensorFlow will rewrite
+all graphs to perform computations in half-precision format and loss scaling will automatically be applied. 
+
+To enable mixed precision training using TF-AMP, the environment variable can be set prior to running `ncf.py`.
+Alternatively, `ncf.py` can be run with the `--fp16` flag.
+
+**Note:**  The `--fp16` flag sets the environment variable to the correct value
+for mixed precision training inside the script, for example:
+
+```
+# Note that the --fp16 flag maps to the amp variable in code
+if args.amp:
+    os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1" 
+```
+
+For more information about:
+* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper
+  and the [Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
+* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
+  from the TensorFlow User Guide.
+
+
 ## Setup

 The following section lists the requirements in order to start training the NCF model.
@ -103,14 +146,14 @@ Documentation and the Deep Learning Documentation:
 To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default
 parameters of the NCF model on the ml-20m dataset.

-### 1. Clone this repository
+### Clone this repository

 ```bash
 git clone https://github.com/NVIDIA/DeepLearningExamples
 cd DeepLearningExamples/TensorFlow/Recommendation/NCF
 ```

-### 2. Build the NCF TensorFlow NGC container.
+### Build the NCF TensorFlow NGC container.

 After Docker is correctly set up, you can build the NCF image with:

@ -118,7 +161,7 @@ After Docker is correctly set up, you can build the NCF image with:
 docker build . -t nvidia_ncf
 ``` 

-### 3. Launch the NCF TensorFlow Docker container.
+### Launch the NCF TensorFlow Docker container.

 ```bash
 mkdir data
@ -129,7 +172,9 @@ This will launch the container and mount the ./data directory as a volume to the
 Any datasets and experiment results (logs, checkpoints etc.) saved to /data will be accessible in the ./data directory
 on the host.

-### 4. Download and preprocess the dataset.
+### Download and preprocess the dataset.
+
+#### ml-20m

 Preprocessing consists of downloading the data, filtering out users that have less than 20 ratings (by default), sorting
 the data and dropping the duplicates. No data augmentation techniques are used in the preprocessing stage.
@ -140,7 +185,7 @@ To download and preprocess the ml-20m dataset, run:
 ./prepare_dataset.sh
 ```

-##### ml-1m
+#### ml-1m

 To download and preprocess the ml-1m dataset, run:

@ -151,7 +196,7 @@ To download and preprocess the ml-1m dataset, run:
 This will store the preprocessed training and evaluation data in the `/data` directory, so that it can be later used to
 train the model (by passing the appropriate `--data` argument to the `ncf.py` script).

-### 5. Start training.
+### Start training.

 After the Docker container is launched, the training with the default hyper-parameters can be started with:

@ -166,7 +211,7 @@ mpirun -np $numgpu \
 After the training is complete, the model parameters that provide the best evaluation accuracy are saved to the
 directory passed to the `--checkpoint-dir` argument. By default, this will be in the `/data/checkpoints/` directory.

-### 6. Start validation/evaluation.
+### Start validation/evaluation.

 To run evaluation on a specific checkpoint, simply run the following command:

@ -177,7 +222,7 @@ python ncf.py --data /data/cache/ml-20m --mode test --checkpoint-dir $checkpoint

 Note: TensorFlow checkpoints consist of 3 files each with a `*.ckpt` prefix.

-## Details
+## Advanced

 The following sections provide greater details of the dataset, running training and inference, and the training results.

@ -237,7 +282,7 @@ automatically call `download_dataset.sh` to download the desired dataset, and
 then preprocess the training and test datasets. By default, data will be
 downloaded to the `/data` directory.

-##### Other Datasets
+#### Other Datasets

 This implementation is tuned for the ml-20m and ml-1m datasets.  Using other
 datasets might require tuning some hyperparameters (for example, learning rate,
@ -319,46 +364,6 @@ will be stored at the directory pointed to by the `--checkpoint-dir` argument.
 Multiple GPUs can be used for training through Horovod. The number of GPUs can
 be controlled by the `-np` parameter passed to `mpirun`.

-##### Enabling Mixed Precision
-
-[Mixed Precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
-operations in half-precision format, while storing information in single-precision to retain as much information as
-possible. Mixed precision is enabled in TensorFlow by using a custom variable getter that casts variables to
-half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small
-gradient magnitudes in backpropagation, a [loss
-scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included
-when applying gradients. In TensorFlow, loss scaling can be easily applied by using
-[LossScaleOptimizer](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/LossScaleOptimizer) . The
-scaling value to be used can be
-[dynamic](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/ExponentialUpdateLossScaleManager) or
-[fixed](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/FixedLossScaleManager)
-
-Enabling mixed precision is now easier than ever with support for AMP in TensorFlow. TF-AMP is an extension of
-TensorFlow that enables mixed precision without any code changes. It accomplishes this by automatically rewriting all
-computation graphs with the necessary operations to enable mixed precision training and loss scaling. Currently, TF-AMP
-is only available through NVIDIA’s TensorFlow Docker container.
-
-TF-AMP is controlled by the `TF_ENABLE_AUTO_MIXED_PRECISION=1` environment variable; when set, TensorFlow will rewrite
-all graphs to perform computations in half-precision format and loss scaling will automatically be applied. 
-
-To enable mixed precision training using TF-AMP, the environment variable can be set prior to running `ncf.py`.
-Alternatively, `ncf.py` can be run with the `--fp16` flag.
-
-**Note:**  The `--fp16` flag sets the environment variable to the correct value
-for mixed precision training inside the script, for example:
-
-```
-# Note that the --fp16 flag maps to the amp variable in code
-if args.amp:
-    os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1" 
-```
-
-For more information about:
-* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper
-  and the [Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
-* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
-  from the TensorFlow User Guide.
-
 ### Evaluation Process

 The evaluation process can be run by the ncf.py script as well. By passing the
@ -372,12 +377,14 @@ The script will then output a line like the one below which describes the model
 Eval Time = 1.1829, HR@10 = 0.9574, NDCG@10 = 0.7420
 ```

-## Benchmarking
+## Performance
+
+### Benchmarking

 The following section shows how to run benchmarks measuring the model
 performance in training and inference modes.

-### Performance Benchmark
+#### Performance Benchmark

 To benchmark the training and inference performance, run: 

@ -394,11 +401,11 @@ By default, the `ncf.py` script outputs metrics describing the following:
 * Training speed and throughput
 * Evaluation speed and throughput

-## Results
+### Results

 The following sections provide details on how we achieved our performance and accuracy in training and inference.

-## Training Accuracy Results
+### Training Accuracy Results

 Our results were obtained by running the `ncf.py` training script in the
 TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 8x V100 16G GPUs.
@ -414,9 +421,9 @@ recorded to demonstrate the maximum accuracy the model can achieve.
 | 4 | 0.9589 | 0.9591 |
 | 8 | 0.9597 | 0.9598 |

-## Training Performance Results
+### Training Performance Results

-### NVIDIA DGX-1 (8x V100 16G)
+#### NVIDIA DGX-1 (8x V100 16G)

 Our results were obtained by running the `ncf.py` training script in the
 TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 8x V100 16GB GPUs
@ -448,7 +455,7 @@ Those results can be improved when [XLA](https://www.tensorflow.org/xla) is used
 in conjunction with mixed precision, delivering up to 2.6x speedup over FP32 on a single GPU (~24.3M items/sec). 
 However XLA is still considered experimental.

-### NVIDIA DGX-1 (8x V100 32G)
+#### NVIDIA DGX-1 (8x V100 32G)

 Our results were obtained by running the `ncf.py` training script in the
 TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 8x V100 32G GPUs with
@ -472,7 +479,7 @@ The performance was measured by the wall clock time over one training epoch.
 The number of samples in the epoch (roughly 100 million samples), was then
 divided by the average training duration to obtain the items per second metric.

-## Inference Performance Results
+### Inference Performance Results

 Our results were obtained by running the `ncf.py` training script in the
 TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 1x V100 16G GPUs.
@ -488,14 +495,16 @@ achieve.
 | 4 | 88,255,971 | 66,625,422 | 1.32x |
 | 8 | 119,159,304 | 100,117,608 | 1.19x |

-## Changelog
+## Release Notes
+
+### Changelog

 March 2019
 * Initial Release

-## Known Issues 
+### Known Issues 

-### Multi-GPU Scaling Efficiency
+#### Multi-GPU Scaling Efficiency

 Currently, this model does not exhibit good scaling efficiency when scaling to
 4 and 8 GPUs.  Since we could not find hyper-parameters that could hit the
@ -505,7 +514,7 @@ to a more common weak scaling strategy. Additionally, we believe that the small
 dataset size does not facilitate great scaling. However, the training scripts
 allow the use of custom datasets provided they are in the correct format.

-### Scaling beyond 8 GPUs
+#### Scaling beyond 8 GPUs

 Neural Collaborative Filtering (NCF) is a relatively lightweight model that
 trains quickly with this relatively smaller dataset, ml-20m. Because of the
--- a/TensorFlow/Recommendation/NCF/inference.py
+++ b/TensorFlow/Recommendation/NCF/inference.py
@ -0,0 +1,127 @@
+#
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import time
+import os
+import json
+import argparse
+
+import numpy as np
+import tensorflow as tf
+from neumf import ncf_model_ops
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Benchmark inference performance of the NCF model")
+    parser.add_argument('--load_checkpoint_path', default=None, type=str,
+                        help='Path to the checkpoint file to be loaded. If None will use random weights')
+    parser.add_argument('--n_users', default=138493, type=int,
+                        help='Number of users. Defaults to the number of users in the ml-20m dataset after preprocessing')
+    parser.add_argument('--n_items', default=26744, type=int,
+                        help='Number of items. Defaults to the number of users in the ml-20m dataset after preprocessing')
+    parser.add_argument('-f', '--factors', type=int, default=64,
+                        help='Number of predictive factors')
+    parser.add_argument('--layers', nargs='+', type=int,
+                        default=[256, 256, 128, 64],
+                        help='Sizes of hidden layers for MLP')
+    parser.add_argument('--batch_size', default=1, type=int, help='Batch size for inference')
+    parser.add_argument('--num_batches', default=20, type=int,
+                        help='Number of batches for which to measure latency and throughput')
+    parser.add_argument('--no_amp', dest='amp', action='store_false', default=True,
+                        help='Disable mixed precision')
+    parser.add_argument('--xla', dest='xla', action='store_true', default=False,
+                        help='Enable XLA')
+    parser.add_argument('--log_path', default='nvlog.json', type=str,
+                        help='Path to the path to store benchmark results')
+
+    return parser.parse_args()
+
+
+def main():
+    args = parse_args()
+
+    if args.amp:
+        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1"
+
+    # Input tensors
+    users = tf.placeholder(tf.int32, shape=(None,))
+    items = tf.placeholder(tf.int32, shape=(None,))
+    dropout = tf.placeholder_with_default(0.0, shape=())
+
+    # Model ops and saver
+    logits_op = ncf_model_ops(
+        users=users,
+        items=items,
+        labels=None,
+        dup_mask=None,
+        params={
+            'fp16': False,
+            'val_batch_size': args.batch_size,
+            'num_users': args.n_users,
+            'num_items': args.n_items,
+            'num_factors': args.factors,
+            'mf_reg': 0,
+            'layer_sizes': args.layers,
+            'layer_regs': [0. for i in args.layers],
+            'dropout': 0.0,
+            'sigmoid': True,
+            'top_k': None,
+            'learning_rate': None,
+            'beta_1': None,
+            'beta_2': None,
+            'epsilon': None,
+            'loss_scale': None,
+        },
+        mode='INFERENCE'
+    )
+
+    config = tf.ConfigProto()
+    config.gpu_options.allow_growth = True
+    if args.xla:
+        config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
+
+    sess = tf.Session(config=config)
+
+    saver = tf.train.Saver()
+    if args.load_checkpoint_path:
+        saver.restore(sess, args.load_checkpoint_path)
+    else:
+        # Manual initialize weights
+        sess.run(tf.global_variables_initializer())
+
+    sess.run(tf.local_variables_initializer())
+
+
+    users_batch = np.random.randint(size=args.batch_size, low=0, high=args.n_users)
+    items_batch = np.random.randint(size=args.batch_size, low=0, high=args.n_items)
+
+    latencies = []
+    for _ in range(args.num_batches):
+        start = time.time()
+        logits = sess.run(logits_op, feed_dict={users: users_batch, items: items_batch, dropout: 0.0 })
+        latencies.append(time.time() - start)
+
+    results = {
+        'args' : vars(args),
+        'best_inference_throughput' : args.batch_size / min(latencies),
+        'best_inference_latency' : min(latencies),
+        'inference_latencies' : latencies
+    }
+    print('RESULTS: ', json.dumps(results, indent=4))
+    if args.log_path is not None:
+        json.dump(results, open(args.log_path, 'w'), indent=4)
+
+if __name__ == '__main__':
+    main()
--- a/TensorFlow/Recommendation/NCF/input_pipeline.py
+++ b/TensorFlow/Recommendation/NCF/input_pipeline.py
@ -1,5 +1,4 @@
 import numpy as np
-import tensorflow as tf
 import cupy as cp

 def generate_negatives(neg_users, true_mat, item_range, sort=False, use_trick=False):
@ -29,7 +28,7 @@ def generate_negatives(neg_users, true_mat, item_range, sort=False, use_trick=Fa
    neg_users = cp.concatenate(neg_u)
    neg_items = cp.concatenate(neg_i)

-    if sort == False:
+    if not sort:
        return neg_users, neg_items

    sorted_users = cp.sort(neg_users)
@ -56,7 +55,6 @@ class DataGenerator():
                 pos_eval_items,            # type: np.ndarray
                 eval_users_per_batch,      # type: int
                 eval_negative_samples,     # type: int
-                 use_neg_trick=False,       # type: bool
                ):
        # Check input data
        if train_users.shape != train_items.shape:
@ -86,7 +84,6 @@ class DataGenerator():
        self._pos_eval_items = pos_eval_items
        self.eval_users_per_batch = eval_users_per_batch
        self._eval_negative_samples = eval_negative_samples
-        self.use_neg_trick = use_neg_trick

        # Eval data
        self.eval_users = None
@ -108,9 +105,9 @@ class DataGenerator():
        neg_eval_users_base = cp.repeat(pos_eval_users, self._eval_negative_samples)

        # Generate negative samples
-        test_u_neg, test_i_neg = generate_negatives(
-            neg_eval_users_base, neg_mat, self.num_items, True
-        )
+        test_u_neg, test_i_neg = generate_negatives(neg_users=neg_eval_users_base, true_mat=neg_mat,
+                                                    item_range=self.num_items, sort=True, use_trick=False)
+
        test_u_neg = test_u_neg.reshape((-1, self._eval_negative_samples)).get()
        test_i_neg = test_i_neg.reshape((-1, self._eval_negative_samples)).get()

@ -150,21 +147,20 @@ class DataGenerator():
        is_neg = cp.logical_not(self._train_labels)

        # Do not store verification matrix if using the negatives generation shortcut
-        neg_mat = None if self.use_neg_trick else cp.array(self._neg_mat)
+        neg_mat = None

        # If there are no negative samples in the local portion of the training data, do nothing
        any_neg = cp.any(is_neg)
        if any_neg:
            self._train_users[is_neg], self._train_items[is_neg] = generate_negatives(
-                self._train_users[is_neg], neg_mat, self.num_items, use_trick=self.use_neg_trick
+                self._train_users[is_neg], neg_mat, self.num_items, use_trick=True
            )

        shuffled_order = cp.random.permutation(self._train_users.shape[0])
        self._train_users = self._train_users[shuffled_order]
        self._train_items = self._train_items[shuffled_order]
        self._train_labels = self._train_labels[shuffled_order]
-        is_neg = cp.logical_not(self._train_labels)
-        
+
        # Manually create batches
        split_indices = np.arange(batch_size, self._train_users.shape[0], batch_size)
        self.train_users_batches = np.split(self._train_users, split_indices)
--- a/TensorFlow/Recommendation/NCF/ncf.py
+++ b/TensorFlow/Recommendation/NCF/ncf.py
@ -55,7 +55,7 @@ def parse_args():
                                        " Filtering model")
    parser.add_argument('--data', type=str,
                        help='path to test and training data files')
-    parser.add_argument('-e', '--epochs', type=int, default=40,
+    parser.add_argument('-e', '--epochs', type=int, default=30,
                        help='number of epochs to train for')
    parser.add_argument('-b', '--batch-size', type=int, default=1048576,
                        help='number of examples for each iteration')
@ -98,12 +98,11 @@ def parse_args():
    parser.add_argument('--checkpoint-dir', default='/data/checkpoints/', type=str,
                        help='Path to the store the result checkpoint file for training, \
                              or to read from for evaluation')
+    parser.add_argument('--load-checkpoint-path', default=None, type=str,
+                        help='Path to the checkpoint for initialization. If None will initialize with random weights')
    parser.add_argument('--mode', choices=['train', 'test'], default='train', type=str,
                        help='Passing "test" will only run a single evaluation, \
                              otherwise full training will be performed')
-    parser.add_argument('--no-neg-trick', action='store_true', dest='no_neg_trick',
-                        help='do not use negative sample generation shortcut to speed up data \
-                              augmentation (will increase GPU memory consumption)')
    parser.add_argument('--eval-after', type=int, default=8,
                        help='Perform evaluations only after this many epochs')
    parser.add_argument('--verbose', action='store_true',
@ -233,7 +232,6 @@ def main():
        test_items,
        args.valid_users_per_batch,
        args.valid_negative,
-        use_neg_trick=False if args.no_neg_trick else True
        )

    # Create tensorflow session and saver
@ -274,7 +272,7 @@ def main():
            'sigmoid': True,
            'loss_scale': args.loss_scale
        },
-        eval_only=False if args.mode == 'train' else True
+        mode='TRAIN' if args.mode == 'train' else 'EVAL'
    )
    saver = tf.train.Saver()

@ -287,11 +285,16 @@ def main():
    # Prepare evaluation data
    data_generator.prepare_eval_data()

+    if args.load_checkpoint_path:
+        saver.restore(sess, args.load_checkpoint_path)
+    else:
+        # Manual initialize weights
+        sess.run(tf.global_variables_initializer())
+
    # If test mode, run one eval
    if args.mode == 'test':
-        saver.restore(sess, args.checkpoint_dir)
-        eval_start = time.time()
        sess.run(tf.local_variables_initializer())
+        eval_start = time.time()
        for user_batch, item_batch, dup_batch \
            in zip(data_generator.eval_users, data_generator.eval_items, data_generator.dup_mask):
            sess.run(
@ -316,6 +319,9 @@ def main():
        if hvd.rank() == 0:
            LOGGER.log("Eval Time: {:.4f}, HR: {:.4f}, NDCG: {:.4f}"
                       .format(eval_duration, hit_rate, ndcg))
+
+            eval_throughput = pos_test_users.shape[0] * (args.valid_negative + 1) / eval_duration
+            LOGGER.log('Average Eval Throughput: {:.4f}'.format(eval_throughput))
        return

    # Performance Metrics
@ -326,8 +332,6 @@ def main():
    time_to_train = 0.0
    best_hr = 0
    best_epoch = 0
-    # Manual initialize weights
-    sess.run(tf.global_variables_initializer())
    # Buffers for global metrics
    global_hr_sum = np.ones(1)
    global_hr_count = np.ones(1)
@ -419,6 +423,7 @@ def main():
                if hit_rate > best_hr:
                    best_hr = hit_rate
                    best_epoch = epoch
+                    time_to_best =  time.time() - begin_train
                    if not args.verbose:
                        log_string = "New Best Epoch: {:02d}, Train Time: {:.4f}, Eval Time: {:.4f}, HR: {:.4f}, NDCG: {:.4f}"
                        LOGGER.log(
@ -441,6 +446,11 @@ def main():
        eval_times = np.array(eval_times)
        eval_throughputs = pos_test_users.shape[0]*(args.valid_negative+1) / eval_times
        LOGGER.log(' ')
+
+        LOGGER.log('batch_size: {}'.format(args.batch_size))
+        LOGGER.log('num_gpus: {}'.format(hvd.size()))
+        LOGGER.log('AMP: {}'.format(1 if args.amp else 0))
+        LOGGER.log('seed: {}'.format(args.seed))
        LOGGER.log('Minimum Train Time per Epoch: {:.4f}'.format(np.min(train_times)))
        LOGGER.log('Average Train Time per Epoch: {:.4f}'.format(np.mean(train_times)))
        LOGGER.log('Average Train Throughput:     {:.4f}'.format(np.mean(train_throughputs)))
@ -449,6 +459,7 @@ def main():
        LOGGER.log('Average Eval Throughput:      {:.4f}'.format(np.mean(eval_throughputs)))
        LOGGER.log('First Epoch to hit:           {}'.format(first_to_target))
        LOGGER.log('Time to Train:                {:.4f}'.format(time_to_train))
+        LOGGER.log('Time to Best:                 {:.4f}'.format(time_to_best))
        LOGGER.log('Best HR:                      {:.4f}'.format(best_hr))
        LOGGER.log('Best Epoch:                   {}'.format(best_epoch))

--- a/TensorFlow/Recommendation/NCF/neumf.py
+++ b/TensorFlow/Recommendation/NCF/neumf.py
@ -148,7 +148,7 @@ def ncf_model_ops(users,
                  labels,
                  dup_mask,
                  params,
-                  eval_only=False):
+                  mode='TRAIN'):
    """
    Constructs the training and evaluation graphs
    """
@ -172,7 +172,6 @@ def ncf_model_ops(users,
    sigmoid = False #params['sigmoid']
    loss_scale = params['loss_scale']

-    is_training = True
    model_dtype = tf.float16 if fp16 else tf.float32

    # If manually enabling mixed precision, use the custom variable getter
@ -196,6 +195,9 @@ def ncf_model_ops(users,
        )
        logits = tf.squeeze(logits)

+        if mode == 'INFERENCE':
+            return logits
+
        # Evaluation Ops
        found_positive, dcg = compute_eval_metrics(logits, dup_mask, val_batch_size, K)
        # Metrics
@ -204,7 +206,7 @@ def ncf_model_ops(users,

        eval_op = tf.group(hit_rate[1], ndcg[1])

-        if eval_only:
+        if mode == 'EVAL':
            return hit_rate[0], ndcg[0], eval_op, None

        # Labels
--- a/TensorFlow/Recommendation/NCF/prepare_dataset.sh
+++ b/TensorFlow/Recommendation/NCF/prepare_dataset.sh
@ -14,7 +14,7 @@ set -e

 DATASET_NAME=${1:-'ml-20m'}
 RAW_DATADIR=${2:-'/data'}
-CACHED_DATADIR="${RAW_DATADIR}/cache/${DATASET_NAME}"
+CACHED_DATADIR=${3:-"${RAW_DATADIR}/cache/${DATASET_NAME}"}

 # you can add another option to this case in order to support other datasets
 case ${DATASET_NAME} in
--- a/TensorFlow/Segmentation/UNet_Industrial/README.md
+++ b/TensorFlow/Segmentation/UNet_Industrial/README.md
@ -4,29 +4,31 @@ This repository provides a script and recipe to train U-Net Industrial to achiev
 accuracy on the dataset DAGM2007, and is tested and maintained by NVIDIA.


-# Table of Contents
+## Table of Contents

-* [The model](#the-model)
+* [Model overview](#model-overview)
    * [Default Configuration](#default-configuration)
+    * [Enabling mixed precision](#enabling-mixed-precision)
 * [Setup](#setup)
    * [Requirements](#requirements)
-* [Quick start guide](#quick-start-guide)
-* [Details](#details)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
    * [Command line options](#command-line-options)
    * [Getting the data](#getting-the-data)
    * [Training process](#training-process)
-    * [Enabling mixed precision](#enabling-mixed-precision)
-* [Benchmarking](#benchmarking)
-    * [Training performance benchmark](#training-performance-benchmark)
-    * [Inference performance benchmark](#inference-performance-benchmark)
-* [Results](#results)
-    * [Training accuracy results](#training-accuracy-results)
-    * [Training performance results](#training-performance-results)
-    * [Inference performance results](#inference-performance-results)
-* [Changelog](#changelog)
-* [Known issues](#known-issues)
+* [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+        * [Training performance results](#training-performance-results)
+        * [Inference performance results](#inference-performance-results)
+* [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)

-# The model
+## Model overview

 This U-Net model is adapted from the original version of the [U-Net model](https://arxiv.org/abs/1505.04597) which is
 a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by
@ -63,7 +65,7 @@ by an environmental variable.
 The following performance optimizations were implemented in this model:
 * [XLA](https://www.tensorflow.org/xla) support (experimental)

-## Default Configuration
+### Default Configuration

 This model trains in 2500 epochs, under the following setup:

@ -95,12 +97,42 @@ This model trains in 2500 epochs, under the following setup:

 * Weight decay: 1e-5

-# Setup
+### Enabling mixed precision
+
+[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
+operations in half-precision format, while storing minimal information in single-precision to retain as much
+information as possible in critical parts of the network.  Since the introduction of
+[tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training
+speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically
+intense model architectures.  Using
+[mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously
+required two steps:
+1. Porting the model to use the FP16 data type where appropriate.
+2. Manually adding loss scaling to preserve small gradient values.
+
+This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision
+methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing
+TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow
+framework code makes all necessary model changes internally.
+
+In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16,
+and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with
+the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to
+perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all
+computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
+
+For information about:
+- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and
+[Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+
+## Setup

 The following section list the requirements in order to start training the U-Net model
 (only the `TinyUnet` model is proposed here).

-## Requirements
+### Requirements
 This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies.  
 Aside from these dependencies, ensure you have the following components:

@ -116,19 +148,19 @@ following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
 - [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running).


-# Quick start guide
+## Quick Start Guide

 To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the
 default configuration of the U-Net model (only `TinyUNet` has been made available here) on the DAGM2007 dataset.

-##### Clone the repository
+### Clone the repository

 ```bash
 git clone https://github.com/NVIDIA/DeepLearningExamples
 cd DeepLearningExamples/TensorFlow/Segmentation/UNetIndustrial
 ```

-##### Download and preprocess the dataset: DAGM2007
+### Download and preprocess the dataset: DAGM2007

 In order to download the dataset. You can execute the following:

@ -138,7 +170,7 @@ In order to download the dataset. You can execute the following:

 **Important Information:** Some files of the dataset require an account to be downloaded, the script will invite you to download them manually and put them in the correct directory.

-##### Build and start the docker container based on the TensorFlow NGC container.
+### Build and start the docker container based on the TensorFlow NGC container.

 ```bash
 # Build the docker container
@ -152,7 +184,7 @@ nvidia-docker run -it --rm \
    unet_industrial:latest
 ```

-##### Run training  
+### Run training  

 To run training for a default configuration (as described in Default configuration, for example 1/4/8 GPUs,
 FP32/TF-AMP), launch one of the scripts in the `./scripts` directory called
@ -170,7 +202,7 @@ cd scripts/
 ./UNet_FP32_1GPU.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>
 ```

-##### Run evaluation
+### Run evaluation

 Model evaluation on a checkpoint can be launched by running  one of the scripts in the `./scripts` directory
 called `./scripts/UNet_{FP32, AMP}_EVAL.sh`.
@ -187,11 +219,11 @@ cd scripts/
 ./UNet_FP32_EVAL.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>
 ```

-# Details
+## Advanced

 The following sections provide greater details of the dataset, running training and inference, and the training results.

-## Command line options
+### Command line options

 To see the full list of available options and their descriptions, use the -h or --help command line option, for example:

@ -228,7 +260,7 @@ model

 `--use_auto_loss_scaling` Use AutoLossScaling in TF-AMP

-## Getting the data
+### Getting the data

 The U-Net model was trained with the [Weakly Supervised Learning for Industrial Optical Inspection (DAGM 2007)](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html) dataset.

@ -240,7 +272,7 @@ The U-Net model was trained with the [Weakly Supervised Learning for Industrial

 **Source:** https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html

-### Data description
+#### Data description

 > The provided data is artificially generated, but similar to real world problems. It consists of multiple data sets, each consisting of 1000 images showing the background texture without defects, and of 150 images with one labeled defect each on the background texture. The images in a single data set are very similar, but each data set is generated by a different texture model and defect model.

@ -257,7 +289,7 @@ The number of classes and sub-challenges for the development set is 6.
 - A competition set: which requires an account to be downloaded from [here](https://hci.iwr.uni-heidelberg.de/node/3616).
 The number of classes and sub-challenges for the competition set is 10.

-### Challenge description
+#### Challenge description

 The challenge consists in designing a single model with a set of predefined hyper-parameters which will not change
 across the 10 different classes or sub-challenges of the competition set.
@ -265,15 +297,15 @@ across the 10 different classes or sub-challenges of the competition set.
 The performance shall be measured on the competition set which is normalized and more complex that the public dataset
 while offering the most unbiased evaluation method.

-## Training Process
+### Training Process

-### Laplace Smoothing
+#### Laplace Smoothing

 We use this technique in the DICE loss to improve the training efficiency. This technique consists in replacing the
 epsilon parameter (used to avoid dividing by zero and very small: +/- 1e-7) by 1. You can find more information on:
 [https://en.wikipedia.org/wiki/Additive_smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)

-### Adaptive Loss
+#### Adaptive Loss

 The DICE Loss is not able to provide a meaningful gradient at initialisation. This leads to a model instability which
 often push the model to diverge. Nonetheless, once the model starts to converge, DICE loss is able to very efficiently
@ -285,46 +317,18 @@ fully train the model. Therefore, we implemented an *adaptive loss* which is com
 The model is trained with the BCE loss until the DICE Loss reach a experimentally defined threshold (0.3).
 Thereafter, DICE loss is used to finish training.

-### Weak Labelling
+#### Weak Labelling

 This dataset is referred as weakly labelled. That means that the segmentation labels are not given at the pixel level
 but rather in an approximate fashion.

-## Enabling mixed precision
+## Performance

-[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
-operations in half-precision format, while storing minimal information in single-precision to retain as much
-information as possible in critical parts of the network.  Since the introduction of
-[tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training
-speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically
-intense model architectures.  Using
-[mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously
-required two steps:
-1. Porting the model to use the FP16 data type where appropriate.
-2. Manually adding loss scaling to preserve small gradient values.
-
-This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision
-methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing
-TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow
-framework code makes all necessary model changes internally.
-
-In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16,
-and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with
-the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to
-perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all
-computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
-
-For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and
-[Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
-
-# Benchmarking
+### Benchmarking

 The following sections shows how to run benchmarks measuring the model performance in training and inference modes.

-## Training performance benchmark
+#### Training performance benchmark

 To benchmark the inference performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
 called `./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh`.
@ -341,7 +345,7 @@ cd scripts/benchmarking/
 ./DGX1v_trainbench_FP32_1GPU.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>
 ```

-## Inference performance benchmark
+#### Inference performance benchmark

 To benchmark the training performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
 called `./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh`.
@ -358,16 +362,16 @@ cd scripts/benchmarking/
 ./DGX1v_evalbench_FP16_1GPU.sh <path to result repository> <path to dataset> <dagm classID (1-10)>
 ```

-# Results
+### Results

 The following sections provide details on the achieved results in training accuracy, performance and inference performance.

-## Training accuracy results
+#### Training accuracy results

 Our results were obtained by running the `./scripts/UNet_{FP32, AMP}_{1, 4, 8}GPU.sh` training
 script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.

-### Threshold = 0.75
+##### Threshold = 0.75

 | # DAGM Class ID | Precision                       | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
 |-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -392,7 +396,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
 | 10              | FP32                            | 0.979                         | 100.00                   | 99.43                    |
 | 10              | Automatic Mixed Precision (AMP) | 0.982                         | 100.00                   | 99.70                    |

-### Threshold = 0.85
+##### Threshold = 0.85

 | # DAGM Class ID | Precision                       | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
 |-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -417,7 +421,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
 | 10              | FP32                            | 0.980                         | 100.00                   | 99.45                    |
 | 10              | Automatic Mixed Precision (AMP) | 0.982                         | 100.00                   | 99.71                    |

-### Threshold = 0.95
+##### Threshold = 0.95

 | # DAGM Class ID | Precision                       | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
 |-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -442,7 +446,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
 | 10              | FP32                            | 0.980                         | 100.00                   | 99.54                    |
 | 10              | Automatic Mixed Precision (AMP) | 0.982                         | 100.00                   | 99.72                    |

-### Threshold = 0.99
+##### Threshold = 0.99

 | # DAGM Class ID | Precision                       | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
 |-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -467,7 +471,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
 | 10              | FP32                            | 0.981                         | 100.00                   | 99.62                    |
 | 10              | Automatic Mixed Precision (AMP) | 0.982                         | 100.00                   | 99.78                    |

-## Training performance results
+#### Training performance results

 <!-- Spreedsheet to Markdown: https://thisdavej.com/copy-table-in-excel-and-paste-as-a-markdown-table/ -->

@ -485,9 +489,9 @@ TensorFlow 19.03-py3 NGC container on an NVIDIA DGX-1 with 8 V100 16G GPUs.
 | 8      | FP32                            | 445                   | 1m44          | 1.00    |
 | 8      | Automatic Mixed Precision (AMP) | 491                   | 1m36          | 1.10    |

-To achieve these same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

-## Inference performance results
+#### Inference performance results

 Our results were obtained by running the aforementioned scripts in the TensorFlow 
 19.03-py3 NGC container on an NVIDIA DGX-1 server with 8 V100 16G GPUs.
@ -497,10 +501,13 @@ Our results were obtained by running the aforementioned scripts in the TensorFlo
 | 1      | FP32                            | 228                   | 1.00    |
 | 1      | Automatic Mixed Precision (AMP) | 301                   | 1.32    |

-To achieve these same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

-# Changelog
-1. **March 18, 2019:** Initial release
+## Release notes

-# Known issues
+### Changelog
+March 18, 2019
+* Initial release
+
+### Known issues
 There are no known issues with this model.
--- a/TensorFlow/Segmentation/UNet_Industrial/dllogger/LICENSE
+++ b/TensorFlow/Segmentation/UNet_Industrial/dllogger/LICENSE
@ -1,12 +0,0 @@
-All rights reserved.
-
-Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
-1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
-2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
-3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
--- a/TensorFlow/Segmentation/UNet_Medical/Dockerfile
+++ b/TensorFlow/Segmentation/UNet_Medical/Dockerfile
@ -1,4 +1,4 @@
-FROM nvcr.io/nvidia/tensorflow:19.05-py3
+FROM nvcr.io/nvidia/tensorflow:19.06-py3

 ADD . /workspace/unet
 WORKDIR /workspace/unet
--- a/TensorFlow/Segmentation/UNet_Medical/README.md
+++ b/TensorFlow/Segmentation/UNet_Medical/README.md
@ -1,57 +1,52 @@
-# UNet
+# UNet Medical Image Segmentation for TensorFlow

 This repository provides a script and recipe to train U-Net Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.

 ## Table of contents

-1. [The model](#1-the-model)
-    1. [Default configuration](#11-default-configuration)
-    2. [Model architecture](#12-model-architecture)
-    3. [Feature support matrix](#13-feature-support-matrix)
-        1.  [Features](##131) 
-2. [Setup](#2-setup)
-    1. [Requirements](#21-requirements)
-3. [Quick start guide](#3-quick-start-guide)
-    1. [Clone the repository](#31-clone-the-repository)
-    2. [Download and preprocess the dataset](#32-download-and-preprocess-the-dataset)
-    3. [Build the U-Net TensorFlow container](#33-build-and-start-the-docker-container-based-on-the-tensorflow-ngc-container)
-    4. [Start an interactive session in the NGC container to run training/inference](#34-start-an-interactive-session-in-the-ngc-container-to-run-traininginference)
-    5. [Start training](#35-start-training)
-    6. [Start inference/predictions](#36-start-inferencepredictions)
-4. [Details](#4-details)
-    1. [Scripts and sample code](#41-scripts-and-sample-code)
-    2. [Parameters](#42-parameters)
-    3. [Command line options](#43-command-line-options)
-    4. [Getting the data](#44-getting-the-data)
-        1. [Dataset guidelines](#441-dataset-guidelines)
-    5. [Training process](#45-training-process)
-        1. [Optimizer](#451-optimizer)
-        2. [Augmentation](#452-augmentation)
-    6. [Inference process](#46-inference-process) 
-5. [Mixed precision training](#5-mixed-precision-training)
-    1. [Enabling mixed precision](#51-enabling-mixed-precision)
-6. [Benchmarking](#6-benchmarking)
-    1. [Training performance benchmark](#61-training-performance-benchmark)
-    2. [Inference performance benchmark](#62-inference-performance-benchmark)
-7. [Results](#7-results)
-    1. [Training accuracy results](#71-training-accuracy-results)
-        1. [NVIDIA DGX-1 (8x V100 16G)](#711-nvidia-dgx-1-8x-v100-16g) 
-    2. [Training performance results](#72-training-performance-results)
-        1. [NVIDIA DGX-1 (1x V100 16G)](#721-nvidia-dgx-1-1x-v100-16g)
-        2. [NVIDIA DGX-1 (8x V100 16G)](#721-nvidia-dgx-1-8x-v100-16g)
-    3. [Inference performance results](#73-inference-performance-results)
-        1. [NVIDIA DGX-1 (1x V100 16G)](#731)
-7. [Glossary](#7-glossary)
-8. [Changelog](#8-changelog)
-9. [Known issues](#9-known-issues)
+* [Model overview](#model-overview)
+    * [Default configuration](#default-configuration)
+    * [Model architecture](#model-architecture)
+    * [Feature support matrix](#feature-support-matrix)
+        *  [Features](#features)
+    * [Mixed precision training](#mixed-precision-training)
+        * [Enabling mixed precision](#enabling-mixed-precision) 
+* [Setup](#setup)
+    * [Requirements](#requirements)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Parameters](#parameters)
+    * [Command line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
+        * [Dataset guidelines](#dataset-guidelines)
+    * [Training process](#training-process)
+        * [Optimizer](#optimizer)
+        * [Augmentation](#augmentation)
+    * [Inference process](#inference-process) 
+* [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g) 
+        * [Training performance results](#training-performance-results)
+            * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
+            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+        * [Inference performance results](#inference-performance-results)
+            * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
+* [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)

-## 1. The model
+## Model overview

 The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.

 This model is trained with mixed precision using tensor cores on NVIDIA Volta GPUs. Therefore, researchers can get results much faster than training without Tensor Cores, while experiencing the benefits of mixed precision training (for example, up to 3.5x performance boost). This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

-### 1.1. Model architecture
+### Model architecture

 U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation.  U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.

@ -59,7 +54,7 @@ The following figure shows the construction of the UNet model and its different

 ![UNet](images/unet.png)

-### 1.2. Default configuration
+### Default configuration

 U-Net consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and a concatenation with the correspondingly cropped feature map from the contractive path.

@ -72,7 +67,7 @@ The following features were implemented in this model:
 The following performance optimizations were implemented in this model:
 * XLA support (experimental). For TensorFlow, easily adding mixed-precision support is available from NVIDIA’s APEX, a TensorFlow extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.

-### 1.3. Feature support matrix
+### Feature support matrix

 The following features are supported by this model.

@ -80,19 +75,47 @@ The following features are supported by this model.
 |:---:|:--------:|
 | Horovod Multi-GPU (NCCL) | Yes |

-### 1.3.1. Features
+#### Features

 **Horovod** - Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.  For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).

-## 2. Setup
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.  Using mixed precision training requires two steps:
+1. Porting the model to use the FP16 data type where appropriate.
+2. Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
+- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+#### Enabling mixed precision
+
+In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
+```
+TF_ENABLE_AUTO_MIXED_PRECISION=1
+```
+Exporting these variables ensures that loss scaling is performed correctly and automatically. 
+By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training inside the `./utils/runner.py` script:
+```
+if params['use_amp']:
+   LOGGER.log("TF AMP is activated")
+   os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
+```
+
+## Setup

 The following section lists the requirements in order to start training the U-Net model.

-### 2.1. Requirements
+### Requirements

 This repository contains a `Dockerfile` which extends the TensorFlow NGC container and encapsulates some additional dependencies. Aside from these dependencies, ensure you have the following components:
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [tensorflow:19.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
+* [tensorflow:19.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
 * [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)

 For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning DGX Documentation:
@ -101,17 +124,17 @@ For more information about how to get started with NGC containers, see the follo
 * [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
 * [Running Tensorflow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)

-## 3. Quick start guide
+## Quick Start Guide

 To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home).

-### 3.1. Clone the repository
+### Clone the repository
 ```
 git clone https://github.com/NVIDIA/DeepLearningExamples
 cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Medical
 ```

-### 3.2. Download and preprocess the dataset
+### Download and preprocess the dataset

 The U-Net script  main.py operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597). Upon registration, the challenge's data is made available through the following links:

@ -135,14 +158,14 @@ Once downloaded the data using the `download_dataset.py` script, it can be used

 **Note:** Masks are only provided for training data.

-### 3.3. Build the U-Net TensorFlow container
+### Build the U-Net TensorFlow container

 After Docker is correctly set up, the U-Net TensorFlow container can be built with:
 ```
 user@~/Documents/unet_medical_tf # docker build -t unet_tf .
 ```

-### 3.4. Start an interactive session in the NGC container to run training/inference.
+### Start an interactive session in the NGC container to run training/inference.

 Run the previously built Docker container:
 ```
@ -150,7 +173,7 @@ user@~/path/to/unet_medical_tf # docker run --runtime=nvidia --rm -it --shm-size
 ```
 **Note:** Ensure to mount your dataset using the -v flag to make it available for training inside the NVIDIA Docker container.

-### 3.5. Start training
+### Start training

 To run training for a default configuration (for example 1/8 GPUs FP32/TF-AMP), run one of the scripts in the `./examples` directory, as follows:
 ```
@ -161,17 +184,21 @@ For example:
 root@8e522945990f:/workspace/unet# bash examples/unet_FP32_1GPU.sh . /data results
 ```

-### 3.6. Start inference/predictions
+### Start inference/predictions
 To run inference on a checkpointed model, run:
 ```
-python main.py --data_dir /data --model_dir <path to checkpoint> --exec_mode predict
+bash examples/unet_INFER_{FP32, TF-AMP}.sh <path to main.py> <path to dataset> <path to results directory>
+```
+For example:
+```
+root@8e522945990f:/workspace/unet# bash examples/unet_INFER_FP32.sh . /data results
 ```

-## 4. Details
+## Advanced

 The following sections provide greater details of the dataset, running training and inference, and the training results.

-### 4.1. Scripts and sample code
+### Scripts and sample code

 In the root directory, the most important files are:
 * `main.py`: Serves as the entry point to the application.
@ -179,13 +206,13 @@ In the root directory, the most important files are:
 * `requirements.txt`: Set of extra requirements for running UNet
 * `download_data.py`: Automatically downloads the dataset for training

-The utils/ folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
+The `utils/` folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
 * `runner.py`: Implements the logic for training and inference
 * `data_loader.py`: Implements the data loading and augmentation
 * `hooks/profiler.py`: Collects different metrics to be used for benchmarking and testing
 * `var_storage.py`: Helper functions for TF-AMP

-The model/ folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
+The `model/` folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
 * `layers.py`: Defines the different blocks that are used to assemble UNet
 * `unet.py`: Defines the model architecture using the blocks from the `layers.py` script

@ -194,7 +221,7 @@ Other folders included in the root directory are:
 * `examples/`: Provides examples for training and benchmarking UNet
 * `images/`: Contains a model diagram

-### 4.2. Parameters
+### Parameters
 The complete list of the available parameters for the main.py script contains:
 * `--exec_mode`: Select the execution mode to run the model (default: train_and_predict)
 * `--model_dir`: Set the output directory for information related to the model (default: result/)
@ -213,7 +240,7 @@ The complete list of the available parameters for the main.py script contains:
 * `--benchmark`: Enable performance benchmarking (default: False)
 * `--use_amp`: Enable automatic mixed precision (default: False)

-### 4.3. Command line options
+### Command line options

 To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example: 
 ```
@ -239,7 +266,7 @@ usage: main.py [-h]
            [--use_amp]
 ```

-### 4.4. Getting the data
+### Getting the data

 The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission.

@ -249,7 +276,7 @@ Training and test data is comprised of three 512x512x30 `TIF` volumes (`test-vol

 The objective is to produce a set of masks that segment the data as accurately as possible. The results are expected to be submitted as a 32-bit `TIF` 3D image, which values between `0` (100% membrane certainty) and `1` (100% non-membrane certainty). 

-#### 4.4.1 Dataset guidelines
+#### Dataset guidelines

 The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data_loader.py` script. 

@ -268,9 +295,9 @@ If augmentation is enabled, the following set of augmentation techniques are app
 At the end, intensities are clipped to the `[-1, 1]` interval.


-### 4.5. Training process
+### Training process

-#### 4.5.1. Optimizer
+#### Optimizer

 The model trains for 40,000 batches, with the default U-Net setup as specified in the [original paper](https://arxiv.org/abs/1505.04597):

@ -298,7 +325,7 @@ Use `-h` or `--help` to obtain a list of available options in the `main.py` scri

 Use the `--model_dir` flag to select the location where to store the artifacts of the training.

-### 4.6. Inference process
+### Inference process

 To run inference on a checkpointed model, run the script below, although, it requires a pre-trained model checkpoint and tokenized input.
 ```
@ -306,61 +333,33 @@ python main.py --data_dir /data --model_dir <path to checkpoint> --exec_mode pre
 ```
 This script should produce the prediction results over a set of masks which will be located in `<path to checkpoint>/eval`.

-## 5. Mixed precision training
+## Performance

-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.  Using mixed precision training requires two steps:
-1. Porting the model to use the FP16 data type where appropriate.
-2. Adding loss scaling to preserve small gradient values.
-
-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
-
-For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
-
-## 5.1. Enabling mixed precision
-
-In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
-```
-TF_ENABLE_AUTO_MIXED_PRECISION=1
-```
-Exporting these variables ensures that loss scaling is performed correctly and automatically. 
-By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training inside the `./utils/runner.py` script:
-```
-if params['use_amp']:
-   assert params['dtype'] == tf.float32, "TF-AMP requires FP32 precision"
-
-   LOGGER.log("TF AMP is activated - Experimental Feature")
-   os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
-```
-
-## 6. Benchmarking
+### Benchmarking

 The following section shows how to run benchmarks measuring the model performance in training and inference modes.

-### 6.1. Training performance benchmark
+#### Training performance benchmark

 To benchmark training, run one of the scripts in `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh  <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>`.

 Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during training in the next 100 iterations. To control warmup and benchmark length, use `--warmup_steps`, and `--max_steps` flags.

-### 6.2. Inference performance benchmark
+#### Inference performance benchmark

 To benchmark inference, run one of the scripts in `./examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>`.

 Each of these scripts will by default run 200 warmup iterations and benchmark the performance during inference in the next 100 iterations. To control warmup and benchmark length, use `--warmup_steps`, and `--max_steps` flags.

-## 7. Results
+### Results

 The following sections provide details on how we achieved our performance and accuracy in training and inference. 

-### 7.1. Training accuracy results
+#### Training accuracy results

-#### 7.1.1 NVIDIA DGX-1 (8x V100 16G)
+##### NVIDIA DGX-1 (8x V100 16G)

-Our results were obtained by running the `./examples/unet_{FP32, TF-AMP}_{1, 8}GPU.sh` scripts in the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
+Our results were obtained by running the `./examples/unet_{FP32, TF-AMP}_{1, 8}GPU.sh` scripts in the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.

 Metrics employed by the organization are explained in detail [here](http://brainiac2.mit.edu/isbi_challenge/evaluation).

@ -371,12 +370,12 @@ The results described below were obtained after the submission of our evaluation
 |1 | 0.938508265 | 0.970255682 | 0.939619101 | 0.970120138 | 7.1 | 11.28 |
 |8 | 0.932395087 | 0.9786346 | 0.941360867 | 0.976235311 | 0.9 | 1.41 |

-### 7.2. Training performance results
+#### Training performance results

-#### 7.2.1 NVIDIA DGX-1 (1x V100 16G)
+##### NVIDIA DGX-1 (1x V100 16G)

 Our results were obtained by running the `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_1GPU.sh` scripts in
-the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
+the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.


 | **Batch size** | **FP32 max img/s** | **TF-AMP max img/s** | **Speedup factor** | 
@ -387,10 +386,10 @@ the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU whil

 To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.

-#### 7.2.2 NVIDIA DGX-1 (8x V100 16G)
+##### NVIDIA DGX-1 (8x V100 16G)

 Our results were obtained by running the `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_8GPU.sh` scripts in
-the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU while data augmentation is enabled.
+the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU while data augmentation is enabled.

 | **Batch size per GPU** | **FP32 max img/s** | **TF-AMP max img/s** | **Speedup factor** | 
 |:---:|:--------:|:-------:|:-------:|
@ -400,10 +399,12 @@ the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU whil

 To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.

-### 7.3. Inference performance results
+#### Inference performance results
+
+#### NVIDIA DGX-1 (1x V100 16G)

 Our results were obtained by running the `./examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh` scripts in
-the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
+the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.

 | **Batch size** | **FP32 img/s** | **TF-AMP img/s** | **Speedup factor** | 
 |:---:|:--------:|:-------:|:-------:|
@ -413,11 +414,22 @@ the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU whil

 To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.

-## 8. Changelog
+## Release notes
+
+### Changelog
+
+July 2019
+* Added inference example scripts
+* Added inference benchmark measuring latency
+* Added TRT/TF-TRT support
+* Updated Pre-trained model on NGC registry
+
+June 2019
+* Updated README template

 May 2019
 * Initial release

-## 9. Known issues
+### Known issues

 There are no known issues in this release.
--- a/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_FP32.sh
+++ b/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_FP32.sh
@ -15,4 +15,4 @@
 # This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
 # Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>

-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode benchmark --augment --warmup_steps 200 --log_every 100 --max_steps 300
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300
--- a/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-AMP.sh
+++ b/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-AMP.sh
@ -15,4 +15,4 @@
 # This script launches U-Net inference in TF-AMP on 1 GPUs using 2 batch size
 # Usage ./unet_INFER_BENCHMARK_TF-AMP.sh <path to this repository> <path to dataset> <path to results directory> <batch size>

-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --use_amp --exec_mode benchmark --augment --warmup_steps 200 --log_every 100 --max_steps 300
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --use_amp --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300
--- a/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-TRT.sh
+++ b/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-TRT.sh
@ -0,0 +1,18 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
+# Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300
--- a/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_FP32.sh
+++ b/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_FP32.sh
@ -0,0 +1,18 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net inference in FP32 on 1 GPUs
+# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict
--- a/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-AMP.sh
+++ b/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-AMP.sh
@ -0,0 +1,18 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net inference in TF-AMP on 1 GPUs
+# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_amp
--- a/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-TRT.sh
+++ b/TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-TRT.sh
@ -0,0 +1,18 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net inference in TF-AMP on 1 GPUs
+# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_trt
--- a/TensorFlow/Segmentation/UNet_Medical/main.py
+++ b/TensorFlow/Segmentation/UNet_Medical/main.py
@ -23,121 +23,23 @@ Example:

 """

-import argparse
 import os
+import pickle
+import time

+import horovod.tensorflow as hvd
+import math
+import numpy as np
 import tensorflow as tf
+from PIL import Image

+from dllogger import tags
 from dllogger.logger import LOGGER
-
-from utils.runner import Runner
-
-PARSER = argparse.ArgumentParser(description="UNet-medical")
-
-PARSER.add_argument('--exec_mode',
-                    choices=['train', 'train_and_predict', 'predict', 'benchmark'],
-                    type=str,
-                    default='train_and_predict',
-                    help="""Which execution mode to run the model into"""
-                    )
-
-PARSER.add_argument('--model_dir',
-                    type=str,
-                    default='./results',
-                    help="""Output directory for information related to the model"""
-                    )
-
-PARSER.add_argument('--data_dir',
-                    type=str,
-                    required=True,
-                    help="""Input directory containing the dataset for training the model"""
-                    )
-
-PARSER.add_argument('--batch_size',
-                    type=int,
-                    default=1,
-                    help="""Size of each minibatch per GPU""")
-
-PARSER.add_argument('--max_steps',
-                    type=int,
-                    default=1000,
-                    help="""Maximum number of steps (batches) used for training""")
-
-PARSER.add_argument('--seed',
-                    type=int,
-                    default=0,
-                    help="""Random seed""")
-
-PARSER.add_argument('--weight_decay',
-                    type=float,
-                    default=0.0005,
-                    help="""Weight decay coefficient""")
-
-PARSER.add_argument('--log_every',
-                    type=int,
-                    default=100,
-                    help="""Log performance every n steps""")
-
-PARSER.add_argument('--warmup_steps',
-                    type=int,
-                    default=200,
-                    help="""Number of warmup steps""")
-
-PARSER.add_argument('--learning_rate',
-                    type=float,
-                    default=0.01,
-                    help="""Learning rate coefficient for SGD""")
-
-PARSER.add_argument('--momentum',
-                    type=float,
-                    default=0.99,
-                    help="""Momentum coefficient for SGD""")
-
-PARSER.add_argument('--decay_steps',
-                    type=float,
-                    default=5000,
-                    help="""Decay steps for inverse learning rate decay""")
-
-PARSER.add_argument('--decay_rate',
-                    type=float,
-                    default=0.95,
-                    help="""Decay rate for learning rate decay""")
-
-PARSER.add_argument('--augment', dest='augment', action='store_true',
-                    help="""Perform data augmentation during training""")
-PARSER.add_argument('--no-augment', dest='augment', action='store_false')
-PARSER.set_defaults(augment=False)
-
-PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
-                    help="""Collect performance metrics during training""")
-PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
-PARSER.set_defaults(augment=False)
-
-PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
-                    help="""Train using TF-AMP""")
-PARSER.set_defaults(use_amp=False)
-
-
-def _cmd_params(flags):
-    return {
-        'model_dir': flags.model_dir,
-        'batch_size': flags.batch_size,
-        'data_dir': flags.data_dir,
-        'max_steps': flags.max_steps,
-        'weight_decay': flags.weight_decay,
-        'dtype': tf.float32,
-        'learning_rate': flags.learning_rate,
-        'momentum': flags.momentum,
-        'benchmark': flags.benchmark,
-        'augment': flags.augment,
-        'exec_mode': flags.exec_mode,
-        'seed': flags.seed,
-        'use_amp': flags.use_amp,
-        'log_every': flags.log_every,
-        'warmup_steps': flags.warmup_steps,
-        'decay_steps': flags.decay_steps,
-        'decay_rate': flags.decay_rate,
-    }
+from utils.cmd_util import PARSER, _cmd_params
+from utils.data_loader import Dataset
+from utils.hooks.profiling_hook import ProfilingHook
+from utils.hooks.training_hook import TrainingHook
+from utils.model_fn import unet_fn


 def main(_):
@ -160,32 +62,103 @@ def main(_):

    os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'

-    os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
+    os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = 'data'

-    os.environ['TF_ADJUST_HUE_FUSED'] = '1'
-    os.environ['TF_ADJUST_SATURATION_FUSED'] = '1'
-    os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
+    os.environ['TF_ADJUST_HUE_FUSED'] = 'data'
+    os.environ['TF_ADJUST_SATURATION_FUSED'] = 'data'
+    os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = 'data'

    os.environ['TF_SYNC_ON_FINISH'] = '0'
    os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
-    os.environ['TF_DISABLE_NVTX_RANGES'] = '1'

    if params['use_amp']:
-        assert params['dtype'] == tf.float32, "TF-AMP requires FP32 precision"
+        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION']='1'

-        LOGGER.log("TF AMP is activated - Experimental Feature")
-        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
+    hvd.init()

-    runner = Runner(params)
+    # Build run config
+    gpu_options = tf.GPUOptions()
+    config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)
+    config.gpu_options.allow_growth = True
+    config.gpu_options.visible_device_list = str(hvd.local_rank())
+    config.gpu_options.force_gpu_compatible = True
+    config.intra_op_parallelism_threads = 1
+    config.inter_op_parallelism_threads = max(2, 40 // hvd.size() - 2)

-    if 'train' in params['exec_mode'] \
-            or 'train_and predict' in params['exec_mode']:
-        runner.train()
-    if 'train_and predict' in params['exec_mode'] \
-            or 'predict' in params['exec_mode']:
-        runner.predict()
-    if 'benchmark' in params['exec_mode']:
-        runner.benchmark()
+    run_config = tf.estimator.RunConfig(
+        save_summary_steps=1,
+        tf_random_seed=None,
+        session_config=config,
+        save_checkpoints_steps=params['max_steps'],
+        keep_checkpoint_max=1)
+
+    # Build the estimator model
+    estimator = tf.estimator.Estimator(
+        model_fn=unet_fn,
+        model_dir=params['model_dir'],
+        config=run_config,
+        params=params)
+
+    dataset = Dataset(data_dir=params['data_dir'],
+                      batch_size=params['batch_size'],
+                      augment=params['augment'],
+                      gpu_id=hvd.rank(),
+                      num_gpus=hvd.size(),
+                      seed=params['seed'])
+
+    if 'train' in params['exec_mode']:
+        hooks = [hvd.BroadcastGlobalVariablesHook(0),
+                 TrainingHook(params['log_every'])]
+
+        if params['benchmark']:
+            hooks.append(ProfilingHook(params['batch_size'],
+                                       params['log_every'],
+                                       params['warmup_steps']))
+
+        LOGGER.log('Begin Training...')
+
+        LOGGER.log(tags.RUN_START)
+        estimator.train(
+            input_fn=dataset.train_fn,
+            steps=params['max_steps'],
+            hooks=hooks)
+        LOGGER.log(tags.RUN_STOP)
+
+    if 'predict' in params['exec_mode']:
+        if hvd.rank() == 0:
+            predict_steps = dataset.test_size
+            hooks = None
+            if params['benchmark']:
+                hooks = [ProfilingHook(params['batch_size'],
+                                       params['log_every'],
+                                       params['warmup_steps'])]
+                predict_steps = params['warmup_steps'] * 2 * params['batch_size']
+
+            LOGGER.log('Begin Predict...')
+            LOGGER.log(tags.RUN_START)
+
+            predictions = estimator.predict(
+                input_fn=lambda: dataset.test_fn(count=math.ceil(predict_steps/dataset.test_size)),
+                hooks=hooks)
+
+            binary_masks = [np.argmax(p['logits'], axis=-1).astype(np.uint8) * 255 for p in predictions]
+            LOGGER.log(tags.RUN_STOP)
+
+            multipage_tif = [Image.fromarray(mask).resize(size=(512, 512), resample=Image.BILINEAR)
+                             for mask in binary_masks]
+
+            output_dir = os.path.join(params['model_dir'], 'pred')
+
+            if not os.path.exists(output_dir):
+                os.makedirs(output_dir)
+
+            multipage_tif[0].save(os.path.join(output_dir, 'test-masks.tif'),
+                                  compression="tiff_deflate",
+                                  save_all=True,
+                                  append_images=multipage_tif[1:])
+
+            LOGGER.log("Predict finished")
+            LOGGER.log("Results available in: {}".format(output_dir))


 if __name__ == '__main__':
--- a/TensorFlow/Segmentation/UNet_Medical/utils/cmd_util.py
+++ b/TensorFlow/Segmentation/UNet_Medical/utils/cmd_util.py
@ -0,0 +1,114 @@
+import argparse
+import tensorflow as tf
+
+PARSER = argparse.ArgumentParser(description="UNet-medical")
+
+PARSER.add_argument('--exec_mode',
+                    choices=['train', 'train_and_predict', 'predict'],
+                    type=str,
+                    default='train_and_predict',
+                    help="""Which execution mode to run the model into"""
+                    )
+
+PARSER.add_argument('--model_dir',
+                    type=str,
+                    default='./results',
+                    help="""Output directory for information related to the model"""
+                    )
+
+PARSER.add_argument('--data_dir',
+                    type=str,
+                    required=True,
+                    help="""Input directory containing the dataset for training the model"""
+                    )
+
+PARSER.add_argument('--batch_size',
+                    type=int,
+                    default=1,
+                    help="""Size of each minibatch per GPU""")
+
+PARSER.add_argument('--max_steps',
+                    type=int,
+                    default=1000,
+                    help="""Maximum number of steps (batches) used for training""")
+
+PARSER.add_argument('--seed',
+                    type=int,
+                    default=0,
+                    help="""Random seed""")
+
+PARSER.add_argument('--weight_decay',
+                    type=float,
+                    default=0.0005,
+                    help="""Weight decay coefficient""")
+
+PARSER.add_argument('--log_every',
+                    type=int,
+                    default=100,
+                    help="""Log performance every n steps""")
+
+PARSER.add_argument('--warmup_steps',
+                    type=int,
+                    default=200,
+                    help="""Number of warmup steps""")
+
+PARSER.add_argument('--learning_rate',
+                    type=float,
+                    default=0.01,
+                    help="""Learning rate coefficient for SGD""")
+
+PARSER.add_argument('--momentum',
+                    type=float,
+                    default=0.99,
+                    help="""Momentum coefficient for SGD""")
+
+PARSER.add_argument('--decay_steps',
+                    type=float,
+                    default=5000,
+                    help="""Decay steps for inverse learning rate decay""")
+
+PARSER.add_argument('--decay_rate',
+                    type=float,
+                    default=0.95,
+                    help="""Decay rate for learning rate decay""")
+
+PARSER.add_argument('--augment', dest='augment', action='store_true',
+                    help="""Perform data augmentation during training""")
+PARSER.add_argument('--no-augment', dest='augment', action='store_false')
+PARSER.set_defaults(augment=False)
+
+PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
+                    help="""Collect performance metrics during training""")
+PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
+PARSER.set_defaults(augment=False)
+
+PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
+                    help="""Train using TF-AMP""")
+PARSER.set_defaults(use_amp=False)
+
+PARSER.add_argument('--use_trt', dest='use_trt', action='store_true',
+                    help="""Use TF-TRT""")
+PARSER.set_defaults(use_trt=False)
+
+
+def _cmd_params(flags):
+    return {
+        'model_dir': flags.model_dir,
+        'batch_size': flags.batch_size,
+        'data_dir': flags.data_dir,
+        'max_steps': flags.max_steps,
+        'weight_decay': flags.weight_decay,
+        'dtype': tf.float32,
+        'learning_rate': flags.learning_rate,
+        'momentum': flags.momentum,
+        'benchmark': flags.benchmark,
+        'augment': flags.augment,
+        'exec_mode': flags.exec_mode,
+        'seed': flags.seed,
+        'use_amp': flags.use_amp,
+        'use_trt': flags.use_trt,
+        'log_every': flags.log_every,
+        'warmup_steps': flags.warmup_steps,
+        'decay_steps': flags.decay_steps,
+        'decay_rate': flags.decay_rate,
+    }
--- a/TensorFlow/Segmentation/UNet_Medical/utils/data_loader.py
+++ b/TensorFlow/Segmentation/UNet_Medical/utils/data_loader.py
@ -42,6 +42,14 @@ class Dataset():
        self._num_gpus = num_gpus
        self._gpu_id = gpu_id

+    @property
+    def train_size(self):
+        return len(self._train_images)
+
+    @property
+    def test_size(self):
+        return len(self._test_images)
+
    def _load_multipage_tiff(self, path):
        """Load tiff images containing many images in the channel dimension"""
        return np.array([np.array(p) for p in ImageSequence.Iterator(Image.open(path))])
@ -154,10 +162,11 @@ class Dataset():

        return dataset

-    def test_fn(self):
+    def test_fn(self, count):
        """Input function for testing"""
        dataset = tf.data.Dataset.from_tensor_slices(
-            (self._test_images))
+            self._test_images)
+        dataset = dataset.repeat(count=count)
        dataset = dataset.map(self._normalize_inputs)
        dataset = dataset.batch(self._batch_size)
        dataset = dataset.prefetch(self._batch_size)
@ -179,4 +188,4 @@ class Dataset():

        dataset = dataset.prefetch(buffer_size=tf.contrib.data.AUTOTUNE)

-        return dataset
+        return dataset
--- a/TensorFlow/Segmentation/UNet_Medical/utils/hooks/profiling_hook.py
+++ b/TensorFlow/Segmentation/UNet_Medical/utils/hooks/profiling_hook.py
@ -0,0 +1,54 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import tensorflow as tf
+import horovod.tensorflow as hvd
+
+from dllogger import LOGGER, tags, AverageMeter
+
+
+class ProfilingHook(tf.train.SessionRunHook):
+
+    def __init__(self, batch_size, log_every, warmup_steps):
+        self._log_every = log_every
+        self._warmup_steps = warmup_steps
+        self._current_step = 0
+        self._global_batch_size = batch_size * hvd.size()
+        self._meter = AverageMeter()
+        self._t0 = 0
+
+    def before_run(self, run_context):
+        if self._current_step % self._log_every == 0:
+            LOGGER.log('iter_start', self._current_step)
+
+        if self._current_step > self._warmup_steps:
+            self._t0 = time.time()
+
+    def after_run(self,
+                  run_context,
+                  run_values):
+        if self._current_step > self._warmup_steps:
+            batch_time = time.time() - self._t0
+            ips = self._global_batch_size / batch_time
+            self._meter.record(ips)
+
+        self._current_step += 1
+
+    def begin(self):
+        pass
+
+    def end(self, session):
+        LOGGER.log('average_images_per_second', self._meter.get_value())
--- a/TensorFlow/Segmentation/UNet_Medical/utils/hooks/training_hook.py
+++ b/TensorFlow/Segmentation/UNet_Medical/utils/hooks/training_hook.py
@ -0,0 +1,46 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import tensorflow as tf
+
+from dllogger import LOGGER, tags
+
+
+class TrainingHook(tf.train.SessionRunHook):
+
+    def __init__(self, log_every=1):
+        self._log_every = log_every
+        self._iter_idx = 0
+
+    def before_run(self, run_context):
+        run_args = tf.train.SessionRunArgs(
+            fetches=[
+                'cross_loss_ref:0',
+                'dice_loss_ref:0',
+                'total_loss_ref:0',
+            ]
+        )
+
+        return run_args
+
+    def after_run(self,
+                  run_context,
+                  run_values):
+        cross_loss, dice_loss, total_loss = run_values.results
+
+        if self._iter_idx % self._log_every == 0:
+            LOGGER.log('cross_loss', cross_loss)
+            LOGGER.log('dice_loss', dice_loss)
+            LOGGER.log('total_loss', total_loss)
+        self._iter_idx += 1
--- a/Show more
+++ b/Show more
				`@ -0,0 +1 @@`
				`PYTHONPATH=$PYTHONPATH:/mlperf/ jupyter-notebook --ip 0.0.0.0 --no-browser --allow-root`