[Resnet/MX] Update

2021-04-07 17:46:50 +02:00 · 2021-04-07 17:46:50 +02:00 · 8c42220b49
parent 5e7dd3b91b
commit 8c42220b49
12 changed files with 554 additions and 416 deletions
--- a/MxNet/Classification/RN50v1.5/Dockerfile
+++ b/MxNet/Classification/RN50v1.5/Dockerfile
@ -1,3 +1,10 @@
-FROM nvcr.io/nvidia/mxnet:19.07-py3
-COPY . /workspace/rn50
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/mxnet:20.12-py3
+
+FROM $FROM_IMAGE_NAME
+
 WORKDIR /workspace/rn50
+
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+
+COPY . .
--- a/MxNet/Classification/RN50v1.5/README.md
+++ b/MxNet/Classification/RN50v1.5/README.md
@ -1,14 +1,17 @@
-# ResNet50 v1.5 for MXNet
+# ResNet-50 v1.5 for MXNet

-This repository provides a script and recipe to train the ResNet50 v1.5 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
+This repository provides a script and recipe to train the ResNet-50 v1.5 model to achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.

 ## Table Of Contents
+
 - [Model overview](#model-overview)
+    * [Model architecture](#model-architecture)
    * [Default configuration](#default-configuration)
    * [Feature support matrix](#feature-support-matrix)
        * [Features](#features)
    * [Mixed precision training](#mixed-precision-training)
        * [Enabling mixed precision](#enabling-mixed-precision)
+        * [Enabling TF32](#enabling-tf32)
 - [Setup](#setup)
    * [Requirements](#requirements)
 - [Quick Start Guide](#quick-start-guide)
@ -18,7 +21,7 @@ This repository provides a script and recipe to train the ResNet50 v1.5 model to
    * [Command-line options](#command-line-options)
    * [Getting the data](#getting-the-data)
        * [Dataset guidelines](#dataset-guidelines)
-    * [Multi-dataset](#multi-dataset)
+        * [Multi-dataset](#multi-dataset)
    * [Training process](#training-process)
    * [Inference process](#inference-process)
 - [Performance](#performance)
@ -27,13 +30,16 @@ This repository provides a script and recipe to train the ResNet50 v1.5 model to
        * [Inference performance benchmark](#inference-performance-benchmark)
    * [Results](#results)
        * [Training accuracy results](#training-accuracy-results)
-            * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
+            * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)  
+            * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
            * [Training stability test](#training-stability-test)
        * [Training performance results](#training-performance-results)
-            * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
-            * [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-(16x-v100-32G))
+            * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb) 
+            * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
+            * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
        * [Inference performance results](#inference-performance-results)
-            * [Inference performance: NVIDIA DGX-1 (8x V100 16G)](#inference-performance-nvidia-dgx-1-(8x-v100-16G))
+            * [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-1x-a100-80gb)
+            * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
            * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 - [Release notes](#release-notes)
    * [Changelog](#changelog)
@ -41,35 +47,40 @@ This repository provides a script and recipe to train the ResNet50 v1.5 model to


 ## Model overview
-The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).

-The difference between v1 and v1.5 is in the bottleneck blocks which require
-downsampling. ResNet v1 has stride = 2 in the first 1x1 convolution, whereas
-v1.5 has stride = 2 in the 3x3 convolution
+The ResNet-50 v1.5 model is a modified version of the [original ResNet-50 v1 model](https://arxiv.org/abs/1512.03385).

-This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec).
+The difference between v1 and v1.5 is in the bottleneck blocks which require downsampling. ResNet v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.

-This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 3.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+This difference makes ResNet-50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec).
+
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+### Model architecture
+
+The model architecture was present in [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) paper. The main advantage of the model is the usage of residual layers as a building block that helps with gradient propagation during training.
+
+![ResidualLayer](./img/residual_diagram.png)
+
+_Image source: [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385)_

 ### Default configuration

-**Optimizer:**
+**Optimizer**

 * SGD with momentum (0.875)
-* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary scale the learning rate.
-* Learning rate schedule -- we use cosine LR schedule
-* Linear warmup of the learning rate during first 5 epochs according to [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
-* Weight decay: 3.0517578125e-05 (1/32768).
-* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
+* Learning rate = 0.256 for 256 batch size, for other batch sizes we linearly scale the learning rate
+* Learning rate schedule - we use cosine LR schedule
+* Linear warmup of the learning rate during the first 5 epochs according to [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
+* Weight decay: 3.0517578125e-05 (1/32768)
+* We do not apply WD on batch norm trainable parameters (gamma/bias)
 * Label Smoothing: 0.1
 * We train for:
-    * 50 Epochs -> configuration that reaches 75.9% top1 accuracy
-    * 90 Epochs -> 90 epochs is a standard for ResNet50
-    * 250 Epochs -> best possible accuracy. For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
+    * 50 Epochs - configuration that reaches 75.9% top1 accuracy
+    * 90 Epochs - 90 epochs is a standard for ResNet-50
+    * 250 Epochs - best possible accuracy. For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).

-**Data augmentation:**
-
-This model uses the following data augmentation:
+**Data augmentation**

 For training:
 * Normalization
@ -85,122 +96,153 @@ For inference:

 ### Feature support matrix

-| **Feature** | **ResNet50 MXNet** |
+| **Feature** | **ResNet-50 MXNet** |
 |:---:|:--------:|
 |[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)|yes|
 |Horovod Multi-GPU|yes|


 #### Features
+
 The following features are supported by this model.

-NVIDIA DALI - NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications.
-
-Horovod Multi-GPU - Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
-
+**NVIDIA DALI**
+NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications.

+**Horovod Multi-GPU**
+Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).

 ### Mixed precision training

-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
 1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.

 The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.

 For information about:
-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
 -   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.



 #### Enabling mixed precision
+
 Using the Gluon API, ensure you perform the following steps to convert a model that supports computation with float16.

 1. Cast Gluon Block‘s parameters and expected input type to float16 by calling the cast method of the Block representing the network.
+
    ```python
    net = net.cast('float16')
    ```

 2. Ensure the data input to the network is of float16 type. If your DataLoader or Iterator produces output in another datatype, then you have to cast your data. There are different ways you can do this. The easiest way is to use the `astype` method of NDArrays.
+
    ```python
    data = data.astype('float16', copy=False)
    ```

-3. If you are using images and DataLoader, you can also use a Cast transform.  It is preferable to use multi_precision mode of optimizer when training in float16. This mode of optimizer maintains a master copy of the weights in float32 even when the training (forward and backward pass) is in float16. This helps increase precision of the weight updates and can lead to faster convergence in some scenarios.
+3. If you are using images and DataLoader, you can also use a Cast transform.  It is preferable to use `multi_precision` mode of optimizer when training in float16. This mode of optimizer maintains a master copy of the weights in float32 even when the training (forward and backward pass) is in float16. This helps increase precision of the weight updates and can lead to faster convergence in some scenarios.
+
    ```python
    optimizer = mx.optimizer.create('sgd', multi_precision=True, lr=0.01)
    ```
+   
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+

 ## Setup
-The following section lists the requirements in order to start training the ResNet50 v1.5 model.
+
+The following section lists the requirements that you need to meet in order to start training the ResNet-50 v1.5 model.

 ### Requirements

 This repository contains Dockerfile which extends the MXNet NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 -   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-   [MXNet 19.07-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet)
-   [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+-   [MXNet 20.12-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet)
+Supported GPUs:
+- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
+- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+

 For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
 -   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
 -   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
-   [Running MXNet](https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/running.html#running)
+-   [Running MXNet](https://docs.nvidia.com/deeplearning/frameworks/mxnet-release-notes/running.html#running)

 For those unable to use the MXNet NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).


 ## Quick Start Guide

-**1. Clone the repository.**
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the ResNet-50 model on the ImageNet 1k dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+
+1. Clone the repository.
+
 ```bash
 git clone https://github.com/NVIDIA/DeepLearningExamples
 cd DeepLearningExamples/MxNet/Classification/RN50v1.5
 ```

-**2. Build the ResNet50 MXNet NGC container.**
-After Docker is setup, you can build the ResNet50 image with:
+2. Build the ResNet-50 MXNet NGC container.
+
+After Docker is set up, you can build the ResNet-50 image with:
+
 ```bash
 docker build . -t nvidia_rn50_mx
 ```

-**3. Start an interactive session in the NGC container to run preprocessing/training/inference.**
+3. Start an interactive session in the NGC container to run preprocessing/training/inference.
+
 ```bash
-nvidia-docker run --rm -it --ipc=host <path to dataset>:/data/imagenet/train-val-recordio-passthrough nvidia_rn50_mx
+nvidia-docker run --rm -it --ipc=host -v <path to dataset>:/data/imagenet/train-val-recordio-passthrough nvidia_rn50_mx
 ```

-**4. Download and preprocess the data.**
-* Download the images from http://image-net.org/download-images.
+4. Download the data.
+
+* Download the images from `http://image-net.org/download-images`.
 * Extract the training and validation data:
+
    ```bash
    mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
    tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
    find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
    cd ..
+    mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf      ILSVRC2012_img_val.tar
+wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
    ```

-**5. Extract the validation data and move the images to subfolders.**
-```bash
-mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
-wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
-```
+5. Preprocess the ImageNet 1k dataset.

-**6. Preprocess the dataset.**
 ```bash
 ./scripts/prepare_imagenet.sh <path to raw imagenet> <path where processed dataset will be created>
 ```

-**7. Start training.**
+6. Start training.
+
 ```bash
 ./runner -n <number of gpus> -b <batch size per GPU (default 192)>
 ```

-**8. Start validation/evaluation.**
+7. Start validation/evaluation.
+
 ```bash
 ./runner -n <number of gpus> -b <batch size per GPU (default 192)> --load <path to trained model> --mode val
 ```

-**9. Start inference/predictions.**
+8. Start inference/predictions.
+
 ```bash
 ./runner --load <path to trained model> --mode pred --data-pred <path to the image>
 ```
@ -213,17 +255,17 @@ The following sections provide greater details of the dataset, running training
 ### Scripts and sample code

 In the root directory, the most important files are:
-* `runner`: A wrapper on the `train.py` script which is the main executable script for training/validation/predicting
-* `benchmark.py`: A script for benchmarking
-* `Dockerfile`: Container to build the container
-* `fit.py`: A file containing most of the training and validation logic
-* `data.py`: Data loading and preprocessing code
-* `dali.py`: Data loading and preprocessing code using DALI
-* `models.py`: The model architecture
-* `report.py`: A file containing JSON report structure and description of fields
+* `runner`: A wrapper on the `train.py` script which is the main executable script for training/validation/predicting.
+* `benchmark.py`: A script for benchmarking.
+* `Dockerfile`: Container to build the container.
+* `fit.py`: A file containing most of the training and validation logic.
+* `data.py`: Data loading and preprocessing code.
+* `dali.py`: Data loading and preprocessing code using DALI.
+* `models.py`: The model architecture.
+* `report.py`: A file containing JSON report structure and description of fields.

 In the `scripts` directory, the most important files are:
-* `prepare_imagenet.sh`: A script that converts raw dataset format to RecordIO format
+* `prepare_imagenet.sh`: A script that converts raw dataset format to RecordIO format.



@ -260,20 +302,18 @@ Training:
  --mode {train_val,train,val,pred}
                        mode (default: train_val)
  --seed SEED           random seed (default: None)
-  -n NGPUS, --ngpus NGPUS
-                        number of GPUs to use (default: 1)
+  --gpus GPUS           list of gpus to run, e.g. 0 or 0,2,5 (default: [0])
  --kv-store {device,horovod}
-                        key-value store type (default: horovod)
+                        key-value store type (default: device)
  --dtype {float32,float16}
-                        Precision (default: float16)
+                        precision (default: float16)
  --amp                 If enabled, turn on AMP (Automatic Mixed Precision)
                        (default: False)
-  -b BATCH_SIZE, --batch-size BATCH_SIZE
-                        batch size per GPU (default: 192)
-  -e NUM_EPOCHS, --num-epochs NUM_EPOCHS
+  --batch-size BATCH_SIZE
+                        the batch size (default: 192)
+  --num-epochs NUM_EPOCHS
                        number of epochs (default: 90)
-  -l LR, --lr LR        learning rate; IMPORTANT: true learning rate will be
-                        calculated as `lr * batch_size / 256` (default: 0.256)
+  --lr LR               initial learning rate (default: 0.1)
  --lr-schedule {multistep,cosine}
                        learning rate schedule (default: cosine)
  --lr-factor LR_FACTOR
@ -306,7 +346,12 @@ Training:
                        data to test (default: train)
  --log LOG             file where to save the log from the experiment
                        (default: log.log)
-  --report REPORT       file where to save report (default: report.json)
+  --dllogger-log DLLOGGER_LOG
+                        file where to save the dllogger log from the
+                        experiment (default: dllogger_log.log)
+  --workspace WORKSPACE
+                        path to directory where results will be stored
+                        (default: ./)
  --no-metrics          do not calculate evaluation metrics (for benchmarking)
                        (default: False)
  --benchmark-iters BENCHMARK_ITERS
@ -314,11 +359,19 @@ Training:
                        (default: None)

 Data:
-  --data-root DATA_ROOT
-                        Directory with RecordIO data files (default:
-                        /data/imagenet/train-val-recordio-passthrough)
-  --data-backend {dali,mxnet,synthetic}
-                        data backend (default: dali)
+  --data-train DATA_TRAIN
+                        the training data (default: None)
+  --data-train-idx DATA_TRAIN_IDX
+                        the index of training data (default: )
+  --data-val DATA_VAL   the validation data (default: None)
+  --data-val-idx DATA_VAL_IDX
+                        the index of validation data (default: )
+  --data-pred DATA_PRED
+                        the image on which run inference (only for pred mode)
+                        (default: None)
+  --data-backend {dali-gpu,dali-cpu,mxnet,synthetic}
+                        set data loading & augmentation backend (default:
+                        dali-gpu)
  --image-shape IMAGE_SHAPE
                        the image shape feed into the network (default: [3,
                        224, 224])
@ -358,6 +411,8 @@ DALI data backend:
                        DALI prefetch queue depth (default: 2)
  --dali-nvjpeg-memory-padding DALI_NVJPEG_MEMORY_PADDING
                        Memory padding value for nvJPEG (in MB) (default: 64)
+  --dali-fuse-decoder DALI_FUSE_DECODER
+                        0 or 1 whether to fuse decoder or not (default: 1)

 MXNet data backend:
  entire group applies only to mxnet data backend
@ -425,75 +480,19 @@ MXNet data backend:

 ### Command-line options

-To see the full list of available options and their descriptions, use the `-h` or `--help` command line option: `./runner --help` and `python train.py --help`. `./runner` acts as a wrapper on `train.py` and all additional flags will be passed to `train.py`.
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option: 

-`./runner` command-line options:
-```
-usage: runner [-h] [-n NGPUS] [-b BATCH_SIZE] [-e NUM_EPOCHS] [-l LR]
-              [--data-root DATA_ROOT] [--dtype {float32,float16}]
-              [--kv-store {device,horovod}]
-              [--data-backend {dali,mxnet,synthetic}]
-```
+`./runner --help` and `python train.py --help`
+
+`./runner` acts as a wrapper on `train.py` and all additional flags will be passed to `train.py`.

-`train.py` command-line options:
-```
-usage: train.py [-h]
-                [--arch {resnetv1,resnetv15,resnextv1,resnextv15,xception}]
-                [--num-layers NUM_LAYERS] [--num-groups NUM_GROUPS]
-                [--num-classes NUM_CLASSES] [--batchnorm-eps BATCHNORM_EPS]
-                [--batchnorm-mom BATCHNORM_MOM] [--fuse-bn-relu FUSE_BN_RELU]
-                [--fuse-bn-add-relu FUSE_BN_ADD_RELU]
-                [--mode {train_val,train,val,pred}] [--seed SEED]
-                [--gpus GPUS] [--kv-store {device,horovod}]
-                [--dtype {float32,float16}] [--amp] [--batch-size BATCH_SIZE]
-                [--num-epochs NUM_EPOCHS] [--lr LR]
-                [--lr-schedule {multistep,cosine}] [--lr-factor LR_FACTOR]
-                [--lr-steps LR_STEPS] [--warmup-epochs WARMUP_EPOCHS]
-                [--optimizer OPTIMIZER] [--mom MOM] [--wd WD]
-                [--label-smoothing LABEL_SMOOTHING] [--mixup MIXUP]
-                [--disp-batches DISP_BATCHES] [--model-prefix MODEL_PREFIX]
-                [--save-frequency SAVE_FREQUENCY] [--begin-epoch BEGIN_EPOCH]
-                [--load LOAD] [--test-io] [--test-io-mode {train,val}]
-                [--log LOG] [--report REPORT] [--no-metrics]
-                [--benchmark-iters BENCHMARK_ITERS] [--data-train DATA_TRAIN]
-                [--data-train-idx DATA_TRAIN_IDX] [--data-val DATA_VAL]
-                [--data-val-idx DATA_VAL_IDX] [--data-pred DATA_PRED]
-                [--data-backend {dali,mxnet,synthetic}]
-                [--image-shape IMAGE_SHAPE] [--rgb-mean RGB_MEAN]
-                [--rgb-std RGB_STD] [--input-layout {NCHW,NHWC}]
-                [--conv-layout {NCHW,NHWC}] [--batchnorm-layout {NCHW,NHWC}]
-                [--pooling-layout {NCHW,NHWC}] [--num-examples NUM_EXAMPLES]
-                [--data-val-resize DATA_VAL_RESIZE] [--dali-separ-val]
-                [--dali-threads DALI_THREADS]
-                [--dali-validation-threads DALI_VALIDATION_THREADS]
-                [--dali-prefetch-queue DALI_PREFETCH_QUEUE]
-                [--dali-nvjpeg-memory-padding DALI_NVJPEG_MEMORY_PADDING]
-                [--data-mxnet-threads DATA_MXNET_THREADS]
-                [--random-crop RANDOM_CROP] [--random-mirror RANDOM_MIRROR]
-                [--max-random-h MAX_RANDOM_H] [--max-random-s MAX_RANDOM_S]
-                [--max-random-l MAX_RANDOM_L]
-                [--min-random-aspect-ratio MIN_RANDOM_ASPECT_RATIO]
-                [--max-random-aspect-ratio MAX_RANDOM_ASPECT_RATIO]
-                [--max-random-rotate-angle MAX_RANDOM_ROTATE_ANGLE]
-                [--max-random-shear-ratio MAX_RANDOM_SHEAR_RATIO]
-                [--max-random-scale MAX_RANDOM_SCALE]
-                [--min-random-scale MIN_RANDOM_SCALE]
-                [--max-random-area MAX_RANDOM_AREA]
-                [--min-random-area MIN_RANDOM_AREA]
-                [--min-crop-size MIN_CROP_SIZE]
-                [--max-crop-size MAX_CROP_SIZE] [--brightness BRIGHTNESS]
-                [--contrast CONTRAST] [--saturation SATURATION]
-                [--pca-noise PCA_NOISE]
-                [--random-resized-crop RANDOM_RESIZED_CROP]
-```

 ### Getting the data

-The MXNet ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
-You can download the images from http://image-net.org/download-images.
+The MXNet ResNet-50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge. You can download the images from `http://image-net.org/download-images`.

 The recommended data format is
-[RecordIO](http://mxnet.io/architecture/note_data_loading.html), which
+[RecordIO](https://mxnet.apache.org/versions/1.7.0/api/architecture/note_data_loading), which
 concatenates multiple examples into seekable binary files for better read
 efficiency. MXNet provides a tool called `im2rec.py` located in the `/opt/mxnet/tools/` directory.
 The tool converts individual images into `.rec` files.
@ -508,8 +507,7 @@ python /opt/mxnet/tools/im2rec.py --list --recursive val /data/imagenet/raw/val-
 ```

 Next, we generate the `.rec` (RecordIO files with data) and `.idx` (indexes required by DALI
-to speed up data loading) files. To obtain the best training accuracy
-we do not preprocess the images when creating the RecordIO file.
+to speed up data loading) files. To obtain the best training accuracy we do not preprocess the images when creating the RecordIO file.

 ```bash
 python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 train /data/imagenet/raw/train-jpeg
@ -517,7 +515,8 @@ python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 val /data/image
 ```

 #### Dataset guidelines
-The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data.py` and `dali.py` files.
+
+The process of loading, normalizing, and augmenting the data contained in the dataset can be found in the `data.py` and `dali.py` files.

 The data is read from RecordIO format, which concatenates multiple examples into seekable binary files for better read efficiency.

@ -535,10 +534,10 @@ To convert a custom dataset, follow the steps from [Getting the data](#getting-t
 To start training, run:
 `./runner -n <number of gpus> -b <batch size per GPU> --data-root <path to imagenet> --dtype <float32 or float16>`

-By default the training script runs the validation after each epoch:
-* the best checkpoint will be stored in the `model_best.params` file in the working directory
-* the log from training will be saved in the  `log.log` file in the working directory
-* the JSON report with statistics will be saved in the  `report.json` file in the working directory
+By default, the training script runs the validation after each epoch:
+* The best checkpoint will be stored in the `model_best.params` file in the working directory.
+* The log from training will be saved in the `log.log` file in the working directory.
+* The JSON report with statistics will be saved in the `report.json` file in the working directory.

 If ImageNet is mounted in the `/data/imagenet/train-val-recordio-passthrough` directory, you don't have to specify the `--data-root` flag.

@ -548,8 +547,8 @@ To start validation, run:
 `./runner -n <number of gpus> -b <batch size per GPU> --data-root <path to imagenet> --dtype <float32 or float16> --mode val`

 By default: 
-* the log from validation will be saved in the `log.log` file in the working directory
-* the JSON report with statistics will be saved in the `report.json` file in the working directory
+* The log from validation will be saved in the `log.log` file in the working directory.
+* The JSON report with statistics will be saved in the `report.json` file in the working directory.

 ## Performance

@ -558,16 +557,19 @@ By default:
 To benchmark training and inference, run:
 `python benchmark.py -n <numbers of gpus separated by comma> -b <batch sizes per GPU separated by comma> --data-root <path to imagenet> --dtype <float32 or float16> -o <path to benchmark report>`

-To control the benchmark length per epoch, use the `-i` flag (defaults to 100 iterations).
-To control the number of epochs, use the `-e` flag.
-To control the number of warmup epochs (epochs which are not taken into account), use the `-w` flag.
-To limit the length of the dataset, use the `--num-examples` flag.
+* To control the benchmark length per epoch, use the `-i` flag (defaults to 100 iterations).
+* To control the number of epochs, use the `-e` flag.
+* To control the number of warmup epochs (epochs which are not taken into account), use the `-w` flag.
+* To limit the length of the dataset, use the `--num-examples` flag.
+
 By default, the same parameters as in `./runner` will be used. Additional flags will be passed to `./runner`.

 #### Training performance benchmark
+
 To benchmark only training, use the `--mode train` flag.

 #### Inference performance benchmark
+
 To benchmark only inference, use the `--mode val` flag.


@ -577,17 +579,28 @@ The following sections provide details on how we achieved our performance and ac

 #### Training accuracy results

-##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

-90 epochs configuration
-Our results were obtained by running the `./runner -n <number of gpus> -b 96 --dtype float32` script for FP32 and the `./runner -n <number of gpus> -b 192` script for mixed precision in the in the mxnet-19.07-py3  NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
- on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+**90 epochs configuration**
+
+Our results were obtained by running 8 times the `./runner -n <number of gpus> -b 256 --dtype float32` script for TF32 and the `./runner -n <number of gpus> -b 256` script for mixed precision in the mxnet-20.12-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
+
+| **GPUs** | **Accuracy - mixed precision** | **Accuracy - TF32** | **Time to train - mixed precision** | **Time to train - TF32** | **Time to train - speedup** |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|1|77.185|77.184|14.6|31.26|2.13|
+|8|77.185|77.184|1.8|4.0|2.12|
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
+
+**90 epochs configuration**
+
+Our results were obtained by running the `./runner -n <number of gpus> -b 96 --dtype float32` training script for FP32 and the `./runner -n <number of gpus> -b 192` training script for mixed precision in the mxnet-20.12-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

 | **GPUs** | **Accuracy - mixed precision** | **Accuracy - FP32** | **Time to train - mixed precision** | **Time to train - FP32** | **Time to train - speedup** |
 |:---:|:---:|:---:|:---:|:---:|:---:|
-|1|77.208|77.160|24.2|84.5|3.49|
-|4|77.296|77.280|6.0|21.4|3.59|
-|8|77.308|77.292|3.0|10.7|3.54|
+|1|77.342|77.160|24.2|84.5|3.49|
+|4|77.196|77.290|6.0|21.4|3.59|
+|8|77.150|77.313|3.0|10.7|3.54|

 ##### Training stability test

@ -596,21 +609,22 @@ Our results were obtained by running the following commands 8 times with differe
 * For 50 epochs
  * `./runner -n 8 -b 96 --dtype float32 --num-epochs 50` for FP32
  * `./runner -n 8 -b 192 --num-epochs 50` for mixed precision
+
 * For 90 epochs
  * `./runner -n 8 -b 96 --dtype float32` for FP32
  * `./runner -n 8 -b 192` for mixed precision
+
 * For 250 epochs
  * `./runner -n 8 -b 96 --dtype float32 --num-epochs 250 --mixup 0.2` for FP32
  * `./runner -n 8 -b 192 --num-epochs 250 --mixup 0.2` for mixed precision

 | **# of epochs** | **mixed precision avg top1** | **FP32 avg top1** | **mixed precision standard deviation** | **FP32 standard deviation** | **mixed precision minimum top1** | **FP32 minimum top1** | **mixed precision maximum top1** | **FP32 maximum top1** |
 |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-|50|76.156|76.185|0.118|0.082|76.010|76.062|76.370|76.304|
-|90|77.105|77.224|0.097|0.060|76.982|77.134|77.308|77.292|
-|250|78.317|78.400|0.073|0.102|78.202|78.316|78.432|78.570|
+|50|76.308|76.329|0.00073|0.00094|76.230|76.234|76.440|76.470|
+|90|77.150|77.313|0.00098|0.00085|76.972|77.228|77.266|77.474|
+|250|78.460|78.483|0.00078|0.00065|78.284|78.404|78.560|78.598|

-
-Plots for 250 epoch configuration
+**Plots for 250 epoch configuration**
 Here are example graphs of FP32 and mixed precision training on 8 GPU 250 epochs configuration:

 ![TrainingLoss](./img/dgx1-16g_250e_training_loss.png)
@ -622,99 +636,139 @@ Here are example graphs of FP32 and mixed precision training on 8 GPU 250 epochs

 #### Training performance results

-##### Training performance: NVIDIA DGX-1 (8x V100 16G)
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+
+The following results were obtained by running the 
+`python benchmark.py -n 1,2,4,8 -b 256 --dtype float32 -o benchmark_report_tf32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for TF32 and the
+`python benchmark.py -n 1,2,4,8 -b 256 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision in the mxnet-20.12-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
+
+Training performance reported as Total IPS (data + compute time taken into account).
+Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.
+
+| **GPUs** | **Throughput - mixed precision** | **Throughput - TF32** | **Throughput speedup (TF32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - TF32** |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|1|2180 |1022 |2.18 |1.00 |1.00 |
+|2|4332 |2032 |2.13 |1.98 |1.98 |
+|4|8587 |4035 |2.12 |3.93 |3.94 |
+|8|16925|8001 |2.11 |7.76 |7.82 |
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)

 The following results were obtained by running the
 `python benchmark.py -n 1,2,4,8 -b 192 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
-`python benchmark.py -n 1,2,4,8 -b 96 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+`python benchmark.py -n 1,2,4,8 -b 96 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-20.12-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.

 Training performance reported as Total IPS (data + compute time taken into account).
 Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.

 | **GPUs** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:---:|:---:|:---:|:---:|:---:|:---:|
-|1|1427|385|3.71|1.00|1.00|
-|2|2820|768|3.67|1.98|2.00|
-|4|5560|1513|3.68|3.90|3.93|
-|8|10931|3023|3.62|7.66|7.86|
+|1|1376  |384  |3.58 |1.00 |1.00 |
+|2|2768  |763  |3.62 |2.01 |1.98 |
+|4|5357  |1513 |3.54 |3.89 |3.94 |
+|8|10723 |3005 |3.56 |7.79 |7.82 |

-##### Training performance: NVIDIA DGX-2 (16x V100 32G)
+##### Training performance: NVIDIA DGX-2 (16x V100 32GB)

 The following results were obtained by running the
-`python benchmark.py -n 1,4,8,16 -b 256 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
-`python benchmark.py -n 1,4,8,16 -b 128 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+`python benchmark.py -n 1,2,4,8,16 -b 256 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
+`python benchmark.py -n 1,2,4,8,16 -b 128 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-20.12-py3 NGC container on NVIDIA DGX-2 with (16x V100 32GB) GPUs.

 Training performance reported as Total IPS (data + compute time taken into account).
 Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.

 | **GPUs** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:---:|:---:|:---:|:---:|:---:|:---:|
-|1|1438|409|3.52|1.00|1.00|
-|2|2868|817|3.51|1.99|2.00|
-|4|5624|1617|3.48|3.91|3.96|
-|8|11174|3214|3.48|7.77|7.86|
-|16|20530|6356|3.23|14.28|15.54|
+|1 |1492 |417  |3.57 |1.00 |1.00 |
+|2 |2935 |821  |3.57 |1.96 |1.96 |
+|4 |5726 |1623 |3.52 |3.83 |3.92 |
+|8 |11368|3223 |3.52 |7.61 |7.72 |
+|16|21484|6338 |3.38 |14.39|15.19|

 #### Inference performance results

-##### Inference performance: NVIDIA DGX-1 (8x V100 16G)
+##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)

 The following results were obtained by running the
 `python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
-`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_tf32.json -i 500 -e 3 -w 1 --mode val` script for TF32 in the mxnet-20.12-py3 NGC container on NVIDIA DGX A100 with (8x A100 80GB) GPUs.

 Inference performance reported as Total IPS (data + compute time taken into account).
+Reported mixed precision speedups are relative to TF32 numbers for corresponding configuration.

+| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** |
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1   | 463 | 1.72 | 2.15 | 1.72 | 2.10 | 1.58 | 2.23 | 1.58 | 2.39 | 1.56 | 2.94 | 1.79 |
+| 2   | 880 | 1.62 | 2.27 | 1.62 | 2.14 | 1.66 | 2.52 | 1.54 | 2.73 | 1.50 | 3.70 | 1.42 |
+| 4   | 1668| 1.76 | 2.39 | 1.76 | 2.21 | 1.86 | 2.70 | 1.66 | 3.30 | 1.44 | 5.72 | 1.01 |
+| 8   | 2522| 1.75 | 3.17 | 1.75 | 2.74 | 2.00 | 4.26 | 1.35 | 5.36 | 1.10 | 10.43| 0.65 |
+| 16  | 3704| 1.90 | 4.31 | 1.90 | 3.83 | 2.13 | 6.00 | 1.43 | 7.20 | 1.24 | 12.77| 0.85 |
+| 32  | 2964| 1.51 | 10.79| 1.51 | 10.45| 1.52 | 14.52| 1.37 | 16.07| 1.32 | 22.76| 1.21 |
+| 64  | 4547| 1.80 | 14.07| 1.80 | 13.75| 1.82 | 17.16| 1.67 | 19.04| 1.59 | 28.12| 1.28 |
+| 128 | 5530| 1.94 | 23.14| 1.94 | 23.63| 1.82 | 29.04| 1.71 | 32.75| 1.56 | 41.45| 1.34 |
+| 192 | 6198| 2.19 | 30.97| 2.19 | 31.02| 2.21 | 40.04| 1.81 | 44.03| 1.68 | 51.44| 1.51 |
+| 256 | 6120| 2.19 | 41.82| 2.19 | 42.01| 2.19 | 50.72| 1.89 | 55.09| 1.77 | 63.08| 1.60 |
+
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
+
+The following results were obtained by running the
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-20.12-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+
+Inference performance reported as Total IPS (data + compute time taken into account).
 Reported mixed precision speedups are relative to FP32 numbers for corresponding configuration.

-| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | **100% latency (ms) - mixed precision** | **100% latency - speedup** |
-|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-| 1 | 397 | 1.65 | 2.5 | 1.65 | 2.5 | 1.67 | 2.7 | 1.59 | 2.8 | 1.56 | 3.2 | 1.51 | 15.8 | 0.84 |
-| 2 | 732 | 1.81 | 2.7 | 1.81 | 2.6 | 1.88 | 3.0 | 1.67 | 3.3 | 1.52 | 4.9 | 1.10 | 18.8 | 0.83 |
-| 4 | 1269 | 2.08 | 3.2 | 2.08 | 3.0 | 2.21 | 3.5 | 1.92 | 4.0 | 1.72 | 7.5 | 0.97 | 14.5 | 0.54 |
-| 8 | 2012 | 2.53 | 4.0 | 2.53 | 3.9 | 2.59 | 4.2 | 2.45 | 4.4 | 2.37 | 8.3 | 1.29 | 15.3 | 0.72 |
-| 16 | 2667 | 2.64 | 6.0 | 2.64 | 5.9 | 2.66 | 6.3 | 2.54 | 6.4 | 2.52 | 8.3 | 2.02 | 16.9 | 1.05 |
-| 32 | 3240 | 2.86 | 9.9 | 2.86 | 9.8 | 2.87 | 10.3 | 2.79 | 10.4 | 2.76 | 11.5 | 2.53 | 28.4 | 1.12 |
-| 64 | 3776 | 3.10 | 17.0 | 3.10 | 17.0 | 3.09 | 17.5 | 3.03 | 17.7 | 3.01 | 18.1 | 3.01 | 18.7 | 2.99 |
-| 128 | 3734 | 3.02 | 34.3 | 3.02 | 33.8 | 3.05 | 35.5 | 2.93 | 36.3 | 2.88 | 42.4 | 2.79 | 51.7 | 2.38 |
-| 192 | 3641 | 2.90 | 52.7 | 2.90 | 52.4 | 2.90 | 55.2 | 2.77 | 56.2 | 2.74 | 65.4 | 2.76 | 77.1 | 2.41 |
-| 256 | 3463 | 2.73 | 73.9 | 2.73 | 72.8 | 2.75 | 77.3 | 2.61 | 79.9 | 2.54 | 100.8 | 2.39 | 104.1 | 2.35 |
+| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | 
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1   | 286 | 1.27 | 3.48 | 1.27 | 3.45 | 1.27 | 3.61 | 1.26| 3.68 | 1.26| 3.86 | 1.24| 
+| 2   | 519 | 1.34 | 3.84 | 1.34 | 3.77 | 1.35 | 4.05 | 1.31| 4.16 | 1.29| 4.59 | 1.27| 
+| 4   | 910 | 1.60 | 4.39 | 1.60 | 4.35 | 1.61 | 4.59 | 1.56| 4.66 | 1.56| 5.19 | 1.47| 
+| 8   | 1642| 2.20 | 4.87 | 2.20 | 4.68 | 2.29 | 5.35 | 2.05| 6.01 | 1.84| 11.06| 1.04| 
+| 16  | 2359| 2.55 | 6.78 | 2.55 | 6.49 | 2.66 | 7.07 | 2.48| 8.33 | 2.12| 13.89| 1.30| 
+| 32  | 2902| 2.86 | 11.02| 2.86 | 10.43| 3.02 | 12.25| 2.60| 13.88| 2.31| 21.41| 1.55| 
+| 64  | 3234| 2.74 | 19.78| 2.74 | 18.89| 2.86 | 22.50| 2.44| 25.38| 2.17| 30.78| 1.81| 
+| 128 | 3362| 2.69 | 38.06| 2.69 | 37.20| 2.75 | 42.32| 2.44| 45.12| 2.30| 50.59| 2.07| 
+| 192 | 3178| 2.52 | 60.40| 2.52 | 59.62| 2.55 | 65.56| 2.35| 68.16| 2.25| 73.72| 2.10| 
+| 256 | 3057| 2.38 | 83.73| 2.38 | 82.77| 2.40 | 92.26| 2.24| 92.26| 2.17|100.84| 2.23| 

 ##### Inference performance: NVIDIA T4

 The following results were obtained by running the
 `python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
-`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-19.07-py3 NGC container on an NVIDIA T4 GPU.
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-20.12-py3 NGC container on an NVIDIA T4 GPU.

 Inference performance reported as Total IPS (data + compute time taken into account).
-
 Reported mixed precision speedups are relative to FP32 numbers for corresponding configuration.

-| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | **100% latency (ms) - mixed precision** | **100% latency - speedup** |
-|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
-| 1 | 348 | 1.88 | 2.9 | 1.88 | 2.8 | 1.91 | 2.9 | 1.88 | 3.0 | 1.90 | 3.9 | 1.82 | 17.6 | 0.74 |
-| 2 | 594 | 2.30 | 3.4 | 2.30 | 3.3 | 2.35 | 3.4 | 2.34 | 3.5 | 2.38 | 5.7 | 1.55 | 20.2 | 0.74 |
-| 4 | 858 | 2.93 | 4.7 | 2.93 | 4.6 | 2.97 | 4.9 | 2.86 | 5.0 | 2.81 | 6.0 | 2.46 | 13.7 | 1.12 |
-| 8 | 1047 | 3.17 | 7.6 | 3.17 | 7.6 | 3.19 | 7.9 | 3.10 | 8.2 | 3.02 | 9.1 | 2.77 | 15.0 | 1.72 |
-| 16 | 1163 | 3.16 | 13.8 | 3.16 | 13.7 | 3.17 | 14.1 | 3.13 | 14.4 | 3.07 | 15.4 | 2.90 | 17.5 | 2.62 |
-| 32 | 1225 | 3.22 | 26.1 | 3.22 | 26.1 | 3.22 | 27.0 | 3.15 | 27.3 | 3.12 | 28.3 | 3.05 | 30.5 | 2.89 |
-| 64 | 1230 | 3.15 | 52.0 | 3.15 | 51.8 | 3.16 | 52.9 | 3.12 | 53.3 | 3.10 | 54.4 | 3.08 | 58.8 | 2.90 |
-| 128 | 1260 | 3.21 | 101.6 | 3.21 | 101.3 | 3.22 | 102.7 | 3.21 | 103.2 | 3.20 | 115.0 | 2.89 | 121.8 | 2.86 |
-| 192 | 1252 | 3.20 | 153.3 | 3.20 | 153.1 | 3.20 | 154.7 | 3.19 | 155.5 | 3.21 | 156.9 | 3.20 | 182.3 | 2.81 |
-| 256 | 1251 | 3.22 | 204.6 | 3.22 | 204.3 | 3.23 | 206.4 | 3.21 | 207.1 | 3.21 | 209.3 | 3.18 | 241.9 | 2.76 |
+| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** |
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1   | 131 | 1.11 | 7.61 | 1.17 | 7.10 | 0.97 | 10.28 | 0.92 | 11.35 | 0.95 | 15.05 | 0.96 |
+| 2   | 277 | 1.48 | 7.20 | 1.53 | 7.30 | 1.19 | 7.74  | 1.48 | 8.82  | 1.49 | 12.09 | 1.58 | 
+| 4   | 374 | 1.47 | 10.67| 1.50 | 10.20| 1.40 | 13.51 | 1.09 | 14.82 | 1.03 | 22.36 | 0.74 |
+| 8   | 672 | 2.21 | 11.90| 2.23 | 11.21| 2.21 | 14.54 | 1.74 | 17.24 | 1.48 | 28.65 | 0.92 |
+| 16  | 1267| 3.57 | 12.62| 3.58 | 12.02| 3.59 | 14.02 | 3.13 | 16.02 | 2.76 | 22.28 | 2.01 |
+| 32  | 1473| 3.85 | 21.71| 3.86 | 21.67| 3.76 | 22.63 | 3.64 | 22.98 | 3.60 | 23.85 | 3.52 |
+| 64  | 1561| 3.70 | 40.98| 3.70 | 40.87| 3.64 | 41.98 | 3.57 | 42.56 | 3.53 | 43.85 | 3.46 | 
+| 128 | 1555| 3.60 | 82.26| 3.60 | 81.86| 3.57 | 83.87 | 3.51 | 84.63 | 3.49 | 96.56 | 3.09 |
+| 192 | 1545| 3.64 |124.26| 3.64 |123.67| 3.61 |125.76 | 3.58 |126.73 | 3.56 |143.27 | 3.19 |
+| 256 | 1559| 3.71 |164.15| 3.71 |163.97| 3.71 |166.28 | 3.70 |167.01 | 3.70 |168.54 | 3.69 |

 ## Release notes 

 ### Changelog

 1. Dec, 2018
-  * Initial release (based on https://github.com/apache/incubator-mxnet/tree/master/example/image-classification)
+  * Initial release (based on https://github.com/apache/incubator-mxnet/tree/v1.8.x/example/image-classification)
 2. June, 2019
  * Code refactor
  * Label smoothing
  * Cosine LR schedule
  * MixUp regularization
  * Better configurations
+3. February, 2021
+  * DGX-A100 performance results
+  * Container version upgraded to 20.12


 ### Known Issues
--- a/MxNet/Classification/RN50v1.5/benchmark.py
+++ b/MxNet/Classification/RN50v1.5/benchmark.py
@ -15,11 +15,8 @@
 # limitations under the License.

 import argparse
-import json
 import sys
-import tempfile
 import json
-import os
 import traceback
 import numpy as np
 from collections import OrderedDict
@ -48,8 +45,8 @@ parser.add_argument('--mode', metavar='MODE', choices=('train_val', 'train', 'va
                    help="benchmark mode")
 args, other_args = parser.parse_known_args()

-latency_percentiles = ['avg', 50, 90, 95, 99, 100]
-harmonic_mean_metrics = ['train.total_ips', 'val.total_ips']
+latency_percentiles = [50, 90, 95, 99, 100]
+harmonic_mean_metrics = ['train.ips', 'val.ips']

 res = OrderedDict()
 res['model'] = ''
@ -57,11 +54,11 @@ res['ngpus'] = args.ngpus
 res['bs'] = args.batch_sizes
 res['metric_keys'] = []
 if args.mode == 'train' or args.mode == 'train_val':
-    res['metric_keys'].append('train.total_ips')
-    for percentile in latency_percentiles:
-        res['metric_keys'].append('train.latency_{}'.format(percentile))
+    res['metric_keys'].append('train.ips')
 if args.mode == 'val' or args.mode == 'train_val':
-    res['metric_keys'].append('val.total_ips')
+    res['metric_keys'].append('val.ips')
+    res['metric_keys'].append('val.latency_avg')
+if args.mode == 'val':
    for percentile in latency_percentiles:
        res['metric_keys'].append('val.latency_{}'.format(percentile))

@ -72,29 +69,26 @@ for n in args.ngpus:
    for bs in args.batch_sizes:
        res['metrics'][str(n)][str(bs)] = OrderedDict()

-        report_file = args.output + '-{},{}'.format(n, bs)
+        log_file = args.output + '-{},{}'.format(n, bs)
        Popen(['timeout', args.timeout, args.executable, '-n', str(n), '-b', str(bs),
               '--benchmark-iters', str(args.benchmark_iters),
-               '-e', str(args.epochs), '--report', report_file,
+               '-e', str(args.epochs), '--dllogger-log', log_file,
               '--mode', args.mode, '--no-metrics'] + other_args,
              stdout=sys.stderr).wait()

        try:
-            for suffix in ['', *['-{}'.format(i) for i in range(1, n)]]:
-                try:
-                    with open(report_file + suffix, 'r') as f:
-                        report = json.load(f)
-                    break
-                except FileNotFoundError:
-                    pass
-            else:
-                with open(report_file, 'r') as f:
-                    report = json.load(f)
+
+            with open(log_file, 'r') as f:
+                lines = f.read().splitlines()
+                log_data = [json.loads(line[5:]) for line in lines]
+                epochs_report = list(filter(lambda x: len(x['step']) == 1, log_data))
+
+            if len(epochs_report) != args.epochs:
+                raise ValueError('Wrong number epochs in report')
+            epochs_report = epochs_report[args.warmup:]

            for metric in res['metric_keys']:
-                if len(report['metrics'][metric]) != args.epochs:
-                    raise ValueError('Wrong number epochs in report')
-                data = report['metrics'][metric][args.warmup:]
+                data = list(map(lambda x: x['data'][metric], epochs_report))
                if metric in harmonic_mean_metrics:
                    avg = len(data) / sum(map(lambda x: 1 / x, data))
                else:
--- a/MxNet/Classification/RN50v1.5/dali.py
+++ b/MxNet/Classification/RN50v1.5/dali.py
@ -45,18 +45,17 @@ class HybridTrainPipe(Pipeline):

        if dali_cpu:
            dali_device = "cpu"
-            if args.dali_fuse_decoder:
-                self.decode = ops.HostDecoderRandomCrop(device=dali_device, output_type=types.RGB)
-            else:
-                self.decode = ops.HostDecoder(device=dali_device, output_type=types.RGB)
+            decoder_device = "cpu"
        else:
            dali_device = "gpu"
-            if args.dali_fuse_decoder:
-                self.decode = ops.nvJPEGDecoderRandomCrop(device="mixed", output_type=types.RGB,
-                                                          device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
-            else:
-                self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB,
-                                                device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
+            decoder_device = "mixed"
+
+        if args.dali_fuse_decoder:
+            self.decode = ops.ImageDecoderRandomCrop(device=decoder_device, output_type=types.RGB,
+                                                    device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
+        else:
+            self.decode = ops.ImageDecoder(device=decoder_device, output_type=types.RGB,
+                                           device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)

        if args.dali_fuse_decoder:
            self.resize = ops.Resize(device=dali_device, resize_x=crop_shape[1], resize_y=crop_shape[0])
@ -89,12 +88,14 @@ class HybridValPipe(Pipeline):

        if dali_cpu:
            dali_device = "cpu"
-            self.decode = ops.HostDecoder(device=dali_device, output_type=types.RGB)
+            decoder_device = "cpu"
        else:
            dali_device = "gpu"
-            self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB,
-                                            device_memory_padding=nvjpeg_padding,
-                                            host_memory_padding=nvjpeg_padding)
+            decoder_device = "mixed"
+
+        self.decode = ops.ImageDecoder(device=decoder_device, output_type=types.RGB,
+                                       device_memory_padding=nvjpeg_padding,
+                                       host_memory_padding=nvjpeg_padding)
        self.resize = ops.Resize(device=dali_device, resize_shorter=resize_shp) if resize_shp else None
        self.cmnp = ops.CropMirrorNormalize(device="gpu",
                                            output_dtype=types.FLOAT16 if dtype == 'float16' else types.FLOAT,
--- a/MxNet/Classification/RN50v1.5/fit.py
+++ b/MxNet/Classification/RN50v1.5/fit.py
@ -35,23 +35,25 @@

 """ train fit utility """
 import logging
-import os
-import time
-import re
 import math
-import sys
+import os
 import random
+import sys
+import time
 from itertools import starmap
-import numpy as np
-import mxnet as mx
-import mxnet.ndarray as nd
+
+import dllogger
 import horovod.mxnet as hvd
+import mxnet as mx
 import mxnet.contrib.amp as amp
+import numpy as np
 from mxnet import autograd as ag
 from mxnet import gluon
-from report import Report
-from benchmarking import BenchmarkingDataIter
+
 import data
+from benchmarking import BenchmarkingDataIter
+from global_metrics import CompositeMeter, MaxMeter, MinMeter, AvgMeter, PercentileMeter
+

 def add_fit_args(parser):
    def int_list(x):
@ -65,12 +67,10 @@ def add_fit_args(parser):
                       help='mode')
    train.add_argument('--seed', type=int, default=None,
                       help='random seed')
-
    train.add_argument('--gpus', type=int_list, default=[0],
                       help='list of gpus to run, e.g. 0 or 0,2,5')
    train.add_argument('--kv-store', type=str, default='device', choices=('device', 'horovod'),
                       help='key-value store type')
-
    train.add_argument('--dtype', type=str, default='float16', choices=('float32', 'float16'),
                       help='precision')
    train.add_argument('--amp', action='store_true',
@ -79,6 +79,8 @@ def add_fit_args(parser):
                       help='the batch size')
    train.add_argument('--num-epochs', type=int, default=90,
                       help='number of epochs')
+    train.add_argument('--run-epochs', type=int, default=-1, 
+                       help='number of epochs to run in single run')
    train.add_argument('--lr', type=float, default=0.1,
                       help='initial learning rate')
    train.add_argument('--lr-schedule', choices=('multistep', 'cosine'), default='cosine',
@ -99,7 +101,6 @@ def add_fit_args(parser):
                       help='label smoothing factor')
    train.add_argument('--mixup', type=float, default=0,
                       help='alpha parameter for mixup (if 0 then mixup is not applied)')
-
    train.add_argument('--disp-batches', type=int, default=20,
                       help='show progress for every n batches')
    train.add_argument('--model-prefix', type=str, default='model',
@ -110,24 +111,27 @@ def add_fit_args(parser):
    train.add_argument('--begin-epoch', type=int, default=0,
                       help='start the model from an epoch')
    train.add_argument('--load', help='checkpoint to load')
-
    train.add_argument('--test-io', action='store_true',
                       help='test reading speed without training')
    train.add_argument('--test-io-mode', default='train', choices=('train', 'val'),
                       help='data to test')
-
    train.add_argument('--log', type=str, default='log.log',
                       help='file where to save the log from the experiment')
-    train.add_argument('--report', default='report.json', help='file where to save report')
-
-    train.add_argument('--no-metrics', action='store_true', help='do not calculate evaluation metrics (for benchmarking)')
+    train.add_argument('--dllogger-log', type=str, default='dllogger_log.log',
+                       help='file where to save the dllogger log from the experiment')
+    train.add_argument('--workspace', type=str, default='./',
+                       help='path to directory where results will be stored')
+    train.add_argument('--no-metrics', action='store_true',
+                       help='do not calculate evaluation metrics (for benchmarking)')
    train.add_argument('--benchmark-iters', type=int, default=None,
                       help='run only benchmark-iters iterations from each epoch')
    return train

+
 def get_epoch_size(args, kv):
    return math.ceil(args.num_examples / args.batch_size)

+
 def get_lr_scheduler(args):
    def multistep_schedule(x):
        lr = args.lr * (args.lr_factor ** (len(list(filter(lambda step: step <= x, args.lr_steps)))))
@ -158,6 +162,7 @@ def get_lr_scheduler(args):
    }
    return schedules[args.lr_schedule]

+
 def load_model(args, model):
    if args.load is None:
        return False
@ -165,6 +170,7 @@ def load_model(args, model):
    logging.info('Loaded model {}'.format(args.load))
    return True

+
 def save_checkpoint(net, epoch, top1, best_acc, model_prefix, save_frequency, kvstore):
    if model_prefix is None or save_frequency == 0 or ('horovod' in kvstore and hvd.rank() != 0):
        return
@ -177,29 +183,6 @@ def save_checkpoint(net, epoch, top1, best_acc, model_prefix, save_frequency, kv
        net.save_parameters(fname)
        logging.info('[Epoch {}] Saving checkpoint to {} with Accuracy: {:.4f}'.format(epoch, fname, top1))

-def add_metrics_to_report(report, mode, metric, durations, total_batch_size, loss=None, warmup=20):
-    if report is None:
-        return
-
-    top1 = metric.get('accuracy', None)
-    if top1 is not None:
-        report.add_value('{}.top1'.format(mode), top1)
-
-    top5 = metric.get('top_k_accuracy_5', None)
-    if top5 is not None:
-        report.add_value('{}.top5'.format(mode), top5)
-
-    if loss is not None:
-        report.add_value('{}.loss'.format(mode), loss.get_global()[1])
-
-    if len(durations) > warmup:
-        durations = durations[warmup:]
-    duration = np.mean(durations)
-    total_ips = total_batch_size / duration
-    report.add_value('{}.latency_avg'.format(mode), duration)
-    for percentile in [50, 90, 95, 99, 100]:
-        report.add_value('{}.latency_{}'.format(mode, percentile), np.percentile(durations, percentile))
-    report.add_value('{}.total_ips'.format(mode), total_ips)

 def model_pred(args, model, image):
    from imagenet_classes import classes
@ -209,6 +192,7 @@ def model_pred(args, model, image):
        ind = int(ind.asscalar())
        logging.info('{:2d}. {:5.2f}% -> {}'.format(i + 1, output[ind].asscalar() * 100, classes[ind]))

+
 def reduce_metrics(args, metrics, kvstore):
    if 'horovod' not in kvstore or not metrics[0] or hvd.size() == 1:
        return metrics
@ -218,7 +202,8 @@ def reduce_metrics(args, metrics, kvstore):
    values = reduced.as_in_context(mx.cpu()).asnumpy().tolist()
    return (metrics[0], values)

-def model_score(args, net, val_data, metric, kvstore, report=None):
+
+def model_score(args, net, val_data, metric, kvstore):
    if val_data is None:
        logging.info('Omitting validation: no data')
        return [], []
@ -249,8 +234,13 @@ def model_score(args, net, val_data, metric, kvstore, report=None):
        tic = time.time()

    metric = reduce_metrics(args, metric.get_global(), kvstore)
-    add_metrics_to_report(report, 'val', dict(zip(*metric)), durations, total_batch_size)
-    return metric
+    duration_stats = {
+        'ips': total_batch_size / np.mean(durations),
+        'latency_avg': np.mean(durations),
+    }
+
+    return metric, duration_stats, durations
+

 class ScalarMetric(mx.metric.Loss):
    def update(self, _, scalar):
@ -259,13 +249,14 @@ class ScalarMetric(mx.metric.Loss):
        self.num_inst += 1
        self.global_num_inst += 1

+
 def label_smoothing(labels, classes, eta):
    return labels.one_hot(classes, on_value=1 - eta + eta / classes, off_value=eta / classes)

-def model_fit(args, net, train_data, eval_metric, optimizer,
-        optimizer_params, lr_scheduler, eval_data, kvstore, kv,
-        begin_epoch, num_epoch, model_prefix, report, print_loss):

+def model_fit(args, net, train_data, eval_metric, optimizer,
+              optimizer_params, lr_scheduler, eval_data, global_metrics, kvstore, kv,
+              begin_epoch, num_epoch, run_epoch, model_prefix):
    if not isinstance(eval_metric, mx.metric.EvalMetric):
        eval_metric = mx.metric.create(eval_metric)
    loss_metric = ScalarMetric()
@ -278,6 +269,7 @@ def model_fit(args, net, train_data, eval_metric, optimizer,

    if args.amp:
        amp.init_trainer(trainer)
+    

    sparse_label_loss = (args.label_smoothing == 0 and args.mixup == 0)
    loss = gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=sparse_label_loss)
@ -288,10 +280,12 @@ def model_fit(args, net, train_data, eval_metric, optimizer,
    durations = []

    epoch_size = get_epoch_size(args, kv)
+    run_epoch = num_epoch if (run_epoch == -1) else (begin_epoch + run_epoch)

    def transform_data(images, labels):
        if args.mixup != 0:
-            coeffs = mx.nd.array(np.random.beta(args.mixup, args.mixup, size=images.shape[0])).as_in_context(images.context)
+            coeffs = mx.nd.array(np.random.beta(args.mixup, args.mixup, size=images.shape[0])).as_in_context(
+                images.context)
            image_coeffs = coeffs.astype(images.dtype, copy=False).reshape(*coeffs.shape, 1, 1, 1)
            ret_images = image_coeffs * images + (1 - image_coeffs) * images[::-1]

@ -307,21 +301,23 @@ def model_fit(args, net, train_data, eval_metric, optimizer,

        return ret_images, ret_labels

-
+    i = -1
    best_accuracy = -1
-    for epoch in range(begin_epoch, num_epoch):
+    for epoch in range(begin_epoch, min(run_epoch, num_epoch)):
        tic = time.time()
+        btic = time.time()
+        etic = time.time()
+
        train_data.reset()
        eval_metric.reset()
        loss_metric.reset()
-        btic = time.time()

        logging.info('Starting epoch {}'.format(epoch))
        outputs = []
        for i, batches in enumerate(train_data):
            # synchronize to previous iteration
-            for o in outputs:
-                o.wait_to_read()
+            #for o in outputs:
+            #    o.wait_to_read()

            trainer.set_learning_rate(lr_scheduler(epoch + i / epoch_size))

@ -353,40 +349,43 @@ def model_fit(args, net, train_data, eval_metric, optimizer,
            else:
                trainer.step(total_batch_size)

-            if print_loss:
-                loss_metric.update(..., np.mean([l.asnumpy() for l in Ls]).item())
-            eval_metric.update(orig_label, outputs)
+            loss_metric.update(..., np.mean([l.asnumpy() for l in Ls]).item())

            if args.disp_batches and not (i + 1) % args.disp_batches:
-                name, acc = eval_metric.get()
-                if print_loss:
-                    name = [loss_metric.get()[0]] + name
-                    acc = [loss_metric.get()[1]] + acc
+                dllogger_it_data = {
+                    'train.loss': loss_metric.get()[1],
+                    'train.ips': args.disp_batches * total_batch_size / (time.time() - btic),
+                    'train.lr': trainer.learning_rate
+                }
+                dllogger.log((epoch, i), data=dllogger_it_data)

-                logging.info('Epoch[{}] Batch [{}-{}]\tSpeed: {} samples/sec\tLR: {}\t{}'.format(
-                    epoch, (i // args.disp_batches) * args.disp_batches, i,
-                    args.disp_batches * total_batch_size / (time.time() - btic), trainer.learning_rate,
-                    '\t'.join(list(map(lambda x: '{}: {:.6f}'.format(*x), zip(name, acc))))))
-                eval_metric.reset_local()
                loss_metric.reset_local()
                btic = time.time()

            durations.append(time.time() - tic)
            tic = time.time()

-
-        add_metrics_to_report(report, 'train', dict(eval_metric.get_global_name_value()), durations, total_batch_size, loss_metric if print_loss else None)
-
+        dllogger_epoch_data = {
+            'train.loss': loss_metric.get_global()[1],
+            'train.ips': (i + 1) * total_batch_size / (time.time() - etic)
+        }
        if args.mode == 'train_val':
            logging.info('Validating epoch {}'.format(epoch))
-            score = model_score(args, net, eval_data, eval_metric, kvstore, report)
-            for name, value in zip(*score):
-                logging.info('Epoch[{}] Validation {:20}: {}'.format(epoch, name, value))
+            score, duration_stats, _ = model_score(args, net, eval_data, eval_metric, kvstore)
+
+            dllogger_epoch_data.update(
+                starmap(lambda key, val: ('val.{}'.format(key), val), zip(*score))
+            )
+            dllogger_epoch_data.update(
+                starmap(lambda key, val: ('val.{}'.format(key), val), duration_stats.items())
+            )

            score = dict(zip(*score))
            accuracy = score.get('accuracy', -1)
            save_checkpoint(net, epoch, accuracy, best_accuracy, model_prefix, args.save_frequency, kvstore)
            best_accuracy = max(best_accuracy, accuracy)
+        global_metrics.update_dict(dllogger_epoch_data)
+        dllogger.log(step=(epoch,), data=dllogger_epoch_data)


 def fit(args, model, data_loader):
@ -399,11 +398,8 @@ def fit(args, model, data_loader):

    start_time = time.time()

-    report = Report(args.arch, len(args.gpus), sys.argv)
-
    # select gpu for horovod process
    if 'horovod' in args.kv_store:
-        hvd.init()
        args.gpus = [args.gpus[hvd.local_rank()]]

    if args.amp:
@ -494,6 +490,23 @@ def fit(args, model, data_loader):
        params = model.collect_params()
        if params is not None:
            hvd.broadcast_parameters(params, root_rank=0)
+    global_metrics = CompositeMeter()
+    if args.mode in ['train_val', 'train']:
+        global_metrics.register_metric('train.loss', MinMeter())
+        global_metrics.register_metric('train.ips', AvgMeter())
+
+    if args.mode in ['train_val', 'val']:
+        global_metrics.register_metric('val.accuracy', MaxMeter())
+        global_metrics.register_metric('val.top_k_accuracy_5', MaxMeter())
+        global_metrics.register_metric('val.ips', AvgMeter())
+        global_metrics.register_metric('val.latency_avg', AvgMeter())
+
+    if args.mode in ['val']:
+        global_metrics.register_metric('val.latency_50', PercentileMeter(50))
+        global_metrics.register_metric('val.latency_90', PercentileMeter(90))
+        global_metrics.register_metric('val.latency_95', PercentileMeter(95))
+        global_metrics.register_metric('val.latency_99', PercentileMeter(99))
+        global_metrics.register_metric('val.latency_100', PercentileMeter(100))

    # run
    if args.mode in ['train_val', 'train']:
@ -503,30 +516,32 @@ def fit(args, model, data_loader):
            train,
            begin_epoch=args.begin_epoch,
            num_epoch=args.num_epochs,
+            run_epoch=args.run_epochs,
            eval_data=val,
            eval_metric=eval_metrics,
+            global_metrics=global_metrics,
            kvstore=args.kv_store,
            kv=kv,
            optimizer=args.optimizer,
            optimizer_params=optimizer_params,
            lr_scheduler=lr_scheduler,
-            report=report,
-            model_prefix=args.model_prefix,
-            print_loss=not args.no_metrics,
+            model_prefix=os.path.join(args.workspace, args.model_prefix),
        )
    elif args.mode == 'val':
        for epoch in range(args.num_epochs):  # loop for benchmarking
-            score = model_score(args, model, val, eval_metrics, args.kv_store, report=report)
-            for name, value in zip(*score):
-                logging.info('Validation {:20}: {}'.format(name, value))
+            score, duration_stats, durations = model_score(args, model, val, eval_metrics, args.kv_store)
+            dllogger_data = dict(starmap(lambda key, val: ('val.{}'.format(key), val), zip(*score)))
+            dllogger_data.update(
+                starmap(lambda key, val: ('val.{}'.format(key), val), duration_stats.items())
+            )
+            global_metrics.update_dict(dllogger_data)
+            for percentile in [50, 90, 95, 99, 100]:
+                metric_name = 'val.latency_{}'.format(percentile)
+                dllogger_data[metric_name] = np.percentile(durations, percentile)
+                global_metrics.update_metric(metric_name, durations)
+            dllogger.log(step=(epoch,), data=dllogger_data)
    else:
        raise ValueError('Wrong mode')

    mx.nd.waitall()
-
-    report.set_total_duration(time.time() - start_time)
-    if args.report:
-        suffix = '-{}'.format(hvd.rank()) if 'horovod' in args.kv_store and hvd.rank() != 0 else ''
-        report.save(args.report + suffix)
-
-    logging.info('Experiment took: {} sec'.format(report.total_duration))
+    dllogger.log(tuple(), data=global_metrics.get())
--- a/MxNet/Classification/RN50v1.5/global_metrics.py
+++ b/MxNet/Classification/RN50v1.5/global_metrics.py
@ -0,0 +1,99 @@
+import numpy as np
+
+
+class CompositeMeter:
+    def __init__(self):
+        self.register = {}
+
+    def register_metric(self, name, metric):
+        self.register[name] = metric
+
+    def _validate(self, metric_name):
+        if metric_name not in self.register:
+            raise ValueError('{} is not registered metric'.format(metric_name))
+
+    def update_metric(self, metric_name, value):
+        self._validate(metric_name)
+        self.register[metric_name].update(value)
+
+    def update_dict(self, dict_metric):
+        for name, val in dict_metric.items():
+            if name in self.register.keys():
+                self.update_metric(name, val)
+
+    def get(self, metric_name=None):
+        if metric_name is not None:
+            self._validate(metric_name)
+            return self.register[metric_name].get()
+        res_dict = {name: metric.get() for name, metric in self.register.items()}
+        return res_dict
+
+
+class MaxMeter:
+    def __init__(self):
+        self.max = None
+        self.n = 0
+
+    def reset(self):
+        self.max = None
+        self.n = 0
+
+    def update(self, val):
+        if self.max is None:
+            self.max = val
+        else:
+            self.max = max(self.max, val)
+
+    def get(self):
+        return self.max
+
+
+class MinMeter:
+    def __init__(self):
+        self.min = None
+        self.n = 0
+
+    def reset(self):
+        self.min = None
+        self.n = 0
+
+    def update(self, val):
+        if self.min is None:
+            self.min = val
+        else:
+            self.min = min(self.min, val)
+
+    def get(self):
+        return self.min
+
+
+class AvgMeter:
+    def __init__(self):
+        self.sum = 0
+        self.n = 0
+
+    def reset(self):
+        self.sum = 0
+        self.n = 0
+
+    def update(self, val):
+        self.sum += val
+        self.n += 1
+
+    def get(self):
+        return self.sum / self.n
+
+
+class PercentileMeter:
+    def __init__(self, q):
+        self.data = []
+        self.q = q
+
+    def reset(self):
+        self.data = []
+
+    def update(self, data):
+        self.data.extend(data)
+
+    def get(self):
+        return np.percentile(self.data, self.q)
--- a/MxNet/Classification/RN50v1.5/img/residual_diagram.png
+++ b/MxNet/Classification/RN50v1.5/img/residual_diagram.png
--- a/MxNet/Classification/RN50v1.5/log_utils.py
+++ b/MxNet/Classification/RN50v1.5/log_utils.py
@ -0,0 +1,33 @@
+import logging
+import os
+import sys
+
+import dllogger
+import horovod.mxnet as hvd
+
+
+def format_step(step):
+    if isinstance(step, str):
+        return step
+    s = ""
+    if len(step) > 0:
+        s += "Epoch: {} ".format(step[0])
+    if len(step) > 1:
+        s += "Iteration: {} ".format(step[1])
+    if len(step) > 2:
+        s += "Validation Iteration: {} ".format(step[2])
+    if len(step) == 0:
+        s = "Summary:"
+    return s
+
+
+def setup_logging(args):
+    logging.basicConfig(level=logging.DEBUG, format='{asctime}:{levelname}: {message}', style='{')
+    if hvd.rank() == 0:
+        dllogger.init(backends=[
+            dllogger.StdOutBackend(dllogger.Verbosity.DEFAULT, step_format=format_step),
+            dllogger.JSONStreamBackend(
+                dllogger.Verbosity.VERBOSE, os.path.join(args.workspace, args.dllogger_log)),
+        ])
+    else:
+        dllogger.init([])
--- a/MxNet/Classification/RN50v1.5/report.py
+++ b/MxNet/Classification/RN50v1.5/report.py
@ -1,68 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# Report JSON file structure:
-# - "model"          : architecture of the model (e.g. "resnet50").
-# - "ngpus"          : number of gpus on which training was performed.
-# - "total_duration" : total duration of training in seconds.
-# - "cmd"            : list of application arguments.
-# - "metrics"        : per epoch metrics for train and validation
-#                      (some of below metrics may not exist in the report,
-#                       depending on application arguments)
-#       - "train.top1"        : training top1 accuracy in epoch.
-#       - "train.top5"        : training top5 accuracy in epoch.
-#       - "train.loss"        : training loss in epoch.
-#       - "train.total_ips"   : training speed (data and compute time taken into account) for epoch in images/sec.
-#       - "train.latency_avg" : average latency of one iteration in seconds.
-#       - "train.latency_50"  : median latency of one iteration in seconds.
-#       - "train.latency_90"  : 90th percentile latency of one iteration in seconds.
-#       - "train.latency_95"  : 95th percentile latency of one iteration in seconds.
-#       - "train.latency_99"  : 99th percentile latency of one iteration in seconds.
-#       - "train.latency_100" : highest observed latency of one iteration in seconds.
-#       - "val.top1", "val.top5", "val.time", "val.total_ips", "val.latency_avg", "val.latency_50",
-#         "val.latency_90", "val.latency_95", "val.latency_99", "val.latency_100"      : the same but for validation.
-
-import json
-from collections import OrderedDict
-
-class Report:
-    def __init__(self, model_name, ngpus, cmd):
-        self.model_name = model_name
-        self.ngpus = ngpus
-        self.cmd = cmd
-        self.total_duration = 0
-        self.metrics = OrderedDict()
-
-    def add_value(self, metric, value):
-        if metric not in self.metrics:
-            self.metrics[metric] = []
-        self.metrics[metric].append(value)
-
-    def set_total_duration(self, duration):
-        self.total_duration = duration
-
-    def save(self, filename):
-        with open(filename, 'w') as f:
-            f.write(self.get_report())
-
-    def get_report(self):
-        report = OrderedDict([
-            ('model', self.model_name),
-            ('ngpus', self.ngpus),
-            ('total_duration', self.total_duration),
-            ('cmd', self.cmd),
-            ('metrics', self.metrics),
-        ])
-        return json.dumps(report, indent=4)
--- a/MxNet/Classification/RN50v1.5/requirements.txt
+++ b/MxNet/Classification/RN50v1.5/requirements.txt
@ -0,0 +1 @@
+git+git://github.com/NVIDIA/dllogger.git@26a0f8f1958de2c0c460925ff6102a4d2486d6cc#egg=dllogger
--- a/MxNet/Classification/RN50v1.5/runner
+++ b/MxNet/Classification/RN50v1.5/runner
@ -25,10 +25,13 @@ optparser.add_argument('-b', '--batch-size', type=int, default=192, help='batch
 optparser.add_argument('-e', '--num-epochs', type=int, default=90, help='number of epochs')
 optparser.add_argument('-l', '--lr', type=float, default=0.256, help='learning rate; '
                       'IMPORTANT: true learning rate will be calculated as `lr * batch_size / 256`')
-optparser.add_argument('--data-root', type=Path, help='Directory with RecordIO data files', default=Path('/data/imagenet/train-val-recordio-passthrough'))
+optparser.add_argument('--data-root', type=Path, help='Directory with RecordIO data files',
+                       default=Path('/data/imagenet/train-val-recordio-passthrough'))
 optparser.add_argument('--dtype', help='Precision', default='float16', choices=('float32', 'float16'))
 optparser.add_argument('--kv-store', default='horovod', choices=('device', 'horovod'), help='key-value store type')
-optparser.add_argument('--data-backend', default='dali-gpu', choices=('dali-gpu', 'dali-cpu', 'mxnet', 'synthetic'), help='data backend')
+optparser.add_argument('--data-backend', default='dali-gpu', choices=('dali-gpu', 'dali-cpu', 'mxnet', 'synthetic'),
+                       help='data backend')
+optparser.add_argument('--launcher', default='horovod', choices=('horovod', 'slurm'), help='type of launcher')

 opts, args = optparser.parse_known_args()

@ -41,7 +44,7 @@ opts.batch_size *= opts.ngpus
 opts.lr *= opts.batch_size / 256

 command = []
-if 'horovod' in opts.kv_store:
+if 'horovod' in opts.kv_store and opts.launcher == 'horovod':
    command += ['horovodrun', '-np', str(opts.ngpus)]
 command += ['python', str(Path(__file__).parent / "train.py")]
 command += ['--data-train', str(opts.data_root / "train.rec")]
@ -73,7 +76,6 @@ os.environ['MXNET_GPU_COPY_NTHREADS'] = "1"
 os.environ['MXNET_OPTIMIZER_AGGREGATION_SIZE'] = "54"
 os.environ['HOROVOD_CYCLE_TIME'] = "0.1"
 os.environ['HOROVOD_FUSION_THRESHOLD'] = "67108864"
-os.environ['HOROVOD_NUM_NCCL_STREAMS'] = "2"
 os.environ['MXNET_HOROVOD_NUM_GROUPS'] = "16"
 os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD'] = "999"
 os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD'] = "25"
--- a/MxNet/Classification/RN50v1.5/train.py
+++ b/MxNet/Classification/RN50v1.5/train.py
@ -33,16 +33,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import os
-import sys
 import argparse
-import logging
-import mxnet as mx
-import numpy as np

-import data, dali
+import dllogger
+import horovod.mxnet as hvd
+
+import dali
+import data
 import fit
 import models
+from log_utils import setup_logging
+

 def parse_args():
    parser = argparse.ArgumentParser(description="Train classification models on ImageNet",
@ -54,15 +55,14 @@ def parse_args():
    data.add_data_aug_args(parser)
    return parser.parse_args()

-def setup_logging(args):
-    head = '{asctime}:{levelname}: {message}'
-    logging.basicConfig(level=logging.DEBUG, format=head, style='{',
-                        handlers=[logging.StreamHandler(sys.stderr), logging.FileHandler(args.log)])
-    logging.info('Start with arguments {}'.format(args))

 if __name__ == '__main__':
    args = parse_args()
+    if 'horovod' in args.kv_store:
+        hvd.init()
+
    setup_logging(args)
+    dllogger.log(step='PARAMETER', data=vars(args))

    model = models.get_model(**vars(args))
    data_loader = data.get_data_loader(args)
				`@ -0,0 +1 @@`
				`git+git://github.com/NVIDIA/dllogger.git@26a0f8f1958de2c0c460925ff6102a4d2486d6cc#egg=dllogger`