Updating RN50/MxNet

2019-10-21 19:20:40 +02:00 · 2019-10-21 19:20:40 +02:00 · e470c2150a
parent f2fe0904cf
commit e470c2150a
47 changed files with 3226 additions and 1503 deletions
--- a/MxNet/Classification/RN50v1.5/Dockerfile
+++ b/MxNet/Classification/RN50v1.5/Dockerfile
@ -0,0 +1,3 @@
+FROM nvcr.io/nvidia/mxnet:19.07-py3
+COPY . /workspace/rn50
+WORKDIR /workspace/rn50
--- a/MxNet/Classification/RN50v1.5/LICENSE
+++ b/MxNet/Classification/RN50v1.5/LICENSE
@ -1,3 +1,4 @@
+
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/
--- a/MxNet/Classification/RN50v1.5/README.md
+++ b/MxNet/Classification/RN50v1.5/README.md
@ -1,6 +1,46 @@
-# ResNet50 v1.5 For MXNet
+# ResNet50 v1.5 for MXNet

-## The model
+This repository provides a script and recipe to train the ResNet50 v1.5 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
+
+## Table Of Contents
+- [Model overview](#model-overview)
+    * [Default configuration](#default-configuration)
+    * [Feature support matrix](#feature-support-matrix)
+        * [Features](#features)
+    * [Mixed precision training](#mixed-precision-training)
+        * [Enabling mixed precision](#enabling-mixed-precision)
+- [Setup](#setup)
+    * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+    * [Scripts and sample code](#scripts-and-sample-code)
+    * [Parameters](#parameters)
+    * [Command-line options](#command-line-options)
+    * [Getting the data](#getting-the-data)
+        * [Dataset guidelines](#dataset-guidelines)
+    * [Multi-dataset](#multi-dataset)
+    * [Training process](#training-process)
+    * [Inference process](#inference-process)
+- [Performance](#performance)
+    * [Benchmarking](#benchmarking)
+        * [Training performance benchmark](#training-performance-benchmark)
+        * [Inference performance benchmark](#inference-performance-benchmark)
+    * [Results](#results)
+        * [Training accuracy results](#training-accuracy-results)
+            * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
+            * [Training stability test](#training-stability-test)
+        * [Training performance results](#training-performance-results)
+            * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
+            * [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-(16x-v100-32G))
+        * [Inference performance results](#inference-performance-results)
+            * [Inference performance: NVIDIA DGX-1 (8x V100 16G)](#inference-performance-nvidia-dgx-1-(8x-v100-16G))
+            * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
+- [Release notes](#release-notes)
+    * [Changelog](#changelog)
+    * [Known issues](#known-issues)
+
+
+## Model overview
 The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).

 The difference between v1 and v1.5 is in the bottleneck blocks which require
@ -9,96 +49,448 @@ v1.5 has stride = 2 in the 3x3 convolution

 This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec).

-## Training procedure
+This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 3.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

-### Optimizer
+### Default configuration

-This model trains for 90 epochs, with the standard ResNet v1.5 setup:
+**Optimizer:**

-* SGD with momentum (0.9)
+* SGD with momentum (0.875)
+* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary scale the learning rate.
+* Learning rate schedule -- we use cosine LR schedule
+* Linear warmup of the learning rate during first 5 epochs according to [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
+* Weight decay: 3.0517578125e-05 (1/32768).
+* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
+* Label Smoothing: 0.1
+* We train for:
+    * 50 Epochs -> configuration that reaches 75.9% top1 accuracy
+    * 90 Epochs -> 90 epochs is a standard for ResNet50
+    * 250 Epochs -> best possible accuracy. For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).

-* Learning rate = 0.1 for 256 batch size, for other batch sizes we linearly
-scale the learning rate.
+**Data augmentation:**

-* Learning rate decay - multiply by 0.1 after 30, 60, and 80 epochs
+This model uses the following data augmentation:

-* Linear warmup of the learning rate during first 5 epochs
-according to [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
-
-* Weight decay: 1e-4
-
-### Data Augmentation
-
-During training, we perform the following augmentation techniques:
+For training:
 * Normalization
 * Random resized crop to 224x224
-* Scale from 5% to 100%
+* Scale from 8% to 100%
 * Aspect ratio from 3/4 to 4/3
 * Random horizontal flip

-During inference, we perform the following augmentation techniques:
+For inference:
 * Normalization
 * Scale to 256x256
 * Center crop to 224x224

-See `data.py` for more info.
+### Feature support matrix

-# Setup
-
-## Requirements
-
-Ensure your environment meets the following requirements:
-
-* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [MXNet 18.12-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet) or newer
-* [NVIDIA-DALI 0.5.0](https://github.com/NVIDIA/DALI) -- included in the MXNet container
-* [Python 3.5](https://www.python.org) -- included in the MXNet container
-* [CUDA 10](https://developer.nvidia.com/cuda-toolkit) -- included in the MXNet container
-* [cuDNN 7.4.1](https://developer.nvidia.com/cudnn) -- included in the the MXNet container
-* (optional) NVIDIA Volta or Turing GPU (see section below) -- for best training performance using FP16
-
-For more information about how to get started with NGC containers, see the
-following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
-* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
-* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
-* [Running MXNet](https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/running.html#running)
-
-## Training using mixed precision with Tensor Cores
-
-### Hardware requirements
-Training with mixed precision on NVIDIA Tensor Cores, requires an NVIDIA Volta-based or Turing-based GPU.
+| **Feature** | **ResNet50 MXNet** |
+|:---:|:--------:|
+|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)|yes|
+|Horovod Multi-GPU|yes|


-### Software changes
+#### Features
+The following features are supported by this model.

-For information about how to train using mixed precision, see the
-[Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
-and
-[Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
+NVIDIA DALI - NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications.
+
+Horovod Multi-GPU - Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).


-# Quick start guide

-## Docker
+### Mixed precision training

-To run docker MXNet container, run:
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.

-`nvidia-docker run --rm -it --ipc=host -v <path to source of this repo>:/workspace/resnet50 -v <path to prepared dataset>:/data/imagenet/train-val-recordio-passthrough nvcr.io/nvidia/mxnet:18.12-py3`
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.

-It will also automatically start downloading the MXNet container if you haven't downloaded it yet. You can also download it manually by running:
+For information about:
+-   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
+-   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.

-`nvidia-docker pull nvcr.io/nvidia/mxnet:18.12-py3`

-If you haven't prepared dataset yet (see section below), download raw ImageNet dataset (see section below), and run:

-`nvidia-docker run --rm -it --ipc=host -v <path to source of this repo>:/workspace/resnet50 -v <path where prepared dataset should be created>:/data/imagenet/train-val-recordio-passthrough -v <path to raw dataset>:/data/imagenet/raw nvcr.io/nvidia/mxnet:18.12-py3`
+#### Enabling mixed precision
+Using the Gluon API, ensure you perform the following steps to convert a model that supports computation with float16.

-and follow step from Prepare Dataset section.
+1. Cast Gluon Block‘s parameters and expected input type to float16 by calling the cast method of the Block representing the network.
+    ```python
+    net = net.cast('float16')
+    ```

-## Prepare Dataset
+2. Ensure the data input to the network is of float16 type. If your DataLoader or Iterator produces output in another datatype, then you have to cast your data. There are different ways you can do this. The easiest way is to use the `astype` method of NDArrays.
+    ```python
+    data = data.astype('float16', copy=False)
+    ```
+
+3. If you are using images and DataLoader, you can also use a Cast transform.  It is preferable to use multi_precision mode of optimizer when training in float16. This mode of optimizer maintains a master copy of the weights in float32 even when the training (forward and backward pass) is in float16. This helps increase precision of the weight updates and can lead to faster convergence in some scenarios.
+    ```python
+    optimizer = mx.optimizer.create('sgd', multi_precision=True, lr=0.01)
+    ```
+
+## Setup
+The following section lists the requirements in order to start training the ResNet50 v1.5 model.
+
+### Requirements
+
+This repository contains Dockerfile which extends the MXNet NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+-   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+-   [MXNet 19.07-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet)
+-   [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+
+For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+-   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+-   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
+-   [Running MXNet](https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/running.html#running)
+
+For those unable to use the MXNet NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
+
+
+## Quick Start Guide
+
+**1. Clone the repository.**
+```bash
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/MxNet/Classification/RN50v1.5
+```
+
+**2. Build the ResNet50 MXNet NGC container.**
+After Docker is setup, you can build the ResNet50 image with:
+```bash
+docker build . -t nvidia_rn50_mx
+```
+
+**3. Start an interactive session in the NGC container to run preprocessing/training/inference.**
+```bash
+nvidia-docker run --rm -it --ipc=host <path to dataset>:/data/imagenet/train-val-recordio-passthrough nvidia_rn50_mx
+```
+
+**4. Download and preprocess the data.**
+* Download the images from http://image-net.org/download-images.
+* Extract the training and validation data:
+    ```bash
+    mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
+    tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
+    find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
+    cd ..
+    ```
+
+**5. Extract the validation data and move the images to subfolders.**
+```bash
+mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
+wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
+```
+
+**6. Preprocess the dataset.**
+```bash
+./scripts/prepare_imagenet.sh <path to raw imagenet> <path where processed dataset will be created>
+```
+
+**7. Start training.**
+```bash
+./runner -n <number of gpus> -b <batch size per GPU (default 192)>
+```
+
+**8. Start validation/evaluation.**
+```bash
+./runner -n <number of gpus> -b <batch size per GPU (default 192)> --load <path to trained model> --mode val
+```
+
+**9. Start inference/predictions.**
+```bash
+./runner --load <path to trained model> --mode pred --data-pred <path to the image>
+```
+
+
+## Advanced
+
+The following sections provide greater details of the dataset, running training and inference, and the training results.
+
+### Scripts and sample code
+
+In the root directory, the most important files are:
+* `runner`: A wrapper on the `train.py` script which is the main executable script for training/validation/predicting
+* `benchmark.py`: A script for benchmarking
+* `Dockerfile`: Container to build the container
+* `fit.py`: A file containing most of the training and validation logic
+* `data.py`: Data loading and preprocessing code
+* `dali.py`: Data loading and preprocessing code using DALI
+* `models.py`: The model architecture
+* `report.py`: A file containing JSON report structure and description of fields
+
+In the `scripts` directory, the most important files are:
+* `prepare_imagenet.sh`: A script that converts raw dataset format to RecordIO format
+
+
+
+
+### Parameters
+
+The complete list of available parameters contains:
+```
+Model:
+  --arch {resnetv1,resnetv15,resnextv1,resnextv15,xception}
+                        model architecture (default: resnetv15)
+  --num-layers NUM_LAYERS
+                        number of layers in the neural network, required by
+                        some networks such as resnet (default: 50)
+  --num-groups NUM_GROUPS
+                        number of groups for grouped convolutions, required by
+                        some networks such as resnext (default: 32)
+  --num-classes NUM_CLASSES
+                        the number of classes (default: 1000)
+  --batchnorm-eps BATCHNORM_EPS
+                        the amount added to the batchnorm variance to prevent
+                        output explosion. (default: 1e-05)
+  --batchnorm-mom BATCHNORM_MOM
+                        the leaky-integrator factor controling the batchnorm
+                        mean and variance. (default: 0.9)
+  --fuse-bn-relu FUSE_BN_RELU
+                        have batchnorm kernel perform activation relu
+                        (default: 0)
+  --fuse-bn-add-relu FUSE_BN_ADD_RELU
+                        have batchnorm kernel perform add followed by
+                        activation relu (default: 0)
+
+Training:
+  --mode {train_val,train,val,pred}
+                        mode (default: train_val)
+  --seed SEED           random seed (default: None)
+  -n NGPUS, --ngpus NGPUS
+                        number of GPUs to use (default: 1)
+  --kv-store {device,horovod}
+                        key-value store type (default: horovod)
+  --dtype {float32,float16}
+                        Precision (default: float16)
+  --amp                 If enabled, turn on AMP (Automatic Mixed Precision)
+                        (default: False)
+  -b BATCH_SIZE, --batch-size BATCH_SIZE
+                        batch size per GPU (default: 192)
+  -e NUM_EPOCHS, --num-epochs NUM_EPOCHS
+                        number of epochs (default: 90)
+  -l LR, --lr LR        learning rate; IMPORTANT: true learning rate will be
+                        calculated as `lr * batch_size / 256` (default: 0.256)
+  --lr-schedule {multistep,cosine}
+                        learning rate schedule (default: cosine)
+  --lr-factor LR_FACTOR
+                        the ratio to reduce lr on each step (default: 0.256)
+  --lr-steps LR_STEPS   the epochs to reduce the lr, e.g. 30,60 (default: [])
+  --warmup-epochs WARMUP_EPOCHS
+                        the epochs to ramp-up lr to scaled large-batch value
+                        (default: 5)
+  --optimizer OPTIMIZER
+                        the optimizer type (default: sgd)
+  --mom MOM             momentum for sgd (default: 0.875)
+  --wd WD               weight decay for sgd (default: 3.0517578125e-05)
+  --label-smoothing LABEL_SMOOTHING
+                        label smoothing factor (default: 0.1)
+  --mixup MIXUP         alpha parameter for mixup (if 0 then mixup is not
+                        applied) (default: 0)
+  --disp-batches DISP_BATCHES
+                        show progress for every n batches (default: 20)
+  --model-prefix MODEL_PREFIX
+                        model checkpoint prefix (default: model)
+  --save-frequency SAVE_FREQUENCY
+                        frequency of saving model in epochs (--model-prefix
+                        must be specified). If -1 then save only best model.
+                        If 0 then do not save anything. (default: -1)
+  --begin-epoch BEGIN_EPOCH
+                        start the model from an epoch (default: 0)
+  --load LOAD           checkpoint to load (default: None)
+  --test-io             test reading speed without training (default: False)
+  --test-io-mode {train,val}
+                        data to test (default: train)
+  --log LOG             file where to save the log from the experiment
+                        (default: log.log)
+  --report REPORT       file where to save report (default: report.json)
+  --no-metrics          do not calculate evaluation metrics (for benchmarking)
+                        (default: False)
+  --benchmark-iters BENCHMARK_ITERS
+                        run only benchmark-iters iterations from each epoch
+                        (default: None)
+
+Data:
+  --data-root DATA_ROOT
+                        Directory with RecordIO data files (default:
+                        /data/imagenet/train-val-recordio-passthrough)
+  --data-backend {dali,mxnet,synthetic}
+                        data backend (default: dali)
+  --image-shape IMAGE_SHAPE
+                        the image shape feed into the network (default: [3,
+                        224, 224])
+  --rgb-mean RGB_MEAN   a tuple of size 3 for the mean rgb (default: [123.68,
+                        116.779, 103.939])
+  --rgb-std RGB_STD     a tuple of size 3 for the std rgb (default: [58.393,
+                        57.12, 57.375])
+  --input-layout {NCHW,NHWC}
+                        the layout of the input data (default: NCHW)
+  --conv-layout {NCHW,NHWC}
+                        the layout of the data assumed by the conv operation
+                        (default: NCHW)
+  --batchnorm-layout {NCHW,NHWC}
+                        the layout of the data assumed by the batchnorm
+                        operation (default: NCHW)
+  --pooling-layout {NCHW,NHWC}
+                        the layout of the data assumed by the pooling
+                        operation (default: NCHW)
+  --num-examples NUM_EXAMPLES
+                        the number of training examples (doesn't work with
+                        mxnet data backend) (default: 1281167)
+  --data-val-resize DATA_VAL_RESIZE
+                        base length of shorter edge for validation dataset
+                        (default: 256)
+
+DALI data backend:
+  entire group applies only to dali data backend
+
+  --dali-separ-val      each process will perform independent validation on
+                        whole val-set (default: False)
+  --dali-threads DALI_THREADS
+                        number of threadsper GPU for DALI (default: 3)
+  --dali-validation-threads DALI_VALIDATION_THREADS
+                        number of threadsper GPU for DALI for validation
+                        (default: 10)
+  --dali-prefetch-queue DALI_PREFETCH_QUEUE
+                        DALI prefetch queue depth (default: 2)
+  --dali-nvjpeg-memory-padding DALI_NVJPEG_MEMORY_PADDING
+                        Memory padding value for nvJPEG (in MB) (default: 64)
+
+MXNet data backend:
+  entire group applies only to mxnet data backend
+
+  --data-mxnet-threads DATA_MXNET_THREADS
+                        number of threads for data decoding for mxnet data
+                        backend (default: 40)
+  --random-crop RANDOM_CROP
+                        if or not randomly crop the image (default: 0)
+  --random-mirror RANDOM_MIRROR
+                        if or not randomly flip horizontally (default: 1)
+  --max-random-h MAX_RANDOM_H
+                        max change of hue, whose range is [0, 180] (default:
+                        0)
+  --max-random-s MAX_RANDOM_S
+                        max change of saturation, whose range is [0, 255]
+                        (default: 0)
+  --max-random-l MAX_RANDOM_L
+                        max change of intensity, whose range is [0, 255]
+                        (default: 0)
+  --min-random-aspect-ratio MIN_RANDOM_ASPECT_RATIO
+                        min value of aspect ratio, whose value is either None
+                        or a positive value. (default: 0.75)
+  --max-random-aspect-ratio MAX_RANDOM_ASPECT_RATIO
+                        max value of aspect ratio. If min_random_aspect_ratio
+                        is None, the aspect ratio range is
+                        [1-max_random_aspect_ratio,
+                        1+max_random_aspect_ratio], otherwise it is
+                        [min_random_aspect_ratio, max_random_aspect_ratio].
+                        (default: 1.33)
+  --max-random-rotate-angle MAX_RANDOM_ROTATE_ANGLE
+                        max angle to rotate, whose range is [0, 360] (default:
+                        0)
+  --max-random-shear-ratio MAX_RANDOM_SHEAR_RATIO
+                        max ratio to shear, whose range is [0, 1] (default: 0)
+  --max-random-scale MAX_RANDOM_SCALE
+                        max ratio to scale (default: 1)
+  --min-random-scale MIN_RANDOM_SCALE
+                        min ratio to scale, should >= img_size/input_shape.
+                        otherwise use --pad-size (default: 1)
+  --max-random-area MAX_RANDOM_AREA
+                        max area to crop in random resized crop, whose range
+                        is [0, 1] (default: 1)
+  --min-random-area MIN_RANDOM_AREA
+                        min area to crop in random resized crop, whose range
+                        is [0, 1] (default: 0.05)
+  --min-crop-size MIN_CROP_SIZE
+                        Crop both width and height into a random size in
+                        [min_crop_size, max_crop_size] (default: -1)
+  --max-crop-size MAX_CROP_SIZE
+                        Crop both width and height into a random size in
+                        [min_crop_size, max_crop_size] (default: -1)
+  --brightness BRIGHTNESS
+                        brightness jittering, whose range is [0, 1] (default:
+                        0)
+  --contrast CONTRAST   contrast jittering, whose range is [0, 1] (default: 0)
+  --saturation SATURATION
+                        saturation jittering, whose range is [0, 1] (default:
+                        0)
+  --pca-noise PCA_NOISE
+                        pca noise, whose range is [0, 1] (default: 0)
+  --random-resized-crop RANDOM_RESIZED_CROP
+                        whether to use random resized crop (default: 1)
+```
+
+### Command-line options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command line option: `./runner --help` and `python train.py --help`. `./runner` acts as a wrapper on `train.py` and all additional flags will be passed to `train.py`.
+
+`./runner` command-line options:
+```
+usage: runner [-h] [-n NGPUS] [-b BATCH_SIZE] [-e NUM_EPOCHS] [-l LR]
+              [--data-root DATA_ROOT] [--dtype {float32,float16}]
+              [--kv-store {device,horovod}]
+              [--data-backend {dali,mxnet,synthetic}]
+```
+
+`train.py` command-line options:
+```
+usage: train.py [-h]
+                [--arch {resnetv1,resnetv15,resnextv1,resnextv15,xception}]
+                [--num-layers NUM_LAYERS] [--num-groups NUM_GROUPS]
+                [--num-classes NUM_CLASSES] [--batchnorm-eps BATCHNORM_EPS]
+                [--batchnorm-mom BATCHNORM_MOM] [--fuse-bn-relu FUSE_BN_RELU]
+                [--fuse-bn-add-relu FUSE_BN_ADD_RELU]
+                [--mode {train_val,train,val,pred}] [--seed SEED]
+                [--gpus GPUS] [--kv-store {device,horovod}]
+                [--dtype {float32,float16}] [--amp] [--batch-size BATCH_SIZE]
+                [--num-epochs NUM_EPOCHS] [--lr LR]
+                [--lr-schedule {multistep,cosine}] [--lr-factor LR_FACTOR]
+                [--lr-steps LR_STEPS] [--warmup-epochs WARMUP_EPOCHS]
+                [--optimizer OPTIMIZER] [--mom MOM] [--wd WD]
+                [--label-smoothing LABEL_SMOOTHING] [--mixup MIXUP]
+                [--disp-batches DISP_BATCHES] [--model-prefix MODEL_PREFIX]
+                [--save-frequency SAVE_FREQUENCY] [--begin-epoch BEGIN_EPOCH]
+                [--load LOAD] [--test-io] [--test-io-mode {train,val}]
+                [--log LOG] [--report REPORT] [--no-metrics]
+                [--benchmark-iters BENCHMARK_ITERS] [--data-train DATA_TRAIN]
+                [--data-train-idx DATA_TRAIN_IDX] [--data-val DATA_VAL]
+                [--data-val-idx DATA_VAL_IDX] [--data-pred DATA_PRED]
+                [--data-backend {dali,mxnet,synthetic}]
+                [--image-shape IMAGE_SHAPE] [--rgb-mean RGB_MEAN]
+                [--rgb-std RGB_STD] [--input-layout {NCHW,NHWC}]
+                [--conv-layout {NCHW,NHWC}] [--batchnorm-layout {NCHW,NHWC}]
+                [--pooling-layout {NCHW,NHWC}] [--num-examples NUM_EXAMPLES]
+                [--data-val-resize DATA_VAL_RESIZE] [--dali-separ-val]
+                [--dali-threads DALI_THREADS]
+                [--dali-validation-threads DALI_VALIDATION_THREADS]
+                [--dali-prefetch-queue DALI_PREFETCH_QUEUE]
+                [--dali-nvjpeg-memory-padding DALI_NVJPEG_MEMORY_PADDING]
+                [--data-mxnet-threads DATA_MXNET_THREADS]
+                [--random-crop RANDOM_CROP] [--random-mirror RANDOM_MIRROR]
+                [--max-random-h MAX_RANDOM_H] [--max-random-s MAX_RANDOM_S]
+                [--max-random-l MAX_RANDOM_L]
+                [--min-random-aspect-ratio MIN_RANDOM_ASPECT_RATIO]
+                [--max-random-aspect-ratio MAX_RANDOM_ASPECT_RATIO]
+                [--max-random-rotate-angle MAX_RANDOM_ROTATE_ANGLE]
+                [--max-random-shear-ratio MAX_RANDOM_SHEAR_RATIO]
+                [--max-random-scale MAX_RANDOM_SCALE]
+                [--min-random-scale MIN_RANDOM_SCALE]
+                [--max-random-area MAX_RANDOM_AREA]
+                [--min-random-area MIN_RANDOM_AREA]
+                [--min-crop-size MIN_CROP_SIZE]
+                [--max-crop-size MAX_CROP_SIZE] [--brightness BRIGHTNESS]
+                [--contrast CONTRAST] [--saturation SATURATION]
+                [--pca-noise PCA_NOISE]
+                [--random-resized-crop RANDOM_RESIZED_CROP]
+```
+
+### Getting the data

 The MXNet ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
-You can download the images from http://image-net.org/download-images
+You can download the images from http://image-net.org/download-images.

 The recommended data format is
 [RecordIO](http://mxnet.io/architecture/note_data_loading.html), which
@ -106,7 +498,7 @@ concatenates multiple examples into seekable binary files for better read
 efficiency. MXNet provides a tool called `im2rec.py` located in the `/opt/mxnet/tools/` directory.
 The tool converts individual images into `.rec` files.

-To prepare RecordIO file containing ImageNet data, we first need to create .lst files
+To prepare a RecordIO file containing ImageNet data, we first need to create `.lst` files
 which consist of the labels and image paths. We assume that the original images were
 downloaded to `/data/imagenet/raw/train-jpeg` and `/data/imagenet/raw/val-jpeg`.

@ -115,121 +507,216 @@ python /opt/mxnet/tools/im2rec.py --list --recursive train /data/imagenet/raw/tr
 python /opt/mxnet/tools/im2rec.py --list --recursive val /data/imagenet/raw/val-jpeg
 ```

-Then we generate the `.rec` (RecordIO files with data) and `.idx` (indexes required by DALI
+Next, we generate the `.rec` (RecordIO files with data) and `.idx` (indexes required by DALI
 to speed up data loading) files. To obtain the best training accuracy
-we do not preprocess the images when creating RecordIO file.
+we do not preprocess the images when creating the RecordIO file.

 ```bash
 python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 train /data/imagenet/raw/train-jpeg
 python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 val /data/imagenet/raw/val-jpeg
 ```

-## Running training
+#### Dataset guidelines
+The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data.py` and `dali.py` files.

-To run training for a standard configuration (1/4/8 GPUs, FP16/FP32),
-run one of the scripts in the `./examples` directory
-called `./examples/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh`.
-By default the training scripts run the validation and save checkpoint after each epoch.
-Checkpoints will be stored in `model-symbol.json` and `model-<number of epoch>.params` files.
+The data is read from RecordIO format, which concatenates multiple examples into seekable binary files for better read efficiency.

-If imagenet is mounted in the `/data/imagenet/train-val-recordio-passthrough` directory, you don't have to specify `--data-root` flag.
+Data augmentation techniques are described in the [Default configuration](#default-configuration) section.

-To run a non standard configuration use:
+#### Multi-dataset

-`./runner -n <number of gpus> -b <batch size per gpu> --data-root <path to imagenet> --dtype <float32 or float16> --model-prefix <model prefix>`
+In most cases, to train a model on a different dataset, no changes in the code are required, but the dataset has to be converted into RecordIO format.

-Checkpoints will be stored in `<model prefix>-symbol.json` and `<model prefix>-<number of epoch>.params` files.
-To generate JSON report with performance and accuracy stats, use `--report <path to report>` flag (see `report.py` for info about JSON report file structure).
-Use `./runner -h` and `python ./train.py -h` to obtain the list of available options.
-
-## Running inference
-
-To run inference on a checkpointed model run:
-* For FP16
-    `./examples/SCORE_FP16.sh <model prefix> <epoch>`
-* For FP32
-    `./examples/SCORE_FP32.sh <model prefix> <epoch>`
+To convert a custom dataset, follow the steps from [Getting the data](#getting-the-data) section, and refer to the `scripts/prepare_dataset.py` script.


-## Benchmark scripts
+### Training process
+
+To start training, run:
+`./runner -n <number of gpus> -b <batch size per GPU> --data-root <path to imagenet> --dtype <float32 or float16>`
+
+By default the training script runs the validation after each epoch:
+* the best checkpoint will be stored in the `model_best.params` file in the working directory
+* the log from training will be saved in the  `log.log` file in the working directory
+* the JSON report with statistics will be saved in the  `report.json` file in the working directory
+
+If ImageNet is mounted in the `/data/imagenet/train-val-recordio-passthrough` directory, you don't have to specify the `--data-root` flag.
+
+### Inference process
+
+To start validation, run:
+`./runner -n <number of gpus> -b <batch size per GPU> --data-root <path to imagenet> --dtype <float32 or float16> --mode val`
+
+By default: 
+* the log from validation will be saved in the `log.log` file in the working directory
+* the JSON report with statistics will be saved in the `report.json` file in the working directory
+
+## Performance
+
+### Benchmarking

 To benchmark training and inference, run:
+`python benchmark.py -n <numbers of gpus separated by comma> -b <batch sizes per GPU separated by comma> --data-root <path to imagenet> --dtype <float32 or float16> -o <path to benchmark report>`

-`python benchmark.py -n <numbers of gpus separated by comma> -b <batch sizes per gpu separated by comma> --data-root <path to imagenet> --dtype <float32 or float16> -o <path to benchmark report>`
-
-To control benchmark length per epoch, use `-i` flag (defaults to 100 iterations).
-To control number of epochs, use `-e` flag.
-To control number of warmup epochs (epochs which are not taken into account), use `-w` flag.
-To limit length of dataset, use `--num-examples` flag.
-To benchmark only inference, use `--only-inference` flag.
+To control the benchmark length per epoch, use the `-i` flag (defaults to 100 iterations).
+To control the number of epochs, use the `-e` flag.
+To control the number of warmup epochs (epochs which are not taken into account), use the `-w` flag.
+To limit the length of the dataset, use the `--num-examples` flag.
 By default, the same parameters as in `./runner` will be used. Additional flags will be passed to `./runner`.

+#### Training performance benchmark
+To benchmark only training, use the `--mode train` flag.

-## Training accuracy results
-
-The following results were obtained by running the `./examples/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh` scripts in the
-mxnet-18.12-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
-
-| **number of GPUs** | **FP16 top1** | **FP16 training time** | **FP32 top1** | **FP32 training time** |
-|:------------------:|:-------------:|:----------------------:|:-------------:|:----------------------:|
-| 1                  | 76.424        | 22.9h                  | 76.462        | 82.0h                  |
-| 4                  | 76.328        | 6.2h                   | 76.448        | 21.1h                  |
-| 8                  | 76.490        | 3.3h                   | 76.668        | 11.1h                  |
-
-Here are example graphs of FP32 and FP16 training on 8 GPU configuration:
-
-![TrainingLoss](./img/training_loss.png)
-
-![TrainingAccuracy](./img/training_accuracy.png)
-
-![ValidationAccuracy](./img/validation_accuracy.png)
+#### Inference performance benchmark
+To benchmark only inference, use the `--mode val` flag.


-## Training performance results
+### Results
+
+The following sections provide details on how we achieved our performance and accuracy in training and inference.
+
+#### Training accuracy results
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+
+90 epochs configuration
+Our results were obtained by running the `./runner -n <number of gpus> -b 96 --dtype float32` script for FP32 and the `./runner -n <number of gpus> -b 192` script for mixed precision in the in the mxnet-19.07-py3  NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+ on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+
+| **GPUs** | **Accuracy - mixed precision** | **Accuracy - FP32** | **Time to train - mixed precision** | **Time to train - FP32** | **Time to train - speedup** |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|1|77.208|77.160|24.2|84.5|3.49|
+|4|77.296|77.280|6.0|21.4|3.59|
+|8|77.308|77.292|3.0|10.7|3.54|
+
+##### Training stability test
+
+Our results were obtained by running the following commands 8 times with different seeds.
+
+* For 50 epochs
+  * `./runner -n 8 -b 96 --dtype float32 --num-epochs 50` for FP32
+  * `./runner -n 8 -b 192 --num-epochs 50` for mixed precision
+* For 90 epochs
+  * `./runner -n 8 -b 96 --dtype float32` for FP32
+  * `./runner -n 8 -b 192` for mixed precision
+* For 250 epochs
+  * `./runner -n 8 -b 96 --dtype float32 --num-epochs 250 --mixup 0.2` for FP32
+  * `./runner -n 8 -b 192 --num-epochs 250 --mixup 0.2` for mixed precision
+
+| **# of epochs** | **mixed precision avg top1** | **FP32 avg top1** | **mixed precision standard deviation** | **FP32 standard deviation** | **mixed precision minimum top1** | **FP32 minimum top1** | **mixed precision maximum top1** | **FP32 maximum top1** |
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+|50|76.156|76.185|0.118|0.082|76.010|76.062|76.370|76.304|
+|90|77.105|77.224|0.097|0.060|76.982|77.134|77.308|77.292|
+|250|78.317|78.400|0.073|0.102|78.202|78.316|78.432|78.570|
+
+
+Plots for 250 epoch configuration
+Here are example graphs of FP32 and mixed precision training on 8 GPU 250 epochs configuration:
+
+![TrainingLoss](./img/dgx1-16g_250e_training_loss.png)
+
+![TrainingAccuracy](./img/dgx1-16g_250e_validation_top1.png)
+
+![ValidationAccuracy](./img/dgx1-16g_250e_validation_top5.png)
+
+
+#### Training performance results
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16G)
+
+The following results were obtained by running the
+`python benchmark.py -n 1,2,4,8 -b 192 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
+`python benchmark.py -n 1,2,4,8 -b 96 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.

-The following results were obtained by running
-`python benchmark.py -n 1,4,8 -b 208 --dtype float16 -o benchmark_report_fp16.json --data-root <path to imagenet> -i 100 -e 12 -w 4 --num-examples 25600` for FP16, and
-`python benchmark.py -n 1,4,8 -b 96 --dtype float32 -o benchmark_report_fp32.json --data-root <path to imagenet> -i 100 -e 12 -w 4 --num-examples 12800` for FP32
-in the mxnet-18.12-py3 Docker container on NVIDIA DGX-1 with V100 16G GPUs.
 Training performance reported as Total IPS (data + compute time taken into account).
 Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.

-| **number of GPUs** | **FP16 img/s** | **FP32 img/s** | **FP16 speedup** | **FP16 weak scaling** | **FP32 weak scaling** |
-|:------------------:|:--------------:|:--------------:|:----------------:|:---------------------:|:---------------------:|
-| 1                  | 1442.6         | 400.2          | 3.60             | 1.00                  | 1.00                  |
-| 4                  | 5391.8         | 1558.6         | 3.46             | 3.74                  | 3.89                  |
-| 8                  | 10263.2        | 2957.4         | 3.47             | 7.11                  | 7.39                  |
+| **GPUs** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|1|1427|385|3.71|1.00|1.00|
+|2|2820|768|3.67|1.98|2.00|
+|4|5560|1513|3.68|3.90|3.93|
+|8|10931|3023|3.62|7.66|7.86|

+##### Training performance: NVIDIA DGX-2 (16x V100 32G)

-## Inference performance results
+The following results were obtained by running the
+`python benchmark.py -n 1,4,8,16 -b 256 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
+`python benchmark.py -n 1,4,8,16 -b 128 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+
+Training performance reported as Total IPS (data + compute time taken into account).
+Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.
+
+| **GPUs** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|1|1438|409|3.52|1.00|1.00|
+|2|2868|817|3.51|1.99|2.00|
+|4|5624|1617|3.48|3.91|3.96|
+|8|11174|3214|3.48|7.77|7.86|
+|16|20530|6356|3.23|14.28|15.54|
+
+#### Inference performance results
+
+##### Inference performance: NVIDIA DGX-1 (8x V100 16G)
+
+The following results were obtained by running the
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.

-The following results were obtained by running
-`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,96,128,192,208 --dtype float16 -o inferbenchmark_report_fp16.json --data-root <path to imagenet> -i 200 -e 12 -w 4 --only-inference` for FP16, and
-`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,96 --dtype float32 -o inferbenchmark_report_fp32.json --data-root <path to imagenet> -i 200 -e 12 -w 4 --only-inference` for FP32
-in the mxnet-18.12-py3 Docker container on NVIDIA DGX-1 using one V100 16G GPU.
 Inference performance reported as Total IPS (data + compute time taken into account).

-| **batch size** | **FP16 img/s** | **FP32 img/s** |
-|:--------------:|:--------------:|:--------------:|
-|              1 |  314           | 252            |
-|              2 |  555           | 393            |
-|              4 |  1024          | 601            |
-|              8 |  1642          | 824            |
-|             16 |  2144          | 1028           |
-|             32 |  2954          | 1138           |
-|             64 |  3428          | 1236           |
-|             96 |  3546          | 1282           |
-|            128 |  3690          |                |
-|            192 |  3828          |                |
-|            208 |  3832          |                |
+Reported mixed precision speedups are relative to FP32 numbers for corresponding configuration.

+| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | **100% latency (ms) - mixed precision** | **100% latency - speedup** |
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1 | 397 | 1.65 | 2.5 | 1.65 | 2.5 | 1.67 | 2.7 | 1.59 | 2.8 | 1.56 | 3.2 | 1.51 | 15.8 | 0.84 |
+| 2 | 732 | 1.81 | 2.7 | 1.81 | 2.6 | 1.88 | 3.0 | 1.67 | 3.3 | 1.52 | 4.9 | 1.10 | 18.8 | 0.83 |
+| 4 | 1269 | 2.08 | 3.2 | 2.08 | 3.0 | 2.21 | 3.5 | 1.92 | 4.0 | 1.72 | 7.5 | 0.97 | 14.5 | 0.54 |
+| 8 | 2012 | 2.53 | 4.0 | 2.53 | 3.9 | 2.59 | 4.2 | 2.45 | 4.4 | 2.37 | 8.3 | 1.29 | 15.3 | 0.72 |
+| 16 | 2667 | 2.64 | 6.0 | 2.64 | 5.9 | 2.66 | 6.3 | 2.54 | 6.4 | 2.52 | 8.3 | 2.02 | 16.9 | 1.05 |
+| 32 | 3240 | 2.86 | 9.9 | 2.86 | 9.8 | 2.87 | 10.3 | 2.79 | 10.4 | 2.76 | 11.5 | 2.53 | 28.4 | 1.12 |
+| 64 | 3776 | 3.10 | 17.0 | 3.10 | 17.0 | 3.09 | 17.5 | 3.03 | 17.7 | 3.01 | 18.1 | 3.01 | 18.7 | 2.99 |
+| 128 | 3734 | 3.02 | 34.3 | 3.02 | 33.8 | 3.05 | 35.5 | 2.93 | 36.3 | 2.88 | 42.4 | 2.79 | 51.7 | 2.38 |
+| 192 | 3641 | 2.90 | 52.7 | 2.90 | 52.4 | 2.90 | 55.2 | 2.77 | 56.2 | 2.74 | 65.4 | 2.76 | 77.1 | 2.41 |
+| 256 | 3463 | 2.73 | 73.9 | 2.73 | 72.8 | 2.75 | 77.3 | 2.61 | 79.9 | 2.54 | 100.8 | 2.39 | 104.1 | 2.35 |

-# Changelog
+##### Inference performance: NVIDIA T4

-1. Dec 19, 2018
+The following results were obtained by running the
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
+`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-19.07-py3 NGC container on an NVIDIA T4 GPU.
+
+Inference performance reported as Total IPS (data + compute time taken into account).
+
+Reported mixed precision speedups are relative to FP32 numbers for corresponding configuration.
+
+| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | **100% latency (ms) - mixed precision** | **100% latency - speedup** |
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1 | 348 | 1.88 | 2.9 | 1.88 | 2.8 | 1.91 | 2.9 | 1.88 | 3.0 | 1.90 | 3.9 | 1.82 | 17.6 | 0.74 |
+| 2 | 594 | 2.30 | 3.4 | 2.30 | 3.3 | 2.35 | 3.4 | 2.34 | 3.5 | 2.38 | 5.7 | 1.55 | 20.2 | 0.74 |
+| 4 | 858 | 2.93 | 4.7 | 2.93 | 4.6 | 2.97 | 4.9 | 2.86 | 5.0 | 2.81 | 6.0 | 2.46 | 13.7 | 1.12 |
+| 8 | 1047 | 3.17 | 7.6 | 3.17 | 7.6 | 3.19 | 7.9 | 3.10 | 8.2 | 3.02 | 9.1 | 2.77 | 15.0 | 1.72 |
+| 16 | 1163 | 3.16 | 13.8 | 3.16 | 13.7 | 3.17 | 14.1 | 3.13 | 14.4 | 3.07 | 15.4 | 2.90 | 17.5 | 2.62 |
+| 32 | 1225 | 3.22 | 26.1 | 3.22 | 26.1 | 3.22 | 27.0 | 3.15 | 27.3 | 3.12 | 28.3 | 3.05 | 30.5 | 2.89 |
+| 64 | 1230 | 3.15 | 52.0 | 3.15 | 51.8 | 3.16 | 52.9 | 3.12 | 53.3 | 3.10 | 54.4 | 3.08 | 58.8 | 2.90 |
+| 128 | 1260 | 3.21 | 101.6 | 3.21 | 101.3 | 3.22 | 102.7 | 3.21 | 103.2 | 3.20 | 115.0 | 2.89 | 121.8 | 2.86 |
+| 192 | 1252 | 3.20 | 153.3 | 3.20 | 153.1 | 3.20 | 154.7 | 3.19 | 155.5 | 3.21 | 156.9 | 3.20 | 182.3 | 2.81 |
+| 256 | 1251 | 3.22 | 204.6 | 3.22 | 204.3 | 3.23 | 206.4 | 3.21 | 207.1 | 3.21 | 209.3 | 3.18 | 241.9 | 2.76 |
+
+## Release notes 
+
+### Changelog
+
+1. Dec, 2018
  * Initial release (based on https://github.com/apache/incubator-mxnet/tree/master/example/image-classification)
+2. June, 2019
+  * Code refactor
+  * Label smoothing
+  * Cosine LR schedule
+  * MixUp regularization
+  * Better configurations


-# Known Issues
+### Known Issues

 There are no known issues with this model.
--- a/MxNet/Classification/RN50v1.5/init.py
+++ b/MxNet/Classification/RN50v1.5/init.py
--- a/MxNet/Classification/RN50v1.5/benchmark.py
+++ b/MxNet/Classification/RN50v1.5/benchmark.py
@ -1,3 +1,5 @@
+#!/usr/bin/env python3
+
 # Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@ -18,14 +20,21 @@ import sys
 import tempfile
 import json
 import os
+import traceback
+import numpy as np
 from collections import OrderedDict
 from subprocess import Popen

-parser = argparse.ArgumentParser(description='Benchmark')
+def int_list(x):
+    return list(map(int, x.split(',')))
+
+parser = argparse.ArgumentParser(description='Benchmark',
+                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
 parser.add_argument('--executable', default='./runner', help='path to runner')
-parser.add_argument('-n', '--ngpus', metavar='N1,[N2,...]',
+parser.add_argument('-o', '--output', metavar='OUT', required=True, help="path to benchmark report")
+parser.add_argument('-n', '--ngpus', metavar='N1,[N2,...]', type=int_list,
                    required=True, help='numbers of gpus separated by comma')
-parser.add_argument('-b', '--batch-sizes', metavar='B1,[B2,...]',
+parser.add_argument('-b', '--batch-sizes', metavar='B1,[B2,...]', type=int_list,
                    required=True, help='batch sizes separated by comma')
 parser.add_argument('-i', '--benchmark-iters', metavar='I',
                    type=int, default=100, help='iterations')
@ -33,57 +42,83 @@ parser.add_argument('-e', '--epochs', metavar='E',
                    type=int, default=1, help='number of epochs')
 parser.add_argument('-w', '--warmup', metavar='N',
                    type=int, default=0, help='warmup epochs')
-parser.add_argument('-o', '--output', metavar='OUT', required=True, help="path to benchmark report")
-parser.add_argument('--only-inference', action='store_true', help="benchmark inference only")
+parser.add_argument('--timeout', metavar='T',
+                    type=str, default='inf', help='timeout for each run')
+parser.add_argument('--mode', metavar='MODE', choices=('train_val', 'train', 'val'), default='train_val',
+                    help="benchmark mode")
 args, other_args = parser.parse_known_args()

-ngpus = list(map(int, args.ngpus.split(',')))
-batch_sizes = list(map(int, args.batch_sizes.split(',')))
-
+latency_percentiles = ['avg', 50, 90, 95, 99, 100]
+harmonic_mean_metrics = ['train.total_ips', 'val.total_ips']

 res = OrderedDict()
 res['model'] = ''
-res['ngpus'] = ngpus
-res['bs'] = batch_sizes
-if args.only_inference:
-    res['metric_keys'] = ['val.total_ips']
-else:
-    res['metric_keys'] = ['train.total_ips', 'val.total_ips']
+res['ngpus'] = args.ngpus
+res['bs'] = args.batch_sizes
+res['metric_keys'] = []
+if args.mode == 'train' or args.mode == 'train_val':
+    res['metric_keys'].append('train.total_ips')
+    for percentile in latency_percentiles:
+        res['metric_keys'].append('train.latency_{}'.format(percentile))
+if args.mode == 'val' or args.mode == 'train_val':
+    res['metric_keys'].append('val.total_ips')
+    for percentile in latency_percentiles:
+        res['metric_keys'].append('val.latency_{}'.format(percentile))
+
 res['metrics'] = OrderedDict()

-for n in ngpus:
+for n in args.ngpus:
    res['metrics'][str(n)] = OrderedDict()
-    for bs in batch_sizes:
+    for bs in args.batch_sizes:
        res['metrics'][str(n)][str(bs)] = OrderedDict()

        report_file = args.output + '-{},{}'.format(n, bs)
-        Popen([args.executable, '-n', str(n), '-b', str(bs),
+        Popen(['timeout', args.timeout, args.executable, '-n', str(n), '-b', str(bs),
               '--benchmark-iters', str(args.benchmark_iters),
               '-e', str(args.epochs), '--report', report_file,
-               *([] if not args.only_inference else ['--only-inference']),
-               '--no-metrics'] + other_args, stdout=sys.stderr).wait()
+               '--mode', args.mode, '--no-metrics'] + other_args,
+              stdout=sys.stderr).wait()

-        with open(report_file, 'r') as f:
-            report = json.load(f)
+        try:
+            for suffix in ['', *['-{}'.format(i) for i in range(1, n)]]:
+                try:
+                    with open(report_file + suffix, 'r') as f:
+                        report = json.load(f)
+                    break
+                except FileNotFoundError:
+                    pass
+            else:
+                with open(report_file, 'r') as f:
+                    report = json.load(f)

-        for metric in res['metric_keys']:
-            data = report['metrics'][metric][args.warmup:]
-            avg = len(data) / sum(map(lambda x: 1 / x, data))
-            res['metrics'][str(n)][str(bs)][metric] = avg
+            for metric in res['metric_keys']:
+                if len(report['metrics'][metric]) != args.epochs:
+                    raise ValueError('Wrong number epochs in report')
+                data = report['metrics'][metric][args.warmup:]
+                if metric in harmonic_mean_metrics:
+                    avg = len(data) / sum(map(lambda x: 1 / x, data))
+                else:
+                    avg = np.mean(data)
+                res['metrics'][str(n)][str(bs)][metric] = avg
+        except Exception as e:
+            traceback.print_exc()
+
+            for metric in res['metric_keys']:
+                res['metrics'][str(n)][str(bs)][metric] = float('nan')


-column_len = 7
+column_len = 11
 for m in res['metric_keys']:
    print(m, file=sys.stderr)
    print(' ' * column_len, end='|', file=sys.stderr)
-    for bs in batch_sizes:
+    for bs in args.batch_sizes:
        print(str(bs).center(column_len), end='|', file=sys.stderr)
    print(file=sys.stderr)
-    print('-' * (len(batch_sizes) + 1) * (column_len + 1), file=sys.stderr)
-    for n in ngpus:
+    print('-' * (len(args.batch_sizes) + 1) * (column_len + 1), file=sys.stderr)
+    for n in args.ngpus:
        print(str(n).center(column_len), end='|', file=sys.stderr)
-        for bs in batch_sizes:
-            print(str(round(res['metrics'][str(n)][str(bs)][m])).center(column_len), end='|', file=sys.stderr)
+        for bs in args.batch_sizes:
+            print('{:.5g}'.format(res['metrics'][str(n)][str(bs)][m]).center(column_len), end='|', file=sys.stderr)
        print(file=sys.stderr)
    print(file=sys.stderr)

--- a/MxNet/Classification/RN50v1.5/benchmarking.py
+++ b/MxNet/Classification/RN50v1.5/benchmarking.py
@ -52,11 +52,14 @@ class BenchmarkingDataIter:
    def __getattr__(self, attr):
        return getattr(self.data_iter, attr)

-    def get_avg_time_and_clear(self):
+    def get_avg_time(self):
        if self.num <= 1:
            avg = float('nan')
        else:
            avg = self.overall_time / (self.num - 1)
+        return avg
+
+    def reset(self):
        self.overall_time = 0
        self.num = 0
-        return avg
+        self.data_iter.reset()
--- a/MxNet/Classification/RN50v1.5/dali.py
+++ b/MxNet/Classification/RN50v1.5/dali.py
@ -18,146 +18,166 @@ from nvidia.dali.pipeline import Pipeline
 import nvidia.dali.ops as ops
 import nvidia.dali.types as types
 from nvidia.dali.plugin.mxnet import DALIClassificationIterator
+import horovod.mxnet as hvd


 def add_dali_args(parser):
-    group = parser.add_argument_group('DALI', 'pipeline and augumentation')
-    group.add_argument('--use-dali', action='store_true',
-                      help='use dalli pipeline and augunetation')
-    group.add_argument('--separ-val', action='store_true',
+    group = parser.add_argument_group('DALI data backend', 'entire group applies only to dali data backend')
+    group.add_argument('--dali-separ-val', action='store_true',
                      help='each process will perform independent validation on whole val-set')
    group.add_argument('--dali-threads', type=int, default=3, help="number of threads" +\
                       "per GPU for DALI")
-    group.add_argument('--validation-dali-threads', type=int, default=10, help="number of threads" +\
+    group.add_argument('--dali-validation-threads', type=int, default=10, help="number of threads" +\
                       "per GPU for DALI for validation")
-    group.add_argument('--dali-prefetch-queue', type=int, default=3, help="DALI prefetch queue depth")
-    group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=16, help="Memory padding value for nvJPEG (in MB)")
+    group.add_argument('--dali-prefetch-queue', type=int, default=2, help="DALI prefetch queue depth")
+    group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=64, help="Memory padding value for nvJPEG (in MB)")
+    group.add_argument('--dali-fuse-decoder', type=int, default=1, help="0 or 1 whether to fuse decoder or not")
    return parser


-_mean_pixel = [255 * x for x in (0.485, 0.456, 0.406)]
-_std_pixel  = [255 * x for x in (0.229, 0.224, 0.225)]
-
 class HybridTrainPipe(Pipeline):
-    def __init__(self, batch_size, num_threads, device_id, rec_path, idx_path,
-                 shard_id, num_shards, crop_shape,
-                 nvjpeg_padding, prefetch_queue=3,
-                 output_layout=types.NCHW, pad_output=True, dtype='float16'):
-        super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id, prefetch_queue_depth = prefetch_queue)
-        self.input = ops.MXNetReader(path = [rec_path], index_path=[idx_path],
+    def __init__(self, args, batch_size, num_threads, device_id, rec_path, idx_path,
+                 shard_id, num_shards, crop_shape, nvjpeg_padding, prefetch_queue=3,
+                 output_layout=types.NCHW, pad_output=True, dtype='float16', dali_cpu=False):
+        super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id, prefetch_queue_depth = prefetch_queue)
+        self.input = ops.MXNetReader(path=[rec_path], index_path=[idx_path],
                                     random_shuffle=True, shard_id=shard_id, num_shards=num_shards)

-        self.decode = ops.nvJPEGDecoder(device = "mixed", output_type = types.RGB,
-                                        device_memory_padding = nvjpeg_padding,
-                                        host_memory_padding = nvjpeg_padding)
-        self.rrc = ops.RandomResizedCrop(device = "gpu", size = crop_shape)
-        self.cmnp = ops.CropMirrorNormalize(device = "gpu",
-                                            output_dtype = types.FLOAT16 if dtype == 'float16' else types.FLOAT,
-                                            output_layout = output_layout,
-                                            crop = crop_shape,
-                                            pad_output = pad_output,
-                                            image_type = types.RGB,
-                                            mean = _mean_pixel,
-                                            std =  _std_pixel)
-        self.coin = ops.CoinFlip(probability = 0.5)
+        if dali_cpu:
+            dali_device = "cpu"
+            if args.dali_fuse_decoder:
+                self.decode = ops.HostDecoderRandomCrop(device=dali_device, output_type=types.RGB)
+            else:
+                self.decode = ops.HostDecoder(device=dali_device, output_type=types.RGB)
+        else:
+            dali_device = "gpu"
+            if args.dali_fuse_decoder:
+                self.decode = ops.nvJPEGDecoderRandomCrop(device="mixed", output_type=types.RGB,
+                                                          device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
+            else:
+                self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB,
+                                                device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
+
+        if args.dali_fuse_decoder:
+            self.resize = ops.Resize(device=dali_device, resize_x=crop_shape[1], resize_y=crop_shape[0])
+        else:
+            self.resize = ops.RandomResizedCrop(device=dali_device, size=crop_shape)
+
+        self.cmnp = ops.CropMirrorNormalize(device="gpu",
+                                            output_dtype=types.FLOAT16 if dtype == 'float16' else types.FLOAT,
+                                            output_layout=output_layout, crop=crop_shape, pad_output=pad_output,
+                                            image_type=types.RGB, mean=args.rgb_mean, std=args.rgb_std)
+        self.coin = ops.CoinFlip(probability=0.5)

    def define_graph(self):
        rng = self.coin()
-        self.jpegs, self.labels = self.input(name = "Reader")
+        self.jpegs, self.labels = self.input(name="Reader")

        images = self.decode(self.jpegs)
-        images = self.rrc(images)
-        output = self.cmnp(images, mirror = rng)
+        images = self.resize(images)
+        output = self.cmnp(images.gpu(), mirror=rng)
        return [output, self.labels]


 class HybridValPipe(Pipeline):
-    def __init__(self, batch_size, num_threads, device_id, rec_path, idx_path,
-                 shard_id, num_shards, crop_shape,
-                 nvjpeg_padding, prefetch_queue=3,
-                 resize_shp=None,
-                 output_layout=types.NCHW, pad_output=True, dtype='float16'):
-        super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id, prefetch_queue_depth = prefetch_queue)
-        self.input = ops.MXNetReader(path = [rec_path], index_path=[idx_path],
+    def __init__(self, args, batch_size, num_threads, device_id, rec_path, idx_path,
+                 shard_id, num_shards, crop_shape, nvjpeg_padding, prefetch_queue=3, resize_shp=None,
+                 output_layout=types.NCHW, pad_output=True, dtype='float16', dali_cpu=False):
+        super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id, prefetch_queue_depth=prefetch_queue)
+        self.input = ops.MXNetReader(path=[rec_path], index_path=[idx_path],
                                     random_shuffle=False, shard_id=shard_id, num_shards=num_shards)
-        self.decode = ops.nvJPEGDecoder(device = "mixed", output_type = types.RGB,
-                                        device_memory_padding = nvjpeg_padding,
-                                        host_memory_padding = nvjpeg_padding)
-        self.resize = ops.Resize(device = "gpu", resize_shorter=resize_shp) if resize_shp else None
-        self.cmnp = ops.CropMirrorNormalize(device = "gpu",
-                                            output_dtype = types.FLOAT16 if dtype == 'float16' else types.FLOAT,
-                                            output_layout = output_layout,
-                                            crop = crop_shape,
-                                            pad_output = pad_output,
-                                            image_type = types.RGB,
-                                            mean = _mean_pixel,
-                                            std =  _std_pixel)
+
+        if dali_cpu:
+            dali_device = "cpu"
+            self.decode = ops.HostDecoder(device=dali_device, output_type=types.RGB)
+        else:
+            dali_device = "gpu"
+            self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB,
+                                            device_memory_padding=nvjpeg_padding,
+                                            host_memory_padding=nvjpeg_padding)
+        self.resize = ops.Resize(device=dali_device, resize_shorter=resize_shp) if resize_shp else None
+        self.cmnp = ops.CropMirrorNormalize(device="gpu",
+                                            output_dtype=types.FLOAT16 if dtype == 'float16' else types.FLOAT,
+                                            output_layout=output_layout, crop=crop_shape, pad_output=pad_output,
+                                            image_type=types.RGB, mean=args.rgb_mean, std=args.rgb_std)

    def define_graph(self):
-        self.jpegs, self.labels = self.input(name = "Reader")
+        self.jpegs, self.labels = self.input(name="Reader")
        images = self.decode(self.jpegs)
        if self.resize:
            images = self.resize(images)
-        output = self.cmnp(images)
+        output = self.cmnp(images.gpu())
        return [output, self.labels]


-def get_rec_iter(args, kv=None):
-    # resize is default base length of shorter edge for dataset;
-    # all images will be reshaped to this size
-    resize = int(args.resize)
-    # target shape is final shape of images pipelined to network;
-    # all images will be cropped to this size
-    target_shape = tuple([int(l) for l in args.image_shape.split(',')])
-    pad_output = target_shape[0] == 4
-    gpus = list(map(int, filter(None, args.gpus.split(',')))) # filter to not encount eventually empty strings
-    batch_size = args.batch_size//len(gpus)
+def get_rec_iter(args, kv=None, dali_cpu=False):
+    gpus = args.gpus
    num_threads = args.dali_threads
-    num_validation_threads = args.validation_dali_threads
-    #db_folder = "/data/imagenet/train-480-val-256-recordio/"
+    num_validation_threads = args.dali_validation_threads
+    pad_output = (args.image_shape[0] == 4)

    # the input_layout w.r.t. the model is the output_layout of the image pipeline
    output_layout = types.NHWC if args.input_layout == 'NHWC' else types.NCHW

-    rank = kv.rank if kv else 0
-    nWrk = kv.num_workers if kv else 1
+    if 'horovod' in args.kv_store:
+        rank = hvd.rank()
+        nWrk = hvd.size()
+    else:
+        rank = kv.rank if kv else 0
+        nWrk = kv.num_workers if kv else 1

-    trainpipes = [HybridTrainPipe(batch_size     = batch_size,
+    batch_size = args.batch_size // nWrk // len(gpus)
+
+    trainpipes = [HybridTrainPipe(args           = args,
+                                  batch_size     = batch_size,
                                  num_threads    = num_threads,
                                  device_id      = gpu_id,
                                  rec_path       = args.data_train,
                                  idx_path       = args.data_train_idx,
                                  shard_id       = gpus.index(gpu_id) + len(gpus)*rank,
                                  num_shards     = len(gpus)*nWrk,
-                                  crop_shape     = target_shape[1:],
+                                  crop_shape     = args.image_shape[1:],
                                  output_layout  = output_layout,
-                                  pad_output     = pad_output,
                                  dtype          = args.dtype,
+                                  pad_output     = pad_output,
+                                  dali_cpu       = dali_cpu,
                                  nvjpeg_padding = args.dali_nvjpeg_memory_padding * 1024 * 1024,
                                  prefetch_queue = args.dali_prefetch_queue) for gpu_id in gpus]

-    valpipes = [HybridValPipe(batch_size     = batch_size,
-                              num_threads    = num_validation_threads,
-                              device_id      = gpu_id,
-                              rec_path       = args.data_val,
-                              idx_path       = args.data_val_idx,
-                              shard_id       = 0 if args.separ_val
-                                                 else gpus.index(gpu_id) + len(gpus)*rank,
-                              num_shards     = 1 if args.separ_val else len(gpus)*nWrk,
-                              crop_shape     = target_shape[1:],
-                              resize_shp     = resize,
-                              output_layout  = output_layout,
-                              pad_output     = pad_output,
-                              dtype          = args.dtype,
-                              nvjpeg_padding = args.dali_nvjpeg_memory_padding * 1024 * 1024,
-                              prefetch_queue = args.dali_prefetch_queue) for gpu_id in gpus] if args.data_val else None
+    if args.data_val:
+        valpipes = [HybridValPipe(args           = args,
+                                  batch_size     = batch_size,
+                                  num_threads    = num_validation_threads,
+                                  device_id      = gpu_id,
+                                  rec_path       = args.data_val,
+                                  idx_path       = args.data_val_idx,
+                                  shard_id       = 0 if args.dali_separ_val
+                                                      else gpus.index(gpu_id) + len(gpus)*rank,
+                                  num_shards     = 1 if args.dali_separ_val else len(gpus)*nWrk,
+                                  crop_shape     = args.image_shape[1:],
+                                  resize_shp     = args.data_val_resize,
+                                  output_layout  = output_layout,
+                                  dtype          = args.dtype,
+                                  pad_output     = pad_output,
+                                  dali_cpu       = dali_cpu,
+                                  nvjpeg_padding = args.dali_nvjpeg_memory_padding * 1024 * 1024,
+                                  prefetch_queue = args.dali_prefetch_queue) for gpu_id in gpus] if args.data_val else None
    trainpipes[0].build()
    if args.data_val:
        valpipes[0].build()
+        worker_val_examples = valpipes[0].epoch_size("Reader")
+        if not args.dali_separ_val:
+            worker_val_examples = worker_val_examples // nWrk
+            if rank < valpipes[0].epoch_size("Reader") % nWrk:
+                worker_val_examples += 1

    if args.num_examples < trainpipes[0].epoch_size("Reader"):
        warnings.warn("{} training examples will be used, although full training set contains {} examples".format(args.num_examples, trainpipes[0].epoch_size("Reader")))
    dali_train_iter = DALIClassificationIterator(trainpipes, args.num_examples // nWrk)
-    dali_val_iter = DALIClassificationIterator(valpipes, valpipes[0].epoch_size("Reader") // (1 if args.separ_val else nWrk), fill_last_batch = False) if args.data_val else None
-    return dali_train_iter, dali_val_iter

+    if args.data_val:
+        dali_val_iter = DALIClassificationIterator(valpipes, worker_val_examples, fill_last_batch = False) if args.data_val else None
+    else:
+        dali_val_iter = None
+
+    return dali_train_iter, dali_val_iter
--- a/MxNet/Classification/RN50v1.5/data.py
+++ b/MxNet/Classification/RN50v1.5/data.py
@ -1,7 +1,5 @@
-# -----------------------------------------------------------------------
 # Copyright 2017-2018 The Apache Software Foundation
 #
-#
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
@ -36,128 +34,61 @@
 # limitations under the License.

 import mxnet as mx
+import mxnet.ndarray as nd
 import random
 import argparse
 from mxnet.io import DataBatch, DataIter
 import numpy as np
+import horovod.mxnet as hvd
+
+import dali

 def add_data_args(parser):
-    data = parser.add_argument_group('Data', 'the input images')
+    def float_list(x):
+        return list(map(float, x.split(',')))
+    def int_list(x):
+        return list(map(int, x.split(',')))
+
+    data = parser.add_argument_group('Data')
    data.add_argument('--data-train', type=str, help='the training data')
    data.add_argument('--data-train-idx', type=str, default='', help='the index of training data')
    data.add_argument('--data-val', type=str, help='the validation data')
    data.add_argument('--data-val-idx', type=str, default='', help='the index of validation data')
-    data.add_argument('--rgb-mean', type=str, default='123.68,116.779,103.939',
+    data.add_argument('--data-pred', type=str, help='the image on which run inference (only for pred mode)')
+
+    data.add_argument('--data-backend', choices=('dali-gpu', 'dali-cpu', 'mxnet', 'synthetic'), default='dali-gpu',
+                      help='set data loading & augmentation backend')
+    data.add_argument('--image-shape', type=int_list, default=[3, 224, 224],
+                      help='the image shape feed into the network')
+    data.add_argument('--rgb-mean', type=float_list, default=[123.68, 116.779, 103.939],
                      help='a tuple of size 3 for the mean rgb')
-    data.add_argument('--rgb-std', type=str, default='1,1,1',
+    data.add_argument('--rgb-std', type=float_list, default=[58.393, 57.12, 57.375],
                      help='a tuple of size 3 for the std rgb')
-    data.add_argument('--pad-size', type=int, default=0,
-                      help='padding the input image')
-    data.add_argument('--fill-value', type=int, default=127,
-                      help='Set the padding pixels value to fill_value')
-    data.add_argument('--image-shape', type=str,
-                      help='the image shape feed into the network, e.g. (3,224,224)')
-    data.add_argument('--num-classes', type=int, help='the number of classes')
-    data.add_argument('--num-examples', type=int, help='the number of training examples')
-    data.add_argument('--data-nthreads', type=int, default=4,
-                      help='number of threads for data decoding')
-    data.add_argument('--benchmark-iters', type=int, default=None,
-                      help='run only benchmark-iters iterations from each epoch')
-    data.add_argument('--input-layout', type=str, default='NCHW',
-                      help='the layout of the input data (e.g. NCHW)')
-    data.add_argument('--conv-layout', type=str, default='NCHW',
-                      help='the layout of the data assumed by the conv operation (e.g. NCHW)')
-    data.add_argument('--conv-algo', type=int, default=-1,
-                      help='set the convolution algos (fwd, dgrad, wgrad)')
-    data.add_argument('--batchnorm-layout', type=str, default='NCHW',
-                      help='the layout of the data assumed by the batchnorm operation (e.g. NCHW)')
-    data.add_argument('--batchnorm-eps', type=float, default=2e-5,
-                      help='the amount added to the batchnorm variance to prevent output explosion.')
-    data.add_argument('--batchnorm-mom', type=float, default=0.9,
-                      help='the leaky-integrator factor controling the batchnorm mean and variance.')
-    data.add_argument('--pooling-layout', type=str, default='NCHW',
-                      help='the layout of the data assumed by the pooling operation (e.g. NCHW)')
-    data.add_argument('--verbose', type=int, default=0,
-                      help='turn on reporting of chosen algos for convolution, etc.')
-    data.add_argument('--seed', type=int, default=None,
-                      help='set the seed for python, nd and mxnet rngs')
-    data.add_argument('--custom-bn-off', type=int, default=0,
-                      help='disable use of custom batchnorm kernel')
-    data.add_argument('--fuse-bn-relu', type=int, default=0,
-                      help='have batchnorm kernel perform activation relu')
-    data.add_argument('--fuse-bn-add-relu', type=int, default=0,
-                      help='have batchnorm kernel perform add followed by activation relu')
-    data.add_argument('--force-tensor-core', type=int, default=0,
-                      help='require conv algos to be tensor core')
+
+    data.add_argument('--input-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
+                      help='the layout of the input data')
+    data.add_argument('--conv-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
+                      help='the layout of the data assumed by the conv operation')
+    data.add_argument('--batchnorm-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
+                      help='the layout of the data assumed by the batchnorm operation')
+    data.add_argument('--pooling-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
+                      help='the layout of the data assumed by the pooling operation')
+
+    data.add_argument('--num-examples', type=int, default=1281167,
+                      help="the number of training examples (doesn't work with mxnet data backend)")
+    data.add_argument('--data-val-resize', type=int, default=256,
+                      help='base length of shorter edge for validation dataset')
+
    return data

-# Action to translate --set-resnet-aug flag to its component settings.
-class SetResnetAugAction(argparse.Action):
-    def __init__(self, nargs=0, **kwargs):
-        if nargs != 0:
-            raise ValueError('nargs for SetResnetAug must be 0.')
-        super(SetResnetAugAction, self).__init__(nargs=nargs, **kwargs)
-    def __call__(self, parser, namespace, values, option_string=None):
-        # standard data augmentation setting for resnet training
-        setattr(namespace, 'random_crop', 1)
-        setattr(namespace, 'random_resized_crop', 1)
-        setattr(namespace, 'random_mirror', 1)
-        setattr(namespace, 'min_random_area', 0.08)
-        setattr(namespace, 'max_random_aspect_ratio', 4./3.)
-        setattr(namespace, 'min_random_aspect_ratio', 3./4.)
-        setattr(namespace, 'brightness', 0.4)
-        setattr(namespace, 'contrast', 0.4)
-        setattr(namespace, 'saturation', 0.4)
-        setattr(namespace, 'pca_noise', 0.1)
-        # record that this --set-resnet-aug 'macro arg' has been invoked
-        setattr(namespace, self.dest, 1)
-
-# Similar to the above, but suitable for calling within a training script to set the defaults.
-def set_resnet_aug(aug):
-    # standard data augmentation setting for resnet training
-    aug.set_defaults(random_crop=0, random_resized_crop=1)
-    aug.set_defaults(random_mirror=1)
-    aug.set_defaults(min_random_area=0.08)
-    aug.set_defaults(max_random_aspect_ratio=4./3., min_random_aspect_ratio=3./4.)
-    aug.set_defaults(brightness=0.4, contrast=0.4, saturation=0.4, pca_noise=0.1)
-
-# Action to translate --set-data-aug-level <N> arg to its component settings.
-class SetDataAugLevelAction(argparse.Action):
-    def __init__(self, option_strings, dest, nargs=None, **kwargs):
-        if nargs is not None:
-            raise ValueError("nargs not allowed")
-        super(SetDataAugLevelAction, self).__init__(option_strings, dest, **kwargs)
-    def __call__(self, parser, namespace, values, option_string=None):
-        level = values
-        # record that this --set-data-aug-level <N> 'macro arg' has been invoked
-        setattr(namespace, self.dest, level)
-        if level >= 1:
-            setattr(namespace, 'random_crop', 1)
-            setattr(namespace, 'random_mirror', 1)
-        if level >= 2:
-            setattr(namespace, 'max_random_h', 36)
-            setattr(namespace, 'max_random_s', 50)
-            setattr(namespace, 'max_random_l', 50)
-        if level >= 3:
-            setattr(namespace, 'max_random_rotate_angle', 10)
-            setattr(namespace, 'max_random_shear_ratio', 0.1)
-            setattr(namespace, 'max_random_aspect_ratio', 0.25)
-
-# Similar to the above, but suitable for calling within a training script to set the defaults.
-def set_data_aug_level(aug, level):
-    if level >= 1:
-        aug.set_defaults(random_crop=1, random_mirror=1)
-    if level >= 2:
-        aug.set_defaults(max_random_h=36, max_random_s=50, max_random_l=50)
-    if level >= 3:
-        aug.set_defaults(max_random_rotate_angle=10, max_random_shear_ratio=0.1, max_random_aspect_ratio=0.25)
-
 def add_data_aug_args(parser):
    aug = parser.add_argument_group(
-        'Image augmentations', 'implemented in src/io/image_aug_default.cc')
+            'MXNet data backend', 'entire group applies only to mxnet data backend')
+    aug.add_argument('--data-mxnet-threads', type=int, default=40,
+                     help='number of threads for data decoding for mxnet data backend')
    aug.add_argument('--random-crop', type=int, default=0,
                     help='if or not randomly crop the image')
-    aug.add_argument('--random-mirror', type=int, default=0,
+    aug.add_argument('--random-mirror', type=int, default=1,
                     help='if or not randomly flip horizontally')
    aug.add_argument('--max-random-h', type=int, default=0,
                     help='max change of hue, whose range is [0, 180]')
@ -165,9 +96,9 @@ def add_data_aug_args(parser):
                     help='max change of saturation, whose range is [0, 255]')
    aug.add_argument('--max-random-l', type=int, default=0,
                     help='max change of intensity, whose range is [0, 255]')
-    aug.add_argument('--min-random-aspect-ratio', type=float, default=None,
+    aug.add_argument('--min-random-aspect-ratio', type=float, default=0.75,
                     help='min value of aspect ratio, whose value is either None or a positive value.')
-    aug.add_argument('--max-random-aspect-ratio', type=float, default=0,
+    aug.add_argument('--max-random-aspect-ratio', type=float, default=1.33,
                     help='max value of aspect ratio. If min_random_aspect_ratio is None, '
                          'the aspect ratio range is [1-max_random_aspect_ratio, '
                          '1+max_random_aspect_ratio], otherwise it is '
@ -183,7 +114,7 @@ def add_data_aug_args(parser):
                          'otherwise use --pad-size')
    aug.add_argument('--max-random-area', type=float, default=1,
                     help='max area to crop in random resized crop, whose range is [0, 1]')
-    aug.add_argument('--min-random-area', type=float, default=1,
+    aug.add_argument('--min-random-area', type=float, default=0.05,
                     help='min area to crop in random resized crop, whose range is [0, 1]')
    aug.add_argument('--min-crop-size', type=int, default=-1,
                     help='Crop both width and height into a random size in '
@ -199,87 +130,200 @@ def add_data_aug_args(parser):
                     help='saturation jittering, whose range is [0, 1]')
    aug.add_argument('--pca-noise', type=float, default=0,
                     help='pca noise, whose range is [0, 1]')
-    aug.add_argument('--random-resized-crop', type=int, default=0,
+    aug.add_argument('--random-resized-crop', type=int, default=1,
                     help='whether to use random resized crop')
-    aug.add_argument('--set-resnet-aug', action=SetResnetAugAction,
-                     help='whether to employ standard resnet augmentations (see data.py)')
-    aug.add_argument('--set-data-aug-level', type=int, default=None, action=SetDataAugLevelAction,
-                     help='set multiple data augmentations based on a `level` (see data.py)')
    return aug

+def get_data_loader(args):
+    if args.data_backend == 'dali-gpu':
+        return (lambda *args, **kwargs: dali.get_rec_iter(*args, **kwargs, dali_cpu=False))
+    if args.data_backend == 'dali-cpu':
+        return (lambda *args, **kwargs: dali.get_rec_iter(*args, **kwargs, dali_cpu=True))
+    if args.data_backend == 'synthetic':
+        return get_synthetic_rec_iter
+    if args.data_backend == 'mxnet':
+        return get_rec_iter
+    raise ValueError('Wrong data backend')
+
+class DataGPUSplit:
+    def __init__(self, dataloader, ctx, dtype):
+        self.dataloader = dataloader
+        self.ctx = ctx
+        self.dtype = dtype
+        self.batch_size = dataloader.batch_size // len(ctx)
+        self._num_gpus = len(ctx)
+
+    def __iter__(self):
+        return DataGPUSplit(iter(self.dataloader), self.ctx, self.dtype)
+
+    def __next__(self):
+        data = next(self.dataloader)
+        ret = []
+        for i in range(len(self.ctx)):
+            start = i * len(data.data[0]) // len(self.ctx)
+            end = (i + 1) * len(data.data[0]) // len(self.ctx)
+            pad = max(0, min(data.pad - (len(self.ctx) - i - 1) * self.batch_size, self.batch_size))
+            ret.append(mx.io.DataBatch(
+                [data.data[0][start:end].as_in_context(self.ctx[i]).astype(self.dtype)],
+                [data.label[0][start:end].as_in_context(self.ctx[i])],
+                pad=pad))
+        return ret
+
+    def next(self):
+        return next(self)
+
+    def reset(self):
+        self.dataloader.reset()
+
 def get_rec_iter(args, kv=None):
-    image_shape = tuple([int(l) for l in args.image_shape.split(',')])
-    if args.input_layout == 'NHWC':
-        image_shape = image_shape[1:] + (image_shape[0],)
-    if kv:
-        (rank, nworker) = (kv.rank, kv.num_workers)
+    gpus = args.gpus
+    if 'horovod' in args.kv_store:
+        rank = hvd.rank()
+        nworker = hvd.size()
+        gpus = [gpus[0]]
+        batch_size = args.batch_size // hvd.size()
    else:
-        (rank, nworker) = (0, 1)
-    rgb_mean = [float(i) for i in args.rgb_mean.split(',')]
-    rgb_std = [float(i) for i in args.rgb_std.split(',')]
+        rank = kv.rank if kv else 0
+        nworker = kv.num_workers if kv else 1
+        batch_size = args.batch_size
+
    if args.input_layout == 'NHWC':
        raise ValueError('ImageRecordIter cannot handle layout {}'.format(args.input_layout))

-    train = mx.io.ImageRecordIter(
-        path_imgrec         = args.data_train,
-        path_imgidx         = args.data_train_idx,
-        label_width         = 1,
-        mean_r              = rgb_mean[0],
-        mean_g              = rgb_mean[1],
-        mean_b              = rgb_mean[2],
-        std_r               = rgb_std[0],
-        std_g               = rgb_std[1],
-        std_b               = rgb_std[2],
-        data_name           = 'data',
-        label_name          = 'softmax_label',
-        data_shape          = image_shape,
-        batch_size          = args.batch_size,
-        rand_crop           = args.random_crop,
-        max_random_scale    = args.max_random_scale,
-        pad                 = args.pad_size,
-        fill_value          = args.fill_value,
-        random_resized_crop = args.random_resized_crop,
-        min_random_scale    = args.min_random_scale,
-        max_aspect_ratio    = args.max_random_aspect_ratio,
-        min_aspect_ratio    = args.min_random_aspect_ratio,
-        max_random_area     = args.max_random_area,
-        min_random_area     = args.min_random_area,
-        min_crop_size       = args.min_crop_size,
-        max_crop_size       = args.max_crop_size,
-        brightness          = args.brightness,
-        contrast            = args.contrast,
-        saturation          = args.saturation,
-        pca_noise           = args.pca_noise,
-        random_h            = args.max_random_h,
-        random_s            = args.max_random_s,
-        random_l            = args.max_random_l,
-        max_rotate_angle    = args.max_random_rotate_angle,
-        max_shear_ratio     = args.max_random_shear_ratio,
-        rand_mirror         = args.random_mirror,
-        preprocess_threads  = args.data_nthreads,
-        shuffle             = True,
-        num_parts           = nworker,
-        part_index          = rank)
+
+    train = DataGPUSplit(mx.io.ImageRecordIter(
+            path_imgrec         = args.data_train,
+            path_imgidx         = args.data_train_idx,
+            label_width         = 1,
+            mean_r              = args.rgb_mean[0],
+            mean_g              = args.rgb_mean[1],
+            mean_b              = args.rgb_mean[2],
+            std_r               = args.rgb_std[0],
+            std_g               = args.rgb_std[1],
+            std_b               = args.rgb_std[2],
+            data_name           = 'data',
+            label_name          = 'softmax_label',
+            data_shape          = args.image_shape,
+            batch_size          = batch_size,
+            rand_crop           = args.random_crop,
+            max_random_scale    = args.max_random_scale,
+            random_resized_crop = args.random_resized_crop,
+            min_random_scale    = args.min_random_scale,
+            max_aspect_ratio    = args.max_random_aspect_ratio,
+            min_aspect_ratio    = args.min_random_aspect_ratio,
+            max_random_area     = args.max_random_area,
+            min_random_area     = args.min_random_area,
+            min_crop_size       = args.min_crop_size,
+            max_crop_size       = args.max_crop_size,
+            brightness          = args.brightness,
+            contrast            = args.contrast,
+            saturation          = args.saturation,
+            pca_noise           = args.pca_noise,
+            random_h            = args.max_random_h,
+            random_s            = args.max_random_s,
+            random_l            = args.max_random_l,
+            max_rotate_angle    = args.max_random_rotate_angle,
+            max_shear_ratio     = args.max_random_shear_ratio,
+            rand_mirror         = args.random_mirror,
+            preprocess_threads  = args.data_mxnet_threads,
+            shuffle             = True,
+            num_parts           = nworker,
+            part_index          = rank,
+            seed                = args.seed or '0',
+        ), [mx.gpu(gpu) for gpu in gpus], args.dtype)
    if args.data_val is None:
        return (train, None)
-    val = mx.io.ImageRecordIter(
-        path_imgrec         = args.data_val,
-        path_imgidx         = args.data_val_idx,
-        label_width         = 1,
-        mean_r              = rgb_mean[0],
-        mean_g              = rgb_mean[1],
-        mean_b              = rgb_mean[2],
-        std_r               = rgb_std[0],
-        std_g               = rgb_std[1],
-        std_b               = rgb_std[2],
-        data_name           = 'data',
-        label_name          = 'softmax_label',
-        batch_size          = args.batch_size,
-        round_batch         = False,
-        data_shape          = image_shape,
-        preprocess_threads  = args.data_nthreads,
-        rand_crop           = False,
-        rand_mirror         = False,
-        num_parts           = nworker,
-        part_index          = rank)
+    val = DataGPUSplit(mx.io.ImageRecordIter(
+            path_imgrec         = args.data_val,
+            path_imgidx         = args.data_val_idx,
+            label_width         = 1,
+            mean_r              = args.rgb_mean[0],
+            mean_g              = args.rgb_mean[1],
+            mean_b              = args.rgb_mean[2],
+            std_r               = args.rgb_std[0],
+            std_g               = args.rgb_std[1],
+            std_b               = args.rgb_std[2],
+            data_name           = 'data',
+            label_name          = 'softmax_label',
+            batch_size          = batch_size,
+            round_batch         = False,
+            data_shape          = args.image_shape,
+            preprocess_threads  = args.data_mxnet_threads,
+            rand_crop           = False,
+            rand_mirror         = False,
+            num_parts           = nworker,
+            part_index          = rank,
+            resize              = args.data_val_resize,
+        ), [mx.gpu(gpu) for gpu in gpus], args.dtype)
    return (train, val)
+
+
+class SyntheticDataIter(DataIter):
+    def __init__(self, num_classes, data_shape, max_iter, ctx, dtype):
+        self.batch_size = data_shape[0]
+        self.cur_iter = 0
+        self.max_iter = max_iter
+        self.dtype = dtype
+        label = np.random.randint(0, num_classes, [self.batch_size,])
+        data = np.random.uniform(-1, 1, data_shape)
+        self.data = []
+        self.label = []
+        self._num_gpus = len(ctx)
+        for dev in ctx:
+            self.data.append(mx.nd.array(data, dtype=self.dtype, ctx=dev))
+            self.label.append(mx.nd.array(label, dtype=self.dtype, ctx=dev))
+
+    def __iter__(self):
+        return self
+
+    def next(self):
+        self.cur_iter += 1
+        if self.cur_iter <= self.max_iter:
+            return [DataBatch(data=(data,), label=(label,), pad=0) for data, label in zip(self.data, self.label)]
+        else:
+            raise StopIteration
+
+    def __next__(self):
+        return self.next()
+
+    def reset(self):
+        self.cur_iter = 0
+
+def get_synthetic_rec_iter(args, kv=None):
+    gpus = args.gpus
+    if 'horovod' in args.kv_store:
+        gpus = [gpus[0]]
+        batch_size = args.batch_size // hvd.size()
+    else:
+        batch_size = args.batch_size
+
+    if args.input_layout == 'NCHW':
+        data_shape = (batch_size, *args.image_shape)
+    elif args.input_layout == 'NHWC':
+        data_shape = (batch_size, *args.image_shape[1:], args.image_shape[0])
+    else:
+        raise ValueError('Wrong input layout')
+
+    train = SyntheticDataIter(args.num_classes, data_shape,
+                              args.num_examples // args.batch_size,
+                              [mx.gpu(gpu) for gpu in gpus], args.dtype)
+    if args.data_val is None:
+        return (train, None)
+
+    val = SyntheticDataIter(args.num_classes, data_shape,
+                            args.num_examples // args.batch_size,
+                            [mx.gpu(gpu) for gpu in gpus], args.dtype)
+    return (train, val)
+
+def load_image(args, path, ctx=mx.cpu()):
+    image = mx.image.imread(path).astype('float32')
+    image = mx.image.imresize(image, *args.image_shape[1:])
+    image = (image - nd.array(args.rgb_mean)) / nd.array(args.rgb_std)
+    image = image.as_in_context(ctx)
+    if args.input_layout == 'NCHW':
+        image = image.transpose((2, 0, 1))
+    image = image.astype(args.dtype)
+    if args.image_shape[0] == 4:
+        dim = 0 if args.input_layout == 'NCHW' else 2
+        image = nd.concat(image, nd.zeros((1, *image.shape[1:]), dtype=image.dtype, ctx=image.context), dim=dim)
+    return image
--- a/MxNet/Classification/RN50v1.5/examples/BENCHMARK_FP16.sh
+++ b/MxNet/Classification/RN50v1.5/examples/BENCHMARK_FP16.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 benchmark in FP16 on 1,4,8 GPUs with 64,128,192,208 batch size
-# Usage ./BENCHMARK_FP16.sh <additionals flags>
-
-python benchmark.py -n 1,4,8 -b 64,128,192,208 -e 2 -w 1 -i 100 -o report.json $@
--- a/MxNet/Classification/RN50v1.5/examples/BENCHMARK_FP32.sh
+++ b/MxNet/Classification/RN50v1.5/examples/BENCHMARK_FP32.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 benchmark in FP32 on 1,4,8 GPUs with 32,64,96 batch size
-# Usage ./BENCHMARK_FP32.sh <additionals flags>
-
-python benchmark.py -n 1,4,8 -b 32,64,96 -e 2 -w 1 -i 100 --dtype float32 -o report.json $@
--- a/MxNet/Classification/RN50v1.5/examples/INFER_BENCHMARK_FP16.sh
+++ b/MxNet/Classification/RN50v1.5/examples/INFER_BENCHMARK_FP16.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 inference benchmark in FP16 on 1 GPU with 1,2,4,64,128,192,208 batch size
-# Usage ./INFER_BENCHMARK_FP16.sh <additionals flags>
-
-python benchmark.py -n 1 -b 1,2,4,64,128,192,208 --only-inference -e 3 -w 1 -i 100 -o report.json $@
--- a/MxNet/Classification/RN50v1.5/examples/RN50_FP16_1GPU.sh
+++ b/MxNet/Classification/RN50v1.5/examples/RN50_FP16_1GPU.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 training in FP16 on 1 GPUs using 208 batch size (208 per GPU)
-# Usage ./RN50_FP16_1GPU.sh <path to this repository> <additionals flags>
-
-"$1/runner" -n 1 -b 208 --model-prefix model ${@:2}
--- a/MxNet/Classification/RN50v1.5/examples/RN50_FP16_4GPU.sh
+++ b/MxNet/Classification/RN50v1.5/examples/RN50_FP16_4GPU.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 training in FP16 on 4 GPUs using 832 batch size (208 per GPU)
-# Usage ./RN50_FP16_4GPU.sh <path to this repository> <additionals flags>
-
-"$1/runner" -n 4 -b 208 --model-prefix model ${@:2}
--- a/MxNet/Classification/RN50v1.5/examples/RN50_FP16_8GPU.sh
+++ b/MxNet/Classification/RN50v1.5/examples/RN50_FP16_8GPU.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 training in FP16 on 8 GPUs using 1664 batch size (208 per GPU)
-# Usage ./RN50_FP16_8GPU.sh <path to this repository> <additionals flags>
-
-"$1/runner" -n 8 -b 208 --model-prefix model ${@:2}
--- a/MxNet/Classification/RN50v1.5/examples/RN50_FP32_1GPU.sh
+++ b/MxNet/Classification/RN50v1.5/examples/RN50_FP32_1GPU.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 training in FP32 on 1 GPUs using 96 batch size (96 per GPU)
-# Usage ./RN50_FP32_1GPU.sh <path to this repository> <additionals flags>
-
-"$1/runner" -n 1 -b 96 --dtype float32 --model-prefix model ${@:2}
--- a/MxNet/Classification/RN50v1.5/examples/RN50_FP32_4GPU.sh
+++ b/MxNet/Classification/RN50v1.5/examples/RN50_FP32_4GPU.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 training in FP32 on 4 GPUs using 384 batch size (96 per GPU)
-# Usage ./RN50_FP32_4GPU.sh <path to this repository> <additionals flags>
-
-"$1/runner" -n 4 -b 96 --dtype float32 --model-prefix model ${@:2}
--- a/MxNet/Classification/RN50v1.5/examples/RN50_FP32_8GPU.sh
+++ b/MxNet/Classification/RN50v1.5/examples/RN50_FP32_8GPU.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script launches ResNet50 training in FP32 on 8 GPUs using 768 batch size (96 per GPU)
-# Usage ./RN50_FP32_8GPU.sh <path to this repository> <additionals flags>
-
-"$1/runner" -n 8 -b 96 --dtype float32 --model-prefix model ${@:2}
--- a/MxNet/Classification/RN50v1.5/examples/SCORE_FP16.sh
+++ b/MxNet/Classification/RN50v1.5/examples/SCORE_FP16.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script score ResNet50 checkpoint in FP16 on 1 GPUs using 128 batch size
-# Usage ./SCORE_FP16.sh <model prefix> <epoch> <additionals flags>
-
-./runner -n 1 -b 128 --only-inference --model-prefix $1 --load-epoch $2 -e 1 ${@:3}
--- a/MxNet/Classification/RN50v1.5/examples/SCORE_FP32.sh
+++ b/MxNet/Classification/RN50v1.5/examples/SCORE_FP32.sh
@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-# This script score ResNet50 checkpoint in FP32 on 1 GPUs using 64 batch size
-# Usage ./SCORE_FP32.sh <model prefix> <epoch> <additionals flags>
-
-./runner -n 1 -b 64 --dtype float32 --only-inference --model-prefix $1 --load-epoch $2 -e 1 ${@:3}
--- a/MxNet/Classification/RN50v1.5/fit.py
+++ b/MxNet/Classification/RN50v1.5/fit.py
@ -33,197 +33,408 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-""" example train fit utility """
+""" train fit utility """
 import logging
 import os
 import time
 import re
 import math
 import sys
+import random
+from itertools import starmap
+import numpy as np
 import mxnet as mx
+import mxnet.ndarray as nd
+import horovod.mxnet as hvd
+import mxnet.contrib.amp as amp
+from mxnet import autograd as ag
+from mxnet import gluon
 from report import Report
 from benchmarking import BenchmarkingDataIter
-
-def get_epoch_size(args, kv):
-    return math.ceil(int(args.num_examples / kv.num_workers) / args.batch_size)
-
-def _get_lr_scheduler(args, kv):
-    if 'lr_factor' not in args or args.lr_factor >= 1:
-        return (args.lr, None)
-    epoch_size = get_epoch_size(args, kv)
-    begin_epoch = args.load_epoch if args.load_epoch else 0
-    if 'pow' in args.lr_step_epochs:
-        lr = args.lr
-        max_up = args.num_epochs * epoch_size
-        pwr = float(re.sub('pow[- ]*', '', args.lr_step_epochs))
-        poly_sched = mx.lr_scheduler.PolyScheduler(max_up, lr, pwr)
-        return (lr, poly_sched)
-    step_epochs = [int(l) for l in args.lr_step_epochs.split(',')]
-    lr = args.lr
-    for s in step_epochs:
-        if begin_epoch >= s:
-            lr *= args.lr_factor
-    if lr != args.lr:
-        logging.info('Adjust learning rate to %e for epoch %d',
-                     lr, begin_epoch)
-
-    steps = [epoch_size * (x - begin_epoch)
-             for x in step_epochs if x - begin_epoch > 0]
-    if steps:
-        if kv:
-            num_workers = kv.num_workers
-        else:
-            num_workers = 1
-        epoch_size = math.ceil(int(args.num_examples/num_workers)/args.batch_size)
-        return (lr, mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=args.lr_factor,
-                                                         base_lr=args.lr, warmup_steps=epoch_size * args.warmup_epochs,
-                                                         warmup_mode=args.warmup_strategy))
-    else:
-        return (lr, None)
-
-def _load_model(args, rank=0):
-    if 'load_epoch' not in args or args.load_epoch is None:
-        return (None, None, None)
-    assert args.model_prefix is not None
-    model_prefix = args.model_prefix
-    if rank > 0 and os.path.exists("%s-%d-symbol.json" % (model_prefix, rank)):
-        model_prefix += "-%d" % (rank)
-    sym, arg_params, aux_params = mx.model.load_checkpoint(
-        model_prefix, args.load_epoch)
-    logging.info('Loaded model %s_%04d.params', model_prefix, args.load_epoch)
-    return (sym, arg_params, aux_params)
-
-
-def _save_model(args, rank=0):
-    if args.model_prefix is None:
-        return None
-    return mx.callback.do_checkpoint(args.model_prefix if rank == 0 else "%s-%d" % (
-        args.model_prefix, rank), period=args.save_period)
-
+import data

 def add_fit_args(parser):
-    """
-    parser : argparse.ArgumentParser
-    return a parser added with args required by fit
-    """
-    train = parser.add_argument_group('Training', 'model training')
-    train.add_argument('--num-layers', type=int,
-                       help='number of layers in the neural network, \
-                             required by some networks such as resnet')
-    train.add_argument('--gpus', type=str,
-                       help='list of gpus to run, e.g. 0 or 0,2,5. empty means using cpu')
-    train.add_argument('--kv-store', type=str, default='device',
+    def int_list(x):
+        return list(map(int, x.split(',')))
+
+    def float_list(x):
+        return list(map(float, x.split(',')))
+
+    train = parser.add_argument_group('Training')
+    train.add_argument('--mode', default='train_val', choices=('train_val', 'train', 'val', 'pred'),
+                       help='mode')
+    train.add_argument('--seed', type=int, default=None,
+                       help='random seed')
+
+    train.add_argument('--gpus', type=int_list, default=[0],
+                       help='list of gpus to run, e.g. 0 or 0,2,5')
+    train.add_argument('--kv-store', type=str, default='device', choices=('device', 'horovod'),
                       help='key-value store type')
-    train.add_argument('--num-epochs', type=int, default=100,
-                       help='max num of epochs')
+
+    train.add_argument('--dtype', type=str, default='float16', choices=('float32', 'float16'),
+                       help='precision')
+    train.add_argument('--amp', action='store_true',
+                       help='If enabled, turn on AMP (Automatic Mixed Precision)')
+    train.add_argument('--batch-size', type=int, default=192,
+                       help='the batch size')
+    train.add_argument('--num-epochs', type=int, default=90,
+                       help='number of epochs')
    train.add_argument('--lr', type=float, default=0.1,
                       help='initial learning rate')
-    train.add_argument('--lr-factor', type=float, default=0.1,
+    train.add_argument('--lr-schedule', choices=('multistep', 'cosine'), default='cosine',
+                       help='learning rate schedule')
+    train.add_argument('--lr-factor', type=float, default=0.256,
                       help='the ratio to reduce lr on each step')
-    train.add_argument('--lr-step-epochs', type=str,
+    train.add_argument('--lr-steps', type=float_list, default=[],
                       help='the epochs to reduce the lr, e.g. 30,60')
-    train.add_argument('--initializer', type=str, default='default',
-                       help='the initializer type')
-    train.add_argument('--optimizer', type=str, default='sgd',
-                       help='the optimizer type')
-    train.add_argument('--mom', type=float, default=0.9,
-                       help='momentum for sgd')
-    train.add_argument('--wd', type=float, default=0.0001,
-                       help='weight decay for sgd')
-    train.add_argument('--batch-size', type=int, default=208,
-                       help='the batch size')
-    train.add_argument('--disp-batches', type=int, default=20,
-                       help='show progress for every n batches')
-    train.add_argument('--model-prefix', type=str,
-                       help='model prefix')
-    train.add_argument('--save-period', type=int, default=1, help='params saving period')
-    parser.add_argument('--monitor', dest='monitor', type=int, default=0,
-                        help='log network parameters every N iters if larger than 0')
-    train.add_argument('--load-epoch', type=int,
-                       help='load the model on an epoch using the model-load-prefix')
-    train.add_argument('--loss', type=str, default='',
-                       help='show the cross-entropy or nll loss. ce strands for cross-entropy, nll-loss stands for likelihood loss')
-    train.add_argument('--test-io', type=int, default=0,
-                       help='1 means test reading speed without training')
-    train.add_argument('--dtype', type=str, default='float16',
-                       help='precision: float32 or float16')
-    train.add_argument('--gc-type', type=str, default='none',
-                       help='type of gradient compression to use, \
-                             takes `2bit` or `none` for now')
-    train.add_argument('--gc-threshold', type=float, default=0.5,
-                       help='threshold for 2bit gradient compression')
-    # additional parameters for large batch sgd
-    train.add_argument('--macrobatch-size', type=int, default=0,
-                       help='distributed effective batch size')
    train.add_argument('--warmup-epochs', type=int, default=5,
                       help='the epochs to ramp-up lr to scaled large-batch value')
-    train.add_argument('--warmup-strategy', type=str, default='linear',
-                       help='the ramping-up strategy for large batch sgd')
-    train.add_argument('--logging-dir', type=str, default='logs')
-    train.add_argument('--log', type=str, default='')
-    train.add_argument('--bn-gamma-init0', action='store_true')
-    train.add_argument('--epoch-size',type=int, default=0,
-                       help='set number of batches in an epoch. useful for debugging')
-    #train.add_argument('--tensorboard', type=str, default='',
-    #                   help='log parameters to visualize in tensorboard every epoch. takes name to specify as tensorboard run. Empty means tensorboard logging is disabled')
-    train.add_argument('--profile-worker-suffix', type=str, default='',
-                       help='profile workers actions into this file. During distributed training\
-                             filename saved will be rank1_ followed by this suffix')
-    train.add_argument('--profile-server-suffix', type=str, default='',
-                       help='profile server actions into a file with name like rank1_ followed by this suffix \
-                             during distributed training')
-    train.add_argument('--report', type=str, help='file where to save report')
-    train.add_argument('--only-inference', action='store_true', help='do not train, only inference (for benchmarking)')
+    train.add_argument('--optimizer', type=str, default='sgd',
+                       help='the optimizer type')
+    train.add_argument('--mom', type=float, default=0.875,
+                       help='momentum for sgd')
+    train.add_argument('--wd', type=float, default=1 / 32768,
+                       help='weight decay for sgd')
+    train.add_argument('--label-smoothing', type=float, default=0.1,
+                       help='label smoothing factor')
+    train.add_argument('--mixup', type=float, default=0,
+                       help='alpha parameter for mixup (if 0 then mixup is not applied)')
+
+    train.add_argument('--disp-batches', type=int, default=20,
+                       help='show progress for every n batches')
+    train.add_argument('--model-prefix', type=str, default='model',
+                       help='model checkpoint prefix')
+    train.add_argument('--save-frequency', type=int, default=-1,
+                       help='frequency of saving model in epochs (--model-prefix must be specified). '
+                            'If -1 then save only best model. If 0 then do not save anything.')
+    train.add_argument('--begin-epoch', type=int, default=0,
+                       help='start the model from an epoch')
+    train.add_argument('--load', help='checkpoint to load')
+
+    train.add_argument('--test-io', action='store_true',
+                       help='test reading speed without training')
+    train.add_argument('--test-io-mode', default='train', choices=('train', 'val'),
+                       help='data to test')
+
+    train.add_argument('--log', type=str, default='log.log',
+                       help='file where to save the log from the experiment')
+    train.add_argument('--report', default='report.json', help='file where to save report')
+
    train.add_argument('--no-metrics', action='store_true', help='do not calculate evaluation metrics (for benchmarking)')
+    train.add_argument('--benchmark-iters', type=int, default=None,
+                       help='run only benchmark-iters iterations from each epoch')
    return train

+def get_epoch_size(args, kv):
+    return math.ceil(args.num_examples / args.batch_size)

-def fit(args, network, data_loader, **kwargs):
+def get_lr_scheduler(args):
+    def multistep_schedule(x):
+        lr = args.lr * (args.lr_factor ** (len(list(filter(lambda step: step <= x, args.lr_steps)))))
+        warmup_coeff = min(1, x / args.warmup_epochs)
+        return warmup_coeff * lr
+
+    def cosine_schedule(x):
+        steps = args.lr_steps
+        if not steps or steps[0] > args.warmup_epochs:
+            steps = [args.warmup_epochs] + steps
+        elif not steps or steps[0] != 0:
+            steps = [0] + steps
+
+        if steps[-1] != args.num_epochs:
+            steps.append(args.num_epochs)
+
+        if x < args.warmup_epochs:
+            return args.lr * x / args.warmup_epochs
+
+        for i, (step, next_step) in enumerate(zip(steps, steps[1:])):
+            if next_step > x:
+                return args.lr * 0.5 * (1 + math.cos(math.pi * (x - step) / (next_step - step))) * (args.lr_factor ** i)
+        return 0
+
+    schedules = {
+        'multistep': multistep_schedule,
+        'cosine': cosine_schedule,
+    }
+    return schedules[args.lr_schedule]
+
+def load_model(args, model):
+    if args.load is None:
+        return False
+    model.load_parameters(args.load)
+    logging.info('Loaded model {}'.format(args.load))
+    return True
+
+def save_checkpoint(net, epoch, top1, best_acc, model_prefix, save_frequency, kvstore):
+    if model_prefix is None or save_frequency == 0 or ('horovod' in kvstore and hvd.rank() != 0):
+        return
+    if save_frequency > 0 and (epoch + 1) % save_frequency == 0:
+        fname = '{}_{:04}.params'.format(model_prefix, epoch)
+        net.save_parameters(fname)
+        logging.info('[Epoch {}] Saving checkpoint to {} with Accuracy: {:.4f}'.format(epoch, fname, top1))
+    if top1 > best_acc:
+        fname = '{}_best.params'.format(model_prefix)
+        net.save_parameters(fname)
+        logging.info('[Epoch {}] Saving checkpoint to {} with Accuracy: {:.4f}'.format(epoch, fname, top1))
+
+def add_metrics_to_report(report, mode, metric, durations, total_batch_size, loss=None, warmup=20):
+    if report is None:
+        return
+
+    top1 = metric.get('accuracy', None)
+    if top1 is not None:
+        report.add_value('{}.top1'.format(mode), top1)
+
+    top5 = metric.get('top_k_accuracy_5', None)
+    if top5 is not None:
+        report.add_value('{}.top5'.format(mode), top5)
+
+    if loss is not None:
+        report.add_value('{}.loss'.format(mode), loss.get_global()[1])
+
+    if len(durations) > warmup:
+        durations = durations[warmup:]
+    duration = np.mean(durations)
+    total_ips = total_batch_size / duration
+    report.add_value('{}.latency_avg'.format(mode), duration)
+    for percentile in [50, 90, 95, 99, 100]:
+        report.add_value('{}.latency_{}'.format(mode, percentile), np.percentile(durations, percentile))
+    report.add_value('{}.total_ips'.format(mode), total_ips)
+
+def model_pred(args, model, image):
+    from imagenet_classes import classes
+    output = model(image.reshape(-1, *image.shape))[0].softmax().as_in_context(mx.cpu())
+    top = output.argsort(is_ascend=False)[:10]
+    for i, ind in enumerate(top):
+        ind = int(ind.asscalar())
+        logging.info('{:2d}. {:5.2f}% -> {}'.format(i + 1, output[ind].asscalar() * 100, classes[ind]))
+
+def reduce_metrics(args, metrics, kvstore):
+    if 'horovod' not in kvstore or not metrics[0] or hvd.size() == 1:
+        return metrics
+
+    m = mx.ndarray.array(metrics[1], ctx=mx.gpu(args.gpus[0]))
+    reduced = hvd.allreduce(m)
+    values = reduced.as_in_context(mx.cpu()).asnumpy().tolist()
+    return (metrics[0], values)
+
+def model_score(args, net, val_data, metric, kvstore, report=None):
+    if val_data is None:
+        logging.info('Omitting validation: no data')
+        return [], []
+
+    if not isinstance(metric, mx.metric.EvalMetric):
+        metric = mx.metric.create(metric)
+    metric.reset()
+
+    val_data.reset()
+
+    total_batch_size = val_data.batch_size * val_data._num_gpus * (hvd.size() if 'horovod' in kvstore else 1)
+
+    durations = []
+    tic = time.time()
+    outputs = []
+    for batches in val_data:
+        # synchronize to previous iteration
+        for o in outputs:
+            o.wait_to_read()
+
+        data = [b.data[0] for b in batches]
+        label = [b.label[0][:len(b.data[0]) - b.pad] for b in batches if len(b.data[0]) != b.pad]
+        outputs = [net(X) for X, b in zip(data, batches)]
+        outputs = [o[:len(b.data[0]) - b.pad] for o, b in zip(outputs, batches) if len(b.data[0]) != b.pad]
+        metric.update(label, outputs)
+
+        durations.append(time.time() - tic)
+        tic = time.time()
+
+    metric = reduce_metrics(args, metric.get_global(), kvstore)
+    add_metrics_to_report(report, 'val', dict(zip(*metric)), durations, total_batch_size)
+    return metric
+
+class ScalarMetric(mx.metric.Loss):
+    def update(self, _, scalar):
+        self.sum_metric += scalar
+        self.global_sum_metric += scalar
+        self.num_inst += 1
+        self.global_num_inst += 1
+
+def label_smoothing(labels, classes, eta):
+    return labels.one_hot(classes, on_value=1 - eta + eta / classes, off_value=eta / classes)
+
+def model_fit(args, net, train_data, eval_metric, optimizer,
+        optimizer_params, lr_scheduler, eval_data, kvstore, kv,
+        begin_epoch, num_epoch, model_prefix, report, print_loss):
+
+    if not isinstance(eval_metric, mx.metric.EvalMetric):
+        eval_metric = mx.metric.create(eval_metric)
+    loss_metric = ScalarMetric()
+
+    if 'horovod' in kvstore:
+        trainer = hvd.DistributedTrainer(net.collect_params(), optimizer, optimizer_params)
+    else:
+        trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params,
+                                kvstore=kv, update_on_kvstore=False)
+
+    if args.amp:
+        amp.init_trainer(trainer)
+
+    sparse_label_loss = (args.label_smoothing == 0 and args.mixup == 0)
+    loss = gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=sparse_label_loss)
+    loss.hybridize(static_shape=True, static_alloc=True)
+
+    local_batch_size = train_data.batch_size
+    total_batch_size = local_batch_size * train_data._num_gpus * (hvd.size() if 'horovod' in kvstore else 1)
+    durations = []
+
+    epoch_size = get_epoch_size(args, kv)
+
+    def transform_data(images, labels):
+        if args.mixup != 0:
+            coeffs = mx.nd.array(np.random.beta(args.mixup, args.mixup, size=images.shape[0])).as_in_context(images.context)
+            image_coeffs = coeffs.astype(images.dtype, copy=False).reshape(*coeffs.shape, 1, 1, 1)
+            ret_images = image_coeffs * images + (1 - image_coeffs) * images[::-1]
+
+            ret_labels = label_smoothing(labels, args.num_classes, args.label_smoothing)
+            label_coeffs = coeffs.reshape(*coeffs.shape, 1)
+            ret_labels = label_coeffs * ret_labels + (1 - label_coeffs) * ret_labels[::-1]
+        else:
+            ret_images = images
+            if not sparse_label_loss:
+                ret_labels = label_smoothing(labels, args.num_classes, args.label_smoothing)
+            else:
+                ret_labels = labels
+
+        return ret_images, ret_labels
+
+
+    best_accuracy = -1
+    for epoch in range(begin_epoch, num_epoch):
+        tic = time.time()
+        train_data.reset()
+        eval_metric.reset()
+        loss_metric.reset()
+        btic = time.time()
+
+        logging.info('Starting epoch {}'.format(epoch))
+        outputs = []
+        for i, batches in enumerate(train_data):
+            # synchronize to previous iteration
+            for o in outputs:
+                o.wait_to_read()
+
+            trainer.set_learning_rate(lr_scheduler(epoch + i / epoch_size))
+
+            data = [b.data[0] for b in batches]
+            label = [b.label[0].as_in_context(b.data[0].context) for b in batches]
+            orig_label = label
+
+            data, label = zip(*starmap(transform_data, zip(data, label)))
+
+            outputs = []
+            Ls = []
+            with ag.record():
+                for x, y in zip(data, label):
+                    z = net(x)
+                    L = loss(z, y)
+                    # store the loss and do backward after we have done forward
+                    # on all GPUs for better speed on multiple GPUs.
+                    Ls.append(L)
+                    outputs.append(z)
+
+                if args.amp:
+                    with amp.scale_loss(Ls, trainer) as scaled_loss:
+                        ag.backward(scaled_loss)
+                else:
+                    ag.backward(Ls)
+
+            if 'horovod' in kvstore:
+                trainer.step(local_batch_size)
+            else:
+                trainer.step(total_batch_size)
+
+            if print_loss:
+                loss_metric.update(..., np.mean([l.asnumpy() for l in Ls]).item())
+            eval_metric.update(orig_label, outputs)
+
+            if args.disp_batches and not (i + 1) % args.disp_batches:
+                name, acc = eval_metric.get()
+                if print_loss:
+                    name = [loss_metric.get()[0]] + name
+                    acc = [loss_metric.get()[1]] + acc
+
+                logging.info('Epoch[{}] Batch [{}-{}]\tSpeed: {} samples/sec\tLR: {}\t{}'.format(
+                    epoch, (i // args.disp_batches) * args.disp_batches, i,
+                    args.disp_batches * total_batch_size / (time.time() - btic), trainer.learning_rate,
+                    '\t'.join(list(map(lambda x: '{}: {:.6f}'.format(*x), zip(name, acc))))))
+                eval_metric.reset_local()
+                loss_metric.reset_local()
+                btic = time.time()
+
+            durations.append(time.time() - tic)
+            tic = time.time()
+
+
+        add_metrics_to_report(report, 'train', dict(eval_metric.get_global_name_value()), durations, total_batch_size, loss_metric if print_loss else None)
+
+        if args.mode == 'train_val':
+            logging.info('Validating epoch {}'.format(epoch))
+            score = model_score(args, net, eval_data, eval_metric, kvstore, report)
+            for name, value in zip(*score):
+                logging.info('Epoch[{}] Validation {:20}: {}'.format(epoch, name, value))
+
+            score = dict(zip(*score))
+            accuracy = score.get('accuracy', -1)
+            save_checkpoint(net, epoch, accuracy, best_accuracy, model_prefix, args.save_frequency, kvstore)
+            best_accuracy = max(best_accuracy, accuracy)
+
+
+def fit(args, model, data_loader):
    """
    train a model
    args : argparse returns
-    network : the symbol definition of the nerual network
+    model : the the neural network model
    data_loader : function that returns the train and val data iterators
    """

    start_time = time.time()

+    report = Report(args.arch, len(args.gpus), sys.argv)
+
+    # select gpu for horovod process
+    if 'horovod' in args.kv_store:
+        hvd.init()
+        args.gpus = [args.gpus[hvd.local_rank()]]
+
+    if args.amp:
+        amp.init()
+
+    if args.seed is not None:
+        logging.info('Setting seeds to {}'.format(args.seed))
+        random.seed(args.seed)
+        np.random.seed(args.seed)
+        mx.random.seed(args.seed)
+
    # kvstore
-    kv = mx.kvstore.create(args.kv_store)
-    if args.gc_type != 'none':
-        kv.set_gradient_compression({'type': args.gc_type,
-                                     'threshold': args.gc_threshold})
-    if args.profile_server_suffix:
-        mx.profiler.set_config(filename=args.profile_server_suffix, profile_all=True, profile_process='server')
-        mx.profiler.set_state(state='run', profile_process='server')
-
-    if args.profile_worker_suffix:
-        if kv.num_workers > 1:
-            filename = 'rank' + str(kv.rank) + '_' + args.profile_worker_suffix
-        else:
-            filename = args.profile_worker_suffix
-        mx.profiler.set_config(filename=filename, profile_all=True, profile_process='worker')
-        mx.profiler.set_state(state='run', profile_process='worker')
-
-    # logging
-    head = '%(asctime)-15s Node[' + str(kv.rank) + '] %(message)s'
-    logging.basicConfig(level=logging.DEBUG, format=head)
-    logging.info('start with arguments %s', args)
-
-    epoch_size = get_epoch_size(args, kv)
-
-    # data iterators
-    (train, val) = data_loader(args, kv)
-    if 'dist' in args.kv_store and not 'async' in args.kv_store:
-        logging.info('Resizing training data to %d batches per machine', epoch_size)
-        # resize train iter to ensure each machine has same number of batches per epoch
-        # if not, dist_sync can hang at the end with one machine waiting for other machines
-        if not args.use_dali:
-            train = mx.io.ResizeIter(train, epoch_size)
+    if 'horovod' in args.kv_store:
+        kv = None
+        rank = hvd.rank()
+        num_workers = hvd.size()
+    else:
+        kv = mx.kvstore.create(args.kv_store)
+        rank = kv.rank
+        num_workers = kv.num_workers

    if args.test_io:
+        train, val = data_loader(args, kv)
+
+        if args.test_io_mode == 'train':
+            data_iter = train
+        else:
+            data_iter = val
+
        tic = time.time()
-        for i, batch in enumerate(train):
+        for i, batch in enumerate(data_iter):
            if isinstance(batch, list):
                for b in batch:
                    for j in b.data:
@ -232,232 +443,90 @@ def fit(args, network, data_loader, **kwargs):
                for j in batch.data:
                    j.wait_to_read()
            if (i + 1) % args.disp_batches == 0:
-                logging.info('Batch [%d]\tSpeed: %.2f samples/sec', i,
-                             args.disp_batches * args.batch_size / (time.time() - tic))
+                logging.info('Batch [{}]\tSpeed: {:.2f} samples/sec'.format(
+                    i, args.disp_batches * args.batch_size / (time.time() - tic)))
                tic = time.time()
        return

-    # load model
-    if 'arg_params' in kwargs and 'aux_params' in kwargs:
-        arg_params = kwargs['arg_params']
-        aux_params = kwargs['aux_params']
-    else:
-        sym, arg_params, aux_params = _load_model(args, kv.rank)
-
-    # save model
-    checkpoint = _save_model(args, kv.rank)
-    epoch_end_callbacks = []
-    if checkpoint:
-        epoch_end_callbacks.append(checkpoint)
+    if not load_model(args, model):
+        # all initializers should be specified in the model definition.
+        # if not, this will raise an error
+        model.initialize(mx.init.Initializer())

    # devices for training
-    devs = mx.cpu() if args.gpus is None or args.gpus == "" else [
-        mx.gpu(int(i)) for i in args.gpus.split(',')]
+    devs = list(map(mx.gpu, args.gpus))
+    model.collect_params().reset_ctx(devs)
+
+    if args.mode == 'pred':
+        logging.info('Infering image {}'.format(args.data_pred))
+        model_pred(args, model, data.load_image(args, args.data_pred, devs[0]))
+        return

    # learning rate
-    lr, lr_scheduler = _get_lr_scheduler(args, kv)
-
-    # create model
-    model = mx.mod.Module(
-        context=devs,
-        symbol=network
-    )
+    lr_scheduler = get_lr_scheduler(args)

    optimizer_params = {
-        'learning_rate': lr,
+        'learning_rate': 0,
        'wd': args.wd,
-        'lr_scheduler': lr_scheduler,
-        'multi_precision': True}
+        'multi_precision': True,
+    }

    # Only a limited number of optimizers have 'momentum' property
    has_momentum = {'sgd', 'dcasgd', 'nag', 'signum', 'lbsgd'}
    if args.optimizer in has_momentum:
        optimizer_params['momentum'] = args.mom

-    monitor = mx.mon.Monitor(
-        args.monitor, pattern=".*") if args.monitor > 0 else None
-
-    # A limited number of optimizers have a warmup period
-    has_warmup = {'lbsgd', 'lbnag'}
-    if args.optimizer in has_warmup:
-        if 'dist' in args.kv_store:
-            nworkers = kv.num_workers
-        else:
-            nworkers = 1
-        epoch_size = args.num_examples / args.batch_size / nworkers
-
-        if epoch_size < 1:
-            epoch_size = 1
-        macrobatch_size = args.macrobatch_size
-        if macrobatch_size < args.batch_size * nworkers:
-            macrobatch_size = args.batch_size * nworkers
-        #batch_scale = round(float(macrobatch_size) / args.batch_size / nworkers +0.4999)
-        batch_scale = math.ceil(
-            float(macrobatch_size) / args.batch_size / nworkers)
-        optimizer_params['updates_per_epoch'] = epoch_size
-        optimizer_params['begin_epoch'] = args.load_epoch if args.load_epoch else 0
-        optimizer_params['batch_scale'] = batch_scale
-        optimizer_params['warmup_strategy'] = args.warmup_strategy
-        optimizer_params['warmup_epochs'] = args.warmup_epochs
-        optimizer_params['num_epochs'] = args.num_epochs
-
-    if args.initializer == 'default':
-        initializer = mx.init.Xavier(
-            rnd_type='gaussian', factor_type="in", magnitude=2)
-    # initializer   = mx.init.Xavier(factor_type="in", magnitude=2.34),
-    elif args.initializer == 'xavier':
-        initializer = mx.init.Xavier()
-    elif args.initializer == 'msra':
-        initializer = mx.init.MSRAPrelu()
-    elif args.initializer == 'orthogonal':
-        initializer = mx.init.Orthogonal()
-    elif args.initializer == 'normal':
-        initializer = mx.init.Normal()
-    elif args.initializer == 'uniform':
-        initializer = mx.init.Uniform()
-    elif args.initializer == 'one':
-        initializer = mx.init.One()
-    elif args.initializer == 'zero':
-        initializer = mx.init.Zero()
-
    # evaluation metrices
    if not args.no_metrics:
-        eval_metrics = ['crossentropy', 'accuracy']
+        eval_metrics = ['accuracy']
        eval_metrics.append(mx.metric.create(
            'top_k_accuracy', top_k=5))
    else:
        eval_metrics = []

-    supported_loss = ['ce', 'nll_loss']
-    if len(args.loss) > 0:
-        # ce or nll loss is only applicable to softmax output
-        loss_type_list = args.loss.split(',')
-        if 'softmax_output' in network.list_outputs():
-            for loss_type in loss_type_list:
-                loss_type = loss_type.strip()
-                if loss_type == 'nll':
-                    loss_type = 'nll_loss'
-                if loss_type not in supported_loss:
-                    logging.warning(loss_type + ' is not an valid loss type, only cross-entropy or ' \
-                                    'negative likelihood loss is supported!')
-                else:
-                    eval_metrics.append(mx.metric.create(loss_type))
-        else:
-            logging.warning("The output is not softmax_output, loss argument will be skipped!")
-
-    # callbacks that run after each batch
-    batch_end_callbacks = []
-    batch_end_callbacks.append(mx.callback.Speedometer(
-        args.batch_size, args.disp_batches))
-
-    if 'batch_end_callback' in kwargs:
-        cbs = kwargs['batch_end_callback']
-        batch_end_callbacks += cbs if isinstance(cbs, list) else [cbs]
-
-
-    report = Report('resnet{}'.format(args.num_layers), len(args.gpus.split(',')), sys.argv)
-
+    train, val = data_loader(args, kv)
    train = BenchmarkingDataIter(train, args.benchmark_iters)
-    val = BenchmarkingDataIter(val, args.benchmark_iters)
+    if val is not None:
+        val = BenchmarkingDataIter(val, args.benchmark_iters)

-    class Gatherer:
-        def __init__(self, report, mode, data_iter, total_bs=None):
-            self.report = report
-            self.mode = mode
-            self.total_bs = total_bs
-            self.data_iter = data_iter
-            self.clear()
-
-        def clear(self):
-            self.num = 0
-            self.top1 = 0
-            self.top5 = 0
-            self.loss = 0
-            self.time = 0
-            self.tic = 0
-
-        def gather_metrics(self, data):
-            params = dict(data.eval_metric.get_global_name_value())
-
-            if self.num != 0:
-                self.time += time.time() - self.tic
-            self.num += 1
-            if not args.no_metrics:
-                self.top1 = params['accuracy']
-                self.top5 = params['top_k_accuracy_5']
-                self.loss = params['cross-entropy']
-
-            self.tic = time.time()
-
-        def add_metrics(self, *a, **k):
-            top1 = self.top1 * 100
-            top5 = self.top5 * 100
-            loss = self.loss
-            if self.num <= 1:
-                time = float('nan')
-            else:
-                time = self.time / (self.num - 1)
-            data = self.data_iter.get_avg_time_and_clear()
-            if self.total_bs is not None:
-                compute_ips = self.total_bs / (time - data)
-                total_ips = self.total_bs / time
-
-            if not args.no_metrics:
-                self.report.add_value('{}.top1'.format(self.mode), top1)
-                self.report.add_value('{}.top5'.format(self.mode), top5)
-                self.report.add_value('{}.loss'.format(self.mode), loss)
-            self.report.add_value('{}.time'.format(self.mode), time)
-            # self.report.add_value('{}.data'.format(self.mode), data)
-            if self.total_bs is not None:
-                # self.report.add_value('{}.compute_ips'.format(self.mode), compute_ips)
-                self.report.add_value('{}.total_ips'.format(self.mode), total_ips)
-            self.clear()
-
-    def save_report(*a, **k):
-        report.set_total_duration(time.time() - start_time)
-        if args.report:
-            report.save(args.report)
-
-    train_gatherer = Gatherer(report, 'train', train, args.batch_size)
-    eval_gatherer = Gatherer(report, 'val', val, args.batch_size)
-
-    batch_end_callbacks = [train_gatherer.gather_metrics] + batch_end_callbacks
-    epoch_end_callbacks = [train_gatherer.add_metrics, save_report] + epoch_end_callbacks
-
-    eval_batch_end_callbacks = [eval_gatherer.gather_metrics]
-    eval_end_callbacks = [eval_gatherer.add_metrics, save_report]
+    if 'horovod' in args.kv_store:
+        # Fetch and broadcast parameters
+        params = model.collect_params()
+        if params is not None:
+            hvd.broadcast_parameters(params, root_rank=0)

    # run
-    model.fit(train,
-              begin_epoch=args.load_epoch if args.load_epoch else 0,
-              num_epoch=args.num_epochs if not args.only_inference else 0,
-              eval_data=val,
-              eval_metric=eval_metrics,
-              kvstore=kv,
-              optimizer=args.optimizer,
-              optimizer_params=optimizer_params,
-              initializer=initializer,
-              arg_params=arg_params,
-              aux_params=aux_params,
-              batch_end_callback=batch_end_callbacks,
-              epoch_end_callback=epoch_end_callbacks, #checkpoint if args.use_dali else ,,
-              eval_batch_end_callback=eval_batch_end_callbacks,
-              eval_end_callback=eval_end_callbacks,
-              allow_missing=True,
-              monitor=monitor)
+    if args.mode in ['train_val', 'train']:
+        model_fit(
+            args,
+            model,
+            train,
+            begin_epoch=args.begin_epoch,
+            num_epoch=args.num_epochs,
+            eval_data=val,
+            eval_metric=eval_metrics,
+            kvstore=args.kv_store,
+            kv=kv,
+            optimizer=args.optimizer,
+            optimizer_params=optimizer_params,
+            lr_scheduler=lr_scheduler,
+            report=report,
+            model_prefix=args.model_prefix,
+            print_loss=not args.no_metrics,
+        )
+    elif args.mode == 'val':
+        for epoch in range(args.num_epochs):  # loop for benchmarking
+            score = model_score(args, model, val, eval_metrics, args.kv_store, report=report)
+            for name, value in zip(*score):
+                logging.info('Validation {:20}: {}'.format(name, value))
+    else:
+        raise ValueError('Wrong mode')

-    if args.only_inference:
-        for epoch in range(args.num_epochs):
-            score = model.score(val, eval_metrics, batch_end_callback=eval_batch_end_callbacks, score_end_callback=eval_end_callbacks, epoch=epoch)
-            print('-------------')
-            for name, value in score:
-                print('{}: {}'.format(name, value))
+    mx.nd.waitall()

-    if args.profile_server_suffix:
-        mx.profiler.set_state(state='run', profile_process='server')
-    if args.profile_worker_suffix:
-        mx.profiler.set_state(state='run', profile_process='worker')
+    report.set_total_duration(time.time() - start_time)
+    if args.report:
+        suffix = '-{}'.format(hvd.rank()) if 'horovod' in args.kv_store and hvd.rank() != 0 else ''
+        report.save(args.report + suffix)

-    save_report()
-
-    print('Experiment took: {} sec'.format(report.total_duration))
+    logging.info('Experiment took: {} sec'.format(report.total_duration))
--- a/MxNet/Classification/RN50v1.5/imagenet_classes.py
+++ b/MxNet/Classification/RN50v1.5/imagenet_classes.py
--- a/MxNet/Classification/RN50v1.5/img/dgx1-16g_250e_training_loss.png
+++ b/MxNet/Classification/RN50v1.5/img/dgx1-16g_250e_training_loss.png
--- a/MxNet/Classification/RN50v1.5/img/dgx1-16g_250e_validation_top1.png
+++ b/MxNet/Classification/RN50v1.5/img/dgx1-16g_250e_validation_top1.png
--- a/MxNet/Classification/RN50v1.5/img/dgx1-16g_250e_validation_top5.png
+++ b/MxNet/Classification/RN50v1.5/img/dgx1-16g_250e_validation_top5.png
--- a/MxNet/Classification/RN50v1.5/img/training_accuracy.png
+++ b/MxNet/Classification/RN50v1.5/img/training_accuracy.png
--- a/MxNet/Classification/RN50v1.5/img/validation_accuracy.png
+++ b/MxNet/Classification/RN50v1.5/img/validation_accuracy.png
--- a/MxNet/Classification/RN50v1.5/models.py
+++ b/MxNet/Classification/RN50v1.5/models.py
@ -0,0 +1,522 @@
+# Copyright 2017-2018 The Apache Software Foundation
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -----------------------------------------------------------------------
+#
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import copy
+
+import mxnet as mx
+from mxnet.gluon.block import HybridBlock
+from mxnet.gluon import nn
+
+def add_model_args(parser):
+    model = parser.add_argument_group('Model')
+    model.add_argument('--arch', default='resnetv15',
+                       choices=['resnetv1', 'resnetv15',
+                                'resnextv1', 'resnextv15',
+                                'xception'],
+                       help='model architecture')
+    model.add_argument('--num-layers', type=int, default=50,
+                       help='number of layers in the neural network, \
+                             required by some networks such as resnet')
+    model.add_argument('--num-groups', type=int, default=32,
+                       help='number of groups for grouped convolutions, \
+                             required by some networks such as resnext')
+    model.add_argument('--num-classes', type=int, default=1000,
+                       help='the number of classes')
+    model.add_argument('--batchnorm-eps', type=float, default=1e-5,
+                       help='the amount added to the batchnorm variance to prevent output explosion.')
+    model.add_argument('--batchnorm-mom', type=float, default=0.9,
+                       help='the leaky-integrator factor controling the batchnorm mean and variance.')
+    model.add_argument('--fuse-bn-relu', type=int, default=0,
+                       help='have batchnorm kernel perform activation relu')
+    model.add_argument('--fuse-bn-add-relu', type=int, default=0,
+                       help='have batchnorm kernel perform add followed by activation relu')
+    return model
+
+class Builder:
+    def __init__(self, dtype, input_layout, conv_layout, bn_layout,
+                 pooling_layout, bn_eps, bn_mom, fuse_bn_relu, fuse_bn_add_relu):
+        self.dtype = dtype
+        self.input_layout = input_layout
+        self.conv_layout = conv_layout
+        self.bn_layout = bn_layout
+        self.pooling_layout = pooling_layout
+        self.bn_eps = bn_eps
+        self.bn_mom = bn_mom
+        self.fuse_bn_relu = fuse_bn_relu
+        self.fuse_bn_add_relu = fuse_bn_add_relu
+
+        self.act_type = 'relu'
+        self.bn_gamma_initializer = lambda last: 'zeros' if last else 'ones'
+        self.linear_initializer = lambda groups=1: mx.init.Xavier(rnd_type='gaussian', factor_type="in",
+                                                                  magnitude=2 * (groups ** 0.5))
+
+        self.last_layout = self.input_layout
+
+    def copy(self):
+        return copy.copy(self)
+
+    def batchnorm(self, last=False):
+        gamma_initializer = self.bn_gamma_initializer(last)
+        bn_axis = 3 if self.bn_layout == 'NHWC' else 1
+        return self.sequence(
+            self.transpose(self.bn_layout),
+            nn.BatchNorm(axis=bn_axis, momentum=self.bn_mom, epsilon=self.bn_eps,
+                         gamma_initializer=gamma_initializer,
+                         running_variance_initializer=gamma_initializer)
+        )
+
+    def batchnorm_add_relu(self, last=False):
+        gamma_initializer = self.bn_gamma_initializer(last)
+        if self.fuse_bn_add_relu:
+            bn_axis = 3 if self.bn_layout == 'NHWC' else 1
+            return self.sequence(
+                self.transpose(self.bn_layout),
+                BatchNormAddRelu(axis=bn_axis, momentum=self.bn_mom,
+                                 epsilon=self.bn_eps, act_type=self.act_type,
+                                 gamma_initializer=gamma_initializer,
+                                 running_variance_initializer=gamma_initializer)
+            )
+        return NonFusedBatchNormAddRelu(self, last=last)
+
+    def batchnorm_relu(self, last=False):
+        gamma_initializer = self.bn_gamma_initializer(last)
+        if self.fuse_bn_relu:
+            bn_axis = 3 if self.bn_layout == 'NHWC' else 1
+            return self.sequence(
+                self.transpose(self.bn_layout),
+                nn.BatchNorm(axis=bn_axis, momentum=self.bn_mom,
+                             epsilon=self.bn_eps, act_type=self.act_type,
+                             gamma_initializer=gamma_initializer,
+                             running_variance_initializer=gamma_initializer)
+            )
+
+        return self.sequence(self.batchnorm(last=last), self.activation())
+
+    def activation(self):
+        return nn.Activation(self.act_type)
+
+    def global_avg_pool(self):
+        return self.sequence(
+            self.transpose(self.pooling_layout),
+            nn.GlobalAvgPool2D(layout=self.pooling_layout)
+        )
+
+    def max_pool(self, pool_size, strides=1, padding=True):
+        padding = pool_size // 2 if padding is True else int(padding)
+        return self.sequence(
+            self.transpose(self.pooling_layout),
+            nn.MaxPool2D(pool_size, strides=strides, padding=padding,
+                         layout=self.pooling_layout)
+        )
+
+    def conv(self, channels, kernel_size, padding=True, strides=1, groups=1, in_channels=0):
+        padding = kernel_size // 2 if padding is True else int(padding)
+        initializer = self.linear_initializer(groups=groups)
+        return self.sequence(
+            self.transpose(self.conv_layout),
+            nn.Conv2D(channels, kernel_size=kernel_size, strides=strides,
+                      padding=padding, use_bias=False, groups=groups,
+                      in_channels=in_channels, layout=self.conv_layout,
+                      weight_initializer=initializer)
+        )
+
+    def separable_conv(self, channels, kernel_size, in_channels, padding=True, strides=1):
+        return self.sequence(
+            self.conv(in_channels, kernel_size, padding=padding,
+                      strides=strides, groups=in_channels, in_channels=in_channels),
+            self.conv(channels, 1, in_channels=in_channels)
+        )
+
+    def dense(self, units, in_units=0):
+        return nn.Dense(units, in_units=in_units,
+                        weight_initializer=self.linear_initializer())
+
+    def transpose(self, to_layout):
+        if self.last_layout == to_layout:
+            return None
+        ret = Transpose(self.last_layout, to_layout)
+        self.last_layout = to_layout
+        return ret
+
+    def sequence(self, *seq):
+        seq = list(filter(lambda x: x is not None, seq))
+        if len(seq) == 1:
+            return seq[0]
+        ret = nn.HybridSequential()
+        ret.add(*seq)
+        return ret
+
+
+class Transpose(HybridBlock):
+    def __init__(self, from_layout, to_layout):
+        super().__init__()
+        supported_layouts = ['NCHW', 'NHWC']
+        if from_layout not in supported_layouts:
+            raise ValueError('Not prepared to handle layout: {}'.format(from_layout))
+        if to_layout not in supported_layouts:
+            raise ValueError('Not prepared to handle layout: {}'.format(to_layout))
+        self.from_layout = from_layout
+        self.to_layout = to_layout
+
+    def hybrid_forward(self, F, x):
+        # Insert transpose if from_layout and to_layout don't match
+        if self.from_layout == 'NCHW' and self.to_layout == 'NHWC':
+            return F.transpose(x, axes=(0, 2, 3, 1))
+        elif self.from_layout == 'NHWC' and self.to_layout == 'NCHW':
+            return F.transpose(x, axes=(0, 3, 1, 2))
+        else:
+            return x
+
+    def __repr__(self):
+        s = '{name}({content})'
+        if self.from_layout == self.to_layout:
+            content = 'passthrough ' + self.from_layout
+        else:
+            content = self.from_layout + ' -> ' + self.to_layout
+        return s.format(name=self.__class__.__name__,
+                        content=content)
+
+class LayoutWrapper(HybridBlock):
+    def __init__(self, op, io_layout, op_layout, **kwargs):
+        super(LayoutWrapper, self).__init__(**kwargs)
+        with self.name_scope():
+            self.layout1 = Transpose(io_layout, op_layout)
+            self.op = op
+            self.layout2 = Transpose(op_layout, io_layout)
+
+    def hybrid_forward(self, F, *x):
+        return self.layout2(self.op(*(self.layout1(y) for y in x)))
+
+class BatchNormAddRelu(nn.BatchNorm):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        if self._kwargs.pop('act_type') != 'relu':
+            raise ValueError('BatchNormAddRelu can be used only with ReLU as activation')
+
+    def hybrid_forward(self, F, x, y, gamma, beta, running_mean, running_var):
+        return F.BatchNormAddRelu(data=x, addend=y, gamma=gamma, beta=beta,
+                moving_mean=running_mean, moving_var=running_var, name='fwd', **self._kwargs)
+
+class NonFusedBatchNormAddRelu(HybridBlock):
+    def __init__(self, builder, **kwargs):
+        super().__init__()
+        self.bn = builder.batchnorm(**kwargs)
+        self.act = builder.activation()
+
+    def hybrid_forward(self, F, x, y):
+        return self.act(self.bn(x) + y)
+
+
+# Blocks
+class ResNetBasicBlock(HybridBlock):
+    def __init__(self, builder, channels, stride, downsample=False, in_channels=0,
+                 version='1', resnext_groups=None, **kwargs):
+        super().__init__()
+        assert not resnext_groups
+
+        self.transpose = builder.transpose(builder.conv_layout)
+        builder_copy = builder.copy()
+
+        body = [
+            builder.conv(channels, 3, strides=stride, in_channels=in_channels),
+            builder.batchnorm_relu(),
+            builder.conv(channels, 3),
+        ]
+
+        self.body = builder.sequence(*body)
+        self.bn_add_relu = builder.batchnorm_add_relu(last=True)
+
+        builder = builder_copy
+        if downsample:
+            self.downsample = builder.sequence(
+                builder.conv(channels, 1, strides=stride, in_channels=in_channels),
+                builder.batchnorm()
+            )
+        else:
+            self.downsample = None
+
+    def hybrid_forward(self, F, x):
+        if self.transpose is not None:
+            x = self.transpose(x)
+        residual = x
+
+        x = self.body(x)
+
+        if self.downsample:
+            residual = self.downsample(residual)
+
+        x = self.bn_add_relu(x, residual)
+        return x
+
+
+class ResNetBottleNeck(HybridBlock):
+    def __init__(self, builder, channels, stride, downsample=False, in_channels=0,
+                 version='1', resnext_groups=None):
+        super().__init__()
+        stride1 = stride if version == '1' else 1
+        stride2 = 1 if version == '1' else stride
+
+        mult = 2 if resnext_groups else 1
+        groups = resnext_groups or 1
+
+        self.transpose = builder.transpose(builder.conv_layout)
+        builder_copy = builder.copy()
+
+        body = [
+            builder.conv(channels * mult // 4, 1, strides=stride1, in_channels=in_channels),
+            builder.batchnorm_relu(),
+            builder.conv(channels * mult // 4, 3, strides=stride2),
+            builder.batchnorm_relu(),
+            builder.conv(channels, 1)
+        ]
+
+        self.body = builder.sequence(*body)
+        self.bn_add_relu = builder.batchnorm_add_relu(last=True)
+
+        builder = builder_copy
+        if downsample:
+            self.downsample = builder.sequence(
+                builder.conv(channels, 1, strides=stride, in_channels=in_channels),
+                builder.batchnorm()
+            )
+        else:
+            self.downsample = None
+
+    def hybrid_forward(self, F, x):
+        if self.transpose is not None:
+            x = self.transpose(x)
+        residual = x
+
+        x = self.body(x)
+
+        if self.downsample:
+            residual = self.downsample(residual)
+
+        x = self.bn_add_relu(x, residual)
+        return x
+
+
+class XceptionBlock(HybridBlock):
+    def __init__(self, builder, definition, in_channels, relu_at_beginning=True):
+        super().__init__()
+
+        self.transpose = builder.transpose(builder.conv_layout)
+        builder_copy = builder.copy()
+
+        body = []
+        if relu_at_beginning:
+            body.append(builder.activation())
+
+        last_channels = in_channels
+        for channels1, channels2 in zip(definition, definition[1:] + [0]):
+            if channels1 > 0:
+                body.append(builder.separable_conv(channels1, 3, in_channels=last_channels))
+                if channels2 > 0:
+                    body.append(builder.batchnorm_relu())
+                else:
+                    body.append(builder.batchnorm(last=True))
+
+                last_channels = channels1
+            else:
+                body.append(builder.max_pool(3, 2))
+
+        self.body = builder.sequence(*body)
+
+        builder = builder_copy
+        if any(map(lambda x: x <= 0, definition)):
+            self.shortcut = builder.sequence(
+                builder.conv(last_channels, 1, strides=2, in_channels=in_channels),
+                builder.batchnorm(),
+            )
+        else:
+            self.shortcut = builder.sequence()
+
+    def hybrid_forward(self, F, x):
+        return self.shortcut(x) + self.body(x)
+
+# Nets
+class ResNet(HybridBlock):
+    def __init__(self, builder, block, layers, channels, classes=1000,
+                 version='1', resnext_groups=None):
+        super().__init__()
+        assert len(layers) == len(channels) - 1
+
+        self.version = version
+        with self.name_scope():
+            features = [
+                builder.conv(channels[0], 7, strides=2),
+                builder.batchnorm_relu(),
+                builder.max_pool(3, 2),
+            ]
+
+            for i, num_layer in enumerate(layers):
+                stride = 1 if i == 0 else 2
+                features.append(self.make_layer(builder, block, num_layer, channels[i+1],
+                                                stride, in_channels=channels[i],
+                                                resnext_groups=resnext_groups))
+            features.append(builder.global_avg_pool())
+
+            self.features = builder.sequence(*features)
+            self.output = builder.dense(classes, in_units=channels[-1])
+
+    def make_layer(self, builder, block, layers, channels, stride,
+                    in_channels=0, resnext_groups=None):
+        layer = []
+        layer.append(block(builder, channels, stride, channels != in_channels,
+                            in_channels=in_channels, version=self.version,
+                            resnext_groups=resnext_groups))
+        for _ in range(layers-1):
+            layer.append(block(builder, channels, 1, False, in_channels=channels,
+                               version=self.version, resnext_groups=resnext_groups))
+        return builder.sequence(*layer)
+
+    def hybrid_forward(self, F, x):
+        x = self.features(x)
+        x = self.output(x)
+        return x
+
+
+class Xception(HybridBlock):
+    def __init__(self, builder,
+                 definition=([32, 64],
+                             [[128, 128, 0], [256, 256, 0], [728, 728, 0],
+                              *([[728, 728, 728]] * 8), [728, 1024, 0]],
+                             [1536, 2048]),
+                 classes=1000):
+        super().__init__()
+
+        definition1, definition2, definition3 = definition
+
+        with self.name_scope():
+            features = []
+            last_channels = 0
+            for i, channels in enumerate(definition1):
+                features += [
+                    builder.conv(channels, 3, strides=(2 if i == 0 else 1), in_channels=last_channels),
+                    builder.batchnorm_relu(),
+                ]
+                last_channels = channels
+
+            for i, block_definition in enumerate(definition2):
+                features.append(XceptionBlock(builder, block_definition, in_channels=last_channels,
+                                              relu_at_beginning=False if i == 0 else True))
+                last_channels = list(filter(lambda x: x > 0, block_definition))[-1]
+
+            for i, channels in enumerate(definition3):
+                features += [
+                    builder.separable_conv(channels, 3, in_channels=last_channels),
+                    builder.batchnorm_relu(),
+                ]
+                last_channels = channels
+
+            features.append(builder.global_avg_pool())
+
+            self.features = builder.sequence(*features)
+            self.output = builder.dense(classes, in_units=last_channels)
+
+    def hybrid_forward(self, F, x):
+        x = self.features(x)
+        x = self.output(x)
+
+        return x
+
+
+resnet_spec = {18: (ResNetBasicBlock, [2, 2, 2, 2], [64, 64, 128, 256, 512]),
+               34: (ResNetBasicBlock, [3, 4, 6, 3], [64, 64, 128, 256, 512]),
+               50: (ResNetBottleNeck, [3, 4, 6, 3], [64, 256, 512, 1024, 2048]),
+               101: (ResNetBottleNeck, [3, 4, 23, 3], [64, 256, 512, 1024, 2048]),
+               152: (ResNetBottleNeck, [3, 8, 36, 3], [64, 256, 512, 1024, 2048])}
+
+def create_resnet(builder, version, num_layers=50, resnext=False, classes=1000):
+    assert num_layers in resnet_spec, \
+        "Invalid number of layers: {}. Options are {}".format(
+            num_layers, str(resnet_spec.keys()))
+    block_class, layers, channels = resnet_spec[num_layers]
+    assert not resnext or num_layers >= 50, \
+        "Cannot create resnext with less then 50 layers"
+    net = ResNet(builder, block_class, layers, channels, version=version,
+                 resnext_groups=args.num_groups if resnext else None)
+    return net
+
+class fp16_model(mx.gluon.block.HybridBlock):
+    def __init__(self, net, **kwargs):
+        super(fp16_model, self).__init__(**kwargs)
+        with self.name_scope():
+            self._net = net
+
+    def hybrid_forward(self, F, x):
+        y = self._net(x)
+        y = F.cast(y, dtype='float32')
+        return y
+
+def get_model(arch, num_classes, num_layers, image_shape, dtype, amp,
+              input_layout, conv_layout, batchnorm_layout, pooling_layout,
+              batchnorm_eps, batchnorm_mom, fuse_bn_relu, fuse_bn_add_relu, **kwargs):
+
+    builder = Builder(
+            dtype               = dtype,
+            input_layout        = input_layout,
+            conv_layout         = conv_layout,
+            bn_layout           = batchnorm_layout,
+            pooling_layout      = pooling_layout,
+            bn_eps              = batchnorm_eps,
+            bn_mom              = batchnorm_mom,
+            fuse_bn_relu        = fuse_bn_relu,
+            fuse_bn_add_relu    = fuse_bn_add_relu,
+    )
+
+    if arch.startswith('resnet') or arch.startswith('resnext'):
+        version = '1' if arch in {'resnetv1', 'resnextv1'} else '1.5'
+        net = create_resnet(
+                builder         = builder,
+                version         = version,
+                resnext         = arch.startswith('resnext'),
+                num_layers      = num_layers,
+                classes         = num_classes,
+        )
+    elif arch == 'xception':
+        net = Xception(builder, classes=num_classes)
+    else:
+        raise ValueError('Wrong model architecture')
+
+    net.hybridize(static_shape=True, static_alloc=True)
+
+    if not amp:
+        net.cast(dtype)
+        if dtype == 'float16':
+            net = fp16_model(net)
+
+    return net
--- a/MxNet/Classification/RN50v1.5/report.py
+++ b/MxNet/Classification/RN50v1.5/report.py
@ -21,15 +21,21 @@
 # - "metrics"        : per epoch metrics for train and validation
 #                      (some of below metrics may not exist in the report,
 #                       depending on application arguments)
-#       - "train.top1"      : training top1 accuracy in epoch.
-#       - "train.top5"      : training top5 accuracy in epoch.
-#       - "train.loss"      : training loss in epoch.
-#       - "train.time"      : average training time of iteration in seconds.
-#       - "train.total_ips" : training speed (data and compute time taken into account) for epoch in images/sec.
-#       - "val.top1", "val.top5", "val.loss", "val.time", "val.total_ips" : the same but for validation.
+#       - "train.top1"        : training top1 accuracy in epoch.
+#       - "train.top5"        : training top5 accuracy in epoch.
+#       - "train.loss"        : training loss in epoch.
+#       - "train.total_ips"   : training speed (data and compute time taken into account) for epoch in images/sec.
+#       - "train.latency_avg" : average latency of one iteration in seconds.
+#       - "train.latency_50"  : median latency of one iteration in seconds.
+#       - "train.latency_90"  : 90th percentile latency of one iteration in seconds.
+#       - "train.latency_95"  : 95th percentile latency of one iteration in seconds.
+#       - "train.latency_99"  : 99th percentile latency of one iteration in seconds.
+#       - "train.latency_100" : highest observed latency of one iteration in seconds.
+#       - "val.top1", "val.top5", "val.time", "val.total_ips", "val.latency_avg", "val.latency_50",
+#         "val.latency_90", "val.latency_95", "val.latency_99", "val.latency_100"      : the same but for validation.

 import json
-from collections import defaultdict, OrderedDict
+from collections import OrderedDict

 class Report:
    def __init__(self, model_name, ngpus, cmd):
@ -37,15 +43,21 @@ class Report:
        self.ngpus = ngpus
        self.cmd = cmd
        self.total_duration = 0
-        self.metrics = defaultdict(lambda: [])
+        self.metrics = OrderedDict()

    def add_value(self, metric, value):
+        if metric not in self.metrics:
+            self.metrics[metric] = []
        self.metrics[metric].append(value)

    def set_total_duration(self, duration):
        self.total_duration = duration

    def save(self, filename):
+        with open(filename, 'w') as f:
+            f.write(self.get_report())
+
+    def get_report(self):
        report = OrderedDict([
            ('model', self.model_name),
            ('ngpus', self.ngpus),
@ -53,5 +65,4 @@ class Report:
            ('cmd', self.cmd),
            ('metrics', self.metrics),
        ])
-        with open(filename, 'w') as f:
-            json.dump(report, f, indent=4)
+        return json.dumps(report, indent=4)
--- a/MxNet/Classification/RN50v1.5/resnet.py
+++ b/MxNet/Classification/RN50v1.5/resnet.py
@ -1,376 +0,0 @@
-# Copyright 2017-2018 The Apache Software Foundation
-#
-# Licensed to the Apache Software Foundation (ASF) under one
-# or more contributor license agreements.  See the NOTICE file
-# distributed with this work for additional information
-# regarding copyright ownership.  The ASF licenses this file
-# to you under the Apache License, Version 2.0 (the
-# "License"); you may not use this file except in compliance
-# with the License.  You may obtain a copy of the License at
-#
-#   http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing,
-# software distributed under the License is distributed on an
-# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-# KIND, either express or implied.  See the License for the
-# specific language governing permissions and limitations
-# under the License.
-#
-# -----------------------------------------------------------------------
-#
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-'''
-Adapted from https://github.com/tornadomeet/ResNet/blob/master/symbol_resnet.py
-(Original author Wei Wu) by Antti-Pekka Hynninen
-
-"Flexible Layout" (fl) version created by Dick Carter.
-
-Implementing the original resnet ILSVRC 2015 winning network from:
-
-Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Deep Residual Learning for Image Recognition"
-'''
-import mxnet as mx
-import numpy as np
-import random
-
-# Transform a symbol from one layout to another, or do nothing if they have the same layout
-def transform_layout(data, from_layout, to_layout):
-    supported_layouts = ['NCHW', 'NHWC']
-    if from_layout not in supported_layouts:
-        raise ValueError('Not prepared to handle layout: {}'.format(from_layout))
-    if to_layout not in supported_layouts:
-        raise ValueError('Not prepared to handle layout: {}'.format(to_layout))
-
-    # Insert transpose if from_layout and to_layout don't match
-    if from_layout == 'NCHW' and to_layout == 'NHWC':
-        return mx.sym.transpose(data, axes=(0, 2, 3, 1))
-    elif from_layout == 'NHWC' and to_layout == 'NCHW':
-        return mx.sym.transpose(data, axes=(0, 3, 1, 2))
-    else:
-        return data
-
-# A BatchNorm wrapper that responds to the input layout
-def batchnorm(data, io_layout, batchnorm_layout, **kwargs):
-    # Transpose as needed to batchnorm_layout
-    transposed_as_needed = transform_layout(data, io_layout, batchnorm_layout)
-    bn_axis = 3 if batchnorm_layout == 'NHWC' else 1
-    batchnormed = mx.sym.BatchNorm(data=transposed_as_needed, axis=bn_axis, **kwargs)
-    # Transpose back to i/o layout as needed
-    return transform_layout(batchnormed, batchnorm_layout, io_layout)
-
-# A BatchNormAddRelu wrapper that responds to the input layout
-def batchnorm_add_relu(data, addend, io_layout, batchnorm_layout, **kwargs):
-    # Transpose as needed to batchnorm_layout
-    transposed_data_as_needed = transform_layout(data, io_layout, batchnorm_layout)
-    transposed_addend_as_needed = transform_layout(addend, io_layout, batchnorm_layout)
-    bn_axis = 3 if batchnorm_layout == 'NHWC' else 1
-    batchnormed = mx.sym.BatchNormAddRelu(data=transposed_data_as_needed,
-                                      addend=transposed_addend_as_needed,
-                                      axis=bn_axis, **kwargs)
-    # Transpose back to i/o layout as needed
-    return transform_layout(batchnormed, batchnorm_layout, io_layout)
-
-# A Pooling wrapper that responds to the input layout
-def pooling(data, io_layout, pooling_layout, **kwargs):
-    # Pooling kernel, as specified by pooling_layout, may be in conflict with i/o layout.
-    transposed_as_needed = transform_layout(data, io_layout, pooling_layout)
-    pooled = mx.sym.Pooling(data=transposed_as_needed, layout=pooling_layout, **kwargs)
-    # Transpose back to i/o layout as needed
-    return transform_layout(pooled, pooling_layout, io_layout)
-
-# Assumption is that data comes in and out in the 'conv_layout' format.
-# If this format is different from the 'batchnorm_layout' format, then the batchnorm() routine
-# will introduce transposes on both sides of the mx.sym.BatchNorm symbol
-def residual_unit(data, num_filter, stride, dim_match, name, bottle_neck=True,
-                  workspace=256, memonger=False, conv_layout='NCHW', batchnorm_layout='NCHW',
-                  verbose=False, cudnn_bn_off=False, bn_eps=2e-5, bn_mom=0.9, conv_algo=-1,
-                  fuse_bn_relu=False, fuse_bn_add_relu=False, cudnn_tensor_core_only=False):
-    """Return ResNet Unit symbol for building ResNet
-    Parameters
-    ----------
-    data : str
-        Input data
-    num_filter : int
-        Number of output channels
-    bnf : int
-        Bottle neck channels factor with regard to num_filter
-    stride : tuple
-        Stride used in convolution
-    dim_match : Boolean
-        True means channel number between input and output is the same, otherwise means differ
-    name : str
-        Base name of the operators
-    workspace : int
-        Workspace used in convolution operator
-    """
-
-    act = 'relu' if fuse_bn_relu else None
-    if bottle_neck:
-        conv1 = mx.sym.Convolution(data=data, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
-                                   no_bias=True, workspace=workspace, name=name + '_conv1', layout=conv_layout,
-                                   cudnn_algo_verbose=verbose,
-                                   cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                   cudnn_tensor_core_only=cudnn_tensor_core_only)
-        bn1 = batchnorm(data=conv1, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                        fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn1', cudnn_off=cudnn_bn_off, act_type=act)
-        act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1') if not fuse_bn_relu else bn1
-        conv2 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
-                                   no_bias=True, workspace=workspace, name=name + '_conv2', layout=conv_layout,
-                                   cudnn_algo_verbose=verbose,
-                                   cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                   cudnn_tensor_core_only=cudnn_tensor_core_only)
-        bn2 = batchnorm(data=conv2, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                        fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn2', cudnn_off=cudnn_bn_off, act_type=act)
-        act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2') if not fuse_bn_relu else bn2
-        conv3 = mx.sym.Convolution(data=act2, num_filter=num_filter, kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True,
-                                   workspace=workspace, name=name + '_conv3', layout=conv_layout,
-                                   cudnn_algo_verbose=verbose,
-                                   cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                   cudnn_tensor_core_only=cudnn_tensor_core_only)
-
-        if dim_match:
-            shortcut = data
-        else:
-            conv1sc = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
-                                            workspace=workspace, name=name+'_conv1sc', layout=conv_layout,
-                                         cudnn_algo_verbose=verbose,
-                                         cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                         cudnn_tensor_core_only=cudnn_tensor_core_only)
-            shortcut = batchnorm(data=conv1sc, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                                 fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_sc', cudnn_off=cudnn_bn_off)
-        if memonger:
-            shortcut._set_attr(mirror_stage='True')
-
-        if fuse_bn_add_relu:
-            return batchnorm_add_relu(data=conv3, addend=shortcut, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                            fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn3', cudnn_off=cudnn_bn_off)
-        else:
-            bn3 = batchnorm(data=conv3, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                            fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn3', cudnn_off=cudnn_bn_off)
-            return mx.sym.Activation(data=bn3 + shortcut, act_type='relu', name=name + '_relu3')
-
-    else:
-        conv1 = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=(3,3), stride=stride, pad=(1,1),
-                                      no_bias=True, workspace=workspace, name=name + '_conv1', layout=conv_layout,
-                                   cudnn_algo_verbose=verbose,
-                                   cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                   cudnn_tensor_core_only=cudnn_tensor_core_only)
-        bn1 = batchnorm(data=conv1, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                        fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_bn1', cudnn_off=cudnn_bn_off, act_type=act)
-        act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1') if not fuse_bn_relu else bn1
-        conv2 = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(3,3), stride=(1,1), pad=(1,1),
-                                      no_bias=True, workspace=workspace, name=name + '_conv2', layout=conv_layout,
-                                   cudnn_algo_verbose=verbose,
-                                   cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                   cudnn_tensor_core_only=cudnn_tensor_core_only)
-
-        if dim_match:
-            shortcut = data
-        else:
-            conv1sc = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
-                                            workspace=workspace, name=name+'_conv1sc', layout=conv_layout,
-                                         cudnn_algo_verbose=verbose,
-                                         cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                         cudnn_tensor_core_only=cudnn_tensor_core_only)
-            shortcut = batchnorm(data=conv1sc, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                                 fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_sc', cudnn_off=cudnn_bn_off)
-        if memonger:
-            shortcut._set_attr(mirror_stage='True')
-
-        if fuse_bn_add_relu:
-            return batchnorm_add_relu(data=conv2, addend=shortcut, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                            fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_bn2', cudnn_off=cudnn_bn_off)
-        else:
-            bn2 = batchnorm(data=conv2, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                            fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_bn2', cudnn_off=cudnn_bn_off)
-            return mx.sym.Activation(data=bn2 + shortcut, act_type='relu', name=name + '_relu2')
-
-def resnet(units, num_stages, filter_list, num_classes, image_shape, bottle_neck=True, workspace=256, dtype='float32', memonger=False,
-           input_layout='NCHW', conv_layout='NCHW',  batchnorm_layout='NCHW', pooling_layout='NCHW', verbose=False,
-           cudnn_bn_off=False, bn_eps=2e-5, bn_mom=0.9, conv_algo=-1,
-           fuse_bn_relu=False, fuse_bn_add_relu=False, force_tensor_core=False, use_dali=True):
-    """Return ResNet symbol of
-    Parameters
-    ----------
-    units : list
-        Number of units in each stage
-    num_stages : int
-        Number of stage
-    filter_list : list
-        Channel size of each stage
-    num_classes : int
-        Ouput size of symbol
-    dataset : str
-        Dataset type, only cifar10 and imagenet supports
-    workspace : int
-        Workspace used in convolution operator
-    dtype : str
-        Precision (float32 or float16)
-    memonger : boolean
-        Activates "memory monger" to reduce the model's memory footprint
-    input_layout : str
-        interpretation (e.g. NCHW vs NHWC) of data provided by the i/o pipeline (may introduce transposes
-        if in conflict with 'layout' above)
-    conv_layout : str
-        interpretation (e.g. NCHW vs NHWC) of data for convolution operation.
-    batchnorm_layout : str
-        directs which kernel performs the batchnorm (may introduce transposes if in conflict with 'conv_layout' above)
-    pooling_layout : str
-        directs which kernel performs the pooling (may introduce transposes if in conflict with 'conv_layout' above)
-    """
-
-    act = 'relu' if fuse_bn_relu else None
-    num_unit = len(units)
-    assert(num_unit == num_stages)
-    data = mx.sym.Variable(name='data')
-    if not use_dali:
-        # double buffering of data
-        if dtype == 'float32':
-            data = mx.sym.identity(data=data, name='id')
-        else:
-            if dtype == 'float16':
-                data = mx.sym.Cast(data=data, dtype=np.float16)
-    (nchannel, height, width) = image_shape
-
-    # Insert transpose as needed to get the input layout to match the desired processing layout
-    data = transform_layout(data, input_layout, conv_layout)
-
-    if height <= 32:            # such as cifar10
-        body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(3, 3), stride=(1,1), pad=(1, 1),
-                                  no_bias=True, name="conv0", workspace=workspace, layout=conv_layout,
-                                  cudnn_algo_verbose=verbose,
-                                  cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                  cudnn_tensor_core_only=force_tensor_core)
-        # Is this BatchNorm supposed to be here?
-        body = batchnorm(data=body, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                         fix_gamma=False, eps=bn_eps, momentum=bn_mom, name='bn0', cudnn_off=cudnn_bn_off)
-    else:                       # often expected to be 224 such as imagenet
-        body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(7, 7), stride=(2,2), pad=(3, 3),
-                                  no_bias=True, name="conv0", workspace=workspace, layout=conv_layout,
-                                  cudnn_algo_verbose=verbose,
-                                  cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
-                                  cudnn_tensor_core_only=force_tensor_core)
-        body = batchnorm(data=body, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                         fix_gamma=False, eps=bn_eps, momentum=bn_mom, name='bn0', cudnn_off=cudnn_bn_off, act_type=act)
-        if not fuse_bn_relu:
-            body = mx.sym.Activation(data=body, act_type='relu', name='relu0')
-        body = pooling(data=body, io_layout=conv_layout, pooling_layout=pooling_layout,
-                       kernel=(3, 3), stride=(2, 2), pad=(1, 1), pool_type='max')
-
-    for i in range(num_stages):
-        body = residual_unit(body, filter_list[i+1], (1 if i==0 else 2, 1 if i==0 else 2), False,
-                             name='stage%d_unit%d' % (i + 1, 1),
-                             bottle_neck=bottle_neck, workspace=workspace,
-                             memonger=memonger, conv_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                             verbose=verbose, cudnn_bn_off=cudnn_bn_off, bn_eps=bn_eps, bn_mom=bn_mom,
-                             conv_algo=conv_algo, fuse_bn_relu=fuse_bn_relu, fuse_bn_add_relu=fuse_bn_add_relu,
-                             cudnn_tensor_core_only=force_tensor_core)
-        for j in range(units[i]-1):
-            body = residual_unit(body, filter_list[i+1], (1,1), True, name='stage%d_unit%d' % (i + 1, j + 2),
-                                 bottle_neck=bottle_neck, workspace=workspace,
-                                 memonger=memonger, conv_layout=conv_layout, batchnorm_layout=batchnorm_layout,
-                                 verbose=verbose, cudnn_bn_off=cudnn_bn_off, bn_eps = bn_eps, bn_mom=bn_mom,
-                                 conv_algo=conv_algo, fuse_bn_relu=fuse_bn_relu, fuse_bn_add_relu=fuse_bn_add_relu,
-                                 cudnn_tensor_core_only=force_tensor_core)
-    # bn1 = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=2e-5, momentum=bn_mom, name='bn1')
-    # relu1 = mx.sym.Activation(data=bn1, act_type='relu', name='relu1')
-    # Although kernel is not used here when global_pool=True, we should put one
-    pool1 = pooling(data=body, io_layout=conv_layout, pooling_layout=pooling_layout,
-                    global_pool=True, kernel=(7, 7), pool_type='avg', name='pool1')
-    flat = mx.sym.Flatten(data=pool1)
-    fc1 = mx.sym.FullyConnected(data=flat, num_hidden=num_classes, name='fc1', cublas_algo_verbose=verbose)
-    if dtype == 'float16':
-        fc1 = mx.sym.Cast(data=fc1, dtype=np.float32)
-    return mx.sym.SoftmaxOutput(data=fc1, name='softmax')
-
-def get_symbol(num_classes, num_layers, image_shape, conv_workspace=256, dtype='float32',
-               input_layout='NCHW', conv_layout='NCHW', batchnorm_layout='NCHW', pooling_layout='NCHW',
-               verbose=False, seed=None, cudnn_bn_off=False, batchnorm_eps=2e-5, batchnorm_mom=0.9,
-               conv_algo=-1, fuse_bn_relu=False, fuse_bn_add_relu=False, force_tensor_core=False, use_dali=True, **kwargs):
-    """
-    Adapted from https://github.com/tornadomeet/ResNet/blob/master/symbol_resnet.py
-    (Original author Wei Wu) by Antti-Pekka Hynninen
-    Implementing the original resnet ILSVRC 2015 winning network from:
-    Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Deep Residual Learning for Image Recognition"
-    """
-    if seed is not None:
-        print('Setting seeds to %s' % (seed,))
-        random.seed(seed)
-        np.random.seed(seed)
-        mx.random.seed(seed)
-
-    image_shape = [int(l) for l in image_shape.split(',')]
-    (nchannel, height, width) = image_shape
-    if height <= 28:
-        num_stages = 3
-        if (num_layers-2) % 9 == 0 and num_layers >= 164:
-            per_unit = [(num_layers-2)//9]
-            filter_list = [16, 64, 128, 256]
-            bottle_neck = True
-        elif (num_layers-2) % 6 == 0 and num_layers < 164:
-            per_unit = [(num_layers-2)//6]
-            filter_list = [16, 16, 32, 64]
-            bottle_neck = False
-        else:
-            raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
-        units = per_unit * num_stages
-    else:
-        if num_layers >= 50:
-            filter_list = [64, 256, 512, 1024, 2048]
-            bottle_neck = True
-        else:
-            filter_list = [64, 64, 128, 256, 512]
-            bottle_neck = False
-        num_stages = 4
-        if num_layers == 18:
-            units = [2, 2, 2, 2]
-        elif num_layers == 34:
-            units = [3, 4, 6, 3]
-        elif num_layers == 50:
-            units = [3, 4, 6, 3]
-        elif num_layers == 101:
-            units = [3, 4, 23, 3]
-        elif num_layers == 152:
-            units = [3, 8, 36, 3]
-        elif num_layers == 200:
-            units = [3, 24, 36, 3]
-        elif num_layers == 269:
-            units = [3, 30, 48, 8]
-        else:
-            raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
-
-    return resnet(units             = units,
-                  num_stages        = num_stages,
-                  filter_list       = filter_list,
-                  num_classes       = num_classes,
-                  image_shape       = image_shape,
-                  bottle_neck       = bottle_neck,
-                  workspace         = conv_workspace,
-                  dtype             = dtype,
-                  input_layout      = input_layout,
-                  conv_layout       = conv_layout,
-                  batchnorm_layout  = batchnorm_layout,
-                  pooling_layout    = pooling_layout,
-                  verbose           = verbose,
-                  cudnn_bn_off      = cudnn_bn_off,
-                  bn_eps            = batchnorm_eps,
-                  bn_mom            = batchnorm_mom,
-                  conv_algo         = conv_algo,
-                  fuse_bn_relu      = fuse_bn_relu,
-                  fuse_bn_add_relu  = fuse_bn_add_relu,
-                  force_tensor_core = force_tensor_core,
-                  use_dali          = use_dali)
--- a/MxNet/Classification/RN50v1.5/runner
+++ b/MxNet/Classification/RN50v1.5/runner
@ -14,77 +14,56 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import os, socket
-from argparse import ArgumentParser
-import warnings
+import os
+import argparse
+from pathlib import Path

-
-optparser = ArgumentParser(description="train resnet50 with MXNet")
-optparser.add_argument("-n", "--n-GPUs", type=int, default=8, help="number of GPUs to use; " +\
-                       "default = 8")
-optparser.add_argument("-b", "--batch-size", type=int, default=208, help="batch size per GPU; " +\
-                       "default = 208")
-optparser.add_argument("-e", "--num-epochs", type=int, default=90, help="number of epochs; " +\
-                       "default = 90")
-optparser.add_argument("-l", "--lr", type=float, default=0.1, help="learning rate; default = 0.1; " +\
-                       "IMPORTANT: true learning rate will be calculated as `lr * batch_size/256`")
-optparser.add_argument("--no-val", action="store_true",
-                       help="if set no validation will be performed")
-optparser.add_argument("--no-dali", action="store_true", default=False,
-                       help="use default MXNet pipeline instead of DALI")
-optparser.add_argument("--data-root", type=str, help="Directory with RecordIO data files", default="/data/imagenet/train-val-recordio-passthrough")
-optparser.add_argument("--data-nthreads", type=int, help="number of threads for data loading; default = 40", default=40)
-optparser.add_argument("--dtype", type=str, help="Precision, float16 or float32", default="float16")
+optparser = argparse.ArgumentParser(description='Train classification models on ImageNet',
+                                    formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+optparser.add_argument('-n', '--ngpus', type=int, default=1, help='number of GPUs to use')
+optparser.add_argument('-b', '--batch-size', type=int, default=192, help='batch size per GPU')
+optparser.add_argument('-e', '--num-epochs', type=int, default=90, help='number of epochs')
+optparser.add_argument('-l', '--lr', type=float, default=0.256, help='learning rate; '
+                       'IMPORTANT: true learning rate will be calculated as `lr * batch_size / 256`')
+optparser.add_argument('--data-root', type=Path, help='Directory with RecordIO data files', default=Path('/data/imagenet/train-val-recordio-passthrough'))
+optparser.add_argument('--dtype', help='Precision', default='float16', choices=('float32', 'float16'))
+optparser.add_argument('--kv-store', default='horovod', choices=('device', 'horovod'), help='key-value store type')
+optparser.add_argument('--data-backend', default='dali-gpu', choices=('dali-gpu', 'dali-cpu', 'mxnet', 'synthetic'), help='data backend')

 opts, args = optparser.parse_known_args()

-if opts.dtype == "float16":
-    n_ch = str(4 - int(opts.no_dali))
+if opts.dtype == 'float16':
+    n_ch = str(4 - int(opts.data_backend == 'mxnet'))
 else:
    n_ch = str(3)

-opts.batch_size *= opts.n_GPUs
+opts.batch_size *= opts.ngpus
+opts.lr *= opts.batch_size / 256

-opts.lr *= opts.batch_size/256
-
-command = ""
-command += "python "+os.path.dirname(__file__)+"/train.py"
-command += " --num-layers 50"
-command += " --data-train " + opts.data_root + "/train.rec"
-command += " --data-train-idx " + opts.data_root + "/train.idx"
-if not opts.no_val:
-    command += " --data-val " + opts.data_root + "/val.rec"
-    command += " --data-val-idx " + opts.data_root + "/val.idx"
-command += " --data-nthreads " + str(opts.data_nthreads)
-command += " --optimizer sgd --dtype " + opts.dtype
-command += " --lr-step-epochs 30,60,80 --max-random-area 1"
-command += " --min-random-area 0.05 --max-random-scale 1"
-command += " --min-random-scale 1 --min-random-aspect-ratio 0.75"
-command += " --max-random-aspect-ratio 1.33 --max-random-shear-ratio 0"
-command += " --max-random-rotate-angle 0 --random-resized-crop 1"
-command += " --random-crop 0 --random-mirror 1"
-command += " --image-shape "+n_ch+",224,224 --warmup-epochs 5"
-command += " --disp-batches 20"
-command += " --batchnorm-mom 0.9 --batchnorm-eps 1e-5"
+command = []
+if 'horovod' in opts.kv_store:
+    command += ['horovodrun', '-np', str(opts.ngpus)]
+command += ['python', str(Path(__file__).parent / "train.py")]
+command += ['--data-train', str(opts.data_root / "train.rec")]
+command += ['--data-train-idx', str(opts.data_root / "train.idx")]
+command += ['--data-val', str(opts.data_root / "val.rec")]
+command += ['--data-val-idx', str(opts.data_root / "val.idx")]
+command += ['--dtype', opts.dtype]
+command += ['--image-shape', n_ch + ',224,224']
 if opts.dtype == 'float16':
-    command += " --fuse-bn-relu 1"
-    command += " --input-layout NHWC --conv-layout NHWC"
-    command += " --batchnorm-layout NHWC --pooling-layout NHWC"
-    command += " --conv-algo 1 --force-tensor-core 1"
-    command += " --fuse-bn-add-relu 1"
+    command += '--fuse-bn-relu 1 --fuse-bn-add-relu 1'.split()
+    command += '--input-layout NCHW --conv-layout NHWC ' \
+               '--batchnorm-layout NHWC --pooling-layout NHWC'.split()

-command += " --kv-store device"
-if not opts.no_dali:
-    command += " --use-dali"
-    command += " --dali-prefetch-queue 2 --dali-nvjpeg-memory-padding 64"
-command += " --lr "+str(opts.lr)
-command += " --gpus " + str(list(range(opts.n_GPUs))).replace(' ', '').replace('[', '').replace(']', '')
-command += " --batch-size " + str(opts.batch_size)
-command += " --num-epochs " + str(opts.num_epochs)
+command += ['--kv-store', opts.kv_store]
+command += ['--data-backend', opts.data_backend]
+command += ['--lr', str(opts.lr)]
+command += ['--gpus', ','.join(list(map(str, range(opts.ngpus))))]
+command += ['--batch-size', str(opts.batch_size)]
+command += ['--num-epochs', str(opts.num_epochs)]

+command += args

-for arg in args:
-    command += " " + arg

 os.environ['MXNET_UPDATE_ON_KVSTORE'] = "0"
 os.environ['MXNET_EXEC_ENABLE_ADDTO'] = "1"
@ -92,5 +71,11 @@ os.environ['MXNET_USE_TENSORRT'] = "0"
 os.environ['MXNET_GPU_WORKER_NTHREADS'] = "2"
 os.environ['MXNET_GPU_COPY_NTHREADS'] = "1"
 os.environ['MXNET_OPTIMIZER_AGGREGATION_SIZE'] = "54"
+os.environ['HOROVOD_CYCLE_TIME'] = "0.1"
+os.environ['HOROVOD_FUSION_THRESHOLD'] = "67108864"
+os.environ['HOROVOD_NUM_NCCL_STREAMS'] = "2"
+os.environ['MXNET_HOROVOD_NUM_GROUPS'] = "16"
+os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD'] = "999"
+os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD'] = "25"

-exit(os.system('/bin/bash -c "'+command+'"'))
+os.execvp(command[0], command)
--- a/MxNet/Classification/RN50v1.5/examples/INFER_BENCHMARK_FP32.sh
+++ b/MxNet/Classification/RN50v1.5/examples/INFER_BENCHMARK_FP32.sh
@ -1,3 +1,4 @@
+#!/bin/bash
 # Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@ -12,8 +13,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+if [ $# -lt 2 ] ; then
+    echo "usage: $0 raw_dataset prepared_dataset"
+    exit 1
+fi

-# This script launches ResNet50 inference benchmark in FP32 on 1 GPU with 1,2,4,32,64,96 batch size
-# Usage ./INFER_BENCHMARK_FP32.sh <additionals flags>
-
-python benchmark.py -n 1 -b 1,2,4,32,64,96 --only-inference -e 3 -w 1 -i 100 -o report.json $@
+cd "$2" &&
+python /opt/mxnet/tools/im2rec.py --list --recursive train "$1/train" &&
+python /opt/mxnet/tools/im2rec.py --list --recursive val "$1/val" &&
+python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 train "$1/train" &&
+python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 val "$1/val" &&
+echo "Dataset was prepared succesfully!"
--- a/MxNet/Classification/RN50v1.5/train.py
+++ b/MxNet/Classification/RN50v1.5/train.py
@ -34,58 +34,37 @@
 # limitations under the License.

 import os
+import sys
 import argparse
 import logging
-logging.basicConfig(level=logging.DEBUG)
-import data, dali, fit
 import mxnet as mx
 import numpy as np

-def set_imagenet_aug(aug):
-    # standard data augmentation setting for imagenet training
-    aug.set_defaults(rgb_mean='123.68,116.779,103.939', rgb_std='58.393,57.12,57.375')
-    aug.set_defaults(random_crop=0, random_resized_crop=1, random_mirror=1)
-    aug.set_defaults(min_random_area=0.08)
-    aug.set_defaults(max_random_aspect_ratio=4./3., min_random_aspect_ratio=3./4.)
-    aug.set_defaults(brightness=0.4, contrast=0.4, saturation=0.4, pca_noise=0.1)
+import data, dali
+import fit
+import models

-if __name__ == '__main__':
-    # parse args
-    parser = argparse.ArgumentParser(description="train resnet on imagenet",
+def parse_args():
+    parser = argparse.ArgumentParser(description="Train classification models on ImageNet",
                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    models.add_model_args(parser)
    fit.add_fit_args(parser)
    data.add_data_args(parser)
    dali.add_dali_args(parser)
    data.add_data_aug_args(parser)
-    
-    # Instead, to get standard resnet augmentation on a per-use basis, invoke as in:
-    # train_imagenet.py --set-resnet-aug ...
-    # Finally, to get the legacy MXNet v1.2 training settings on a per-use basis, invoke as in:
-    # train_imagenet.py --set-data-aug-level 3
-    parser.set_defaults(
-        # network
-        num_layers       = 50,
+    return parser.parse_args()

-        # data
-        resize           = 256,
-        num_classes      = 1000,
-        num_examples     = 1281167,
-        image_shape      = '3,224,224',
-        min_random_scale = 1, # if input image has min size k, suggest to use
-                              # 256.0/x, e.g. 0.533 for 480
-        # train
-        num_epochs       = 90,
-        lr_step_epochs   = '30,60,80',
-        dtype            = 'float32'
-    )
-    args = parser.parse_args()
+def setup_logging(args):
+    head = '{asctime}:{levelname}: {message}'
+    logging.basicConfig(level=logging.DEBUG, format=head, style='{',
+                        handlers=[logging.StreamHandler(sys.stderr), logging.FileHandler(args.log)])
+    logging.info('Start with arguments {}'.format(args))

-    if not args.use_dali:
-        data.set_data_aug_level(parser, 0)
+if __name__ == '__main__':
+    args = parse_args()
+    setup_logging(args)

-    # load network
-    import resnet as net
-    sym = net.get_symbol(**vars(args))
+    model = models.get_model(**vars(args))
+    data_loader = data.get_data_loader(args)

-    # train
-    fit.fit(args, sym, dali.get_rec_iter)
+    fit.fit(args, model, data_loader)
--- a/PyTorch/Detection/SSD/csrc/box_encoder_cuda.cu
+++ b/PyTorch/Detection/SSD/csrc/box_encoder_cuda.cu
@ -1,6 +1,6 @@
 /******************************************************************************
 *
-* Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+* Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
--- a/PyTorch/Detection/SSD/csrc/interface.cpp
+++ b/PyTorch/Detection/SSD/csrc/interface.cpp
@ -1,6 +1,6 @@
 /******************************************************************************
 *
-* Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+* Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
--- a/PyTorch/Detection/SSD/csrc/random_horiz_flip.cu
+++ b/PyTorch/Detection/SSD/csrc/random_horiz_flip.cu
@ -1,6 +1,6 @@
 /******************************************************************************
 *
-* Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+* Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
--- a/PyTorch/Detection/SSD/dle/inference.py
+++ b/PyTorch/Detection/SSD/dle/inference.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import numpy as np
 import skimage

--- a/PyTorch/Detection/SSD/examples/SSD300_inference.py
+++ b/PyTorch/Detection/SSD/examples/SSD300_inference.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch
 import numpy as np

--- a/PyTorch/Detection/SSD/main.py
+++ b/PyTorch/Detection/SSD/main.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os
 import time
 from argparse import ArgumentParser
--- a/PyTorch/Detection/SSD/setup.py
+++ b/PyTorch/Detection/SSD/setup.py
@ -1,6 +1,6 @@
 #!/usr/bin/env python

-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
--- a/PyTorch/Detection/SSD/src/coco.py
+++ b/PyTorch/Detection/SSD/src/coco.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 __author__ = 'tylin'
 __version__ = '2.0'
 # Interface for accessing the Microsoft COCO dataset.
--- a/PyTorch/Detection/SSD/src/coco_pipeline.py
+++ b/PyTorch/Detection/SSD/src/coco_pipeline.py
@ -1,4 +1,4 @@
-# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
--- a/PyTorch/Detection/SSD/src/data.py
+++ b/PyTorch/Detection/SSD/src/data.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import os

 import torch
@ -18,7 +32,7 @@ def get_train_loader(args, local_seed):
                    output_fp16=args.amp, output_nhwc=False,
                    pad_output=False, seed=local_seed)
    train_pipe.build()
-    test_run = train_pipe.run()
+    test_run = train_pipe.schedule_run(), train_pipe.share_outputs(), train_pipe.release_outputs()
    train_loader = DALICOCOIterator(train_pipe, 118287 / args.N_gpu)
    return train_loader

--- a/PyTorch/Detection/SSD/src/evaluate.py
+++ b/PyTorch/Detection/SSD/src/evaluate.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch
 import time
 import numpy as np
--- a/PyTorch/Detection/SSD/src/logger.py
+++ b/PyTorch/Detection/SSD/src/logger.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import math
 import numpy as np

--- a/PyTorch/Detection/SSD/src/model.py
+++ b/PyTorch/Detection/SSD/src/model.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch
 import torch.nn as nn
 from torchvision.models.resnet import resnet18, resnet34, resnet50, resnet101, resnet152
--- a/PyTorch/Detection/SSD/src/train.py
+++ b/PyTorch/Detection/SSD/src/train.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 from torch.autograd import Variable
 import torch
 import time
--- a/PyTorch/Detection/SSD/src/utils.py
+++ b/PyTorch/Detection/SSD/src/utils.py
@ -1,3 +1,17 @@
+# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import torch
 import torchvision.transforms as transforms
 import torch.utils.data as data