Updating RN50/MxNet

This commit is contained in:
Przemek Strzelczyk 2019-10-21 19:20:40 +02:00
parent f2fe0904cf
commit e470c2150a
47 changed files with 3226 additions and 1503 deletions

View file

@ -0,0 +1,3 @@
FROM nvcr.io/nvidia/mxnet:19.07-py3
COPY . /workspace/rn50
WORKDIR /workspace/rn50

View file

@ -1,3 +1,4 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

View file

@ -1,6 +1,46 @@
# ResNet50 v1.5 For MXNet
# ResNet50 v1.5 for MXNet
## The model
This repository provides a script and recipe to train the ResNet50 v1.5 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table Of Contents
- [Model overview](#model-overview)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Multi-dataset](#multi-dataset)
* [Training process](#training-process)
* [Inference process](#inference-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
* [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-(16x-v100-32G))
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX-1 (8x V100 16G)](#inference-performance-nvidia-dgx-1-(8x-v100-16G))
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).
The difference between v1 and v1.5 is in the bottleneck blocks which require
@ -9,96 +49,448 @@ v1.5 has stride = 2 in the 3x3 convolution
This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a small performance drawback (~5% imgs/sec).
## Training procedure
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 3.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Optimizer
### Default configuration
This model trains for 90 epochs, with the standard ResNet v1.5 setup:
**Optimizer:**
* SGD with momentum (0.9)
* SGD with momentum (0.875)
* Learning rate = 0.256 for 256 batch size, for other batch sizes we lineary scale the learning rate.
* Learning rate schedule -- we use cosine LR schedule
* Linear warmup of the learning rate during first 5 epochs according to [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
* Weight decay: 3.0517578125e-05 (1/32768).
* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
* Label Smoothing: 0.1
* We train for:
* 50 Epochs -> configuration that reaches 75.9% top1 accuracy
* 90 Epochs -> 90 epochs is a standard for ResNet50
* 250 Epochs -> best possible accuracy. For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
* Learning rate = 0.1 for 256 batch size, for other batch sizes we linearly
scale the learning rate.
**Data augmentation:**
* Learning rate decay - multiply by 0.1 after 30, 60, and 80 epochs
This model uses the following data augmentation:
* Linear warmup of the learning rate during first 5 epochs
according to [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677).
* Weight decay: 1e-4
### Data Augmentation
During training, we perform the following augmentation techniques:
For training:
* Normalization
* Random resized crop to 224x224
* Scale from 5% to 100%
* Scale from 8% to 100%
* Aspect ratio from 3/4 to 4/3
* Random horizontal flip
During inference, we perform the following augmentation techniques:
For inference:
* Normalization
* Scale to 256x256
* Center crop to 224x224
See `data.py` for more info.
### Feature support matrix
# Setup
## Requirements
Ensure your environment meets the following requirements:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [MXNet 18.12-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet) or newer
* [NVIDIA-DALI 0.5.0](https://github.com/NVIDIA/DALI) -- included in the MXNet container
* [Python 3.5](https://www.python.org) -- included in the MXNet container
* [CUDA 10](https://developer.nvidia.com/cuda-toolkit) -- included in the MXNet container
* [cuDNN 7.4.1](https://developer.nvidia.com/cudnn) -- included in the the MXNet container
* (optional) NVIDIA Volta or Turing GPU (see section below) -- for best training performance using FP16
For more information about how to get started with NGC containers, see the
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
* [Running MXNet](https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/running.html#running)
## Training using mixed precision with Tensor Cores
### Hardware requirements
Training with mixed precision on NVIDIA Tensor Cores, requires an NVIDIA Volta-based or Turing-based GPU.
| **Feature** | **ResNet50 MXNet** |
|:---:|:--------:|
|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)|yes|
|Horovod Multi-GPU|yes|
### Software changes
#### Features
The following features are supported by this model.
For information about how to train using mixed precision, see the
[Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
and
[Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
NVIDIA DALI - NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications.
Horovod Multi-GPU - Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
# Quick start guide
## Docker
### Mixed precision training
To run docker MXNet container, run:
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
`nvidia-docker run --rm -it --ipc=host -v <path to source of this repo>:/workspace/resnet50 -v <path to prepared dataset>:/data/imagenet/train-val-recordio-passthrough nvcr.io/nvidia/mxnet:18.12-py3`
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
It will also automatically start downloading the MXNet container if you haven't downloaded it yet. You can also download it manually by running:
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
`nvidia-docker pull nvcr.io/nvidia/mxnet:18.12-py3`
If you haven't prepared dataset yet (see section below), download raw ImageNet dataset (see section below), and run:
`nvidia-docker run --rm -it --ipc=host -v <path to source of this repo>:/workspace/resnet50 -v <path where prepared dataset should be created>:/data/imagenet/train-val-recordio-passthrough -v <path to raw dataset>:/data/imagenet/raw nvcr.io/nvidia/mxnet:18.12-py3`
#### Enabling mixed precision
Using the Gluon API, ensure you perform the following steps to convert a model that supports computation with float16.
and follow step from Prepare Dataset section.
1. Cast Gluon Blocks parameters and expected input type to float16 by calling the cast method of the Block representing the network.
```python
net = net.cast('float16')
```
## Prepare Dataset
2. Ensure the data input to the network is of float16 type. If your DataLoader or Iterator produces output in another datatype, then you have to cast your data. There are different ways you can do this. The easiest way is to use the `astype` method of NDArrays.
```python
data = data.astype('float16', copy=False)
```
3. If you are using images and DataLoader, you can also use a Cast transform. It is preferable to use multi_precision mode of optimizer when training in float16. This mode of optimizer maintains a master copy of the weights in float32 even when the training (forward and backward pass) is in float16. This helps increase precision of the weight updates and can lead to faster convergence in some scenarios.
```python
optimizer = mx.optimizer.create('sgd', multi_precision=True, lr=0.01)
```
## Setup
The following section lists the requirements in order to start training the ResNet50 v1.5 model.
### Requirements
This repository contains Dockerfile which extends the MXNet NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- [MXNet 19.07-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia%2Fmxnet)
- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
- [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
- [Running MXNet](https://docs.nvidia.com/deeplearning/dgx/mxnet-release-notes/running.html#running)
For those unable to use the MXNet NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
**1. Clone the repository.**
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/MxNet/Classification/RN50v1.5
```
**2. Build the ResNet50 MXNet NGC container.**
After Docker is setup, you can build the ResNet50 image with:
```bash
docker build . -t nvidia_rn50_mx
```
**3. Start an interactive session in the NGC container to run preprocessing/training/inference.**
```bash
nvidia-docker run --rm -it --ipc=host <path to dataset>:/data/imagenet/train-val-recordio-passthrough nvidia_rn50_mx
```
**4. Download and preprocess the data.**
* Download the images from http://image-net.org/download-images.
* Extract the training and validation data:
```bash
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
cd ..
```
**5. Extract the validation data and move the images to subfolders.**
```bash
mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
```
**6. Preprocess the dataset.**
```bash
./scripts/prepare_imagenet.sh <path to raw imagenet> <path where processed dataset will be created>
```
**7. Start training.**
```bash
./runner -n <number of gpus> -b <batch size per GPU (default 192)>
```
**8. Start validation/evaluation.**
```bash
./runner -n <number of gpus> -b <batch size per GPU (default 192)> --load <path to trained model> --mode val
```
**9. Start inference/predictions.**
```bash
./runner --load <path to trained model> --mode pred --data-pred <path to the image>
```
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
### Scripts and sample code
In the root directory, the most important files are:
* `runner`: A wrapper on the `train.py` script which is the main executable script for training/validation/predicting
* `benchmark.py`: A script for benchmarking
* `Dockerfile`: Container to build the container
* `fit.py`: A file containing most of the training and validation logic
* `data.py`: Data loading and preprocessing code
* `dali.py`: Data loading and preprocessing code using DALI
* `models.py`: The model architecture
* `report.py`: A file containing JSON report structure and description of fields
In the `scripts` directory, the most important files are:
* `prepare_imagenet.sh`: A script that converts raw dataset format to RecordIO format
### Parameters
The complete list of available parameters contains:
```
Model:
--arch {resnetv1,resnetv15,resnextv1,resnextv15,xception}
model architecture (default: resnetv15)
--num-layers NUM_LAYERS
number of layers in the neural network, required by
some networks such as resnet (default: 50)
--num-groups NUM_GROUPS
number of groups for grouped convolutions, required by
some networks such as resnext (default: 32)
--num-classes NUM_CLASSES
the number of classes (default: 1000)
--batchnorm-eps BATCHNORM_EPS
the amount added to the batchnorm variance to prevent
output explosion. (default: 1e-05)
--batchnorm-mom BATCHNORM_MOM
the leaky-integrator factor controling the batchnorm
mean and variance. (default: 0.9)
--fuse-bn-relu FUSE_BN_RELU
have batchnorm kernel perform activation relu
(default: 0)
--fuse-bn-add-relu FUSE_BN_ADD_RELU
have batchnorm kernel perform add followed by
activation relu (default: 0)
Training:
--mode {train_val,train,val,pred}
mode (default: train_val)
--seed SEED random seed (default: None)
-n NGPUS, --ngpus NGPUS
number of GPUs to use (default: 1)
--kv-store {device,horovod}
key-value store type (default: horovod)
--dtype {float32,float16}
Precision (default: float16)
--amp If enabled, turn on AMP (Automatic Mixed Precision)
(default: False)
-b BATCH_SIZE, --batch-size BATCH_SIZE
batch size per GPU (default: 192)
-e NUM_EPOCHS, --num-epochs NUM_EPOCHS
number of epochs (default: 90)
-l LR, --lr LR learning rate; IMPORTANT: true learning rate will be
calculated as `lr * batch_size / 256` (default: 0.256)
--lr-schedule {multistep,cosine}
learning rate schedule (default: cosine)
--lr-factor LR_FACTOR
the ratio to reduce lr on each step (default: 0.256)
--lr-steps LR_STEPS the epochs to reduce the lr, e.g. 30,60 (default: [])
--warmup-epochs WARMUP_EPOCHS
the epochs to ramp-up lr to scaled large-batch value
(default: 5)
--optimizer OPTIMIZER
the optimizer type (default: sgd)
--mom MOM momentum for sgd (default: 0.875)
--wd WD weight decay for sgd (default: 3.0517578125e-05)
--label-smoothing LABEL_SMOOTHING
label smoothing factor (default: 0.1)
--mixup MIXUP alpha parameter for mixup (if 0 then mixup is not
applied) (default: 0)
--disp-batches DISP_BATCHES
show progress for every n batches (default: 20)
--model-prefix MODEL_PREFIX
model checkpoint prefix (default: model)
--save-frequency SAVE_FREQUENCY
frequency of saving model in epochs (--model-prefix
must be specified). If -1 then save only best model.
If 0 then do not save anything. (default: -1)
--begin-epoch BEGIN_EPOCH
start the model from an epoch (default: 0)
--load LOAD checkpoint to load (default: None)
--test-io test reading speed without training (default: False)
--test-io-mode {train,val}
data to test (default: train)
--log LOG file where to save the log from the experiment
(default: log.log)
--report REPORT file where to save report (default: report.json)
--no-metrics do not calculate evaluation metrics (for benchmarking)
(default: False)
--benchmark-iters BENCHMARK_ITERS
run only benchmark-iters iterations from each epoch
(default: None)
Data:
--data-root DATA_ROOT
Directory with RecordIO data files (default:
/data/imagenet/train-val-recordio-passthrough)
--data-backend {dali,mxnet,synthetic}
data backend (default: dali)
--image-shape IMAGE_SHAPE
the image shape feed into the network (default: [3,
224, 224])
--rgb-mean RGB_MEAN a tuple of size 3 for the mean rgb (default: [123.68,
116.779, 103.939])
--rgb-std RGB_STD a tuple of size 3 for the std rgb (default: [58.393,
57.12, 57.375])
--input-layout {NCHW,NHWC}
the layout of the input data (default: NCHW)
--conv-layout {NCHW,NHWC}
the layout of the data assumed by the conv operation
(default: NCHW)
--batchnorm-layout {NCHW,NHWC}
the layout of the data assumed by the batchnorm
operation (default: NCHW)
--pooling-layout {NCHW,NHWC}
the layout of the data assumed by the pooling
operation (default: NCHW)
--num-examples NUM_EXAMPLES
the number of training examples (doesn't work with
mxnet data backend) (default: 1281167)
--data-val-resize DATA_VAL_RESIZE
base length of shorter edge for validation dataset
(default: 256)
DALI data backend:
entire group applies only to dali data backend
--dali-separ-val each process will perform independent validation on
whole val-set (default: False)
--dali-threads DALI_THREADS
number of threadsper GPU for DALI (default: 3)
--dali-validation-threads DALI_VALIDATION_THREADS
number of threadsper GPU for DALI for validation
(default: 10)
--dali-prefetch-queue DALI_PREFETCH_QUEUE
DALI prefetch queue depth (default: 2)
--dali-nvjpeg-memory-padding DALI_NVJPEG_MEMORY_PADDING
Memory padding value for nvJPEG (in MB) (default: 64)
MXNet data backend:
entire group applies only to mxnet data backend
--data-mxnet-threads DATA_MXNET_THREADS
number of threads for data decoding for mxnet data
backend (default: 40)
--random-crop RANDOM_CROP
if or not randomly crop the image (default: 0)
--random-mirror RANDOM_MIRROR
if or not randomly flip horizontally (default: 1)
--max-random-h MAX_RANDOM_H
max change of hue, whose range is [0, 180] (default:
0)
--max-random-s MAX_RANDOM_S
max change of saturation, whose range is [0, 255]
(default: 0)
--max-random-l MAX_RANDOM_L
max change of intensity, whose range is [0, 255]
(default: 0)
--min-random-aspect-ratio MIN_RANDOM_ASPECT_RATIO
min value of aspect ratio, whose value is either None
or a positive value. (default: 0.75)
--max-random-aspect-ratio MAX_RANDOM_ASPECT_RATIO
max value of aspect ratio. If min_random_aspect_ratio
is None, the aspect ratio range is
[1-max_random_aspect_ratio,
1+max_random_aspect_ratio], otherwise it is
[min_random_aspect_ratio, max_random_aspect_ratio].
(default: 1.33)
--max-random-rotate-angle MAX_RANDOM_ROTATE_ANGLE
max angle to rotate, whose range is [0, 360] (default:
0)
--max-random-shear-ratio MAX_RANDOM_SHEAR_RATIO
max ratio to shear, whose range is [0, 1] (default: 0)
--max-random-scale MAX_RANDOM_SCALE
max ratio to scale (default: 1)
--min-random-scale MIN_RANDOM_SCALE
min ratio to scale, should >= img_size/input_shape.
otherwise use --pad-size (default: 1)
--max-random-area MAX_RANDOM_AREA
max area to crop in random resized crop, whose range
is [0, 1] (default: 1)
--min-random-area MIN_RANDOM_AREA
min area to crop in random resized crop, whose range
is [0, 1] (default: 0.05)
--min-crop-size MIN_CROP_SIZE
Crop both width and height into a random size in
[min_crop_size, max_crop_size] (default: -1)
--max-crop-size MAX_CROP_SIZE
Crop both width and height into a random size in
[min_crop_size, max_crop_size] (default: -1)
--brightness BRIGHTNESS
brightness jittering, whose range is [0, 1] (default:
0)
--contrast CONTRAST contrast jittering, whose range is [0, 1] (default: 0)
--saturation SATURATION
saturation jittering, whose range is [0, 1] (default:
0)
--pca-noise PCA_NOISE
pca noise, whose range is [0, 1] (default: 0)
--random-resized-crop RANDOM_RESIZED_CROP
whether to use random resized crop (default: 1)
```
### Command-line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command line option: `./runner --help` and `python train.py --help`. `./runner` acts as a wrapper on `train.py` and all additional flags will be passed to `train.py`.
`./runner` command-line options:
```
usage: runner [-h] [-n NGPUS] [-b BATCH_SIZE] [-e NUM_EPOCHS] [-l LR]
[--data-root DATA_ROOT] [--dtype {float32,float16}]
[--kv-store {device,horovod}]
[--data-backend {dali,mxnet,synthetic}]
```
`train.py` command-line options:
```
usage: train.py [-h]
[--arch {resnetv1,resnetv15,resnextv1,resnextv15,xception}]
[--num-layers NUM_LAYERS] [--num-groups NUM_GROUPS]
[--num-classes NUM_CLASSES] [--batchnorm-eps BATCHNORM_EPS]
[--batchnorm-mom BATCHNORM_MOM] [--fuse-bn-relu FUSE_BN_RELU]
[--fuse-bn-add-relu FUSE_BN_ADD_RELU]
[--mode {train_val,train,val,pred}] [--seed SEED]
[--gpus GPUS] [--kv-store {device,horovod}]
[--dtype {float32,float16}] [--amp] [--batch-size BATCH_SIZE]
[--num-epochs NUM_EPOCHS] [--lr LR]
[--lr-schedule {multistep,cosine}] [--lr-factor LR_FACTOR]
[--lr-steps LR_STEPS] [--warmup-epochs WARMUP_EPOCHS]
[--optimizer OPTIMIZER] [--mom MOM] [--wd WD]
[--label-smoothing LABEL_SMOOTHING] [--mixup MIXUP]
[--disp-batches DISP_BATCHES] [--model-prefix MODEL_PREFIX]
[--save-frequency SAVE_FREQUENCY] [--begin-epoch BEGIN_EPOCH]
[--load LOAD] [--test-io] [--test-io-mode {train,val}]
[--log LOG] [--report REPORT] [--no-metrics]
[--benchmark-iters BENCHMARK_ITERS] [--data-train DATA_TRAIN]
[--data-train-idx DATA_TRAIN_IDX] [--data-val DATA_VAL]
[--data-val-idx DATA_VAL_IDX] [--data-pred DATA_PRED]
[--data-backend {dali,mxnet,synthetic}]
[--image-shape IMAGE_SHAPE] [--rgb-mean RGB_MEAN]
[--rgb-std RGB_STD] [--input-layout {NCHW,NHWC}]
[--conv-layout {NCHW,NHWC}] [--batchnorm-layout {NCHW,NHWC}]
[--pooling-layout {NCHW,NHWC}] [--num-examples NUM_EXAMPLES]
[--data-val-resize DATA_VAL_RESIZE] [--dali-separ-val]
[--dali-threads DALI_THREADS]
[--dali-validation-threads DALI_VALIDATION_THREADS]
[--dali-prefetch-queue DALI_PREFETCH_QUEUE]
[--dali-nvjpeg-memory-padding DALI_NVJPEG_MEMORY_PADDING]
[--data-mxnet-threads DATA_MXNET_THREADS]
[--random-crop RANDOM_CROP] [--random-mirror RANDOM_MIRROR]
[--max-random-h MAX_RANDOM_H] [--max-random-s MAX_RANDOM_S]
[--max-random-l MAX_RANDOM_L]
[--min-random-aspect-ratio MIN_RANDOM_ASPECT_RATIO]
[--max-random-aspect-ratio MAX_RANDOM_ASPECT_RATIO]
[--max-random-rotate-angle MAX_RANDOM_ROTATE_ANGLE]
[--max-random-shear-ratio MAX_RANDOM_SHEAR_RATIO]
[--max-random-scale MAX_RANDOM_SCALE]
[--min-random-scale MIN_RANDOM_SCALE]
[--max-random-area MAX_RANDOM_AREA]
[--min-random-area MIN_RANDOM_AREA]
[--min-crop-size MIN_CROP_SIZE]
[--max-crop-size MAX_CROP_SIZE] [--brightness BRIGHTNESS]
[--contrast CONTRAST] [--saturation SATURATION]
[--pca-noise PCA_NOISE]
[--random-resized-crop RANDOM_RESIZED_CROP]
```
### Getting the data
The MXNet ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
You can download the images from http://image-net.org/download-images
You can download the images from http://image-net.org/download-images.
The recommended data format is
[RecordIO](http://mxnet.io/architecture/note_data_loading.html), which
@ -106,7 +498,7 @@ concatenates multiple examples into seekable binary files for better read
efficiency. MXNet provides a tool called `im2rec.py` located in the `/opt/mxnet/tools/` directory.
The tool converts individual images into `.rec` files.
To prepare RecordIO file containing ImageNet data, we first need to create .lst files
To prepare a RecordIO file containing ImageNet data, we first need to create `.lst` files
which consist of the labels and image paths. We assume that the original images were
downloaded to `/data/imagenet/raw/train-jpeg` and `/data/imagenet/raw/val-jpeg`.
@ -115,121 +507,216 @@ python /opt/mxnet/tools/im2rec.py --list --recursive train /data/imagenet/raw/tr
python /opt/mxnet/tools/im2rec.py --list --recursive val /data/imagenet/raw/val-jpeg
```
Then we generate the `.rec` (RecordIO files with data) and `.idx` (indexes required by DALI
Next, we generate the `.rec` (RecordIO files with data) and `.idx` (indexes required by DALI
to speed up data loading) files. To obtain the best training accuracy
we do not preprocess the images when creating RecordIO file.
we do not preprocess the images when creating the RecordIO file.
```bash
python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 train /data/imagenet/raw/train-jpeg
python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 val /data/imagenet/raw/val-jpeg
```
## Running training
#### Dataset guidelines
The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data.py` and `dali.py` files.
To run training for a standard configuration (1/4/8 GPUs, FP16/FP32),
run one of the scripts in the `./examples` directory
called `./examples/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh`.
By default the training scripts run the validation and save checkpoint after each epoch.
Checkpoints will be stored in `model-symbol.json` and `model-<number of epoch>.params` files.
The data is read from RecordIO format, which concatenates multiple examples into seekable binary files for better read efficiency.
If imagenet is mounted in the `/data/imagenet/train-val-recordio-passthrough` directory, you don't have to specify `--data-root` flag.
Data augmentation techniques are described in the [Default configuration](#default-configuration) section.
To run a non standard configuration use:
#### Multi-dataset
`./runner -n <number of gpus> -b <batch size per gpu> --data-root <path to imagenet> --dtype <float32 or float16> --model-prefix <model prefix>`
In most cases, to train a model on a different dataset, no changes in the code are required, but the dataset has to be converted into RecordIO format.
Checkpoints will be stored in `<model prefix>-symbol.json` and `<model prefix>-<number of epoch>.params` files.
To generate JSON report with performance and accuracy stats, use `--report <path to report>` flag (see `report.py` for info about JSON report file structure).
Use `./runner -h` and `python ./train.py -h` to obtain the list of available options.
## Running inference
To run inference on a checkpointed model run:
* For FP16
`./examples/SCORE_FP16.sh <model prefix> <epoch>`
* For FP32
`./examples/SCORE_FP32.sh <model prefix> <epoch>`
To convert a custom dataset, follow the steps from [Getting the data](#getting-the-data) section, and refer to the `scripts/prepare_dataset.py` script.
## Benchmark scripts
### Training process
To start training, run:
`./runner -n <number of gpus> -b <batch size per GPU> --data-root <path to imagenet> --dtype <float32 or float16>`
By default the training script runs the validation after each epoch:
* the best checkpoint will be stored in the `model_best.params` file in the working directory
* the log from training will be saved in the `log.log` file in the working directory
* the JSON report with statistics will be saved in the `report.json` file in the working directory
If ImageNet is mounted in the `/data/imagenet/train-val-recordio-passthrough` directory, you don't have to specify the `--data-root` flag.
### Inference process
To start validation, run:
`./runner -n <number of gpus> -b <batch size per GPU> --data-root <path to imagenet> --dtype <float32 or float16> --mode val`
By default:
* the log from validation will be saved in the `log.log` file in the working directory
* the JSON report with statistics will be saved in the `report.json` file in the working directory
## Performance
### Benchmarking
To benchmark training and inference, run:
`python benchmark.py -n <numbers of gpus separated by comma> -b <batch sizes per GPU separated by comma> --data-root <path to imagenet> --dtype <float32 or float16> -o <path to benchmark report>`
`python benchmark.py -n <numbers of gpus separated by comma> -b <batch sizes per gpu separated by comma> --data-root <path to imagenet> --dtype <float32 or float16> -o <path to benchmark report>`
To control benchmark length per epoch, use `-i` flag (defaults to 100 iterations).
To control number of epochs, use `-e` flag.
To control number of warmup epochs (epochs which are not taken into account), use `-w` flag.
To limit length of dataset, use `--num-examples` flag.
To benchmark only inference, use `--only-inference` flag.
To control the benchmark length per epoch, use the `-i` flag (defaults to 100 iterations).
To control the number of epochs, use the `-e` flag.
To control the number of warmup epochs (epochs which are not taken into account), use the `-w` flag.
To limit the length of the dataset, use the `--num-examples` flag.
By default, the same parameters as in `./runner` will be used. Additional flags will be passed to `./runner`.
#### Training performance benchmark
To benchmark only training, use the `--mode train` flag.
## Training accuracy results
The following results were obtained by running the `./examples/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh` scripts in the
mxnet-18.12-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
| **number of GPUs** | **FP16 top1** | **FP16 training time** | **FP32 top1** | **FP32 training time** |
|:------------------:|:-------------:|:----------------------:|:-------------:|:----------------------:|
| 1 | 76.424 | 22.9h | 76.462 | 82.0h |
| 4 | 76.328 | 6.2h | 76.448 | 21.1h |
| 8 | 76.490 | 3.3h | 76.668 | 11.1h |
Here are example graphs of FP32 and FP16 training on 8 GPU configuration:
![TrainingLoss](./img/training_loss.png)
![TrainingAccuracy](./img/training_accuracy.png)
![ValidationAccuracy](./img/validation_accuracy.png)
#### Inference performance benchmark
To benchmark only inference, use the `--mode val` flag.
## Training performance results
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
#### Training accuracy results
##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
90 epochs configuration
Our results were obtained by running the `./runner -n <number of gpus> -b 96 --dtype float32` script for FP32 and the `./runner -n <number of gpus> -b 192` script for mixed precision in the in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
on NVIDIA DGX-1 with (8x V100 16G) GPUs.
| **GPUs** | **Accuracy - mixed precision** | **Accuracy - FP32** | **Time to train - mixed precision** | **Time to train - FP32** | **Time to train - speedup** |
|:---:|:---:|:---:|:---:|:---:|:---:|
|1|77.208|77.160|24.2|84.5|3.49|
|4|77.296|77.280|6.0|21.4|3.59|
|8|77.308|77.292|3.0|10.7|3.54|
##### Training stability test
Our results were obtained by running the following commands 8 times with different seeds.
* For 50 epochs
* `./runner -n 8 -b 96 --dtype float32 --num-epochs 50` for FP32
* `./runner -n 8 -b 192 --num-epochs 50` for mixed precision
* For 90 epochs
* `./runner -n 8 -b 96 --dtype float32` for FP32
* `./runner -n 8 -b 192` for mixed precision
* For 250 epochs
* `./runner -n 8 -b 96 --dtype float32 --num-epochs 250 --mixup 0.2` for FP32
* `./runner -n 8 -b 192 --num-epochs 250 --mixup 0.2` for mixed precision
| **# of epochs** | **mixed precision avg top1** | **FP32 avg top1** | **mixed precision standard deviation** | **FP32 standard deviation** | **mixed precision minimum top1** | **FP32 minimum top1** | **mixed precision maximum top1** | **FP32 maximum top1** |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|50|76.156|76.185|0.118|0.082|76.010|76.062|76.370|76.304|
|90|77.105|77.224|0.097|0.060|76.982|77.134|77.308|77.292|
|250|78.317|78.400|0.073|0.102|78.202|78.316|78.432|78.570|
Plots for 250 epoch configuration
Here are example graphs of FP32 and mixed precision training on 8 GPU 250 epochs configuration:
![TrainingLoss](./img/dgx1-16g_250e_training_loss.png)
![TrainingAccuracy](./img/dgx1-16g_250e_validation_top1.png)
![ValidationAccuracy](./img/dgx1-16g_250e_validation_top5.png)
#### Training performance results
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
The following results were obtained by running the
`python benchmark.py -n 1,2,4,8 -b 192 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
`python benchmark.py -n 1,2,4,8 -b 96 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
The following results were obtained by running
`python benchmark.py -n 1,4,8 -b 208 --dtype float16 -o benchmark_report_fp16.json --data-root <path to imagenet> -i 100 -e 12 -w 4 --num-examples 25600` for FP16, and
`python benchmark.py -n 1,4,8 -b 96 --dtype float32 -o benchmark_report_fp32.json --data-root <path to imagenet> -i 100 -e 12 -w 4 --num-examples 12800` for FP32
in the mxnet-18.12-py3 Docker container on NVIDIA DGX-1 with V100 16G GPUs.
Training performance reported as Total IPS (data + compute time taken into account).
Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.
| **number of GPUs** | **FP16 img/s** | **FP32 img/s** | **FP16 speedup** | **FP16 weak scaling** | **FP32 weak scaling** |
|:------------------:|:--------------:|:--------------:|:----------------:|:---------------------:|:---------------------:|
| 1 | 1442.6 | 400.2 | 3.60 | 1.00 | 1.00 |
| 4 | 5391.8 | 1558.6 | 3.46 | 3.74 | 3.89 |
| 8 | 10263.2 | 2957.4 | 3.47 | 7.11 | 7.39 |
| **GPUs** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:---:|:---:|:---:|:---:|:---:|:---:|
|1|1427|385|3.71|1.00|1.00|
|2|2820|768|3.67|1.98|2.00|
|4|5560|1513|3.68|3.90|3.93|
|8|10931|3023|3.62|7.66|7.86|
##### Training performance: NVIDIA DGX-2 (16x V100 32G)
## Inference performance results
The following results were obtained by running the
`python benchmark.py -n 1,4,8,16 -b 256 --dtype float16 -o benchmark_report_fp16.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for mixed precision and the
`python benchmark.py -n 1,4,8,16 -b 128 --dtype float32 -o benchmark_report_fp32.json -i 500 -e 3 -w 1 --num-examples 32000 --mode train` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
Training performance reported as Total IPS (data + compute time taken into account).
Weak scaling is calculated as a ratio of speed for given number of GPUs to speed for 1 GPU.
| **GPUs** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:---:|:---:|:---:|:---:|:---:|:---:|
|1|1438|409|3.52|1.00|1.00|
|2|2868|817|3.51|1.99|2.00|
|4|5624|1617|3.48|3.91|3.96|
|8|11174|3214|3.48|7.77|7.86|
|16|20530|6356|3.23|14.28|15.54|
#### Inference performance results
##### Inference performance: NVIDIA DGX-1 (8x V100 16G)
The following results were obtained by running the
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
The following results were obtained by running
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,96,128,192,208 --dtype float16 -o inferbenchmark_report_fp16.json --data-root <path to imagenet> -i 200 -e 12 -w 4 --only-inference` for FP16, and
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,96 --dtype float32 -o inferbenchmark_report_fp32.json --data-root <path to imagenet> -i 200 -e 12 -w 4 --only-inference` for FP32
in the mxnet-18.12-py3 Docker container on NVIDIA DGX-1 using one V100 16G GPU.
Inference performance reported as Total IPS (data + compute time taken into account).
| **batch size** | **FP16 img/s** | **FP32 img/s** |
|:--------------:|:--------------:|:--------------:|
| 1 | 314 | 252 |
| 2 | 555 | 393 |
| 4 | 1024 | 601 |
| 8 | 1642 | 824 |
| 16 | 2144 | 1028 |
| 32 | 2954 | 1138 |
| 64 | 3428 | 1236 |
| 96 | 3546 | 1282 |
| 128 | 3690 | |
| 192 | 3828 | |
| 208 | 3832 | |
Reported mixed precision speedups are relative to FP32 numbers for corresponding configuration.
| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | **100% latency (ms) - mixed precision** | **100% latency - speedup** |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 397 | 1.65 | 2.5 | 1.65 | 2.5 | 1.67 | 2.7 | 1.59 | 2.8 | 1.56 | 3.2 | 1.51 | 15.8 | 0.84 |
| 2 | 732 | 1.81 | 2.7 | 1.81 | 2.6 | 1.88 | 3.0 | 1.67 | 3.3 | 1.52 | 4.9 | 1.10 | 18.8 | 0.83 |
| 4 | 1269 | 2.08 | 3.2 | 2.08 | 3.0 | 2.21 | 3.5 | 1.92 | 4.0 | 1.72 | 7.5 | 0.97 | 14.5 | 0.54 |
| 8 | 2012 | 2.53 | 4.0 | 2.53 | 3.9 | 2.59 | 4.2 | 2.45 | 4.4 | 2.37 | 8.3 | 1.29 | 15.3 | 0.72 |
| 16 | 2667 | 2.64 | 6.0 | 2.64 | 5.9 | 2.66 | 6.3 | 2.54 | 6.4 | 2.52 | 8.3 | 2.02 | 16.9 | 1.05 |
| 32 | 3240 | 2.86 | 9.9 | 2.86 | 9.8 | 2.87 | 10.3 | 2.79 | 10.4 | 2.76 | 11.5 | 2.53 | 28.4 | 1.12 |
| 64 | 3776 | 3.10 | 17.0 | 3.10 | 17.0 | 3.09 | 17.5 | 3.03 | 17.7 | 3.01 | 18.1 | 3.01 | 18.7 | 2.99 |
| 128 | 3734 | 3.02 | 34.3 | 3.02 | 33.8 | 3.05 | 35.5 | 2.93 | 36.3 | 2.88 | 42.4 | 2.79 | 51.7 | 2.38 |
| 192 | 3641 | 2.90 | 52.7 | 2.90 | 52.4 | 2.90 | 55.2 | 2.77 | 56.2 | 2.74 | 65.4 | 2.76 | 77.1 | 2.41 |
| 256 | 3463 | 2.73 | 73.9 | 2.73 | 72.8 | 2.75 | 77.3 | 2.61 | 79.9 | 2.54 | 100.8 | 2.39 | 104.1 | 2.35 |
# Changelog
##### Inference performance: NVIDIA T4
1. Dec 19, 2018
The following results were obtained by running the
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float16 -o inferbenchmark_report_fp16.json -i 500 -e 3 -w 1 --mode val` script for mixed precision and the
`python benchmark.py -n 1 -b 1,2,4,8,16,32,64,128,192,256 --dtype float32 -o inferbenchmark_report_fp32.json -i 500 -e 3 -w 1 --mode val` script for FP32 in the mxnet-19.07-py3 NGC container on an NVIDIA T4 GPU.
Inference performance reported as Total IPS (data + compute time taken into account).
Reported mixed precision speedups are relative to FP32 numbers for corresponding configuration.
| **Batch size** | **Throughput (img/sec) - mixed precision** | **Throughput - speedup** | **Avg latency (ms) - mixed precision** | **Avg latency - speedup** | **50% latency (ms) - mixed precision** | **50% latency - speedup** | **90% latency (ms) - mixed precision** | **90% latency - speedup** | **95% latency (ms) - mixed precision** | **95% latency - speedup** | **99% latency (ms) - mixed precision** | **99% latency - speedup** | **100% latency (ms) - mixed precision** | **100% latency - speedup** |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 348 | 1.88 | 2.9 | 1.88 | 2.8 | 1.91 | 2.9 | 1.88 | 3.0 | 1.90 | 3.9 | 1.82 | 17.6 | 0.74 |
| 2 | 594 | 2.30 | 3.4 | 2.30 | 3.3 | 2.35 | 3.4 | 2.34 | 3.5 | 2.38 | 5.7 | 1.55 | 20.2 | 0.74 |
| 4 | 858 | 2.93 | 4.7 | 2.93 | 4.6 | 2.97 | 4.9 | 2.86 | 5.0 | 2.81 | 6.0 | 2.46 | 13.7 | 1.12 |
| 8 | 1047 | 3.17 | 7.6 | 3.17 | 7.6 | 3.19 | 7.9 | 3.10 | 8.2 | 3.02 | 9.1 | 2.77 | 15.0 | 1.72 |
| 16 | 1163 | 3.16 | 13.8 | 3.16 | 13.7 | 3.17 | 14.1 | 3.13 | 14.4 | 3.07 | 15.4 | 2.90 | 17.5 | 2.62 |
| 32 | 1225 | 3.22 | 26.1 | 3.22 | 26.1 | 3.22 | 27.0 | 3.15 | 27.3 | 3.12 | 28.3 | 3.05 | 30.5 | 2.89 |
| 64 | 1230 | 3.15 | 52.0 | 3.15 | 51.8 | 3.16 | 52.9 | 3.12 | 53.3 | 3.10 | 54.4 | 3.08 | 58.8 | 2.90 |
| 128 | 1260 | 3.21 | 101.6 | 3.21 | 101.3 | 3.22 | 102.7 | 3.21 | 103.2 | 3.20 | 115.0 | 2.89 | 121.8 | 2.86 |
| 192 | 1252 | 3.20 | 153.3 | 3.20 | 153.1 | 3.20 | 154.7 | 3.19 | 155.5 | 3.21 | 156.9 | 3.20 | 182.3 | 2.81 |
| 256 | 1251 | 3.22 | 204.6 | 3.22 | 204.3 | 3.23 | 206.4 | 3.21 | 207.1 | 3.21 | 209.3 | 3.18 | 241.9 | 2.76 |
## Release notes
### Changelog
1. Dec, 2018
* Initial release (based on https://github.com/apache/incubator-mxnet/tree/master/example/image-classification)
2. June, 2019
* Code refactor
* Label smoothing
* Cosine LR schedule
* MixUp regularization
* Better configurations
# Known Issues
### Known Issues
There are no known issues with this model.

97
MxNet/Classification/RN50v1.5/benchmark.py Normal file → Executable file
View file

@ -1,3 +1,5 @@
#!/usr/bin/env python3
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -18,14 +20,21 @@ import sys
import tempfile
import json
import os
import traceback
import numpy as np
from collections import OrderedDict
from subprocess import Popen
parser = argparse.ArgumentParser(description='Benchmark')
def int_list(x):
return list(map(int, x.split(',')))
parser = argparse.ArgumentParser(description='Benchmark',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--executable', default='./runner', help='path to runner')
parser.add_argument('-n', '--ngpus', metavar='N1,[N2,...]',
parser.add_argument('-o', '--output', metavar='OUT', required=True, help="path to benchmark report")
parser.add_argument('-n', '--ngpus', metavar='N1,[N2,...]', type=int_list,
required=True, help='numbers of gpus separated by comma')
parser.add_argument('-b', '--batch-sizes', metavar='B1,[B2,...]',
parser.add_argument('-b', '--batch-sizes', metavar='B1,[B2,...]', type=int_list,
required=True, help='batch sizes separated by comma')
parser.add_argument('-i', '--benchmark-iters', metavar='I',
type=int, default=100, help='iterations')
@ -33,57 +42,83 @@ parser.add_argument('-e', '--epochs', metavar='E',
type=int, default=1, help='number of epochs')
parser.add_argument('-w', '--warmup', metavar='N',
type=int, default=0, help='warmup epochs')
parser.add_argument('-o', '--output', metavar='OUT', required=True, help="path to benchmark report")
parser.add_argument('--only-inference', action='store_true', help="benchmark inference only")
parser.add_argument('--timeout', metavar='T',
type=str, default='inf', help='timeout for each run')
parser.add_argument('--mode', metavar='MODE', choices=('train_val', 'train', 'val'), default='train_val',
help="benchmark mode")
args, other_args = parser.parse_known_args()
ngpus = list(map(int, args.ngpus.split(',')))
batch_sizes = list(map(int, args.batch_sizes.split(',')))
latency_percentiles = ['avg', 50, 90, 95, 99, 100]
harmonic_mean_metrics = ['train.total_ips', 'val.total_ips']
res = OrderedDict()
res['model'] = ''
res['ngpus'] = ngpus
res['bs'] = batch_sizes
if args.only_inference:
res['metric_keys'] = ['val.total_ips']
else:
res['metric_keys'] = ['train.total_ips', 'val.total_ips']
res['ngpus'] = args.ngpus
res['bs'] = args.batch_sizes
res['metric_keys'] = []
if args.mode == 'train' or args.mode == 'train_val':
res['metric_keys'].append('train.total_ips')
for percentile in latency_percentiles:
res['metric_keys'].append('train.latency_{}'.format(percentile))
if args.mode == 'val' or args.mode == 'train_val':
res['metric_keys'].append('val.total_ips')
for percentile in latency_percentiles:
res['metric_keys'].append('val.latency_{}'.format(percentile))
res['metrics'] = OrderedDict()
for n in ngpus:
for n in args.ngpus:
res['metrics'][str(n)] = OrderedDict()
for bs in batch_sizes:
for bs in args.batch_sizes:
res['metrics'][str(n)][str(bs)] = OrderedDict()
report_file = args.output + '-{},{}'.format(n, bs)
Popen([args.executable, '-n', str(n), '-b', str(bs),
Popen(['timeout', args.timeout, args.executable, '-n', str(n), '-b', str(bs),
'--benchmark-iters', str(args.benchmark_iters),
'-e', str(args.epochs), '--report', report_file,
*([] if not args.only_inference else ['--only-inference']),
'--no-metrics'] + other_args, stdout=sys.stderr).wait()
'--mode', args.mode, '--no-metrics'] + other_args,
stdout=sys.stderr).wait()
with open(report_file, 'r') as f:
report = json.load(f)
try:
for suffix in ['', *['-{}'.format(i) for i in range(1, n)]]:
try:
with open(report_file + suffix, 'r') as f:
report = json.load(f)
break
except FileNotFoundError:
pass
else:
with open(report_file, 'r') as f:
report = json.load(f)
for metric in res['metric_keys']:
data = report['metrics'][metric][args.warmup:]
avg = len(data) / sum(map(lambda x: 1 / x, data))
res['metrics'][str(n)][str(bs)][metric] = avg
for metric in res['metric_keys']:
if len(report['metrics'][metric]) != args.epochs:
raise ValueError('Wrong number epochs in report')
data = report['metrics'][metric][args.warmup:]
if metric in harmonic_mean_metrics:
avg = len(data) / sum(map(lambda x: 1 / x, data))
else:
avg = np.mean(data)
res['metrics'][str(n)][str(bs)][metric] = avg
except Exception as e:
traceback.print_exc()
for metric in res['metric_keys']:
res['metrics'][str(n)][str(bs)][metric] = float('nan')
column_len = 7
column_len = 11
for m in res['metric_keys']:
print(m, file=sys.stderr)
print(' ' * column_len, end='|', file=sys.stderr)
for bs in batch_sizes:
for bs in args.batch_sizes:
print(str(bs).center(column_len), end='|', file=sys.stderr)
print(file=sys.stderr)
print('-' * (len(batch_sizes) + 1) * (column_len + 1), file=sys.stderr)
for n in ngpus:
print('-' * (len(args.batch_sizes) + 1) * (column_len + 1), file=sys.stderr)
for n in args.ngpus:
print(str(n).center(column_len), end='|', file=sys.stderr)
for bs in batch_sizes:
print(str(round(res['metrics'][str(n)][str(bs)][m])).center(column_len), end='|', file=sys.stderr)
for bs in args.batch_sizes:
print('{:.5g}'.format(res['metrics'][str(n)][str(bs)][m]).center(column_len), end='|', file=sys.stderr)
print(file=sys.stderr)
print(file=sys.stderr)

View file

@ -52,11 +52,14 @@ class BenchmarkingDataIter:
def __getattr__(self, attr):
return getattr(self.data_iter, attr)
def get_avg_time_and_clear(self):
def get_avg_time(self):
if self.num <= 1:
avg = float('nan')
else:
avg = self.overall_time / (self.num - 1)
return avg
def reset(self):
self.overall_time = 0
self.num = 0
return avg
self.data_iter.reset()

View file

@ -18,146 +18,166 @@ from nvidia.dali.pipeline import Pipeline
import nvidia.dali.ops as ops
import nvidia.dali.types as types
from nvidia.dali.plugin.mxnet import DALIClassificationIterator
import horovod.mxnet as hvd
def add_dali_args(parser):
group = parser.add_argument_group('DALI', 'pipeline and augumentation')
group.add_argument('--use-dali', action='store_true',
help='use dalli pipeline and augunetation')
group.add_argument('--separ-val', action='store_true',
group = parser.add_argument_group('DALI data backend', 'entire group applies only to dali data backend')
group.add_argument('--dali-separ-val', action='store_true',
help='each process will perform independent validation on whole val-set')
group.add_argument('--dali-threads', type=int, default=3, help="number of threads" +\
"per GPU for DALI")
group.add_argument('--validation-dali-threads', type=int, default=10, help="number of threads" +\
group.add_argument('--dali-validation-threads', type=int, default=10, help="number of threads" +\
"per GPU for DALI for validation")
group.add_argument('--dali-prefetch-queue', type=int, default=3, help="DALI prefetch queue depth")
group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=16, help="Memory padding value for nvJPEG (in MB)")
group.add_argument('--dali-prefetch-queue', type=int, default=2, help="DALI prefetch queue depth")
group.add_argument('--dali-nvjpeg-memory-padding', type=int, default=64, help="Memory padding value for nvJPEG (in MB)")
group.add_argument('--dali-fuse-decoder', type=int, default=1, help="0 or 1 whether to fuse decoder or not")
return parser
_mean_pixel = [255 * x for x in (0.485, 0.456, 0.406)]
_std_pixel = [255 * x for x in (0.229, 0.224, 0.225)]
class HybridTrainPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, rec_path, idx_path,
shard_id, num_shards, crop_shape,
nvjpeg_padding, prefetch_queue=3,
output_layout=types.NCHW, pad_output=True, dtype='float16'):
super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id, prefetch_queue_depth = prefetch_queue)
self.input = ops.MXNetReader(path = [rec_path], index_path=[idx_path],
def __init__(self, args, batch_size, num_threads, device_id, rec_path, idx_path,
shard_id, num_shards, crop_shape, nvjpeg_padding, prefetch_queue=3,
output_layout=types.NCHW, pad_output=True, dtype='float16', dali_cpu=False):
super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id, prefetch_queue_depth = prefetch_queue)
self.input = ops.MXNetReader(path=[rec_path], index_path=[idx_path],
random_shuffle=True, shard_id=shard_id, num_shards=num_shards)
self.decode = ops.nvJPEGDecoder(device = "mixed", output_type = types.RGB,
device_memory_padding = nvjpeg_padding,
host_memory_padding = nvjpeg_padding)
self.rrc = ops.RandomResizedCrop(device = "gpu", size = crop_shape)
self.cmnp = ops.CropMirrorNormalize(device = "gpu",
output_dtype = types.FLOAT16 if dtype == 'float16' else types.FLOAT,
output_layout = output_layout,
crop = crop_shape,
pad_output = pad_output,
image_type = types.RGB,
mean = _mean_pixel,
std = _std_pixel)
self.coin = ops.CoinFlip(probability = 0.5)
if dali_cpu:
dali_device = "cpu"
if args.dali_fuse_decoder:
self.decode = ops.HostDecoderRandomCrop(device=dali_device, output_type=types.RGB)
else:
self.decode = ops.HostDecoder(device=dali_device, output_type=types.RGB)
else:
dali_device = "gpu"
if args.dali_fuse_decoder:
self.decode = ops.nvJPEGDecoderRandomCrop(device="mixed", output_type=types.RGB,
device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
else:
self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB,
device_memory_padding=nvjpeg_padding, host_memory_padding=nvjpeg_padding)
if args.dali_fuse_decoder:
self.resize = ops.Resize(device=dali_device, resize_x=crop_shape[1], resize_y=crop_shape[0])
else:
self.resize = ops.RandomResizedCrop(device=dali_device, size=crop_shape)
self.cmnp = ops.CropMirrorNormalize(device="gpu",
output_dtype=types.FLOAT16 if dtype == 'float16' else types.FLOAT,
output_layout=output_layout, crop=crop_shape, pad_output=pad_output,
image_type=types.RGB, mean=args.rgb_mean, std=args.rgb_std)
self.coin = ops.CoinFlip(probability=0.5)
def define_graph(self):
rng = self.coin()
self.jpegs, self.labels = self.input(name = "Reader")
self.jpegs, self.labels = self.input(name="Reader")
images = self.decode(self.jpegs)
images = self.rrc(images)
output = self.cmnp(images, mirror = rng)
images = self.resize(images)
output = self.cmnp(images.gpu(), mirror=rng)
return [output, self.labels]
class HybridValPipe(Pipeline):
def __init__(self, batch_size, num_threads, device_id, rec_path, idx_path,
shard_id, num_shards, crop_shape,
nvjpeg_padding, prefetch_queue=3,
resize_shp=None,
output_layout=types.NCHW, pad_output=True, dtype='float16'):
super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id, prefetch_queue_depth = prefetch_queue)
self.input = ops.MXNetReader(path = [rec_path], index_path=[idx_path],
def __init__(self, args, batch_size, num_threads, device_id, rec_path, idx_path,
shard_id, num_shards, crop_shape, nvjpeg_padding, prefetch_queue=3, resize_shp=None,
output_layout=types.NCHW, pad_output=True, dtype='float16', dali_cpu=False):
super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed=12 + device_id, prefetch_queue_depth=prefetch_queue)
self.input = ops.MXNetReader(path=[rec_path], index_path=[idx_path],
random_shuffle=False, shard_id=shard_id, num_shards=num_shards)
self.decode = ops.nvJPEGDecoder(device = "mixed", output_type = types.RGB,
device_memory_padding = nvjpeg_padding,
host_memory_padding = nvjpeg_padding)
self.resize = ops.Resize(device = "gpu", resize_shorter=resize_shp) if resize_shp else None
self.cmnp = ops.CropMirrorNormalize(device = "gpu",
output_dtype = types.FLOAT16 if dtype == 'float16' else types.FLOAT,
output_layout = output_layout,
crop = crop_shape,
pad_output = pad_output,
image_type = types.RGB,
mean = _mean_pixel,
std = _std_pixel)
if dali_cpu:
dali_device = "cpu"
self.decode = ops.HostDecoder(device=dali_device, output_type=types.RGB)
else:
dali_device = "gpu"
self.decode = ops.nvJPEGDecoder(device="mixed", output_type=types.RGB,
device_memory_padding=nvjpeg_padding,
host_memory_padding=nvjpeg_padding)
self.resize = ops.Resize(device=dali_device, resize_shorter=resize_shp) if resize_shp else None
self.cmnp = ops.CropMirrorNormalize(device="gpu",
output_dtype=types.FLOAT16 if dtype == 'float16' else types.FLOAT,
output_layout=output_layout, crop=crop_shape, pad_output=pad_output,
image_type=types.RGB, mean=args.rgb_mean, std=args.rgb_std)
def define_graph(self):
self.jpegs, self.labels = self.input(name = "Reader")
self.jpegs, self.labels = self.input(name="Reader")
images = self.decode(self.jpegs)
if self.resize:
images = self.resize(images)
output = self.cmnp(images)
output = self.cmnp(images.gpu())
return [output, self.labels]
def get_rec_iter(args, kv=None):
# resize is default base length of shorter edge for dataset;
# all images will be reshaped to this size
resize = int(args.resize)
# target shape is final shape of images pipelined to network;
# all images will be cropped to this size
target_shape = tuple([int(l) for l in args.image_shape.split(',')])
pad_output = target_shape[0] == 4
gpus = list(map(int, filter(None, args.gpus.split(',')))) # filter to not encount eventually empty strings
batch_size = args.batch_size//len(gpus)
def get_rec_iter(args, kv=None, dali_cpu=False):
gpus = args.gpus
num_threads = args.dali_threads
num_validation_threads = args.validation_dali_threads
#db_folder = "/data/imagenet/train-480-val-256-recordio/"
num_validation_threads = args.dali_validation_threads
pad_output = (args.image_shape[0] == 4)
# the input_layout w.r.t. the model is the output_layout of the image pipeline
output_layout = types.NHWC if args.input_layout == 'NHWC' else types.NCHW
rank = kv.rank if kv else 0
nWrk = kv.num_workers if kv else 1
if 'horovod' in args.kv_store:
rank = hvd.rank()
nWrk = hvd.size()
else:
rank = kv.rank if kv else 0
nWrk = kv.num_workers if kv else 1
trainpipes = [HybridTrainPipe(batch_size = batch_size,
batch_size = args.batch_size // nWrk // len(gpus)
trainpipes = [HybridTrainPipe(args = args,
batch_size = batch_size,
num_threads = num_threads,
device_id = gpu_id,
rec_path = args.data_train,
idx_path = args.data_train_idx,
shard_id = gpus.index(gpu_id) + len(gpus)*rank,
num_shards = len(gpus)*nWrk,
crop_shape = target_shape[1:],
crop_shape = args.image_shape[1:],
output_layout = output_layout,
pad_output = pad_output,
dtype = args.dtype,
pad_output = pad_output,
dali_cpu = dali_cpu,
nvjpeg_padding = args.dali_nvjpeg_memory_padding * 1024 * 1024,
prefetch_queue = args.dali_prefetch_queue) for gpu_id in gpus]
valpipes = [HybridValPipe(batch_size = batch_size,
num_threads = num_validation_threads,
device_id = gpu_id,
rec_path = args.data_val,
idx_path = args.data_val_idx,
shard_id = 0 if args.separ_val
else gpus.index(gpu_id) + len(gpus)*rank,
num_shards = 1 if args.separ_val else len(gpus)*nWrk,
crop_shape = target_shape[1:],
resize_shp = resize,
output_layout = output_layout,
pad_output = pad_output,
dtype = args.dtype,
nvjpeg_padding = args.dali_nvjpeg_memory_padding * 1024 * 1024,
prefetch_queue = args.dali_prefetch_queue) for gpu_id in gpus] if args.data_val else None
if args.data_val:
valpipes = [HybridValPipe(args = args,
batch_size = batch_size,
num_threads = num_validation_threads,
device_id = gpu_id,
rec_path = args.data_val,
idx_path = args.data_val_idx,
shard_id = 0 if args.dali_separ_val
else gpus.index(gpu_id) + len(gpus)*rank,
num_shards = 1 if args.dali_separ_val else len(gpus)*nWrk,
crop_shape = args.image_shape[1:],
resize_shp = args.data_val_resize,
output_layout = output_layout,
dtype = args.dtype,
pad_output = pad_output,
dali_cpu = dali_cpu,
nvjpeg_padding = args.dali_nvjpeg_memory_padding * 1024 * 1024,
prefetch_queue = args.dali_prefetch_queue) for gpu_id in gpus] if args.data_val else None
trainpipes[0].build()
if args.data_val:
valpipes[0].build()
worker_val_examples = valpipes[0].epoch_size("Reader")
if not args.dali_separ_val:
worker_val_examples = worker_val_examples // nWrk
if rank < valpipes[0].epoch_size("Reader") % nWrk:
worker_val_examples += 1
if args.num_examples < trainpipes[0].epoch_size("Reader"):
warnings.warn("{} training examples will be used, although full training set contains {} examples".format(args.num_examples, trainpipes[0].epoch_size("Reader")))
dali_train_iter = DALIClassificationIterator(trainpipes, args.num_examples // nWrk)
dali_val_iter = DALIClassificationIterator(valpipes, valpipes[0].epoch_size("Reader") // (1 if args.separ_val else nWrk), fill_last_batch = False) if args.data_val else None
return dali_train_iter, dali_val_iter
if args.data_val:
dali_val_iter = DALIClassificationIterator(valpipes, worker_val_examples, fill_last_batch = False) if args.data_val else None
else:
dali_val_iter = None
return dali_train_iter, dali_val_iter

View file

@ -1,7 +1,5 @@
# -----------------------------------------------------------------------
# Copyright 2017-2018 The Apache Software Foundation
#
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
@ -36,128 +34,61 @@
# limitations under the License.
import mxnet as mx
import mxnet.ndarray as nd
import random
import argparse
from mxnet.io import DataBatch, DataIter
import numpy as np
import horovod.mxnet as hvd
import dali
def add_data_args(parser):
data = parser.add_argument_group('Data', 'the input images')
def float_list(x):
return list(map(float, x.split(',')))
def int_list(x):
return list(map(int, x.split(',')))
data = parser.add_argument_group('Data')
data.add_argument('--data-train', type=str, help='the training data')
data.add_argument('--data-train-idx', type=str, default='', help='the index of training data')
data.add_argument('--data-val', type=str, help='the validation data')
data.add_argument('--data-val-idx', type=str, default='', help='the index of validation data')
data.add_argument('--rgb-mean', type=str, default='123.68,116.779,103.939',
data.add_argument('--data-pred', type=str, help='the image on which run inference (only for pred mode)')
data.add_argument('--data-backend', choices=('dali-gpu', 'dali-cpu', 'mxnet', 'synthetic'), default='dali-gpu',
help='set data loading & augmentation backend')
data.add_argument('--image-shape', type=int_list, default=[3, 224, 224],
help='the image shape feed into the network')
data.add_argument('--rgb-mean', type=float_list, default=[123.68, 116.779, 103.939],
help='a tuple of size 3 for the mean rgb')
data.add_argument('--rgb-std', type=str, default='1,1,1',
data.add_argument('--rgb-std', type=float_list, default=[58.393, 57.12, 57.375],
help='a tuple of size 3 for the std rgb')
data.add_argument('--pad-size', type=int, default=0,
help='padding the input image')
data.add_argument('--fill-value', type=int, default=127,
help='Set the padding pixels value to fill_value')
data.add_argument('--image-shape', type=str,
help='the image shape feed into the network, e.g. (3,224,224)')
data.add_argument('--num-classes', type=int, help='the number of classes')
data.add_argument('--num-examples', type=int, help='the number of training examples')
data.add_argument('--data-nthreads', type=int, default=4,
help='number of threads for data decoding')
data.add_argument('--benchmark-iters', type=int, default=None,
help='run only benchmark-iters iterations from each epoch')
data.add_argument('--input-layout', type=str, default='NCHW',
help='the layout of the input data (e.g. NCHW)')
data.add_argument('--conv-layout', type=str, default='NCHW',
help='the layout of the data assumed by the conv operation (e.g. NCHW)')
data.add_argument('--conv-algo', type=int, default=-1,
help='set the convolution algos (fwd, dgrad, wgrad)')
data.add_argument('--batchnorm-layout', type=str, default='NCHW',
help='the layout of the data assumed by the batchnorm operation (e.g. NCHW)')
data.add_argument('--batchnorm-eps', type=float, default=2e-5,
help='the amount added to the batchnorm variance to prevent output explosion.')
data.add_argument('--batchnorm-mom', type=float, default=0.9,
help='the leaky-integrator factor controling the batchnorm mean and variance.')
data.add_argument('--pooling-layout', type=str, default='NCHW',
help='the layout of the data assumed by the pooling operation (e.g. NCHW)')
data.add_argument('--verbose', type=int, default=0,
help='turn on reporting of chosen algos for convolution, etc.')
data.add_argument('--seed', type=int, default=None,
help='set the seed for python, nd and mxnet rngs')
data.add_argument('--custom-bn-off', type=int, default=0,
help='disable use of custom batchnorm kernel')
data.add_argument('--fuse-bn-relu', type=int, default=0,
help='have batchnorm kernel perform activation relu')
data.add_argument('--fuse-bn-add-relu', type=int, default=0,
help='have batchnorm kernel perform add followed by activation relu')
data.add_argument('--force-tensor-core', type=int, default=0,
help='require conv algos to be tensor core')
data.add_argument('--input-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
help='the layout of the input data')
data.add_argument('--conv-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
help='the layout of the data assumed by the conv operation')
data.add_argument('--batchnorm-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
help='the layout of the data assumed by the batchnorm operation')
data.add_argument('--pooling-layout', type=str, default='NCHW', choices=('NCHW', 'NHWC'),
help='the layout of the data assumed by the pooling operation')
data.add_argument('--num-examples', type=int, default=1281167,
help="the number of training examples (doesn't work with mxnet data backend)")
data.add_argument('--data-val-resize', type=int, default=256,
help='base length of shorter edge for validation dataset')
return data
# Action to translate --set-resnet-aug flag to its component settings.
class SetResnetAugAction(argparse.Action):
def __init__(self, nargs=0, **kwargs):
if nargs != 0:
raise ValueError('nargs for SetResnetAug must be 0.')
super(SetResnetAugAction, self).__init__(nargs=nargs, **kwargs)
def __call__(self, parser, namespace, values, option_string=None):
# standard data augmentation setting for resnet training
setattr(namespace, 'random_crop', 1)
setattr(namespace, 'random_resized_crop', 1)
setattr(namespace, 'random_mirror', 1)
setattr(namespace, 'min_random_area', 0.08)
setattr(namespace, 'max_random_aspect_ratio', 4./3.)
setattr(namespace, 'min_random_aspect_ratio', 3./4.)
setattr(namespace, 'brightness', 0.4)
setattr(namespace, 'contrast', 0.4)
setattr(namespace, 'saturation', 0.4)
setattr(namespace, 'pca_noise', 0.1)
# record that this --set-resnet-aug 'macro arg' has been invoked
setattr(namespace, self.dest, 1)
# Similar to the above, but suitable for calling within a training script to set the defaults.
def set_resnet_aug(aug):
# standard data augmentation setting for resnet training
aug.set_defaults(random_crop=0, random_resized_crop=1)
aug.set_defaults(random_mirror=1)
aug.set_defaults(min_random_area=0.08)
aug.set_defaults(max_random_aspect_ratio=4./3., min_random_aspect_ratio=3./4.)
aug.set_defaults(brightness=0.4, contrast=0.4, saturation=0.4, pca_noise=0.1)
# Action to translate --set-data-aug-level <N> arg to its component settings.
class SetDataAugLevelAction(argparse.Action):
def __init__(self, option_strings, dest, nargs=None, **kwargs):
if nargs is not None:
raise ValueError("nargs not allowed")
super(SetDataAugLevelAction, self).__init__(option_strings, dest, **kwargs)
def __call__(self, parser, namespace, values, option_string=None):
level = values
# record that this --set-data-aug-level <N> 'macro arg' has been invoked
setattr(namespace, self.dest, level)
if level >= 1:
setattr(namespace, 'random_crop', 1)
setattr(namespace, 'random_mirror', 1)
if level >= 2:
setattr(namespace, 'max_random_h', 36)
setattr(namespace, 'max_random_s', 50)
setattr(namespace, 'max_random_l', 50)
if level >= 3:
setattr(namespace, 'max_random_rotate_angle', 10)
setattr(namespace, 'max_random_shear_ratio', 0.1)
setattr(namespace, 'max_random_aspect_ratio', 0.25)
# Similar to the above, but suitable for calling within a training script to set the defaults.
def set_data_aug_level(aug, level):
if level >= 1:
aug.set_defaults(random_crop=1, random_mirror=1)
if level >= 2:
aug.set_defaults(max_random_h=36, max_random_s=50, max_random_l=50)
if level >= 3:
aug.set_defaults(max_random_rotate_angle=10, max_random_shear_ratio=0.1, max_random_aspect_ratio=0.25)
def add_data_aug_args(parser):
aug = parser.add_argument_group(
'Image augmentations', 'implemented in src/io/image_aug_default.cc')
'MXNet data backend', 'entire group applies only to mxnet data backend')
aug.add_argument('--data-mxnet-threads', type=int, default=40,
help='number of threads for data decoding for mxnet data backend')
aug.add_argument('--random-crop', type=int, default=0,
help='if or not randomly crop the image')
aug.add_argument('--random-mirror', type=int, default=0,
aug.add_argument('--random-mirror', type=int, default=1,
help='if or not randomly flip horizontally')
aug.add_argument('--max-random-h', type=int, default=0,
help='max change of hue, whose range is [0, 180]')
@ -165,9 +96,9 @@ def add_data_aug_args(parser):
help='max change of saturation, whose range is [0, 255]')
aug.add_argument('--max-random-l', type=int, default=0,
help='max change of intensity, whose range is [0, 255]')
aug.add_argument('--min-random-aspect-ratio', type=float, default=None,
aug.add_argument('--min-random-aspect-ratio', type=float, default=0.75,
help='min value of aspect ratio, whose value is either None or a positive value.')
aug.add_argument('--max-random-aspect-ratio', type=float, default=0,
aug.add_argument('--max-random-aspect-ratio', type=float, default=1.33,
help='max value of aspect ratio. If min_random_aspect_ratio is None, '
'the aspect ratio range is [1-max_random_aspect_ratio, '
'1+max_random_aspect_ratio], otherwise it is '
@ -183,7 +114,7 @@ def add_data_aug_args(parser):
'otherwise use --pad-size')
aug.add_argument('--max-random-area', type=float, default=1,
help='max area to crop in random resized crop, whose range is [0, 1]')
aug.add_argument('--min-random-area', type=float, default=1,
aug.add_argument('--min-random-area', type=float, default=0.05,
help='min area to crop in random resized crop, whose range is [0, 1]')
aug.add_argument('--min-crop-size', type=int, default=-1,
help='Crop both width and height into a random size in '
@ -199,87 +130,200 @@ def add_data_aug_args(parser):
help='saturation jittering, whose range is [0, 1]')
aug.add_argument('--pca-noise', type=float, default=0,
help='pca noise, whose range is [0, 1]')
aug.add_argument('--random-resized-crop', type=int, default=0,
aug.add_argument('--random-resized-crop', type=int, default=1,
help='whether to use random resized crop')
aug.add_argument('--set-resnet-aug', action=SetResnetAugAction,
help='whether to employ standard resnet augmentations (see data.py)')
aug.add_argument('--set-data-aug-level', type=int, default=None, action=SetDataAugLevelAction,
help='set multiple data augmentations based on a `level` (see data.py)')
return aug
def get_data_loader(args):
if args.data_backend == 'dali-gpu':
return (lambda *args, **kwargs: dali.get_rec_iter(*args, **kwargs, dali_cpu=False))
if args.data_backend == 'dali-cpu':
return (lambda *args, **kwargs: dali.get_rec_iter(*args, **kwargs, dali_cpu=True))
if args.data_backend == 'synthetic':
return get_synthetic_rec_iter
if args.data_backend == 'mxnet':
return get_rec_iter
raise ValueError('Wrong data backend')
class DataGPUSplit:
def __init__(self, dataloader, ctx, dtype):
self.dataloader = dataloader
self.ctx = ctx
self.dtype = dtype
self.batch_size = dataloader.batch_size // len(ctx)
self._num_gpus = len(ctx)
def __iter__(self):
return DataGPUSplit(iter(self.dataloader), self.ctx, self.dtype)
def __next__(self):
data = next(self.dataloader)
ret = []
for i in range(len(self.ctx)):
start = i * len(data.data[0]) // len(self.ctx)
end = (i + 1) * len(data.data[0]) // len(self.ctx)
pad = max(0, min(data.pad - (len(self.ctx) - i - 1) * self.batch_size, self.batch_size))
ret.append(mx.io.DataBatch(
[data.data[0][start:end].as_in_context(self.ctx[i]).astype(self.dtype)],
[data.label[0][start:end].as_in_context(self.ctx[i])],
pad=pad))
return ret
def next(self):
return next(self)
def reset(self):
self.dataloader.reset()
def get_rec_iter(args, kv=None):
image_shape = tuple([int(l) for l in args.image_shape.split(',')])
if args.input_layout == 'NHWC':
image_shape = image_shape[1:] + (image_shape[0],)
if kv:
(rank, nworker) = (kv.rank, kv.num_workers)
gpus = args.gpus
if 'horovod' in args.kv_store:
rank = hvd.rank()
nworker = hvd.size()
gpus = [gpus[0]]
batch_size = args.batch_size // hvd.size()
else:
(rank, nworker) = (0, 1)
rgb_mean = [float(i) for i in args.rgb_mean.split(',')]
rgb_std = [float(i) for i in args.rgb_std.split(',')]
rank = kv.rank if kv else 0
nworker = kv.num_workers if kv else 1
batch_size = args.batch_size
if args.input_layout == 'NHWC':
raise ValueError('ImageRecordIter cannot handle layout {}'.format(args.input_layout))
train = mx.io.ImageRecordIter(
path_imgrec = args.data_train,
path_imgidx = args.data_train_idx,
label_width = 1,
mean_r = rgb_mean[0],
mean_g = rgb_mean[1],
mean_b = rgb_mean[2],
std_r = rgb_std[0],
std_g = rgb_std[1],
std_b = rgb_std[2],
data_name = 'data',
label_name = 'softmax_label',
data_shape = image_shape,
batch_size = args.batch_size,
rand_crop = args.random_crop,
max_random_scale = args.max_random_scale,
pad = args.pad_size,
fill_value = args.fill_value,
random_resized_crop = args.random_resized_crop,
min_random_scale = args.min_random_scale,
max_aspect_ratio = args.max_random_aspect_ratio,
min_aspect_ratio = args.min_random_aspect_ratio,
max_random_area = args.max_random_area,
min_random_area = args.min_random_area,
min_crop_size = args.min_crop_size,
max_crop_size = args.max_crop_size,
brightness = args.brightness,
contrast = args.contrast,
saturation = args.saturation,
pca_noise = args.pca_noise,
random_h = args.max_random_h,
random_s = args.max_random_s,
random_l = args.max_random_l,
max_rotate_angle = args.max_random_rotate_angle,
max_shear_ratio = args.max_random_shear_ratio,
rand_mirror = args.random_mirror,
preprocess_threads = args.data_nthreads,
shuffle = True,
num_parts = nworker,
part_index = rank)
train = DataGPUSplit(mx.io.ImageRecordIter(
path_imgrec = args.data_train,
path_imgidx = args.data_train_idx,
label_width = 1,
mean_r = args.rgb_mean[0],
mean_g = args.rgb_mean[1],
mean_b = args.rgb_mean[2],
std_r = args.rgb_std[0],
std_g = args.rgb_std[1],
std_b = args.rgb_std[2],
data_name = 'data',
label_name = 'softmax_label',
data_shape = args.image_shape,
batch_size = batch_size,
rand_crop = args.random_crop,
max_random_scale = args.max_random_scale,
random_resized_crop = args.random_resized_crop,
min_random_scale = args.min_random_scale,
max_aspect_ratio = args.max_random_aspect_ratio,
min_aspect_ratio = args.min_random_aspect_ratio,
max_random_area = args.max_random_area,
min_random_area = args.min_random_area,
min_crop_size = args.min_crop_size,
max_crop_size = args.max_crop_size,
brightness = args.brightness,
contrast = args.contrast,
saturation = args.saturation,
pca_noise = args.pca_noise,
random_h = args.max_random_h,
random_s = args.max_random_s,
random_l = args.max_random_l,
max_rotate_angle = args.max_random_rotate_angle,
max_shear_ratio = args.max_random_shear_ratio,
rand_mirror = args.random_mirror,
preprocess_threads = args.data_mxnet_threads,
shuffle = True,
num_parts = nworker,
part_index = rank,
seed = args.seed or '0',
), [mx.gpu(gpu) for gpu in gpus], args.dtype)
if args.data_val is None:
return (train, None)
val = mx.io.ImageRecordIter(
path_imgrec = args.data_val,
path_imgidx = args.data_val_idx,
label_width = 1,
mean_r = rgb_mean[0],
mean_g = rgb_mean[1],
mean_b = rgb_mean[2],
std_r = rgb_std[0],
std_g = rgb_std[1],
std_b = rgb_std[2],
data_name = 'data',
label_name = 'softmax_label',
batch_size = args.batch_size,
round_batch = False,
data_shape = image_shape,
preprocess_threads = args.data_nthreads,
rand_crop = False,
rand_mirror = False,
num_parts = nworker,
part_index = rank)
val = DataGPUSplit(mx.io.ImageRecordIter(
path_imgrec = args.data_val,
path_imgidx = args.data_val_idx,
label_width = 1,
mean_r = args.rgb_mean[0],
mean_g = args.rgb_mean[1],
mean_b = args.rgb_mean[2],
std_r = args.rgb_std[0],
std_g = args.rgb_std[1],
std_b = args.rgb_std[2],
data_name = 'data',
label_name = 'softmax_label',
batch_size = batch_size,
round_batch = False,
data_shape = args.image_shape,
preprocess_threads = args.data_mxnet_threads,
rand_crop = False,
rand_mirror = False,
num_parts = nworker,
part_index = rank,
resize = args.data_val_resize,
), [mx.gpu(gpu) for gpu in gpus], args.dtype)
return (train, val)
class SyntheticDataIter(DataIter):
def __init__(self, num_classes, data_shape, max_iter, ctx, dtype):
self.batch_size = data_shape[0]
self.cur_iter = 0
self.max_iter = max_iter
self.dtype = dtype
label = np.random.randint(0, num_classes, [self.batch_size,])
data = np.random.uniform(-1, 1, data_shape)
self.data = []
self.label = []
self._num_gpus = len(ctx)
for dev in ctx:
self.data.append(mx.nd.array(data, dtype=self.dtype, ctx=dev))
self.label.append(mx.nd.array(label, dtype=self.dtype, ctx=dev))
def __iter__(self):
return self
def next(self):
self.cur_iter += 1
if self.cur_iter <= self.max_iter:
return [DataBatch(data=(data,), label=(label,), pad=0) for data, label in zip(self.data, self.label)]
else:
raise StopIteration
def __next__(self):
return self.next()
def reset(self):
self.cur_iter = 0
def get_synthetic_rec_iter(args, kv=None):
gpus = args.gpus
if 'horovod' in args.kv_store:
gpus = [gpus[0]]
batch_size = args.batch_size // hvd.size()
else:
batch_size = args.batch_size
if args.input_layout == 'NCHW':
data_shape = (batch_size, *args.image_shape)
elif args.input_layout == 'NHWC':
data_shape = (batch_size, *args.image_shape[1:], args.image_shape[0])
else:
raise ValueError('Wrong input layout')
train = SyntheticDataIter(args.num_classes, data_shape,
args.num_examples // args.batch_size,
[mx.gpu(gpu) for gpu in gpus], args.dtype)
if args.data_val is None:
return (train, None)
val = SyntheticDataIter(args.num_classes, data_shape,
args.num_examples // args.batch_size,
[mx.gpu(gpu) for gpu in gpus], args.dtype)
return (train, val)
def load_image(args, path, ctx=mx.cpu()):
image = mx.image.imread(path).astype('float32')
image = mx.image.imresize(image, *args.image_shape[1:])
image = (image - nd.array(args.rgb_mean)) / nd.array(args.rgb_std)
image = image.as_in_context(ctx)
if args.input_layout == 'NCHW':
image = image.transpose((2, 0, 1))
image = image.astype(args.dtype)
if args.image_shape[0] == 4:
dim = 0 if args.input_layout == 'NCHW' else 2
image = nd.concat(image, nd.zeros((1, *image.shape[1:]), dtype=image.dtype, ctx=image.context), dim=dim)
return image

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 benchmark in FP16 on 1,4,8 GPUs with 64,128,192,208 batch size
# Usage ./BENCHMARK_FP16.sh <additionals flags>
python benchmark.py -n 1,4,8 -b 64,128,192,208 -e 2 -w 1 -i 100 -o report.json $@

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 benchmark in FP32 on 1,4,8 GPUs with 32,64,96 batch size
# Usage ./BENCHMARK_FP32.sh <additionals flags>
python benchmark.py -n 1,4,8 -b 32,64,96 -e 2 -w 1 -i 100 --dtype float32 -o report.json $@

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 inference benchmark in FP16 on 1 GPU with 1,2,4,64,128,192,208 batch size
# Usage ./INFER_BENCHMARK_FP16.sh <additionals flags>
python benchmark.py -n 1 -b 1,2,4,64,128,192,208 --only-inference -e 3 -w 1 -i 100 -o report.json $@

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 training in FP16 on 1 GPUs using 208 batch size (208 per GPU)
# Usage ./RN50_FP16_1GPU.sh <path to this repository> <additionals flags>
"$1/runner" -n 1 -b 208 --model-prefix model ${@:2}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 training in FP16 on 4 GPUs using 832 batch size (208 per GPU)
# Usage ./RN50_FP16_4GPU.sh <path to this repository> <additionals flags>
"$1/runner" -n 4 -b 208 --model-prefix model ${@:2}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 training in FP16 on 8 GPUs using 1664 batch size (208 per GPU)
# Usage ./RN50_FP16_8GPU.sh <path to this repository> <additionals flags>
"$1/runner" -n 8 -b 208 --model-prefix model ${@:2}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 training in FP32 on 1 GPUs using 96 batch size (96 per GPU)
# Usage ./RN50_FP32_1GPU.sh <path to this repository> <additionals flags>
"$1/runner" -n 1 -b 96 --dtype float32 --model-prefix model ${@:2}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 training in FP32 on 4 GPUs using 384 batch size (96 per GPU)
# Usage ./RN50_FP32_4GPU.sh <path to this repository> <additionals flags>
"$1/runner" -n 4 -b 96 --dtype float32 --model-prefix model ${@:2}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches ResNet50 training in FP32 on 8 GPUs using 768 batch size (96 per GPU)
# Usage ./RN50_FP32_8GPU.sh <path to this repository> <additionals flags>
"$1/runner" -n 8 -b 96 --dtype float32 --model-prefix model ${@:2}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script score ResNet50 checkpoint in FP16 on 1 GPUs using 128 batch size
# Usage ./SCORE_FP16.sh <model prefix> <epoch> <additionals flags>
./runner -n 1 -b 128 --only-inference --model-prefix $1 --load-epoch $2 -e 1 ${@:3}

View file

@ -1,19 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script score ResNet50 checkpoint in FP32 on 1 GPUs using 64 batch size
# Usage ./SCORE_FP32.sh <model prefix> <epoch> <additionals flags>
./runner -n 1 -b 64 --dtype float32 --only-inference --model-prefix $1 --load-epoch $2 -e 1 ${@:3}

View file

@ -33,197 +33,408 @@
# See the License for the specific language governing permissions and
# limitations under the License.
""" example train fit utility """
""" train fit utility """
import logging
import os
import time
import re
import math
import sys
import random
from itertools import starmap
import numpy as np
import mxnet as mx
import mxnet.ndarray as nd
import horovod.mxnet as hvd
import mxnet.contrib.amp as amp
from mxnet import autograd as ag
from mxnet import gluon
from report import Report
from benchmarking import BenchmarkingDataIter
def get_epoch_size(args, kv):
return math.ceil(int(args.num_examples / kv.num_workers) / args.batch_size)
def _get_lr_scheduler(args, kv):
if 'lr_factor' not in args or args.lr_factor >= 1:
return (args.lr, None)
epoch_size = get_epoch_size(args, kv)
begin_epoch = args.load_epoch if args.load_epoch else 0
if 'pow' in args.lr_step_epochs:
lr = args.lr
max_up = args.num_epochs * epoch_size
pwr = float(re.sub('pow[- ]*', '', args.lr_step_epochs))
poly_sched = mx.lr_scheduler.PolyScheduler(max_up, lr, pwr)
return (lr, poly_sched)
step_epochs = [int(l) for l in args.lr_step_epochs.split(',')]
lr = args.lr
for s in step_epochs:
if begin_epoch >= s:
lr *= args.lr_factor
if lr != args.lr:
logging.info('Adjust learning rate to %e for epoch %d',
lr, begin_epoch)
steps = [epoch_size * (x - begin_epoch)
for x in step_epochs if x - begin_epoch > 0]
if steps:
if kv:
num_workers = kv.num_workers
else:
num_workers = 1
epoch_size = math.ceil(int(args.num_examples/num_workers)/args.batch_size)
return (lr, mx.lr_scheduler.MultiFactorScheduler(step=steps, factor=args.lr_factor,
base_lr=args.lr, warmup_steps=epoch_size * args.warmup_epochs,
warmup_mode=args.warmup_strategy))
else:
return (lr, None)
def _load_model(args, rank=0):
if 'load_epoch' not in args or args.load_epoch is None:
return (None, None, None)
assert args.model_prefix is not None
model_prefix = args.model_prefix
if rank > 0 and os.path.exists("%s-%d-symbol.json" % (model_prefix, rank)):
model_prefix += "-%d" % (rank)
sym, arg_params, aux_params = mx.model.load_checkpoint(
model_prefix, args.load_epoch)
logging.info('Loaded model %s_%04d.params', model_prefix, args.load_epoch)
return (sym, arg_params, aux_params)
def _save_model(args, rank=0):
if args.model_prefix is None:
return None
return mx.callback.do_checkpoint(args.model_prefix if rank == 0 else "%s-%d" % (
args.model_prefix, rank), period=args.save_period)
import data
def add_fit_args(parser):
"""
parser : argparse.ArgumentParser
return a parser added with args required by fit
"""
train = parser.add_argument_group('Training', 'model training')
train.add_argument('--num-layers', type=int,
help='number of layers in the neural network, \
required by some networks such as resnet')
train.add_argument('--gpus', type=str,
help='list of gpus to run, e.g. 0 or 0,2,5. empty means using cpu')
train.add_argument('--kv-store', type=str, default='device',
def int_list(x):
return list(map(int, x.split(',')))
def float_list(x):
return list(map(float, x.split(',')))
train = parser.add_argument_group('Training')
train.add_argument('--mode', default='train_val', choices=('train_val', 'train', 'val', 'pred'),
help='mode')
train.add_argument('--seed', type=int, default=None,
help='random seed')
train.add_argument('--gpus', type=int_list, default=[0],
help='list of gpus to run, e.g. 0 or 0,2,5')
train.add_argument('--kv-store', type=str, default='device', choices=('device', 'horovod'),
help='key-value store type')
train.add_argument('--num-epochs', type=int, default=100,
help='max num of epochs')
train.add_argument('--dtype', type=str, default='float16', choices=('float32', 'float16'),
help='precision')
train.add_argument('--amp', action='store_true',
help='If enabled, turn on AMP (Automatic Mixed Precision)')
train.add_argument('--batch-size', type=int, default=192,
help='the batch size')
train.add_argument('--num-epochs', type=int, default=90,
help='number of epochs')
train.add_argument('--lr', type=float, default=0.1,
help='initial learning rate')
train.add_argument('--lr-factor', type=float, default=0.1,
train.add_argument('--lr-schedule', choices=('multistep', 'cosine'), default='cosine',
help='learning rate schedule')
train.add_argument('--lr-factor', type=float, default=0.256,
help='the ratio to reduce lr on each step')
train.add_argument('--lr-step-epochs', type=str,
train.add_argument('--lr-steps', type=float_list, default=[],
help='the epochs to reduce the lr, e.g. 30,60')
train.add_argument('--initializer', type=str, default='default',
help='the initializer type')
train.add_argument('--optimizer', type=str, default='sgd',
help='the optimizer type')
train.add_argument('--mom', type=float, default=0.9,
help='momentum for sgd')
train.add_argument('--wd', type=float, default=0.0001,
help='weight decay for sgd')
train.add_argument('--batch-size', type=int, default=208,
help='the batch size')
train.add_argument('--disp-batches', type=int, default=20,
help='show progress for every n batches')
train.add_argument('--model-prefix', type=str,
help='model prefix')
train.add_argument('--save-period', type=int, default=1, help='params saving period')
parser.add_argument('--monitor', dest='monitor', type=int, default=0,
help='log network parameters every N iters if larger than 0')
train.add_argument('--load-epoch', type=int,
help='load the model on an epoch using the model-load-prefix')
train.add_argument('--loss', type=str, default='',
help='show the cross-entropy or nll loss. ce strands for cross-entropy, nll-loss stands for likelihood loss')
train.add_argument('--test-io', type=int, default=0,
help='1 means test reading speed without training')
train.add_argument('--dtype', type=str, default='float16',
help='precision: float32 or float16')
train.add_argument('--gc-type', type=str, default='none',
help='type of gradient compression to use, \
takes `2bit` or `none` for now')
train.add_argument('--gc-threshold', type=float, default=0.5,
help='threshold for 2bit gradient compression')
# additional parameters for large batch sgd
train.add_argument('--macrobatch-size', type=int, default=0,
help='distributed effective batch size')
train.add_argument('--warmup-epochs', type=int, default=5,
help='the epochs to ramp-up lr to scaled large-batch value')
train.add_argument('--warmup-strategy', type=str, default='linear',
help='the ramping-up strategy for large batch sgd')
train.add_argument('--logging-dir', type=str, default='logs')
train.add_argument('--log', type=str, default='')
train.add_argument('--bn-gamma-init0', action='store_true')
train.add_argument('--epoch-size',type=int, default=0,
help='set number of batches in an epoch. useful for debugging')
#train.add_argument('--tensorboard', type=str, default='',
# help='log parameters to visualize in tensorboard every epoch. takes name to specify as tensorboard run. Empty means tensorboard logging is disabled')
train.add_argument('--profile-worker-suffix', type=str, default='',
help='profile workers actions into this file. During distributed training\
filename saved will be rank1_ followed by this suffix')
train.add_argument('--profile-server-suffix', type=str, default='',
help='profile server actions into a file with name like rank1_ followed by this suffix \
during distributed training')
train.add_argument('--report', type=str, help='file where to save report')
train.add_argument('--only-inference', action='store_true', help='do not train, only inference (for benchmarking)')
train.add_argument('--optimizer', type=str, default='sgd',
help='the optimizer type')
train.add_argument('--mom', type=float, default=0.875,
help='momentum for sgd')
train.add_argument('--wd', type=float, default=1 / 32768,
help='weight decay for sgd')
train.add_argument('--label-smoothing', type=float, default=0.1,
help='label smoothing factor')
train.add_argument('--mixup', type=float, default=0,
help='alpha parameter for mixup (if 0 then mixup is not applied)')
train.add_argument('--disp-batches', type=int, default=20,
help='show progress for every n batches')
train.add_argument('--model-prefix', type=str, default='model',
help='model checkpoint prefix')
train.add_argument('--save-frequency', type=int, default=-1,
help='frequency of saving model in epochs (--model-prefix must be specified). '
'If -1 then save only best model. If 0 then do not save anything.')
train.add_argument('--begin-epoch', type=int, default=0,
help='start the model from an epoch')
train.add_argument('--load', help='checkpoint to load')
train.add_argument('--test-io', action='store_true',
help='test reading speed without training')
train.add_argument('--test-io-mode', default='train', choices=('train', 'val'),
help='data to test')
train.add_argument('--log', type=str, default='log.log',
help='file where to save the log from the experiment')
train.add_argument('--report', default='report.json', help='file where to save report')
train.add_argument('--no-metrics', action='store_true', help='do not calculate evaluation metrics (for benchmarking)')
train.add_argument('--benchmark-iters', type=int, default=None,
help='run only benchmark-iters iterations from each epoch')
return train
def get_epoch_size(args, kv):
return math.ceil(args.num_examples / args.batch_size)
def fit(args, network, data_loader, **kwargs):
def get_lr_scheduler(args):
def multistep_schedule(x):
lr = args.lr * (args.lr_factor ** (len(list(filter(lambda step: step <= x, args.lr_steps)))))
warmup_coeff = min(1, x / args.warmup_epochs)
return warmup_coeff * lr
def cosine_schedule(x):
steps = args.lr_steps
if not steps or steps[0] > args.warmup_epochs:
steps = [args.warmup_epochs] + steps
elif not steps or steps[0] != 0:
steps = [0] + steps
if steps[-1] != args.num_epochs:
steps.append(args.num_epochs)
if x < args.warmup_epochs:
return args.lr * x / args.warmup_epochs
for i, (step, next_step) in enumerate(zip(steps, steps[1:])):
if next_step > x:
return args.lr * 0.5 * (1 + math.cos(math.pi * (x - step) / (next_step - step))) * (args.lr_factor ** i)
return 0
schedules = {
'multistep': multistep_schedule,
'cosine': cosine_schedule,
}
return schedules[args.lr_schedule]
def load_model(args, model):
if args.load is None:
return False
model.load_parameters(args.load)
logging.info('Loaded model {}'.format(args.load))
return True
def save_checkpoint(net, epoch, top1, best_acc, model_prefix, save_frequency, kvstore):
if model_prefix is None or save_frequency == 0 or ('horovod' in kvstore and hvd.rank() != 0):
return
if save_frequency > 0 and (epoch + 1) % save_frequency == 0:
fname = '{}_{:04}.params'.format(model_prefix, epoch)
net.save_parameters(fname)
logging.info('[Epoch {}] Saving checkpoint to {} with Accuracy: {:.4f}'.format(epoch, fname, top1))
if top1 > best_acc:
fname = '{}_best.params'.format(model_prefix)
net.save_parameters(fname)
logging.info('[Epoch {}] Saving checkpoint to {} with Accuracy: {:.4f}'.format(epoch, fname, top1))
def add_metrics_to_report(report, mode, metric, durations, total_batch_size, loss=None, warmup=20):
if report is None:
return
top1 = metric.get('accuracy', None)
if top1 is not None:
report.add_value('{}.top1'.format(mode), top1)
top5 = metric.get('top_k_accuracy_5', None)
if top5 is not None:
report.add_value('{}.top5'.format(mode), top5)
if loss is not None:
report.add_value('{}.loss'.format(mode), loss.get_global()[1])
if len(durations) > warmup:
durations = durations[warmup:]
duration = np.mean(durations)
total_ips = total_batch_size / duration
report.add_value('{}.latency_avg'.format(mode), duration)
for percentile in [50, 90, 95, 99, 100]:
report.add_value('{}.latency_{}'.format(mode, percentile), np.percentile(durations, percentile))
report.add_value('{}.total_ips'.format(mode), total_ips)
def model_pred(args, model, image):
from imagenet_classes import classes
output = model(image.reshape(-1, *image.shape))[0].softmax().as_in_context(mx.cpu())
top = output.argsort(is_ascend=False)[:10]
for i, ind in enumerate(top):
ind = int(ind.asscalar())
logging.info('{:2d}. {:5.2f}% -> {}'.format(i + 1, output[ind].asscalar() * 100, classes[ind]))
def reduce_metrics(args, metrics, kvstore):
if 'horovod' not in kvstore or not metrics[0] or hvd.size() == 1:
return metrics
m = mx.ndarray.array(metrics[1], ctx=mx.gpu(args.gpus[0]))
reduced = hvd.allreduce(m)
values = reduced.as_in_context(mx.cpu()).asnumpy().tolist()
return (metrics[0], values)
def model_score(args, net, val_data, metric, kvstore, report=None):
if val_data is None:
logging.info('Omitting validation: no data')
return [], []
if not isinstance(metric, mx.metric.EvalMetric):
metric = mx.metric.create(metric)
metric.reset()
val_data.reset()
total_batch_size = val_data.batch_size * val_data._num_gpus * (hvd.size() if 'horovod' in kvstore else 1)
durations = []
tic = time.time()
outputs = []
for batches in val_data:
# synchronize to previous iteration
for o in outputs:
o.wait_to_read()
data = [b.data[0] for b in batches]
label = [b.label[0][:len(b.data[0]) - b.pad] for b in batches if len(b.data[0]) != b.pad]
outputs = [net(X) for X, b in zip(data, batches)]
outputs = [o[:len(b.data[0]) - b.pad] for o, b in zip(outputs, batches) if len(b.data[0]) != b.pad]
metric.update(label, outputs)
durations.append(time.time() - tic)
tic = time.time()
metric = reduce_metrics(args, metric.get_global(), kvstore)
add_metrics_to_report(report, 'val', dict(zip(*metric)), durations, total_batch_size)
return metric
class ScalarMetric(mx.metric.Loss):
def update(self, _, scalar):
self.sum_metric += scalar
self.global_sum_metric += scalar
self.num_inst += 1
self.global_num_inst += 1
def label_smoothing(labels, classes, eta):
return labels.one_hot(classes, on_value=1 - eta + eta / classes, off_value=eta / classes)
def model_fit(args, net, train_data, eval_metric, optimizer,
optimizer_params, lr_scheduler, eval_data, kvstore, kv,
begin_epoch, num_epoch, model_prefix, report, print_loss):
if not isinstance(eval_metric, mx.metric.EvalMetric):
eval_metric = mx.metric.create(eval_metric)
loss_metric = ScalarMetric()
if 'horovod' in kvstore:
trainer = hvd.DistributedTrainer(net.collect_params(), optimizer, optimizer_params)
else:
trainer = gluon.Trainer(net.collect_params(), optimizer, optimizer_params,
kvstore=kv, update_on_kvstore=False)
if args.amp:
amp.init_trainer(trainer)
sparse_label_loss = (args.label_smoothing == 0 and args.mixup == 0)
loss = gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=sparse_label_loss)
loss.hybridize(static_shape=True, static_alloc=True)
local_batch_size = train_data.batch_size
total_batch_size = local_batch_size * train_data._num_gpus * (hvd.size() if 'horovod' in kvstore else 1)
durations = []
epoch_size = get_epoch_size(args, kv)
def transform_data(images, labels):
if args.mixup != 0:
coeffs = mx.nd.array(np.random.beta(args.mixup, args.mixup, size=images.shape[0])).as_in_context(images.context)
image_coeffs = coeffs.astype(images.dtype, copy=False).reshape(*coeffs.shape, 1, 1, 1)
ret_images = image_coeffs * images + (1 - image_coeffs) * images[::-1]
ret_labels = label_smoothing(labels, args.num_classes, args.label_smoothing)
label_coeffs = coeffs.reshape(*coeffs.shape, 1)
ret_labels = label_coeffs * ret_labels + (1 - label_coeffs) * ret_labels[::-1]
else:
ret_images = images
if not sparse_label_loss:
ret_labels = label_smoothing(labels, args.num_classes, args.label_smoothing)
else:
ret_labels = labels
return ret_images, ret_labels
best_accuracy = -1
for epoch in range(begin_epoch, num_epoch):
tic = time.time()
train_data.reset()
eval_metric.reset()
loss_metric.reset()
btic = time.time()
logging.info('Starting epoch {}'.format(epoch))
outputs = []
for i, batches in enumerate(train_data):
# synchronize to previous iteration
for o in outputs:
o.wait_to_read()
trainer.set_learning_rate(lr_scheduler(epoch + i / epoch_size))
data = [b.data[0] for b in batches]
label = [b.label[0].as_in_context(b.data[0].context) for b in batches]
orig_label = label
data, label = zip(*starmap(transform_data, zip(data, label)))
outputs = []
Ls = []
with ag.record():
for x, y in zip(data, label):
z = net(x)
L = loss(z, y)
# store the loss and do backward after we have done forward
# on all GPUs for better speed on multiple GPUs.
Ls.append(L)
outputs.append(z)
if args.amp:
with amp.scale_loss(Ls, trainer) as scaled_loss:
ag.backward(scaled_loss)
else:
ag.backward(Ls)
if 'horovod' in kvstore:
trainer.step(local_batch_size)
else:
trainer.step(total_batch_size)
if print_loss:
loss_metric.update(..., np.mean([l.asnumpy() for l in Ls]).item())
eval_metric.update(orig_label, outputs)
if args.disp_batches and not (i + 1) % args.disp_batches:
name, acc = eval_metric.get()
if print_loss:
name = [loss_metric.get()[0]] + name
acc = [loss_metric.get()[1]] + acc
logging.info('Epoch[{}] Batch [{}-{}]\tSpeed: {} samples/sec\tLR: {}\t{}'.format(
epoch, (i // args.disp_batches) * args.disp_batches, i,
args.disp_batches * total_batch_size / (time.time() - btic), trainer.learning_rate,
'\t'.join(list(map(lambda x: '{}: {:.6f}'.format(*x), zip(name, acc))))))
eval_metric.reset_local()
loss_metric.reset_local()
btic = time.time()
durations.append(time.time() - tic)
tic = time.time()
add_metrics_to_report(report, 'train', dict(eval_metric.get_global_name_value()), durations, total_batch_size, loss_metric if print_loss else None)
if args.mode == 'train_val':
logging.info('Validating epoch {}'.format(epoch))
score = model_score(args, net, eval_data, eval_metric, kvstore, report)
for name, value in zip(*score):
logging.info('Epoch[{}] Validation {:20}: {}'.format(epoch, name, value))
score = dict(zip(*score))
accuracy = score.get('accuracy', -1)
save_checkpoint(net, epoch, accuracy, best_accuracy, model_prefix, args.save_frequency, kvstore)
best_accuracy = max(best_accuracy, accuracy)
def fit(args, model, data_loader):
"""
train a model
args : argparse returns
network : the symbol definition of the nerual network
model : the the neural network model
data_loader : function that returns the train and val data iterators
"""
start_time = time.time()
report = Report(args.arch, len(args.gpus), sys.argv)
# select gpu for horovod process
if 'horovod' in args.kv_store:
hvd.init()
args.gpus = [args.gpus[hvd.local_rank()]]
if args.amp:
amp.init()
if args.seed is not None:
logging.info('Setting seeds to {}'.format(args.seed))
random.seed(args.seed)
np.random.seed(args.seed)
mx.random.seed(args.seed)
# kvstore
kv = mx.kvstore.create(args.kv_store)
if args.gc_type != 'none':
kv.set_gradient_compression({'type': args.gc_type,
'threshold': args.gc_threshold})
if args.profile_server_suffix:
mx.profiler.set_config(filename=args.profile_server_suffix, profile_all=True, profile_process='server')
mx.profiler.set_state(state='run', profile_process='server')
if args.profile_worker_suffix:
if kv.num_workers > 1:
filename = 'rank' + str(kv.rank) + '_' + args.profile_worker_suffix
else:
filename = args.profile_worker_suffix
mx.profiler.set_config(filename=filename, profile_all=True, profile_process='worker')
mx.profiler.set_state(state='run', profile_process='worker')
# logging
head = '%(asctime)-15s Node[' + str(kv.rank) + '] %(message)s'
logging.basicConfig(level=logging.DEBUG, format=head)
logging.info('start with arguments %s', args)
epoch_size = get_epoch_size(args, kv)
# data iterators
(train, val) = data_loader(args, kv)
if 'dist' in args.kv_store and not 'async' in args.kv_store:
logging.info('Resizing training data to %d batches per machine', epoch_size)
# resize train iter to ensure each machine has same number of batches per epoch
# if not, dist_sync can hang at the end with one machine waiting for other machines
if not args.use_dali:
train = mx.io.ResizeIter(train, epoch_size)
if 'horovod' in args.kv_store:
kv = None
rank = hvd.rank()
num_workers = hvd.size()
else:
kv = mx.kvstore.create(args.kv_store)
rank = kv.rank
num_workers = kv.num_workers
if args.test_io:
train, val = data_loader(args, kv)
if args.test_io_mode == 'train':
data_iter = train
else:
data_iter = val
tic = time.time()
for i, batch in enumerate(train):
for i, batch in enumerate(data_iter):
if isinstance(batch, list):
for b in batch:
for j in b.data:
@ -232,232 +443,90 @@ def fit(args, network, data_loader, **kwargs):
for j in batch.data:
j.wait_to_read()
if (i + 1) % args.disp_batches == 0:
logging.info('Batch [%d]\tSpeed: %.2f samples/sec', i,
args.disp_batches * args.batch_size / (time.time() - tic))
logging.info('Batch [{}]\tSpeed: {:.2f} samples/sec'.format(
i, args.disp_batches * args.batch_size / (time.time() - tic)))
tic = time.time()
return
# load model
if 'arg_params' in kwargs and 'aux_params' in kwargs:
arg_params = kwargs['arg_params']
aux_params = kwargs['aux_params']
else:
sym, arg_params, aux_params = _load_model(args, kv.rank)
# save model
checkpoint = _save_model(args, kv.rank)
epoch_end_callbacks = []
if checkpoint:
epoch_end_callbacks.append(checkpoint)
if not load_model(args, model):
# all initializers should be specified in the model definition.
# if not, this will raise an error
model.initialize(mx.init.Initializer())
# devices for training
devs = mx.cpu() if args.gpus is None or args.gpus == "" else [
mx.gpu(int(i)) for i in args.gpus.split(',')]
devs = list(map(mx.gpu, args.gpus))
model.collect_params().reset_ctx(devs)
if args.mode == 'pred':
logging.info('Infering image {}'.format(args.data_pred))
model_pred(args, model, data.load_image(args, args.data_pred, devs[0]))
return
# learning rate
lr, lr_scheduler = _get_lr_scheduler(args, kv)
# create model
model = mx.mod.Module(
context=devs,
symbol=network
)
lr_scheduler = get_lr_scheduler(args)
optimizer_params = {
'learning_rate': lr,
'learning_rate': 0,
'wd': args.wd,
'lr_scheduler': lr_scheduler,
'multi_precision': True}
'multi_precision': True,
}
# Only a limited number of optimizers have 'momentum' property
has_momentum = {'sgd', 'dcasgd', 'nag', 'signum', 'lbsgd'}
if args.optimizer in has_momentum:
optimizer_params['momentum'] = args.mom
monitor = mx.mon.Monitor(
args.monitor, pattern=".*") if args.monitor > 0 else None
# A limited number of optimizers have a warmup period
has_warmup = {'lbsgd', 'lbnag'}
if args.optimizer in has_warmup:
if 'dist' in args.kv_store:
nworkers = kv.num_workers
else:
nworkers = 1
epoch_size = args.num_examples / args.batch_size / nworkers
if epoch_size < 1:
epoch_size = 1
macrobatch_size = args.macrobatch_size
if macrobatch_size < args.batch_size * nworkers:
macrobatch_size = args.batch_size * nworkers
#batch_scale = round(float(macrobatch_size) / args.batch_size / nworkers +0.4999)
batch_scale = math.ceil(
float(macrobatch_size) / args.batch_size / nworkers)
optimizer_params['updates_per_epoch'] = epoch_size
optimizer_params['begin_epoch'] = args.load_epoch if args.load_epoch else 0
optimizer_params['batch_scale'] = batch_scale
optimizer_params['warmup_strategy'] = args.warmup_strategy
optimizer_params['warmup_epochs'] = args.warmup_epochs
optimizer_params['num_epochs'] = args.num_epochs
if args.initializer == 'default':
initializer = mx.init.Xavier(
rnd_type='gaussian', factor_type="in", magnitude=2)
# initializer = mx.init.Xavier(factor_type="in", magnitude=2.34),
elif args.initializer == 'xavier':
initializer = mx.init.Xavier()
elif args.initializer == 'msra':
initializer = mx.init.MSRAPrelu()
elif args.initializer == 'orthogonal':
initializer = mx.init.Orthogonal()
elif args.initializer == 'normal':
initializer = mx.init.Normal()
elif args.initializer == 'uniform':
initializer = mx.init.Uniform()
elif args.initializer == 'one':
initializer = mx.init.One()
elif args.initializer == 'zero':
initializer = mx.init.Zero()
# evaluation metrices
if not args.no_metrics:
eval_metrics = ['crossentropy', 'accuracy']
eval_metrics = ['accuracy']
eval_metrics.append(mx.metric.create(
'top_k_accuracy', top_k=5))
else:
eval_metrics = []
supported_loss = ['ce', 'nll_loss']
if len(args.loss) > 0:
# ce or nll loss is only applicable to softmax output
loss_type_list = args.loss.split(',')
if 'softmax_output' in network.list_outputs():
for loss_type in loss_type_list:
loss_type = loss_type.strip()
if loss_type == 'nll':
loss_type = 'nll_loss'
if loss_type not in supported_loss:
logging.warning(loss_type + ' is not an valid loss type, only cross-entropy or ' \
'negative likelihood loss is supported!')
else:
eval_metrics.append(mx.metric.create(loss_type))
else:
logging.warning("The output is not softmax_output, loss argument will be skipped!")
# callbacks that run after each batch
batch_end_callbacks = []
batch_end_callbacks.append(mx.callback.Speedometer(
args.batch_size, args.disp_batches))
if 'batch_end_callback' in kwargs:
cbs = kwargs['batch_end_callback']
batch_end_callbacks += cbs if isinstance(cbs, list) else [cbs]
report = Report('resnet{}'.format(args.num_layers), len(args.gpus.split(',')), sys.argv)
train, val = data_loader(args, kv)
train = BenchmarkingDataIter(train, args.benchmark_iters)
val = BenchmarkingDataIter(val, args.benchmark_iters)
if val is not None:
val = BenchmarkingDataIter(val, args.benchmark_iters)
class Gatherer:
def __init__(self, report, mode, data_iter, total_bs=None):
self.report = report
self.mode = mode
self.total_bs = total_bs
self.data_iter = data_iter
self.clear()
def clear(self):
self.num = 0
self.top1 = 0
self.top5 = 0
self.loss = 0
self.time = 0
self.tic = 0
def gather_metrics(self, data):
params = dict(data.eval_metric.get_global_name_value())
if self.num != 0:
self.time += time.time() - self.tic
self.num += 1
if not args.no_metrics:
self.top1 = params['accuracy']
self.top5 = params['top_k_accuracy_5']
self.loss = params['cross-entropy']
self.tic = time.time()
def add_metrics(self, *a, **k):
top1 = self.top1 * 100
top5 = self.top5 * 100
loss = self.loss
if self.num <= 1:
time = float('nan')
else:
time = self.time / (self.num - 1)
data = self.data_iter.get_avg_time_and_clear()
if self.total_bs is not None:
compute_ips = self.total_bs / (time - data)
total_ips = self.total_bs / time
if not args.no_metrics:
self.report.add_value('{}.top1'.format(self.mode), top1)
self.report.add_value('{}.top5'.format(self.mode), top5)
self.report.add_value('{}.loss'.format(self.mode), loss)
self.report.add_value('{}.time'.format(self.mode), time)
# self.report.add_value('{}.data'.format(self.mode), data)
if self.total_bs is not None:
# self.report.add_value('{}.compute_ips'.format(self.mode), compute_ips)
self.report.add_value('{}.total_ips'.format(self.mode), total_ips)
self.clear()
def save_report(*a, **k):
report.set_total_duration(time.time() - start_time)
if args.report:
report.save(args.report)
train_gatherer = Gatherer(report, 'train', train, args.batch_size)
eval_gatherer = Gatherer(report, 'val', val, args.batch_size)
batch_end_callbacks = [train_gatherer.gather_metrics] + batch_end_callbacks
epoch_end_callbacks = [train_gatherer.add_metrics, save_report] + epoch_end_callbacks
eval_batch_end_callbacks = [eval_gatherer.gather_metrics]
eval_end_callbacks = [eval_gatherer.add_metrics, save_report]
if 'horovod' in args.kv_store:
# Fetch and broadcast parameters
params = model.collect_params()
if params is not None:
hvd.broadcast_parameters(params, root_rank=0)
# run
model.fit(train,
begin_epoch=args.load_epoch if args.load_epoch else 0,
num_epoch=args.num_epochs if not args.only_inference else 0,
eval_data=val,
eval_metric=eval_metrics,
kvstore=kv,
optimizer=args.optimizer,
optimizer_params=optimizer_params,
initializer=initializer,
arg_params=arg_params,
aux_params=aux_params,
batch_end_callback=batch_end_callbacks,
epoch_end_callback=epoch_end_callbacks, #checkpoint if args.use_dali else ,,
eval_batch_end_callback=eval_batch_end_callbacks,
eval_end_callback=eval_end_callbacks,
allow_missing=True,
monitor=monitor)
if args.mode in ['train_val', 'train']:
model_fit(
args,
model,
train,
begin_epoch=args.begin_epoch,
num_epoch=args.num_epochs,
eval_data=val,
eval_metric=eval_metrics,
kvstore=args.kv_store,
kv=kv,
optimizer=args.optimizer,
optimizer_params=optimizer_params,
lr_scheduler=lr_scheduler,
report=report,
model_prefix=args.model_prefix,
print_loss=not args.no_metrics,
)
elif args.mode == 'val':
for epoch in range(args.num_epochs): # loop for benchmarking
score = model_score(args, model, val, eval_metrics, args.kv_store, report=report)
for name, value in zip(*score):
logging.info('Validation {:20}: {}'.format(name, value))
else:
raise ValueError('Wrong mode')
if args.only_inference:
for epoch in range(args.num_epochs):
score = model.score(val, eval_metrics, batch_end_callback=eval_batch_end_callbacks, score_end_callback=eval_end_callbacks, epoch=epoch)
print('-------------')
for name, value in score:
print('{}: {}'.format(name, value))
mx.nd.waitall()
if args.profile_server_suffix:
mx.profiler.set_state(state='run', profile_process='server')
if args.profile_worker_suffix:
mx.profiler.set_state(state='run', profile_process='worker')
report.set_total_duration(time.time() - start_time)
if args.report:
suffix = '-{}'.format(hvd.rank()) if 'horovod' in args.kv_store and hvd.rank() != 0 else ''
report.save(args.report + suffix)
save_report()
print('Experiment took: {} sec'.format(report.total_duration))
logging.info('Experiment took: {} sec'.format(report.total_duration))

File diff suppressed because it is too large Load diff

View file

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 16 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

View file

@ -0,0 +1,522 @@
# Copyright 2017-2018 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# -----------------------------------------------------------------------
#
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import copy
import mxnet as mx
from mxnet.gluon.block import HybridBlock
from mxnet.gluon import nn
def add_model_args(parser):
model = parser.add_argument_group('Model')
model.add_argument('--arch', default='resnetv15',
choices=['resnetv1', 'resnetv15',
'resnextv1', 'resnextv15',
'xception'],
help='model architecture')
model.add_argument('--num-layers', type=int, default=50,
help='number of layers in the neural network, \
required by some networks such as resnet')
model.add_argument('--num-groups', type=int, default=32,
help='number of groups for grouped convolutions, \
required by some networks such as resnext')
model.add_argument('--num-classes', type=int, default=1000,
help='the number of classes')
model.add_argument('--batchnorm-eps', type=float, default=1e-5,
help='the amount added to the batchnorm variance to prevent output explosion.')
model.add_argument('--batchnorm-mom', type=float, default=0.9,
help='the leaky-integrator factor controling the batchnorm mean and variance.')
model.add_argument('--fuse-bn-relu', type=int, default=0,
help='have batchnorm kernel perform activation relu')
model.add_argument('--fuse-bn-add-relu', type=int, default=0,
help='have batchnorm kernel perform add followed by activation relu')
return model
class Builder:
def __init__(self, dtype, input_layout, conv_layout, bn_layout,
pooling_layout, bn_eps, bn_mom, fuse_bn_relu, fuse_bn_add_relu):
self.dtype = dtype
self.input_layout = input_layout
self.conv_layout = conv_layout
self.bn_layout = bn_layout
self.pooling_layout = pooling_layout
self.bn_eps = bn_eps
self.bn_mom = bn_mom
self.fuse_bn_relu = fuse_bn_relu
self.fuse_bn_add_relu = fuse_bn_add_relu
self.act_type = 'relu'
self.bn_gamma_initializer = lambda last: 'zeros' if last else 'ones'
self.linear_initializer = lambda groups=1: mx.init.Xavier(rnd_type='gaussian', factor_type="in",
magnitude=2 * (groups ** 0.5))
self.last_layout = self.input_layout
def copy(self):
return copy.copy(self)
def batchnorm(self, last=False):
gamma_initializer = self.bn_gamma_initializer(last)
bn_axis = 3 if self.bn_layout == 'NHWC' else 1
return self.sequence(
self.transpose(self.bn_layout),
nn.BatchNorm(axis=bn_axis, momentum=self.bn_mom, epsilon=self.bn_eps,
gamma_initializer=gamma_initializer,
running_variance_initializer=gamma_initializer)
)
def batchnorm_add_relu(self, last=False):
gamma_initializer = self.bn_gamma_initializer(last)
if self.fuse_bn_add_relu:
bn_axis = 3 if self.bn_layout == 'NHWC' else 1
return self.sequence(
self.transpose(self.bn_layout),
BatchNormAddRelu(axis=bn_axis, momentum=self.bn_mom,
epsilon=self.bn_eps, act_type=self.act_type,
gamma_initializer=gamma_initializer,
running_variance_initializer=gamma_initializer)
)
return NonFusedBatchNormAddRelu(self, last=last)
def batchnorm_relu(self, last=False):
gamma_initializer = self.bn_gamma_initializer(last)
if self.fuse_bn_relu:
bn_axis = 3 if self.bn_layout == 'NHWC' else 1
return self.sequence(
self.transpose(self.bn_layout),
nn.BatchNorm(axis=bn_axis, momentum=self.bn_mom,
epsilon=self.bn_eps, act_type=self.act_type,
gamma_initializer=gamma_initializer,
running_variance_initializer=gamma_initializer)
)
return self.sequence(self.batchnorm(last=last), self.activation())
def activation(self):
return nn.Activation(self.act_type)
def global_avg_pool(self):
return self.sequence(
self.transpose(self.pooling_layout),
nn.GlobalAvgPool2D(layout=self.pooling_layout)
)
def max_pool(self, pool_size, strides=1, padding=True):
padding = pool_size // 2 if padding is True else int(padding)
return self.sequence(
self.transpose(self.pooling_layout),
nn.MaxPool2D(pool_size, strides=strides, padding=padding,
layout=self.pooling_layout)
)
def conv(self, channels, kernel_size, padding=True, strides=1, groups=1, in_channels=0):
padding = kernel_size // 2 if padding is True else int(padding)
initializer = self.linear_initializer(groups=groups)
return self.sequence(
self.transpose(self.conv_layout),
nn.Conv2D(channels, kernel_size=kernel_size, strides=strides,
padding=padding, use_bias=False, groups=groups,
in_channels=in_channels, layout=self.conv_layout,
weight_initializer=initializer)
)
def separable_conv(self, channels, kernel_size, in_channels, padding=True, strides=1):
return self.sequence(
self.conv(in_channels, kernel_size, padding=padding,
strides=strides, groups=in_channels, in_channels=in_channels),
self.conv(channels, 1, in_channels=in_channels)
)
def dense(self, units, in_units=0):
return nn.Dense(units, in_units=in_units,
weight_initializer=self.linear_initializer())
def transpose(self, to_layout):
if self.last_layout == to_layout:
return None
ret = Transpose(self.last_layout, to_layout)
self.last_layout = to_layout
return ret
def sequence(self, *seq):
seq = list(filter(lambda x: x is not None, seq))
if len(seq) == 1:
return seq[0]
ret = nn.HybridSequential()
ret.add(*seq)
return ret
class Transpose(HybridBlock):
def __init__(self, from_layout, to_layout):
super().__init__()
supported_layouts = ['NCHW', 'NHWC']
if from_layout not in supported_layouts:
raise ValueError('Not prepared to handle layout: {}'.format(from_layout))
if to_layout not in supported_layouts:
raise ValueError('Not prepared to handle layout: {}'.format(to_layout))
self.from_layout = from_layout
self.to_layout = to_layout
def hybrid_forward(self, F, x):
# Insert transpose if from_layout and to_layout don't match
if self.from_layout == 'NCHW' and self.to_layout == 'NHWC':
return F.transpose(x, axes=(0, 2, 3, 1))
elif self.from_layout == 'NHWC' and self.to_layout == 'NCHW':
return F.transpose(x, axes=(0, 3, 1, 2))
else:
return x
def __repr__(self):
s = '{name}({content})'
if self.from_layout == self.to_layout:
content = 'passthrough ' + self.from_layout
else:
content = self.from_layout + ' -> ' + self.to_layout
return s.format(name=self.__class__.__name__,
content=content)
class LayoutWrapper(HybridBlock):
def __init__(self, op, io_layout, op_layout, **kwargs):
super(LayoutWrapper, self).__init__(**kwargs)
with self.name_scope():
self.layout1 = Transpose(io_layout, op_layout)
self.op = op
self.layout2 = Transpose(op_layout, io_layout)
def hybrid_forward(self, F, *x):
return self.layout2(self.op(*(self.layout1(y) for y in x)))
class BatchNormAddRelu(nn.BatchNorm):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
if self._kwargs.pop('act_type') != 'relu':
raise ValueError('BatchNormAddRelu can be used only with ReLU as activation')
def hybrid_forward(self, F, x, y, gamma, beta, running_mean, running_var):
return F.BatchNormAddRelu(data=x, addend=y, gamma=gamma, beta=beta,
moving_mean=running_mean, moving_var=running_var, name='fwd', **self._kwargs)
class NonFusedBatchNormAddRelu(HybridBlock):
def __init__(self, builder, **kwargs):
super().__init__()
self.bn = builder.batchnorm(**kwargs)
self.act = builder.activation()
def hybrid_forward(self, F, x, y):
return self.act(self.bn(x) + y)
# Blocks
class ResNetBasicBlock(HybridBlock):
def __init__(self, builder, channels, stride, downsample=False, in_channels=0,
version='1', resnext_groups=None, **kwargs):
super().__init__()
assert not resnext_groups
self.transpose = builder.transpose(builder.conv_layout)
builder_copy = builder.copy()
body = [
builder.conv(channels, 3, strides=stride, in_channels=in_channels),
builder.batchnorm_relu(),
builder.conv(channels, 3),
]
self.body = builder.sequence(*body)
self.bn_add_relu = builder.batchnorm_add_relu(last=True)
builder = builder_copy
if downsample:
self.downsample = builder.sequence(
builder.conv(channels, 1, strides=stride, in_channels=in_channels),
builder.batchnorm()
)
else:
self.downsample = None
def hybrid_forward(self, F, x):
if self.transpose is not None:
x = self.transpose(x)
residual = x
x = self.body(x)
if self.downsample:
residual = self.downsample(residual)
x = self.bn_add_relu(x, residual)
return x
class ResNetBottleNeck(HybridBlock):
def __init__(self, builder, channels, stride, downsample=False, in_channels=0,
version='1', resnext_groups=None):
super().__init__()
stride1 = stride if version == '1' else 1
stride2 = 1 if version == '1' else stride
mult = 2 if resnext_groups else 1
groups = resnext_groups or 1
self.transpose = builder.transpose(builder.conv_layout)
builder_copy = builder.copy()
body = [
builder.conv(channels * mult // 4, 1, strides=stride1, in_channels=in_channels),
builder.batchnorm_relu(),
builder.conv(channels * mult // 4, 3, strides=stride2),
builder.batchnorm_relu(),
builder.conv(channels, 1)
]
self.body = builder.sequence(*body)
self.bn_add_relu = builder.batchnorm_add_relu(last=True)
builder = builder_copy
if downsample:
self.downsample = builder.sequence(
builder.conv(channels, 1, strides=stride, in_channels=in_channels),
builder.batchnorm()
)
else:
self.downsample = None
def hybrid_forward(self, F, x):
if self.transpose is not None:
x = self.transpose(x)
residual = x
x = self.body(x)
if self.downsample:
residual = self.downsample(residual)
x = self.bn_add_relu(x, residual)
return x
class XceptionBlock(HybridBlock):
def __init__(self, builder, definition, in_channels, relu_at_beginning=True):
super().__init__()
self.transpose = builder.transpose(builder.conv_layout)
builder_copy = builder.copy()
body = []
if relu_at_beginning:
body.append(builder.activation())
last_channels = in_channels
for channels1, channels2 in zip(definition, definition[1:] + [0]):
if channels1 > 0:
body.append(builder.separable_conv(channels1, 3, in_channels=last_channels))
if channels2 > 0:
body.append(builder.batchnorm_relu())
else:
body.append(builder.batchnorm(last=True))
last_channels = channels1
else:
body.append(builder.max_pool(3, 2))
self.body = builder.sequence(*body)
builder = builder_copy
if any(map(lambda x: x <= 0, definition)):
self.shortcut = builder.sequence(
builder.conv(last_channels, 1, strides=2, in_channels=in_channels),
builder.batchnorm(),
)
else:
self.shortcut = builder.sequence()
def hybrid_forward(self, F, x):
return self.shortcut(x) + self.body(x)
# Nets
class ResNet(HybridBlock):
def __init__(self, builder, block, layers, channels, classes=1000,
version='1', resnext_groups=None):
super().__init__()
assert len(layers) == len(channels) - 1
self.version = version
with self.name_scope():
features = [
builder.conv(channels[0], 7, strides=2),
builder.batchnorm_relu(),
builder.max_pool(3, 2),
]
for i, num_layer in enumerate(layers):
stride = 1 if i == 0 else 2
features.append(self.make_layer(builder, block, num_layer, channels[i+1],
stride, in_channels=channels[i],
resnext_groups=resnext_groups))
features.append(builder.global_avg_pool())
self.features = builder.sequence(*features)
self.output = builder.dense(classes, in_units=channels[-1])
def make_layer(self, builder, block, layers, channels, stride,
in_channels=0, resnext_groups=None):
layer = []
layer.append(block(builder, channels, stride, channels != in_channels,
in_channels=in_channels, version=self.version,
resnext_groups=resnext_groups))
for _ in range(layers-1):
layer.append(block(builder, channels, 1, False, in_channels=channels,
version=self.version, resnext_groups=resnext_groups))
return builder.sequence(*layer)
def hybrid_forward(self, F, x):
x = self.features(x)
x = self.output(x)
return x
class Xception(HybridBlock):
def __init__(self, builder,
definition=([32, 64],
[[128, 128, 0], [256, 256, 0], [728, 728, 0],
*([[728, 728, 728]] * 8), [728, 1024, 0]],
[1536, 2048]),
classes=1000):
super().__init__()
definition1, definition2, definition3 = definition
with self.name_scope():
features = []
last_channels = 0
for i, channels in enumerate(definition1):
features += [
builder.conv(channels, 3, strides=(2 if i == 0 else 1), in_channels=last_channels),
builder.batchnorm_relu(),
]
last_channels = channels
for i, block_definition in enumerate(definition2):
features.append(XceptionBlock(builder, block_definition, in_channels=last_channels,
relu_at_beginning=False if i == 0 else True))
last_channels = list(filter(lambda x: x > 0, block_definition))[-1]
for i, channels in enumerate(definition3):
features += [
builder.separable_conv(channels, 3, in_channels=last_channels),
builder.batchnorm_relu(),
]
last_channels = channels
features.append(builder.global_avg_pool())
self.features = builder.sequence(*features)
self.output = builder.dense(classes, in_units=last_channels)
def hybrid_forward(self, F, x):
x = self.features(x)
x = self.output(x)
return x
resnet_spec = {18: (ResNetBasicBlock, [2, 2, 2, 2], [64, 64, 128, 256, 512]),
34: (ResNetBasicBlock, [3, 4, 6, 3], [64, 64, 128, 256, 512]),
50: (ResNetBottleNeck, [3, 4, 6, 3], [64, 256, 512, 1024, 2048]),
101: (ResNetBottleNeck, [3, 4, 23, 3], [64, 256, 512, 1024, 2048]),
152: (ResNetBottleNeck, [3, 8, 36, 3], [64, 256, 512, 1024, 2048])}
def create_resnet(builder, version, num_layers=50, resnext=False, classes=1000):
assert num_layers in resnet_spec, \
"Invalid number of layers: {}. Options are {}".format(
num_layers, str(resnet_spec.keys()))
block_class, layers, channels = resnet_spec[num_layers]
assert not resnext or num_layers >= 50, \
"Cannot create resnext with less then 50 layers"
net = ResNet(builder, block_class, layers, channels, version=version,
resnext_groups=args.num_groups if resnext else None)
return net
class fp16_model(mx.gluon.block.HybridBlock):
def __init__(self, net, **kwargs):
super(fp16_model, self).__init__(**kwargs)
with self.name_scope():
self._net = net
def hybrid_forward(self, F, x):
y = self._net(x)
y = F.cast(y, dtype='float32')
return y
def get_model(arch, num_classes, num_layers, image_shape, dtype, amp,
input_layout, conv_layout, batchnorm_layout, pooling_layout,
batchnorm_eps, batchnorm_mom, fuse_bn_relu, fuse_bn_add_relu, **kwargs):
builder = Builder(
dtype = dtype,
input_layout = input_layout,
conv_layout = conv_layout,
bn_layout = batchnorm_layout,
pooling_layout = pooling_layout,
bn_eps = batchnorm_eps,
bn_mom = batchnorm_mom,
fuse_bn_relu = fuse_bn_relu,
fuse_bn_add_relu = fuse_bn_add_relu,
)
if arch.startswith('resnet') or arch.startswith('resnext'):
version = '1' if arch in {'resnetv1', 'resnextv1'} else '1.5'
net = create_resnet(
builder = builder,
version = version,
resnext = arch.startswith('resnext'),
num_layers = num_layers,
classes = num_classes,
)
elif arch == 'xception':
net = Xception(builder, classes=num_classes)
else:
raise ValueError('Wrong model architecture')
net.hybridize(static_shape=True, static_alloc=True)
if not amp:
net.cast(dtype)
if dtype == 'float16':
net = fp16_model(net)
return net

View file

@ -21,15 +21,21 @@
# - "metrics" : per epoch metrics for train and validation
# (some of below metrics may not exist in the report,
# depending on application arguments)
# - "train.top1" : training top1 accuracy in epoch.
# - "train.top5" : training top5 accuracy in epoch.
# - "train.loss" : training loss in epoch.
# - "train.time" : average training time of iteration in seconds.
# - "train.total_ips" : training speed (data and compute time taken into account) for epoch in images/sec.
# - "val.top1", "val.top5", "val.loss", "val.time", "val.total_ips" : the same but for validation.
# - "train.top1" : training top1 accuracy in epoch.
# - "train.top5" : training top5 accuracy in epoch.
# - "train.loss" : training loss in epoch.
# - "train.total_ips" : training speed (data and compute time taken into account) for epoch in images/sec.
# - "train.latency_avg" : average latency of one iteration in seconds.
# - "train.latency_50" : median latency of one iteration in seconds.
# - "train.latency_90" : 90th percentile latency of one iteration in seconds.
# - "train.latency_95" : 95th percentile latency of one iteration in seconds.
# - "train.latency_99" : 99th percentile latency of one iteration in seconds.
# - "train.latency_100" : highest observed latency of one iteration in seconds.
# - "val.top1", "val.top5", "val.time", "val.total_ips", "val.latency_avg", "val.latency_50",
# "val.latency_90", "val.latency_95", "val.latency_99", "val.latency_100" : the same but for validation.
import json
from collections import defaultdict, OrderedDict
from collections import OrderedDict
class Report:
def __init__(self, model_name, ngpus, cmd):
@ -37,15 +43,21 @@ class Report:
self.ngpus = ngpus
self.cmd = cmd
self.total_duration = 0
self.metrics = defaultdict(lambda: [])
self.metrics = OrderedDict()
def add_value(self, metric, value):
if metric not in self.metrics:
self.metrics[metric] = []
self.metrics[metric].append(value)
def set_total_duration(self, duration):
self.total_duration = duration
def save(self, filename):
with open(filename, 'w') as f:
f.write(self.get_report())
def get_report(self):
report = OrderedDict([
('model', self.model_name),
('ngpus', self.ngpus),
@ -53,5 +65,4 @@ class Report:
('cmd', self.cmd),
('metrics', self.metrics),
])
with open(filename, 'w') as f:
json.dump(report, f, indent=4)
return json.dumps(report, indent=4)

View file

@ -1,376 +0,0 @@
# Copyright 2017-2018 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# -----------------------------------------------------------------------
#
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Adapted from https://github.com/tornadomeet/ResNet/blob/master/symbol_resnet.py
(Original author Wei Wu) by Antti-Pekka Hynninen
"Flexible Layout" (fl) version created by Dick Carter.
Implementing the original resnet ILSVRC 2015 winning network from:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Deep Residual Learning for Image Recognition"
'''
import mxnet as mx
import numpy as np
import random
# Transform a symbol from one layout to another, or do nothing if they have the same layout
def transform_layout(data, from_layout, to_layout):
supported_layouts = ['NCHW', 'NHWC']
if from_layout not in supported_layouts:
raise ValueError('Not prepared to handle layout: {}'.format(from_layout))
if to_layout not in supported_layouts:
raise ValueError('Not prepared to handle layout: {}'.format(to_layout))
# Insert transpose if from_layout and to_layout don't match
if from_layout == 'NCHW' and to_layout == 'NHWC':
return mx.sym.transpose(data, axes=(0, 2, 3, 1))
elif from_layout == 'NHWC' and to_layout == 'NCHW':
return mx.sym.transpose(data, axes=(0, 3, 1, 2))
else:
return data
# A BatchNorm wrapper that responds to the input layout
def batchnorm(data, io_layout, batchnorm_layout, **kwargs):
# Transpose as needed to batchnorm_layout
transposed_as_needed = transform_layout(data, io_layout, batchnorm_layout)
bn_axis = 3 if batchnorm_layout == 'NHWC' else 1
batchnormed = mx.sym.BatchNorm(data=transposed_as_needed, axis=bn_axis, **kwargs)
# Transpose back to i/o layout as needed
return transform_layout(batchnormed, batchnorm_layout, io_layout)
# A BatchNormAddRelu wrapper that responds to the input layout
def batchnorm_add_relu(data, addend, io_layout, batchnorm_layout, **kwargs):
# Transpose as needed to batchnorm_layout
transposed_data_as_needed = transform_layout(data, io_layout, batchnorm_layout)
transposed_addend_as_needed = transform_layout(addend, io_layout, batchnorm_layout)
bn_axis = 3 if batchnorm_layout == 'NHWC' else 1
batchnormed = mx.sym.BatchNormAddRelu(data=transposed_data_as_needed,
addend=transposed_addend_as_needed,
axis=bn_axis, **kwargs)
# Transpose back to i/o layout as needed
return transform_layout(batchnormed, batchnorm_layout, io_layout)
# A Pooling wrapper that responds to the input layout
def pooling(data, io_layout, pooling_layout, **kwargs):
# Pooling kernel, as specified by pooling_layout, may be in conflict with i/o layout.
transposed_as_needed = transform_layout(data, io_layout, pooling_layout)
pooled = mx.sym.Pooling(data=transposed_as_needed, layout=pooling_layout, **kwargs)
# Transpose back to i/o layout as needed
return transform_layout(pooled, pooling_layout, io_layout)
# Assumption is that data comes in and out in the 'conv_layout' format.
# If this format is different from the 'batchnorm_layout' format, then the batchnorm() routine
# will introduce transposes on both sides of the mx.sym.BatchNorm symbol
def residual_unit(data, num_filter, stride, dim_match, name, bottle_neck=True,
workspace=256, memonger=False, conv_layout='NCHW', batchnorm_layout='NCHW',
verbose=False, cudnn_bn_off=False, bn_eps=2e-5, bn_mom=0.9, conv_algo=-1,
fuse_bn_relu=False, fuse_bn_add_relu=False, cudnn_tensor_core_only=False):
"""Return ResNet Unit symbol for building ResNet
Parameters
----------
data : str
Input data
num_filter : int
Number of output channels
bnf : int
Bottle neck channels factor with regard to num_filter
stride : tuple
Stride used in convolution
dim_match : Boolean
True means channel number between input and output is the same, otherwise means differ
name : str
Base name of the operators
workspace : int
Workspace used in convolution operator
"""
act = 'relu' if fuse_bn_relu else None
if bottle_neck:
conv1 = mx.sym.Convolution(data=data, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
no_bias=True, workspace=workspace, name=name + '_conv1', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
bn1 = batchnorm(data=conv1, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn1', cudnn_off=cudnn_bn_off, act_type=act)
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1') if not fuse_bn_relu else bn1
conv2 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
bn2 = batchnorm(data=conv2, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn2', cudnn_off=cudnn_bn_off, act_type=act)
act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2') if not fuse_bn_relu else bn2
conv3 = mx.sym.Convolution(data=act2, num_filter=num_filter, kernel=(1,1), stride=(1,1), pad=(0,0), no_bias=True,
workspace=workspace, name=name + '_conv3', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
if dim_match:
shortcut = data
else:
conv1sc = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
workspace=workspace, name=name+'_conv1sc', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
shortcut = batchnorm(data=conv1sc, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_sc', cudnn_off=cudnn_bn_off)
if memonger:
shortcut._set_attr(mirror_stage='True')
if fuse_bn_add_relu:
return batchnorm_add_relu(data=conv3, addend=shortcut, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn3', cudnn_off=cudnn_bn_off)
else:
bn3 = batchnorm(data=conv3, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name=name + '_bn3', cudnn_off=cudnn_bn_off)
return mx.sym.Activation(data=bn3 + shortcut, act_type='relu', name=name + '_relu3')
else:
conv1 = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=(3,3), stride=stride, pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv1', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
bn1 = batchnorm(data=conv1, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_bn1', cudnn_off=cudnn_bn_off, act_type=act)
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1') if not fuse_bn_relu else bn1
conv2 = mx.sym.Convolution(data=act1, num_filter=num_filter, kernel=(3,3), stride=(1,1), pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
if dim_match:
shortcut = data
else:
conv1sc = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=(1,1), stride=stride, no_bias=True,
workspace=workspace, name=name+'_conv1sc', layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=cudnn_tensor_core_only)
shortcut = batchnorm(data=conv1sc, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_sc', cudnn_off=cudnn_bn_off)
if memonger:
shortcut._set_attr(mirror_stage='True')
if fuse_bn_add_relu:
return batchnorm_add_relu(data=conv2, addend=shortcut, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_bn2', cudnn_off=cudnn_bn_off)
else:
bn2 = batchnorm(data=conv2, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, momentum=bn_mom, eps=bn_eps, name=name + '_bn2', cudnn_off=cudnn_bn_off)
return mx.sym.Activation(data=bn2 + shortcut, act_type='relu', name=name + '_relu2')
def resnet(units, num_stages, filter_list, num_classes, image_shape, bottle_neck=True, workspace=256, dtype='float32', memonger=False,
input_layout='NCHW', conv_layout='NCHW', batchnorm_layout='NCHW', pooling_layout='NCHW', verbose=False,
cudnn_bn_off=False, bn_eps=2e-5, bn_mom=0.9, conv_algo=-1,
fuse_bn_relu=False, fuse_bn_add_relu=False, force_tensor_core=False, use_dali=True):
"""Return ResNet symbol of
Parameters
----------
units : list
Number of units in each stage
num_stages : int
Number of stage
filter_list : list
Channel size of each stage
num_classes : int
Ouput size of symbol
dataset : str
Dataset type, only cifar10 and imagenet supports
workspace : int
Workspace used in convolution operator
dtype : str
Precision (float32 or float16)
memonger : boolean
Activates "memory monger" to reduce the model's memory footprint
input_layout : str
interpretation (e.g. NCHW vs NHWC) of data provided by the i/o pipeline (may introduce transposes
if in conflict with 'layout' above)
conv_layout : str
interpretation (e.g. NCHW vs NHWC) of data for convolution operation.
batchnorm_layout : str
directs which kernel performs the batchnorm (may introduce transposes if in conflict with 'conv_layout' above)
pooling_layout : str
directs which kernel performs the pooling (may introduce transposes if in conflict with 'conv_layout' above)
"""
act = 'relu' if fuse_bn_relu else None
num_unit = len(units)
assert(num_unit == num_stages)
data = mx.sym.Variable(name='data')
if not use_dali:
# double buffering of data
if dtype == 'float32':
data = mx.sym.identity(data=data, name='id')
else:
if dtype == 'float16':
data = mx.sym.Cast(data=data, dtype=np.float16)
(nchannel, height, width) = image_shape
# Insert transpose as needed to get the input layout to match the desired processing layout
data = transform_layout(data, input_layout, conv_layout)
if height <= 32: # such as cifar10
body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(3, 3), stride=(1,1), pad=(1, 1),
no_bias=True, name="conv0", workspace=workspace, layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=force_tensor_core)
# Is this BatchNorm supposed to be here?
body = batchnorm(data=body, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name='bn0', cudnn_off=cudnn_bn_off)
else: # often expected to be 224 such as imagenet
body = mx.sym.Convolution(data=data, num_filter=filter_list[0], kernel=(7, 7), stride=(2,2), pad=(3, 3),
no_bias=True, name="conv0", workspace=workspace, layout=conv_layout,
cudnn_algo_verbose=verbose,
cudnn_algo_fwd=conv_algo, cudnn_algo_bwd_data=conv_algo, cudnn_algo_bwd_filter=conv_algo,
cudnn_tensor_core_only=force_tensor_core)
body = batchnorm(data=body, io_layout=conv_layout, batchnorm_layout=batchnorm_layout,
fix_gamma=False, eps=bn_eps, momentum=bn_mom, name='bn0', cudnn_off=cudnn_bn_off, act_type=act)
if not fuse_bn_relu:
body = mx.sym.Activation(data=body, act_type='relu', name='relu0')
body = pooling(data=body, io_layout=conv_layout, pooling_layout=pooling_layout,
kernel=(3, 3), stride=(2, 2), pad=(1, 1), pool_type='max')
for i in range(num_stages):
body = residual_unit(body, filter_list[i+1], (1 if i==0 else 2, 1 if i==0 else 2), False,
name='stage%d_unit%d' % (i + 1, 1),
bottle_neck=bottle_neck, workspace=workspace,
memonger=memonger, conv_layout=conv_layout, batchnorm_layout=batchnorm_layout,
verbose=verbose, cudnn_bn_off=cudnn_bn_off, bn_eps=bn_eps, bn_mom=bn_mom,
conv_algo=conv_algo, fuse_bn_relu=fuse_bn_relu, fuse_bn_add_relu=fuse_bn_add_relu,
cudnn_tensor_core_only=force_tensor_core)
for j in range(units[i]-1):
body = residual_unit(body, filter_list[i+1], (1,1), True, name='stage%d_unit%d' % (i + 1, j + 2),
bottle_neck=bottle_neck, workspace=workspace,
memonger=memonger, conv_layout=conv_layout, batchnorm_layout=batchnorm_layout,
verbose=verbose, cudnn_bn_off=cudnn_bn_off, bn_eps = bn_eps, bn_mom=bn_mom,
conv_algo=conv_algo, fuse_bn_relu=fuse_bn_relu, fuse_bn_add_relu=fuse_bn_add_relu,
cudnn_tensor_core_only=force_tensor_core)
# bn1 = mx.sym.BatchNorm(data=body, fix_gamma=False, eps=2e-5, momentum=bn_mom, name='bn1')
# relu1 = mx.sym.Activation(data=bn1, act_type='relu', name='relu1')
# Although kernel is not used here when global_pool=True, we should put one
pool1 = pooling(data=body, io_layout=conv_layout, pooling_layout=pooling_layout,
global_pool=True, kernel=(7, 7), pool_type='avg', name='pool1')
flat = mx.sym.Flatten(data=pool1)
fc1 = mx.sym.FullyConnected(data=flat, num_hidden=num_classes, name='fc1', cublas_algo_verbose=verbose)
if dtype == 'float16':
fc1 = mx.sym.Cast(data=fc1, dtype=np.float32)
return mx.sym.SoftmaxOutput(data=fc1, name='softmax')
def get_symbol(num_classes, num_layers, image_shape, conv_workspace=256, dtype='float32',
input_layout='NCHW', conv_layout='NCHW', batchnorm_layout='NCHW', pooling_layout='NCHW',
verbose=False, seed=None, cudnn_bn_off=False, batchnorm_eps=2e-5, batchnorm_mom=0.9,
conv_algo=-1, fuse_bn_relu=False, fuse_bn_add_relu=False, force_tensor_core=False, use_dali=True, **kwargs):
"""
Adapted from https://github.com/tornadomeet/ResNet/blob/master/symbol_resnet.py
(Original author Wei Wu) by Antti-Pekka Hynninen
Implementing the original resnet ILSVRC 2015 winning network from:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Deep Residual Learning for Image Recognition"
"""
if seed is not None:
print('Setting seeds to %s' % (seed,))
random.seed(seed)
np.random.seed(seed)
mx.random.seed(seed)
image_shape = [int(l) for l in image_shape.split(',')]
(nchannel, height, width) = image_shape
if height <= 28:
num_stages = 3
if (num_layers-2) % 9 == 0 and num_layers >= 164:
per_unit = [(num_layers-2)//9]
filter_list = [16, 64, 128, 256]
bottle_neck = True
elif (num_layers-2) % 6 == 0 and num_layers < 164:
per_unit = [(num_layers-2)//6]
filter_list = [16, 16, 32, 64]
bottle_neck = False
else:
raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
units = per_unit * num_stages
else:
if num_layers >= 50:
filter_list = [64, 256, 512, 1024, 2048]
bottle_neck = True
else:
filter_list = [64, 64, 128, 256, 512]
bottle_neck = False
num_stages = 4
if num_layers == 18:
units = [2, 2, 2, 2]
elif num_layers == 34:
units = [3, 4, 6, 3]
elif num_layers == 50:
units = [3, 4, 6, 3]
elif num_layers == 101:
units = [3, 4, 23, 3]
elif num_layers == 152:
units = [3, 8, 36, 3]
elif num_layers == 200:
units = [3, 24, 36, 3]
elif num_layers == 269:
units = [3, 30, 48, 8]
else:
raise ValueError("no experiments done on num_layers {}, you can do it yourself".format(num_layers))
return resnet(units = units,
num_stages = num_stages,
filter_list = filter_list,
num_classes = num_classes,
image_shape = image_shape,
bottle_neck = bottle_neck,
workspace = conv_workspace,
dtype = dtype,
input_layout = input_layout,
conv_layout = conv_layout,
batchnorm_layout = batchnorm_layout,
pooling_layout = pooling_layout,
verbose = verbose,
cudnn_bn_off = cudnn_bn_off,
bn_eps = batchnorm_eps,
bn_mom = batchnorm_mom,
conv_algo = conv_algo,
fuse_bn_relu = fuse_bn_relu,
fuse_bn_add_relu = fuse_bn_add_relu,
force_tensor_core = force_tensor_core,
use_dali = use_dali)

View file

@ -14,77 +14,56 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import os, socket
from argparse import ArgumentParser
import warnings
import os
import argparse
from pathlib import Path
optparser = ArgumentParser(description="train resnet50 with MXNet")
optparser.add_argument("-n", "--n-GPUs", type=int, default=8, help="number of GPUs to use; " +\
"default = 8")
optparser.add_argument("-b", "--batch-size", type=int, default=208, help="batch size per GPU; " +\
"default = 208")
optparser.add_argument("-e", "--num-epochs", type=int, default=90, help="number of epochs; " +\
"default = 90")
optparser.add_argument("-l", "--lr", type=float, default=0.1, help="learning rate; default = 0.1; " +\
"IMPORTANT: true learning rate will be calculated as `lr * batch_size/256`")
optparser.add_argument("--no-val", action="store_true",
help="if set no validation will be performed")
optparser.add_argument("--no-dali", action="store_true", default=False,
help="use default MXNet pipeline instead of DALI")
optparser.add_argument("--data-root", type=str, help="Directory with RecordIO data files", default="/data/imagenet/train-val-recordio-passthrough")
optparser.add_argument("--data-nthreads", type=int, help="number of threads for data loading; default = 40", default=40)
optparser.add_argument("--dtype", type=str, help="Precision, float16 or float32", default="float16")
optparser = argparse.ArgumentParser(description='Train classification models on ImageNet',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
optparser.add_argument('-n', '--ngpus', type=int, default=1, help='number of GPUs to use')
optparser.add_argument('-b', '--batch-size', type=int, default=192, help='batch size per GPU')
optparser.add_argument('-e', '--num-epochs', type=int, default=90, help='number of epochs')
optparser.add_argument('-l', '--lr', type=float, default=0.256, help='learning rate; '
'IMPORTANT: true learning rate will be calculated as `lr * batch_size / 256`')
optparser.add_argument('--data-root', type=Path, help='Directory with RecordIO data files', default=Path('/data/imagenet/train-val-recordio-passthrough'))
optparser.add_argument('--dtype', help='Precision', default='float16', choices=('float32', 'float16'))
optparser.add_argument('--kv-store', default='horovod', choices=('device', 'horovod'), help='key-value store type')
optparser.add_argument('--data-backend', default='dali-gpu', choices=('dali-gpu', 'dali-cpu', 'mxnet', 'synthetic'), help='data backend')
opts, args = optparser.parse_known_args()
if opts.dtype == "float16":
n_ch = str(4 - int(opts.no_dali))
if opts.dtype == 'float16':
n_ch = str(4 - int(opts.data_backend == 'mxnet'))
else:
n_ch = str(3)
opts.batch_size *= opts.n_GPUs
opts.batch_size *= opts.ngpus
opts.lr *= opts.batch_size / 256
opts.lr *= opts.batch_size/256
command = ""
command += "python "+os.path.dirname(__file__)+"/train.py"
command += " --num-layers 50"
command += " --data-train " + opts.data_root + "/train.rec"
command += " --data-train-idx " + opts.data_root + "/train.idx"
if not opts.no_val:
command += " --data-val " + opts.data_root + "/val.rec"
command += " --data-val-idx " + opts.data_root + "/val.idx"
command += " --data-nthreads " + str(opts.data_nthreads)
command += " --optimizer sgd --dtype " + opts.dtype
command += " --lr-step-epochs 30,60,80 --max-random-area 1"
command += " --min-random-area 0.05 --max-random-scale 1"
command += " --min-random-scale 1 --min-random-aspect-ratio 0.75"
command += " --max-random-aspect-ratio 1.33 --max-random-shear-ratio 0"
command += " --max-random-rotate-angle 0 --random-resized-crop 1"
command += " --random-crop 0 --random-mirror 1"
command += " --image-shape "+n_ch+",224,224 --warmup-epochs 5"
command += " --disp-batches 20"
command += " --batchnorm-mom 0.9 --batchnorm-eps 1e-5"
command = []
if 'horovod' in opts.kv_store:
command += ['horovodrun', '-np', str(opts.ngpus)]
command += ['python', str(Path(__file__).parent / "train.py")]
command += ['--data-train', str(opts.data_root / "train.rec")]
command += ['--data-train-idx', str(opts.data_root / "train.idx")]
command += ['--data-val', str(opts.data_root / "val.rec")]
command += ['--data-val-idx', str(opts.data_root / "val.idx")]
command += ['--dtype', opts.dtype]
command += ['--image-shape', n_ch + ',224,224']
if opts.dtype == 'float16':
command += " --fuse-bn-relu 1"
command += " --input-layout NHWC --conv-layout NHWC"
command += " --batchnorm-layout NHWC --pooling-layout NHWC"
command += " --conv-algo 1 --force-tensor-core 1"
command += " --fuse-bn-add-relu 1"
command += '--fuse-bn-relu 1 --fuse-bn-add-relu 1'.split()
command += '--input-layout NCHW --conv-layout NHWC ' \
'--batchnorm-layout NHWC --pooling-layout NHWC'.split()
command += " --kv-store device"
if not opts.no_dali:
command += " --use-dali"
command += " --dali-prefetch-queue 2 --dali-nvjpeg-memory-padding 64"
command += " --lr "+str(opts.lr)
command += " --gpus " + str(list(range(opts.n_GPUs))).replace(' ', '').replace('[', '').replace(']', '')
command += " --batch-size " + str(opts.batch_size)
command += " --num-epochs " + str(opts.num_epochs)
command += ['--kv-store', opts.kv_store]
command += ['--data-backend', opts.data_backend]
command += ['--lr', str(opts.lr)]
command += ['--gpus', ','.join(list(map(str, range(opts.ngpus))))]
command += ['--batch-size', str(opts.batch_size)]
command += ['--num-epochs', str(opts.num_epochs)]
command += args
for arg in args:
command += " " + arg
os.environ['MXNET_UPDATE_ON_KVSTORE'] = "0"
os.environ['MXNET_EXEC_ENABLE_ADDTO'] = "1"
@ -92,5 +71,11 @@ os.environ['MXNET_USE_TENSORRT'] = "0"
os.environ['MXNET_GPU_WORKER_NTHREADS'] = "2"
os.environ['MXNET_GPU_COPY_NTHREADS'] = "1"
os.environ['MXNET_OPTIMIZER_AGGREGATION_SIZE'] = "54"
os.environ['HOROVOD_CYCLE_TIME'] = "0.1"
os.environ['HOROVOD_FUSION_THRESHOLD'] = "67108864"
os.environ['HOROVOD_NUM_NCCL_STREAMS'] = "2"
os.environ['MXNET_HOROVOD_NUM_GROUPS'] = "16"
os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD'] = "999"
os.environ['MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD'] = "25"
exit(os.system('/bin/bash -c "'+command+'"'))
os.execvp(command[0], command)

View file

@ -1,3 +1,4 @@
#!/bin/bash
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@ -12,8 +13,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.
if [ $# -lt 2 ] ; then
echo "usage: $0 raw_dataset prepared_dataset"
exit 1
fi
# This script launches ResNet50 inference benchmark in FP32 on 1 GPU with 1,2,4,32,64,96 batch size
# Usage ./INFER_BENCHMARK_FP32.sh <additionals flags>
python benchmark.py -n 1 -b 1,2,4,32,64,96 --only-inference -e 3 -w 1 -i 100 -o report.json $@
cd "$2" &&
python /opt/mxnet/tools/im2rec.py --list --recursive train "$1/train" &&
python /opt/mxnet/tools/im2rec.py --list --recursive val "$1/val" &&
python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 train "$1/train" &&
python /opt/mxnet/tools/im2rec.py --pass-through --num-thread 40 val "$1/val" &&
echo "Dataset was prepared succesfully!"

View file

@ -34,58 +34,37 @@
# limitations under the License.
import os
import sys
import argparse
import logging
logging.basicConfig(level=logging.DEBUG)
import data, dali, fit
import mxnet as mx
import numpy as np
def set_imagenet_aug(aug):
# standard data augmentation setting for imagenet training
aug.set_defaults(rgb_mean='123.68,116.779,103.939', rgb_std='58.393,57.12,57.375')
aug.set_defaults(random_crop=0, random_resized_crop=1, random_mirror=1)
aug.set_defaults(min_random_area=0.08)
aug.set_defaults(max_random_aspect_ratio=4./3., min_random_aspect_ratio=3./4.)
aug.set_defaults(brightness=0.4, contrast=0.4, saturation=0.4, pca_noise=0.1)
import data, dali
import fit
import models
if __name__ == '__main__':
# parse args
parser = argparse.ArgumentParser(description="train resnet on imagenet",
def parse_args():
parser = argparse.ArgumentParser(description="Train classification models on ImageNet",
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
models.add_model_args(parser)
fit.add_fit_args(parser)
data.add_data_args(parser)
dali.add_dali_args(parser)
data.add_data_aug_args(parser)
# Instead, to get standard resnet augmentation on a per-use basis, invoke as in:
# train_imagenet.py --set-resnet-aug ...
# Finally, to get the legacy MXNet v1.2 training settings on a per-use basis, invoke as in:
# train_imagenet.py --set-data-aug-level 3
parser.set_defaults(
# network
num_layers = 50,
return parser.parse_args()
# data
resize = 256,
num_classes = 1000,
num_examples = 1281167,
image_shape = '3,224,224',
min_random_scale = 1, # if input image has min size k, suggest to use
# 256.0/x, e.g. 0.533 for 480
# train
num_epochs = 90,
lr_step_epochs = '30,60,80',
dtype = 'float32'
)
args = parser.parse_args()
def setup_logging(args):
head = '{asctime}:{levelname}: {message}'
logging.basicConfig(level=logging.DEBUG, format=head, style='{',
handlers=[logging.StreamHandler(sys.stderr), logging.FileHandler(args.log)])
logging.info('Start with arguments {}'.format(args))
if not args.use_dali:
data.set_data_aug_level(parser, 0)
if __name__ == '__main__':
args = parse_args()
setup_logging(args)
# load network
import resnet as net
sym = net.get_symbol(**vars(args))
model = models.get_model(**vars(args))
data_loader = data.get_data_loader(args)
# train
fit.fit(args, sym, dali.get_rec_iter)
fit.fit(args, model, data_loader)

View file

@ -1,6 +1,6 @@
/******************************************************************************
*
* Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.

View file

@ -1,6 +1,6 @@
/******************************************************************************
*
* Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.

View file

@ -1,6 +1,6 @@
/******************************************************************************
*
* Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import skimage

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import numpy as np

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import time
from argparse import ArgumentParser

View file

@ -1,6 +1,6 @@
#!/usr/bin/env python
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.

14
PyTorch/Detection/SSD/src/coco.py Executable file → Normal file
View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
__author__ = 'tylin'
__version__ = '2.0'
# Interface for accessing the Microsoft COCO dataset.

View file

@ -1,4 +1,4 @@
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import torch
@ -18,7 +32,7 @@ def get_train_loader(args, local_seed):
output_fp16=args.amp, output_nhwc=False,
pad_output=False, seed=local_seed)
train_pipe.build()
test_run = train_pipe.run()
test_run = train_pipe.schedule_run(), train_pipe.share_outputs(), train_pipe.release_outputs()
train_loader = DALICOCOIterator(train_pipe, 118287 / args.N_gpu)
return train_loader

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import time
import numpy as np

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
from torchvision.models.resnet import resnet18, resnet34, resnet50, resnet101, resnet152

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from torch.autograd import Variable
import torch
import time

View file

@ -1,3 +1,17 @@
# Copyright (c) 2018-2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torchvision.transforms as transforms
import torch.utils.data as data