Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.
This commit is contained in:
Przemek Strzelczyk 2019-07-16 21:13:08 +02:00
parent 3b3d0f6a55
commit a644350589
109 changed files with 6552 additions and 2524 deletions

View file

@ -40,7 +40,7 @@ def add_parser_arguments(parser):
parser.add_argument('data', metavar='DIR',
help='path to dataset')
parser.add_argument('--data-backend', metavar='BACKEND', default='pytorch',
parser.add_argument('--data-backend', metavar='BACKEND', default='dali-cpu',
choices=DATA_BACKEND_CHOICES)
parser.add_argument('--arch', '-a', metavar='ARCH', default='resnet50',

View file

@ -1,7 +1,9 @@
FROM nvcr.io/nvidia/pytorch:19.05-py3
# Set working directory
WORKDIR /mlperf
WORKDIR /workspace
ENV PYTHONPATH "${PYTHONPATH}:/workspace"
RUN apt-get update && apt-get install -y python3-tk python-pip git tmux htop tree

View file

@ -1,29 +1,45 @@
# SSD300 v1.1 For PyTorch
## Table Of Contents
* [The model](#the-model)
* [Default configuration](#default-configuration)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick start guide](#quick-start-guide)
* [Details](#details)
* [Command line arguments](#command-line-arguments)
* [Getting the data](#getting-the-data)
* [Training process](#training-process)
* [Data preprocessing](#data-preprocessing)
* [Data augmentation](#data-augmentation)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training performance results](#training-performance-results)
* [Inference performance results](#inference-performance-results)
* [Changelog](#changelog)
* [Known issues](#known-issues)
This repository provides a script and recipe to train the SSD300 v1.1 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## The model
## Table Of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Data preprocessing](#data-preprocessing)
* [Data augmentation](#data-augmentation)
* [Training process](#training-process)
* [Inference process](#inference-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [Training performance results](#training-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g-1)
* [Inference performance results](#inference-performance-results)
* [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The SSD300 v1.1 model is based on the
[SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) paper, which
describes SSD as “a method for detecting objects in images using a single deep neural network".
@ -36,54 +52,202 @@ From the
[Speed/accuracy trade-offs for modern convolutional object detectors](https://arxiv.org/abs/1611.10012)
paper, the following enhancements were made to the backbone:
* The conv5_x, avgpool, fc and softmax layers were removed from the original classification model.
* All strides in conv4_x are set to 1x1.
* All strides in conv4_x are set to 1x1.
Detector heads are similar to the ones referenced in the paper, however,
they are enhanced by additional BatchNorm layers after each convolution.
Additionally, we removed weight decay on every bias parameter and
all the BatchNorm layer parameters as described in the
[Highly Scalable Deep Learning Training System with Mixed-Precision:
Training ImageNet in Four Minutes](https://arxiv.org/abs/1807.11205) paper.
Training of SSD requires computational costly augmentations.
To fully utilize GPUs during training we are using the
[NVIDIA DALI](https://github.com/NVIDIA/DALI) library
to accelerate data preparation pipelines.
This model is trained with mixed precision using Tensor Cores on NVIDIA
Volta and Turing GPUs. Therefore, researchers can get results 2x faster
than training without Tensor Cores, while experiencing the benefits of
mixed precision training. This model is tested against each NGC monthly
container release to ensure consistent accuracy and performance over time.
### Model architecture
Despite the changes described in the previous section,
the overall architecture, as described in the following diagram, has not changed.
<p align="center">
<img width="90%" src="./img/ssd_diagram.png" />
<br>
Figure 1. The architecture of a Single Shot MultiBox Detector model. Image has been taken from the <a href="https://arxiv.org/abs/1512.02325">Single Shot MultiBox Detector paper</a>.
</p>
The backbone is followed by 5 additional convolutional layers.
In addition to the convolutional layers, we attached 6 detection heads:
* The first detection head is attached to the last conv4_x layer.
* The other five detection heads are attached to the corresponding 5 additional layers.
Detector heads are similar to the ones referenced in the paper, however,
they are enhanced by additional BatchNorm layers after each convolution.
Additionally, we removed weight decay on every bias parameter and
all the BatchNorm layer parameters as described in the
[Highly Scalable Deep Learning Training System with Mixed-Precision:
Training ImageNet in Four Minutes](https://arxiv.org/abs/1807.11205) paper.
This model trains with mixed precision tensor cores on Volta, therefore you can get results much faster than training without tensor cores.
This model is tested against each NGC monthly container release to ensure
consistent accuracy and performance over time.
Because of these enhancements, the SSD300 v1.1 model achieves higher accuracy.
Training of SSD requires computational costly augmentations. To fully utilize GPUs during training we are using [NVIDIA DALI](https://github.com/NVIDIA/DALI) library to accelerate data preparation pipeline.
### Default configuration
We trained the model for 65 epochs with the following setup:
* SGD with momentum (0.9)
* Learning rate = 2.6e-3 * number of GPUs * (batch_size / 32)
* Learning rate decay multiply by 0.1 before 43 and 54 epochs
* We use linear warmup of the learning rate during the first epoch. For more information, see the
* SGD with momentum (0.9)
* Learning rate = 2.6e-3 * number of GPUs * (batch_size / 32)
* Learning rate decay multiply by 0.1 before 43 and 54 epochs
* We use linear warmup of the learning rate during the first epoch.
For more information, see the
[Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677) paper.
To enable warmup provide argument the `--warmup 300`
* Weight decay:
* 0 for BatchNorms and biases
* 5e-4 for other layers
**Note**: The learning rate is automatically scaled (in other words, mutliplied by the number of GPUs and multiplied by the batch size divided by 32).
* Weight decay:
* 0 for BatchNorms and biases
* 5e-4 for other layers
**Note**: The learning rate is automatically scaled (in other words, multiplied
by the number of GPUs and multiplied by the batch size divided by 32).
### Feature support matrix
The following features are supported by this model.
| Feature | SSD300 v1.1 PyTorch |
|-----------------------|--------------------------
|Multi-GPU training with [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) | Yes |
|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes |
#### Features
Multi-GPU training with Distributed Data Parallel - Our model uses Apex's
DDP to implement efficient multi-GPU training with NCCL.
To enable multi-GPU training with DDP, you have to wrap your model
with a proper class, and change the way you launch training.
For details, see example sources in this repo or see
the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
NVIDIA DALI - DALI is a library accelerating data preparation pipeline.
To accelerate your input pipeline, you only need to define your data loader
with the DALI library.
For details, see example sources in this repo or see
the [DALI documentation](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html)
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in
a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740)
training offers significant computational speedup by performing operations
in half-precision format, while storing minimal information in single-precision
to retain as much information as possible in critical parts of the network.
Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
in the Volta and Turing architecture, significant training speedups are
experienced by switching to mixed precision -- up to 3x overall speedup
on the most arithmetically intense model architectures. Using mixed precision
training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced
in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/)
in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision
Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
from the TensorFlow User Guide.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools
for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
#### Enabling mixed precision
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP)
library from [APEX](https://github.com/NVIDIA/apex) which casts variables
to half-precision upon retrieval, while storing variables in single-precision format.
Furthermore, to preserve small gradient magnitudes in backpropagation,
a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
step must be included when applying gradients. In PyTorch, loss scaling
can be easily applied by using `scale_loss()` method provided by AMP.
The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler)
or fixed.
For an in-depth walk through on AMP, check out sample usage
[here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started).
[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
utility libraries, such as AMP, which require minimal network code changes
to leverage Tensor Cores performance.
To enable mixed precision, you can:
- Import AMP from APEX:
```
from apex import amp
```
- Initialize an AMP handle:
```
amp_handle = amp.init(enabled=True, verbose=True)
```
- Wrap your optimizer with the AMP handle:
```
optimizer = amp_handle.wrap_optimizer(optimizer)
```
- Scale loss before backpropagation (assuming loss is stored in a variable called `losses`)
- Default backpropagate for FP32:
```
losses.backward()
```
- Scale loss and backpropagate with AMP:
```
with optimizer.scale_loss(losses) as scaled_losses:
scaled_losses.backward()
```
### Glossary
backbone
: a part of a many object detection architectures, usually pre-trained for a different,
simpler task, like classification.
input pipeline
: set of operations performed for every item in input data before feeding the neural
network. Especially for object detection task, the input pipeline can be complex
and computationally significant. For that reason, solutions like NVIDIA DALI emerged.
object detection
: a subset of Computer Vision problem. The task of object detection is to localize
possibly multiple objects on the image and classify them. The difference between
Object Detection, Image Classification, and Localization are clearly explained in the
video published as a part of the [C4W3L01 course](https://www.youtube.com/watch?v=GSwYGkTfOKk).
SSD (Single Shot MultiBox Detector)
: a name for the detection model described in a [paper authored by Liu at al.](https://arxiv.org/abs/1512.02325)
ResNet (ResNet-50)
: a name for the classification model described in a [paper authored by He et al.](https://arxiv.org/abs/1512.03385)
In this repo, it is used as a backbone for SSD.
## Setup
The following section list the requirements in order to start training the SSD300 v1.1 model.
The following section lists the requirements in order to start training the SSD300 v1.1 model.
### Requirements
This repository contains `Dockerfile` which extends the PyTorch 19.03 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following software:
This repository contains `Dockerfile` which extends the PyTorch 19.06 NGC container
and encapsulates some dependencies. Aside from these dependencies,
ensure you have the following software:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 19.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [PyTorch 19.06-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
For more information about how to get started with NGC containers, see the
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@ -92,44 +256,51 @@ Documentation:
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
For those unable to use the [PyTorch 19.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
to set up the required environment or create your own container,
see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores, perform the
following steps using the default parameters of the SSD v1.1 model on the
[COCO 2017](http://cocodataset.org/#download) dataset.
### 1. Clone the repository.
To train your model using mixed precision with Tensor Cores or using FP32,
perform the following steps using the default parameters of the SSD v1.1 model
on the [COCO 2017](http://cocodataset.org/#download) dataset.
For the specifics concerning training and inference,
see the [Advanced](#advanced) section.
1. Clone the repository.
```
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Detection/SSD
```
### 2. Download and preprocess the dataset.
2. Download and preprocess the dataset.
Extract the COCO 2017 dataset with `download_dataset.sh $COCO_DIR`.
Data will be downloaded to the `$COCO_DIR` directory (on the host).
### 3. Build the SSD300 v1.1 PyTorch NGC container.
3. Build the SSD300 v1.1 PyTorch NGC container.
```
docker build . -t nvidia_ssd
```
### 4. Launch the NGC container to run training/inference.
4. Start an interactive session in the NGC container to run training/inference.
```
nvidia-docker run --rm -it --ulimit memlock=-1 --ulimit stack=67108864 -v $COCO_DIR:/coco --ipc=host nvidia_ssd
```
**Note**: the default mount point in the container is `/coco`.
### 5. Start training.
5. Start training.
The `./examples` directory provides several sample scripts for various GPU settings
and act as wrappers around the main.py script.
and act as wrappers around the `main.py` script.
The example scripts need two arguments:
- A path to root SSD directory.
- A path to COCO 2017 dataset.
- A path to the root SSD directory.
- A path to the COCO 2017 dataset.
Remaining arguments are passed to the `main.py` script.
@ -137,25 +308,22 @@ The `--save` flag, saves the model after each epoch.
The checkpoints are stored as `./models/epoch_*.pt`.
Use `python main.py -h` to obtain the list of available options in the `main.py` script.
For example, if you want to run 8 GPU training with TensorCore acceleration and
For example, if you want to run 8 GPU training with Tensor Core acceleration and
save checkpoints after each epoch, run:
```
bash ./examples/SSD300_FP16_8GPU.sh . /coco --save
```
For information about how to train using mixed precision, see the [Mixed Precision Training paper](https://arxiv.org/abs/1710.03740) and [Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
For PyTorch, easily adding mixed-precision support is available from NVIDIAs [APEX](https://github.com/NVIDIA/apex), a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
### 6. Start validation/evaluation.
6. Start validation/evaluation.
The `main.py` training script automatically runs validation during training.
The results from the validation are printed to stdout.
The results from the validation are printed to `stdout`.
Pycocotools open-sourced scripts provides a consistent way to evaluate models on the COCO dataset. We are using these scripts during validation to measure models performance in AP metric. Metrics below are evaluated using pycocotools methodology, in the following format:
Pycocotools open-sourced scripts provides a consistent way
to evaluate models on the COCO dataset. We are using these scripts
during validation to measure a models performance in AP metric.
Metrics below are evaluated using pycocotools methodology, in the following format:
```
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.423
@ -172,24 +340,143 @@ Pycocotools open-sourced scripts provides a consistent way to evaluate models
```
The metric reported in our results is present in the first row.
To evaluate a checkpointed model saved in previous point, run:
To evaluate a checkpointed model saved in the previous point, run:
```
python ./main.py --backbone resnet50 --mode evaluation --checkpoint ./models/epoch_*.pt --data /coco
```
### 7. Optionally, resume training from a checkpointed model.
7. Optionally, resume training from a checkpointed model.
```
python ./main.py --backbone resnet50 --checkpoint ./models/epoch_*.pt --data /coco
```
## Details
8. Start inference/predictions.
The following sections provide greater details of the dataset, running training and inference, and the training results.
You can check your trained model with a Jupyter notebook provided in the examples directory.
Start with running a Docker container with a Jupyter notebook server:
```
nvidia-docker run --rm -it --ulimit memlock=-1 --ulimit stack=67108864 -v $SSD_CHECKPINT_PATH:/checkpoints/SSD300v1.1.pt -v $COCO_PATH:/datasets/coco2017 --ipc=host -p 8888:8888 nvidia_ssd jupyter-notebook --ip 0.0.0.0 --allow-root
```
### Command line arguments
All these parameters can be controlled by passing command line arguments to the `main.py` script. To get a complete list of all command line arguments with descriptions and default values you can run:
The container prints Jupyter notebook logs like this:
```
[I 16:17:58.935 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 16:17:59.769 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 16:17:59.769 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 16:17:59.770 NotebookApp] Serving notebooks from local directory: /workspace
[I 16:17:59.770 NotebookApp] The Jupyter Notebook is running at:
[I 16:17:59.770 NotebookApp] http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
[I 16:17:59.770 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 16:17:59.774 NotebookApp] No web browser found: could not locate runnable browser.
[C 16:17:59.774 NotebookApp]
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
Or copy and paste one of these URLs:
http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
```
Use the token printed in the last line to start your notebook session.
The notebook is in `examples/inference.ipynb`, for example:
http://127.0.0.1:8888/notebooks/examples/inference.ipynb?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
## Advanced
The following sections provide greater details of the dataset,
running training and inference, and the training results.
### Scripts and sample code
In the root directory, the most important files are:
- `main.py`: the script that controls the logic of training and validation of the SSD300 v1.1 model;
- `Dockerfile`: Instructions for docker to build a container with the basic set of dependencies to run SSD300 v1.1;
- `requirements.txt`: a set of extra Python requirements for running SSD300 v1.1;
- `download_dataset.py`: automatically downloads the COCO dataset for training.
The `src/` directory contains modules used to train and evaluate the SSD300 v1.1 model
- `model.py`: the definition of SSD300 v1.1 model
- `data.py`: definition of input pipelines used in training and evaluation
- `train.py`: functions used to train the SSD300 v1.1 model
- `evaluate.py`: functions used to evaluate the SSD300 v1.1 model
- `coco_pipeline.py`: definition of input pipeline using NVIDIA DALI
- `coco.py`: code specific for the COCO dataset
- `logger.py`: utilities for logging
- `utils.py`: extra utility functions
The `examples/` directory contains scripts wrapping common scenarios.
### Parameters
#### The script `main.py`
The script for training end evaluating the SSD300 v1.1 model have a variety
of parameters that control these processes.
##### Common parameters
`--data`
: use it to specify, where your dataset is. By default, the script will look for it
under the `/coco` directory.
`--checkpoint`
: allows you to specify the path to the pre-trained model.
`--save`
: when the flag is turned on, the script will save the trained model to the disc.
`--seed`
: Use it to specify the seed for RNGs.
`--amp`
: when the flag is turned on, the AMP features will be enabled.
##### Training related
`--epochs`
: a number of times the model will see every example from the training dataset.
`--evaluation`
: after this parameter, list the number of epochs after which evaluation should
be performed.
`--learning-rate`
: initial learning rate.
`--multistep`
: after this parameter, list the epochs after which learning rate should be decayed.
`--warmup`
: allows you to specify the number of iterations for which a linear learning-rate
warmup will be performed.
`--momentum`
: momentum argument for SGD optimizer.
`--weight-decay`
: weight decay argument for SGD optimizer.
`--batch-size`
: a number of inputs processed at once for each iteration.
`--backbone-path`
: the path to the checkpointed backbone. When it is not provided, a pre-trained model from torchvision
will be downloaded.
##### Evaluation related
`--eval-batch-size`
: a number of inputs processed at once for each iteration.
##### Utility parameters
`--help`
: displays a short description of all parameters accepted by the script.
### Command-line options
All these parameters can be controlled by passing command-line arguments
to the `main.py` script. To get a complete list of all command-line arguments
with descriptions and default values you can run:
```
python main.py --help
@ -197,93 +484,74 @@ python main.py --help
### Getting the data
The SSD model was trained on the COCO 2017 dataset. The val2017 validation set was used as a validation dataset. PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
The SSD model was trained on the COCO 2017 dataset. The [val2017](http://cocodataset.org/#download) validation set
was used as a validation dataset. PyTorch can work directly on JPEGs,
therefore, preprocessing/augmentation is not needed.
This repository contains the `download_dataset.sh` download script which will automatically
download and preprocess the training, validation and test datasets. By default,
data will be downloaded to the `/coco` directory.
#### Dataset guidelines
Our model expects input data aligned in a way a COCO dataset is aligned by the `download_dataset.sh` script.
`train2017` and `val2017` directories should contain images in JPEG format.
Annotation format is described in [the COCO documentation](http://cocodataset.org/#format-data).
The preprocessing of the data is defined in the `src/coco_pipeline.py` module.
##### Data preprocessing
Before we feed data to the model, both during training and inference, we perform:
* JPEG decoding
* normalization with a mean =` [0.485, 0.456, 0.406]` and std dev = `[0.229, 0.224, 0.225]`
* encoding bounding boxes
* resizing to 300x300
Additionally, during training, data is:
* randomly shuffled
* samples without annotations are skipped
##### Data augmentation
During training we perform the following augmentation techniques:
* Random crop using the algorithm described in the [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) paper
* Random horizontal flip
* Color jitter
### Training process
Training the SSD model is implemented in the `main.py` script.
Training the SSD model is implemented in the `main.py` script.
By default, training is running for 65 epochs. Because evaluation is relatively time consuming,
it is not running every epoch. With default settings, evaluation is executed after epochs:
21, 31, 37, 42, 48, 53, 59, 64. The model is evaluated using pycocotools distributed with
the COCO dataset.
Which epochs should be evaluated can be reconfigured with argument `--evaluation`.
Which epochs should be evaluated can be reconfigured with the `--evaluation` argument.
To run training with Tensor Cores, use the `--fp16` flag when running the `main.py` script.
To run training with Tensor Cores, use the `--amp` flag when running the `main.py` script.
The flag `--save` flag enables storing checkpoints after each epoch under `./models/epoch_*.pt`.
#### Data preprocessing
Before we feed data to the model, both during training and inference, we perform:
* Normalization
* Encoding bounding boxes
* Resize to 300x300
### Inference process
#### Data augmentation
During training we perform the following augmentation techniques:
* Random crop
* Random horizontal flip
* Color jitter
Our scripts for SSD300 v1.1 presents two ways to run inference.
To get meaningful results, you need a pre-trained model checkpoint.
One way is to run an interactive session on Jupyter notebook, as described in a [Quick Start Guide](#8-start-inferencepredictions).
Another way is to run a script `src/SSD300_inference.py`. It contains the logic from the notebook, wrapped into a Python script. The script contains sample usage.
### Enabling mixed precision
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) training previously required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Manually adding loss scaling to preserve small gradient values.
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP), library from [APEX](https://github.com/NVIDIA/apex) that casts variables to half-precision upon retrieval,
while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients.
In PyTorch, loss scaling can be easily applied by using scale_loss() method provided by amp. The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
To use the inference example script in your own code, you can call the `main` function, providing input image URIs as an argument. The result will be a list of detections for each input image.
For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
## Performance
To enable mixed precision, you can:
- Import AMP from APEX, for example:
### Benchmarking
```
from apex import amp
```
- Initialize an AMP handle, for example:
```
amp_handle = amp.init(enabled=True, verbose=True)
```
- Wrap your optimizer with the AMP handle, for example:
```
optimizer = amp_handle.wrap_optimizer(optimizer)
```
- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
- Default backpropagate for FP32:
```
losses.backward()
```
- Scale loss and backpropagate with AMP:
```
with optimizer.scale_loss(losses) as scaled_losses:
scaled_losses.backward()
```
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
## Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
### Training performance benchmark
Training benchmark was run in various scenarios on V100 16G GPU. For each scenario, batch size was set to 32. The benchmark does not require a checkpoint from a fully trained model.
#### Training performance benchmark
The training benchmark was run in various scenarios on V100 16G GPU. For each scenario, the batch size was set to 32. The benchmark does not require a checkpoint from a fully trained model.
To benchmark training, run:
```
@ -292,34 +560,46 @@ python -m torch.distributed.launch --nproc_per_node={NGPU} \
--mode benchmark-training \
--benchmark-warmup 100 \
--benchmark-iterations 200 \
{fp16} \
{AMP} \
--data {data}
```
Where the `{NGPU}` selects number of GPUs used in benchmark, the `{bs}` is the desired batch size, the `{fp16}` is set to `--fp16` if you want to benchmark training with tensor cores, and the `{data}` is the location of the COCO 2017 dataset.
Benchmark warmup is specified to omit first iterations of first epoch. Benchmark iterations is number of iterations used to measure performance.
Where the `{NGPU}` selects number of GPUs used in benchmark, the `{bs}` is the desired
batch size, the `{AMP}` is set to `--amp` if you want to benchmark training with
Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.
`--benchmark-warmup` is specified to omit the first iteration of the first epoch.
`--benchmark-iterations` is a number of iterations used to measure performance.
#### Inference performance benchmark
### Inference performance benchmark
Inference benchmark was run on 1x V100 16G GPU. To benchmark inference, run:
```
python main.py --eval-batch-size {bs} \
--mode benchmark-inference \
--benchmark-warmup 100 \
--benchmark-iterations 200 \
{fp16} \
{AMP} \
--data {data}
```
Where the `{bs}` is the desired batch size, the `{fp16}` is set to `--fp16` if you want to benchmark inference with Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.
Benchmark warmup is specified to omit first iterations of first epoch. Benchmark iterations is number of iterations used to measure performance.
Where the `{bs}` is the desired batch size, the `{AMP}` is set to `--amp` if you want to benchmark inference with Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.
## Results
`--benchmark-warmup` is specified to omit the first iterations of the first epoch. `--benchmark-iterations` is a number of iterations used to measure performance.
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
### Training accuracy results
#### Training accuracy results
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `./examples/SSD300_FP{16,32}_{1,4,8}GPU.sh`
script in the pytorch-19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Batch was set to size best utilizing GPU memory. For FP32 precision, batch size is 32, for mixed precision batch size is 64
script in the `pytorch-19.06-py3` NGC container on NVIDIA DGX-1 with 8x
V100 16G GPUs. Performance numbers (in items/images per second) were averaged
over an entire training epoch.
| **Number of GPUs** | **Mixed precision mAP** | **Training time with mixed precision** | **FP32 mAP** | **Training time with FP32** |
|:------------------:|:------------------------:|:-------------------------------------:|:------------:|:---------------------------:|
@ -334,10 +614,15 @@ Here are example graphs of FP32 and FP16 training on 8 GPU configuration:
![ValidationAccuracy](./img/validation_accuracy.png)
### Training performance results
Our results were obtained by running the `main.py` script with the
`--mode benchmark-training` flag in the pytorch-19.03-py3 NGC container on NVIDIA DGX-1 with V100 16G GPUs.
#### Training performance results
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `main.py` script with the `--mode
benchmark-training` flag in the `pytorch-19.06-py3` NGC container on NVIDIA
DGX-1 with 8x V100 16G GPUs. Performance numbers (in items/images per second)
were averaged over an entire training epoch.
| **Number of GPUs** | **Batch size per GPU** | **Mixed precision img/s (median)** | **FP32 img/s (median)** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with mixed precision** | **Multi-gpu weak scaling with FP32** |
|:------------------:|:----------------------:|:----------------------------------:|:-----------------------:|:---------------------------------:|:-----------------------------------------------:|:------------------------------------:|
@ -345,11 +630,16 @@ Our results were obtained by running the `main.py` script with the
| 4 | 32 | 838.457 | 397.797 | 2.11 | 3.86 | 3.88 |
| 8 | 32 | 1639.843 | 789.695 | 2.08 | 7.56 | 7.70 |
To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
### Inference performance results
#### Inference performance results
Our results were obtained by running the `main.py` script with `--mode benchmark-inference` flag in the pytorch-19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs.
##### NVIDIA DGX-1 (1x V100 16G)
Our results were obtained by running the `main.py` script with `--mode
benchmark-inference` flag in the pytorch-19.06-py3 NGC container on NVIDIA
DGX-1 with (1x V100 16G) GPUs.
| **Batch size** | **Mixed precision img/s (median)** | **FP32 img/s (median)** |
|:--------------:|:----------------------------------:|:-----------------------:|
@ -359,15 +649,24 @@ Our results were obtained by running the `main.py` script with `--mode benchmark
| 16 | 470.10 | 280.57 |
| 32 | 520.54 | 302.43 |
To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
## Changelog
## Release notes
### Changelog
July 2019
* script and notebook for inference
* use AMP instead of hand-crafted FP16 support
* README update
* introduced a parameter with a path to the custom backbone checkpoint
* minor enchantments of `example/*` scripts
* alignment to changes in PyTorch 19.06
March 2019
* Initial release
May 2019
* Test scripts updated
### Known issues
## Known issues
There are no known issues with this model.

View file

@ -0,0 +1,44 @@
import numpy as np
import skimage
def load_image(image_path):
"""Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
mean, std = 128, 128
img = skimage.img_as_float(skimage.io.imread(image_path))
if len(img.shape) == 2:
img = np.array([img, img, img]).swapaxes(0,2)
return img
def rescale(img, input_height, input_width):
"""Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
aspect = img.shape[1]/float(img.shape[0])
if(aspect>1):
# landscape orientation - wide image
res = int(aspect * input_height)
imgScaled = skimage.transform.resize(img, (input_width, res))
if(aspect<1):
# portrait orientation - tall image
res = int(input_width/aspect)
imgScaled = skimage.transform.resize(img, (res, input_height))
if(aspect == 1):
imgScaled = skimage.transform.resize(img, (input_width, input_height))
return imgScaled
def crop_center(img,cropx,cropy):
"""Code from Loading_Pretrained_Models.ipynb - a Caffe2 tutorial"""
y,x,c = img.shape
startx = x//2-(cropx//2)
starty = y//2-(cropy//2)
return img[starty:starty+cropy,startx:startx+cropx]
def normalize(img, mean=128, std=128):
img = (img * 256 - mean) / std
return img
def prepare_input(img_uri):
img = load_image(img_uri)
img = rescale(img, 300, 300)
img = crop_center(img, 300, 300)
img = normalize(img)
return img

View file

@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 1 GPUs using 64 batch size
# Usage bash SSD300_FP16_1GPU.sh <path to this repository> <path to dataset> <additional flags>
python $1/main.py --backbone resnet50 --warmup 300 --bs 64 --fp16 --data $2 ${@:3}
python $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}

View file

@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 4 GPUs using 256 batch size (64 per GPU)
# Usage ./SSD300_FP16_4GPU.sh <path to this repository> <path to dataset> <additional flags>
python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --fp16 --data $2 ${@:3}
python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}

View file

@ -1,4 +1,4 @@
# This script launches SSD300 training in FP16 on 8 GPUs using 512 batch size (64 per GPU)
# Usage ./SSD300_FP16_8GPU.sh <path to this repository> <path to dataset> <additional flags>
python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --fp16 --data $2 ${@:3}
python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --warmup 300 --bs 64 --amp --data $2 ${@:3}

View file

@ -1,4 +1,4 @@
# This script evaluates SSD300 model in FP16 using 32 batch size on 1 GPU
# Usage: ./SSD300_FP16_EVAL.sh <path to this repository> <path to dataset> <path to checkpoint> <additional flags>
python $1/main.py --backbone resnet50 --fp16 --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}
python $1/main.py --backbone resnet50 --amp --ebs 32 --data $2 --mode evaluation --checkpoint $3 ${@:4}

View file

@ -1,4 +1,4 @@
# This script launches SSD300 inference benchmark in FP16 on 1 GPU with 64 batch size
# Usage bash SSD300_FP16_INFERENCE_BENCHMARK.sh <path to this repository> <path to dataset> <additional flags>
python $1/main.py --backbone resnet50 --mode benchmark-inference --bs 64 --fp16 --data $2 ${@:3}
python $1/main.py --backbone resnet50 --mode benchmark-inference --bs 64 --amp --data $2 ${@:3}

View file

@ -0,0 +1,82 @@
import torch
import numpy as np
from apex.fp16_utils import network_to_half
from dle.inference import prepare_input
from src.model import SSD300, ResNet
from src.utils import dboxes300_coco, Encoder
def load_checkpoint(model, model_file):
cp = torch.load(model_file)['model']
cp = { k.replace('module.1.', ''): cp[k] for k in cp }
model.load_state_dict(cp)
def build_predictor(model_file, backbone='resnet50'):
ssd300 = SSD300(backbone=ResNet(backbone))
load_checkpoint(ssd300, model_file)
return ssd300
def prepare_model(checkpoint_path):
ssd300 = build_predictor(checkpoint_path)
ssd300 = ssd300.cuda()
ssd300 = network_to_half(ssd300)
ssd300 = ssd300.eval()
return ssd300
def prepare_tensor(inputs):
NHWC = np.array(inputs)
NCHW = np.swapaxes(np.swapaxes(NHWC, 2, 3), 1, 2)
tensor = torch.from_numpy(NCHW)
tensor = tensor.cuda()
tensor = tensor.half()
return tensor
def decode_results(predictions):
dboxes = dboxes300_coco()
encoder = Encoder(dboxes)
ploc, plabel = [val.float() for val in predictions]
results = encoder.decode_batch(ploc, plabel, criteria=0.5, max_output=20)
return [ [ pred.detach().cpu().numpy()
for pred in detections
]
for detections in results
]
def pick_best(detections, treshold):
bboxes, classes, confidences = detections
best = np.argwhere(confidences > 0.3).squeeze(axis=1)
return [pred[best] for pred in detections]
def main(checkpoint_path, imgs):
inputs = [prepare_input(uri) for uri in imgs]
tensor = prepare_tensor(inputs)
ssd300 = prepare_model(checkpoint_path)
predictions = ssd300(tensor)
results = decode_results(predictions)
best_results = [pick_best(detections, treshold=0.3) for detections in results]
return best_results
if __name__ == '__main__':
best_results = main(
checkpoint_path='/checkpoints/SSD300v1.1.pt',
imgs=[ 'http://images.cocodataset.org/val2017/000000397133.jpg',
'http://images.cocodataset.org/val2017/000000037777.jpg',
'http://images.cocodataset.org/val2017/000000252219.jpg',
]
)
print(best_results)

File diff suppressed because one or more lines are too long

View file

@ -0,0 +1 @@
PYTHONPATH=$PYTHONPATH:/mlperf/ jupyter-notebook --ip 0.0.0.0 --no-browser --allow-root

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

View file

@ -34,7 +34,7 @@ def generate_mean_std(args):
mean = mean.view(*view)
std = std.view(*view)
if args.fp16:
if args.amp:
mean = mean.half()
std = std.half()
@ -90,7 +90,6 @@ def make_parser():
' When it is not provided, pretrained model from torchvision'
' will be downloaded.')
parser.add_argument('--num-workers', type=int, default=4)
parser.add_argument('--fp16', action='store_true')
parser.add_argument('--amp', action='store_true')
# Distributed
@ -102,8 +101,6 @@ def make_parser():
def train(train_loop_func, logger, args):
if args.amp:
amp_handle = amp.init(enabled=args.fp16)
# Check that GPUs are actually available
use_cuda = not args.no_cuda
@ -149,20 +146,15 @@ def train(train_loop_func, logger, args):
ssd300.cuda()
loss_func.cuda()
if args.fp16 and not args.amp:
ssd300 = network_to_half(ssd300)
optimizer = torch.optim.SGD(tencent_trick(ssd300), lr=args.learning_rate,
momentum=args.momentum, weight_decay=args.weight_decay)
scheduler = MultiStepLR(optimizer=optimizer, milestones=args.multistep, gamma=0.1)
if args.amp:
ssd300, optimizer = amp.initialize(ssd300, optimizer, opt_level='O2')
if args.distributed:
ssd300 = DDP(ssd300)
optimizer = torch.optim.SGD(tencent_trick(ssd300), lr=args.learning_rate,
momentum=args.momentum, weight_decay=args.weight_decay)
scheduler = MultiStepLR(optimizer=optimizer, milestones=args.multistep, gamma=0.1)
if args.fp16:
if args.amp:
optimizer = amp_handle.wrap_optimizer(optimizer)
else:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=128.)
if args.checkpoint is not None:
if os.path.isfile(args.checkpoint):
load_checkpoint(ssd300, args.checkpoint)

View file

@ -1 +1,2 @@
Cython==0.28.4
scikit-image==0.15.0

View file

@ -15,7 +15,7 @@ def get_train_loader(args, local_seed):
train_pipe = COCOPipeline(args.batch_size, args.local_rank, train_coco_root,
train_annotate, args.N_gpu, num_threads=args.num_workers,
output_fp16=args.fp16, output_nhwc=False,
output_fp16=args.amp, output_nhwc=False,
pad_output=False, seed=local_seed)
train_pipe.build()
test_run = train_pipe.run()

View file

@ -24,7 +24,7 @@ def evaluate(model, coco, cocoGt, encoder, inv_map, args):
print("Parsing batch: {}/{}".format(nbatch, len(coco)), end='\r')
with torch.no_grad():
inp = img.cuda()
if args.fp16:
if args.amp:
inp = inp.half()
# Get predictions

View file

@ -3,6 +3,8 @@ import torch
import time
from SSD import _C as C
from apex import amp
def train_loop(model, loss_func, epoch, optim, train_dataloader, val_dataloader, encoder, iteration, logger, args, mean, std):
# for nbatch, (img, _, img_size, bbox, label) in enumerate(train_dataloader):
for nbatch, data in enumerate(train_dataloader):
@ -46,12 +48,9 @@ def train_loop(model, loss_func, epoch, optim, train_dataloader, val_dataloader,
if args.local_rank == 0:
logger.update_iter(epoch, iteration, loss.item())
if args.fp16:
if args.amp:
with optim.scale_loss(loss) as scale_loss:
scale_loss.backward()
else:
optim.backward(loss)
if args.amp:
with amp.scale_loss(loss, optim) as scale_loss:
scale_loss.backward()
else:
loss.backward()
@ -118,12 +117,9 @@ def benchmark_train_loop(model, loss_func, epoch, optim, train_dataloader, val_d
# loss scaling
if args.fp16:
if args.amp:
with optim.scale_loss(loss) as scale_loss:
scale_loss.backward()
else:
optim.backward(loss)
if args.amp:
with amp.scale_loss(loss, optim) as scale_loss:
scale_loss.backward()
else:
loss.backward()
@ -170,7 +166,7 @@ def benchmark_inference_loop(model, loss_func, epoch, optim, train_dataloader, v
img = data[0]
if not args.no_cuda:
img = img.cuda()
if args.fp16:
if args.amp:
img = img.half()
img.sub_(mean).div_(std)
img = Variable(img, requires_grad=False)

View file

@ -194,6 +194,9 @@ class Encoder(object):
scores_out.append(score[candidates])
labels_out.extend([i]*len(candidates))
if not bboxes_out:
return [torch.tensor([]) for _ in range(3)]
bboxes_out, labels_out, scores_out = torch.cat(bboxes_out, dim=0), \
torch.tensor(labels_out, dtype=torch.long), \
torch.cat(scores_out, dim=0)

View file

@ -1,4 +1,4 @@
ARG FROM_IMAGE_NAME=gitlab-master.nvidia.com:5005/dl/dgx/pytorch:19.05-py3-devel
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
FROM ${FROM_IMAGE_NAME}
RUN apt-get update && apt-get install -y pbzip2 pv bzip2 cabextract

View file

@ -0,0 +1,5 @@
BERT PyTorch
This repository includes software from https://github.com/huggingface/pytorch-pretrained-BERT
licensed under the Apache License 2.0.

File diff suppressed because it is too large Load diff

View file

@ -1,7 +1,7 @@
#! /bin/bash
WIKI_DUMP="ftp://ftpmirror.your.org/pub/wikimedia/dumps/enwiki/20190301/enwiki-20190301-pages-articles-multistream.xml.bz2"
N_PROCS_PREPROCESS=4 # Adjust this based on memory requirements and available number of cores
WIKI_DUMP="https://dumps.wikimedia.org/enwiki/20190320/enwiki-20190320-pages-articles-multistream.xml.bz2"
N_PROCS_PREPROCESS=$(nproc) # Adjust this based on memory requirements and available number of cores
# Download Wikipedia dump file
mkdir -p ./download

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

View file

@ -170,7 +170,7 @@ def main():
type=float, default=0.0,
help='Loss scaling, positive power of 2 values can improve fp16 convergence.')
parser.add_argument('--log_freq',
type=float, default=10.0,
type=float, default=50.0,
help='frequency of logging loss.')
parser.add_argument('--checkpoint_activations',
default=False,
@ -333,12 +333,12 @@ def main():
train_data = pretraining_dataset(input_file=data_file, max_pred_length=args.max_predictions_per_seq)
if args.local_rank == -1:
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size * n_gpu, num_workers=4, pin_memory=True)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size * n_gpu, num_workers=4, pin_memory=True)
else:
train_sampler = DistributedSampler(train_data)
train_sampler = DistributedSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size, num_workers=4, pin_memory=True)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size, num_workers=4, pin_memory=True)
for step, batch in enumerate(tqdm(train_dataloader, desc="File Iteration")):
training_steps += 1
@ -378,8 +378,9 @@ def main():
loss.item(), optimizer.param_groups[0]['lr']))
average_loss = 0
if global_step >= args.max_steps or training_steps % (
args.num_steps_per_checkpoint * args.gradient_accumulation_steps) == 0:
if global_step >= args.max_steps or training_steps == 1 * args.gradient_accumulation_steps or training_steps % (args.num_steps_per_checkpoint * args.gradient_accumulation_steps) == 0:
if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
# Save a trained model
logger.info("** ** * Saving fine - tuned model ** ** * ")

View file

@ -936,40 +936,40 @@ def main():
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
if args.do_train:
if args.fp16:
try:
# from fused_adam_local import FusedAdamBert as FusedAdam
from apex.optimizers import FusedAdam
from apex.optimizers import FP16_Optimizer
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
# import ipdb; ipdb.set_trace()
optimizer = FusedAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
bias_correction=False,
max_grad_norm=1.0)
if args.fp16:
try:
# from fused_adam_local import FusedAdamBert as FusedAdam
from apex.optimizers import FusedAdam
from apex.optimizers import FP16_Optimizer
except ImportError:
raise ImportError(
"Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.")
# import ipdb; ipdb.set_trace()
optimizer = FusedAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
bias_correction=False,
max_grad_norm=1.0)
if args.loss_scale == 0:
if args.old:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
if args.loss_scale == 0:
if args.old:
optimizer = FP16_Optimizer(optimizer, dynamic_loss_scale=True)
else:
model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False,
loss_scale="dynamic")
else:
model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False,
loss_scale="dynamic")
if args.old:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
else:
model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False, loss_scale=args.loss_scale)
if not args.old and args.do_train:
scheduler = LinearWarmUpScheduler(optimizer, warmup=args.warmup_proportion, total_steps=num_train_optimization_steps)
else:
if args.old:
optimizer = FP16_Optimizer(optimizer, static_loss_scale=args.loss_scale)
else:
model, optimizer = amp.initialize(model, optimizer, opt_level="O2", keep_batchnorm_fp32=False, loss_scale=args.loss_scale)
if not args.old and args.do_train:
scheduler = LinearWarmUpScheduler(optimizer, warmup=args.warmup_proportion, total_steps=num_train_optimization_steps)
else:
optimizer = BertAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
optimizer = BertAdam(optimizer_grouped_parameters,
lr=args.learning_rate,
warmup=args.warmup_proportion,
t_total=num_train_optimization_steps)
#print(model)
if args.local_rank != -1:
@ -1086,6 +1086,10 @@ def main():
if args.do_predict and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
if not args.do_train and args.fp16:
model.half()
eval_examples = read_squad_examples(
input_file=args.predict_file, is_training=False, version_2_with_negative=args.version_2_with_negative)
eval_features = convert_examples_to_features(

View file

@ -1,6 +1,20 @@
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
train_batch_size=${1:-14}
learning_rate=${2:-"0.4375e-4"}
precision=${3:-"fp16"}
num_gpus=${4:-8}
warmup_proportion=${5:-"0.01"}
train_steps=${6:-2285714}
save_checkpoint_steps=${7:-2000}
resume_training=${8:-"false"}
create_logfile=${9:-"true"}
accumulate_gradients=${10:-"false"}
gradient_accumulation_steps=${11:-1}
seed=${12:-42}
job_name=${13:-"job"}
DATASET=wikipedia_corpus # change this for other datasets
@ -29,18 +43,6 @@ if [ ! -f "$BERT_CONFIG" ] ; then
exit -1
fi
train_batch_size=${1:-14}
learning_rate=${2:-"0.4375e-4"}
precision=${3:-"fp16"}
num_gpus=${4:-8}
warmup_proportion=${5:-"0.01"}
train_steps=${6:-2285714}
save_checkpoint_steps=${7:-2000}
resume_training=${8:-"false"}
create_logfile=${9:-"true"}
checkpoint_activations=${10:-"false"}
seed=${11:-42}
PREC=""
if [ "$precision" = "fp16" ] ; then
PREC="--fp16"
@ -51,9 +53,9 @@ else
exit -2
fi
CHECKPOINT_ACTIVATIONS=""
if [ "$checkpoint_activations" == "true" ] ; then
CHECKPOINT_ACTIVATIONS="--checkpoint_activations"
ACCUMULATE_GRADIENTS=""
if [ "$accumulate_gradients" == "true" ] ; then
ACCUMULATE_GRADIENTS="--gradient_accumulation_steps=$gradient_accumulation_steps"
fi
CHECKPOINT=""
@ -67,7 +69,6 @@ CMD=" /workspace/bert/run_pretraining.py"
CMD+=" --input_dir=$DATA_DIR"
CMD+=" --output_dir=$CHECKPOINTS_DIR"
CMD+=" --config_file=$BERT_CONFIG"
CMD+=" --do_train"
CMD+=" --bert_model=bert-large-uncased"
CMD+=" --train_batch_size=$train_batch_size"
CMD+=" --max_seq_length=512"
@ -78,7 +79,7 @@ CMD+=" --num_steps_per_checkpoint=$save_checkpoint_steps"
CMD+=" --learning_rate=$learning_rate"
CMD+=" --seed=$seed"
CMD+=" $PREC"
CMD+=" $CHECKPOINT_ACTIVATIONS"
CMD+=" $ACCUMULATE_GRADIENTS"
CMD+=" $CHECKPOINT"
@ -93,7 +94,7 @@ if [ "$create_logfile" = "true" ] ; then
export GBS=$(expr $train_batch_size \* $num_gpus)
printf -v TAG "pyt_bert_pretraining_%s_gbs%d" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULTS_DIR/$TAG.$DATESTAMP.log
LOGFILE=$RESULTS_DIR/$job_name.$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
fi

View file

@ -7,11 +7,11 @@ echo "Container nvidia build = " $NVIDIA_BUILD_ID
init_checkpoint=${1:-"/workspace/checkpoints/bert_uncased.pt"}
epochs=${2:-"2.0"}
batch_size=${3:-"24"}
batch_size=${3:-"3"}
learning_rate=${4:-"3e-5"}
precision=${5:-"fp16"}
num_gpu=${6:-"8"}
seed=${7:-"42"}
seed=${7:-"1"}
squad_dir=${8:-"/workspace/bert/data/squad/v1.1"}
vocab_file=${9:-"/workspace/bert/vocab/vocab"}
OUT_DIR=${10:-"/results/SQuAD"}
@ -50,6 +50,10 @@ elif [ "$mode" = "eval" ] ; then
CMD+="--do_predict "
CMD+="--predict_file=$squad_dir/dev-v1.1.json "
CMD+="--predict_batch_size=$batch_size "
elif [ "$mode" = "prediction" ] ; then
CMD+="--do_predict "
CMD+="--predict_file=$squad_dir/dev-v1.1.json "
CMD+="--predict_batch_size=$batch_size "
else
CMD+=" --do_train "
CMD+=" --train_file=$squad_dir/train-v1.1.json "
@ -79,10 +83,18 @@ time $CMD |& tee $LOGFILE
#sed -r 's/
#|([A)/\n/g' $LOGFILE > $LOGFILE.edit
throughput=`cat $LOGFILE | grep -E 'Iteration.*[0-9.]+(s/it|it/s)' | tail -1 | egrep -o '[0-9.]+(s/it|it/s)' | head -1 | egrep -o '[0-9.]+'`
if [ "$mode" != "train" ]; then
python $squad_dir/evaluate-v1.1.py $squad_dir/dev-v1.1.json $OUT_DIR/predictions.json |& tee -a $LOGFILE
if [ "$mode" != "eval" ]; then
throughput=`cat $LOGFILE | grep -E 'Iteration.*[0-9.]+(it/s)' | tail -1 | egrep -o '[0-9.]+(s/it|it/s)' | head -1 | egrep -o '[0-9.]+'`
train_perf=$(awk 'BEGIN {print ('$throughput' * '$num_gpu' * '$batch_size')}')
echo " training throughput: $train_perf"
fi
echo "throughput: $throughput"
if [ "$mode" != "train" ]; then
if [ "$mode" != "prediction" ]; then
python $squad_dir/evaluate-v1.1.py $squad_dir/dev-v1.1.json $OUT_DIR/predictions.json |& tee -a $LOGFILE
eval_throughput=`cat $LOGFILE | grep Evaluating | tail -1 | awk -F ',' '{print $2}' | egrep -o '[0-9.]+' | head -1 | egrep -o '[0-9.]+'`
eval_perf=$(awk 'BEGIN {print ('$eval_throughput' * '$num_gpu' * '$batch_size')}')
echo " evaluation throughput: $eval_perf"
fi
fi

View file

@ -12,7 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvcr.io/nvidia/pytorch:19.05-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
FROM ${FROM_IMAGE_NAME}
RUN apt-get update && \
apt-get install -y unzip

View file

@ -6,46 +6,50 @@ model to achieve state of the art accuracy, and is tested and maintained by NVID
Table of Contents
=================
* [The model](#the-model)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick Start Guide](#quick-start-guide)
* [Details](#details)
* [Scripts and sample code](#scripts-and-sample-code)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick Start Guide](#quick-start-guide)
* [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Multi-dataset](#multi-dataset)
* [ML-1m](#ml-1m)
* [Training process](#training-process)
* [Inference process](#inference-process)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-8x-v100-32g)
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
* [Inference performance results](#inference-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
* [Changelog](#changelog)
* [Known issues](#known-issues)
* [Scaling beyond 8 GPUs](#scaling-beyond-8-gpus)
* [Memory usage](#memory-usage)
* [Training process](#training-process)
* [Inference process](#inference-process)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-8x-v100-32g)
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
* [Inference performance results](#inference-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16g))
* [NVIDIA DGX-1 (8x V100 32G)](#nvidia-dgx-1-(8x-v100-32g))
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32g))
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
* [Scaling beyond 8 GPUs](#scaling-beyond-8-gpus)
* [Memory usage](#memory-usage)
## The model
## Model overview
The NCF model focuses on providing recommendations, also known as collaborative filtering; with implicit feedback. The training data for this model should contain binary information about whether a user interacted with a specific item.
NCF was first described by Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu and Tat-Seng Chua in the [Neural Collaborative Filtering paper](https://arxiv.org/abs/1708.05031).
@ -110,6 +114,35 @@ It allows us to use FP16 training with FP32 master weights by modifying just 3 l
* Fused Adam - We use a special implementation of the Adam implementation provided by the Apex package. It fuses some operations for faster weight updates.
Since NCF is a relatively lightweight model with a large number of parameters, weve observed significant performance improvements from using FusedAdam.
## Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
### Enabling mixed precision
Using the Automatic Mixed Precision (AMP) package requires two modifications in the source code.
The first one is to initialize the model and the optimizer using the `amp.initialize` function:
```python
model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level,
keep_batchnorm_fp32=False, loss_scale='dynamic')
```
The second one is to use the AMP's loss scaling context manager:
```python
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
```
## Setup
The following section lists the requirements in order to start training the Neural Collaborative Filtering model.
@ -128,7 +161,7 @@ Running PyTorch
For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.
### Quick Start Guide
## Quick Start Guide
1. Clone the repository.
```bash
@ -189,14 +222,14 @@ This will result in a checkpoint file being written to `/data/checkpoints/model.
6. Start validation/evaluation.
The trained model can be evaluated by passing the `--mode test` flag to the `run.sh` script:
The trained model can be evaluated by passing the `--mode` test flag to the `run.sh` script:
```bash
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m --mode test --checkpoint-path /data/checkpoints/model.pth
```
## Details
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
@ -217,7 +250,7 @@ usage: ncf.py [-h] [--data DATA] [-e EPOCHS] [-b BATCH_SIZE]
[--valid_batch_size VALID_BATCH_SIZE] [-f FACTORS]
[--layers LAYERS [LAYERS ...]] [-n NEGATIVE_SAMPLES]
[-l LEARNING_RATE] [-k TOPK] [--seed SEED]
[--threshold THRESHOLD] [--valid_negative VALID_NEGATIVE]
[--threshold THRESHOLD]
[--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--dropout DROPOUT]
[--checkpoint_dir CHECKPOINT_DIR] [--mode {train,test}]
[--grads_accumulated GRADS_ACCUMULATED] [--opt_level {O0,O2}]
@ -247,9 +280,6 @@ optional arguments:
--seed SEED, -s SEED Manually set random seed for torch
--threshold THRESHOLD, -t THRESHOLD
Stop training early at threshold
--valid_negative VALID_NEGATIVE
Number of negative samples for each positive test
example
--beta1 BETA1, -b1 BETA1
Beta1 for Adam
--beta2 BETA2, -b2 BETA2
@ -329,39 +359,11 @@ The script will then:
* Run inference on the test dataset
* Compute and print the validation metric
## Mixed precision training
## Performance
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
### Benchmarking
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
### Enabling mixed precision
Using the Automatic Mixed Precision (AMP) package requires two modifications in the source code.
The first one is to initialize the model and the optimizer using the `amp.initialize` function:
```python
model, optimizer = amp.initialize(model, optimizer, opt_level=args.opt_level,
keep_batchnorm_fp32=False, loss_scale='dynamic')
```
The second one is to use the AMP's loss scaling context manager:
```python
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
```
## Benchmarking
### Training performance benchmark
#### Training performance benchmark
NCF training on NVIDIA DGX systems is very fast, therefore, in order to measure train and validation throughput, you can simply run the full training job with:
```bash
@ -372,7 +374,7 @@ python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/
At the end of the script, a line reporting the best train throughput is printed.
### Inference performance benchmark
#### Inference performance benchmark
Validation throughput can be measured by running the full training job with:
```bash
@ -382,23 +384,42 @@ python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/
The best validation throughput is reported to the standard output.
## Results
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
### Training accuracy results
#### Training accuracy results
#### NVIDIA DGX-1 (8x V100 32G)
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
The following table lists the best hit rate at 10 for DGX-1 with 8 V100 16G GPUs. It also shows the average time to reach this HR@10 across 5 random seeds.
The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
| **GPUs** | **Batch size / GPU** | **Accuracy - FP32** | **Accuracy - mixed precision** | **Time to train - FP32 (s)** | **Time to train - mixed precision (s)** | **Time to train speedup (FP32 to mixed precision)** |
|--------------------------:|-----------------------------:|--------------------------:|--------------------------:|-------------------------------:|-------------------------------:|------------------:|
| 1 | 1,048,576 | 0.95913 | 0.95887 | 188.82 | 100.37 | 1.88 |
| 8 | 131,072 | 0.95905 | 0.95906 | 43.20 | 26.68 | 1.62 |
To reproduce this result, start the NCF Docker container interactively and run:
```bash
./prepare_dataset.sh
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
```
##### NVIDIA DGX-1 (8x V100 32G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs.
The following table lists the best hit rate at 10 for DGX-1 with 8 V100 32G GPUs:
The following table lists the best hit rate at 10 for DGX-1 with 8 V100 16G GPUs. It also shows the average time to reach this HR@10 across 5 random seeds.
The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
| **GPUs** | **Batch size / GPU** | **Accuracy - FP32** | **Accuracy - mixed precision** | **Time to train - FP32 (s)** | **Time to train - mixed precision (s)** | **Time to train speedup (FP32 to mixed precision)** |
|--------------------------:|-----------------------------:|--------------------------:|--------------------------:|-------------------------------:|-------------------------------:|------------------:|
| 1 | 1,048,576 | 0.95913 | 0.95887 | 194.72 | 106.03 | 1.84 |
| 8 | 131,072 | 0.95905 | 0.95906 | 44.07 | 27.86 | 1.58 |
| **Number of GPUs** | **Single precision HR@10** | **Mixed precision HR@10** |
|:---:|:--------:|:-------:|
|1| 0.95847 | 0.95845 |
|4| 0.95887 | 0.95841 |
|8| 0.95850 | 0.95885 |
Here's an example validation accuracy curve for mixed precision vs single precision on DGX-1 with 8 V100 32G GPUs:
@ -410,9 +431,29 @@ To reproduce this result, start the NCF Docker container interactively and run:
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
```
Training accuracy results on a DGX-1 with 8 V100 16G GPUs and on DGX-2 should be the same.
##### NVIDIA DGX-2 (16x V100 32G)
#### Training stability test
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
The following table lists the best hit rate at 10 for DGX-1 with 8 V100 16G GPUs. It also shows the average time to reach this HR@10 across 5 random seeds.
The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
| **GPUs** | **Batch size / GPU** | **Accuracy - FP32** | **Accuracy - mixed precision** | **Time to train - FP32 (s)** | **Time to train - mixed precision (s)** | **Time to train speedup (FP32 to mixed precision)** |
|--------------------------:|-----------------------------:|--------------------------:|--------------------------:|-------------------------------:|-------------------------------:|------------------:|
| 1 | 1,048,576 | 0.95913 | 0.95887 | 180.85 | 100.33 | 1.80 |
| 8 | 131,072 | 0.95900 | 0.95918 | 44.21 | 29.68 | 1.49 |
| 16 | 65,536 | 0.95896 | 0.95906 | 34.47 | 26.52 | 1.30 |
To reproduce this result, start the NCF Docker container interactively and run:
```bash
./prepare_dataset.sh
python -m torch.distributed.launch --nproc_per_node=16 ncf.py --data /data/cache/ml-20m
```
##### Training stability test
The histogram below shows the best HR@10 achieved
for 400 experiments using mixed precision and 400 experiments using single precision.
@ -421,90 +462,60 @@ Mean HR@10 for mixed precision was equal to 0.95868 and for single precision it
![hr_histogram](./img/hr_histogram.png)
### Training performance results
#### Training performance results
#### NVIDIA DGX-1 (8x V100 16G)
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
The following table shows the best training throughput:
| **Number of GPUs** | **Batch size per GPU**| **Mixed precision throughput (samples/sec)** | **Single precision throughput (samples/sec)** | **Speed-up with mixed precision** | **Multi-GPU strong scaling with mixed precision** | **Multi-GPU strong scaling with FP32** |
|:---:|:--------:|:-----:|:-----------:|:-----:|:----:|:---|
| 1 |1048576| 20,459,365| 9,777,551 | 2.09 | 1 | 1 |
| 4 |262144 | 61,782,125| 32,583,924 | 1.90 | 3.02 |3.33|
| 8 |131072 | 98,464,084| 55,365,147 | 1.78 |4.81 |5.66|
The following table shows the average time to reach HR@10 of 0.9562 across 5 random seeds. The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
| **GPUs** | **Batch Size / GPU** | **Throughput - FP32 (samples / s)** | **Throughput - Mixed precision (samples /s)** | **Throughput Speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong scaling - Mixed precision** |
|--------------------------:|-----------------------------:|----------------------------------:|----------------------------------:|------------------:|---------------------:|---------------------:|
| 1 | 1,048,576 | 10,536,076 | 21,059,303 | 2.00 | 1.00 | 1.00 |
| 8 | 131,072 | 58,286,313 | 100,760,496 | 1.73 | 5.53 | 4.78 |
| **Number of GPUs** | **Batch size per GPU** | **Mixed precision (seconds)** | **Single precision (seconds)** | **Speed-up with mixed precision** |
|:---:|:----:|:---------:|:-----------:|:-----:|
| 1 | 1048576| 67.03 | 142.31 | 2.12 |
| 4 | 262144| 23.92 | 47.57 | 1.99 |
| 8 | 131072| 18.82 | 31.48 | 1.67 |
#### NVIDIA DGX-1 (8x V100 32G)
##### NVIDIA DGX-1 (8x V100 32G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs.
The following table shows the best training throughput:
| **Number of GPUs** | **Batch size per GPU** | **Mixed precision throughput (samples/sec)** | **Single precision throughput (samples/sec)** | **Speed-up with mixed precision** | **Multi-GPU strong scaling with mixed precision** | **Multi-GPU strong scaling with FP32** |
|:---:|:----:|:---------:|:-----------:|:-----:|:---:|:---:|
| 1 | 1048576| 19,314,944 | 9,464,431 | 2.04 | 1 | 1 |
| 4 | 262144| 58,579,745 |31,577,085 | 1.86 | 3.03 | 3.34 |
| 8 | 131072| 92,964,306 | 53,972,811 | 1.72 | 4.81 | 5.70 |
The following table shows the average time to reach HR@10 of 0.9562 across 5 random seeds. The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
| **Number of GPUs** | **Mixed precision (seconds)** | **Single precision (seconds)** | **Speed-up with mixed precision** |
|:---:|:-------------:|:-----------:|:-----:|
| 1 | 70.49 | 146.68 | 2.08 |
| 4 | 24.61 | 49.01 | 1.99 |
| 8 | 19.72 | 32.25 | 1.64 |
| **GPUs** | **Batch Size / GPU** | **Throughput - FP32 (samples / s)** | **Throughput - Mixed precision (samples /s)** | **Throughput Speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong scaling - Mixed precision** |
|--------------------------:|-----------------------------:|----------------------------------:|----------------------------------:|------------------:|---------------------:|---------------------:|
| 1 | 1,048,576 | 10,230,464 | 19,894,392 | 1.94 | 1.00 | 1.00 |
| 8 | 131,072 | 57,043,196 | 95,424,391 | 1.67 | 5.58 | 4.80 |
#### NVIDIA DGX-2 (16x V100 32G)
##### NVIDIA DGX-2 (16x V100 32G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.
The following table shows the best training throughput:
| **Number of GPUs ** | **Batch size per GPU** | **Mixed precision throughput (samples/sec)** | **Single precision throughput (samples/sec)** | **Speed-up with mixed precision** | **Multi-GPU strong scaling with mixed precision** | **Multi-GPU strong scaling with FP32** |
|:---:|:-----:|:-------:|:-----------:|:-----:|:---:|:---:|
| 1 | 1048576| 20,645,544 | 10,145,873 | 2.03 | 1 | 1 |
| 4 | 262144 | 63,608,950 | 34,758,369 | 1.83 | 3.08 | 3.43 |
| 8 | 131072| 98,887,103 | 57,251,418 | 1.73 | 4.79 | 5.64 |
| 16 | 65536| 128,976,394 | 82,932,545 | 1.56 | 6.25 | 8.17 |
The following table shows the average time to reach HR@10 of 0.9562 across 5 random seeds. The training time was measured excluding data downloading, preprocessing, validation data generation and library initialization times.
| **Number of GPUs ** | **Mixed precision (seconds)** | **Single precision (seconds)** | **Speed-up with mixed precision** |
|:---:|:-------------:|:-----------:|:-----:|
| 1 | 65.99 |134.93 |2.04|
| 4 | 26.21 |41.12 |1.57|
| 8 | 21.96 |29.71 |1.35|
| 16| 22.15 |28.99 |1.31|
| **GPUs** | **Batch Size / GPU** | **Throughput - FP32 (samples / s)** | **Throughput - Mixed precision (samples /s)** | **Throughput Speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong scaling - Mixed precision** |
|--------------------------:|:-----------------------------|:----------------------------------|:----------------------------------|------------------:|---------------------:|---------------------:|
| 1 | 1,048,576 | 10,941,690 | 21,056,129 | 1.92 | 1.00 | 1.00 |
| 8 | 131,072 | 60,247,209 | 100,142,844 | 1.66 | 5.51 | 4.76 |
| 16 | 65,536 | 84,287,736 | 133,300,953 | 1.58 | 7.70 | 6.33 |
### Inference performance results
#### Inference performance results
#### NVIDIA DGX-1 (8x V100 16G)
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
The following table shows the best inference throughput:
| **Number of GPUs ** | **Mixed precision (samples/sec)** | **Single precision (samples/sec)** | **Speed-up with mixed precision** |
| **Number of GPUs** | **Mixed precision (samples/sec)** | **Single precision (samples/sec)** | **Speed-up with mixed precision** |
|:---:|:-------------:|:-----------:|:-----:|
| 1 | 57,163,273 | 28,877,257 | 1.98 |
#### NVIDIA DGX-1 (8x V100 32G)
##### NVIDIA DGX-1 (8x V100 32G)
Our results were obtained by following the steps in the Quick Start Guidein the PyTorch 19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs.
@ -515,7 +526,7 @@ The following table shows the best inference throughput:
| 1 | 54,570,476 | 28,085,521 | 1.94 |
#### NVIDIA DGX-2 (16x V100 32G)
##### NVIDIA DGX-2 (16x V100 32G)
Our results were obtained by following the steps in the Quick Start Guide in the PyTorch 19.05-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.
@ -525,8 +536,9 @@ The following table shows the best inference throughput:
|:---:|:-------------:|:-----------:|:-----:|
| 1 | 58,383,216 | 30,018,043 | 1.94 |
## Release notes
## Changelog
### Changelog
1. January 22, 2018
* Initial release
2. May, 2019
@ -536,11 +548,15 @@ The following table shows the best inference throughput:
* Data loading code cleanup.
* Default container updated to PyTorch 19.05-py3.
* Updated README.md.
3. June, 2019
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs
## Known issues
### Known issues
### Scaling beyond 8 GPUs
#### Scaling beyond 8 GPUs
Neural Collaborative Filtering is a relatively lightweight model that trains quickly with this relatively smaller dataset, ML-20m.
Because of that, the high ratio of communication to computation makes it difficult to
efficiently use more than 8 GPUs. Typically, this is not an issue because when using 8
@ -550,7 +566,7 @@ GPUs with FP16 precision, the training is sufficiently fast. However, if youd
by finding hyperparameters that enable using a larger batch size or by reducing the
number of trainable parameters.
### Memory usage
#### Memory usage
In the default settings, the additional memory beyond 16G may not be fully utilized.
This is because we set the default batch size for ML-20m dataset to 1M,

View file

@ -32,6 +32,7 @@ from argparse import ArgumentParser
import pandas as pd
from load import implicit_load
import torch
import tqdm
from logger.logger import LOGGER
from logger import tags
@ -48,11 +49,50 @@ def parse_args():
help='Path to reviews CSV file from MovieLens')
parser.add_argument('--output', type=str, default='/data',
help='Output directory for train and test files')
parser.add_argument('--valid_negative', type=int, default=100,
help='Number of negative samples for each positive test example')
parser.add_argument('--seed', '-s', type=int, default=1,
help='Manually set random seed for torch')
return parser.parse_args()
class _TestNegSampler:
def __init__(self, train_ratings, nb_neg):
self.nb_neg = nb_neg
self.nb_users = int(train_ratings[:, 0].max()) + 1
self.nb_items = int(train_ratings[:, 1].max()) + 1
# compute unique ids for quickly created hash set and fast lookup
ids = (train_ratings[:, 0] * self.nb_items) + train_ratings[:, 1]
self.set = set(ids)
def generate(self, batch_size=128*1024):
users = torch.arange(0, self.nb_users).reshape([1, -1]).repeat([self.nb_neg, 1]).transpose(0, 1).reshape(-1)
items = [-1] * len(users)
random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
print('Generating validation negatives...')
for idx, u in enumerate(tqdm.tqdm(users.tolist())):
if not random_items:
random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
j = random_items.pop()
while u * self.nb_items + j in self.set:
if not random_items:
random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
j = random_items.pop()
items[idx] = j
items = torch.LongTensor(items)
return items
def main():
args = parse_args()
if args.seed is not None:
torch.manual_seed(args.seed)
print("Loading raw data from {}".format(args.path))
df = implicit_load(args.path, sort=False)
@ -65,7 +105,6 @@ def main():
df[USER_COLUMN] = pd.factorize(df[USER_COLUMN])[0]
df[ITEM_COLUMN] = pd.factorize(df[ITEM_COLUMN])[0]
print("Creating list of items for each user")
# Need to sort before popping to get last item
df.sort_values(by='timestamp', inplace=True)
@ -87,5 +126,10 @@ def main():
test_ratings = torch.from_numpy(test_data.values)
torch.save(test_ratings, args.output+'/test_ratings.pt')
sampler = _TestNegSampler(train_ratings.cpu().numpy(), args.valid_negative)
test_negs = sampler.generate().cuda()
test_negs = test_negs.reshape(-1, args.valid_negative)
torch.save(test_negs, args.output+'/test_negatives.pt')
if __name__ == '__main__':
main()

View file

@ -30,53 +30,16 @@
import time
import torch
import tqdm
class _TestNegSampler:
def __init__(self, train_ratings, nb_neg):
self.nb_neg = nb_neg
self.nb_users = int(train_ratings[:, 0].max()) + 1
self.nb_items = int(train_ratings[:, 1].max()) + 1
# compute unique ids for quickly created hash set and fast lookup
ids = (train_ratings[:, 0] * self.nb_items) + train_ratings[:, 1]
self.set = set(ids)
def generate(self, batch_size=128*1024):
users = torch.arange(0, self.nb_users).reshape([1, -1]).repeat([self.nb_neg, 1]).transpose(0, 1).reshape(-1)
items = [-1] * len(users)
random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
print('Generating validation negatives...')
for idx, u in enumerate(tqdm.tqdm(users.tolist())):
if not random_items:
random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
j = random_items.pop()
while u * self.nb_items + j in self.set:
if not random_items:
random_items = torch.LongTensor(batch_size).random_(0, self.nb_items).tolist()
j = random_items.pop()
items[idx] = j
items = torch.LongTensor(items)
return items
def create_test_data(train_ratings, test_ratings, args):
def create_test_data(test_ratings, test_negs, args):
test_users = test_ratings[:,0]
test_pos = test_ratings[:,1].reshape(-1,1)
begin = time.time()
sampler = _TestNegSampler(train_ratings.cpu().numpy(), args.valid_negative)
test_negs = sampler.generate().cuda()
end = time.time()
print('Generating validation negatives took: ', end - begin)
del train_ratings
# create items with real sample at last position
test_users = test_users.reshape(-1,1).repeat(1, 1 + args.valid_negative)
test_items = torch.cat((test_negs.reshape(-1, args.valid_negative), test_pos), dim=1)
num_valid_negative = test_negs.shape[1]
test_users = test_users.reshape(-1,1).repeat(1, 1 + num_valid_negative)
test_items = torch.cat((test_negs, test_pos), dim=1)
del test_ratings, test_negs
# generate dup mask and real indices for exact same behavior on duplication compare to reference

View file

@ -0,0 +1,95 @@
#
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch.jit
import time
from argparse import ArgumentParser
import torch
from neumf import NeuMF
from logger.logger import LOGGER, timed_block, timed_function
from logger.autologging import log_hardware, log_args
from apex import amp
LOGGER.model = 'ncf'
def parse_args():
parser = ArgumentParser(description="Benchmark inference performance of the NCF model")
parser.add_argument('--load_checkpoint_path', default=None, type=str,
help='Path to the checkpoint file to be loaded before training/evaluation')
parser.add_argument('--n_users', default=138493, type=int,
help='Number of users. Defaults to the number of users in the ml-20m dataset after preprocessing')
parser.add_argument('--n_items', default=26744, type=int,
help='Number of items. Defaults to the number of users in the ml-20m dataset after preprocessing')
parser.add_argument('-f', '--factors', type=int, default=64,
help='Number of predictive factors')
parser.add_argument('--dropout', type=float, default=0.5,
help='Dropout probability, if equal to 0 will not use dropout at all')
parser.add_argument('--layers', nargs='+', type=int,
default=[256, 256, 128, 64],
help='Sizes of hidden layers for MLP')
parser.add_argument('--batch_size', default=1, type=int, help='Batch size for inference')
parser.add_argument('--num_batches', default=20, type=int,
help='Number of batches for which to measure latency and throughput')
parser.add_argument('--opt_level', default='O2', type=str,
help='Optimization level for Automatic Mixed Precision',
choices=['O0', 'O2'])
return parser.parse_args()
def main():
log_hardware()
args = parse_args()
log_args(args)
model = NeuMF(nb_users=args.n_users, nb_items=args.n_items, mf_dim=args.factors,
mlp_layer_sizes=args.layers, dropout=args.dropout)
model = model.cuda()
if args.load_checkpoint_path:
state_dict = torch.load(args.load_checkpoint_path)
model.load_state_dict(state_dict)
if args.opt_level == "O2":
model = amp.initialize(model, opt_level=args.opt_level,
keep_batchnorm_fp32=False, loss_scale='dynamic')
users = torch.cuda.LongTensor(args.batch_size).random_(0, args.n_users)
items = torch.cuda.LongTensor(args.batch_size).random_(0, args.n_items)
latencies = []
for _ in range(args.num_batches):
torch.cuda.synchronize()
start = time.time()
predictions = model(users, items, sigmoid=True)
torch.cuda.synchronize()
latencies.append(time.time() - start)
LOGGER.log(key='batch_size', value=args.batch_size)
LOGGER.log(key='best_inference_throughput', value=args.batch_size / min(latencies))
LOGGER.log(key='best_inference_latency', value=min(latencies))
LOGGER.log(key='inference_latencies', value=latencies)
return
if __name__ == '__main__':
main()

View file

@ -79,8 +79,6 @@ def parse_args():
help='Manually set random seed for torch')
parser.add_argument('--threshold', '-t', type=float, default=1.0,
help='Stop training early at threshold')
parser.add_argument('--valid_negative', type=int, default=100,
help='Number of negative samples for each positive test example')
parser.add_argument('--beta1', '-b1', type=float, default=0.25,
help='Beta1 for Adam')
parser.add_argument('--beta2', '-b2', type=float, default=0.5,
@ -91,6 +89,8 @@ def parse_args():
help='Dropout probability, if equal to 0 will not use dropout at all')
parser.add_argument('--checkpoint_dir', default='/data/checkpoints/', type=str,
help='Path to the directory storing the checkpoint file')
parser.add_argument('--load_checkpoint_path', default=None, type=str,
help='Path to the checkpoint file to be loaded before training/evaluation')
parser.add_argument('--mode', choices=['train', 'test'], default='train', type=str,
help='Passing "test" will only run a single evaluation, otherwise full training will be performed')
parser.add_argument('--grads_accumulated', default=1, type=int,
@ -173,17 +173,13 @@ def main():
args.distributed, args.world_size = init_distributed(args.local_rank)
log_args(args)
main_start_time = time.time()
if args.seed is not None:
torch.manual_seed(args.seed)
print("Saving results to {}".format(args.checkpoint_dir))
if not os.path.exists(args.checkpoint_dir) and args.checkpoint_dir != '':
os.makedirs(args.checkpoint_dir, exist_ok=True)
checkpoint_path = os.path.join(args.checkpoint_dir, 'model.pth')
LOGGER.log(key=tags.PREPROC_HP_NUM_EVAL, value=args.valid_negative)
# The default of np.random.choice is replace=True, so does pytorch random_()
LOGGER.log(key=tags.PREPROC_HP_SAMPLE_EVAL_REPLACEMENT, value=True)
LOGGER.log(key=tags.INPUT_HP_SAMPLE_TRAIN_REPLACEMENT, value=True)
@ -194,10 +190,16 @@ def main():
torch.distributed.broadcast(torch.tensor([1], device="cuda"), 0)
torch.cuda.synchronize()
main_start_time = time.time()
LOGGER.log(key=tags.RUN_START)
train_ratings = torch.load(args.data+'/train_ratings.pt', map_location=torch.device('cuda:{}'.format(args.local_rank)))
test_ratings = torch.load(args.data+'/test_ratings.pt', map_location=torch.device('cuda:{}'.format(args.local_rank)))
test_negs = torch.load(args.data+'/test_negatives.pt', map_location=torch.device('cuda:{}'.format(args.local_rank)))
valid_negative = test_negs.shape[1]
LOGGER.log(key=tags.PREPROC_HP_NUM_EVAL, value=valid_negative)
nb_maxs = torch.max(train_ratings, 0)[0]
nb_users = nb_maxs[0].item() + 1
@ -206,7 +208,7 @@ def main():
all_test_users = test_ratings.shape[0]
test_users, test_items, dup_mask, real_indices = dataloading.create_test_data(train_ratings, test_ratings, args)
test_users, test_items, dup_mask, real_indices = dataloading.create_test_data(test_ratings, test_negs, args)
# make pytorch memory behavior more consistent later
torch.cuda.empty_cache()
@ -248,15 +250,25 @@ def main():
LOGGER.log(key=tags.OPT_HP_ADAM_EPSILON, value=args.eps)
LOGGER.log(key=tags.MODEL_HP_LOSS_FN, value=tags.VALUE_BCE)
if args.load_checkpoint_path:
state_dict = torch.load(args.load_checkpoint_path)
model.load_state_dict(state_dict)
if args.mode == 'test':
state_dict = torch.load(checkpoint_path)
model.load_state_dict(state_dict)
LOGGER.log(key=tags.EVAL_START, value=0)
start = time.time()
hr, ndcg = val_epoch(model, test_users, test_items, dup_mask, real_indices, args.topk,
samples_per_user=args.valid_negative + 1,
samples_per_user=valid_negative + 1,
num_user=all_test_users, distributed=args.distributed)
print('HR@{K} = {hit_rate:.4f}, NDCG@{K} = {ndcg:.4f}'
.format(K=args.topk, hit_rate=hr, ndcg=ndcg))
val_time = time.time() - start
eval_size = all_test_users * (valid_negative + 1)
eval_throughput = eval_size / val_time
LOGGER.log(key=tags.EVAL_ACCURACY, value={"epoch": 0, "value": hr})
LOGGER.log(key=tags.EVAL_STOP, value=0)
LOGGER.log(key='best_eval_throughput', value=eval_throughput)
return
success = False
@ -307,7 +319,7 @@ def main():
LOGGER.log(key=tags.EVAL_START, value=epoch)
hr, ndcg = val_epoch(model, test_users, test_items, dup_mask, real_indices, args.topk,
samples_per_user=args.valid_negative + 1,
samples_per_user=valid_negative + 1,
num_user=all_test_users, epoch=epoch, distributed=args.distributed)
val_time = time.time() - begin
@ -321,15 +333,17 @@ def main():
LOGGER.log(key=tags.EVAL_TARGET, value=args.threshold)
LOGGER.log(key=tags.EVAL_STOP, value=epoch)
eval_size = all_test_users * (args.valid_negative + 1)
eval_size = all_test_users * (valid_negative + 1)
eval_throughput = eval_size / val_time
eval_throughputs.append(eval_throughput)
LOGGER.log(key='eval_throughput', value=eval_throughput)
if hr > max_hr and args.local_rank == 0:
max_hr = hr
print("New best hr! Saving the model to: ", checkpoint_path)
torch.save(model.state_dict(), checkpoint_path)
save_checkpoint_path = os.path.join(args.checkpoint_dir, 'model.pth')
print("New best hr! Saving the model to: ", save_checkpoint_path)
torch.save(model.state_dict(), save_checkpoint_path)
best_model_timestamp = time.time()
if args.threshold is not None:
if hr >= args.threshold:
@ -337,13 +351,15 @@ def main():
success = True
break
LOGGER.log(key='best_train_throughput', value=max(train_throughputs))
LOGGER.log(key='best_eval_throughput', value=max(eval_throughputs))
LOGGER.log(key='best_accuracy', value=max_hr)
LOGGER.log(key='time_to_target', value=time.time() - main_start_time)
if args.local_rank == 0:
LOGGER.log(key='best_train_throughput', value=max(train_throughputs))
LOGGER.log(key='best_eval_throughput', value=max(eval_throughputs))
LOGGER.log(key='best_accuracy', value=max_hr)
LOGGER.log(key='time_to_target', value=time.time() - main_start_time)
LOGGER.log(key='time_to_best_model', value=best_model_timestamp - main_start_time)
LOGGER.log(key=tags.RUN_STOP, value={"success": success})
LOGGER.log(key=tags.RUN_FINAL)
LOGGER.log(key=tags.RUN_STOP, value={"success": success})
LOGGER.log(key=tags.RUN_FINAL)
if __name__ == '__main__':
main()

View file

@ -35,7 +35,7 @@ set -x
DATASET_NAME=${1:-'ml-20m'}
RAW_DATADIR=${2:-'/data'}
CACHED_DATADIR="${RAW_DATADIR}/cache/${DATASET_NAME}"
CACHED_DATADIR=${3:-"${RAW_DATADIR}/cache/${DATASET_NAME}"}
# you can add another option to this case in order to support other datasets
case ${DATASET_NAME} in

View file

@ -607,7 +607,7 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
The following table shows the expected training time for convergence for Tacotron 2 (1500 epochs):
|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
|---:|---:|---:|---:|
|---:|---:|---:|---:|---:|
|1| 128@FP16, 64@FP32 | 137.33 | 227.66 | 1.66 |
|4| 128@FP16, 64@FP32 | 40.68 | 63.99 | 1.57 |
|8| 128@FP16, 64@FP32 | 20.74 | 32.47 | 1.57 |
@ -617,7 +617,7 @@ The following table shows the expected training time for convergence for Tacotro
The following table shows the expected training time for convergence for WaveGlow (1000 epochs):
|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
|---:|---:|---:|---:|
|---:|---:|---:|---:|---:|
|1| 10@FP16, 4@FP32 | 358.00 | 793.97 | 2.22 |
|4| 10@FP16, 4@FP32 | 103.10 | 223.59 | 2.17 |
|8| 10@FP16, 4@FP32 | 50.40 | 109.45 | 2.17 |

View file

@ -1,12 +0,0 @@
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View file

@ -1,9 +1,11 @@
FROM nvcr.io/nvidia/pytorch:19.01-py3
FROM nvcr.io/nvidia/pytorch:19.05-py3
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ADD . /workspace/gnmt
WORKDIR /workspace/gnmt
RUN pip install -r requirements.txt
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
ADD . /workspace/gnmt

File diff suppressed because it is too large Load diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 351 KiB

View file

@ -1,2 +1,5 @@
pytablewriter
sacrebleu==1.2.10
sacremoses==0.0.19
git+git://github.com/rsennrich/subword-nmt.git@48ba99e657591c329e0003f0c6e32e493fa959ef
git+git://github.com/NVIDIA/apex.git#egg=apex

View file

@ -1,3 +1,3 @@
#!/bin/bash
docker build . --rm -t gnmt
docker build . --rm -t gnmt:latest

View file

@ -1,3 +1,3 @@
#!/bin/bash
nvidia-docker run -it --rm --ipc=host -v $PWD:/workspace/gnmt/ gnmt bash
nvidia-docker run --init -it --rm --ipc=host -v $PWD:/workspace/gnmt/ gnmt bash

View file

@ -33,12 +33,13 @@ fi
cd $REPO_DIR
python3 translate.py \
--input ${DATASET_DIR}/newstest2014.tok.bpe.32000.en \
--input ${DATASET_DIR}/newstest2014.en \
--reference ${DATASET_DIR}/newstest2014.de \
--output /tmp/output \
--model results/gnmt/model_best.pth \
--batch-size ${BATCH_SIZE} \
--beam-size ${BEAM_SIZE} \
--math ${MATH} \
--warmup 1 \
--target-bleu 24.3 \
${TARGET_PERF}

View file

@ -1,6 +1,6 @@
fp16,128,5,Tesla V100-SXM2-16GB,31050
fp32,128,5,Tesla V100-SXM2-16GB,12500
fp16,128,5,Tesla V100-SXM2-32GB,29500
fp32,128,5,Tesla V100-SXM2-32GB,12500
fp16,128,5,Tesla V100-SXM3-32GB,34050
fp32,128,5,Tesla V100-SXM3-32GB,14250
fp16,128,5,Tesla V100-SXM2-16GB,18740
fp32,128,5,Tesla V100-SXM2-16GB,8610
fp16,128,5,Tesla V100-SXM2-32GB,17800
fp32,128,5,Tesla V100-SXM2-32GB,8180
fp16,128,5,Tesla V100-SXM3-32GB,20550
fp32,128,5,Tesla V100-SXM3-32GB,9810

View file

@ -1,20 +1,20 @@
fp16,1,Tesla V100-SXM2-16GB,66050
fp16,4,Tesla V100-SXM2-16GB,196174
fp16,8,Tesla V100-SXM2-16GB,387282
fp32,1,Tesla V100-SXM2-16GB,21346
fp32,4,Tesla V100-SXM2-16GB,76083
fp32,8,Tesla V100-SXM2-16GB,153697
fp16,1,Tesla V100-SXM2-16GB,68000
fp16,4,Tesla V100-SXM2-16GB,221000
fp16,8,Tesla V100-SXM2-16GB,419000
fp32,1,Tesla V100-SXM2-16GB,21000
fp32,4,Tesla V100-SXM2-16GB,75000
fp32,8,Tesla V100-SXM2-16GB,149000
fp16,1,Tesla V100-SXM2-32GB,65000
fp16,4,Tesla V100-SXM2-32GB,190000
fp16,8,Tesla V100-SXM2-32GB,360000
fp16,4,Tesla V100-SXM2-32GB,210000
fp16,8,Tesla V100-SXM2-32GB,400000
fp32,1,Tesla V100-SXM2-32GB,21000
fp32,4,Tesla V100-SXM2-32GB,76000
fp32,8,Tesla V100-SXM2-32GB,150000
fp16,1,Tesla V100-SXM3-32GB,65830
fp16,4,Tesla V100-SXM3-32GB,200886
fp16,8,Tesla V100-SXM3-32GB,362612
fp16,16,Tesla V100-SXM3-32GB,738521
fp32,1,Tesla V100-SXM3-32GB,22695
fp32,4,Tesla V100-SXM3-32GB,81224
fp32,8,Tesla V100-SXM3-32GB,156536
fp32,16,Tesla V100-SXM3-32GB,314831
fp32,4,Tesla V100-SXM2-32GB,75000
fp32,8,Tesla V100-SXM2-32GB,148000
fp16,1,Tesla V100-SXM3-32GB,72000
fp16,4,Tesla V100-SXM3-32GB,237000
fp16,8,Tesla V100-SXM3-32GB,430000
fp16,16,Tesla V100-SXM3-32GB,852000
fp32,1,Tesla V100-SXM3-32GB,22000
fp32,4,Tesla V100-SXM3-32GB,80000
fp32,8,Tesla V100-SXM3-32GB,155000
fp32,16,Tesla V100-SXM3-32GB,312000

View file

@ -47,7 +47,7 @@ python3 -m launch train.py \
--epochs 1 \
--remain-steps 1.0 \
--no-eval \
--max-size $((128 * ${GPU_COUNT} * 300)) \
--train-max-size $((128 * ${GPU_COUNT} * 300)) \
--math ${MATH} \
--train-global-batch-size ${GLOBAL_BATCH_SIZE} \
${TARGET_PERF}

View file

@ -146,32 +146,25 @@ python3 scripts/filter_dataset.py \
-f2 ${OUTPUT_DIR}/newstest_dev.tok.clean.de
# Generate Subword Units (BPE)
# Clone Subword NMT
if [ ! -d "${OUTPUT_DIR}/subword-nmt" ]; then
git clone https://github.com/rsennrich/subword-nmt.git "${OUTPUT_DIR}/subword-nmt"
cd ${OUTPUT_DIR}/subword-nmt
git reset --hard 48ba99e657591c329e0003f0c6e32e493fa959ef
cd -
fi
# Learn Shared BPE
for merge_ops in 32000; do
echo "Learning BPE with merge_ops=${merge_ops}. This may take a while..."
cat "${OUTPUT_DIR}/train.tok.clean.de" "${OUTPUT_DIR}/train.tok.clean.en" | \
${OUTPUT_DIR}/subword-nmt/learn_bpe.py -s $merge_ops > "${OUTPUT_DIR}/bpe.${merge_ops}"
subword-nmt learn-bpe -s $merge_ops > "${OUTPUT_DIR}/bpe.${merge_ops}"
echo "Apply BPE with merge_ops=${merge_ops} to tokenized files..."
for lang in en de; do
for f in ${OUTPUT_DIR}/*.tok.${lang} ${OUTPUT_DIR}/*.tok.clean.${lang}; do
outfile="${f%.*}.bpe.${merge_ops}.${lang}"
${OUTPUT_DIR}/subword-nmt/apply_bpe.py -c "${OUTPUT_DIR}/bpe.${merge_ops}" < $f > "${outfile}"
subword-nmt apply-bpe -c "${OUTPUT_DIR}/bpe.${merge_ops}" < $f > "${outfile}"
echo ${outfile}
done
done
# Create vocabulary file for BPE
cat "${OUTPUT_DIR}/train.tok.clean.bpe.${merge_ops}.en" "${OUTPUT_DIR}/train.tok.clean.bpe.${merge_ops}.de" | \
${OUTPUT_DIR}/subword-nmt/get_vocab.py | cut -f1 -d ' ' > "${OUTPUT_DIR}/vocab.bpe.${merge_ops}"
subword-nmt get-vocab | cut -f1 -d ' ' > "${OUTPUT_DIR}/vocab.bpe.${merge_ops}"
done

View file

@ -6,27 +6,5 @@ EOS_TOKEN = '<\s>'
# special PAD, UNKNOWN, BEGIN-OF-STRING, END-OF-STRING tokens
PAD, UNK, BOS, EOS = [0, 1, 2, 3]
# path to the BPE vocabulary file, relative to the data directory, it should
# point to file generated by subword-nmt/get_vocab.py
VOCAB_FNAME = 'vocab.bpe.32000'
# paths to source and target training files, relative to the data directory, it
# should point to BPE-encoded files, generated by subword-nmt/apply_bpe.py
SRC_TRAIN_FNAME = 'train.tok.clean.bpe.32000.en'
TGT_TRAIN_FNAME = 'train.tok.clean.bpe.32000.de'
# paths to source and target validation files, relative to the data directory,
# it should point to BPE-encoded files, generated by subword-nmt/apply_bpe.py
SRC_VAL_FNAME = 'newstest_dev.tok.clean.bpe.32000.en'
TGT_VAL_FNAME = 'newstest_dev.tok.clean.bpe.32000.de'
# path to the test source file, relative to the data directory, it should point
# to BPE-encoded file, generated by subword-nmt/apply_bpe.py
SRC_TEST_FNAME = 'newstest2014.tok.bpe.32000.en'
# path to the test target file, relative to the data directory, it should point
# to plaintext file, tokenization is performed by the sacrebleu package
TGT_TEST_TARGET_FNAME = 'newstest2014.de'
# path to the moses detokenizer, relative to the data directory
DETOKENIZER = 'mosesdecoder/scripts/tokenizer/detokenizer.perl'

View file

@ -28,17 +28,17 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
:param seq: list of sequences
"""
lengths = [len(s) for s in seq]
lengths = torch.tensor([len(s) for s in seq], dtype=torch.int64)
batch_length = max(lengths)
shape = (batch_length, len(seq))
shape = (len(seq), batch_length)
seq_tensor = torch.full(shape, config.PAD, dtype=torch.int64)
for i, s in enumerate(seq):
end_seq = lengths[i]
seq_tensor[:end_seq, i].copy_(s[:end_seq])
seq_tensor[i, :end_seq].copy_(s[:end_seq])
if batch_first:
if not batch_first:
seq_tensor = seq_tensor.t()
return (seq_tensor, lengths)
@ -81,6 +81,71 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
return single_collate
class RawTextDataset(Dataset):
def __init__(self, raw_data=None, raw_datafile=None, tokenizer=None,
sort=False, max_size=None):
self.tokenizer = tokenizer
self.sorted = False
if raw_datafile:
with open(raw_datafile, 'r') as f:
self.raw_data = f.readlines()
else:
self.raw_data = raw_data
if max_size:
self.raw_data = self.raw_data[:max_size]
self.lengths = [len(s.split()) for s in self.raw_data]
if sort:
self.sort_by_length()
def __getitem__(self, idx):
raw = self.raw_data[idx]
tokenized = self.tokenizer.tokenize(raw)
return tokenized
def unsort(self, array):
"""
"Unsorts" given array (restores original order of elements before
dataset was sorted by sequence length).
:param array: array to be "unsorted"
"""
if self.sorted:
inverse = sorted(enumerate(self.indices), key=itemgetter(1))
array = [array[i[0]] for i in inverse]
return array
def sort_by_length(self):
output = sorted(
enumerate(self.raw_data),
key=lambda x: len(x[1].split()),
)
self.indices, self.raw_data = zip(*output)
self.lengths = [self.lengths[idx] for idx in self.indices]
self.sorted = True
def __len__(self):
return len(self.raw_data)
def get_loader(self, batch_size=1, num_workers=0, batch_first=False,
pad=False, repeat=1):
collate_fn = build_collate_fn(batch_first, parallel=False,
sort=True)
sampler = StaticDistributedSampler(self, batch_size, pad, repeat)
return DataLoader(self,
batch_size=batch_size,
collate_fn=collate_fn,
sampler=sampler,
num_workers=num_workers,
pin_memory=True,
drop_last=False)
class TextDataset(Dataset):
def __init__(self, src_fname, tokenizer, min_len=None, max_len=None,
sort=False, max_size=None):

View file

@ -183,7 +183,7 @@ class BucketingSampler(DistributedSampler):
bucket_ids.clamp_(0, num_buckets - 1)
# build buckets
all_indices = torch.tensor(range(self.data_len))
all_indices = torch.arange(self.data_len)
self.buckets = []
self.num_samples = 0
global_bs = self.global_batch_size
@ -226,7 +226,7 @@ class BucketingSampler(DistributedSampler):
class StaticDistributedSampler(Sampler):
def __init__(self, dataset, batch_size, pad, world_size=None, rank=None):
def __init__(self, dataset, batch_size, pad, repeat=1, world_size=None, rank=None):
"""
Constructor for the StaticDistributedSampler.
@ -247,11 +247,12 @@ class StaticDistributedSampler(Sampler):
global_batch_size = batch_size * world_size
data_len = len(dataset)
num_samples = (data_len + global_batch_size - 1) \
repeated_data_len = int(len(dataset) * repeat)
num_samples = (repeated_data_len + global_batch_size - 1) \
// global_batch_size * global_batch_size
self.num_samples = num_samples
indices = list(range(data_len))
indices = list(range(repeated_data_len))
if pad:
# pad dataset to a multiple of global_batch_size samples, uses
# sample with idx 0 as pad
@ -267,6 +268,7 @@ class StaticDistributedSampler(Sampler):
indices = indices.view(-1)
# remove temporary pad
indices = indices[indices != -1]
indices = indices % data_len
indices = indices.tolist()
self.indices = indices

View file

@ -2,6 +2,9 @@ import logging
from collections import defaultdict
from functools import partial
import torch
import subword_nmt.apply_bpe
import sacremoses
import seq2seq.data.config as config
@ -9,37 +12,53 @@ class Tokenizer:
"""
Tokenizer class.
"""
def __init__(self, vocab_fname=None, pad=1, separator='@@'):
def __init__(self, vocab_fname=None, bpe_fname=None, lang=None, pad=1,
separator='@@'):
"""
Constructor for the Tokenizer class.
:param vocab_fname: path to the file with vocabulary
:param bpe_fname: path to the file with bpe codes
:param pad: pads vocabulary to a multiple of 'pad' tokens
:param separator: tokenization separator
"""
self.separator = separator
self.lang = lang
if bpe_fname:
with open(bpe_fname, 'r') as bpe_codes:
self.bpe = subword_nmt.apply_bpe.BPE(bpe_codes)
if vocab_fname:
self.separator = separator
self.build_vocabulary(vocab_fname, pad)
logging.info(f'Building vocabulary from {vocab_fname}')
vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
config.BOS_TOKEN, config.EOS_TOKEN]
if lang:
self.init_moses(lang)
with open(vocab_fname) as vfile:
for line in vfile:
vocab.append(line.strip())
def init_moses(self, lang):
self.moses_tokenizer = sacremoses.MosesTokenizer(lang['src'])
self.moses_detokenizer = sacremoses.MosesDetokenizer(lang['tgt'])
self.pad_vocabulary(vocab, pad)
def build_vocabulary(self, vocab_fname, pad):
logging.info(f'Building vocabulary from {vocab_fname}')
vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
config.BOS_TOKEN, config.EOS_TOKEN]
with open(vocab_fname) as vfile:
for line in vfile:
vocab.append(line.strip())
self.vocab_size = len(vocab)
logging.info(f'Size of vocabulary: {self.vocab_size}')
self.pad_vocabulary(vocab, pad)
self.tok2idx = defaultdict(partial(int, config.UNK))
for idx, token in enumerate(vocab):
self.tok2idx[token] = idx
self.vocab_size = len(vocab)
logging.info(f'Size of vocabulary: {self.vocab_size}')
self.idx2tok = {}
for key, value in self.tok2idx.items():
self.idx2tok[value] = key
self.tok2idx = defaultdict(partial(int, config.UNK))
for idx, token in enumerate(vocab):
self.tok2idx[token] = idx
self.idx2tok = {}
for key, value in self.tok2idx.items():
self.idx2tok[value] = key
def pad_vocabulary(self, vocab, pad):
"""
@ -58,8 +77,10 @@ class Tokenizer:
def get_state(self):
logging.info(f'Saving state of the tokenizer')
state = {
'lang': self.lang,
'separator': self.separator,
'vocab_size': self.vocab_size,
'bpe': self.bpe,
'tok2idx': self.tok2idx,
'idx2tok': self.idx2tok,
}
@ -67,11 +88,15 @@ class Tokenizer:
def set_state(self, state):
logging.info(f'Restoring state of the tokenizer')
self.lang = state['lang']
self.separator = state['separator']
self.vocab_size = state['vocab_size']
self.bpe = state['bpe']
self.tok2idx = state['tok2idx']
self.idx2tok = state['idx2tok']
self.init_moses(self.lang)
def segment(self, line):
"""
Tokenizes single sentence and adds special BOS and EOS tokens.
@ -85,7 +110,14 @@ class Tokenizer:
entry = [config.BOS] + entry + [config.EOS]
return entry
def detokenize(self, inputs, delim=' '):
def tokenize(self, line):
tokenized = self.moses_tokenizer.tokenize(line, return_str=True)
bpe = self.bpe.process_line(tokenized)
segmented = self.segment(bpe)
tensor = torch.tensor(segmented)
return tensor
def detokenize_bpe(self, inp, delim=' '):
"""
Detokenizes single sentence and removes token separator characters.
@ -94,7 +126,7 @@ class Tokenizer:
returns: string representing detokenized sentence
"""
detok = delim.join([self.idx2tok[idx] for idx in inputs])
detok = delim.join([self.idx2tok[idx] for idx in inp])
detok = detok.replace(self.separator + ' ', '')
detok = detok.replace(self.separator, '')
@ -103,3 +135,12 @@ class Tokenizer:
detok = detok.replace(config.PAD_TOKEN, '')
detok = detok.strip()
return detok
def detokenize_moses(self, inp):
output = self.moses_detokenizer.detokenize(inp.split())
return output
def detokenize(self, inp):
detok_bpe = self.detokenize_bpe(inp)
output = self.detokenize_moses(detok_bpe)
return output

View file

@ -8,7 +8,7 @@ class SequenceGenerator:
"""
Generator for the autoregressive inference with beam search decoding.
"""
def __init__(self, model, beam_size=5, max_seq_len=100, cuda=False,
def __init__(self, model, beam_size=5, max_seq_len=100,
len_norm_factor=0.6, len_norm_const=5,
cov_penalty_factor=0.1):
"""
@ -21,14 +21,12 @@ class SequenceGenerator:
:param model: model which implements generate method
:param beam_size: decoder beam size
:param max_seq_len: maximum decoder sequence length
:param cuda: whether to use cuda
:param len_norm_factor: length normalization factor
:param len_norm_const: length normalization constant
:param cov_penalty_factor: coverage penalty factor
"""
self.model = model
self.cuda = cuda
self.beam_size = beam_size
self.max_seq_len = max_seq_len
self.len_norm_factor = len_norm_factor
@ -51,18 +49,17 @@ class SequenceGenerator:
lengths: (batch_size) - lengths of generated translations
counter: number of iterations of the decoding loop
"""
device = initial_input.device
max_seq_len = self.max_seq_len
translation = torch.zeros(batch_size, max_seq_len, dtype=torch.int64)
lengths = torch.ones(batch_size, dtype=torch.int64)
active = torch.arange(0, batch_size, dtype=torch.int64)
base_mask = torch.arange(0, batch_size, dtype=torch.int64)
if self.cuda:
translation = translation.cuda()
lengths = lengths.cuda()
active = active.cuda()
base_mask = base_mask.cuda()
translation = torch.zeros(batch_size, max_seq_len, dtype=torch.int64,
device=device)
lengths = torch.ones(batch_size, dtype=torch.int64,
device=device)
active = torch.arange(0, batch_size, dtype=torch.int64,
device=device)
base_mask = torch.arange(0, batch_size, dtype=torch.int64,
device=device)
translation[:, 0] = BOS
words, context = initial_input, initial_context
@ -118,6 +115,7 @@ class SequenceGenerator:
lengths: (batch_size) - lengths of generated translations
counter: number of iterations of the decoding loop
"""
device = initial_input.device
beam_size = self.beam_size
norm_const = self.len_norm_const
norm_factor = self.len_norm_factor
@ -125,25 +123,19 @@ class SequenceGenerator:
cov_penalty_factor = self.cov_penalty_factor
translation = torch.zeros(batch_size * beam_size, max_seq_len,
dtype=torch.int64)
lengths = torch.ones(batch_size * beam_size, dtype=torch.int64)
scores = torch.zeros(batch_size * beam_size, dtype=torch.float32)
active = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
base_mask = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
dtype=torch.int64, device=device)
lengths = torch.ones(batch_size * beam_size,
dtype=torch.int64, device=device)
scores = torch.zeros(batch_size * beam_size,
dtype=torch.float32, device=device)
active = torch.arange(0, batch_size * beam_size,
dtype=torch.int64, device=device)
base_mask = torch.arange(0, batch_size * beam_size,
dtype=torch.int64, device=device)
global_offset = torch.arange(0, batch_size * beam_size, beam_size,
dtype=torch.int64)
eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')])
if self.cuda:
translation = translation.cuda()
lengths = lengths.cuda()
active = active.cuda()
base_mask = base_mask.cuda()
scores = scores.cuda()
global_offset = global_offset.cuda()
eos_beam_fill = eos_beam_fill.cuda()
device=device, dtype=torch.int64)
eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')],
dtype=torch.float32, device=device)
translation[:, 0] = BOS
@ -182,9 +174,8 @@ class SequenceGenerator:
context[1] = context[1].contiguous().view(batch_size * beam_size)
# context[1]: (batch * beam)
accu_attn_scores = torch.zeros(batch_size * beam_size, seq)
if self.cuda:
accu_attn_scores = accu_attn_scores.cuda()
accu_attn_scores = torch.zeros(batch_size * beam_size, seq,
dtype=torch.float32, device=device)
counter = 0
for idx in range(1, self.max_seq_len):

View file

@ -0,0 +1,110 @@
import collections
import itertools
import numpy as np
from pytablewriter import MarkdownTableWriter
def interleave(*args):
return list(itertools.chain(*zip(*args)))
class AccuracyTable:
def __init__(self, unit):
self.data = collections.defaultdict(dict)
self.unit = unit
def add(self, key, data):
self.data[key].update(data)
def write(self, title, write_math):
writer = MarkdownTableWriter()
writer.table_name = f'{title}'
main_header = ['**Batch Size**', '**Beam Size**']
data_header = []
if 'fp32' in write_math:
data_header += [f'**Accuracy - FP32 ({self.unit})**']
if 'fp16' in write_math:
data_header += [f'**Accuracy - FP16 ({self.unit})**']
writer.headers = main_header + data_header
writer.value_matrix = []
for k, v in self.data.items():
batch_size, beam_size = k
row = [batch_size, beam_size]
if 'fp32' in write_math:
row.append(v['fp32'])
if 'fp16' in write_math:
row.append(v['fp16'])
writer.value_matrix.append(row)
writer.write_table()
class PerformanceTable:
def __init__(self, percentiles, unit, reverse_percentiles=False):
self.percentiles = percentiles
self.data = collections.defaultdict(dict)
self.unit = unit
self.reverse_percentiles = reverse_percentiles
def add(self, key, value):
math, value = next(iter(value.items()))
value = np.array(value)
if self.reverse_percentiles:
percentiles = [100 - p for p in self.percentiles]
else:
percentiles = self.percentiles
stats = []
for p in percentiles:
val = np.percentile(value, p)
stats.append(val * self.unit_convert[self.unit])
avg = value.mean() * self.unit_convert[self.unit]
self.data[key].update({math: (avg, stats)})
def write(self, title, math, relative=None, reverse_speedup=False):
writer = MarkdownTableWriter()
writer.table_name = f'{title} - {math.upper()}'
main_header = ['**Batch Size**', '**Beam Size**']
data_header = [f'**Avg ({self.unit})**']
data_header += [f'**{p}% ({self.unit})**' for p in self.percentiles]
if relative:
speedup_header = ['**Speedup**'] * len(data_header)
data_header = interleave(data_header, speedup_header)
writer.headers = main_header + data_header
writer.value_matrix = []
for k, v in self.data.items():
batch_size, beam_size = k
avg, res_percentiles = v[math]
main = [batch_size, beam_size]
data = [avg, *res_percentiles]
if relative:
rel = self.data[k][relative]
rel_avg, rel_res_percentiles = rel
rel = [rel_avg, *rel_res_percentiles]
speedup = [d / r for (r, d) in zip(rel, data)]
if reverse_speedup:
speedup = [1 / s for s in speedup]
data = interleave(data, speedup)
writer.value_matrix.append(main + data)
writer.write_table()
class LatencyTable(PerformanceTable):
def __init__(self, percentiles, unit='ms'):
super().__init__(percentiles, unit)
self.unit_convert = {'s': 1, 'ms': 1e3, 'us': 1e6}
class ThroughputTable(PerformanceTable):
def __init__(self, percentiles, unit='tok/s', reverse_percentiles=True):
super().__init__(percentiles, unit, reverse_percentiles)
self.unit_convert = {'tok/s': 1}

View file

@ -0,0 +1,238 @@
import logging
import subprocess
import time
import torch
import torch.distributed as dist
import seq2seq.data.config as config
import seq2seq.utils as utils
from seq2seq.inference.beam_search import SequenceGenerator
def gather_predictions(preds):
world_size = utils.get_world_size()
if world_size > 1:
all_preds = [preds.new(preds.size(0), preds.size(1)) for i in range(world_size)]
dist.all_gather(all_preds, preds)
preds = torch.cat(all_preds)
return preds
def run_sacrebleu(test_path, reference_path):
"""
Executes sacrebleu and returns BLEU score.
:param test_path: path to the test file
:param reference_path: path to the reference file
"""
sacrebleu_params = '--score-only -lc --tokenize intl'
logging.info(f'Running sacrebleu (parameters: {sacrebleu_params})')
sacrebleu = subprocess.run([f'sacrebleu --input {test_path} \
{reference_path} {sacrebleu_params}'],
stdout=subprocess.PIPE, shell=True)
test_bleu = round(float(sacrebleu.stdout.strip()), 2)
return test_bleu
class Translator:
"""
Translator runs validation on test dataset, executes inference, optionally
computes BLEU score using sacrebleu.
"""
def __init__(self,
model,
tokenizer,
loader=None,
beam_size=5,
len_norm_factor=0.6,
len_norm_const=5.0,
cov_penalty_factor=0.1,
max_seq_len=50,
print_freq=1,
reference=None,
):
self.model = model
self.tokenizer = tokenizer
self.loader = loader
self.insert_target_start = [config.BOS]
self.insert_src_start = [config.BOS]
self.insert_src_end = [config.EOS]
self.batch_first = model.batch_first
self.beam_size = beam_size
self.print_freq = print_freq
self.reference = reference
self.distributed = (utils.get_world_size() > 1)
self.generator = SequenceGenerator(
model=self.model,
beam_size=beam_size,
max_seq_len=max_seq_len,
len_norm_factor=len_norm_factor,
len_norm_const=len_norm_const,
cov_penalty_factor=cov_penalty_factor)
def run(self, calc_bleu=True, epoch=None, iteration=None, eval_path=None,
summary=False, warmup=0, reference_path=None):
"""
Runs translation on test dataset.
:param calc_bleu: if True compares results with reference and computes
BLEU score
:param epoch: index of the current epoch
:param iteration: index of the current iteration
:param eval_path: path to the file for saving results
:param summary: if True prints summary
:param reference_path: path to the file with reference translation
"""
if reference_path is None:
reference_path = self.reference
device = next(self.model.parameters()).device
test_bleu = torch.tensor([0.], device=device)
rank = utils.get_rank()
logging.info(f'Running evaluation on test set')
self.model.eval()
output, eval_stats = self.evaluate(self.loader, epoch, iteration,
warmup, summary)
output = output[:len(self.loader.dataset)]
output = self.loader.dataset.unsort(output)
if rank == 0 and eval_path:
with open(eval_path, 'w') as eval_file:
lines = [line + '\n' for line in output]
eval_file.writelines(lines)
if calc_bleu:
test_bleu[0] = run_sacrebleu(eval_path, reference_path)
if summary:
logging.info(f'BLEU on test dataset: {test_bleu[0]:.2f}')
utils.barrier()
logging.info(f'Finished evaluation on test set')
if self.distributed:
dist.broadcast(test_bleu, 0)
if calc_bleu:
eval_stats['bleu'] = test_bleu[0].item()
else:
eval_stats['bleu'] = None
return output, eval_stats
def evaluate(self, loader, epoch=0, iteration=0, warmup=0, summary=False):
"""
Runs evaluation on test dataset.
:param epoch: index of the current epoch
:param iteration: index of the current iteration
:param summary: if True prints summary
"""
device = next(self.model.parameters()).device
batch_time = utils.AverageMeter(warmup, keep=True)
tot_tok_per_sec = utils.AverageMeter(warmup, keep=True)
iterations = utils.AverageMeter()
enc_seq_len = utils.AverageMeter()
dec_seq_len = utils.AverageMeter()
stats = {}
batch_size = loader.batch_size
global_batch_size = batch_size * utils.get_world_size()
beam_size = self.beam_size
bos = [self.insert_target_start] * (batch_size * beam_size)
bos = torch.tensor(bos, dtype=torch.int64, device=device)
if self.batch_first:
bos = bos.view(-1, 1)
else:
bos = bos.view(1, -1)
if beam_size == 1:
generator = self.generator.greedy_search
else:
generator = self.generator.beam_search
output = []
for i, (src, indices) in enumerate(loader):
translate_timer = time.time()
src, src_length = src
stats['total_enc_len'] = int(src_length.sum())
src = src.to(device)
src_length = src_length.to(device)
with torch.no_grad():
context = self.model.encode(src, src_length)
context = [context, src_length, None]
preds, lengths, counter = generator(batch_size, bos, context)
stats['total_dec_len'] = lengths.sum().item()
stats['iters'] = counter
indices = torch.tensor(indices).to(preds)
preds = preds.scatter(0, indices.unsqueeze(1).expand_as(preds), preds)
preds = gather_predictions(preds).cpu()
for pred in preds:
pred = pred.tolist()
detok = self.tokenizer.detokenize(pred)
output.append(detok)
elapsed = time.time() - translate_timer
batch_time.update(elapsed, batch_size)
total_tokens = stats['total_dec_len'] + stats['total_enc_len']
ttps = total_tokens / elapsed
tot_tok_per_sec.update(ttps, batch_size)
iterations.update(stats['iters'])
enc_seq_len.update(stats['total_enc_len'] / batch_size, batch_size)
dec_seq_len.update(stats['total_dec_len'] / batch_size, batch_size)
if i % self.print_freq == self.print_freq - 1:
log = []
log += f'TEST '
if epoch is not None:
log += f'[{epoch}]'
if iteration is not None:
log += f'[{iteration}]'
log += f'[{i}/{len(loader)}]\t'
log += f'Time {batch_time.val:.4f} ({batch_time.avg:.4f})\t'
log += f'Decoder iters {iterations.val:.1f} ({iterations.avg:.1f})\t'
log += f'Tok/s {tot_tok_per_sec.val:.0f} ({tot_tok_per_sec.avg:.0f})'
log = ''.join(log)
logging.info(log)
tot_tok_per_sec.reduce('sum')
enc_seq_len.reduce('mean')
dec_seq_len.reduce('mean')
batch_time.reduce('mean')
iterations.reduce('sum')
if summary and utils.get_rank() == 0:
time_per_sentence = (batch_time.avg / global_batch_size)
log = []
log += f'TEST SUMMARY:\n'
log += f'Lines translated: {len(loader.dataset)}\t'
log += f'Avg total tokens/s: {tot_tok_per_sec.avg:.0f}\n'
log += f'Avg time per batch: {batch_time.avg:.3f} s\t'
log += f'Avg time per sentence: {1000*time_per_sentence:.3f} ms\n'
log += f'Avg encoder seq len: {enc_seq_len.avg:.2f}\t'
log += f'Avg decoder seq len: {dec_seq_len.avg:.2f}\t'
log += f'Total decoder iterations: {int(iterations.sum)}'
log = ''.join(log)
logging.info(log)
eval_stats = {}
eval_stats['tokens_per_sec'] = tot_tok_per_sec.avg
eval_stats['runtimes'] = batch_time.vals
eval_stats['throughputs'] = tot_tok_per_sec.vals
return output, eval_stats

View file

@ -144,7 +144,7 @@ class BahdanauAttention(nn.Module):
if self.mask is not None:
mask = self.mask.unsqueeze(1).expand(b, t_q, t_k)
# I can't use -INF because of overflow check in pytorch
scores.data.masked_fill_(mask, -65504.0)
scores.masked_fill_(mask, -65504.0)
# Normalize the scores, softmax over t_k
scores_normalized = F.softmax(scores, dim=-1)

View file

@ -3,9 +3,11 @@ import math
import torch
from torch.nn.utils import clip_grad_norm_
import apex.amp._amp_state
from apex import amp
class Fp16Optimizer:
class FP16Optimizer:
"""
Mixed precision optimizer with dynamic loss scaling and backoff.
https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#scalefactor
@ -34,12 +36,12 @@ class Fp16Optimizer:
for param, new_param in zip(params, new_params):
param.data.copy_(new_param.data)
def __init__(self, fp16_model, grad_clip=float('inf'), loss_scale=8192,
def __init__(self, model, grad_clip=float('inf'), loss_scale=8192,
dls_downscale=2, dls_upscale=2, dls_upscale_interval=128):
"""
Constructor for the Fp16Optimizer.
:param fp16_model: model (previously casted to half)
:param model: model
:param grad_clip: coefficient for gradient clipping, max L2 norm of the
gradients
:param loss_scale: initial loss scale
@ -51,7 +53,7 @@ class Fp16Optimizer:
:param dls_upscale_interval: interval for loss scale upscaling
"""
logging.info('Initializing fp16 optimizer')
self.initialize_model(fp16_model)
self.initialize_model(model)
self.since_last_invalid = 0
self.loss_scale = loss_scale
@ -66,9 +68,11 @@ class Fp16Optimizer:
:param model: fp16 model
"""
logging.info('Converting model to half precision')
model.half()
logging.info('Initializing fp32 clone weights')
self.fp16_model = model
self.fp16_model.zero_grad()
self.model = model
self.model.zero_grad()
self.fp32_params = [param.to(torch.float32).detach()
for param in model.parameters()]
@ -93,7 +97,7 @@ class Fp16Optimizer:
loss.backward()
if update:
self.set_grads(self.fp32_params, self.fp16_model.parameters())
self.set_grads(self.fp32_params, self.model.parameters())
if self.loss_scale != 1.0:
for param in self.fp32_params:
param.grad.data /= self.loss_scale
@ -103,7 +107,7 @@ class Fp16Optimizer:
if math.isfinite(norm):
scheduler.step()
optimizer.step()
self.set_weights(self.fp16_model.parameters(),
self.set_weights(self.model.parameters(),
self.fp32_params)
self.since_last_invalid += 1
else:
@ -118,10 +122,10 @@ class Fp16Optimizer:
logging.info(f'Upscaling, new scale: {self.loss_scale}')
self.since_last_invalid = 0
self.fp16_model.zero_grad()
self.model.zero_grad()
class Fp32Optimizer:
class FP32Optimizer:
"""
Standard optimizer, computes backward and applies weight update.
"""
@ -161,3 +165,54 @@ class Fp32Optimizer:
scheduler.step()
optimizer.step()
self.model.zero_grad()
class AMPOptimizer:
"""
Optimizer compatible with AMP.
Uses AMP to apply loss scaling, computes backward and applies weight
update.
"""
def __init__(self, model, grad_clip=None, loss_scale=8192,
dls_upscale_interval=128):
"""
Constructor for the AMPOptimizer
:param model: model
:param grad_clip: coefficient for gradient clipping, max L2 norm of the
gradients
"""
logging.info('Initializing amp optimizer')
self.initialize_model(model)
self.grad_clip = grad_clip
loss_scaler = apex.amp._amp_state.loss_scalers[0]
loss_scaler._loss_scale = loss_scale
loss_scaler._scale_seq_len = dls_upscale_interval
def initialize_model(self, model):
"""
Initializes state of the model.
:param model: model
"""
self.model = model
self.model.zero_grad()
def step(self, loss, optimizer, scheduler, update=True):
"""
Performs one step of the optimizer.
:param loss: value of loss function
:param optimizer: optimizer
:param update: if True executes weight update
"""
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
if update:
if self.grad_clip != float('inf'):
clip_grad_norm_(amp.master_params(optimizer), self.grad_clip)
scheduler.step()
optimizer.step()
self.model.zero_grad()

View file

@ -0,0 +1,32 @@
from pytablewriter import MarkdownTableWriter
class TrainingTable:
def __init__(self, acc_unit='BLEU', time_unit='min', perf_unit='tok/s'):
self.data = []
self.acc_unit = acc_unit
self.time_unit = time_unit
self.perf_unit = perf_unit
self.time_unit_convert = {'s': 1, 'min': 1/60, 'h': 1/3600}
def add(self, gpus, batch_size, accuracy, perf, time_to_train):
time_to_train *= self.time_unit_convert[self.time_unit]
if not accuracy:
accuracy = 0.0
accuracy = round(accuracy, 2)
self.data.append([gpus, batch_size, accuracy, perf, time_to_train])
def write(self, title, math):
writer = MarkdownTableWriter()
writer.table_name = f'{title}'
header = [f'**GPUs**',
f'**Batch Size / GPU**',
f'**Accuracy - {math.upper()} ({self.acc_unit})**',
f'**Throughput - {math.upper()} ({self.perf_unit})**',
f'**Time to Train - {math.upper()} ({self.time_unit})**',
]
writer.headers = header
writer.value_matrix = self.data
writer.write_table()

View file

@ -7,10 +7,12 @@ import numpy as np
import torch
import torch.optim
import torch.utils.data
from apex.parallel import DistributedDataParallel as DDP
from apex.parallel import DistributedDataParallel
from apex import amp
from seq2seq.train.fp_optimizers import Fp16Optimizer
from seq2seq.train.fp_optimizers import Fp32Optimizer
from seq2seq.train.fp_optimizers import FP16Optimizer
from seq2seq.train.fp_optimizers import FP32Optimizer
from seq2seq.train.fp_optimizers import AMPOptimizer
from seq2seq.train.lr_scheduler import WarmupMultiStepLR
from seq2seq.utils import AverageMeter
from seq2seq.utils import sync_workers
@ -28,16 +30,15 @@ class Seq2SeqTrainer:
print_freq=10,
save_freq=1000,
grad_clip=float('inf'),
batch_first=False,
save_info={},
save_path='.',
save_dir='.',
train_iterations=0,
checkpoint_filename='checkpoint%s.pth',
keep_checkpoints=5,
math='fp32',
cuda=True,
distributed=False,
loss_scaling={},
intra_epoch_eval=0,
prealloc_mode='always',
iter_size=1,
translator=None,
verbose=False):
@ -52,18 +53,17 @@ class Seq2SeqTrainer:
:param print_freq: prints short summary every 'print_freq' iterations
:param save_freq: saves checkpoint every 'save_freq' iterations
:param grad_clip: coefficient for gradient clipping
:param batch_first: if True the model uses (batch,seq,feature) tensors,
if false the model uses (seq, batch, feature)
:param save_info: dict with additional state stored in each checkpoint
:param save_path: path to the directiory for checkpoints
:param save_dir: path to the directiory for checkpoints
:param train_iterations: total number of training iterations to execute
:param checkpoint_filename: name of files with checkpoints
:param keep_checkpoints: max number of checkpoints to keep
:param math: arithmetic type
:param cuda: if True use cuda, if False train on cpu
:param distributed: if True run distributed training
:param loss_scaling: options for dynamic loss scaling
:param intra_epoch_eval: number of additional eval runs within each
training epoch
:param prealloc_mode: controls preallocation,
choices=['off', 'once', 'always']
:param iter_size: number of iterations between weight updates
:param translator: instance of Translator, runs inference on test set
:param verbose: enables verbose logging
@ -73,38 +73,36 @@ class Seq2SeqTrainer:
self.criterion = criterion
self.epoch = 0
self.save_info = save_info
self.save_path = save_path
self.save_dir = save_dir
self.save_freq = save_freq
self.save_counter = 0
self.checkpoint_filename = checkpoint_filename
self.checkpoint_counter = cycle(range(keep_checkpoints))
self.opt_config = opt_config
self.cuda = cuda
self.distributed = distributed
self.device = next(model.parameters()).device
self.print_freq = print_freq
self.batch_first = batch_first
self.verbose = verbose
self.loss = None
self.translator = translator
self.intra_epoch_eval = intra_epoch_eval
self.iter_size = iter_size
self.prealloc_mode = prealloc_mode
self.preallocated = False
if cuda:
self.model = self.model.cuda()
self.criterion = self.criterion.cuda()
self.distributed = torch.distributed.is_initialized()
self.batch_first = model.batch_first
if math == 'fp16':
self.model = self.model.half()
params = self.model.parameters()
if distributed:
self.model = DDP(self.model)
if math == 'fp16':
self.fp_optimizer = Fp16Optimizer(self.model, grad_clip)
if math == 'manual_fp16':
self.fp_optimizer = FP16Optimizer(
self.model, grad_clip,
loss_scale=loss_scaling['init_scale'],
dls_upscale_interval=loss_scaling['upscale_interval']
)
params = self.fp_optimizer.fp32_params
elif math == 'fp32':
self.fp_optimizer = Fp32Optimizer(self.model, grad_clip)
params = self.model.parameters()
self.fp_optimizer = FP32Optimizer(self.model, grad_clip)
opt_name = opt_config.pop('optimizer')
self.optimizer = torch.optim.__dict__[opt_name](params, **opt_config)
@ -113,6 +111,24 @@ class Seq2SeqTrainer:
self.scheduler = WarmupMultiStepLR(self.optimizer, train_iterations,
**scheduler_config)
if math == 'fp16':
self.model, self.optimizer = amp.initialize(
self.model,
self.optimizer,
cast_model_outputs=torch.float16,
keep_batchnorm_fp32=False,
opt_level='O2')
self.fp_optimizer = AMPOptimizer(
self.model,
grad_clip,
loss_scale=loss_scaling['init_scale'],
dls_upscale_interval=loss_scaling['upscale_interval']
)
if self.distributed:
self.model = DistributedDataParallel(self.model)
def iterate(self, src, tgt, update=True, training=True):
"""
Performs one iteration of the training/validation.
@ -124,18 +140,14 @@ class Seq2SeqTrainer:
"""
src, src_length = src
tgt, tgt_length = tgt
src_length = torch.LongTensor(src_length)
tgt_length = torch.LongTensor(tgt_length)
src = src.to(self.device)
tgt = tgt.to(self.device)
src_length = src_length.to(self.device)
num_toks = {}
num_toks['tgt'] = int(sum(tgt_length - 1))
num_toks['src'] = int(sum(src_length))
if self.cuda:
src = src.cuda()
src_length = src_length.cuda()
tgt = tgt.cuda()
if self.batch_first:
output = self.model(src, src_length, tgt[:, :-1])
tgt_labels = tgt[:, 1:]
@ -177,8 +189,8 @@ class Seq2SeqTrainer:
batch_time = AverageMeter()
data_time = AverageMeter()
losses_per_token = AverageMeter(skip_first=False)
losses_per_sentence = AverageMeter(skip_first=False)
losses_per_token = AverageMeter()
losses_per_sentence = AverageMeter()
tot_tok_time = AverageMeter()
src_tok_time = AverageMeter()
@ -214,9 +226,15 @@ class Seq2SeqTrainer:
self.loss = losses_per_token.avg
if training and i in eval_iters:
test_bleu, _ = self.translator.run(calc_bleu=True,
epoch=self.epoch,
iteration=i)
eval_fname = f'eval_epoch_{self.epoch}_iter_{i}'
eval_path = os.path.join(self.save_dir, eval_fname)
_, eval_stats = self.translator.run(
calc_bleu=True,
epoch=self.epoch,
iteration=i,
eval_path=eval_path,
)
test_bleu = eval_stats['bleu']
log = []
log += [f'TRAIN [{self.epoch}][{i}/{len(data_loader)}]']
@ -225,7 +243,8 @@ class Seq2SeqTrainer:
logging.info(log)
self.model.train()
self.preallocate(data_loader, training=True)
self.preallocate(data_loader.batch_size,
data_loader.dataset.max_len, training=True)
if i % self.print_freq == 0:
phase = 'TRAIN' if training else 'VALIDATION'
@ -262,31 +281,37 @@ class Seq2SeqTrainer:
return losses_per_token.avg, tot_tok_time.avg
def preallocate(self, data_loader, training):
def preallocate(self, batch_size, max_length, training):
"""
Generates maximum sequence length batch and runs forward and backward
pass without updating model parameters.
:param data_loader: data loader
:param batch_size: batch size for preallocation
:param max_length: max sequence length for preallocation
:param training: if True preallocates memory for backward pass
"""
batch_size = data_loader.batch_size
max_len = data_loader.dataset.max_len
if self.prealloc_mode == 'always' or (self.prealloc_mode == 'once' and
not self.preallocated):
logging.info('Executing preallocation')
torch.cuda.empty_cache()
src_length = [max_len] * batch_size
tgt_length = [max_len] * batch_size
src_length = torch.full((batch_size,), max_length,
dtype=torch.int64)
tgt_length = torch.full((batch_size,), max_length,
dtype=torch.int64)
if self.batch_first:
shape = (batch_size, max_len)
else:
shape = (max_len, batch_size)
if self.batch_first:
shape = (batch_size, max_length)
else:
shape = (max_length, batch_size)
src = torch.full(shape, 4, dtype=torch.int64)
tgt = torch.full(shape, 4, dtype=torch.int64)
src = src, src_length
tgt = tgt, tgt_length
self.iterate(src, tgt, update=False, training=training)
self.model.zero_grad()
src = torch.full(shape, 4, dtype=torch.int64)
tgt = torch.full(shape, 4, dtype=torch.int64)
src = src, src_length
tgt = tgt, tgt_length
self.iterate(src, tgt, update=False, training=training)
self.model.zero_grad()
self.preallocated = True
def optimize(self, data_loader):
"""
@ -297,11 +322,12 @@ class Seq2SeqTrainer:
"""
torch.set_grad_enabled(True)
self.model.train()
torch.cuda.empty_cache()
self.preallocate(data_loader, training=True)
self.preallocate(data_loader.batch_size, data_loader.dataset.max_len,
training=True)
output = self.feed_data(data_loader, training=True)
self.model.zero_grad()
torch.cuda.empty_cache()
return output
def evaluate(self, data_loader):
@ -313,11 +339,12 @@ class Seq2SeqTrainer:
"""
torch.set_grad_enabled(False)
self.model.eval()
torch.cuda.empty_cache()
self.preallocate(data_loader, training=False)
self.preallocate(data_loader.batch_size, data_loader.dataset.max_len,
training=False)
output = self.feed_data(data_loader, training=False)
self.model.zero_grad()
torch.cuda.empty_cache()
return output
def load(self, filename):
@ -352,7 +379,7 @@ class Seq2SeqTrainer:
"""
def write_checkpoint(state, filename):
filename = os.path.join(self.save_path, filename)
filename = os.path.join(self.save_dir, filename)
logging.info(f'Saving model to {filename}')
torch.save(state, filename)

View file

@ -61,7 +61,7 @@ def broadcast_seeds(seeds, device):
:param device: torch.device
"""
if torch.distributed.is_available() and torch.distributed.is_initialized():
seeds_tensor = torch.LongTensor(seeds).to(device)
seeds_tensor = torch.tensor(seeds, dtype=torch.int64, device=device)
torch.distributed.broadcast(seeds_tensor, 0)
seeds = seeds_tensor.tolist()
return seeds
@ -110,13 +110,10 @@ def setup_seeds(master_seed, epochs, device):
def barrier():
"""
Works as a temporary distributed barrier, currently pytorch
doesn't implement barrier for NCCL backend.
Calls all_reduce on dummy tensor and synchronizes with GPU.
Call torch.distributed.barrier() if distritubed is in use
"""
if torch.distributed.is_available() and torch.distributed.is_initialized():
torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
torch.cuda.synchronize()
torch.distributed.barrier()
def get_rank():
@ -244,7 +241,7 @@ def log_env_info():
def pad_vocabulary(math):
if math == 'fp16':
if math == 'fp16' or math == 'manual_fp16':
pad_vocab = 8
elif math == 'fp32':
pad_vocab = 1
@ -269,78 +266,6 @@ def benchmark(test_acc, target_acc, test_perf, target_perf):
passed &= test(test_perf, target_perf, 'Performance')
return passed
class AverageMeter:
"""
Computes and stores the average and current value
"""
def __init__(self, skip_first=True):
self.reset()
self.skip = skip_first
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
if self.skip:
self.skip = False
else:
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def reduce(self, op):
"""
Reduces average value over all workers.
:param op: 'sum' or 'mean', reduction operator
"""
if op not in ('sum', 'mean'):
raise NotImplementedError
distributed = (get_world_size() > 1)
if distributed:
# Backward/forward compatibility around
# https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
# https://github.com/pytorch/pytorch/commit/044d00516ccd6572c0d6ab6d54587155b02a3b86
# To accomodate change in Pytorch's distributed API
if hasattr(dist, "get_backend"):
_backend = dist.get_backend()
if hasattr(dist, "DistBackend"):
backend_enum_holder = dist.DistBackend
else:
backend_enum_holder = dist.Backend
else:
_backend = dist._backend
backend_enum_holder = dist.dist_backend
cuda = _backend == backend_enum_holder.NCCL
if cuda:
avg = torch.cuda.FloatTensor([self.avg])
_sum = torch.cuda.FloatTensor([self.sum])
else:
avg = torch.FloatTensor([self.avg])
_sum = torch.FloatTensor([self.sum])
try:
_reduce_op = dist.ReduceOp
except AttributeError:
_reduce_op = dist.reduce_op
dist.all_reduce(avg, op=_reduce_op.SUM)
dist.all_reduce(_sum, op=_reduce_op.SUM)
self.avg = avg.item()
self.sum = _sum.item()
if op == 'mean':
self.avg /= get_world_size()
self.sum /= get_world_size()
def debug_tensor(tensor, name):
"""
@ -356,3 +281,62 @@ def debug_tensor(tensor, name):
logging.info(f'MIN: {tensor.min()} MAX: {tensor.max()} '
f'AVG: {tensor.mean()} STD: {tensor.std()} '
f'NAN: {np.isnan(tensor).sum()} INF: {np.isinf(tensor).sum()}')
class AverageMeter:
"""
Computes and stores the average and current value
"""
def __init__(self, warmup=0, keep=False):
self.reset()
self.warmup = warmup
self.keep = keep
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
self.iters = 0
self.vals = []
def update(self, val, n=1):
self.iters += 1
self.val = val
if self.iters > self.warmup:
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
if self.keep:
self.vals.append(val)
def reduce(self, op):
"""
Reduces average value over all workers.
:param op: 'sum' or 'mean', reduction operator
"""
if op not in ('sum', 'mean'):
raise NotImplementedError
distributed = (get_world_size() > 1)
if distributed:
backend = dist.get_backend()
cuda = (backend == dist.Backend.NCCL)
if cuda:
avg = torch.cuda.FloatTensor([self.avg])
_sum = torch.cuda.FloatTensor([self.sum])
else:
avg = torch.FloatTensor([self.avg])
_sum = torch.FloatTensor([self.sum])
dist.all_reduce(avg)
dist.all_reduce(_sum)
self.avg = avg.item()
self.sum = _sum.item()
if op == 'mean':
self.avg /= get_world_size()
self.sum /= get_world_size()

View file

@ -3,6 +3,7 @@ import argparse
import logging
import os
import sys
import time
from ast import literal_eval
import torch.nn as nn
@ -17,9 +18,10 @@ from seq2seq.data.dataset import LazyParallelDataset
from seq2seq.data.dataset import ParallelDataset
from seq2seq.data.dataset import TextDataset
from seq2seq.data.tokenizer import Tokenizer
from seq2seq.inference.inference import Translator
from seq2seq.inference.translator import Translator
from seq2seq.models.gnmt import GNMT
from seq2seq.train.smoothing import LabelSmoothing
from seq2seq.train.table import TrainingTable
def parse_args():
@ -44,17 +46,55 @@ def parse_args():
dataset = parser.add_argument_group('dataset setup')
dataset.add_argument('--dataset-dir', default='data/wmt16_de_en',
help='path to the directory with training/test data')
dataset.add_argument('--max-size', default=None, type=int,
help='use at most MAX_SIZE elements from training \
dataset (useful for benchmarking), by default \
uses entire dataset')
dataset.add_argument('--src-lang',
default='en',
help='source language')
dataset.add_argument('--tgt-lang',
default='de',
help='target language')
dataset.add_argument('--vocab',
default='vocab.bpe.32000',
help='path to the vocabulary file \
(relative to DATASET_DIR directory)')
dataset.add_argument('-bpe', '--bpe-codes', default='bpe.32000',
help='path to the file with bpe codes \
(relative to DATASET_DIR directory)')
dataset.add_argument('--train-src',
default='train.tok.clean.bpe.32000.en',
help='path to the training source data file \
(relative to DATASET_DIR directory)')
dataset.add_argument('--train-tgt',
default='train.tok.clean.bpe.32000.de',
help='path to the training target data file \
(relative to DATASET_DIR directory)')
dataset.add_argument('--val-src',
default='newstest_dev.tok.clean.bpe.32000.en',
help='path to the validation source data file \
(relative to DATASET_DIR directory)')
dataset.add_argument('--val-tgt',
default='newstest_dev.tok.clean.bpe.32000.de',
help='path to the validation target data file \
(relative to DATASET_DIR directory)')
dataset.add_argument('--test-src',
default='newstest2014.tok.bpe.32000.en',
help='path to the test source data file \
(relative to DATASET_DIR directory)')
dataset.add_argument('--test-tgt',
default='newstest2014.de',
help='path to the test target data file \
(relative to DATASET_DIR directory)')
# results
results = parser.add_argument_group('results setup')
results.add_argument('--results-dir', default='results',
help='path to directory with results, it will be \
automatically created if it does not exist')
results.add_argument('--save', default='gnmt',
results.add_argument('--save-dir', default='gnmt',
help='defines subdirectory within RESULTS_DIR for \
results from this training run')
results.add_argument('--print-freq', default=10, type=int,
@ -63,7 +103,7 @@ def parse_args():
# model
model = parser.add_argument_group('model setup')
model.add_argument('--hidden-size', default=1024, type=int,
help='model hidden size')
help='hidden size of the model')
model.add_argument('--num-layers', default=4, type=int,
help='number of RNN layers in encoder and in decoder')
model.add_argument('--dropout', default=0.2, type=float,
@ -79,12 +119,16 @@ def parse_args():
# setup
general = parser.add_argument_group('general setup')
general.add_argument('--math', default='fp16', choices=['fp16', 'fp32'],
help='arithmetic type')
general.add_argument('--math', default='fp16',
choices=['fp16', 'fp32', 'manual_fp16'],
help='precision')
general.add_argument('--seed', default=None, type=int,
help='master seed for random number generators, if \
"seed" is undefined then the master seed will be \
sampled from random.SystemRandom()')
general.add_argument('--prealloc-mode', default='always', type=str,
choices=['off', 'once', 'always'],
help='controls preallocation')
exclusive_group(group=general, name='eval', default=True,
help='run validation and test after every epoch')
@ -100,6 +144,10 @@ def parse_args():
# training
training = parser.add_argument_group('training setup')
dataset.add_argument('--train-max-size', default=None, type=int,
help='use at most TRAIN_MAX_SIZE elements from \
training dataset (useful for benchmarking), by \
default uses entire dataset')
training.add_argument('--train-batch-size', default=128, type=int,
help='training batch size per worker')
training.add_argument('--train-global-batch-size', default=None, type=int,
@ -121,10 +169,10 @@ def parse_args():
training.add_argument('--grad-clip', default=5.0, type=float,
help='enables gradient clipping and sets maximum \
norm of gradients')
training.add_argument('--max-length-train', default=50, type=int,
training.add_argument('--train-max-length', default=50, type=int,
help='maximum sequence length for training \
(including special BOS and EOS tokens)')
training.add_argument('--min-length-train', default=0, type=int,
training.add_argument('--train-min-length', default=0, type=int,
help='minimum sequence length for training \
(including special BOS and EOS tokens)')
training.add_argument('--train-loader-workers', default=2, type=int,
@ -149,6 +197,15 @@ def parse_args():
default="{}",
help='extra options for the optimizer')
# mixed precision loss scaling
loss_scaling = parser.add_argument_group(
'mixed precision loss scaling setup'
)
loss_scaling.add_argument('--init-scale', type=float, default=8192,
help='initial loss scale')
loss_scaling.add_argument('--upscale-interval', type=float, default=128,
help='loss upscaling interval')
# scheduler
scheduler = parser.add_argument_group('learning rate scheduler setup')
scheduler.add_argument('--warmup-steps', type=str, default='200',
@ -166,10 +223,10 @@ def parse_args():
val = parser.add_argument_group('validation setup')
val.add_argument('--val-batch-size', default=64, type=int,
help='batch size for validation')
val.add_argument('--max-length-val', default=125, type=int,
val.add_argument('--val-max-length', default=125, type=int,
help='maximum sequence length for validation \
(including special BOS and EOS tokens)')
val.add_argument('--min-length-val', default=0, type=int,
val.add_argument('--val-min-length', default=0, type=int,
help='minimum sequence length for validation \
(including special BOS and EOS tokens)')
val.add_argument('--val-loader-workers', default=0, type=int,
@ -179,10 +236,10 @@ def parse_args():
test = parser.add_argument_group('test setup')
test.add_argument('--test-batch-size', default=128, type=int,
help='batch size for test')
test.add_argument('--max-length-test', default=150, type=int,
test.add_argument('--test-max-length', default=150, type=int,
help='maximum sequence length for test \
(including special BOS and EOS tokens)')
test.add_argument('--min-length-test', default=0, type=int,
test.add_argument('--test-min-length', default=0, type=int,
help='minimum sequence length for test \
(including special BOS and EOS tokens)')
test.add_argument('--beam-size', default=5, type=int,
@ -232,6 +289,18 @@ def parse_args():
args = parser.parse_args()
args.lang = {'src': args.src_lang, 'tgt': args.tgt_lang}
args.save_dir = os.path.join(args.results_dir, args.save_dir)
args.vocab = os.path.join(args.dataset_dir, args.vocab)
args.bpe_codes = os.path.join(args.dataset_dir, args.bpe_codes)
args.train_src = os.path.join(args.dataset_dir, args.train_src)
args.train_tgt = os.path.join(args.dataset_dir, args.train_tgt)
args.val_src = os.path.join(args.dataset_dir, args.val_src)
args.val_tgt = os.path.join(args.dataset_dir, args.val_tgt)
args.test_src = os.path.join(args.dataset_dir, args.test_src)
args.test_tgt = os.path.join(args.dataset_dir, args.test_tgt)
args.warmup_steps = literal_eval(args.warmup_steps)
args.remain_steps = literal_eval(args.remain_steps)
args.decay_interval = literal_eval(args.decay_interval)
@ -239,6 +308,25 @@ def parse_args():
return args
def set_iter_size(train_iter_size, train_global_batch_size, train_batch_size):
"""
Automatically set train_iter_size based on train_global_batch_size,
world_size and per-worker train_batch_size
:param train_global_batch_size: global training batch size
:param train_batch_size: local training batch size
"""
if train_global_batch_size is not None:
global_bs = train_global_batch_size
bs = train_batch_size
world_size = utils.get_world_size()
assert global_bs % (bs * world_size) == 0
train_iter_size = global_bs // (bs * world_size)
logging.info(f'Global batch size was set, '
f'Setting train_iter_size to {train_iter_size}')
return train_iter_size
def build_criterion(vocab_size, padding_idx, smoothing):
if smoothing == 0.:
logging.info(f'Building CrossEntropyLoss')
@ -252,47 +340,39 @@ def build_criterion(vocab_size, padding_idx, smoothing):
return criterion
@utils.timer('TOTAL RUNTIME', sync_gpu=False)
def main():
"""
Launches data-parallel multi-gpu training.
"""
training_start = time.time()
args = parse_args()
device = utils.set_device(args.cuda, args.local_rank)
distributed = utils.init_distributed(args.cuda)
utils.init_distributed(args.cuda)
args.rank = utils.get_rank()
if not args.cudnn:
torch.backends.cudnn.enabled = False
# create directory for results
save_path = os.path.join(args.results_dir, args.save)
args.save_path = save_path
os.makedirs(save_path, exist_ok=True)
os.makedirs(args.save_dir, exist_ok=True)
# setup logging
log_filename = f'log_rank_{utils.get_rank()}.log'
utils.setup_logging(args.log_all_ranks,
os.path.join(save_path, log_filename))
os.path.join(args.save_dir, log_filename))
if args.env:
utils.log_env_info()
logging.info(f'Saving results to: {save_path}')
logging.info(f'Saving results to: {args.save_dir}')
logging.info(f'Run arguments: {args}')
# automatically set train_iter_size based on train_global_batch_size,
# world_size and per-worker train_batch_size
if args.train_global_batch_size is not None:
global_bs = args.train_global_batch_size
bs = args.train_batch_size
world_size = utils.get_world_size()
assert global_bs % (bs * world_size) == 0
args.train_iter_size = global_bs // (bs * world_size)
logging.info(f'Global batch size was set in the config, '
f'Setting train_iter_size to {args.train_iter_size}')
args.train_iter_size = set_iter_size(args.train_iter_size,
args.train_global_batch_size,
args.train_batch_size)
worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed, args.epochs,
worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed,
args.epochs,
device)
worker_seed = worker_seeds[args.rank]
logging.info(f'Worker {args.rank} is using worker seed: {worker_seed}')
@ -300,48 +380,54 @@ def main():
# build tokenizer
pad_vocab = utils.pad_vocabulary(args.math)
tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME),
pad_vocab)
tokenizer = Tokenizer(args.vocab, args.bpe_codes, args.lang, pad_vocab)
# build datasets
train_data = LazyParallelDataset(
src_fname=os.path.join(args.dataset_dir, config.SRC_TRAIN_FNAME),
tgt_fname=os.path.join(args.dataset_dir, config.TGT_TRAIN_FNAME),
src_fname=args.train_src,
tgt_fname=args.train_tgt,
tokenizer=tokenizer,
min_len=args.min_length_train,
max_len=args.max_length_train,
min_len=args.train_min_length,
max_len=args.train_max_length,
sort=False,
max_size=args.max_size)
max_size=args.train_max_size,
)
val_data = ParallelDataset(
src_fname=os.path.join(args.dataset_dir, config.SRC_VAL_FNAME),
tgt_fname=os.path.join(args.dataset_dir, config.TGT_VAL_FNAME),
src_fname=args.val_src,
tgt_fname=args.val_tgt,
tokenizer=tokenizer,
min_len=args.min_length_val,
max_len=args.max_length_val,
sort=True)
min_len=args.val_min_length,
max_len=args.val_max_length,
sort=True,
)
test_data = TextDataset(
src_fname=os.path.join(args.dataset_dir, config.SRC_TEST_FNAME),
src_fname=args.test_src,
tokenizer=tokenizer,
min_len=args.min_length_test,
max_len=args.max_length_test,
sort=True)
min_len=args.test_min_length,
max_len=args.test_max_length,
sort=True,
)
vocab_size = tokenizer.vocab_size
# build GNMT model
model_config = {'hidden_size': args.hidden_size,
'vocab_size': vocab_size,
'num_layers': args.num_layers,
'dropout': args.dropout, 'batch_first': False,
'share_embedding': args.share_embedding}
model = GNMT(vocab_size=vocab_size, **model_config)
'dropout': args.dropout,
'batch_first': False,
'share_embedding': args.share_embedding,
}
model = GNMT(**model_config).to(device)
logging.info(model)
batch_first = model.batch_first
# define loss function (criterion) and optimizer
criterion = build_criterion(vocab_size, config.PAD, args.smoothing)
criterion = build_criterion(vocab_size, config.PAD,
args.smoothing).to(device)
opt_config = {'optimizer': args.optimizer, 'lr': args.lr}
opt_config.update(literal_eval(args.optimizer_extra))
@ -384,47 +470,52 @@ def main():
tokenizer=tokenizer,
loader=test_loader,
beam_size=args.beam_size,
max_seq_len=args.max_length_test,
max_seq_len=args.test_max_length,
len_norm_factor=args.len_norm_factor,
len_norm_const=args.len_norm_const,
cov_penalty_factor=args.cov_penalty_factor,
cuda=args.cuda,
print_freq=args.print_freq,
dataset_dir=args.dataset_dir,
save_path=args.save_path)
reference=args.test_tgt,
)
# create trainer
total_train_iters = len(train_loader) // args.train_iter_size * args.epochs
save_info = {'model_config': model_config, 'config': args, 'tokenizer':
tokenizer.get_state()}
save_info = {
'model_config': model_config,
'config': args,
'tokenizer': tokenizer.get_state()
}
loss_scaling = {
'init_scale': args.init_scale,
'upscale_interval': args.upscale_interval
}
trainer_options = dict(
model=model,
criterion=criterion,
grad_clip=args.grad_clip,
iter_size=args.train_iter_size,
save_path=save_path,
save_dir=args.save_dir,
save_freq=args.save_freq,
save_info=save_info,
opt_config=opt_config,
scheduler_config=scheduler_config,
train_iterations=total_train_iters,
batch_first=batch_first,
keep_checkpoints=args.keep_checkpoints,
math=args.math,
loss_scaling=loss_scaling,
print_freq=args.print_freq,
cuda=args.cuda,
distributed=distributed,
intra_epoch_eval=args.intra_epoch_eval,
translator=translator)
translator=translator,
prealloc_mode=args.prealloc_mode,
)
trainer_options['model'] = model
trainer = trainers.Seq2SeqTrainer(**trainer_options)
# optionally resume from a checkpoint
if args.resume:
checkpoint_file = args.resume
if os.path.isdir(checkpoint_file):
checkpoint_file = os.path.join(
checkpoint_file, 'model_best.pth')
checkpoint_file = os.path.join(checkpoint_file, 'model_best.pth')
if os.path.isfile(checkpoint_file):
trainer.load(checkpoint_file)
else:
@ -432,6 +523,7 @@ def main():
# training loop
best_loss = float('inf')
training_perf = []
break_training = False
test_bleu = None
for epoch in range(args.start_epoch, args.epochs):
@ -441,6 +533,7 @@ def main():
trainer.epoch = epoch
train_loss, train_perf = trainer.optimize(train_loader)
training_perf.append(train_perf)
# evaluate on validation set
if args.eval:
@ -455,7 +548,13 @@ def main():
if args.eval:
utils.barrier()
eval_stats = translator.run(calc_bleu=True, epoch=epoch)
eval_fname = f'eval_epoch_{epoch}'
eval_path = os.path.join(args.save_dir, eval_fname)
_, eval_stats = translator.run(
calc_bleu=True,
epoch=epoch,
eval_path=eval_path,
)
test_bleu = eval_stats['bleu']
if args.target_bleu and test_bleu >= args.target_bleu:
logging.info(f'Target accuracy reached')
@ -483,12 +582,22 @@ def main():
break
utils.barrier()
training_stop = time.time()
training_time = training_stop - training_start
logging.info(f'Total training time {training_time:.0f} s')
table = TrainingTable()
avg_training_perf = sum(training_perf) / len(training_perf)
table.add(utils.get_world_size(), args.train_batch_size, test_bleu,
avg_training_perf, training_time)
if utils.get_rank() == 0:
table.write('Training Summary', args.math)
passed = utils.benchmark(test_bleu, args.target_bleu,
train_perf, args.target_perf)
return passed
if not passed:
sys.exit(1)
if __name__ == '__main__':
passed = main()
if not passed:
sys.exit(1)
main()

View file

@ -1,21 +1,19 @@
#!/usr/bin/env python
import argparse
import logging
import os
import itertools
import sys
import warnings
from ast import literal_eval
from itertools import product
import torch
import torch.distributed as dist
import seq2seq.utils as utils
from seq2seq.data.dataset import TextDataset
from seq2seq.data.dataset import RawTextDataset
from seq2seq.data.tokenizer import Tokenizer
from seq2seq.inference.inference import Translator
from seq2seq.inference.translator import Translator
from seq2seq.models.gnmt import GNMT
from seq2seq.utils import setup_logging
from seq2seq.inference import tables
def parse_args():
@ -38,18 +36,22 @@ def parse_args():
# dataset
dataset = parser.add_argument_group('data setup')
dataset.add_argument('--dataset-dir', default='data/wmt16_de_en/',
help='path to directory with training/test data')
dataset.add_argument('-i', '--input', required=True,
help='full path to the input file (tokenized)')
dataset.add_argument('-o', '--output', required=True,
help='full path to the output file (tokenized)')
dataset.add_argument('-o', '--output', required=False,
help='full path to the output file \
if not specified, then the output will be printed')
dataset.add_argument('-r', '--reference', default=None,
help='full path to the file with reference \
translations (for sacrebleu)')
translations (for sacrebleu, raw text)')
dataset.add_argument('-m', '--model', required=True,
help='full path to the model checkpoint file')
exclusive_group(group=dataset, name='sort', default=True,
source = dataset.add_mutually_exclusive_group(required=True)
source.add_argument('-i', '--input', required=False,
help='full path to the input file (raw text)')
source.add_argument('-t', '--input-text', nargs='+', required=False,
help='raw input text')
exclusive_group(group=dataset, name='sort', default=False,
help='sorts dataset by sequence length')
# parameters
@ -69,9 +71,9 @@ def parse_args():
# general setup
general = parser.add_argument_group('general setup')
general.add_argument('--math', nargs='+', default=['fp16'],
choices=['fp16', 'fp32'], help='arithmetic type')
choices=['fp16', 'fp32'], help='precision')
exclusive_group(group=general, name='env', default=True,
exclusive_group(group=general, name='env', default=False,
help='print info about execution env')
exclusive_group(group=general, name='bleu', default=True,
help='compares with reference translation and computes \
@ -102,6 +104,21 @@ def parse_args():
per second)')
benchmark.add_argument('--target-bleu', default=None, type=float,
help='target accuracy')
benchmark.add_argument('--repeat', nargs='+', default=[1], type=float,
help='loops over the dataset REPEAT times, flag \
accepts multiple arguments, one for each specified \
batch size')
benchmark.add_argument('--warmup', default=0, type=int,
help='warmup iterations for performance counters')
benchmark.add_argument('--percentiles', nargs='+', type=int,
default=(50, 90, 95, 99, 100),
help='Percentiles for confidence intervals for \
throughput/latency benchmarks')
exclusive_group(group=benchmark, name='tables', default=False,
help='print accuracy, throughput and latency results in \
tables')
# distributed
distributed = parser.add_argument_group('distributed setup')
distributed.add_argument('--rank', default=0, type=int,
@ -111,6 +128,9 @@ def parse_args():
args = parser.parse_args()
if args.input_text:
args.bleu = False
if args.bleu and args.reference is None:
parser.error('--bleu requires --reference')
@ -120,6 +140,11 @@ def parse_args():
if len(list(product(args.math, args.batch_size, args.beam_size))) > 1:
args.target_bleu = None
args.target_perf = None
args.repeat = dict(itertools.zip_longest(args.batch_size,
args.repeat,
fillvalue=1))
return args
@ -130,9 +155,10 @@ def main():
with length normalization and coverage penalty.
"""
args = parse_args()
utils.set_device(args.cuda, args.local_rank)
device = utils.set_device(args.cuda, args.local_rank)
utils.init_distributed(args.cuda)
setup_logging()
args.rank = utils.get_rank()
utils.setup_logging()
if args.env:
utils.log_env_info()
@ -150,54 +176,100 @@ def main():
# build GNMT model
tokenizer = Tokenizer()
tokenizer.set_state(checkpoint['tokenizer'])
vocab_size = tokenizer.vocab_size
model_config = checkpoint['model_config']
model_config['batch_first'] = args.batch_first
model = GNMT(vocab_size=vocab_size, **model_config)
model_config['vocab_size'] = tokenizer.vocab_size
model = GNMT(**model_config)
model.load_state_dict(checkpoint['state_dict'])
# construct the dataset
if args.input:
data = RawTextDataset(raw_datafile=args.input,
tokenizer=tokenizer,
sort=args.sort,
)
elif args.input_text:
data = RawTextDataset(raw_data=args.input_text,
tokenizer=tokenizer,
sort=args.sort,
)
latency_table = tables.LatencyTable(args.percentiles)
throughput_table = tables.ThroughputTable(args.percentiles)
accuracy_table = tables.AccuracyTable('BLEU')
dtype = {'fp32': torch.FloatTensor, 'fp16': torch.HalfTensor}
for (math, batch_size, beam_size) in product(args.math, args.batch_size,
args.beam_size):
logging.info(f'math: {math}, batch size: {batch_size}, '
f'beam size: {beam_size}')
if math == 'fp32':
dtype = torch.FloatTensor
if math == 'fp16':
dtype = torch.HalfTensor
model.type(dtype)
if args.cuda:
model = model.cuda()
model.type(dtype[math])
model = model.to(device)
model.eval()
# construct the dataset
test_data = TextDataset(src_fname=args.input,
tokenizer=tokenizer,
sort=args.sort)
# build the data loader
test_loader = test_data.get_loader(batch_size=batch_size,
batch_first=args.batch_first,
shuffle=False,
pad=True,
num_workers=0)
loader = data.get_loader(
batch_size=batch_size,
batch_first=args.batch_first,
pad=True,
repeat=args.repeat[batch_size],
num_workers=0,
)
# build the translator object
translator = Translator(model=model,
tokenizer=tokenizer,
loader=test_loader,
beam_size=beam_size,
max_seq_len=args.max_seq_len,
len_norm_factor=args.len_norm_factor,
len_norm_const=args.len_norm_const,
cov_penalty_factor=args.cov_penalty_factor,
cuda=args.cuda,
print_freq=args.print_freq,
dataset_dir=args.dataset_dir)
translator = Translator(
model=model,
tokenizer=tokenizer,
loader=loader,
beam_size=beam_size,
max_seq_len=args.max_seq_len,
len_norm_factor=args.len_norm_factor,
len_norm_const=args.len_norm_const,
cov_penalty_factor=args.cov_penalty_factor,
print_freq=args.print_freq,
)
# execute the inference
stats = translator.run(calc_bleu=args.bleu, eval_path=args.output,
reference_path=args.reference, summary=True)
output, stats = translator.run(
calc_bleu=args.bleu,
eval_path=args.output,
summary=True,
warmup=args.warmup,
reference_path=args.reference,
)
# print translated outputs
if not args.output and args.rank == 0:
logging.info(f'Translated output:')
for out in output:
print(out)
key = (batch_size, beam_size)
latency_table.add(key, {math: stats['runtimes']})
throughput_table.add(key, {math: stats['throughputs']})
accuracy_table.add(key, {math: stats['bleu']})
if args.tables:
accuracy_table.write('Inference accuracy', args.math)
if 'fp16' in args.math and 'fp32' in args.math:
relative = 'fp32'
else:
relative = None
if 'fp32' in args.math:
throughput_table.write('Inference throughput', 'fp32')
if 'fp16' in args.math:
throughput_table.write('Inference throughput', 'fp16',
relative=relative)
if 'fp32' in args.math:
latency_table.write('Inference latency', 'fp32')
if 'fp16' in args.math:
latency_table.write('Inference latency', 'fp16',
relative=relative, reverse_speedup=True)
passed = utils.benchmark(stats['bleu'], args.target_bleu,
stats['tokens_per_sec'], args.target_perf)

View file

@ -12,7 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvcr.io/nvidia/pytorch:19.05-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
FROM ${FROM_IMAGE_NAME}
# Install Python dependencies
RUN pip install --upgrade --no-cache-dir pip \
@ -21,9 +22,10 @@ RUN pip install --upgrade --no-cache-dir pip \
sentencepiece
RUN apt-get update
RUN apt-get install -y cmake pkg-config libprotobuf9v5 protobuf-compiler libprotobuf-dev libgoogle-perftools-dev
RUN apt-get install -y cmake pkg-config protobuf-compiler libprotobuf-dev libgoogle-perftools-dev
RUN git clone https://github.com/google/sentencepiece.git /workspace/sentencepiece
RUN cd /workspace/sentencepiece \
&& git checkout d4dd947 \
&& mkdir build \
&& cd build \
&& cmake .. \
@ -33,6 +35,6 @@ RUN cd /workspace/sentencepiece \
ENV PYTHONPATH=/workspace/translation/examples/translation/subword-nmt/
WORKDIR /workspace/translation
COPY . .
RUN git clone https://github.com/rsennrich/subword-nmt.git /workspace/translation/examples/translation/subword-nmt/
COPY . .
RUN pip install -e .

View file

@ -1,61 +1,180 @@
# Transformer
# Transformer For PyTorch
This implementation of the Transformer model architecture is based on the optimized implementation in [Facebook's Fairseq NLP toolkit](https://github.com/pytorch/fairseq), built on top of PyTorch. The original version in the Fairseq project was developed using Tensor Cores, which provides significant training speedup. Our implementation improves the performance of a training and is tested on a DGX-1V 16GB.
This repository provides a script and recipe to train the Transformer model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
# Requirements and installation
This repository contains a `Dockerfile` which extends the PyTorch NGC container and encapsulates all dependencies. Ensure you have the following software:
* [nvidia-docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 19.01-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [SacreBLEU 1.2.10](https://pypi.org/project/sacrebleu/1.2.10/)
**Table Of Contents**
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Glossary](#glossary)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Multi-dataset](#multi-dataset)
* [Training process](#training-process)
* [Inference process](#inference-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-(8x-v100-16G))
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-(16x-v100-32G))
* [Inference performance results](#inference-performance-results)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
If you use multiprocessing for multi-threaded data loaders, the default shared memory segment size that the container runs with may not be enough. Therefore, we recommend you to increase the shared memory size by issuing either:
## Model overview
The Transformer is a Neural Machine Translation (NMT) model which uses attention mechanism to boost training speed and overall accuracy. The Transformer model was introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) and improved in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
This implementation is based on the optimized implementation in [Facebook's Fairseq NLP toolkit](https://github.com/pytorch/fairseq), built on top of PyTorch.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 3.6x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
The Transformer model uses standard NMT encoder-decoder architecture. This model unlike other NMT models, uses no recurrent connections and operates on fixed size context window.
The encoder stack is made up of N identical layers. Each layer is composed of the following sublayers:
1. Self-attention layer
2. Feedforward network (which is 2 fully-connected layers)
Like the encoder stack, the decoder stack is made up of N identical layers. Each layer is composed of the sublayers:
1. Self-attention layer
2. Multi-headed attention layer combining encoder outputs with results from
the previous self-attention layer.
3. Feedforward network (2 fully-connected layers)
The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.
The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
<p align="center">
<img width="50%" src="./transformer.png" />
<br>
Figure 1. The architecture of a Transformer model.
</p>
The complete description of the Transformer architecture can be found in [Attention Is All You Need](https://arxiv.org/abs/1706.03762) paper.
### Default configuration
The Transformer uses Byte Pair Encoding tokenization scheme using [Moses decoder](https://github.com/moses-smt/mosesdecoder). This is a lossy compression method (we drop information about white spaces). Tokenization is applied over whole [WMT14](http://statmt.org/wmt14/) en-de dataset including test set. Default vocabulary size is 33708, excluding all special tokens. Encoder and decoder are using shared embeddings.
We use 6 blocks in each encoder and decoder stacks. Self attention layer computes it's outputs according to the following formula $`Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V`$. At each attention step, the model computes 16 different attention representations (which we will call attention heads) and concatenates them.
We trained the Transformer model using the Adam optimizer with betas `(0.9, 0.997)`, epsilon `1e-9` and learning rate `6e-4`. We used the inverse square root training schedule preceded with liniar warmup of 4000 steps.
The implementation allows to perform training in mixed precision. We use dynamic loss scaling and custom mixed precision optimizer. Distributed multi-GPU and multi-Node is implemented with `torch.distirbuted` module with NCCL backend.
For inference, we use beam search with default beam size of 5. Model performance is evaluated with BLEU4 metrics. For clarity, we report internal (legacy) BLEU implementation as well as external [SacreBleu](https://github.com/mjpost/sacreBLEU) score.
### Feature support matrix
The following features are supported by this model.<br>
| Feature | Yes column
|--------------------------|--------------------------
| Multi-GPU training with [Distributed Communication Package](https://pytorch.org/docs/stable/distributed.html) | Yes
| APEX | Yes
#### Features
Multi-GPU training with [Distributed Communication Package](https://pytorch.org/docs/stable/distributed.html)
Our model uses torch.distributed package to implement efficient multi-GPU training with NCCL.
To enable multi-GPU training with torch.distributed, you have to initialize your model
identically in every process spawned by torch.distributed.launch. For efficiency the only point of synchronization is gradient gathering.
For details, see example sources in this repo or see
the [pytorch tutorial](https://pytorch.org/docs/stable/distributed.html)
APEX - This implementation uses Apex's FP16_Optimizer API to perform mixed precision training.
The purpose of the APEX is to provide easy and intuitive framework for distributed training and mixed precision training.
For details, see official [APEX repository](https://github.com/NVIDIA/apex).
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
#### Enabling mixed precision
Mixed precision is enabled using the `--fp16` option in the `train.py` script. The script then builds a custom mixed precision optimizer. Forward and backward pass are computed with FP16 precision with exclusion of a loss function which is computed in FP32 precision. We keep a copy of a model in higher precision in order to perform accurate weight update. After the update FP32 weights are again copied to FP16 model. We use dynamic loss scaling with initial scale of 2^7 increasing it by a factor of 2 every 2000 successful iterations. Overflow is being checked after reducing gradients from all of the workers. If we encounter infs or nans the whole batch is dropped.
### Glossary
Attention layer - Layer that computes which elements of input sequence or it's hidden representation contribute the most to the currently considered output element.
Beam search - A heuristic search algorithm which at each step of predictions keeps N most possible outputs as a base to perform further prediction.
BPE - Binary Pair Encoding, compression algorithm that find most common pair of symbols in a data and replaces them with new symbol absent in the data.
EOS - End of a sentence.
Self attention layer - Attention layer that computes hidden representation of input using the same tensor as query, key and value.
Token - A string that is representable within the model. We also refer to the token's position in the dictionary as a token. There are special non-string tokens: alphabet tokens (all characters in a dataset), EOS token, PAD token.
Tokenizer - Object that converts raw strings to sequences of tokens.
Vocabulary embedding - Layer that projects one-hot token representations to a high dimensional space which preserves some information about correlations between tokens.
## Setup
The following section lists the requirements in order to start training the Transformer model.
### Requirements
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- [PyTorch 19.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
- [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
- Running [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Transformer model on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
1. Clone the repository
```
--ipc=host
git clone --recurse-submodules https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/PyTorch/Translation/Transformer
```
Or
```
--shm-size=<requested memory size>
```
in the command line to `nvidia-docker run`. For more information,see [Setting The Shared Memory Flag](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#setincshmem) in the NVIDIA Container User Guide.
For more information about how to get started with NGC containers, see the
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
DGX Documentation:
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
- [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
- [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
## Training using mixed precision with Tensor Cores
The training script provided in this project takes advantage of Tensor Cores to speedup the time it takes to train the Transformer model (for a translation task in this example). Tensor Cores accelerate matrix multiplication math and are available on NVIDIA Volta and Turing based GPUs. For more information about how to use Tensor Cores, see the [Training With Mixed Precision Guide](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) to Mixed Precision Training on NVIDIA GPUs.
An additional resource for mixed precision training is NVIDIAs
[Apex](https://github.com/NVIDIA/apex), a PyTorch extension, that contains
utility libraries, such as AMP, which stands for Automatic Mixed Precision and enables the use of Tensor Cores with minimal code changes to existing PyTorch training scripts.
# Hyper parameters setting
To reach the BLEU score reported in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) reaserch paper, we used mixed precision training with a batch size of 5120 per GPU and learning rate of 6e-4 on a DGX-1V system with 8 Tesla V100s 16G. If you use a different setup, we recommend you scale your hyperparameters by applying the following rules:
1. To use FP32, reduce the batch size to 2560 and set the `--update-freq 2` and `--warmup-updates 8000` options.
2. To train on a fewer GPUs, multiply `--update-freq` and `--warmup-updates` by the reciprocal of scaling factor.
For example, when training in FP32 mode on 4 GPUs, use the `--update-freq=4` and `--warmup-updates 16000` options.
# Quick start guide
Perform the following steps to train using provided default parameters of the Transformer model on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset.
## Build and launch Transformer Docker container
2. Build and launch the Transformer PyTorch NGC container
```bash
docker build . -t your.repository:transformer
nvidia-docker run -it --rm --ipc=host -v /path/to/your/dataset:/container/dataset/path your.repository:transformer bash
nvidia-docker run -it --rm --ipc=host your.repository:transformer bash
```
## Downloading and preprocessing dataset
If you have already preprocessed data, use:
```bash
nvidia-docker run -it --rm --ipc=host -v path/to/your/data/:/data/wmt14_en_de_joined_dict your.repository:transformer bash
```
3. Download and preprocess dataset
Download and preprocess the WMT14 English-German dataset.
```bash
./run_preprocessing.sh
```
## Run training
After running this command, the processed dataset will be put into: `/data/wmt14_en_de_joined_dict` directory.
4. Start training
The following command runs the training script that is distributed between 8 workers.
```bash
python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py /workspace/data-bin/wmt14_en_de_joined_dict \
python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py /data/wmt14_en_de_joined_dict \
--arch transformer_wmt_en_de_big_t2t \
--share-all-embeddings \
--optimizer adam \
@ -78,54 +197,155 @@ python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/tra
--fp16 \
--save-dir /workspace/checkpoints \
--distributed-init-method env://
```
**WARNING**: If you don't have access to sufficient disk space, use the --save-interval $N option. The checkpoints are ~2.5GB large. For example it takes the Transformer model 16 epochs to reach the BLEU score of 28 points. Default option is to save the latest checkpoint, the best checkpoint and a checkpoint for every epoch, which means (16+1+1)*2.5GB = 45GB of a disk space used. Specifying `--save-interval 5` you can reduce this to (16/5+1+1)*2.5GB = 12.5GB.
# Details
The script saves checkpoints every epoch to the directory specified in the `--save-dir` option. In addition, the best performing checkpoint (in terms of loss) and the latest checkpoints are saved separately.
**WARNING**: If you don't have access to sufficient disk space, use the `--save-interval $N` option. The checkpoints are ~2.5GB large. For example, it takes the Transformer model 16 epochs to reach the BLEU score of 28 points. The default option is to save last checkpoint, the best checkpoint and a checkpoint for every epoch, which means (16+1+1)*2.5GB = 45GB of a disk space used. Specifying `--save-interval 5` reduces this to (16/5+1+1)*2.5GB = 12.5GB.
## Getting the data
The Transformer model was trained on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. Concatenation of the *commoncrawl*, *europarl* and *news-commentary* is used as train and vaidation dataset and *newstest2014* is used as test dataset.<br/>
This repository contains `run_preprocessing.sh` script which will automatically download and preprocess the training and test datasets. By default data will be stored in `/data/wmt14_en_de_joined_dict` directory.<br/>
Our download script utilizes [Moses decoder](https://github.com/moses-smt/mosesdecoder) to perform tokenization of the dataset and [subword-nmt](https://github.com/rsennrich/subword-nmt) to segment text into subword units (BPE). By default, script builds shared vocabulary of 33708 tokens, which is constistent withbuilds shared vocabulary of 33708 tokens, which is constistent with [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
## Running training
The default training configuration can be launched ny running the `train.py` training script. By default, the script saves one checkpoint evety epoch in addition to the latest and the best ones. The best chckpoint is considered the one with the lowest value of loss, not the one with the highest BLEU score. To override this behavior use the `--save-interval $N` option to save epoch checkpoints every N epoch or `--no-epoch-checkpoints` to disable them entirely (with this option latest and the best checkpoints still will be saved). Specify save directory with `--save-dir` option.<br/>
In order to run multi-GPU training launch the training script with `python -m torch.distributed.launch --nproc_per_node $N` prepended, where N is the number of GPUs.
### Scripts and sample code
The `preprocess.py` script performs binarization of the dataset obtained and tokenized by the `examples/translation/prepare-wmt14en2de.sh` script. The `train.py` script contains training loop as well as statistics gathering code. Steps performed in single training step can be found in `fairseq/trainer.py` if you are using FP32 precision or inside `fairseq/fp16_trainer.py` for mixed precision. Model definition is placed in the file `fairseq/models/transformer.py`. Model specific modules including multiheaded attention and sinusoidal positional embedding are inside the `fairseq/modules/` directory. Finally, the data wrappers are placed inside the `fairseq/data/` directory.
### Parameters
In this section we give a user friendly description of the most common options used in the `train.py` script.
### Command-line options
`--arch` - select the specific configuration for the model. You can select between various predefined hyper parameters values like number of encoder/decoder blocks, dropout value or size of hidden state representation.<br/>
`--share-all-embeddings` - use the same set of weights for encoder and decoder words embedding.<br/>
`--optimizer` - choose optimization algorithm.<br/>
`--clip-norm` - set a value that gradients will be clipped to.<br/>
`--lr-scheduler` - choose learning rate change strategy.<br/>
`--warmup-init-lr` - start linear warmup with a learning rate at this value.<br/>
`--warmup-updates` - set number of optimization steps after which linear warmup will end.<br/>
`--lr` - set learning rate.<br/>
`--min-lr` - prevent learning rate to fall below this value using arbitrary learning rate schedule.<br/>
`--dropout` - set dropout value.<br/>
`--weight-decay` - set weight decay value.<br/>
`--criterion` - select loss function.<br/>
`--label-smoothing` - distribute value of one-hot labels between all entries of a dictionary. Value set by this option will be a value subtracted from one-hot label.<br/>
`--max-tokens` - set batch size in terms of tokens.<br/>
`--max-sentences` - set batch size in terms of sentences. Note that then the actual batchsize will vary a lot more than when using `--max-tokens` option.<br/>
`--seed` - set random seed for NumPy and PyTorch RNGs.<br/>
`--max-epochs` - set the maximum number of epochs.<br/>
`--online-eval` - perform inference on test set and then compute BLEU score after every epoch.<br/>
`--ignore-case` - used with `--online-eval`, ignore case while computing BLEU score.<br/>
`--target-bleu` - works like `--online-eval` and sets a BLEU score threshold which after being attained will cause training to stop.<br/>
`--fp16` - use mixed precision.<br/>
`--save-dir` - set directory for saving checkpoints.<br/>
`--distributed-init-method` - method for initializing torch.distributed package. You can either provide addresses with the `tcp` method or use the envionment variables initialization with `env` method<br/>
`--update-freq` - use gradient accumulation. Set number of training steps across which gradient will be accumulated.<br/>
To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
```
python train.py --help
```
The following (partial) output is printed when running the sample:
```
usage: train.py [-h] [--no-progress-bar] [--log-interval N]
[--log-format {json,none,simple,tqdm}] [--seed N] [--fp16]
[--profile PROFILE] [--task TASK]
[--skip-invalid-size-inputs-valid-test] [--max-tokens N]
[--max-sentences N] [--sentencepiece] [--train-subset SPLIT]
[--valid-subset SPLIT] [--max-sentences-valid N]
[--gen-subset SPLIT] [--num-shards N] [--shard-id ID]
[--distributed-world-size N]
[--distributed-rank DISTRIBUTED_RANK]
[--local_rank LOCAL_RANK]
[--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD]
[--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
--arch ARCH [--criterion CRIT] [--max-epoch N]
[--max-update N] [--target-bleu TARGET] [--clip-norm NORM]
[--sentence-avg] [--update-freq N] [--optimizer OPT]
[--lr LR_1,LR_2,...,LR_N] [--momentum M] [--weight-decay WD]
[--lr-scheduler LR_SCHEDULER] [--lr-shrink LS] [--min-lr LR]
[--min-loss-scale D] [--enable-parallel-backward-allred-opt]
[--parallel-backward-allred-opt-threshold N]
[--enable-parallel-backward-allred-opt-correctness-check]
[--save-dir DIR] [--restore-file RESTORE_FILE]
[--save-interval N] [--save-interval-updates N]
[--keep-interval-updates N] [--no-save]
[--no-epoch-checkpoints] [--validate-interval N] [--path FILE]
[--remove-bpe [REMOVE_BPE]] [--cpu] [--quiet] [--beam N]
[--nbest N] [--max-len-a N] [--max-len-b N] [--min-len N]
[--no-early-stop] [--unnormalized] [--no-beamable-mm]
[--lenpen LENPEN] [--unkpen UNKPEN]
[--replace-unk [REPLACE_UNK]] [--score-reference]
[--prefix-size PS] [--sampling] [--sampling-topk PS]
[--sampling-temperature N] [--print-alignment]
[--model-overrides DICT] [--online-eval] [--ignore-case]
[--bpe-codes CODES] [--fuse-dropout-add] [--fuse-relu-dropout]
```
### Getting the data
The Transformer model was trained on the [WMT14 English-German](http://statmt.org/wmt14/translation-task.html#Download) dataset. Concatenation of the *commoncrawl*, *europarl* and *news-commentary* is used as train and validation dataset and *newstest2014* is used as test dataset.<br/>
This repository contains the `run_preprocessing.sh` script which will automatically downloads and preprocesses the training and test datasets. By default, data will be stored in the `/data/wmt14_en_de_joined_dict` directory.<br/>
Our download script utilizes [Moses decoder](https://github.com/moses-smt/mosesdecoder) to perform tokenization of the dataset and [subword-nmt](https://github.com/rsennrich/subword-nmt) to segment text into subword units (BPE). By default, the script builds a shared vocabulary of 33708 tokens, which is consistent with [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187).
#### Dataset guidelines
The Transformer model works with a fixed sized vocabulary. Prior to the training, we need to learn a data representation that allows us to store the entire dataset as a sequence of tokens. To achieve this we use Binary Pair Encoding. This algorithm builds a vocabulary by iterating over a dataset, looking for the most frequent pair of symbols and replacing them with a new symbol, yet absent in the dataset. After identifying the desired number of encodings (new symbols can also be merged together) it outputs a code file that is used as an input for the `Dictionary` class.
This approach does not minimize the length of the encoded dataset, however this is allowed using [SentencePiece](https://github.com/google/sentencepiece/) to tokenize the dataset with the unigram model. This approach tries to find encoding that is close to the theoretical entropy limit.
Data is then sorted by length (in terms of tokens) and examples with similar length are batched together, padded if necessary.
#### Multi-dataset
The model has been tested oni the [wmt14 en-fr](http://www.statmt.org/wmt14/translation-task.html) dataset. Achieving state of the art accuracy of 41.4 BLEU.
### Training process
The default training configuration can be launched by running the `train.py` training script. By default, the script saves one checkpoint every epoch in addition to the latest and the best ones. The best checkpoint is considered the one with the lowest value of loss, not the one with the highest BLEU score. To override this behavior use the `--save-interval $N` option to save epoch checkpoints every N epoch or `--no-epoch-checkpoints` to disable them entirely (with this option the latest and the best checkpoints still will be saved). Specify save the directory with `--save-dir` option.<br/>
In order to run multi-GPU training, launch the training script with `python -m torch.distributed.launch --nproc_per_node $N` prepended, where N is the number of GPUs.
We have tested reliance on up to 16 GPUs on a single node.<br/>
After each training epoch, the script runs a loss validation on the validation split of the dataset and outputs validation loss. By default the evaluation after each epoch is disabled. To enable it use `--online-eval` option or to use BLEU score value as training stopping condition use `--target-bleu $TGT` option. In order to compute case insensitive BLEU score use flag `--ignore-case` along with previous ones. BLEU is computed by the internal fairseq algorithm which implementation can be found in `fairseq/bleu.py` script.<br/>
By default, the `train.py` script will launch fp32 training without Tensor Cores. To use mixed precision with Tensor Cores use `--fp16` option.<br/>
To view all available options for training, run `python train.py --help`.
After each training epoch, the script runs a loss validation on the validation split of the dataset and outputs the validation loss. By default the evaluation after each epoch is disabled. To enable it, use the `--online-eval` option or to use the BLEU score value as the training stopping condition use the `--target-bleu $TGT` option. In order to compute the case insensitive BLEU score, use the flag `--ignore-case` along with previous ones. The BLEU is computed by the internal fairseq algorithm which implementation can be found in the `fairseq/bleu.py` script.<br/>
By default, the `train.py` script will launch FP32 training without Tensor Cores. To use mixed precision with Tensor Cores use the `--fp16` option.<br/>
## Running inference
Inference on a raw input can be performed by launching `interactive.py` inference script. It requires pre-trained model checkpoint,BPE codes file and dictionary file (both are produced by `run_preprocessing.sh` script and can be found in the dataset directory).<br/>
To enhance speed of the inference on large input files it is recommended to preprocess them the same way as the dataset and run inference on a binarized input with the `generate.py` script.<br/>
Both scripts run inference with a default beam size of 4 and give tokenized output. To remove BPE codes use `--remove-bpe` option.<br/>
To view all available options for training, run `python interactive.py --help`.
To reach the BLEU score reported in [Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187) research paper, we used mixed precision training with a batch size of 5120 per GPU and learning rate of 6e-4 on a DGX-1V system with 8 Tesla V100s 16G. If you use a different setup, we recommend you scale your hyperparameters by applying the following rules:
1. To use FP32, reduce the batch size to 2560 and set the `--update-freq 2` and `--warmup-updates 8000` options.
2. To train on a fewer GPUs, multiply `--update-freq` and `--warmup-updates` by the reciprocal of scaling factor.
## Testing
Computing BLEU score is contained inside the training script and can be used to determine when the script should stop the training. To disable this feature replace `--target-bleu $BLEU$` and `--ignore-case` options with `--max-epoch $N`, where `N` is number of training epochs. By default, evaluation of the Transformer model is then performed on the binarized test split of the dataset by default. To evaluate the model, issue:
```bash
python generate.py /path/to/dataset/wmt14_en_de_joined_dict \
--path /path/to/your/checkpoint.pt \
--beam 4 --remove-bpe
For example, when training in FP32 mode on 4 GPUs, use the `--update-freq=4` and `--warmup-updates 16000` options.
### Inference process
Inference on a raw input can be performed by launching the `interactive.py` inference script. It requires a pre-trained model checkpoint, BPE codes file and dictionary file (both are produced by the `run_preprocessing.sh` script and can be found in the dataset directory).<br/>
To enhance the speed of the inference on large input files, it is recommended to preprocess them the same way as the dataset and run inference on a binarized input with the `generate.py` script.<br/>
Both scripts run inference with a default beam size of 4 and give tokenized output. To remove BPE codes use the `--remove-bpe` option.<br/>
In order to run interactive inference, run command:
```
In order to use [SacreBLEU](https://pypi.org/project/sacrebleu/1.2.10/) for evaluation, run:
```bash
sacrebleu -t wmt14/full -l en-de --echo src > wmt14-en-de.src
python interactive.py --buffer-size 1 --fp16 --path /path/to/your/checkpoint.pt --max-tokens 128 \
--fuse-dropout-add --remove-bpe --bpe-codes /path/to/code/file \
/path/to/dataset/wmt14_en_de_joined_dict/ < wmt14-en-de.src > wmt14.detok
grep ^H wmt14.detok | cut -f3- > wmt14.translated
cat wmt14.translated | sacrebleu -t wmt14 -lc -l en-de
/path/to/dataset/wmt14_en_de_joined_dict/
```
Sacrebleu test set is a subset of test set used during a course of training thus score obtained with sacreBLEU can slightly differ from the one computed during training.
The `--buffer-size` option allows the batching of input sentences up to `--max_token` length.
## Training Accuracy Results
In order to test accuracy of our implementation we have run experiments with different seeds for 100 epochs with batch size 5120 per GPU and learining rate 6e-4 in the pytorch-19.03-py3 Docker container. Plot below shows BLEU score changes.<br/>
![Accuracy plot](BLEU.png)
## Performance
## Training Performance Results
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
#### Training performance benchmark
To benchmark the training performance on a specific batch size, just run `train.py` training script. Performance in words/s will be printed to standard output every N iterations, specified by the `--log-interval` option. After each epoch, the mean performance across the epoch will be reported as well.
#### Inference performance benchmark
To benchmark the inference performance on a specific batch size, run the `generate.py` script. The mean throughput will be reported at the end of the script.
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
#### Training accuracy results
In order to test the accuracy of our implementation, we have run experiments with different seeds for 100 epochs with batch size 5120 per GPU and learning rate 6e-4 in the pytorch-18.12-py3 Docker container. The plot below shows the BLEU score changes.<br/>
![Accuracy plot](/BLEU.png)
Running this code with the provided hyperparameters will allow you to achieve the following results. Our setup is a DGX-1 with 8x Tesla V100 16GB. We've verified our results after training 32 epochs to obtain multi-GPU and mixed precision scaling results.
@ -136,32 +356,80 @@ Running this code with the provided hyperparameters will allow you to achieve th
In some cases we can train further with the same setup to achieve slightly better results.
GPU count |Precision | BLEU score | Epochs to train | Training time
---|---|---|---|---
4 |fp16 | 28.67 | 74 | 1925 min
4 |fp32 | 28.40 | 47 | 5478 min
##### NVIDIA DGX-1 (8x V100 16G)
Results here are the best we achieved. We've observed a large variance in BLEU, while using random seed. Nearly all setups reach 28.4 BLEU, although the time it takes also varies between setups.
We also observed a good rate of week scaling. We measured performance in tokens (words) per second.
Our results were obtained by running the `run_training.sh` and `run_training_fp32.sh` training scripts in the PyTorch NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in tokens per second) were averaged over an entire training epoch.
GPU count | Mixed precision | FP32 | FP32/Mixed speedup | Mixed precision week scaling | FP32 week scaling
---|---|---|---|---|---
1 | 37650 | 8630 | 4.36 | 1.0 | 1.0
4 | 132700 | 30500 | 4.35 | 3.52 | 3.53
8 | 260000 | 61000 | 4.26 | 6.91 | 7.07
| GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision
|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|------------------------------
|8 |2560 | 53641 | 186442 | 3.48 | 7.03 | 7.82
|4 |2560 | 26647 | 92514 | 3.47 | 3.49 | 3.88
|1 |2560 | 7635 | 23821 | 3.12 | 1 | 1
## Inference performance results
All results were obtained by `generate.py` inference script in the pytorch-19.01-py3 Docker container. Inference was run on a single GPU.
In addition mixed precision training has lower memory requirements, so we can train with batch size twice as big
| GPUs | Batch size / GPU | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision
|--------|--------------------|---------------------------------|-----------------------------------------------|--------------------
|8 |5120 | 235077 | 4.38 | 7.31
|4 |5120 | 75574 | 2.83 | 2.35
|1 |5120 | 32153 | 4.21 | 1
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### NVIDIA DGX-2 (16x V100 32G)
Our results were obtained by running the `run_training.sh` and `run_training_fp32.sh` training scripts in the Pytorch NGC container on NVIDIA DGX-2 with (16x V100 32G) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
| GPUs | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision
|--------|--------------------|----------------------|---------------------------------|-----------------------------------------------|------------------------|-----------------------------
| 16 | 5120 | 128319 | 476585 | 3.71 | |
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Inference performance results
We provide two inference scripts, `generate.py` for preprocessed data and `interactive.py` for raw input. To measure throughput of the Transformer model, run:
```bash
python generate.py /path/to/dataset/wmt14_en_de_joined_dict \
--path /path/to/your/checkpoint.pt \
--beam 4 \
--remove-bpe \
--quiet \
--fp16
```
To measure end-to-end inference with tokenization,
```
python interactive.py \
--buffer-size 1 \
--fp16 \
--path /path/to/your/checkpoint.pt \
--max-tokens 128 \
--fuse-dropout-add \
--remove-bpe\
--bpe-codes /path/to/code/file \
/path/to/dataset/wmt14_en_de_joined_dict/
```
We have benchmarked the inference performance by running the `generate.py` script using the pytorch-19.03-py3 NGC Docker container. Inference was run on a single GPU.
GPU | Mixed precision | FP32 | FP16/Mixed speedup
---|---|---|---
Tesla V100 | 5129.34 | 3396.09 | 1.51
Tesla V100-SXM2-32GB | 6010 | 3414 | 1.76
## Changelog
## Release notes
### Changelog
January 2019
- initial commit, forked from [fairseq](https://github.com/pytorch/fairseq/commit/ac5fddfc691267285a84c81d39475411da5ed1c6)
- adding mid-training [SacreBLEU](https://pypi.org/project/sacrebleu/1.2.10/) evaluation. Better handling of OOMs.
May 2019:
- adding mid-training SacreBLEU evaluation. Better handling of OOMs.
June 2019
- new README
- jit support added
## Known issues
- Course of a training heavily depends on a random seed. There is high variance in a time required to reach a certain BLEU score. Also the highest BLEU score value observed vary between runs with different seeds.
- Course of a training heavily depends on a random seed. There is high variance in the time required to reach a certain BLEU score. Also the highest BLEU score value observed vary between runs with different seeds.

View file

@ -41,8 +41,22 @@ from . import (
)
from apex.normalization.fused_layer_norm import FusedLayerNorm
from .fused_dropout_add import fused_dropout_add
from .fused_relu_dropout import fused_relu_dropout
@torch.jit.script
def jit_dropout_add(x, residual, prob, is_training) :
# type: (Tensor, Tensor, float, bool) -> Tensor
out = F.dropout(x, p=prob, training=is_training)
out = residual + out
return out
@torch.jit.script
def jit_relu_dropout(x, prob, is_training) :
# type: (Tensor, float, bool) -> Tensor
out = F.threshold(x, 0., 0.)
out = F.dropout(out, p=prob, training=is_training)
return out
@register_model('transformer')
class TransformerModel(FairseqModel):
@ -438,7 +452,7 @@ class TransformerEncoderLayer(nn.Module):
x = self.maybe_layer_norm(0, x, before=True)
x, _ = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask)
if self.fuse_dropout_add and self.training :
x = fused_dropout_add(x, residual, self.dropout)
x = jit_dropout_add(x, residual, self.dropout, self.training)
else :
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x
@ -448,14 +462,14 @@ class TransformerEncoderLayer(nn.Module):
x = self.maybe_layer_norm(1, x, before=True)
if self.fuse_relu_dropout :
x = fused_relu_dropout(self.fc1(x), self.relu_dropout)
x = jit_relu_dropout(self.fc1(x), self.relu_dropout, self.training)
else :
x = F.threshold(self.fc1(x),0,0)
x = F.dropout(x, p=self.relu_dropout, training=self.training)
x = self.fc2(x)
if self.fuse_dropout_add and self.training :
x = fused_dropout_add(x, residual, self.dropout)
x = jit_dropout_add(x, residual, self.dropout, self.training)
else :
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x
@ -517,7 +531,7 @@ class TransformerDecoderLayer(nn.Module):
need_weights=False,
)
if self.fuse_dropout_add and self.training :
x = fused_dropout_add(x, residual, self.dropout)
x = jit_dropout_add(x, residual, self.dropout, self.training)
else :
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x
@ -537,7 +551,7 @@ class TransformerDecoderLayer(nn.Module):
need_weights=(not self.training and self.need_attn),
)
if self.fuse_dropout_add and self.training :
x = fused_dropout_add(x, residual, self.dropout)
x = jit_dropout_add(x, residual, self.dropout, self.training)
else :
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x
@ -546,13 +560,13 @@ class TransformerDecoderLayer(nn.Module):
residual = x
x = self.maybe_layer_norm(self.final_layer_norm, x, before=True)
if self.fuse_relu_dropout :
x = fused_relu_dropout(self.fc1(x), self.relu_dropout)
x = jit_relu_dropout(self.fc1(x), self.relu_dropout, self.training)
else :
x = F.threshold(self.fc1(x),0,0)
x = F.dropout(x, p=self.relu_dropout, training=self.training)
x = self.fc2(x)
if self.fuse_dropout_add and self.training :
x = fused_dropout_add(x, residual, self.dropout)
x = jit_dropout_add(x, residual, self.dropout, self.training)
else :
x = F.dropout(x, p=self.dropout, training=self.training)
x = residual + x

View file

@ -29,6 +29,7 @@ import torch
from fairseq import data, options, tasks, tokenizer, utils
from fairseq.sequence_generator import SequenceGenerator
from fairseq.meters import StopwatchMeter
from apply_bpe import BPE
@ -156,6 +157,9 @@ def main(args):
)
return result
gen_timer = StopwatchMeter()
end2end_timer = StopwatchMeter()
def process_batch(batch):
tokens = batch.tokens
lengths = batch.lengths
@ -164,11 +168,13 @@ def main(args):
tokens = tokens.cuda()
lengths = lengths.cuda()
gen_timer.start()
translations = translator.generate(
tokens,
lengths,
maxlen=int(args.max_len_a * tokens.size(1) + args.max_len_b),
)
gen_timer.stop()
return [make_result(batch.srcs[i], t) for i, t in enumerate(translations)]
@ -178,6 +184,7 @@ def main(args):
for inputs in buffered_read(args.buffer_size):
indices = []
results = []
end2end_timer.start()
for batch, batch_indices in make_batches(inputs, args, src_dict, models[0].max_positions(), bpe):
indices.extend(batch_indices)
results += process_batch(batch)
@ -191,6 +198,12 @@ def main(args):
if align is not None:
print(align)
print('Model latency: {} s'.format(gen_timer.sum))
gen_timer.reset()
end2end_timer.stop()
print('End-to-end translation time: {} s'.format(end2end_timer.sum))
end2end_timer.reset()
if __name__ == '__main__':
parser = options.get_generation_parser(interactive=True)

View file

@ -30,4 +30,6 @@ python preprocess.py \
--nwordstgt 33712 \
--joined-dictionary
sacrebleu -t wmt14/full -l de-en --echo src > $DATASET_DIR/sacrebleu_reference.de
cp $TEXT/code $DATASET_DIR/code

View file

@ -0,0 +1,27 @@
#! /bin/bash
nvidia-smi
python /workspace/translation/train.py \
/data/wmt14_en_de_joined_dict \
--arch transformer_wmt_en_de_big_t2t \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.997)' \
--adam-eps "1e-9" \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 0.0 \
--warmup-updates 4000 \
--lr 0.0006\
--min-lr 0.0 \
--dropout 0.1 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 2560 \
--seed 1 \
--max-epoch 50 \
--no-epoch-checkpoints \
--no-progress-bar \
--save-dir /results/checkpoints \

View file

@ -0,0 +1,28 @@
#! /bin/bash
nvidia-smi
python /workspace/translation/train.py \
/data/wmt14_en_de_joined_dict \
--arch transformer_wmt_en_de_big_t2t \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.997)' \
--adam-eps "1e-9" \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 0.0 \
--warmup-updates 4000 \
--lr 0.0006\
--min-lr 0.0 \
--dropout 0.1 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 5120 \
--seed 1 \
--max-epoch 50 \
--fp16 \
--no-epoch-checkpoints \
--no-progress-bar \
--save-dir /results/checkpoints_1_GPU_fp16 \

View file

@ -0,0 +1,29 @@
#! /bin/bash
nvidia-smi
python -m torch.distributed.launch --nproc_per_node 16 /workspace/translation/train.py \
/data/wmt14_en_de_joined_dict \
--arch transformer_wmt_en_de_big_t2t \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.997)' \
--adam-eps "1e-9" \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 0.0 \
--warmup-updates 4000 \
--lr 0.0006\
--min-lr 0.0 \
--dropout 0.1 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 5120 \
--seed 1 \
--max-epoch 1 \
--fp16 \
--no-epoch-checkpoints \
--no-progress-bar \
--log-interval 500 \
--save-dir /results/checkpoints_dgx2 \
--distributed-init-method env://

View file

@ -0,0 +1,29 @@
#! /bin/bash
nvidia-smi
python -m torch.distributed.launch --nproc_per_node 8 /workspace/translation/train.py \
/data/wmt14_en_de_joined_dict \
--arch transformer_wmt_en_de_big_t2t \
--share-all-embeddings \
--optimizer adam \
--adam-betas '(0.9, 0.997)' \
--adam-eps "1e-9" \
--clip-norm 0.0 \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 0.0 \
--warmup-updates 4000 \
--lr 0.0006\
--min-lr 0.0 \
--dropout 0.1 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 5120 \
--seed 1 \
--max-epoch 50 \
--fp16 \
--no-epoch-checkpoints \
--no-progress-bar \
--log-interval 500 \
--save-dir /results/checkpoints \
--distributed-init-method env://

View file

@ -57,22 +57,6 @@ strided_batched_gemm = CUDAExtension(
}
)
fused_dropout_add = CUDAExtension(
name='fused_dropout_add_cuda',
sources=['fairseq/models/fused_dropout_add/fused_dropout_add_cuda.cpp', 'fairseq/models/fused_dropout_add/fused_dropout_add_cuda_kernel.cu'],
extra_compile_args={
'cxx': ['-O2',],
'nvcc':['--gpu-architecture=sm_70','-O3','--use_fast_math', '--expt-extended-lambda'],
}
)
fused_relu_dropout = CUDAExtension(
name='fused_relu_dropout_cuda',
sources=['fairseq/models/fused_relu_dropout/fused_relu_dropout_cuda.cpp', 'fairseq/models/fused_relu_dropout/fused_relu_dropout_cuda_kernel.cu'],
extra_compile_args={
'cxx': ['-O2',],
'nvcc':['--gpu-architecture=sm_70','-O3','--use_fast_math', '--expt-extended-lambda'],
}
)
batch_utils = CppExtension(
name='fairseq.data.batch_C',
sources=['fairseq/data/csrc/make_batches.cpp'],
@ -88,7 +72,7 @@ setup(
license=license,
install_requires=reqs.strip().split('\n'),
packages=find_packages(),
ext_modules=[bleu, strided_batched_gemm, fused_dropout_add, fused_relu_dropout, batch_utils],
ext_modules=[bleu, strided_batched_gemm, batch_utils],
cmdclass={
'build_ext': BuildExtension
},

View file

@ -115,11 +115,10 @@ def main(args):
valid_losses = [None]
valid_subsets = args.valid_subset.split(',')
while lr >= args.min_lr and epoch_itr.epoch < max_epoch and trainer.get_num_updates() < max_update and current_bleu < tgt_bleu:
while lr >= args.min_lr and epoch_itr.epoch < max_epoch and trainer.get_num_updates() < max_update and current_bleu < tgt_bleu:
# train for one epoch
train(args, trainer, task, epoch_itr)
if epoch_itr.epoch % args.validate_interval == 0:
valid_losses = validate(args, trainer, task, epoch_itr, valid_subsets)
@ -366,25 +365,11 @@ def score(args, trainer, task, epoch_itr, subset):
if args.distributed_world_size > 1:
_all_gather_bleu_scorer(scorer)
chunked_predictions = []
while True:
if len(predictions)>100:
chunked_predictions.append(predictions[:100])
predictions = predictions[100:]
else:
chunked_predictions.append(predictions)
break
reduced_predictions = []
for chunk in chunked_predictions:
torch.cuda.synchronize()
reduced_predictions += distributed_utils.all_gather_list(chunk, max_size=65000)
torch.cuda.synchronize()
predictions = _all_gather_predictions(predictions)
with open(os.path.join(args.data, 'sacrebleu_reference.de'), 'r') as reference:
refs = [reference.readlines()]
#reducing indexed predictions as strings is more memory efficient than reducing tuples
predictions = [item for sublist in reduced_predictions for item in sublist]
predictions = [tuple(item.split('\t')) for item in predictions]
predictions = [(int(item[0]), item[1]) for item in predictions]
predictions.sort(key=lambda tup: tup[0])
@ -401,6 +386,36 @@ def score(args, trainer, task, epoch_itr, subset):
return scorer.score(order=4), sacrebleu_score.score
def _all_gather_predictions(predictions):
ready = False
all_ready = False
reduced_predictions = []
max_size = 65000
while not all_ready:
lst_len = len(predictions)
size = 2000 #some extra space for python stuff
n = 0
while n < lst_len:
str_len = len(predictions[n].encode('utf8')) + 8 # per string pickle overhead
if size + str_len >= max_size:
break
size += str_len
n += 1
chunk = predictions[:n]
predictions = predictions[n:]
if not predictions:
ready = True
chunk = (ready, chunk)
torch.cuda.synchronize()
gathered = distributed_utils.all_gather_list(chunk, max_size=65000)
torch.cuda.synchronize()
reduced_predictions += [t[1] for t in gathered]
all_ready = all([t[0] for t in gathered])
reduced_predictions = [item for sublist in reduced_predictions for item in sublist]
return reduced_predictions
def _all_gather_bleu_scorer(scorer):
stats = distributed_utils.all_gather_list(scorer.stat)
bleu_stat = bleu.BleuStat()

Binary file not shown.

After

Width:  |  Height:  |  Size: 192 KiB

View file

@ -25,7 +25,7 @@ The examples are organized first by framework, such as TensorFlow, PyTorch, etc.
### Natural Language Processing
- __GNMT__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT)]
- __Transformer__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/Transformer)]
- __BERT__ [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT)]
- __BERT__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT)][[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT)]
### Recommender Systems
- __NCF__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF)] [[TensorFlow](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Recommendation/NCF)]
@ -41,3 +41,6 @@ We're posting these examples on GitHub to better support the community, facilita
## Known issues
In each of the network READMEs, we indicate any known issues and encourage the community to provide feedback.

View file

@ -1,28 +1,45 @@
# ResNet-50 v1.5 for TensorFlow
This repository provides a script and recipe to train the ResNet-50 v1.5 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table Of Contents
* [The model](#the-model)
* [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Data Augmentation](#data-augmentation)
* [Other training recipes](#other-training-recipes)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick start guide](#quick-start-guide)
* [Details](#details)
* [Quick Start Guide](#quick-start-guide)
* [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Data augmentation](#data-augmentation)
* [Training process](#training-process)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training performance results](#training-performance-results)
* [Inference performance results](#inference-performance-results)
* [Changelog](#changelog)
* [Known issues](#known-issues)
* [Inference process](#inference-process)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [Training performance results](#training-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
* [Inference performance results](#inference-performance-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [NVIDIA DGX-2 (16x V100 32G)](#nvidia-dgx-2-16x-v100-32g)
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
# The model
## Model overview
The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).
The difference between v1 and v1.5 is in the bottleneck blocks which requires
@ -31,16 +48,15 @@ downsampling, for example, v1 has stride = 2 in the first 1x1 convolution, where
This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1,
but comes with a small performance drawback (~5% imgs/sec).
The following features were implemented in this model:
* Data-parallel multi-GPU training with Horovod
* Mixed precision support with TensorFlow Automatic Mixed Precision (TF-AMP), which enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable. Tensor Core operations to maximize throughput using NVIDIA Volta GPUs.
* Static loss scaling for Tensor Cores (mixed precision) training
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
The following performance optimizations were implemented in this model:
* JIT graph compilation with [XLA](https://www.tensorflow.org/xla)
* NVIDIA Data Loading ([DALI](https://github.com/NVIDIA/DALI)) support (experimental).
## Default configuration
### Model architecture
### Default configuration
This model trains for 90 epochs, with default ResNet50 v1.5 setup:
@ -56,25 +72,7 @@ during first 5 epochs according to [Training ImageNet in 1 hour](https://arxiv.o
* Weight decay: 1e-4
## Data Augmentation
This model uses the following data augmentation:
* During training, we perform the following augmentation techniques:
* Normalization
* Random resized crop to 224x224
* Scale from 8% to 100%
* Aspect ratio from 3/4 to 4/3
* Random horizontal flip
* During inference, we perform the following augmentation techniques:
* Normalization
* Scale to 256x256
* Center crop to 224x224
## Other training recipes
### Other training recipes
This script does not target any specific benchmark.
There are changes that others have made which can speed up convergence and/or increase accuracy.
@ -93,15 +91,63 @@ and this recipe keeps the original assumption that validation is done on 224px i
Using 288px images means that a lot more FLOPs are needed during inference to reach the same accuracy.
# Setup
### Feature support matrix
The following features are supported by this model.
| Feature | ResNet-50 v1.5 Tensorflow |
|-----------------------|--------------------------
|Multi-GPU training with [Horovod](https://github.com/horovod/horovod) | Yes |
|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes |
#### Features
Multi-GPU training with Horovod - Our model uses Horovod to implement efficient multi-GPU training with NCCL.
For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
with the DALI library. For details, see example sources in this repository or see the [DALI documentation](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html)
### Mixed precision training
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Manually adding loss scaling to preserve small gradient values.
This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
For information about:
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
#### Enabling mixed precision
Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format.Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance effort. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
To enable mixed precision, you can simply the values of enviromental varialbes inside your training script:
- Enable TF-AMP graph rewrite:
```
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
```
- Enable Automated Mixed Precision:
```
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
```
## Setup
The following section list the requirements that you need to meet in order to use the ResNet50 v1.5 model.
## Requirements
### Requirements
This repository contains Dockerfile which extends the Tensorflow NGC container and encapsulates all dependencies. Aside from these dependencies, ensure you have the following software:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [TensorFlow 19.03-py3 NGC container or later](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
* [TensorFlow 19.06-py3 NGC container or later](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
For more information about how to get started with NGC containers, see the
@ -110,28 +156,37 @@ following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
* [Running Tensorflow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running).
# Quick start guide
To train your model using mixed precision with tensor cores, perform the following steps using the default parameters of the ResNet-50 v1.5 model on the [ImageNet](http://www.image-net.org/) dataset.
For those unable to use the [TensorFlow 19.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## 1. Download and preprocess the dataset.
## Quick Start Guide
To train your model using mixed precision with tensor cores, perform the following steps using the default parameters of the ResNet-50 v1.5 model on the [ImageNet](http://www.image-net.org/) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
1. Clone the repository.
```
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Classification/RN50v1.5
```
2. Download and preprocess the dataset.
The ResNet50 v1.5 script operates on ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
To download and preprocess the dataset, use the [Generate ImageNet for TensorFlow](https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh) script. The dataset will be downloaded to a directory specified as the first parameter of the script.
## 2. Build the ResNet-50 v1.5 TensorFlow NGC container.
3. Build the ResNet-50 v1.5 TensorFlow NGC container.
```bash
bash scripts/docker/build.sh
```
## 3. Start an interactive session in the NGC container to run training/inference.
4. Start an interactive session in the NGC container to run training/inference.
After you build the container image, you can start an interactive CLI session with
```bash
bash scripts/docker/interactive.sh
```
The interactive.sh script requires that the location on the dataset is specified. For example, /data.
## 4. Start training.
5. Start training.
To run training for a default configuration (as described in [Default configuration](#default-configuration), for example 1/4/8 GPUs, FP16/FP32), run one of the scripts in the ./scripts directory called `./scripts/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh`. Each of the scripts require three parameters:
* path to the root directory of the model as the first argument
* path to the dataset as a second argument
@ -142,7 +197,7 @@ For example:
./scripts/RN50_FP16_8GPU.sh <path to model> <path to dataset> <path to results>
```
## 5. Start validation/evaluation.
6. Start validation/evaluation.
Model evaluation on a checkpoint can be launched by running one of the `./scripts/RN50_{FP16, FP32}_EVAL.sh` scripts in the `./scripts` directory. Each of the scripts requires three parameters:
* path to the root directory of the model as the first argument
@ -159,57 +214,131 @@ To run a non-default configuration, use:
`python ./main.py --mode=evaluate --use_tf_amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to checkpoint>`
# Details
## Advanced
## Command line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
The following sections provide greater details of the dataset,running training and inference, and the training results.
### Scripts and sample code
In the root directory, the most important files are:
- `main.py`: the script that controls the logic of training and validation of the ResNet-50 v1.5 model;
- `Dockerfile`: Instructions for docker to build a container with the basic set of dependencies to run ResNet-50 v1.5;
- `requirements.txt`: a set of extra Python requirements for running ResNet-50 v1.5;
The `model/` directory contains modules used to define ResNet-50 v1.5 model
- `resnet_v1_5.py`: the definition of ResNet-50 v1.5 model
- `blocks/conv2d_block.py`: the definition of ResNet-50 v1.5 2D convolution block
- `blocks/resnet_bottleneck_block.py`: the definition of ResNet-50 v1.5 bottleneck block
- `layers/*.py`: definitions of specific layers used in ResNet-50 v1.5 model
The `utils/` directory contains utility modules
- `cmdline_helper.py`: helper module for command line processing
- `data_utils.py`: module defining input data pipelines
- `dali_utils.py`: helper module for DALI
- `hvd_utils.py`: helper module for Horovod
- `image_processing.py`: image processing and data augmentation functions
- `learning_rate.py`: definition of used learning rate schedule
- `optimizers.py`: definition of used custom optimizers
- `hooks/*.py`: defintions of specific hooks allowing loggin of training and inference process
The `runtime/` directory contains modules that define the mechanics of trainig process
- `runner.py`: module encapsulating the training, inference and evaluation
The `scripts/` directory contains scripts wrapping common scenarios.
### Parameters
#### The script `main.py`
The script for training end evaluating the ResNet-50 v1.5 model have a variety of parameters that control these processes.
##### Common parameters
`--mode`
: allow specification of mode in which the script will run: train, train_and_evaluate, evaluate, training_benchmark or inference_benchmark
`--data_dir` `--data_idx_dir`
: allow specification of dataset location
`--seed`
: allow specification of seed for RNGs.
`--batch_size`
: allow specification of the minibatch size.
`--num_iter` and `--iter_unit`
: allow specification of training/evaluation length
`--use_tf_amp`
: flag enabling TF-AMP mixed precision computation.
`--use_xla`
: flag enabling XLA graph optimization.
`--use_dali`
: flag enabling DALI input pipeline. This parameter requires `--data_idx_dir` to be set.
##### Training related
`--use_auto_loss_scaling`
: flag enabling automatic loss scaling
`--lr_init`
: initial value of learning rate.
`--warmup_steps`
: allows you to specify the number of iterations considered as warmup and not taken into account for performance measurements.
`--momentum`
: momentum argument for SGD optimizer.
`--weight-decay`
: weight decay argument for SGD optimizer.
`--batch-size`
: a number of inputs processed at once for each iteration.
`--loss_scale`
: value of static loss scale. This parameter will have no effect if `--use_auto_loss_scaling` is set.
##### Utility parameters
`--help`
: displays a short description of all parameters accepted by the script.
### Command-line options
All these parameters can be controlled by passing command-line arguments
to the `main.py` script. To get a complete list of all command-line arguments
with descriptions and default values you can run:
```
python main.py --help
```
To summarize, the most important arguments are as follows:
```
--mode {train,train_and_evaluate,evaluate,training_benchmark,inference_benchmark}
The execution mode of the script.
--data_dir DATA_DIR Path to dataset in TFRecord format. Files should be
named 'train-*' and 'validation-*'.
--data_idx_dir DATA_IDX_DIR
Path to index files for DALI. Files should be named
'train-*' and 'validation-*'.
--batch_size BATCH_SIZE
Size of each minibatch per GPU.
--num_iter NUM_ITER Number of iterations to run.
--iter_unit {epoch,batch}
Unit of iterations.
--results_dir RESULTS_DIR
Directory in which to write training logs, summaries
and checkpoints.
--loss_scale LOSS_SCALE
Loss scale for mixed precision training.
--use_auto_loss_scaling
Use automatic loss scaling in fp32 AMP.
--use_xla Enable XLA (Accelerated Linear Algebra) computation
for improved performance.
--use_tf_amp Enable Automatic Mixed Precision to speedup fp32
computation using tensor cores.
--use_dali Enable DALI data input.
### Getting the data
```
The ResNet-50 v1.5 model was trained on the ImageNet 1k, a widely popular image classification dataset from ILSVRC challenge.
## Training process
To download and preprocess the dataset, use the [Generate ImageNet for TensorFlow](https://github.com/tensorflow/models/blob/master/research/inception/inception/data/download_and_preprocess_imagenet.sh) script. The dataset will be downloaded to a directory specified as the first parameter of the script.
#### Data Augmentation
This model uses the following data augmentation:
* During training, we perform the following augmentation techniques:
* Normalization
* Random resized crop to 224x224
* Scale from 8% to 100%
* Aspect ratio from 3/4 to 4/3
* Random horizontal flip
* During inference, we perform the following augmentation techniques:
* Normalization
* Scale to 256x256
* Center crop to 224x224
### Training process
To run a configuration that is not based on the default parameters, use:
* For 1 GPU
@ -224,28 +353,13 @@ To run a configuration that is not based on the default parameters, use:
* FP16
`mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --batch_size=256 --use_tf_amp --data_dir=<path to imagenet> --results_dir=<path to results>`
## Performance
## Enabling mixed precision
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Manually adding loss scaling to preserve small gradient values.
This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow framework code makes all necessary model changes internally.
In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
For information about:
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
# Benchmarking
### Benchmarking
The following sections shows how to run benchmarks measuring the model performance in training and inference modes.
## Training performance benchmark
#### Training performance benchmark
To benchmark the training performance on a specific batch size, run:
@ -269,7 +383,7 @@ Each of these scripts runs 200 warm-up iterations and measures the first epoch.
To control warmup and benchmark length, use `--warmup_steps`, `--num_iter` and `--iter_unit` flags.
## Inference performance benchmark
#### Inference performance benchmark
To benchmark the inference performance on a specific batch size, run:
@ -283,14 +397,16 @@ Each of these scripts, by default runs 20 warm-up iterations and measures the ne
To control warm-up and benchmark length, use `--warmup_steps`, `--num_iter` and `--iter_unit` flags.
# Results
### Results
The following sections provide details on how we achieved our results in training accuracy, performance and inference performance.
## Training accuracy results
#### Training accuracy results
##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `./scripts/RN50_{FP16, FP32}_{1, 4, 8}GPU.sh` script in
the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
the tensorflow-19.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
| **number of GPUs** | **mixed precision top1** | **mixed precision training time** | **FP32 top1** | **FP32 training time** |
@ -301,10 +417,10 @@ the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
## Training performance results
#### Training performance results
### DGX-1
The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbench_fp16.sh` and `./scripts/benchmarking/DGX1V_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPU.
##### NVIDIA DGX-1 (8x V100 16G)
The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbench_fp16.sh` and `./scripts/benchmarking/DGX1V_trainbench_fp32.sh` scripts in the tensorflow-19.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPU.
| **number of GPUs** | **mixed precision img/s** | **FP32 img/s** | **mixed precision speedup** | **mixed precision weak scaling** | **FP32 weak scaling** |
@ -313,10 +429,17 @@ The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbenc
| **4** | 3197.4 | 1419.4 | 2.25 | 3.96 | 3.89 |
| **8** | 6209.9 | 2778.5 | 2.24 | 7.74 | 7.61 |
##### XLA Enabled
| **number of GPUs** | **mixed precision img/s** | **mixed precision + XLA img/s** | **XLA speedup** |
|:------------------:|:-------------------------:|:-------------------------------:|:---------------:|
| **1** | 802.1 | 1177.9 | 1.47 |
| **4** | 3197.4 | 4654.1 | 1.45 |
| **8** | 6209.9 | 8104.4 | 1.30 |
### DGX-2
The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench_fp16.sh` and `./scripts/benchmarking/DGX2_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-2 with 16 V100 32G GPU.
##### NVIDIA DGX-2 (16x V100 32G)
The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench_fp16.sh` and `./scripts/benchmarking/DGX2_trainbench_fp32.sh` scripts in the tensorflow-19.06-py3 Docker container on NVIDIA DGX-2 with 16 V100 32G GPU.
| **number of GPUs** | **mixed precision img/s** | **FP32 img/s** | **mixed precision speedup** | **mixed precision weak scaling** | **FP32 weak scaling** |
|:------------------:|:-------------------------:|:--------------:|:---------------------------:|:--------------------------------:|:---------------------:|
@ -325,21 +448,7 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench
| **8** | 6439.4 | 2888.6 | 2.23 | 7.84 | 7.65 |
| **16** | 12467.5 | 5660.8 | 2.20 | 15.17 | 15.00 |
### XLA-enabled results
### DGX-1
The results were obtained by running the `./scripts/benchmarking/DGX1V_trainbench_fp16.sh` and `./scripts/benchmarking/DGX1V_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPU.
| **number of GPUs** | **mixed precision img/s** | **mixed precision + XLA img/s** | **XLA speedup** |
|:------------------:|:-------------------------:|:-------------------------------:|:---------------:|
| **1** | 802.1 | 1177.9 | 1.47 |
| **4** | 3197.4 | 4654.1 | 1.45 |
| **8** | 6209.9 | 8104.4 | 1.30 |
### DGX-2
The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench_fp16.sh` and `./scripts/benchmarking/DGX2_trainbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-2 with 16 V100 32G GPU.
##### XLA Enabled
| **number of GPUs** | **mixed precision img/s** | **mixed precision + XLA img/s** | **XLA speedup** |
|:------------------:|:-------------------------:|:-------------------------------:|:---------------:|
@ -348,10 +457,10 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_trainbench
| **8** | 6439.4 | 9295.5 | 1.44 |
| **16** | 12467.5 | 15902.8 | 1.27 |
## Inference performance results
#### Inference performance results
### DGX-1
The results were obtained by running the `./scripts/benchmarking/DGX1V_inferbench_fp16.sh` and `./scripts/benchmarking/DGX1V_inferbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
##### NVIDIA DGX-1 (8x V100 16G)
The results were obtained by running the `./scripts/benchmarking/DGX1V_inferbench_fp16.sh` and `./scripts/benchmarking/DGX1V_inferbench_fp32.sh` scripts in the tensorflow-19.06-py3 Docker container on a single GPU of NVIDIA DGX-1 with 8 V100 16G GPUs.
| **batch size** | **mixed precision img/s** | **FP32 img/s** | **mixed precision + XLA img/s** |
|:--------------:|:-------------------------:|:--------------:|:-------------------------------:|
@ -366,8 +475,8 @@ The results were obtained by running the `./scripts/benchmarking/DGX1V_inferbenc
| **256** | 2129.3 | N/A | 3547.9 |
### DGX-2
The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench_fp16.sh` and `./scripts/benchmarking/DGX2_inferbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on NVIDIA DGX-1 with 16 V100 32G GPUs.
##### NVIDIA DGX-2 (16x V100 32G)
The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench_fp16.sh` and `./scripts/benchmarking/DGX2_inferbench_fp32.sh` scripts in the tensorflow-19.05-py3 Docker container on a single GPU of NVIDIA DGX-2 with 16 V100 32G GPUs.
| **batch size** | **mixed precision img/s** | **FP32 img/s** | **mixed precision + XLA img/s** |
|:--------------:|:-------------------------:|:--------------:|:-------------------------------:|
@ -381,7 +490,9 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench
| **128** | 2126.5 | 1168.8 | 3469.6 |
| **256** | 2203.6 | N/A | 3713.2 |
# Changelog
## Release notes
### Changelog
1. March 1, 2019
* Initial release
2. May 15, 2019
@ -389,5 +500,5 @@ The results were obtained by running the `./scripts/benchmarking/DGX2_inferbench
* Added scripts for DGX-2
* Added benchmark results for DGX-2 and XLA-enabled DGX-1 and DGX-2.
# Known issues
### Known issues
There are no known issues with this model.

View file

@ -52,6 +52,7 @@ class Runner(object):
model_dir=None,
log_dir=None,
data_dir=None,
data_idx_dir=None,
# ======= Optimization HParams ======== #
use_xla=False,
@ -223,6 +224,7 @@ class Runner(object):
config.allow_soft_placement = True
config.log_device_placement = False
if hvd_utils.is_using_hvd():
config.gpu_options.visible_device_list = str(hvd.local_rank())

View file

@ -332,6 +332,7 @@ To achieve same results, follow the [Quick start guide](#quick-start-guide) outl
March 2019
* Initial release
May 2019
* Test scripts updated

View file

@ -12,7 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvcr.io/nvidia/tensorflow:19.03-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:19.06-py3
FROM ${FROM_IMAGE_NAME}
RUN apt-get update && \
apt-get install -y unzip

View file

@ -4,28 +4,30 @@ This repository provides a script and recipe to train Neural Collaborative Filte
accuracy, and is tested and maintained by NVIDIA.
## Table Of Contents
* [The Model](#the-model)
* [Model overview](#model-overview)
* [Default Configuration](#default-configuration)
* [Mixed precision training](#mixed-precision-training)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick Start Guide](#quick-start-guide)
* [Details](#details)
* [Advanced](#advanced)
* [Command Line Arguments](#command-line-arguments)
* [Getting the Data](#getting-the-data)
* [Other Datasets](#other-datasets)
* [Training Process](#training-process)
* [Enabling Mixed Precision](#enabling-mixed-precision)
* [Evaluation Process](#evaluation-process)
* [Benchmarking](#benchmarking)
* [Performance Benchmark](#performance-benchmark)
* [Results](#results)
* [Training Accuracy Results](#training-accuracy-results)
* [Training Performance Results](#training-performance-results)
* [Inference Performance Results](#inference-performance-results)
* [Changelog](#changelog)
* [Known Issues](#known-issues)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Performance Benchmark](#performance-benchmark)
* [Results](#results)
* [Training Accuracy Results](#training-accuracy-results)
* [Training Performance Results](#training-performance-results)
* [Inference Performance Results](#inference-performance-results)
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known Issues](#known-issues)
## The Model
## Model overview
The Neural Collaborative Filtering (NCF) model is a neural network that provides collaborative filtering based on
implicit feedback, specifically, it provides product recommendations based on user and item interactions. The training
@ -78,6 +80,47 @@ This implementation is implemented with the following features:
- Note: The negative samples generated for the test set are always verified regardless if the shortcut is enabled or
not.
### Mixed Precision Training
[Mixed Precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
operations in half-precision format, while storing information in single-precision to retain as much information as
possible. Mixed precision is enabled in TensorFlow by using a custom variable getter that casts variables to
half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small
gradient magnitudes in backpropagation, a [loss
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included
when applying gradients. In TensorFlow, loss scaling can be easily applied by using
[LossScaleOptimizer](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/LossScaleOptimizer) . The
scaling value to be used can be
[dynamic](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/ExponentialUpdateLossScaleManager) or
[fixed](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/FixedLossScaleManager)
Enabling mixed precision is now easier than ever with support for AMP in TensorFlow. TF-AMP is an extension of
TensorFlow that enables mixed precision without any code changes. It accomplishes this by automatically rewriting all
computation graphs with the necessary operations to enable mixed precision training and loss scaling. Currently, TF-AMP
is only available through NVIDIAs TensorFlow Docker container.
TF-AMP is controlled by the `TF_ENABLE_AUTO_MIXED_PRECISION=1` environment variable; when set, TensorFlow will rewrite
all graphs to perform computations in half-precision format and loss scaling will automatically be applied.
To enable mixed precision training using TF-AMP, the environment variable can be set prior to running `ncf.py`.
Alternatively, `ncf.py` can be run with the `--fp16` flag.
**Note:** The `--fp16` flag sets the environment variable to the correct value
for mixed precision training inside the script, for example:
```
# Note that the --fp16 flag maps to the amp variable in code
if args.amp:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1"
```
For more information about:
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper
and the [Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
from the TensorFlow User Guide.
## Setup
The following section lists the requirements in order to start training the NCF model.
@ -103,14 +146,14 @@ Documentation and the Deep Learning Documentation:
To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default
parameters of the NCF model on the ml-20m dataset.
### 1. Clone this repository
### Clone this repository
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Recommendation/NCF
```
### 2. Build the NCF TensorFlow NGC container.
### Build the NCF TensorFlow NGC container.
After Docker is correctly set up, you can build the NCF image with:
@ -118,7 +161,7 @@ After Docker is correctly set up, you can build the NCF image with:
docker build . -t nvidia_ncf
```
### 3. Launch the NCF TensorFlow Docker container.
### Launch the NCF TensorFlow Docker container.
```bash
mkdir data
@ -129,7 +172,9 @@ This will launch the container and mount the ./data directory as a volume to the
Any datasets and experiment results (logs, checkpoints etc.) saved to /data will be accessible in the ./data directory
on the host.
### 4. Download and preprocess the dataset.
### Download and preprocess the dataset.
#### ml-20m
Preprocessing consists of downloading the data, filtering out users that have less than 20 ratings (by default), sorting
the data and dropping the duplicates. No data augmentation techniques are used in the preprocessing stage.
@ -140,7 +185,7 @@ To download and preprocess the ml-20m dataset, run:
./prepare_dataset.sh
```
##### ml-1m
#### ml-1m
To download and preprocess the ml-1m dataset, run:
@ -151,7 +196,7 @@ To download and preprocess the ml-1m dataset, run:
This will store the preprocessed training and evaluation data in the `/data` directory, so that it can be later used to
train the model (by passing the appropriate `--data` argument to the `ncf.py` script).
### 5. Start training.
### Start training.
After the Docker container is launched, the training with the default hyper-parameters can be started with:
@ -166,7 +211,7 @@ mpirun -np $numgpu \
After the training is complete, the model parameters that provide the best evaluation accuracy are saved to the
directory passed to the `--checkpoint-dir` argument. By default, this will be in the `/data/checkpoints/` directory.
### 6. Start validation/evaluation.
### Start validation/evaluation.
To run evaluation on a specific checkpoint, simply run the following command:
@ -177,7 +222,7 @@ python ncf.py --data /data/cache/ml-20m --mode test --checkpoint-dir $checkpoint
Note: TensorFlow checkpoints consist of 3 files each with a `*.ckpt` prefix.
## Details
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
@ -237,7 +282,7 @@ automatically call `download_dataset.sh` to download the desired dataset, and
then preprocess the training and test datasets. By default, data will be
downloaded to the `/data` directory.
##### Other Datasets
#### Other Datasets
This implementation is tuned for the ml-20m and ml-1m datasets. Using other
datasets might require tuning some hyperparameters (for example, learning rate,
@ -319,46 +364,6 @@ will be stored at the directory pointed to by the `--checkpoint-dir` argument.
Multiple GPUs can be used for training through Horovod. The number of GPUs can
be controlled by the `-np` parameter passed to `mpirun`.
##### Enabling Mixed Precision
[Mixed Precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
operations in half-precision format, while storing information in single-precision to retain as much information as
possible. Mixed precision is enabled in TensorFlow by using a custom variable getter that casts variables to
half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small
gradient magnitudes in backpropagation, a [loss
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included
when applying gradients. In TensorFlow, loss scaling can be easily applied by using
[LossScaleOptimizer](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/LossScaleOptimizer) . The
scaling value to be used can be
[dynamic](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/ExponentialUpdateLossScaleManager) or
[fixed](https://www.tensorflow.org/api_docs/python/tf/contrib/mixed_precision/FixedLossScaleManager)
Enabling mixed precision is now easier than ever with support for AMP in TensorFlow. TF-AMP is an extension of
TensorFlow that enables mixed precision without any code changes. It accomplishes this by automatically rewriting all
computation graphs with the necessary operations to enable mixed precision training and loss scaling. Currently, TF-AMP
is only available through NVIDIAs TensorFlow Docker container.
TF-AMP is controlled by the `TF_ENABLE_AUTO_MIXED_PRECISION=1` environment variable; when set, TensorFlow will rewrite
all graphs to perform computations in half-precision format and loss scaling will automatically be applied.
To enable mixed precision training using TF-AMP, the environment variable can be set prior to running `ncf.py`.
Alternatively, `ncf.py` can be run with the `--fp16` flag.
**Note:** The `--fp16` flag sets the environment variable to the correct value
for mixed precision training inside the script, for example:
```
# Note that the --fp16 flag maps to the amp variable in code
if args.amp:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1"
```
For more information about:
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper
and the [Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
from the TensorFlow User Guide.
### Evaluation Process
The evaluation process can be run by the ncf.py script as well. By passing the
@ -372,12 +377,14 @@ The script will then output a line like the one below which describes the model
Eval Time = 1.1829, HR@10 = 0.9574, NDCG@10 = 0.7420
```
## Benchmarking
## Performance
### Benchmarking
The following section shows how to run benchmarks measuring the model
performance in training and inference modes.
### Performance Benchmark
#### Performance Benchmark
To benchmark the training and inference performance, run:
@ -394,11 +401,11 @@ By default, the `ncf.py` script outputs metrics describing the following:
* Training speed and throughput
* Evaluation speed and throughput
## Results
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
## Training Accuracy Results
### Training Accuracy Results
Our results were obtained by running the `ncf.py` training script in the
TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 8x V100 16G GPUs.
@ -414,9 +421,9 @@ recorded to demonstrate the maximum accuracy the model can achieve.
| 4 | 0.9589 | 0.9591 |
| 8 | 0.9597 | 0.9598 |
## Training Performance Results
### Training Performance Results
### NVIDIA DGX-1 (8x V100 16G)
#### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `ncf.py` training script in the
TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 8x V100 16GB GPUs
@ -448,7 +455,7 @@ Those results can be improved when [XLA](https://www.tensorflow.org/xla) is used
in conjunction with mixed precision, delivering up to 2.6x speedup over FP32 on a single GPU (~24.3M items/sec).
However XLA is still considered experimental.
### NVIDIA DGX-1 (8x V100 32G)
#### NVIDIA DGX-1 (8x V100 32G)
Our results were obtained by running the `ncf.py` training script in the
TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 8x V100 32G GPUs with
@ -472,7 +479,7 @@ The performance was measured by the wall clock time over one training epoch.
The number of samples in the epoch (roughly 100 million samples), was then
divided by the average training duration to obtain the items per second metric.
## Inference Performance Results
### Inference Performance Results
Our results were obtained by running the `ncf.py` training script in the
TensorFlow 19.03-py3 NGC container on a NVIDIA DGX-1 with 1x V100 16G GPUs.
@ -488,14 +495,16 @@ achieve.
| 4 | 88,255,971 | 66,625,422 | 1.32x |
| 8 | 119,159,304 | 100,117,608 | 1.19x |
## Changelog
## Release Notes
### Changelog
March 2019
* Initial Release
## Known Issues
### Known Issues
### Multi-GPU Scaling Efficiency
#### Multi-GPU Scaling Efficiency
Currently, this model does not exhibit good scaling efficiency when scaling to
4 and 8 GPUs. Since we could not find hyper-parameters that could hit the
@ -505,7 +514,7 @@ to a more common weak scaling strategy. Additionally, we believe that the small
dataset size does not facilitate great scaling. However, the training scripts
allow the use of custom datasets provided they are in the correct format.
### Scaling beyond 8 GPUs
#### Scaling beyond 8 GPUs
Neural Collaborative Filtering (NCF) is a relatively lightweight model that
trains quickly with this relatively smaller dataset, ml-20m. Because of the

View file

@ -0,0 +1,127 @@
#
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
import os
import json
import argparse
import numpy as np
import tensorflow as tf
from neumf import ncf_model_ops
def parse_args():
parser = argparse.ArgumentParser(description="Benchmark inference performance of the NCF model")
parser.add_argument('--load_checkpoint_path', default=None, type=str,
help='Path to the checkpoint file to be loaded. If None will use random weights')
parser.add_argument('--n_users', default=138493, type=int,
help='Number of users. Defaults to the number of users in the ml-20m dataset after preprocessing')
parser.add_argument('--n_items', default=26744, type=int,
help='Number of items. Defaults to the number of users in the ml-20m dataset after preprocessing')
parser.add_argument('-f', '--factors', type=int, default=64,
help='Number of predictive factors')
parser.add_argument('--layers', nargs='+', type=int,
default=[256, 256, 128, 64],
help='Sizes of hidden layers for MLP')
parser.add_argument('--batch_size', default=1, type=int, help='Batch size for inference')
parser.add_argument('--num_batches', default=20, type=int,
help='Number of batches for which to measure latency and throughput')
parser.add_argument('--no_amp', dest='amp', action='store_false', default=True,
help='Disable mixed precision')
parser.add_argument('--xla', dest='xla', action='store_true', default=False,
help='Enable XLA')
parser.add_argument('--log_path', default='nvlog.json', type=str,
help='Path to the path to store benchmark results')
return parser.parse_args()
def main():
args = parse_args()
if args.amp:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1"
# Input tensors
users = tf.placeholder(tf.int32, shape=(None,))
items = tf.placeholder(tf.int32, shape=(None,))
dropout = tf.placeholder_with_default(0.0, shape=())
# Model ops and saver
logits_op = ncf_model_ops(
users=users,
items=items,
labels=None,
dup_mask=None,
params={
'fp16': False,
'val_batch_size': args.batch_size,
'num_users': args.n_users,
'num_items': args.n_items,
'num_factors': args.factors,
'mf_reg': 0,
'layer_sizes': args.layers,
'layer_regs': [0. for i in args.layers],
'dropout': 0.0,
'sigmoid': True,
'top_k': None,
'learning_rate': None,
'beta_1': None,
'beta_2': None,
'epsilon': None,
'loss_scale': None,
},
mode='INFERENCE'
)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
if args.xla:
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
sess = tf.Session(config=config)
saver = tf.train.Saver()
if args.load_checkpoint_path:
saver.restore(sess, args.load_checkpoint_path)
else:
# Manual initialize weights
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
users_batch = np.random.randint(size=args.batch_size, low=0, high=args.n_users)
items_batch = np.random.randint(size=args.batch_size, low=0, high=args.n_items)
latencies = []
for _ in range(args.num_batches):
start = time.time()
logits = sess.run(logits_op, feed_dict={users: users_batch, items: items_batch, dropout: 0.0 })
latencies.append(time.time() - start)
results = {
'args' : vars(args),
'best_inference_throughput' : args.batch_size / min(latencies),
'best_inference_latency' : min(latencies),
'inference_latencies' : latencies
}
print('RESULTS: ', json.dumps(results, indent=4))
if args.log_path is not None:
json.dump(results, open(args.log_path, 'w'), indent=4)
if __name__ == '__main__':
main()

View file

@ -1,5 +1,4 @@
import numpy as np
import tensorflow as tf
import cupy as cp
def generate_negatives(neg_users, true_mat, item_range, sort=False, use_trick=False):
@ -29,7 +28,7 @@ def generate_negatives(neg_users, true_mat, item_range, sort=False, use_trick=Fa
neg_users = cp.concatenate(neg_u)
neg_items = cp.concatenate(neg_i)
if sort == False:
if not sort:
return neg_users, neg_items
sorted_users = cp.sort(neg_users)
@ -56,7 +55,6 @@ class DataGenerator():
pos_eval_items, # type: np.ndarray
eval_users_per_batch, # type: int
eval_negative_samples, # type: int
use_neg_trick=False, # type: bool
):
# Check input data
if train_users.shape != train_items.shape:
@ -86,7 +84,6 @@ class DataGenerator():
self._pos_eval_items = pos_eval_items
self.eval_users_per_batch = eval_users_per_batch
self._eval_negative_samples = eval_negative_samples
self.use_neg_trick = use_neg_trick
# Eval data
self.eval_users = None
@ -108,9 +105,9 @@ class DataGenerator():
neg_eval_users_base = cp.repeat(pos_eval_users, self._eval_negative_samples)
# Generate negative samples
test_u_neg, test_i_neg = generate_negatives(
neg_eval_users_base, neg_mat, self.num_items, True
)
test_u_neg, test_i_neg = generate_negatives(neg_users=neg_eval_users_base, true_mat=neg_mat,
item_range=self.num_items, sort=True, use_trick=False)
test_u_neg = test_u_neg.reshape((-1, self._eval_negative_samples)).get()
test_i_neg = test_i_neg.reshape((-1, self._eval_negative_samples)).get()
@ -150,21 +147,20 @@ class DataGenerator():
is_neg = cp.logical_not(self._train_labels)
# Do not store verification matrix if using the negatives generation shortcut
neg_mat = None if self.use_neg_trick else cp.array(self._neg_mat)
neg_mat = None
# If there are no negative samples in the local portion of the training data, do nothing
any_neg = cp.any(is_neg)
if any_neg:
self._train_users[is_neg], self._train_items[is_neg] = generate_negatives(
self._train_users[is_neg], neg_mat, self.num_items, use_trick=self.use_neg_trick
self._train_users[is_neg], neg_mat, self.num_items, use_trick=True
)
shuffled_order = cp.random.permutation(self._train_users.shape[0])
self._train_users = self._train_users[shuffled_order]
self._train_items = self._train_items[shuffled_order]
self._train_labels = self._train_labels[shuffled_order]
is_neg = cp.logical_not(self._train_labels)
# Manually create batches
split_indices = np.arange(batch_size, self._train_users.shape[0], batch_size)
self.train_users_batches = np.split(self._train_users, split_indices)

View file

@ -55,7 +55,7 @@ def parse_args():
" Filtering model")
parser.add_argument('--data', type=str,
help='path to test and training data files')
parser.add_argument('-e', '--epochs', type=int, default=40,
parser.add_argument('-e', '--epochs', type=int, default=30,
help='number of epochs to train for')
parser.add_argument('-b', '--batch-size', type=int, default=1048576,
help='number of examples for each iteration')
@ -98,12 +98,11 @@ def parse_args():
parser.add_argument('--checkpoint-dir', default='/data/checkpoints/', type=str,
help='Path to the store the result checkpoint file for training, \
or to read from for evaluation')
parser.add_argument('--load-checkpoint-path', default=None, type=str,
help='Path to the checkpoint for initialization. If None will initialize with random weights')
parser.add_argument('--mode', choices=['train', 'test'], default='train', type=str,
help='Passing "test" will only run a single evaluation, \
otherwise full training will be performed')
parser.add_argument('--no-neg-trick', action='store_true', dest='no_neg_trick',
help='do not use negative sample generation shortcut to speed up data \
augmentation (will increase GPU memory consumption)')
parser.add_argument('--eval-after', type=int, default=8,
help='Perform evaluations only after this many epochs')
parser.add_argument('--verbose', action='store_true',
@ -233,7 +232,6 @@ def main():
test_items,
args.valid_users_per_batch,
args.valid_negative,
use_neg_trick=False if args.no_neg_trick else True
)
# Create tensorflow session and saver
@ -274,7 +272,7 @@ def main():
'sigmoid': True,
'loss_scale': args.loss_scale
},
eval_only=False if args.mode == 'train' else True
mode='TRAIN' if args.mode == 'train' else 'EVAL'
)
saver = tf.train.Saver()
@ -287,11 +285,16 @@ def main():
# Prepare evaluation data
data_generator.prepare_eval_data()
if args.load_checkpoint_path:
saver.restore(sess, args.load_checkpoint_path)
else:
# Manual initialize weights
sess.run(tf.global_variables_initializer())
# If test mode, run one eval
if args.mode == 'test':
saver.restore(sess, args.checkpoint_dir)
eval_start = time.time()
sess.run(tf.local_variables_initializer())
eval_start = time.time()
for user_batch, item_batch, dup_batch \
in zip(data_generator.eval_users, data_generator.eval_items, data_generator.dup_mask):
sess.run(
@ -316,6 +319,9 @@ def main():
if hvd.rank() == 0:
LOGGER.log("Eval Time: {:.4f}, HR: {:.4f}, NDCG: {:.4f}"
.format(eval_duration, hit_rate, ndcg))
eval_throughput = pos_test_users.shape[0] * (args.valid_negative + 1) / eval_duration
LOGGER.log('Average Eval Throughput: {:.4f}'.format(eval_throughput))
return
# Performance Metrics
@ -326,8 +332,6 @@ def main():
time_to_train = 0.0
best_hr = 0
best_epoch = 0
# Manual initialize weights
sess.run(tf.global_variables_initializer())
# Buffers for global metrics
global_hr_sum = np.ones(1)
global_hr_count = np.ones(1)
@ -419,6 +423,7 @@ def main():
if hit_rate > best_hr:
best_hr = hit_rate
best_epoch = epoch
time_to_best = time.time() - begin_train
if not args.verbose:
log_string = "New Best Epoch: {:02d}, Train Time: {:.4f}, Eval Time: {:.4f}, HR: {:.4f}, NDCG: {:.4f}"
LOGGER.log(
@ -441,6 +446,11 @@ def main():
eval_times = np.array(eval_times)
eval_throughputs = pos_test_users.shape[0]*(args.valid_negative+1) / eval_times
LOGGER.log(' ')
LOGGER.log('batch_size: {}'.format(args.batch_size))
LOGGER.log('num_gpus: {}'.format(hvd.size()))
LOGGER.log('AMP: {}'.format(1 if args.amp else 0))
LOGGER.log('seed: {}'.format(args.seed))
LOGGER.log('Minimum Train Time per Epoch: {:.4f}'.format(np.min(train_times)))
LOGGER.log('Average Train Time per Epoch: {:.4f}'.format(np.mean(train_times)))
LOGGER.log('Average Train Throughput: {:.4f}'.format(np.mean(train_throughputs)))
@ -449,6 +459,7 @@ def main():
LOGGER.log('Average Eval Throughput: {:.4f}'.format(np.mean(eval_throughputs)))
LOGGER.log('First Epoch to hit: {}'.format(first_to_target))
LOGGER.log('Time to Train: {:.4f}'.format(time_to_train))
LOGGER.log('Time to Best: {:.4f}'.format(time_to_best))
LOGGER.log('Best HR: {:.4f}'.format(best_hr))
LOGGER.log('Best Epoch: {}'.format(best_epoch))

View file

@ -148,7 +148,7 @@ def ncf_model_ops(users,
labels,
dup_mask,
params,
eval_only=False):
mode='TRAIN'):
"""
Constructs the training and evaluation graphs
"""
@ -172,7 +172,6 @@ def ncf_model_ops(users,
sigmoid = False #params['sigmoid']
loss_scale = params['loss_scale']
is_training = True
model_dtype = tf.float16 if fp16 else tf.float32
# If manually enabling mixed precision, use the custom variable getter
@ -196,6 +195,9 @@ def ncf_model_ops(users,
)
logits = tf.squeeze(logits)
if mode == 'INFERENCE':
return logits
# Evaluation Ops
found_positive, dcg = compute_eval_metrics(logits, dup_mask, val_batch_size, K)
# Metrics
@ -204,7 +206,7 @@ def ncf_model_ops(users,
eval_op = tf.group(hit_rate[1], ndcg[1])
if eval_only:
if mode == 'EVAL':
return hit_rate[0], ndcg[0], eval_op, None
# Labels

View file

@ -14,7 +14,7 @@ set -e
DATASET_NAME=${1:-'ml-20m'}
RAW_DATADIR=${2:-'/data'}
CACHED_DATADIR="${RAW_DATADIR}/cache/${DATASET_NAME}"
CACHED_DATADIR=${3:-"${RAW_DATADIR}/cache/${DATASET_NAME}"}
# you can add another option to this case in order to support other datasets
case ${DATASET_NAME} in

View file

@ -4,29 +4,31 @@ This repository provides a script and recipe to train U-Net Industrial to achiev
accuracy on the dataset DAGM2007, and is tested and maintained by NVIDIA.
# Table of Contents
## Table of Contents
* [The model](#the-model)
* [Model overview](#model-overview)
* [Default Configuration](#default-configuration)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick start guide](#quick-start-guide)
* [Details](#details)
* [Quick Start Guide](#quick-start-guide)
* [Advanced](#advanced)
* [Command line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Training process](#training-process)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training performance results](#training-performance-results)
* [Inference performance results](#inference-performance-results)
* [Changelog](#changelog)
* [Known issues](#known-issues)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training performance results](#training-performance-results)
* [Inference performance results](#inference-performance-results)
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
# The model
## Model overview
This U-Net model is adapted from the original version of the [U-Net model](https://arxiv.org/abs/1505.04597) which is
a convolutional auto-encoder for 2D image segmentation. U-Net was first introduced by
@ -63,7 +65,7 @@ by an environmental variable.
The following performance optimizations were implemented in this model:
* [XLA](https://www.tensorflow.org/xla) support (experimental)
## Default Configuration
### Default Configuration
This model trains in 2500 epochs, under the following setup:
@ -95,12 +97,42 @@ This model trains in 2500 epochs, under the following setup:
* Weight decay: 1e-5
# Setup
### Enabling mixed precision
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
operations in half-precision format, while storing minimal information in single-precision to retain as much
information as possible in critical parts of the network. Since the introduction of
[tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training
speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically
intense model architectures. Using
[mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously
required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Manually adding loss scaling to preserve small gradient values.
This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision
methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing
TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow
framework code makes all necessary model changes internally.
In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16,
and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with
the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to
perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all
computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and
[Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
## Setup
The following section list the requirements in order to start training the U-Net model
(only the `TinyUnet` model is proposed here).
## Requirements
### Requirements
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies.
Aside from these dependencies, ensure you have the following components:
@ -116,19 +148,19 @@ following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running).
# Quick start guide
## Quick Start Guide
To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the
default configuration of the U-Net model (only `TinyUNet` has been made available here) on the DAGM2007 dataset.
##### Clone the repository
### Clone the repository
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Segmentation/UNetIndustrial
```
##### Download and preprocess the dataset: DAGM2007
### Download and preprocess the dataset: DAGM2007
In order to download the dataset. You can execute the following:
@ -138,7 +170,7 @@ In order to download the dataset. You can execute the following:
**Important Information:** Some files of the dataset require an account to be downloaded, the script will invite you to download them manually and put them in the correct directory.
##### Build and start the docker container based on the TensorFlow NGC container.
### Build and start the docker container based on the TensorFlow NGC container.
```bash
# Build the docker container
@ -152,7 +184,7 @@ nvidia-docker run -it --rm \
unet_industrial:latest
```
##### Run training
### Run training
To run training for a default configuration (as described in Default configuration, for example 1/4/8 GPUs,
FP32/TF-AMP), launch one of the scripts in the `./scripts` directory called
@ -170,7 +202,7 @@ cd scripts/
./UNet_FP32_1GPU.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>
```
##### Run evaluation
### Run evaluation
Model evaluation on a checkpoint can be launched by running one of the scripts in the `./scripts` directory
called `./scripts/UNet_{FP32, AMP}_EVAL.sh`.
@ -187,11 +219,11 @@ cd scripts/
./UNet_FP32_EVAL.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>
```
# Details
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
## Command line options
### Command line options
To see the full list of available options and their descriptions, use the -h or --help command line option, for example:
@ -228,7 +260,7 @@ model
`--use_auto_loss_scaling` Use AutoLossScaling in TF-AMP
## Getting the data
### Getting the data
The U-Net model was trained with the [Weakly Supervised Learning for Industrial Optical Inspection (DAGM 2007)](https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html) dataset.
@ -240,7 +272,7 @@ The U-Net model was trained with the [Weakly Supervised Learning for Industrial
**Source:** https://resources.mpi-inf.mpg.de/conference/dagm/2007/prizes.html
### Data description
#### Data description
> The provided data is artificially generated, but similar to real world problems. It consists of multiple data sets, each consisting of 1000 images showing the background texture without defects, and of 150 images with one labeled defect each on the background texture. The images in a single data set are very similar, but each data set is generated by a different texture model and defect model.
@ -257,7 +289,7 @@ The number of classes and sub-challenges for the development set is 6.
- A competition set: which requires an account to be downloaded from [here](https://hci.iwr.uni-heidelberg.de/node/3616).
The number of classes and sub-challenges for the competition set is 10.
### Challenge description
#### Challenge description
The challenge consists in designing a single model with a set of predefined hyper-parameters which will not change
across the 10 different classes or sub-challenges of the competition set.
@ -265,15 +297,15 @@ across the 10 different classes or sub-challenges of the competition set.
The performance shall be measured on the competition set which is normalized and more complex that the public dataset
while offering the most unbiased evaluation method.
## Training Process
### Training Process
### Laplace Smoothing
#### Laplace Smoothing
We use this technique in the DICE loss to improve the training efficiency. This technique consists in replacing the
epsilon parameter (used to avoid dividing by zero and very small: +/- 1e-7) by 1. You can find more information on:
[https://en.wikipedia.org/wiki/Additive_smoothing](https://en.wikipedia.org/wiki/Additive_smoothing)
### Adaptive Loss
#### Adaptive Loss
The DICE Loss is not able to provide a meaningful gradient at initialisation. This leads to a model instability which
often push the model to diverge. Nonetheless, once the model starts to converge, DICE loss is able to very efficiently
@ -285,46 +317,18 @@ fully train the model. Therefore, we implemented an *adaptive loss* which is com
The model is trained with the BCE loss until the DICE Loss reach a experimentally defined threshold (0.3).
Thereafter, DICE loss is used to finish training.
### Weak Labelling
#### Weak Labelling
This dataset is referred as weakly labelled. That means that the segmentation labels are not given at the pixel level
but rather in an approximate fashion.
## Enabling mixed precision
## Performance
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing
operations in half-precision format, while storing minimal information in single-precision to retain as much
information as possible in critical parts of the network. Since the introduction of
[tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architectures, significant training
speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically
intense model architectures. Using
[mixed precision training](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) previously
required two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Manually adding loss scaling to preserve small gradient values.
This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision
methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing
TensorFlow model code. AMP enables mixed precision training on Volta and Turing GPUs automatically. The TensorFlow
framework code makes all necessary model changes internally.
In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16,
and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with
the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to
perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all
computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and
[Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
# Benchmarking
### Benchmarking
The following sections shows how to run benchmarks measuring the model performance in training and inference modes.
## Training performance benchmark
#### Training performance benchmark
To benchmark the inference performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
called `./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh`.
@ -341,7 +345,7 @@ cd scripts/benchmarking/
./DGX1v_trainbench_FP32_1GPU.sh <path to result repository> <path to dataset> <DAGM2007 classID (1-10)>
```
## Inference performance benchmark
#### Inference performance benchmark
To benchmark the training performance, you can run one of the scripts in the `./scripts/benchmarking/` directory
called `./scripts/benchmarking/DGX1v_trainbench_{FP32, AMP}_{1, 4, 8}GPU.sh`.
@ -358,16 +362,16 @@ cd scripts/benchmarking/
./DGX1v_evalbench_FP16_1GPU.sh <path to result repository> <path to dataset> <dagm classID (1-10)>
```
# Results
### Results
The following sections provide details on the achieved results in training accuracy, performance and inference performance.
## Training accuracy results
#### Training accuracy results
Our results were obtained by running the `./scripts/UNet_{FP32, AMP}_{1, 4, 8}GPU.sh` training
script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
### Threshold = 0.75
##### Threshold = 0.75
| # DAGM Class ID | Precision | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
|-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -392,7 +396,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
| 10 | FP32 | 0.979 | 100.00 | 99.43 |
| 10 | Automatic Mixed Precision (AMP) | 0.982 | 100.00 | 99.70 |
### Threshold = 0.85
##### Threshold = 0.85
| # DAGM Class ID | Precision | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
|-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -417,7 +421,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
| 10 | FP32 | 0.980 | 100.00 | 99.45 |
| 10 | Automatic Mixed Precision (AMP) | 0.982 | 100.00 | 99.71 |
### Threshold = 0.95
##### Threshold = 0.95
| # DAGM Class ID | Precision | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
|-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -442,7 +446,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
| 10 | FP32 | 0.980 | 100.00 | 99.54 |
| 10 | Automatic Mixed Precision (AMP) | 0.982 | 100.00 | 99.72 |
### Threshold = 0.99
##### Threshold = 0.99
| # DAGM Class ID | Precision | IoU (Intersection over Union) | TPR (True Positive Rate) | TNR (True Negative Rate) |
|-----------------|---------------------------------|-------------------------------|--------------------------|--------------------------|
@ -467,7 +471,7 @@ script in the Tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16
| 10 | FP32 | 0.981 | 100.00 | 99.62 |
| 10 | Automatic Mixed Precision (AMP) | 0.982 | 100.00 | 99.78 |
## Training performance results
#### Training performance results
<!-- Spreedsheet to Markdown: https://thisdavej.com/copy-table-in-excel-and-paste-as-a-markdown-table/ -->
@ -485,9 +489,9 @@ TensorFlow 19.03-py3 NGC container on an NVIDIA DGX-1 with 8 V100 16G GPUs.
| 8 | FP32 | 445 | 1m44 | 1.00 |
| 8 | Automatic Mixed Precision (AMP) | 491 | 1m36 | 1.10 |
To achieve these same results, follow the [Quick start guide](#quick-start-guide) outlined above.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
## Inference performance results
#### Inference performance results
Our results were obtained by running the aforementioned scripts in the TensorFlow
19.03-py3 NGC container on an NVIDIA DGX-1 server with 8 V100 16G GPUs.
@ -497,10 +501,13 @@ Our results were obtained by running the aforementioned scripts in the TensorFlo
| 1 | FP32 | 228 | 1.00 |
| 1 | Automatic Mixed Precision (AMP) | 301 | 1.32 |
To achieve these same results, follow the [Quick start guide](#quick-start-guide) outlined above.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
# Changelog
1. **March 18, 2019:** Initial release
## Release notes
# Known issues
### Changelog
March 18, 2019
* Initial release
### Known issues
There are no known issues with this model.

View file

@ -1,12 +0,0 @@
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View file

@ -1,4 +1,4 @@
FROM nvcr.io/nvidia/tensorflow:19.05-py3
FROM nvcr.io/nvidia/tensorflow:19.06-py3
ADD . /workspace/unet
WORKDIR /workspace/unet

View file

@ -1,57 +1,52 @@
# UNet
# UNet Medical Image Segmentation for TensorFlow
This repository provides a script and recipe to train U-Net Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
## Table of contents
1. [The model](#1-the-model)
1. [Default configuration](#11-default-configuration)
2. [Model architecture](#12-model-architecture)
3. [Feature support matrix](#13-feature-support-matrix)
1. [Features](##131)
2. [Setup](#2-setup)
1. [Requirements](#21-requirements)
3. [Quick start guide](#3-quick-start-guide)
1. [Clone the repository](#31-clone-the-repository)
2. [Download and preprocess the dataset](#32-download-and-preprocess-the-dataset)
3. [Build the U-Net TensorFlow container](#33-build-and-start-the-docker-container-based-on-the-tensorflow-ngc-container)
4. [Start an interactive session in the NGC container to run training/inference](#34-start-an-interactive-session-in-the-ngc-container-to-run-traininginference)
5. [Start training](#35-start-training)
6. [Start inference/predictions](#36-start-inferencepredictions)
4. [Details](#4-details)
1. [Scripts and sample code](#41-scripts-and-sample-code)
2. [Parameters](#42-parameters)
3. [Command line options](#43-command-line-options)
4. [Getting the data](#44-getting-the-data)
1. [Dataset guidelines](#441-dataset-guidelines)
5. [Training process](#45-training-process)
1. [Optimizer](#451-optimizer)
2. [Augmentation](#452-augmentation)
6. [Inference process](#46-inference-process)
5. [Mixed precision training](#5-mixed-precision-training)
1. [Enabling mixed precision](#51-enabling-mixed-precision)
6. [Benchmarking](#6-benchmarking)
1. [Training performance benchmark](#61-training-performance-benchmark)
2. [Inference performance benchmark](#62-inference-performance-benchmark)
7. [Results](#7-results)
1. [Training accuracy results](#71-training-accuracy-results)
1. [NVIDIA DGX-1 (8x V100 16G)](#711-nvidia-dgx-1-8x-v100-16g)
2. [Training performance results](#72-training-performance-results)
1. [NVIDIA DGX-1 (1x V100 16G)](#721-nvidia-dgx-1-1x-v100-16g)
2. [NVIDIA DGX-1 (8x V100 16G)](#721-nvidia-dgx-1-8x-v100-16g)
3. [Inference performance results](#73-inference-performance-results)
1. [NVIDIA DGX-1 (1x V100 16G)](#731)
7. [Glossary](#7-glossary)
8. [Changelog](#8-changelog)
9. [Known issues](#9-known-issues)
* [Model overview](#model-overview)
* [Default configuration](#default-configuration)
* [Model architecture](#model-architecture)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Setup](#setup)
* [Requirements](#requirements)
* [Quick Start Guide](#quick-start-guide)
* [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Training process](#training-process)
* [Optimizer](#optimizer)
* [Augmentation](#augmentation)
* [Inference process](#inference-process)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [Training performance results](#training-performance-results)
* [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
* [Inference performance results](#inference-performance-results)
* [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## 1. The model
## Model overview
The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
This model is trained with mixed precision using tensor cores on NVIDIA Volta GPUs. Therefore, researchers can get results much faster than training without Tensor Cores, while experiencing the benefits of mixed precision training (for example, up to 3.5x performance boost). This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### 1.1. Model architecture
### Model architecture
U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation. U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
@ -59,7 +54,7 @@ The following figure shows the construction of the UNet model and its different
![UNet](images/unet.png)
### 1.2. Default configuration
### Default configuration
U-Net consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and a concatenation with the correspondingly cropped feature map from the contractive path.
@ -72,7 +67,7 @@ The following features were implemented in this model:
The following performance optimizations were implemented in this model:
* XLA support (experimental). For TensorFlow, easily adding mixed-precision support is available from NVIDIAs APEX, a TensorFlow extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
### 1.3. Feature support matrix
### Feature support matrix
The following features are supported by this model.
@ -80,19 +75,47 @@ The following features are supported by this model.
|:---:|:--------:|
| Horovod Multi-GPU (NCCL) | Yes |
### 1.3.1. Features
#### Features
**Horovod** - Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
## 2. Setup
### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
#### Enabling mixed precision
In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
```
TF_ENABLE_AUTO_MIXED_PRECISION=1
```
Exporting these variables ensures that loss scaling is performed correctly and automatically.
By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training inside the `./utils/runner.py` script:
```
if params['use_amp']:
LOGGER.log("TF AMP is activated")
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
```
## Setup
The following section lists the requirements in order to start training the U-Net model.
### 2.1. Requirements
### Requirements
This repository contains a `Dockerfile` which extends the TensorFlow NGC container and encapsulates some additional dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [tensorflow:19.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
* [tensorflow:19.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning DGX Documentation:
@ -101,17 +124,17 @@ For more information about how to get started with NGC containers, see the follo
* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
* [Running Tensorflow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
## 3. Quick start guide
## Quick Start Guide
To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home).
### 3.1. Clone the repository
### Clone the repository
```
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Medical
```
### 3.2. Download and preprocess the dataset
### Download and preprocess the dataset
The U-Net script main.py operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597). Upon registration, the challenge's data is made available through the following links:
@ -135,14 +158,14 @@ Once downloaded the data using the `download_dataset.py` script, it can be used
**Note:** Masks are only provided for training data.
### 3.3. Build the U-Net TensorFlow container
### Build the U-Net TensorFlow container
After Docker is correctly set up, the U-Net TensorFlow container can be built with:
```
user@~/Documents/unet_medical_tf # docker build -t unet_tf .
```
### 3.4. Start an interactive session in the NGC container to run training/inference.
### Start an interactive session in the NGC container to run training/inference.
Run the previously built Docker container:
```
@ -150,7 +173,7 @@ user@~/path/to/unet_medical_tf # docker run --runtime=nvidia --rm -it --shm-size
```
**Note:** Ensure to mount your dataset using the -v flag to make it available for training inside the NVIDIA Docker container.
### 3.5. Start training
### Start training
To run training for a default configuration (for example 1/8 GPUs FP32/TF-AMP), run one of the scripts in the `./examples` directory, as follows:
```
@ -161,17 +184,21 @@ For example:
root@8e522945990f:/workspace/unet# bash examples/unet_FP32_1GPU.sh . /data results
```
### 3.6. Start inference/predictions
### Start inference/predictions
To run inference on a checkpointed model, run:
```
python main.py --data_dir /data --model_dir <path to checkpoint> --exec_mode predict
bash examples/unet_INFER_{FP32, TF-AMP}.sh <path to main.py> <path to dataset> <path to results directory>
```
For example:
```
root@8e522945990f:/workspace/unet# bash examples/unet_INFER_FP32.sh . /data results
```
## 4. Details
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
### 4.1. Scripts and sample code
### Scripts and sample code
In the root directory, the most important files are:
* `main.py`: Serves as the entry point to the application.
@ -179,13 +206,13 @@ In the root directory, the most important files are:
* `requirements.txt`: Set of extra requirements for running UNet
* `download_data.py`: Automatically downloads the dataset for training
The utils/ folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
The `utils/` folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
* `runner.py`: Implements the logic for training and inference
* `data_loader.py`: Implements the data loading and augmentation
* `hooks/profiler.py`: Collects different metrics to be used for benchmarking and testing
* `var_storage.py`: Helper functions for TF-AMP
The model/ folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
The `model/` folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
* `layers.py`: Defines the different blocks that are used to assemble UNet
* `unet.py`: Defines the model architecture using the blocks from the `layers.py` script
@ -194,7 +221,7 @@ Other folders included in the root directory are:
* `examples/`: Provides examples for training and benchmarking UNet
* `images/`: Contains a model diagram
### 4.2. Parameters
### Parameters
The complete list of the available parameters for the main.py script contains:
* `--exec_mode`: Select the execution mode to run the model (default: train_and_predict)
* `--model_dir`: Set the output directory for information related to the model (default: result/)
@ -213,7 +240,7 @@ The complete list of the available parameters for the main.py script contains:
* `--benchmark`: Enable performance benchmarking (default: False)
* `--use_amp`: Enable automatic mixed precision (default: False)
### 4.3. Command line options
### Command line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
```
@ -239,7 +266,7 @@ usage: main.py [-h]
[--use_amp]
```
### 4.4. Getting the data
### Getting the data
The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission.
@ -249,7 +276,7 @@ Training and test data is comprised of three 512x512x30 `TIF` volumes (`test-vol
The objective is to produce a set of masks that segment the data as accurately as possible. The results are expected to be submitted as a 32-bit `TIF` 3D image, which values between `0` (100% membrane certainty) and `1` (100% non-membrane certainty).
#### 4.4.1 Dataset guidelines
#### Dataset guidelines
The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data_loader.py` script.
@ -268,9 +295,9 @@ If augmentation is enabled, the following set of augmentation techniques are app
At the end, intensities are clipped to the `[-1, 1]` interval.
### 4.5. Training process
### Training process
#### 4.5.1. Optimizer
#### Optimizer
The model trains for 40,000 batches, with the default U-Net setup as specified in the [original paper](https://arxiv.org/abs/1505.04597):
@ -298,7 +325,7 @@ Use `-h` or `--help` to obtain a list of available options in the `main.py` scri
Use the `--model_dir` flag to select the location where to store the artifacts of the training.
### 4.6. Inference process
### Inference process
To run inference on a checkpointed model, run the script below, although, it requires a pre-trained model checkpoint and tokenized input.
```
@ -306,61 +333,33 @@ python main.py --data_dir /data --model_dir <path to checkpoint> --exec_mode pre
```
This script should produce the prediction results over a set of masks which will be located in `<path to checkpoint>/eval`.
## 5. Mixed precision training
## Performance
Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
## 5.1. Enabling mixed precision
In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
```
TF_ENABLE_AUTO_MIXED_PRECISION=1
```
Exporting these variables ensures that loss scaling is performed correctly and automatically.
By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training inside the `./utils/runner.py` script:
```
if params['use_amp']:
assert params['dtype'] == tf.float32, "TF-AMP requires FP32 precision"
LOGGER.log("TF AMP is activated - Experimental Feature")
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
```
## 6. Benchmarking
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
### 6.1. Training performance benchmark
#### Training performance benchmark
To benchmark training, run one of the scripts in `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>`.
Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during training in the next 100 iterations. To control warmup and benchmark length, use `--warmup_steps`, and `--max_steps` flags.
### 6.2. Inference performance benchmark
#### Inference performance benchmark
To benchmark inference, run one of the scripts in `./examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>`.
Each of these scripts will by default run 200 warmup iterations and benchmark the performance during inference in the next 100 iterations. To control warmup and benchmark length, use `--warmup_steps`, and `--max_steps` flags.
## 7. Results
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
### 7.1. Training accuracy results
#### Training accuracy results
#### 7.1.1 NVIDIA DGX-1 (8x V100 16G)
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `./examples/unet_{FP32, TF-AMP}_{1, 8}GPU.sh` scripts in the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
Our results were obtained by running the `./examples/unet_{FP32, TF-AMP}_{1, 8}GPU.sh` scripts in the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
Metrics employed by the organization are explained in detail [here](http://brainiac2.mit.edu/isbi_challenge/evaluation).
@ -371,12 +370,12 @@ The results described below were obtained after the submission of our evaluation
|1 | 0.938508265 | 0.970255682 | 0.939619101 | 0.970120138 | 7.1 | 11.28 |
|8 | 0.932395087 | 0.9786346 | 0.941360867 | 0.976235311 | 0.9 | 1.41 |
### 7.2. Training performance results
#### Training performance results
#### 7.2.1 NVIDIA DGX-1 (1x V100 16G)
##### NVIDIA DGX-1 (1x V100 16G)
Our results were obtained by running the `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_1GPU.sh` scripts in
the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
| **Batch size** | **FP32 max img/s** | **TF-AMP max img/s** | **Speedup factor** |
@ -387,10 +386,10 @@ the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU whil
To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.
#### 7.2.2 NVIDIA DGX-1 (8x V100 16G)
##### NVIDIA DGX-1 (8x V100 16G)
Our results were obtained by running the `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_8GPU.sh` scripts in
the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU while data augmentation is enabled.
the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU while data augmentation is enabled.
| **Batch size per GPU** | **FP32 max img/s** | **TF-AMP max img/s** | **Speedup factor** |
|:---:|:--------:|:-------:|:-------:|
@ -400,10 +399,12 @@ the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU whil
To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.
### 7.3. Inference performance results
#### Inference performance results
#### NVIDIA DGX-1 (1x V100 16G)
Our results were obtained by running the `./examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh` scripts in
the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
| **Batch size** | **FP32 img/s** | **TF-AMP img/s** | **Speedup factor** |
|:---:|:--------:|:-------:|:-------:|
@ -413,11 +414,22 @@ the tensorflow:19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU whil
To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.
## 8. Changelog
## Release notes
### Changelog
July 2019
* Added inference example scripts
* Added inference benchmark measuring latency
* Added TRT/TF-TRT support
* Updated Pre-trained model on NGC registry
June 2019
* Updated README template
May 2019
* Initial release
## 9. Known issues
### Known issues
There are no known issues in this release.

View file

@ -15,4 +15,4 @@
# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
# Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode benchmark --augment --warmup_steps 200 --log_every 100 --max_steps 300
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300

View file

@ -15,4 +15,4 @@
# This script launches U-Net inference in TF-AMP on 1 GPUs using 2 batch size
# Usage ./unet_INFER_BENCHMARK_TF-AMP.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --use_amp --exec_mode benchmark --augment --warmup_steps 200 --log_every 100 --max_steps 300
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --use_amp --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
# Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net inference in FP32 on 1 GPUs
# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net inference in TF-AMP on 1 GPUs
# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_amp

View file

@ -0,0 +1,18 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script launches U-Net inference in TF-AMP on 1 GPUs
# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_trt

View file

@ -23,121 +23,23 @@ Example:
"""
import argparse
import os
import pickle
import time
import horovod.tensorflow as hvd
import math
import numpy as np
import tensorflow as tf
from PIL import Image
from dllogger import tags
from dllogger.logger import LOGGER
from utils.runner import Runner
PARSER = argparse.ArgumentParser(description="UNet-medical")
PARSER.add_argument('--exec_mode',
choices=['train', 'train_and_predict', 'predict', 'benchmark'],
type=str,
default='train_and_predict',
help="""Which execution mode to run the model into"""
)
PARSER.add_argument('--model_dir',
type=str,
default='./results',
help="""Output directory for information related to the model"""
)
PARSER.add_argument('--data_dir',
type=str,
required=True,
help="""Input directory containing the dataset for training the model"""
)
PARSER.add_argument('--batch_size',
type=int,
default=1,
help="""Size of each minibatch per GPU""")
PARSER.add_argument('--max_steps',
type=int,
default=1000,
help="""Maximum number of steps (batches) used for training""")
PARSER.add_argument('--seed',
type=int,
default=0,
help="""Random seed""")
PARSER.add_argument('--weight_decay',
type=float,
default=0.0005,
help="""Weight decay coefficient""")
PARSER.add_argument('--log_every',
type=int,
default=100,
help="""Log performance every n steps""")
PARSER.add_argument('--warmup_steps',
type=int,
default=200,
help="""Number of warmup steps""")
PARSER.add_argument('--learning_rate',
type=float,
default=0.01,
help="""Learning rate coefficient for SGD""")
PARSER.add_argument('--momentum',
type=float,
default=0.99,
help="""Momentum coefficient for SGD""")
PARSER.add_argument('--decay_steps',
type=float,
default=5000,
help="""Decay steps for inverse learning rate decay""")
PARSER.add_argument('--decay_rate',
type=float,
default=0.95,
help="""Decay rate for learning rate decay""")
PARSER.add_argument('--augment', dest='augment', action='store_true',
help="""Perform data augmentation during training""")
PARSER.add_argument('--no-augment', dest='augment', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
help="""Collect performance metrics during training""")
PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
help="""Train using TF-AMP""")
PARSER.set_defaults(use_amp=False)
def _cmd_params(flags):
return {
'model_dir': flags.model_dir,
'batch_size': flags.batch_size,
'data_dir': flags.data_dir,
'max_steps': flags.max_steps,
'weight_decay': flags.weight_decay,
'dtype': tf.float32,
'learning_rate': flags.learning_rate,
'momentum': flags.momentum,
'benchmark': flags.benchmark,
'augment': flags.augment,
'exec_mode': flags.exec_mode,
'seed': flags.seed,
'use_amp': flags.use_amp,
'log_every': flags.log_every,
'warmup_steps': flags.warmup_steps,
'decay_steps': flags.decay_steps,
'decay_rate': flags.decay_rate,
}
from utils.cmd_util import PARSER, _cmd_params
from utils.data_loader import Dataset
from utils.hooks.profiling_hook import ProfilingHook
from utils.hooks.training_hook import TrainingHook
from utils.model_fn import unet_fn
def main(_):
@ -160,32 +62,103 @@ def main(_):
os.environ['TF_GPU_THREAD_MODE'] = 'gpu_private'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = '1'
os.environ['TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT'] = 'data'
os.environ['TF_ADJUST_HUE_FUSED'] = '1'
os.environ['TF_ADJUST_SATURATION_FUSED'] = '1'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
os.environ['TF_ADJUST_HUE_FUSED'] = 'data'
os.environ['TF_ADJUST_SATURATION_FUSED'] = 'data'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = 'data'
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
os.environ['TF_DISABLE_NVTX_RANGES'] = '1'
if params['use_amp']:
assert params['dtype'] == tf.float32, "TF-AMP requires FP32 precision"
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION']='1'
LOGGER.log("TF AMP is activated - Experimental Feature")
os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
hvd.init()
runner = Runner(params)
# Build run config
gpu_options = tf.GPUOptions()
config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
config.gpu_options.force_gpu_compatible = True
config.intra_op_parallelism_threads = 1
config.inter_op_parallelism_threads = max(2, 40 // hvd.size() - 2)
if 'train' in params['exec_mode'] \
or 'train_and predict' in params['exec_mode']:
runner.train()
if 'train_and predict' in params['exec_mode'] \
or 'predict' in params['exec_mode']:
runner.predict()
if 'benchmark' in params['exec_mode']:
runner.benchmark()
run_config = tf.estimator.RunConfig(
save_summary_steps=1,
tf_random_seed=None,
session_config=config,
save_checkpoints_steps=params['max_steps'],
keep_checkpoint_max=1)
# Build the estimator model
estimator = tf.estimator.Estimator(
model_fn=unet_fn,
model_dir=params['model_dir'],
config=run_config,
params=params)
dataset = Dataset(data_dir=params['data_dir'],
batch_size=params['batch_size'],
augment=params['augment'],
gpu_id=hvd.rank(),
num_gpus=hvd.size(),
seed=params['seed'])
if 'train' in params['exec_mode']:
hooks = [hvd.BroadcastGlobalVariablesHook(0),
TrainingHook(params['log_every'])]
if params['benchmark']:
hooks.append(ProfilingHook(params['batch_size'],
params['log_every'],
params['warmup_steps']))
LOGGER.log('Begin Training...')
LOGGER.log(tags.RUN_START)
estimator.train(
input_fn=dataset.train_fn,
steps=params['max_steps'],
hooks=hooks)
LOGGER.log(tags.RUN_STOP)
if 'predict' in params['exec_mode']:
if hvd.rank() == 0:
predict_steps = dataset.test_size
hooks = None
if params['benchmark']:
hooks = [ProfilingHook(params['batch_size'],
params['log_every'],
params['warmup_steps'])]
predict_steps = params['warmup_steps'] * 2 * params['batch_size']
LOGGER.log('Begin Predict...')
LOGGER.log(tags.RUN_START)
predictions = estimator.predict(
input_fn=lambda: dataset.test_fn(count=math.ceil(predict_steps/dataset.test_size)),
hooks=hooks)
binary_masks = [np.argmax(p['logits'], axis=-1).astype(np.uint8) * 255 for p in predictions]
LOGGER.log(tags.RUN_STOP)
multipage_tif = [Image.fromarray(mask).resize(size=(512, 512), resample=Image.BILINEAR)
for mask in binary_masks]
output_dir = os.path.join(params['model_dir'], 'pred')
if not os.path.exists(output_dir):
os.makedirs(output_dir)
multipage_tif[0].save(os.path.join(output_dir, 'test-masks.tif'),
compression="tiff_deflate",
save_all=True,
append_images=multipage_tif[1:])
LOGGER.log("Predict finished")
LOGGER.log("Results available in: {}".format(output_dir))
if __name__ == '__main__':

View file

@ -0,0 +1,114 @@
import argparse
import tensorflow as tf
PARSER = argparse.ArgumentParser(description="UNet-medical")
PARSER.add_argument('--exec_mode',
choices=['train', 'train_and_predict', 'predict'],
type=str,
default='train_and_predict',
help="""Which execution mode to run the model into"""
)
PARSER.add_argument('--model_dir',
type=str,
default='./results',
help="""Output directory for information related to the model"""
)
PARSER.add_argument('--data_dir',
type=str,
required=True,
help="""Input directory containing the dataset for training the model"""
)
PARSER.add_argument('--batch_size',
type=int,
default=1,
help="""Size of each minibatch per GPU""")
PARSER.add_argument('--max_steps',
type=int,
default=1000,
help="""Maximum number of steps (batches) used for training""")
PARSER.add_argument('--seed',
type=int,
default=0,
help="""Random seed""")
PARSER.add_argument('--weight_decay',
type=float,
default=0.0005,
help="""Weight decay coefficient""")
PARSER.add_argument('--log_every',
type=int,
default=100,
help="""Log performance every n steps""")
PARSER.add_argument('--warmup_steps',
type=int,
default=200,
help="""Number of warmup steps""")
PARSER.add_argument('--learning_rate',
type=float,
default=0.01,
help="""Learning rate coefficient for SGD""")
PARSER.add_argument('--momentum',
type=float,
default=0.99,
help="""Momentum coefficient for SGD""")
PARSER.add_argument('--decay_steps',
type=float,
default=5000,
help="""Decay steps for inverse learning rate decay""")
PARSER.add_argument('--decay_rate',
type=float,
default=0.95,
help="""Decay rate for learning rate decay""")
PARSER.add_argument('--augment', dest='augment', action='store_true',
help="""Perform data augmentation during training""")
PARSER.add_argument('--no-augment', dest='augment', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--benchmark', dest='benchmark', action='store_true',
help="""Collect performance metrics during training""")
PARSER.add_argument('--no-benchmark', dest='benchmark', action='store_false')
PARSER.set_defaults(augment=False)
PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
help="""Train using TF-AMP""")
PARSER.set_defaults(use_amp=False)
PARSER.add_argument('--use_trt', dest='use_trt', action='store_true',
help="""Use TF-TRT""")
PARSER.set_defaults(use_trt=False)
def _cmd_params(flags):
return {
'model_dir': flags.model_dir,
'batch_size': flags.batch_size,
'data_dir': flags.data_dir,
'max_steps': flags.max_steps,
'weight_decay': flags.weight_decay,
'dtype': tf.float32,
'learning_rate': flags.learning_rate,
'momentum': flags.momentum,
'benchmark': flags.benchmark,
'augment': flags.augment,
'exec_mode': flags.exec_mode,
'seed': flags.seed,
'use_amp': flags.use_amp,
'use_trt': flags.use_trt,
'log_every': flags.log_every,
'warmup_steps': flags.warmup_steps,
'decay_steps': flags.decay_steps,
'decay_rate': flags.decay_rate,
}

View file

@ -42,6 +42,14 @@ class Dataset():
self._num_gpus = num_gpus
self._gpu_id = gpu_id
@property
def train_size(self):
return len(self._train_images)
@property
def test_size(self):
return len(self._test_images)
def _load_multipage_tiff(self, path):
"""Load tiff images containing many images in the channel dimension"""
return np.array([np.array(p) for p in ImageSequence.Iterator(Image.open(path))])
@ -154,10 +162,11 @@ class Dataset():
return dataset
def test_fn(self):
def test_fn(self, count):
"""Input function for testing"""
dataset = tf.data.Dataset.from_tensor_slices(
(self._test_images))
self._test_images)
dataset = dataset.repeat(count=count)
dataset = dataset.map(self._normalize_inputs)
dataset = dataset.batch(self._batch_size)
dataset = dataset.prefetch(self._batch_size)
@ -179,4 +188,4 @@ class Dataset():
dataset = dataset.prefetch(buffer_size=tf.contrib.data.AUTOTUNE)
return dataset
return dataset

View file

@ -0,0 +1,54 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
import tensorflow as tf
import horovod.tensorflow as hvd
from dllogger import LOGGER, tags, AverageMeter
class ProfilingHook(tf.train.SessionRunHook):
def __init__(self, batch_size, log_every, warmup_steps):
self._log_every = log_every
self._warmup_steps = warmup_steps
self._current_step = 0
self._global_batch_size = batch_size * hvd.size()
self._meter = AverageMeter()
self._t0 = 0
def before_run(self, run_context):
if self._current_step % self._log_every == 0:
LOGGER.log('iter_start', self._current_step)
if self._current_step > self._warmup_steps:
self._t0 = time.time()
def after_run(self,
run_context,
run_values):
if self._current_step > self._warmup_steps:
batch_time = time.time() - self._t0
ips = self._global_batch_size / batch_time
self._meter.record(ips)
self._current_step += 1
def begin(self):
pass
def end(self, session):
LOGGER.log('average_images_per_second', self._meter.get_value())

View file

@ -0,0 +1,46 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import tensorflow as tf
from dllogger import LOGGER, tags
class TrainingHook(tf.train.SessionRunHook):
def __init__(self, log_every=1):
self._log_every = log_every
self._iter_idx = 0
def before_run(self, run_context):
run_args = tf.train.SessionRunArgs(
fetches=[
'cross_loss_ref:0',
'dice_loss_ref:0',
'total_loss_ref:0',
]
)
return run_args
def after_run(self,
run_context,
run_values):
cross_loss, dice_loss, total_loss = run_values.results
if self._iter_idx % self._log_every == 0:
LOGGER.log('cross_loss', cross_loss)
LOGGER.log('dice_loss', dice_loss)
LOGGER.log('total_loss', total_loss)
self._iter_idx += 1

Some files were not shown because too many files have changed in this diff Show more