2019-07-08 22:51:28 +02:00
|
|
|
|
# GNMT v2 For PyTorch
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
This repository provides a script and recipe to train the GNMT v2 model to
|
|
|
|
|
achieve state of the art accuracy, and is tested and maintained by NVIDIA.
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
## Table Of Contents
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
<!-- TOC GFM -->
|
|
|
|
|
|
|
|
|
|
* [Model overview](#model-overview)
|
|
|
|
|
* [Model architecture](#model-architecture)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
* [Default configuration](#default-configuration)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Feature support matrix](#feature-support-matrix)
|
|
|
|
|
* [Features](#features)
|
|
|
|
|
* [Mixed precision training](#mixed-precision-training)
|
|
|
|
|
* [Enabling mixed precision](#enabling-mixed-precision)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [Enabling TF32](#enabling-tf32)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
* [Setup](#setup)
|
|
|
|
|
* [Requirements](#requirements)
|
|
|
|
|
* [Quick Start Guide](#quick-start-guide)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Advanced](#advanced)
|
|
|
|
|
* [Scripts and sample code](#scripts-and-sample-code)
|
|
|
|
|
* [Parameters](#parameters)
|
|
|
|
|
* [Command-line options](#command-line-options)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
* [Getting the data](#getting-the-data)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Dataset guidelines](#dataset-guidelines)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
* [Training process](#training-process)
|
|
|
|
|
* [Inference process](#inference-process)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Performance](#performance)
|
|
|
|
|
* [Benchmarking](#benchmarking)
|
|
|
|
|
* [Training performance benchmark](#training-performance-benchmark)
|
|
|
|
|
* [Inference performance benchmark](#inference-performance-benchmark)
|
|
|
|
|
* [Results](#results)
|
|
|
|
|
* [Training accuracy results](#training-accuracy-results)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
|
|
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
|
|
|
|
|
* [Training accuracy: NVIDIA DGX-2H (16x V100 32GB)](#training-accuracy-nvidia-dgx-2h-16x-v100-32gb)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Training stability test](#training-stability-test)
|
|
|
|
|
* [Training throughput results](#training-throughput-results)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [Training throughput: NVIDIA DGX A100 (8x A100 40GB)](#training-throughput-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
|
|
* [Training throughput: NVIDIA DGX-1 (8x V100 16GB)](#training-throughput-nvidia-dgx-1-8x-v100-16gb)
|
|
|
|
|
* [Training throughput: NVIDIA DGX-2H (16x V100 32GB)](#training-throughput-nvidia-dgx-2h-16x-v100-32gb)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Inference accuracy results](#inference-accuracy-results)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [Inference accuracy: NVIDIA A100 40GB](#inference-accuracy-nvidia-a100-40gb)
|
|
|
|
|
* [Inference accuracy: NVIDIA Tesla V100 16GB](#inference-accuracy-nvidia-tesla-v100-16gb)
|
2019-10-21 19:41:32 +02:00
|
|
|
|
* [Inference accuracy: NVIDIA T4](#inference-accuracy-nvidia-t4)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Inference throughput results](#inference-throughput-results)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [Inference throughput: NVIDIA A100 40GB](#inference-throughput-nvidia-a100-40gb)
|
2019-10-21 19:41:32 +02:00
|
|
|
|
* [Inference throughput: NVIDIA T4](#inference-throughput-nvidia-t4)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Inference latency results](#inference-latency-results)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [Inference latency: NVIDIA A100 40GB](#inference-latency-nvidia-a100-40gb)
|
2019-10-21 19:41:32 +02:00
|
|
|
|
* [Inference latency: NVIDIA T4](#inference-latency-nvidia-t4)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* [Release notes](#release-notes)
|
|
|
|
|
* [Changelog](#changelog)
|
|
|
|
|
* [Known issues](#known-issues)
|
|
|
|
|
|
|
|
|
|
<!-- /TOC -->
|
|
|
|
|
|
|
|
|
|
## Model overview
|
|
|
|
|
The GNMT v2 model is similar to the one discussed in the [Google's Neural
|
|
|
|
|
Machine Translation System: Bridging the Gap between Human and Machine
|
2018-08-07 16:27:43 +02:00
|
|
|
|
Translation](https://arxiv.org/abs/1609.08144) paper.
|
|
|
|
|
|
|
|
|
|
The most important difference between the two models is in the attention
|
|
|
|
|
mechanism. In our model, the output from the first LSTM layer of the decoder
|
|
|
|
|
goes into the attention module, then the re-weighted context is concatenated
|
|
|
|
|
with inputs to all subsequent LSTM layers in the decoder at the current
|
2019-07-16 21:13:08 +02:00
|
|
|
|
time step.
|
|
|
|
|
|
|
|
|
|
The same attention mechanism is also implemented in the default GNMT-like
|
|
|
|
|
models from [TensorFlow Neural Machine Translation
|
|
|
|
|
Tutorial](https://github.com/tensorflow/nmt) and [NVIDIA OpenSeq2Seq
|
|
|
|
|
Toolkit](https://github.com/NVIDIA/OpenSeq2Seq).
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Model architecture
|
2020-08-01 15:47:34 +02:00
|
|
|
|
![ModelArchitecture](./img/diagram.png)
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Default configuration
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
The following features were implemented in this model:
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
* general:
|
|
|
|
|
* encoder and decoder are using shared embeddings
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* data-parallel multi-GPU training
|
|
|
|
|
* dynamic loss scaling with backoff for Tensor Cores (mixed precision)
|
|
|
|
|
training
|
2018-08-07 16:27:43 +02:00
|
|
|
|
* trained with label smoothing loss (smoothing factor 0.1)
|
|
|
|
|
* encoder:
|
|
|
|
|
* 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
|
|
|
|
|
unidirectional
|
|
|
|
|
* with residual connections starting from 3rd layer
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* uses standard PyTorch nn.LSTM layer
|
2019-02-14 12:40:30 +01:00
|
|
|
|
* dropout is applied on input to all LSTM layers, probability of dropout is
|
|
|
|
|
set to 0.2
|
|
|
|
|
* hidden state of LSTM layers is initialized with zeros
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* weights and bias of LSTM layers is initialized with uniform(-0.1,0.1)
|
2019-02-14 12:40:30 +01:00
|
|
|
|
distribution
|
2018-08-07 16:27:43 +02:00
|
|
|
|
* decoder:
|
|
|
|
|
* 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
|
|
|
|
|
classifier
|
|
|
|
|
* with residual connections starting from 3rd layer
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* uses standard PyTorch nn.LSTM layer
|
2019-02-14 12:40:30 +01:00
|
|
|
|
* dropout is applied on input to all LSTM layers, probability of dropout is
|
|
|
|
|
set to 0.2
|
|
|
|
|
* hidden state of LSTM layers is initialized with zeros
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* weights and bias of LSTM layers is initialized with uniform(-0.1,0.1)
|
2019-02-14 12:40:30 +01:00
|
|
|
|
distribution
|
|
|
|
|
* weights and bias of fully-connected classifier is initialized with
|
2019-07-16 21:13:08 +02:00
|
|
|
|
uniform(-0.1,0.1) distribution
|
2018-08-07 16:27:43 +02:00
|
|
|
|
* attention:
|
|
|
|
|
* normalized Bahdanau attention
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* output from first LSTM layer of decoder goes into attention, then
|
|
|
|
|
re-weighted context is concatenated with the input to all subsequent LSTM
|
|
|
|
|
layers of the decoder at the current timestep
|
|
|
|
|
* linear transform of keys and queries is initialized with uniform(-0.1,
|
|
|
|
|
0.1), normalization scalar is initialized with 1.0/sqrt(1024),
|
2019-02-14 12:40:30 +01:00
|
|
|
|
normalization bias is initialized with zero
|
2018-08-07 16:27:43 +02:00
|
|
|
|
* inference:
|
|
|
|
|
* beam search with default beam size of 5
|
2019-02-14 12:40:30 +01:00
|
|
|
|
* with coverage penalty and length normalization, coverage penalty factor is
|
|
|
|
|
set to 0.1, length normalization factor is set to 0.6 and length
|
|
|
|
|
normalization constant is set to 5.0
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* de-tokenized BLEU computed by
|
|
|
|
|
[SacreBLEU](https://github.com/mjpost/sacrebleu)
|
|
|
|
|
* [motivation](https://github.com/mjpost/sacrebleu#motivation) for choosing
|
|
|
|
|
SacreBLEU
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
When comparing the BLEU score, there are various tokenization approaches and
|
|
|
|
|
BLEU calculation methodologies; therefore, ensure you align similar metrics.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
Code from this repository can be used to train a larger, 8-layer GNMT v2 model.
|
|
|
|
|
Our experiments show that a 4-layer model is significantly faster to train and
|
2019-07-16 21:13:08 +02:00
|
|
|
|
yields comparable accuracy on the public [WMT16
|
|
|
|
|
English-German](http://www.statmt.org/wmt16/translation-task.html) dataset. The
|
2020-08-01 15:47:34 +02:00
|
|
|
|
number of LSTM layers is controlled by the `--num-layers` parameter in the
|
2019-07-16 21:13:08 +02:00
|
|
|
|
`train.py` training script.
|
|
|
|
|
|
|
|
|
|
### Feature support matrix
|
|
|
|
|
The following features are supported by this model.
|
|
|
|
|
|
|
|
|
|
| **Feature** | **GNMT v2** |
|
|
|
|
|
|:------------|------------:|
|
|
|
|
|
|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
|
|
|
|
|
|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) | Yes |
|
|
|
|
|
|
|
|
|
|
#### Features
|
|
|
|
|
[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor
|
|
|
|
|
Core-accelerated training. Refer to the [Enabling mixed
|
|
|
|
|
precision](#enabling-mixed-precision) section for more details.
|
|
|
|
|
|
|
|
|
|
[Apex
|
|
|
|
|
DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) -
|
|
|
|
|
a module wrapper that enables easy multiprocess distributed data parallel
|
|
|
|
|
training, similar to
|
|
|
|
|
[torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
|
|
|
|
|
`DistributedDataParallel` is optimized for use with
|
|
|
|
|
[NCCL](https://github.com/NVIDIA/nccl). It achieves high performance by
|
|
|
|
|
overlapping communication with computation during `backward()` and bucketing
|
|
|
|
|
smaller gradient transfers to reduce the total number of transfers required.
|
|
|
|
|
|
|
|
|
|
### Mixed precision training
|
|
|
|
|
Mixed precision is the combined use of different numerical precisions in a
|
|
|
|
|
computational method.
|
|
|
|
|
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant
|
|
|
|
|
computational speedup by performing operations in half-precision format, while
|
|
|
|
|
storing minimal information in single-precision to retain as much information
|
|
|
|
|
as possible in critical parts of the network. Since the introduction of [Tensor
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with
|
|
|
|
|
both the Turing and Ampere architectures, significant training speedups are
|
|
|
|
|
experienced by switching to mixed precision -- up to 3x overall speedup on the
|
|
|
|
|
most arithmetically intense model architectures. Using mixed precision training
|
|
|
|
|
previously required two steps:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
1. Porting the model to use the FP16 data type where appropriate.
|
|
|
|
|
2. Manually adding loss scaling to preserve small gradient values.
|
|
|
|
|
|
|
|
|
|
The ability to train deep learning networks with lower precision was introduced
|
|
|
|
|
in the Pascal architecture and first supported in [CUDA
|
|
|
|
|
8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep
|
|
|
|
|
Learning SDK.
|
|
|
|
|
|
|
|
|
|
For information about:
|
|
|
|
|
* How to train using mixed precision, see the [Mixed Precision
|
|
|
|
|
Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed
|
|
|
|
|
Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
|
|
|
|
|
documentation.
|
|
|
|
|
* Techniques used for mixed precision training, see the [Mixed-Precision
|
|
|
|
|
Training of Deep Neural
|
|
|
|
|
Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
|
|
|
|
|
blog.
|
|
|
|
|
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy
|
|
|
|
|
Mixed-Precision Training in
|
|
|
|
|
PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/)
|
|
|
|
|
.
|
|
|
|
|
|
|
|
|
|
#### Enabling mixed precision
|
|
|
|
|
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
|
|
|
|
|
(AMP), library from [APEX](https://github.com/NVIDIA/apex) that casts variables
|
|
|
|
|
to half-precision upon retrieval, while storing variables in single-precision
|
|
|
|
|
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
|
|
|
|
|
a [loss
|
|
|
|
|
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
|
|
|
|
|
step must be included when applying gradients. In PyTorch, loss scaling can be
|
|
|
|
|
easily applied by using `scale_loss()` method provided by AMP. The scaling
|
|
|
|
|
value to be used can be
|
|
|
|
|
[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.
|
|
|
|
|
|
|
|
|
|
For an in-depth walk through on AMP, check out sample usage
|
|
|
|
|
[here](https://nvidia.github.io/apex/amp.html#).
|
|
|
|
|
[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
|
|
|
|
|
utility libraries, such as AMP, which require minimal network code changes to
|
2020-08-01 15:47:34 +02:00
|
|
|
|
leverage Tensor Cores performance.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
The following steps were needed to enable mixed precision training in GNMT:
|
|
|
|
|
|
|
|
|
|
* Import AMP from APEX (file: `seq2seq/train/trainer.py`):
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
from apex import amp
|
|
|
|
|
```
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* Initialize AMP and wrap the model and the optimizer (file:
|
|
|
|
|
`seq2seq/train/trainer.py`, class: `Seq2SeqTrainer`):
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
self.model, self.optimizer = amp.initialize(
|
|
|
|
|
self.model,
|
|
|
|
|
self.optimizer,
|
|
|
|
|
cast_model_outputs=torch.float16,
|
|
|
|
|
keep_batchnorm_fp32=False,
|
|
|
|
|
opt_level='O2')
|
|
|
|
|
```
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* Apply `scale_loss` context manager (file: `seq2seq/train/fp_optimizers.py`,
|
|
|
|
|
class: `AMPOptimizer`):
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
|
|
|
|
scaled_loss.backward()
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
* Apply gradient clipping on single precision master weights (file:
|
|
|
|
|
`seq2seq/train/fp_optimizers.py`, class: `AMPOptimizer`):
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
if self.grad_clip != float('inf'):
|
|
|
|
|
clip_grad_norm_(amp.master_params(optimizer), self.grad_clip)
|
|
|
|
|
```
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
#### Enabling TF32
|
|
|
|
|
TensorFloat-32 (TF32) is the new math mode in [NVIDIA
|
|
|
|
|
A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the
|
|
|
|
|
matrix math also called tensor operations. TF32 running on Tensor Cores in A100
|
|
|
|
|
GPUs can provide up to 10x speedups compared to single-precision floating-point
|
|
|
|
|
math (FP32) on Volta GPUs.
|
|
|
|
|
|
|
|
|
|
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of
|
|
|
|
|
accuracy. It is more robust than FP16 for models which require high dynamic
|
|
|
|
|
range for weights or activations.
|
|
|
|
|
|
|
|
|
|
For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates
|
|
|
|
|
AI Training, HPC up to
|
|
|
|
|
20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)
|
|
|
|
|
blog post.
|
|
|
|
|
|
|
|
|
|
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by
|
|
|
|
|
default.
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
## Setup
|
|
|
|
|
|
|
|
|
|
The following section lists the requirements in order to start training the
|
|
|
|
|
GNMT v2 model.
|
|
|
|
|
|
|
|
|
|
### Requirements
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
This repository contains `Dockerfile` which extends the PyTorch NGC container
|
2020-08-01 15:47:34 +02:00
|
|
|
|
and encapsulates some dependencies. Aside from these dependencies, ensure you
|
2019-07-08 22:51:28 +02:00
|
|
|
|
have the following components:
|
|
|
|
|
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
|
|
|
|
* GPU architecture:
|
|
|
|
|
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
|
|
|
|
* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
|
|
|
|
|
* [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
For more information about how to get started with NGC containers, see the
|
2019-07-16 21:13:08 +02:00
|
|
|
|
following sections from the NVIDIA GPU Cloud Documentation and the Deep
|
|
|
|
|
Learning DGX Documentation:
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html),
|
|
|
|
|
* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry),
|
|
|
|
|
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running).
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
For those unable to use the Pytorch NGC container, to set up the required
|
|
|
|
|
environment or create your own container, see the versioned [NVIDIA Container
|
|
|
|
|
Support
|
|
|
|
|
Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
## Quick Start Guide
|
2020-08-01 15:47:34 +02:00
|
|
|
|
To train your model using mixed or TF32 precision with Tensor Cores or using
|
|
|
|
|
FP32, perform the following steps using the default parameters of the GNMT v2
|
|
|
|
|
model on the WMT16 English German dataset. For the specifics concerning
|
|
|
|
|
training and inference, see the [Advanced](#advanced) section.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
**1. Clone the repository.**
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
git clone https://github.com/NVIDIA/DeepLearningExamples
|
|
|
|
|
cd DeepLearningExamples/PyTorch/Translation/GNMT
|
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
**2. Build the GNMT v2 Docker container.**
|
|
|
|
|
|
2018-08-07 16:27:43 +02:00
|
|
|
|
```
|
|
|
|
|
bash scripts/docker/build.sh
|
2019-07-08 22:51:28 +02:00
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
**3. Start an interactive session in the container to run training/inference.**
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
```
|
2018-08-07 16:27:43 +02:00
|
|
|
|
bash scripts/docker/interactive.sh
|
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
**4. Download and preprocess the dataset.**
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
Data will be downloaded to the `data` directory (on the host). The `data`
|
|
|
|
|
directory is mounted to the `/workspace/gnmt/data` location in the Docker
|
|
|
|
|
container.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2018-08-07 16:27:43 +02:00
|
|
|
|
```
|
|
|
|
|
bash scripts/wmt16_en_de.sh
|
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
**5. Start training.**
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
The training script saves only one checkpoint with the lowest value of the loss
|
2019-07-16 21:13:08 +02:00
|
|
|
|
function on the validation dataset. All results and logs are saved to the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
`gnmt` directory (on the host) or to the `/workspace/gnmt/gnmt` directory
|
2019-07-16 21:13:08 +02:00
|
|
|
|
(in the container). By default, the `train.py` script will launch mixed
|
2020-08-01 15:47:34 +02:00
|
|
|
|
precision training with Tensor Cores. You can change this behavior by setting:
|
|
|
|
|
* the `--math fp32` flag to launch single precision training (for NVIDIA Volta
|
|
|
|
|
and NVIDIA Turing architectures) or
|
|
|
|
|
* the `--math tf32` flag to launch TF32 training with Tensor Cores (for NVIDIA
|
|
|
|
|
Ampere architecture)
|
|
|
|
|
|
|
|
|
|
for the `train.py` training script.
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To launch mixed precision training on 1, 4 or 8 GPUs, run:
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024
|
2019-02-14 12:40:30 +01:00
|
|
|
|
```
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To launch mixed precision training on 16 GPUs, run:
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2018-08-07 16:27:43 +02:00
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048
|
2018-08-07 16:27:43 +02:00
|
|
|
|
```
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
By default, the training script will launch training with batch size 128 per
|
|
|
|
|
GPU. If `--train-global-batch-size` is specified and larger than 128 times the
|
|
|
|
|
number of GPUs available for the training then the training script will
|
|
|
|
|
accumulate gradients over consecutive iterations and then perform the weight
|
|
|
|
|
update. For example, 1 GPU training with `--train-global-batch-size 1024` will
|
|
|
|
|
accumulate gradients over 8 iterations before doing the weight update with
|
|
|
|
|
accumulated gradients.
|
|
|
|
|
|
|
|
|
|
**6. Start evaluation.**
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
The training process automatically runs evaluation and outputs the BLEU score
|
|
|
|
|
after each training epoch. Additionally, after the training is done, you can
|
2019-07-16 21:13:08 +02:00
|
|
|
|
manually run inference on the test dataset with the checkpoint saved during the
|
2019-07-08 22:51:28 +02:00
|
|
|
|
training.
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
To launch FP16 inference on the `newstest2014.en` test set, run:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 translate.py \
|
|
|
|
|
--input data/wmt16_de_en/newstest2014.en \
|
|
|
|
|
--reference data/wmt16_de_en/newstest2014.de \
|
|
|
|
|
--output /tmp/output \
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--model gnmt/model_best.pth
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The script will load the checkpoint specified by the `--model` option, then it
|
|
|
|
|
will launch inference on the file specified by the `--input` option, and
|
|
|
|
|
compute BLEU score against the reference translation specified by the
|
|
|
|
|
`--reference` option. Outputs will be stored to the location specified by the
|
|
|
|
|
`--output` option.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Additionally, one can pass the input text directly from the command-line:
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
```
|
2019-07-16 21:13:08 +02:00
|
|
|
|
python3 translate.py \
|
|
|
|
|
--input-text "The quick brown fox jumps over the lazy dog" \
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--model gnmt/model_best.pth
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Translated output will be printed to the console:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
(...)
|
|
|
|
|
0: Translated output:
|
|
|
|
|
Der schnelle braune Fuchs springt über den faulen Hund
|
2018-08-07 16:27:43 +02:00
|
|
|
|
```
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
By default, the `translate.py` script will launch FP16 inference with Tensor
|
|
|
|
|
Cores. You can change this behavior by setting:
|
|
|
|
|
* the `--math fp32` flag to launch single precision inference (for NVIDIA Volta
|
|
|
|
|
and NVIDIA Turing architectures) or
|
|
|
|
|
* the `--math tf32` flag to launch TF32 inference with Tensor Cores (for NVIDIA
|
|
|
|
|
Ampere architecture)
|
|
|
|
|
|
|
|
|
|
for the `translate.py` inference script.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
## Advanced
|
2019-07-08 22:51:28 +02:00
|
|
|
|
The following sections provide greater details of the dataset, running training
|
|
|
|
|
and inference, and the training results.
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Scripts and sample code
|
|
|
|
|
In the `root` directory, the most important files are:
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* `train.py`: serves as the entry point to launch the training
|
|
|
|
|
* `translate.py`: serves as the entry point to launch inference
|
|
|
|
|
* `Dockerfile`: container with the basic set of dependencies to run GNMT v2
|
|
|
|
|
* `requirements.txt`: set of extra requirements for running GNMT v2
|
|
|
|
|
|
|
|
|
|
The `seq2seq/model` directory contains the implementation of GNMT v2 building
|
|
|
|
|
blocks:
|
|
|
|
|
|
|
|
|
|
* `attention.py`: implementation of normalized Bahdanau attention
|
|
|
|
|
* `encoder.py`: implementation of recurrent encoder
|
|
|
|
|
* `decoder.py`: implementation of recurrent decoder with attention
|
|
|
|
|
* `seq2seq_base.py`: base class for seq2seq models
|
|
|
|
|
* `gnmt.py`: implementation of GNMT v2 model
|
|
|
|
|
|
|
|
|
|
The `seq2seq/train` directory encapsulates the necessary tools to execute
|
|
|
|
|
training:
|
|
|
|
|
|
|
|
|
|
* `trainer.py`: implementation of training loop
|
|
|
|
|
* `smoothing.py`: implementation of cross-entropy with label smoothing
|
|
|
|
|
* `lr_scheduler.py`: implementation of exponential learning rate warmup and
|
|
|
|
|
step decay
|
|
|
|
|
* `fp_optimizers.py`: implementation of optimizers for various floating point
|
|
|
|
|
precisions
|
|
|
|
|
|
|
|
|
|
The `seq2seq/inference` directory contains scripts required to run inference:
|
|
|
|
|
|
|
|
|
|
* `beam_search.py`: implementation of beam search with length normalization and
|
|
|
|
|
length penalty
|
|
|
|
|
* `translator.py`: implementation of auto-regressive inference
|
|
|
|
|
|
|
|
|
|
The `seq2seq/data` directory contains implementation of components needed for
|
|
|
|
|
data loading:
|
|
|
|
|
|
|
|
|
|
* `dataset.py`: implementation of text datasets
|
|
|
|
|
* `sampler.py`: implementation of batch samplers with bucketing by sequence
|
|
|
|
|
length
|
|
|
|
|
* `tokenizer.py`: implementation of tokenizer (maps integer vocabulary indices
|
|
|
|
|
to text)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Parameters
|
|
|
|
|
Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The complete list of available parameters for the `train.py` training script
|
|
|
|
|
contains:
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
dataset setup:
|
|
|
|
|
--dataset-dir DATASET_DIR
|
|
|
|
|
path to the directory with training/test data
|
|
|
|
|
(default: data/wmt16_de_en)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--src-lang SRC_LANG source language (default: en)
|
|
|
|
|
--tgt-lang TGT_LANG target language (default: de)
|
|
|
|
|
--vocab VOCAB path to the vocabulary file (relative to DATASET_DIR
|
|
|
|
|
directory) (default: vocab.bpe.32000)
|
|
|
|
|
-bpe BPE_CODES, --bpe-codes BPE_CODES
|
|
|
|
|
path to the file with bpe codes (relative to
|
|
|
|
|
DATASET_DIR directory) (default: bpe.32000)
|
|
|
|
|
--train-src TRAIN_SRC
|
|
|
|
|
path to the training source data file (relative to
|
|
|
|
|
DATASET_DIR directory) (default:
|
|
|
|
|
train.tok.clean.bpe.32000.en)
|
|
|
|
|
--train-tgt TRAIN_TGT
|
|
|
|
|
path to the training target data file (relative to
|
|
|
|
|
DATASET_DIR directory) (default:
|
|
|
|
|
train.tok.clean.bpe.32000.de)
|
|
|
|
|
--val-src VAL_SRC path to the validation source data file (relative to
|
|
|
|
|
DATASET_DIR directory) (default:
|
|
|
|
|
newstest_dev.tok.clean.bpe.32000.en)
|
|
|
|
|
--val-tgt VAL_TGT path to the validation target data file (relative to
|
|
|
|
|
DATASET_DIR directory) (default:
|
|
|
|
|
newstest_dev.tok.clean.bpe.32000.de)
|
|
|
|
|
--test-src TEST_SRC path to the test source data file (relative to
|
|
|
|
|
DATASET_DIR directory) (default:
|
|
|
|
|
newstest2014.tok.bpe.32000.en)
|
|
|
|
|
--test-tgt TEST_TGT path to the test target data file (relative to
|
|
|
|
|
DATASET_DIR directory) (default: newstest2014.de)
|
|
|
|
|
--train-max-size TRAIN_MAX_SIZE
|
|
|
|
|
use at most TRAIN_MAX_SIZE elements from training
|
|
|
|
|
dataset (useful for benchmarking), by default uses
|
|
|
|
|
entire dataset (default: None)
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
results setup:
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--save-dir SAVE_DIR path to directory with results, it will be
|
2019-07-08 22:51:28 +02:00
|
|
|
|
automatically created if it does not exist (default:
|
2020-08-01 15:47:34 +02:00
|
|
|
|
gnmt)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--print-freq PRINT_FREQ
|
|
|
|
|
print log every PRINT_FREQ batches (default: 10)
|
|
|
|
|
|
|
|
|
|
model setup:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--hidden-size HIDDEN_SIZE
|
|
|
|
|
hidden size of the model (default: 1024)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--num-layers NUM_LAYERS
|
|
|
|
|
number of RNN layers in encoder and in decoder
|
|
|
|
|
(default: 4)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--dropout DROPOUT dropout applied to input of RNN cells (default: 0.2)
|
|
|
|
|
--share-embedding use shared embeddings for encoder and decoder (use '--
|
|
|
|
|
no-share-embedding' to disable) (default: True)
|
|
|
|
|
--smoothing SMOOTHING
|
|
|
|
|
label smoothing, if equal to zero model will use
|
|
|
|
|
CrossEntropyLoss, if not zero model will be trained
|
|
|
|
|
with label smoothing loss (default: 0.1)
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
general setup:
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--math {fp16,fp32,tf32,manual_fp16}
|
2019-07-16 21:13:08 +02:00
|
|
|
|
precision (default: fp16)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--seed SEED master seed for random number generators, if "seed" is
|
|
|
|
|
undefined then the master seed will be sampled from
|
|
|
|
|
random.SystemRandom() (default: None)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--prealloc-mode {off,once,always}
|
|
|
|
|
controls preallocation (default: always)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--dllog-file DLLOG_FILE
|
|
|
|
|
Name of the DLLogger output file (default:
|
|
|
|
|
train_log.json)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--eval run validation and test after every epoch (use '--no-
|
|
|
|
|
eval' to disable) (default: True)
|
|
|
|
|
--env print info about execution env (use '--no-env' to
|
|
|
|
|
disable) (default: True)
|
|
|
|
|
--cuda enables cuda (use '--no-cuda' to disable) (default:
|
|
|
|
|
True)
|
|
|
|
|
--cudnn enables cudnn (use '--no-cudnn' to disable) (default:
|
|
|
|
|
True)
|
|
|
|
|
--log-all-ranks enables logging from all distributed ranks, if
|
|
|
|
|
disabled then only logs from rank 0 are reported (use
|
|
|
|
|
'--no-log-all-ranks' to disable) (default: True)
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
training setup:
|
|
|
|
|
--train-batch-size TRAIN_BATCH_SIZE
|
|
|
|
|
training batch size per worker (default: 128)
|
|
|
|
|
--train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE
|
|
|
|
|
global training batch size, this argument does not
|
|
|
|
|
have to be defined, if it is defined it will be used
|
|
|
|
|
to automatically compute train_iter_size using the
|
|
|
|
|
equation: train_iter_size = train_global_batch_size //
|
|
|
|
|
(train_batch_size * world_size) (default: None)
|
|
|
|
|
--train-iter-size N training iter size, training loop will accumulate
|
|
|
|
|
gradients over N iterations and execute optimizer
|
|
|
|
|
every N steps (default: 1)
|
|
|
|
|
--epochs EPOCHS max number of training epochs (default: 6)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--grad-clip GRAD_CLIP
|
|
|
|
|
enables gradient clipping and sets maximum norm of
|
|
|
|
|
gradients (default: 5.0)
|
|
|
|
|
--train-max-length TRAIN_MAX_LENGTH
|
|
|
|
|
maximum sequence length for training (including
|
|
|
|
|
special BOS and EOS tokens) (default: 50)
|
|
|
|
|
--train-min-length TRAIN_MIN_LENGTH
|
2021-05-27 18:48:37 +02:00
|
|
|
|
minimum sequence length for training (including
|
2019-07-16 21:13:08 +02:00
|
|
|
|
special BOS and EOS tokens) (default: 0)
|
|
|
|
|
--train-loader-workers TRAIN_LOADER_WORKERS
|
|
|
|
|
number of workers for training data loading (default:
|
|
|
|
|
2)
|
|
|
|
|
--batching {random,sharding,bucketing}
|
|
|
|
|
select batching algorithm (default: bucketing)
|
|
|
|
|
--shard-size SHARD_SIZE
|
|
|
|
|
shard size for "sharding" batching algorithm, in
|
|
|
|
|
multiples of global batch size (default: 80)
|
|
|
|
|
--num-buckets NUM_BUCKETS
|
|
|
|
|
number of buckets for "bucketing" batching algorithm
|
|
|
|
|
(default: 5)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
optimizer setup:
|
|
|
|
|
--optimizer OPTIMIZER
|
2020-08-01 15:47:34 +02:00
|
|
|
|
training optimizer (default: Adam)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--lr LR learning rate (default: 0.002)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--optimizer-extra OPTIMIZER_EXTRA
|
|
|
|
|
extra options for the optimizer (default: {})
|
|
|
|
|
|
|
|
|
|
mixed precision loss scaling setup:
|
|
|
|
|
--init-scale INIT_SCALE
|
|
|
|
|
initial loss scale (default: 8192)
|
|
|
|
|
--upscale-interval UPSCALE_INTERVAL
|
|
|
|
|
loss upscaling interval (default: 128)
|
|
|
|
|
|
|
|
|
|
learning rate scheduler setup:
|
|
|
|
|
--warmup-steps WARMUP_STEPS
|
|
|
|
|
number of learning rate warmup iterations (default:
|
|
|
|
|
200)
|
|
|
|
|
--remain-steps REMAIN_STEPS
|
|
|
|
|
starting iteration for learning rate decay (default:
|
|
|
|
|
0.666)
|
|
|
|
|
--decay-interval DECAY_INTERVAL
|
|
|
|
|
interval between learning rate decay steps (default:
|
|
|
|
|
None)
|
|
|
|
|
--decay-steps DECAY_STEPS
|
|
|
|
|
max number of learning rate decay steps (default: 4)
|
|
|
|
|
--decay-factor DECAY_FACTOR
|
|
|
|
|
learning rate decay factor (default: 0.5)
|
|
|
|
|
|
|
|
|
|
validation setup:
|
|
|
|
|
--val-batch-size VAL_BATCH_SIZE
|
|
|
|
|
batch size for validation (default: 64)
|
|
|
|
|
--val-max-length VAL_MAX_LENGTH
|
|
|
|
|
maximum sequence length for validation (including
|
|
|
|
|
special BOS and EOS tokens) (default: 125)
|
|
|
|
|
--val-min-length VAL_MIN_LENGTH
|
|
|
|
|
minimum sequence length for validation (including
|
|
|
|
|
special BOS and EOS tokens) (default: 0)
|
|
|
|
|
--val-loader-workers VAL_LOADER_WORKERS
|
|
|
|
|
number of workers for validation data loading
|
|
|
|
|
(default: 0)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
test setup:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--test-batch-size TEST_BATCH_SIZE
|
|
|
|
|
batch size for test (default: 128)
|
|
|
|
|
--test-max-length TEST_MAX_LENGTH
|
|
|
|
|
maximum sequence length for test (including special
|
|
|
|
|
BOS and EOS tokens) (default: 150)
|
|
|
|
|
--test-min-length TEST_MIN_LENGTH
|
|
|
|
|
minimum sequence length for test (including special
|
|
|
|
|
BOS and EOS tokens) (default: 0)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--beam-size BEAM_SIZE
|
|
|
|
|
beam size (default: 5)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--len-norm-factor LEN_NORM_FACTOR
|
|
|
|
|
length normalization factor (default: 0.6)
|
|
|
|
|
--cov-penalty-factor COV_PENALTY_FACTOR
|
|
|
|
|
coverage penalty factor (default: 0.1)
|
|
|
|
|
--len-norm-const LEN_NORM_CONST
|
|
|
|
|
length normalization constant (default: 5.0)
|
|
|
|
|
--intra-epoch-eval N evaluate within training epoch, this option will
|
|
|
|
|
enable extra N equally spaced evaluations executed
|
|
|
|
|
during each training epoch (default: 0)
|
|
|
|
|
--test-loader-workers TEST_LOADER_WORKERS
|
|
|
|
|
number of workers for test data loading (default: 0)
|
|
|
|
|
|
|
|
|
|
checkpointing setup:
|
|
|
|
|
--start-epoch START_EPOCH
|
|
|
|
|
manually set initial epoch counter (default: 0)
|
|
|
|
|
--resume PATH resumes training from checkpoint from PATH (default:
|
|
|
|
|
None)
|
|
|
|
|
--save-all saves checkpoint after every epoch (default: False)
|
|
|
|
|
--save-freq SAVE_FREQ
|
|
|
|
|
save checkpoint every SAVE_FREQ batches (default:
|
|
|
|
|
5000)
|
|
|
|
|
--keep-checkpoints KEEP_CHECKPOINTS
|
|
|
|
|
keep only last KEEP_CHECKPOINTS checkpoints, affects
|
|
|
|
|
only checkpoints controlled by --save-freq option
|
|
|
|
|
(default: 0)
|
|
|
|
|
|
|
|
|
|
benchmark setup:
|
|
|
|
|
--target-perf TARGET_PERF
|
|
|
|
|
target training performance (in tokens per second)
|
|
|
|
|
(default: None)
|
|
|
|
|
--target-bleu TARGET_BLEU
|
|
|
|
|
target accuracy (default: None)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Inference
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
The complete list of available parameters for the `translate.py` inference
|
|
|
|
|
script contains:
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
data setup:
|
|
|
|
|
-o OUTPUT, --output OUTPUT
|
2019-07-16 21:13:08 +02:00
|
|
|
|
full path to the output file if not specified, then
|
|
|
|
|
the output will be printed (default: None)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
-r REFERENCE, --reference REFERENCE
|
|
|
|
|
full path to the file with reference translations (for
|
2019-07-16 21:13:08 +02:00
|
|
|
|
sacrebleu, raw text) (default: None)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
-m MODEL, --model MODEL
|
|
|
|
|
full path to the model checkpoint file (default: None)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--synthetic use synthetic dataset (default: False)
|
|
|
|
|
--synthetic-batches SYNTHETIC_BATCHES
|
|
|
|
|
number of synthetic batches to generate (default: 64)
|
|
|
|
|
--synthetic-vocab SYNTHETIC_VOCAB
|
|
|
|
|
size of synthetic vocabulary (default: 32320)
|
|
|
|
|
--synthetic-len SYNTHETIC_LEN
|
|
|
|
|
sequence length of synthetic samples (default: 50)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
-i INPUT, --input INPUT
|
|
|
|
|
full path to the input file (raw text) (default: None)
|
|
|
|
|
-t INPUT_TEXT [INPUT_TEXT ...], --input-text INPUT_TEXT [INPUT_TEXT ...]
|
|
|
|
|
raw input text (default: None)
|
|
|
|
|
--sort sorts dataset by sequence length (use '--no-sort' to
|
|
|
|
|
disable) (default: False)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
inference setup:
|
|
|
|
|
--batch-size BATCH_SIZE [BATCH_SIZE ...]
|
|
|
|
|
batch size per GPU (default: [128])
|
|
|
|
|
--beam-size BEAM_SIZE [BEAM_SIZE ...]
|
|
|
|
|
beam size (default: [5])
|
|
|
|
|
--max-seq-len MAX_SEQ_LEN
|
|
|
|
|
maximum generated sequence length (default: 80)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--len-norm-factor LEN_NORM_FACTOR
|
|
|
|
|
length normalization factor (default: 0.6)
|
|
|
|
|
--cov-penalty-factor COV_PENALTY_FACTOR
|
|
|
|
|
coverage penalty factor (default: 0.1)
|
|
|
|
|
--len-norm-const LEN_NORM_CONST
|
|
|
|
|
length normalization constant (default: 5.0)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
|
|
|
|
general setup:
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--math {fp16,fp32,tf32} [{fp16,fp32,tf32} ...]
|
|
|
|
|
precision (default: ['fp16'])
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--env print info about execution env (use '--no-env' to
|
|
|
|
|
disable) (default: False)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--bleu compares with reference translation and computes BLEU
|
|
|
|
|
(use '--no-bleu' to disable) (default: True)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--cuda enables cuda (use '--no-cuda' to disable) (default:
|
|
|
|
|
True)
|
|
|
|
|
--cudnn enables cudnn (use '--no-cudnn' to disable) (default:
|
|
|
|
|
True)
|
|
|
|
|
--batch-first uses (batch, seq, feature) data format for RNNs
|
|
|
|
|
(default: True)
|
|
|
|
|
--seq-first uses (seq, batch, feature) data format for RNNs
|
|
|
|
|
(default: True)
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--save-dir SAVE_DIR path to directory with results, it will be
|
|
|
|
|
automatically created if it does not exist (default:
|
|
|
|
|
gnmt)
|
|
|
|
|
--dllog-file DLLOG_FILE
|
|
|
|
|
Name of the DLLogger output file (default:
|
|
|
|
|
eval_log.json)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
--print-freq PRINT_FREQ, -p PRINT_FREQ
|
|
|
|
|
print log every PRINT_FREQ batches (default: 1)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
benchmark setup:
|
|
|
|
|
--target-perf TARGET_PERF
|
|
|
|
|
target inference performance (in tokens per second)
|
|
|
|
|
(default: None)
|
|
|
|
|
--target-bleu TARGET_BLEU
|
|
|
|
|
target accuracy (default: None)
|
|
|
|
|
--repeat REPEAT [REPEAT ...]
|
|
|
|
|
loops over the dataset REPEAT times, flag accepts
|
|
|
|
|
multiple arguments, one for each specified batch size
|
|
|
|
|
(default: [1])
|
|
|
|
|
--warmup WARMUP warmup iterations for performance counters (default:
|
|
|
|
|
0)
|
|
|
|
|
--percentiles PERCENTILES [PERCENTILES ...]
|
|
|
|
|
Percentiles for confidence intervals for
|
2020-08-01 15:47:34 +02:00
|
|
|
|
throughput/latency benchmarks (default: (90, 95, 99))
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--tables print accuracy, throughput and latency results in
|
|
|
|
|
tables (use '--no-tables' to disable) (default: False)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Command-line options
|
|
|
|
|
To see the full list of available options and their descriptions, use the `-h`
|
|
|
|
|
or `--help` command line option. For example, for training:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 train.py --help
|
|
|
|
|
|
|
|
|
|
usage: train.py [-h] [--dataset-dir DATASET_DIR] [--src-lang SRC_LANG]
|
|
|
|
|
[--tgt-lang TGT_LANG] [--vocab VOCAB] [-bpe BPE_CODES]
|
|
|
|
|
[--train-src TRAIN_SRC] [--train-tgt TRAIN_TGT]
|
|
|
|
|
[--val-src VAL_SRC] [--val-tgt VAL_TGT] [--test-src TEST_SRC]
|
2020-08-01 15:47:34 +02:00
|
|
|
|
[--test-tgt TEST_TGT] [--save-dir SAVE_DIR]
|
|
|
|
|
[--print-freq PRINT_FREQ] [--hidden-size HIDDEN_SIZE]
|
|
|
|
|
[--num-layers NUM_LAYERS] [--dropout DROPOUT]
|
|
|
|
|
[--share-embedding] [--smoothing SMOOTHING]
|
|
|
|
|
[--math {fp16,fp32,tf32,manual_fp16}] [--seed SEED]
|
|
|
|
|
[--prealloc-mode {off,once,always}] [--dllog-file DLLOG_FILE]
|
|
|
|
|
[--eval] [--env] [--cuda] [--cudnn] [--log-all-ranks]
|
2019-07-16 21:13:08 +02:00
|
|
|
|
[--train-max-size TRAIN_MAX_SIZE]
|
|
|
|
|
[--train-batch-size TRAIN_BATCH_SIZE]
|
|
|
|
|
[--train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE]
|
|
|
|
|
[--train-iter-size N] [--epochs EPOCHS]
|
|
|
|
|
[--grad-clip GRAD_CLIP] [--train-max-length TRAIN_MAX_LENGTH]
|
|
|
|
|
[--train-min-length TRAIN_MIN_LENGTH]
|
|
|
|
|
[--train-loader-workers TRAIN_LOADER_WORKERS]
|
|
|
|
|
[--batching {random,sharding,bucketing}]
|
|
|
|
|
[--shard-size SHARD_SIZE] [--num-buckets NUM_BUCKETS]
|
|
|
|
|
[--optimizer OPTIMIZER] [--lr LR]
|
|
|
|
|
[--optimizer-extra OPTIMIZER_EXTRA] [--init-scale INIT_SCALE]
|
|
|
|
|
[--upscale-interval UPSCALE_INTERVAL]
|
|
|
|
|
[--warmup-steps WARMUP_STEPS] [--remain-steps REMAIN_STEPS]
|
|
|
|
|
[--decay-interval DECAY_INTERVAL] [--decay-steps DECAY_STEPS]
|
|
|
|
|
[--decay-factor DECAY_FACTOR]
|
|
|
|
|
[--val-batch-size VAL_BATCH_SIZE]
|
|
|
|
|
[--val-max-length VAL_MAX_LENGTH]
|
|
|
|
|
[--val-min-length VAL_MIN_LENGTH]
|
|
|
|
|
[--val-loader-workers VAL_LOADER_WORKERS]
|
|
|
|
|
[--test-batch-size TEST_BATCH_SIZE]
|
|
|
|
|
[--test-max-length TEST_MAX_LENGTH]
|
|
|
|
|
[--test-min-length TEST_MIN_LENGTH] [--beam-size BEAM_SIZE]
|
|
|
|
|
[--len-norm-factor LEN_NORM_FACTOR]
|
|
|
|
|
[--cov-penalty-factor COV_PENALTY_FACTOR]
|
|
|
|
|
[--len-norm-const LEN_NORM_CONST] [--intra-epoch-eval N]
|
|
|
|
|
[--test-loader-workers TEST_LOADER_WORKERS]
|
|
|
|
|
[--start-epoch START_EPOCH] [--resume PATH] [--save-all]
|
|
|
|
|
[--save-freq SAVE_FREQ] [--keep-checkpoints KEEP_CHECKPOINTS]
|
|
|
|
|
[--target-perf TARGET_PERF] [--target-bleu TARGET_BLEU]
|
2020-08-01 15:47:34 +02:00
|
|
|
|
[--local_rank LOCAL_RANK]
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
For example, for inference:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 translate.py --help
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
usage: translate.py [-h] [-o OUTPUT] [-r REFERENCE] [-m MODEL] [--synthetic]
|
|
|
|
|
[--synthetic-batches SYNTHETIC_BATCHES]
|
|
|
|
|
[--synthetic-vocab SYNTHETIC_VOCAB]
|
|
|
|
|
[--synthetic-len SYNTHETIC_LEN]
|
|
|
|
|
[-i INPUT | -t INPUT_TEXT [INPUT_TEXT ...]] [--sort]
|
2019-07-16 21:13:08 +02:00
|
|
|
|
[--batch-size BATCH_SIZE [BATCH_SIZE ...]]
|
|
|
|
|
[--beam-size BEAM_SIZE [BEAM_SIZE ...]]
|
|
|
|
|
[--max-seq-len MAX_SEQ_LEN]
|
|
|
|
|
[--len-norm-factor LEN_NORM_FACTOR]
|
|
|
|
|
[--cov-penalty-factor COV_PENALTY_FACTOR]
|
|
|
|
|
[--len-norm-const LEN_NORM_CONST]
|
2020-08-01 15:47:34 +02:00
|
|
|
|
[--math {fp16,fp32,tf32} [{fp16,fp32,tf32} ...]] [--env]
|
|
|
|
|
[--bleu] [--cuda] [--cudnn] [--batch-first | --seq-first]
|
|
|
|
|
[--save-dir SAVE_DIR] [--dllog-file DLLOG_FILE]
|
2019-07-16 21:13:08 +02:00
|
|
|
|
[--print-freq PRINT_FREQ] [--target-perf TARGET_PERF]
|
|
|
|
|
[--target-bleu TARGET_BLEU] [--repeat REPEAT [REPEAT ...]]
|
|
|
|
|
[--warmup WARMUP]
|
|
|
|
|
[--percentiles PERCENTILES [PERCENTILES ...]] [--tables]
|
2020-08-01 15:47:34 +02:00
|
|
|
|
[--local_rank LOCAL_RANK]
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
### Getting the data
|
|
|
|
|
The GNMT v2 model was trained on the [WMT16
|
|
|
|
|
English-German](http://www.statmt.org/wmt16/translation-task.html) dataset.
|
|
|
|
|
Concatenation of the newstest2015 and newstest2016 test sets are used as a
|
|
|
|
|
validation dataset and the newstest2014 is used as a testing dataset.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
This repository contains the `scripts/wmt16_en_de.sh` download script which
|
|
|
|
|
automatically downloads and preprocesses the training, validation and test
|
|
|
|
|
datasets. By default, data is downloaded to the `data` directory.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
Our download script is very similar to the `wmt16_en_de.sh` script from the
|
|
|
|
|
[tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
|
|
|
|
|
repository. Our download script contains an extra preprocessing step, which
|
|
|
|
|
discards all pairs of sentences which can't be decoded by *latin-1* encoder.
|
|
|
|
|
The `scripts/wmt16_en_de.sh` script uses the
|
2019-07-16 21:13:08 +02:00
|
|
|
|
[subword-nmt](https://github.com/rsennrich/subword-nmt) package to segment text
|
|
|
|
|
into subword units (Byte Pair Encodings -
|
2019-07-08 22:51:28 +02:00
|
|
|
|
[BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding)). By default, the
|
|
|
|
|
script builds the shared vocabulary of 32,000 tokens.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
In order to test with other datasets, the script needs to be customized
|
|
|
|
|
accordingly.
|
|
|
|
|
|
|
|
|
|
#### Dataset guidelines
|
|
|
|
|
The process of downloading and preprocessing the data can be found in the
|
|
|
|
|
`scripts/wmt16_en_de.sh` script.
|
|
|
|
|
|
|
|
|
|
Initially, data is downloaded from [www.statmt.org](www.statmt.org). Then
|
|
|
|
|
`europarl-v7`, `commoncrawl` and `news-commentary` corpora are concatenated to
|
|
|
|
|
form the training dataset, similarly `newstest2015` and `newstest2016` are
|
|
|
|
|
concatenated to form the validation dataset. Raw data is preprocessed with
|
|
|
|
|
[Moses](https://github.com/moses-smt/mosesdecoder), first by launching [Moses
|
|
|
|
|
tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)
|
|
|
|
|
(tokenizer breaks up text into individual words), then by launching
|
|
|
|
|
[clean-corpus-n.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl)
|
|
|
|
|
which removes invalid sentences and does initial filtering by sequence length.
|
|
|
|
|
|
|
|
|
|
Second stage of preprocessing is done by launching the
|
|
|
|
|
`scripts/filter_dataset.py` script, which discards all pairs of sentences that
|
|
|
|
|
can't be decoded by latin-1 encoder.
|
|
|
|
|
|
|
|
|
|
Third state of preprocessing uses the
|
|
|
|
|
[subword-nmt](https://github.com/rsennrich/subword-nmt) package. First it
|
|
|
|
|
builds shared [byte pair
|
|
|
|
|
encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) vocabulary with
|
|
|
|
|
32,000 merge operations (command `subword-nmt learn-bpe`), then it applies
|
|
|
|
|
generated vocabulary to training, validation and test corpora (command
|
|
|
|
|
`subword-nmt apply-bpe`).
|
|
|
|
|
|
|
|
|
|
### Training process
|
|
|
|
|
The default training configuration can be launched by running the `train.py`
|
|
|
|
|
training script. By default, the training script saves only one checkpoint with
|
|
|
|
|
the lowest value of the loss function on the validation dataset. An evaluation
|
|
|
|
|
is then performed after each training epoch. Results are stored in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
`gnmt` directory.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
The training script launches data-parallel training with batch size 128 per GPU
|
2019-02-14 12:40:30 +01:00
|
|
|
|
on all available GPUs. We have tested reliance on up to 16 GPUs on a single
|
|
|
|
|
node.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
After each training epoch, the script runs an evaluation on the validation
|
|
|
|
|
dataset and outputs a BLEU score on the test dataset (newstest2014). BLEU is
|
|
|
|
|
computed by the [SacreBLEU](https://github.com/mjpost/sacreBLEU) package. Logs
|
2020-08-01 15:47:34 +02:00
|
|
|
|
from the training and evaluation are saved to the `gnmt` directory.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
The summary after each training epoch is printed in the following format:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
```
|
2019-07-16 21:13:08 +02:00
|
|
|
|
0: Summary: Epoch: 3 Training Loss: 3.1336 Validation Loss: 2.9587 Test BLEU: 23.18
|
|
|
|
|
0: Performance: Epoch: 3 Training: 418772 Tok/s Validation: 1445331 Tok/s
|
2019-07-08 22:51:28 +02:00
|
|
|
|
```
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
The training loss is averaged over an entire training epoch, the validation
|
|
|
|
|
loss is averaged over the validation dataset and the BLEU score is computed on
|
2019-07-16 21:13:08 +02:00
|
|
|
|
the test dataset. Performance is reported in total tokens per second. The
|
|
|
|
|
result is averaged over an entire training epoch and summed over all GPUs
|
|
|
|
|
participating in the training.
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
By default, the `train.py` script will launch mixed precision training with
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Tensor Cores. You can change this behavior by setting:
|
|
|
|
|
* the `--math fp32` flag to launch single precision training (for NVIDIA Volta
|
|
|
|
|
and NVIDIA Turing architectures) or
|
|
|
|
|
* the `--math tf32` flag to launch TF32 training with Tensor Cores (for NVIDIA
|
|
|
|
|
Ampere architecture)
|
|
|
|
|
|
|
|
|
|
for the `train.py` training script.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-02-14 12:40:30 +01:00
|
|
|
|
To view all available options for training, run `python3 train.py --help`.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Inference process
|
|
|
|
|
Inference can be run by launching the `translate.py` inference script,
|
|
|
|
|
although, it requires a pre-trained model checkpoint and tokenized input.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
The inference script, `translate.py`, supports batched inference. By default,
|
|
|
|
|
it launches beam search with beam size of 5, coverage penalty term and length
|
|
|
|
|
normalization term. Greedy decoding can be enabled by setting the beam size to
|
|
|
|
|
1.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
|
|
|
|
To view all available options for inference, run `python3 translate.py --help`.
|
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
## Performance
|
|
|
|
|
|
2021-10-13 18:40:51 +02:00
|
|
|
|
The performance measurements in this document were conducted at the time of
|
|
|
|
|
publication and may not reflect the performance achieved from NVIDIA’s latest
|
|
|
|
|
software release. For the most up-to-date performance measurements, go to
|
|
|
|
|
[NVIDIA Data Center Deep Learning Product
|
|
|
|
|
Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
|
2021-07-21 14:39:48 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Benchmarking
|
|
|
|
|
The following section shows how to run benchmarks measuring the model
|
|
|
|
|
performance in training and inference modes.
|
|
|
|
|
|
|
|
|
|
#### Training performance benchmark
|
|
|
|
|
Training is launched on batches of text data, different batches have different
|
|
|
|
|
sequence lengths (number of tokens in the longest sequence). Sequence length
|
|
|
|
|
and batch efficiency (ratio of non-pad tokens to total number of tokens) affect
|
|
|
|
|
performance of the training, therefore it's recommended to run the training on
|
|
|
|
|
a large chunk of training dataset to get a stable and reliable average training
|
|
|
|
|
performance. Ideally at least one full epoch of training should be launched to
|
|
|
|
|
get a good estimate of training performance.
|
|
|
|
|
|
|
|
|
|
The following commands will launch one epoch of training:
|
|
|
|
|
|
|
|
|
|
To launch mixed precision training on 1, 4 or 8 GPUs, run:
|
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --epochs 1 --math fp16
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
To launch mixed precision training on 16 GPUs, run:
|
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048 --epochs 1 --math fp16
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Change `--math fp16` to `--math fp32` to launch single precision training (for
|
|
|
|
|
NVIDIA Volta and NVIDIA Turing architectures) or to `--math tf32` to launch
|
|
|
|
|
TF32 training with Tensor Cores (for NVIDIA Ampere architecture).
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
After the training is completed, the `train.py` script prints a summary to
|
|
|
|
|
standard output. Performance results are printed in the following format:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
(...)
|
|
|
|
|
0: Performance: Epoch: 0 Training: 418926 Tok/s Validation: 1430828 Tok/s
|
|
|
|
|
(...)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
`Training: 418926 Tok/s` represents training throughput averaged over an entire
|
|
|
|
|
training epoch and summed over all GPUs participating in the training.
|
|
|
|
|
|
|
|
|
|
#### Inference performance benchmark
|
|
|
|
|
The inference performance and accuracy benchmarks require a checkpoint from a
|
|
|
|
|
fully trained model.
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Command to launch the inference accuracy benchmark on NVIDIA Volta or on NVIDIA
|
|
|
|
|
Turing architectures:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 translate.py \
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--model gnmt/model_best.pth \
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--input data/wmt16_de_en/newstest2014.en \
|
|
|
|
|
--reference data/wmt16_de_en/newstest2014.de \
|
|
|
|
|
--output /tmp/output \
|
|
|
|
|
--math fp16 fp32 \
|
|
|
|
|
--batch-size 128 \
|
|
|
|
|
--beam-size 1 2 5 \
|
|
|
|
|
--tables
|
|
|
|
|
```
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Command to launch the inference accuracy benchmark on NVIDIA Ampere architecture:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 translate.py \
|
|
|
|
|
--model gnmt/model_best.pth \
|
|
|
|
|
--input data/wmt16_de_en/newstest2014.en \
|
|
|
|
|
--reference data/wmt16_de_en/newstest2014.de \
|
|
|
|
|
--output /tmp/output \
|
|
|
|
|
--math fp16 tf32 \
|
|
|
|
|
--batch-size 128 \
|
|
|
|
|
--beam-size 1 2 5 \
|
|
|
|
|
--tables
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Command to launch the inference throughput and latency benchmarks on NVIDIA
|
|
|
|
|
Volta or NVIDIA Turing architectures:
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 translate.py \
|
2020-08-01 15:47:34 +02:00
|
|
|
|
--model gnmt/model_best.pth \
|
2019-07-16 21:13:08 +02:00
|
|
|
|
--input data/wmt16_de_en/newstest2014.en \
|
|
|
|
|
--reference data/wmt16_de_en/newstest2014.de \
|
|
|
|
|
--output /tmp/output \
|
|
|
|
|
--math fp16 fp32 \
|
|
|
|
|
--batch-size 1 2 4 8 32 128 512 \
|
|
|
|
|
--repeat 1 1 1 1 2 8 16 \
|
|
|
|
|
--beam-size 1 2 5 \
|
|
|
|
|
--warmup 5 \
|
|
|
|
|
--tables
|
|
|
|
|
```
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Command to launch the inference throughput and latency benchmarks on NVIDIA
|
|
|
|
|
Ampere architecture:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
python3 translate.py \
|
|
|
|
|
--model gnmt/model_best.pth \
|
|
|
|
|
--input data/wmt16_de_en/newstest2014.en \
|
|
|
|
|
--reference data/wmt16_de_en/newstest2014.de \
|
|
|
|
|
--output /tmp/output \
|
|
|
|
|
--math fp16 tf32 \
|
|
|
|
|
--batch-size 1 2 4 8 32 128 512 \
|
|
|
|
|
--repeat 1 1 1 1 2 8 16 \
|
|
|
|
|
--beam-size 1 2 5 \
|
|
|
|
|
--warmup 5 \
|
|
|
|
|
--tables
|
|
|
|
|
```
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
### Results
|
2019-07-08 22:51:28 +02:00
|
|
|
|
The following sections provide details on how we achieved our performance and
|
|
|
|
|
accuracy in training and inference.
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
#### Training accuracy results
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
|
2019-07-08 22:51:28 +02:00
|
|
|
|
Our results were obtained by running the `train.py` script with the default
|
2020-08-01 15:47:34 +02:00
|
|
|
|
batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX
|
|
|
|
|
A100 with 8x A100 40GB GPUs.
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
Command to launch the training:
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --math fp16
|
2019-02-14 12:40:30 +01:00
|
|
|
|
```
|
2019-01-25 16:20:05 +01:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Change `--math fp16` to `--math tf32` to launch TF32 training with Tensor Cores.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| **GPUs** | **Batch Size / GPU** | **Accuracy - TF32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - TF32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (TF32 to Mixed precision)** |
|
2019-07-16 21:13:08 +02:00
|
|
|
|
| --- | --- | ----- | ----- | ----- | ------ | ---- |
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| 8 | 128 | 24.46 | 24.60 | 34.7 | 22.7 | 1.53 |
|
2019-01-25 16:20:05 +01:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Our results were obtained by running the `train.py` script with the default
|
2020-08-01 15:47:34 +02:00
|
|
|
|
batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1
|
|
|
|
|
with 8x V100 16GB GPUs.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Command to launch the training:
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --math fp16
|
2019-02-14 12:40:30 +01:00
|
|
|
|
```
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Change `--math fp16` to `--math fp32` to launch single precision training.
|
|
|
|
|
|
|
|
|
|
| **GPUs** | **Batch Size / GPU** | **Accuracy - FP32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - FP32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (FP32 to Mixed precision)** |
|
|
|
|
|
| --- | --- | ----- | ----- | ----- | ------ | ---- |
|
|
|
|
|
| 1 | 128 | 24.41 | 24.42 | 810.0 | 224.0 | 3.62 |
|
|
|
|
|
| 4 | 128 | 24.40 | 24.33 | 218.2 | 69.5 | 3.14 |
|
|
|
|
|
| 8 | 128 | 24.45 | 24.38 | 112.0 | 38.6 | 2.90 |
|
|
|
|
|
|
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
|
|
|
|
##### Training accuracy: NVIDIA DGX-2H (16x V100 32GB)
|
|
|
|
|
Our results were obtained by running the `train.py` script with the default
|
|
|
|
|
batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H
|
|
|
|
|
with 16x V100 32GB GPUs.
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
To launch mixed precision training on 16 GPUs, run:
|
|
|
|
|
```
|
2020-08-01 15:47:34 +02:00
|
|
|
|
python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048 --math fp16
|
2019-07-16 21:13:08 +02:00
|
|
|
|
```
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
Change `--math fp16` to `--math fp32` to launch single precision training.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
| **GPUs** | **Batch Size / GPU** | **Accuracy - FP32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - FP32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (FP32 to Mixed precision)** |
|
|
|
|
|
| --- | --- | ----- | ----- | ------ | ----- | ---- |
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| 16 | 128 | 24.41 | 24.38 | 52.1 | 19.4 | 2.69 |
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
2018-08-07 16:27:43 +02:00
|
|
|
|
![TrainingLoss](./img/training_loss.png)
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
##### Training stability test
|
2020-08-01 15:47:34 +02:00
|
|
|
|
The GNMT v2 model was trained for 6 epochs, starting from 32 different initial
|
2019-07-08 22:51:28 +02:00
|
|
|
|
random seeds. After each training epoch, the model was evaluated on the test
|
2018-08-07 16:27:43 +02:00
|
|
|
|
dataset and the BLEU score was recorded. The training was performed in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 Docker container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
The following table summarizes the results of the stability test.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
In the following table, the BLEU scores after each training epoch for different
|
|
|
|
|
initial random seeds are displayed.
|
|
|
|
|
|
|
|
|
|
| **Epoch** | **Average** | **Standard deviation** | **Minimum** | **Maximum** | **Median** |
|
2019-02-14 12:40:30 +01:00
|
|
|
|
| --- | ------ | ----- | ------ | ------ | ------ |
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| 1 | 19.959 | 0.238 | 19.410 | 20.390 | 19.970 |
|
|
|
|
|
| 2 | 21.772 | 0.293 | 20.960 | 22.280 | 21.820 |
|
|
|
|
|
| 3 | 22.435 | 0.264 | 21.740 | 22.870 | 22.465 |
|
|
|
|
|
| 4 | 23.167 | 0.166 | 22.870 | 23.620 | 23.195 |
|
|
|
|
|
| 5 | 24.233 | 0.149 | 23.820 | 24.530 | 24.235 |
|
|
|
|
|
| 6 | 24.416 | 0.131 | 24.140 | 24.660 | 24.390 |
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
#### Training throughput results
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
##### Training throughput: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
|
|
Our results were obtained by running the `train.py` training script in the
|
|
|
|
|
pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
|
|
|
|
|
Throughput performance numbers (in tokens per second) were averaged over an
|
|
|
|
|
entire training epoch.
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| **GPUs** | **Batch size / GPU** | **Throughput - TF32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (TF32 to Mixed precision)** | **Strong Scaling - TF32** | **Strong Scaling - Mixed precision** |
|
|
|
|
|
| --- | --- | ------ | ------ | ----- | ----- | ----- |
|
|
|
|
|
| 1 | 128 | 83214 | 140909 | 1.693 | 1.000 | 1.000 |
|
|
|
|
|
| 4 | 128 | 278576 | 463144 | 1.663 | 3.348 | 3.287 |
|
|
|
|
|
| 8 | 128 | 519952 | 822024 | 1.581 | 6.248 | 5.834 |
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
|
|
|
|
##### Training throughput: NVIDIA DGX-1 (8x V100 16GB)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Our results were obtained by running the `train.py` training script in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Throughput performance numbers (in tokens per second) were averaged over an
|
|
|
|
|
entire training epoch.
|
|
|
|
|
|
|
|
|
|
| **GPUs** | **Batch size / GPU** | **Throughput - FP32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong Scaling - Mixed precision** |
|
2019-02-14 12:40:30 +01:00
|
|
|
|
| --- | --- | ------ | ------ | ----- | ----- | ----- |
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| 1 | 128 | 21860 | 76438 | 3.497 | 1.000 | 1.000 |
|
|
|
|
|
| 4 | 128 | 80224 | 249168 | 3.106 | 3.670 | 3.260 |
|
|
|
|
|
| 8 | 128 | 154168 | 447832 | 2.905 | 7.053 | 5.859 |
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
##### Training throughput: NVIDIA DGX-2H (16x V100 32GB)
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Our results were obtained by running the `train.py` training script in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 NGC container on NVIDIA DGX-2H with 16x V100 32GB GPUs.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Throughput performance numbers (in tokens per second) were averaged over an
|
|
|
|
|
entire training epoch.
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
| **GPUs** | **Batch size / GPU** | **Throughput - FP32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong Scaling - Mixed precision** |
|
|
|
|
|
| --- | --- | ------ | ------ | ----- | ------ | ------ |
|
2020-08-01 15:47:34 +02:00
|
|
|
|
| 1 | 128 | 25583 | 87829 | 3.433 | 1.000 | 1.000 |
|
|
|
|
|
| 4 | 128 | 91400 | 290640 | 3.180 | 3.573 | 3.309 |
|
|
|
|
|
| 8 | 128 | 176616 | 522008 | 2.956 | 6.904 | 5.943 |
|
|
|
|
|
| 16 | 128 | 351792 | 1010880 | 2.874 | 13.751 | 11.510 |
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
#### Inference accuracy results
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
##### Inference accuracy: NVIDIA A100 40GB
|
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB GPU. Full
|
|
|
|
|
command to launch the inference accuracy benchmark was provided in the
|
|
|
|
|
[Inference performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
| **Batch Size** | **Beam Size** | **Accuracy - TF32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
|
|
|
|
|
| -------------: | ------------: | -------------------------: | -------------------------: |
|
|
|
|
|
| 128 | 1 | 23.07 | 23.07 |
|
|
|
|
|
| 128 | 2 | 23.81 | 23.81 |
|
|
|
|
|
| 128 | 5 | 24.41 | 24.43 |
|
|
|
|
|
|
|
|
|
|
##### Inference accuracy: NVIDIA Tesla V100 16GB
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA Tesla V100 16GB GPU. Full
|
2019-07-16 21:13:08 +02:00
|
|
|
|
command to launch the inference accuracy benchmark was provided in the
|
|
|
|
|
[Inference performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
| **Batch Size** | **Beam Size** | **Accuracy - FP32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
|
|
|
|
|
| -------------: | ------------: | -------------------------: | -------------------------: |
|
|
|
|
|
| 128 | 1 | 23.07 | 23.07 |
|
|
|
|
|
| 128 | 2 | 23.81 | 23.79 |
|
|
|
|
|
| 128 | 5 | 24.40 | 24.43 |
|
|
|
|
|
|
2019-10-21 19:41:32 +02:00
|
|
|
|
##### Inference accuracy: NVIDIA T4
|
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA Tesla T4. Full command to
|
2019-10-21 19:41:32 +02:00
|
|
|
|
launch the inference accuracy benchmark was provided in the [Inference
|
|
|
|
|
performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
| **Batch Size** | **Beam Size** | **Accuracy - FP32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
|
|
|
|
|
| -------------: | ------------: | -------------------------: | -------------------------: |
|
|
|
|
|
| 128 | 1 | 23.07 | 23.08 |
|
|
|
|
|
| 128 | 2 | 23.81 | 23.80 |
|
|
|
|
|
| 128 | 5 | 24.40 | 24.39 |
|
|
|
|
|
|
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
#### Inference throughput results
|
|
|
|
|
Tables presented in this section show the average inference throughput (columns
|
|
|
|
|
**Avg (tok/s)**) and inference throughput for various confidence intervals
|
|
|
|
|
(columns **N% (ms)**, where `N` denotes the confidence interval). Inference
|
|
|
|
|
throughput is measured in tokens per second. Speedups reported in FP16
|
2020-08-01 15:47:34 +02:00
|
|
|
|
subsections are relative to FP32 (for NVIDIA Volta and NVIDIA Turing) and
|
|
|
|
|
relative to TF32 (for NVIDIA Ampere) numbers for corresponding configuration.
|
|
|
|
|
|
|
|
|
|
##### Inference throughput: NVIDIA A100 40GB
|
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB.
|
|
|
|
|
Full command to launch the inference throughput benchmark was provided in the
|
|
|
|
|
[Inference performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
**FP16**
|
|
|
|
|
|
|
|
|
|
|**Batch Size**|**Beam Size**|**Avg (tok/s)**|**Speedup**|**90% (tok/s)**|**Speedup**|**95% (tok/s)**|**Speedup**|**99% (tok/s)**|**Speedup**|
|
|
|
|
|
|-------------:|------------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|
|
|
|
|
|
| 1| 1| 1291.6| 1.031| 1195.7| 1.029| 1165.8| 1.029| 1104.7| 1.030|
|
|
|
|
|
| 1| 2| 882.7| 1.019| 803.4| 1.015| 769.2| 1.015| 696.7| 1.017|
|
|
|
|
|
| 1| 5| 848.3| 1.042| 753.0| 1.037| 715.0| 1.043| 636.4| 1.033|
|
|
|
|
|
| 2| 1| 2060.5| 1.034| 1700.8| 1.032| 1621.8| 1.032| 1487.4| 1.022|
|
|
|
|
|
| 2| 2| 1445.7| 1.026| 1197.6| 1.024| 1132.5| 1.023| 1043.7| 1.033|
|
|
|
|
|
| 2| 5| 1402.3| 1.063| 1152.4| 1.056| 1100.5| 1.053| 992.9| 1.053|
|
|
|
|
|
| 4| 1| 3465.6| 1.046| 2838.3| 1.040| 2672.7| 1.043| 2392.8| 1.043|
|
|
|
|
|
| 4| 2| 2425.4| 1.041| 2002.5| 1.028| 1898.3| 1.033| 1690.2| 1.028|
|
|
|
|
|
| 4| 5| 2364.4| 1.075| 1930.0| 1.067| 1822.0| 1.065| 1626.1| 1.058|
|
|
|
|
|
| 8| 1| 6151.1| 1.099| 5078.0| 1.087| 4786.5| 1.096| 4206.9| 1.090|
|
|
|
|
|
| 8| 2| 4241.9| 1.075| 3494.1| 1.066| 3293.6| 1.066| 2970.9| 1.064|
|
|
|
|
|
| 8| 5| 4117.7| 1.118| 3430.9| 1.103| 3224.5| 1.104| 2833.5| 1.110|
|
|
|
|
|
| 32| 1| 18830.4| 1.147| 16210.0| 1.152| 15563.9| 1.138| 13973.2| 1.135|
|
|
|
|
|
| 32| 2| 12698.2| 1.133| 10812.3| 1.114| 10256.1| 1.145| 9330.2| 1.101|
|
|
|
|
|
| 32| 5| 11802.6| 1.355| 9998.8| 1.318| 9671.6| 1.329| 9058.4| 1.335|
|
|
|
|
|
| 128| 1| 53394.5| 1.350| 48867.6| 1.342| 46898.5| 1.414| 40670.6| 1.305|
|
|
|
|
|
| 128| 2| 34876.4| 1.483| 31687.4| 1.491| 30025.4| 1.505| 27677.1| 1.421|
|
|
|
|
|
| 128| 5| 28201.3| 1.986| 25660.5| 1.997| 24306.0| 1.967| 23326.2| 2.007|
|
|
|
|
|
| 512| 1| 119675.3| 1.904| 112400.5| 1.971| 109694.8| 1.927| 108781.3| 1.919|
|
|
|
|
|
| 512| 2| 74514.7| 2.126| 69578.9| 2.209| 69348.1| 2.210| 69253.7| 2.212|
|
|
|
|
|
| 512| 5| 47003.2| 2.760| 43348.2| 2.893| 43080.3| 2.884| 42878.4| 2.881|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2019-10-21 19:41:32 +02:00
|
|
|
|
##### Inference throughput: NVIDIA T4
|
2019-07-08 22:51:28 +02:00
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA T4.
|
2019-10-21 19:41:32 +02:00
|
|
|
|
Full command to launch the inference throughput benchmark was provided in the
|
2019-07-16 21:13:08 +02:00
|
|
|
|
[Inference performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
**FP16**
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
|**Batch Size**|**Beam Size**|**Avg (tok/s)**|**Speedup**|**90% (tok/s)**|**Speedup**|**95% (tok/s)**|**Speedup**|**99% (tok/s)**|**Speedup**|
|
|
|
|
|
|-------------:|------------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|
|
|
|
|
|
| 1| 1| 1133.8| 1.266| 1059.1| 1.253| 1036.6| 1.251| 989.5| 1.242|
|
|
|
|
|
| 1| 2| 793.9| 1.169| 728.3| 1.165| 698.1| 1.163| 637.1| 1.157|
|
|
|
|
|
| 1| 5| 766.8| 1.343| 685.6| 1.335| 649.3| 1.335| 584.1| 1.318|
|
|
|
|
|
| 2| 1| 1759.8| 1.233| 1461.6| 1.239| 1402.3| 1.242| 1302.1| 1.242|
|
|
|
|
|
| 2| 2| 1313.3| 1.186| 1088.7| 1.185| 1031.6| 1.180| 953.2| 1.178|
|
|
|
|
|
| 2| 5| 1257.2| 1.301| 1034.1| 1.316| 990.3| 1.313| 886.3| 1.265|
|
|
|
|
|
| 4| 1| 2974.0| 1.261| 2440.3| 1.255| 2294.6| 1.257| 2087.7| 1.261|
|
|
|
|
|
| 4| 2| 2204.7| 1.320| 1826.3| 1.283| 1718.9| 1.260| 1548.4| 1.260|
|
|
|
|
|
| 4| 5| 2106.1| 1.340| 1727.8| 1.345| 1625.7| 1.353| 1467.7| 1.346|
|
|
|
|
|
| 8| 1| 5076.6| 1.423| 4207.9| 1.367| 3904.4| 1.360| 3475.3| 1.355|
|
|
|
|
|
| 8| 2| 3761.7| 1.311| 3108.1| 1.285| 2931.6| 1.300| 2628.7| 1.300|
|
|
|
|
|
| 8| 5| 3578.2| 1.660| 2998.2| 1.614| 2812.1| 1.609| 2447.6| 1.523|
|
|
|
|
|
| 32| 1| 14637.8| 1.636| 12702.5| 1.644| 12070.3| 1.634| 11036.9| 1.647|
|
|
|
|
|
| 32| 2| 10627.3| 1.818| 9198.3| 1.818| 8431.6| 1.725| 8000.0| 1.773|
|
|
|
|
|
| 32| 5| 8205.7| 2.598| 7117.6| 2.476| 6825.2| 2.497| 6293.2| 2.437|
|
|
|
|
|
| 128| 1| 33800.5| 2.755| 30824.5| 2.816| 27685.2| 2.661| 26580.9| 2.694|
|
|
|
|
|
| 128| 2| 20829.4| 2.795| 18665.2| 2.778| 17372.1| 2.639| 16820.5| 2.821|
|
|
|
|
|
| 128| 5| 11753.9| 3.309| 10658.1| 3.273| 10308.7| 3.205| 9630.7| 3.328|
|
|
|
|
|
| 512| 1| 44474.6| 3.327| 40108.1| 3.394| 39816.6| 3.378| 39708.0| 3.381|
|
|
|
|
|
| 512| 2| 26057.9| 3.295| 23197.3| 3.294| 23019.8| 3.284| 22951.4| 3.284|
|
|
|
|
|
| 512| 5| 12161.5| 3.428| 10777.5| 3.418| 10733.1| 3.414| 10710.5| 3.420|
|
2019-02-14 12:40:30 +01:00
|
|
|
|
|
2019-07-08 22:51:28 +02:00
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
2018-08-07 16:27:43 +02:00
|
|
|
|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
#### Inference latency results
|
|
|
|
|
Tables presented in this section show the average inference latency (columns **Avg
|
|
|
|
|
(ms)**) and inference latency for various confidence intervals (columns **N%
|
|
|
|
|
(ms)**, where `N` denotes the confidence interval). Inference latency is
|
|
|
|
|
measured in milliseconds. Speedups reported in FP16 subsections are relative to
|
2020-08-01 15:47:34 +02:00
|
|
|
|
FP32 (for NVIDIA Volta and NVIDIA Turing) and relative to TF32 (for NVIDIA
|
|
|
|
|
Ampere) numbers for corresponding configuration.
|
|
|
|
|
|
|
|
|
|
##### Inference latency: NVIDIA A100 40GB
|
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB.
|
|
|
|
|
Full command to launch the inference latency benchmark was provided in the
|
|
|
|
|
[Inference performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
**FP16**
|
|
|
|
|
|
|
|
|
|
|**Batch Size**|**Beam Size**|**Avg (ms)**|**Speedup**|**90% (ms)**|**Speedup**|**95% (ms)**|**Speedup**|**99% (ms)**|**Speedup**|
|
|
|
|
|
|-------------:|------------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|
|
|
|
|
|
| 1| 1| 44.69| 1.032| 74.04| 1.035| 84.61| 1.034| 99.14| 1.042|
|
|
|
|
|
| 1| 2| 64.76| 1.020| 105.18| 1.018| 118.92| 1.019| 139.42| 1.023|
|
|
|
|
|
| 1| 5| 67.06| 1.043| 107.56| 1.049| 121.82| 1.054| 143.85| 1.054|
|
|
|
|
|
| 2| 1| 56.57| 1.034| 85.59| 1.037| 92.55| 1.038| 107.59| 1.046|
|
|
|
|
|
| 2| 2| 80.22| 1.027| 119.22| 1.027| 128.43| 1.030| 150.06| 1.028|
|
|
|
|
|
| 2| 5| 82.54| 1.063| 121.37| 1.067| 132.35| 1.069| 156.34| 1.059|
|
|
|
|
|
| 4| 1| 67.29| 1.047| 92.69| 1.048| 100.08| 1.056| 112.63| 1.064|
|
|
|
|
|
| 4| 2| 95.86| 1.041| 129.83| 1.040| 139.48| 1.044| 162.34| 1.045|
|
|
|
|
|
| 4| 5| 98.34| 1.075| 133.83| 1.076| 142.70| 1.068| 168.30| 1.075|
|
|
|
|
|
| 8| 1| 75.60| 1.099| 97.87| 1.103| 104.13| 1.099| 117.40| 1.102|
|
|
|
|
|
| 8| 2| 109.38| 1.074| 137.71| 1.079| 147.69| 1.069| 168.79| 1.065|
|
|
|
|
|
| 8| 5| 112.71| 1.116| 143.50| 1.104| 153.17| 1.118| 172.60| 1.113|
|
|
|
|
|
| 32| 1| 98.40| 1.146| 117.02| 1.153| 123.42| 1.150| 129.01| 1.128|
|
|
|
|
|
| 32| 2| 145.87| 1.133| 171.71| 1.159| 184.01| 1.127| 188.64| 1.141|
|
|
|
|
|
| 32| 5| 156.82| 1.357| 189.10| 1.374| 194.95| 1.392| 196.65| 1.419|
|
|
|
|
|
| 128| 1| 137.97| 1.350| 150.04| 1.348| 151.52| 1.349| 154.52| 1.434|
|
|
|
|
|
| 128| 2| 211.58| 1.484| 232.96| 1.490| 237.46| 1.505| 239.86| 1.567|
|
|
|
|
|
| 128| 5| 261.44| 1.990| 288.54| 2.017| 291.63| 2.052| 298.73| 2.136|
|
|
|
|
|
| 512| 1| 245.93| 1.906| 262.51| 1.998| 264.24| 1.999| 265.23| 2.000|
|
|
|
|
|
| 512| 2| 395.61| 2.129| 428.54| 2.219| 431.58| 2.224| 433.86| 2.227|
|
|
|
|
|
| 512| 5| 627.21| 2.767| 691.72| 2.878| 696.01| 2.895| 702.13| 2.887|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
2019-10-21 19:41:32 +02:00
|
|
|
|
##### Inference latency: NVIDIA T4
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Our results were obtained by running the `translate.py` script in the
|
2020-08-01 15:47:34 +02:00
|
|
|
|
pytorch-20.06-py3 NGC Docker container with NVIDIA T4.
|
2019-07-16 21:13:08 +02:00
|
|
|
|
Full command to launch the inference latency benchmark was provided in the
|
|
|
|
|
[Inference performance benchmark](#inference-performance-benchmark) section.
|
|
|
|
|
|
|
|
|
|
**FP16**
|
|
|
|
|
|
2020-08-01 15:47:34 +02:00
|
|
|
|
|**Batch Size**|**Beam Size**|**Avg (ms)**|**Speedup**|**90% (ms)**|**Speedup**|**95% (ms)**|**Speedup**|**99% (ms)**|**Speedup**|
|
|
|
|
|
|-------------:|------------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|
|
|
|
|
|
| 1| 1| 51.08| 1.261| 84.82| 1.254| 97.45| 1.251| 114.6| 1.257|
|
|
|
|
|
| 1| 2| 72.05| 1.168| 117.41| 1.165| 132.33| 1.170| 155.8| 1.174|
|
|
|
|
|
| 1| 5| 74.20| 1.345| 119.45| 1.352| 135.07| 1.354| 160.3| 1.354|
|
|
|
|
|
| 2| 1| 66.31| 1.232| 100.90| 1.232| 108.52| 1.235| 126.9| 1.238|
|
|
|
|
|
| 2| 2| 88.35| 1.185| 131.47| 1.188| 141.46| 1.185| 164.7| 1.191|
|
|
|
|
|
| 2| 5| 92.12| 1.305| 136.30| 1.310| 148.66| 1.309| 174.8| 1.320|
|
|
|
|
|
| 4| 1| 78.54| 1.260| 108.53| 1.256| 117.19| 1.259| 133.7| 1.259|
|
|
|
|
|
| 4| 2| 105.54| 1.315| 142.74| 1.317| 154.36| 1.307| 178.7| 1.303|
|
|
|
|
|
| 4| 5| 110.43| 1.351| 150.62| 1.388| 161.61| 1.397| 191.2| 1.427|
|
|
|
|
|
| 8| 1| 91.65| 1.418| 117.92| 1.421| 126.60| 1.405| 144.0| 1.411|
|
|
|
|
|
| 8| 2| 123.39| 1.315| 156.00| 1.337| 167.34| 1.347| 193.4| 1.340|
|
|
|
|
|
| 8| 5| 129.69| 1.666| 165.01| 1.705| 178.18| 1.723| 200.3| 1.765|
|
|
|
|
|
| 32| 1| 126.53| 1.641| 153.23| 1.689| 159.58| 1.692| 167.0| 1.700|
|
|
|
|
|
| 32| 2| 174.37| 1.822| 209.04| 1.899| 219.59| 1.877| 228.6| 1.878|
|
|
|
|
|
| 32| 5| 226.15| 2.598| 277.38| 2.636| 290.27| 2.648| 299.4| 2.664|
|
|
|
|
|
| 128| 1| 218.29| 2.755| 238.94| 2.826| 243.18| 2.843| 267.1| 2.828|
|
|
|
|
|
| 128| 2| 354.83| 2.796| 396.63| 2.832| 410.53| 2.803| 433.2| 2.866|
|
|
|
|
|
| 128| 5| 628.32| 3.311| 699.57| 3.353| 723.98| 3.323| 771.0| 3.337|
|
|
|
|
|
| 512| 1| 663.07| 3.330| 748.62| 3.388| 753.20| 3.388| 758.0| 3.378|
|
|
|
|
|
| 512| 2| 1134.04| 3.295| 1297.85| 3.283| 1302.25| 3.304| 1306.9| 3.308|
|
|
|
|
|
| 512| 5| 2428.82| 3.428| 2771.72| 3.415| 2801.32| 3.427| 2817.6| 3.422|
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
|
|
outlined above.
|
|
|
|
|
|
|
|
|
|
## Release notes
|
|
|
|
|
### Changelog
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* July 2020
|
|
|
|
|
* Added support for NVIDIA DGX A100
|
|
|
|
|
* Default container updated to NGC PyTorch 20.06-py3
|
|
|
|
|
* June 2019
|
|
|
|
|
* Default container updated to NGC PyTorch 19.05-py3
|
|
|
|
|
* Mixed precision training implemented using APEX AMP
|
|
|
|
|
* Added inference throughput and latency results on NVIDIA T4 and NVIDIA
|
|
|
|
|
Tesla V100 16GB
|
|
|
|
|
* Added option to run inference on user-provided raw input text from command
|
|
|
|
|
line
|
|
|
|
|
* February 2019
|
2019-02-14 12:40:30 +01:00
|
|
|
|
* Different batching algorithm (bucketing with 5 equal-width buckets)
|
|
|
|
|
* Additional dropouts before first LSTM layer in encoder and in decoder
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* Weight initialization changed to uniform (-0.1,0.1)
|
2019-02-14 12:40:30 +01:00
|
|
|
|
* Switched order of dropout and concatenation with attention in decoder
|
2019-07-16 21:13:08 +02:00
|
|
|
|
* Default container updated to NGC PyTorch 19.01-py3
|
2020-08-01 15:47:34 +02:00
|
|
|
|
* December 2018
|
|
|
|
|
* Added exponential warm-up and step learning rate decay
|
|
|
|
|
* Multi-GPU (distributed) inference and validation
|
|
|
|
|
* Default container updated to NGC PyTorch 18.11-py3
|
|
|
|
|
* General performance improvements
|
|
|
|
|
* August 2018
|
|
|
|
|
* Initial release
|
2019-07-16 21:13:08 +02:00
|
|
|
|
|
|
|
|
|
### Known issues
|
2019-07-08 22:51:28 +02:00
|
|
|
|
There are no known issues in this release.
|