542 lines
23 KiB
Markdown
542 lines
23 KiB
Markdown
# GNMT v2 For PyTorch
|
||
|
||
This repository provides a script and recipe to train GNMT v2 to achieve
|
||
state of the art accuracy, and is tested and maintained by NVIDIA.
|
||
|
||
## Table Of Contents
|
||
|
||
* [The model](#the-model)
|
||
* [Default configuration](#default-configuration)
|
||
* [Setup](#setup)
|
||
* [Requirements](#requirements)
|
||
* [Training using mixed precision with Tensor Cores](#training-using-mixed-precision-with-tensor-cores)
|
||
* [Quick Start Guide](#quick-start-guide)
|
||
* [Details](#details)
|
||
* [Command line arguments](#command-line-arguments)
|
||
* [Getting the data](#getting-the-data)
|
||
* [Training process](#training-process)
|
||
* [Inference process](#inference-process)
|
||
* [Results](#results)
|
||
* [Training accuracy results](#training-accuracy-results)
|
||
* [NVIDIA DGX-1 (8x Tesla V100 16G)](#nvidia-dgx-1-8x-tesla-v100-16g)
|
||
* [NVIDIA DGX-2 (16x Tesla V100 32G)](#nvidia-dgx-2-16x-tesla-v100-32g)
|
||
* [Training stability test](#training-stability-test)
|
||
* [Training performance results](#training-performance-results)
|
||
* [NVIDIA DGX-1 (8x Tesla V100 16G)](#nvidia-dgx-1-8x-tesla-v100-16g-1)
|
||
* [NVIDIA DGX-2 (16x Tesla V100 32G)](#nvidia-dgx-2-16x-tesla-v100-32g-1)
|
||
* [Inference performance results](#inference-performance-results)
|
||
* [Changelog](#changelog)
|
||
* [Known issues](#known-issues)
|
||
|
||
## The model
|
||
The GNMT v2 model is similar to the one discussed in the [Google's Neural Machine
|
||
Translation System: Bridging the Gap between Human and Machine
|
||
Translation](https://arxiv.org/abs/1609.08144) paper.
|
||
|
||
The most important difference between the two models is in the attention
|
||
mechanism. In our model, the output from the first LSTM layer of the decoder
|
||
goes into the attention module, then the re-weighted context is concatenated
|
||
with inputs to all subsequent LSTM layers in the decoder at the current
|
||
timestep.
|
||
|
||
The same attention mechanism is also implemented in the default
|
||
GNMT-like models from
|
||
[TensorFlow Neural Machine Translation Tutorial](https://github.com/tensorflow/nmt)
|
||
and
|
||
[NVIDIA OpenSeq2Seq Toolkit](https://github.com/NVIDIA/OpenSeq2Seq).
|
||
|
||
## Default configuration
|
||
|
||
The following features were implemented in this model:
|
||
|
||
* general:
|
||
* encoder and decoder are using shared embeddings
|
||
* data-parallel multi-gpu training
|
||
* dynamic loss scaling with backoff for Tensor Cores (mixed precision) training
|
||
* trained with label smoothing loss (smoothing factor 0.1)
|
||
* encoder:
|
||
* 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
|
||
unidirectional
|
||
* with residual connections starting from 3rd layer
|
||
* uses standard pytorch nn.LSTM layer
|
||
* dropout is applied on input to all LSTM layers, probability of dropout is
|
||
set to 0.2
|
||
* hidden state of LSTM layers is initialized with zeros
|
||
* weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
|
||
distribution
|
||
* decoder:
|
||
* 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
|
||
classifier
|
||
* with residual connections starting from 3rd layer
|
||
* uses standard pytorch nn.LSTM layer
|
||
* dropout is applied on input to all LSTM layers, probability of dropout is
|
||
set to 0.2
|
||
* hidden state of LSTM layers is initialized with zeros
|
||
* weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
|
||
distribution
|
||
* weights and bias of fully-connected classifier is initialized with
|
||
uniform(-0.1, 0.1) distribution
|
||
* attention:
|
||
* normalized Bahdanau attention
|
||
* output from first LSTM layer of decoder goes into attention,
|
||
then re-weighted context is concatenated with the input to all subsequent
|
||
LSTM layers of the decoder at the current timestep
|
||
* linear transform of keys and queries is initialized with uniform(-0.1, 0.1),
|
||
normalization scalar is initialized with 1.0 / sqrt(1024),
|
||
normalization bias is initialized with zero
|
||
* inference:
|
||
* beam search with default beam size of 5
|
||
* with coverage penalty and length normalization, coverage penalty factor is
|
||
set to 0.1, length normalization factor is set to 0.6 and length
|
||
normalization constant is set to 5.0
|
||
* de-tokenized BLEU computed by [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/sockeye_contrib/sacrebleu)
|
||
* [motivation](https://github.com/awslabs/sockeye/tree/master/sockeye_contrib/sacrebleu#motivation) for choosing SacreBLEU
|
||
|
||
When comparing the BLEU score, there are various tokenization approaches and
|
||
BLEU calculation methodologies; therefore, ensure you align similar metrics.
|
||
|
||
Code from this repository can be used to train a larger, 8-layer GNMT v2 model.
|
||
Our experiments show that a 4-layer model is significantly faster to train and
|
||
yields comparable accuracy on the public
|
||
[WMT16 English-German](http://www.statmt.org/wmt16/translation-task.html)
|
||
dataset. The number of LSTM layers is controlled by the `--num_layers`
|
||
parameter in the `train.py` training script.
|
||
|
||
# Setup
|
||
|
||
The following section list the requirements in order to start training the GNMT
|
||
v2 model.
|
||
|
||
## Requirements
|
||
|
||
This repository contains `Dockerfile` which extends the PyTorch NGC container
|
||
and encapsulates some dependencies. Aside from these dependencies, ensure you
|
||
have the following components:
|
||
|
||
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
||
* [PyTorch 19.01-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
||
* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
||
|
||
For more information about how to get started with NGC containers, see the
|
||
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
|
||
DGX Documentation:
|
||
|
||
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html),
|
||
* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry),
|
||
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running).
|
||
|
||
## Training using mixed precision with Tensor Cores
|
||
Before you can train using mixed precision with Tensor Cores, ensure that you
|
||
have a
|
||
[NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
||
based GPU. Other platforms might likely work but aren't officially supported.
|
||
For information about how to train using mixed precision, see the
|
||
[Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
|
||
and
|
||
[Training With Mixed Precision documentation](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html).
|
||
|
||
Another option for adding mixed-precision support is available from NVIDIA’s
|
||
[APEX](https://github.com/NVIDIA/apex), A PyTorch Extension, that contains
|
||
utility libraries, such as AMP, which require minimal network code changes to
|
||
leverage Tensor Core performance.
|
||
|
||
# Quick Start Guide
|
||
To train your model using mixed precision with Tensor Cores or using FP32,
|
||
perform the following steps using the default parameters of the GNMT v2 model
|
||
on the *WMT16 English German* dataset.
|
||
|
||
### 1. Clone the repository.
|
||
```
|
||
git clone https://github.com/NVIDIA/DeepLearningExamples
|
||
cd DeepLearningExamples/PyTorch/Translation/GNMT
|
||
```
|
||
|
||
### 2. Build the GNMT v2 container.
|
||
```
|
||
bash scripts/docker/build.sh
|
||
```
|
||
|
||
### 3. Start an interactive session in the container to run training/inference.
|
||
```
|
||
bash scripts/docker/interactive.sh
|
||
```
|
||
|
||
### 4. Download and preprocess the dataset.
|
||
Data will be downloaded to the `data` directory (on the host). The `data`
|
||
directory is mounted to the `/workspace/gnmt/data` location in the Docker
|
||
container.
|
||
```
|
||
bash scripts/wmt16_en_de.sh
|
||
```
|
||
|
||
### 5. Start training.
|
||
By default, the training script will use all available GPUs. The training script
|
||
saves only one checkpoint with the lowest value of the loss function on the
|
||
validation dataset. All results and logs are saved to the `results` directory
|
||
(on the host) or to the `/workspace/gnmt/results` directory (in the container).
|
||
By default, the `train.py` script will launch mixed precision training
|
||
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
|
||
flag for the `train.py` training script.
|
||
|
||
To launch mixed precision training on 1, 4 or 8 GPUs, run:
|
||
|
||
```
|
||
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
|
||
```
|
||
|
||
To launch mixed precision training on 16 GPUs, run:
|
||
|
||
```
|
||
python3 -m launch train.py --seed 2 --train-global-batch-size 2048
|
||
```
|
||
|
||
By default the training script will launch training with batch size 128 per GPU.
|
||
If specified `--train-global-batch-size` is larger than 128 times the number of
|
||
GPUs available for the training then the training script will accumulate
|
||
gradients over consecutive iterations and then perform the weight update.
|
||
For example 1 GPU training with `--train-global-batch-size 1024` will accumulate
|
||
gradients over 8 iterations before doing the weight update with accumulated
|
||
gradients.
|
||
|
||
### 6. Start evaluation.
|
||
The training process automatically runs evaluation and outputs the BLEU score
|
||
after each training epoch. Additionally, after the training is done, you can
|
||
manually run inference on test dataset with the checkpoint saved during the
|
||
training.
|
||
|
||
To launch mixed precision inference on 1 GPU, run:
|
||
|
||
```
|
||
python3 translate.py --input data/wmt16_de_en/newstest2014.tok.bpe.32000.en \
|
||
--reference data/wmt16_de_en/newstest2014.de --output /tmp/output \
|
||
--model results/gnmt/model_best.pth --batch-size 128
|
||
```
|
||
|
||
By default, the `translate.py` script will launch mixed precision inference
|
||
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
|
||
flag for the `translate.py` inference script.
|
||
|
||
# Details
|
||
The following sections provide greater details of the dataset, running training
|
||
and inference, and the training results.
|
||
|
||
## Command line arguments
|
||
To see the full list of available options and their descriptions, use the `-h`
|
||
or `--help` command line option, for example:
|
||
|
||
For training:
|
||
```
|
||
python3 train.py --help
|
||
```
|
||
|
||
To summarize, the most useful arguments for training are as follows:
|
||
|
||
```
|
||
dataset setup:
|
||
--dataset-dir DATASET_DIR
|
||
path to the directory with training/test data
|
||
(default: data/wmt16_de_en)
|
||
results setup:
|
||
--results-dir RESULTS_DIR
|
||
path to directory with results, it will be
|
||
automatically created if it does not exist (default:
|
||
results)
|
||
--save SAVE defines subdirectory within RESULTS_DIR for results
|
||
from this training run (default: gnmt)
|
||
--print-freq PRINT_FREQ
|
||
print log every PRINT_FREQ batches (default: 10)
|
||
|
||
model setup:
|
||
--num-layers NUM_LAYERS
|
||
number of RNN layers in encoder and in decoder
|
||
(default: 4)
|
||
general setup:
|
||
--math {fp16,fp32} arithmetic type (default: fp16)
|
||
--seed SEED master seed for random number generators, if "seed" is
|
||
undefined then the master seed will be sampled from
|
||
random.SystemRandom() (default: None)
|
||
training setup:
|
||
--train-batch-size TRAIN_BATCH_SIZE
|
||
training batch size per worker (default: 128)
|
||
--train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE
|
||
global training batch size, this argument does not
|
||
have to be defined, if it is defined it will be used
|
||
to automatically compute train_iter_size using the
|
||
equation: train_iter_size = train_global_batch_size //
|
||
(train_batch_size * world_size) (default: None)
|
||
--train-iter-size N training iter size, training loop will accumulate
|
||
gradients over N iterations and execute optimizer
|
||
every N steps (default: 1)
|
||
--epochs EPOCHS max number of training epochs (default: 6)
|
||
|
||
optimizer setup:
|
||
--optimizer OPTIMIZER
|
||
training optimizer (default: Adam)
|
||
--lr LR learning rate (default: 0.002)
|
||
|
||
test setup:
|
||
--beam-size BEAM_SIZE
|
||
beam size (default: 5)
|
||
```
|
||
|
||
For inference:
|
||
```
|
||
python3 translate.py --help
|
||
```
|
||
|
||
To summarize, the most useful arguments for inference are as follows:
|
||
|
||
```
|
||
data setup:
|
||
--dataset-dir DATASET_DIR
|
||
path to directory with training/test data (default:
|
||
data/wmt16_de_en/)
|
||
-i INPUT, --input INPUT
|
||
full path to the input file (tokenized) (default:
|
||
None)
|
||
-o OUTPUT, --output OUTPUT
|
||
full path to the output file (tokenized) (default:
|
||
None)
|
||
-r REFERENCE, --reference REFERENCE
|
||
full path to the file with reference translations (for
|
||
sacrebleu) (default: None)
|
||
-m MODEL, --model MODEL
|
||
full path to the model checkpoint file (default: None)
|
||
|
||
inference setup:
|
||
--batch-size BATCH_SIZE [BATCH_SIZE ...]
|
||
batch size per GPU (default: [128])
|
||
--beam-size BEAM_SIZE [BEAM_SIZE ...]
|
||
beam size (default: [5])
|
||
--max-seq-len MAX_SEQ_LEN
|
||
maximum generated sequence length (default: 80)
|
||
|
||
general setup:
|
||
--math {fp16,fp32} [{fp16,fp32} ...]
|
||
arithmetic type (default: ['fp16'])
|
||
--bleu compares with reference translation and computes BLEU
|
||
(use '--no-bleu' to disable) (default: True)
|
||
--print-freq PRINT_FREQ, -p PRINT_FREQ
|
||
print log every PRINT_FREQ batches (default: 1)
|
||
```
|
||
|
||
## Getting the data
|
||
The GNMT v2 model was trained on the
|
||
[WMT16 English-German](http://www.statmt.org/wmt16/translation-task.html) dataset.
|
||
Concatenation of the *newstest2015* and *newstest2016* test sets are used as a
|
||
validation dataset and the *newstest2014* is used as a testing dataset.
|
||
|
||
<<<<<<< HEAD
|
||
This repository contains the `scripts/wmt16_en_de.sh` download script which
|
||
automatically downloads and preprocesses the training, validation and test
|
||
datasets. By default, data is downloaded to the `data` directory.
|
||
|
||
Our download script is very similar to the `wmt16_en_de.sh` script from the
|
||
[tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
|
||
repository. Our download script contains an extra preprocessing step, which
|
||
discards all pairs of sentences which can't be decoded by *latin-1* encoder.
|
||
The `scripts/wmt16_en_de.sh` script uses the
|
||
[subword-nmt](https://github.com/rsennrich/subword-nmt)
|
||
<<<<<<< HEAD
|
||
package to segment text into subword units (Byte Pair Encodings -
|
||
[BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding)). By default, the
|
||
script builds the shared vocabulary of 32,000 tokens.
|
||
|
||
In order to test with other datasets, scripts need to be customized accordingly.
|
||
|
||
## Training process
|
||
The default training configuration can be launched by running the
|
||
`train.py` training script.
|
||
By default, the training script saves only one checkpoint with the lowest value
|
||
of the loss function on the validation dataset, an evaluation is performed after
|
||
each training epoch. Results are stored in the `results/gnmt` directory.
|
||
|
||
The training script launches data-parallel training with batch size 128 per GPU
|
||
on all available GPUs. We have tested reliance on up to 16 GPUs on a single
|
||
node.
|
||
After each training epoch, the script runs an evaluation
|
||
on the validation dataset and outputs a BLEU score on the test dataset
|
||
(*newstest2014*). BLEU is computed by the
|
||
[SacreBLEU](https://github.com/awslabs/sockeye/tree/master/sockeye_contrib/sacrebleu)
|
||
package. Logs from the training and evaluation are saved to the `results`
|
||
directory.
|
||
|
||
The summary after each training epoch is printed in the following format:
|
||
```
|
||
Summary: Epoch: 3 Training Loss: 3.1735 Validation Loss: 3.0511 Test BLEU: 21.89
|
||
Performance: Epoch: 3 Training: 300155 Tok/s Validation: 156066 Tok/s
|
||
```
|
||
The training loss is averaged over an entire training epoch, the validation
|
||
loss is averaged over the validation dataset and the BLEU score is computed on
|
||
the test dataset.
|
||
Performance is reported in total tokens per second. The result is averaged over
|
||
an entire training epoch and summed over all GPUs participating in the training.
|
||
|
||
Even though the training script uses all available GPUs, you can change this
|
||
behavior by setting the `CUDA_VISIBLE_DEVICES` variable in your environment or
|
||
by setting the `NV_GPU` variable at the Docker container launch
|
||
([see section "GPU isolation"](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation)).
|
||
|
||
By default, the `train.py` script will launch mixed precision training
|
||
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
|
||
flag for the `train.py` script.
|
||
|
||
To view all available options for training, run `python3 train.py --help`.
|
||
|
||
## Inference process
|
||
Inference can be run by launching the `translate.py` inference script, although,
|
||
it requires a pre-trained model checkpoint and tokenized input.
|
||
|
||
The inference script, `translate.py`, supports batched inference. By default, it
|
||
launches beam search with beam size of 5, coverage penalty term and length
|
||
normalization term. Greedy decoding can be enabled by setting the beam size to 1.
|
||
|
||
To view all available options for inference, run `python3 translate.py --help`.
|
||
|
||
# Results
|
||
|
||
The following sections provide details on how we achieved our performance and
|
||
accuracy in training and inference.
|
||
|
||
## Training accuracy results
|
||
Our results were obtained by running the `train.py` script with the default
|
||
batch size = 128 per GPU in the pytorch-19.01-py3 Docker container.
|
||
|
||
### NVIDIA DGX-1 (8x Tesla V100 16G)
|
||
Command to launch the training:
|
||
|
||
```
|
||
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
|
||
```
|
||
|
||
| **Number of GPUs** | **Batch size/GPU** | **Mixed precision BLEU** | **FP32 BLEU** | **Mixed precision training time** | **FP32 training time** |
|
||
| --- | --- | ----- | ----- | ------------- | ------------- |
|
||
| 1 | 128 | 24.59 | 24.71 | 264.4 minutes | 824.4 minutes |
|
||
| 4 | 128 | 24.30 | 24.45 | 89.5 minutes | 230.8 minutes |
|
||
| 8 | 128 | 24.45 | 24.48 | 46.2 minutes | 116.6 minutes |
|
||
|
||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
||
outlined above.
|
||
|
||
### NVIDIA DGX-2 (16x Tesla V100 32G)
|
||
Commands to launch the training:
|
||
|
||
```
|
||
for 1,4,8 GPUs:
|
||
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
|
||
for 16 GPUs:
|
||
python3 -m launch train.py --seed 2 --train-global-batch-size 2048
|
||
```
|
||
|
||
| **Number of GPUs** | **Batch size/GPU** | **Mixed precision BLEU** | **FP32 BLEU** | **Mixed precision training time** | **FP32 training time** |
|
||
| --- | --- | ----- | ----- | ------------- | ------------- |
|
||
| 1 | 128 | 24.59 | 24.71 | 265.0 minutes | 825.1 minutes |
|
||
| 4 | 128 | 24.69 | 24.33 | 87.4 minutes | 216.3 minutes |
|
||
| 8 | 128 | 24.50 | 24.47 | 49.6 minutes | 113.5 minutes |
|
||
| 16 | 128 | 24.22 | 24.16 | 26.3 minutes | 58.6 minutes |
|
||
|
||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
||
outlined above.
|
||
|
||
![TrainingLoss](./img/training_loss.png)
|
||
|
||
### Training stability test
|
||
The GNMT v2 model was trained for 6 epochs, starting from 50 different initial
|
||
random seeds. After each training epoch, the model was evaluated on the test
|
||
dataset and the BLEU score was recorded. The training was performed in the
|
||
pytorch-19.01-py3 Docker container on NVIDIA DGX-1 with 8 Tesla V100 16G GPUs.
|
||
The following table summarizes results of the stability test.
|
||
|
||
![TrainingAccuracy](./img/training_accuracy.png)
|
||
|
||
In the following table, the BLEU scores after each training epoch for different
|
||
initial random seeds are displayed.
|
||
|
||
| **Epoch** | **Average** | **Standard deviation** | **Minimum** | **Maximum** | **Median** |
|
||
| --- | ------ | ----- | ------ | ------ | ------ |
|
||
| 1 | 19.954 | 0.326 | 18.710 | 20.490 | 20.020 |
|
||
| 2 | 21.734 | 0.222 | 21.220 | 22.120 | 21.765 |
|
||
| 3 | 22.502 | 0.223 | 21.960 | 22.970 | 22.485 |
|
||
| 4 | 23.004 | 0.221 | 22.350 | 23.430 | 23.020 |
|
||
| 5 | 24.201 | 0.146 | 23.900 | 24.480 | 24.215 |
|
||
| 6 | 24.423 | 0.159 | 24.070 | 24.820 | 24.395 |
|
||
|
||
|
||
## Training performance results
|
||
Our results were obtained by running the `train.py` training script in the
|
||
pytorch-19.01-py3 Docker container. Performance numbers (in tokens per second)
|
||
were averaged over an entire training epoch.
|
||
|
||
### NVIDIA DGX-1 (8x Tesla V100 16G)
|
||
|
||
| **Number of GPUs** | **Batch size/GPU** | **Mixed precision tokens/s** | **FP32 tokens/s** | **Mixed precision speedup** | **Mixed precision multi-gpu strong scaling** | **FP32 multi-gpu strong scaling** |
|
||
| --- | --- | ------ | ------ | ----- | ----- | ----- |
|
||
| 1 | 128 | 66050 | 21346 | 3.094 | 1.000 | 1.000|
|
||
| 4 | 128 | 196174 | 76083 | 2.578 | 2.970 | 3.564|
|
||
| 8 | 128 | 387282 | 153697 | 2.520 | 5.863 | 7.200|
|
||
|
||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
||
outlined above.
|
||
|
||
### NVIDIA DGX-2 (16x Tesla V100 32G)
|
||
|
||
| **Number of GPUs** | **Batch size/GPU** | **Mixed precision tokens/s** | **FP32 tokens/s** | **Mixed precision speedup** | **Mixed precision multi-gpu strong scaling** | **FP32 multi-gpu strong scaling** |
|
||
| --- | --- | ------ | ------- | ----- | ------ | ------ |
|
||
| 1 | 128 | 65830 | 22695 | 2.901 | 1.000 | 1.000 |
|
||
| 4 | 128 | 200886 | 81224 | 2.473 | 3.052 | 3.579 |
|
||
| 8 | 128 | 362612 | 156536 | 2.316 | 5.508 | 6.897 |
|
||
| 16 | 128 | 738521 | 314831 | 2.346 | 11.219 | 13.872 |
|
||
|
||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
||
outlined above.
|
||
|
||
## Inference performance results
|
||
Our results were obtained by running the `translate.py` script in the
|
||
pytorch-19.01-py3 Docker container on NVIDIA DGX-1. Inference benchmark was run
|
||
on a single Tesla V100 16G GPU. The benchmark requires a checkpoint from a fully
|
||
trained model.
|
||
|
||
Command to launch the inference benchmark:
|
||
```
|
||
python3 translate.py --input data/wmt16_de_en/newstest2014.tok.bpe.32000.en \
|
||
--reference data/wmt16_de_en/newstest2014.de --output /tmp/output \
|
||
--model results/gnmt/model_best.pth --batch-size 32 128 512 \
|
||
--beam-size 1 2 5 10 --math fp16 fp32
|
||
```
|
||
|
||
| **Batch size** | **Beam size** | **Mixed precision BLEU** | **FP32 BLEU** | **Mixed precision tokens/s** | **FP32 tokens/s** |
|
||
| ---- | ----- | ------- | ------- | ---------|-------- |
|
||
| 32 | 1 | 23.18 | 23.18 | 23571 | 19462 |
|
||
| 32 | 2 | 24.09 | 24.12 | 15303 | 12345 |
|
||
| 32 | 5 | 24.63 | 24.62 | 13644 | 7725 |
|
||
| 32 | 10 | 24.50 | 24.48 | 11049 | 5359 |
|
||
| 128 | 1 | 23.17 | 23.18 | 73429 | 42272 |
|
||
| 128 | 2 | 24.07 | 24.12 | 43373 | 23131 |
|
||
| 128 | 5 | 24.69 | 24.63 | 29646 | 12525 |
|
||
| 128 | 10 | 24.45 | 24.48 | 19100 | 6886 |
|
||
| 512 | 1 | 23.17 | 23.18 | 135333 | 48962 |
|
||
| 512 | 2 | 24.08 | 24.12 | 74367 | 27308 |
|
||
| 512 | 5 | 24.60 | 24.63 | 39217 | 12674 |
|
||
| 512 | 10 | 24.54 | 24.48 | 21433 | 6640 |
|
||
|
||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
||
outlined above.
|
||
|
||
# Changelog
|
||
1. Aug 7, 2018
|
||
* Initial release
|
||
2. Dec 4, 2018
|
||
* Added exponential warm-up and step learning rate decay
|
||
* Multi-GPU (distributed) inference and validation
|
||
* Default container updated to PyTorch 18.11-py3
|
||
* General performance improvements
|
||
3. Feb 14, 2019
|
||
* Different batching algorithm (bucketing with 5 equal-width buckets)
|
||
* Additional dropouts before first LSTM layer in encoder and in decoder
|
||
* Weight initialization changed to uniform (-0.1, 0.1)
|
||
* Switched order of dropout and concatenation with attention in decoder
|
||
* Default container updated to PyTorch 19.01-py3
|
||
|
||
# Known issues
|
||
There are no known issues in this release.
|