DeepLearningExamples/PyTorch/Translation/GNMT/README.md

# GNMT v2 For PyTorch

This repository provides a script and recipe to train the GNMT v2 model to
achieve state of the art accuracy, and is tested and maintained by NVIDIA.

## Table Of Contents

<!-- TOC GFM -->

* [Model overview](#model-overview)
  * [Model architecture](#model-architecture)
  * [Default configuration](#default-configuration)
  * [Feature support matrix](#feature-support-matrix)
    * [Features](#features)
  * [Mixed precision training](#mixed-precision-training)
    * [Enabling mixed precision](#enabling-mixed-precision)
    * [Enabling TF32](#enabling-tf32)
* [Setup](#setup)
  * [Requirements](#requirements)
* [Quick Start Guide](#quick-start-guide)
* [Advanced](#advanced)
  * [Scripts and sample code](#scripts-and-sample-code)
  * [Parameters](#parameters)
  * [Command-line options](#command-line-options)
  * [Getting the data](#getting-the-data)
    * [Dataset guidelines](#dataset-guidelines)
  * [Training process](#training-process)
  * [Inference process](#inference-process)
* [Performance](#performance)
  * [Benchmarking](#benchmarking)
    * [Training performance benchmark](#training-performance-benchmark)
    * [Inference performance benchmark](#inference-performance-benchmark)
  * [Results](#results)
    * [Training accuracy results](#training-accuracy-results)
      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
      * [Training accuracy: NVIDIA DGX-2H (16x V100 32GB)](#training-accuracy-nvidia-dgx-2h-16x-v100-32gb)
      * [Training stability test](#training-stability-test)
    * [Training throughput results](#training-throughput-results)
      * [Training throughput: NVIDIA DGX A100 (8x A100 40GB)](#training-throughput-nvidia-dgx-a100-8x-a100-40gb)
      * [Training throughput: NVIDIA DGX-1 (8x V100 16GB)](#training-throughput-nvidia-dgx-1-8x-v100-16gb)
      * [Training throughput: NVIDIA DGX-2H (16x V100 32GB)](#training-throughput-nvidia-dgx-2h-16x-v100-32gb)
    * [Inference accuracy results](#inference-accuracy-results)
      * [Inference accuracy: NVIDIA A100 40GB](#inference-accuracy-nvidia-a100-40gb)
      * [Inference accuracy: NVIDIA Tesla V100 16GB](#inference-accuracy-nvidia-tesla-v100-16gb)
      * [Inference accuracy: NVIDIA T4](#inference-accuracy-nvidia-t4)
    * [Inference throughput results](#inference-throughput-results)
      * [Inference throughput: NVIDIA A100 40GB](#inference-throughput-nvidia-a100-40gb)
      * [Inference throughput: NVIDIA T4](#inference-throughput-nvidia-t4)
    * [Inference latency results](#inference-latency-results)
      * [Inference latency: NVIDIA A100 40GB](#inference-latency-nvidia-a100-40gb)
      * [Inference latency: NVIDIA T4](#inference-latency-nvidia-t4)
* [Release notes](#release-notes)
  * [Changelog](#changelog)
  * [Known issues](#known-issues)

<!-- /TOC -->

## Model overview
The GNMT v2 model is similar to the one discussed in the [Google's Neural
Machine Translation System: Bridging the Gap between Human and Machine
Translation](https://arxiv.org/abs/1609.08144) paper.

The most important difference between the two models is in the attention
mechanism. In our model, the output from the first LSTM layer of the decoder
goes into the attention module, then the re-weighted context is concatenated
with inputs to all subsequent LSTM layers in the decoder at the current
time step.

The same attention mechanism is also implemented in the default GNMT-like
models from [TensorFlow Neural Machine Translation
Tutorial](https://github.com/tensorflow/nmt) and [NVIDIA OpenSeq2Seq
Toolkit](https://github.com/NVIDIA/OpenSeq2Seq).

### Model architecture
![ModelArchitecture](./img/diagram.png)

### Default configuration

The following features were implemented in this model:

* general:
  * encoder and decoder are using shared embeddings
  * data-parallel multi-GPU training
  * dynamic loss scaling with backoff for Tensor Cores (mixed precision)
    training
  * trained with label smoothing loss (smoothing factor 0.1)
* encoder:
  * 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
    unidirectional
  * with residual connections starting from 3rd layer
  * uses standard PyTorch nn.LSTM layer
  * dropout is applied on input to all LSTM layers, probability of dropout is
    set to 0.2
  * hidden state of LSTM layers is initialized with zeros
  * weights and bias of LSTM layers is initialized with uniform(-0.1,0.1)
    distribution
* decoder:
  * 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
    classifier
  * with residual connections starting from 3rd layer
  * uses standard PyTorch nn.LSTM layer
  * dropout is applied on input to all LSTM layers, probability of dropout is
    set to 0.2
  * hidden state of LSTM layers is initialized with zeros
  * weights and bias of LSTM layers is initialized with uniform(-0.1,0.1)
    distribution
  * weights and bias of fully-connected classifier is initialized with
    uniform(-0.1,0.1) distribution
* attention:
  * normalized Bahdanau attention
  * output from first LSTM layer of decoder goes into attention, then
    re-weighted context is concatenated with the input to all subsequent LSTM
    layers of the decoder at the current timestep
  * linear transform of keys and queries is initialized with uniform(-0.1,
    0.1), normalization scalar is initialized with 1.0/sqrt(1024),
    normalization bias is initialized with zero
* inference:
  * beam search with default beam size of 5
  * with coverage penalty and length normalization, coverage penalty factor is
    set to 0.1, length normalization factor is set to 0.6 and length
    normalization constant is set to 5.0
  * de-tokenized BLEU computed by
    [SacreBLEU](https://github.com/mjpost/sacrebleu)
  * [motivation](https://github.com/mjpost/sacrebleu#motivation) for choosing
    SacreBLEU

When comparing the BLEU score, there are various tokenization approaches and
BLEU calculation methodologies; therefore, ensure you align similar metrics.

Code from this repository can be used to train a larger, 8-layer GNMT v2 model.
Our experiments show that a 4-layer model is significantly faster to train and
yields comparable accuracy on the public [WMT16
English-German](http://www.statmt.org/wmt16/translation-task.html) dataset. The
number of LSTM layers is controlled by the `--num-layers` parameter in the
`train.py` training script.

### Feature support matrix
The following features are supported by this model.

| **Feature** | **GNMT v2** |
|:------------|------------:|
|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) | Yes |

#### Features
[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor
Core-accelerated training. Refer to the [Enabling mixed
precision](#enabling-mixed-precision) section for more details.

[Apex
DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) -
a module wrapper that enables easy multiprocess distributed data parallel
training, similar to
[torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
`DistributedDataParallel` is optimized for use with
[NCCL](https://github.com/NVIDIA/nccl). It achieves high performance by
overlapping communication with computation during `backward()` and bucketing
smaller gradient transfers to reduce the total number of transfers required.

### Mixed precision training
Mixed precision is the combined use of different numerical precisions in a
computational method.
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant
computational speedup by performing operations in half-precision format, while
storing minimal information in single-precision to retain as much information
as possible in critical parts of the network. Since the introduction of [Tensor
Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with
both the Turing and Ampere architectures, significant training speedups are
experienced by switching to mixed precision -- up to 3x overall speedup on the
most arithmetically intense model architectures. Using mixed precision training
previously required two steps:

1. Porting the model to use the FP16 data type where appropriate.
2. Manually adding loss scaling to preserve small gradient values.

The ability to train deep learning networks with lower precision was introduced
in the Pascal architecture and first supported in [CUDA
8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep
Learning SDK.

For information about:
* How to train using mixed precision, see the [Mixed Precision
  Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed
  Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
  documentation.
* Techniques used for mixed precision training, see the [Mixed-Precision
  Training of Deep Neural
  Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
  blog.
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy
  Mixed-Precision Training in
  PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/)
  .

#### Enabling mixed precision
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
(AMP), library from [APEX](https://github.com/NVIDIA/apex) that casts variables
to half-precision upon retrieval, while storing variables in single-precision
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
a [loss
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
step must be included when applying gradients. In PyTorch, loss scaling can be
easily applied by using `scale_loss()` method provided by AMP. The scaling
value to be used can be
[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.

For an in-depth walk through on AMP, check out sample usage
[here](https://nvidia.github.io/apex/amp.html#).
[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
utility libraries, such as AMP, which require minimal network code changes to
leverage Tensor Cores performance.

The following steps were needed to enable mixed precision training in GNMT:

* Import AMP from APEX (file: `seq2seq/train/trainer.py`):

```
from apex import amp
```

* Initialize AMP and wrap the model and the optimizer (file:
  `seq2seq/train/trainer.py`, class: `Seq2SeqTrainer`):

```
self.model, self.optimizer = amp.initialize(
    self.model,
    self.optimizer,
    cast_model_outputs=torch.float16,
    keep_batchnorm_fp32=False,
    opt_level='O2')
```

* Apply `scale_loss` context manager (file: `seq2seq/train/fp_optimizers.py`,
  class: `AMPOptimizer`):

```
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
```

* Apply gradient clipping on single precision master weights (file:
  `seq2seq/train/fp_optimizers.py`, class: `AMPOptimizer`):

```
if self.grad_clip != float('inf'):
    clip_grad_norm_(amp.master_params(optimizer), self.grad_clip)
```

#### Enabling TF32
TensorFloat-32 (TF32) is the new math mode in [NVIDIA
A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the
matrix math also called tensor operations. TF32 running on Tensor Cores in A100
GPUs can provide up to 10x speedups compared to single-precision floating-point
math (FP32) on Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of
accuracy. It is more robust than FP16 for models which require high dynamic
range for weights or activations.

For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates
AI Training, HPC up to
20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)
blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by
default.

## Setup

The following section lists the requirements in order to start training the
GNMT v2 model.

### Requirements

This repository contains `Dockerfile` which extends the PyTorch NGC container
and encapsulates some dependencies. Aside from these dependencies, ensure you
have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
* GPU architecture:
  * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
  * [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
  * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)

For more information about how to get started with NGC containers, see the
following sections from the NVIDIA GPU Cloud Documentation and the Deep
Learning DGX Documentation:

* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html),
* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry),
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running).

For those unable to use the Pytorch NGC container, to set up the required
environment or create your own container, see the versioned [NVIDIA Container
Support
Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).


## Quick Start Guide
To train your model using mixed or TF32 precision with Tensor Cores or using
FP32, perform the following steps using the default parameters of the GNMT v2
model on the WMT16 English German dataset. For the specifics concerning
training and inference, see the [Advanced](#advanced) section.

**1. Clone the repository.**

```
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/Translation/GNMT
```

**2. Build the GNMT v2 Docker container.**

```
bash scripts/docker/build.sh
```

**3. Start an interactive session in the container to run training/inference.**

```
bash scripts/docker/interactive.sh
```

**4. Download and preprocess the dataset.**

Data will be downloaded to the `data` directory (on the host). The `data`
directory is mounted to the `/workspace/gnmt/data` location in the Docker
container.

```
bash scripts/wmt16_en_de.sh
```

**5. Start training.**

The training script saves only one checkpoint with the lowest value of the loss
function on the validation dataset. All results and logs are saved to the
`gnmt` directory (on the host) or to the `/workspace/gnmt/gnmt` directory
(in the container). By default, the `train.py` script will launch mixed
precision training with Tensor Cores. You can change this behavior by setting:
* the `--math fp32` flag to launch single precision training (for NVIDIA Volta
  and NVIDIA Turing architectures) or
* the `--math tf32` flag to launch TF32 training with Tensor Cores (for NVIDIA
  Ampere architecture) 

for the `train.py` training script.

To launch mixed precision training on 1, 4 or 8 GPUs, run:

```
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024
```

To launch mixed precision training on 16 GPUs, run:

```
python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048
```

By default, the training script will launch training with batch size 128 per
GPU. If `--train-global-batch-size` is specified and larger than 128 times the
number of GPUs available for the training then the training script will
accumulate gradients over consecutive iterations and then perform the weight
update. For example, 1 GPU training with `--train-global-batch-size 1024` will
accumulate gradients over 8 iterations before doing the weight update with
accumulated gradients.

**6. Start evaluation.**

The training process automatically runs evaluation and outputs the BLEU score
after each training epoch. Additionally, after the training is done, you can
manually run inference on the test dataset with the checkpoint saved during the
training.

To launch FP16 inference on the `newstest2014.en` test set, run:

```
python3 translate.py \
  --input data/wmt16_de_en/newstest2014.en \
  --reference data/wmt16_de_en/newstest2014.de \
  --output /tmp/output \
  --model gnmt/model_best.pth
```

The script will load the checkpoint specified by the `--model` option, then it
will launch inference on the file specified by the `--input` option, and
compute BLEU score against the reference translation specified by the
`--reference` option. Outputs will be stored to the location specified by the
`--output` option.


Additionally, one can pass the input text directly from the command-line:

```
python3 translate.py \
  --input-text "The quick brown fox jumps over the lazy dog" \
  --model gnmt/model_best.pth
```

Translated output will be printed to the console:

```
(...)
0: Translated output:
Der schnelle braune Fuchs springt über den faulen Hund
```

By default, the `translate.py` script will launch FP16 inference with Tensor
Cores. You can change this behavior by setting:
* the `--math fp32` flag to launch single precision inference (for NVIDIA Volta
  and NVIDIA Turing architectures) or
* the `--math tf32` flag to launch TF32 inference with Tensor Cores (for NVIDIA
  Ampere architecture)

for the `translate.py` inference script.

## Advanced
The following sections provide greater details of the dataset, running training
and inference, and the training results.

### Scripts and sample code
In the `root` directory, the most important files are:

* `train.py`: serves as the entry point to launch the training
* `translate.py`: serves as the entry point to launch inference
* `Dockerfile`: container with the basic set of dependencies to run GNMT v2
* `requirements.txt`: set of extra requirements for running GNMT v2

The `seq2seq/model` directory contains the implementation of GNMT v2 building
blocks:

* `attention.py`: implementation of normalized Bahdanau attention
* `encoder.py`: implementation of recurrent encoder
* `decoder.py`: implementation of recurrent decoder with attention
* `seq2seq_base.py`: base class for seq2seq models
* `gnmt.py`: implementation of GNMT v2 model

The `seq2seq/train` directory encapsulates the necessary tools to execute
training:

* `trainer.py`: implementation of training loop
* `smoothing.py`: implementation of cross-entropy with label smoothing
* `lr_scheduler.py`: implementation of exponential learning rate warmup and
  step decay
* `fp_optimizers.py`: implementation of optimizers for various floating point
  precisions

The `seq2seq/inference` directory contains scripts required to run inference:

* `beam_search.py`: implementation of beam search with length normalization and
  length penalty
* `translator.py`: implementation of auto-regressive inference

The `seq2seq/data` directory contains implementation of components needed for
data loading:

* `dataset.py`: implementation of text datasets
* `sampler.py`: implementation of batch samplers with bucketing by sequence
  length
* `tokenizer.py`: implementation of tokenizer (maps integer vocabulary indices
  to text)

### Parameters
Training


The complete list of available parameters for the `train.py` training script
contains:

```
dataset setup:
  --dataset-dir DATASET_DIR
                        path to the directory with training/test data
                        (default: data/wmt16_de_en)
  --src-lang SRC_LANG   source language (default: en)
  --tgt-lang TGT_LANG   target language (default: de)
  --vocab VOCAB         path to the vocabulary file (relative to DATASET_DIR
                        directory) (default: vocab.bpe.32000)
  -bpe BPE_CODES, --bpe-codes BPE_CODES
                        path to the file with bpe codes (relative to
                        DATASET_DIR directory) (default: bpe.32000)
  --train-src TRAIN_SRC
                        path to the training source data file (relative to
                        DATASET_DIR directory) (default:
                        train.tok.clean.bpe.32000.en)
  --train-tgt TRAIN_TGT
                        path to the training target data file (relative to
                        DATASET_DIR directory) (default:
                        train.tok.clean.bpe.32000.de)
  --val-src VAL_SRC     path to the validation source data file (relative to
                        DATASET_DIR directory) (default:
                        newstest_dev.tok.clean.bpe.32000.en)
  --val-tgt VAL_TGT     path to the validation target data file (relative to
                        DATASET_DIR directory) (default:
                        newstest_dev.tok.clean.bpe.32000.de)
  --test-src TEST_SRC   path to the test source data file (relative to
                        DATASET_DIR directory) (default:
                        newstest2014.tok.bpe.32000.en)
  --test-tgt TEST_TGT   path to the test target data file (relative to
                        DATASET_DIR directory) (default: newstest2014.de)
  --train-max-size TRAIN_MAX_SIZE
                        use at most TRAIN_MAX_SIZE elements from training
                        dataset (useful for benchmarking), by default uses
                        entire dataset (default: None)

results setup:
  --save-dir SAVE_DIR   path to directory with results, it will be
                        automatically created if it does not exist (default:
                        gnmt)
  --print-freq PRINT_FREQ
                        print log every PRINT_FREQ batches (default: 10)

model setup:
  --hidden-size HIDDEN_SIZE
                        hidden size of the model (default: 1024)
  --num-layers NUM_LAYERS
                        number of RNN layers in encoder and in decoder
                        (default: 4)
  --dropout DROPOUT     dropout applied to input of RNN cells (default: 0.2)
  --share-embedding     use shared embeddings for encoder and decoder (use '--
                        no-share-embedding' to disable) (default: True)
  --smoothing SMOOTHING
                        label smoothing, if equal to zero model will use
                        CrossEntropyLoss, if not zero model will be trained
                        with label smoothing loss (default: 0.1)

general setup:
  --math {fp16,fp32,tf32,manual_fp16}
                        precision (default: fp16)
  --seed SEED           master seed for random number generators, if "seed" is
                        undefined then the master seed will be sampled from
                        random.SystemRandom() (default: None)
  --prealloc-mode {off,once,always}
                        controls preallocation (default: always)
  --dllog-file DLLOG_FILE
                        Name of the DLLogger output file (default:
                        train_log.json)
  --eval                run validation and test after every epoch (use '--no-
                        eval' to disable) (default: True)
  --env                 print info about execution env (use '--no-env' to
                        disable) (default: True)
  --cuda                enables cuda (use '--no-cuda' to disable) (default:
                        True)
  --cudnn               enables cudnn (use '--no-cudnn' to disable) (default:
                        True)
  --log-all-ranks       enables logging from all distributed ranks, if
                        disabled then only logs from rank 0 are reported (use
                        '--no-log-all-ranks' to disable) (default: True)

training setup:
  --train-batch-size TRAIN_BATCH_SIZE
                        training batch size per worker (default: 128)
  --train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE
                        global training batch size, this argument does not
                        have to be defined, if it is defined it will be used
                        to automatically compute train_iter_size using the
                        equation: train_iter_size = train_global_batch_size //
                        (train_batch_size * world_size) (default: None)
  --train-iter-size N   training iter size, training loop will accumulate
                        gradients over N iterations and execute optimizer
                        every N steps (default: 1)
  --epochs EPOCHS       max number of training epochs (default: 6)
  --grad-clip GRAD_CLIP
                        enables gradient clipping and sets maximum norm of
                        gradients (default: 5.0)
  --train-max-length TRAIN_MAX_LENGTH
                        maximum sequence length for training (including
                        special BOS and EOS tokens) (default: 50)
  --train-min-length TRAIN_MIN_LENGTH
                        minimum sequence length for training (including
                        special BOS and EOS tokens) (default: 0)
  --train-loader-workers TRAIN_LOADER_WORKERS
                        number of workers for training data loading (default:
                        2)
  --batching {random,sharding,bucketing}
                        select batching algorithm (default: bucketing)
  --shard-size SHARD_SIZE
                        shard size for "sharding" batching algorithm, in
                        multiples of global batch size (default: 80)
  --num-buckets NUM_BUCKETS
                        number of buckets for "bucketing" batching algorithm
                        (default: 5)

optimizer setup:
  --optimizer OPTIMIZER
                        training optimizer (default: Adam)
  --lr LR               learning rate (default: 0.002)
  --optimizer-extra OPTIMIZER_EXTRA
                        extra options for the optimizer (default: {})

mixed precision loss scaling setup:
  --init-scale INIT_SCALE
                        initial loss scale (default: 8192)
  --upscale-interval UPSCALE_INTERVAL
                        loss upscaling interval (default: 128)

learning rate scheduler setup:
  --warmup-steps WARMUP_STEPS
                        number of learning rate warmup iterations (default:
                        200)
  --remain-steps REMAIN_STEPS
                        starting iteration for learning rate decay (default:
                        0.666)
  --decay-interval DECAY_INTERVAL
                        interval between learning rate decay steps (default:
                        None)
  --decay-steps DECAY_STEPS
                        max number of learning rate decay steps (default: 4)
  --decay-factor DECAY_FACTOR
                        learning rate decay factor (default: 0.5)

validation setup:
  --val-batch-size VAL_BATCH_SIZE
                        batch size for validation (default: 64)
  --val-max-length VAL_MAX_LENGTH
                        maximum sequence length for validation (including
                        special BOS and EOS tokens) (default: 125)
  --val-min-length VAL_MIN_LENGTH
                        minimum sequence length for validation (including
                        special BOS and EOS tokens) (default: 0)
  --val-loader-workers VAL_LOADER_WORKERS
                        number of workers for validation data loading
                        (default: 0)

test setup:
  --test-batch-size TEST_BATCH_SIZE
                        batch size for test (default: 128)
  --test-max-length TEST_MAX_LENGTH
                        maximum sequence length for test (including special
                        BOS and EOS tokens) (default: 150)
  --test-min-length TEST_MIN_LENGTH
                        minimum sequence length for test (including special
                        BOS and EOS tokens) (default: 0)
  --beam-size BEAM_SIZE
                        beam size (default: 5)
  --len-norm-factor LEN_NORM_FACTOR
                        length normalization factor (default: 0.6)
  --cov-penalty-factor COV_PENALTY_FACTOR
                        coverage penalty factor (default: 0.1)
  --len-norm-const LEN_NORM_CONST
                        length normalization constant (default: 5.0)
  --intra-epoch-eval N  evaluate within training epoch, this option will
                        enable extra N equally spaced evaluations executed
                        during each training epoch (default: 0)
  --test-loader-workers TEST_LOADER_WORKERS
                        number of workers for test data loading (default: 0)

checkpointing setup:
  --start-epoch START_EPOCH
                        manually set initial epoch counter (default: 0)
  --resume PATH         resumes training from checkpoint from PATH (default:
                        None)
  --save-all            saves checkpoint after every epoch (default: False)
  --save-freq SAVE_FREQ
                        save checkpoint every SAVE_FREQ batches (default:
                        5000)
  --keep-checkpoints KEEP_CHECKPOINTS
                        keep only last KEEP_CHECKPOINTS checkpoints, affects
                        only checkpoints controlled by --save-freq option
                        (default: 0)

benchmark setup:
  --target-perf TARGET_PERF
                        target training performance (in tokens per second)
                        (default: None)
  --target-bleu TARGET_BLEU
                        target accuracy (default: None)
```

Inference


The complete list of available parameters for the `translate.py` inference
script contains:

```
data setup:
  -o OUTPUT, --output OUTPUT
                        full path to the output file if not specified, then
                        the output will be printed (default: None)
  -r REFERENCE, --reference REFERENCE
                        full path to the file with reference translations (for
                        sacrebleu, raw text) (default: None)
  -m MODEL, --model MODEL
                        full path to the model checkpoint file (default: None)
  --synthetic           use synthetic dataset (default: False)
  --synthetic-batches SYNTHETIC_BATCHES
                        number of synthetic batches to generate (default: 64)
  --synthetic-vocab SYNTHETIC_VOCAB
                        size of synthetic vocabulary (default: 32320)
  --synthetic-len SYNTHETIC_LEN
                        sequence length of synthetic samples (default: 50)
  -i INPUT, --input INPUT
                        full path to the input file (raw text) (default: None)
  -t INPUT_TEXT [INPUT_TEXT ...], --input-text INPUT_TEXT [INPUT_TEXT ...]
                        raw input text (default: None)
  --sort                sorts dataset by sequence length (use '--no-sort' to
                        disable) (default: False)

inference setup:
  --batch-size BATCH_SIZE [BATCH_SIZE ...]
                        batch size per GPU (default: [128])
  --beam-size BEAM_SIZE [BEAM_SIZE ...]
                        beam size (default: [5])
  --max-seq-len MAX_SEQ_LEN
                        maximum generated sequence length (default: 80)
  --len-norm-factor LEN_NORM_FACTOR
                        length normalization factor (default: 0.6)
  --cov-penalty-factor COV_PENALTY_FACTOR
                        coverage penalty factor (default: 0.1)
  --len-norm-const LEN_NORM_CONST
                        length normalization constant (default: 5.0)

general setup:
  --math {fp16,fp32,tf32} [{fp16,fp32,tf32} ...]
                        precision (default: ['fp16'])
  --env                 print info about execution env (use '--no-env' to
                        disable) (default: False)
  --bleu                compares with reference translation and computes BLEU
                        (use '--no-bleu' to disable) (default: True)
  --cuda                enables cuda (use '--no-cuda' to disable) (default:
                        True)
  --cudnn               enables cudnn (use '--no-cudnn' to disable) (default:
                        True)
  --batch-first         uses (batch, seq, feature) data format for RNNs
                        (default: True)
  --seq-first           uses (seq, batch, feature) data format for RNNs
                        (default: True)
  --save-dir SAVE_DIR   path to directory with results, it will be
                        automatically created if it does not exist (default:
                        gnmt)
  --dllog-file DLLOG_FILE
                        Name of the DLLogger output file (default:
                        eval_log.json)
  --print-freq PRINT_FREQ, -p PRINT_FREQ
                        print log every PRINT_FREQ batches (default: 1)

benchmark setup:
  --target-perf TARGET_PERF
                        target inference performance (in tokens per second)
                        (default: None)
  --target-bleu TARGET_BLEU
                        target accuracy (default: None)
  --repeat REPEAT [REPEAT ...]
                        loops over the dataset REPEAT times, flag accepts
                        multiple arguments, one for each specified batch size
                        (default: [1])
  --warmup WARMUP       warmup iterations for performance counters (default:
                        0)
  --percentiles PERCENTILES [PERCENTILES ...]
                        Percentiles for confidence intervals for
                        throughput/latency benchmarks (default: (90, 95, 99))
  --tables              print accuracy, throughput and latency results in
                        tables (use '--no-tables' to disable) (default: False)
```

### Command-line options
To see the full list of available options and their descriptions, use the `-h`
or `--help` command line option. For example, for training:

```
python3 train.py --help

usage: train.py [-h] [--dataset-dir DATASET_DIR] [--src-lang SRC_LANG]
                [--tgt-lang TGT_LANG] [--vocab VOCAB] [-bpe BPE_CODES]
                [--train-src TRAIN_SRC] [--train-tgt TRAIN_TGT]
                [--val-src VAL_SRC] [--val-tgt VAL_TGT] [--test-src TEST_SRC]
                [--test-tgt TEST_TGT] [--save-dir SAVE_DIR]
                [--print-freq PRINT_FREQ] [--hidden-size HIDDEN_SIZE]
                [--num-layers NUM_LAYERS] [--dropout DROPOUT]
                [--share-embedding] [--smoothing SMOOTHING]
                [--math {fp16,fp32,tf32,manual_fp16}] [--seed SEED]
                [--prealloc-mode {off,once,always}] [--dllog-file DLLOG_FILE]
                [--eval] [--env] [--cuda] [--cudnn] [--log-all-ranks]
                [--train-max-size TRAIN_MAX_SIZE]
                [--train-batch-size TRAIN_BATCH_SIZE]
                [--train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE]
                [--train-iter-size N] [--epochs EPOCHS]
                [--grad-clip GRAD_CLIP] [--train-max-length TRAIN_MAX_LENGTH]
                [--train-min-length TRAIN_MIN_LENGTH]
                [--train-loader-workers TRAIN_LOADER_WORKERS]
                [--batching {random,sharding,bucketing}]
                [--shard-size SHARD_SIZE] [--num-buckets NUM_BUCKETS]
                [--optimizer OPTIMIZER] [--lr LR]
                [--optimizer-extra OPTIMIZER_EXTRA] [--init-scale INIT_SCALE]
                [--upscale-interval UPSCALE_INTERVAL]
                [--warmup-steps WARMUP_STEPS] [--remain-steps REMAIN_STEPS]
                [--decay-interval DECAY_INTERVAL] [--decay-steps DECAY_STEPS]
                [--decay-factor DECAY_FACTOR]
                [--val-batch-size VAL_BATCH_SIZE]
                [--val-max-length VAL_MAX_LENGTH]
                [--val-min-length VAL_MIN_LENGTH]
                [--val-loader-workers VAL_LOADER_WORKERS]
                [--test-batch-size TEST_BATCH_SIZE]
                [--test-max-length TEST_MAX_LENGTH]
                [--test-min-length TEST_MIN_LENGTH] [--beam-size BEAM_SIZE]
                [--len-norm-factor LEN_NORM_FACTOR]
                [--cov-penalty-factor COV_PENALTY_FACTOR]
                [--len-norm-const LEN_NORM_CONST] [--intra-epoch-eval N]
                [--test-loader-workers TEST_LOADER_WORKERS]
                [--start-epoch START_EPOCH] [--resume PATH] [--save-all]
                [--save-freq SAVE_FREQ] [--keep-checkpoints KEEP_CHECKPOINTS]
                [--target-perf TARGET_PERF] [--target-bleu TARGET_BLEU]
                [--local_rank LOCAL_RANK]
```
For example, for inference:

```
python3 translate.py --help

usage: translate.py [-h] [-o OUTPUT] [-r REFERENCE] [-m MODEL] [--synthetic]
                    [--synthetic-batches SYNTHETIC_BATCHES]
                    [--synthetic-vocab SYNTHETIC_VOCAB]
                    [--synthetic-len SYNTHETIC_LEN]
                    [-i INPUT | -t INPUT_TEXT [INPUT_TEXT ...]] [--sort]
                    [--batch-size BATCH_SIZE [BATCH_SIZE ...]]
                    [--beam-size BEAM_SIZE [BEAM_SIZE ...]]
                    [--max-seq-len MAX_SEQ_LEN]
                    [--len-norm-factor LEN_NORM_FACTOR]
                    [--cov-penalty-factor COV_PENALTY_FACTOR]
                    [--len-norm-const LEN_NORM_CONST]
                    [--math {fp16,fp32,tf32} [{fp16,fp32,tf32} ...]] [--env]
                    [--bleu] [--cuda] [--cudnn] [--batch-first | --seq-first]
                    [--save-dir SAVE_DIR] [--dllog-file DLLOG_FILE]
                    [--print-freq PRINT_FREQ] [--target-perf TARGET_PERF]
                    [--target-bleu TARGET_BLEU] [--repeat REPEAT [REPEAT ...]]
                    [--warmup WARMUP]
                    [--percentiles PERCENTILES [PERCENTILES ...]] [--tables]
                    [--local_rank LOCAL_RANK]
```

### Getting the data
The GNMT v2 model was trained on the [WMT16
English-German](http://www.statmt.org/wmt16/translation-task.html) dataset.
Concatenation of the newstest2015 and newstest2016 test sets are used as a
validation dataset and the newstest2014 is used as a testing dataset.

This repository contains the `scripts/wmt16_en_de.sh` download script which
automatically downloads and preprocesses the training, validation and test
datasets. By default, data is downloaded to the `data` directory.

Our download script is very similar to the `wmt16_en_de.sh` script from the
[tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
repository. Our download script contains an extra preprocessing step, which
discards all pairs of sentences which can't be decoded by *latin-1* encoder.
The `scripts/wmt16_en_de.sh` script uses the
[subword-nmt](https://github.com/rsennrich/subword-nmt) package to segment text
into subword units (Byte Pair Encodings -
[BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding)). By default, the
script builds the shared vocabulary of 32,000 tokens.

In order to test with other datasets, the script needs to be customized
accordingly.

#### Dataset guidelines
The process of downloading and preprocessing the data can be found in the
`scripts/wmt16_en_de.sh` script.

Initially, data is downloaded from [www.statmt.org](www.statmt.org). Then
`europarl-v7`, `commoncrawl` and `news-commentary` corpora are concatenated to
form the training dataset, similarly `newstest2015` and `newstest2016` are
concatenated to form the validation dataset. Raw data is preprocessed with
[Moses](https://github.com/moses-smt/mosesdecoder), first by launching [Moses
tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)
(tokenizer breaks up text into individual words), then by launching
[clean-corpus-n.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl)
which removes invalid sentences and does initial filtering by sequence length.

Second stage of preprocessing is done by launching the
`scripts/filter_dataset.py` script, which discards all pairs of sentences that
can't be decoded by latin-1 encoder.

Third state of preprocessing uses the
[subword-nmt](https://github.com/rsennrich/subword-nmt) package. First it
builds shared [byte pair
encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) vocabulary with
32,000 merge operations (command `subword-nmt learn-bpe`), then it applies
generated vocabulary to training, validation and test corpora (command
`subword-nmt apply-bpe`).

### Training process
The default training configuration can be launched by running the `train.py`
training script. By default, the training script saves only one checkpoint with
the lowest value of the loss function on the validation dataset. An evaluation
is then performed after each training epoch. Results are stored in the
`gnmt` directory.

The training script launches data-parallel training with batch size 128 per GPU
on all available GPUs. We have tested reliance on up to 16 GPUs on a single
node.
After each training epoch, the script runs an evaluation on the validation
dataset and outputs a BLEU score on the test dataset (newstest2014). BLEU is
computed by the [SacreBLEU](https://github.com/mjpost/sacreBLEU) package. Logs
from the training and evaluation are saved to the `gnmt` directory.

The summary after each training epoch is printed in the following format:

```
0: Summary: Epoch: 3	Training Loss: 3.1336	Validation Loss: 2.9587	Test BLEU: 23.18
0: Performance: Epoch: 3	Training: 418772 Tok/s	Validation: 1445331 Tok/s
```

The training loss is averaged over an entire training epoch, the validation
loss is averaged over the validation dataset and the BLEU score is computed on
the test dataset. Performance is reported in total tokens per second. The
result is averaged over an entire training epoch and summed over all GPUs
participating in the training.

By default, the `train.py` script will launch mixed precision training with
Tensor Cores. You can change this behavior by setting:
* the `--math fp32` flag to launch single precision training (for NVIDIA Volta
  and NVIDIA Turing architectures) or
* the `--math tf32` flag to launch TF32 training with Tensor Cores (for NVIDIA
  Ampere architecture)

for the `train.py` training script.

To view all available options for training, run `python3 train.py --help`.

### Inference process
Inference can be run by launching the `translate.py` inference script,
although, it requires a pre-trained model checkpoint and tokenized input.

The inference script, `translate.py`, supports batched inference. By default,
it launches beam search with beam size of 5, coverage penalty term and length
normalization term. Greedy decoding can be enabled by setting the beam size to
1.

To view all available options for inference, run `python3 translate.py --help`.


## Performance

The performance measurements in this document were conducted at the time of
publication and may not reflect the performance achieved from NVIDIA’s latest
software release. For the most up-to-date performance measurements, go to
[NVIDIA Data Center Deep Learning Product
Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).

### Benchmarking
The following section shows how to run benchmarks measuring the model
performance in training and inference modes.

#### Training performance benchmark
Training is launched on batches of text data, different batches have different
sequence lengths (number of tokens in the longest sequence). Sequence length
and batch efficiency (ratio of non-pad tokens to total number of tokens) affect
performance of the training, therefore it's recommended to run the training on
a large chunk of training dataset to get a stable and reliable average training
performance. Ideally at least one full epoch of training should be launched to
get a good estimate of training performance.

The following commands will launch one epoch of training:

To launch mixed precision training on 1, 4 or 8 GPUs, run:
```
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --epochs 1 --math fp16
```

To launch mixed precision training on 16 GPUs, run:
```
python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048 --epochs 1 --math fp16
```

Change `--math fp16` to `--math fp32` to launch single precision training (for
NVIDIA Volta and NVIDIA Turing architectures) or to `--math tf32` to launch
TF32 training with Tensor Cores (for NVIDIA Ampere architecture).

After the training is completed, the `train.py` script prints a summary to
standard output. Performance results are printed in the following format:

```
(...)
0: Performance: Epoch: 0	Training: 418926 Tok/s	Validation: 1430828 Tok/s
(...)
```

`Training: 418926 Tok/s` represents training throughput averaged over an entire
training epoch and summed over all GPUs participating in the training.

#### Inference performance benchmark
The inference performance and accuracy benchmarks require a checkpoint from a
fully trained model.

Command to launch the inference accuracy benchmark on NVIDIA Volta or on NVIDIA
Turing architectures:

```
python3 translate.py \
  --model gnmt/model_best.pth \
  --input data/wmt16_de_en/newstest2014.en \
  --reference data/wmt16_de_en/newstest2014.de \
  --output /tmp/output \
  --math fp16 fp32 \
  --batch-size 128 \
  --beam-size 1 2 5 \
  --tables
```

Command to launch the inference accuracy benchmark on NVIDIA Ampere architecture:

```
python3 translate.py \
  --model gnmt/model_best.pth \
  --input data/wmt16_de_en/newstest2014.en \
  --reference data/wmt16_de_en/newstest2014.de \
  --output /tmp/output \
  --math fp16 tf32 \
  --batch-size 128 \
  --beam-size 1 2 5 \
  --tables
```

Command to launch the inference throughput and latency benchmarks on NVIDIA
Volta or NVIDIA Turing architectures:

```
python3 translate.py \
  --model gnmt/model_best.pth \
  --input data/wmt16_de_en/newstest2014.en \
  --reference data/wmt16_de_en/newstest2014.de \
  --output /tmp/output \
  --math fp16 fp32 \
  --batch-size 1 2 4 8 32 128 512 \
  --repeat 1 1 1 1 2 8 16 \
  --beam-size 1 2 5 \
  --warmup 5 \
  --tables
```

Command to launch the inference throughput and latency benchmarks on NVIDIA
Ampere architecture:

```
python3 translate.py \
  --model gnmt/model_best.pth \
  --input data/wmt16_de_en/newstest2014.en \
  --reference data/wmt16_de_en/newstest2014.de \
  --output /tmp/output \
  --math fp16 tf32 \
  --batch-size 1 2 4 8 32 128 512 \
  --repeat 1 1 1 1 2 8 16 \
  --beam-size 1 2 5 \
  --warmup 5 \
  --tables
```

### Results
The following sections provide details on how we achieved our performance and
accuracy in training and inference.

#### Training accuracy results

##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
Our results were obtained by running the `train.py` script with the default
batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX
A100 with 8x A100 40GB GPUs.

Command to launch the training:

```
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --math fp16
```

Change `--math fp16` to `--math tf32` to launch TF32 training with Tensor Cores.

| **GPUs** | **Batch Size / GPU** | **Accuracy - TF32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - TF32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (TF32 to Mixed precision)** |
| --- | --- | ----- | ----- | ----- | ------ | ---- |
| 8   | 128 | 24.46 | 24.60 | 34.7  | 22.7   | 1.53 |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the `train.py` script with the default
batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1
with 8x V100 16GB GPUs.

Command to launch the training:

```
python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --math fp16
```

Change `--math fp16` to `--math fp32` to launch single precision training.

| **GPUs** | **Batch Size / GPU** | **Accuracy - FP32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - FP32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (FP32 to Mixed precision)** |
| --- | --- | ----- | ----- | ----- | ------ | ---- |
| 1   | 128 | 24.41 | 24.42 | 810.0 | 224.0  | 3.62 |
| 4   | 128 | 24.40 | 24.33 | 218.2 | 69.5   | 3.14 |
| 8   | 128 | 24.45 | 24.38 | 112.0 | 38.6   | 2.90 |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

##### Training accuracy: NVIDIA DGX-2H (16x V100 32GB)
Our results were obtained by running the `train.py` script with the default
batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H
with 16x V100 32GB GPUs.

To launch mixed precision training on 16 GPUs, run:
```
python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048 --math fp16
```

Change `--math fp16` to `--math fp32` to launch single precision training.

| **GPUs** | **Batch Size / GPU** | **Accuracy - FP32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - FP32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (FP32 to Mixed precision)** |
| --- | --- | ----- | ----- | ------ | ----- | ---- |
| 16  | 128 | 24.41 | 24.38 | 52.1   | 19.4  | 2.69 |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

![TrainingLoss](./img/training_loss.png)

##### Training stability test
The GNMT v2 model was trained for 6 epochs, starting from 32 different initial
random seeds. After each training epoch, the model was evaluated on the test
dataset and the BLEU score was recorded. The training was performed in the
pytorch-20.06-py3 Docker container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
The following table summarizes the results of the stability test.

In the following table, the BLEU scores after each training epoch for different
initial random seeds are displayed.

| **Epoch** | **Average** | **Standard deviation** | **Minimum** | **Maximum** | **Median** |
| --- | ------ | ----- | ------ | ------ | ------ |
| 1   | 19.959 | 0.238 | 19.410 | 20.390 | 19.970 |
| 2   | 21.772 | 0.293 | 20.960 | 22.280 | 21.820 |
| 3   | 22.435 | 0.264 | 21.740 | 22.870 | 22.465 |
| 4   | 23.167 | 0.166 | 22.870 | 23.620 | 23.195 |
| 5   | 24.233 | 0.149 | 23.820 | 24.530 | 24.235 |
| 6   | 24.416 | 0.131 | 24.140 | 24.660 | 24.390 |

#### Training throughput results

##### Training throughput: NVIDIA DGX A100 (8x A100 40GB)
Our results were obtained by running the `train.py` training script in the
pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
Throughput performance numbers (in tokens per second) were averaged over an
entire training epoch.

| **GPUs** | **Batch size / GPU** | **Throughput - TF32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (TF32 to Mixed precision)** | **Strong Scaling - TF32** | **Strong Scaling - Mixed precision** |
| --- | --- | ------ | ------ | ----- | ----- | ----- |
| 1   | 128 | 83214  | 140909 | 1.693 | 1.000 | 1.000 |
| 4   | 128 | 278576 | 463144 | 1.663 | 3.348 | 3.287 |
| 8   | 128 | 519952 | 822024 | 1.581 | 6.248 | 5.834 |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

##### Training throughput: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the `train.py` training script in the
pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs.
Throughput performance numbers (in tokens per second) were averaged over an
entire training epoch.

| **GPUs** | **Batch size / GPU** | **Throughput - FP32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong Scaling - Mixed precision** |
| --- | --- | ------ | ------ | ----- | ----- | ----- |
| 1   | 128 | 21860  | 76438  | 3.497 | 1.000 | 1.000 |
| 4   | 128 | 80224  | 249168 | 3.106 | 3.670 | 3.260 |
| 8   | 128 | 154168 | 447832 | 2.905 | 7.053 | 5.859 |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

##### Training throughput: NVIDIA DGX-2H (16x V100 32GB)
Our results were obtained by running the `train.py` training script in the
pytorch-20.06-py3 NGC container on NVIDIA DGX-2H with 16x V100 32GB GPUs.
Throughput performance numbers (in tokens per second) were averaged over an
entire training epoch.

| **GPUs** | **Batch size / GPU** | **Throughput - FP32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong Scaling - Mixed precision** |
| --- | --- | ------ | ------ | ----- | ------ | ------ |
| 1  | 128 | 25583  | 87829   | 3.433 | 1.000  | 1.000  |
| 4  | 128 | 91400  | 290640  | 3.180 | 3.573  | 3.309  |
| 8  | 128 | 176616 | 522008  | 2.956 | 6.904  | 5.943  |
| 16 | 128 | 351792 | 1010880 | 2.874 | 13.751 | 11.510 |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

#### Inference accuracy results

##### Inference accuracy: NVIDIA A100 40GB
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB GPU. Full
command to launch the inference accuracy benchmark was provided in the
[Inference performance benchmark](#inference-performance-benchmark) section.

| **Batch Size** | **Beam Size** | **Accuracy - TF32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
| -------------: | ------------: | -------------------------: | -------------------------: |
| 128            | 1             | 23.07                      | 23.07                      |
| 128            | 2             | 23.81                      | 23.81                      |
| 128            | 5             | 24.41                      | 24.43                      |

##### Inference accuracy: NVIDIA Tesla V100 16GB
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA Tesla V100 16GB GPU. Full
command to launch the inference accuracy benchmark was provided in the
[Inference performance benchmark](#inference-performance-benchmark) section.

| **Batch Size** | **Beam Size** | **Accuracy - FP32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
| -------------: | ------------: | -------------------------: | -------------------------: |
| 128            | 1             | 23.07                      | 23.07                      |
| 128            | 2             | 23.81                      | 23.79                      |
| 128            | 5             | 24.40                      | 24.43                      |

##### Inference accuracy: NVIDIA T4
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA Tesla T4. Full command to
launch the inference accuracy benchmark was provided in the [Inference
performance benchmark](#inference-performance-benchmark) section.

| **Batch Size** | **Beam Size** | **Accuracy - FP32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
| -------------: | ------------: | -------------------------: | -------------------------: |
| 128            | 1             | 23.07                      | 23.08                      |
| 128            | 2             | 23.81                      | 23.80                      |
| 128            | 5             | 24.40                      | 24.39                      |

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

#### Inference throughput results
Tables presented in this section show the average inference throughput (columns
**Avg (tok/s)**) and inference throughput for various confidence intervals
(columns **N% (ms)**, where `N` denotes the confidence interval). Inference
throughput is measured in tokens per second. Speedups reported in FP16
subsections are relative to FP32 (for NVIDIA Volta and NVIDIA Turing) and
relative to TF32 (for NVIDIA Ampere) numbers for corresponding configuration.

##### Inference throughput: NVIDIA A100 40GB
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB.
Full command to launch the inference throughput benchmark was provided in the
[Inference performance benchmark](#inference-performance-benchmark) section.

**FP16**

|**Batch Size**|**Beam Size**|**Avg (tok/s)**|**Speedup**|**90% (tok/s)**|**Speedup**|**95% (tok/s)**|**Speedup**|**99% (tok/s)**|**Speedup**|
|-------------:|------------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|
|             1|            1|         1291.6|      1.031|         1195.7|      1.029|         1165.8|      1.029|         1104.7|      1.030|
|             1|            2|          882.7|      1.019|          803.4|      1.015|          769.2|      1.015|          696.7|      1.017|
|             1|            5|          848.3|      1.042|          753.0|      1.037|          715.0|      1.043|          636.4|      1.033|
|             2|            1|         2060.5|      1.034|         1700.8|      1.032|         1621.8|      1.032|         1487.4|      1.022|
|             2|            2|         1445.7|      1.026|         1197.6|      1.024|         1132.5|      1.023|         1043.7|      1.033|
|             2|            5|         1402.3|      1.063|         1152.4|      1.056|         1100.5|      1.053|          992.9|      1.053|
|             4|            1|         3465.6|      1.046|         2838.3|      1.040|         2672.7|      1.043|         2392.8|      1.043|
|             4|            2|         2425.4|      1.041|         2002.5|      1.028|         1898.3|      1.033|         1690.2|      1.028|
|             4|            5|         2364.4|      1.075|         1930.0|      1.067|         1822.0|      1.065|         1626.1|      1.058|
|             8|            1|         6151.1|      1.099|         5078.0|      1.087|         4786.5|      1.096|         4206.9|      1.090|
|             8|            2|         4241.9|      1.075|         3494.1|      1.066|         3293.6|      1.066|         2970.9|      1.064|
|             8|            5|         4117.7|      1.118|         3430.9|      1.103|         3224.5|      1.104|         2833.5|      1.110|
|            32|            1|        18830.4|      1.147|        16210.0|      1.152|        15563.9|      1.138|        13973.2|      1.135|
|            32|            2|        12698.2|      1.133|        10812.3|      1.114|        10256.1|      1.145|         9330.2|      1.101|
|            32|            5|        11802.6|      1.355|         9998.8|      1.318|         9671.6|      1.329|         9058.4|      1.335|
|           128|            1|        53394.5|      1.350|        48867.6|      1.342|        46898.5|      1.414|        40670.6|      1.305|
|           128|            2|        34876.4|      1.483|        31687.4|      1.491|        30025.4|      1.505|        27677.1|      1.421|
|           128|            5|        28201.3|      1.986|        25660.5|      1.997|        24306.0|      1.967|        23326.2|      2.007|
|           512|            1|       119675.3|      1.904|       112400.5|      1.971|       109694.8|      1.927|       108781.3|      1.919|
|           512|            2|        74514.7|      2.126|        69578.9|      2.209|        69348.1|      2.210|        69253.7|      2.212|
|           512|            5|        47003.2|      2.760|        43348.2|      2.893|        43080.3|      2.884|        42878.4|      2.881|

##### Inference throughput: NVIDIA T4
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA T4.
Full command to launch the inference throughput benchmark was provided in the
[Inference performance benchmark](#inference-performance-benchmark) section.

**FP16**

|**Batch Size**|**Beam Size**|**Avg (tok/s)**|**Speedup**|**90% (tok/s)**|**Speedup**|**95% (tok/s)**|**Speedup**|**99% (tok/s)**|**Speedup**|
|-------------:|------------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|
|             1|            1|         1133.8|      1.266|         1059.1|      1.253|         1036.6|      1.251|          989.5|      1.242|
|             1|            2|          793.9|      1.169|          728.3|      1.165|          698.1|      1.163|          637.1|      1.157|
|             1|            5|          766.8|      1.343|          685.6|      1.335|          649.3|      1.335|          584.1|      1.318|
|             2|            1|         1759.8|      1.233|         1461.6|      1.239|         1402.3|      1.242|         1302.1|      1.242|
|             2|            2|         1313.3|      1.186|         1088.7|      1.185|         1031.6|      1.180|          953.2|      1.178|
|             2|            5|         1257.2|      1.301|         1034.1|      1.316|          990.3|      1.313|          886.3|      1.265|
|             4|            1|         2974.0|      1.261|         2440.3|      1.255|         2294.6|      1.257|         2087.7|      1.261|
|             4|            2|         2204.7|      1.320|         1826.3|      1.283|         1718.9|      1.260|         1548.4|      1.260|
|             4|            5|         2106.1|      1.340|         1727.8|      1.345|         1625.7|      1.353|         1467.7|      1.346|
|             8|            1|         5076.6|      1.423|         4207.9|      1.367|         3904.4|      1.360|         3475.3|      1.355|
|             8|            2|         3761.7|      1.311|         3108.1|      1.285|         2931.6|      1.300|         2628.7|      1.300|
|             8|            5|         3578.2|      1.660|         2998.2|      1.614|         2812.1|      1.609|         2447.6|      1.523|
|            32|            1|        14637.8|      1.636|        12702.5|      1.644|        12070.3|      1.634|        11036.9|      1.647|
|            32|            2|        10627.3|      1.818|         9198.3|      1.818|         8431.6|      1.725|         8000.0|      1.773|
|            32|            5|         8205.7|      2.598|         7117.6|      2.476|         6825.2|      2.497|         6293.2|      2.437|
|           128|            1|        33800.5|      2.755|        30824.5|      2.816|        27685.2|      2.661|        26580.9|      2.694|
|           128|            2|        20829.4|      2.795|        18665.2|      2.778|        17372.1|      2.639|        16820.5|      2.821|
|           128|            5|        11753.9|      3.309|        10658.1|      3.273|        10308.7|      3.205|         9630.7|      3.328|
|           512|            1|        44474.6|      3.327|        40108.1|      3.394|        39816.6|      3.378|        39708.0|      3.381|
|           512|            2|        26057.9|      3.295|        23197.3|      3.294|        23019.8|      3.284|        22951.4|      3.284|
|           512|            5|        12161.5|      3.428|        10777.5|      3.418|        10733.1|      3.414|        10710.5|      3.420|

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

#### Inference latency results
Tables presented in this section show the average inference latency (columns **Avg
(ms)**) and inference latency for various confidence intervals (columns **N%
(ms)**, where `N` denotes the confidence interval). Inference latency is
measured in milliseconds. Speedups reported in FP16 subsections are relative to
FP32 (for NVIDIA Volta and NVIDIA Turing) and relative to TF32 (for NVIDIA
Ampere) numbers for corresponding configuration.

##### Inference latency: NVIDIA A100 40GB
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB.
Full command to launch the inference latency benchmark was provided in the
[Inference performance benchmark](#inference-performance-benchmark) section.

**FP16**

|**Batch Size**|**Beam Size**|**Avg (ms)**|**Speedup**|**90% (ms)**|**Speedup**|**95% (ms)**|**Speedup**|**99% (ms)**|**Speedup**|
|-------------:|------------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|
|             1|            1|       44.69|      1.032|       74.04|      1.035|       84.61|      1.034|       99.14|      1.042|
|             1|            2|       64.76|      1.020|      105.18|      1.018|      118.92|      1.019|      139.42|      1.023|
|             1|            5|       67.06|      1.043|      107.56|      1.049|      121.82|      1.054|      143.85|      1.054|
|             2|            1|       56.57|      1.034|       85.59|      1.037|       92.55|      1.038|      107.59|      1.046|
|             2|            2|       80.22|      1.027|      119.22|      1.027|      128.43|      1.030|      150.06|      1.028|
|             2|            5|       82.54|      1.063|      121.37|      1.067|      132.35|      1.069|      156.34|      1.059|
|             4|            1|       67.29|      1.047|       92.69|      1.048|      100.08|      1.056|      112.63|      1.064|
|             4|            2|       95.86|      1.041|      129.83|      1.040|      139.48|      1.044|      162.34|      1.045|
|             4|            5|       98.34|      1.075|      133.83|      1.076|      142.70|      1.068|      168.30|      1.075|
|             8|            1|       75.60|      1.099|       97.87|      1.103|      104.13|      1.099|      117.40|      1.102|
|             8|            2|      109.38|      1.074|      137.71|      1.079|      147.69|      1.069|      168.79|      1.065|
|             8|            5|      112.71|      1.116|      143.50|      1.104|      153.17|      1.118|      172.60|      1.113|
|            32|            1|       98.40|      1.146|      117.02|      1.153|      123.42|      1.150|      129.01|      1.128|
|            32|            2|      145.87|      1.133|      171.71|      1.159|      184.01|      1.127|      188.64|      1.141|
|            32|            5|      156.82|      1.357|      189.10|      1.374|      194.95|      1.392|      196.65|      1.419|
|           128|            1|      137.97|      1.350|      150.04|      1.348|      151.52|      1.349|      154.52|      1.434|
|           128|            2|      211.58|      1.484|      232.96|      1.490|      237.46|      1.505|      239.86|      1.567|
|           128|            5|      261.44|      1.990|      288.54|      2.017|      291.63|      2.052|      298.73|      2.136|
|           512|            1|      245.93|      1.906|      262.51|      1.998|      264.24|      1.999|      265.23|      2.000|
|           512|            2|      395.61|      2.129|      428.54|      2.219|      431.58|      2.224|      433.86|      2.227|
|           512|            5|      627.21|      2.767|      691.72|      2.878|      696.01|      2.895|      702.13|      2.887|

##### Inference latency: NVIDIA T4
Our results were obtained by running the `translate.py` script in the
pytorch-20.06-py3 NGC Docker container with NVIDIA T4.
Full command to launch the inference latency benchmark was provided in the
[Inference performance benchmark](#inference-performance-benchmark) section.

**FP16**

|**Batch Size**|**Beam Size**|**Avg (ms)**|**Speedup**|**90% (ms)**|**Speedup**|**95% (ms)**|**Speedup**|**99% (ms)**|**Speedup**|
|-------------:|------------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|
|             1|            1|       51.08|      1.261|       84.82|      1.254|       97.45|      1.251|       114.6|      1.257|
|             1|            2|       72.05|      1.168|      117.41|      1.165|      132.33|      1.170|       155.8|      1.174|
|             1|            5|       74.20|      1.345|      119.45|      1.352|      135.07|      1.354|       160.3|      1.354|
|             2|            1|       66.31|      1.232|      100.90|      1.232|      108.52|      1.235|       126.9|      1.238|
|             2|            2|       88.35|      1.185|      131.47|      1.188|      141.46|      1.185|       164.7|      1.191|
|             2|            5|       92.12|      1.305|      136.30|      1.310|      148.66|      1.309|       174.8|      1.320|
|             4|            1|       78.54|      1.260|      108.53|      1.256|      117.19|      1.259|       133.7|      1.259|
|             4|            2|      105.54|      1.315|      142.74|      1.317|      154.36|      1.307|       178.7|      1.303|
|             4|            5|      110.43|      1.351|      150.62|      1.388|      161.61|      1.397|       191.2|      1.427|
|             8|            1|       91.65|      1.418|      117.92|      1.421|      126.60|      1.405|       144.0|      1.411|
|             8|            2|      123.39|      1.315|      156.00|      1.337|      167.34|      1.347|       193.4|      1.340|
|             8|            5|      129.69|      1.666|      165.01|      1.705|      178.18|      1.723|       200.3|      1.765|
|            32|            1|      126.53|      1.641|      153.23|      1.689|      159.58|      1.692|       167.0|      1.700|
|            32|            2|      174.37|      1.822|      209.04|      1.899|      219.59|      1.877|       228.6|      1.878|
|            32|            5|      226.15|      2.598|      277.38|      2.636|      290.27|      2.648|       299.4|      2.664|
|           128|            1|      218.29|      2.755|      238.94|      2.826|      243.18|      2.843|       267.1|      2.828|
|           128|            2|      354.83|      2.796|      396.63|      2.832|      410.53|      2.803|       433.2|      2.866|
|           128|            5|      628.32|      3.311|      699.57|      3.353|      723.98|      3.323|       771.0|      3.337|
|           512|            1|      663.07|      3.330|      748.62|      3.388|      753.20|      3.388|       758.0|      3.378|
|           512|            2|     1134.04|      3.295|     1297.85|      3.283|     1302.25|      3.304|      1306.9|      3.308|
|           512|            5|     2428.82|      3.428|     2771.72|      3.415|     2801.32|      3.427|      2817.6|      3.422|

To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
outlined above.

## Release notes
### Changelog
* July 2020
  * Added support for NVIDIA DGX A100
  * Default container updated to NGC PyTorch 20.06-py3
* June 2019
  * Default container updated to NGC PyTorch 19.05-py3
  * Mixed precision training implemented using APEX AMP
  * Added inference throughput and latency results on NVIDIA T4 and NVIDIA
    Tesla V100 16GB
  * Added option to run inference on user-provided raw input text from command
    line
* February 2019
  * Different batching algorithm (bucketing with 5 equal-width buckets)
  * Additional dropouts before first LSTM layer in encoder and in decoder
  * Weight initialization changed to uniform (-0.1,0.1)
  * Switched order of dropout and concatenation with attention in decoder
  * Default container updated to NGC PyTorch 19.01-py3
* December 2018
  * Added exponential warm-up and step learning rate decay
  * Multi-GPU (distributed) inference and validation
  * Default container updated to NGC PyTorch 18.11-py3
  * General performance improvements
* August 2018
  * Initial release

### Known issues
There are no known issues in this release.
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								# GNMT v2 For PyTorch
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								This repository provides a script and recipe to train the GNMT v2 model to
 								achieve state of the art accuracy, and is tested and maintained by NVIDIA.
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								## Table Of Contents
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								<!-- TOC GFM -->
 								* [Model overview](#model-overview)
 								  * [Model architecture](#model-architecture)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  * [Default configuration](#default-configuration)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * [Feature support matrix](#feature-support-matrix)
 								    * [Features](#features)
 								  * [Mixed precision training](#mixed-precision-training)
 								    * [Enabling mixed precision](#enabling-mixed-precision)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								    * [Enabling TF32](#enabling-tf32)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								* [Setup](#setup)
 								  * [Requirements](#requirements)
 								* [Quick Start Guide](#quick-start-guide)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								* [Advanced](#advanced)
 								  * [Scripts and sample code](#scripts-and-sample-code)
 								  * [Parameters](#parameters)
 								  * [Command-line options](#command-line-options)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  * [Getting the data](#getting-the-data)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								    * [Dataset guidelines](#dataset-guidelines)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  * [Training process](#training-process)
 								  * [Inference process](#inference-process)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								* [Performance](#performance)
 								  * [Benchmarking](#benchmarking)
 								    * [Training performance benchmark](#training-performance-benchmark)
 								    * [Inference performance benchmark](#inference-performance-benchmark)
 								  * [Results](#results)
 								    * [Training accuracy results](#training-accuracy-results)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
 								      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
 								      * [Training accuracy: NVIDIA DGX-2H (16x V100 32GB)](#training-accuracy-nvidia-dgx-2h-16x-v100-32gb)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								      * [Training stability test](#training-stability-test)
 								    * [Training throughput results](#training-throughput-results)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								      * [Training throughput: NVIDIA DGX A100 (8x A100 40GB)](#training-throughput-nvidia-dgx-a100-8x-a100-40gb)
 								      * [Training throughput: NVIDIA DGX-1 (8x V100 16GB)](#training-throughput-nvidia-dgx-1-8x-v100-16gb)
 								      * [Training throughput: NVIDIA DGX-2H (16x V100 32GB)](#training-throughput-nvidia-dgx-2h-16x-v100-32gb)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								    * [Inference accuracy results](#inference-accuracy-results)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								      * [Inference accuracy: NVIDIA A100 40GB](#inference-accuracy-nvidia-a100-40gb)
 								      * [Inference accuracy: NVIDIA Tesla V100 16GB](#inference-accuracy-nvidia-tesla-v100-16gb)
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								      * [Inference accuracy: NVIDIA T4](#inference-accuracy-nvidia-t4)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								    * [Inference throughput results](#inference-throughput-results)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								      * [Inference throughput: NVIDIA A100 40GB](#inference-throughput-nvidia-a100-40gb)
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								      * [Inference throughput: NVIDIA T4](#inference-throughput-nvidia-t4)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								    * [Inference latency results](#inference-latency-results)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								      * [Inference latency: NVIDIA A100 40GB](#inference-latency-nvidia-a100-40gb)
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								      * [Inference latency: NVIDIA T4](#inference-latency-nvidia-t4)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								* [Release notes](#release-notes)
 								  * [Changelog](#changelog)
 								  * [Known issues](#known-issues)
 								<!-- /TOC -->
 								## Model overview
 								The GNMT v2 model is similar to the one discussed in the [Google's Neural
 								Machine Translation System: Bridging the Gap between Human and Machine
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								Translation](https://arxiv.org/abs/1609.08144) paper.
 								The most important difference between the two models is in the attention
 								mechanism. In our model, the output from the first LSTM layer of the decoder
 								goes into the attention module, then the re-weighted context is concatenated
 								with inputs to all subsequent LSTM layers in the decoder at the current
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								time step.
 								The same attention mechanism is also implemented in the default GNMT-like
 								models from [TensorFlow Neural Machine Translation
 								Tutorial](https://github.com/tensorflow/nmt) and [NVIDIA OpenSeq2Seq
 								Toolkit](https://github.com/NVIDIA/OpenSeq2Seq).
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Model architecture
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								![ModelArchitecture](./img/diagram.png)
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Default configuration
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								The following features were implemented in this model:
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								* general:
 								  * encoder and decoder are using shared embeddings
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * data-parallel multi-GPU training
 								  * dynamic loss scaling with backoff for Tensor Cores (mixed precision)
 								    training
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								  * trained with label smoothing loss (smoothing factor 0.1)
 								* encoder:
 								  * 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
 								    unidirectional
 								  * with residual connections starting from 3rd layer
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * uses standard PyTorch nn.LSTM layer
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								  * dropout is applied on input to all LSTM layers, probability of dropout is
 								    set to 0.2
 								  * hidden state of LSTM layers is initialized with zeros
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * weights and bias of LSTM layers is initialized with uniform(-0.1,0.1)
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								    distribution
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								* decoder:
 								  * 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
 								    classifier
 								  * with residual connections starting from 3rd layer
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * uses standard PyTorch nn.LSTM layer
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								  * dropout is applied on input to all LSTM layers, probability of dropout is
 								    set to 0.2
 								  * hidden state of LSTM layers is initialized with zeros
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * weights and bias of LSTM layers is initialized with uniform(-0.1,0.1)
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								    distribution
 								  * weights and bias of fully-connected classifier is initialized with
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								    uniform(-0.1,0.1) distribution
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								* attention:
 								  * normalized Bahdanau attention
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * output from first LSTM layer of decoder goes into attention, then
 								    re-weighted context is concatenated with the input to all subsequent LSTM
 								    layers of the decoder at the current timestep
 								  * linear transform of keys and queries is initialized with uniform(-0.1,
 .1), normalization scalar is initialized with 1.0/sqrt(1024),
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								    normalization bias is initialized with zero
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								* inference:
 								  * beam search with default beam size of 5
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								  * with coverage penalty and length normalization, coverage penalty factor is
 								    set to 0.1, length normalization factor is set to 0.6 and length
 								    normalization constant is set to 5.0
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * de-tokenized BLEU computed by
 								    [SacreBLEU](https://github.com/mjpost/sacrebleu)
 								  * [motivation](https://github.com/mjpost/sacrebleu#motivation) for choosing
 								    SacreBLEU
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								When comparing the BLEU score, there are various tokenization approaches and
 								BLEU calculation methodologies; therefore, ensure you align similar metrics.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								Code from this repository can be used to train a larger, 8-layer GNMT v2 model.
 								Our experiments show that a 4-layer model is significantly faster to train and
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								yields comparable accuracy on the public [WMT16
 								English-German](http://www.statmt.org/wmt16/translation-task.html) dataset. The
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								number of LSTM layers is controlled by the `--num-layers` parameter in the
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								`train.py` training script.
 								### Feature support matrix
 								The following features are supported by this model.
 								| **Feature** | **GNMT v2** |
 								|:------------|------------:|
 								|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
 								|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) | Yes |
 								#### Features
 								[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor
 								Core-accelerated training. Refer to the [Enabling mixed
 								precision](#enabling-mixed-precision) section for more details.
 								[Apex
 								DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) -
 								a module wrapper that enables easy multiprocess distributed data parallel
 								training, similar to
 								[torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
 								`DistributedDataParallel` is optimized for use with
 								[NCCL](https://github.com/NVIDIA/nccl). It achieves high performance by
 								overlapping communication with computation during `backward()` and bucketing
 								smaller gradient transfers to reduce the total number of transfers required.
 								### Mixed precision training
 								Mixed precision is the combined use of different numerical precisions in a
 								computational method.
 								[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant
 								computational speedup by performing operations in half-precision format, while
 								storing minimal information in single-precision to retain as much information
 								as possible in critical parts of the network. Since the introduction of [Tensor
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with
 								both the Turing and Ampere architectures, significant training speedups are
 								experienced by switching to mixed precision -- up to 3x overall speedup on the
 								most arithmetically intense model architectures. Using mixed precision training
 								previously required two steps:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 . Porting the model to use the FP16 data type where appropriate.
 . Manually adding loss scaling to preserve small gradient values.
 								The ability to train deep learning networks with lower precision was introduced
 								in the Pascal architecture and first supported in [CUDA
 ](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep
 								Learning SDK.
 								For information about:
 								* How to train using mixed precision, see the [Mixed Precision
 								  Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed
 								  Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
 								  documentation.
 								* Techniques used for mixed precision training, see the [Mixed-Precision
 								  Training of Deep Neural
 								  Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
 								  blog.
 								* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy
 								  Mixed-Precision Training in
 								  PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/)
 								  .
 								#### Enabling mixed precision
 								Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
 								(AMP), library from [APEX](https://github.com/NVIDIA/apex) that casts variables
 								to half-precision upon retrieval, while storing variables in single-precision
 								format. Furthermore, to preserve small gradient magnitudes in backpropagation,
 								a [loss
 								scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
 								step must be included when applying gradients. In PyTorch, loss scaling can be
 								easily applied by using `scale_loss()` method provided by AMP. The scaling
 								value to be used can be
 								[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.
 								For an in-depth walk through on AMP, check out sample usage
 								[here](https://nvidia.github.io/apex/amp.html#).
 								[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
 								utility libraries, such as AMP, which require minimal network code changes to
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								leverage Tensor Cores performance.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								The following steps were needed to enable mixed precision training in GNMT:
 								* Import AMP from APEX (file: `seq2seq/train/trainer.py`):
 								```
 								from apex import amp
 								```
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								* Initialize AMP and wrap the model and the optimizer (file:
 								  `seq2seq/train/trainer.py`, class: `Seq2SeqTrainer`):
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
 								self.model, self.optimizer = amp.initialize(
 								    self.model,
 								    self.optimizer,
 								    cast_model_outputs=torch.float16,
 								    keep_batchnorm_fp32=False,
 								    opt_level='O2')
 								```
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								* Apply `scale_loss` context manager (file: `seq2seq/train/fp_optimizers.py`,
 								  class: `AMPOptimizer`):
 								```
 								with amp.scale_loss(loss, optimizer) as scaled_loss:
 								    scaled_loss.backward()
 								```
 								* Apply gradient clipping on single precision master weights (file:
 								  `seq2seq/train/fp_optimizers.py`, class: `AMPOptimizer`):
 								```
 								if self.grad_clip != float('inf'):
 								    clip_grad_norm_(amp.master_params(optimizer), self.grad_clip)
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								#### Enabling TF32
 								TensorFloat-32 (TF32) is the new math mode in [NVIDIA
 								A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the
 								matrix math also called tensor operations. TF32 running on Tensor Cores in A100
 								GPUs can provide up to 10x speedups compared to single-precision floating-point
 								math (FP32) on Volta GPUs.
 								TF32 Tensor Cores can speed up networks using FP32, typically with no loss of
 								accuracy. It is more robust than FP16 for models which require high dynamic
 								range for weights or activations.
 								For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates
 								AI Training, HPC up to
 x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)
 								blog post.
 								TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by
 								default.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Setup
 								The following section lists the requirements in order to start training the
 								GNMT v2 model.
 								### Requirements
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								This repository contains `Dockerfile` which extends the PyTorch NGC container
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								and encapsulates some dependencies. Aside from these dependencies, ensure you
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								have the following components:
 								* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 								* GPU architecture:
 								  * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
 								  * [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
 								  * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								For more information about how to get started with NGC containers, see the
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								following sections from the NVIDIA GPU Cloud Documentation and the Deep
 								Learning DGX Documentation:
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html),
 								* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry),
 								* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running).
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								For those unable to use the Pytorch NGC container, to set up the required
 								environment or create your own container, see the versioned [NVIDIA Container
 								Support
 								Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								## Quick Start Guide
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								To train your model using mixed or TF32 precision with Tensor Cores or using
 								FP32, perform the following steps using the default parameters of the GNMT v2
 								model on the WMT16 English German dataset. For the specifics concerning
 								training and inference, see the [Advanced](#advanced) section.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								**1. Clone the repository.**
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								```
 								git clone https://github.com/NVIDIA/DeepLearningExamples
 								cd DeepLearningExamples/PyTorch/Translation/GNMT
 								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								**2. Build the GNMT v2 Docker container.**
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								```
 								bash scripts/docker/build.sh
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								**3. Start an interactive session in the container to run training/inference.**
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								```
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								bash scripts/docker/interactive.sh
 								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								**4. Download and preprocess the dataset.**
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								Data will be downloaded to the `data` directory (on the host). The `data`
 								directory is mounted to the `/workspace/gnmt/data` location in the Docker
 								container.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								```
 								bash scripts/wmt16_en_de.sh
 								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								**5. Start training.**
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								The training script saves only one checkpoint with the lowest value of the loss
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								function on the validation dataset. All results and logs are saved to the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								`gnmt` directory (on the host) or to the `/workspace/gnmt/gnmt` directory
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								(in the container). By default, the `train.py` script will launch mixed
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								precision training with Tensor Cores. You can change this behavior by setting:
 								* the `--math fp32` flag to launch single precision training (for NVIDIA Volta
 								  and NVIDIA Turing architectures) or
 								* the `--math tf32` flag to launch TF32 training with Tensor Cores (for NVIDIA
 								  Ampere architecture)
 								for the `train.py` training script.
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To launch mixed precision training on 1, 4 or 8 GPUs, run:
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								```
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To launch mixed precision training on 16 GPUs, run:
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								```
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								By default, the training script will launch training with batch size 128 per
 								GPU. If `--train-global-batch-size` is specified and larger than 128 times the
 								number of GPUs available for the training then the training script will
 								accumulate gradients over consecutive iterations and then perform the weight
 								update. For example, 1 GPU training with `--train-global-batch-size 1024` will
 								accumulate gradients over 8 iterations before doing the weight update with
 								accumulated gradients.
 								**6. Start evaluation.**
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								The training process automatically runs evaluation and outputs the BLEU score
 								after each training epoch. Additionally, after the training is done, you can
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								manually run inference on the test dataset with the checkpoint saved during the
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								training.
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								To launch FP16 inference on the `newstest2014.en` test set, run:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								```
 								python3 translate.py \
 								  --input data/wmt16_de_en/newstest2014.en \
 								  --reference data/wmt16_de_en/newstest2014.de \
 								  --output /tmp/output \
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --model gnmt/model_best.pth
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
 								The script will load the checkpoint specified by the `--model` option, then it
 								will launch inference on the file specified by the `--input` option, and
 								compute BLEU score against the reference translation specified by the
 								`--reference` option. Outputs will be stored to the location specified by the
 								`--output` option.
 								Additionally, one can pass the input text directly from the command-line:
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								python3 translate.py \
 								  --input-text "The quick brown fox jumps over the lazy dog" \
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --model gnmt/model_best.pth
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
 								Translated output will be printed to the console:
 								```
 								(...)
 : Translated output:
 								Der schnelle braune Fuchs springt über den faulen Hund
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								```
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								By default, the `translate.py` script will launch FP16 inference with Tensor
 								Cores. You can change this behavior by setting:
 								* the `--math fp32` flag to launch single precision inference (for NVIDIA Volta
 								  and NVIDIA Turing architectures) or
 								* the `--math tf32` flag to launch TF32 inference with Tensor Cores (for NVIDIA
 								  Ampere architecture)
 								for the `translate.py` inference script.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Advanced
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								The following sections provide greater details of the dataset, running training
 								and inference, and the training results.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Scripts and sample code
 								In the `root` directory, the most important files are:
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								* `train.py`: serves as the entry point to launch the training
 								* `translate.py`: serves as the entry point to launch inference
 								* `Dockerfile`: container with the basic set of dependencies to run GNMT v2
 								* `requirements.txt`: set of extra requirements for running GNMT v2
 								The `seq2seq/model` directory contains the implementation of GNMT v2 building
 								blocks:
 								* `attention.py`: implementation of normalized Bahdanau attention
 								* `encoder.py`: implementation of recurrent encoder
 								* `decoder.py`: implementation of recurrent decoder with attention
 								* `seq2seq_base.py`: base class for seq2seq models
 								* `gnmt.py`: implementation of GNMT v2 model
 								The `seq2seq/train` directory encapsulates the necessary tools to execute
 								training:
 								* `trainer.py`: implementation of training loop
 								* `smoothing.py`: implementation of cross-entropy with label smoothing
 								* `lr_scheduler.py`: implementation of exponential learning rate warmup and
 								  step decay
 								* `fp_optimizers.py`: implementation of optimizers for various floating point
 								  precisions
 								The `seq2seq/inference` directory contains scripts required to run inference:
 								* `beam_search.py`: implementation of beam search with length normalization and
 								  length penalty
 								* `translator.py`: implementation of auto-regressive inference
 								The `seq2seq/data` directory contains implementation of components needed for
 								data loading:
 								* `dataset.py`: implementation of text datasets
 								* `sampler.py`: implementation of batch samplers with bucketing by sequence
 								  length
 								* `tokenizer.py`: implementation of tokenizer (maps integer vocabulary indices
 								  to text)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Parameters
 								Training
 								The complete list of available parameters for the `train.py` training script
 								contains:
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								```
 								dataset setup:
 								  --dataset-dir DATASET_DIR
 								                        path to the directory with training/test data
 								                        (default: data/wmt16_de_en)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --src-lang SRC_LANG   source language (default: en)
 								  --tgt-lang TGT_LANG   target language (default: de)
 								  --vocab VOCAB         path to the vocabulary file (relative to DATASET_DIR
 								                        directory) (default: vocab.bpe.32000)
 								  -bpe BPE_CODES, --bpe-codes BPE_CODES
 								                        path to the file with bpe codes (relative to
 								                        DATASET_DIR directory) (default: bpe.32000)
 								  --train-src TRAIN_SRC
 								                        path to the training source data file (relative to
 								                        DATASET_DIR directory) (default:
 								                        train.tok.clean.bpe.32000.en)
 								  --train-tgt TRAIN_TGT
 								                        path to the training target data file (relative to
 								                        DATASET_DIR directory) (default:
 								                        train.tok.clean.bpe.32000.de)
 								  --val-src VAL_SRC     path to the validation source data file (relative to
 								                        DATASET_DIR directory) (default:
 								                        newstest_dev.tok.clean.bpe.32000.en)
 								  --val-tgt VAL_TGT     path to the validation target data file (relative to
 								                        DATASET_DIR directory) (default:
 								                        newstest_dev.tok.clean.bpe.32000.de)
 								  --test-src TEST_SRC   path to the test source data file (relative to
 								                        DATASET_DIR directory) (default:
 								                        newstest2014.tok.bpe.32000.en)
 								  --test-tgt TEST_TGT   path to the test target data file (relative to
 								                        DATASET_DIR directory) (default: newstest2014.de)
 								  --train-max-size TRAIN_MAX_SIZE
 								                        use at most TRAIN_MAX_SIZE elements from training
 								                        dataset (useful for benchmarking), by default uses
 								                        entire dataset (default: None)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								results setup:
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --save-dir SAVE_DIR   path to directory with results, it will be
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								                        automatically created if it does not exist (default:
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                        gnmt)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --print-freq PRINT_FREQ
 								                        print log every PRINT_FREQ batches (default: 10)
 								model setup:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --hidden-size HIDDEN_SIZE
 								                        hidden size of the model (default: 1024)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --num-layers NUM_LAYERS
 								                        number of RNN layers in encoder and in decoder
 								                        (default: 4)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --dropout DROPOUT     dropout applied to input of RNN cells (default: 0.2)
 								  --share-embedding     use shared embeddings for encoder and decoder (use '--
 								                        no-share-embedding' to disable) (default: True)
 								  --smoothing SMOOTHING
 								                        label smoothing, if equal to zero model will use
 								                        CrossEntropyLoss, if not zero model will be trained
 								                        with label smoothing loss (default: 0.1)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								general setup:
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --math {fp16,fp32,tf32,manual_fp16}
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                        precision (default: fp16)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --seed SEED           master seed for random number generators, if "seed" is
 								                        undefined then the master seed will be sampled from
 								                        random.SystemRandom() (default: None)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --prealloc-mode {off,once,always}
 								                        controls preallocation (default: always)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --dllog-file DLLOG_FILE
 								                        Name of the DLLogger output file (default:
 								                        train_log.json)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --eval                run validation and test after every epoch (use '--no-
 								                        eval' to disable) (default: True)
 								  --env                 print info about execution env (use '--no-env' to
 								                        disable) (default: True)
 								  --cuda                enables cuda (use '--no-cuda' to disable) (default:
 								                        True)
 								  --cudnn               enables cudnn (use '--no-cudnn' to disable) (default:
 								                        True)
 								  --log-all-ranks       enables logging from all distributed ranks, if
 								                        disabled then only logs from rank 0 are reported (use
 								                        '--no-log-all-ranks' to disable) (default: True)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								training setup:
 								  --train-batch-size TRAIN_BATCH_SIZE
 								                        training batch size per worker (default: 128)
 								  --train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE
 								                        global training batch size, this argument does not
 								                        have to be defined, if it is defined it will be used
 								                        to automatically compute train_iter_size using the
 								                        equation: train_iter_size = train_global_batch_size //
 								                        (train_batch_size * world_size) (default: None)
 								  --train-iter-size N   training iter size, training loop will accumulate
 								                        gradients over N iterations and execute optimizer
 								                        every N steps (default: 1)
 								  --epochs EPOCHS       max number of training epochs (default: 6)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --grad-clip GRAD_CLIP
 								                        enables gradient clipping and sets maximum norm of
 								                        gradients (default: 5.0)
 								  --train-max-length TRAIN_MAX_LENGTH
 								                        maximum sequence length for training (including
 								                        special BOS and EOS tokens) (default: 50)
 								  --train-min-length TRAIN_MIN_LENGTH
-												[GNMT/PyT] Update for PyT GNMT (#944)


											
										
										
											2021-05-27 18:48:37 +02:00
+								                        minimum sequence length for training (including
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                        special BOS and EOS tokens) (default: 0)
 								  --train-loader-workers TRAIN_LOADER_WORKERS
 								                        number of workers for training data loading (default:
 )
 								  --batching {random,sharding,bucketing}
 								                        select batching algorithm (default: bucketing)
 								  --shard-size SHARD_SIZE
 								                        shard size for "sharding" batching algorithm, in
 								                        multiples of global batch size (default: 80)
 								  --num-buckets NUM_BUCKETS
 								                        number of buckets for "bucketing" batching algorithm
 								                        (default: 5)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								optimizer setup:
 								  --optimizer OPTIMIZER
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                        training optimizer (default: Adam)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --lr LR               learning rate (default: 0.002)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --optimizer-extra OPTIMIZER_EXTRA
 								                        extra options for the optimizer (default: {})
 								mixed precision loss scaling setup:
 								  --init-scale INIT_SCALE
 								                        initial loss scale (default: 8192)
 								  --upscale-interval UPSCALE_INTERVAL
 								                        loss upscaling interval (default: 128)
 								learning rate scheduler setup:
 								  --warmup-steps WARMUP_STEPS
 								                        number of learning rate warmup iterations (default:
 )
 								  --remain-steps REMAIN_STEPS
 								                        starting iteration for learning rate decay (default:
 .666)
 								  --decay-interval DECAY_INTERVAL
 								                        interval between learning rate decay steps (default:
 								                        None)
 								  --decay-steps DECAY_STEPS
 								                        max number of learning rate decay steps (default: 4)
 								  --decay-factor DECAY_FACTOR
 								                        learning rate decay factor (default: 0.5)
 								validation setup:
 								  --val-batch-size VAL_BATCH_SIZE
 								                        batch size for validation (default: 64)
 								  --val-max-length VAL_MAX_LENGTH
 								                        maximum sequence length for validation (including
 								                        special BOS and EOS tokens) (default: 125)
 								  --val-min-length VAL_MIN_LENGTH
 								                        minimum sequence length for validation (including
 								                        special BOS and EOS tokens) (default: 0)
 								  --val-loader-workers VAL_LOADER_WORKERS
 								                        number of workers for validation data loading
 								                        (default: 0)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								test setup:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --test-batch-size TEST_BATCH_SIZE
 								                        batch size for test (default: 128)
 								  --test-max-length TEST_MAX_LENGTH
 								                        maximum sequence length for test (including special
 								                        BOS and EOS tokens) (default: 150)
 								  --test-min-length TEST_MIN_LENGTH
 								                        minimum sequence length for test (including special
 								                        BOS and EOS tokens) (default: 0)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --beam-size BEAM_SIZE
 								                        beam size (default: 5)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --len-norm-factor LEN_NORM_FACTOR
 								                        length normalization factor (default: 0.6)
 								  --cov-penalty-factor COV_PENALTY_FACTOR
 								                        coverage penalty factor (default: 0.1)
 								  --len-norm-const LEN_NORM_CONST
 								                        length normalization constant (default: 5.0)
 								  --intra-epoch-eval N  evaluate within training epoch, this option will
 								                        enable extra N equally spaced evaluations executed
 								                        during each training epoch (default: 0)
 								  --test-loader-workers TEST_LOADER_WORKERS
 								                        number of workers for test data loading (default: 0)
 								checkpointing setup:
 								  --start-epoch START_EPOCH
 								                        manually set initial epoch counter (default: 0)
 								  --resume PATH         resumes training from checkpoint from PATH (default:
 								                        None)
 								  --save-all            saves checkpoint after every epoch (default: False)
 								  --save-freq SAVE_FREQ
 								                        save checkpoint every SAVE_FREQ batches (default:
 )
 								  --keep-checkpoints KEEP_CHECKPOINTS
 								                        keep only last KEEP_CHECKPOINTS checkpoints, affects
 								                        only checkpoints controlled by --save-freq option
 								                        (default: 0)
 								benchmark setup:
 								  --target-perf TARGET_PERF
 								                        target training performance (in tokens per second)
 								                        (default: None)
 								  --target-bleu TARGET_BLEU
 								                        target accuracy (default: None)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Inference
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								The complete list of available parameters for the `translate.py` inference
 								script contains:
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								```
 								data setup:
 								  -o OUTPUT, --output OUTPUT
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                        full path to the output file if not specified, then
 								                        the output will be printed (default: None)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  -r REFERENCE, --reference REFERENCE
 								                        full path to the file with reference translations (for
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                        sacrebleu, raw text) (default: None)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  -m MODEL, --model MODEL
 								                        full path to the model checkpoint file (default: None)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --synthetic           use synthetic dataset (default: False)
 								  --synthetic-batches SYNTHETIC_BATCHES
 								                        number of synthetic batches to generate (default: 64)
 								  --synthetic-vocab SYNTHETIC_VOCAB
 								                        size of synthetic vocabulary (default: 32320)
 								  --synthetic-len SYNTHETIC_LEN
 								                        sequence length of synthetic samples (default: 50)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  -i INPUT, --input INPUT
 								                        full path to the input file (raw text) (default: None)
 								  -t INPUT_TEXT [INPUT_TEXT ...], --input-text INPUT_TEXT [INPUT_TEXT ...]
 								                        raw input text (default: None)
 								  --sort                sorts dataset by sequence length (use '--no-sort' to
 								                        disable) (default: False)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								inference setup:
 								  --batch-size BATCH_SIZE [BATCH_SIZE ...]
 								                        batch size per GPU (default: [128])
 								  --beam-size BEAM_SIZE [BEAM_SIZE ...]
 								                        beam size (default: [5])
 								  --max-seq-len MAX_SEQ_LEN
 								                        maximum generated sequence length (default: 80)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --len-norm-factor LEN_NORM_FACTOR
 								                        length normalization factor (default: 0.6)
 								  --cov-penalty-factor COV_PENALTY_FACTOR
 								                        coverage penalty factor (default: 0.1)
 								  --len-norm-const LEN_NORM_CONST
 								                        length normalization constant (default: 5.0)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
 								general setup:
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --math {fp16,fp32,tf32} [{fp16,fp32,tf32} ...]
 								                        precision (default: ['fp16'])
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --env                 print info about execution env (use '--no-env' to
 								                        disable) (default: False)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --bleu                compares with reference translation and computes BLEU
 								                        (use '--no-bleu' to disable) (default: True)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --cuda                enables cuda (use '--no-cuda' to disable) (default:
 								                        True)
 								  --cudnn               enables cudnn (use '--no-cudnn' to disable) (default:
 								                        True)
 								  --batch-first         uses (batch, seq, feature) data format for RNNs
 								                        (default: True)
 								  --seq-first           uses (seq, batch, feature) data format for RNNs
 								                        (default: True)
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --save-dir SAVE_DIR   path to directory with results, it will be
 								                        automatically created if it does not exist (default:
 								                        gnmt)
 								  --dllog-file DLLOG_FILE
 								                        Name of the DLLogger output file (default:
 								                        eval_log.json)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								  --print-freq PRINT_FREQ, -p PRINT_FREQ
 								                        print log every PRINT_FREQ batches (default: 1)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								benchmark setup:
 								  --target-perf TARGET_PERF
 								                        target inference performance (in tokens per second)
 								                        (default: None)
 								  --target-bleu TARGET_BLEU
 								                        target accuracy (default: None)
 								  --repeat REPEAT [REPEAT ...]
 								                        loops over the dataset REPEAT times, flag accepts
 								                        multiple arguments, one for each specified batch size
 								                        (default: [1])
 								  --warmup WARMUP       warmup iterations for performance counters (default:
 )
 								  --percentiles PERCENTILES [PERCENTILES ...]
 								                        Percentiles for confidence intervals for
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                        throughput/latency benchmarks (default: (90, 95, 99))
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --tables              print accuracy, throughput and latency results in
 								                        tables (use '--no-tables' to disable) (default: False)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Command-line options
 								To see the full list of available options and their descriptions, use the `-h`
 								or `--help` command line option. For example, for training:
 								```
 								python3 train.py --help
 								usage: train.py [-h] [--dataset-dir DATASET_DIR] [--src-lang SRC_LANG]
 								                [--tgt-lang TGT_LANG] [--vocab VOCAB] [-bpe BPE_CODES]
 								                [--train-src TRAIN_SRC] [--train-tgt TRAIN_TGT]
 								                [--val-src VAL_SRC] [--val-tgt VAL_TGT] [--test-src TEST_SRC]
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                [--test-tgt TEST_TGT] [--save-dir SAVE_DIR]
 								                [--print-freq PRINT_FREQ] [--hidden-size HIDDEN_SIZE]
 								                [--num-layers NUM_LAYERS] [--dropout DROPOUT]
 								                [--share-embedding] [--smoothing SMOOTHING]
 								                [--math {fp16,fp32,tf32,manual_fp16}] [--seed SEED]
 								                [--prealloc-mode {off,once,always}] [--dllog-file DLLOG_FILE]
 								                [--eval] [--env] [--cuda] [--cudnn] [--log-all-ranks]
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                [--train-max-size TRAIN_MAX_SIZE]
 								                [--train-batch-size TRAIN_BATCH_SIZE]
 								                [--train-global-batch-size TRAIN_GLOBAL_BATCH_SIZE]
 								                [--train-iter-size N] [--epochs EPOCHS]
 								                [--grad-clip GRAD_CLIP] [--train-max-length TRAIN_MAX_LENGTH]
 								                [--train-min-length TRAIN_MIN_LENGTH]
 								                [--train-loader-workers TRAIN_LOADER_WORKERS]
 								                [--batching {random,sharding,bucketing}]
 								                [--shard-size SHARD_SIZE] [--num-buckets NUM_BUCKETS]
 								                [--optimizer OPTIMIZER] [--lr LR]
 								                [--optimizer-extra OPTIMIZER_EXTRA] [--init-scale INIT_SCALE]
 								                [--upscale-interval UPSCALE_INTERVAL]
 								                [--warmup-steps WARMUP_STEPS] [--remain-steps REMAIN_STEPS]
 								                [--decay-interval DECAY_INTERVAL] [--decay-steps DECAY_STEPS]
 								                [--decay-factor DECAY_FACTOR]
 								                [--val-batch-size VAL_BATCH_SIZE]
 								                [--val-max-length VAL_MAX_LENGTH]
 								                [--val-min-length VAL_MIN_LENGTH]
 								                [--val-loader-workers VAL_LOADER_WORKERS]
 								                [--test-batch-size TEST_BATCH_SIZE]
 								                [--test-max-length TEST_MAX_LENGTH]
 								                [--test-min-length TEST_MIN_LENGTH] [--beam-size BEAM_SIZE]
 								                [--len-norm-factor LEN_NORM_FACTOR]
 								                [--cov-penalty-factor COV_PENALTY_FACTOR]
 								                [--len-norm-const LEN_NORM_CONST] [--intra-epoch-eval N]
 								                [--test-loader-workers TEST_LOADER_WORKERS]
 								                [--start-epoch START_EPOCH] [--resume PATH] [--save-all]
 								                [--save-freq SAVE_FREQ] [--keep-checkpoints KEEP_CHECKPOINTS]
 								                [--target-perf TARGET_PERF] [--target-bleu TARGET_BLEU]
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                [--local_rank LOCAL_RANK]
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
 								For example, for inference:
 								```
 								python3 translate.py --help
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								usage: translate.py [-h] [-o OUTPUT] [-r REFERENCE] [-m MODEL] [--synthetic]
 								                    [--synthetic-batches SYNTHETIC_BATCHES]
 								                    [--synthetic-vocab SYNTHETIC_VOCAB]
 								                    [--synthetic-len SYNTHETIC_LEN]
 								                    [-i INPUT | -t INPUT_TEXT [INPUT_TEXT ...]] [--sort]
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                    [--batch-size BATCH_SIZE [BATCH_SIZE ...]]
 								                    [--beam-size BEAM_SIZE [BEAM_SIZE ...]]
 								                    [--max-seq-len MAX_SEQ_LEN]
 								                    [--len-norm-factor LEN_NORM_FACTOR]
 								                    [--cov-penalty-factor COV_PENALTY_FACTOR]
 								                    [--len-norm-const LEN_NORM_CONST]
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                    [--math {fp16,fp32,tf32} [{fp16,fp32,tf32} ...]] [--env]
 								                    [--bleu] [--cuda] [--cudnn] [--batch-first | --seq-first]
 								                    [--save-dir SAVE_DIR] [--dllog-file DLLOG_FILE]
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								                    [--print-freq PRINT_FREQ] [--target-perf TARGET_PERF]
 								                    [--target-bleu TARGET_BLEU] [--repeat REPEAT [REPEAT ...]]
 								                    [--warmup WARMUP]
 								                    [--percentiles PERCENTILES [PERCENTILES ...]] [--tables]
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								                    [--local_rank LOCAL_RANK]
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
 								### Getting the data
 								The GNMT v2 model was trained on the [WMT16
 								English-German](http://www.statmt.org/wmt16/translation-task.html) dataset.
 								Concatenation of the newstest2015 and newstest2016 test sets are used as a
 								validation dataset and the newstest2014 is used as a testing dataset.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								This repository contains the `scripts/wmt16_en_de.sh` download script which
 								automatically downloads and preprocesses the training, validation and test
 								datasets. By default, data is downloaded to the `data` directory.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								Our download script is very similar to the `wmt16_en_de.sh` script from the
 								[tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
 								repository. Our download script contains an extra preprocessing step, which
 								discards all pairs of sentences which can't be decoded by *latin-1* encoder.
 								The `scripts/wmt16_en_de.sh` script uses the
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								[subword-nmt](https://github.com/rsennrich/subword-nmt) package to segment text
 								into subword units (Byte Pair Encodings -
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								[BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding)). By default, the
 								script builds the shared vocabulary of 32,000 tokens.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								In order to test with other datasets, the script needs to be customized
 								accordingly.
 								#### Dataset guidelines
 								The process of downloading and preprocessing the data can be found in the
 								`scripts/wmt16_en_de.sh` script.
 								Initially, data is downloaded from [www.statmt.org](www.statmt.org). Then
 								`europarl-v7`, `commoncrawl` and `news-commentary` corpora are concatenated to
 								form the training dataset, similarly `newstest2015` and `newstest2016` are
 								concatenated to form the validation dataset. Raw data is preprocessed with
 								[Moses](https://github.com/moses-smt/mosesdecoder), first by launching [Moses
 								tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)
 								(tokenizer breaks up text into individual words), then by launching
 								[clean-corpus-n.perl](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl)
 								which removes invalid sentences and does initial filtering by sequence length.
 								Second stage of preprocessing is done by launching the
 								`scripts/filter_dataset.py` script, which discards all pairs of sentences that
 								can't be decoded by latin-1 encoder.
 								Third state of preprocessing uses the
 								[subword-nmt](https://github.com/rsennrich/subword-nmt) package. First it
 								builds shared [byte pair
 								encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) vocabulary with
 ,000 merge operations (command `subword-nmt learn-bpe`), then it applies
 								generated vocabulary to training, validation and test corpora (command
 								`subword-nmt apply-bpe`).
 								### Training process
 								The default training configuration can be launched by running the `train.py`
 								training script. By default, the training script saves only one checkpoint with
 								the lowest value of the loss function on the validation dataset. An evaluation
 								is then performed after each training epoch. Results are stored in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								`gnmt` directory.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								The training script launches data-parallel training with batch size 128 per GPU
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								on all available GPUs. We have tested reliance on up to 16 GPUs on a single
 								node.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								After each training epoch, the script runs an evaluation on the validation
 								dataset and outputs a BLEU score on the test dataset (newstest2014). BLEU is
 								computed by the [SacreBLEU](https://github.com/mjpost/sacreBLEU) package. Logs
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								from the training and evaluation are saved to the `gnmt` directory.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								The summary after each training epoch is printed in the following format:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+: Summary: Epoch: 3	Training Loss: 3.1336	Validation Loss: 2.9587	Test BLEU: 23.18
 : Performance: Epoch: 3	Training: 418772 Tok/s	Validation: 1445331 Tok/s
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								The training loss is averaged over an entire training epoch, the validation
 								loss is averaged over the validation dataset and the BLEU score is computed on
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								the test dataset. Performance is reported in total tokens per second. The
 								result is averaged over an entire training epoch and summed over all GPUs
 								participating in the training.
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								By default, the `train.py` script will launch mixed precision training with
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Tensor Cores. You can change this behavior by setting:
 								* the `--math fp32` flag to launch single precision training (for NVIDIA Volta
 								  and NVIDIA Turing architectures) or
 								* the `--math tf32` flag to launch TF32 training with Tensor Cores (for NVIDIA
 								  Ampere architecture)
 								for the `train.py` training script.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								To view all available options for training, run `python3 train.py --help`.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Inference process
 								Inference can be run by launching the `translate.py` inference script,
 								although, it requires a pre-trained model checkpoint and tokenized input.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								The inference script, `translate.py`, supports batched inference. By default,
 								it launches beam search with beam size of 5, coverage penalty term and length
 								normalization term. Greedy decoding can be enabled by setting the beam size to
 .
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
 								To view all available options for inference, run `python3 translate.py --help`.
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								## Performance
-												[GNMT/PyT] requirements.txt: updated pytablewriter to 0.64.0

											
										
										
											2021-10-13 18:40:51 +02:00
+								The performance measurements in this document were conducted at the time of
 								publication and may not reflect the performance achieved from NVIDIA’s latest
 								software release. For the most up-to-date performance measurements, go to
 								[NVIDIA Data Center Deep Learning Product
 								Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
-												Adding links to performance benchmark page

											
										
										
											2021-07-21 14:39:48 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Benchmarking
 								The following section shows how to run benchmarks measuring the model
 								performance in training and inference modes.
 								#### Training performance benchmark
 								Training is launched on batches of text data, different batches have different
 								sequence lengths (number of tokens in the longest sequence). Sequence length
 								and batch efficiency (ratio of non-pad tokens to total number of tokens) affect
 								performance of the training, therefore it's recommended to run the training on
 								a large chunk of training dataset to get a stable and reliable average training
 								performance. Ideally at least one full epoch of training should be launched to
 								get a good estimate of training performance.
 								The following commands will launch one epoch of training:
 								To launch mixed precision training on 1, 4 or 8 GPUs, run:
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --epochs 1 --math fp16
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
 								To launch mixed precision training on 16 GPUs, run:
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048 --epochs 1 --math fp16
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Change `--math fp16` to `--math fp32` to launch single precision training (for
 								NVIDIA Volta and NVIDIA Turing architectures) or to `--math tf32` to launch
 								TF32 training with Tensor Cores (for NVIDIA Ampere architecture).
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								After the training is completed, the `train.py` script prints a summary to
 								standard output. Performance results are printed in the following format:
 								```
 								(...)
 : Performance: Epoch: 0	Training: 418926 Tok/s	Validation: 1430828 Tok/s
 								(...)
 								```
 								`Training: 418926 Tok/s` represents training throughput averaged over an entire
 								training epoch and summed over all GPUs participating in the training.
 								#### Inference performance benchmark
 								The inference performance and accuracy benchmarks require a checkpoint from a
 								fully trained model.
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Command to launch the inference accuracy benchmark on NVIDIA Volta or on NVIDIA
 								Turing architectures:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								```
 								python3 translate.py \
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --model gnmt/model_best.pth \
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --input data/wmt16_de_en/newstest2014.en \
 								  --reference data/wmt16_de_en/newstest2014.de \
 								  --output /tmp/output \
 								  --math fp16 fp32 \
 								  --batch-size 128 \
 								  --beam-size 1 2 5 \
 								  --tables
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Command to launch the inference accuracy benchmark on NVIDIA Ampere architecture:
 								```
 								python3 translate.py \
 								  --model gnmt/model_best.pth \
 								  --input data/wmt16_de_en/newstest2014.en \
 								  --reference data/wmt16_de_en/newstest2014.de \
 								  --output /tmp/output \
 								  --math fp16 tf32 \
 								  --batch-size 128 \
 								  --beam-size 1 2 5 \
 								  --tables
 								```
 								Command to launch the inference throughput and latency benchmarks on NVIDIA
 								Volta or NVIDIA Turing architectures:
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								```
 								python3 translate.py \
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								  --model gnmt/model_best.pth \
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  --input data/wmt16_de_en/newstest2014.en \
 								  --reference data/wmt16_de_en/newstest2014.de \
 								  --output /tmp/output \
 								  --math fp16 fp32 \
 								  --batch-size 1 2 4 8 32 128 512 \
 								  --repeat 1 1 1 1 2 8 16 \
 								  --beam-size 1 2 5 \
 								  --warmup 5 \
 								  --tables
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Command to launch the inference throughput and latency benchmarks on NVIDIA
 								Ampere architecture:
 								```
 								python3 translate.py \
 								  --model gnmt/model_best.pth \
 								  --input data/wmt16_de_en/newstest2014.en \
 								  --reference data/wmt16_de_en/newstest2014.de \
 								  --output /tmp/output \
 								  --math fp16 tf32 \
 								  --batch-size 1 2 4 8 32 128 512 \
 								  --repeat 1 1 1 1 2 8 16 \
 								  --beam-size 1 2 5 \
 								  --warmup 5 \
 								  --tables
 								```
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								### Results
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								The following sections provide details on how we achieved our performance and
 								accuracy in training and inference.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Training accuracy results
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								Our results were obtained by running the `train.py` script with the default
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX
 								A100 with 8x A100 40GB GPUs.
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								Command to launch the training:
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --math fp16
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								```
-												Updating GNMT

											
										
										
											2019-01-25 16:20:05 +01:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Change `--math fp16` to `--math tf32` to launch TF32 training with Tensor Cores.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| **GPUs** | **Batch Size / GPU** | **Accuracy - TF32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - TF32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (TF32 to Mixed precision)** |
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								| --- | --- | ----- | ----- | ----- | ------ | ---- |
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| 8   | 128 | 24.46 | 24.60 | 34.7  | 22.7   | 1.53 |
-												Updating GNMT

											
										
										
											2019-01-25 16:20:05 +01:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Our results were obtained by running the `train.py` script with the default
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1
 								with 8x V100 16GB GPUs.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Command to launch the training:
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=<#GPUs> train.py --seed 2 --train-global-batch-size 1024 --math fp16
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Change `--math fp16` to `--math fp32` to launch single precision training.
 								| **GPUs** | **Batch Size / GPU** | **Accuracy - FP32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - FP32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (FP32 to Mixed precision)** |
 								| --- | --- | ----- | ----- | ----- | ------ | ---- |
 								| 1   | 128 | 24.41 | 24.42 | 810.0 | 224.0  | 3.62 |
 								| 4   | 128 | 24.40 | 24.33 | 218.2 | 69.5   | 3.14 |
 								| 8   | 128 | 24.45 | 24.38 | 112.0 | 38.6   | 2.90 |
 								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
 								##### Training accuracy: NVIDIA DGX-2H (16x V100 32GB)
 								Our results were obtained by running the `train.py` script with the default
 								batch size = 128 per GPU in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H
 								with 16x V100 32GB GPUs.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								To launch mixed precision training on 16 GPUs, run:
 								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								python3 -m torch.distributed.launch --nproc_per_node=16 train.py --seed 2 --train-global-batch-size 2048 --math fp16
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								```
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								Change `--math fp16` to `--math fp32` to launch single precision training.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								| **GPUs** | **Batch Size / GPU** | **Accuracy - FP32 (BLEU)** | **Accuracy - Mixed precision (BLEU)** | **Time to Train - FP32 (minutes)** | **Time to Train - Mixed precision (minutes)** | **Time to Train Speedup (FP32 to Mixed precision)** |
 								| --- | --- | ----- | ----- | ------ | ----- | ---- |
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| 16  | 128 | 24.41 | 24.38 | 52.1   | 19.4  | 2.69 |
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								![TrainingLoss](./img/training_loss.png)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								##### Training stability test
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								The GNMT v2 model was trained for 6 epochs, starting from 32 different initial
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								random seeds. After each training epoch, the model was evaluated on the test
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
+								dataset and the BLEU score was recorded. The training was performed in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 Docker container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								The following table summarizes the results of the stability test.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								In the following table, the BLEU scores after each training epoch for different
 								initial random seeds are displayed.
 								| **Epoch** | **Average** | **Standard deviation** | **Minimum** | **Maximum** | **Median** |
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								| --- | ------ | ----- | ------ | ------ | ------ |
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| 1   | 19.959 | 0.238 | 19.410 | 20.390 | 19.970 |
 								| 2   | 21.772 | 0.293 | 20.960 | 22.280 | 21.820 |
 								| 3   | 22.435 | 0.264 | 21.740 | 22.870 | 22.465 |
 								| 4   | 23.167 | 0.166 | 22.870 | 23.620 | 23.195 |
 								| 5   | 24.233 | 0.149 | 23.820 | 24.530 | 24.235 |
 								| 6   | 24.416 | 0.131 | 24.140 | 24.660 | 24.390 |
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								#### Training throughput results
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								##### Training throughput: NVIDIA DGX A100 (8x A100 40GB)
 								Our results were obtained by running the `train.py` training script in the
 								pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100 40GB GPUs.
 								Throughput performance numbers (in tokens per second) were averaged over an
 								entire training epoch.
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| **GPUs** | **Batch size / GPU** | **Throughput - TF32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (TF32 to Mixed precision)** | **Strong Scaling - TF32** | **Strong Scaling - Mixed precision** |
 								| --- | --- | ------ | ------ | ----- | ----- | ----- |
 								| 1   | 128 | 83214  | 140909 | 1.693 | 1.000 | 1.000 |
 								| 4   | 128 | 278576 | 463144 | 1.663 | 3.348 | 3.287 |
 								| 8   | 128 | 519952 | 822024 | 1.581 | 6.248 | 5.834 |
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
 								##### Training throughput: NVIDIA DGX-1 (8x V100 16GB)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Our results were obtained by running the `train.py` training script in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Throughput performance numbers (in tokens per second) were averaged over an
 								entire training epoch.
 								| **GPUs** | **Batch size / GPU** | **Throughput - FP32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong Scaling - Mixed precision** |
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								| --- | --- | ------ | ------ | ----- | ----- | ----- |
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| 1   | 128 | 21860  | 76438  | 3.497 | 1.000 | 1.000 |
 								| 4   | 128 | 80224  | 249168 | 3.106 | 3.670 | 3.260 |
 								| 8   | 128 | 154168 | 447832 | 2.905 | 7.053 | 5.859 |
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								##### Training throughput: NVIDIA DGX-2H (16x V100 32GB)
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Our results were obtained by running the `train.py` training script in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 NGC container on NVIDIA DGX-2H with 16x V100 32GB GPUs.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Throughput performance numbers (in tokens per second) were averaged over an
 								entire training epoch.
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								| **GPUs** | **Batch size / GPU** | **Throughput - FP32 (tok/s)** | **Throughput - Mixed precision (tok/s)** | **Throughput speedup (FP32 to Mixed precision)** | **Strong Scaling - FP32** | **Strong Scaling - Mixed precision** |
 								| --- | --- | ------ | ------ | ----- | ------ | ------ |
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								| 1  | 128 | 25583  | 87829   | 3.433 | 1.000  | 1.000  |
 								| 4  | 128 | 91400  | 290640  | 3.180 | 3.573  | 3.309  |
 								| 8  | 128 | 176616 | 522008  | 2.956 | 6.904  | 5.943  |
 								| 16 | 128 | 351792 | 1010880 | 2.874 | 13.751 | 11.510 |
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Inference accuracy results
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								##### Inference accuracy: NVIDIA A100 40GB
 								Our results were obtained by running the `translate.py` script in the
 								pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB GPU. Full
 								command to launch the inference accuracy benchmark was provided in the
 								[Inference performance benchmark](#inference-performance-benchmark) section.
 								| **Batch Size** | **Beam Size** | **Accuracy - TF32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
 								| -------------: | ------------: | -------------------------: | -------------------------: |
 								| 128            | 1             | 23.07                      | 23.07                      |
 								| 128            | 2             | 23.81                      | 23.81                      |
 								| 128            | 5             | 24.41                      | 24.43                      |
 								##### Inference accuracy: NVIDIA Tesla V100 16GB
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Our results were obtained by running the `translate.py` script in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 NGC Docker container with NVIDIA Tesla V100 16GB GPU. Full
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								command to launch the inference accuracy benchmark was provided in the
 								[Inference performance benchmark](#inference-performance-benchmark) section.
 								| **Batch Size** | **Beam Size** | **Accuracy - FP32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
 								| -------------: | ------------: | -------------------------: | -------------------------: |
 								| 128            | 1             | 23.07                      | 23.07                      |
 								| 128            | 2             | 23.81                      | 23.79                      |
 								| 128            | 5             | 24.40                      | 24.43                      |
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								##### Inference accuracy: NVIDIA T4
 								Our results were obtained by running the `translate.py` script in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 NGC Docker container with NVIDIA Tesla T4. Full command to
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								launch the inference accuracy benchmark was provided in the [Inference
 								performance benchmark](#inference-performance-benchmark) section.
 								| **Batch Size** | **Beam Size** | **Accuracy - FP32 (BLEU)** | **Accuracy - FP16 (BLEU)** |
 								| -------------: | ------------: | -------------------------: | -------------------------: |
 								| 128            | 1             | 23.07                      | 23.08                      |
 								| 128            | 2             | 23.81                      | 23.80                      |
 								| 128            | 5             | 24.40                      | 24.39                      |
 								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Inference throughput results
 								Tables presented in this section show the average inference throughput (columns
 								**Avg (tok/s)**) and inference throughput for various confidence intervals
 								(columns **N% (ms)**, where `N` denotes the confidence interval). Inference
 								throughput is measured in tokens per second. Speedups reported in FP16
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								subsections are relative to FP32 (for NVIDIA Volta and NVIDIA Turing) and
 								relative to TF32 (for NVIDIA Ampere) numbers for corresponding configuration.
 								##### Inference throughput: NVIDIA A100 40GB
 								Our results were obtained by running the `translate.py` script in the
 								pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB.
 								Full command to launch the inference throughput benchmark was provided in the
 								[Inference performance benchmark](#inference-performance-benchmark) section.
 								**FP16**
 								|**Batch Size**|**Beam Size**|**Avg (tok/s)**|**Speedup**|**90% (tok/s)**|**Speedup**|**95% (tok/s)**|**Speedup**|**99% (tok/s)**|**Speedup**|
 								|-------------:|------------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|
 								|             1|            1|         1291.6|      1.031|         1195.7|      1.029|         1165.8|      1.029|         1104.7|      1.030|
 								|             1|            2|          882.7|      1.019|          803.4|      1.015|          769.2|      1.015|          696.7|      1.017|
 								|             1|            5|          848.3|      1.042|          753.0|      1.037|          715.0|      1.043|          636.4|      1.033|
 								|             2|            1|         2060.5|      1.034|         1700.8|      1.032|         1621.8|      1.032|         1487.4|      1.022|
 								|             2|            2|         1445.7|      1.026|         1197.6|      1.024|         1132.5|      1.023|         1043.7|      1.033|
 								|             2|            5|         1402.3|      1.063|         1152.4|      1.056|         1100.5|      1.053|          992.9|      1.053|
 								|             4|            1|         3465.6|      1.046|         2838.3|      1.040|         2672.7|      1.043|         2392.8|      1.043|
 								|             4|            2|         2425.4|      1.041|         2002.5|      1.028|         1898.3|      1.033|         1690.2|      1.028|
 								|             4|            5|         2364.4|      1.075|         1930.0|      1.067|         1822.0|      1.065|         1626.1|      1.058|
 								|             8|            1|         6151.1|      1.099|         5078.0|      1.087|         4786.5|      1.096|         4206.9|      1.090|
 								|             8|            2|         4241.9|      1.075|         3494.1|      1.066|         3293.6|      1.066|         2970.9|      1.064|
 								|             8|            5|         4117.7|      1.118|         3430.9|      1.103|         3224.5|      1.104|         2833.5|      1.110|
 								|            32|            1|        18830.4|      1.147|        16210.0|      1.152|        15563.9|      1.138|        13973.2|      1.135|
 								|            32|            2|        12698.2|      1.133|        10812.3|      1.114|        10256.1|      1.145|         9330.2|      1.101|
 								|            32|            5|        11802.6|      1.355|         9998.8|      1.318|         9671.6|      1.329|         9058.4|      1.335|
 								|           128|            1|        53394.5|      1.350|        48867.6|      1.342|        46898.5|      1.414|        40670.6|      1.305|
 								|           128|            2|        34876.4|      1.483|        31687.4|      1.491|        30025.4|      1.505|        27677.1|      1.421|
 								|           128|            5|        28201.3|      1.986|        25660.5|      1.997|        24306.0|      1.967|        23326.2|      2.007|
 								|           512|            1|       119675.3|      1.904|       112400.5|      1.971|       109694.8|      1.927|       108781.3|      1.919|
 								|           512|            2|        74514.7|      2.126|        69578.9|      2.209|        69348.1|      2.210|        69253.7|      2.212|
 								|           512|            5|        47003.2|      2.760|        43348.2|      2.893|        43080.3|      2.884|        42878.4|      2.881|
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								##### Inference throughput: NVIDIA T4
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								Our results were obtained by running the `translate.py` script in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 NGC Docker container with NVIDIA T4.
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								Full command to launch the inference throughput benchmark was provided in the
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								[Inference performance benchmark](#inference-performance-benchmark) section.
 								**FP16**
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								|**Batch Size**|**Beam Size**|**Avg (tok/s)**|**Speedup**|**90% (tok/s)**|**Speedup**|**95% (tok/s)**|**Speedup**|**99% (tok/s)**|**Speedup**|
 								|-------------:|------------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|--------------:|----------:|
 								|             1|            1|         1133.8|      1.266|         1059.1|      1.253|         1036.6|      1.251|          989.5|      1.242|
 								|             1|            2|          793.9|      1.169|          728.3|      1.165|          698.1|      1.163|          637.1|      1.157|
 								|             1|            5|          766.8|      1.343|          685.6|      1.335|          649.3|      1.335|          584.1|      1.318|
 								|             2|            1|         1759.8|      1.233|         1461.6|      1.239|         1402.3|      1.242|         1302.1|      1.242|
 								|             2|            2|         1313.3|      1.186|         1088.7|      1.185|         1031.6|      1.180|          953.2|      1.178|
 								|             2|            5|         1257.2|      1.301|         1034.1|      1.316|          990.3|      1.313|          886.3|      1.265|
 								|             4|            1|         2974.0|      1.261|         2440.3|      1.255|         2294.6|      1.257|         2087.7|      1.261|
 								|             4|            2|         2204.7|      1.320|         1826.3|      1.283|         1718.9|      1.260|         1548.4|      1.260|
 								|             4|            5|         2106.1|      1.340|         1727.8|      1.345|         1625.7|      1.353|         1467.7|      1.346|
 								|             8|            1|         5076.6|      1.423|         4207.9|      1.367|         3904.4|      1.360|         3475.3|      1.355|
 								|             8|            2|         3761.7|      1.311|         3108.1|      1.285|         2931.6|      1.300|         2628.7|      1.300|
 								|             8|            5|         3578.2|      1.660|         2998.2|      1.614|         2812.1|      1.609|         2447.6|      1.523|
 								|            32|            1|        14637.8|      1.636|        12702.5|      1.644|        12070.3|      1.634|        11036.9|      1.647|
 								|            32|            2|        10627.3|      1.818|         9198.3|      1.818|         8431.6|      1.725|         8000.0|      1.773|
 								|            32|            5|         8205.7|      2.598|         7117.6|      2.476|         6825.2|      2.497|         6293.2|      2.437|
 								|           128|            1|        33800.5|      2.755|        30824.5|      2.816|        27685.2|      2.661|        26580.9|      2.694|
 								|           128|            2|        20829.4|      2.795|        18665.2|      2.778|        17372.1|      2.639|        16820.5|      2.821|
 								|           128|            5|        11753.9|      3.309|        10658.1|      3.273|        10308.7|      3.205|         9630.7|      3.328|
 								|           512|            1|        44474.6|      3.327|        40108.1|      3.394|        39816.6|      3.378|        39708.0|      3.381|
 								|           512|            2|        26057.9|      3.295|        23197.3|      3.294|        23019.8|      3.284|        22951.4|      3.284|
 								|           512|            5|        12161.5|      3.428|        10777.5|      3.418|        10733.1|      3.414|        10710.5|      3.420|
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
-												Added PyTorch GNMT implementation

											
										
										
											2018-08-07 16:27:43 +02:00
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								#### Inference latency results
 								Tables presented in this section show the average inference latency (columns **Avg
 								(ms)**) and inference latency for various confidence intervals (columns **N%
 								(ms)**, where `N` denotes the confidence interval). Inference latency is
 								measured in milliseconds. Speedups reported in FP16 subsections are relative to
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								FP32 (for NVIDIA Volta and NVIDIA Turing) and relative to TF32 (for NVIDIA
 								Ampere) numbers for corresponding configuration.
 								##### Inference latency: NVIDIA A100 40GB
 								Our results were obtained by running the `translate.py` script in the
 								pytorch-20.06-py3 NGC Docker container with NVIDIA A100 40GB.
 								Full command to launch the inference latency benchmark was provided in the
 								[Inference performance benchmark](#inference-performance-benchmark) section.
 								**FP16**
 								|**Batch Size**|**Beam Size**|**Avg (ms)**|**Speedup**|**90% (ms)**|**Speedup**|**95% (ms)**|**Speedup**|**99% (ms)**|**Speedup**|
 								|-------------:|------------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|
 								|             1|            1|       44.69|      1.032|       74.04|      1.035|       84.61|      1.034|       99.14|      1.042|
 								|             1|            2|       64.76|      1.020|      105.18|      1.018|      118.92|      1.019|      139.42|      1.023|
 								|             1|            5|       67.06|      1.043|      107.56|      1.049|      121.82|      1.054|      143.85|      1.054|
 								|             2|            1|       56.57|      1.034|       85.59|      1.037|       92.55|      1.038|      107.59|      1.046|
 								|             2|            2|       80.22|      1.027|      119.22|      1.027|      128.43|      1.030|      150.06|      1.028|
 								|             2|            5|       82.54|      1.063|      121.37|      1.067|      132.35|      1.069|      156.34|      1.059|
 								|             4|            1|       67.29|      1.047|       92.69|      1.048|      100.08|      1.056|      112.63|      1.064|
 								|             4|            2|       95.86|      1.041|      129.83|      1.040|      139.48|      1.044|      162.34|      1.045|
 								|             4|            5|       98.34|      1.075|      133.83|      1.076|      142.70|      1.068|      168.30|      1.075|
 								|             8|            1|       75.60|      1.099|       97.87|      1.103|      104.13|      1.099|      117.40|      1.102|
 								|             8|            2|      109.38|      1.074|      137.71|      1.079|      147.69|      1.069|      168.79|      1.065|
 								|             8|            5|      112.71|      1.116|      143.50|      1.104|      153.17|      1.118|      172.60|      1.113|
 								|            32|            1|       98.40|      1.146|      117.02|      1.153|      123.42|      1.150|      129.01|      1.128|
 								|            32|            2|      145.87|      1.133|      171.71|      1.159|      184.01|      1.127|      188.64|      1.141|
 								|            32|            5|      156.82|      1.357|      189.10|      1.374|      194.95|      1.392|      196.65|      1.419|
 								|           128|            1|      137.97|      1.350|      150.04|      1.348|      151.52|      1.349|      154.52|      1.434|
 								|           128|            2|      211.58|      1.484|      232.96|      1.490|      237.46|      1.505|      239.86|      1.567|
 								|           128|            5|      261.44|      1.990|      288.54|      2.017|      291.63|      2.052|      298.73|      2.136|
 								|           512|            1|      245.93|      1.906|      262.51|      1.998|      264.24|      1.999|      265.23|      2.000|
 								|           512|            2|      395.61|      2.129|      428.54|      2.219|      431.58|      2.224|      433.86|      2.227|
 								|           512|            5|      627.21|      2.767|      691.72|      2.878|      696.01|      2.895|      702.13|      2.887|
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
-												Updating GNMT/PyT

											
										
										
											2019-10-21 19:41:32 +02:00
+								##### Inference latency: NVIDIA T4
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Our results were obtained by running the `translate.py` script in the
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								pytorch-20.06-py3 NGC Docker container with NVIDIA T4.
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								Full command to launch the inference latency benchmark was provided in the
 								[Inference performance benchmark](#inference-performance-benchmark) section.
 								**FP16**
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								|**Batch Size**|**Beam Size**|**Avg (ms)**|**Speedup**|**90% (ms)**|**Speedup**|**95% (ms)**|**Speedup**|**99% (ms)**|**Speedup**|
 								|-------------:|------------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|-----------:|----------:|
 								|             1|            1|       51.08|      1.261|       84.82|      1.254|       97.45|      1.251|       114.6|      1.257|
 								|             1|            2|       72.05|      1.168|      117.41|      1.165|      132.33|      1.170|       155.8|      1.174|
 								|             1|            5|       74.20|      1.345|      119.45|      1.352|      135.07|      1.354|       160.3|      1.354|
 								|             2|            1|       66.31|      1.232|      100.90|      1.232|      108.52|      1.235|       126.9|      1.238|
 								|             2|            2|       88.35|      1.185|      131.47|      1.188|      141.46|      1.185|       164.7|      1.191|
 								|             2|            5|       92.12|      1.305|      136.30|      1.310|      148.66|      1.309|       174.8|      1.320|
 								|             4|            1|       78.54|      1.260|      108.53|      1.256|      117.19|      1.259|       133.7|      1.259|
 								|             4|            2|      105.54|      1.315|      142.74|      1.317|      154.36|      1.307|       178.7|      1.303|
 								|             4|            5|      110.43|      1.351|      150.62|      1.388|      161.61|      1.397|       191.2|      1.427|
 								|             8|            1|       91.65|      1.418|      117.92|      1.421|      126.60|      1.405|       144.0|      1.411|
 								|             8|            2|      123.39|      1.315|      156.00|      1.337|      167.34|      1.347|       193.4|      1.340|
 								|             8|            5|      129.69|      1.666|      165.01|      1.705|      178.18|      1.723|       200.3|      1.765|
 								|            32|            1|      126.53|      1.641|      153.23|      1.689|      159.58|      1.692|       167.0|      1.700|
 								|            32|            2|      174.37|      1.822|      209.04|      1.899|      219.59|      1.877|       228.6|      1.878|
 								|            32|            5|      226.15|      2.598|      277.38|      2.636|      290.27|      2.648|       299.4|      2.664|
 								|           128|            1|      218.29|      2.755|      238.94|      2.826|      243.18|      2.843|       267.1|      2.828|
 								|           128|            2|      354.83|      2.796|      396.63|      2.832|      410.53|      2.803|       433.2|      2.866|
 								|           128|            5|      628.32|      3.311|      699.57|      3.353|      723.98|      3.323|       771.0|      3.337|
 								|           512|            1|      663.07|      3.330|      748.62|      3.388|      753.20|      3.388|       758.0|      3.378|
 								|           512|            2|     1134.04|      3.295|     1297.85|      3.283|     1302.25|      3.304|      1306.9|      3.308|
 								|           512|            5|     2428.82|      3.428|     2771.72|      3.415|     2801.32|      3.427|      2817.6|      3.422|
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
 								outlined above.
 								## Release notes
 								### Changelog
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								* July 2020
 								  * Added support for NVIDIA DGX A100
 								  * Default container updated to NGC PyTorch 20.06-py3
 								* June 2019
 								  * Default container updated to NGC PyTorch 19.05-py3
 								  * Mixed precision training implemented using APEX AMP
 								  * Added inference throughput and latency results on NVIDIA T4 and NVIDIA
 								    Tesla V100 16GB
 								  * Added option to run inference on user-provided raw input text from command
 								    line
 								* February 2019
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								  * Different batching algorithm (bucketing with 5 equal-width buckets)
 								  * Additional dropouts before first LSTM layer in encoder and in decoder
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * Weight initialization changed to uniform (-0.1,0.1)
-												GNMT update

											
										
										
											2019-02-14 12:40:30 +01:00
+								  * Switched order of dropout and concatenation with attention in decoder
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
+								  * Default container updated to NGC PyTorch 19.01-py3
-												[GNMT/PyT] Updating for Ampere]

											
										
										
											2020-08-01 15:47:34 +02:00
+								* December 2018
 								  * Added exponential warm-up and step learning rate decay
 								  * Multi-GPU (distributed) inference and validation
 								  * Default container updated to NGC PyTorch 18.11-py3
 								  * General performance improvements
 								* August 2018
 								  * Initial release
-												Updating models and adding BERT/PyT

Tacotron2+Waveglow/PyT
* AMP support
* Data preprocessing for Tacotron 2 training
* Fixed dropouts on LSTMCells

SSD/PyT
* script and notebook for inference
* AMP support
* README update
* updates to examples/*

BERT/PyT
* initial release

GNMT/PyT
* Default container updated to NGC PyTorch 19.05-py3
* Mixed precision training implemented using APEX AMP
* Added inference throughput and latency results on NVIDIA Tesla V100 16G
* Added option to run inference on user-provided raw input text from command line

NCF/PyT
* Updated performance tables.
* Default container changed to PyTorch 19.06-py3.
* Caching validation negatives between runs

Transformer/PyT
* new README
* jit support added

UNet Medical/TF
* inference example scripts added
* inference benchmark measuring latency added
* TRT/TF-TRT support added
* README updated

GNMT/TF
* Performance improvements

Small updates (mostly README) for other models.

											
										
										
											2019-07-16 21:13:08 +02:00
 								### Known issues
-												Updating models

											
										
										
											2019-07-08 22:51:28 +02:00
+								There are no known issues in this release.