1982 lines
94 KiB
Markdown
1982 lines
94 KiB
Markdown
# Transformer-XL For PyTorch
|
||
|
||
This repository provides a script and recipe to train the Transformer-XL model
|
||
to achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
|
||
|
||
## Table Of Contents
|
||
|
||
<!-- TOC GFM -->
|
||
|
||
* [Model overview](#model-overview)
|
||
* [Model architecture](#model-architecture)
|
||
* [Default configuration](#default-configuration)
|
||
* [Feature support matrix](#feature-support-matrix)
|
||
* [Features](#features)
|
||
* [Mixed precision training](#mixed-precision-training)
|
||
* [Enabling mixed precision](#enabling-mixed-precision)
|
||
* [Enabling TF32](#enabling-tf32)
|
||
* [Setup](#setup)
|
||
* [Requirements](#requirements)
|
||
* [Quick Start Guide](#quick-start-guide)
|
||
* [Advanced](#advanced)
|
||
* [Scripts and sample code](#scripts-and-sample-code)
|
||
* [Parameters](#parameters)
|
||
* [Command-line options](#command-line-options)
|
||
* [Getting the data](#getting-the-data)
|
||
* [Dataset guidelines](#dataset-guidelines)
|
||
* [Multi-dataset](#multi-dataset)
|
||
* [Training process](#training-process)
|
||
* [Multi-node](#multi-node)
|
||
* [Inference process](#inference-process)
|
||
* [Performance](#performance)
|
||
* [Benchmarking](#benchmarking)
|
||
* [Training performance benchmark](#training-performance-benchmark)
|
||
* [Training performance benchmark for multi-node](#training-performance-benchmark-for-multi-node)
|
||
* [Inference performance benchmark](#inference-performance-benchmark)
|
||
* [Results](#results)
|
||
* [Training accuracy results](#training-accuracy-results)
|
||
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
|
||
* [Base model](#base-model)
|
||
* [Large model](#large-model)
|
||
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
|
||
* [Base model](#base-model-1)
|
||
* [Large model](#large-model-1)
|
||
* [Training accuracy: NVIDIA DGX-2H (16x V100 32GB)](#training-accuracy-nvidia-dgx-2h-16x-v100-32gb)
|
||
* [Base model](#base-model-2)
|
||
* [Large model](#large-model-2)
|
||
* [Training accuracy: 8x NVIDIA DGX-2H (16x V100 32GB)](#training-accuracy-8x-nvidia-dgx-2h-16x-v100-32gb)
|
||
* [Large model](#large-model-3)
|
||
* [Training accuracy plots](#training-accuracy-plots)
|
||
* [Base model](#base-model-3)
|
||
* [Large model (single-node)](#large-model-single-node)
|
||
* [Large model (multi-node)](#large-model-multi-node)
|
||
* [Training stability test](#training-stability-test)
|
||
* [Base model](#base-model-4)
|
||
* [Large model (single-node)](#large-model-single-node-1)
|
||
* [Large model (multi-node)](#large-model-multi-node-1)
|
||
* [Training performance results](#training-performance-results)
|
||
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
|
||
* [Base model](#base-model-5)
|
||
* [Large model](#large-model-4)
|
||
* [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
|
||
* [Base model](#base-model-6)
|
||
* [Large model](#large-model-5)
|
||
* [Training performance: NVIDIA DGX-2H (16x V100 32GB)](#training-performance-nvidia-dgx-2h-16x-v100-32gb)
|
||
* [Base model](#base-model-7)
|
||
* [Large model](#large-model-6)
|
||
* [Training performance: 8x NVIDIA DGX-2H (16x V100 32GB)](#training-performance-8x-nvidia-dgx-2h-16x-v100-32gb)
|
||
* [Large model](#large-model-7)
|
||
* [Inference performance results](#inference-performance-results)
|
||
* [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
|
||
* [Base model](#base-model-8)
|
||
* [Large model](#large-model-8)
|
||
* [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
|
||
* [Base model](#base-model-9)
|
||
* [Large model](#large-model-9)
|
||
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
|
||
* [Base model](#base-model-10)
|
||
* [Large model](#large-model-10)
|
||
* [Release notes](#release-notes)
|
||
* [Changelog](#changelog)
|
||
* [Known issues](#known-issues)
|
||
|
||
<!-- /TOC -->
|
||
|
||
## Model overview
|
||
|
||
This repository provides an implementation of the Transformer-XL model in
|
||
[PyTorch](https://pytorch.org) from the paper [Transformer-XL: Attentive
|
||
Language Models Beyond a Fixed-Length
|
||
Context](https://arxiv.org/abs/1901.02860). Transformer-XL is a
|
||
transformer-based language model with a segment-level recurrence and a novel
|
||
relative positional encoding. Enhancements introduced in Transformer-XL help
|
||
capture better long-term dependencies by attending to tokens from multiple
|
||
previous segments.
|
||
|
||
Our implementation is based on the
|
||
[codebase](https://github.com/kimiyoung/transformer-xl) published by the
|
||
authors of the Transformer-XL paper.
|
||
Our implementation uses a modified model architecture. Our
|
||
modifications were made to achieve better hardware utilization and to take
|
||
advantage of Tensor Cores. Similar modifications were also proposed in an
|
||
implementation available from
|
||
[github.com/cybertronai/transformer-xl](https://github.com/cybertronai/transformer-xl).
|
||
Refer to the [Model architecture](#model-architecture) section for more
|
||
details.
|
||
|
||
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta
|
||
and the NVIDIA Ampere GPU architectures and evaluated on Volta, Turing and the
|
||
NVIDIA Ampere GPU architectures.
|
||
Therefore, researchers can get results up to 2.5x faster than training without
|
||
Tensor Cores, while experiencing the benefits of mixed precision training. This
|
||
model is tested against each NGC monthly container release to ensure consistent
|
||
accuracy and performance over time.
|
||
|
||
### Model architecture
|
||
|
||
The Transformer-XL "base" model for WikiText-103 dataset available in this
|
||
repository was modified to use the following hyperparameter values:
|
||
|
||
|
||
|**Hyperparameter**|**Description**|**Original setting for the base model**|**Our modification for the base model**|
|
||
|------------------|---------------|--------------------------------------:|--------------------------------------:|
|
||
| `d_model` | hidden size | 410 | 512 |
|
||
| `n_head` | number of attention heads | 10 | 8 |
|
||
| `d_head` | size of each attention head | 41 | 64 |
|
||
| `d_inner` | hidden size in fully-connected layers | 2100 | 2048 |
|
||
| `tgt_len` | number of tokens to predict during training | 150 | 192 |
|
||
| `mem_len` | number of tokens cached from previous iterations during training | 150 | 192 |
|
||
|
||
Changes described above were made to align certain hyperparameters with powers
|
||
of two, with this modification, the model is able to achieve better hardware
|
||
utilization, and therefore higher training throughput.
|
||
|
||
The Transformer-XL "large" model for WikiText-103 dataset available in this
|
||
repository uses the original hyperparameters from the [reference
|
||
implementation](https://github.com/kimiyoung/transformer-xl).
|
||
|
||
The following table lists the hyperparameters for the large and the base
|
||
Transformer-XL models for WikiText-103 dataset available in this repository.
|
||
|
||
| **Hyperparameter** | **Description** | **Base model** | **Large model** |
|
||
| ------------------ | ---------------------------------------------------------------- | -------------: | ---------------: |
|
||
| `n_layer` | number of layers | 16 | 18 |
|
||
| `d_model` | hidden size | 512 | 1024 |
|
||
| `n_head` | number of attention heads | 8 | 16 |
|
||
| `d_head` | size of each attention head | 64 | 64 |
|
||
| `d_inner` | inner hidden size in fully-connected layers | 2048 | 4096 |
|
||
| `dropout` | dropout | 0.1 | 0.2 |
|
||
| `dropatt` | dropout after softmax in the attention | 0.0 | 0.2 |
|
||
| `lr` | base learning rate | 0.01 | 0.01 |
|
||
| `eta_min` | minimum learning rate (for cosine decay) | 0.001 | 0.0001 |
|
||
| `max_step` | number of training steps | 40,000 | 100,000 |
|
||
| `warmup_step` | number of learning rate warmup steps | 1,000 | 16,000 |
|
||
| `batch_size` | training batch size | 256 | 128 |
|
||
| `tgt_len` | number of tokens to predict during training | 192 | 384 |
|
||
| `mem_len` | number of tokens cached from previous iterations during training | 192 | 384 |
|
||
|
||
The Transformer-XL model addresses the limitations of vanilla transformer-based
|
||
language models, which are only able to use relatively short context, bounded
|
||
by the segment length. The Transformer-XL introduces a recurrence mechanism,
|
||
which is able to use a cached hidden state from previous segments. During
|
||
training, the context consists of a concatenation of current segment's hidden
|
||
state and cached states from previous iterations. Gradients are backpropagated
|
||
only through the current segment, although the model is able to take advantage
|
||
of the extra information stored in the cache and therefore is able to model
|
||
long-term dependencies.
|
||
|
||
An illustration of the recurrence mechanism taken from the [Transformer-XL
|
||
paper](https://arxiv.org/abs/1901.02860) is shown below.
|
||
![model](pytorch/img/model.png)
|
||
|
||
|
||
### Default configuration
|
||
|
||
The following features were implemented in this model:
|
||
|
||
* general
|
||
* single-node or multi-node, data-parallel multi-GPU training
|
||
* training and inference with mixed precision using Tensor Cores
|
||
* mixed precision training implemented using
|
||
[Apex AMP](https://nvidia.github.io/apex/amp.html), with `O2` optimization
|
||
level and with a dynamic loss scaling
|
||
* model
|
||
* 16-layer base Transformer-XL model with hidden size 512, 8 attention heads,
|
||
each head with hidden size 64
|
||
* 18-layer large Transformer-XL model with hidden size 1024, 16 attention
|
||
heads, each head with hidden size 64
|
||
* the model trained on
|
||
[WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
|
||
dataset, using word-level vocabulary and
|
||
adaptive softmax
|
||
* embedding weights are tied with weights in the classifier
|
||
* training
|
||
* training with [LAMB](https://arxiv.org/abs/1904.00962) optimizer, the
|
||
implementation of the optimizer uses
|
||
[TorchScript](https://pytorch.org/docs/stable/jit.html), which enables
|
||
the fusion of elementwise operations and accelerates the training
|
||
* support for training with a gradient accumulation
|
||
* base model:
|
||
* linear learning rate warmup for 1,000 iterations, followed by the cosine
|
||
learning rate schedule, the initial learning rate is set to 0.01, and the final
|
||
learning rate is set to 0.001
|
||
* training for 40,000 steps, using a batch size of 256
|
||
* large model:
|
||
* single node:
|
||
* linear learning rate warmup for 16,000 iterations, followed by the cosine
|
||
learning rate schedule, the initial learning rate is set to 0.01, and the final
|
||
learning rate is set to 0.0001
|
||
* training for 100,000 steps, using a batch size of 128
|
||
* multi node:
|
||
* linear learning rate warmup for 16,000 iterations, followed by the cosine
|
||
learning rate schedule, the initial learning rate is set to 0.02, and the final
|
||
learning rate is set to 0.0002
|
||
* training for 25,000 steps, using a batch size of 512
|
||
* inference
|
||
* support for multi-gpu inference
|
||
* support for [TorchScript](https://pytorch.org/docs/stable/jit.html) and
|
||
pure Python inference
|
||
* each token is using the same size of the context from previous time steps.
|
||
* base model:
|
||
* target length is set to 64, length of memory is set to 640
|
||
* positional embeddings are clamped after 400 time steps
|
||
* large model:
|
||
* target length is set to 128, length of memory is set to 1,600
|
||
* positional embeddings are clamped after 1,000 time steps
|
||
|
||
### Feature support matrix
|
||
|
||
The following features are supported by this model:
|
||
|
||
| **Feature** | **Transformer-XL** |
|
||
|:------------|-------------------:|
|
||
|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
|
||
|[PyTorch DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) | Yes |
|
||
|[LAMB](https://arxiv.org/abs/1904.00962v3) | Yes |
|
||
| Inference with [TorchScript](https://pytorch.org/docs/stable/jit.html) | Yes |
|
||
| Multi-node training | Yes |
|
||
|
||
|
||
#### Features
|
||
|
||
[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor
|
||
Core-accelerated training. Refer to the [Enabling mixed
|
||
precision](#enabling-mixed-precision) section for more details.
|
||
|
||
[PyTorch
|
||
DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) - a module
|
||
wrapper that enables easy multiprocess distributed data-parallel
|
||
training.
|
||
|
||
[LAMB](https://arxiv.org/abs/1904.00962v3) - stands for Layerwise Adaptive
|
||
Moments Based optimizer, is a large batch optimization technique that helps
|
||
accelerate training of deep neural networks using large minibatches.
|
||
|
||
[TorchScript](https://pytorch.org/docs/stable/jit.html) - is a way to create
|
||
serializable and optimizable models from PyTorch code. Any TorchScript program
|
||
can be saved from a Python process and loaded in a process where there is no
|
||
Python dependency.
|
||
|
||
### Mixed precision training
|
||
|
||
Mixed precision is the combined use of different numerical precisions in a
|
||
computational method.
|
||
[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant
|
||
computational speedup by performing operations in half-precision format while
|
||
storing minimal information in single-precision to retain as much information
|
||
as possible in critical parts of the network. Since the introduction of [Tensor
|
||
Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with
|
||
both the Turing and Ampere architectures, significant training speedups are
|
||
experienced by switching to
|
||
mixed precision -- up to 3x overall speedup on the most arithmetically intense
|
||
model architectures. Using mixed precision training previously required two
|
||
steps:
|
||
|
||
1. Porting the model to use the FP16 data type where appropriate.
|
||
2. Manually adding loss scaling to preserve small gradient values.
|
||
|
||
The ability to train deep learning networks with lower precision was introduced
|
||
in the Pascal architecture and first supported in [CUDA
|
||
8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep
|
||
Learning SDK.
|
||
|
||
For information about:
|
||
* How to train using mixed precision, see the [Mixed Precision
|
||
Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed
|
||
Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
|
||
documentation.
|
||
* Techniques used for mixed precision training, see the [Mixed-Precision
|
||
Training of Deep Neural
|
||
Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
|
||
blog.
|
||
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy
|
||
Mixed-Precision Training in
|
||
PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/)
|
||
.
|
||
|
||
#### Enabling mixed precision
|
||
The `pytorch/train.py` training script launches mixed precision training
|
||
with Tensor Cores if the flag `--fp16` is set.
|
||
|
||
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
|
||
(AMP), library from [APEX](https://github.com/NVIDIA/apex) that casts variables
|
||
to half-precision upon retrieval, while storing variables in single-precision
|
||
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
|
||
a [loss
|
||
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
|
||
step must be included when applying gradients. In PyTorch, loss scaling can be
|
||
easily applied by using `scale_loss()` method provided by AMP. The scaling
|
||
value to be used can be
|
||
[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.
|
||
|
||
For an in-depth walk through on AMP, check out sample usage
|
||
[here](https://nvidia.github.io/apex/amp.html#).
|
||
[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
|
||
utility libraries, such as AMP, which require minimal network code changes to
|
||
leverage Tensor Cores performance.
|
||
|
||
The following steps were needed to enable mixed precision training in
|
||
Transformer-XL:
|
||
|
||
1. Import AMP from APEX:
|
||
|
||
```
|
||
from apex import amp
|
||
```
|
||
|
||
2. Initialize AMP and wrap the model and the optimizer before starting the
|
||
training:
|
||
|
||
```
|
||
model, optimizer = amp.initialize(
|
||
model,
|
||
optimizer,
|
||
opt_level='O2',
|
||
)
|
||
```
|
||
|
||
3. Apply `scale_loss` context manager:
|
||
|
||
```
|
||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||
scaled_loss.backward()
|
||
```
|
||
|
||
4. Apply gradient clipping on single precision master weights:
|
||
|
||
```
|
||
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.clip)
|
||
```
|
||
|
||
#### Enabling TF32
|
||
TensorFloat-32 (TF32) is the new math mode in [NVIDIA
|
||
A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the
|
||
matrix math also called tensor operations. TF32 running on Tensor Cores in A100
|
||
GPUs can provide up to 10x speedups compared to single-precision floating-point
|
||
math (FP32) on Volta GPUs.
|
||
|
||
TF32 Tensor Cores can speed up networks using FP32, typically with no loss of
|
||
accuracy. It is more robust than FP16 for models which require high dynamic
|
||
range for weights or activations.
|
||
|
||
For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates
|
||
AI Training, HPC up to
|
||
20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/)
|
||
blog post.
|
||
|
||
TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by
|
||
default.
|
||
|
||
## Setup
|
||
|
||
The following section lists the requirements that you need to meet in order to
|
||
start training the Transformer-XL model.
|
||
|
||
### Requirements
|
||
|
||
This repository contains `Dockerfile` which extends the PyTorch NGC container
|
||
and encapsulates some dependencies. Aside from these dependencies, ensure you
|
||
have the following components:
|
||
|
||
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
||
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
||
* GPU architecture:
|
||
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
||
* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
|
||
* [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
|
||
|
||
For more information about how to get started with NGC containers, see the
|
||
following sections from the NVIDIA GPU Cloud Documentation and the Deep
|
||
Learning DGX Documentation:
|
||
|
||
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html),
|
||
* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry),
|
||
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running).
|
||
|
||
For those unable to use the Pytorch NGC container, to set up the required
|
||
environment or create your own container, see the versioned [NVIDIA Container
|
||
Support
|
||
Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
|
||
|
||
For multi-node, the sample provided in this repository requires
|
||
[Enroot](https://github.com/NVIDIA/enroot) and
|
||
[Pyxis](https://github.com/NVIDIA/pyxis) set up on a
|
||
[SLURM](https://slurm.schedmd.com) cluster.
|
||
|
||
## Quick Start Guide
|
||
|
||
To train your model using mixed or TF32 precision with Tensor Cores or using
|
||
FP32, perform the following steps using the default parameters of the
|
||
Transformer-XL base model on the
|
||
[WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
|
||
dataset.
|
||
|
||
For the specifics concerning training
|
||
and inference, see the [Advanced](#advanced) section.
|
||
|
||
1. Clone the repository.
|
||
|
||
```
|
||
git clone https://github.com/NVIDIA/DeepLearningExamples
|
||
cd DeepLearningExamples/PyTorch/LanguageModeling/Transformer-XL
|
||
```
|
||
|
||
2. Download and preprocess the dataset.
|
||
|
||
```
|
||
bash getdata.sh
|
||
```
|
||
|
||
3. Build the Transformer-XL PyTorch NGC container.
|
||
|
||
```
|
||
bash pytorch/scripts/docker/build.sh
|
||
```
|
||
|
||
4. Start an interactive session in the NGC container to run training/inference.
|
||
|
||
```
|
||
bash pytorch/scripts/docker/interactive.sh
|
||
```
|
||
|
||
5. Start training.
|
||
|
||
This repository contains a number of predefined configurations to run the
|
||
training on NVIDIA DGX-1, NVIDIA DGX-2H or NVIDIA DGX A100 nodes.
|
||
|
||
To start the training on NVIDIA DGX-1 or NVIDIA DGX-2H, run:
|
||
|
||
```
|
||
bash run_wt103_{base,large}.sh train <#GPUs> --config {dgx1,dgx2}_<#GPUs>gpu_{fp16,fp32}
|
||
```
|
||
|
||
To start the training on NVIDIA DGX A100, run:
|
||
|
||
```
|
||
bash run_wt103_{base,large}.sh train <#GPUs> --config dgxa100_<#GPUs>gpu_{fp16,tf32}
|
||
```
|
||
|
||
* use the `run_wt103_base.sh` script to train the base model, and use the
|
||
`run_wt103_large.sh` script to train the large model
|
||
* the training is executed on `<#GPUs>` GPUs, supported values for `<#GPUs>`
|
||
for NVIDIA DGX-1 and NVIDIA DGX A100 are: 1, 2, 4, 8 and for NVIDIA DGX-2H:
|
||
1, 2, 4, 8, 16
|
||
* use configs with the `dgx1` prefix to run on a NVIDIA DGX-1, configs with the
|
||
`dgx2` prefix to run on a NVIDIA DGX-2H and configs with the `dgxa100` prefix
|
||
to run on a NVIDIA DGX A100
|
||
* configs with the `fp16` suffix are launching mixed precision training,
|
||
configs with the `fp32` suffix are launching FP32 training, configs with the
|
||
`tf32` suffix are launching TF32 training
|
||
|
||
Examples:
|
||
|
||
To launch TF32 training of the base Transformer-XL model on a NVIDIA DGX A100
|
||
using 8 GPUs, run:
|
||
|
||
```
|
||
bash run_wt103_base.sh train 8 --config dgxa100_8gpu_tf32
|
||
```
|
||
|
||
To launch FP32 training of the base Transformer-XL model on a NVIDIA DGX-1
|
||
using 8 GPUs, run:
|
||
```
|
||
bash run_wt103_base.sh train 8 --config dgx1_8gpu_fp32
|
||
```
|
||
|
||
To launch mixed precision training of the large Transformer-XL model on a
|
||
NVIDIA DGX-2H using 16 GPUs, run:
|
||
```
|
||
bash run_wt103_large.sh train 16 --config dgx2_16gpu_fp16
|
||
```
|
||
|
||
To launch mixed precision training of the large Transformer-XL model on a
|
||
NVIDIA DGX A100 using 8 GPUs, run:
|
||
|
||
```
|
||
bash run_wt103_large.sh train 8 --config dgxa100_8gpu_fp16
|
||
```
|
||
|
||
To run on multiple nodes, see the [Multi-node](#multi-node) section.
|
||
|
||
For more information on the available options, and for an explanation of what
|
||
happens at the end of training, refer to the [Training
|
||
process](#training-process) section.
|
||
|
||
6. Start evaluation.
|
||
|
||
To start inference on the test set using `<#GPUs>` GPUs, run:
|
||
|
||
```
|
||
bash run_wt103_{base,large}.sh eval <#GPUs> [--fp16] [--type {pytorch, torchscript}]
|
||
```
|
||
|
||
Select `run_wt103_base.sh` for the base Transformer-XL model and
|
||
`run_wt103_large.sh` for the large Transformer-XL model.
|
||
The `--fp16` flag is optional, however, if it's specified, then the script
|
||
launches mixed precision inference with Tensor Cores. If the flag is not
|
||
present, then the script launches FP32 inference on NVIDIA Volta and NVIDIA
|
||
Turing GPUs and TF32 inference on NVIDIA Ampere GPUs.
|
||
|
||
By default, the script is loading the checkpoint from
|
||
`LM-TFM/checkpoint_best.pt`, which contains the model corresponding to the
|
||
lowest value of the validation loss from the previous training run. Path to the
|
||
checkpoint can be customized by setting the `--model` flag.
|
||
|
||
Inference can use pure Python execution or TorchScript from using the `--type`
|
||
flag.
|
||
|
||
Supported values for `<#GPUs>` are: 1, 2, 4, 8 for NVIDIA DGX-1 and NVIDIA DGX
|
||
A100 and 1, 2, 4, 8, 16 for NVIDIA DGX-2H.
|
||
|
||
Additionally, one can pass the input text directly from the command-line using
|
||
the `--manual` flag. This mode of operation supports only 1 GPU and batch size
|
||
of 1. The script outputs average loss and perplexity for the provided input
|
||
text.
|
||
|
||
Examples:
|
||
|
||
```
|
||
bash run_wt103_base.sh eval 1 \
|
||
--model LM-TFM/checkpoint_best.pt \
|
||
--fp16 \
|
||
--manual "recognize speech"
|
||
|
||
===============================================================================
|
||
| test loss 6.20 | test ppl 494.291
|
||
===============================================================================
|
||
```
|
||
|
||
```
|
||
bash run_wt103_base.sh eval 1 \
|
||
--model LM-TFM/checkpoint_best.pt \
|
||
--fp16 \
|
||
--manual "wreck a nice beach"
|
||
|
||
===============================================================================
|
||
| test loss 8.04 | test ppl 3099.706
|
||
===============================================================================
|
||
```
|
||
|
||
For more information on the available options, refer to the [Inference
|
||
process](#inference-process) section.
|
||
|
||
## Advanced
|
||
|
||
The following sections provide greater details of the dataset, running training
|
||
and inference, and the training results.
|
||
|
||
### Scripts and sample code
|
||
|
||
In the root directory, the most important files are:
|
||
|
||
* `Dockerfile`: container with the basic set of dependencies to run
|
||
Transformer-XL
|
||
* `requirements.txt`: set of extra requirements for running Transformer-XL
|
||
* `getdata.sh`: script for downloading datasets
|
||
|
||
In the `pytorch` directory, the most important files are:
|
||
|
||
* `data_utils.py`: data loading utilities
|
||
* `eval.py`: serves as the entry point to launch the evaluation and inference
|
||
* `lamb.py`: implementation of [LAMB](https://arxiv.org/abs/1904.00962)
|
||
optimizer
|
||
* `mem_transformer.py`: implementation of the Transformer-XL model
|
||
* `train.py`: serves as the entry point to launch the training
|
||
* `run.sub`: Slurm batch script for launching multi-node training
|
||
|
||
The `pytorch/utils` directory contains the following additional modules:
|
||
|
||
* `adaptive_softmax.py`: implementation of adaptive softmax
|
||
* `data_parallel.py`: implementation of `BalancedDataParallel` class
|
||
* `distributed.py`: utility functions for running distributed training
|
||
* `exp_utils.py`: utility functions for running training and benchmarking
|
||
* `log_uniform_sampler.py`: implementation of log-uniform sampler
|
||
* `proj_adaptive_softmax.py`: implementation of projected adaptive softmax
|
||
* `vocabulary.py`: implementation of word-level vocabulary and BPE-based
|
||
vocabulary
|
||
|
||
The `pytorch/inference` directory contains modules optimized for running
|
||
inference with TorchScript:
|
||
* `mem_transformer_jit.py`: implementation of TorchScript-compatible
|
||
Transformer-XL model
|
||
* `proj_adaptive_softmax_jit.py`: implementation of TorchScript-compatible
|
||
projected adaptive softmax
|
||
|
||
### Parameters
|
||
|
||
**Training**
|
||
|
||
The complete list of available parameters for the `pytorch/train.py` training
|
||
script contains:
|
||
|
||
```
|
||
general setup:
|
||
--work_dir WORK_DIR Directory for the results
|
||
--append_dataset Automatically append dataset name to work_dir
|
||
--append_time Automatically append current time to work_dir
|
||
--cuda Run training on a GPU using CUDA
|
||
--fp16 Run training in fp16/mixed precision
|
||
--restart RESTART Restart training from the saved checkpoint
|
||
--debug Run in debug mode (do not create exp dir)
|
||
--log_all_ranks Enable logging from all distributed ranks
|
||
--dllog_file DLLOG_FILE
|
||
Name of the DLLogger output file
|
||
--txtlog_file TXTLOG_FILE
|
||
Name of the txt log file
|
||
--save_all Save all checkpoints
|
||
--no_env Do not print info on execution env
|
||
--no_eval Disable model evaluation
|
||
--log_interval LOG_INTERVAL
|
||
Report interval
|
||
--target_throughput TARGET_THROUGHPUT
|
||
Target training throughput (for benchmarking)
|
||
--target_perplexity TARGET_PERPLEXITY
|
||
Target validation perplexity (for benchmarking)
|
||
--amp_mode {O0,O1,O2,O3}
|
||
Optimization level for apex amp
|
||
|
||
dataset setup:
|
||
--data DATA Location of the data corpus
|
||
--dataset {wt103,lm1b,enwik8,text8}
|
||
Dataset name
|
||
--vocab {word,bpe} Type of vocabulary
|
||
|
||
model setup:
|
||
--n_layer N_LAYER Number of total layers
|
||
--n_head N_HEAD Number of heads
|
||
--d_head D_HEAD Head dimension
|
||
--d_embed D_EMBED Embedding dimension
|
||
--d_model D_MODEL Model dimension
|
||
--d_inner D_INNER Inner dimension in feedforward layer
|
||
--dropout DROPOUT Global dropout rate
|
||
--dropatt DROPATT Attention probability dropout rate
|
||
--pre_lnorm Apply LayerNorm to the input instead of the output
|
||
--attn_type ATTN_TYPE
|
||
Attention type. 0 for ours, 1 for Shaw et al,2 for
|
||
Vaswani et al, 3 for Al Rfou et al.
|
||
--not_tied Do not tie the word embedding and softmax weights
|
||
--clamp_len CLAMP_LEN
|
||
Use the same pos embeddings after clamp_len
|
||
--adaptive Use adaptive softmax
|
||
--div_val DIV_VAL Dividend value for adaptive input and softmax
|
||
--sample_softmax SAMPLE_SOFTMAX
|
||
Number of samples in sampled softmax
|
||
--init INIT Parameter initializer to use
|
||
--emb_init EMB_INIT Parameter initializer to use
|
||
--init_range INIT_RANGE
|
||
Parameters initialized by U(-init_range, init_range)
|
||
--emb_init_range EMB_INIT_RANGE
|
||
Parameters initialized by U(-init_range, init_range)
|
||
--init_std INIT_STD Parameters initialized by N(0, init_std)
|
||
--proj_init_std PROJ_INIT_STD
|
||
Parameters initialized by N(0, init_std)
|
||
|
||
optimizer setup:
|
||
--optim {adam,sgd,adagrad,lamb,jitlamb}
|
||
Optimizer to use
|
||
--lr LR Initial learning rate
|
||
--mom MOM Momentum for sgd
|
||
--scheduler {cosine,inv_sqrt,dev_perf,constant}
|
||
LR scheduler to use
|
||
--max_step_scheduler MAX_STEP_SCHEDULER
|
||
Max number of training steps for LR scheduler
|
||
--warmup_step WARMUP_STEP
|
||
Number of iterations for LR warmup
|
||
--decay_rate DECAY_RATE
|
||
Decay factor when ReduceLROnPlateau is used
|
||
--lr_min LR_MIN Minimum learning rate during annealing
|
||
--clip CLIP Gradient clipping
|
||
--weight_decay WEIGHT_DECAY
|
||
Weight decay for adam|lamb
|
||
--clip_nonemb Only clip the gradient of non-embedding params
|
||
--patience PATIENCE Patience
|
||
--eta_min ETA_MIN Min learning rate for cosine scheduler
|
||
|
||
training setup:
|
||
--max_step MAX_STEP Max number of training steps
|
||
--batch_size BATCH_SIZE
|
||
Global batch size
|
||
--local_batch_size LOCAL_BATCH_SIZE
|
||
Local (per-device) batch size, this setting overrides
|
||
global --batch_size and sets batch_size to
|
||
local_batch_size * world_size
|
||
--batch_chunk BATCH_CHUNK
|
||
Split batch into chunks and train with gradient
|
||
accumulation
|
||
--roll Enable random shifts within each data stream
|
||
--tgt_len TGT_LEN Number of tokens to predict
|
||
--ext_len EXT_LEN Length of the extended context
|
||
--mem_len MEM_LEN Length of the retained previous heads
|
||
--seed SEED Random seed
|
||
--multi_gpu {ddp,dp} Use multiple GPU
|
||
--gpu0_bsz GPU0_BSZ Batch size on gpu 0 (for "dp" backend)
|
||
--same_length Use the same attn length for all tokens
|
||
--varlen Use variable length
|
||
|
||
validation setup:
|
||
--eval_tgt_len EVAL_TGT_LEN
|
||
Number of tokens to predict for evaluation
|
||
--eval_batch_size EVAL_BATCH_SIZE
|
||
Eval batch size
|
||
--eval_max_steps EVAL_MAX_STEPS
|
||
Max eval steps
|
||
--eval_interval EVAL_INTERVAL
|
||
Evaluation interval
|
||
```
|
||
|
||
**Inference**
|
||
|
||
The complete list of available parameters for the `eval.py` inference
|
||
script contains:
|
||
|
||
```
|
||
--work_dir WORK_DIR experiment directory
|
||
--debug run in debug mode (do not create exp dir)
|
||
--data DATA location of the data corpus
|
||
--manual MANUAL [MANUAL ...]
|
||
run model on raw input data
|
||
--dataset {wt103,lm1b,enwik8,text8}
|
||
dataset name
|
||
--split {all,valid,test}
|
||
which split to evaluate
|
||
--type {pytorch,torchscript}
|
||
type of runtime to use
|
||
--batch_size BATCH_SIZE
|
||
batch size
|
||
--tgt_len TGT_LEN number of tokens to predict
|
||
--ext_len EXT_LEN length of the extended context
|
||
--mem_len MEM_LEN length of the retained previous heads
|
||
--seed SEED Random seed
|
||
--clamp_len CLAMP_LEN
|
||
max positional embedding index
|
||
--cuda Run evaluation on a GPU using CUDA
|
||
--model MODEL path to the checkpoint
|
||
--manual_config MANUAL_CONFIG
|
||
Manually specify config for the model
|
||
--manual_vocab {word,bpe}
|
||
Manually specify type of vocabulary
|
||
--fp16 Run training in fp16/mixed precision
|
||
--log_all_ranks Enable logging for all distributed ranks
|
||
--dllog_file DLLOG_FILE
|
||
Name of the DLLogger output file
|
||
--same_length set same length attention with masking
|
||
--no_env Do not print info on execution env
|
||
--log_interval LOG_INTERVAL
|
||
Report interval
|
||
--target_perplexity TARGET_PERPLEXITY
|
||
target perplexity
|
||
--target_throughput TARGET_THROUGHPUT
|
||
target throughput
|
||
--save_data save latency and throughput data to a file
|
||
--repeat REPEAT loop over the dataset REPEAT times
|
||
--max_size MAX_SIZE run inference on up to MAX_SIZE batches
|
||
--percentiles PERCENTILES [PERCENTILES ...]
|
||
percentiles for latency confidence intervals
|
||
--save_torchscript SAVE_TORCHSCRIPT
|
||
save torchscript model to a file
|
||
--load_torchscript LOAD_TORCHSCRIPT
|
||
load torchscript model from a file
|
||
```
|
||
|
||
|
||
### Command-line options
|
||
|
||
To see the full list of available options and their descriptions, use the `-h`
|
||
or `--help` command-line option. For example, for training:
|
||
|
||
```
|
||
python3 train.py --help
|
||
|
||
usage: train.py [-h] [--work_dir WORK_DIR] [--append_dataset] [--append_time]
|
||
[--cuda] [--fp16] [--restart RESTART] [--debug]
|
||
[--log_all_ranks] [--dllog_file DLLOG_FILE]
|
||
[--txtlog_file TXTLOG_FILE] [--save_all] [--no_env]
|
||
[--no_eval] [--log_interval LOG_INTERVAL]
|
||
[--target_throughput TARGET_THROUGHPUT]
|
||
[--target_perplexity TARGET_PERPLEXITY]
|
||
[--amp_mode {O0,O1,O2,O3}] [--data DATA]
|
||
[--dataset {wt103,lm1b,enwik8,text8}] [--vocab {word,bpe}]
|
||
[--n_layer N_LAYER] [--n_head N_HEAD] [--d_head D_HEAD]
|
||
[--d_embed D_EMBED] [--d_model D_MODEL] [--d_inner D_INNER]
|
||
[--dropout DROPOUT] [--dropatt DROPATT] [--pre_lnorm]
|
||
[--attn_type ATTN_TYPE] [--not_tied] [--clamp_len CLAMP_LEN]
|
||
[--adaptive] [--div_val DIV_VAL]
|
||
[--sample_softmax SAMPLE_SOFTMAX] [--init INIT]
|
||
[--emb_init EMB_INIT] [--init_range INIT_RANGE]
|
||
[--emb_init_range EMB_INIT_RANGE] [--init_std INIT_STD]
|
||
[--proj_init_std PROJ_INIT_STD]
|
||
[--optim {adam,sgd,adagrad,lamb,jitlamb}] [--lr LR]
|
||
[--mom MOM] [--scheduler {cosine,inv_sqrt,dev_perf,constant}]
|
||
[--max_step_scheduler MAX_STEP_SCHEDULER]
|
||
[--warmup_step WARMUP_STEP] [--decay_rate DECAY_RATE]
|
||
[--lr_min LR_MIN] [--clip CLIP] [--weight_decay WEIGHT_DECAY]
|
||
[--clip_nonemb] [--patience PATIENCE] [--eta_min ETA_MIN]
|
||
[--max_step MAX_STEP] [--batch_size BATCH_SIZE]
|
||
[--local_batch_size LOCAL_BATCH_SIZE]
|
||
[--batch_chunk BATCH_CHUNK] [--roll] [--tgt_len TGT_LEN]
|
||
[--ext_len EXT_LEN] [--mem_len MEM_LEN] [--seed SEED]
|
||
[--multi_gpu {ddp,dp}] [--gpu0_bsz GPU0_BSZ] [--same_length]
|
||
[--varlen] [--eval_tgt_len EVAL_TGT_LEN]
|
||
[--eval_batch_size EVAL_BATCH_SIZE]
|
||
[--eval_max_steps EVAL_MAX_STEPS]
|
||
[--eval_interval EVAL_INTERVAL] [--local_rank LOCAL_RANK]
|
||
```
|
||
|
||
For example, for inference:
|
||
|
||
```
|
||
python3 eval.py --help
|
||
|
||
usage: eval.py [-h] [--work_dir WORK_DIR] [--debug] [--data DATA]
|
||
[--manual MANUAL [MANUAL ...]]
|
||
[--dataset {wt103,lm1b,enwik8,text8}]
|
||
[--split {all,valid,test}] [--type {pytorch,torchscript}]
|
||
[--batch_size BATCH_SIZE] [--tgt_len TGT_LEN]
|
||
[--ext_len EXT_LEN] [--mem_len MEM_LEN] [--seed SEED]
|
||
[--clamp_len CLAMP_LEN] [--cuda] [--model MODEL]
|
||
[--manual_config MANUAL_CONFIG] [--manual_vocab {word,bpe}]
|
||
[--fp16] [--log_all_ranks] [--dllog_file DLLOG_FILE]
|
||
[--same_length] [--no_env] [--log_interval LOG_INTERVAL]
|
||
[--target_perplexity TARGET_PERPLEXITY]
|
||
[--target_throughput TARGET_THROUGHPUT] [--save_data]
|
||
[--repeat REPEAT] [--max_size MAX_SIZE]
|
||
[--percentiles PERCENTILES [PERCENTILES ...]]
|
||
[--save_torchscript SAVE_TORCHSCRIPT]
|
||
[--load_torchscript LOAD_TORCHSCRIPT] [--local_rank LOCAL_RANK]
|
||
```
|
||
|
||
|
||
### Getting the data
|
||
|
||
The Transformer-XL model was trained on the
|
||
[WikiText-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/)
|
||
dataset. The WikiText-103 dataset is a collection of over 100 million tokens
|
||
extracted from the set of verified
|
||
[Good](https://en.wikipedia.org/wiki/Wikipedia:Good_articles) and
|
||
[Featured](https://en.wikipedia.org/wiki/Wikipedia:Featured_articles) articles
|
||
on Wikipedia.
|
||
|
||
This repository contains the `getdata.sh` download script which
|
||
automatically downloads and extracts the training, validation and test
|
||
datasets. By default, data is downloaded to the `data` directory.
|
||
|
||
In order to test with other datasets, the script needs to be customized
|
||
accordingly.
|
||
|
||
#### Dataset guidelines
|
||
|
||
The WikiText-103 dataset was already pre-tokenized with word-level tokens. The
|
||
dataset features a large vocabulary of 267,735 tokens and retains the original
|
||
case, punctuation and numbers.
|
||
|
||
The `getdata.sh` script downloads the data, extracts the archive and renames
|
||
the training, validation, and test set to `train.txt`, `valid.txt`, `test.txt`
|
||
respectively.
|
||
|
||
#### Multi-dataset
|
||
|
||
Using other datasets requires changes in the following files:
|
||
|
||
* `pytorch/train.py`:
|
||
* the name of the new dataset should be added to the `dataset` argument in
|
||
the `parse_args()` function
|
||
* desired values of cutoffs for adaptive softmax should be added in the
|
||
`main()` function, after the section which builds train/valid/test data
|
||
iterators
|
||
* `pytorch/data_utils.py`:
|
||
* the support for the new dataset needs to be added to the `Corpus` class:
|
||
names of files containing training, validation and test data, options for
|
||
the tokenizer, and dataset iterator
|
||
|
||
The current codebase supports training with word-level vocabulary
|
||
(automatically generated based on the provided dataset) and with BPE vocabulary
|
||
(using pre-built vocabulary from pretrained GPT2 model imported from
|
||
[github.com/huggingface/transformers](https://github.com/huggingface/transformers).
|
||
|
||
Additionally, using other datasets may require changes in some hyperparameters
|
||
(for example, batch size, learning rate, number of training steps,
|
||
and the configuration of learning rate scheduler).
|
||
|
||
### Training process
|
||
|
||
The default training configuration can be launched by running the
|
||
`run_wt103_base.sh` or the `run_wt103_large.sh` script with the first argument
|
||
set to `train`. By default, the training results are saved to the `LM-TFM`
|
||
directory; this can be customized by setting the `--work_dir` parameter.
|
||
|
||
The training script launches a single-node data-parallel training with a fixed
|
||
global batch size of 256, optionally with gradient accumulation to allow
|
||
training on configurations with less than 8 GPUs. Logs from the training are
|
||
automatically saved to the `LM-TFM/train_log.log` file.
|
||
|
||
**Command-line**
|
||
|
||
You can launch training of the Transformer-XL base/large model on the
|
||
WikiText-103 dataset with the word-based vocabulary and adaptive softmax using
|
||
`<#GPUs>` GPUs. For example:
|
||
|
||
```
|
||
bash run_wt103_base.sh train <#GPUs> [--fp16] [--batch_chunk CHUNK]
|
||
```
|
||
|
||
and
|
||
|
||
```
|
||
bash run_wt103_large.sh train <#GPUs> [--fp16] [--batch_chunk CHUNK]
|
||
```
|
||
|
||
|
||
The `--fp16` flag is optional, however, if it's specified, then the script
|
||
launches mixed precision training with Tensor Cores; if the flag is not
|
||
present, then the script launches FP32 training on NVIDIA Volta GPUs and TF32
|
||
training on NVIDIA Ampere GPUs.
|
||
|
||
The `--batch_chunk CHUNK` parameter controls gradient accumulation. With
|
||
gradient accumulation, the batch size is split into `CHUNK` chunks of equal
|
||
size, the training script executes the forward and backward pass using each
|
||
chunk and then executes the optimizer using accumulated gradients.
|
||
|
||
**Examples**
|
||
|
||
You can launch mixed precision training of the Transformer-XL base model on the
|
||
WikiText-103 dataset using 16 GPUs. For example:
|
||
|
||
```
|
||
bash run_wt103_base.sh train 16 --fp16 --batch_chunk 1
|
||
```
|
||
|
||
The batch size per GPU is equal to the default global batch size of 256 divided
|
||
by the product of the number of GPUs times the number of chunks, in this case
|
||
batch size per GPU is equal to `256 / (16 * 1) = 16`.
|
||
|
||
You can launch FP32 training using 8 GPUs; the batch size per GPU is equal to
|
||
16 (`--batch_chunk` was set to `2` because a local batch size of 32 runs out of
|
||
memory on a NVIDIA DGX-1 with Tesla V100 16GB in FP32 training). For example:
|
||
|
||
```
|
||
bash run_wt103_base.sh train 8 --batch_chunk 2
|
||
```
|
||
|
||
A progress summary of the training progress is printed after every 10 training
|
||
iterations; this can be customized by setting the `--log_interval` parameter.
|
||
The summary is printed in the following format:
|
||
|
||
```
|
||
| epoch 18 step 36000 | batches 283 / 2101 | lr 1.220e-03 | ms/batch 185.1 | tok/s 265585 | loss 3.12 | ppl 22.71
|
||
```
|
||
|
||
which contains information about a current training epoch, current training
|
||
step, number of batches processed within the current epoch, current learning
|
||
rate, execution time in milliseconds per batch, throughput in tokens per
|
||
second, current training loss and training perplexity.
|
||
|
||
The script saves two checkpoints: `checkpoint_best.pt` which contains the model
|
||
corresponding to the lowest value of the validation loss and
|
||
`checkpoint_last.pt` which contains the model corresponding to the last
|
||
execution of the validation step. By default, the validation is executed every
|
||
5000 training steps, this can be customized by setting the `--eval_interval`
|
||
parameter. The summary of results on the validation dataset is printed in the
|
||
following format:
|
||
|
||
```
|
||
| Eval 7 at step 35000 | time: 1.37s | valid loss 3.14 | valid ppl 23.132
|
||
```
|
||
|
||
which contains information about the current epoch, current training step, time
|
||
needed to execute the validation, current validation loss, and validation
|
||
perplexity.
|
||
|
||
At the end of the training, the training script automatically runs evaluation
|
||
on the test dataset. This automatic evaluation is executed with values of
|
||
`mem_len` and `tgt_len` hyperparameters inherited from the training setup.
|
||
Evaluation (inference) benefits from longer attention sequences, therefore to
|
||
reproduce perplexity values reported in the [Transformer-XL
|
||
paper](https://arxiv.org/abs/1901.02860), it's necessary to run the final
|
||
evaluation with a dedicated inference script. Refer to the [Inference
|
||
process](#inference-process) section for more details.
|
||
|
||
#### Multi-node
|
||
|
||
Multi-node runs can be launched on a pyxis/enroot Slurm cluster (see
|
||
[Requirements](#requirements)). To launch a multi-node run, issue the
|
||
`run.sub` script with the following command for an 8-node DGX-2H training, for
|
||
example:
|
||
|
||
```
|
||
sbatch run.sub all
|
||
```
|
||
|
||
This repository contains a number of predefined configurations to run the
|
||
multi-node training on DGX-2H nodes. By default, `run.sub` launches 8-node
|
||
training.
|
||
|
||
To launch multi-node training on `<NODES>` DGX-2H nodes, run:
|
||
|
||
```
|
||
CONFIG=<NODES>dgx2_16gpu_{fp16,fp32} sbatch -N <NODES> run.sub all
|
||
```
|
||
|
||
* supported values for `<NODES>` parameter are: 1, 2, 4, 8
|
||
* configs with `fp16` suffix launch mixed precision training, configs with
|
||
`fp32` suffix launch FP32 training
|
||
|
||
Examples:
|
||
|
||
To launch 4-node mixed-precision training, run:
|
||
|
||
```
|
||
CONFIG=4dgx2_16gpu_fp16 sbatch -N 4 run.sub all
|
||
```
|
||
|
||
To launch 2-node FP32 training, run:
|
||
|
||
```
|
||
CONFIG=2dgx2_16gpu_fp32 sbatch -N 2 run.sub all
|
||
```
|
||
|
||
Note that the `run.sub` script is a starting point that has to be adapted
|
||
depending on the environment. In particular, variables such as `WORK_DIR`
|
||
handle the location of the workspace in the file system. The variable `CONT`
|
||
should point to the location of the Transformer-XL Docker container. It's
|
||
assumed that the Docker container built with the `scripts/docker/build.sh`
|
||
script was pushed to a Docker registry accessible from all compute nodes.
|
||
|
||
Refer to the contents of the file to see the full list of variables to adjust
|
||
for your system.
|
||
|
||
### Inference process
|
||
|
||
Inference can be run by launching the `run_wt103_base.sh` or the
|
||
`run_wt103_large.sh` script with the first argument set to `eval`. Running
|
||
inference requires a pre-trained model checkpoint.
|
||
|
||
The script supports single-node multi-GPU inference, each batch is split
|
||
equally among all GPUs running the inference and the loss is averaged over the
|
||
global batch. Logs from the inference are automatically saved to the
|
||
`LM-TFM/eval_log.log` file.
|
||
|
||
**Command-line**
|
||
|
||
You can launch inference of the Transformer-XL base/large model on the
|
||
WikiText-103 dataset with the word-based vocabulary and adaptive softmax using
|
||
`<#GPUs>` GPUs. For example:
|
||
|
||
```
|
||
bash run_wt103_base.sh eval <#GPUs> --model <PATH TO THE CHECKPOINT> [--fp16] [--type {pytorch, torchscript}]
|
||
```
|
||
|
||
and
|
||
|
||
```
|
||
bash run_wt103_large.sh eval <#GPUs> --model <PATH TO THE CHECKPOINT> [--fp16] [--type {pytorch, torchscript}]
|
||
```
|
||
|
||
The `--fp16` flag is optional, however, if it's specified, then the script
|
||
launches inference with Tensor Cores; if the flag is not present, then the
|
||
script launches FP32 inference on NVIDIA Volta and NVIDIA Turing GPUs and TF32
|
||
inference on NVIDIA Ampere GPUs.
|
||
|
||
The `--type` flag selects between pure Python PyTorch execution and TorchScript
|
||
execution.
|
||
|
||
Supported values for `<#GPUs>` are: 1, 2, 4, 8 for NVIDIA DGX-1 and NVIDIA DGX
|
||
A100 and 1, 2, 4, 8, 16 for NVIDIA DGX-2H.
|
||
|
||
**Examples**
|
||
|
||
To launch TorchScript mixed precision inference on 8 GPUs using a checkpoint
|
||
loaded from `LM-TFM/checkpoint_best.pt`, run:
|
||
```
|
||
bash run_wt103_base.sh eval 8 --model LM-TFM/checkpoint_best.pt --fp16 --type torchscript
|
||
```
|
||
|
||
To launch pure Python TF32/FP32 inference on a single GPU using a checkpoint loaded
|
||
from `LM-TFM/checkpoint_best.pt`, run:
|
||
|
||
```
|
||
bash run_wt103_base.sh eval 1 --model LM-TFM/checkpoint_best.pt --type pytorch
|
||
```
|
||
|
||
|
||
After the execution, the script prints a summary in the following format:
|
||
|
||
```
|
||
Evaluating with math fp16 type torchscript bsz 16 tgt_len 64 ext_len 0 mem_len 640 clamp_len 400
|
||
Time : 5.29s, 22.05ms/segment
|
||
====================================================================================================
|
||
| test loss 3.15 | test ppl 23.304
|
||
====================================================================================================
|
||
```
|
||
|
||
which contains information about runtime parameters, execution time, loss and
|
||
perplexity on the test dataset.
|
||
|
||
## Performance
|
||
|
||
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
|
||
|
||
### Benchmarking
|
||
|
||
The following section shows how to run benchmarks measuring the model
|
||
performance in training and inference modes.
|
||
|
||
#### Training performance benchmark
|
||
|
||
To benchmark the training performance for a specific local (per-gpu) batch size
|
||
`<LBS>`, with a specific number of GPUs `<#GPUs>` for a specific number of
|
||
training iterations `<ITER>`, run:
|
||
|
||
```
|
||
bash run_wt103_{base,large}.sh train <#GPUs> --config trainbench --local_batch_size <LBS> --max_step <ITER> [--fp16]
|
||
```
|
||
|
||
* use the `run_wt103_base.sh` script to run the benchmark for the base model,
|
||
and use the `run_wt103_large.sh` script to run the benchmark for the large
|
||
model
|
||
* it's recommended to launch at least 500 training steps to get a reliable
|
||
estimate of training performace.
|
||
* the `--fp16` flag is optional, however, if it's specified, then the script
|
||
launches mixed precision training with Tensor Cores. If the flag is not
|
||
present, then the script launches FP32 training on NVIDIA Volta GPUs and TF32
|
||
training on NVIDIA Ampere GPUs.
|
||
|
||
For more information about the available options, refer to the [Training
|
||
process](#training-process) section.
|
||
|
||
The training script prints information in the following format:
|
||
|
||
```
|
||
(...)
|
||
| epoch 1 step 499 | batches 499 / 16802 | lr 4.990e-03 | ms/batch 219.9 | tok/s 27947 | loss 6.43 | ppl 620.80
|
||
| epoch 1 step 500 | batches 500 / 16802 | lr 5.000e-03 | ms/batch 221.4 | tok/s 27747 | loss 6.42 | ppl 611.70
|
||
-------------------------------------------------------------------------------
|
||
(...)
|
||
Training time: 1.81 minutes
|
||
Training throughput: 28508.91 tok/s
|
||
```
|
||
|
||
The last two lines contain information on the total training time and on the
|
||
average training throughput measured in tokens per second.
|
||
|
||
##### Training performance benchmark for multi-node
|
||
|
||
To benchmark the multi-node training performance of the large model on a
|
||
specific number of DGX-2H nodes `<NODES>` and a specific local batch size
|
||
`<LBS>`, run:
|
||
|
||
For mixed precision:
|
||
```
|
||
FP16=1 LOCAL_BATCH_SIZE=<LBS> CONFIG=trainbench_multinode sbatch -N <NODES> run.sub train
|
||
```
|
||
|
||
For FP32:
|
||
|
||
```
|
||
LOCAL_BATCH_SIZE=<LBS> CONFIG=trainbench_multinode sbatch -N <NODES> run.sub train
|
||
```
|
||
|
||
#### Inference performance benchmark
|
||
|
||
The inference performance and accuracy benchmarks require a checkpoint from a
|
||
trained model.
|
||
|
||
To benchmark the inference performance on a specific global batch size `<BS>`
|
||
with a specific number of GPUs `<#GPUs>`, run:
|
||
|
||
For the base model:
|
||
|
||
```
|
||
bash run_wt103_base.sh eval <#GPUs> --model <CHECKPOINT> --batch_size <BS> --save_data [--fp16] [--type {pytorch, torchscript}]
|
||
```
|
||
|
||
For the large model:
|
||
|
||
```
|
||
bash run_wt103_large.sh eval <#GPUs> --model <CHECKPOINT> --batch_size <BS> --save_data [--fp16] [--type {pytorch, torchscript}]
|
||
```
|
||
|
||
The inference script prints information in the following format:
|
||
|
||
```
|
||
Evaluating with math fp16 type torchscript bsz 16 tgt_len 64 ext_len 0 mem_len 640 clamp_len 400
|
||
Time : 5.25s, 21.88ms/segment
|
||
====================================================================================================
|
||
| test loss 3.15 | test ppl 23.304
|
||
====================================================================================================
|
||
Throughput Avg: 46316.64 tok/s
|
||
Latency Avg: 22.09 ms
|
||
Latency 90%: 22.22 ms
|
||
Latency 95%: 22.25 ms
|
||
Latency 99%: 22.37 ms
|
||
====================================================================================================
|
||
```
|
||
|
||
The output contains information on the achieved test loss and test perplexity,
|
||
average inference throughput (measured in tokens per second), average inference
|
||
latency and latency at 90%, 95% and 99% confidence intervals (measured in
|
||
milliseconds).
|
||
|
||
The `scripts/inference_benchmark.sh` benchmarking script is provided for
|
||
convenience, it automatically launches TF32/FP32 and FP16 inference for various
|
||
batch sizes.
|
||
|
||
### Results
|
||
|
||
The following sections provide details on how we achieved our performance and
|
||
accuracy in training and inference.
|
||
|
||
#### Training accuracy results
|
||
|
||
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_base.sh`
|
||
training script in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100
|
||
with 8x A100 40GB GPUs.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Accuracy - TF32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - TF32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (TF32 to Mixed precision)**|
|
||
|-------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| 8 | 32 | 23.24 | 23.24 | 110 | 76 | 1.45 |
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_large.sh`
|
||
training script in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100
|
||
with 8x A100 40GB GPUs.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Accuracy - TF32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - TF32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (TF32 to Mixed precision)**|
|
||
|-------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| 8 | 8 | 18.18 | 18.18 | 735 | 477 | 1.54 |
|
||
| 8 | 16 | N/A | 18.19 | N/A | 430 | 1.71 |
|
||
|
||
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_base.sh`
|
||
training script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1
|
||
with 8x V100 16GB GPUs.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Accuracy - FP32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - FP32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (FP32 to Mixed precision)**|
|
||
|-------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| 1 | 16 | 23.12 | 23.13 | 2146 | 960 | 2.24 |
|
||
| 8 | 16 | 23.17 | 23.14 | 316 | 167 | 1.89 |
|
||
| 1 | 32 | N/A | 23.15 | N/A | 766 | 2.80 |
|
||
| 8 | 32 | N/A | 23.18 | N/A | 121 | 2.61 |
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_large.sh`
|
||
training script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1
|
||
with 8x V100 16GB GPUs.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Accuracy - FP32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - FP32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (FP32 to Mixed precision)**|
|
||
|-------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| 8 | 2 | 18.22 | 18.20 | 2983 | 1480 | 2.01 |
|
||
| 8 | 4 | N/A | 18.17 | N/A | 984 | 3.03 |
|
||
|
||
##### Training accuracy: NVIDIA DGX-2H (16x V100 32GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_base.sh`
|
||
training script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H
|
||
with 16x V100 32GB GPUs.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Accuracy - FP32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - FP32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (FP32 to Mixed precision)**|
|
||
|-------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| 16 | 16 | 23.22 | 23.22 | 149 | 80 | 1.86 |
|
||
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_large.sh`
|
||
training script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H
|
||
with 16x V100 32GB GPUs.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Accuracy - FP32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - FP32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (FP32 to Mixed precision)**|
|
||
|-------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| 16 | 8 | 18.21 | 18.20 | 1075 | 394 | 2.73 |
|
||
|
||
|
||
##### Training accuracy: 8x NVIDIA DGX-2H (16x V100 32GB)
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run.sub`
|
||
training script in the pytorch-20.06-py3 NGC container on 8x NVIDIA DGX-2H
|
||
with 16x V100 32GB GPUs.
|
||
|
||
|**DGX System**|**Nodes**|**Batch Size / GPU**|**Accuracy - FP32 (perplexity)**|**Accuracy - Mixed precision (perplexity)**|**Time to Train - FP32 (minutes)**|**Time to Train - Mixed precision (minutes)**|**Time to Train Speedup (FP32 to Mixed precision)**|
|
||
|-------------:|--------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| DGX-2H | 8 | 4 | 18.27 | 18.28 | 156 | 74 | 2.11 |
|
||
|
||
##### Training accuracy plots
|
||
|
||
###### Base model
|
||
|
||
![TrainingLossBase](pytorch/img/training_loss_base.png)
|
||
|
||
###### Large model (single-node)
|
||
|
||
![TrainingLossLarge](pytorch/img/training_loss_large.png)
|
||
|
||
###### Large model (multi-node)
|
||
|
||
![TrainingLossLargeMultiNode](pytorch/img/training_loss_large_multinode.png)
|
||
|
||
##### Training stability test
|
||
|
||
###### Base model
|
||
|
||
The Transformer-XL base model was trained for 40,000 training steps, starting
|
||
from 16 different initial random seeds. After every 5,000 training steps, the
|
||
model was evaluated on the validation dataset and validation perplexity was
|
||
recorded. The training was performed in the pytorch-20.06-py3 NGC container on
|
||
NVIDIA DGX A100 with 8x A100 40GB GPUs. The following table summarizes the
|
||
perplexity of our validation dataset.
|
||
|
||
|**Training step**|**Average perplexity**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
||
|----------------:|----------:|---------------------:|----------:|----------:|---------:|
|
||
| 5000 | 42.62 | 0.27311 | 42.01 | 43.09 | 42.67 |
|
||
| 10000 | 32.31 | 0.12814 | 32.10 | 32.59 | 32.31 |
|
||
| 15000 | 28.38 | 0.10764 | 28.23 | 28.57 | 28.35 |
|
||
| 20000 | 26.14 | 0.10218 | 25.96 | 26.36 | 26.14 |
|
||
| 25000 | 24.59 | 0.09060 | 24.42 | 24.81 | 24.60 |
|
||
| 30000 | 23.71 | 0.07259 | 23.61 | 23.84 | 23.71 |
|
||
| 35000 | 23.15 | 0.04781 | 23.05 | 23.26 | 23.15 |
|
||
| 40000 | 22.93 | 0.05593 | 22.83 | 23.04 | 22.94 |
|
||
|
||
After training, the models were evaluated on the test dataset. The following
|
||
table summarizes the final perplexity on the test set.
|
||
|
||
|**Average perplexity**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
||
|----------:|---------------------:|----------:|----------:|---------:|
|
||
| 23.24| 0.07794| 23.11| 23.38| 23.25|
|
||
|
||
###### Large model (single-node)
|
||
|
||
The Transformer-XL large model was trained for 100,000 training steps, starting
|
||
from 16 different initial random seeds. After every 10,000 training steps, the
|
||
model was evaluated on the validation dataset and validation perplexity was
|
||
recorded. The training was performed in the pytorch-20.06-py3 NGC container on
|
||
NVIDIA DGX A100 with 8x A100 40GB GPUs. The following table summarizes the
|
||
perplexity of our validation dataset.
|
||
|
||
|**Training step**|**Average perplexity**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
||
|----------------:|----------:|---------------------:|----------:|----------:|---------:|
|
||
| 10000 | 32.63 | 0.20432 | 32.34 | 33.05 | 32.62 |
|
||
| 20000 | 24.08 | 0.10980 | 23.90 | 24.28 | 24.10 |
|
||
| 30000 | 21.52 | 0.09069 | 21.36 | 21.66 | 21.52 |
|
||
| 40000 | 20.17 | 0.06922 | 20.06 | 20.27 | 20.17 |
|
||
| 50000 | 19.23 | 0.05975 | 19.11 | 19.33 | 19.24 |
|
||
| 60000 | 18.57 | 0.06008 | 18.47 | 18.72 | 18.56 |
|
||
| 70000 | 18.17 | 0.06473 | 18.08 | 18.32 | 18.15 |
|
||
| 80000 | 17.95 | 0.06506 | 17.82 | 18.08 | 17.94 |
|
||
| 90000 | 17.80 | 0.04350 | 17.71 | 17.90 | 17.80 |
|
||
| 100000 | 17.80 | 0.03592 | 17.74 | 17.86 | 17.81 |
|
||
|
||
After training, the models were evaluated on the test dataset. The following
|
||
table summarizes the final perplexity on the test set.
|
||
|
||
|**Average perplexity**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
||
|---------------------:|---------------------:|----------:|----------:|---------:|
|
||
| 18.17 | 0.04016 | 18.09 | 18.24 | 18.17 |
|
||
|
||
###### Large model (multi-node)
|
||
|
||
The Transformer-XL large model was trained for 25,000 training steps, starting
|
||
from 10 different initial random seeds. After every 1,000 training steps, the
|
||
model was evaluated on the validation dataset and validation perplexity was
|
||
recorded. The training was performed in the pytorch-20.06-py3 NGC container on
|
||
8x NVIDIA DGX-2H with 16x V100 32GB GPUs. The following table summarizes the
|
||
perplexity of our validation dataset.
|
||
|
||
|**Training step**|**Average perplexity**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
||
|----------------:|----------:|---------------------:|----------:|----------:|---------:|
|
||
| 1000 | 608.09 | 3.80116 | 600.65 | 613.73 | 609.40 |
|
||
| 2000 | 142.75 | 0.94452 | 141.21 | 143.84 | 143.07 |
|
||
| 3000 | 62.19 | 0.44544 | 61.38 | 63.01 | 62.18 |
|
||
| 4000 | 40.22 | 0.16397 | 39.93 | 40.54 | 40.20 |
|
||
| 5000 | 32.00 | 0.15850 | 31.61 | 32.19 | 32.02 |
|
||
| 6000 | 28.05 | 0.17854 | 27.81 | 28.41 | 28.05 |
|
||
| 7000 | 25.65 | 0.10946 | 25.51 | 25.87 | 25.65 |
|
||
| 8000 | 24.20 | 0.11385 | 23.98 | 24.36 | 24.20 |
|
||
| 9000 | 23.18 | 0.14936 | 22.84 | 23.37 | 23.20 |
|
||
| 10000 | 22.88 | 0.22752 | 22.54 | 23.33 | 22.94 |
|
||
| 11000 | 21.99 | 0.16232 | 21.73 | 22.29 | 21.97 |
|
||
| 12000 | 21.69 | 0.10824 | 21.46 | 21.81 | 21.73 |
|
||
| 13000 | 21.42 | 0.09154 | 21.25 | 21.57 | 21.44 |
|
||
| 14000 | 21.33 | 0.13821 | 21.15 | 21.55 | 21.27 |
|
||
| 15000 | 21.24 | 0.15526 | 20.95 | 21.57 | 21.20 |
|
||
| 16000 | 21.19 | 0.10521 | 21.01 | 21.44 | 21.18 |
|
||
| 17000 | 20.89 | 0.18239 | 20.69 | 21.18 | 20.82 |
|
||
| 18000 | 20.36 | 0.10715 | 20.21 | 20.53 | 20.34 |
|
||
| 19000 | 19.74 | 0.12803 | 19.45 | 19.92 | 19.75 |
|
||
| 20000 | 19.18 | 0.10020 | 19.05 | 19.39 | 19.15 |
|
||
| 21000 | 18.49 | 0.06319 | 18.36 | 18.60 | 18.49 |
|
||
| 22000 | 18.17 | 0.03674 | 18.11 | 18.22 | 18.16 |
|
||
| 23000 | 17.98 | 0.03682 | 17.90 | 18.04 | 17.99 |
|
||
| 24000 | 17.88 | 0.02880 | 17.84 | 17.92 | 17.89 |
|
||
| 25000 | 17.85 | 0.02793 | 17.80 | 17.90 | 17.86 |
|
||
|
||
After training, the models were evaluated on the test dataset. The following
|
||
table summarizes the final perplexity on the test set.
|
||
|
||
|**Average perplexity**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
||
|----------:|---------------------:|----------:|----------:|---------:|
|
||
| 18.30 | 0.02747 | 18.24 | 18.33 | 18.30 |
|
||
|
||
#### Training performance results
|
||
|
||
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_base.sh` training
|
||
script in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100
|
||
40GB GPUs. Performance numbers (in tokens per second) were averaged over 500
|
||
training iterations.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Throughput - TF32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (TF32 to Mixed precision)**|**Weak Scaling - TF32**|**Weak Scaling - Mixed precision**|
|
||
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
|
||
| 1 | 32 | 41,527 | 59,961 | 1.444 | 1.000 | 1.000 |
|
||
| 2 | 32 | 77,625 | 113,238 | 1.459 | 1.869 | 1.889 |
|
||
| 4 | 32 | 153,945 | 225,609 | 1.466 | 3.707 | 3.763 |
|
||
| 8 | 32 | 305,933 | 449,890 | 1.471 | 7.367 | 7.503 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Training performance benchmark](#training-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_large.sh` training
|
||
script in the pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 8x A100
|
||
40GB GPUs. Performance numbers (in tokens per second) were averaged over 500
|
||
training iterations.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Throughput - TF32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (TF32 to Mixed precision)**|**Weak Scaling - TF32**|**Weak Scaling - Mixed precision**|
|
||
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
|
||
| 1 | 8 | 14,497 | 21,554 | 1.487 | 1.000 | 1.000 |
|
||
| 2 | 8 | 27,304 | 40,222 | 1.473 | 1.883 | 1.866 |
|
||
| 4 | 8 | 53,756 | 80,226 | 1.492 | 3.708 | 3.722 |
|
||
| 8 | 8 | 106,651 | 159,185 | 1.493 | 7.357 | 7.385 |
|
||
| 1 | 16 | N/A | 25,084 | 1.730 | N/A | 1.000 |
|
||
| 2 | 16 | N/A | 48,562 | 1.779 | N/A | 1.936 |
|
||
| 4 | 16 | N/A | 95,997 | 1.786 | N/A | 3.827 |
|
||
| 8 | 16 | N/A | 191,148 | 1.792 | N/A | 7.620 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Training performance benchmark](#training-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_base.sh` training
|
||
script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB
|
||
GPUs. Performance numbers (in tokens per second) were averaged over 500
|
||
training iterations.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Throughput - FP32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (FP32 to Mixed precision)**|**Weak Scaling - FP32**|**Weak Scaling - Mixed precision**|
|
||
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
|
||
| 1 | 16 | 13,981 | 26,639 | 1.905 | 1.000 | 1.000 |
|
||
| 2 | 16 | 23,163 | 45,299 | 1.956 | 1.657 | 1.700 |
|
||
| 4 | 16 | 48,893 | 92,618 | 1.894 | 3.497 | 3.477 |
|
||
| 8 | 16 | 97,005 | 170,532 | 1.758 | 6.938 | 6.402 |
|
||
| 1 | 32 | N/A | 36,692 | 2.624 | N/A | 1.000 |
|
||
| 2 | 32 | N/A | 65,889 | 2.845 | N/A | 1.796 |
|
||
| 4 | 32 | N/A | 133,838 | 2.737 | N/A | 3.648 |
|
||
| 8 | 32 | N/A | 258,648 | 2.666 | N/A | 7.049 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Training performance benchmark](#training-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_large.sh` training
|
||
script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB
|
||
GPUs. Performance numbers (in tokens per second) were averaged over 500
|
||
training iterations.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Throughput - FP32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (FP32 to Mixed precision)**|**Weak Scaling - FP32**|**Weak Scaling - Mixed precision**|
|
||
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
|
||
| 1 | 2 | 3,558 | 6,907 | 1.941 | 1.000 | 1.000 |
|
||
| 2 | 2 | 6,153 | 11,272 | 1.832 | 1.729 | 1.632 |
|
||
| 4 | 2 | 12,492 | 22,530 | 1.804 | 3.511 | 3.262 |
|
||
| 8 | 2 | 24,595 | 40,920 | 1.664 | 6.913 | 5.925 |
|
||
| 1 | 4 | N/A | 10,210 | 2.870 | N/A | 1.000 |
|
||
| 2 | 4 | N/A | 17,984 | 2.923 | N/A | 1.761 |
|
||
| 4 | 4 | N/A | 36,340 | 2.909 | N/A | 3.559 |
|
||
| 8 | 4 | N/A | 66,716 | 2.713 | N/A | 6.535 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Training performance benchmark](#training-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
##### Training performance: NVIDIA DGX-2H (16x V100 32GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_base.sh` training
|
||
script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H with 16x V100
|
||
32GB GPUs. Performance numbers (in tokens per second) were averaged over 500
|
||
training iterations.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Throughput - FP32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (FP32 to Mixed precision)**|**Weak Scaling - FP32**|**Weak Scaling - Mixed precision**|
|
||
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
|
||
| 1 | 16 | 16,150 | 32,875 | 2.036 | 1.000 | 1.000 |
|
||
| 2 | 16 | 29,712 | 59,058 | 1.988 | 1.840 | 1.796 |
|
||
| 4 | 16 | 58,011 | 113,985 | 1.965 | 3.592 | 3.467 |
|
||
| 8 | 16 | 114,655 | 223,907 | 1.953 | 7.099 | 6.811 |
|
||
| 16 | 16 | 222,920 | 414,994 | 1.862 | 13.803 | 12.623 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Training performance benchmark](#training-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the `pytorch/run_wt103_large.sh` training
|
||
script in the pytorch-20.06-py3 NGC container on NVIDIA DGX-2H with 16x V100
|
||
32GB GPUs. Performance numbers (in tokens per second) were averaged over 500
|
||
training iterations.
|
||
|
||
|**GPUs**|**Batch Size / GPU**|**Throughput - FP32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (FP32 to Mixed precision)**|**Weak Scaling - FP32**|**Weak Scaling - Mixed precision**|
|
||
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
|
||
| 1 | 8 | 5,169 | 14,787 | 2.861 | 1.000 | 1.000 |
|
||
| 2 | 8 | 9,977 | 27,710 | 2.777 | 1.930 | 1.874 |
|
||
| 4 | 8 | 19,691 | 54,207 | 2.753 | 3.810 | 3.666 |
|
||
| 8 | 8 | 39,157 | 107,073 | 2.734 | 7.576 | 7.241 |
|
||
| 16 | 8 | 77,568 | 211,387 | 2.725 | 15.008 | 14.296 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Training performance benchmark](#training-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
##### Training performance: 8x NVIDIA DGX-2H (16x V100 32GB)
|
||
|
||
Our results were obtained by running the `pytorch/run.sub` training script in
|
||
the pytorch-20.06-py3 NGC container. Performance numbers (in tokens per second)
|
||
were averaged over 500 training iterations.
|
||
|
||
###### Large model
|
||
|
||
|**DGX System**|**Nodes**|**Batch Size / GPU**|**Throughput - FP32 (tok/s)**|**Throughput - Mixed precision (tok/s)**|**Throughput speedup (FP32 to Mixed precision)**|**Weak Scaling - FP32**|**Weak scaling - Mixed precision**|
|
||
|-------------:|--------:|-------------------:|-------------------------------:|------------------------------------------:|---------------------------------:|--------------------------------------------:|--------------------------------------------------:|
|
||
| DGX-2H | 1 | 4 | 69,070 | 154,950 | 2.24 | 1.00 | 1.00 |
|
||
| DGX-2H | 2 | 4 | 136,960 | 307,520 | 2.25 | 1.98 | 1.98 |
|
||
| DGX-2H | 4 | 4 | 270,120 | 605,530 | 2.24 | 3.91 | 3.91 |
|
||
| DGX-2H | 8 | 4 | 514,500 | 1,189,700 | 2.31 | 7.45 | 7.68 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and then
|
||
proceed to the
|
||
[Training performance benchmark for
|
||
multi-node](#training-performance-benchmark-for-multi-node) section for
|
||
instruction on how to launch the multi-node performance benchmark. The numbers
|
||
presented above were obtained with `LOCAL_BATCH_SIZE=4`.
|
||
|
||
#### Inference performance results
|
||
|
||
##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the
|
||
`pytorch/scripts/inference_benchmark.sh` inferencing benchmarking script in the
|
||
pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 1x A100 40GB GPU.
|
||
|
||
The command to launch the inference performance benchmark is provided in the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section.
|
||
|
||
**FP16, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 4,163.7 | 15.38 | 15.58 | 15.66 | 16.12 |
|
||
| 2 | 64 | 640 | 7,915.4 | 16.17 | 16.36 | 16.42 | 17.19 |
|
||
| 4 | 64 | 640 | 15,710.2 | 16.29 | 16.45 | 16.49 | 17.38 |
|
||
| 8 | 64 | 640 | 32,712.1 | 15.64 | 15.77 | 15.82 | 16.65 |
|
||
| 16 | 64 | 640 | 59,378.6 | 17.23 | 17.32 | 17.36 | 18.39 |
|
||
| 32 | 64 | 640 | 91,654.2 | 22.33 | 22.39 | 22.53 | 23.63 |
|
||
|
||
**FP16, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 6,935.9 | 9.231 | 9.388 | 9.445 | 9.534 |
|
||
| 2 | 64 | 640 | 12,649.4 | 10.120 | 10.253 | 10.294 | 10.945 |
|
||
| 4 | 64 | 640 | 25,029.5 | 10.223 | 10.346 | 10.381 | 10.475 |
|
||
| 8 | 64 | 640 | 52,666.3 | 9.716 | 9.808 | 9.851 | 10.540 |
|
||
| 16 | 64 | 640 | 90,767.8 | 11.274 | 11.321 | 11.334 | 11.800 |
|
||
| 32 | 64 | 640 | 107,082.4 | 19.109 | 19.138 | 19.162 | 19.608 |
|
||
|
||
**TF32, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 4,003.8 | 15.99 | 16.26 | 16.36 | 16.58 |
|
||
| 2 | 64 | 640 | 7,499.2 | 17.07 | 17.32 | 17.39 | 17.86 |
|
||
| 4 | 64 | 640 | 14,835.4 | 17.25 | 17.46 | 17.50 | 18.34 |
|
||
| 8 | 64 | 640 | 30,001.5 | 17.06 | 17.22 | 17.28 | 18.40 |
|
||
| 16 | 64 | 640 | 50,189.7 | 20.39 | 20.48 | 20.52 | 21.41 |
|
||
| 32 | 64 | 640 | 63,660.5 | 32.14 | 32.17 | 32.29 | 33.19 |
|
||
|
||
**TF32, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 6,084.5 | 10.52 | 10.74 | 10.84 | 10.95 |
|
||
| 2 | 64 | 640 | 11,680.6 | 10.96 | 11.17 | 11.22 | 11.76 |
|
||
| 4 | 64 | 640 | 22,867.3 | 11.19 | 11.35 | 11.40 | 12.07 |
|
||
| 8 | 64 | 640 | 45,165.5 | 11.33 | 11.46 | 11.49 | 12.03 |
|
||
| 16 | 64 | 640 | 61,042.0 | 16.76 | 16.84 | 16.86 | 17.13 |
|
||
| 32 | 64 | 640 | 71,124.1 | 28.77 | 28.81 | 28.84 | 28.86 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the
|
||
`pytorch/scripts/inference_benchmark.sh` inferencing benchmarking script in the
|
||
pytorch-20.06-py3 NGC container on NVIDIA DGX A100 with 1x A100 40GB GPU.
|
||
|
||
The command to launch the inference performance benchmark is provided in the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section.
|
||
|
||
**FP16, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 7,033.0 | 18.20 | 18.57 | 18.64 | 18.93 |
|
||
| 2 | 128 | 1,600 | 12,832.5 | 19.94 | 20.23 | 20.29 | 21.07 |
|
||
| 4 | 128 | 1,600 | 21,500.2 | 23.80 | 23.99 | 24.07 | 25.09 |
|
||
| 8 | 128 | 1,600 | 25,797.1 | 39.66 | 39.74 | 39.91 | 41.00 |
|
||
| 16 | 128 | 1,600 | 28,143.5 | 72.71 | 72.74 | 73.12 | 74.00 |
|
||
| 32 | 128 | 1,600 | 28,533.6 | 143.44 | 143.30 | 143.48 | 149.07 |
|
||
|
||
**FP16, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 11,068.2 | 11.57 | 11.83 | 11.88 | 12.42 |
|
||
| 2 | 128 | 1,600 | 19,847.0 | 12.89 | 13.09 | 13.11 | 13.27 |
|
||
| 4 | 128 | 1,600 | 24,450.7 | 20.92 | 21.08 | 21.10 | 21.15 |
|
||
| 8 | 128 | 1,600 | 27,938.4 | 36.62 | 36.72 | 36.75 | 36.86 |
|
||
| 16 | 128 | 1,600 | 30,783.0 | 66.48 | 66.54 | 66.59 | 66.98 |
|
||
| 32 | 128 | 1,600 | 32,161.6 | 127.26 | 127.19 | 127.34 | 131.64 |
|
||
|
||
**TF32, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 6,558.8 | 19.52 | 19.87 | 19.95 | 20.44 |
|
||
| 2 | 128 | 1,600 | 10,658.4 | 24.00 | 24.28 | 24.36 | 25.17 |
|
||
| 4 | 128 | 1,600 | 14,769.6 | 34.64 | 34.82 | 34.89 | 35.74 |
|
||
| 8 | 128 | 1,600 | 16,852.6 | 60.71 | 60.82 | 61.05 | 62.17 |
|
||
| 16 | 128 | 1,600 | 18,071.8 | 113.23 | 113.28 | 113.37 | 114.64 |
|
||
| 32 | 128 | 1,600 | 17,619.2 | 234.04 | 229.98 | 239.30 | 328.15 |
|
||
|
||
**TF32, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 9,084.4 | 14.09 | 14.37 | 14.40 | 14.46 |
|
||
| 2 | 128 | 1,600 | 12,839.4 | 19.92 | 20.15 | 20.17 | 20.25 |
|
||
| 4 | 128 | 1,600 | 15,582.4 | 32.83 | 33.00 | 33.02 | 33.28 |
|
||
| 8 | 128 | 1,600 | 17,825.0 | 57.40 | 57.55 | 57.59 | 57.94 |
|
||
| 16 | 128 | 1,600 | 19,419.2 | 105.38 | 105.49 | 105.54 | 105.91 |
|
||
| 32 | 128 | 1,600 | 20,079.4 | 203.81 | 203.77 | 203.84 | 207.47 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the
|
||
`pytorch/scripts/inference_benchmark.sh` inferencing benchmarking script in the
|
||
pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
|
||
|
||
The command to launch the inference performance benchmark is provided in the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section.
|
||
|
||
**FP16, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 2,999.6 | 21.36 | 21.72 | 21.90 | 24.86 |
|
||
| 2 | 64 | 640 | 5,738.5 | 22.32 | 22.64 | 22.89 | 25.97 |
|
||
| 4 | 64 | 640 | 11,773.5 | 21.73 | 21.92 | 22.06 | 22.68 |
|
||
| 8 | 64 | 640 | 22,604.7 | 22.63 | 22.92 | 23.08 | 23.56 |
|
||
| 16 | 64 | 640 | 41,481.6 | 24.67 | 24.83 | 24.99 | 25.73 |
|
||
| 32 | 64 | 640 | 58,556.9 | 34.95 | 35.13 | 35.24 | 35.85 |
|
||
|
||
**FP16, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 5,199.9 | 12.31 | 12.59 | 12.65 | 12.98 |
|
||
| 2 | 64 | 640 | 9,802.5 | 13.06 | 13.30 | 13.42 | 13.82 |
|
||
| 4 | 64 | 640 | 19,609.4 | 13.05 | 13.17 | 13.24 | 13.94 |
|
||
| 8 | 64 | 640 | 37,598.7 | 13.61 | 13.71 | 13.77 | 14.62 |
|
||
| 16 | 64 | 640 | 57,840.2 | 17.69 | 17.73 | 17.76 | 18.36 |
|
||
| 32 | 64 | 640 | 66,955.9 | 30.57 | 30.78 | 30.86 | 30.96 |
|
||
|
||
**FP32, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 2,940.0 | 21.79 | 22.23 | 22.42 | 25.52 |
|
||
| 2 | 64 | 640 | 5,652.0 | 22.66 | 23.00 | 23.20 | 26.86 |
|
||
| 4 | 64 | 640 | 10,526.0 | 24.30 | 24.62 | 24.72 | 25.03 |
|
||
| 8 | 64 | 640 | 15,767.2 | 32.45 | 32.67 | 32.78 | 33.32 |
|
||
| 16 | 64 | 640 | 20,303.2 | 50.39 | 50.82 | 50.89 | 51.07 |
|
||
| 32 | 64 | 640 | 21,707.1 | 94.26 | 94.76 | 94.94 | 95.26 |
|
||
|
||
**FP32, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 4,974.1 | 12.88 | 13.25 | 13.37 | 13.69 |
|
||
| 2 | 64 | 640 | 9,625.3 | 13.30 | 13.58 | 13.72 | 14.15 |
|
||
| 4 | 64 | 640 | 15,069.9 | 16.98 | 17.27 | 17.35 | 17.54 |
|
||
| 8 | 64 | 640 | 18,269.8 | 28.00 | 28.23 | 28.28 | 28.37 |
|
||
| 16 | 64 | 640 | 20,884.5 | 48.99 | 49.46 | 49.50 | 49.63 |
|
||
| 32 | 64 | 640 | 22,289.2 | 91.80 | 92.25 | 92.56 | 92.67 |
|
||
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the
|
||
`pytorch/scripts/inference_benchmark.sh` inferencing benchmarking script in the
|
||
pytorch-20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
|
||
|
||
The command to launch the inference performance benchmark is provided in the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section.
|
||
|
||
**FP16, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 5,119.6 | 25.00 | 25.47 | 25.66 | 26.12 |
|
||
| 2 | 128 | 1,600 | 8,676.1 | 29.49 | 29.81 | 29.94 | 30.88 |
|
||
| 4 | 128 | 1,600 | 12,960.9 | 39.47 | 39.84 | 39.91 | 40.69 |
|
||
| 8 | 128 | 1,600 | 14,870.6 | 68.81 | 69.28 | 69.42 | 69.76 |
|
||
| 16 | 128 | 1,600 | 15,528.5 | 131.78 | 132.74 | 132.86 | 133.07 |
|
||
| 32 | 128 | 1,600 | 15,649.4 | 261.54 | 262.45 | 262.99 | 271.10 |
|
||
|
||
**FP16, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 8,718.2 | 14.68 | 15.01 | 15.07 | 15.50 |
|
||
| 2 | 128 | 1,600 | 12,157.8 | 21.04 | 21.29 | 21.31 | 21.38 |
|
||
| 4 | 128 | 1,600 | 14,534.8 | 35.20 | 35.48 | 35.53 | 35.93 |
|
||
| 8 | 128 | 1,600 | 15,863.8 | 64.50 | 64.90 | 65.15 | 65.31 |
|
||
| 16 | 128 | 1,600 | 16,674.0 | 122.73 | 123.34 | 123.66 | 123.92 |
|
||
| 32 | 128 | 1,600 | 17,154.1 | 238.60 | 239.48 | 239.73 | 247.48 |
|
||
|
||
**FP32, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 3,009.8 | 42.52 | 43.01 | 43.09 | 43.53 |
|
||
| 2 | 128 | 1,600 | 3,838.4 | 66.64 | 67.24 | 67.45 | 67.83 |
|
||
| 4 | 128 | 1,600 | 4,265.3 | 119.94 | 120.87 | 121.00 | 121.39 |
|
||
| 8 | 128 | 1,600 | 4,646.5 | 220.19 | 221.30 | 221.50 | 221.68 |
|
||
| 16 | 128 | 1,600 | 4,805.4 | 426.39 | 426.25 | 426.47 | 427.25 |
|
||
| 32 | 128 | 1,600 | 4,787.4 | 855.09 | 854.95 | 855.46 | 912.05 |
|
||
|
||
**FP32, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 3,319.0 | 38.56 | 38.91 | 39.01 | 39.19 |
|
||
| 2 | 128 | 1,600 | 3,925.2 | 65.16 | 65.74 | 65.89 | 66.12 |
|
||
| 4 | 128 | 1,600 | 4,344.1 | 117.76 | 118.46 | 118.55 | 118.69 |
|
||
| 8 | 128 | 1,600 | 4,716.2 | 216.94 | 217.99 | 218.27 | 218.69 |
|
||
| 16 | 128 | 1,600 | 4,922.1 | 415.72 | 417.16 | 417.32 | 417.59 |
|
||
| 32 | 128 | 1,600 | 4,965.2 | 824.98 | 821.79 | 831.71 | 952.47 |
|
||
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
##### Inference performance: NVIDIA T4
|
||
|
||
###### Base model
|
||
|
||
Our results were obtained by running the
|
||
`pytorch/scripts/inference_benchmark.sh` inferencing benchmarking script in the
|
||
pytorch-20.06-py3 NGC container on NVIDIA T4.
|
||
|
||
The command to launch the inference performance benchmark is provided in the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section.
|
||
|
||
**FP16, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 3,775.3 | 16.97 | 17.51 | 17.84 | 18.18 |
|
||
| 2 | 64 | 640 | 6,417.4 | 19.96 | 20.49 | 20.56 | 21.52 |
|
||
| 4 | 64 | 640 | 9,988.6 | 25.64 | 26.07 | 26.14 | 27.32 |
|
||
| 8 | 64 | 640 | 11,878.9 | 43.07 | 43.42 | 43.46 | 44.24 |
|
||
| 16 | 64 | 640 | 13,630.0 | 75.07 | 75.26 | 75.32 | 76.07 |
|
||
| 32 | 64 | 640 | 14,511.2 | 141.01 | 141.38 | 141.41 | 142.16 |
|
||
|
||
**FP16, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 6,132.5 | 10.47 | 10.93 | 11.31 | 11.45 |
|
||
| 2 | 64 | 640 | 8,319.4 | 15.39 | 15.89 | 15.92 | 16.10 |
|
||
| 4 | 64 | 640 | 11,259.1 | 22.74 | 23.16 | 23.23 | 23.30 |
|
||
| 8 | 64 | 640 | 13,120.3 | 38.99 | 39.35 | 39.37 | 39.42 |
|
||
| 16 | 64 | 640 | 15,120.0 | 67.67 | 67.90 | 67.94 | 68.06 |
|
||
| 32 | 64 | 640 | 16,158.1 | 126.65 | 126.97 | 127.03 | 127.18 |
|
||
|
||
**FP32, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 2,323.1 | 27.59 | 29.39 | 29.56 | 29.86 |
|
||
| 2 | 64 | 640 | 3,094.8 | 41.39 | 42.49 | 42.78 | 43.47 |
|
||
| 4 | 64 | 640 | 3,889.8 | 65.82 | 66.60 | 66.71 | 67.57 |
|
||
| 8 | 64 | 640 | 4,270.1 | 119.80 | 120.61 | 120.68 | 120.89 |
|
||
| 16 | 64 | 640 | 4,765.7 | 214.68 | 215.87 | 216.01 | 216.14 |
|
||
| 32 | 64 | 640 | 4,985.2 | 410.43 | 413.58 | 413.67 | 413.92 |
|
||
|
||
|
||
**FP32, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 64 | 640 | 2,486.3 | 25.78 | 27.52 | 27.66 | 27.92 |
|
||
| 2 | 64 | 640 | 3,260.7 | 39.28 | 40.32 | 40.49 | 40.84 |
|
||
| 4 | 64 | 640 | 4,033.3 | 63.48 | 64.28 | 64.35 | 64.56 |
|
||
| 8 | 64 | 640 | 4,411.4 | 115.96 | 116.74 | 116.85 | 116.89 |
|
||
| 16 | 64 | 640 | 4,924.9 | 207.74 | 208.91 | 209.04 | 209.21 |
|
||
| 32 | 64 | 640 | 5,163.1 | 396.29 | 399.42 | 399.50 | 399.70 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
###### Large model
|
||
|
||
Our results were obtained by running the
|
||
`pytorch/scripts/inference_benchmark.sh` inferencing benchmarking script in the
|
||
pytorch-20.06-py3 NGC container on NVIDIA T4.
|
||
|
||
The command to launch the inference performance benchmark is provided in the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section.
|
||
|
||
**FP16, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 2,978.0 | 42.99 | 43.40 | 43.44 | 44.40 |
|
||
| 2 | 128 | 1,600 | 3,161.4 | 80.98 | 81.38 | 81.45 | 81.75 |
|
||
| 4 | 128 | 1,600 | 3,459.3 | 147.89 | 148.11 | 148.14 | 148.49 |
|
||
| 8 | 128 | 1,600 | 3,657.8 | 279.74 | 279.82 | 279.86 | 280.48 |
|
||
| 16 | 128 | 1,600 | 3,762.9 | 543.92 | 543.48 | 543.55 | 544.43 |
|
||
| 32 | 128 | 1,600 | 3,794.4 | 1079.15 | 1076.23 | 1076.37 | 1158.93 |
|
||
|
||
**FP16, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 3,066.4 | 41.74 | 42.08 | 42.12 | 42.19 |
|
||
| 2 | 128 | 1,600 | 3,399.2 | 75.31 | 75.54 | 75.57 | 75.64 |
|
||
| 4 | 128 | 1,600 | 3,721.5 | 137.47 | 137.65 | 137.70 | 137.82 |
|
||
| 8 | 128 | 1,600 | 3,932.9 | 260.19 | 260.23 | 260.29 | 260.50 |
|
||
| 16 | 128 | 1,600 | 4,057.9 | 504.43 | 503.97 | 504.01 | 504.14 |
|
||
| 32 | 128 | 1,600 | 4,117.8 | 994.54 | 991.40 | 991.46 | 1079.17 |
|
||
|
||
**FP32, pure Python**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 786.9 | 162.7 | 163.2 | 163.3 | 163.9 |
|
||
| 2 | 128 | 1,600 | 889.6 | 287.8 | 288.1 | 288.2 | 288.4 |
|
||
| 4 | 128 | 1,600 | 992.1 | 515.6 | 516.0 | 516.0 | 516.5 |
|
||
| 8 | 128 | 1,600 | 1,047.0 | 977.2 | 977.6 | 977.6 | 977.8 |
|
||
| 16 | 128 | 1,600 | 1,069.3 | 1913.5 | 1914.7 | 1914.7 | 1915.0 |
|
||
| 32 | 128 | 1,600 | 1,069.5 | 3826.3 | 3823.7 | 3823.8 | 3915.8 |
|
||
|
||
**FP32, TorchScript**
|
||
|
||
|**Batch size**|**Sequence length**|**Memory length**|**Throughput Avg (tok/s)**|**Latency Avg (ms)**|**Latency 90% (ms)**|**Latency 95% (ms)**|**Latency 99% (ms)**|
|
||
|-------------:|------------------:|----------------:|-------------------------:|-------------------:|-------------------:|-------------------:|-------------------:|
|
||
| 1 | 128 | 1,600 | 792.5 | 161.5 | 161.9 | 162.0 | 162.2 |
|
||
| 2 | 128 | 1,600 | 904.7 | 283.0 | 283.3 | 283.3 | 283.4 |
|
||
| 4 | 128 | 1,600 | 1,009.0 | 507.0 | 507.3 | 507.4 | 507.5 |
|
||
| 8 | 128 | 1,600 | 1,065.0 | 960.7 | 961.1 | 961.1 | 961.2 |
|
||
| 16 | 128 | 1,600 | 1,088.6 | 1879.7 | 1880.9 | 1881.0 | 1881.1 |
|
||
| 32 | 128 | 1,600 | 1,102.0 | 3713.7 | 3710.0 | 3718.1 | 3819.0 |
|
||
|
||
To achieve these same results, follow the steps in the
|
||
[Quick Start Guide](#quick-start-guide) to download the dataset and setup
|
||
the container, and then proceed to the
|
||
[Inference performance benchmark](#inference-performance-benchmark) section for
|
||
instruction on how to launch the benchmark.
|
||
|
||
## Release notes
|
||
|
||
### Changelog
|
||
|
||
* June 2020
|
||
* Added support for NVIDIA DGX A100
|
||
* Updated default NGC container to pytorch-20.06-py3
|
||
* December 2019
|
||
* Added support for the large Transformer-XL model trained on WikiText-103
|
||
dataset, the large model was trained on NVIDIA DGX-1, NVIDIA DGX-2 and on
|
||
8x NVIDIA DGX-2H (multi-node training)
|
||
* Updated default NGC container to pytorch-19.11-py3
|
||
* Added support for inference with TorchScript
|
||
* October 2019
|
||
* Initial release
|
||
* Support for FP32 and mixed precision training on NVIDIA
|
||
DGX-1, NVIDIA DGX-2, and inference on NVIDIA Tesla V100 16GB
|
||
and NVIDIA T4
|
||
|
||
### Known issues
|
||
There are no known issues with this model.
|