Add FastSpeech in CUDA-Optimized

This commit is contained in:
Dabi Ahn 2020-07-31 14:59:15 +08:00
parent 6c1d562eb9
commit e0fcb62e83
13215 changed files with 10008 additions and 2 deletions

5
CUDA-Optimized/FastSpeech/.gitignore vendored Normal file
View file

@ -0,0 +1,5 @@
.idea
__pycache__
.DS_Store
*.egg-info
.vscode

6
CUDA-Optimized/FastSpeech/.gitmodules vendored Normal file
View file

@ -0,0 +1,6 @@
[submodule "waveglow"]
path = waveglow
url = https://github.com/NVIDIA/waveglow.git
[submodule "cub"]
path = cub
url = https://github.com/NVlabs/cub.git

View file

@ -0,0 +1,7 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
FROM ${FROM_IMAGE_NAME}
ADD . /workspace/fastspeech
WORKDIR /workspace/fastspeech
RUN sh ./scripts/install.sh

View file

@ -0,0 +1,23 @@
Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the NVIDIA CORPORATION nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View file

@ -0,0 +1,2 @@
include fastspeech/hparams/*.yaml
recursive-include fastspeech/trt/plugins *.so

View file

@ -0,0 +1,442 @@
# FastSpeech For PyTorch and TensorRT
This repository provides a script and recipe to train the FastSpeech model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA.
It also provides an optimization in TensorRT to accelerate inference performance without loss of accuracy.
For more details, see this [talk](https://developer.nvidia.com/gtc/2020/video/s21420) and [slides](https://drive.google.com/file/d/1V-h5wBWAZpIpwg-qjwOuxZuOk4CLDRxy/view?usp=sharing) presented in GTC 2020.
## Table Of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Training process](#training-process)
* [Inference process](#inference-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training performance results](#training-performance-results)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The [FastSpeech](https://arxiv.org/pdf/1905.09263.pdf) model is one of the state-of-the-art Text-to-Mel models, researched by Microsoft and its paper was published to NeurIPS 2019. This model uses the WaveGlow vocoder model to generate waveforms.
One of the main points of this model is that the inference is disruptively fast. What make this possible is that it requires only single feed-forwarding, and no recurrence and auto-regression are required in the model. Another benefit of this model is that its robust to errors, meaning that it makes no repetitive words or skipped words.
Our implementation of the FastSpeech model differs from the model described in the paper. Our implementation uses Tacotron2 instead of Transformer TTS as a teacher model to get alignments between texts and mel-spectrograms.
This FastSpeech model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results up to 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. Also, this model accelerates inference by running on TensorRT, up to 3x faster than running on PyTorch Framework on NVIDIA Volta and Turing GPUs. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
### Model architecture
Fastspeech is a Text-to-Mel model, not based on any recurrent blocks or autoregressive logic. It consists of three parts - Phoneme-Side blocks, Length Regulator, and Mel-Side blocks. Phoneme-Side blocks contain an embedding layer, 6 Feed Forward Transformer(FFT) blocks, and the positional encoding adding layer. Length regulator has a nested neural model inside, Duration Predictor. Mel-Side blocks is almost similar with Phoneme-Side blocks, except for a linear layer in the tail.
The FFT Block is a variant of the Transformer block. It contains a multi-head attention layer with a residual connection, two layers of 1D-convolutional network with residual connections and two Layer Normalization layers.
The Length Regulator is the key block in FastSpeech model. Dealing with TTS, one of the biggest difficulties, is handling variable length of data. That's why recently most of the deep neural TTS have required recurrent blocks or autoregressive logic in them. However, the way Length Regulator handles variable length of data is completely different. Basically, it controls the length by repeating elements of the sequence, by the predicted durations. The Duration Predictor in Length Regulator, predicts each phonemes duration. It is also a neural model that consists of two 1D-convolution and a Fully Connected layer. Finally, Length Regulator expands each element of the sequence by the predicted durations.
![](./materials/fastspeech_arch.png "FastSpeech architecture")
Figure 1. Architecture of the FastSpeech model. Taken from the
[FastSpeech](https://arxiv.org/pdf/1905.09263.pdf) paper.
### Default configuration
This FastSpeech model supports multi-GPU and mixed precision training with dynamic loss scaling (see Apex code [here](https://github.com/NVIDIA/apex/blob/master/apex/fp16_utils/loss_scaler.py)), as well as mixed precision inference. To speed up FastSpeech training, reference mel-spectrograms and alignments between texts and mel-spectrograms are generated during a preprocessing step and read directly from disk during training, instead of being generated during training. Also, this model utilizes fused layer normalization supported by Apex (see [here](https://nvidia.github.io/apex/layernorm.html)) to get extra speed-up during training and inference.
This model is accelerated during inference by our implementation using TensorRT Python API (see [here](https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html)). Custom CUDA/C++ plugins are provided for some layers, to implement complex operations in the model for TensorRT and for better performance during inference. Also, we provide implementation of multi-engine inference as an experimental feature for improving inference performance more, dealing with variable input lengths. For more details, refer to [running on TensorRT](fastspeech/trt/README.md)
In summary, the following features were implemented in this model:
* Data-parallel multi-GPU training
* Dynamic loss scaling with backoff for Tensor Cores (mixed precision) training
* Accelerated inference on TensorRT using custom plugins and multi-engines approach
### Feature support matrix
The following features are supported by this model:
| Feature | FastSpeech
|----------------------------------|--------------------------
|Automatic mixed precision (AMP) | Yes
|TensorRT inferencing | Yes
#### Features
Automatic Mixed Precision (AMP) - AMP is a tool that enables Tensor Core-accelerated training. For more information, refer to [APEX AMP docs](https://nvidia.github.io/apex/amp.html).
TensorRT - a library for high-performance inference on NVIDIA GPUs, improving latency, throughput, power efficiency, and memory consumption. It builds optimized runtime engines by selecting the most performant kernels & algorithms, fusing layers, and using mixed precision. For more information, refer to [github.com/NVIDIA/TensorRT](https://github.com/NVIDIA/TensorRT).
## Setup
The following section lists the requirements that you need to meet in order to start training the FastSpeech model.
### Requirements
This repository contains Dockerfile which extends the PyTorch NGC container
and encapsulates some dependencies. Aside from these dependencies, ensure you
have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
or newer
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
For more information about how to get started with NGC containers, see the
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
Documentation:
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
* [Running PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running)
For those unable to use the PyTorch NGC container, to set up the required
environment or create your own container, see the versioned
[NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the FastSpeech model on the LJSpeech(https://keithito.com/LJ-Speech-Dataset) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
1. Clone the repository,
```
git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/CUDA-Optimized/FastSpeech
```
and pull submodules.
```
git submodule init
git submodule update
```
2. Download and preprocess the dataset. Data is downloaded to the ./LJSpeech-1.1 directory (on the host). The ./LJSpeech-1.1 directory is mounted to the /workspace/fastspeech/LJSpeech-1.1 location in the NGC container.
```
bash scripts/prepare_dataset.sh
```
3. Build the FastSpeech PyTorch NGC container.
```
bash scripts/docker/build.sh
```
4. Start an interactive session in the NGC container to run training/inference. After you build the container image, you can start an interactive CLI session with:
```
bash scripts/docker/interactive.sh
```
5. Start training. To preprocess mel-spectrograms for faster training, first run:
```
python fastspeech/dataset/ljspeech_dataset.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1"
```
The preprocessed mel-spectrograms are stored in the ./mels_ljspeech1.1 directory. Also, the preprocessed alignments are prepared in ./aligns_ljspeech1.1 directory. For more information, refer to the [training process section](#training-process).
Finally, run the training script:
```
python fastspeech/train.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1" --aligns_path="./aligns_ljspeech1.1" --log_path="./logs" --checkpoint_path="./checkpoints"
```
The checkpoints and Tensorboard log files are stored in the ./checkpoints and ./logs, respectively.
Additionally, to accelerate the training using AMP, run with --use_amp:
```
python fastspeech/train.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1" --aligns_path="./aligns_ljspeech1.1" --log_path="./logs" --checkpoint_path="./checkpoints" --use_amp
```
6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
After you have trained the FastSpeech model, you can perform generation using the checkpoint stored in ./checkpoints. Then run:
```
python generate.py --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="./test_sentences.txt"
```
The script loads automatically the latest checkpoint (if any exists), or you can pass a checkpoint file through --ckpt_file. And it loads input texts in ./test_sentences.txt and stores the result in ./results directory. You can also set the result directory path with --results_path.
You can also run with a sample text:
```
python generate.py --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
```
7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first extract a TensorRT engine file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt) and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
```
python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="./test_sentences.txt" --waveglow_engine_path="waveglow.fp16.trt"
```
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
### Scripts and sample code
The ./fastspeech directory contains models and scripts for data processing, training/inference, and estimating performance.
* train.py: the FastSpeech model training script.
* infer.py: the FastSpeech model inference script.
* perf_infer.py: the script for estimating inference performance.
* align_tacotron2.py: the script for preprocessing alignments.
The ./fastspeech/trt directory contains the FastSpeech TensorRT model, inferencer and plugins for TensorRT.
And, ./generate.py is the script for generating waveforms with a vocoder.
### Parameters
All parameters of the FastSpeech model and for training/inference are defined in parameters config files in ./fastspeech/hparams.
The default config file, base.yaml, contains the most common parameters including paths, audio processing, and model hyperparams. The default config file for training, train.yaml, contains parameters used during training such as learning rate, batch size, and number of steps. And the default config file for inference, infer.yaml, contains parameters required for inference including batch size and usage of half precision. For more details, refer to the config files, i.e., base.yaml, train.yaml, and infer.yaml in ./fastspeech/hparams.
You can also define a new config file by overriding the default config, and set the config file via a command-line option --hparam, for example:
```yaml
# File name: ./fastspeech/hparams/my_train.yaml
# Inherit all parameters from train.yaml.
parent_yaml: "train.yaml"
# Override the learning rate.
learning_rate: 0.0005
```
```
python fastspeech/train.py --hparam=my_train.yaml ...
```
### Command-line options
To see the full list of available options and their descriptions, use the `-- -h` or `-- --help` command-line option, for example:
```
python fastspeech/train.py -- -h
```
Although it will not display all parameters defined in the config files, you can override any parameters in the config files, for example:
```
python fastspeech/train.py ... --batch_size=8 --final_steps=64000
```
### Getting the data
The FastSpeech model was trained on the LJSpeech-1.1 dataset. This repository contains the ./scripts/prepare_dataset.sh script which will automatically download and extract the whole dataset. By default, data will be extracted to the ./LJSpeech-1.1 directory. The dataset directory contains a README file, a wavs directory with all audio samples, and a file metadata.csv that contains audio file names and the corresponding transcripts.
#### Dataset guidelines
The LJSpeech dataset has 13,100 clips that amount to about 24 hours of speech. Since the original dataset has all transcripts in the metadata.csv file, the ./scripts/prepare_dataset.sh script partitions the metadata.csv into sub-meta files for training/test set - metadata_train.csv and metadata_test.csv containing 13,000 and 100 transcripts respectively.
### Training process
To accelerate the training performance, preprocessing of alignments between texts and mel-spectrograms is performed prior to the training iterations.
The FastSpeech model requires reference alignments of texts and mel-spectrograms extracted from an auto-regressive TTS teacher model. As Tacotron2 is used as a teacher in our implementation, download the Nvidia [pretrained Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view) to utilize this for the preprocessing of the alignments.
Run ```align_tacotron2.py``` to get alignments on LJSpeech dataset with feed-forwards to the teacher model. --tacotron2_path is for setting Tacotron2 checkpoint file path and the result alignments are stored in --aligns_path. After that, the alignments are loaded during training.
```
python fastspeech/align_tacotron2.py --dataset_path="./LJSpeech-1.1" --tacotron2_path="tacotron2_statedict.pt" --aligns_path="aligns_ljspeech1.1"
```
You can also preprocess mel-spectrograms for faster training. The result mel-spectrograms are stored in --mels_path and loaded during training. If --mels_path is not set, mel-spectrograms are processed during training.
Run ```ljspeech_dataset.py```
```
python fastspeech/dataset/ljspeech_dataset.py --dataset_path="./LJSpeech-1.1" --mels_path="mels_ljspeech1.1"
```
#### Accelerated training
NVIDIA [APEX](https://github.com/NVIDIA/apex) library supports a simple method to obtain up to 2x speed-up during training. The library provides easy-to-use APIs for using AMP and layer fusions.
To use AMP during training, run with --use_amp
```
python fastspeech/train.py ... --use_amp
```
Another approach for extra speed-up during training is fusing operations. To use fused layer normalization, set --fused_layernorm.
```
python fastspeech/train.py ... --use_amp --fused_layernorm
```
### Inference process
```infer.py``` is provided to test the FastSpeech model on the LJSpeech dataset. --n_iters is the number of batches to infer. To run in FP16, run with --use_fp16.
```
python fastspeech/infer.py --dataset_path="./LJSpeech-1.1" --checkpoint_path="./checkpoints" --n_iters=10 --use_fp16
```
#### Accelerated inference
To accelerate inference with TensorRT, set --hparam=trt.yaml.
```
python fastspeech/infer.py --hparam=trt.yaml --dataset_path="./LJSpeech-1.1" --checkpoint_path="./checkpoints" --n_iters=10 --use_fp16
```
For more details, refer to [accelerating inference with TensorRT](fastspeech/trt/README.md).
#### Generation
To generate waveforms with WaveGlow Vocoder, 1) Make sure to pull [Nvidia WaveGlow](https://github.com/NVIDIA/waveglow) through git submodule, 2) get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
```
git submodule init
git submodule update
```
Run generate.py with:
* --text - an input text or the text file path.
* --results_path - result waveforms directory path. (default=./results).
* --ckpt_file - checkpoint file path. (default checkpoint file is the latest file in --checkpoint_path)
```
python generate.py --waveglow_path="./waveglow_256channels.pt" --text="The more you buy, the more you save."
```
or
```
python generate.py --waveglow_path="./waveglow_256channels.pt" --text=test_sentences.txt
```
Sample result waveforms are [here](https://gitlab-master.nvidia.com/dahn/fastspeech/tree/master/samples).
To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
```
python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
```
Sample result waveforms are [FP32](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples) and [FP16](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples_fp16).
## Performance
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
#### Training performance benchmark
To benchmark the training performance, set CUDA_VISIBLE_DEVICES, depending on GPU count:
* for 1 GPU,
```
export CUDA_VISIBLE_DEVICES=0
```
* for 4 GPUs,
```
export CUDA_VISIBLE_DEVICES=0,1,2,3
```
and run on a specific batch size:
* in FP32
```
python fastspeech/train.py --batch_size=BATCH_SIZE
```
* in mixed precision
```
python fastspeech/train.py --batch_size=BATCH_SIZE --use_amp
```
#### Inference performance benchmark
Set CUDA_VISIBLE_DEVICES=0 to use single GPU,
```
export CUDA_VISIBLE_DEVICES=0
```
and run on a specific batch size:
* in FP32
```
python fastspeech/perf_infer.py --batch_size=BATCH_SIZE
```
* in FP16
```
python fastspeech/perf_infer.py --batch_size=BATCH_SIZE --use_fp16
```
To benchmark the inference performance with vocoder,
```
python fastspeech/perf_infer.py --batch_size=BATCH_SIZE --with_vocoder --waveglow_path=WAVEGLOW_PATH
```
Finally, to benchmark the inference performance on TensorRT,
```
python fastspeech/perf_infer.py --hparam=trt.yaml --batch_size=BATCH_SIZE
```
* with vocoder
```
python fastspeech/perf_infer.py --hparam=trt.yaml --batch_size=BATCH_SIZE --with_vocoder --waveglow_path=WAVEGLOW_PATH
```
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
#### Training performance results
Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
| GPUs | Batch size / GPU | Throughput(mels/s) - FP32 | Throughput(mels/s) - mixed precision | Throughput speedup (FP32 - mixed precision) | Multi-GPU Weak scaling - FP32 | Multi-GPU Weak scaling - mixed precision
|---|----|--------|--------|------|-----|------|
| 1 | 32 | 31674 | 63431 | 2.00 | 1 | 1 |
| 4 | 32 | 101115 | 162847 | 1.61 | 3.19| 2.57 |
| 8 | 32 | 167650 | 188251 | 1.12 | 5.29| 2.97 |
#### Inference performance results
Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
| Batch size | Precision | Avg latency (s) | Std latency(s) | Latency tolerance interval 90% (s) | Latency tolerance interval 95% (s) | Latency tolerance interval 99% (s) | Throughput (samples/s) | Avg RTF | Speed-up with mixed precision |
|------------|-----------|-----------------|----------------|------------------------------------|------------------------------------|--------------------|---------------------|---------|-------------------------------|
| 1 | FP16 | 0.2287 | 0.001 | 0.2295 | 0.2297 | 0.2303 | 681,773 | 30.92 | 1.50 |
| 4 | FP16 | 0.5003 | 0.0016 | 0.502 | 0.5023 | 0.5032 | 1,244,466 | 14.11 | 2.57 |
| 8 | FP16 | 0.9695 | 0.0023 | 0.9722 | 0.9732 | 0.9748 | 1,284,339 | 7.28 | 2.73 |
| 1 | FP32 | 0.3428 | 0.0016 | 0.3445 | 0.3449 | 0.3458 | 454,833 | 20.63 | 1.00 |
| 4 | FP32 | 1.287 | 0.0039 | 1.2916 | 1.2927 | 1.2954 | 484,558 | 5.50 | 1.00 |
| 8 | FP32 | 2.6481 | 0.0041 | 2.6535 | 2.6549 | 2.657 | 470,992 | 2.67 | 1.00 |
| Framework | Batch size | Precision | Avg latency (s) | Std latency(s) | Latency tolerance interval 90% (s) | Latency tolerance interval 95% (s) | Latency tolerance interval 99% (s) | Throughput (samples/s) | Avg RTF | Speed-up (PyT - PyT+TRT) |
|-----------|------------|-----------|-----------------|----------------|------------------------------------|------------------------------------|--------------------|---------------------|---------|-------------------------------|
| PyT | 1 | FP16 | 0.2287 | 0.001 | 0.2295 | 0.2297 | 0.2303 | 681,773 | 30.92 | 1 |
| PyT+TRT | 1 | FP16 | 0.1115 | 0.0007 | 0.1122 | 0.1124 | 0.1135 | 1,398,343 | 63.42 | 2.05 |
| PyT | 4 | FP16 | 0.5003 | 0.0016 | 0.502 | 0.5023 | 0.5032 | 1,244,466 | 14.11 | 1 |
| PyT+TRT | 4 | FP16 | 0.3894 | 0.0019 | 0.3917 | 0.3925 | 0.3961 | 1,599,005 | 18.13 | 1.28 |
##### Inference performance: NVIDIA T4
| Batch size | Precision | Avg latency (s) | Std latency(s) | Latency tolerance interval 90% (s) | Latency tolerance interval 95% (s) | Latency tolerance interval 99% (s) | Throughput (samples/s) | Avg RTF | Speed-up with mixed precision |
|------------|-----------|-----------------|----------------|------------------------------------|------------------------------------|--------------------|---------------------|---------|-------------------------------|
| 1 | FP16 | 0.9345 | 0.0294 | 0.9662 | 0.9723 | 0.9806 | 167,003 | 7.57 | 1.28 |
| 4 | FP16 | 3.7815 | 0.0877 | 3.9078 | 3.9393 | 3.9632 | 164,730 | 1.87 | 1.28 |
| 8 | FP16 | 7.5722 | 0.1764 | 7.8273 | 7.8829 | 7.9286 | 164,530 | 0.93 | 1.21 |
| 1 | FP32 | 1.1952 | 0.0368 | 1.2438 | 1.2487 | 1.2589 | 130,572 | 5.92 | 1.00 |
| 4 | FP32 | 4.8578 | 0.1215 | 5.0343 | 5.0651 | 5.1027 | 128,453 | 1.46 | 1.00 |
| 8 | FP32 | 9.1563 | 0.4114 | 9.4049 | 9.4571 | 9.5194 | 136,367 | 0.77 | 1.00 |
| Framework | Batch size | Precision | Avg latency (s) | Std latency(s) | Latency tolerance interval 90% (s) | Latency tolerance interval 95% (s) | Latency tolerance interval 99% (s) | Throughput (samples/s) | Avg RTF | Speed-up (PyT - PyT+TRT) |
|-----------|------------|-----------|-----------------|----------------|------------------------------------|------------------------------------|--------------------|---------------------|---------|-------------------------------|
| PyT | 1 | FP16 | 0.9345 | 0.0294 | 0.9662 | 0.9723 | 0.9806 | 167,003 | 7.57 | 1 |
| PyT+TRT | 1 | FP16 | 0.3234 | 0.0058 | 0.3304 | 0.3326 | 0.3358 | 482,286 | 21.87 | 2.89 |
## Release notes
### Changelog
July 2020
- Initial release
### Known issues
There are no known issues in this release.

Some files were not shown because too many files have changed in this diff Show more