Krzysztof Kudrynski 49e23b4597 Adding links to performance benchmark page

2021-07-21 14:39:48 +02:00

10 KiB

Raw Blame History

Tacotron2+WaveGlow Inference Using TensorRT Inference Server with TensorRT

This is a subfolder of the Tacotron2 for PyTorch repository that provides scripts to deploy high-performance inference using NVIDIA TensorRT Inference Server with a custom TensorRT backend.

Model overview
- Tacotron2 plugins
Setup
- Requirements
Quick Start Guide
Advanced
- Code structure
- Precision
Performance
- Performance on NVIDIA T4
- Running the benchmark

Model overview

The Tacotron2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech. In this implementation, the Tacotron2 network is split into three sub-networks, the encoder, decoder, and postnet. This is followed by WaveGlow as a vocoder, and a Denoiser network using a STFT to remove noise from the audio output. More information on the Tacotron2 and WaveGlow architectures can be found in Tacotron2 PyTorch README, as well as information about training.

Tacotron2 plugins

Because the size of the layers in Tacotron2's decoder, are quite small, many deep learning frameworks fail achieve high throughput for a batch size of one, as the overhead associated with executing each of these small layers can dominate the runtime.

TensorRT supports custom layers through its plugin interface, which not only allows custom operations, but also allows developers to manually tune and/or fuse specific layers in their networks while still using TensorRT to perform automated optimizations on the other layers, and to manage and execute the entire network. This implementation uses several plugins for Tacotron2's decoder, including fusing layers of the Prenet and Attention, as well as creating LSTM Cell kernels optimized specifically for the dimensions used in Tacotron2.

Setup

Requirements

Building and running the container requires docker, nvidia-docker and bash. In addition to this, the host machine must have a Volta or Turing based GPU.

Quick Start Guide

Clone the repository

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trtis_cpp

Export the models

You can either train models yourself, or download pretrained checkpoints from NGC and copy them to the ./checkpoints directory:

mkdir checkpoints
cp <Tacotron2_checkpoint> ./checkpoints/
cp <WaveGlow_checkpoint> ./checkpoints/

Next you will need to export the PyTorch checkpoints so that they can be used to build TensorRT engines. This can be done via the script export_weights.sh script:

mkdir models
./export_weights.sh checkpoints/tacotron2_1032590_6000_amp checkpoints/waveglow_1076430_14000_amp models/

Setup the Triton server

./build_trtis.sh models/tacotron2.json models/waveglow.onnx models/denoiser.json

This will take some time as TensorRT tries out different tactics for best performance while building the engines.

Setup the Triton client

Next you need to build the client docker container. To do this, enter the trtis_client directory and run the script build_trtis_client.sh.

cd trtis_client
./build_trtis_client.sh
cd ..

Run the Triton server

To run the server locally, use the script run_trtis_server.sh:

./run_trtis_server.sh

You can use the environment variable NVIDIA_VISIBLE_DEVICES to set which GPUs the Triton server sees.

Run the Triton client

Leave the server running. In another terminal, type:

cd trtis_client/
./run_trtis_client.sh phrases.txt

This will generate one WAV file per line in the file phrases.txt, named after the line number (e.g., 1.wav through 8.wav for a 8 line file) in the audio/ directory. It is important that each line in the file end with a period, or Tacotron2 may fail to detect the end of the phrase.

Advanced

Code structure

The src/ contains the following sub-directories:

trtis: The directory containing code for the custom Triton backend.
trt/tacotron2: The directory containing the Tacotron2 implementation in TensorRT.
trt/waveglow: The directory containing the WaveGlow implementation in TensorRT.
trt/denoiser: The directory containing the Denoiser (STFT) implementation in TensorRT.
trt/plugins: The directory containing plugins used by the TensorRT engines.

The trtis_client/ directory contains the code for running the client.

Precision

By default the ./build_trtis.sh script builds the TensorRT engines with FP16 mode enabled, which allows some operations to be performed in lower precision, in order to increase throughput. To use engines with only FP32 precision, add 0 to ./build_trtis.sh’s arguments:

./build_trtis.sh models/tacotron2.json models/waveglow.onnx models/denoiser.json 0

Performance

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

The following tables show inference statistics for the Tacotron2 and WaveGlow text-to-speech system. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of processing time. For all tests in these tables, we used WaveGlow with 256 residual channels.

Performance on NVIDIA T4

TensorRT \w Plugins vs. PyTorch

Latency in this table is measured from just before the input sequence starts being copied from host memory to the GPU, to just after the generated audio finishes being copied back to the host memory. That is, what is taking place in the custom backend inside of Triton.

Framework	Batch size	Input length	Precision	Avg latency (s)	Latency std (s)	Latency interval 90% (s)	Latency interval 95% (s)	Latency interval 99% (s)	Throughput (samples/sec)	Speed-up vs. PyT FP32	Speed-up vs. PyT FP16	Avg mels generated	Avg audio length (s)	Avg RTF
TRT \w plugins	1	128	FP16	0.40	0.00	0.40	0.40	0.40	369,862	4.27x	3.90x	579	6.72	16.77
TRT \w plugins	1	128	FP32	1.20	0.01	1.21	1.21	1.21	123,922	1.43x	1.31x	581	6.74	5.62
PyTorch	1	128	FP16	1.63	0.07	1.71	1.73	1.81	94,758	1.10x	1.00x	601	6.98	4.30
PyTorch	1	128	FP32	1.77	0.08	1.88	1.92	2.00	86,705	1.00x	0.91x	600	6.96	3.92

That is a 3.72x speedup when using TensorRT FP16 with plugins when compared to PyTorch FP32, and still a 3.39x speedup when compared to PyTorch FP16.

The TensorRT entries in this table can be reproduced by using the output of the Triton server, when performing the steps for Running the benchmark below. The PyTorch entries can be reproduced by following the instructions here.

TensorRT \w Plugins in Triton

Latency in this table is measured from the client sending the request, to it receiving back the generated audio. This includes network time, request/response formatting time, as well as the backend time shown in the above section.

Batch size	Input length	Precision	Avg latency (s)	Latency std (s)	Latency interval 90% (s)	Latency interval 95% (s)	Latency interval 99% (s)	Avg mels generated	Avg audio length (s)	Avg RTF
1	128	FP16	0.42	0.00	0.42	0.42	0.42	579	6.72	15.95
8	128	FP16	2.55	0.01	2.56	2.56	2.57	571	6.62	2.60
1	128	FP32	1.22	0.01	1.22	1.23	1.23	581	6.75	5.54
8	128	FP32	8.64	0.01	8.68	8.69	8.71	569	6.61	0.72

To reproduce this table, see Running the benchmark below.

Running the benchmark

Once you have performed the steps in Setup the Triton server and Setup the Triton client, you can run the benchmark by starting the Triton server via:

./run_trtis_server.sh

Leave the server running, and in another terminal run the script trtis_client/run_trtis_benchmark_client.sh:

cd trtis_client/
./run_trtis_benchmark_client.sh <batch size>

Replace with the desired batch size between 1 and 32. The engines are built with a maximum batch size of 32 in the ./build_trtis.sh script.

After some time this should produce output like:

Performed 1000 runs.
batch size =  1
avg latency (s) = 0.421375
latency std (s) = 0.00170839
latency interval 50% (s) = 0.421553
latency interval 90% (s) = 0.422805
latency interval 95% (s) = 0.423273
latency interval 99% (s) = 0.424153
average mels generated = 582
average audio generated (s) = 6.72218
average real-time factor = 15.953

10 KiB Raw Blame History Unescape Escape