10 KiB
Tacotron2+WaveGlow Inference Using TensorRT Inference Server with TensorRT
This is a subfolder of the Tacotron2 for PyTorch repository that provides scripts to deploy high-performance inference using NVIDIA TensorRT Inference Server with a custom TensorRT backend.
Table of contents
Model overview
The Tacotron2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech. In this implementation, the Tacotron2 network is split into three sub-networks, the encoder, decoder, and postnet. This is followed by WaveGlow as a vocoder, and a Denoiser network using a STFT to remove noise from the audio output. More information on the Tacotron2 and WaveGlow architectures can be found in Tacotron2 PyTorch README, as well as information about training.
Tacotron2 plugins
Because the size of the layers in Tacotron2's decoder, are quite small, many deep learning frameworks fail achieve high throughput for a batch size of one, as the overhead associated with executing each of these small layers can dominate the runtime.
TensorRT supports custom layers through its plugin interface, which not only allows custom operations, but also allows developers to manually tune and/or fuse specific layers in their networks while still using TensorRT to perform automated optimizations on the other layers, and to manage and execute the entire network. This implementation uses several plugins for Tacotron2's decoder, including fusing layers of the Prenet and Attention, as well as creating LSTM Cell kernels optimized specifically for the dimensions used in Tacotron2.
Setup
Requirements
Building and running the container requires docker
, nvidia-docker
and bash
.
In addition to this, the host machine must have a Volta or Turing based GPU.
Quick Start Guide
Clone the repository
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trtis_cpp
Export the models
You can either train models yourself, or download pretrained checkpoints from NGC and copy them to the ./checkpoints
directory:
mkdir checkpoints
cp <Tacotron2_checkpoint> ./checkpoints/
cp <WaveGlow_checkpoint> ./checkpoints/
Next you will need to export the PyTorch checkpoints so that they can be used to build TensorRT engines. This can be done via the script export_weights.sh
script:
mkdir models
./export_weights.sh checkpoints/tacotron2_1032590_6000_amp checkpoints/waveglow_1076430_14000_amp models/
Setup the Triton server
./build_trtis.sh models/tacotron2.json models/waveglow.onnx models/denoiser.json
This will take some time as TensorRT tries out different tactics for best performance while building the engines.
Setup the Triton client
Next you need to build the client docker container. To do this, enter the
trtis_client
directory and run the script build_trtis_client.sh
.
cd trtis_client
./build_trtis_client.sh
cd ..
Run the Triton server
To run the server locally, use the script run_trtis_server.sh
:
./run_trtis_server.sh
You can use the environment variable NVIDIA_VISIBLE_DEVICES
to set which GPUs
the Triton server sees.
Run the Triton client
Leave the server running. In another terminal, type:
cd trtis_client/
./run_trtis_client.sh phrases.txt
This will generate one WAV file per line in the file phrases.txt
, named after
the line number (e.g., 1.wav through 8.wav for a 8 line file) in the audio/
directory. It is
important that each line in the file end with a period, or Tacotron2 may fail
to detect the end of the phrase.
Advanced
Code structure
The src/
contains the following sub-directories:
trtis
: The directory containing code for the custom Triton backend.trt/tacotron2
: The directory containing the Tacotron2 implementation in TensorRT.trt/waveglow
: The directory containing the WaveGlow implementation in TensorRT.trt/denoiser
: The directory containing the Denoiser (STFT) implementation in TensorRT.trt/plugins
: The directory containing plugins used by the TensorRT engines.
The trtis_client/
directory contains the code for running the client.
Precision
By default the ./build_trtis.sh
script builds the TensorRT engines with FP16 mode enabled, which allows some operations to be performed in lower precision, in order to increase throughput. To use engines with only FP32 precision, add 0
to ./build_trtis.sh
’s arguments:
./build_trtis.sh models/tacotron2.json models/waveglow.onnx models/denoiser.json 0
Performance
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following tables show inference statistics for the Tacotron2 and WaveGlow text-to-speech system. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of processing time. For all tests in these tables, we used WaveGlow with 256 residual channels.
Performance on NVIDIA T4
TensorRT \w Plugins vs. PyTorch
Latency in this table is measured from just before the input sequence starts being copied from host memory to the GPU, to just after the generated audio finishes being copied back to the host memory. That is, what is taking place in the custom backend inside of Triton.
Framework | Batch size | Input length | Precision | Avg latency (s) | Latency std (s) | Latency interval 90% (s) | Latency interval 95% (s) | Latency interval 99% (s) | Throughput (samples/sec) | Speed-up vs. PyT FP32 | Speed-up vs. PyT FP16 | Avg mels generated | Avg audio length (s) | Avg RTF |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TRT \w plugins | 1 | 128 | FP16 | 0.40 | 0.00 | 0.40 | 0.40 | 0.40 | 369,862 | 4.27x | 3.90x | 579 | 6.72 | 16.77 |
TRT \w plugins | 1 | 128 | FP32 | 1.20 | 0.01 | 1.21 | 1.21 | 1.21 | 123,922 | 1.43x | 1.31x | 581 | 6.74 | 5.62 |
PyTorch | 1 | 128 | FP16 | 1.63 | 0.07 | 1.71 | 1.73 | 1.81 | 94,758 | 1.10x | 1.00x | 601 | 6.98 | 4.30 |
PyTorch | 1 | 128 | FP32 | 1.77 | 0.08 | 1.88 | 1.92 | 2.00 | 86,705 | 1.00x | 0.91x | 600 | 6.96 | 3.92 |
That is a 3.72x speedup when using TensorRT FP16 with plugins when compared to PyTorch FP32, and still a 3.39x speedup when compared to PyTorch FP16.
The TensorRT entries in this table can be reproduced by using the output of the Triton server, when performing the steps for Running the benchmark below. The PyTorch entries can be reproduced by following the instructions here.
TensorRT \w Plugins in Triton
Latency in this table is measured from the client sending the request, to it receiving back the generated audio. This includes network time, request/response formatting time, as well as the backend time shown in the above section.
Batch size | Input length | Precision | Avg latency (s) | Latency std (s) | Latency interval 90% (s) | Latency interval 95% (s) | Latency interval 99% (s) | Avg mels generated | Avg audio length (s) | Avg RTF |
---|---|---|---|---|---|---|---|---|---|---|
1 | 128 | FP16 | 0.42 | 0.00 | 0.42 | 0.42 | 0.42 | 579 | 6.72 | 15.95 |
8 | 128 | FP16 | 2.55 | 0.01 | 2.56 | 2.56 | 2.57 | 571 | 6.62 | 2.60 |
1 | 128 | FP32 | 1.22 | 0.01 | 1.22 | 1.23 | 1.23 | 581 | 6.75 | 5.54 |
8 | 128 | FP32 | 8.64 | 0.01 | 8.68 | 8.69 | 8.71 | 569 | 6.61 | 0.72 |
To reproduce this table, see Running the benchmark below.
Running the benchmark
Once you have performed the steps in Setup the Triton server and Setup the Triton client, you can run the benchmark by starting the Triton server via:
./run_trtis_server.sh
Leave the server running, and in another terminal run the script trtis_client/run_trtis_benchmark_client.sh
:
cd trtis_client/
./run_trtis_benchmark_client.sh <batch size>
Replace with the desired batch size between 1 and 32. The engines are built with a maximum batch size of 32 in the ./build_trtis.sh
script.
After some time this should produce output like:
Performed 1000 runs.
batch size = 1
avg latency (s) = 0.421375
latency std (s) = 0.00170839
latency interval 50% (s) = 0.421553
latency interval 90% (s) = 0.422805
latency interval 95% (s) = 0.423273
latency interval 99% (s) = 0.424153
average mels generated = 582
average audio generated (s) = 6.72218
average real-time factor = 15.953