DeepLearningExamples/CUDA-Optimized/FastSpeech
Krzysztof Kudrynski 49e23b4597 Adding links to performance benchmark page 2021-07-21 14:39:48 +02:00
..
fastspeech [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
materials Add FastSpeech in CUDA-Optimized 2020-07-31 14:59:15 +08:00
samples [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
scripts [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
tacotron2 Add FastSpeech in CUDA-Optimized 2020-07-31 14:59:15 +08:00
waveglow [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
.gitignore Add FastSpeech in CUDA-Optimized 2020-07-31 14:59:15 +08:00
Dockerfile [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
LICENSE Add FastSpeech in CUDA-Optimized 2020-07-31 14:59:15 +08:00
MANIFEST.in Add FastSpeech in CUDA-Optimized 2020-07-31 14:59:15 +08:00
README.md Adding links to performance benchmark page 2021-07-21 14:39:48 +02:00
generate.py [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
requirements.txt [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
setup.py [CUDA-Optimized/FastSpeech] 2020-11-02 21:17:00 +08:00
test_sentences.txt Add FastSpeech in CUDA-Optimized 2020-07-31 14:59:15 +08:00

README.md

FastSpeech For PyTorch and TensorRT

This repository provides a script and recipe to train the FastSpeech model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA. It also provides an optimization in TensorRT to accelerate inference performance without loss of accuracy.

For more details, see this talk and slides presented in GTC 2020.

Table Of Contents

Model overview

The FastSpeech model is one of the state-of-the-art Text-to-Mel models, researched by Microsoft and its paper was published to NeurIPS 2019. This model uses the WaveGlow vocoder model to generate waveforms.

One of the main points of this model is that the inference is disruptively fast. What make this possible is that it requires only single feed-forwarding, and no recurrence and auto-regression are required in the model. Another benefit of this model is that its robust to errors, meaning that it makes no repetitive words or skipped words.

Our implementation of the FastSpeech model differs from the model described in the paper. Our implementation uses Tacotron2 instead of Transformer TTS as a teacher model to get alignments between texts and mel-spectrograms.

This FastSpeech model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results up to 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. Also, this model accelerates inference by running on TensorRT, up to 3x faster than running on PyTorch Framework on NVIDIA Volta and Turing GPUs. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

Fastspeech is a Text-to-Mel model, not based on any recurrent blocks or autoregressive logic. It consists of three parts - Phoneme-Side blocks, Length Regulator, and Mel-Side blocks. Phoneme-Side blocks contain an embedding layer, 6 Feed Forward Transformer(FFT) blocks, and the positional encoding adding layer. Length regulator has a nested neural model inside, Duration Predictor. Mel-Side blocks is almost similar with Phoneme-Side blocks, except for a linear layer in the tail.

The FFT Block is a variant of the Transformer block. It contains a multi-head attention layer with a residual connection, two layers of 1D-convolutional network with residual connections and two Layer Normalization layers.

The Length Regulator is the key block in FastSpeech model. Dealing with TTS, one of the biggest difficulties, is handling variable length of data. That's why recently most of the deep neural TTS have required recurrent blocks or autoregressive logic in them. However, the way Length Regulator handles variable length of data is completely different. Basically, it controls the length by repeating elements of the sequence, by the predicted durations. The Duration Predictor in Length Regulator, predicts each phonemes duration. It is also a neural model that consists of two 1D-convolution and a Fully Connected layer. Finally, Length Regulator expands each element of the sequence by the predicted durations.

Figure 1. Architecture of the FastSpeech model. Taken from the FastSpeech paper.

Default configuration

This FastSpeech model supports multi-GPU and mixed precision training with dynamic loss scaling (see Apex code here), as well as mixed precision inference. To speed up FastSpeech training, reference mel-spectrograms and alignments between texts and mel-spectrograms are generated during a preprocessing step and read directly from disk during training, instead of being generated during training. Also, this model utilizes fused layer normalization supported by Apex (see here) to get extra speed-up during training and inference.

This model is accelerated during inference by our implementation using TensorRT Python API (see here). Custom CUDA/C++ plugins are provided for some layers, to implement complex operations in the model for TensorRT and for better performance during inference. Also, we provide implementation of multi-engine inference as an experimental feature for improving inference performance more, dealing with variable input lengths. For more details, refer to running on TensorRT

In summary, the following features were implemented in this model:

  • Data-parallel multi-GPU training
  • Dynamic loss scaling with backoff for Tensor Cores (mixed precision) training
  • Accelerated inference on TensorRT using custom plugins and multi-engines approach

Feature support matrix

The following features are supported by this model:

Feature FastSpeech
Automatic mixed precision (AMP) Yes
TensorRT inferencing Yes

Features

Automatic Mixed Precision (AMP) - AMP is a tool that enables Tensor Core-accelerated training. For more information, refer to APEX AMP docs.

TensorRT - a library for high-performance inference on NVIDIA GPUs, improving latency, throughput, power efficiency, and memory consumption. It builds optimized runtime engines by selecting the most performant kernels & algorithms, fusing layers, and using mixed precision. For more information, refer to github.com/NVIDIA/TensorRT.

Setup

The following section lists the requirements that you need to meet in order to start training the FastSpeech model.

Requirements

This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.

Quick Start Guide

To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the FastSpeech model on the LJSpeech(https://keithito.com/LJ-Speech-Dataset) dataset. For the specifics concerning training and inference, see the Advanced section.

  1. Clone the repository,

    git clone https://github.com/NVIDIA/DeepLearningExamples.git
    cd DeepLearningExamples/CUDA-Optimized/FastSpeech
    
  2. Download and preprocess the dataset. Data is downloaded to the ./LJSpeech-1.1 directory (on the host). The ./LJSpeech-1.1 directory is mounted to the /workspace/fastspeech/LJSpeech-1.1 location in the NGC container.

    bash scripts/prepare_dataset.sh
    
  3. Build the FastSpeech PyTorch NGC container.

    bash scripts/docker/build.sh
    
  4. Start an interactive session in the NGC container to run training/inference. After you build the container image, you can start an interactive CLI session with:

    bash scripts/docker/interactive.sh
    
  5. Start training. To preprocess mel-spectrograms for faster training, first run:

    python fastspeech/dataset/ljspeech_dataset.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1"
    

    The preprocessed mel-spectrograms are stored in the ./mels_ljspeech1.1 directory.

    Next, preprocess the alignments on LJSpeech dataset with feed-forwards to the teacher model. Download the Nvidia pretrained Tacotron2 checkpoint to get a pretrained teacher model. And set --tacotron2_path to the Tacotron2 checkpoint file path and the result alignments are stored in --aligns_path.

    python fastspeech/align_tacotron2.py --dataset_path="./LJSpeech-1.1" --tacotron2_path="tacotron2_statedict.pt" --aligns_path="aligns_ljspeech1.1"
    

    The preprocessed alignments are stored in the ./aligns_ljspeech1.1 directory. For more information, refer to the training process section.

    Finally, run the training script:

    python fastspeech/train.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1" --aligns_path="./aligns_ljspeech1.1" --log_path="./logs" --checkpoint_path="./checkpoints"
    

    The checkpoints and Tensorboard log files are stored in the ./checkpoints and ./logs, respectively.

    Additionally, to accelerate the training using AMP, run with --use_amp:

    python fastspeech/train.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1" --aligns_path="./aligns_ljspeech1.1" --log_path="./logs" --checkpoint_path="./checkpoints" --use_amp
    
  6. Start generation. To generate waveforms with WaveGlow Vocoder, Get pretrained WaveGlow model from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.

    After you have trained the FastSpeech model, you can perform generation using the checkpoint stored in ./checkpoints. Then run:

    python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="./test_sentences.txt"
    

    The script loads automatically the latest checkpoint (if any exists), or you can pass a checkpoint file through --ckpt_file. And it loads input texts in ./test_sentences.txt and stores the result in ./results directory. You can also set the result directory path with --results_path.

    You can also run with a sample text:

    python generate.py  --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
    
  7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first get an ONNX file via DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/tensorrt, convert it to an TensorRT engine using scripts/waveglow/convert_onnx2trt.py, and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:

    python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="./test_sentences.txt" --waveglow_engine_path="waveglow.fp16.trt"
    

Advanced

The following sections provide greater details of the dataset, running training and inference, and the training results.

Scripts and sample code

The ./fastspeech directory contains models and scripts for data processing, training/inference, and estimating performance.

  • train.py: the FastSpeech model training script.
  • infer.py: the FastSpeech model inference script.
  • perf_infer.py: the script for estimating inference performance.
  • align_tacotron2.py: the script for preprocessing alignments.

The ./fastspeech/trt directory contains the FastSpeech TensorRT model, inferencer and plugins for TensorRT.

And, ./generate.py is the script for generating waveforms with a vocoder.

Parameters

All parameters of the FastSpeech model and for training/inference are defined in parameters config files in ./fastspeech/hparams.

The default config file, base.yaml, contains the most common parameters including paths, audio processing, and model hyperparams. The default config file for training, train.yaml, contains parameters used during training such as learning rate, batch size, and number of steps. And the default config file for inference, infer.yaml, contains parameters required for inference including batch size and usage of half precision. For more details, refer to the config files, i.e., base.yaml, train.yaml, and infer.yaml in ./fastspeech/hparams.

You can also define a new config file by overriding the default config, and set the config file via a command-line option --hparam, for example:

# File name: ./fastspeech/hparams/my_train.yaml

# Inherit all parameters from train.yaml.
parent_yaml: "train.yaml"

# Override the learning rate.
learning_rate: 0.0005
python fastspeech/train.py --hparam=my_train.yaml ...

Command-line options

To see the full list of available options and their descriptions, use the -- -h or -- --help command-line option, for example:

python fastspeech/train.py -- -h

Although it will not display all parameters defined in the config files, you can override any parameters in the config files, for example:

python fastspeech/train.py ... --batch_size=8 --final_steps=64000

Getting the data

The FastSpeech model was trained on the LJSpeech-1.1 dataset. This repository contains the ./scripts/prepare_dataset.sh script which will automatically download and extract the whole dataset. By default, data will be extracted to the ./LJSpeech-1.1 directory. The dataset directory contains a README file, a wavs directory with all audio samples, and a file metadata.csv that contains audio file names and the corresponding transcripts.

Dataset guidelines

The LJSpeech dataset has 13,100 clips that amount to about 24 hours of speech. Since the original dataset has all transcripts in the metadata.csv file, the ./scripts/prepare_dataset.sh script partitions the metadata.csv into sub-meta files for training/test set - metadata_train.csv and metadata_test.csv containing 13,000 and 100 transcripts respectively.

Training process

To accelerate the training performance, preprocessing of alignments between texts and mel-spectrograms is performed prior to the training iterations.

The FastSpeech model requires reference alignments of texts and mel-spectrograms extracted from an auto-regressive TTS teacher model. As Tacotron2 is used as a teacher in our implementation, download the Nvidia pretrained Tacotron2 checkpoint to utilize this for the preprocessing of the alignments.

Run align_tacotron2.py to get alignments on LJSpeech dataset with feed-forwards to the teacher model. --tacotron2_path is for setting Tacotron2 checkpoint file path and the result alignments are stored in --aligns_path. After that, the alignments are loaded during training.

python fastspeech/align_tacotron2.py --dataset_path="./LJSpeech-1.1" --tacotron2_path="tacotron2_statedict.pt" --aligns_path="aligns_ljspeech1.1"

You can also preprocess mel-spectrograms for faster training. The result mel-spectrograms are stored in --mels_path and loaded during training. If --mels_path is not set, mel-spectrograms are processed during training.

Run ljspeech_dataset.py

python fastspeech/dataset/ljspeech_dataset.py --dataset_path="./LJSpeech-1.1" --mels_path="mels_ljspeech1.1"

Accelerated training

NVIDIA APEX library supports a simple method to obtain up to 2x speed-up during training. The library provides easy-to-use APIs for using AMP and layer fusions.

To use AMP during training, run with --use_amp

python fastspeech/train.py ... --use_amp

Another approach for extra speed-up during training is fusing operations. To use fused layer normalization, set --fused_layernorm.

python fastspeech/train.py ... --use_amp --fused_layernorm

Inference process

infer.py is provided to test the FastSpeech model on the LJSpeech dataset. --n_iters is the number of batches to infer. To run in FP16, run with --use_fp16.

python fastspeech/infer.py --dataset_path="./LJSpeech-1.1" --checkpoint_path="./checkpoints" --n_iters=10 --use_fp16

Accelerated inference

To accelerate inference with TensorRT, set --hparam=trt.yaml.

python fastspeech/infer.py --hparam=trt.yaml --dataset_path="./LJSpeech-1.1" --checkpoint_path="./checkpoints" --n_iters=10 --use_fp16

For more details, refer to accelerating inference with TensorRT.

Generation

To generate waveforms with WaveGlow Vocoder, get pretrained WaveGlow model from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.

Run generate.py with:

  • --text - an input text or the text file path.
  • --results_path - result waveforms directory path. (default=./results).
  • --ckpt_file - checkpoint file path. (default checkpoint file is the latest file in --checkpoint_path)
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text="The more you buy, the more you save."

or

python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text=test_sentences.txt

Sample result waveforms are here.

To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.

python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."

Sample result waveforms are FP32 and FP16.

Performance

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIAs latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance, set CUDA_VISIBLE_DEVICES, depending on GPU count:

  • for 1 GPU,
export CUDA_VISIBLE_DEVICES=0
  • for 4 GPUs,
export CUDA_VISIBLE_DEVICES=0,1,2,3

and run on a specific batch size:

  • in FP32
python fastspeech/train.py --batch_size=BATCH_SIZE
  • in mixed precision
python fastspeech/train.py --batch_size=BATCH_SIZE --use_amp

Inference performance benchmark

Set CUDA_VISIBLE_DEVICES=0 to use single GPU,

export CUDA_VISIBLE_DEVICES=0

and run on a specific batch size:

  • in FP32
python fastspeech/perf_infer.py --batch_size=BATCH_SIZE
  • in FP16
python fastspeech/perf_infer.py --batch_size=BATCH_SIZE --use_fp16

To benchmark the inference performance with vocoder,

python fastspeech/perf_infer.py --batch_size=BATCH_SIZE --with_vocoder --waveglow_path=WAVEGLOW_PATH

Finally, to benchmark the inference performance on TensorRT,

python fastspeech/perf_infer.py --hparam=trt.yaml --batch_size=BATCH_SIZE
  • with vocoder
python fastspeech/perf_infer.py --hparam=trt.yaml --batch_size=BATCH_SIZE --with_vocoder --waveglow_path=WAVEGLOW_PATH

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training performance results

Our results were obtained by running the script in training performance benchmark on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.

Training performance: NVIDIA DGX-1 (8x V100 16GB)
GPUs Batch size / GPU Throughput(mels/s) - FP32 Throughput(mels/s) - mixed precision Throughput speedup (FP32 - mixed precision) Multi-GPU Weak scaling - FP32 Multi-GPU Weak scaling - mixed precision
1 32 31674 63431 2.00 1 1
4 32 101115 162847 1.61 3.19 2.57
8 32 167650 188251 1.12 5.29 2.97

Inference performance results

Our results were obtained by running the script in inference performance benchmark on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.

Inference performance: NVIDIA DGX-1 (1x V100 16GB)
Batch size Precision Avg latency (s) Std latency(s) Latency tolerance interval 90% (s) Latency tolerance interval 95% (s) Latency tolerance interval 99% (s) Throughput (samples/s) Avg RTF Speed-up with mixed precision
1 FP16 0.2287 0.001 0.2295 0.2297 0.2303 681,773 30.92 1.50
4 FP16 0.5003 0.0016 0.502 0.5023 0.5032 1,244,466 14.11 2.57
8 FP16 0.9695 0.0023 0.9722 0.9732 0.9748 1,284,339 7.28 2.73
1 FP32 0.3428 0.0016 0.3445 0.3449 0.3458 454,833 20.63 1.00
4 FP32 1.287 0.0039 1.2916 1.2927 1.2954 484,558 5.50 1.00
8 FP32 2.6481 0.0041 2.6535 2.6549 2.657 470,992 2.67 1.00
Framework Batch size Precision Avg latency (s) Std latency(s) Latency tolerance interval 90% (s) Latency tolerance interval 95% (s) Latency tolerance interval 99% (s) Throughput (samples/s) Avg RTF Speed-up (PyT - PyT+TRT)
PyT 1 FP16 0.2287 0.001 0.2295 0.2297 0.2303 681,773 30.92 1
PyT+TRT 1 FP16 0.1115 0.0007 0.1122 0.1124 0.1135 1,398,343 63.42 2.05
PyT 4 FP16 0.5003 0.0016 0.502 0.5023 0.5032 1,244,466 14.11 1
PyT+TRT 4 FP16 0.3894 0.0019 0.3917 0.3925 0.3961 1,599,005 18.13 1.28
Inference performance: NVIDIA T4
Batch size Precision Avg latency (s) Std latency(s) Latency tolerance interval 90% (s) Latency tolerance interval 95% (s) Latency tolerance interval 99% (s) Throughput (samples/s) Avg RTF Speed-up with mixed precision
1 FP16 0.9345 0.0294 0.9662 0.9723 0.9806 167,003 7.57 1.28
4 FP16 3.7815 0.0877 3.9078 3.9393 3.9632 164,730 1.87 1.28
8 FP16 7.5722 0.1764 7.8273 7.8829 7.9286 164,530 0.93 1.21
1 FP32 1.1952 0.0368 1.2438 1.2487 1.2589 130,572 5.92 1.00
4 FP32 4.8578 0.1215 5.0343 5.0651 5.1027 128,453 1.46 1.00
8 FP32 9.1563 0.4114 9.4049 9.4571 9.5194 136,367 0.77 1.00
Framework Batch size Precision Avg latency (s) Std latency(s) Latency tolerance interval 90% (s) Latency tolerance interval 95% (s) Latency tolerance interval 99% (s) Throughput (samples/s) Avg RTF Speed-up (PyT - PyT+TRT)
PyT 1 FP16 0.9345 0.0294 0.9662 0.9723 0.9806 167,003 7.57 1
PyT+TRT 1 FP16 0.3234 0.0058 0.3304 0.3326 0.3358 482,286 21.87 2.89

Release notes

Changelog

Oct 2020

  • PyTorch 1.7, TensorRT 7.2 support

July 2020

  • Initial release

Known issues

There are no known issues in this release.