# Tacotron 2 And WaveGlow v1.10 For PyTorch This repository provides a script and recipe to train Tacotron 2 and WaveGlow v1.6 models to achieve state of the art accuracy, and is tested and maintained by NVIDIA. ## Table of Contents - [Model overview](#model-overview) * [Model architecture](#model-architecture) * [Default configuration](#default-configuration) * [Feature support matrix](#feature-support-matrix) * [Features](#features) * [Mixed precision training](#mixed-precision-training) * [Enabling mixed precision](#enabling-mixed-precision) * [Enabling TF32](#enabling-tf32) - [Setup](#setup) * [Requirements](#requirements) - [Quick Start Guide](#quick-start-guide) - [Advanced](#advanced) * [Scripts and sample code](#scripts-and-sample-code) * [Parameters](#parameters) * [Shared parameters](#shared-parameters) * [Shared audio/STFT parameters](#shared-audiostft-parameters) * [Tacotron 2 parameters](#tacotron-2-parameters) * [WaveGlow parameters](#waveglow-parameters) * [Command-line options](#command-line-options) * [Getting the data](#getting-the-data) * [Dataset guidelines](#dataset-guidelines) * [Multi-dataset](#multi-dataset) * [Training process](#training-process) * [Inference process](#inference-process) - [Performance](#performance) * [Benchmarking](#benchmarking) * [Training performance benchmark](#training-performance-benchmark) * [Inference performance benchmark](#inference-performance-benchmark) * [Results](#results) * [Training accuracy results](#training-accuracy-results) * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb) * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb) * [Training curves](#training-curves) * [Training performance results](#training-performance-results) * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb) * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gB) * [Expected training time](#expected-training-time) * [Inference performance results](#inference-performance-results) * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb) * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb) * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4) - [Release notes](#release-notes) * [Changelog](#changelog) * [Known issues](#known-issues) ## Model overview This text-to-speech (TTS) system is a combination of two neural network models: * a modified Tacotron 2 model from the [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884) paper * a flow-based neural network model from the [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002) paper The Tacotron 2 and WaveGlow models form a text-to-speech system that enables users to synthesize natural sounding speech from raw transcripts without any additional information such as patterns and/or rhythms of speech. Our implementation of Tacotron 2 models differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers. Also, the original text-to-speech system proposed in the paper uses the [WaveNet](https://arxiv.org/abs/1609.03499) model to synthesize waveforms. In our implementation, we use the WaveGlow model for this purpose. Both models are based on implementations of NVIDIA GitHub repositories [Tacotron 2](https://github.com/NVIDIA/tacotron2) and [WaveGlow](https://github.com/NVIDIA/waveglow), and are trained on a publicly available [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/). The Tacotron 2 and WaveGlow model enables you to efficiently synthesize high quality speech from text. Both models are trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2.0x faster for Tacotron 2 and 3.1x faster for WaveGlow than training without Tensor Cores, while experiencing the benefits of mixed precision training. The models are tested against each NGC monthly container release to ensure consistent accuracy and performance over time. ### Model architecture The Tacotron 2 model is a recurrent sequence-to-sequence model with attention that predicts mel-spectrograms from text. The encoder (blue blocks in the figure below) transforms the whole text into a fixed-size hidden feature representation. This feature representation is then consumed by the autoregressive decoder (orange blocks) that produces one spectrogram frame at a time. In our implementation, the autoregressive WaveNet (green block) is replaced by the flow-based generative WaveGlow. ![](./img/tacotron2_arch.png "Tacotron 2 architecture") Figure 1. Architecture of the Tacotron 2 model. Taken from the [Tacotron 2](https://arxiv.org/abs/1712.05884) paper. The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer. ![](./img/waveglow_arch.png "WaveGlow architecture") Figure 2. Architecture of the WaveGlow model. Taken from the [WaveGlow](https://arxiv.org/abs/1811.00002) paper. ### Default configuration Both models support multi-GPU and mixed precision training with dynamic loss scaling (see Apex code [here](https://github.com/NVIDIA/apex/blob/master/apex/fp16_utils/loss_scaler.py)), as well as mixed precision inference. To speed up Tacotron 2 training, reference mel-spectrograms are generated during a preprocessing step and read directly from disk during training, instead of being generated during training. The following features were implemented in this model: * data-parallel multi-GPU training * dynamic loss scaling with backoff for Tensor Cores (mixed precision) training. ### Feature support matrix The following features are supported by this model. | Feature | Tacotron 2 | WaveGlow | | :-----------------------|------------:|--------------:| |[AMP](https://nvidia.github.io/apex/amp.html) | Yes | Yes | |[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes | Yes | #### Features AMP - a tool that enables Tensor Core-accelerated training. For more information, refer to [Enabling mixed precision](#enabling-mixed-precision). Apex DistributedDataParallel - a module wrapper that enables easy multiprocess distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`. `DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by overlapping communication with computation during `backward()` and bucketing smaller gradient transfers to reduce the total number of transfers required. ### Mixed precision training *Mixed precision* is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps: 1. Porting the model to use the FP16 data type where appropriate. 2. Adding loss scaling to preserve small gradient values. The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK. For information about: * How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation. * Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog. * APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/). #### Enabling mixed precision Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision (AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In PyTorch, loss scaling can be easily applied by using the `scale_loss()` method provided by AMP. The scaling value to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed. By default, the `train_tacotron2.sh` and `train_waveglow.sh` scripts will launch mixed precision training with Tensor Cores. You can change this behaviour by removing the `--amp` flag from the `train.py` script. To enable mixed precision, the following steps were performed in the Tacotron 2 and WaveGlow models: * Import AMP from APEX: ```bash from apex import amp amp.lists.functional_overrides.FP32_FUNCS.remove('softmax') amp.lists.functional_overrides.FP16_FUNCS.append('softmax') ``` * Initialize AMP: ```bash model, optimizer = amp.initialize(model, optimizer, opt_level="O1") ``` * If running on multi-GPU, wrap the model with `DistributedDataParallel`: ```bash from apex.parallel import DistributedDataParallel as DDP model = DDP(model) ``` * Scale loss before backpropagation (assuming loss is stored in a variable called `losses`): * Default backpropagate for FP32: ```bash losses.backward() ``` * Scale loss and backpropagate with AMP: ```bash with optimizer.scale_loss(losses) as scaled_losses: scaled_losses.backward() ``` #### Enabling TF32 TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations. For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post. TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default. ## Setup The following section lists the requirements in order to start training the Tacotron 2 and WaveGlow models. ### Requirements This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components: - [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) - [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer - Supported GPUs: - [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) - [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation: - [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html) - [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry) - [Running PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running) For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). ## Quick Start Guide To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Tacrotron 2 and WaveGlow model on the [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) dataset. 1. Clone the repository. ```bash git clone https://github.com/NVIDIA/DeepLearningExamples.git cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2 ``` 2. Download and preprocess the dataset. Use the `./scripts/prepare_dataset.sh` download script to automatically download and preprocess the training, validation and test datasets. To run this script, issue: ```bash bash scripts/prepare_dataset.sh ``` Data is downloaded to the `./LJSpeech-1.1` directory (on the host). The `./LJSpeech-1.1` directory is mounted to the `/workspace/tacotron2/LJSpeech-1.1` location in the NGC container. 3. Build the Tacotron 2 and WaveGlow PyTorch NGC container. ```bash bash scripts/docker/build.sh ``` 4. Start an interactive session in the NGC container to run training/inference. After you build the container image, you can start an interactive CLI session with: ```bash bash scripts/docker/interactive.sh ``` The `interactive.sh` script requires that the location on the dataset is specified. For example, `LJSpeech-1.1`. To preprocess the datasets for Tacotron 2 training, use the `./scripts/prepare_mels.sh` script: ```bash bash scripts/prepare_mels.sh ``` The preprocessed mel-spectrograms are stored in the `./LJSpeech-1.1/mels` directory. 5. Start training. To start Tacotron 2 training, run: ```bash bash scripts/train_tacotron2.sh ``` To start WaveGlow training, run: ```bash bash scripts/train_waveglow.sh ``` 6. Start validation/evaluation. Ensure your loss values are comparable to those listed in the table in the [Results](#results) section. For both models, the loss values are stored in the `./output/nvlog.json` log file. After you have trained the Tacotron 2 and WaveGlow models, you should get audio results similar to the samples in the `./audio` folder. For details about generating audio, see the [Inference process](#inference-process) section below. The training scripts automatically run the validation after each training epoch. The results from the validation are printed to the standard output (`stdout`) and saved to the log files. 7. Start inference. After you have trained the Tacotron 2 and WaveGlow models, you can perform inference using the respective checkpoints that are passed as `--tacotron2` and `--waveglow` arguments. Tacotron2 and WaveGlow checkpoints can also be downloaded from NGC: https://ngc.nvidia.com/catalog/models/nvidia:tacotron2pyt_fp16/files?version=3 https://ngc.nvidia.com/catalog/models/nvidia:waveglow256pyt_fp16/files?version=2 To run inference issue: ```bash python inference.py --tacotron2 --waveglow --wn-channels 256 -o output/ -i phrases/phrase.txt --fp16 ``` The speech is generated from lines of text in the file that is passed with `-i` argument. The number of lines determines inference batch size. To run inference in mixed precision, use the `--fp16` flag. The output audio will be stored in the path specified by the `-o` argument. You can also run inference on CPU with TorchScript by adding flag --cpu: ```bash export CUDA_VISIBLE_DEVICES= ``` ```bash python inference.py --tacotron2 --waveglow --wn-channels 256 --cpu -o output/ -i phrases/phrase.txt ``` ## Advanced The following sections provide greater details of the dataset, running training and inference, and the training results. ### Scripts and sample code The sample code for Tacotron 2 and WaveGlow has scripts specific to a particular model, located in directories `./tacotron2` and `./waveglow`, as well as scripts common to both models, located in the `./common` directory. The model-specific scripts are as follows: * `/model.py` - the model architecture, definition of forward and inference functions * `/arg_parser.py` - argument parser for parameters specific to a given model * `/data_function.py` - data loading functions * `/loss_function.py` - loss function for the model The common scripts contain layer definitions common to both models (`common/layers.py`), some utility scripts (`common/utils.py`) and scripts for audio processing (`common/audio_processing.py` and `common/stft.py`). In the root directory `./` of this repository, the `./run.py` script is used for training while inference can be executed with the `./inference.py` script. The scripts `./models.py`, `./data_functions.py` and `./loss_functions.py` call the respective scripts in the `` directory, depending on what model is trained using the `run.py` script. ### Parameters In this section, we list the most important hyperparameters and command-line arguments, together with their default values that are used to train Tacotron 2 and WaveGlow models. #### Shared parameters * `--epochs` - number of epochs (Tacotron 2: 1501, WaveGlow: 1001) * `--learning-rate` - learning rate (Tacotron 2: 1e-3, WaveGlow: 1e-4) * `--batch-size` - batch size (Tacotron 2 FP16/FP32: 104/48, WaveGlow FP16/FP32: 10/4) * `--amp` - use mixed precision training * `--cpu` - use CPU with TorchScript for inference #### Shared audio/STFT parameters * `--sampling-rate` - sampling rate in Hz of input and output audio (22050) * `--filter-length` - (1024) * `--hop-length` - hop length for FFT, i.e., sample stride between consecutive FFTs (256) * `--win-length` - window size for FFT (1024) * `--mel-fmin` - lowest frequency in Hz (0.0) * `--mel-fmax` - highest frequency in Hz (8.000) #### Tacotron 2 parameters * `--anneal-steps` - epochs at which to anneal the learning rate (500 1000 1500) * `--anneal-factor` - factor by which to anneal the learning rate (FP16/FP32: 0.3/0.1) #### WaveGlow parameters * `--segment-length` - segment length of input audio processed by the neural network (8000) * `--wn-channels` - number of residual channels in the coupling layer networks (512) ### Command-line options To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example: ```bash python train.py --help ``` The following example output is printed when running the sample: ```bash Batch: 7/260 epoch 0 :::NVLOGv0.2.2 Tacotron2_PyT 1560936205.667271376 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_start: 7 :::NVLOGv0.2.2 Tacotron2_PyT 1560936207.209611416 (/workspace/tacotron2/dllogger/logger.py:251) train_iteration_loss: 5.415428161621094 :::NVLOGv0.2.2 Tacotron2_PyT 1560936208.705905914 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_stop: 7 :::NVLOGv0.2.2 Tacotron2_PyT 1560936208.706479311 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_items/sec: 8924.00136085362 :::NVLOGv0.2.2 Tacotron2_PyT 1560936208.706998110 (/workspace/tacotron2/dllogger/logger.py:251) iter_time: 3.0393316745758057 Batch: 8/260 epoch 0 :::NVLOGv0.2.2 Tacotron2_PyT 1560936208.711485624 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_start: 8 :::NVLOGv0.2.2 Tacotron2_PyT 1560936210.236668825 (/workspace/tacotron2/dllogger/logger.py:251) train_iteration_loss: 5.516331672668457 ``` ### Getting the data The Tacotron 2 and WaveGlow models were trained on the LJSpeech-1.1 dataset. This repository contains the `./scripts/prepare_dataset.sh` script which will automatically download and extract the whole dataset. By default, data will be extracted to the `./LJSpeech-1.1` directory. The dataset directory contains a `README` file, a `wavs` directory with all audio samples, and a file `metadata.csv` that contains audio file names and the corresponding transcripts. #### Dataset guidelines The LJSpeech dataset has 13,100 clips that amount to about 24 hours of speech. Since the original dataset has all transcripts in the `metadata.csv` file, in this repository we provide file lists in the `./filelists` directory that determine training and validation subsets; `ljs_audio_text_train_filelist.txt` is a test set used as a training dataset and `ljs_audio_text_val_filelist.txt` is a test set used as a validation dataset. #### Multi-dataset To use datasets different than the default LJSpeech dataset: 1. Prepare a directory with all audio files and pass it to the `--dataset-path` command-line option. 2. Add two text files containing file lists: one for the training subset (`--training-files`) and one for the validation subset (`--validation files`). The structure of the filelists should be as follows: ```bash `