updated to 19.07 version
This commit is contained in:
parent
3137fbeae3
commit
979e291848
|
@ -1,10 +1,5 @@
|
|||
FROM nvcr.io/nvidia/pytorch:19.03-py3
|
||||
FROM nvcr.io/nvidia/pytorch:19.06-py3
|
||||
|
||||
ADD . /workspace/tacotron2
|
||||
WORKDIR /workspace/tacotron2
|
||||
RUN pip install -r requirements.txt
|
||||
RUN cd /workspace; \
|
||||
git clone https://github.com/NVIDIA/apex.git; \
|
||||
cd /workspace/apex; \
|
||||
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
|
||||
WORKDIR /workspace/tacotron2
|
||||
|
|
|
@ -1,20 +1,20 @@
|
|||
# Tacotron 2 And WaveGlow v1.5 For PyTorch
|
||||
# Tacotron 2 And WaveGlow v1.6 For PyTorch
|
||||
|
||||
This repository provides a script and recipe to train Tacotron 2 and WaveGlow
|
||||
v1.5 models to achieve state of the art accuracy, and is tested and maintained by
|
||||
NVIDIA.
|
||||
v1.6 models to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
|
||||
|
||||
Table of Contents
|
||||
=================
|
||||
* [The model](#the-model)
|
||||
## Table of Contents
|
||||
* [Model overview](#model-overview)
|
||||
* [Model architecture](#model-architecture)
|
||||
* [Default configuration](#default-configuration)
|
||||
* [Feature support matrix](#feature-support-matrix)
|
||||
* [Features](#features)
|
||||
* [Mixed precision training](#mixed-precision-training)
|
||||
* [Enabling mixed precision](#enabling-mixed-precision)
|
||||
* [Setup](#setup)
|
||||
* [Requirements](#requirements)
|
||||
* [Quick Start Guide](#quick-start-guide)
|
||||
* [Details](#details)
|
||||
* [Advanced](#advanced)
|
||||
* [Scripts and sample code](#scripts-and-sample-code)
|
||||
* [Parameters](#parameters)
|
||||
* [Shared parameters](#shared-parameters)
|
||||
|
@ -27,30 +27,30 @@ Table of Contents
|
|||
* [Multi-dataset](#multi-dataset)
|
||||
* [Training process](#training-process)
|
||||
* [Inference process](#inference-process)
|
||||
* [Mixed precision training](#mixed-precision-training)
|
||||
* [Enabling mixed precision](#enabling-mixed-precision)
|
||||
* [Benchmarking](#benchmarking)
|
||||
* [Training performance benchmark](#training-performance-benchmark)
|
||||
* [Inference performance benchmark](#inference-performance-benchmark)
|
||||
* [Results](#results)
|
||||
* [Training accuracy results](#training-accuracy-results)
|
||||
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
||||
* [Training performance results](#training-performance-results)
|
||||
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
||||
* [Expected training time](#expected-training-time)
|
||||
* [Inference performance results](#inference-performance-results)
|
||||
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
||||
* [Changelog](#changelog)
|
||||
* [Known issues](#known-issues)
|
||||
* [Performance](#performance)
|
||||
* [Benchmarking](#benchmarking)
|
||||
* [Training performance benchmark](#training-performance-benchmark)
|
||||
* [Inference performance benchmark](#inference-performance-benchmark)
|
||||
* [Results](#results)
|
||||
* [Training accuracy results](#training-accuracy-results)
|
||||
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
||||
* [Training performance results](#training-performance-results)
|
||||
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
||||
* [Expected training time](#expected-training-time)
|
||||
* [Inference performance results](#inference-performance-results)
|
||||
* [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
||||
* [Release notes](#release-notes)
|
||||
* [Changelog](#changelog)
|
||||
* [Known issues](#known-issues)
|
||||
|
||||
## The model
|
||||
## Model overview
|
||||
|
||||
This text-to-speech (TTS) system is a combination of two neural network
|
||||
models:
|
||||
|
||||
* a modified Tacotron 2 model from the [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
|
||||
paper and
|
||||
* a flow-based neural network model from the [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002) paper.
|
||||
paper
|
||||
* a flow-based neural network model from the [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002) paper
|
||||
|
||||
The Tacotron 2 and WaveGlow models form a text-to-speech system that enables
|
||||
users to synthesize natural sounding speech from raw transcripts without
|
||||
|
@ -106,7 +106,6 @@ distribution.
|
|||
Figure 2. Architecture of the WaveGlow model. Taken from the
|
||||
[WaveGlow](https://arxiv.org/abs/1811.00002) paper.
|
||||
|
||||
|
||||
### Default configuration
|
||||
|
||||
Both models support multi-GPU and mixed precision training with dynamic loss
|
||||
|
@ -126,16 +125,98 @@ training.
|
|||
|
||||
The following features are supported by this model.
|
||||
|
||||
| Feature | Tacotron 2 | and WaveGlow |
|
||||
|:-------|---------:|-----------:|
|
||||
| Feature | Tacotron 2 | WaveGlow |
|
||||
| :-----------------------|------------:|--------------:|
|
||||
|[AMP](https://nvidia.github.io/apex/amp.html) | Yes | Yes |
|
||||
|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes | Yes |
|
||||
|
||||
#### Features
|
||||
|
||||
AMP - a tool that enables Tensor Core-accelerated training. Please refer to section [Enabling mixed precision](#enabling-mixed-precision) for more details.
|
||||
AMP - a tool that enables Tensor Core-accelerated training. For more information,
|
||||
refer to [Enabling mixed precision](#enabling-mixed-precision).
|
||||
|
||||
Apex DistributedDataParallel - a module wrapper that enables easy multiprocess distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`. `DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by overlapping communication with computation during backward() and bucketing smaller gradient transfers to reduce the total number of transfers required.
|
||||
Apex DistributedDataParallel - a module wrapper that enables easy multiprocess
|
||||
distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`.
|
||||
`DistributedDataParallel` is optimized for use with NCCL. It achieves high
|
||||
performance by overlapping communication with computation during `backward()`
|
||||
and bucketing smaller gradient transfers to reduce the total number of transfers
|
||||
required.
|
||||
|
||||
## Mixed precision training
|
||||
|
||||
*Mixed precision* is the combined use of different numerical precisions in a
|
||||
computational method. [Mixed precision](https://arxiv.org/abs/1710.03740)
|
||||
training offers significant computational speedup by performing operations in
|
||||
half-precision format, while storing minimal information in single-precision
|
||||
to retain as much information as possible in critical parts of the network.
|
||||
Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
|
||||
in the Volta and Turing architecture, significant training speedups are
|
||||
experienced by switching to mixed precision -- up to 3x overall speedup on
|
||||
the most arithmetically intense model architectures. Using mixed precision
|
||||
training requires two steps:
|
||||
|
||||
1. Porting the model to use the FP16 data type where appropriate.
|
||||
2. Adding loss scaling to preserve small gradient values.
|
||||
|
||||
The ability to train deep learning networks with lower precision was
|
||||
introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
|
||||
|
||||
For information about:
|
||||
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
|
||||
paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
|
||||
documentation.
|
||||
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
|
||||
blog.
|
||||
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
|
||||
|
||||
### Enabling mixed precision
|
||||
|
||||
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
|
||||
(AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables
|
||||
to half-precision upon retrieval, while storing variables in single-precision
|
||||
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
|
||||
a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
|
||||
step must be included when applying gradients. In PyTorch, loss scaling can be
|
||||
easily applied by using the `scale_loss()` method provided by AMP. The scaling value
|
||||
to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
|
||||
|
||||
By default, the `train_tacotron2.sh` and `train_waveglow.sh` scripts will
|
||||
launch mixed precision training with Tensor Cores. You can change this
|
||||
behaviour by removing the `--amp-run` flag from the `train.py` script.
|
||||
|
||||
To enable mixed precision, the following steps were performed in the Tacotron 2 and
|
||||
WaveGlow models:
|
||||
* Import AMP from APEX:
|
||||
```bash
|
||||
from apex import amp
|
||||
amp.lists.functional_overrides.FP32_FUNCS.remove('softmax')
|
||||
amp.lists.functional_overrides.FP16_FUNCS.append('softmax')
|
||||
```
|
||||
|
||||
* Initialize AMP:
|
||||
```bash
|
||||
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
|
||||
```
|
||||
|
||||
* If running on multi-GPU, wrap the model with `DistributedDataParallel`:
|
||||
```bash
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
model = DDP(model)
|
||||
```
|
||||
|
||||
* Scale loss before backpropagation (assuming loss is stored in a variable
|
||||
called `losses`):
|
||||
|
||||
* Default backpropagate for FP32:
|
||||
```bash
|
||||
losses.backward()
|
||||
```
|
||||
|
||||
* Scale loss and backpropagate with AMP:
|
||||
```bash
|
||||
with optimizer.scale_loss(losses) as scaled_losses:
|
||||
scaled_losses.backward()
|
||||
```
|
||||
|
||||
## Setup
|
||||
|
||||
|
@ -149,22 +230,21 @@ and encapsulates some dependencies. Aside from these dependencies, ensure you
|
|||
have the following components:
|
||||
|
||||
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
||||
* [PyTorch 19.04-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
||||
* [PyTorch 19.05-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
||||
or newer
|
||||
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
|
||||
|
||||
|
||||
For more information about how to get started with NGC containers, see the
|
||||
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
|
||||
Documentation:
|
||||
|
||||
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
|
||||
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
|
||||
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
|
||||
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
|
||||
* [Running PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running)
|
||||
|
||||
For those unable to use the PyTorch NGC container, to set up the required
|
||||
environment or create your own container, see the versioned
|
||||
[NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html).
|
||||
[NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
|
||||
|
||||
## Quick Start Guide
|
||||
|
||||
|
@ -174,84 +254,86 @@ and WaveGlow model on the [LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
|
|||
dataset.
|
||||
|
||||
1. Clone the repository.
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/DeepLearningExamples.git
|
||||
cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2
|
||||
```
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/DeepLearningExamples.git
|
||||
cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2
|
||||
```
|
||||
|
||||
2. Download and preprocess the dataset.
|
||||
Use the `./scripts/prepare-dataset.sh` download script to automatically
|
||||
download and preprocess the training, validation and test datasets. To run
|
||||
this script, issue:
|
||||
```bash
|
||||
bash scripts/prepare-dataset.sh
|
||||
```
|
||||
```bash
|
||||
bash scripts/prepare-dataset.sh
|
||||
```
|
||||
|
||||
To preprocess the datasets for Tacotron 2 training, use the
|
||||
`./scripts/prepare-mels.sh` script:
|
||||
```bash
|
||||
bash scripts/prepare_mels.sh
|
||||
```
|
||||
|
||||
Data is downloaded to the `./LJSpeech-1.1` directory (on the host). The
|
||||
To preprocess the datasets for Tacotron 2 training, use the
|
||||
`./scripts/prepare-mels.sh` script:
|
||||
```bash
|
||||
bash scripts/prepare_mels.sh
|
||||
```
|
||||
|
||||
Data is downloaded to the `./LJSpeech-1.1` directory (on the host). The
|
||||
`./LJSpeech-1.1` directory is mounted to the `/workspace/tacotron2/LJSpeech-1.1`
|
||||
location in the NGC container. The preprocessed mel-spectrograms are stored in the
|
||||
`./LJSpeech-1.1/mels` directory.
|
||||
|
||||
3. Build the Tacotron 2 and WaveGlow PyTorch NGC container.
|
||||
```bash
|
||||
bash scripts/docker/build.sh
|
||||
```
|
||||
```bash
|
||||
bash scripts/docker/build.sh
|
||||
```
|
||||
|
||||
4. Start an interactive session in the NGC container to run training/inference.
|
||||
After you build the container image, you can start an interactive CLI session with:
|
||||
|
||||
```bash
|
||||
bash scripts/docker/interactive.sh
|
||||
```
|
||||
```bash
|
||||
bash scripts/docker/interactive.sh
|
||||
```
|
||||
|
||||
The `interactive.sh` script requires that the location on the dataset is specified.
|
||||
For example, `LJSpeech-1.1`.
|
||||
The `interactive.sh` script requires that the location on the dataset is specified.
|
||||
For example, `LJSpeech-1.1`.
|
||||
|
||||
5. Start training.
|
||||
To start Tacotron 2 training, run:
|
||||
```bash
|
||||
bash scripts/train_tacotron2.sh
|
||||
```
|
||||
```bash
|
||||
bash scripts/train_tacotron2.sh
|
||||
```
|
||||
|
||||
To start WaveGlow training, run:
|
||||
```bash
|
||||
bash scripts/train_waveglow.sh
|
||||
```
|
||||
To start WaveGlow training, run:
|
||||
```bash
|
||||
bash scripts/train_waveglow.sh
|
||||
```
|
||||
|
||||
6. Start validation/evaluation.
|
||||
Ensure your loss values are comparable to those listed in the table in the
|
||||
[Results][#results] section. For both models, the loss values are stored in the
|
||||
`./output/nvlog.json` log file.
|
||||
[Results](#results) section. For both models, the loss values are stored in the `./output/nvlog.json` log file.
|
||||
|
||||
After you have trained the Tacotron 2 model for 1500 epochs and the
|
||||
WaveGlow model for 800 epochs, you should get audio results similar to the
|
||||
samples in the `./audio` folder. For details about generating audio, see the
|
||||
[Inference process](#inference-process) section below.
|
||||
After you have trained the Tacotron 2 model for 1500 epochs and the
|
||||
WaveGlow model for 800 epochs, you should get audio results similar to the
|
||||
samples in the `./audio` folder. For details about generating audio, see the
|
||||
[Inference process](#inference-process) section below.
|
||||
|
||||
The training scripts automatically run the validation after each training
|
||||
epoch. The results from the validation are printed to the standard output
|
||||
(`stdout`) and saved to the log files.
|
||||
The training scripts automatically run the validation after each training
|
||||
epoch. The results from the validation are printed to the standard output
|
||||
(`stdout`) and saved to the log files.
|
||||
|
||||
7. Start inference.
|
||||
After you have trained the Tacotron 2 and WaveGlow models, you can perform
|
||||
inference using the respective checkpoints that are passed as `--tacotron2`
|
||||
and `--waveglow` arguments.
|
||||
|
||||
To run inference issue:
|
||||
```bash
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --fp16-run
|
||||
```
|
||||
The speech is generated from a text file that is passed with `-i` argument. To run
|
||||
inference in mixed precision, use the `--amp-run` flag. The output audio will
|
||||
be stored in the path specified by the `-o` argument.
|
||||
To run inference issue:
|
||||
|
||||
## Details
|
||||
```bash
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i phrases/phrase.txt --amp-run
|
||||
```
|
||||
|
||||
The speech is generated from lines of text in the file that is passed with
|
||||
`-i` argument. The number of lines determines inference batch size. To run
|
||||
inference in mixed precision, use the `--amp-run` flag. The output audio will
|
||||
be stored in the path specified by the `-o` argument.
|
||||
|
||||
## Advanced
|
||||
|
||||
The following sections provide greater details of the dataset, running
|
||||
training and inference, and the training results.
|
||||
|
@ -312,11 +394,26 @@ WaveGlow models.
|
|||
|
||||
### Command-line options
|
||||
|
||||
To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
|
||||
To see the full list of available options and their descriptions, use the `-h`
|
||||
or `--help` command line option, for example:
|
||||
```bash
|
||||
python train.py --help
|
||||
```
|
||||
|
||||
The following example output is printed when running the sample:
|
||||
|
||||
```bash
|
||||
Batch: 7/260 epoch 0
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936205.667271376 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_start: 7
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936207.209611416 (/workspace/tacotron2/dllogger/logger.py:251) train_iteration_loss: 5.415428161621094
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.705905914 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_stop: 7
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.706479311 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_items/sec: 8924.00136085362
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.706998110 (/workspace/tacotron2/dllogger/logger.py:251) iter_time: 3.0393316745758057
|
||||
Batch: 8/260 epoch 0
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.711485624 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_start: 8
|
||||
:::NVLOGv0.2.2 Tacotron2_PyT 1560936210.236668825 (/workspace/tacotron2/dllogger/logger.py:251) train_iteration_loss: 5.516331672668457
|
||||
```
|
||||
|
||||
|
||||
### Getting the data
|
||||
|
||||
|
@ -335,12 +432,11 @@ To use datasets different than the default LJSpeech dataset:
|
|||
|
||||
2. Add two text files containing file lists: one for the training subset (`--training-files`) and one for the validation subset (`--validation files`).
|
||||
The structure of the filelists should be as follows:
|
||||
```bash
|
||||
`<audio file path>|<transcript>`
|
||||
```
|
||||
|
||||
The `<audio file path>` is the relative path to the path provided by the `--dataset-path` option.
|
||||
```bash
|
||||
`<audio file path>|<transcript>`
|
||||
```
|
||||
|
||||
The `<audio file path>` is the relative path to the path provided by the `--dataset-path` option.
|
||||
|
||||
### Training process
|
||||
|
||||
|
@ -351,7 +447,7 @@ of Tacotron 2 and as conditioning input to the network in case of WaveGlow.
|
|||
|
||||
The training loss is averaged over an entire training epoch, whereas the
|
||||
validation loss is averaged over the validation dataset. Performance is
|
||||
reported in total input tokens per second for the Tacotron 2 model and
|
||||
reported in total output mel-spectrograms per second for the Tacotron 2 model and
|
||||
in total output samples per second for the WaveGlow model. Both measures are
|
||||
recorded as `train_iter_items/sec` (after each iteration) and
|
||||
`train_epoch_items/sec` (averaged over epoch) in the output log file `./output/nvlog.json`. The result is
|
||||
|
@ -372,100 +468,21 @@ models and input text as a text file, with one phrase per line.
|
|||
|
||||
To run inference, issue:
|
||||
```bash
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --amp-run
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase.txt --amp-run
|
||||
```
|
||||
Here, `Tacotron2_checkpoint` and `WaveGlow_checkpoint` are pre-trained
|
||||
checkpoints for the respective models, and `text.txt` contains input phrases.
|
||||
Audio will be saved in the output folder.
|
||||
checkpoints for the respective models, and `phrases/phrase.txt` contains input phrases. The number of text lines determines the inference batch size. Audio will be saved in the output folder.
|
||||
|
||||
You can find all the available options by calling `python inference.py --help`.
|
||||
|
||||
## Mixed precision training
|
||||
## Performance
|
||||
|
||||
*Mixed precision* is the combined use of different numerical precisions in a
|
||||
computational method. [Mixed precision](https://arxiv.org/abs/1710.03740)
|
||||
training offers significant computational speedup by performing operations in
|
||||
half-precision format, while storing minimal information in single-precision
|
||||
to retain as much information as possible in critical parts of the network.
|
||||
Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
|
||||
in the Volta and Turing architecture, significant training speedups are
|
||||
experienced by switching to mixed precision -- up to 3x overall speedup on
|
||||
the most arithmetically intense model architectures. Using mixed precision
|
||||
training requires two steps:
|
||||
|
||||
1. Porting the model to use the FP16 data type where appropriate.
|
||||
2. Adding loss scaling to preserve small gradient values.
|
||||
|
||||
The ability to train deep learning networks with lower precision was
|
||||
introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
|
||||
|
||||
For information about:
|
||||
* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
|
||||
paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
|
||||
documentation.
|
||||
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
|
||||
blog.
|
||||
* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
|
||||
from the TensorFlow User Guide.
|
||||
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
|
||||
|
||||
|
||||
|
||||
### Enabling mixed precision
|
||||
|
||||
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
|
||||
(AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables
|
||||
to half-precision upon retrieval, while storing variables in single-precision
|
||||
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
|
||||
a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
|
||||
step must be included when applying gradients. In PyTorch, loss scaling can be
|
||||
easily applied by using the `scale_loss()` method provided by AMP. The scaling value
|
||||
to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
|
||||
|
||||
By default, the `train_tacotron2.sh` and `train_waveglow.sh` scripts will
|
||||
launch mixed precision training with Tensor Cores. You can change this
|
||||
behaviour by removing the `--amp-run` flag from the `train.py` script.
|
||||
|
||||
To enable mixed precision, the following steps were performed in the Tacotron 2 and
|
||||
WaveGlow models:
|
||||
* Import AMP from APEX:
|
||||
```bash
|
||||
from apex import amp
|
||||
amp.lists.functional_overrides.FP32_FUNCS.remove('softmax')
|
||||
amp.lists.functional_overrides.FP16_FUNCS.append('softmax')
|
||||
```
|
||||
|
||||
* Initialize AMP:
|
||||
```bash
|
||||
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
|
||||
```
|
||||
|
||||
* If running on multi-GPU, wrap the model with `DistributedDataParallel`:
|
||||
```bash
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
model = DDP(model)
|
||||
```
|
||||
|
||||
* Scale loss before backpropagation (assuming loss is stored in a variable
|
||||
called `losses`)
|
||||
|
||||
* Default backpropagate for FP32:
|
||||
```bash
|
||||
losses.backward()
|
||||
```
|
||||
|
||||
* Scale loss and backpropagate with AMP:
|
||||
```bash
|
||||
with optimizer.scale_loss(losses) as scaled_losses:
|
||||
scaled_losses.backward()
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
### Benchmarking
|
||||
|
||||
The following section shows how to run benchmarks measuring the model
|
||||
performance in training and inference mode.
|
||||
|
||||
### Training performance benchmark
|
||||
#### Training performance benchmark
|
||||
|
||||
To benchmark the training performance on a specific batch size, run:
|
||||
|
||||
|
@ -517,37 +534,34 @@ Each of these scripts runs for 10 epochs and for each epoch measures the
|
|||
average number of items per second. The performance results can be read from
|
||||
the `nvlog.json` files produced by the commands.
|
||||
|
||||
### Inference performance benchmark
|
||||
#### Inference performance benchmark
|
||||
|
||||
To benchmark the inference performance on a batch size=1, run:
|
||||
|
||||
* For FP32
|
||||
```bash
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --log-file=output/nvlog_fp32.json
|
||||
```
|
||||
* For FP16
|
||||
```bash
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --amp-run --log-file=output/nvlog_fp16.json
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase_1_64.txt --amp-run --log-file=output/nvlog_fp16.json
|
||||
```
|
||||
* For FP32
|
||||
```bash
|
||||
python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase_1_64.txt --log-file=output/nvlog_fp32.json
|
||||
```
|
||||
|
||||
The log files contain performance numbers for Tacotron 2 model
|
||||
(number of input tokens per second, reported as `tacotron2_items_per_sec`)
|
||||
and for WaveGlow (number of output samples per second, reported as
|
||||
`waveglow_items_per_sec`).
|
||||
The output log files will contain performance numbers for Tacotron 2 model
|
||||
(number of output mel-spectrograms per second, reported as `tacotron2_items_per_sec`)
|
||||
and for WaveGlow (number of output samples per second, reported as `waveglow_items_per_sec`).
|
||||
The `inference.py` script will run a few warmup iterations before running the benchmark.
|
||||
|
||||
|
||||
|
||||
## Results
|
||||
### Results
|
||||
|
||||
The following sections provide details on how we achieved our performance
|
||||
and accuracy in training and inference.
|
||||
|
||||
### Training accuracy results
|
||||
|
||||
#### Training accuracy results
|
||||
|
||||
##### NVIDIA DGX-1 (8x V100 16G)
|
||||
|
||||
Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{FP16,FP32}_DGX1_16GB_8GPU.sh` training script in the PyTorch-19.04-py3
|
||||
Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{FP16,FP32}_DGX1_16GB_8GPU.sh` training script in the PyTorch-19.05-py3
|
||||
NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
|
||||
|
||||
All of the results were produced using the `train.py` script as described in the
|
||||
|
@ -573,87 +587,84 @@ WaveGlow FP32 loss - batch size 4 (mean and std over 16 runs)
|
|||
![](./img/waveglow_fp32_loss.png "WaveGlow FP32 loss")
|
||||
|
||||
|
||||
### Training performance results
|
||||
|
||||
#### Training performance results
|
||||
|
||||
##### NVIDIA DGX-1 (8x V100 16G)
|
||||
|
||||
Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{FP16,FP32}_DGX1_16GB_8GPU.sh`
|
||||
training script in the PyTorch-19.04-py3 NGC container on NVIDIA DGX-1 with
|
||||
8x V100 16G GPUs. Performance numbers (in input tokens per second for
|
||||
training script in the PyTorch-19.05-py3 NGC container on NVIDIA DGX-1 with
|
||||
8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for
|
||||
Tacotron 2 and output samples per second for WaveGlow) were averaged over
|
||||
an entire training epoch.
|
||||
|
||||
This table shows the results for Tacotron 2:
|
||||
|
||||
|Number of GPUs|Batch size per GPU|Number of tokens used with mixed precision|Number of tokens used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
|
||||
|Number of GPUs|Batch size per GPU|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
|
||||
|---:|---:|---:|---:|---:|---:|---:|
|
||||
|1|128@FP16, 64@FP32 | 3,746 | 2,087 | 1.79 | 1.00 | 1.00 |
|
||||
|4|128@FP16, 64@FP32 | 13,264 | 8,052 | 1.65 | 3.54 | 3.86 |
|
||||
|8|128@FP16, 64@FP32 | 25,056 | 15,863 | 1.58 | 6.69 | 7.60 |
|
||||
|1|128@FP16, 64@FP32 | 20,992 | 12,933 | 1.62 | 1.00 | 1.00 |
|
||||
|4|128@FP16, 64@FP32 | 74,989 | 46,115 | 1.63 | 3.57 | 3.57 |
|
||||
|8|128@FP16, 64@FP32 | 140,060 | 88,719 | 1.58 | 6.67 | 6.86 |
|
||||
|
||||
The following table shows the results for WaveGlow:
|
||||
|
||||
|Number of GPUs|Batch size per GPU|Number of samples used with mixed precision|Number of samples used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
|
||||
|---:|---:|---:|---:|---:|---:|---:|
|
||||
|1| 10@FP16, 4@FP32 | 79248.87426 | 35695.56774 | 2.22 | 1.00 | 1.00 |
|
||||
|4| 10@FP16, 4@FP32 | 275310.0262 | 126497.6265 | 2.18 | 3.47 | 3.54 |
|
||||
|8| 10@FP16, 4@FP32 | 576709.4935 | 255155.1798 | 2.26 | 7.28 | 7.15 |
|
||||
|1| 10@FP16, 4@FP32 | 81,503 | 36,671 | 2.22 | 1.00 | 1.00 |
|
||||
|4| 10@FP16, 4@FP32 | 275,803 | 124,504 | 2.22 | 3.38 | 3.40 |
|
||||
|8| 10@FP16, 4@FP32 | 583,887 | 264,903 | 2.20 | 7.16 | 7.22 |
|
||||
|
||||
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
|
||||
|
||||
#### Expected training time
|
||||
##### Expected training time
|
||||
|
||||
The following table shows the expected training time for convergence for Tacotron 2 (1500 epochs):
|
||||
|
||||
|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
|
||||
|---:|---:|---:|---:|---:|
|
||||
|1| 128@FP16, 64@FP32 | 137.33 | 227.66 | 1.66 |
|
||||
|4| 128@FP16, 64@FP32 | 40.68 | 63.99 | 1.57 |
|
||||
|8| 128@FP16, 64@FP32 | 20.74 | 32.47 | 1.57 |
|
||||
|
||||
|
||||
|1| 128@FP16, 64@FP32 | 153 | 234 | 1.53 |
|
||||
|4| 128@FP16, 64@FP32 | 42 | 64 | 1.54 |
|
||||
|8| 128@FP16, 64@FP32 | 22 | 33 | 1.52 |
|
||||
|
||||
The following table shows the expected training time for convergence for WaveGlow (1000 epochs):
|
||||
|
||||
|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
|
||||
|---:|---:|---:|---:|---:|
|
||||
|1| 10@FP16, 4@FP32 | 358.00 | 793.97 | 2.22 |
|
||||
|4| 10@FP16, 4@FP32 | 103.10 | 223.59 | 2.17 |
|
||||
|8| 10@FP16, 4@FP32 | 50.40 | 109.45 | 2.17 |
|
||||
|
||||
|
||||
|
||||
### Inference performance results
|
||||
|1| 10@FP16, 4@FP32 | 347 | 768 | 2.21 |
|
||||
|4| 10@FP16, 4@FP32 | 106 | 231 | 2.18 |
|
||||
|8| 10@FP16, 4@FP32 | 49 | 105 | 2.16 |
|
||||
|
||||
#### Inference performance results
|
||||
|
||||
##### NVIDIA DGX-1 (8x V100 16G)
|
||||
|
||||
Our results were obtained by running the `./inference.py` inference script in the
|
||||
PyTorch-18.12.1-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
|
||||
Performance numbers (in input tokens per second for Tacotron 2 and output
|
||||
samples per second for WaveGlow) were averaged over 16 runs.
|
||||
Our results were obtained by running the `./inference.py` inference script in
|
||||
the PyTorch-19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
|
||||
Performance numbers (in output mel-spectrograms per second for Tacotron 2 and
|
||||
output samples per second for WaveGlow) were averaged over 16 runs.
|
||||
|
||||
The following table shows the inference performance results for Tacotron 2 model.
|
||||
Results are measured in the number of output mel-spectrograms per second.
|
||||
|
||||
The following table shows the inference performance results for Tacotron 2.
|
||||
Results are measured in the number of input tokens per second.
|
||||
|
||||
|Number of GPUs|Number of tokens used with mixed precision|Number of tokens used with FP32|Speed-up with mixed precision|
|
||||
|Number of GPUs|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|
|
||||
|---:|---:|---:|---:|
|
||||
|1|168|173|0.97|
|
||||
|**1**|637|619|1.03|
|
||||
|
||||
The following table shows the inference performance results for WaveGlow.
|
||||
Results are measured in the number of output audio samples per second.<sup>1</sup>
|
||||
The following table shows the inference performance results for WaveGlow model.
|
||||
Results are measured in the number of output samples per second<sup>1</sup>.
|
||||
|
||||
|Number of GPUs|Number of samples used with mixed precision|Number of samples used with FP32|Speed-up with mixed precision|
|
||||
|---:|---:|---:|---:|
|
||||
|1|583318|553380|1.05|
|
||||
|**1**|565629|578322|0.98|
|
||||
|
||||
<sup>1</sup>With sampling rate equal to 22050, one second of audio is generated from 22050 samples.
|
||||
|
||||
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
|
||||
|
||||
## Changelog
|
||||
|
||||
## Release notes
|
||||
|
||||
### Changelog
|
||||
|
||||
March 2019
|
||||
* Initial release
|
||||
|
||||
|
@ -662,5 +673,12 @@ June 2019
|
|||
* Data preprocessing for Tacotron 2 training
|
||||
* Fixed dropouts on LSTMCells
|
||||
|
||||
## Known issues
|
||||
July 2019
|
||||
* Changed measurement units for Tacotron 2 training and inference performance
|
||||
benchmarks from input tokes per second to output mel-spectrograms per second
|
||||
* Introduced batched inference
|
||||
* Included warmup in the inference script
|
||||
|
||||
### Known issues
|
||||
|
||||
There are no known issues in this release.
|
||||
|
|
|
@ -27,7 +27,6 @@
|
|||
|
||||
import torch
|
||||
from librosa.filters import mel as librosa_mel_fn
|
||||
from torch.nn import functional as F
|
||||
from common.audio_processing import dynamic_range_compression, dynamic_range_decompression
|
||||
from common.stft import STFT
|
||||
|
||||
|
|
|
@ -33,7 +33,6 @@ import numpy as np
|
|||
from scipy.io.wavfile import write
|
||||
|
||||
import sys
|
||||
sys.path.append('waveglow/')
|
||||
|
||||
import time
|
||||
from dllogger.logger import LOGGER
|
||||
|
@ -46,19 +45,15 @@ def parse_args(parser):
|
|||
"""
|
||||
Parse commandline arguments.
|
||||
"""
|
||||
parser.add_argument('-i', '--input', type=str, default="",
|
||||
parser.add_argument('-i', '--input', type=str, required=True,
|
||||
help='full path to the input text (phareses separated by new line); \
|
||||
if not provided then use default text')
|
||||
parser.add_argument('-o', '--output', required=True,
|
||||
help='output folder to save audio (file per phrase)')
|
||||
parser.add_argument('--tacotron2', type=str, default="",
|
||||
parser.add_argument('--tacotron2', type=str,
|
||||
help='full path to the Tacotron2 model checkpoint file')
|
||||
parser.add_argument('--mel-file', type=str, default="",
|
||||
help='set if using mel spectrograms instead of Tacotron2 model')
|
||||
parser.add_argument('--waveglow', required=True,
|
||||
parser.add_argument('--waveglow', type=str,
|
||||
help='full path to the WaveGlow model checkpoint file')
|
||||
parser.add_argument('--old-waveglow', action='store_true',
|
||||
help='set if WaveGlow checkpoint is from GitHub.com/NVIDIA/waveglow')
|
||||
parser.add_argument('-s', '--sigma-infer', default=0.6, type=float)
|
||||
parser.add_argument('-sr', '--sampling-rate', default=22050, type=int,
|
||||
help='Sampling rate')
|
||||
|
@ -66,6 +61,10 @@ def parse_args(parser):
|
|||
help='inference with AMP')
|
||||
parser.add_argument('--log-file', type=str, default='nvlog.json',
|
||||
help='Filename for logging')
|
||||
parser.add_argument('--include-warmup', action='store_true',
|
||||
help='Include warmup')
|
||||
parser.add_argument('--stft-hop-length', type=int, default=256,
|
||||
help='STFT hop length for estimating audio length from mel size')
|
||||
|
||||
|
||||
return parser
|
||||
|
@ -135,6 +134,41 @@ def load_and_setup_model(model_name, parser, checkpoint, amp_run):
|
|||
return model
|
||||
|
||||
|
||||
# taken from tacotron2/data_function.py:TextMelCollate.__call__
|
||||
def pad_sequences(batch):
|
||||
# Right zero-pad all one-hot text sequences to max input length
|
||||
input_lengths, ids_sorted_decreasing = torch.sort(
|
||||
torch.LongTensor([len(x) for x in batch]),
|
||||
dim=0, descending=True)
|
||||
max_input_len = input_lengths[0]
|
||||
|
||||
text_padded = torch.LongTensor(len(batch), max_input_len)
|
||||
text_padded.zero_()
|
||||
for i in range(len(ids_sorted_decreasing)):
|
||||
text = batch[ids_sorted_decreasing[i]]
|
||||
text_padded[i, :text.size(0)] = text
|
||||
|
||||
return text_padded, input_lengths
|
||||
|
||||
|
||||
def prepare_input_sequence(texts):
|
||||
|
||||
d = []
|
||||
for i,text in enumerate(texts):
|
||||
d.append(torch.IntTensor(
|
||||
text_to_sequence(text, ['english_cleaners'])[:]))
|
||||
|
||||
text_padded, input_lengths = pad_sequences(d)
|
||||
if torch.cuda.is_available():
|
||||
text_padded = torch.autograd.Variable(text_padded).cuda().long()
|
||||
input_lengths = torch.autograd.Variable(input_lengths).cuda().long()
|
||||
else:
|
||||
text_padded = torch.autograd.Variable(text_padded).long()
|
||||
input_lengths = torch.autograd.Variable(input_lengths).long()
|
||||
|
||||
return text_padded, input_lengths
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Launches text to speech (inference).
|
||||
|
@ -153,70 +187,65 @@ def main():
|
|||
logging_scope=dllg.TRAIN_ITER_SCOPE, iteration_interval=1)
|
||||
])
|
||||
LOGGER.register_metric("tacotron2_items_per_sec", metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
LOGGER.register_metric("tacotron2_latency", metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
LOGGER.register_metric("waveglow_items_per_sec", metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
LOGGER.register_metric("waveglow_latency", metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
LOGGER.register_metric("latency", metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
|
||||
log_hardware()
|
||||
log_args(args)
|
||||
|
||||
# tacotron2 model filepath was specified
|
||||
if args.tacotron2:
|
||||
# Setup Tacotron2
|
||||
tacotron2 = load_and_setup_model('Tacotron2', parser, args.tacotron2,
|
||||
args.amp_run)
|
||||
# file with mel spectrogram was specified
|
||||
elif args.mel_file:
|
||||
mel = torch.load(args.mel_file)
|
||||
mel = torch.autograd.Variable(mel.cuda())
|
||||
mel = torch.unsqueeze(mel, 0)
|
||||
|
||||
# Setup WaveGlow
|
||||
if args.old_waveglow:
|
||||
waveglow = torch.load(args.waveglow)['model']
|
||||
waveglow = waveglow.remove_weightnorm(waveglow)
|
||||
waveglow = waveglow.cuda()
|
||||
waveglow.eval()
|
||||
else:
|
||||
waveglow = load_and_setup_model('WaveGlow', parser, args.waveglow,
|
||||
args.amp_run)
|
||||
tacotron2 = load_and_setup_model('Tacotron2', parser, args.tacotron2,
|
||||
args.amp_run)
|
||||
waveglow = load_and_setup_model('WaveGlow', parser, args.waveglow,
|
||||
args.amp_run)
|
||||
|
||||
texts = []
|
||||
try:
|
||||
f = open(args.input, 'r')
|
||||
texts = f.readlines()
|
||||
except:
|
||||
print("Could not read file. Using default text.")
|
||||
texts = ["The forms of printed letters should be beautiful, and\
|
||||
that their arrangement on the page should be reasonable and\
|
||||
a help to the shapeliness of the letters themselves."]
|
||||
print("Could not read file")
|
||||
sys.exit(1)
|
||||
|
||||
for i, text in enumerate(texts):
|
||||
|
||||
LOGGER.iteration_start()
|
||||
|
||||
sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
|
||||
sequence = torch.autograd.Variable(
|
||||
torch.from_numpy(sequence)).cuda().long()
|
||||
|
||||
if args.tacotron2:
|
||||
tacotron2_t0 = time.time()
|
||||
if args.include_warmup:
|
||||
sequence = torch.randint(low=0, high=148, size=(1,50),
|
||||
dtype=torch.long).cuda()
|
||||
input_lengths = torch.IntTensor([sequence.size(1)]).cuda().long()
|
||||
for i in range(3):
|
||||
with torch.no_grad():
|
||||
_, mel, _, _ = tacotron2.infer(sequence)
|
||||
tacotron2_t1 = time.time()
|
||||
tacotron2_infer_perf = sequence.size(1)/(tacotron2_t1-tacotron2_t0)
|
||||
LOGGER.log(key="tacotron2_items_per_sec", value=tacotron2_infer_perf)
|
||||
_, mel, _, _, mel_lengths = tacotron2.infer(sequence, input_lengths)
|
||||
_ = waveglow.infer(mel)
|
||||
|
||||
waveglow_t0 = time.time()
|
||||
with torch.no_grad():
|
||||
audio = waveglow.infer(mel, sigma=args.sigma_infer)
|
||||
audio = audio.float()
|
||||
waveglow_t1 = time.time()
|
||||
waveglow_infer_perf = audio[0].size(0)/(waveglow_t1-waveglow_t0)
|
||||
LOGGER.iteration_start()
|
||||
|
||||
sequences_padded, input_lengths = prepare_input_sequence(texts)
|
||||
|
||||
tacotron2_t0 = time.time()
|
||||
with torch.no_grad():
|
||||
_, mel, _, _, mel_lengths = tacotron2.infer(sequences_padded, input_lengths)
|
||||
tacotron2_t1 = time.time()
|
||||
|
||||
waveglow_t0 = time.time()
|
||||
with torch.no_grad():
|
||||
audios = waveglow.infer(mel, sigma=args.sigma_infer)
|
||||
audios = audios.float()
|
||||
waveglow_t1 = time.time()
|
||||
|
||||
tacotron2_infer_perf = mel.size(0)*mel.size(2)/(tacotron2_t1-tacotron2_t0)
|
||||
LOGGER.log(key="tacotron2_items_per_sec", value=tacotron2_infer_perf)
|
||||
LOGGER.log(key="tacotron2_latency", value=(tacotron2_t1-tacotron2_t0))
|
||||
waveglow_infer_perf = audios.size(0)*audios.size(1)/(waveglow_t1-waveglow_t0)
|
||||
LOGGER.log(key="waveglow_items_per_sec", value=waveglow_infer_perf)
|
||||
LOGGER.log(key="waveglow_latency", value=(waveglow_t1-waveglow_t0))
|
||||
LOGGER.log(key="latency", value=(waveglow_t1-tacotron2_t0))
|
||||
|
||||
for i, audio in enumerate(audios):
|
||||
audio_path = args.output + "audio_"+str(i)+".wav"
|
||||
write(audio_path, args.sampling_rate, audio[0].data.cpu().numpy())
|
||||
write(audio_path, args.sampling_rate,
|
||||
audio.data.cpu().numpy()[:mel_lengths[i]*args.stft_hop_length])
|
||||
|
||||
LOGGER.log(key="waveglow_items_per_sec", value=waveglow_infer_perf)
|
||||
LOGGER.iteration_stop()
|
||||
LOGGER.iteration_stop()
|
||||
|
||||
LOGGER.finish()
|
||||
|
||||
|
|
|
@ -25,12 +25,10 @@
|
|||
#
|
||||
# *****************************************************************************
|
||||
|
||||
from tacotron2.text import text_to_sequence
|
||||
import models
|
||||
import torch
|
||||
import argparse
|
||||
import numpy as np
|
||||
from scipy.io.wavfile import write
|
||||
import json
|
||||
import time
|
||||
|
||||
|
@ -49,43 +47,17 @@ def parse_args(parser):
|
|||
"""
|
||||
parser.add_argument('-m', '--model-name', type=str, default='', required=True,
|
||||
help='Model to train')
|
||||
parser.add_argument('--input-text', type=str, default=None,
|
||||
help='Path to tensor containing text (when running Tacotron2)')
|
||||
parser.add_argument('--input-mels', type=str, default=None,
|
||||
help='Path to tensor containing mels (when running WaveGlow)')
|
||||
parser.add_argument('-sr', '--sampling-rate', default=22050, type=int,
|
||||
help='Sampling rate')
|
||||
parser.add_argument('--amp-run', action='store_true',
|
||||
help='inference with AMP')
|
||||
parser.add_argument('-bs', '--batch-size', type=int, default=1)
|
||||
parser.add_argument('--create-benchmark', action='store_true')
|
||||
parser.add_argument('--log-file', type=str, default='nvlog.json',
|
||||
help='Filename for logging')
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def collate_text(batch):
|
||||
"""Collate's training batch from normalized text and mel-spectrogram
|
||||
PARAMS
|
||||
------
|
||||
batch: [text_normalized]
|
||||
"""
|
||||
# Right zero-pad all one-hot text sequences to max input length
|
||||
input_lengths, ids_sorted_decreasing = torch.sort(
|
||||
torch.LongTensor([len(x) for x in batch]),
|
||||
dim=0, descending=True)
|
||||
max_input_len = input_lengths[0]
|
||||
|
||||
text_padded = torch.LongTensor(len(batch), max_input_len)
|
||||
text_padded.zero_()
|
||||
for i in range(len(ids_sorted_decreasing)):
|
||||
text = batch[ids_sorted_decreasing[i]]
|
||||
text_padded[i, :text.size(0)] = text
|
||||
|
||||
return text_padded, input_lengths
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Launches inference benchmark.
|
||||
|
@ -107,55 +79,36 @@ def main():
|
|||
])
|
||||
LOGGER.register_metric("items_per_sec",
|
||||
metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
LOGGER.register_metric("latency",
|
||||
metric_scope=dllg.TRAIN_ITER_SCOPE)
|
||||
|
||||
log_hardware()
|
||||
log_args(args)
|
||||
|
||||
# ### uncomment to generate new padded text
|
||||
# texts = []
|
||||
# f = open('qa/ljs_text_train_subset_2500.txt', 'r')
|
||||
# texts = f.readlines()
|
||||
# sequence = []
|
||||
# for i, text in enumerate(texts):
|
||||
# sequence.append(torch.IntTensor(text_to_sequence(text, ['english_cleaners'])))
|
||||
model = load_and_setup_model(args.model_name, parser, None, args.amp_run)
|
||||
|
||||
# text_padded, input_lengths = collate_text(sequence)
|
||||
# text_padded = torch.autograd.Variable(text_padded).cuda().long()
|
||||
# torch.save(text_padded, "qa/text_padded.pt")
|
||||
# torch.save(input_lengths, "qa/input_lengths.pt")
|
||||
|
||||
model = load_and_setup_model(args.model_name, parser, None,
|
||||
args.amp_run)
|
||||
|
||||
dry_runs = 3
|
||||
num_iters = (16+dry_runs) if args.create_benchmark else (1+dry_runs)
|
||||
warmup_iters = 3
|
||||
num_iters = 1+warmup_iters
|
||||
|
||||
for i in range(num_iters):
|
||||
## skipping the first inference which is slower
|
||||
if i >= dry_runs:
|
||||
if i >= warmup_iters:
|
||||
LOGGER.iteration_start()
|
||||
|
||||
if args.model_name == 'Tacotron2':
|
||||
text_padded = torch.load(args.input_text)
|
||||
text_padded = text_padded[:args.batch_size]
|
||||
text_padded = torch.autograd.Variable(text_padded).cuda().long()
|
||||
|
||||
text_padded = torch.randint(low=0, high=148, size=(args.batch_size, 140),
|
||||
dtype=torch.long).cuda()
|
||||
input_lengths = torch.IntTensor([text_padded.size(1)]*args.batch_size).cuda().long()
|
||||
t0 = time.time()
|
||||
with torch.no_grad():
|
||||
_, mels, _, _ = model.infer(text_padded)
|
||||
t1 = time.time()
|
||||
inference_time= t1 - t0
|
||||
num_items = text_padded.size(0)*text_padded.size(1)
|
||||
|
||||
# # ## uncomment to generate new padded mels
|
||||
# torch.save(mels, "qa/mel_padded.pt")
|
||||
_, mels, _, _, _ = model.infer(text_padded, input_lengths)
|
||||
inference_time = time.time() - t0
|
||||
num_items = mels.size(0)*mels.size(2)
|
||||
|
||||
if args.model_name == 'WaveGlow':
|
||||
mel_padded = torch.load(args.input_mels)
|
||||
mel_padded = torch.cat((mel_padded, mel_padded, mel_padded, mel_padded))
|
||||
mel_padded = mel_padded[:args.batch_size]
|
||||
mel_padded = mel_padded.cuda()
|
||||
|
||||
n_mel_channels = model.upsample.in_channels
|
||||
num_mels = 895
|
||||
mel_padded = torch.zeros(args.batch_size, n_mel_channels,
|
||||
num_mels).normal_(-5.62, 1.98).cuda()
|
||||
if args.amp_run:
|
||||
mel_padded = mel_padded.half()
|
||||
|
||||
|
@ -163,12 +116,12 @@ def main():
|
|||
with torch.no_grad():
|
||||
audios = model.infer(mel_padded)
|
||||
audios = audios.float()
|
||||
t1 = time.time()
|
||||
inference_time = t1 - t0
|
||||
inference_time = time.time() - t0
|
||||
num_items = audios.size(0)*audios.size(1)
|
||||
|
||||
if i >= dry_runs:
|
||||
if i >= warmup_iters:
|
||||
LOGGER.log(key="items_per_sec", value=(num_items/inference_time))
|
||||
LOGGER.log(key="latency", value=inference_time)
|
||||
LOGGER.iteration_stop()
|
||||
|
||||
LOGGER.finish()
|
||||
|
|
|
@ -25,24 +25,28 @@
|
|||
#
|
||||
# *****************************************************************************
|
||||
|
||||
import sys
|
||||
from os.path import abspath, dirname
|
||||
# enabling modules discovery from global entrypoint
|
||||
sys.path.append(abspath(dirname(__file__)+'/'))
|
||||
from tacotron2.model import Tacotron2
|
||||
from waveglow.model import WaveGlow
|
||||
from tacotron2.arg_parser import parse_tacotron2_args
|
||||
from waveglow.arg_parser import parse_waveglow_args
|
||||
import torch
|
||||
|
||||
|
||||
def parse_model_args(model_name, parser, add_help=False):
|
||||
if model_name == 'Tacotron2':
|
||||
from tacotron2.arg_parser import parse_tacotron2_args
|
||||
return parse_tacotron2_args(parser, add_help)
|
||||
if model_name == 'WaveGlow':
|
||||
from waveglow.arg_parser import parse_waveglow_args
|
||||
return parse_waveglow_args(parser, add_help)
|
||||
else:
|
||||
raise NotImplementedError(model_name)
|
||||
|
||||
|
||||
def batchnorm_to_float(module):
|
||||
"""Converts batch norm modules to FP32"""
|
||||
"""Converts batch norm to FP32"""
|
||||
if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
|
||||
module.float()
|
||||
for child in module.children():
|
||||
|
|
|
@ -0,0 +1,2 @@
|
|||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
|
|
@ -0,0 +1 @@
|
|||
She sells seashells by the seashore, shells she sells are great
|
|
@ -0,0 +1,4 @@
|
|||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
|
@ -0,0 +1,4 @@
|
|||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
|
@ -0,0 +1,8 @@
|
|||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
||||
The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
|
|
@ -0,0 +1,8 @@
|
|||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
||||
She sells seashells by the seashore, shells she sells are great
|
|
@ -1,2 +0,0 @@
|
|||
mkdir -p output
|
||||
python train.py -m Tacotron2 -o output/ --fp16-run -lr 1e-3 --epochs 2001 -bs 128 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.3
|
|
@ -1,2 +0,0 @@
|
|||
mkdir -p output
|
||||
python -m multiproc train.py -m Tacotron2 -o output/ --fp16-run -lr 1e-3 --epochs 2001 -bs 128 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.3
|
|
@ -1,2 +0,0 @@
|
|||
mkdir -p output
|
||||
python -m multiproc train.py -m Tacotron2 -o output/ --fp16-run -lr 1e-3 --epochs 2001 -bs 128 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.3
|
|
@ -1,2 +0,0 @@
|
|||
mkdir -p output
|
||||
python train.py -m WaveGlow -o output/ --fp16-run -lr 1e-4 --epochs 2001 -bs 10 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-benchmark --cudnn-enabled --log-file output/nvlog.json
|
|
@ -1,2 +0,0 @@
|
|||
mkdir -p output
|
||||
python -m multiproc train.py -m WaveGlow -o output/ --fp16-run -lr 1e-4 --epochs 2001 -bs 10 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-benchmark --cudnn-enabled --log-file output/nvlog.json
|
|
@ -1,2 +0,0 @@
|
|||
mkdir -p output
|
||||
python -m multiproc train.py -m WaveGlow -o output/ --fp16-run -lr 1e-4 --epochs 2001 -bs 10 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-benchmark --cudnn-enabled --log-file output/nvlog.json
|
|
@ -69,7 +69,7 @@ def parse_tacotron2_args(parent, add_help=False):
|
|||
help='Number of units in decoder LSTM')
|
||||
decoder.add_argument('--prenet-dim', default=256, type=int,
|
||||
help='Number of ReLU units in prenet layers')
|
||||
decoder.add_argument('--max-decoder-steps', default=1000, type=int,
|
||||
decoder.add_argument('--max-decoder-steps', default=2000, type=int,
|
||||
help='Maximum number of output mel spectrograms')
|
||||
decoder.add_argument('--gate-threshold', default=0.5, type=float,
|
||||
help='Probability threshold for stop token')
|
||||
|
|
|
@ -151,5 +151,5 @@ def batch_to_gpu(batch):
|
|||
output_lengths = to_gpu(output_lengths).long()
|
||||
x = (text_padded, input_lengths, mel_padded, max_len, output_lengths)
|
||||
y = (mel_padded, gate_padded)
|
||||
len_x = torch.sum(input_lengths)
|
||||
len_x = torch.sum(output_lengths)
|
||||
return (x, y, len_x)
|
||||
|
|
|
@ -30,6 +30,10 @@ import torch
|
|||
from torch.autograd import Variable
|
||||
from torch import nn
|
||||
from torch.nn import functional as F
|
||||
import sys
|
||||
from os.path import abspath, dirname
|
||||
# enabling modules discovery from global entrypoint
|
||||
sys.path.append(abspath(dirname(__file__)+'/../'))
|
||||
from common.layers import ConvNorm, LinearNorm
|
||||
from common.utils import to_gpu, get_mask_from_lengths
|
||||
|
||||
|
@ -122,9 +126,17 @@ class Prenet(nn.Module):
|
|||
[LinearNorm(in_size, out_size, bias=False)
|
||||
for (in_size, out_size) in zip(in_sizes, sizes)])
|
||||
|
||||
def forward(self, x):
|
||||
for linear in self.layers:
|
||||
x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
|
||||
def forward(self, x, inference=False):
|
||||
if inference:
|
||||
for linear in self.layers:
|
||||
x = F.relu(linear(x))
|
||||
x0 = x[0].unsqueeze(0)
|
||||
mask = Variable(torch.bernoulli(x0.data.new(x0.data.size()).fill_(0.5)))
|
||||
mask = mask.expand(x.size(0), x.size(1))
|
||||
x = x*mask*2
|
||||
else:
|
||||
for linear in self.layers:
|
||||
x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
|
||||
return x
|
||||
|
||||
|
||||
|
@ -457,7 +469,7 @@ class Decoder(nn.Module):
|
|||
return mel_outputs, gate_outputs, alignments
|
||||
|
||||
|
||||
def infer(self, memory):
|
||||
def infer(self, memory, memory_lengths):
|
||||
""" Decoder inference
|
||||
PARAMS
|
||||
------
|
||||
|
@ -471,13 +483,23 @@ class Decoder(nn.Module):
|
|||
"""
|
||||
decoder_input = self.get_go_frame(memory)
|
||||
|
||||
self.initialize_decoder_states(memory, mask=None)
|
||||
if memory.size(0) > 1:
|
||||
mask =~ get_mask_from_lengths(memory_lengths)
|
||||
else:
|
||||
mask = None
|
||||
|
||||
self.initialize_decoder_states(memory, mask=mask)
|
||||
|
||||
mel_lengths = torch.zeros([memory.size(0)], dtype=torch.int32)
|
||||
not_finished = torch.ones([memory.size(0)], dtype=torch.int32)
|
||||
if torch.cuda.is_available():
|
||||
mel_lengths = mel_lengths.cuda()
|
||||
not_finished = not_finished.cuda()
|
||||
|
||||
|
||||
mel_lengths = torch.zeros([memory.size(0)], dtype=torch.int32).cuda()
|
||||
not_finished = torch.ones([memory.size(0)], dtype=torch.int32).cuda()
|
||||
mel_outputs, gate_outputs, alignments = [], [], []
|
||||
while True:
|
||||
decoder_input = self.prenet(decoder_input)
|
||||
decoder_input = self.prenet(decoder_input, inference=True)
|
||||
mel_output, gate_output, alignment = self.decode(decoder_input)
|
||||
|
||||
mel_outputs += [mel_output.squeeze(1)]
|
||||
|
@ -500,7 +522,7 @@ class Decoder(nn.Module):
|
|||
mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
|
||||
mel_outputs, gate_outputs, alignments)
|
||||
|
||||
return mel_outputs, gate_outputs, alignments
|
||||
return mel_outputs, gate_outputs, alignments, mel_lengths
|
||||
|
||||
|
||||
class Tacotron2(nn.Module):
|
||||
|
@ -552,9 +574,6 @@ class Tacotron2(nn.Module):
|
|||
(text_padded, input_lengths, mel_padded, max_len, output_lengths),
|
||||
(mel_padded, gate_padded))
|
||||
|
||||
def parse_input(self, inputs):
|
||||
return inputs
|
||||
|
||||
def parse_output(self, outputs, output_lengths=None):
|
||||
if self.mask_padding and output_lengths is not None:
|
||||
mask = ~get_mask_from_lengths(output_lengths)
|
||||
|
@ -568,8 +587,7 @@ class Tacotron2(nn.Module):
|
|||
return outputs
|
||||
|
||||
def forward(self, inputs):
|
||||
inputs, input_lengths, targets, max_len, \
|
||||
output_lengths = self.parse_input(inputs)
|
||||
inputs, input_lengths, targets, max_len, output_lengths = inputs
|
||||
input_lengths, output_lengths = input_lengths.data, output_lengths.data
|
||||
|
||||
embedded_inputs = self.embedding(inputs).transpose(1, 2)
|
||||
|
@ -586,17 +604,17 @@ class Tacotron2(nn.Module):
|
|||
[mel_outputs, mel_outputs_postnet, gate_outputs, alignments],
|
||||
output_lengths)
|
||||
|
||||
def infer(self, inputs):
|
||||
inputs = self.parse_input(inputs)
|
||||
def infer(self, inputs, input_lengths):
|
||||
|
||||
embedded_inputs = self.embedding(inputs).transpose(1, 2)
|
||||
encoder_outputs = self.encoder.infer(embedded_inputs)
|
||||
mel_outputs, gate_outputs, alignments = self.decoder.infer(
|
||||
encoder_outputs)
|
||||
encoder_outputs = self.encoder(embedded_inputs, input_lengths)
|
||||
mel_outputs, gate_outputs, alignments, mel_lengths = self.decoder.infer(
|
||||
encoder_outputs, input_lengths)
|
||||
|
||||
mel_outputs_postnet = self.postnet(mel_outputs)
|
||||
mel_outputs_postnet = mel_outputs + mel_outputs_postnet
|
||||
|
||||
outputs = self.parse_output(
|
||||
[mel_outputs, mel_outputs_postnet, gate_outputs, alignments])
|
||||
[mel_outputs, mel_outputs_postnet, gate_outputs, alignments, mel_lengths])
|
||||
|
||||
return outputs
|
||||
|
|
|
@ -1,311 +0,0 @@
|
|||
# *****************************************************************************
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
# * Redistributions of source code must retain the above copyright
|
||||
# notice, this list of conditions and the following disclaimer.
|
||||
# * Redistributions in binary form must reproduce the above copyright
|
||||
# notice, this list of conditions and the following disclaimer in the
|
||||
# documentation and/or other materials provided with the distribution.
|
||||
# * Neither the name of the NVIDIA CORPORATION nor the
|
||||
# names of its contributors may be used to endorse or promote products
|
||||
# derived from this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
|
||||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
# *****************************************************************************
|
||||
|
||||
import copy
|
||||
import torch
|
||||
from torch.autograd import Variable
|
||||
import torch.nn.functional as F
|
||||
|
||||
|
||||
@torch.jit.script
|
||||
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
|
||||
n_channels_int = n_channels[0]
|
||||
in_act = input_a+input_b
|
||||
t_act = torch.nn.functional.tanh(in_act[:, :n_channels_int, :])
|
||||
s_act = torch.nn.functional.sigmoid(in_act[:, n_channels_int:, :])
|
||||
acts = t_act * s_act
|
||||
return acts
|
||||
|
||||
|
||||
class WaveGlowLoss(torch.nn.Module):
|
||||
def __init__(self, sigma=1.0):
|
||||
super(WaveGlowLoss, self).__init__()
|
||||
self.sigma = sigma
|
||||
|
||||
def forward(self, model_output):
|
||||
z, log_s_list, log_det_W_list = model_output
|
||||
for i, log_s in enumerate(log_s_list):
|
||||
if i == 0:
|
||||
log_s_total = torch.sum(log_s)
|
||||
log_det_W_total = log_det_W_list[i]
|
||||
else:
|
||||
log_s_total = log_s_total + torch.sum(log_s)
|
||||
log_det_W_total += log_det_W_list[i]
|
||||
|
||||
loss = torch.sum(z*z)/(2*self.sigma*self.sigma) - log_s_total - log_det_W_total
|
||||
return loss/(z.size(0)*z.size(1)*z.size(2))
|
||||
|
||||
|
||||
class Invertible1x1Conv(torch.nn.Module):
|
||||
"""
|
||||
The layer outputs both the convolution, and the log determinant
|
||||
of its weight matrix. If reverse=True it does convolution with
|
||||
inverse
|
||||
"""
|
||||
def __init__(self, c):
|
||||
super(Invertible1x1Conv, self).__init__()
|
||||
self.conv = torch.nn.Conv1d(c, c, kernel_size=1, stride=1, padding=0,
|
||||
bias=False)
|
||||
|
||||
# Sample a random orthonormal matrix to initialize weights
|
||||
W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
|
||||
|
||||
# Ensure determinant is 1.0 not -1.0
|
||||
if torch.det(W) < 0:
|
||||
W[:,0] = -1*W[:,0]
|
||||
W = W.view(c, c, 1)
|
||||
self.conv.weight.data = W
|
||||
|
||||
def forward(self, z, reverse=False):
|
||||
# shape
|
||||
batch_size, group_size, n_of_groups = z.size()
|
||||
|
||||
W = self.conv.weight.squeeze()
|
||||
|
||||
if reverse:
|
||||
if not hasattr(self, 'W_inverse'):
|
||||
# Reverse computation
|
||||
W_inverse = W.inverse()
|
||||
W_inverse = Variable(W_inverse[..., None])
|
||||
if z.type() == 'torch.cuda.HalfTensor' or z.type() == 'torch.HalfTensor': W_inverse = W_inverse.half()
|
||||
self.W_inverse = W_inverse
|
||||
z = F.conv1d(z, self.W_inverse, bias=None, stride=1, padding=0)
|
||||
return z
|
||||
else:
|
||||
# Forward computation
|
||||
log_det_W = batch_size * n_of_groups * torch.logdet(W)
|
||||
z = self.conv(z)
|
||||
return z, log_det_W
|
||||
|
||||
|
||||
class WN(torch.nn.Module):
|
||||
"""
|
||||
This is the WaveNet like layer for the affine coupling. The primary difference
|
||||
from WaveNet is the convolutions need not be causal. There is also no dilation
|
||||
size reset. The dilation only doubles on each layer
|
||||
"""
|
||||
def __init__(self, n_in_channels, n_mel_channels, n_layers, n_channels,
|
||||
kernel_size):
|
||||
super(WN, self).__init__()
|
||||
assert(kernel_size % 2 == 1)
|
||||
assert(n_channels % 2 == 0)
|
||||
self.n_layers = n_layers
|
||||
self.n_channels = n_channels
|
||||
self.in_layers = torch.nn.ModuleList()
|
||||
self.res_skip_layers = torch.nn.ModuleList()
|
||||
self.cond_layers = torch.nn.ModuleList()
|
||||
|
||||
start = torch.nn.Conv1d(n_in_channels, n_channels, 1)
|
||||
start = torch.nn.utils.weight_norm(start, name='weight')
|
||||
self.start = start
|
||||
|
||||
# Initializing last layer to 0 makes the affine coupling layers
|
||||
# do nothing at first. This helps with training stability
|
||||
end = torch.nn.Conv1d(n_channels, 2*n_in_channels, 1)
|
||||
end.weight.data.zero_()
|
||||
end.bias.data.zero_()
|
||||
self.end = end
|
||||
|
||||
for i in range(n_layers):
|
||||
dilation = 2 ** i
|
||||
padding = int((kernel_size*dilation - dilation)/2)
|
||||
in_layer = torch.nn.Conv1d(n_channels, 2*n_channels, kernel_size,
|
||||
dilation=dilation, padding=padding)
|
||||
in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
|
||||
self.in_layers.append(in_layer)
|
||||
|
||||
cond_layer = torch.nn.Conv1d(n_mel_channels, 2*n_channels, 1)
|
||||
cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
|
||||
self.cond_layers.append(cond_layer)
|
||||
|
||||
# last one is not necessary
|
||||
if i < n_layers - 1:
|
||||
res_skip_channels = 2*n_channels
|
||||
else:
|
||||
res_skip_channels = n_channels
|
||||
res_skip_layer = torch.nn.Conv1d(n_channels, res_skip_channels, 1)
|
||||
res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
|
||||
self.res_skip_layers.append(res_skip_layer)
|
||||
|
||||
def forward(self, forward_input):
|
||||
audio, spect = forward_input
|
||||
audio = self.start(audio)
|
||||
|
||||
for i in range(self.n_layers):
|
||||
acts = fused_add_tanh_sigmoid_multiply(
|
||||
self.in_layers[i](audio),
|
||||
self.cond_layers[i](spect),
|
||||
torch.IntTensor([self.n_channels]))
|
||||
|
||||
res_skip_acts = self.res_skip_layers[i](acts)
|
||||
if i < self.n_layers - 1:
|
||||
audio = res_skip_acts[:,:self.n_channels,:] + audio
|
||||
skip_acts = res_skip_acts[:,self.n_channels:,:]
|
||||
else:
|
||||
skip_acts = res_skip_acts
|
||||
|
||||
if i == 0:
|
||||
output = skip_acts
|
||||
else:
|
||||
output = skip_acts + output
|
||||
return self.end(output)
|
||||
|
||||
|
||||
class WaveGlow(torch.nn.Module):
|
||||
def __init__(self, n_mel_channels, n_flows, n_group, n_early_every,
|
||||
n_early_size, WN_config):
|
||||
super(WaveGlow, self).__init__()
|
||||
|
||||
self.upsample = torch.nn.ConvTranspose1d(n_mel_channels,
|
||||
n_mel_channels,
|
||||
1024, stride=256)
|
||||
assert(n_group % 2 == 0)
|
||||
self.n_flows = n_flows
|
||||
self.n_group = n_group
|
||||
self.n_early_every = n_early_every
|
||||
self.n_early_size = n_early_size
|
||||
self.WN = torch.nn.ModuleList()
|
||||
self.convinv = torch.nn.ModuleList()
|
||||
|
||||
n_half = int(n_group/2)
|
||||
|
||||
# Set up layers with the right sizes based on how many dimensions
|
||||
# have been output already
|
||||
n_remaining_channels = n_group
|
||||
for k in range(n_flows):
|
||||
if k % self.n_early_every == 0 and k > 0:
|
||||
n_half = n_half - int(self.n_early_size/2)
|
||||
n_remaining_channels = n_remaining_channels - self.n_early_size
|
||||
self.convinv.append(Invertible1x1Conv(n_remaining_channels))
|
||||
self.WN.append(WN(n_half, n_mel_channels*n_group, **WN_config))
|
||||
self.n_remaining_channels = n_remaining_channels # Useful during inference
|
||||
|
||||
def forward(self, forward_input):
|
||||
"""
|
||||
forward_input[0] = mel_spectrogram: batch x n_mel_channels x frames
|
||||
forward_input[1] = audio: batch x time
|
||||
"""
|
||||
spect, audio = forward_input
|
||||
|
||||
# Upsample spectrogram to size of audio
|
||||
spect = self.upsample(spect)
|
||||
assert(spect.size(2) >= audio.size(1))
|
||||
if spect.size(2) > audio.size(1):
|
||||
spect = spect[:, :, :audio.size(1)]
|
||||
|
||||
spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
|
||||
spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
|
||||
|
||||
audio = audio.unfold(1, self.n_group, self.n_group).permute(0, 2, 1)
|
||||
output_audio = []
|
||||
log_s_list = []
|
||||
log_det_W_list = []
|
||||
|
||||
for k in range(self.n_flows):
|
||||
if k % self.n_early_every == 0 and k > 0:
|
||||
output_audio.append(audio[:,:self.n_early_size,:])
|
||||
audio = audio[:,self.n_early_size:,:]
|
||||
|
||||
audio, log_det_W = self.convinv[k](audio)
|
||||
log_det_W_list.append(log_det_W)
|
||||
|
||||
n_half = int(audio.size(1)/2)
|
||||
audio_0 = audio[:,:n_half,:]
|
||||
audio_1 = audio[:,n_half:,:]
|
||||
|
||||
output = self.WN[k]((audio_0, spect))
|
||||
log_s = output[:, n_half:, :]
|
||||
b = output[:, :n_half, :]
|
||||
audio_1 = torch.exp(log_s)*audio_1 + b
|
||||
log_s_list.append(log_s)
|
||||
|
||||
audio = torch.cat([audio_0, audio_1],1)
|
||||
|
||||
output_audio.append(audio)
|
||||
return torch.cat(output_audio,1), log_s_list, log_det_W_list
|
||||
|
||||
def infer(self, spect, sigma=1.0):
|
||||
print("+++++++++++++++++++++glow.py")
|
||||
spect = self.upsample(spect)
|
||||
# trim conv artifacts. maybe pad spec to kernel multiple
|
||||
time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
|
||||
spect = spect[:, :, :-time_cutoff]
|
||||
|
||||
spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
|
||||
spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
|
||||
|
||||
if spect.type() == 'torch.cuda.HalfTensor':
|
||||
audio = torch.cuda.HalfTensor(spect.size(0),
|
||||
self.n_remaining_channels,
|
||||
spect.size(2)).normal_()
|
||||
else:
|
||||
audio = torch.cuda.FloatTensor(spect.size(0),
|
||||
self.n_remaining_channels,
|
||||
spect.size(2)).normal_()
|
||||
|
||||
audio = torch.autograd.Variable(sigma*audio)
|
||||
|
||||
for k in reversed(range(self.n_flows)):
|
||||
n_half = int(audio.size(1)/2)
|
||||
audio_0 = audio[:,:n_half,:]
|
||||
audio_1 = audio[:,n_half:,:]
|
||||
|
||||
output = self.WN[k]((audio_0, spect))
|
||||
s = output[:, n_half:, :]
|
||||
b = output[:, :n_half, :]
|
||||
audio_1 = (audio_1 - b)/torch.exp(s)
|
||||
audio = torch.cat([audio_0, audio_1],1)
|
||||
|
||||
audio = self.convinv[k](audio, reverse=True)
|
||||
|
||||
if k % self.n_early_every == 0 and k > 0:
|
||||
if spect.type() == 'torch.cuda.HalfTensor':
|
||||
z = torch.cuda.HalfTensor(spect.size(0), self.n_early_size, spect.size(2)).normal_()
|
||||
else:
|
||||
z = torch.cuda.FloatTensor(spect.size(0), self.n_early_size, spect.size(2)).normal_()
|
||||
audio = torch.cat((sigma*z, audio),1)
|
||||
|
||||
audio = audio.permute(0,2,1).contiguous().view(audio.size(0), -1).data
|
||||
return audio
|
||||
|
||||
@staticmethod
|
||||
def remove_weightnorm(model):
|
||||
waveglow = model
|
||||
for WN in waveglow.WN:
|
||||
WN.start = torch.nn.utils.remove_weight_norm(WN.start)
|
||||
WN.in_layers = remove(WN.in_layers)
|
||||
WN.cond_layers = remove(WN.cond_layers)
|
||||
WN.res_skip_layers = remove(WN.res_skip_layers)
|
||||
return waveglow
|
||||
|
||||
|
||||
def remove(conv_list):
|
||||
new_conv_list = torch.nn.ModuleList()
|
||||
for old_conv in conv_list:
|
||||
old_conv = torch.nn.utils.remove_weight_norm(old_conv)
|
||||
new_conv_list.append(old_conv)
|
||||
return new_conv_list
|
|
@ -1,269 +0,0 @@
|
|||
# *****************************************************************************
|
||||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Redistribution and use in source and binary forms, with or without
|
||||
# modification, are permitted provided that the following conditions are met:
|
||||
# * Redistributions of source code must retain the above copyright
|
||||
# notice, this list of conditions and the following disclaimer.
|
||||
# * Redistributions in binary form must reproduce the above copyright
|
||||
# notice, this list of conditions and the following disclaimer in the
|
||||
# documentation and/or other materials provided with the distribution.
|
||||
# * Neither the name of the NVIDIA CORPORATION nor the
|
||||
# names of its contributors may be used to endorse or promote products
|
||||
# derived from this software without specific prior written permission.
|
||||
#
|
||||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
|
||||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
#
|
||||
# *****************************************************************************
|
||||
|
||||
import copy
|
||||
import torch
|
||||
from waveglow.glow import Invertible1x1Conv, remove
|
||||
|
||||
|
||||
@torch.jit.script
|
||||
def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
|
||||
n_channels_int = n_channels[0]
|
||||
in_act = input_a+input_b
|
||||
t_act = torch.tanh(in_act[:, :n_channels_int, :])
|
||||
s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
|
||||
acts = t_act * s_act
|
||||
return acts
|
||||
|
||||
|
||||
class WN(torch.nn.Module):
|
||||
"""
|
||||
This is the WaveNet like layer for the affine coupling. The primary difference
|
||||
from WaveNet is the convolutions need not be causal. There is also no dilation
|
||||
size reset. The dilation only doubles on each layer
|
||||
"""
|
||||
def __init__(self, n_in_channels, n_mel_channels, n_layers, n_channels,
|
||||
kernel_size):
|
||||
super(WN, self).__init__()
|
||||
assert(kernel_size % 2 == 1)
|
||||
assert(n_channels % 2 == 0)
|
||||
self.n_layers = n_layers
|
||||
self.n_channels = n_channels
|
||||
self.in_layers = torch.nn.ModuleList()
|
||||
self.res_skip_layers = torch.nn.ModuleList()
|
||||
self.cond_layers = torch.nn.ModuleList()
|
||||
|
||||
start = torch.nn.Conv1d(n_in_channels, n_channels, 1)
|
||||
start = torch.nn.utils.weight_norm(start, name='weight')
|
||||
self.start = start
|
||||
|
||||
# Initializing last layer to 0 makes the affine coupling layers
|
||||
# do nothing at first. This helps with training stability
|
||||
end = torch.nn.Conv1d(n_channels, 2*n_in_channels, 1)
|
||||
end.weight.data.zero_()
|
||||
end.bias.data.zero_()
|
||||
self.end = end
|
||||
|
||||
for i in range(n_layers):
|
||||
dilation = 2 ** i
|
||||
padding = int((kernel_size*dilation - dilation)/2)
|
||||
in_layer = torch.nn.Conv1d(n_channels, 2*n_channels, kernel_size,
|
||||
dilation=dilation, padding=padding)
|
||||
in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
|
||||
self.in_layers.append(in_layer)
|
||||
|
||||
cond_layer = torch.nn.Conv1d(n_mel_channels, 2*n_channels, 1)
|
||||
cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
|
||||
self.cond_layers.append(cond_layer)
|
||||
|
||||
# last one is not necessary
|
||||
if i < n_layers - 1:
|
||||
res_skip_channels = 2*n_channels
|
||||
else:
|
||||
res_skip_channels = n_channels
|
||||
res_skip_layer = torch.nn.Conv1d(n_channels, res_skip_channels, 1)
|
||||
res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
|
||||
self.res_skip_layers.append(res_skip_layer)
|
||||
|
||||
def forward(self, forward_input):
|
||||
audio, spect = forward_input
|
||||
audio = self.start(audio)
|
||||
|
||||
for i in range(self.n_layers):
|
||||
acts = fused_add_tanh_sigmoid_multiply(
|
||||
self.in_layers[i](audio),
|
||||
self.cond_layers[i](spect),
|
||||
torch.IntTensor([self.n_channels]))
|
||||
|
||||
res_skip_acts = self.res_skip_layers[i](acts)
|
||||
if i < self.n_layers - 1:
|
||||
audio = res_skip_acts[:,:self.n_channels,:] + audio
|
||||
skip_acts = res_skip_acts[:,self.n_channels:,:]
|
||||
else:
|
||||
skip_acts = res_skip_acts
|
||||
|
||||
if i == 0:
|
||||
output = skip_acts
|
||||
else:
|
||||
output = skip_acts + output
|
||||
return self.end(output)
|
||||
|
||||
|
||||
class WaveGlow(torch.nn.Module):
|
||||
def __init__(self, n_mel_channels, n_flows, n_group, n_early_every,
|
||||
n_early_size, WN_config):
|
||||
super(WaveGlow, self).__init__()
|
||||
|
||||
self.upsample = torch.nn.ConvTranspose1d(n_mel_channels,
|
||||
n_mel_channels,
|
||||
1024, stride=256)
|
||||
assert(n_group % 2 == 0)
|
||||
self.n_flows = n_flows
|
||||
self.n_group = n_group
|
||||
self.n_early_every = n_early_every
|
||||
self.n_early_size = n_early_size
|
||||
self.WN = torch.nn.ModuleList()
|
||||
self.convinv = torch.nn.ModuleList()
|
||||
|
||||
n_half = int(n_group/2)
|
||||
|
||||
# Set up layers with the right sizes based on how many dimensions
|
||||
# have been output already
|
||||
n_remaining_channels = n_group
|
||||
for k in range(n_flows):
|
||||
if k % self.n_early_every == 0 and k > 0:
|
||||
n_half = n_half - int(self.n_early_size/2)
|
||||
n_remaining_channels = n_remaining_channels - self.n_early_size
|
||||
self.convinv.append(Invertible1x1Conv(n_remaining_channels))
|
||||
self.WN.append(WN(n_half, n_mel_channels*n_group, **WN_config))
|
||||
self.n_remaining_channels = n_remaining_channels # Useful during inference
|
||||
|
||||
def forward(self, forward_input):
|
||||
return None
|
||||
"""
|
||||
forward_input[0] = audio: batch x time
|
||||
forward_input[1] = upsamp_spectrogram: batch x n_cond_channels x time
|
||||
"""
|
||||
"""
|
||||
spect, audio = forward_input
|
||||
|
||||
# Upsample spectrogram to size of audio
|
||||
spect = self.upsample(spect)
|
||||
assert(spect.size(2) >= audio.size(1))
|
||||
if spect.size(2) > audio.size(1):
|
||||
spect = spect[:, :, :audio.size(1)]
|
||||
|
||||
spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
|
||||
spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
|
||||
|
||||
audio = audio.unfold(1, self.n_group, self.n_group).permute(0, 2, 1)
|
||||
output_audio = []
|
||||
s_list = []
|
||||
s_conv_list = []
|
||||
|
||||
for k in range(self.n_flows):
|
||||
if k%4 == 0 and k > 0:
|
||||
output_audio.append(audio[:,:self.n_multi,:])
|
||||
audio = audio[:,self.n_multi:,:]
|
||||
|
||||
# project to new basis
|
||||
audio, s = self.convinv[k](audio)
|
||||
s_conv_list.append(s)
|
||||
|
||||
n_half = int(audio.size(1)/2)
|
||||
if k%2 == 0:
|
||||
audio_0 = audio[:,:n_half,:]
|
||||
audio_1 = audio[:,n_half:,:]
|
||||
else:
|
||||
audio_1 = audio[:,:n_half,:]
|
||||
audio_0 = audio[:,n_half:,:]
|
||||
|
||||
output = self.nn[k]((audio_0, spect))
|
||||
s = output[:, n_half:, :]
|
||||
b = output[:, :n_half, :]
|
||||
audio_1 = torch.exp(s)*audio_1 + b
|
||||
s_list.append(s)
|
||||
|
||||
if k%2 == 0:
|
||||
audio = torch.cat([audio[:,:n_half,:], audio_1],1)
|
||||
else:
|
||||
audio = torch.cat([audio_1, audio[:,n_half:,:]], 1)
|
||||
output_audio.append(audio)
|
||||
return torch.cat(output_audio,1), s_list, s_conv_list
|
||||
"""
|
||||
|
||||
def infer(self, spect, sigma=1.0):
|
||||
spect = self.upsample(spect)
|
||||
# trim conv artifacts. maybe pad spec to kernel multiple
|
||||
time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
|
||||
spect = spect[:, :, :-time_cutoff]
|
||||
|
||||
spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
|
||||
spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
|
||||
|
||||
if spect.type() == 'torch.cuda.HalfTensor':
|
||||
audio = torch.cuda.HalfTensor(spect.size(0),
|
||||
self.n_remaining_channels,
|
||||
spect.size(2)).normal_()
|
||||
elif spect.type() == 'torch.cuda.FloatTensor':
|
||||
audio = torch.cuda.FloatTensor(spect.size(0),
|
||||
self.n_remaining_channels,
|
||||
spect.size(2)).normal_()
|
||||
else:
|
||||
audio = torch.FloatTensor(spect.size(0),
|
||||
self.n_remaining_channels,
|
||||
spect.size(2)).normal_()
|
||||
|
||||
audio = torch.autograd.Variable(sigma*audio)
|
||||
|
||||
for k in reversed(range(self.n_flows)):
|
||||
n_half = int(audio.size(1)/2)
|
||||
if k%2 == 0:
|
||||
audio_0 = audio[:,:n_half,:]
|
||||
audio_1 = audio[:,n_half:,:]
|
||||
else:
|
||||
audio_1 = audio[:,:n_half,:]
|
||||
audio_0 = audio[:,n_half:,:]
|
||||
|
||||
output = self.WN[k]((audio_0, spect))
|
||||
s = output[:, n_half:, :]
|
||||
b = output[:, :n_half, :]
|
||||
audio_1 = (audio_1 - b)/torch.exp(s)
|
||||
if k%2 == 0:
|
||||
audio = torch.cat([audio[:,:n_half,:], audio_1],1)
|
||||
else:
|
||||
audio = torch.cat([audio_1, audio[:,n_half:,:]], 1)
|
||||
|
||||
audio = self.convinv[k](audio, reverse=True)
|
||||
|
||||
if k%4 == 0 and k > 0:
|
||||
if spect.type() == 'torch.cuda.HalfTensor':
|
||||
z = torch.cuda.HalfTensor(spect.size(0),
|
||||
self.n_early_size,
|
||||
spect.size(2)).normal_()
|
||||
elif spect.type() == 'torch.cuda.FloatTensor':
|
||||
z = torch.cuda.FloatTensor(spect.size(0),
|
||||
self.n_early_size,
|
||||
spect.size(2)).normal_()
|
||||
else:
|
||||
z = torch.FloatTensor(spect.size(0),
|
||||
self.n_early_size,
|
||||
spect.size(2)).normal_()
|
||||
|
||||
audio = torch.cat((sigma*z, audio),1)
|
||||
|
||||
return audio.permute(0,2,1).contiguous().view(audio.size(0), -1).data
|
||||
|
||||
@staticmethod
|
||||
def remove_weightnorm(model):
|
||||
waveglow = model
|
||||
for WN in waveglow.WN:
|
||||
WN.start = torch.nn.utils.remove_weight_norm(WN.start)
|
||||
WN.in_layers = remove(WN.in_layers)
|
||||
WN.cond_layers = remove(WN.cond_layers)
|
||||
WN.res_skip_layers = remove(WN.res_skip_layers)
|
||||
return waveglow
|
Loading…
Reference in a new issue