updated to 19.07 version

2019-07-23 12:45:37 -07:00 · 2019-07-23 12:45:37 -07:00 · 979e291848
parent 3137fbeae3
commit 979e291848
24 changed files with 422 additions and 971 deletions
--- a/PyTorch/SpeechSynthesis/Tacotron2/Dockerfile
+++ b/PyTorch/SpeechSynthesis/Tacotron2/Dockerfile
@ -1,10 +1,5 @@
-FROM nvcr.io/nvidia/pytorch:19.03-py3
+FROM nvcr.io/nvidia/pytorch:19.06-py3

 ADD . /workspace/tacotron2
 WORKDIR /workspace/tacotron2
 RUN pip install -r requirements.txt
-RUN cd /workspace; \
-    git clone https://github.com/NVIDIA/apex.git; \
-    cd /workspace/apex; \
-    pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
-WORKDIR /workspace/tacotron2
--- a/PyTorch/SpeechSynthesis/Tacotron2/README.md
+++ b/PyTorch/SpeechSynthesis/Tacotron2/README.md
@ -1,20 +1,20 @@
-# Tacotron 2 And WaveGlow v1.5 For PyTorch
+# Tacotron 2 And WaveGlow v1.6 For PyTorch

 This repository provides a script and recipe to train Tacotron 2 and WaveGlow
-v1.5 models to achieve state of the art accuracy, and is tested and maintained by
-NVIDIA.
+v1.6 models to achieve state of the art accuracy, and is tested and maintained by NVIDIA.

-Table of Contents
-=================
-* [The model](#the-model)
+## Table of Contents
+* [Model overview](#model-overview)
   * [Model architecture](#model-architecture)
   * [Default configuration](#default-configuration)
   * [Feature support matrix](#feature-support-matrix)
      * [Features](#features)
+   * [Mixed precision training](#mixed-precision-training)
+      * [Enabling mixed precision](#enabling-mixed-precision)
 * [Setup](#setup)
   * [Requirements](#requirements)
 * [Quick Start Guide](#quick-start-guide)
-* [Details](#details)
+* [Advanced](#advanced)
   * [Scripts and sample code](#scripts-and-sample-code)
   * [Parameters](#parameters)
      * [Shared parameters](#shared-parameters)
@ -27,30 +27,30 @@ Table of Contents
      * [Multi-dataset](#multi-dataset)
   * [Training process](#training-process)
   * [Inference process](#inference-process)
-* [Mixed precision training](#mixed-precision-training)
-   * [Enabling mixed precision](#enabling-mixed-precision)
-* [Benchmarking](#benchmarking)
-   * [Training performance benchmark](#training-performance-benchmark)
-   * [Inference performance benchmark](#inference-performance-benchmark)
-* [Results](#results)
-   * [Training accuracy results](#training-accuracy-results)
-      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
-   * [Training performance results](#training-performance-results)
-      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
-      * [Expected training time](#expected-training-time)
-   * [Inference performance results](#inference-performance-results)
-      * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
-* [Changelog](#changelog)
-* [Known issues](#known-issues)
+* [Performance](#performance)
+   * [Benchmarking](#benchmarking)
+      * [Training performance benchmark](#training-performance-benchmark)
+      * [Inference performance benchmark](#inference-performance-benchmark)
+   * [Results](#results)
+      * [Training accuracy results](#training-accuracy-results)
+         * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+      * [Training performance results](#training-performance-results)
+         * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+         * [Expected training time](#expected-training-time)
+      * [Inference performance results](#inference-performance-results)
+         * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+* [Release notes](#release-notes)
+   * [Changelog](#changelog)
+   * [Known issues](#known-issues)

-## The model
+## Model overview

 This text-to-speech (TTS) system is a combination of two neural network
 models:

 * a modified Tacotron 2 model from the [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
-paper and
-* a flow-based neural network model from the [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002) paper.
+paper
+* a flow-based neural network model from the [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002) paper

 The Tacotron 2 and WaveGlow models form a text-to-speech system that enables
 users to synthesize natural sounding speech from raw transcripts without
@ -106,7 +106,6 @@ distribution.
 Figure 2. Architecture of the WaveGlow model. Taken from the
 [WaveGlow](https://arxiv.org/abs/1811.00002) paper.

-
 ### Default configuration

 Both models support multi-GPU and mixed precision training with dynamic loss
@ -126,16 +125,98 @@ training.

 The following features are supported by this model.

-| Feature               | Tacotron 2 | and WaveGlow |               
-|:-------|---------:|-----------:|
+| Feature               | Tacotron 2 | WaveGlow |
+| :-----------------------|------------:|--------------:|
 |[AMP](https://nvidia.github.io/apex/amp.html) | Yes | Yes |
 |[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes | Yes |

 #### Features 

-AMP - a tool that enables Tensor Core-accelerated training. Please refer to section [Enabling mixed precision](#enabling-mixed-precision) for more details.
+AMP - a tool that enables Tensor Core-accelerated training. For more information, 
+refer to [Enabling mixed precision](#enabling-mixed-precision).

-Apex DistributedDataParallel - a module wrapper that enables easy multiprocess distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`. `DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by overlapping communication with computation during backward() and bucketing smaller gradient transfers to reduce the total number of transfers required.
+Apex DistributedDataParallel - a module wrapper that enables easy multiprocess 
+distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`. 
+`DistributedDataParallel` is optimized for use with NCCL. It achieves high 
+performance by overlapping communication with computation during `backward()` 
+and bucketing smaller gradient transfers to reduce the total number of transfers 
+required.
+
+## Mixed precision training
+
+*Mixed precision* is the combined use of different numerical precisions in a
+computational method. [Mixed precision](https://arxiv.org/abs/1710.03740)
+training offers significant computational speedup by performing operations in
+half-precision format, while storing minimal information in single-precision
+to retain as much information as possible in critical parts of the network.
+Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
+in the Volta and Turing architecture, significant training speedups are
+experienced by switching to mixed precision -- up to 3x overall speedup on
+the most arithmetically intense model architectures.  Using mixed precision
+training requires two steps:
+
+1. Porting the model to use the FP16 data type where appropriate.
+2. Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was
+introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+
+For information about:
+* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
+paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
+documentation.
+* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
+blog.
+* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
+
+### Enabling mixed precision
+
+Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
+(AMP)  library from [APEX](https://github.com/NVIDIA/apex) that casts variables
+to half-precision upon retrieval, while storing variables in single-precision
+format. Furthermore, to preserve small gradient magnitudes in backpropagation,
+a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
+step must be included when applying gradients. In PyTorch, loss scaling can be
+easily applied by using the `scale_loss()` method provided by AMP. The scaling value
+to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
+
+By default, the `train_tacotron2.sh` and `train_waveglow.sh` scripts will
+launch mixed precision training with Tensor Cores. You can change this
+behaviour by removing the `--amp-run` flag from the `train.py` script.
+
+To enable mixed precision, the following steps were performed in the Tacotron 2 and
+WaveGlow models:
+* Import AMP from APEX:
+    ```bash
+    from apex import amp
+	amp.lists.functional_overrides.FP32_FUNCS.remove('softmax')
+	amp.lists.functional_overrides.FP16_FUNCS.append('softmax')
+    ```
+
+* Initialize AMP:
+    ```bash
+	model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
+    ```
+
+* If running on multi-GPU, wrap the model with `DistributedDataParallel`:
+    ```bash
+    from apex.parallel import DistributedDataParallel as DDP
+    model = DDP(model)
+	```
+
+* Scale loss before backpropagation (assuming loss is stored in a variable
+called `losses`):
+
+    * Default backpropagate for FP32:
+        ```bash
+        losses.backward()
+        ```
+
+    * Scale loss and backpropagate with AMP:
+        ```bash
+        with optimizer.scale_loss(losses) as scaled_losses:
+            scaled_losses.backward()
+        ```

 ## Setup

@ -149,22 +230,21 @@ and encapsulates some dependencies. Aside from these dependencies, ensure you
 have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.04-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+* [PyTorch 19.05-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 or newer
 * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU

-
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
 Documentation:

 * [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
-* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
-* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
+* [Running PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running)

 For those unable to use the PyTorch NGC container, to set up the required
 environment or create your own container, see the versioned
-[NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html).
+[NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).

 ## Quick Start Guide

@ -174,84 +254,86 @@ and WaveGlow model on the [LJ Speech](https://keithito.com/LJ-Speech-Dataset/)
 dataset.

 1. Clone the repository.
-```bash
-git clone https://github.com/NVIDIA/DeepLearningExamples.git
-cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2
-```
+   ```bash
+   git clone https://github.com/NVIDIA/DeepLearningExamples.git
+   cd DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2
+   ```

 2. Download and preprocess the dataset.
 Use the `./scripts/prepare-dataset.sh` download script to automatically
 download and preprocess the training, validation and test datasets. To run
 this script, issue:
-```bash
-bash scripts/prepare-dataset.sh
-```
+   ```bash
+   bash scripts/prepare-dataset.sh
+   ```

-To preprocess the datasets for Tacotron 2 training, use the
-`./scripts/prepare-mels.sh` script:
-```bash
-bash scripts/prepare_mels.sh
-```
-
-Data is downloaded to the `./LJSpeech-1.1` directory (on the host).  The
+   To preprocess the datasets for Tacotron 2 training, use the
+   `./scripts/prepare-mels.sh` script:
+   ```bash
+   bash scripts/prepare_mels.sh
+   ```
+    
+   Data is downloaded to the `./LJSpeech-1.1` directory (on the host).  The
 `./LJSpeech-1.1` directory is mounted to the `/workspace/tacotron2/LJSpeech-1.1`
 location in the NGC container. The preprocessed mel-spectrograms are stored in the 
 `./LJSpeech-1.1/mels` directory.

 3. Build the Tacotron 2 and WaveGlow PyTorch NGC container.
-```bash
-bash scripts/docker/build.sh
-```
+   ```bash
+   bash scripts/docker/build.sh
+   ```

 4. Start an interactive session in the NGC container to run training/inference.
 After you build the container image, you can start an interactive CLI session with:

-```bash
-bash scripts/docker/interactive.sh
-```
+   ```bash
+   bash scripts/docker/interactive.sh
+   ```

-The `interactive.sh` script requires that the location on the dataset is specified.
-For example, `LJSpeech-1.1`.
+   The `interactive.sh` script requires that the location on the dataset is specified. 
+   For example, `LJSpeech-1.1`.

 5. Start training.
 To start Tacotron 2 training, run:
-```bash
-bash scripts/train_tacotron2.sh
-```
+   ```bash
+   bash scripts/train_tacotron2.sh
+   ```

-To start WaveGlow training, run:
-```bash
-bash scripts/train_waveglow.sh
-```
+   To start WaveGlow training, run:
+   ```bash
+   bash scripts/train_waveglow.sh
+   ```

 6. Start validation/evaluation.
 Ensure your loss values are comparable to those listed in the table in the
-[Results][#results] section. For both models, the loss values are stored in the
-`./output/nvlog.json` log file.
+[Results](#results) section. For both models, the loss values are stored in the `./output/nvlog.json` log file.

-After you have trained the Tacotron 2 model for 1500 epochs and the
-WaveGlow model for 800 epochs, you should get audio results similar to the
-samples in the `./audio` folder. For details about generating audio, see the
-[Inference process](#inference-process) section below.
+   After you have trained the Tacotron 2 model for 1500 epochs and the 
+   WaveGlow model for 800 epochs, you should get audio results similar to the 
+   samples in the `./audio` folder. For details about generating audio, see the
+   [Inference process](#inference-process) section below.

-The training scripts automatically run the validation after each training
-epoch. The results from the validation are printed to the standard output
-(`stdout`) and saved to the log files.
+   The training scripts automatically run the validation after each training 
+   epoch. The results from the validation are printed to the standard output 
+   (`stdout`) and saved to the log files.

 7. Start inference.
 After you have trained the Tacotron 2 and WaveGlow models, you can perform
 inference using the respective checkpoints that are passed as `--tacotron2`
 and `--waveglow` arguments.

-To run inference issue:
-```bash
-python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --fp16-run
-```
-The speech is generated from a text file that is passed with `-i` argument. To run
-inference in mixed precision, use the `--amp-run` flag. The output audio will
-be stored in the path specified by the `-o` argument.
+   To run inference issue:

-## Details
+   ```bash
+   python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i phrases/phrase.txt --amp-run
+   ```
+   
+   The speech is generated from lines of text in the file that is passed with 
+   `-i` argument. The number of lines determines inference batch size. To run 
+   inference in mixed precision, use the `--amp-run` flag. The output audio will 
+   be stored in the path specified by the `-o` argument.
+
+## Advanced

 The following sections provide greater details of the dataset, running
 training and inference, and the training results.
@ -312,11 +394,26 @@ WaveGlow models.

 ### Command-line options

-To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example:
+To see the full list of available options and their descriptions, use the `-h` 
+or `--help` command line option, for example:
 ```bash
 python train.py --help
 ```

+The following example output is printed when running the sample:
+
+```bash
+Batch: 7/260 epoch 0
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936205.667271376 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_start: 7
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936207.209611416 (/workspace/tacotron2/dllogger/logger.py:251) train_iteration_loss: 5.415428161621094
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.705905914 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_stop: 7
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.706479311 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_items/sec: 8924.00136085362
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.706998110 (/workspace/tacotron2/dllogger/logger.py:251) iter_time: 3.0393316745758057
+Batch: 8/260 epoch 0
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936208.711485624 (/workspace/tacotron2/dllogger/logger.py:251) train_iter_start: 8
+:::NVLOGv0.2.2 Tacotron2_PyT 1560936210.236668825 (/workspace/tacotron2/dllogger/logger.py:251) train_iteration_loss: 5.516331672668457
+```
+

 ### Getting the data

@ -335,12 +432,11 @@ To use datasets different than the default LJSpeech dataset:

 2. Add two text files containing file lists: one for the training subset (`--training-files`) and one for the validation subset (`--validation files`).
 The structure of the filelists should be as follows:
-```bash
-`<audio file path>|<transcript>`
-```
-
-The `<audio file path>` is the relative path to the path provided by the `--dataset-path` option.
+   ```bash
+   `<audio file path>|<transcript>`
+   ```

+   The `<audio file path>` is the relative path to the path provided by the `--dataset-path` option.

 ### Training process

@ -351,7 +447,7 @@ of Tacotron 2 and as conditioning input to the network in case of WaveGlow.

 The training loss is averaged over an entire training epoch, whereas the
 validation loss is averaged over the validation dataset. Performance is
-reported in total input tokens per second for the Tacotron 2 model and
+reported in total output mel-spectrograms per second for the Tacotron 2 model and
 in total output samples per second for the WaveGlow model. Both measures are
 recorded as `train_iter_items/sec` (after each iteration) and
 `train_epoch_items/sec` (averaged over epoch) in the output log file `./output/nvlog.json`. The result is
@ -372,100 +468,21 @@ models and input text as a text file, with one phrase per line.

 To run inference, issue:
 ```bash
-python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --amp-run
+python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase.txt --amp-run
 ```
 Here, `Tacotron2_checkpoint` and `WaveGlow_checkpoint` are pre-trained
-checkpoints for the respective models, and `text.txt` contains input phrases.
-Audio will be saved in the output folder.
+checkpoints for the respective models, and `phrases/phrase.txt` contains input phrases. The number of text lines determines the inference batch size. Audio will be saved in the output folder.

 You can find all the available options by calling `python inference.py --help`.

-## Mixed precision training
+## Performance

-*Mixed precision* is the combined use of different numerical precisions in a
-computational method. [Mixed precision](https://arxiv.org/abs/1710.03740)
-training offers significant computational speedup by performing operations in
-half-precision format, while storing minimal information in single-precision
-to retain as much information as possible in critical parts of the network.
-Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
-in the Volta and Turing architecture, significant training speedups are
-experienced by switching to mixed precision -- up to 3x overall speedup on
-the most arithmetically intense model architectures.  Using mixed precision
-training requires two steps:
-
-1. Porting the model to use the FP16 data type where appropriate.
-2. Adding loss scaling to preserve small gradient values.
-
-The ability to train deep learning networks with lower precision was
-introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
-
-For information about:
-* How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740)
-paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
-documentation.
-* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
-blog.
-* How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
-from the TensorFlow User Guide.
-* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
-
-
-
-### Enabling mixed precision
-
-Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
-(AMP)  library from [APEX](https://github.com/NVIDIA/apex) that casts variables
-to half-precision upon retrieval, while storing variables in single-precision
-format. Furthermore, to preserve small gradient magnitudes in backpropagation,
-a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
-step must be included when applying gradients. In PyTorch, loss scaling can be
-easily applied by using the `scale_loss()` method provided by AMP. The scaling value
-to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
-
-By default, the `train_tacotron2.sh` and `train_waveglow.sh` scripts will
-launch mixed precision training with Tensor Cores. You can change this
-behaviour by removing the `--amp-run` flag from the `train.py` script.
-
-To enable mixed precision, the following steps were performed in the Tacotron 2 and
-WaveGlow models:
-* Import AMP from APEX:
-    ```bash
-    from apex import amp
-	amp.lists.functional_overrides.FP32_FUNCS.remove('softmax')
-	amp.lists.functional_overrides.FP16_FUNCS.append('softmax')
-    ```
-
-* Initialize AMP:
-    ```bash
-	model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
-    ```
-
-* If running on multi-GPU, wrap the model with `DistributedDataParallel`:
-    ```bash
-  from apex.parallel import DistributedDataParallel as DDP
-  model = DDP(model)
-	```
-
-* Scale loss before backpropagation (assuming loss is stored in a variable
-called `losses`)
-
-    * Default backpropagate for FP32:
-        ```bash
-        losses.backward()
-        ```
-
-    * Scale loss and backpropagate with AMP:
-        ```bash
-        with optimizer.scale_loss(losses) as scaled_losses:
-            scaled_losses.backward()
-        ```
-
-## Benchmarking
+### Benchmarking

 The following section shows how to run benchmarks measuring the model
 performance in training and inference mode.

-### Training performance benchmark
+#### Training performance benchmark

 To benchmark the training performance on a specific batch size, run:

@ -517,37 +534,34 @@ Each of these scripts runs for 10 epochs and for each epoch measures the
 average number of items per second. The performance results can be read from
 the `nvlog.json` files produced by the commands.

-### Inference performance benchmark
+#### Inference performance benchmark

 To benchmark the inference performance on a batch size=1, run:

-* For FP32
-    ```bash
-    python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --log-file=output/nvlog_fp32.json
-    ```
 * For FP16
    ```bash
-    python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ -i text.txt --amp-run --log-file=output/nvlog_fp16.json
+    python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase_1_64.txt --amp-run --log-file=output/nvlog_fp16.json
+    ```
+* For FP32
+    ```bash
+    python inference.py --tacotron2 <Tacotron2_checkpoint> --waveglow <WaveGlow_checkpoint> -o output/ --include-warmup -i phrases/phrase_1_64.txt --log-file=output/nvlog_fp32.json
    ```

-The log files contain performance numbers for Tacotron 2 model
-(number of input tokens per second, reported as `tacotron2_items_per_sec`)
-and for WaveGlow (number of output samples per second, reported as
-`waveglow_items_per_sec`).
+The output log files will contain performance numbers for Tacotron 2 model
+(number of output mel-spectrograms per second, reported as `tacotron2_items_per_sec`) 
+and for WaveGlow (number of output samples per second, reported as `waveglow_items_per_sec`). 
+The `inference.py` script will run a few warmup iterations before running the benchmark. 

-
-
-## Results
+### Results

 The following sections provide details on how we achieved our performance
 and accuracy in training and inference.

-### Training accuracy results
-
+#### Training accuracy results

 ##### NVIDIA DGX-1 (8x V100 16G)

-Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{FP16,FP32}_DGX1_16GB_8GPU.sh` training script in the PyTorch-19.04-py3
+Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{FP16,FP32}_DGX1_16GB_8GPU.sh` training script in the PyTorch-19.05-py3
 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.

 All of the results were produced using the `train.py` script as described in the
@ -573,87 +587,84 @@ WaveGlow FP32 loss - batch size 4 (mean and std over 16 runs)
 ![](./img/waveglow_fp32_loss.png "WaveGlow FP32 loss")


-### Training performance results
-
+#### Training performance results

 ##### NVIDIA DGX-1 (8x V100 16G)

 Our results were obtained by running the `./platform/train_{tacotron2,waveglow}_{FP16,FP32}_DGX1_16GB_8GPU.sh`
-training script in the PyTorch-19.04-py3 NGC container on NVIDIA DGX-1 with
-8x V100 16G GPUs. Performance numbers (in input tokens per second for
+training script in the PyTorch-19.05-py3 NGC container on NVIDIA DGX-1 with
+8x V100 16G GPUs. Performance numbers (in output mel-spectrograms per second for
 Tacotron 2 and output samples per second for WaveGlow) were averaged over
 an entire training epoch.

 This table shows the results for Tacotron 2:

-|Number of GPUs|Batch size per GPU|Number of tokens used with mixed precision|Number of tokens used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
+|Number of GPUs|Batch size per GPU|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
 |---:|---:|---:|---:|---:|---:|---:|
-|1|128@FP16, 64@FP32 | 3,746 | 2,087 | 1.79 | 1.00 | 1.00 |
-|4|128@FP16, 64@FP32 | 13,264 | 8,052 | 1.65 | 3.54 | 3.86 |
-|8|128@FP16, 64@FP32 | 25,056 | 15,863 | 1.58 | 6.69 | 7.60 |
+|1|128@FP16, 64@FP32 | 20,992  | 12,933 | 1.62 | 1.00 | 1.00 |
+|4|128@FP16, 64@FP32 | 74,989  | 46,115 | 1.63 | 3.57 | 3.57 |
+|8|128@FP16, 64@FP32 | 140,060 | 88,719 | 1.58 | 6.67 | 6.86 |

 The following table shows the results for WaveGlow:

 |Number of GPUs|Batch size per GPU|Number of samples used with mixed precision|Number of samples used with FP32|Speed-up with mixed precision|Multi-GPU weak scaling with mixed precision|Multi-GPU weak scaling with FP32|
 |---:|---:|---:|---:|---:|---:|---:|
-|1| 10@FP16, 4@FP32 | 79248.87426 | 35695.56774 | 2.22 | 1.00 | 1.00 |
-|4| 10@FP16, 4@FP32 | 275310.0262 | 126497.6265 | 2.18 | 3.47 | 3.54 |
-|8| 10@FP16, 4@FP32 | 576709.4935 | 255155.1798 | 2.26 | 7.28 | 7.15 |
+|1| 10@FP16, 4@FP32 | 81,503 | 36,671 | 2.22 | 1.00 | 1.00 |
+|4| 10@FP16, 4@FP32 | 275,803 | 124,504 | 2.22 | 3.38 | 3.40 |
+|8| 10@FP16, 4@FP32 | 583,887 | 264,903 | 2.20 | 7.16 | 7.22 |

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

-#### Expected training time
+##### Expected training time

 The following table shows the expected training time for convergence for Tacotron 2 (1500 epochs):

 |Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
 |---:|---:|---:|---:|---:|
-|1| 128@FP16, 64@FP32 | 137.33 | 227.66 | 1.66 |
-|4| 128@FP16, 64@FP32 | 40.68 | 63.99 | 1.57 |
-|8| 128@FP16, 64@FP32 | 20.74 | 32.47 | 1.57 |
-
-
+|1| 128@FP16, 64@FP32 | 153 | 234 | 1.53 |
+|4| 128@FP16, 64@FP32 | 42 | 64 | 1.54 |
+|8| 128@FP16, 64@FP32 | 22 | 33 | 1.52 |

 The following table shows the expected training time for convergence for WaveGlow (1000 epochs):

 |Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
 |---:|---:|---:|---:|---:|
-|1| 10@FP16, 4@FP32 | 358.00 | 793.97 | 2.22 |
-|4| 10@FP16, 4@FP32 | 103.10 | 223.59 | 2.17 |
-|8| 10@FP16, 4@FP32 | 50.40 | 109.45 | 2.17 |
-
-
-
-### Inference performance results
+|1| 10@FP16, 4@FP32 | 347 | 768 | 2.21 |
+|4| 10@FP16, 4@FP32 | 106 | 231 | 2.18 |
+|8| 10@FP16, 4@FP32 | 49 | 105 | 2.16 |

+#### Inference performance results

 ##### NVIDIA DGX-1 (8x V100 16G)

-Our results were obtained by running the `./inference.py` inference script in the
-PyTorch-18.12.1-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
-Performance numbers (in input tokens per second for Tacotron 2 and output
-samples per second for WaveGlow) were averaged over 16 runs.
+Our results were obtained by running the `./inference.py` inference script in 
+the PyTorch-19.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
+Performance numbers (in output mel-spectrograms per second for Tacotron 2 and 
+output samples per second for WaveGlow) were averaged over 16 runs.

+The following table shows the inference performance results for Tacotron 2 model. 
+Results are measured in the number of output mel-spectrograms per second.

-The following table shows the inference performance results for Tacotron 2.
-Results are measured in the number of input tokens per second.
-
-|Number of GPUs|Number of tokens used with mixed precision|Number of tokens used with FP32|Speed-up with mixed precision|
+|Number of GPUs|Number of mels used with mixed precision|Number of mels used with FP32|Speed-up with mixed precision|
 |---:|---:|---:|---:|
-|1|168|173|0.97|
+|**1**|637|619|1.03|

-The following table shows the inference performance results for WaveGlow.
-Results are measured in the number of output audio samples per second.<sup>1</sup>
+The following table shows the inference performance results for WaveGlow model. 
+Results are measured in the number of output samples per second<sup>1</sup>.

 |Number of GPUs|Number of samples used with mixed precision|Number of samples used with FP32|Speed-up with mixed precision|
 |---:|---:|---:|---:|
-|1|583318|553380|1.05|
+|**1**|565629|578322|0.98|

 <sup>1</sup>With sampling rate equal to 22050, one second of audio is generated from 22050 samples.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

-## Changelog
+
+## Release notes
+
+### Changelog
+
 March 2019
 * Initial release

@ -662,5 +673,12 @@ June 2019
 * Data preprocessing for Tacotron 2 training
 * Fixed dropouts on LSTMCells

-## Known issues
+July 2019
+* Changed measurement units for Tacotron 2 training and inference performance 
+benchmarks from input tokes per second to output mel-spectrograms per second
+* Introduced batched inference
+* Included warmup in the inference script
+
+### Known issues
+
 There are no known issues in this release.
--- a/PyTorch/SpeechSynthesis/Tacotron2/common/layers.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/common/layers.py
@ -27,7 +27,6 @@

 import torch
 from librosa.filters import mel as librosa_mel_fn
-from torch.nn import functional as F
 from common.audio_processing import dynamic_range_compression, dynamic_range_decompression
 from common.stft import STFT

--- a/PyTorch/SpeechSynthesis/Tacotron2/inference.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/inference.py
@ -33,7 +33,6 @@ import numpy as np
 from scipy.io.wavfile import write

 import sys
-sys.path.append('waveglow/')

 import time
 from dllogger.logger import LOGGER
@ -46,19 +45,15 @@ def parse_args(parser):
    """
    Parse commandline arguments.
    """
-    parser.add_argument('-i', '--input', type=str, default="",
+    parser.add_argument('-i', '--input', type=str, required=True,
                        help='full path to the input text (phareses separated by new line); \
                        if not provided then use default text')
    parser.add_argument('-o', '--output', required=True,
                        help='output folder to save audio (file per phrase)')
-    parser.add_argument('--tacotron2', type=str, default="",
+    parser.add_argument('--tacotron2', type=str,
                        help='full path to the Tacotron2 model checkpoint file')
-    parser.add_argument('--mel-file', type=str, default="",
-                        help='set if using mel spectrograms instead of Tacotron2 model')
-    parser.add_argument('--waveglow', required=True,
+    parser.add_argument('--waveglow', type=str,
                        help='full path to the WaveGlow model checkpoint file')
-    parser.add_argument('--old-waveglow', action='store_true',
-                        help='set if WaveGlow checkpoint is from GitHub.com/NVIDIA/waveglow')
    parser.add_argument('-s', '--sigma-infer', default=0.6, type=float)
    parser.add_argument('-sr', '--sampling-rate', default=22050, type=int,
                        help='Sampling rate')
@ -66,6 +61,10 @@ def parse_args(parser):
                        help='inference with AMP')
    parser.add_argument('--log-file', type=str, default='nvlog.json',
                        help='Filename for logging')
+    parser.add_argument('--include-warmup', action='store_true',
+                        help='Include warmup')
+    parser.add_argument('--stft-hop-length', type=int, default=256,
+                        help='STFT hop length for estimating audio length from mel size')


    return parser
@ -135,6 +134,41 @@ def load_and_setup_model(model_name, parser, checkpoint, amp_run):
    return model


+# taken from tacotron2/data_function.py:TextMelCollate.__call__
+def pad_sequences(batch):
+    # Right zero-pad all one-hot text sequences to max input length
+    input_lengths, ids_sorted_decreasing = torch.sort(
+        torch.LongTensor([len(x) for x in batch]),
+        dim=0, descending=True)
+    max_input_len = input_lengths[0]
+
+    text_padded = torch.LongTensor(len(batch), max_input_len)
+    text_padded.zero_()
+    for i in range(len(ids_sorted_decreasing)):
+        text = batch[ids_sorted_decreasing[i]]
+        text_padded[i, :text.size(0)] = text
+
+    return text_padded, input_lengths
+
+
+def prepare_input_sequence(texts):
+
+    d = []
+    for i,text in enumerate(texts):
+        d.append(torch.IntTensor(
+            text_to_sequence(text, ['english_cleaners'])[:]))
+
+    text_padded, input_lengths = pad_sequences(d)
+    if torch.cuda.is_available():
+        text_padded = torch.autograd.Variable(text_padded).cuda().long()
+        input_lengths = torch.autograd.Variable(input_lengths).cuda().long()
+    else:
+        text_padded = torch.autograd.Variable(text_padded).long()
+        input_lengths = torch.autograd.Variable(input_lengths).long()
+
+    return text_padded, input_lengths
+
+
 def main():
    """
    Launches text to speech (inference).
@ -153,70 +187,65 @@ def main():
                         logging_scope=dllg.TRAIN_ITER_SCOPE, iteration_interval=1)
    ])
    LOGGER.register_metric("tacotron2_items_per_sec", metric_scope=dllg.TRAIN_ITER_SCOPE)
+    LOGGER.register_metric("tacotron2_latency", metric_scope=dllg.TRAIN_ITER_SCOPE)
    LOGGER.register_metric("waveglow_items_per_sec", metric_scope=dllg.TRAIN_ITER_SCOPE)
+    LOGGER.register_metric("waveglow_latency", metric_scope=dllg.TRAIN_ITER_SCOPE)
+    LOGGER.register_metric("latency", metric_scope=dllg.TRAIN_ITER_SCOPE)

    log_hardware()
    log_args(args)

-    # tacotron2 model filepath was specified
-    if args.tacotron2:
-        # Setup Tacotron2
-        tacotron2 = load_and_setup_model('Tacotron2', parser, args.tacotron2,
-                                         args.amp_run)
-    # file with mel spectrogram was specified
-    elif args.mel_file:
-        mel = torch.load(args.mel_file)
-        mel = torch.autograd.Variable(mel.cuda())
-        mel = torch.unsqueeze(mel, 0)
-
-    # Setup WaveGlow
-    if args.old_waveglow:
-        waveglow = torch.load(args.waveglow)['model']
-        waveglow = waveglow.remove_weightnorm(waveglow)
-        waveglow = waveglow.cuda()
-        waveglow.eval()
-    else:
-        waveglow = load_and_setup_model('WaveGlow', parser, args.waveglow,
-                                        args.amp_run)
+    tacotron2 = load_and_setup_model('Tacotron2', parser, args.tacotron2,
+                                     args.amp_run)
+    waveglow = load_and_setup_model('WaveGlow', parser, args.waveglow,
+                                    args.amp_run)

    texts = []
    try:
        f = open(args.input, 'r')
        texts = f.readlines()
    except:
-        print("Could not read file. Using default text.")
-        texts = ["The forms of printed letters should be beautiful, and\
-        that their arrangement on the page should be reasonable and\
-        a help to the shapeliness of the letters themselves."]
+        print("Could not read file")
+        sys.exit(1)

-    for i, text in enumerate(texts):
-
-        LOGGER.iteration_start()
-
-        sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
-        sequence = torch.autograd.Variable(
-            torch.from_numpy(sequence)).cuda().long()
-
-        if args.tacotron2:
-            tacotron2_t0 = time.time()
+    if args.include_warmup:
+        sequence = torch.randint(low=0, high=148, size=(1,50),
+                                 dtype=torch.long).cuda()
+        input_lengths = torch.IntTensor([sequence.size(1)]).cuda().long()
+        for i in range(3):
            with torch.no_grad():
-                _, mel, _, _ = tacotron2.infer(sequence)
-            tacotron2_t1 = time.time()
-            tacotron2_infer_perf = sequence.size(1)/(tacotron2_t1-tacotron2_t0)
-            LOGGER.log(key="tacotron2_items_per_sec", value=tacotron2_infer_perf)
+                _, mel, _, _, mel_lengths = tacotron2.infer(sequence, input_lengths)
+                _ = waveglow.infer(mel)

-        waveglow_t0 = time.time()
-        with torch.no_grad():
-            audio = waveglow.infer(mel, sigma=args.sigma_infer)
-            audio = audio.float()
-        waveglow_t1 = time.time()
-        waveglow_infer_perf = audio[0].size(0)/(waveglow_t1-waveglow_t0)
+    LOGGER.iteration_start()

+    sequences_padded, input_lengths = prepare_input_sequence(texts)
+
+    tacotron2_t0 = time.time()
+    with torch.no_grad():
+        _, mel, _, _, mel_lengths = tacotron2.infer(sequences_padded, input_lengths)
+    tacotron2_t1 = time.time()
+
+    waveglow_t0 = time.time()
+    with torch.no_grad():
+        audios = waveglow.infer(mel, sigma=args.sigma_infer)
+        audios = audios.float()
+    waveglow_t1 = time.time()
+
+    tacotron2_infer_perf = mel.size(0)*mel.size(2)/(tacotron2_t1-tacotron2_t0)
+    LOGGER.log(key="tacotron2_items_per_sec", value=tacotron2_infer_perf)
+    LOGGER.log(key="tacotron2_latency", value=(tacotron2_t1-tacotron2_t0))
+    waveglow_infer_perf = audios.size(0)*audios.size(1)/(waveglow_t1-waveglow_t0)
+    LOGGER.log(key="waveglow_items_per_sec", value=waveglow_infer_perf)
+    LOGGER.log(key="waveglow_latency", value=(waveglow_t1-waveglow_t0))
+    LOGGER.log(key="latency", value=(waveglow_t1-tacotron2_t0))
+
+    for i, audio in enumerate(audios):
        audio_path = args.output + "audio_"+str(i)+".wav"
-        write(audio_path, args.sampling_rate, audio[0].data.cpu().numpy())
+        write(audio_path, args.sampling_rate,
+              audio.data.cpu().numpy()[:mel_lengths[i]*args.stft_hop_length])

-        LOGGER.log(key="waveglow_items_per_sec", value=waveglow_infer_perf)
-        LOGGER.iteration_stop()
+    LOGGER.iteration_stop()

    LOGGER.finish()

--- a/PyTorch/SpeechSynthesis/Tacotron2/inference_perf.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/inference_perf.py
@ -25,12 +25,10 @@
 #
 # *****************************************************************************

-from tacotron2.text import text_to_sequence
 import models
 import torch
 import argparse
 import numpy as np
-from scipy.io.wavfile import write
 import json
 import time

@ -49,43 +47,17 @@ def parse_args(parser):
    """
    parser.add_argument('-m', '--model-name', type=str, default='', required=True,
                        help='Model to train')
-    parser.add_argument('--input-text', type=str, default=None,
-                        help='Path to tensor containing text (when running Tacotron2)')
-    parser.add_argument('--input-mels', type=str, default=None,
-                        help='Path to tensor containing mels (when running WaveGlow)')
    parser.add_argument('-sr', '--sampling-rate', default=22050, type=int,
                        help='Sampling rate')
    parser.add_argument('--amp-run', action='store_true',
                        help='inference with AMP')
    parser.add_argument('-bs', '--batch-size', type=int, default=1)
-    parser.add_argument('--create-benchmark', action='store_true')
    parser.add_argument('--log-file', type=str, default='nvlog.json',
                        help='Filename for logging')

    return parser


-def collate_text(batch):
-    """Collate's training batch from normalized text and mel-spectrogram
-    PARAMS
-    ------
-    batch: [text_normalized]
-    """
-    # Right zero-pad all one-hot text sequences to max input length
-    input_lengths, ids_sorted_decreasing = torch.sort(
-        torch.LongTensor([len(x) for x in batch]),
-        dim=0, descending=True)
-    max_input_len = input_lengths[0]
-
-    text_padded = torch.LongTensor(len(batch), max_input_len)
-    text_padded.zero_()
-    for i in range(len(ids_sorted_decreasing)):
-        text = batch[ids_sorted_decreasing[i]]
-        text_padded[i, :text.size(0)] = text
-
-    return text_padded, input_lengths
-
-
 def main():
    """
    Launches inference benchmark.
@ -107,55 +79,36 @@ def main():
    ])
    LOGGER.register_metric("items_per_sec",
                           metric_scope=dllg.TRAIN_ITER_SCOPE)
+    LOGGER.register_metric("latency",
+                           metric_scope=dllg.TRAIN_ITER_SCOPE)

    log_hardware()
    log_args(args)

-    # ### uncomment to generate new padded text
-    # texts = []
-    # f = open('qa/ljs_text_train_subset_2500.txt', 'r')
-    # texts = f.readlines()
-    # sequence = []
-    # for i, text in enumerate(texts):
-    #     sequence.append(torch.IntTensor(text_to_sequence(text, ['english_cleaners'])))
+    model = load_and_setup_model(args.model_name, parser, None, args.amp_run)

-    # text_padded, input_lengths = collate_text(sequence)
-    # text_padded = torch.autograd.Variable(text_padded).cuda().long()
-    # torch.save(text_padded, "qa/text_padded.pt")
-    # torch.save(input_lengths, "qa/input_lengths.pt")
-
-    model = load_and_setup_model(args.model_name, parser, None,
-                                 args.amp_run)
-
-    dry_runs = 3
-    num_iters = (16+dry_runs) if args.create_benchmark else (1+dry_runs)
+    warmup_iters = 3
+    num_iters = 1+warmup_iters

    for i in range(num_iters):
-        ## skipping the first inference which is slower
-        if i >= dry_runs:
+        if i >= warmup_iters:
            LOGGER.iteration_start()

        if args.model_name == 'Tacotron2':
-            text_padded = torch.load(args.input_text)
-            text_padded = text_padded[:args.batch_size]
-            text_padded = torch.autograd.Variable(text_padded).cuda().long()
-
+            text_padded = torch.randint(low=0, high=148, size=(args.batch_size, 140),
+                                        dtype=torch.long).cuda()
+            input_lengths = torch.IntTensor([text_padded.size(1)]*args.batch_size).cuda().long()
            t0 = time.time()
            with torch.no_grad():
-                _, mels, _, _ = model.infer(text_padded)
-            t1 = time.time()
-            inference_time= t1 - t0
-            num_items = text_padded.size(0)*text_padded.size(1)
-
-            # # ## uncomment to generate new padded mels
-            # torch.save(mels, "qa/mel_padded.pt")
+                _, mels, _, _, _ = model.infer(text_padded, input_lengths)
+            inference_time = time.time() - t0
+            num_items = mels.size(0)*mels.size(2)

        if args.model_name == 'WaveGlow':
-            mel_padded = torch.load(args.input_mels)
-            mel_padded = torch.cat((mel_padded, mel_padded, mel_padded, mel_padded))
-            mel_padded = mel_padded[:args.batch_size]
-            mel_padded = mel_padded.cuda()
-
+            n_mel_channels = model.upsample.in_channels
+            num_mels = 895
+            mel_padded = torch.zeros(args.batch_size, n_mel_channels,
+                                     num_mels).normal_(-5.62, 1.98).cuda()
            if args.amp_run:
                mel_padded = mel_padded.half()

@ -163,12 +116,12 @@ def main():
            with torch.no_grad():
                audios = model.infer(mel_padded)
                audios = audios.float()
-            t1 = time.time()
-            inference_time = t1 - t0
+            inference_time = time.time() - t0
            num_items = audios.size(0)*audios.size(1)

-        if i >= dry_runs:
+        if i >= warmup_iters:
            LOGGER.log(key="items_per_sec", value=(num_items/inference_time))
+            LOGGER.log(key="latency", value=inference_time)
            LOGGER.iteration_stop()

    LOGGER.finish()
--- a/PyTorch/SpeechSynthesis/Tacotron2/models.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/models.py
@ -25,24 +25,28 @@
 #
 # *****************************************************************************

+import sys
+from os.path import abspath, dirname
+# enabling modules discovery from global entrypoint
+sys.path.append(abspath(dirname(__file__)+'/'))
 from tacotron2.model import Tacotron2
 from waveglow.model import WaveGlow
-from tacotron2.arg_parser import parse_tacotron2_args
-from waveglow.arg_parser import parse_waveglow_args
 import torch


 def parse_model_args(model_name, parser, add_help=False):
    if model_name == 'Tacotron2':
+        from tacotron2.arg_parser import parse_tacotron2_args
        return parse_tacotron2_args(parser, add_help)
    if model_name == 'WaveGlow':
+        from waveglow.arg_parser import parse_waveglow_args
        return parse_waveglow_args(parser, add_help)
    else:
        raise NotImplementedError(model_name)


 def batchnorm_to_float(module):
-    """Converts batch norm modules to FP32"""
+    """Converts batch norm to FP32"""
    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
        module.float()
    for child in module.children():
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase.txt
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_1_256.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_1_256.txt
@ -0,0 +1,2 @@
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_1_64.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_1_64.txt
@ -0,0 +1 @@
+She sells seashells by the seashore, shells she sells are great
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_4_256.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_4_256.txt
@ -0,0 +1,4 @@
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_4_64.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_4_64.txt
@ -0,0 +1,4 @@
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_8_256.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_8_256.txt
@ -0,0 +1,8 @@
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
+The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.
--- a/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_8_64.txt
+++ b/PyTorch/SpeechSynthesis/Tacotron2/phrases/phrase_8_64.txt
@ -0,0 +1,8 @@
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
+She sells seashells by the seashore, shells she sells are great
--- a/PyTorch/SpeechSynthesis/Tacotron2/platform/train_tacotron2_FP16_DGX1_16GB_1GPU.sh
+++ b/PyTorch/SpeechSynthesis/Tacotron2/platform/train_tacotron2_FP16_DGX1_16GB_1GPU.sh
@ -1,2 +0,0 @@
-mkdir -p output
-python train.py -m Tacotron2 -o output/ --fp16-run -lr 1e-3 --epochs 2001 -bs 128 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.3
--- a/PyTorch/SpeechSynthesis/Tacotron2/platform/train_tacotron2_FP16_DGX1_16GB_4GPU.sh
+++ b/PyTorch/SpeechSynthesis/Tacotron2/platform/train_tacotron2_FP16_DGX1_16GB_4GPU.sh
@ -1,2 +0,0 @@
-mkdir -p output
-python -m multiproc train.py -m Tacotron2 -o output/ --fp16-run -lr 1e-3 --epochs 2001 -bs 128 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.3
--- a/PyTorch/SpeechSynthesis/Tacotron2/platform/train_tacotron2_FP16_DGX1_16GB_8GPU.sh
+++ b/PyTorch/SpeechSynthesis/Tacotron2/platform/train_tacotron2_FP16_DGX1_16GB_8GPU.sh
@ -1,2 +0,0 @@
-mkdir -p output
-python -m multiproc train.py -m Tacotron2 -o output/ --fp16-run -lr 1e-3 --epochs 2001 -bs 128 --weight-decay 1e-6 --grad-clip-thresh 1.0 --cudnn-enabled --load-mel-from-disk --training-files=filelists/ljs_mel_text_train_filelist.txt --validation-files=filelists/ljs_mel_text_val_filelist.txt --log-file output/nvlog.json --anneal-steps 500 1000 1500 --anneal-factor 0.3
--- a/PyTorch/SpeechSynthesis/Tacotron2/platform/train_waveglow_FP16_DGX1_16GB_1GPU.sh
+++ b/PyTorch/SpeechSynthesis/Tacotron2/platform/train_waveglow_FP16_DGX1_16GB_1GPU.sh
@ -1,2 +0,0 @@
-mkdir -p output
-python train.py -m WaveGlow -o output/ --fp16-run -lr 1e-4 --epochs 2001 -bs 10 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-benchmark --cudnn-enabled --log-file output/nvlog.json
--- a/PyTorch/SpeechSynthesis/Tacotron2/platform/train_waveglow_FP16_DGX1_16GB_4GPU.sh
+++ b/PyTorch/SpeechSynthesis/Tacotron2/platform/train_waveglow_FP16_DGX1_16GB_4GPU.sh
@ -1,2 +0,0 @@
-mkdir -p output
-python -m multiproc train.py -m WaveGlow -o output/ --fp16-run -lr 1e-4 --epochs 2001 -bs 10 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-benchmark --cudnn-enabled --log-file output/nvlog.json
--- a/PyTorch/SpeechSynthesis/Tacotron2/platform/train_waveglow_FP16_DGX1_16GB_8GPU.sh
+++ b/PyTorch/SpeechSynthesis/Tacotron2/platform/train_waveglow_FP16_DGX1_16GB_8GPU.sh
@ -1,2 +0,0 @@
-mkdir -p output
-python -m multiproc train.py -m WaveGlow -o output/ --fp16-run -lr 1e-4 --epochs 2001 -bs 10 --segment-length 8000 --weight-decay 0 --grad-clip-thresh 65504.0 --cudnn-benchmark --cudnn-enabled --log-file output/nvlog.json
--- a/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/arg_parser.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/arg_parser.py
@ -69,7 +69,7 @@ def parse_tacotron2_args(parent, add_help=False):
                         help='Number of units in decoder LSTM')
    decoder.add_argument('--prenet-dim', default=256, type=int,
                         help='Number of ReLU units in prenet layers')
-    decoder.add_argument('--max-decoder-steps', default=1000, type=int,
+    decoder.add_argument('--max-decoder-steps', default=2000, type=int,
                         help='Maximum number of output mel spectrograms')
    decoder.add_argument('--gate-threshold', default=0.5, type=float,
                         help='Probability threshold for stop token')
--- a/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/data_function.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/data_function.py
@ -151,5 +151,5 @@ def batch_to_gpu(batch):
    output_lengths = to_gpu(output_lengths).long()
    x = (text_padded, input_lengths, mel_padded, max_len, output_lengths)
    y = (mel_padded, gate_padded)
-    len_x = torch.sum(input_lengths)
+    len_x = torch.sum(output_lengths)
    return (x, y, len_x)
--- a/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/tacotron2/model.py
@ -30,6 +30,10 @@ import torch
 from torch.autograd import Variable
 from torch import nn
 from torch.nn import functional as F
+import sys
+from os.path import abspath, dirname
+# enabling modules discovery from global entrypoint
+sys.path.append(abspath(dirname(__file__)+'/../'))
 from common.layers import ConvNorm, LinearNorm
 from common.utils import to_gpu, get_mask_from_lengths

@ -122,9 +126,17 @@ class Prenet(nn.Module):
            [LinearNorm(in_size, out_size, bias=False)
             for (in_size, out_size) in zip(in_sizes, sizes)])

-    def forward(self, x):
-        for linear in self.layers:
-            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
+    def forward(self, x, inference=False):
+        if inference:
+            for linear in self.layers:
+                x = F.relu(linear(x))
+                x0 = x[0].unsqueeze(0)
+                mask = Variable(torch.bernoulli(x0.data.new(x0.data.size()).fill_(0.5)))
+                mask = mask.expand(x.size(0), x.size(1))
+                x = x*mask*2
+        else:
+            for linear in self.layers:
+                x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
        return x


@ -457,7 +469,7 @@ class Decoder(nn.Module):
        return mel_outputs, gate_outputs, alignments


-    def infer(self, memory):
+    def infer(self, memory, memory_lengths):
        """ Decoder inference
        PARAMS
        ------
@ -471,13 +483,23 @@ class Decoder(nn.Module):
        """
        decoder_input = self.get_go_frame(memory)

-        self.initialize_decoder_states(memory, mask=None)
+        if memory.size(0) > 1:
+            mask =~ get_mask_from_lengths(memory_lengths)
+        else:
+            mask = None
+
+        self.initialize_decoder_states(memory, mask=mask)
+
+        mel_lengths = torch.zeros([memory.size(0)], dtype=torch.int32)
+        not_finished = torch.ones([memory.size(0)], dtype=torch.int32)
+        if torch.cuda.is_available():
+            mel_lengths = mel_lengths.cuda()
+            not_finished = not_finished.cuda()
+

-        mel_lengths = torch.zeros([memory.size(0)], dtype=torch.int32).cuda()
-        not_finished = torch.ones([memory.size(0)], dtype=torch.int32).cuda()
        mel_outputs, gate_outputs, alignments = [], [], []
        while True:
-            decoder_input = self.prenet(decoder_input)
+            decoder_input = self.prenet(decoder_input, inference=True)
            mel_output, gate_output, alignment = self.decode(decoder_input)

            mel_outputs += [mel_output.squeeze(1)]
@ -500,7 +522,7 @@ class Decoder(nn.Module):
        mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
            mel_outputs, gate_outputs, alignments)

-        return mel_outputs, gate_outputs, alignments
+        return mel_outputs, gate_outputs, alignments, mel_lengths


 class Tacotron2(nn.Module):
@ -552,9 +574,6 @@ class Tacotron2(nn.Module):
            (text_padded, input_lengths, mel_padded, max_len, output_lengths),
            (mel_padded, gate_padded))

-    def parse_input(self, inputs):
-        return inputs
-
    def parse_output(self, outputs, output_lengths=None):
        if self.mask_padding and output_lengths is not None:
            mask = ~get_mask_from_lengths(output_lengths)
@ -568,8 +587,7 @@ class Tacotron2(nn.Module):
        return outputs

    def forward(self, inputs):
-        inputs, input_lengths, targets, max_len, \
-            output_lengths = self.parse_input(inputs)
+        inputs, input_lengths, targets, max_len, output_lengths = inputs
        input_lengths, output_lengths = input_lengths.data, output_lengths.data

        embedded_inputs = self.embedding(inputs).transpose(1, 2)
@ -586,17 +604,17 @@ class Tacotron2(nn.Module):
            [mel_outputs, mel_outputs_postnet, gate_outputs, alignments],
            output_lengths)

-    def infer(self, inputs):
-        inputs = self.parse_input(inputs)
+    def infer(self, inputs, input_lengths):
+
        embedded_inputs = self.embedding(inputs).transpose(1, 2)
-        encoder_outputs = self.encoder.infer(embedded_inputs)
-        mel_outputs, gate_outputs, alignments = self.decoder.infer(
-            encoder_outputs)
+        encoder_outputs = self.encoder(embedded_inputs, input_lengths)
+        mel_outputs, gate_outputs, alignments, mel_lengths = self.decoder.infer(
+            encoder_outputs, input_lengths)

        mel_outputs_postnet = self.postnet(mel_outputs)
        mel_outputs_postnet = mel_outputs + mel_outputs_postnet

        outputs = self.parse_output(
-            [mel_outputs, mel_outputs_postnet, gate_outputs, alignments])
+            [mel_outputs, mel_outputs_postnet, gate_outputs, alignments, mel_lengths])

        return outputs
--- a/PyTorch/SpeechSynthesis/Tacotron2/waveglow/glow.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/waveglow/glow.py
@ -1,311 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import copy
-import torch
-from torch.autograd import Variable
-import torch.nn.functional as F
-
-
-@torch.jit.script
-def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
-    n_channels_int = n_channels[0]
-    in_act = input_a+input_b
-    t_act = torch.nn.functional.tanh(in_act[:, :n_channels_int, :])
-    s_act = torch.nn.functional.sigmoid(in_act[:, n_channels_int:, :])
-    acts = t_act * s_act
-    return acts
-
-
-class WaveGlowLoss(torch.nn.Module):
-    def __init__(self, sigma=1.0):
-        super(WaveGlowLoss, self).__init__()
-        self.sigma = sigma
-
-    def forward(self, model_output):
-        z, log_s_list, log_det_W_list = model_output
-        for i, log_s in enumerate(log_s_list):
-            if i == 0:
-                log_s_total = torch.sum(log_s)
-                log_det_W_total = log_det_W_list[i]
-            else:
-                log_s_total = log_s_total + torch.sum(log_s)
-                log_det_W_total += log_det_W_list[i]
-
-        loss = torch.sum(z*z)/(2*self.sigma*self.sigma) - log_s_total - log_det_W_total
-        return loss/(z.size(0)*z.size(1)*z.size(2))
-
-
-class Invertible1x1Conv(torch.nn.Module):
-    """
-    The layer outputs both the convolution, and the log determinant
-    of its weight matrix.  If reverse=True it does convolution with
-    inverse
-    """
-    def __init__(self, c):
-        super(Invertible1x1Conv, self).__init__()
-        self.conv = torch.nn.Conv1d(c, c, kernel_size=1, stride=1, padding=0,
-                                    bias=False)
-
-        # Sample a random orthonormal matrix to initialize weights
-        W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
-
-        # Ensure determinant is 1.0 not -1.0
-        if torch.det(W) < 0:
-            W[:,0] = -1*W[:,0]
-        W = W.view(c, c, 1)
-        self.conv.weight.data = W
-
-    def forward(self, z, reverse=False):
-        # shape
-        batch_size, group_size, n_of_groups = z.size()
-
-        W = self.conv.weight.squeeze()
-
-        if reverse:
-            if not hasattr(self, 'W_inverse'):
-                # Reverse computation
-                W_inverse = W.inverse()
-                W_inverse = Variable(W_inverse[..., None])
-                if z.type() == 'torch.cuda.HalfTensor' or z.type() == 'torch.HalfTensor':                                    W_inverse = W_inverse.half()
-                self.W_inverse = W_inverse
-            z = F.conv1d(z, self.W_inverse, bias=None, stride=1, padding=0)
-            return z
-        else:
-            # Forward computation
-            log_det_W = batch_size * n_of_groups * torch.logdet(W)
-            z = self.conv(z)
-            return z, log_det_W
-
-
-class WN(torch.nn.Module):
-    """
-    This is the WaveNet like layer for the affine coupling.  The primary difference
-    from WaveNet is the convolutions need not be causal.  There is also no dilation
-    size reset.  The dilation only doubles on each layer
-    """
-    def __init__(self, n_in_channels, n_mel_channels, n_layers, n_channels,
-                 kernel_size):
-        super(WN, self).__init__()
-        assert(kernel_size % 2 == 1)
-        assert(n_channels % 2 == 0)
-        self.n_layers = n_layers
-        self.n_channels = n_channels
-        self.in_layers = torch.nn.ModuleList()
-        self.res_skip_layers = torch.nn.ModuleList()
-        self.cond_layers = torch.nn.ModuleList()
-
-        start = torch.nn.Conv1d(n_in_channels, n_channels, 1)
-        start = torch.nn.utils.weight_norm(start, name='weight')
-        self.start = start
-
-        # Initializing last layer to 0 makes the affine coupling layers
-        # do nothing at first.  This helps with training stability
-        end = torch.nn.Conv1d(n_channels, 2*n_in_channels, 1)
-        end.weight.data.zero_()
-        end.bias.data.zero_()
-        self.end = end
-
-        for i in range(n_layers):
-            dilation = 2 ** i
-            padding = int((kernel_size*dilation - dilation)/2)
-            in_layer = torch.nn.Conv1d(n_channels, 2*n_channels, kernel_size,
-                                       dilation=dilation, padding=padding)
-            in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
-            self.in_layers.append(in_layer)
-
-            cond_layer = torch.nn.Conv1d(n_mel_channels, 2*n_channels, 1)
-            cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
-            self.cond_layers.append(cond_layer)
-
-            # last one is not necessary
-            if i < n_layers - 1:
-                res_skip_channels = 2*n_channels
-            else:
-                res_skip_channels = n_channels
-            res_skip_layer = torch.nn.Conv1d(n_channels, res_skip_channels, 1)
-            res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
-            self.res_skip_layers.append(res_skip_layer)
-
-    def forward(self, forward_input):
-        audio, spect = forward_input
-        audio = self.start(audio)
-
-        for i in range(self.n_layers):
-            acts = fused_add_tanh_sigmoid_multiply(
-                self.in_layers[i](audio),
-                self.cond_layers[i](spect),
-                torch.IntTensor([self.n_channels]))
-
-            res_skip_acts = self.res_skip_layers[i](acts)
-            if i < self.n_layers - 1:
-                audio = res_skip_acts[:,:self.n_channels,:] + audio
-                skip_acts = res_skip_acts[:,self.n_channels:,:]
-            else:
-                skip_acts = res_skip_acts
-
-            if i == 0:
-                output = skip_acts
-            else:
-                output = skip_acts + output
-        return self.end(output)
-
-
-class WaveGlow(torch.nn.Module):
-    def __init__(self, n_mel_channels, n_flows, n_group, n_early_every,
-                 n_early_size, WN_config):
-        super(WaveGlow, self).__init__()
-
-        self.upsample = torch.nn.ConvTranspose1d(n_mel_channels,
-                                                 n_mel_channels,
-                                                 1024, stride=256)
-        assert(n_group % 2 == 0)
-        self.n_flows = n_flows
-        self.n_group = n_group
-        self.n_early_every = n_early_every
-        self.n_early_size = n_early_size
-        self.WN = torch.nn.ModuleList()
-        self.convinv = torch.nn.ModuleList()
-
-        n_half = int(n_group/2)
-
-        # Set up layers with the right sizes based on how many dimensions
-        # have been output already
-        n_remaining_channels = n_group
-        for k in range(n_flows):
-            if k % self.n_early_every == 0 and k > 0:
-                n_half = n_half - int(self.n_early_size/2)
-                n_remaining_channels = n_remaining_channels - self.n_early_size
-            self.convinv.append(Invertible1x1Conv(n_remaining_channels))
-            self.WN.append(WN(n_half, n_mel_channels*n_group, **WN_config))
-        self.n_remaining_channels = n_remaining_channels  # Useful during inference
-
-    def forward(self, forward_input):
-        """
-        forward_input[0] = mel_spectrogram:  batch x n_mel_channels x frames
-        forward_input[1] = audio: batch x time
-        """
-        spect, audio = forward_input
-
-        #  Upsample spectrogram to size of audio
-        spect = self.upsample(spect)
-        assert(spect.size(2) >= audio.size(1))
-        if spect.size(2) > audio.size(1):
-            spect = spect[:, :, :audio.size(1)]
-
-        spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
-        spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
-
-        audio = audio.unfold(1, self.n_group, self.n_group).permute(0, 2, 1)
-        output_audio = []
-        log_s_list = []
-        log_det_W_list = []
-
-        for k in range(self.n_flows):
-            if k % self.n_early_every == 0 and k > 0:
-                output_audio.append(audio[:,:self.n_early_size,:])
-                audio = audio[:,self.n_early_size:,:]
-
-            audio, log_det_W = self.convinv[k](audio)
-            log_det_W_list.append(log_det_W)
-
-            n_half = int(audio.size(1)/2)
-            audio_0 = audio[:,:n_half,:]
-            audio_1 = audio[:,n_half:,:]
-
-            output = self.WN[k]((audio_0, spect))
-            log_s = output[:, n_half:, :]
-            b = output[:, :n_half, :]
-            audio_1 = torch.exp(log_s)*audio_1 + b
-            log_s_list.append(log_s)
-
-            audio = torch.cat([audio_0, audio_1],1)
-
-        output_audio.append(audio)
-        return torch.cat(output_audio,1), log_s_list, log_det_W_list
-
-    def infer(self, spect, sigma=1.0):
-        print("+++++++++++++++++++++glow.py")
-        spect = self.upsample(spect)
-        # trim conv artifacts. maybe pad spec to kernel multiple
-        time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
-        spect = spect[:, :, :-time_cutoff]
-
-        spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
-        spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
-
-        if spect.type() == 'torch.cuda.HalfTensor':
-            audio = torch.cuda.HalfTensor(spect.size(0),
-                                          self.n_remaining_channels,
-                                          spect.size(2)).normal_()
-        else:
-            audio = torch.cuda.FloatTensor(spect.size(0),
-                                           self.n_remaining_channels,
-                                           spect.size(2)).normal_()
-
-        audio = torch.autograd.Variable(sigma*audio)
-
-        for k in reversed(range(self.n_flows)):
-            n_half = int(audio.size(1)/2)
-            audio_0 = audio[:,:n_half,:]
-            audio_1 = audio[:,n_half:,:]
-
-            output = self.WN[k]((audio_0, spect))
-            s = output[:, n_half:, :]
-            b = output[:, :n_half, :]
-            audio_1 = (audio_1 - b)/torch.exp(s)
-            audio = torch.cat([audio_0, audio_1],1)
-
-            audio = self.convinv[k](audio, reverse=True)
-
-            if k % self.n_early_every == 0 and k > 0:
-                if spect.type() == 'torch.cuda.HalfTensor':
-                    z = torch.cuda.HalfTensor(spect.size(0), self.n_early_size, spect.size(2)).normal_()
-                else:
-                    z = torch.cuda.FloatTensor(spect.size(0), self.n_early_size, spect.size(2)).normal_()
-                audio = torch.cat((sigma*z, audio),1)
-
-        audio = audio.permute(0,2,1).contiguous().view(audio.size(0), -1).data
-        return audio
-
-    @staticmethod
-    def remove_weightnorm(model):
-        waveglow = model
-        for WN in waveglow.WN:
-            WN.start = torch.nn.utils.remove_weight_norm(WN.start)
-            WN.in_layers = remove(WN.in_layers)
-            WN.cond_layers = remove(WN.cond_layers)
-            WN.res_skip_layers = remove(WN.res_skip_layers)
-        return waveglow
-
-
-def remove(conv_list):
-    new_conv_list = torch.nn.ModuleList()
-    for old_conv in conv_list:
-        old_conv = torch.nn.utils.remove_weight_norm(old_conv)
-        new_conv_list.append(old_conv)
-    return new_conv_list
--- a/PyTorch/SpeechSynthesis/Tacotron2/waveglow/glow_old.py
+++ b/PyTorch/SpeechSynthesis/Tacotron2/waveglow/glow_old.py
@ -1,269 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import copy
-import torch
-from waveglow.glow import Invertible1x1Conv, remove
-
-
-@torch.jit.script
-def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
-    n_channels_int = n_channels[0]
-    in_act = input_a+input_b
-    t_act = torch.tanh(in_act[:, :n_channels_int, :])
-    s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
-    acts = t_act * s_act
-    return acts
-
-
-class WN(torch.nn.Module):
-    """
-    This is the WaveNet like layer for the affine coupling.  The primary difference
-    from WaveNet is the convolutions need not be causal.  There is also no dilation
-    size reset.  The dilation only doubles on each layer
-    """
-    def __init__(self, n_in_channels, n_mel_channels, n_layers, n_channels,
-                 kernel_size):
-        super(WN, self).__init__()
-        assert(kernel_size % 2 == 1)
-        assert(n_channels % 2 == 0)
-        self.n_layers = n_layers
-        self.n_channels = n_channels
-        self.in_layers = torch.nn.ModuleList()
-        self.res_skip_layers = torch.nn.ModuleList()
-        self.cond_layers = torch.nn.ModuleList()
-
-        start = torch.nn.Conv1d(n_in_channels, n_channels, 1)
-        start = torch.nn.utils.weight_norm(start, name='weight')
-        self.start = start
-
-        # Initializing last layer to 0 makes the affine coupling layers
-        # do nothing at first.  This helps with training stability
-        end = torch.nn.Conv1d(n_channels, 2*n_in_channels, 1)
-        end.weight.data.zero_()
-        end.bias.data.zero_()
-        self.end = end
-
-        for i in range(n_layers):
-            dilation = 2 ** i
-            padding = int((kernel_size*dilation - dilation)/2)
-            in_layer = torch.nn.Conv1d(n_channels, 2*n_channels, kernel_size,
-                                       dilation=dilation, padding=padding)
-            in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
-            self.in_layers.append(in_layer)
-
-            cond_layer = torch.nn.Conv1d(n_mel_channels, 2*n_channels, 1)
-            cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
-            self.cond_layers.append(cond_layer)
-
-            # last one is not necessary
-            if i < n_layers - 1:
-                res_skip_channels = 2*n_channels
-            else:
-                res_skip_channels = n_channels
-            res_skip_layer = torch.nn.Conv1d(n_channels, res_skip_channels, 1)
-            res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
-            self.res_skip_layers.append(res_skip_layer)
-
-    def forward(self, forward_input):
-        audio, spect = forward_input
-        audio = self.start(audio)
-
-        for i in range(self.n_layers):
-            acts = fused_add_tanh_sigmoid_multiply(
-                self.in_layers[i](audio),
-                self.cond_layers[i](spect),
-                torch.IntTensor([self.n_channels]))
-
-            res_skip_acts = self.res_skip_layers[i](acts)
-            if i < self.n_layers - 1:
-                audio = res_skip_acts[:,:self.n_channels,:] + audio
-                skip_acts = res_skip_acts[:,self.n_channels:,:]
-            else:
-                skip_acts = res_skip_acts
-
-            if i == 0:
-                output = skip_acts
-            else:
-                output = skip_acts + output
-        return self.end(output)
-
-
-class WaveGlow(torch.nn.Module):
-    def __init__(self, n_mel_channels, n_flows, n_group, n_early_every,
-                 n_early_size, WN_config):
-        super(WaveGlow, self).__init__()
-
-        self.upsample = torch.nn.ConvTranspose1d(n_mel_channels,
-                                                 n_mel_channels,
-                                                 1024, stride=256)
-        assert(n_group % 2 == 0)
-        self.n_flows = n_flows
-        self.n_group = n_group
-        self.n_early_every = n_early_every
-        self.n_early_size = n_early_size
-        self.WN = torch.nn.ModuleList()
-        self.convinv = torch.nn.ModuleList()
-
-        n_half = int(n_group/2)
-
-        # Set up layers with the right sizes based on how many dimensions
-        # have been output already
-        n_remaining_channels = n_group
-        for k in range(n_flows):
-            if k % self.n_early_every == 0 and k > 0:
-                n_half = n_half - int(self.n_early_size/2)
-                n_remaining_channels = n_remaining_channels - self.n_early_size
-            self.convinv.append(Invertible1x1Conv(n_remaining_channels))
-            self.WN.append(WN(n_half, n_mel_channels*n_group, **WN_config))
-        self.n_remaining_channels = n_remaining_channels  # Useful during inference
-
-    def forward(self, forward_input):
-        return None
-        """
-        forward_input[0] = audio: batch x time
-        forward_input[1] = upsamp_spectrogram:  batch x n_cond_channels x time
-        """
-        """
-        spect, audio = forward_input
-
-        #  Upsample spectrogram to size of audio
-        spect = self.upsample(spect)
-        assert(spect.size(2) >= audio.size(1))
-        if spect.size(2) > audio.size(1):
-            spect = spect[:, :, :audio.size(1)]
-
-        spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
-        spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
-
-        audio = audio.unfold(1, self.n_group, self.n_group).permute(0, 2, 1)
-        output_audio = []
-        s_list = []
-        s_conv_list = []
-
-        for k in range(self.n_flows):
-            if k%4 == 0 and k > 0:
-                output_audio.append(audio[:,:self.n_multi,:])
-                audio = audio[:,self.n_multi:,:]
-
-            # project to new basis
-            audio, s = self.convinv[k](audio)
-            s_conv_list.append(s)
-
-            n_half = int(audio.size(1)/2)
-            if k%2 == 0:
-                audio_0 = audio[:,:n_half,:]
-                audio_1 = audio[:,n_half:,:]
-            else:
-                audio_1 = audio[:,:n_half,:]
-                audio_0 = audio[:,n_half:,:]
-
-            output = self.nn[k]((audio_0, spect))
-            s = output[:, n_half:, :]
-            b = output[:, :n_half, :]
-            audio_1 = torch.exp(s)*audio_1 + b
-            s_list.append(s)
-
-            if k%2 == 0:
-                audio = torch.cat([audio[:,:n_half,:], audio_1],1)
-            else:
-                audio = torch.cat([audio_1, audio[:,n_half:,:]], 1)
-        output_audio.append(audio)
-        return torch.cat(output_audio,1), s_list, s_conv_list
-        """
-
-    def infer(self, spect, sigma=1.0):
-        spect = self.upsample(spect)
-        # trim conv artifacts. maybe pad spec to kernel multiple
-        time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
-        spect = spect[:, :, :-time_cutoff]
-
-        spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
-        spect = spect.contiguous().view(spect.size(0), spect.size(1), -1).permute(0, 2, 1)
-
-        if spect.type() == 'torch.cuda.HalfTensor':
-            audio = torch.cuda.HalfTensor(spect.size(0),
-                                          self.n_remaining_channels,
-                                          spect.size(2)).normal_()
-        elif spect.type() == 'torch.cuda.FloatTensor':
-            audio = torch.cuda.FloatTensor(spect.size(0),
-                                           self.n_remaining_channels,
-                                           spect.size(2)).normal_()
-        else:
-            audio = torch.FloatTensor(spect.size(0),
-                                      self.n_remaining_channels,
-                                      spect.size(2)).normal_()
-            
-        audio = torch.autograd.Variable(sigma*audio)
-
-        for k in reversed(range(self.n_flows)):
-            n_half = int(audio.size(1)/2)
-            if k%2 == 0:
-                audio_0 = audio[:,:n_half,:]
-                audio_1 = audio[:,n_half:,:]
-            else:
-                audio_1 = audio[:,:n_half,:]
-                audio_0 = audio[:,n_half:,:]
-
-            output = self.WN[k]((audio_0, spect))
-            s = output[:, n_half:, :]
-            b = output[:, :n_half, :]
-            audio_1 = (audio_1 - b)/torch.exp(s)
-            if k%2 == 0:
-                audio = torch.cat([audio[:,:n_half,:], audio_1],1)
-            else:
-                audio = torch.cat([audio_1, audio[:,n_half:,:]], 1)
-
-            audio = self.convinv[k](audio, reverse=True)
-
-            if k%4 == 0 and k > 0:
-                if spect.type() == 'torch.cuda.HalfTensor':
-                    z = torch.cuda.HalfTensor(spect.size(0),
-                                              self.n_early_size,
-                                              spect.size(2)).normal_()
-                elif spect.type() == 'torch.cuda.FloatTensor':
-                    z = torch.cuda.FloatTensor(spect.size(0),
-                                               self.n_early_size,
-                                               spect.size(2)).normal_()
-                else:
-                    z = torch.FloatTensor(spect.size(0),
-                                          self.n_early_size,
-                                          spect.size(2)).normal_()
-                    
-                audio = torch.cat((sigma*z, audio),1)
-
-        return audio.permute(0,2,1).contiguous().view(audio.size(0), -1).data
-
-    @staticmethod
-    def remove_weightnorm(model):
-        waveglow = model
-        for WN in waveglow.WN:
-            WN.start = torch.nn.utils.remove_weight_norm(WN.start)
-            WN.in_layers = remove(WN.in_layers)
-            WN.cond_layers = remove(WN.cond_layers)
-            WN.res_skip_layers = remove(WN.res_skip_layers)
-        return waveglow
				`@ -0,0 +1,2 @@`
				`The forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves and the form of printed letters should be beautiful, and that their arrangement on pages.`
				`@ -0,0 +1 @@`
				`She sells seashells by the seashore, shells she sells are great`