[FastPitch/PyT] Add FastPitch 1.1 (autoaligned)

2021-08-13 22:22:45 +00:00 · 2021-08-13 22:22:45 +00:00 · ba8f04bfe2
parent 85c54b5c36
commit ba8f04bfe2
75 changed files with 28945 additions and 2770 deletions
--- a/PyTorch/SpeechSynthesis/FastPitch/.gitignore
+++ b/PyTorch/SpeechSynthesis/FastPitch/.gitignore
@ -3,7 +3,6 @@ LJSpeech-1.1/
 output*
 scripts_joc/
 tests/
-workspace/
 pretrained_models/
 *.pyc
 __pycache__
--- a/PyTorch/SpeechSynthesis/FastPitch/Dockerfile
+++ b/PyTorch/SpeechSynthesis/FastPitch/Dockerfile
@ -1,4 +1,4 @@
-# Copyright (c) 2020 NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.02-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.05-py3
 FROM ${FROM_IMAGE_NAME}

 ENV PYTHONPATH /workspace/fastpitch
--- a/PyTorch/SpeechSynthesis/FastPitch/README.md
+++ b/PyTorch/SpeechSynthesis/FastPitch/README.md
@ -1,4 +1,4 @@
-# FastPitch 1.0 for PyTorch
+# FastPitch 1.1 for PyTorch

 This repository provides a script and recipe to train the FastPitch model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA.

@ -25,21 +25,20 @@ This repository provides a script and recipe to train the FastPitch model to ach
        * [Multi-dataset](#multi-dataset)
    * [Training process](#training-process)
    * [Inference process](#inference-process)
-
 - [Performance](#performance)
    * [Benchmarking](#benchmarking)
        * [Training performance benchmark](#training-performance-benchmark)
        * [Inference performance benchmark](#inference-performance-benchmark)
    * [Results](#results)
        * [Training accuracy results](#training-accuracy-results)
-            * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+            * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
            * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
        * [Training performance results](#training-performance-results)
-            * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+            * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
            * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
            * [Expected training time](#expected-training-time)
        * [Inference performance results](#inference-performance-results)
-            * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-40gb)
+            * [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-80gb)
            * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
            * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 - [Release notes](#release-notes)
@ -48,8 +47,7 @@ This repository provides a script and recipe to train the FastPitch model to ach

 ## Model overview

-[FastPitch](https://arxiv.org/abs/2006.06873) is a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration.
-It is one of two major components in a neural, text-to-speech (TTS) system:
+[FastPitch](https://arxiv.org/abs/2006.06873) is one of two major components in a neural, text-to-speech (TTS) system:

 * a mel-spectrogram generator such as [FastPitch](https://arxiv.org/abs/2006.06873) or [Tacotron 2](https://arxiv.org/abs/1712.05884), and
 * a waveform synthesizer such as [WaveGlow](https://arxiv.org/abs/1811.00002) (see [NVIDIA example code](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)).
@ -57,23 +55,40 @@ It is one of two major components in a neural, text-to-speech (TTS) system:
 Such two-component TTS system is able to synthesize natural sounding speech from raw transcripts.

 The FastPitch model generates mel-spectrograms and predicts a pitch contour from raw input text.
+In version 1.1, it does not need any pre-trained aligning model to bootstrap from.
+It allows to exert additional control over the synthesized utterances, such as:
+* modify the pitch contour to control the prosody,
+* increase or decrease the fundamental frequency in a naturally sounding way, that preserves the perceived identity of the speaker,
+* alter the rate of speech,
+* adjust the energy,
+* specify input as graphemes or phonemes,
+* switch speakers when the model has been trained with data from multiple speakers.
 Some of the capabilities of FastPitch are presented on the website with [samples](https://fastpitch.github.io/).

-Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases like Tacotron2 does.
+Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases like Tacotron 2 does.
 This is reflected in Mean Opinion Scores ([details](https://arxiv.org/abs/2006.06873)).

-| Model     | Mean Opinion Score (MOS) |
-|:----------|:-------------------------|
-| Tacotron2 | 3.946 ± 0.134            |
-| FastPitch | 4.080 ± 0.133            |
+| Model          | Mean Opinion Score (MOS) |
+|:---------------|:-------------------------|
+| Tacotron 2     | 3.946 ± 0.134            |
+| FastPitch 1.0  | 4.080 ± 0.133            |
+
+The current version of the model offers even higher quality, as reflected
+in the pairwise preference scores.
+
+| Model          | Average preference |
+|:---------------|:-------------------|
+| FastPitch 1.0  | 0.435 ± 0.068      |
+| FastPitch 1.1  | 0.565 ± 0.068      |

 The FastPitch model is based on the [FastSpeech](https://arxiv.org/abs/1905.09263) model. The main differences between FastPitch and FastSpeech are that FastPitch:
+* no dependence on external aligner (Transformer TTS, Tacotron 2); in version 1.1, FastPitch aligns audio to transcriptions by itself,
 * explicitly learns to predict the pitch contour,
 * pitch conditioning removes harsh sounding artifacts and provides faster convergence,
 * no need for distilling mel-spectrograms with a teacher model,
-* [character durations](#glossary) are extracted with a pre-trained Tacotron 2 model.
+* capabilities to train a multi-speaker model.

-The FastPitch model is similar to [FastSpeech2](https://arxiv.org/abs/2006.04558), which has been developed concurrently. FastPitch averages pitch values over input tokens, and does not use additional conditioning such as the energy.
+The FastPitch model is similar to [FastSpeech2](https://arxiv.org/abs/2006.04558), which has been developed concurrently. FastPitch averages pitch/energy values over input tokens, and treats energy as optional.

 FastPitch is trained on a publicly
 available [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
@ -108,9 +123,8 @@ The following features were implemented in this model:
 training,
 * gradient accumulation for reproducible results regardless of the number of GPUs.

-To speed-up FastPitch training,
-reference mel-spectrograms, character durations, and pitch cues
-are generated during the pre-processing step and read
+Pitch contours and mel-spectrograms can be generated on-line during training.
+To speed-up training, those could be generated during the pre-processing step and read
 directly from the disk during training. For more information on data pre-processing refer to [Dataset guidelines
 ](#dataset-guidelines) and the [paper](https://arxiv.org/abs/2006.06873).

@ -118,22 +132,20 @@ directly from the disk during training. For more information on data pre-process

 The following features are supported by this model.

-| Feature                                                            | FastPitch   |
-| :------------------------------------------------------------------|------------:|
-|[AMP](https://nvidia.github.io/apex/amp.html)                               | Yes |
-|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes |
+| Feature                         | FastPitch |
+| :-------------------------------|----------:|
+| Automatic mixed precision (AMP) | Yes       |
+| Distributed data parallel (DDP) | Yes       |

 #### Features

-AMP - a tool that enables Tensor Core-accelerated training. For more information,
-refer to [Enabling mixed precision](#enabling-mixed-precision).
+Automatic Mixed Precision (AMP) - This implementation uses native PyTorch AMP
+implementation of mixed precision training. It allows us to use FP16 training
+with FP32 master weights by modifying just a few lines of code.

-Apex DistributedDataParallel - a module wrapper that enables easy multiprocess
-distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`.
-`DistributedDataParallel` is optimized for use with NCCL. It achieves high
-performance by overlapping communication with computation during `backward()`
-and bucketing smaller gradient transfers to reduce the total number of transfers
-required.
+DistributedDataParallel (DDP) - The model uses PyTorch Lightning implementation
+of distributed data parallelism at the module level which can run across
+multiple machines.

 ### Mixed precision training

@ -150,48 +162,9 @@ For information about:

 #### Enabling mixed precision

-Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
-(AMP)  library from [APEX](https://github.com/NVIDIA/apex) that casts variables
-to half-precision upon retrieval, while storing variables in single-precision
-format. Furthermore, to preserve small gradient magnitudes in backpropagation,
-a [loss scaling](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#lossscaling)
-step must be included when applying gradients. In PyTorch, loss scaling can be
-easily applied by using the `scale_loss()` method provided by AMP. The scaling value
-to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
+For training and inference, mixed precision can be enabled by adding the `--amp` flag.
+Mixed precision is using [native PyTorch implementation](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/).

-By default, the `scripts/train.sh` script will run in full precision.To launch mixed precision training with Tensor Cores, either set env variable `AMP=true`
-when using `scripts/train.sh`, or add `--amp` flag when directly executing `train.py` without the helper script.
-
-To enable mixed precision, the following steps were performed:
-* Import AMP from APEX:
-    ```bash
-    from apex import amp
-    ```
-
-* Initialize AMP:
-    ```bash
-    model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
-    ```
-
-* If running on multi-GPU, wrap the model with `DistributedDataParallel`:
-    ```bash
-    from apex.parallel import DistributedDataParallel as DDP
-    model = DDP(model)
-    ```
-
-* Scale loss before backpropagation (assuming loss is stored in a variable
-called `losses`):
-
-    * Default backpropagate for FP32:
-        ```bash
-        losses.backward()
-        ```
-
-    * Scale loss and backpropagate with AMP:
-        ```bash
-        with optimizer.scale_loss(losses) as scaled_losses:
-            scaled_losses.backward()
-        ```
 #### Enabling TF32

 TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
@ -207,9 +180,6 @@ TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by defaul
 **Character duration**
 The time during which a character is being articulated. Could be measured in milliseconds, mel-spectrogram frames, etc. Some characters are not pronounced, and thus have 0 duration.

-**Forced alignment**
-Segmentation of a recording into lexical units like characters, words, or phonemes. The segmentation is hard and defines exact starting and ending times for every unit.
-
 **Fundamental frequency**
 The lowest vibration frequency of a periodic soundwave, for example, produced by a vibrating instrument. It is perceived as the loudest. In the context of speech, it refers to the frequency of vibration of vocal chords.  Abbreviated as *f0*.

@ -227,7 +197,7 @@ The following section lists the requirements that you need to meet in order to s

 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 -   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-   [PyTorch 20.09-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+-   [PyTorch 21.05-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 or newer
 - supported GPUs:
    - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
@ -273,11 +243,9 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
   location in the NGC container. The complete dataset has the following structure:
   ```bash
   ./LJSpeech-1.1
-   ├── durations        # Character durations estimates for forced alignment training
-   ├── mels             # Pre-calculated target mel-spectrograms
+   ├── mels             # (optional) Pre-calculated target mel-spectrograms; may be calculated on-line
   ├── metadata.csv     # Mapping of waveforms to utterances
-   ├── pitch_char       # Average per-character fundamental frequencies for input utterances
-   ├── pitch_char_stats__ljs_audio_text_train_filelist.json    # Mean and std of pitch for training data
+   ├── pitch            # Fundamental frequency countours for input utterances; may be calculated on-line
   ├── README
   └── wavs             # Raw waveforms
   ```
@ -306,12 +274,13 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
   You can perform inference using the respective `.pt` checkpoints that are passed as `--fastpitch`
   and `--waveglow` arguments:
   ```bash
-   python inference.py --cuda \
-                       --fastpitch output/<FastPitch checkpoint> \
-                       --waveglow pretrained_models/waveglow/<WaveGlow checkpoint> \
-                       --wn-channels 256 \
-                       -i phrases/devset10.tsv \
-                       -o output/wavs_devset10
+   python inference.py \
+       --cuda \
+       --fastpitch output/<FastPitch checkpoint> \
+       --waveglow pretrained_models/waveglow/<WaveGlow checkpoint> \
+       --wn-channels 256 \
+       -i phrases/devset10.tsv \
+       -o output/wavs_devset10
   ```

   The speech is generated from a file passed with the `-i` argument, with one utterance per line:
@ -338,15 +307,9 @@ given model
 * `<model_name>/data_function.py` - data loading functions
 * `<model_name>/loss_function.py` - loss function for the model

-The common scripts contain layer definitions common to both models
-(`common/layers.py`), some utility scripts (`common/utils.py`) and scripts
-for audio processing (`common/audio_processing.py` and `common/stft.py`).
-
 In the root directory `./` of this repository, the `./train.py` script is used for
 training while inference can be executed with the `./inference.py` script. The
-scripts `./models.py`, `./data_functions.py` and `./loss_functions.py` call
-the respective scripts in the `<model_name>` directory, depending on what
-the model is trained using the `train.py` script.
+script `./models.py` is used to construct a model of requested type and properties.

 The repository is structured similarly to the [NVIDIA Tacotron2 Deep Learning example](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2), so that they could be combined in more advanced use cases.

@ -357,13 +320,12 @@ together with their default values that are used to train FastPitch.

 * `--epochs` - number of epochs (default: 1500)
 * `--learning-rate` - learning rate (default: 0.1)
-* `--batch-size` - batch size (default: 32)
+* `--batch-size` - batch size for a single forward-backward step (default: 16)
+* `--grad-accumulation` - number of steps over which gradients are accumulated (default: 2)
 * `--amp` - use mixed precision training (default: disabled)
-
-* `--pitch-predictor-loss-scale` - rescale the loss of the pitch predictor module to dampen
-its influence on the shared feedforward transformer blocks
-* `--duration-predictor-loss-scale` - rescale the loss of the duration predictor module to dampen
-its influence on the shared feedforward transformer blocks
+* `--load-pitch-from-disk` - pre-calculated fundamental frequency values, estimated before training, are loaded from the disk during training (default: enabled)
+* `--energy-conditioning` - enables additional conditioning on energy (default: enabled)
+* `--p-arpabet` - probability of choosing phonemic over graphemic representation for every word, if available (default: 1.0)

 ### Command-line options

@ -376,9 +338,9 @@ python train.py --help
 The following example output is printed when running the model:

 ```bash
-DLL 2020-03-30 10:41:12.562594 - epoch    1 | iter   1/19 | loss 36.99 | mel loss 35.25 |  142370.52 items/s | elapsed 2.50 s | lrate 1.00E-01 -> 3.16E-06
-DLL 2020-03-30 10:41:13.202835 - epoch    1 | iter   2/19 | loss 37.26 | mel loss 35.98 |  561459.27 items/s | elapsed 0.64 s | lrate 3.16E-06 -> 6.32E-06
-DLL 2020-03-30 10:41:13.831189 - epoch    1 | iter   3/19 | loss 36.93 | mel loss 35.41 |  583530.16 items/s | elapsed 0.63 s | lrate 6.32E-06 -> 9.49E-06
+DLL 2021-06-14 23:08:53.659718 - epoch    1 | iter   1/48 | loss 40.97 | mel loss 35.04 | kl loss 0.02240 | kl weight 0.01000 |    5730.98 frames/s | took 24.54 s | lrate 3.16e-06
+DLL 2021-06-14 23:09:28.449961 - epoch    1 | iter   2/48 | loss 41.07 | mel loss 35.12 | kl loss 0.02258 | kl weight 0.01000 |    4154.18 frames/s | took 34.79 s | lrate 6.32e-06
+DLL 2021-06-14 23:09:59.365398 - epoch    1 | iter   3/48 | loss 40.86 | mel loss 34.93 | kl loss 0.02252 | kl weight 0.01000 |    4589.15 frames/s | took 30.91 s | lrate 9.49e-06
 ```

 ### Getting the data
@ -391,25 +353,23 @@ The `./scripts/download_dataset.sh` script will automatically download and extra
 The LJSpeech dataset has 13,100 clips that amount to about 24 hours of speech of a single, female speaker. Since the original dataset does not define a train/dev/test split of the data, we provide a split in the form of three file lists:
 ```bash
 ./filelists
-├── ljs_mel_dur_pitch_text_test_filelist.txt
-├── ljs_mel_dur_pitch_text_train_filelist.txt
-└── ljs_mel_dur_pitch_text_val_filelist.txt
+├── ljs_audio_pitch_text_train_v3.txt
+├── ljs_audio_pitch_text_test.txt
+└── ljs_audio_pitch_text_val.txt
 ```

-***NOTE: When combining FastPitch/WaveGlow with external models trained on LJSpeech-1.1, make sure that your train/dev/test split matches. Different organizations may use custom splits. A mismatch poses a risk of leaking the training data through model weights during validation and testing.***
-
 FastPitch predicts character durations just like [FastSpeech](https://arxiv.org/abs/1905.09263) does.
-This calls for training with forced alignments, expressed as the number of output mel-spectrogram frames for every input character.
-To this end, a pre-trained
-[Tacotron 2 model](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)
-is used. Its attention matrix
-relates the input characters with the output mel-spectrogram frames.
+FastPitch 1.1 aligns input symbols to output mel-spectrogram frames automatically and does not rely
+on any external aligning model. FastPitch training can now be started on raw waveforms
+without any pre-processing: pitch values and mel-spectrograms will be calculated on-line.

-For every mel-spectrogram frame, its fundamental frequency in Hz is estimated with [Praat](http://praat.org).
-Those values are then averaged over every character, in order to provide sparse
-pitch cues for the model. Character boundaries are calculated from durations
-extracted previously with Tacotron 2.
+For every mel-spectrogram frame, its fundamental frequency in Hz is estimated with either
+the Probabilistic YIN algorithm or [Praat](http://praat.org).

+The former is more accurate but time consuming, and we recommend to pre-calculate
+pitch during the data processing step. The latter is suitable for on-line pitch calculation.
+Pitch values are then averaged over every character, in order to provide sparse
+pitch cues for the model.

 <p align="center">
  <img src="./img/pitch.png" alt="Pitch estimates extracted with Praat" />
@ -432,40 +392,38 @@ Follow these steps to use datasets different from the default LJSpeech dataset.
 2. Prepare filelists with transcripts and paths to .wav files. They define training/validation split of the data (test is currently unused):
   ```bash
   ./filelists
-   ├── my_dataset_mel_ali_pitch_text_train_filelist.txt
-   └── my_dataset_mel_ali_pitch_text_val_filelist.txt
+   ├── my-dataset_audio_text_train.txt
+   └── my-dataset_audio_text_val.txt
   ```
-
-Those filelists should list a single utterance per line as:
+   Those filelists should list a single utterance per line as:
   ```bash
   `<audio file path>|<transcript>`
   ```
-The `<audio file path>` is the relative path to the path provided by the `--dataset-path` option of `train.py`.
+   The `<audio file path>` is the relative path to the path provided by the `--dataset-path` option of `train.py`.

-3. Run the pre-processing script to calculate mel-spectrograms, durations and pitch:
+3. Run the pre-processing script to calculate pitch:
   ```bash
-   python extract_mels.py --cuda \
-                          --dataset-path ./my_dataset \
-                          --wav-text-filelist ./filelists/my_dataset_mel_ali_pitch_text_train_filelist.txt \
-                          --extract-mels \
-                          --extract-durations \
-                          --extract-pitch-char \
-                          --tacotron2-checkpoint ./pretrained_models/tacotron2/state_dict.pt"
-
-   python extract_mels.py --cuda \
-                          --dataset-path ./my_dataset \
-                          --wav-text-filelist ./filelists/my_dataset_mel_ali_pitch_text_val_filelist.txt \
-                          --extract-mels \
-                          --extract-durations \
-                          --extract-pitch-char \
-                          --tacotron2-checkpoint ./pretrained_models/tacotron2/state_dict.pt"
+    python prepare_dataset.py \
+        --wav-text-filelists filelists/my-dataset_audio_text_train.txt \
+                             filelists/my-dataset_audio_text_val.txt \
+        --n-workers 16 \
+        --batch-size 1 \
+        --dataset-path $DATA_DIR \
+        --extract-pitch \
+        --f0-method pyin
+   ```
+4. Prepare file lists with paths to pre-calculated pitch:
+   ```bash
+   ./filelists
+   ├── my-dataset_audio_pitch_text_train.txt
+   └── my-dataset_audio_pitch_text_val.txt
   ```

 In order to use the prepared dataset, pass the following to the `train.py` script:
   ```bash
   --dataset-path ./my_dataset` \
-   --training-files ./filelists/my_dataset_mel_ali_pitch_text_train_filelist.txt \
-   --validation files ./filelists/my_dataset_mel_ali_pitch_text_val_filelist.txt
+   --training-files ./filelists/my-dataset_audio_pitch_text_train.txt \
+   --validation files ./filelists/my-dataset_audio_pitch_text_val.txt
   ```

 ### Training process
@ -480,18 +438,15 @@ The result is averaged over an entire training epoch and summed over all GPUs th
 included in the training.

 The `scripts/train.sh` script is configured for 8x GPU with at least 16GB of memory:
-    ```bash
-    --batch-size 32
-    --gradient-accumulation-steps 1
-    ```
-In a single accumulated step, there are `batch_size x gradient_accumulation_steps x GPUs = 256` examples being processed in parallel. With a smaller number of GPUs, increase `--gradient_accumulation_steps` to keep this relation satisfied, e.g., through env variables
-    ```bash
-    NUM_GPUS=4 GRAD_ACCUMULATION=2 bash scripts/train.sh
-    ```
-With automatic mixed precision (AMP), a larger batch size fits in 16GB of memory:
-    ```bash
-    NUM_GPUS=4 GRAD_ACCUMULATION=1 BS=64 AMP=true bash scripts/train.sh
-    ```
+```bash
+--batch-size 16
+--grad-accumulation 2
+```
+
+In a single accumulated step, there are `batch_size x grad_accumulation x GPUs = 256` examples being processed in parallel. With a smaller number of GPUs, increase `--grad_accumulation` to keep this relation satisfied, e.g., through env variables
+```bash
+NUM_GPUS=1 GRAD_ACCUMULATION=16 bash scripts/train.sh
+```

 ### Inference process

@ -510,7 +465,7 @@ Examine the `inference_example.sh` script to adjust paths to pre-trained models,
 and call `python inference.py --help` to learn all available options.
 By default, synthesized audio samples are saved in `./output/audio_*` folders.

-FastPitch allows us to linearly adjust the pace of synthesized speech like [FastSpeech](https://arxiv.org/abs/1905.09263).
+FastPitch allows us to linearly adjust the rate of synthesized speech like [FastSpeech](https://arxiv.org/abs/1905.09263).
 For instance, pass `--pace 0.5` for a twofold decrease in speed.

 For every input character, the model predicts a pitch cue - an average pitch over a character in Hz.
@ -523,7 +478,7 @@ Pitch can be adjusted by transforming those pitch cues. A few simple examples ar
 | Invert pitch wrt. to the mean pitch         |`--pitch-transform-invert`     | [link](./audio/sample_fp16_invert.wav)  |
 | Raise/lower pitch by <hz>                   |`--pitch-transform-shift <hz>` | [link](./audio/sample_fp16_shift.wav)   |
 | Flatten the pitch to a constant value       |`--pitch-transform-flatten`    | [link](./audio/sample_fp16_flatten.wav) |
-| Change the pace of speech (1.0 = unchanged) |`--pace <value>`               | [link](./audio/sample_fp16_pace.wav)    |
+| Change the rate of speech (1.0 = unchanged) |`--pace <value>`               | [link](./audio/sample_fp16_pace.wav)    |

 The flags can be combined. Modify these functions directly in the `inference.py` script to gain more control over the final result.

@ -532,8 +487,6 @@ More examples are presented on the website with [samples](https://fastpitch.gith

 ## Performance

-The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
-
 ### Benchmarking

 The following section shows how to run benchmarks measuring the model
@ -543,20 +496,20 @@ performance in training and inference mode.

 To benchmark the training performance on a specific batch size, run:

-* NVIDIA DGX A100 (8x A100 40GB)
+* NVIDIA DGX A100 (8x A100 80GB)
    ```bash
-        AMP=true NUM_GPUS=1 BS=128 GRAD_ACCUMULATION=2 EPOCHS=10 bash scripts/train.sh
+        AMP=true NUM_GPUS=1 BS=32 GRAD_ACCUMULATION=8 EPOCHS=10 bash scripts/train.sh
        AMP=true NUM_GPUS=8 BS=32 GRAD_ACCUMULATION=1 EPOCHS=10 bash scripts/train.sh
-        NUM_GPUS=1 BS=128 GRAD_ACCUMULATION=2 EPOCHS=10 bash scripts/train.sh
-        NUM_GPUS=8 BS=32 GRAD_ACCUMULATION=1 EPOCHS=10 bash scripts/train.sh
+        AMP=false NUM_GPUS=1 BS=32 GRAD_ACCUMULATION=8 EPOCHS=10 bash scripts/train.sh
+        AMP=false NUM_GPUS=8 BS=32 GRAD_ACCUMULATION=1 EPOCHS=10 bash scripts/train.sh
    ```

 * NVIDIA DGX-1 (8x V100 16GB)
    ```bash
-        AMP=true NUM_GPUS=1 BS=64 GRAD_ACCUMULATION=4 EPOCHS=10 bash scripts/train.sh
-        AMP=true NUM_GPUS=8 BS=32 GRAD_ACCUMULATION=1 EPOCHS=10 bash scripts/train.sh
-        NUM_GPUS=1 BS=32 GRAD_ACCUMULATION=8 EPOCHS=10 bash scripts/train.sh
-        NUM_GPUS=8 BS=32 GRAD_ACCUMULATION=1 EPOCHS=10 bash scripts/train.sh
+        AMP=true NUM_GPUS=1 BS=16 GRAD_ACCUMULATION=16 EPOCHS=10 bash scripts/train.sh
+        AMP=true NUM_GPUS=8 BS=16 GRAD_ACCUMULATION=2 EPOCHS=10 bash scripts/train.sh
+        AMP=false NUM_GPUS=1 BS=16 GRAD_ACCUMULATION=16 EPOCHS=10 bash scripts/train.sh
+        AMP=false NUM_GPUS=8 BS=16 GRAD_ACCUMULATION=2 EPOCHS=10 bash scripts/train.sh
    ```

 Each of these scripts runs for 10 epochs and for each epoch measures the
@ -574,7 +527,7 @@ To benchmark the inference performance on a specific batch size, run:

 * For FP32 or TF32
    ```bash
-    BS_SEQUENCE=”1 4 8” REPEATS=100 bash scripts/inference_benchmark.sh
+    AMP=false BS_SEQUENCE=”1 4 8” REPEATS=100 bash scripts/inference_benchmark.sh
    ```

 The output log files will contain performance numbers for the FastPitch model
@ -591,70 +544,68 @@ and accuracy in training and inference.

 #### Training accuracy results

-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

-Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

 | Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
 |:---------------------|------:|------:|------:|------:|------:|------:|------:|
-| FastPitch AMP        | 0.503 | 0.252 | 0.214 | 0.202 | 0.193 | 0.188 | 0.184 |
-| FastPitch TF32       | 0.500 | 0.252 | 0.215 | 0.201 | 0.193 | 0.187 | 0.183 |
+| FastPitch AMP        | 3.35 |  2.89 |  2.79 |  2.71 |   2.68 |   2.64 |   2.61 |
+| FastPitch TF32       | 3.37 |  2.88 |  2.78 |  2.71 |   2.68 |   2.63 |   2.61 |

 ##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)

-Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh` training script in the PyTorch 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs.
+Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh` training script in the PyTorch 21.05-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs.

 All of the results were produced using the `train.py` script as described in the
 [Training process](#training-process) section of this document.

 | Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
 |:---------------------|------:|------:|------:|------:|------:|------:|------:|
-| FastPitch AMP        | 0.499 | 0.250 | 0.211 | 0.198 | 0.190 | 0.184 | 0.180 |
-| FastPitch FP32       | 0.503 | 0.251 | 0.214 | 0.201 | 0.192 | 0.186 | 0.182 |
+| FastPitch AMP        | 3.38 |  2.88 |  2.79 |  2.71 |   2.68 |   2.64 |   2.61 |
+| FastPitch FP32       | 3.38 |  2.89 |  2.80 |  2.71 |   2.68 |   2.65 |   2.62 |


 <div style="text-align:center" align="center">
  <img src="./img/loss.png" alt="Loss curves" />
 </div>

-
-
-
 #### Training performance results

-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)

-Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
 an entire training epoch.

-|Number of GPUs|Batch size per GPU|Frames/s with mixed precision|Frames/s with TF32|Speed-up with mixed precision|Multi-GPU strong scaling with mixed precision|Multi-GPU strong scaling with TF32|
-|---:|------------------:|--------:|-------:|-----:|-----:|-----:|
-|  1 | 128@AMP, 128@TF32 |  164955 | 113725 | 1.45 | 1.00 | 1.00 |
-|  4 |  64@AMP,  64@TF32 |  619527 | 435951 | 1.42 | 3.76 | 3.83 |
-|  8 |  32@AMP,  32@TF32 | 1040206 | 643569 | 1.62 | 6.31 | 5.66 |
+| Batch size / GPU | Grad accumulation | GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
+|---:|--:|--:|--------:|--------:|-----:|-----:|-----:|
+| 32 | 8 | 1 |  97,735 | 101,730 | 1.04 | 1.00 | 1.00 |
+| 32 | 2 | 4 | 337,163 | 352,300 | 1.04 | 3.45 | 3.46 |
+| 32 | 1 | 8 | 599,221 | 623,498 | 1.04 | 6.13 | 6.13 |

 ###### Expected training time

 The following table shows the expected training time for convergence for 1500 epochs:

-|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with TF32 (Hrs)|Speed-up with mixed precision|
-|---:|-----------------:|-----:|-----:|-----:|
-|  1 |128@AMP, 128@TF32 | 18.5 | 26.6 | 1.44 |
-|  4 | 64@AMP,  64@TF32 |  5.5 |  7.5 | 1.36 |
-|  8 | 32@AMP,  32@TF32 |  3.6 |  5.3 | 1.47 |
+| Batch size / GPU | GPUs | Grad accumulation | Time to train with TF32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
+|---:|--:|--:|-----:|-----:|-----:|
+| 32 | 1 | 8 | 32.8 | 31.6 | 1.04 |
+| 32 | 4 | 2 |  9.6 |  9.2 | 1.04 |
+| 32 | 8 | 1 |  5.5 |  5.3 | 1.04 |

 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)

 Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh`
-training script in the PyTorch 20.06-py3 NGC container on NVIDIA DGX-1 with
+training script in the PyTorch 21.05-py3 NGC container on NVIDIA DGX-1 with
 8x V100 16GB GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
 an entire training epoch.

-|Number of GPUs|Batch size per GPU|Frames/s with mixed precision|Frames/s with FP32|Speed-up with mixed precision|Multi-GPU strong scaling with mixed precision|Multi-GPU strong scaling with FP32|
-|---:|----------------:|-------:|-------:|-----:|-----:|-----:|
-|  1 | 64@AMP, 32@FP32 | 110370 |  41066 | 2.69 | 1.00 | 1.00 |
-|  4 | 64@AMP, 32@FP32 | 402368 | 153853 | 2.62 | 3.65 | 3.75 |
-|  8 | 32@AMP, 32@FP32 | 570968 | 296767 | 1.92 | 5.17 | 7.23 |
+| Batch size / GPU | GPUs | Grad accumulation | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
+|---:|--:|---:|--------:|--------:|-----:|-----:|-----:|
+| 16 | 1 | 16 |  33,456 |  63,986 | 1.91 | 1.00 | 1.00 |
+| 16 | 4 |  4 | 120,393 | 209,335 | 1.74 | 3.60 | 3.27 |
+| 16 | 8 |  2 | 222,161 | 356,522 | 1.60 | 6.64 | 5.57 |
+

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

@ -662,13 +613,13 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic

 The following table shows the expected training time for convergence for 1500 epochs:

-|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with FP32 (Hrs)|Speed-up with mixed precision|
-|---:|-----------------:|-----:|-----:|-----:|
-|  1 | 64@AMP,  32@FP32 | 27.6 | 72.7 | 2.63 |
-|  4 | 64@AMP,  32@FP32 |  8.2 | 20.3 | 2.48 |
-|  8 | 32@AMP,  32@FP32 |  5.9 | 10.9 | 1.85 |
+| Batch size / GPU | GPUs | Grad accumulation | Time to train with FP32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
+|---:|--:|---:|-----:|-----:|-----:|
+| 16 | 1 | 16 | 89.3 | 47.4 | 1.91 |
+| 16 | 4 |  4 | 24.9 | 14.6 | 1.74 |
+| 16 | 8 |  2 | 13.6 |  8.6 | 1.60 |

-Note that most of the quality is achieved after the initial 500 epochs.
+Note that most of the quality is achieved after the initial 1000 epochs.

 #### Inference performance results

@ -679,48 +630,48 @@ as the number of generated audio samples per second at 22KHz. RTF is the real-ti
 The used WaveGlow model is a 256-channel model.

 Note that performance numbers are related to the length of input. The numbers reported below were taken with a moderate length of 128 characters. Longer utterances yield higher RTF, as the generator is fully parallel.
-##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
+##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)

-Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the 20.06-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
+Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the 21.05-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.

 |Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|------:|------------:|--------------:|--------------:|--------------:|--------------:|----------------:|---------------:|----------:|
-|    1 | FP16   |     0.106 |   0.106 |   0.106 |   0.107 |      1,636,913 |      1.60 | 74.24 |
-|    4 | FP16   |     0.390 |   0.391 |   0.391 |   0.391 |      1,780,764 |      1.55 | 20.19 |
-|    8 | FP16   |     0.758 |   0.758 |   0.758 |   0.758 |      1,832,544 |      1.52 | 10.39 |
-|    1 | TF32   |     0.170 |   0.170 |   0.170 |   0.170 |      1,020,894 |         - | 46.30 |
-|    4 | TF32   |     0.603 |   0.603 |   0.603 |   0.603 |      1,150,598 |         - | 13.05 |
-|    8 | TF32   |     1.153 |   1.154 |   1.154 |   1.154 |      1,202,463 |         - |  6.82 |
+|-----:|-------:|----------:|--------:|--------:|--------:|---------------:|----------:|------:|
+|    1 | FP16   |     0.091 |   0.092 |   0.092 |   0.092 |      1,879,189 | 1.28      | 85.22 |
+|    4 | FP16   |     0.335 |   0.337 |   0.337 |   0.338 |      2,043,641 | 1.21      | 23.17 |
+|    8 | FP16   |     0.652 |   0.654 |   0.654 |   0.655 |      2,103,765 | 1.21      | 11.93 |
+|    1 | TF32   |     0.117 |   0.117 |   0.118 |   0.118 |      1,473,838 | -         | 66.84 |
+|    4 | TF32   |     0.406 |   0.408 |   0.408 |   0.409 |      1,688,141 | -         | 19.14 |
+|    8 | TF32   |     0.792 |   0.794 |   0.794 |   0.795 |      1,735,463 | -         |  9.84 |

 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
-the PyTorch 20.06-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.
+the PyTorch 21.05-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.


 |Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|------:|------------:|--------------:|--------------:|--------------:|--------------:|----------------:|---------------:|----------:|
-|    1 | FP16   |     0.193 |   0.194 |   0.194 |   0.194 |       902,960 |      2.35 | 40.95 |
-|    4 | FP16   |     0.610 |   0.613 |   0.613 |   0.614 |     1,141,207 |      2.78 | 12.94 |
-|    8 | FP16   |     1.157 |   1.161 |   1.161 |   1.162 |     1,201,684 |      2.68 |  6.81 |
-|    1 | FP32   |     0.453 |   0.455 |   0.456 |   0.457 |       385,027 |         - | 17.46 |
-|    4 | FP32   |     1.696 |   1.703 |   1.705 |   1.707 |       411,124 |         - |  4.66 |
-|    8 | FP32   |     3.111 |   3.118 |   3.120 |   3.122 |       448,275 |         - |  2.54 |
+|-----:|-------:|----------:|--------:|--------:|--------:|---------------:|----------:|------:|
+|    1 | FP16   |     0.149 |   0.150 |   0.150 |   0.151 |      1,154,061 | 2.64      | 52.34 |
+|    4 | FP16   |     0.535 |   0.538 |   0.538 |   0.539 |      1,282,680 | 2.71      | 14.54 |
+|    8 | FP16   |     1.055 |   1.058 |   1.059 |   1.060 |      1,300,261 | 2.71      |  7.37 |
+|    1 | FP32   |     0.393 |   0.395 |   0.395 |   0.396 |        436,961 | -         | 19.82 |
+|    4 | FP32   |     1.449 |   1.452 |   1.452 |   1.453 |        473,515 | -         |  5.37 |
+|    8 | FP32   |     2.861 |   2.865 |   2.866 |   2.867 |        479,642 | -         |  2.72 |

 ##### Inference performance: NVIDIA T4

 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
-the PyTorch 20.06-py3 NGC container.
+the PyTorch 21.05-py3 NGC container.
 The input utterance has 128 characters, synthesized audio has 8.05 s.

 |Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|-----:|-------:|----------:|--------:|--------:|--------:|-------------:|----------:|------:|
-|    1 | FP16   |     0.533 |   0.540 |   0.541 |   0.543 |      326,471 |      2.56 | 14.81 |
-|    4 | FP16   |     2.292 |   2.302 |   2.304 |   2.308 |      304,283 |      2.38 |  3.45 |
-|    8 | FP16   |     4.564 |   4.578 |   4.580 |   4.585 |      305,568 |      1.99 |  1.73 |
-|    1 | FP32   |     1.365 |   1.383 |   1.387 |   1.394 |      127,765 |         - |  5.79 |
-|    4 | FP32   |     5.192 |   5.214 |   5.218 |   5.226 |      134,309 |         - |  1.52 |
-|    8 | FP32   |     9.09  |   9.11  |   9.114 |   9.122 |      153,434 |         - |  0.87 |
+|-----:|-------:|----------:|--------:|--------:|--------:|--------------:|----------:|------:|
+|    1 | FP16   |     0.446 |   0.449 |   0.449 |   0.450 |       384,743 | 2.72      | 17.45 |
+|    4 | FP16   |     1.822 |   1.826 |   1.827 |   1.828 |       376,480 | 2.70      |  4.27 |
+|    8 | FP16   |     3.656 |   3.662 |   3.664 |   3.666 |       375,329 | 2.70      |  2.13 |
+|    1 | FP32   |     1.213 |   1.218 |   1.219 |   1.220 |       141,403 | -         |  6.41 |
+|    4 | FP32   |     4.928 |   4.937 |   4.939 |   4.942 |       139,208 | -         |  1.58 |
+|    8 | FP32   |     9.853 |   9.868 |   9.871 |   9.877 |       139,266 | -         |  0.79 |

 ## Release notes

@ -728,6 +679,16 @@ We're constantly refining and improving our performance on AI and HPC workloads

 ### Changelog

+August 2021
+- Improved quality of synthesized audio
+- Added capability to automatically align audio to transcripts during training without a pre-trained Tacotron 2 aligning model
+- Added capability to train on both graphemes and phonemes
+- Added conditioning on energy
+- Faster training recipe
+- F0 is now estimated with Probabilistic YIN (PYIN)
+- Updated performance tables
+- Changed version of FastPitch from 1.0 to 1.1
+
 October 2020
 - Added multispeaker capabilities
 - Updated text processing module
--- a/PyTorch/SpeechSynthesis/FastPitch/cmudict/heteronyms
+++ b/PyTorch/SpeechSynthesis/FastPitch/cmudict/heteronyms
@ -0,0 +1,414 @@
+
+abject
+abrogate
+absent
+abstract
+abuse
+ache
+Acre
+acuminate
+addict
+address
+adduct
+Adele
+advocate
+affect
+affiliate
+agape
+aged
+agglomerate
+aggregate
+agonic
+agora
+allied
+ally
+alternate
+alum
+am
+analyses
+Andrea
+animate
+apply
+appropriate
+approximate
+ares
+arithmetic
+arsenic
+articulate
+associate
+attribute
+august
+axes
+ay
+aye
+bases
+bass
+bathed
+bested
+bifurcate
+blessed
+blotto
+bow
+bowed
+bowman
+brassy
+buffet
+bustier
+carbonate
+Celtic
+choral
+Chumash
+close
+closer
+coax
+coincidence
+color coordinate
+colour coordinate
+comber
+combine
+combs
+committee
+commune
+compact
+complex
+compound
+compress
+concert
+conduct
+confine
+confines
+conflict
+conglomerate
+conscript
+conserve
+consist
+console
+consort
+construct
+consult
+consummate
+content
+contest
+contract
+contracts
+contrast
+converse
+convert
+convict
+coop
+coordinate
+covey
+crooked
+curate
+cussed
+decollate
+decrease
+defect
+defense
+delegate
+deliberate
+denier
+desert
+detail
+deviate
+diagnoses
+diffuse
+digest
+discard
+discharge
+discount
+do
+document
+does
+dogged
+domesticate
+Dominican
+dove
+dr
+drawer
+duplicate
+egress
+ejaculate
+eject
+elaborate
+ellipses
+email
+emu
+entrace
+entrance
+escort
+estimate
+eta
+Etna
+evening
+excise
+excuse
+exploit
+export
+extract
+fine
+flower
+forbear
+four-legged
+frequent
+furrier
+gallant
+gel
+geminate
+gillie
+glower
+Gotham
+graduate
+haggis
+heavy
+hinder
+house
+housewife
+impact
+imped
+implant
+implement
+import
+impress
+incense
+incline
+increase
+infix
+insert
+instar
+insult
+integral
+intercept
+interchange
+interflow
+interleaf
+intermediate
+intern
+interspace
+intimate
+intrigue
+invalid
+invert
+invite
+irony
+jagged
+Jesses
+Julies
+kite
+laminate
+Laos
+lather
+lead
+learned
+leasing
+lech
+legitimate
+lied
+lima
+lipread
+live
+lower
+lunged
+maas
+Magdalen
+manes
+mare
+marked
+merchandise
+merlion
+minute
+misconduct
+misled
+misprint
+mobile
+moderate
+mong
+moped
+moth
+mouth
+mow
+mpg
+multiply
+mush
+nana
+nice
+Nice
+number
+numerate
+nun
+object
+opiate
+ornament
+outbox
+outcry
+outpour
+outreach
+outride
+outright
+outside
+outwork
+overall
+overbid
+overcall
+overcast
+overfall
+overflow
+overhaul
+overhead
+overlap
+overlay
+overuse
+overweight
+overwork
+pace
+palled
+palling
+para
+pasty
+pate
+Pauline
+pedal
+peer
+perfect
+periodic
+permit
+pervert
+pinta
+placer
+platy
+polish
+Polish
+poll
+pontificate
+postulate
+pram
+prayer
+precipitate
+predate
+predicate
+prefix
+preposition
+present
+pretest
+primer
+proceeds
+produce
+progress
+project
+proportionate
+prospect
+protest
+pussy
+putter
+putting
+quite
+ragged
+raven
+re
+read
+reading
+Reading
+real
+rebel
+recall
+recap
+recitative
+recollect
+record
+recreate
+recreation
+redress
+refill
+refund
+refuse
+reject
+relay
+remake
+repaint
+reprint
+reread
+rerun
+resent
+reside
+resign
+respray
+resume
+retard
+retest
+retread
+rewrite
+root
+routed
+routing
+row
+rugged
+rummy
+sais
+sake
+sambuca
+saucier
+second
+secrete
+secreted
+secreting
+segment
+separate
+sewer
+shirk
+shower
+sin
+skied
+slaver
+slough
+sow
+spoof
+squid
+stingy
+subject
+subordinate
+subvert
+supply
+supposed
+survey
+suspect
+syringes
+tabulate
+tales
+tarrier
+tarry
+taxes
+taxis
+tear
+Theron
+thou
+three-legged
+tier
+tinged
+torment
+transfer
+transform
+transplant
+transport
+transpose
+tush
+two-legged
+unionised
+unionized
+update
+uplift
+upset
+use
+used
+vale
+violist
+viva
+ware
+whinged
+whoop
+wicked
+wind
+windy
+wino
+won
+worsted
+wound
--- a/PyTorch/SpeechSynthesis/FastPitch/common/stft.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/stft.py
@ -109,11 +109,9 @@ class STFT(torch.nn.Module):
            [magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1)

        with torch.no_grad():
-            inverse_transform = F.conv_transpose2d(
-                recombine_magnitude_phase.unsqueeze(-1),
-                self.inverse_basis.unsqueeze(-1),
-                stride=self.hop_length,
-                padding=0).squeeze(-1)
+            inverse_transform = F.conv_transpose1d(
+                recombine_magnitude_phase, self.inverse_basis,
+                stride=self.hop_length, padding=0)

        if self.window is not None:
            window_sum = window_sumsquare(
--- a/PyTorch/SpeechSynthesis/FastPitch/common/tb_dllogger.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/tb_dllogger.py
@ -112,6 +112,11 @@ def init(log_fpath, log_dir, enabled=True, tb_subsets=[], **tb_kw):
        dllogger.metadata(f"{id_}_mel_loss",
                          {"name": f"{pref}mel loss", "format": ":>5.2f"})

+        dllogger.metadata(f"{id_}_kl_loss",
+                          {"name": f"{pref}kl loss", "format": ":>5.5f"})
+        dllogger.metadata(f"{id_}_kl_weight",
+                          {"name": f"{pref}kl weight", "format": ":>5.5f"})
+
        dllogger.metadata(f"{id_}_frames/s",
                          {"name": None, "unit": "frames/s", "format": ":>10.2f"})
        dllogger.metadata(f"{id_}_took",
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/abbreviations.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/abbreviations.py
@ -3,6 +3,7 @@ import re
 _no_period_re = re.compile(r'(No[.])(?=[ ]?[0-9])')
 _percent_re = re.compile(r'([ ]?[%])')
 _half_re = re.compile('([0-9]½)|(½)')
+_url_re = re.compile(r'([a-zA-Z])\.(com|gov|org)')


 # List of (regular expression, replacement) pairs for abbreviations:
@ -27,6 +28,7 @@ _abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in
    ('col', 'colonel'),
    ('ft', 'fort'),
    ('sen', 'senator'),
+    ('etc', 'et cetera'),
 ]]


@ -48,10 +50,17 @@ def _expand_half(m):
    return word[0] + ' and a half'


+def _expand_urls(m):
+    return f'{m.group(1)} dot {m.group(2)}'
+
+
 def normalize_abbreviations(text):
    text = re.sub(_no_period_re, _expand_no_period, text)
    text = re.sub(_percent_re, _expand_percent, text)
    text = re.sub(_half_re, _expand_half, text)
+    text = re.sub('&', ' and ', text)
+    text = re.sub('@', ' at ', text)
+    text = re.sub(_url_re, _expand_urls, text)

    for regex, replacement in _abbreviations:
        text = re.sub(regex, replacement, text)
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/acronyms.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/acronyms.py
@ -31,12 +31,30 @@ _letter_to_arpabet = {
    's': 'Z'
 }

+# Acronyms that should not be expanded
+hardcoded_acronyms = [
+    'BMW', 'MVD', 'WDSU', 'GOP', 'UK', 'AI', 'GPS', 'BP', 'FBI', 'HD',
+    'CES', 'LRA', 'PC', 'NBA', 'BBL', 'OS', 'IRS', 'SAC', 'UV', 'CEO', 'TV',
+    'CNN', 'MSS', 'GSA', 'USSR', 'DNA', 'PRS', 'TSA', 'US', 'GPU', 'USA',
+    'FPCC', 'CIA']
+
+# Words and acronyms that should be read as regular words, e.g., NATO, HAPPY, etc.
+uppercase_whiteliset = []
+
+acronyms_exceptions = {
+    'NVIDIA': 'N.VIDIA',
+}
+
+non_uppercase_exceptions = {
+    'email': 'e-mail',
+}
+
 # must ignore roman numerals
-# _acronym_re = re.compile(r'([A-Z][A-Z]+)s?|([A-Z]\.([A-Z]\.)+s?)')
-_acronym_re = re.compile(r'([A-Z][A-Z]+)s?')
+_acronym_re = re.compile(r'([a-z]*[A-Z][A-Z]+)s?\.?')
+_non_uppercase_re = re.compile(r'\b({})\b'.format('|'.join(non_uppercase_exceptions.keys())), re.IGNORECASE)


-def _expand_acronyms(m, add_spaces=True):
+def _expand_acronyms_to_arpa(m, add_spaces=True):
    acronym = m.group(0)

    # remove dots if they exist
@ -63,5 +81,29 @@ def _expand_acronyms(m, add_spaces=True):


 def normalize_acronyms(text):
-    text = re.sub(_acronym_re, _expand_acronyms, text)
+    text = re.sub(_acronym_re, _expand_acronyms_to_arpa, text)
+    return text
+
+
+def expand_acronyms(m):
+    text = m.group(1)
+    if text in acronyms_exceptions:
+        text = acronyms_exceptions[text]
+    elif text in uppercase_whiteliset:
+        text = text
+    else:
+        text = '.'.join(text) + '.'
+
+    if 's' in m.group(0):
+        text = text + '\'s'
+
+    if text[-1] != '.' and m.group(0)[-1] == '.':
+        return text + '.'
+    else:
+        return text
+
+
+def spell_acronyms(text):
+    text = re.sub(_non_uppercase_re, lambda m: non_uppercase_exceptions[m.group(0).lower()], text)
+    text = re.sub(_acronym_re, expand_acronyms, text)
    return text
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/cleaners.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/cleaners.py
@ -15,7 +15,7 @@ hyperparameter. Some cleaners are English-specific. You'll typically want to use
 import re
 from unidecode import unidecode
 from .numerical import normalize_numbers
-from .acronyms import normalize_acronyms
+from .acronyms import normalize_acronyms, spell_acronyms
 from .datestime import normalize_datestime
 from .letters_and_numbers import normalize_letters_and_numbers
 from .abbreviations import normalize_abbreviations
@ -78,10 +78,6 @@ def transliteration_cleaners(text):
    return text


-def english_cleaners_post_chars(word):
-    return word
-
-
 def english_cleaners(text):
    '''Pipeline for English text, with number and abbreviation expansion.'''
    text = convert_to_ascii(text)
@ -90,3 +86,17 @@ def english_cleaners(text):
    text = expand_abbreviations(text)
    text = collapse_whitespace(text)
    return text
+
+
+def english_cleaners_v2(text):
+    text = convert_to_ascii(text)
+    text = expand_datestime(text)
+    text = expand_letters_and_numbers(text)
+    text = expand_numbers(text)
+    text = expand_abbreviations(text)
+    text = spell_acronyms(text)
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    # compatibility with basic_english symbol set
+    text = re.sub(r'/+', ' ', text)
+    return text
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/cmudict.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/cmudict.py
@ -16,6 +16,13 @@ valid_symbols = [
 _valid_symbol_set = set(valid_symbols)


+def lines_to_list(filename):
+  with open(filename, encoding='utf-8') as f:
+    lines = f.readlines()
+  lines = [l.rstrip() for l in lines]
+  return lines
+
+
 class CMUDict:
  '''Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict'''
  def __init__(self, file_or_path=None, heteronyms_path=None, keep_ambiguous=True):
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/letters_and_numbers.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/letters_and_numbers.py
@ -49,7 +49,7 @@ def _expand_letters_and_numbers(m):
                if string[-1] == '0':
                    string = [string]
                else:
-                    string = [string[:-3], string[-2], string[-1]]
+                    string = [string[:-2], string[-2], string[-1]]
            elif len(string) % 2 == 0:
                string = [string[i:i+2] for i in range(0, len(string), 2)]
            elif len(string) > 2:
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/numerical.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/numerical.py
@ -13,7 +13,7 @@ _currency_key = {'$': 'dollar', '£': 'pound', '€': 'euro', '₩': 'won'}
 _inflect = inflect.engine()
 _comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
 _decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
-_currency_re = re.compile(r'([\$€£₩])([0-9\.\,]*[0-9]+)(?:[ ]?({})(?=[^a-zA-Z]))?'.format("|".join(_magnitudes)), re.IGNORECASE)
+_currency_re = re.compile(r'([\$€£₩])([0-9\.\,]*[0-9]+)(?:[ ]?({})(?=[^a-zA-Z]|$))?'.format("|".join(_magnitudes)), re.IGNORECASE)
 _measurement_re = re.compile(r'([0-9\.\,]*[0-9]+(\s)?{}\b)'.format(_measurements), re.IGNORECASE)
 _ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
 # _range_re = re.compile(r'(?<=[0-9])+(-)(?=[0-9])+.*?')
--- a/PyTorch/SpeechSynthesis/FastPitch/common/text/text_processing.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/text/text_processing.py
@ -3,7 +3,7 @@ import re
 import numpy as np
 from . import cleaners
 from .symbols import get_symbols
-from .cmudict import CMUDict
+from . import cmudict
 from .numerical import _currency_re, _expand_currency


@ -21,13 +21,6 @@ _words_re = re.compile(r"([a-zA-ZÀ-ž]+['][a-zA-ZÀ-ž]{1,2}|[a-zA-ZÀ-ž]+)|([
 _arpa_re = re.compile(r'{[^}]+}|\S+')


-def lines_to_list(filename):
-    with open(filename, encoding='utf-8') as f:
-        lines = f.readlines()
-    lines = [l.rstrip() for l in lines]
-    return lines
-
-
 class TextProcessing(object):
    def __init__(self, symbol_set, cleaner_names, p_arpabet=0.0,
                 handle_arpabet='word', handle_arpabet_ambiguous='ignore',
@ -111,6 +104,10 @@ class TextProcessing(object):
        elif arpabet[0] == '{':
            arpabet = [arpabet[1:-1]]

+        # XXX arpabet might not be a list here
+        if type(arpabet) is not list:
+            return word
+
        if len(arpabet) > 1:
            if self.handle_arpabet_ambiguous == 'first':
                arpabet = arpabet[0]
@ -125,21 +122,13 @@ class TextProcessing(object):

        return arpabet

-    # def get_characters(self, word):
-    #     for name in self.cleaner_names:
-    #         cleaner = getattr(cleaners, f'{name}_post_chars')
-    #         if not cleaner:
-    #             raise Exception('Unknown cleaner: %s' % name)
-    #         word = cleaner(word)
-
-    #     return word
-
    def encode_text(self, text, return_all=False):
        if self.expand_currency:
            text = re.sub(_currency_re, _expand_currency, text)
        text_clean = [self.clean_text(split) if split[0] != '{' else split
                      for split in _arpa_re.findall(text)]
        text_clean = ' '.join(text_clean)
+        text_clean = cleaners.collapse_whitespace(text_clean)
        text = text_clean

        text_arpabet = ''
--- a/PyTorch/SpeechSynthesis/FastPitch/common/utils.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/common/utils.py
@ -25,9 +25,12 @@
 #
 # *****************************************************************************

+import shutil
+import warnings
 from pathlib import Path
 from typing import Optional

+import librosa
 import numpy as np

 import torch
@ -42,8 +45,12 @@ def mask_from_lens(lens, max_len: Optional[int] = None):
    return mask


-def load_wav_to_torch(full_path):
-    sampling_rate, data = read(full_path)
+def load_wav_to_torch(full_path, force_sampling_rate=None):
+    if force_sampling_rate is not None:
+        data, sampling_rate = librosa.load(full_path, sr=force_sampling_rate)
+    else:
+        sampling_rate, data = read(full_path)
+
    return torch.FloatTensor(data.astype(np.float32)), sampling_rate


@ -57,7 +64,7 @@ def load_filepaths_and_text(dataset_path, fnames, has_speakers=False, split="|")
        return tuple(str(Path(root, p)) for p in paths) + tuple(non_paths)

    fpaths_and_text = []
-    for fname in fnames.split(','):
+    for fname in fnames:
        with open(fname, encoding='utf-8') as f:
            fpaths_and_text += [split_line(dataset_path, line) for line in f]
    return fpaths_and_text
@ -81,3 +88,13 @@ def to_device_async(tensor, device):

 def to_numpy(x):
    return x.cpu().numpy() if isinstance(x, torch.Tensor) else x
+
+
+def prepare_tmp(path):
+    if path is None:
+        return
+    p = Path(path)
+    if p.is_dir():
+        warnings.warn(f'{p} exists. Removing...')
+        shutil.rmtree(p, ignore_errors=True)
+    p.mkdir(parents=False, exist_ok=False)
--- a/PyTorch/SpeechSynthesis/FastPitch/data_functions.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/data_functions.py
@ -1,51 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import torch
-
-from fastpitch.data_function import (TextMelAliCollate, TextMelAliLoader,
-                                      batch_to_gpu as batch_to_gpu_fastpitch)
-from tacotron2.data_function import batch_to_gpu as batch_to_gpu_tacotron2
-from tacotron2.data_function import TextMelCollate, TextMelLoader
-from waveglow.data_function import batch_to_gpu as batch_to_gpu_waveglow
-from waveglow.data_function import MelAudioLoader
-
-
-def get_collate_function(model_name):
-    return {'Tacotron2': lambda _: TextMelCollate(n_frames_per_step=1),
-            'WaveGlow': lambda _: torch.utils.data.dataloader.default_collate,
-            'FastPitch': TextMelAliCollate}[model_name]()
-
-def get_data_loader(model_name, **args):
-    return {'Tacotron2': TextMelLoader,
-            'WaveGlow': MelAudioLoader,
-            'FastPitch': TextMelAliLoader}[model_name](**args)
-
-def get_batch_to_gpu(model_name):
-    return {'Tacotron2': batch_to_gpu_tacotron2,
-            'WaveGlow': batch_to_gpu_waveglow,
-            'FastPitch': batch_to_gpu_fastpitch}[model_name]
--- a/PyTorch/SpeechSynthesis/FastPitch/extract_mels.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/extract_mels.py
@ -1,298 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import argparse
-import json
-import time
-from pathlib import Path
-
-import parselmouth
-import torch
-import dllogger as DLLogger
-import numpy as np
-from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
-from torch.utils.data import DataLoader
-
-from common import utils
-from inference import load_and_setup_model
-from tacotron2.data_function import TextMelLoader, TextMelCollate, batch_to_gpu
-from common.text.text_processing import TextProcessing
-
-
-def parse_args(parser):
-    """
-    Parse commandline arguments.
-    """
-    parser.add_argument('--tacotron2-checkpoint', type=str,
-                        help='full path to the generator checkpoint file')
-    parser.add_argument('-b', '--batch-size', default=32, type=int)
-    parser.add_argument('--log-file', type=str, default='nvlog.json',
-                        help='Filename for logging')
-    # Mel extraction
-    parser.add_argument('-d', '--dataset-path', type=str,
-                        default='./', help='Path to dataset')
-    parser.add_argument('--wav-text-filelist', required=True,
-                        type=str, help='Path to file with audio paths and text')
-    parser.add_argument('--text-cleaners', nargs='*',
-                        default=['english_cleaners'], type=str,
-                        help='Type of text cleaners for input text')
-    parser.add_argument('--symbol-set', type=str, default='english_basic',
-                        help='Define symbol set for input text')
-    parser.add_argument('--max-wav-value', default=32768.0, type=float,
-                        help='Maximum audiowave value')
-    parser.add_argument('--sampling-rate', default=22050, type=int,
-                        help='Sampling rate')
-    parser.add_argument('--filter-length', default=1024, type=int,
-                        help='Filter length')
-    parser.add_argument('--hop-length', default=256, type=int,
-                        help='Hop (stride) length')
-    parser.add_argument('--win-length', default=1024, type=int,
-                        help='Window length')
-    parser.add_argument('--mel-fmin', default=0.0, type=float,
-                        help='Minimum mel frequency')
-    parser.add_argument('--mel-fmax', default=8000.0, type=float,
-                        help='Maximum mel frequency')
-    # Duration extraction
-    parser.add_argument('--extract-mels', action='store_true',
-                        help='Calculate spectrograms from .wav files')
-    parser.add_argument('--extract-mels-teacher', action='store_true',
-                        help='Extract Taco-generated mel-spectrograms for KD')
-    parser.add_argument('--extract-durations', action='store_true',
-                        help='Extract char durations from attention matrices')
-    parser.add_argument('--extract-attentions', action='store_true',
-                        help='Extract full attention matrices')
-    parser.add_argument('--extract-pitch-mel', action='store_true',
-                        help='Extract pitch')
-    parser.add_argument('--extract-pitch-char', action='store_true',
-                        help='Extract pitch averaged over input characters')
-    parser.add_argument('--extract-pitch-trichar', action='store_true',
-                        help='Extract pitch averaged over input characters')
-    parser.add_argument('--train-mode', action='store_true',
-                        help='Run the model in .train() mode')
-    parser.add_argument('--cuda', action='store_true',
-                        help='Extract mels on a GPU using CUDA')
-    return parser
-
-
-class FilenamedLoader(TextMelLoader):
-    def __init__(self, filenames, **kwargs):
-        # dict_args = vars(args)
-        kwargs['audiopaths_and_text'] = kwargs['wav_text_filelist']
-        kwargs['load_mel_from_disk'] = False
-        super(FilenamedLoader, self).__init__(**kwargs)
-        self.tp = TextProcessing(kwargs['symbol_set'], kwargs['text_cleaners'])
-        self.filenames = filenames
-
-    def __getitem__(self, index):
-        mel_text = super(FilenamedLoader, self).__getitem__(index)
-        return mel_text + (self.filenames[index],)
-
-
-def maybe_pad(vec, l):
-    assert np.abs(vec.shape[0] - l) <= 3
-    vec = vec[:l]
-    if vec.shape[0] < l:
-        vec = np.pad(vec, pad_width=(0, l - vec.shape[0]))
-    return vec
-
-
-def dur_chunk_sizes(n, ary):
-    """Split a single duration into almost-equally-sized chunks
-
-    Examples:
-      dur_chunk(3, 2) --> [2, 1]
-      dur_chunk(3, 3) --> [1, 1, 1]
-      dur_chunk(5, 3) --> [2, 2, 1]
-    """
-    ret = np.ones((ary,), dtype=np.int32) * (n // ary)
-    ret[:n % ary] = n // ary + 1
-    assert ret.sum() == n
-    return ret
-
-
-def calculate_pitch(wav, durs):
-    mel_len = durs.sum()
-    durs_cum = np.cumsum(np.pad(durs, (1, 0)))
-    snd = parselmouth.Sound(wav)
-    pitch = snd.to_pitch(time_step=snd.duration / (mel_len + 3)
-                         ).selected_array['frequency']
-    assert np.abs(mel_len - pitch.shape[0]) <= 1.0
-
-    # Average pitch over characters
-    pitch_char = np.zeros((durs.shape[0],), dtype=np.float)
-    for idx, a, b in zip(range(mel_len), durs_cum[:-1], durs_cum[1:]):
-        values = pitch[a:b][np.where(pitch[a:b] != 0.0)[0]]
-        pitch_char[idx] = np.mean(values) if len(values) > 0 else 0.0
-
-    # Average to three values per character
-    pitch_trichar = np.zeros((3 * durs.shape[0],), dtype=np.float)
-
-    durs_tri = np.concatenate([dur_chunk_sizes(d, 3) for d in durs])
-    durs_tri_cum = np.cumsum(np.pad(durs_tri, (1, 0)))
-
-    for idx, a, b in zip(range(3 * mel_len), durs_tri_cum[:-1], durs_tri_cum[1:]):
-        values = pitch[a:b][np.where(pitch[a:b] != 0.0)[0]]
-        pitch_trichar[idx] = np.mean(values) if len(values) > 0 else 0.0
-
-    pitch_mel = maybe_pad(pitch, mel_len)
-    pitch_char = maybe_pad(pitch_char, len(durs))
-    pitch_trichar = maybe_pad(pitch_trichar, len(durs_tri))
-
-    return pitch_mel, pitch_char, pitch_trichar
-
-
-def normalize_pitch_vectors(pitch_vecs):
-    nonzeros = np.concatenate([v[np.where(v != 0.0)[0]]
-                               for v in pitch_vecs.values()])
-    mean, std = np.mean(nonzeros), np.std(nonzeros)
-
-    for v in pitch_vecs.values():
-        zero_idxs = np.where(v == 0.0)[0]
-        v -= mean
-        v /= std
-        v[zero_idxs] = 0.0
-
-    return mean, std
-
-
-def save_stats(dataset_path, wav_text_filelist, feature_name, mean, std):
-    fpath = utils.stats_filename(dataset_path, wav_text_filelist, feature_name)
-    with open(fpath, 'w') as f:
-        json.dump({'mean': mean, 'std': std}, f, indent=4)
-
-
-def main():
-    parser = argparse.ArgumentParser(description='PyTorch TTS Data Pre-processing')
-    parser = parse_args(parser)
-    args, unk_args = parser.parse_known_args()
-    if len(unk_args) > 0:
-        raise ValueError(f'Invalid options {unk_args}')
-
-    if args.extract_pitch_char:
-        assert args.extract_durations, "Durations required for pitch extraction"
-
-    DLLogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, args.log_file),
-                            StdOutBackend(Verbosity.VERBOSE)])
-    for k,v in vars(args).items():
-        DLLogger.log(step="PARAMETER", data={k:v})
-
-    model = load_and_setup_model(
-        'Tacotron2', parser, args.tacotron2_checkpoint, amp=False,
-        device=torch.device('cuda' if args.cuda else 'cpu'),
-        forward_is_infer=False, ema=False)
-
-    if args.train_mode:
-        model.train()
-
-    # n_mel_channels arg has been consumed by model's arg parser
-    args.n_mel_channels = model.n_mel_channels
-
-    for datum in ('mels', 'mels_teacher', 'attentions', 'durations',
-                  'pitch_mel', 'pitch_char', 'pitch_trichar'):
-        if getattr(args, f'extract_{datum}'):
-            Path(args.dataset_path, datum).mkdir(parents=False, exist_ok=True)
-
-    filenames = [Path(l.split('|')[0]).stem
-                 for l in open(args.wav_text_filelist, 'r')]
-    # Compatibility with Tacotron2 Data loader
-    args.n_speakers = 1
-    dataset = FilenamedLoader(filenames, **vars(args))
-    # TextMelCollate supports only n_frames_per_step=1
-    data_loader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False,
-                             sampler=None, num_workers=0,
-                             collate_fn=TextMelCollate(1),
-                             pin_memory=False, drop_last=False)
-    pitch_vecs = {'mel': {}, 'char': {}, 'trichar': {}}
-    for i, batch in enumerate(data_loader):
-        tik = time.time()
-        fnames = batch[-1]
-        x, _, _ = batch_to_gpu(batch[:-1])
-        _, text_lens, mels_padded, _, mel_lens = x
-
-        for j, mel in enumerate(mels_padded):
-            fpath = Path(args.dataset_path, 'mels', fnames[j] + '.pt')
-            torch.save(mel[:, :mel_lens[j]].cpu(), fpath)
-
-        with torch.no_grad():
-            out_mels, out_mels_postnet, _, alignments = model.forward(x)
-
-        if args.extract_mels_teacher:
-            for j, mel in enumerate(out_mels_postnet):
-                fpath = Path(args.dataset_path, 'mels_teacher', fnames[j] + '.pt')
-                torch.save(mel[:, :mel_lens[j]].cpu(), fpath)
-        if args.extract_attentions:
-            for j, ali in enumerate(alignments):
-                ali = ali[:mel_lens[j],:text_lens[j]]
-                fpath = Path(args.dataset_path, 'attentions', fnames[j] + '.pt')
-                torch.save(ali.cpu(), fpath)
-        durations = []
-        if args.extract_durations:
-            for j, ali in enumerate(alignments):
-                text_len = text_lens[j]
-                ali = ali[:mel_lens[j],:text_len]
-                dur = torch.histc(torch.argmax(ali, dim=1), min=0,
-                                  max=text_len-1, bins=text_len)
-                durations.append(dur)
-                fpath = Path(args.dataset_path, 'durations', fnames[j] + '.pt')
-                torch.save(dur.cpu().int(), fpath)
-        if args.extract_pitch_mel or args.extract_pitch_char or args.extract_pitch_trichar:
-            for j, dur in enumerate(durations):
-                fpath = Path(args.dataset_path, 'pitch_char', fnames[j] + '.pt')
-                wav = Path(args.dataset_path, 'wavs', fnames[j] + '.wav')
-                p_mel, p_char, p_trichar = calculate_pitch(str(wav), dur.cpu().numpy())
-                pitch_vecs['mel'][fnames[j]] = p_mel
-                pitch_vecs['char'][fnames[j]] = p_char
-                pitch_vecs['trichar'][fnames[j]] = p_trichar
-
-        nseconds = time.time() - tik
-        DLLogger.log(step=f'{i+1}/{len(data_loader)} ({nseconds:.2f}s)', data={})
-
-    if args.extract_pitch_mel:
-        normalize_pitch_vectors(pitch_vecs['mel'])
-        for fname, pitch in pitch_vecs['mel'].items():
-            fpath = Path(args.dataset_path, 'pitch_mel', fname + '.pt')
-            torch.save(torch.from_numpy(pitch), fpath)
-
-    if args.extract_pitch_char:
-        mean, std = normalize_pitch_vectors(pitch_vecs['char'])
-        for fname, pitch in pitch_vecs['char'].items():
-            fpath = Path(args.dataset_path, 'pitch_char', fname + '.pt')
-            torch.save(torch.from_numpy(pitch), fpath)
-        save_stats(args.dataset_path, args.wav_text_filelist, 'pitch_char',
-                   mean, std)
-
-    if args.extract_pitch_trichar:
-        normalize_pitch_vectors(pitch_vecs['trichar'])
-        for fname, pitch in pitch_vecs['trichar'].items():
-            fpath = Path(args.dataset_path, 'pitch_trichar', fname + '.pt')
-            torch.save(torch.from_numpy(pitch), fpath)
-
-    DLLogger.flush()
-
-
-if __name__ == '__main__':
-    main()
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/alignment.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/alignment.py
@ -0,0 +1,85 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from numba import jit, prange
+
+
+@jit(nopython=True)
+def mas(attn_map, width=1):
+    # assumes mel x text
+    opt = np.zeros_like(attn_map)
+    attn_map = np.log(attn_map)
+    attn_map[0, 1:] = -np.inf
+    log_p = np.zeros_like(attn_map)
+    log_p[0, :] = attn_map[0, :]
+    prev_ind = np.zeros_like(attn_map, dtype=np.int64)
+    for i in range(1, attn_map.shape[0]):
+        for j in range(attn_map.shape[1]):  # for each text dim
+            prev_j = np.arange(max(0, j-width), j+1)
+            prev_log = np.array([log_p[i-1, prev_idx] for prev_idx in prev_j])
+
+            ind = np.argmax(prev_log)
+            log_p[i, j] = attn_map[i, j] + prev_log[ind]
+            prev_ind[i, j] = prev_j[ind]
+
+    # now backtrack
+    curr_text_idx = attn_map.shape[1]-1
+    for i in range(attn_map.shape[0]-1, -1, -1):
+        opt[i, curr_text_idx] = 1
+        curr_text_idx = prev_ind[i, curr_text_idx]
+    opt[0, curr_text_idx] = 1
+    return opt
+
+
+@jit(nopython=True)
+def mas_width1(attn_map):
+    """mas with hardcoded width=1"""
+    # assumes mel x text
+    opt = np.zeros_like(attn_map)
+    attn_map = np.log(attn_map)
+    attn_map[0, 1:] = -np.inf
+    log_p = np.zeros_like(attn_map)
+    log_p[0, :] = attn_map[0, :]
+    prev_ind = np.zeros_like(attn_map, dtype=np.int64)
+    for i in range(1, attn_map.shape[0]):
+        for j in range(attn_map.shape[1]):  # for each text dim
+            prev_log = log_p[i-1, j]
+            prev_j = j
+
+            if j-1 >= 0 and log_p[i-1, j-1] >= log_p[i-1, j]:
+                prev_log = log_p[i-1, j-1]
+                prev_j = j-1
+
+            log_p[i, j] = attn_map[i, j] + prev_log
+            prev_ind[i, j] = prev_j
+
+    # now backtrack
+    curr_text_idx = attn_map.shape[1]-1
+    for i in range(attn_map.shape[0]-1, -1, -1):
+        opt[i, curr_text_idx] = 1
+        curr_text_idx = prev_ind[i, curr_text_idx]
+    opt[0, curr_text_idx] = 1
+    return opt
+
+
+@jit(nopython=True, parallel=True)
+def b_mas(b_attn_map, in_lens, out_lens, width=1):
+    assert width == 1
+    attn_out = np.zeros_like(b_attn_map)
+
+    for b in prange(b_attn_map.shape[0]):
+        out = mas_width1(b_attn_map[b, 0, :out_lens[b], :in_lens[b]])
+        attn_out[b, 0, :out_lens[b], :in_lens[b]] = out
+    return attn_out
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/arg_parser.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/arg_parser.py
@ -108,9 +108,22 @@ def parse_fastpitch_args(parent, add_help=False):
    pitch_pred.add_argument('--pitch-predictor-n-layers', default=2, type=int,
                            help='Number of conv-1D layers')

+    energy_pred = parser.add_argument_group('energy predictor parameters')
+    energy_pred.add_argument('--energy-conditioning', action='store_true')
+    energy_pred.add_argument('--energy-predictor-kernel-size', default=3, type=int,
+                            help='Pitch predictor conv-1D kernel size')
+    energy_pred.add_argument('--energy-predictor-filter-size', default=256, type=int,
+                            help='Pitch predictor conv-1D filter size')
+    energy_pred.add_argument('--p-energy-predictor-dropout', default=0.1, type=float,
+                            help='Pitch probability for energy predictor')
+    energy_pred.add_argument('--energy-predictor-n-layers', default=2, type=int,
+                            help='Number of conv-1D layers')
+
    cond = parser.add_argument_group('conditioning parameters')
    cond.add_argument('--pitch-embedding-kernel-size', default=3, type=int,
                      help='Pitch embedding conv-1D kernel size')
+    cond.add_argument('--energy-embedding-kernel-size', default=3, type=int,
+                      help='Pitch embedding conv-1D kernel size')
    cond.add_argument('--speaker-emb-weight', type=float, default=1.0,
                      help='Scale speaker embedding')

--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/attention.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/attention.py
@ -0,0 +1,220 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+
+class ConvNorm(torch.nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
+                 padding=None, dilation=1, bias=True, w_init_gain='linear'):
+        super(ConvNorm, self).__init__()
+        if padding is None:
+            assert(kernel_size % 2 == 1)
+            padding = int(dilation * (kernel_size - 1) / 2)
+
+        self.conv = torch.nn.Conv1d(in_channels, out_channels,
+                                    kernel_size=kernel_size, stride=stride,
+                                    padding=padding, dilation=dilation,
+                                    bias=bias)
+
+        torch.nn.init.xavier_uniform_(
+            self.conv.weight, gain=torch.nn.init.calculate_gain(w_init_gain))
+
+    def forward(self, signal):
+        conv_signal = self.conv(signal)
+        return conv_signal
+
+
+class Invertible1x1ConvLUS(torch.nn.Module):
+    def __init__(self, c):
+        super(Invertible1x1ConvLUS, self).__init__()
+        # Sample a random orthonormal matrix to initialize weights
+        W, _ = torch.linalg.qr(torch.randn(c, c))
+        # Ensure determinant is 1.0 not -1.0
+        if torch.det(W) < 0:
+            W[:, 0] = -1*W[:, 0]
+        p, lower, upper = torch.lu_unpack(*torch.lu(W))
+
+        self.register_buffer('p', p)
+        # diagonals of lower will always be 1s anyway
+        lower = torch.tril(lower, -1)
+        lower_diag = torch.diag(torch.eye(c, c))
+        self.register_buffer('lower_diag', lower_diag)
+        self.lower = nn.Parameter(lower)
+        self.upper_diag = nn.Parameter(torch.diag(upper))
+        self.upper = nn.Parameter(torch.triu(upper, 1))
+
+    def forward(self, z, reverse=False):
+        U = torch.triu(self.upper, 1) + torch.diag(self.upper_diag)
+        L = torch.tril(self.lower, -1) + torch.diag(self.lower_diag)
+        W = torch.mm(self.p, torch.mm(L, U))
+        if reverse:
+            if not hasattr(self, 'W_inverse'):
+                # Reverse computation
+                W_inverse = W.float().inverse()
+                if z.type() == 'torch.cuda.HalfTensor':
+                    W_inverse = W_inverse.half()
+
+                self.W_inverse = W_inverse[..., None]
+            z = F.conv1d(z, self.W_inverse, bias=None, stride=1, padding=0)
+            return z
+        else:
+            W = W[..., None]
+            z = F.conv1d(z, W, bias=None, stride=1, padding=0)
+            log_det_W = torch.sum(torch.log(torch.abs(self.upper_diag)))
+            return z, log_det_W
+
+
+class ConvAttention(torch.nn.Module):
+    def __init__(self, n_mel_channels=80, n_speaker_dim=128,
+                 n_text_channels=512, n_att_channels=80, temperature=1.0,
+                 n_mel_convs=2, align_query_enc_type='3xconv',
+                 use_query_proj=True):
+        super(ConvAttention, self).__init__()
+        self.temperature = temperature
+        self.att_scaling_factor = np.sqrt(n_att_channels)
+        self.softmax = torch.nn.Softmax(dim=3)
+        self.log_softmax = torch.nn.LogSoftmax(dim=3)
+        self.query_proj = Invertible1x1ConvLUS(n_mel_channels)
+        self.attn_proj = torch.nn.Conv2d(n_att_channels, 1, kernel_size=1)
+        self.align_query_enc_type = align_query_enc_type
+        self.use_query_proj = bool(use_query_proj)
+
+        self.key_proj = nn.Sequential(
+            ConvNorm(n_text_channels,
+                     n_text_channels * 2,
+                     kernel_size=3,
+                     bias=True,
+                     w_init_gain='relu'),
+            torch.nn.ReLU(),
+            ConvNorm(n_text_channels * 2,
+                     n_att_channels,
+                     kernel_size=1,
+                     bias=True))
+
+        self.align_query_enc_type = align_query_enc_type
+
+        if align_query_enc_type == "inv_conv":
+            self.query_proj = Invertible1x1ConvLUS(n_mel_channels)
+        elif align_query_enc_type == "3xconv":
+            self.query_proj = nn.Sequential(
+                ConvNorm(n_mel_channels,
+                         n_mel_channels * 2,
+                         kernel_size=3,
+                         bias=True,
+                         w_init_gain='relu'),
+                torch.nn.ReLU(),
+                ConvNorm(n_mel_channels * 2,
+                         n_mel_channels,
+                         kernel_size=1,
+                         bias=True),
+                torch.nn.ReLU(),
+                ConvNorm(n_mel_channels,
+                         n_att_channels,
+                         kernel_size=1,
+                         bias=True))
+        else:
+            raise ValueError("Unknown query encoder type specified")
+
+    def run_padded_sequence(self, sorted_idx, unsort_idx, lens, padded_data,
+                            recurrent_model):
+        """Sorts input data by previded ordering (and un-ordering) and runs the
+        packed data through the recurrent model
+
+        Args:
+            sorted_idx (torch.tensor): 1D sorting index
+            unsort_idx (torch.tensor): 1D unsorting index (inverse of sorted_idx)
+            lens: lengths of input data (sorted in descending order)
+            padded_data (torch.tensor): input sequences (padded)
+            recurrent_model (nn.Module): recurrent model to run data through
+        Returns:
+            hidden_vectors (torch.tensor): outputs of the RNN, in the original,
+            unsorted, ordering
+        """
+
+        # sort the data by decreasing length using provided index
+        # we assume batch index is in dim=1
+        padded_data = padded_data[:, sorted_idx]
+        padded_data = nn.utils.rnn.pack_padded_sequence(padded_data, lens)
+        hidden_vectors = recurrent_model(padded_data)[0]
+        hidden_vectors, _ = nn.utils.rnn.pad_packed_sequence(hidden_vectors)
+        # unsort the results at dim=1 and return
+        hidden_vectors = hidden_vectors[:, unsort_idx]
+        return hidden_vectors
+
+    def encode_query(self, query, query_lens):
+        query = query.permute(2, 0, 1)  # seq_len, batch, feature dim
+        lens, ids = torch.sort(query_lens, descending=True)
+        original_ids = [0] * lens.size(0)
+        for i in range(len(ids)):
+            original_ids[ids[i]] = i
+
+        query_encoded = self.run_padded_sequence(ids, original_ids, lens,
+                                                 query, self.query_lstm)
+        query_encoded = query_encoded.permute(1, 2, 0)
+        return query_encoded
+
+    def forward(self, queries, keys, query_lens, mask=None, key_lens=None,
+                keys_encoded=None, attn_prior=None):
+        """Attention mechanism for flowtron parallel
+        Unlike in Flowtron, we have no restrictions such as causality etc,
+        since we only need this during training.
+
+        Args:
+            queries (torch.tensor): B x C x T1 tensor
+                (probably going to be mel data)
+            keys (torch.tensor): B x C2 x T2 tensor (text data)
+            query_lens: lengths for sorting the queries in descending order
+            mask (torch.tensor): uint8 binary mask for variable length entries
+                (should be in the T2 domain)
+        Output:
+            attn (torch.tensor): B x 1 x T1 x T2 attention mask.
+                Final dim T2 should sum to 1
+        """
+        keys_enc = self.key_proj(keys)  # B x n_attn_dims x T2
+
+        # Beware can only do this since query_dim = attn_dim = n_mel_channels
+        if self.use_query_proj:
+            if self.align_query_enc_type == "inv_conv":
+                queries_enc, log_det_W = self.query_proj(queries)
+            elif self.align_query_enc_type == "3xconv":
+                queries_enc = self.query_proj(queries)
+                log_det_W = 0.0
+            else:
+                queries_enc, log_det_W = self.query_proj(queries)
+        else:
+            queries_enc, log_det_W = queries, 0.0
+
+        # different ways of computing attn,
+        # one is isotopic gaussians (per phoneme)
+        # Simplistic Gaussian Isotopic Attention
+
+        # B x n_attn_dims x T1 x T2
+        attn = (queries_enc[:, :, :, None] - keys_enc[:, :, None]) ** 2
+        # compute log likelihood from a gaussian
+        attn = -0.0005 * attn.sum(1, keepdim=True)
+        if attn_prior is not None:
+            attn = self.log_softmax(attn) + torch.log(attn_prior[:, None]+1e-8)
+
+        attn_logprob = attn.clone()
+
+        if mask is not None:
+            attn.data.masked_fill_(mask.permute(0, 2, 1).unsqueeze(2),
+                                   -float("inf"))
+
+        attn = self.softmax(attn)  # Softmax along T2
+        return attn, attn_logprob
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/attn_loss_function.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/attn_loss_function.py
@ -0,0 +1,54 @@
+# Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class AttentionCTCLoss(torch.nn.Module):
+    def __init__(self, blank_logprob=-1):
+        super(AttentionCTCLoss, self).__init__()
+        self.log_softmax = torch.nn.LogSoftmax(dim=3)
+        self.blank_logprob = blank_logprob
+        self.CTCLoss = nn.CTCLoss(zero_infinity=True)
+
+    def forward(self, attn_logprob, in_lens, out_lens):
+        key_lens = in_lens
+        query_lens = out_lens
+        attn_logprob_padded = F.pad(input=attn_logprob,
+                                    pad=(1, 0, 0, 0, 0, 0, 0, 0),
+                                    value=self.blank_logprob)
+        cost_total = 0.0
+        for bid in range(attn_logprob.shape[0]):
+            target_seq = torch.arange(1, key_lens[bid]+1).unsqueeze(0)
+            curr_logprob = attn_logprob_padded[bid].permute(1, 0, 2)
+            curr_logprob = curr_logprob[:query_lens[bid], :, :key_lens[bid]+1]
+            curr_logprob = self.log_softmax(curr_logprob[None])[0]
+            ctc_cost = self.CTCLoss(
+                curr_logprob, target_seq, input_lengths=query_lens[bid:bid+1],
+                target_lengths=key_lens[bid:bid+1])
+            cost_total += ctc_cost
+        cost = cost_total/attn_logprob.shape[0]
+        return cost
+
+
+class AttentionBinarizationLoss(torch.nn.Module):
+    def __init__(self):
+        super(AttentionBinarizationLoss, self).__init__()
+
+    def forward(self, hard_attention, soft_attention, eps=1e-12):
+        log_sum = torch.log(torch.clamp(soft_attention[hard_attention == 1],
+                            min=eps)).sum()
+        return -log_sum / hard_attention.sum()
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/data_function.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/data_function.py
@ -25,55 +25,340 @@
 #
 # *****************************************************************************

+import functools
+import json
+import re
+from pathlib import Path
+
+import librosa
 import numpy as np
-
+import parselmouth
 import torch
+import torch.nn.functional as F
+from scipy import ndimage
+from scipy.stats import betabinom

-from common.utils import to_gpu
-from tacotron2.data_function import TextMelLoader
+import common.layers as layers
 from common.text.text_processing import TextProcessing
+from common.utils import load_wav_to_torch, load_filepaths_and_text, to_gpu


-class TextMelAliLoader(TextMelLoader):
+class BetaBinomialInterpolator:
+    """Interpolates alignment prior matrices to save computation.
+
+    Calculating beta-binomial priors is costly. Instead cache popular sizes
+    and use img interpolation to get priors faster.
    """
+    def __init__(self, round_mel_len_to=100, round_text_len_to=20):
+        self.round_mel_len_to = round_mel_len_to
+        self.round_text_len_to = round_text_len_to
+        self.bank = functools.lru_cache(beta_binomial_prior_distribution)
+
+    def round(self, val, to):
+        return max(1, int(np.round((val + 1) / to))) * to
+
+    def __call__(self, w, h):
+        bw = self.round(w, to=self.round_mel_len_to)
+        bh = self.round(h, to=self.round_text_len_to)
+        ret = ndimage.zoom(self.bank(bw, bh).T, zoom=(w / bw, h / bh), order=1)
+        assert ret.shape[0] == w, ret.shape
+        assert ret.shape[1] == h, ret.shape
+        return ret
+
+
+def beta_binomial_prior_distribution(phoneme_count, mel_count, scaling=1.0):
+    P = phoneme_count
+    M = mel_count
+    x = np.arange(0, P)
+    mel_text_probs = []
+    for i in range(1, M+1):
+        a, b = scaling * i, scaling * (M + 1 - i)
+        rv = betabinom(P, a, b)
+        mel_i_prob = rv.pmf(x)
+        mel_text_probs.append(mel_i_prob)
+    return torch.tensor(np.array(mel_text_probs))
+
+
+def estimate_pitch(wav, mel_len, method='pyin', normalize_mean=None,
+                   normalize_std=None, n_formants=1):
+
+    if type(normalize_mean) is float or type(normalize_mean) is list:
+        normalize_mean = torch.tensor(normalize_mean)
+
+    if type(normalize_std) is float or type(normalize_std) is list:
+        normalize_std = torch.tensor(normalize_std)
+
+    if method == 'praat':
+
+        snd = parselmouth.Sound(wav)
+        pitch_mel = snd.to_pitch(time_step=snd.duration / (mel_len + 3)
+                                 ).selected_array['frequency']
+        assert np.abs(mel_len - pitch_mel.shape[0]) <= 1.0
+
+        pitch_mel = torch.from_numpy(pitch_mel).unsqueeze(0)
+
+        if n_formants > 1:
+            formant = snd.to_formant_burg(
+                time_step=snd.duration / (mel_len + 3))
+            formant_n_frames = formant.get_number_of_frames()
+            assert np.abs(mel_len - formant_n_frames) <= 1.0
+
+            formants_mel = np.zeros((formant_n_frames + 1, n_formants - 1))
+            for i in range(1, formant_n_frames + 1):
+                formants_mel[i] = np.asarray([
+                    formant.get_value_at_time(
+                        formant_number=f,
+                        time=formant.get_time_from_frame_number(i))
+                    for f in range(1, n_formants)
+                ])
+
+            pitch_mel = torch.cat(
+                [pitch_mel, torch.from_numpy(formants_mel).permute(1, 0)],
+                dim=0)
+
+    elif method == 'pyin':
+
+        snd, sr = librosa.load(wav)
+        pitch_mel, voiced_flag, voiced_probs = librosa.pyin(
+            snd, fmin=librosa.note_to_hz('C2'),
+            fmax=librosa.note_to_hz('C7'), frame_length=1024)
+        assert np.abs(mel_len - pitch_mel.shape[0]) <= 1.0
+
+        pitch_mel = np.where(np.isnan(pitch_mel), 0.0, pitch_mel)
+        pitch_mel = torch.from_numpy(pitch_mel).unsqueeze(0)
+        pitch_mel = F.pad(pitch_mel, (0, mel_len - pitch_mel.size(1)))
+
+        if n_formants > 1:
+            raise NotImplementedError
+
+    else:
+        raise ValueError
+
+    pitch_mel = pitch_mel.float()
+
+    if normalize_mean is not None:
+        assert normalize_std is not None
+        pitch_mel = normalize_pitch(pitch_mel, normalize_mean, normalize_std)
+
+    return pitch_mel
+
+
+def normalize_pitch(pitch, mean, std):
+    zeros = (pitch == 0.0)
+    pitch -= mean[:, None]
+    pitch /= std[:, None]
+    pitch[zeros] = 0.0
+    return pitch
+
+
+class TTSDataset(torch.utils.data.Dataset):
    """
-    def __init__(self, **kwargs):
-        super(TextMelAliLoader, self).__init__(**kwargs)
-        self.tp = TextProcessing(kwargs['symbol_set'], kwargs['text_cleaners'])
-        self.n_speakers = kwargs['n_speakers']
-        if len(self.audiopaths_and_text[0]) != 4 + (kwargs['n_speakers'] > 1):
-            raise ValueError('Expected four columns in audiopaths file for single speaker model. \n'
-                             'For multispeaker model, the filelist format is '
-                             '<mel>|<dur>|<pitch>|<text>|<speaker_id>')
+        1) loads audio,text pairs
+        2) normalizes text and converts them to sequences of one-hot vectors
+        3) computes mel-spectrograms from audio files.
+    """
+    def __init__(self,
+                 dataset_path,
+                 audiopaths_and_text,
+                 text_cleaners,
+                 n_mel_channels,
+                 symbol_set='english_basic',
+                 p_arpabet=1.0,
+                 n_speakers=1,
+                 load_mel_from_disk=True,
+                 load_pitch_from_disk=True,
+                 pitch_mean=214.72203,  # LJSpeech defaults
+                 pitch_std=65.72038,
+                 max_wav_value=None,
+                 sampling_rate=None,
+                 filter_length=None,
+                 hop_length=None,
+                 win_length=None,
+                 mel_fmin=None,
+                 mel_fmax=None,
+                 prepend_space_to_text=False,
+                 append_space_to_text=False,
+                 pitch_online_dir=None,
+                 betabinomial_online_dir=None,
+                 use_betabinomial_interpolator=True,
+                 pitch_online_method='praat',
+                 **ignored):
+
+        # Expect a list of filenames
+        if type(audiopaths_and_text) is str:
+            audiopaths_and_text = [audiopaths_and_text]
+
+        self.dataset_path = dataset_path
+        self.audiopaths_and_text = load_filepaths_and_text(
+            dataset_path, audiopaths_and_text,
+            has_speakers=(n_speakers > 1))
+        self.load_mel_from_disk = load_mel_from_disk
+        if not load_mel_from_disk:
+            self.max_wav_value = max_wav_value
+            self.sampling_rate = sampling_rate
+            self.stft = layers.TacotronSTFT(
+                filter_length, hop_length, win_length,
+                n_mel_channels, sampling_rate, mel_fmin, mel_fmax)
+        self.load_pitch_from_disk = load_pitch_from_disk
+
+        self.prepend_space_to_text = prepend_space_to_text
+        self.append_space_to_text = append_space_to_text
+
+        assert p_arpabet == 0.0 or p_arpabet == 1.0, (
+            'Only 0.0 and 1.0 p_arpabet is currently supported. '
+            'Variable probability breaks caching of betabinomial matrices.')
+
+        self.tp = TextProcessing(symbol_set, text_cleaners, p_arpabet=p_arpabet)
+        self.n_speakers = n_speakers
+        self.pitch_tmp_dir = pitch_online_dir
+        self.f0_method = pitch_online_method
+        self.betabinomial_tmp_dir = betabinomial_online_dir
+        self.use_betabinomial_interpolator = use_betabinomial_interpolator
+
+        if use_betabinomial_interpolator:
+            self.betabinomial_interpolator = BetaBinomialInterpolator()
+
+        expected_columns = (2 + int(load_pitch_from_disk) + (n_speakers > 1))
+
+        assert not (load_pitch_from_disk and self.pitch_tmp_dir is not None)
+
+        if len(self.audiopaths_and_text[0]) < expected_columns:
+            raise ValueError(f'Expected {expected_columns} columns in audiopaths file. '
+                             'The format is <mel_or_wav>|[<pitch>|]<text>[|<speaker_id>]')
+
+        if len(self.audiopaths_and_text[0]) > expected_columns:
+            print('WARNING: Audiopaths file has more columns than expected')
+
+        to_tensor = lambda x: torch.Tensor([x]) if type(x) is float else x
+        self.pitch_mean = to_tensor(pitch_mean)
+        self.pitch_std = to_tensor(pitch_std)

    def __getitem__(self, index):
-        # separate filename and text
+        # Separate filename and text
        if self.n_speakers > 1:
-            audiopath, durpath, pitchpath, text, speaker = self.audiopaths_and_text[index]
+            audiopath, *extra, text, speaker = self.audiopaths_and_text[index]
            speaker = int(speaker)
        else:
-            audiopath, durpath, pitchpath, text = self.audiopaths_and_text[index]
+            audiopath, *extra, text = self.audiopaths_and_text[index]
            speaker = None
-        len_text = len(text)
-        text = self.get_text(text)
+
        mel = self.get_mel(audiopath)
-        dur = torch.load(durpath)
-        pitch = torch.load(pitchpath)
-        return (text, mel, len_text, dur, pitch, speaker)
+        text = self.get_text(text)
+        pitch = self.get_pitch(index, mel.size(-1))
+        energy = torch.norm(mel.float(), dim=0, p=2)
+        attn_prior = self.get_prior(index, mel.shape[1], text.shape[0])
+
+        assert pitch.size(-1) == mel.size(-1)
+
+        # No higher formants?
+        if len(pitch.size()) == 1:
+            pitch = pitch[None, :]
+
+        return (text, mel, len(text), pitch, energy, speaker, attn_prior,
+                audiopath)
+
+    def __len__(self):
+        return len(self.audiopaths_and_text)
+
+    def get_mel(self, filename):
+        if not self.load_mel_from_disk:
+            audio, sampling_rate = load_wav_to_torch(filename)
+            if sampling_rate != self.stft.sampling_rate:
+                raise ValueError("{} SR doesn't match target {} SR".format(
+                    sampling_rate, self.stft.sampling_rate))
+            audio_norm = audio / self.max_wav_value
+            audio_norm = audio_norm.unsqueeze(0)
+            audio_norm = torch.autograd.Variable(audio_norm,
+                                                 requires_grad=False)
+            melspec = self.stft.mel_spectrogram(audio_norm)
+            melspec = torch.squeeze(melspec, 0)
+        else:
+            melspec = torch.load(filename)
+            # assert melspec.size(0) == self.stft.n_mel_channels, (
+            #     'Mel dimension mismatch: given {}, expected {}'.format(
+            #         melspec.size(0), self.stft.n_mel_channels))
+
+        return melspec
+
+    def get_text(self, text):
+        text = self.tp.encode_text(text)
+        space = [self.tp.encode_text("A A")[1]]
+
+        if self.prepend_space_to_text:
+            text = space + text
+
+        if self.append_space_to_text:
+            text = text + space
+
+        return torch.LongTensor(text)
+
+    def get_prior(self, index, mel_len, text_len):
+
+        if self.use_betabinomial_interpolator:
+            return torch.from_numpy(self.betabinomial_interpolator(mel_len,
+                                                                   text_len))
+
+        if self.betabinomial_tmp_dir is not None:
+            audiopath, *_ = self.audiopaths_and_text[index]
+            fname = Path(audiopath).relative_to(self.dataset_path)
+            fname = fname.with_suffix('.pt')
+            cached_fpath = Path(self.betabinomial_tmp_dir, fname)
+
+            if cached_fpath.is_file():
+                return torch.load(cached_fpath)
+
+        attn_prior = beta_binomial_prior_distribution(text_len, mel_len)
+
+        if self.betabinomial_tmp_dir is not None:
+            cached_fpath.parent.mkdir(parents=True, exist_ok=True)
+            torch.save(attn_prior, cached_fpath)
+
+        return attn_prior
+
+    def get_pitch(self, index, mel_len=None):
+        audiopath, *fields = self.audiopaths_and_text[index]
+
+        if self.n_speakers > 1:
+            spk = int(fields[-1])
+        else:
+            spk = 0
+
+        if self.load_pitch_from_disk:
+            pitchpath = fields[0]
+            pitch = torch.load(pitchpath)
+            if self.pitch_mean is not None:
+                assert self.pitch_std is not None
+                pitch = normalize_pitch(pitch, self.pitch_mean, self.pitch_std)
+            return pitch
+
+        if self.pitch_tmp_dir is not None:
+            fname = Path(audiopath).relative_to(self.dataset_path)
+            fname_method = fname.with_suffix('.pt')
+            cached_fpath = Path(self.pitch_tmp_dir, fname_method)
+            if cached_fpath.is_file():
+                return torch.load(cached_fpath)
+
+        # No luck so far - calculate or replace with praat
+        wav = audiopath
+        if not wav.endswith('.wav'):
+            wav = re.sub('/mels/', '/wavs/', wav)
+            wav = re.sub('.pt$', '.wav', wav)
+
+        pitch_mel = estimate_pitch(wav, mel_len, self.f0_method,
+                                   self.pitch_mean, self.pitch_std)
+
+        if self.pitch_tmp_dir is not None and not cached_fpath.is_file():
+            cached_fpath.parent.mkdir(parents=True, exist_ok=True)
+            torch.save(pitch_mel, cached_fpath)
+
+        return pitch_mel


-class TextMelAliCollate():
-    """ Zero-pads model inputs and targets based on number of frames per setep
-    """
-    def __init__(self):
-        self.n_frames_per_step = 1  # Taco2 bckwd compat
+class TTSCollate:
+    """Zero-pads model inputs and targets based on number of frames per step"""

    def __call__(self, batch):
-        """Collate's training batch from normalized text and mel-spectrogram
-        PARAMS
-        ------
-        batch: [text_normalized, mel_normalized]
-        """
+        """Collate training batch from normalized text and mel-spec"""
        # Right zero-pad all one-hot text sequences to max input length
        input_lengths, ids_sorted_decreasing = torch.sort(
            torch.LongTensor([len(x[0]) for x in batch]),
@ -86,23 +371,11 @@ class TextMelAliCollate():
            text = batch[ids_sorted_decreasing[i]][0]
            text_padded[i, :text.size(0)] = text

-        dur_padded = torch.zeros_like(text_padded, dtype=batch[0][3].dtype)
-        dur_lens = torch.zeros(dur_padded.size(0), dtype=torch.int32)
-        for i in range(len(ids_sorted_decreasing)):
-            dur = batch[ids_sorted_decreasing[i]][3]
-            dur_padded[i, :dur.shape[0]] = dur
-            dur_lens[i] = dur.shape[0]
-            assert dur_lens[i] == input_lengths[i]
-
        # Right zero-pad mel-spec
        num_mels = batch[0][1].size(0)
        max_target_len = max([x[1].size(1) for x in batch])
-        if max_target_len % self.n_frames_per_step != 0:
-            max_target_len += (self.n_frames_per_step - max_target_len
-                               % self.n_frames_per_step)
-            assert max_target_len % self.n_frames_per_step == 0

-        # include mel padded and gate padded
+        # Include mel padded and gate padded
        mel_padded = torch.FloatTensor(len(batch), num_mels, max_target_len)
        mel_padded.zero_()
        output_lengths = torch.LongTensor(len(batch))
@ -111,11 +384,16 @@ class TextMelAliCollate():
            mel_padded[i, :, :mel.size(1)] = mel
            output_lengths[i] = mel.size(1)

-        pitch_padded = torch.zeros(dur_padded.size(0), dur_padded.size(1),
-                                   dtype=batch[0][4].dtype)
+        n_formants = batch[0][3].shape[0]
+        pitch_padded = torch.zeros(mel_padded.size(0), n_formants,
+                                   mel_padded.size(2), dtype=batch[0][3].dtype)
+        energy_padded = torch.zeros_like(pitch_padded[:, 0, :])
+
        for i in range(len(ids_sorted_decreasing)):
-            pitch = batch[ids_sorted_decreasing[i]][4]
-            pitch_padded[i, :pitch.shape[0]] = pitch
+            pitch = batch[ids_sorted_decreasing[i]][3]
+            energy = batch[ids_sorted_decreasing[i]][4]
+            pitch_padded[i, :, :pitch.shape[1]] = pitch
+            energy_padded[i, :energy.shape[0]] = energy

        if batch[0][5] is not None:
            speaker = torch.zeros_like(input_lengths)
@ -124,29 +402,41 @@ class TextMelAliCollate():
        else:
            speaker = None

-        # count number of items - characters in text
+        attn_prior_padded = torch.zeros(len(batch), max_target_len,
+                                        max_input_len)
+        attn_prior_padded.zero_()
+        for i in range(len(ids_sorted_decreasing)):
+            prior = batch[ids_sorted_decreasing[i]][6]
+            attn_prior_padded[i, :prior.size(0), :prior.size(1)] = prior
+
+        # Count number of items - characters in text
        len_x = [x[2] for x in batch]
        len_x = torch.Tensor(len_x)

-        return (text_padded, input_lengths, mel_padded, output_lengths,
-                len_x, dur_padded, dur_lens, pitch_padded, speaker)
+        audiopaths = [batch[i][7] for i in ids_sorted_decreasing]
+
+        return (text_padded, input_lengths, mel_padded, output_lengths, len_x,
+                pitch_padded, energy_padded, speaker, attn_prior_padded,
+                audiopaths)


 def batch_to_gpu(batch):
-    text_padded, input_lengths, mel_padded, output_lengths, \
-        len_x, dur_padded, dur_lens, pitch_padded, speaker = batch
+    (text_padded, input_lengths, mel_padded, output_lengths, len_x,
+     pitch_padded, energy_padded, speaker, attn_prior, audiopaths) = batch
+
    text_padded = to_gpu(text_padded).long()
    input_lengths = to_gpu(input_lengths).long()
    mel_padded = to_gpu(mel_padded).float()
    output_lengths = to_gpu(output_lengths).long()
-    dur_padded = to_gpu(dur_padded).long()
-    dur_lens = to_gpu(dur_lens).long()
    pitch_padded = to_gpu(pitch_padded).float()
+    energy_padded = to_gpu(energy_padded).float()
+    attn_prior = to_gpu(attn_prior).float()
    if speaker is not None:
        speaker = to_gpu(speaker).long()
+
    # Alignments act as both inputs and targets - pass shallow copies
    x = [text_padded, input_lengths, mel_padded, output_lengths,
-         dur_padded, dur_lens, pitch_padded, speaker]
-    y = [mel_padded, dur_padded, dur_lens, pitch_padded]
+         pitch_padded, energy_padded, speaker, attn_prior, audiopaths]
+    y = [mel_padded, input_lengths, output_lengths]
    len_x = torch.sum(output_lengths)
    return (x, y, len_x)
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/loss_function.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/loss_function.py
@ -30,18 +30,30 @@ import torch.nn.functional as F
 from torch import nn

 from common.utils import mask_from_lens
+from fastpitch.attn_loss_function import AttentionCTCLoss


 class FastPitchLoss(nn.Module):
    def __init__(self, dur_predictor_loss_scale=1.0,
-                 pitch_predictor_loss_scale=1.0):
+                 pitch_predictor_loss_scale=1.0, attn_loss_scale=1.0,
+                 energy_predictor_loss_scale=0.1):
        super(FastPitchLoss, self).__init__()
        self.dur_predictor_loss_scale = dur_predictor_loss_scale
        self.pitch_predictor_loss_scale = pitch_predictor_loss_scale
+        self.energy_predictor_loss_scale = energy_predictor_loss_scale
+        self.attn_loss_scale = attn_loss_scale
+        self.attn_ctc_loss = AttentionCTCLoss()

    def forward(self, model_out, targets, is_training=True, meta_agg='mean'):
-        mel_out, dec_mask, dur_pred, log_dur_pred, pitch_pred = model_out
-        mel_tgt, dur_tgt, dur_lens, pitch_tgt = targets
+        (mel_out, dec_mask, dur_pred, log_dur_pred, pitch_pred, pitch_tgt,
+         energy_pred, energy_tgt, attn_soft, attn_hard, attn_dur,
+         attn_logprob) = model_out
+
+        (mel_tgt, in_lens, out_lens) = targets
+
+        dur_tgt = attn_dur
+        dur_lens = in_lens
+
        mel_tgt.requires_grad = False
        # (B,H,T) => (B,T,H)
        mel_tgt = mel_tgt.transpose(1, 2)
@ -59,25 +71,42 @@ class FastPitchLoss(nn.Module):
        mel_loss = loss_fn(mel_out, mel_tgt, reduction='none')
        mel_loss = (mel_loss * mel_mask).sum() / mel_mask.sum()

-        ldiff = pitch_tgt.size(1) - pitch_pred.size(1)
-        pitch_pred = F.pad(pitch_pred, (0, ldiff, 0, 0), value=0.0)
+        ldiff = pitch_tgt.size(2) - pitch_pred.size(2)
+        pitch_pred = F.pad(pitch_pred, (0, ldiff, 0, 0, 0, 0), value=0.0)
        pitch_loss = F.mse_loss(pitch_tgt, pitch_pred, reduction='none')
-        pitch_loss = (pitch_loss * dur_mask).sum() / dur_mask.sum()
+        pitch_loss = (pitch_loss * dur_mask.unsqueeze(1)).sum() / dur_mask.sum()

-        loss = mel_loss
-        loss = (mel_loss + pitch_loss * self.pitch_predictor_loss_scale
-                + dur_pred_loss * self.dur_predictor_loss_scale)
+        if energy_pred is not None:
+            energy_pred = F.pad(energy_pred, (0, ldiff, 0, 0), value=0.0)
+            energy_loss = F.mse_loss(energy_tgt, energy_pred, reduction='none')
+            energy_loss = (energy_loss * dur_mask).sum() / dur_mask.sum()
+        else:
+            energy_loss = 0
+
+        # Attention loss
+        attn_loss = self.attn_ctc_loss(attn_logprob, in_lens, out_lens)
+
+        loss = (mel_loss
+                + dur_pred_loss * self.dur_predictor_loss_scale
+                + pitch_loss * self.pitch_predictor_loss_scale
+                + energy_loss * self.energy_predictor_loss_scale
+                + attn_loss * self.attn_loss_scale)

        meta = {
            'loss': loss.clone().detach(),
            'mel_loss': mel_loss.clone().detach(),
            'duration_predictor_loss': dur_pred_loss.clone().detach(),
            'pitch_loss': pitch_loss.clone().detach(),
+            'attn_loss': attn_loss.clone().detach(),
            'dur_error': (torch.abs(dur_pred - dur_tgt).sum()
                          / dur_mask.sum()).detach(),
        }
+
+        if energy_pred is not None:
+            meta['energy_loss'] = energy_loss.clone().detach()
+
        assert meta_agg in ('sum', 'mean')
        if meta_agg == 'sum':
            bsz = mel_out.size(0)
-            meta = {k: v * bsz for k,v in meta.items()}
+            meta = {k: v * bsz for k, v in meta.items()}
        return loss, meta
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/model.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/model.py
@ -25,17 +25,21 @@
 #
 # *****************************************************************************

+from typing import Optional
+
 import torch
-from torch import nn as nn
-from torch.nn.utils.rnn import pad_sequence
+import torch.nn as nn
 import torch.nn.functional as F

 from common.layers import ConvReLUNorm
 from common.utils import mask_from_lens
+from fastpitch.alignment import b_mas, mas_width1
+from fastpitch.attention import ConvAttention
 from fastpitch.transformer import FFTransformer


-def regulate_len(durations, enc_out, pace=1.0, mel_max_len=None):
+def regulate_len(durations, enc_out, pace: float = 1.0,
+                 mel_max_len: Optional[int] = None):
    """If target=None, then predicted durations are applied"""
    dtype = enc_out.dtype
    reps = durations.float() / pace
@ -43,7 +47,8 @@ def regulate_len(durations, enc_out, pace=1.0, mel_max_len=None):
    dec_lens = reps.sum(dim=1)

    max_len = dec_lens.max()
-    reps_cumsum = torch.cumsum(F.pad(reps, (1, 0, 0, 0), value=0.0), dim=1)[:, None, :]
+    reps_cumsum = torch.cumsum(F.pad(reps, (1, 0, 0, 0), value=0.0),
+                               dim=1)[:, None, :]
    reps_cumsum = reps_cumsum.to(dtype)

    range_ = torch.arange(max_len).to(enc_out.device)[None, :, None]
@ -52,17 +57,38 @@ def regulate_len(durations, enc_out, pace=1.0, mel_max_len=None):
    mult = mult.to(dtype)
    enc_rep = torch.matmul(mult, enc_out)

-    if mel_max_len:
+    if mel_max_len is not None:
        enc_rep = enc_rep[:, :mel_max_len]
        dec_lens = torch.clamp_max(dec_lens, mel_max_len)
    return enc_rep, dec_lens


+def average_pitch(pitch, durs):
+    durs_cums_ends = torch.cumsum(durs, dim=1).long()
+    durs_cums_starts = F.pad(durs_cums_ends[:, :-1], (1, 0))
+    pitch_nonzero_cums = F.pad(torch.cumsum(pitch != 0.0, dim=2), (1, 0))
+    pitch_cums = F.pad(torch.cumsum(pitch, dim=2), (1, 0))
+
+    bs, l = durs_cums_ends.size()
+    n_formants = pitch.size(1)
+    dcs = durs_cums_starts[:, None, :].expand(bs, n_formants, l)
+    dce = durs_cums_ends[:, None, :].expand(bs, n_formants, l)
+
+    pitch_sums = (torch.gather(pitch_cums, 2, dce)
+                  - torch.gather(pitch_cums, 2, dcs)).float()
+    pitch_nelems = (torch.gather(pitch_nonzero_cums, 2, dce)
+                    - torch.gather(pitch_nonzero_cums, 2, dcs)).float()
+
+    pitch_avg = torch.where(pitch_nelems == 0.0, pitch_nelems,
+                            pitch_sums / pitch_nelems)
+    return pitch_avg
+
+
 class TemporalPredictor(nn.Module):
    """Predicts a single float per each temporal location"""

    def __init__(self, input_size, filter_size, kernel_size, dropout,
-                 n_layers=2):
+                 n_layers=2, n_predictions=1):
        super(TemporalPredictor, self).__init__()

        self.layers = nn.Sequential(*[
@ -70,17 +96,18 @@ class TemporalPredictor(nn.Module):
                         kernel_size=kernel_size, dropout=dropout)
            for i in range(n_layers)]
        )
-        self.fc = nn.Linear(filter_size, 1, bias=True)
+        self.n_predictions = n_predictions
+        self.fc = nn.Linear(filter_size, self.n_predictions, bias=True)

    def forward(self, enc_out, enc_out_mask):
        out = enc_out * enc_out_mask
        out = self.layers(out.transpose(1, 2)).transpose(1, 2)
        out = self.fc(out) * enc_out_mask
-        return out.squeeze(-1)
+        return out


 class FastPitch(nn.Module):
-    def __init__(self, n_mel_channels, max_seq_len, n_symbols, padding_idx,
+    def __init__(self, n_mel_channels, n_symbols, padding_idx,
                 symbols_embedding_dim, in_fft_n_layers, in_fft_n_heads,
                 in_fft_d_head,
                 in_fft_conv1d_kernel_size, in_fft_conv1d_filter_size,
@ -94,9 +121,13 @@ class FastPitch(nn.Module):
                 p_dur_predictor_dropout, dur_predictor_n_layers,
                 pitch_predictor_kernel_size, pitch_predictor_filter_size,
                 p_pitch_predictor_dropout, pitch_predictor_n_layers,
-                 pitch_embedding_kernel_size, n_speakers, speaker_emb_weight):
+                 pitch_embedding_kernel_size,
+                 energy_conditioning,
+                 energy_predictor_kernel_size, energy_predictor_filter_size,
+                 p_energy_predictor_dropout, energy_predictor_n_layers,
+                 energy_embedding_kernel_size,
+                 n_speakers, speaker_emb_weight, pitch_conditioning_formants=1):
        super(FastPitch, self).__init__()
-        del max_seq_len  # unused

        self.encoder = FFTransformer(
            n_layer=in_fft_n_layers, n_head=in_fft_n_heads,
@ -142,11 +173,12 @@ class FastPitch(nn.Module):
            in_fft_output_size,
            filter_size=pitch_predictor_filter_size,
            kernel_size=pitch_predictor_kernel_size,
-            dropout=p_pitch_predictor_dropout, n_layers=pitch_predictor_n_layers
+            dropout=p_pitch_predictor_dropout, n_layers=pitch_predictor_n_layers,
+            n_predictions=pitch_conditioning_formants
        )

        self.pitch_emb = nn.Conv1d(
-            1, symbols_embedding_dim,
+            pitch_conditioning_formants, symbols_embedding_dim,
            kernel_size=pitch_embedding_kernel_size,
            padding=int((pitch_embedding_kernel_size - 1) / 2))

@ -154,11 +186,64 @@ class FastPitch(nn.Module):
        self.register_buffer('pitch_mean', torch.zeros(1))
        self.register_buffer('pitch_std', torch.zeros(1))

+        self.energy_conditioning = energy_conditioning
+        if energy_conditioning:
+            self.energy_predictor = TemporalPredictor(
+                in_fft_output_size,
+                filter_size=energy_predictor_filter_size,
+                kernel_size=energy_predictor_kernel_size,
+                dropout=p_energy_predictor_dropout,
+                n_layers=energy_predictor_n_layers,
+                n_predictions=1
+            )
+
+            self.energy_emb = nn.Conv1d(
+                1, symbols_embedding_dim,
+                kernel_size=energy_embedding_kernel_size,
+                padding=int((energy_embedding_kernel_size - 1) / 2))
+
        self.proj = nn.Linear(out_fft_output_size, n_mel_channels, bias=True)

-    def forward(self, inputs, use_gt_durations=True, use_gt_pitch=True,
-                pace=1.0, max_duration=75):
-        inputs, _, mel_tgt, _, dur_tgt, _, pitch_tgt, speaker = inputs
+        self.attention = ConvAttention(
+            n_mel_channels, 0, symbols_embedding_dim,
+            use_query_proj=True, align_query_enc_type='3xconv')
+
+    def binarize_attention(self, attn, in_lens, out_lens):
+        """For training purposes only. Binarizes attention with MAS.
+           These will no longer recieve a gradient.
+
+        Args:
+            attn: B x 1 x max_mel_len x max_text_len
+        """
+        b_size = attn.shape[0]
+        with torch.no_grad():
+            attn_cpu = attn.data.cpu().numpy()
+            attn_out = torch.zeros_like(attn)
+            for ind in range(b_size):
+                hard_attn = mas_width1(
+                    attn_cpu[ind, 0, :out_lens[ind], :in_lens[ind]])
+                attn_out[ind, 0, :out_lens[ind], :in_lens[ind]] = torch.tensor(
+                    hard_attn, device=attn.get_device())
+        return attn_out
+
+    def binarize_attention_parallel(self, attn, in_lens, out_lens):
+        """For training purposes only. Binarizes attention with MAS.
+           These will no longer recieve a gradient.
+
+        Args:
+            attn: B x 1 x max_mel_len x max_text_len
+        """
+        with torch.no_grad():
+            attn_cpu = attn.data.cpu().numpy()
+            attn_out = b_mas(attn_cpu, in_lens.cpu().numpy(),
+                             out_lens.cpu().numpy(), width=1)
+        return torch.from_numpy(attn_out).to(attn.get_device())
+
+    def forward(self, inputs, use_gt_pitch=True, pace=1.0, max_duration=75):
+
+        (inputs, input_lens, mel_tgt, mel_lens, pitch_dense, energy_dense,
+         speaker, attn_prior, audiopaths) = inputs
+
        mel_max_len = mel_tgt.size(2)

        # Calculate speaker embedding
@ -171,54 +256,88 @@ class FastPitch(nn.Module):
        # Input FFT
        enc_out, enc_mask = self.encoder(inputs, conditioning=spk_emb)

-        # Embedded for predictors
-        pred_enc_out, pred_enc_mask = enc_out, enc_mask
+        # Alignment
+        text_emb = self.encoder.word_emb(inputs)
+
+        # make sure to do the alignments before folding
+        attn_mask = mask_from_lens(input_lens)[..., None] == 0
+        # attn_mask should be 1 for unused timesteps in the text_enc_w_spkvec tensor
+
+        attn_soft, attn_logprob = self.attention(
+            mel_tgt, text_emb.permute(0, 2, 1), mel_lens, attn_mask,
+            key_lens=input_lens, keys_encoded=enc_out, attn_prior=attn_prior)
+
+        attn_hard = self.binarize_attention_parallel(
+            attn_soft, input_lens, mel_lens)
+
+        # Viterbi --> durations
+        attn_hard_dur = attn_hard.sum(2)[:, 0, :]
+        dur_tgt = attn_hard_dur
+
+        assert torch.all(torch.eq(dur_tgt.sum(dim=1), mel_lens))

        # Predict durations
-        log_dur_pred = self.duration_predictor(pred_enc_out, pred_enc_mask)
+        log_dur_pred = self.duration_predictor(enc_out, enc_mask).squeeze(-1)
        dur_pred = torch.clamp(torch.exp(log_dur_pred) - 1, 0, max_duration)

        # Predict pitch
-        pitch_pred = self.pitch_predictor(enc_out, enc_mask)
+        pitch_pred = self.pitch_predictor(enc_out, enc_mask).permute(0, 2, 1)
+
+        # Average pitch over characters
+        pitch_tgt = average_pitch(pitch_dense, dur_tgt)

        if use_gt_pitch and pitch_tgt is not None:
-            pitch_emb = self.pitch_emb(pitch_tgt.unsqueeze(1))
+            pitch_emb = self.pitch_emb(pitch_tgt)
        else:
-            pitch_emb = self.pitch_emb(pitch_pred.unsqueeze(1))
+            pitch_emb = self.pitch_emb(pitch_pred)
        enc_out = enc_out + pitch_emb.transpose(1, 2)

+        # Predict energy
+        if self.energy_conditioning:
+            energy_pred = self.energy_predictor(enc_out, enc_mask).squeeze(-1)
+
+            # Average energy over characters
+            energy_tgt = average_pitch(energy_dense.unsqueeze(1), dur_tgt)
+            energy_tgt = torch.log(1.0 + energy_tgt)
+
+            energy_emb = self.energy_emb(energy_tgt)
+            energy_tgt = energy_tgt.squeeze(1)
+            enc_out = enc_out + energy_emb.transpose(1, 2)
+        else:
+            energy_pred = None
+            energy_tgt = None
+
        len_regulated, dec_lens = regulate_len(
-            dur_tgt if use_gt_durations else dur_pred,
-            enc_out, pace, mel_max_len)
+            dur_tgt, enc_out, pace, mel_max_len)

        # Output FFT
        dec_out, dec_mask = self.decoder(len_regulated, dec_lens)
        mel_out = self.proj(dec_out)
-        return mel_out, dec_mask, dur_pred, log_dur_pred, pitch_pred
+        return (mel_out, dec_mask, dur_pred, log_dur_pred, pitch_pred,
+                pitch_tgt, energy_pred, energy_tgt, attn_soft, attn_hard,
+                attn_hard_dur, attn_logprob)

-    def infer(self, inputs, input_lens, pace=1.0, dur_tgt=None, pitch_tgt=None,
-              pitch_transform=None, max_duration=75, speaker=0):
-        del input_lens  # unused
+    def infer(self, inputs, pace=1.0, dur_tgt=None, pitch_tgt=None,
+              energy_tgt=None, pitch_transform=None, max_duration=75,
+              speaker=0):

        if self.speaker_emb is None:
            spk_emb = 0
        else:
-            speaker = torch.ones(inputs.size(0)).long().to(inputs.device) * speaker
+            speaker = (torch.ones(inputs.size(0)).long().to(inputs.device)
+                       * speaker)
            spk_emb = self.speaker_emb(speaker).unsqueeze(1)
            spk_emb.mul_(self.speaker_emb_weight)

        # Input FFT
        enc_out, enc_mask = self.encoder(inputs, conditioning=spk_emb)

-        # Embedded for predictors
-        pred_enc_out, pred_enc_mask = enc_out, enc_mask
-
        # Predict durations
-        log_dur_pred = self.duration_predictor(pred_enc_out, pred_enc_mask)
+        log_dur_pred = self.duration_predictor(enc_out, enc_mask).squeeze(-1)
        dur_pred = torch.clamp(torch.exp(log_dur_pred) - 1, 0, max_duration)

        # Pitch over chars
-        pitch_pred = self.pitch_predictor(enc_out, enc_mask)
+        pitch_pred = self.pitch_predictor(enc_out, enc_mask).permute(0, 2, 1)

        if pitch_transform is not None:
            if self.pitch_std[0] == 0.0:
@ -226,15 +345,28 @@ class FastPitch(nn.Module):
                mean, std = 218.14, 67.24
            else:
                mean, std = self.pitch_mean[0], self.pitch_std[0]
-            pitch_pred = pitch_transform(pitch_pred, enc_mask.sum(dim=(1,2)), mean, std)
-
+            pitch_pred = pitch_transform(pitch_pred, enc_mask.sum(dim=(1,2)),
+                                         mean, std)
        if pitch_tgt is None:
-            pitch_emb = self.pitch_emb(pitch_pred.unsqueeze(1)).transpose(1, 2)
+            pitch_emb = self.pitch_emb(pitch_pred).transpose(1, 2)
        else:
-            pitch_emb = self.pitch_emb(pitch_tgt.unsqueeze(1)).transpose(1, 2)
+            pitch_emb = self.pitch_emb(pitch_tgt).transpose(1, 2)

        enc_out = enc_out + pitch_emb

+        # Predict energy
+        if self.energy_conditioning:
+
+            if energy_tgt is None:
+                energy_pred = self.energy_predictor(enc_out, enc_mask).squeeze(-1)
+                energy_emb = self.energy_emb(energy_pred.unsqueeze(1)).transpose(1, 2)
+            else:
+                energy_emb = self.energy_emb(energy_tgt).transpose(1, 2)
+
+            enc_out = enc_out + energy_emb
+        else:
+            energy_pred = None
+
        len_regulated, dec_lens = regulate_len(
            dur_pred if dur_tgt is None else dur_tgt,
            enc_out, pace, mel_max_len=None)
@ -243,4 +375,4 @@ class FastPitch(nn.Module):
        mel_out = self.proj(dec_out)
        # mel_lens = dec_mask.squeeze(2).sum(axis=1).long()
        mel_out = mel_out.permute(0, 2, 1)  # For inference.py
-        return mel_out, dec_lens, dur_pred, pitch_pred
+        return mel_out, dec_lens, dur_pred, pitch_pred, energy_pred
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/model_jit.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/model_jit.py
@ -25,62 +25,41 @@
 #
 # *****************************************************************************

-from typing import List, Optional
+from typing import Optional

 import torch
 from torch import nn as nn
-from torch.nn.utils.rnn import pad_sequence
-import torch.nn.functional as F

-from common.layers import ConvReLUNorm
+from fastpitch.model import TemporalPredictor
 from fastpitch.transformer_jit import FFTransformer


-def regulate_len(durations, enc_out, pace=1.0, mel_max_len=0):
-    # type: (Tensor, Tensor, float, int)
+def regulate_len(durations, enc_out, pace: float = 1.0,
+                 mel_max_len: Optional[int] = None):
    """If target=None, then predicted durations are applied"""
-    dtype = enc_out.dtype
-    reps = (durations.float() / pace + 0.5).long()
+    reps = torch.round(durations.float() / pace).long()
    dec_lens = reps.sum(dim=1)

    max_len = dec_lens.max()
-    reps_cumsum = torch.cumsum(F.pad(reps, (1, 0, 0, 0)), dim=1)[:, None, :]
+    bsz, _, hid = enc_out.size()

-    range_ = torch.arange(max_len).to(enc_out.device)[None, :, None]
-    mult = ((reps_cumsum[:, :, :-1] <= range_) &
-            (reps_cumsum[:, :, 1:] > range_))
-    mult = mult.to(dtype)
-    enc_rep = torch.matmul(mult, enc_out)
+    reps_padded = torch.cat([reps, (max_len - dec_lens)[:, None]], dim=1)
+    pad_vec = torch.zeros(bsz, 1, hid, dtype=enc_out.dtype,
+                          device=enc_out.device)

-    if mel_max_len != 0:
+    enc_rep = torch.cat([enc_out, pad_vec], dim=1)
+    enc_rep = torch.repeat_interleave(
+        enc_rep.view(-1, hid), reps_padded.view(-1), dim=0
+    ).view(bsz, -1, hid)
+
+    if mel_max_len is not None:
        enc_rep = enc_rep[:, :mel_max_len]
        dec_lens = torch.clamp_max(dec_lens, mel_max_len)
    return enc_rep, dec_lens


-class TemporalPredictor(nn.Module):
-    """Predicts a single float per each temporal location"""
-
-    def __init__(self, input_size, filter_size, kernel_size, dropout,
-                 n_layers=2):
-        super(TemporalPredictor, self).__init__()
-
-        self.layers = nn.Sequential(*[
-            ConvReLUNorm(input_size if i == 0 else filter_size, filter_size,
-                         kernel_size=kernel_size, dropout=dropout)
-            for i in range(n_layers)]
-        )
-        self.fc = nn.Linear(filter_size, 1, bias=True)
-
-    def forward(self, enc_out, enc_out_mask):
-        out = enc_out * enc_out_mask
-        out = self.layers(out.transpose(1, 2)).transpose(1, 2)
-        out = self.fc(out) * enc_out_mask
-        return out.squeeze(-1)
-
-
-class FastPitch(nn.Module):
-    def __init__(self, n_mel_channels, max_seq_len, n_symbols, padding_idx,
+class FastPitchJIT(nn.Module):
+    def __init__(self, n_mel_channels, n_symbols, padding_idx,
                 symbols_embedding_dim, in_fft_n_layers, in_fft_n_heads,
                 in_fft_d_head,
                 in_fft_conv1d_kernel_size, in_fft_conv1d_filter_size,
@ -94,9 +73,13 @@ class FastPitch(nn.Module):
                 p_dur_predictor_dropout, dur_predictor_n_layers,
                 pitch_predictor_kernel_size, pitch_predictor_filter_size,
                 p_pitch_predictor_dropout, pitch_predictor_n_layers,
-                 pitch_embedding_kernel_size, n_speakers, speaker_emb_weight):
-        super(FastPitch, self).__init__()
-        del max_seq_len  # unused
+                 pitch_embedding_kernel_size,
+                 energy_conditioning,
+                 energy_predictor_kernel_size, energy_predictor_filter_size,
+                 p_energy_predictor_dropout, energy_predictor_n_layers,
+                 energy_embedding_kernel_size,
+                 n_speakers, speaker_emb_weight, pitch_conditioning_formants=1):
+        super(FastPitchJIT, self).__init__()

        self.encoder = FFTransformer(
            n_layer=in_fft_n_layers, n_head=in_fft_n_heads,
@ -142,11 +125,12 @@ class FastPitch(nn.Module):
            in_fft_output_size,
            filter_size=pitch_predictor_filter_size,
            kernel_size=pitch_predictor_kernel_size,
-            dropout=p_pitch_predictor_dropout, n_layers=pitch_predictor_n_layers
+            dropout=p_pitch_predictor_dropout, n_layers=pitch_predictor_n_layers,
+            n_predictions=pitch_conditioning_formants
        )

        self.pitch_emb = nn.Conv1d(
-            1, symbols_embedding_dim,
+            pitch_conditioning_formants, symbols_embedding_dim,
            kernel_size=pitch_embedding_kernel_size,
            padding=int((pitch_embedding_kernel_size - 1) / 2))

@ -154,89 +138,77 @@ class FastPitch(nn.Module):
        self.register_buffer('pitch_mean', torch.zeros(1))
        self.register_buffer('pitch_std', torch.zeros(1))

+        self.energy_conditioning = energy_conditioning
+        if energy_conditioning:
+            self.energy_predictor = TemporalPredictor(
+                in_fft_output_size,
+                filter_size=energy_predictor_filter_size,
+                kernel_size=energy_predictor_kernel_size,
+                dropout=p_energy_predictor_dropout,
+                n_layers=energy_predictor_n_layers,
+                n_predictions=1
+            )
+
+            self.energy_emb = nn.Conv1d(
+                1, symbols_embedding_dim,
+                kernel_size=energy_embedding_kernel_size,
+                padding=int((energy_embedding_kernel_size - 1) / 2))
+
        self.proj = nn.Linear(out_fft_output_size, n_mel_channels, bias=True)

-    def forward(self, inputs: List[torch.Tensor], use_gt_durations: bool = True,
-                use_gt_pitch: bool = True, pace: float = 1.0,
-                max_duration: int = 75):
-        inputs, _, mel_tgt, _, dur_tgt, _, pitch_tgt, speaker = inputs
-        mel_max_len = mel_tgt.size(2)
+        # skip self.attention (used only in training)

-        # Calculate speaker embedding
-        if self.speaker_emb is None:
-            spk_emb = 0
-        else:
-            spk_emb = self.speaker_emb(speaker).unsqueeze(1)
-            spk_emb.mul_(self.speaker_emb_weight)
-
-        # Input FFT
-        enc_out, enc_mask = self.encoder(inputs, conditioning=spk_emb)
-
-        # Embedded for predictors
-        pred_enc_out, pred_enc_mask = enc_out, enc_mask
-
-        # Predict durations
-        log_dur_pred = self.duration_predictor(pred_enc_out, pred_enc_mask)
-        dur_pred = torch.clamp(torch.exp(log_dur_pred) - 1, 0, max_duration)
-
-        # Predict pitch
-        pitch_pred = self.pitch_predictor(enc_out, enc_mask)
-
-        if use_gt_pitch and pitch_tgt is not None:
-            pitch_emb = self.pitch_emb(pitch_tgt.unsqueeze(1))
-        else:
-            pitch_emb = self.pitch_emb(pitch_pred.unsqueeze(1))
-        enc_out = enc_out + pitch_emb.transpose(1, 2)
-
-        len_regulated, dec_lens = regulate_len(
-            dur_tgt if use_gt_durations else dur_pred,
-            enc_out, pace, mel_max_len)
-
-        # Output FFT
-        dec_out, dec_mask = self.decoder(len_regulated, dec_lens)
-        mel_out = self.proj(dec_out)
-        return mel_out, dec_mask, dur_pred, log_dur_pred, pitch_pred
-
-    def infer(self, inputs, input_lens, pace: float = 1.0,
+    def infer(self, inputs, pace: float = 1.0,
              dur_tgt: Optional[torch.Tensor] = None,
              pitch_tgt: Optional[torch.Tensor] = None,
-              max_duration: float = 75,
+              energy_tgt: Optional[torch.Tensor] = None,
              speaker: int = 0):
-        del input_lens  # unused

        if self.speaker_emb is None:
            spk_emb = None
        else:
-            speaker = torch.ones(inputs.size(0), dtype=torch.long, device=inputs.device).fill_(speaker)
+            speaker = (torch.ones(inputs.size(0)).long().to(inputs.device)
+                       * speaker)
            spk_emb = self.speaker_emb(speaker).unsqueeze(1)
            spk_emb.mul_(self.speaker_emb_weight)

        # Input FFT
        enc_out, enc_mask = self.encoder(inputs, conditioning=spk_emb)

-        # Embedded for predictors
-        pred_enc_out, pred_enc_mask = enc_out, enc_mask
-
        # Predict durations
-        log_dur_pred = self.duration_predictor(pred_enc_out, pred_enc_mask)
-        dur_pred = torch.clamp(torch.exp(log_dur_pred) - 1, 0, max_duration)
+        log_dur_pred = self.duration_predictor(enc_out, enc_mask).squeeze(-1)
+        dur_pred = torch.clamp(torch.exp(log_dur_pred) - 1, 0, 100.0)

        # Pitch over chars
-        pitch_pred = self.pitch_predictor(enc_out, enc_mask)
+        pitch_pred = self.pitch_predictor(enc_out, enc_mask).permute(0, 2, 1)

        if pitch_tgt is None:
-            pitch_emb = self.pitch_emb(pitch_pred.unsqueeze(1)).transpose(1, 2)
+            pitch_emb = self.pitch_emb(pitch_pred).transpose(1, 2)
        else:
-            pitch_emb = self.pitch_emb(pitch_tgt.unsqueeze(1)).transpose(1, 2)
+            pitch_emb = self.pitch_emb(pitch_tgt).transpose(1, 2)

        enc_out = enc_out + pitch_emb

+        # Predict energy
+        if self.energy_conditioning:
+
+            if energy_tgt is None:
+                energy_pred = self.energy_predictor(enc_out, enc_mask).squeeze(-1)
+                energy_emb = self.energy_emb(energy_pred.unsqueeze(1)).transpose(1, 2)
+            else:
+                energy_pred = None
+                energy_emb = self.energy_emb(energy_tgt).transpose(1, 2)
+
+            enc_out = enc_out + energy_emb
+        else:
+            energy_pred = None
+
        len_regulated, dec_lens = regulate_len(
            dur_pred if dur_tgt is None else dur_tgt,
-            enc_out, pace, mel_max_len=0)
+            enc_out, pace, mel_max_len=None)

        dec_out, dec_mask = self.decoder(len_regulated, dec_lens)
        mel_out = self.proj(dec_out)
        # mel_lens = dec_mask.squeeze(2).sum(axis=1).long()
        mel_out = mel_out.permute(0, 2, 1)  # For inference.py
-        return mel_out, dec_lens, dur_pred, pitch_pred
+        return mel_out, dec_lens, dur_pred, pitch_pred, energy_pred
--- a/PyTorch/SpeechSynthesis/FastPitch/fastpitch/transformer_jit.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/fastpitch/transformer_jit.py
@ -21,11 +21,6 @@ import torch.nn.functional as F
 from common.utils import mask_from_lens


-class NoOp(nn.Module):
-    def forward(self, x):
-        return x
-
-
 class PositionalEmbedding(nn.Module):
    def __init__(self, demb):
        super(PositionalEmbedding, self).__init__()
@ -221,7 +216,7 @@ class FFTransformer(nn.Module):
            self.word_emb = nn.Embedding(n_embed, d_embed or d_model,
                                         padding_idx=self.padding_idx)
        else:
-            self.word_emb = NoOp()
+            self.word_emb = nn.Identity()

        self.pos_emb = PositionalEmbedding(self.d_model)
        self.drop = nn.Dropout(dropemb)
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_pitch_text_test.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_pitch_text_test.txt
@ -0,0 +1,500 @@
+wavs/LJ045-0096.wav|pitch/LJ045-0096.pt|Mrs. De Mohrenschildt thought that Oswald,
+wavs/LJ049-0022.wav|pitch/LJ049-0022.pt|The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.
+wavs/LJ033-0042.wav|pitch/LJ033-0042.pt|Between the hours of eight and nine p.m. they were occupied with the children in the bedrooms located at the extreme east end of the house.
+wavs/LJ016-0117.wav|pitch/LJ016-0117.pt|The prisoner had nothing to deal with but wooden panels, and by dint of cutting and chopping he got both the lower panels out.
+wavs/LJ025-0157.wav|pitch/LJ025-0157.pt|Under these circumstances, unnatural as they are, with proper management, the bean will thrust forth its radicle and its plumule;
+wavs/LJ042-0219.wav|pitch/LJ042-0219.pt|Oswald demonstrated his thinking in connection with his return to the United States by preparing two sets of identical questions of the type which he might have thought
+wavs/LJ032-0164.wav|pitch/LJ032-0164.pt|it is not possible to state with scientific certainty that a particular small group of fibers come from a certain piece of clothing
+wavs/LJ046-0092.wav|pitch/LJ046-0092.pt|has confidence in the dedicated Secret Service men who are ready to lay down their lives for him
+wavs/LJ050-0118.wav|pitch/LJ050-0118.pt|Since these agencies are already obliged constantly to evaluate the activities of such groups,
+wavs/LJ043-0016.wav|pitch/LJ043-0016.pt|Jeanne De Mohrenschildt said, quote,
+wavs/LJ021-0078.wav|pitch/LJ021-0078.pt|no economic panacea, which could simply revive over-night the heavy industries and the trades dependent upon them.
+wavs/LJ039-0148.wav|pitch/LJ039-0148.pt|Examination of the cartridge cases found on the sixth floor of the Depository Building
+wavs/LJ047-0202.wav|pitch/LJ047-0202.pt|testified that the information available to the Federal Government about Oswald before the assassination would, if known to PRS,
+wavs/LJ023-0056.wav|pitch/LJ023-0056.pt|It is an easy document to understand when you remember that it was called into being
+wavs/LJ021-0025.wav|pitch/LJ021-0025.pt|And in many directions, the intervention of that organized control which we call government
+wavs/LJ030-0105.wav|pitch/LJ030-0105.pt|Communications in the motorcade.
+wavs/LJ021-0012.wav|pitch/LJ021-0012.pt|with respect to industry and business, but nearly all are agreed that private enterprise in times such as these
+wavs/LJ019-0169.wav|pitch/LJ019-0169.pt|and one or two men were allowed to mend clothes and make shoes. The rules made by the Secretary of State were hung up in conspicuous parts of the prison;
+wavs/LJ039-0088.wav|pitch/LJ039-0088.pt|It just is an aid in seeing in the fact that you only have the one element, the crosshair,
+wavs/LJ016-0192.wav|pitch/LJ016-0192.pt|"I think I could do that sort of job," said Calcraft, on the spur of the moment.
+wavs/LJ014-0142.wav|pitch/LJ014-0142.pt|was strewn in front of the dock, and sprinkled it towards the bench with a contemptuous gesture.
+wavs/LJ012-0015.wav|pitch/LJ012-0015.pt|Weedon and Lecasser to twelve and six months respectively in Coldbath Fields.
+wavs/LJ048-0033.wav|pitch/LJ048-0033.pt|Prior to November twenty-two, nineteen sixty-three
+wavs/LJ028-0349.wav|pitch/LJ028-0349.pt|who were each required to send so large a number to Babylon, that in all there were collected no fewer than fifty thousand.
+wavs/LJ030-0197.wav|pitch/LJ030-0197.pt|At first Mrs. Connally thought that her husband had been killed,
+wavs/LJ017-0133.wav|pitch/LJ017-0133.pt|Palmer speedily found imitators.
+wavs/LJ034-0123.wav|pitch/LJ034-0123.pt|Although Brennan testified that the man in the window was standing when he fired the shots, most probably he was either sitting or kneeling.
+wavs/LJ003-0282.wav|pitch/LJ003-0282.pt|Many years were to elapse before these objections should be fairly met and universally overcome.
+wavs/LJ032-0204.wav|pitch/LJ032-0204.pt|Special Agent Lyndal L. Shaneyfelt, a photography expert with the FBI,
+wavs/LJ016-0241.wav|pitch/LJ016-0241.pt|Calcraft served the city of London till eighteen seventy-four, when he was pensioned at the rate of twenty-five shillings per week.
+wavs/LJ023-0033.wav|pitch/LJ023-0033.pt|we will not allow ourselves to run around in new circles of futile discussion and debate, always postponing the day of decision.
+wavs/LJ009-0286.wav|pitch/LJ009-0286.pt|There has never been much science in the system of carrying out the extreme penalty in this country; the "finisher of the law"
+wavs/LJ008-0181.wav|pitch/LJ008-0181.pt|he had his pockets filled with bread and cheese, and it was generally supposed that he had come a long distance to see the fatal show.
+wavs/LJ015-0052.wav|pitch/LJ015-0052.pt|to the value of twenty thousand pounds.
+wavs/LJ016-0314.wav|pitch/LJ016-0314.pt|Sir George Grey thought there was a growing feeling in favor of executions within the prison precincts.
+wavs/LJ047-0056.wav|pitch/LJ047-0056.pt|From August nineteen sixty-two
+wavs/LJ010-0027.wav|pitch/LJ010-0027.pt|Nor did the methods by which they were perpetrated greatly vary from those in times past.
+wavs/LJ010-0065.wav|pitch/LJ010-0065.pt|At the former the "Provisional Government" was to be established,
+wavs/LJ046-0113.wav|pitch/LJ046-0113.pt|The Commission has concluded that at the time of the assassination
+wavs/LJ028-0410.wav|pitch/LJ028-0410.pt|There among the ruins they still live in the same kind of houses,
+wavs/LJ044-0137.wav|pitch/LJ044-0137.pt|More seriously, the facts of his defection had become known, leaving him open to almost unanswerable attack by those who opposed his views.
+wavs/LJ008-0215.wav|pitch/LJ008-0215.pt|One by one the huge uprights of black timber were fitted together,
+wavs/LJ030-0084.wav|pitch/LJ030-0084.pt|or when the press of the crowd made it impossible for the escort motorcycles to stay in position on the car's rear flanks.
+wavs/LJ020-0092.wav|pitch/LJ020-0092.pt|Have yourself called on biscuit mornings an hour earlier than usual.
+wavs/LJ029-0096.wav|pitch/LJ029-0096.pt|On November fourteen, Lawson and Sorrels attended a meeting at Love Field
+wavs/LJ015-0308.wav|pitch/LJ015-0308.pt|and others who swore to the meetings of the conspirators and their movements. Saward was found guilty,
+wavs/LJ012-0067.wav|pitch/LJ012-0067.pt|But Mrs. Solomons could not resist the temptation to dabble in stolen goods, and she was found shipping watches of the wrong category to New York.
+wavs/LJ018-0231.wav|pitch/LJ018-0231.pt|namely, to suppress it and substitute another.
+wavs/LJ014-0265.wav|pitch/LJ014-0265.pt|and later he became manager of the newly rebuilt Olympic at Wych Street.
+wavs/LJ024-0102.wav|pitch/LJ024-0102.pt|would be the first to exclaim as soon as an amendment was proposed
+wavs/LJ007-0233.wav|pitch/LJ007-0233.pt|it consists of several circular perforations, about two inches in diameter,
+wavs/LJ013-0213.wav|pitch/LJ013-0213.pt|This seems to have decided Courvoisier,
+wavs/LJ032-0045.wav|pitch/LJ032-0045.pt|This price included nineteen dollars, ninety-five cents for the rifle and the scope, and one dollar, fifty cents for postage and handling.
+wavs/LJ011-0048.wav|pitch/LJ011-0048.pt|Wherefore let him that thinketh he standeth take heed lest he fall," and was full of the most pointed allusions to the culprit.
+wavs/LJ005-0294.wav|pitch/LJ005-0294.pt|It was frequently stated in evidence that the jail of the borough was in so unfit a state for the reception of prisoners,
+wavs/LJ016-0007.wav|pitch/LJ016-0007.pt|There were others less successful.
+wavs/LJ028-0138.wav|pitch/LJ028-0138.pt|perhaps the tales that travelers told him were exaggerated as travelers' tales are likely to be,
+wavs/LJ050-0029.wav|pitch/LJ050-0029.pt|that is reflected in definite and comprehensive operating procedures.
+wavs/LJ014-0121.wav|pitch/LJ014-0121.pt|The prisoners were in due course transferred to Newgate, to be put upon their trial at the Central Criminal Court.
+wavs/LJ014-0146.wav|pitch/LJ014-0146.pt|They had to handcuff her by force against the most violent resistance, and still she raged and stormed,
+wavs/LJ046-0111.wav|pitch/LJ046-0111.pt|The Secret Service has attempted to perform this function through the activities of its Protective Research Section
+wavs/LJ012-0257.wav|pitch/LJ012-0257.pt|But the affair still remained a profound mystery. No light was thrown upon it till, towards the end of March,
+wavs/LJ002-0260.wav|pitch/LJ002-0260.pt|Yet the public opinion of the whole body seems to have checked dissipation.
+wavs/LJ031-0014.wav|pitch/LJ031-0014.pt|the Presidential limousine arrived at the emergency entrance of the Parkland Hospital at about twelve:thirty-five p.m.
+wavs/LJ047-0093.wav|pitch/LJ047-0093.pt|Oswald was arrested and jailed by the New Orleans Police Department for disturbing the peace, in connection with a street fight which broke out when he was accosted
+wavs/LJ003-0324.wav|pitch/LJ003-0324.pt|gaming of all sorts should be peremptorily forbidden under heavy pains and penalties.
+wavs/LJ021-0115.wav|pitch/LJ021-0115.pt|we have reached into the heart of the problem which is to provide such annual earnings for the lowest paid worker as will meet his minimum needs.
+wavs/LJ046-0191.wav|pitch/LJ046-0191.pt|it had established periodic regular review of the status of four hundred individuals;
+wavs/LJ034-0197.wav|pitch/LJ034-0197.pt|who was one of the first witnesses to alert the police to the Depository as the source of the shots, as has been discussed in chapter three.
+wavs/LJ002-0253.wav|pitch/LJ002-0253.pt|were governed by rules which they themselves had framed, and under which subscriptions were levied
+wavs/LJ048-0288.wav|pitch/LJ048-0288.pt|might have been more alert in the Dallas motorcade if they had retired promptly in Fort Worth.
+wavs/LJ007-0112.wav|pitch/LJ007-0112.pt|Many of the old customs once prevalent in the State Side, so properly condemned and abolished,
+wavs/LJ017-0189.wav|pitch/LJ017-0189.pt|who was presently attacked in the same way as the others, but, but, thanks to the prompt administration of remedies, he recovered.
+wavs/LJ042-0230.wav|pitch/LJ042-0230.pt|basically, although I hate the USSR and socialist system I still think marxism can work under different circumstances, end quote.
+wavs/LJ050-0161.wav|pitch/LJ050-0161.pt|The Secret Service should not and does not plan to develop its own intelligence gathering facilities to duplicate the existing facilities of other Federal agencies.
+wavs/LJ003-0011.wav|pitch/LJ003-0011.pt|that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day,
+wavs/LJ008-0206.wav|pitch/LJ008-0206.pt|and caused a number of stout additional barriers to be erected in front of the scaffold,
+wavs/LJ002-0261.wav|pitch/LJ002-0261.pt|The poorer prisoners were not in abject want, as in other prisons,
+wavs/LJ012-0189.wav|pitch/LJ012-0189.pt|Hunt, in consideration of the information he had given, escaped death, and was sentenced to transportation for life.
+wavs/LJ019-0317.wav|pitch/LJ019-0317.pt|The former, which consisted principally of the tread-wheel, cranks, capstans, shot-drill,
+wavs/LJ011-0041.wav|pitch/LJ011-0041.pt|Visited Mr. Fauntleroy. My application for books for him not having been attended, I had no prayer-book to give him.
+wavs/LJ023-0089.wav|pitch/LJ023-0089.pt|That is not only my accusation.
+wavs/LJ044-0224.wav|pitch/LJ044-0224.pt|would not agree with that particular wording, end quote.
+wavs/LJ013-0104.wav|pitch/LJ013-0104.pt|He found them at length residing at the latter place, one as a landed proprietor, the other as a publican.
+wavs/LJ013-0055.wav|pitch/LJ013-0055.pt|The jury did not believe him, and the verdict was for the defendants.
+wavs/LJ014-0306.wav|pitch/LJ014-0306.pt|These had been attributed to political action; some thought that the large purchases in foreign grains, effected at losing prices,
+wavs/LJ029-0052.wav|pitch/LJ029-0052.pt|To supplement the PRS files, the Secret Service depends largely on local police departments and local offices of other Federal agencies
+wavs/LJ028-0459.wav|pitch/LJ028-0459.pt|Its bricks, measuring about thirteen inches square and three inches in thickness, were burned and stamped with the usual short inscription:
+wavs/LJ017-0183.wav|pitch/LJ017-0183.pt|Soon afterwards Dixon died, showing all the symptoms already described.
+wavs/LJ009-0084.wav|pitch/LJ009-0084.pt|At length the ordinary pauses, and then, in a deep tone, which, though hardly above a whisper, is audible to all, says,
+wavs/LJ007-0170.wav|pitch/LJ007-0170.pt|That in this vast metropolis, the center of wealth, civilization, and information;
+wavs/LJ016-0277.wav|pitch/LJ016-0277.pt|This is proved by contemporary accounts, especially one graphic and realistic article which appeared in the 'Times,'
+wavs/LJ009-0061.wav|pitch/LJ009-0061.pt|He staggers towards the pew, reels into it, stumbles forward, flings himself on the ground, and, by a curious twist of the spine,
+wavs/LJ019-0201.wav|pitch/LJ019-0201.pt|to select a sufficiently spacious piece of ground, and erect a prison which from foundations to roofs should be in conformity with the newest ideas.
+wavs/LJ030-0063.wav|pitch/LJ030-0063.pt|He had repeated this wish only a few days before, during his visit to Tampa, Florida.
+wavs/LJ010-0257.wav|pitch/LJ010-0257.pt|a third miscreant made a similar but far less serious attempt in the month of July following.
+wavs/LJ009-0106.wav|pitch/LJ009-0106.pt|The keeper tries to appear unmoved, but his eye wanders anxiously over the combustible assembly.
+wavs/LJ008-0121.wav|pitch/LJ008-0121.pt|After the construction and action of the machine had been explained, the doctor asked the governor what kind of men he had commanded at Goree,
+wavs/LJ050-0069.wav|pitch/LJ050-0069.pt|the Secret Service had received from the FBI some nine thousand reports on members of the Communist Party.
+wavs/LJ006-0202.wav|pitch/LJ006-0202.pt|The news-vendor was also a tobacconist,
+wavs/LJ012-0230.wav|pitch/LJ012-0230.pt|Shortly before the day fixed for execution, Bishop made a full confession, the bulk of which bore the impress of truth,
+wavs/LJ005-0248.wav|pitch/LJ005-0248.pt|and stated that in his opinion Newgate, as the common jail of Middlesex, was wholly inadequate to the proper confinement of its prisoners.
+wavs/LJ037-0053.wav|pitch/LJ037-0053.pt|who had been greatly upset by her experience, was able to view a lineup of four men handcuffed together at the police station.
+wavs/LJ045-0177.wav|pitch/LJ045-0177.pt|For the first time
+wavs/LJ004-0036.wav|pitch/LJ004-0036.pt|it was hoped that their rulers would hire accommodation in the county prisons, and that the inferior establishments would in course of time disappear.
+wavs/LJ026-0054.wav|pitch/LJ026-0054.pt|carbohydrates (starch, cellulose) and fats.
+wavs/LJ020-0085.wav|pitch/LJ020-0085.pt|Break apart from one another and pile on a plate, throwing a clean doily or a small napkin over them. Break open at table.
+wavs/LJ046-0226.wav|pitch/LJ046-0226.pt|The several military intelligence agencies reported crank mail and similar threats involving the President.
+wavs/LJ014-0233.wav|pitch/LJ014-0233.pt|he shot an old soldier who had attempted to detain him. He was convicted and executed.
+wavs/LJ033-0152.wav|pitch/LJ033-0152.pt|The portion of the palm which was identified was the heel of the right palm, i.e., the area near the wrist, on the little finger side.
+wavs/LJ004-0009.wav|pitch/LJ004-0009.pt|as indefatigable and self-sacrificing, found by personal visitation that the condition of jails throughout the kingdom was,
+wavs/LJ017-0134.wav|pitch/LJ017-0134.pt|Within a few weeks occurred the Leeds poisoning case, in which the murderer undoubtedly was inspired by the facts made public at Palmer's trial.
+wavs/LJ019-0318.wav|pitch/LJ019-0318.pt|was to be the rule for all convicted prisoners throughout the early stages of their detention;
+wavs/LJ020-0093.wav|pitch/LJ020-0093.pt|Rise, wash face and hands, rinse the mouth out and brush back the hair.
+wavs/LJ012-0188.wav|pitch/LJ012-0188.pt|Probert was then admitted as a witness, and the case was fully proved against Thurtell, who was hanged in front of Hertford Jail.
+wavs/LJ019-0202.wav|pitch/LJ019-0202.pt|The preference given to the Pentonville system destroyed all hopes of a complete reformation of Newgate.
+wavs/LJ039-0027.wav|pitch/LJ039-0027.pt|Oswald's revolver
+wavs/LJ040-0176.wav|pitch/LJ040-0176.pt|He admitted to fantasies about being powerful and sometimes hurting and killing people, but refused to elaborate on them.
+wavs/LJ018-0354.wav|pitch/LJ018-0354.pt|Doubts were long entertained whether Thomas Wainwright,
+wavs/LJ031-0185.wav|pitch/LJ031-0185.pt|From the Presidential airplane, the Vice President telephoned Attorney General Robert F. Kennedy,
+wavs/LJ006-0137.wav|pitch/LJ006-0137.pt|They were not obliged to attend chapel, and seldom if ever went; "prisoners," said one of them under examination, "did not like the trouble of going to chapel."
+wavs/LJ032-0085.wav|pitch/LJ032-0085.pt|The Hidell signature on the notice of classification was in the handwriting of Oswald.
+wavs/LJ009-0037.wav|pitch/LJ009-0037.pt|the schoolmaster and the juvenile prisoners being seated round the communion-table, opposite the pulpit.
+wavs/LJ006-0021.wav|pitch/LJ006-0021.pt|Later on he had devoted himself to the personal investigation of the prisons of the United States.
+wavs/LJ006-0082.wav|pitch/LJ006-0082.pt|and this particular official took excellent care to select as residents for his own ward those most suitable from his own point of view.
+wavs/LJ016-0380.wav|pitch/LJ016-0380.pt|with hope to the last. There is always the chance of a flaw in the indictment, of a missing witness, or extenuating circumstances.
+wavs/LJ019-0344.wav|pitch/LJ019-0344.pt|monitor, or schoolmaster, nor to be engaged in the service of any officer of the prison.
+wavs/LJ019-0161.wav|pitch/LJ019-0161.pt|These disciplinary improvements were, however, only slowly and gradually introduced.
+wavs/LJ028-0145.wav|pitch/LJ028-0145.pt|And here I may not omit to tell the use to which the mould dug out of the great moat was turned, nor the manner wherein the wall was wrought.
+wavs/LJ018-0349.wav|pitch/LJ018-0349.pt|His disclaimer, distinct and detailed on every point, was intended simply for effect.
+wavs/LJ043-0010.wav|pitch/LJ043-0010.pt|Some of the members of that group saw a good deal of the Oswalds through the fall of nineteen sixty-three,
+wavs/LJ027-0178.wav|pitch/LJ027-0178.pt|These were undoubtedly perennibranchs. In the Permian and Triassic higher forms appeared, which were certainly caducibranch.
+wavs/LJ041-0070.wav|pitch/LJ041-0070.pt|He did not rise above the rank of private first class, even though he had passed a qualifying examination for the rank of corporal.
+wavs/LJ008-0266.wav|pitch/LJ008-0266.pt|Thus in the years between May first, eighteen twenty-seven, and thirtieth April, eighteen thirty-one,
+wavs/LJ021-0091.wav|pitch/LJ021-0091.pt|In this recent reorganization we have recognized three distinct functions:
+wavs/LJ019-0129.wav|pitch/LJ019-0129.pt|which marked the growth of public interest in prison affairs, and which was the germ of the new system
+wavs/LJ018-0215.wav|pitch/LJ018-0215.pt|William Roupell was the eldest but illegitimate son of a wealthy man who subsequently married Roupell's mother, and had further legitimate issue.
+wavs/LJ015-0194.wav|pitch/LJ015-0194.pt|and behaved so as to justify a belief that he had been a jail-bird all his life.
+wavs/LJ016-0137.wav|pitch/LJ016-0137.pt|that numbers of men, "lifers," and others with ten, fourteen, or twenty years to do, can be trusted to work out of doors without bolts and bars
+wavs/LJ002-0289.wav|pitch/LJ002-0289.pt|the latter raised eighteen pence among them to pay for a truss of straw for the poor woman to lie on.
+wavs/LJ023-0016.wav|pitch/LJ023-0016.pt|In nineteen thirty-three you and I knew that we must never let our economic system get completely out of joint again
+wavs/LJ011-0141.wav|pitch/LJ011-0141.pt|There were at the moment in Newgate six convicts sentenced to death for forging wills.
+wavs/LJ016-0283.wav|pitch/LJ016-0283.pt|to do them mere justice, there was at least till then a half-drunken ribald gaiety among the crowd that made them all akin."
+wavs/LJ035-0082.wav|pitch/LJ035-0082.pt|The only interval was the time necessary to ride in the elevator from the second to the sixth floor and walk back to the southeast corner.
+wavs/LJ045-0194.wav|pitch/LJ045-0194.pt|Anyone who was familiar with that area of Dallas would have known that the motorcade would probably pass the Texas School Book Depository to get from Main Street
+wavs/LJ009-0124.wav|pitch/LJ009-0124.pt|occupied when they saw it last, but a few hours ago, by their comrades who are now dead;
+wavs/LJ030-0162.wav|pitch/LJ030-0162.pt|In the Presidential Limousine
+wavs/LJ050-0223.wav|pitch/LJ050-0223.pt|The plan provides for an additional two hundred five agents for the Secret Service. Seventeen of this number are proposed for the Protective Research Section;
+wavs/LJ008-0228.wav|pitch/LJ008-0228.pt|their harsh and half-cracked voices full of maudlin, besotted sympathy for those about to die.
+wavs/LJ002-0096.wav|pitch/LJ002-0096.pt|The eight courts above enumerated were well supplied with water;
+wavs/LJ018-0288.wav|pitch/LJ018-0288.pt|After this the other conspirators traveled to obtain genuine bills and master the system of the leading houses at home and abroad.
+wavs/LJ002-0106.wav|pitch/LJ002-0106.pt|in which latterly a copper had been fixed for the cooking of provisions sent in by charitable persons.
+wavs/LJ025-0129.wav|pitch/LJ025-0129.pt|On each lobe of the bi-lobed leaf of Venus flytrap are three delicate filaments which stand out at right angles from the surface of the leaf.
+wavs/LJ044-0013.wav|pitch/LJ044-0013.pt|Hands Off Cuba, end quote, an application form for, and a membership card in,
+wavs/LJ049-0115.wav|pitch/LJ049-0115.pt|of the person who is actually in the exercise of the executive power, or
+wavs/LJ019-0145.wav|pitch/LJ019-0145.pt|But reformation was only skin deep. Below the surface many of the old evils still rankled.
+wavs/LJ019-0355.wav|pitch/LJ019-0355.pt|came up in all respects to modern requirements.
+wavs/LJ019-0289.wav|pitch/LJ019-0289.pt|There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
+wavs/LJ048-0222.wav|pitch/LJ048-0222.pt|in Fort Worth, there occurred a breach of discipline by some members of the Secret Service who were officially traveling with the President.
+wavs/LJ016-0367.wav|pitch/LJ016-0367.pt|Under the new system the whole of the arrangements from first to last fell upon the officers.
+wavs/LJ047-0097.wav|pitch/LJ047-0097.pt|Agent Quigley did not know of Oswald's prior FBI record when he interviewed him,
+wavs/LJ007-0075.wav|pitch/LJ007-0075.pt|as effectually to rebuke and abash the profane spirit of the more insolent and daring of the criminals.
+wavs/LJ047-0022.wav|pitch/LJ047-0022.pt|provided by other agencies.
+wavs/LJ007-0085.wav|pitch/LJ007-0085.pt|at Newgate and York Castle as long as five years; "at Ilchester and Morpeth for seven years; at Warwick for eight years,
+wavs/LJ047-0075.wav|pitch/LJ047-0075.pt|Hosty had inquired earlier and found no evidence that it was functioning in the Dallas area.
+wavs/LJ008-0098.wav|pitch/LJ008-0098.pt|One was the "yeoman of the halter," a Newgate official, the executioner's assistant, whom Mr. J. T. Smith, who was present at the execution,
+wavs/LJ017-0102.wav|pitch/LJ017-0102.pt|The second attack was fatal, and ended in Cook's death from tetanus.
+wavs/LJ046-0105.wav|pitch/LJ046-0105.pt|Second, the adequacy of other advance preparations for the security of the President, during his visit to Dallas,
+wavs/LJ018-0206.wav|pitch/LJ018-0206.pt|He was a tall, slender man, with a long face and iron-gray hair.
+wavs/LJ012-0271.wav|pitch/LJ012-0271.pt|Whether it was greed or a quarrel that drove Greenacre to the desperate deed remains obscure.
+wavs/LJ005-0086.wav|pitch/LJ005-0086.pt|with such further separation as the justices should deem conducive to good order and discipline.
+wavs/LJ042-0097.wav|pitch/LJ042-0097.pt|and considerably better living quarters than those accorded to Soviet citizens of equal age and station.
+wavs/LJ047-0126.wav|pitch/LJ047-0126.pt|we would handle it in due course, in accord with the whole context of the investigation. End quote.
+wavs/LJ041-0022.wav|pitch/LJ041-0022.pt|Oswald first wrote, quote, Edward Vogel, end quote, an obvious misspelling of Voebel's name,
+wavs/LJ015-0025.wav|pitch/LJ015-0025.pt|The bank enjoyed an excellent reputation, it had a good connection, and was supposed to be perfectly sound.
+wavs/LJ012-0194.wav|pitch/LJ012-0194.pt|But Burke and Hare had their imitators further south,
+wavs/LJ028-0416.wav|pitch/LJ028-0416.pt|(if man may speak so confidently of His great impenetrable counsels), for an eternal Testimony of His great work in the confusion of Man's pride,
+wavs/LJ007-0130.wav|pitch/LJ007-0130.pt|are all huddled together without discrimination, oversight, or control."
+wavs/LJ015-0005.wav|pitch/LJ015-0005.pt|About this time Davidson and Gordon, the people above-mentioned,
+wavs/LJ016-0125.wav|pitch/LJ016-0125.pt|with this, placed against the wall near the chevaux-de-frise, he made an escalade.
+wavs/LJ014-0224.wav|pitch/LJ014-0224.pt|As Dwyer survived, Cannon escaped the death sentence, which was commuted to penal servitude for life.
+wavs/LJ005-0019.wav|pitch/LJ005-0019.pt|refuted by abundant evidence, and having no foundation whatever in truth.
+wavs/LJ042-0221.wav|pitch/LJ042-0221.pt|With either great ambivalence, or cold calculation he prepared completely different answers to the same questions.
+wavs/LJ001-0063.wav|pitch/LJ001-0063.pt|which was generally more formally Gothic than the printing of the German workmen,
+wavs/LJ030-0006.wav|pitch/LJ030-0006.pt|They took off in the Presidential plane, Air Force One, at eleven a.m., arriving at San Antonio at one:thirty p.m., Eastern Standard Time.
+wavs/LJ024-0054.wav|pitch/LJ024-0054.pt|democracy will have failed far beyond the importance to it of any king of precedent concerning the judiciary.
+wavs/LJ006-0044.wav|pitch/LJ006-0044.pt|the same callous indifference to the moral well-being of the prisoners, the same want of employment and of all disciplinary control.
+wavs/LJ039-0154.wav|pitch/LJ039-0154.pt|four point eight to five point six seconds if the second shot missed,
+wavs/LJ050-0090.wav|pitch/LJ050-0090.pt|they seem unduly restrictive in continuing to require some manifestation of animus against a Government official.
+wavs/LJ028-0421.wav|pitch/LJ028-0421.pt|it was the beginning of the great collections of Babylonian antiquities in the museums of the Western world.
+wavs/LJ033-0205.wav|pitch/LJ033-0205.pt|then I would say the possibility exists, these fibers could have come from this blanket, end quote.
+wavs/LJ019-0335.wav|pitch/LJ019-0335.pt|The books and journals he was to keep were minutely specified, and his constant presence in or near the jail was insisted upon.
+wavs/LJ013-0045.wav|pitch/LJ013-0045.pt|Wallace's relations warned him against his Liverpool friend,
+wavs/LJ037-0002.wav|pitch/LJ037-0002.pt|Chapter four. The Assassin: Part six.
+wavs/LJ018-0159.wav|pitch/LJ018-0159.pt|This was all the police wanted to know.
+wavs/LJ026-0140.wav|pitch/LJ026-0140.pt|In the plant as in the animal metabolism must consist of anabolic and catabolic processes.
+wavs/LJ014-0171.wav|pitch/LJ014-0171.pt|I will briefly describe one or two of the more remarkable murders in the years immediately following, then pass on to another branch of crime.
+wavs/LJ037-0007.wav|pitch/LJ037-0007.pt|Three others subsequently identified Oswald from a photograph.
+wavs/LJ033-0174.wav|pitch/LJ033-0174.pt|microscopic and UV (ultra violet) characteristics, end quote.
+wavs/LJ040-0110.wav|pitch/LJ040-0110.pt|he apparently adjusted well enough there to have had an average, although gradually deteriorating, school record
+wavs/LJ039-0192.wav|pitch/LJ039-0192.pt|he had a total of between four point eight and five point six seconds between the two shots which hit
+wavs/LJ032-0261.wav|pitch/LJ032-0261.pt|When he appeared before the Commission, Michael Paine lifted the blanket
+wavs/LJ040-0097.wav|pitch/LJ040-0097.pt|Lee was brought up in this atmosphere of constant money problems, and I am sure it had quite an effect on him, and also Robert, end quote.
+wavs/LJ037-0249.wav|pitch/LJ037-0249.pt|Mrs. Earlene Roberts, the housekeeper at Oswald's roominghouse and the last person known to have seen him before he reached tenth Street and Patton Avenue,
+wavs/LJ016-0248.wav|pitch/LJ016-0248.pt|Marwood was proud of his calling, and when questioned as to whether his process was satisfactory, replied that he heard "no complaints."
+wavs/LJ004-0083.wav|pitch/LJ004-0083.pt|As Mr. Buxton pointed out, many old acts of parliament designed to protect the prisoner were still in full force.
+wavs/LJ014-0029.wav|pitch/LJ014-0029.pt|This was Delarue's watch, fully identified as such, which Hocker told his brother Delarue had given him the morning of the murder.
+wavs/LJ021-0110.wav|pitch/LJ021-0110.pt|have been best calculated to promote industrial recovery and a permanent improvement of business and labor conditions.
+wavs/LJ003-0107.wav|pitch/LJ003-0107.pt|he slept in the same bed with a highwayman on one side, and a man charged with murder on the other.
+wavs/LJ039-0076.wav|pitch/LJ039-0076.pt|Ronald Simmons, chief of the U.S. Army Infantry Weapons Evaluation Branch of the Ballistics Research Laboratory, said, quote,
+wavs/LJ016-0347.wav|pitch/LJ016-0347.pt|had undoubtedly a solemn, impressive effect upon those outside.
+wavs/LJ001-0072.wav|pitch/LJ001-0072.pt|After the end of the fifteenth century the degradation of printing, especially in Germany and Italy,
+wavs/LJ024-0018.wav|pitch/LJ024-0018.pt|Consequently, although there never can be more than fifteen, there may be only fourteen, or thirteen, or twelve.
+wavs/LJ032-0180.wav|pitch/LJ032-0180.pt|that the fibers were caught in the crevice of the rifle's butt plate, quote, in the recent past, end quote,
+wavs/LJ010-0083.wav|pitch/LJ010-0083.pt|and measures taken to arrest them when their plans were so far developed that no doubt could remain as to their guilt.
+wavs/LJ002-0299.wav|pitch/LJ002-0299.pt|and gave the garnish for the common side at that sum, which is five shillings more than Mr. Neild says was extorted on the common side.
+wavs/LJ048-0143.wav|pitch/LJ048-0143.pt|the Secret Service did not at the time of the assassination have any established procedure governing its relationships with them.
+wavs/LJ012-0054.wav|pitch/LJ012-0054.pt|Solomons, while waiting to appear in court, persuaded the turnkeys to take him to a public-house, where all might "refresh."
+wavs/LJ019-0270.wav|pitch/LJ019-0270.pt|Vegetables, especially the potato, that most valuable anti-scorbutic, was too often omitted.
+wavs/LJ035-0164.wav|pitch/LJ035-0164.pt|three minutes after the shooting.
+wavs/LJ014-0326.wav|pitch/LJ014-0326.pt|Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.
+wavs/LJ001-0173.wav|pitch/LJ001-0173.pt|The essential point to be remembered is that the ornament, whatever it is, whether picture or pattern-work, should form part of the page,
+wavs/LJ050-0056.wav|pitch/LJ050-0056.pt|On December twenty-six, nineteen sixty-three, the FBI circulated additional instructions to all its agents,
+wavs/LJ003-0319.wav|pitch/LJ003-0319.pt|provided only that their security was not jeopardized, and dependent upon the enforcement of another new rule,
+wavs/LJ006-0040.wav|pitch/LJ006-0040.pt|The fact was that the years as they passed, nearly twenty in all, had worked but little permanent improvement in this detestable prison.
+wavs/LJ017-0231.wav|pitch/LJ017-0231.pt|His body was found lying in a pool of blood in a night-dress, stabbed over and over again in the left side.
+wavs/LJ017-0226.wav|pitch/LJ017-0226.pt|One half of the mutineers fell upon him unawares with handspikes and capstan-bars.
+wavs/LJ004-0239.wav|pitch/LJ004-0239.pt|He had been committed for an offense for which he was acquitted.
+wavs/LJ048-0112.wav|pitch/LJ048-0112.pt|The Commission also regards the security arrangements worked out by Lawson and Sorrels at Love Field as entirely adequate.
+wavs/LJ039-0125.wav|pitch/LJ039-0125.pt|that Oswald was a good shot, somewhat better than or equal to -- better than the average let us say.
+wavs/LJ030-0196.wav|pitch/LJ030-0196.pt|He cried out, quote, Oh, no, no, no. My God, they are going to kill us all, end quote,
+wavs/LJ010-0228.wav|pitch/LJ010-0228.pt|He was released from Broadmoor in eighteen seventy-eight, and went abroad.
+wavs/LJ045-0228.wav|pitch/LJ045-0228.pt|On the other hand, he could have traveled some distance with the money he did have and he did return to his room where he obtained his revolver.
+wavs/LJ028-0168.wav|pitch/LJ028-0168.pt|in the other was the sacred precinct of Jupiter Belus,
+wavs/LJ021-0140.wav|pitch/LJ021-0140.pt|and in such an effort we should be able to secure for employers and employees and consumers
+wavs/LJ009-0280.wav|pitch/LJ009-0280.pt|Again the wretched creature succeeded in obtaining foothold, but this time on the left side of the drop.
+wavs/LJ003-0159.wav|pitch/LJ003-0159.pt|To constitute this the aristocratic quarter, unwarrantable demands were made upon the space properly allotted to the female felons,
+wavs/LJ016-0274.wav|pitch/LJ016-0274.pt|and the windows of the opposite houses, which commanded a good view, as usual fetched high prices.
+wavs/LJ035-0014.wav|pitch/LJ035-0014.pt|it sounded high and I immediately kind of looked up,
+wavs/LJ033-0120.wav|pitch/LJ033-0120.pt|which he believed was where the bag reached when it was laid on the seat with one edge against the door.
+wavs/LJ045-0015.wav|pitch/LJ045-0015.pt|which Johnson said he did not receive until after the assassination. The letter said in part, quote,
+wavs/LJ003-0299.wav|pitch/LJ003-0299.pt|the latter end of the nineteenth century, several of which still fall far short of our English ideal,
+wavs/LJ032-0206.wav|pitch/LJ032-0206.pt|After comparing the rifle in the simulated photograph with the rifle in Exhibit Number one thirty-three A, Shaneyfelt testified, quote,
+wavs/LJ028-0494.wav|pitch/LJ028-0494.pt|Between the several sections were wide spaces where foot soldiers and charioteers might fight.
+wavs/LJ005-0099.wav|pitch/LJ005-0099.pt|and report at length upon the condition of the prisons of the country.
+wavs/LJ015-0144.wav|pitch/LJ015-0144.pt|developed to a colossal extent the frauds he had already practiced as a subordinate.
+wavs/LJ019-0221.wav|pitch/LJ019-0221.pt|It was intended as far as possible that, except awaiting trial, no prisoner should find himself relegated to Newgate.
+wavs/LJ003-0088.wav|pitch/LJ003-0088.pt|in one, for seven years -- that of a man sentenced to death, for whom great interest had been made, but whom it was not thought right to pardon.
+wavs/LJ045-0216.wav|pitch/LJ045-0216.pt|nineteen sixty-three, merely to disarm her and to provide a justification of sorts,
+wavs/LJ042-0135.wav|pitch/LJ042-0135.pt|that he was not yet twenty years old when he went to the Soviet Union with such high hopes and not quite twenty-three when he returned bitterly disappointed.
+wavs/LJ049-0196.wav|pitch/LJ049-0196.pt|On the other hand, it is urged that all features of the protection of the President and his family should be committed to an elite and independent corps.
+wavs/LJ018-0278.wav|pitch/LJ018-0278.pt|This was the well and astutely devised plot of the brothers Bidwell,
+wavs/LJ030-0238.wav|pitch/LJ030-0238.pt|and then looked around again and saw more of this movement, and so I proceeded to go to the back seat and get on top of him.
+wavs/LJ018-0309.wav|pitch/LJ018-0309.pt|where probably the money still remains.
+wavs/LJ041-0199.wav|pitch/LJ041-0199.pt|is shown most clearly by his employment relations after his return from the Soviet Union. Of course, he made his real problems worse to the extent
+wavs/LJ007-0076.wav|pitch/LJ007-0076.pt|The lax discipline maintained in Newgate was still further deteriorated by the presence of two other classes of prisoners who ought never to have been inmates of such a jail.
+wavs/LJ039-0118.wav|pitch/LJ039-0118.pt|He had high motivation. He had presumably a good to excellent rifle and good ammunition.
+wavs/LJ024-0019.wav|pitch/LJ024-0019.pt|And there may be only nine.
+wavs/LJ008-0085.wav|pitch/LJ008-0085.pt|The fire had not quite burnt out at twelve, in nearly four hours, that is to say.
+wavs/LJ018-0031.wav|pitch/LJ018-0031.pt|This fixed the crime pretty certainly upon Müller, who had already left the country, thus increasing suspicion under which he lay.
+wavs/LJ030-0032.wav|pitch/LJ030-0032.pt|Dallas police stood at intervals along the fence and Dallas plain clothes men mixed in the crowd.
+wavs/LJ050-0004.wav|pitch/LJ050-0004.pt|General Supervision of the Secret Service
+wavs/LJ039-0096.wav|pitch/LJ039-0096.pt|This is a definite advantage to the shooter, the vehicle moving directly away from him and the downgrade of the street, and he being in an elevated position
+wavs/LJ041-0195.wav|pitch/LJ041-0195.pt|Oswald's interest in Marxism led some people to avoid him,
+wavs/LJ047-0158.wav|pitch/LJ047-0158.pt|After a moment's hesitation, she told me that he worked at the Texas School Book Depository near the downtown area of Dallas.
+wavs/LJ050-0162.wav|pitch/LJ050-0162.pt|In planning its data processing techniques,
+wavs/LJ001-0051.wav|pitch/LJ001-0051.pt|and paying great attention to the "press work" or actual process of printing,
+wavs/LJ028-0136.wav|pitch/LJ028-0136.pt|Of all the ancient descriptions of the famous walls and the city they protected, that of Herodotus is the fullest.
+wavs/LJ034-0134.wav|pitch/LJ034-0134.pt|Shortly after the assassination Brennan noticed
+wavs/LJ019-0348.wav|pitch/LJ019-0348.pt|Every facility was promised. The sanction of the Secretary of State would not be withheld if plans and estimates were duly submitted,
+wavs/LJ010-0219.wav|pitch/LJ010-0219.pt|While one stood over the fire with the papers, another stood with lighted torch to fire the house.
+wavs/LJ011-0245.wav|pitch/LJ011-0245.pt|Mr. Mullay called again, taking with him five hundred pounds in cash. Howard discovered this, and his manner was very suspicious;
+wavs/LJ030-0035.wav|pitch/LJ030-0035.pt|Organization of the Motorcade
+wavs/LJ044-0135.wav|pitch/LJ044-0135.pt|While he had drawn some attention to himself and had actually appeared on two radio programs, he had been attacked by Cuban exiles and arrested,
+wavs/LJ045-0090.wav|pitch/LJ045-0090.pt|He was very much interested in autobiographical works of outstanding statesmen of the United States, to whom his wife thought he compared himself.
+wavs/LJ026-0034.wav|pitch/LJ026-0034.pt|When any given "protist" has to be classified the case must be decided on its individual merits;
+wavs/LJ045-0092.wav|pitch/LJ045-0092.pt|as to the fact that he was an outstanding man, end quote.
+wavs/LJ017-0050.wav|pitch/LJ017-0050.pt|Palmer, who was only thirty-one at the time of his trial, was in appearance short and stout, with a round head
+wavs/LJ036-0104.wav|pitch/LJ036-0104.pt|Whaley picked Oswald.
+wavs/LJ019-0055.wav|pitch/LJ019-0055.pt|High authorities were in favor of continuous separation.
+wavs/LJ010-0030.wav|pitch/LJ010-0030.pt|The brutal ferocity of the wild beast once aroused, the same means, the same weapons were employed to do the dreadful deed,
+wavs/LJ038-0047.wav|pitch/LJ038-0047.pt|Some of the officers saw Oswald strike McDonald with his fist. Most of them heard a click which they assumed to be a click of the hammer of the revolver.
+wavs/LJ009-0074.wav|pitch/LJ009-0074.pt|Let us pass on.
+wavs/LJ048-0069.wav|pitch/LJ048-0069.pt|Efforts made by the Bureau since the assassination, on the other hand,
+wavs/LJ003-0211.wav|pitch/LJ003-0211.pt|They were never left quite alone for fear of suicide, and for the same reason they were searched for weapons or poisons.
+wavs/LJ048-0053.wav|pitch/LJ048-0053.pt|It is the conclusion of the Commission that, even in the absence of Secret Service criteria
+wavs/LJ033-0093.wav|pitch/LJ033-0093.pt|Frazier estimated that the bag was two feet long, quote, give and take a few inches, end quote, and about five or six inches wide.
+wavs/LJ006-0149.wav|pitch/LJ006-0149.pt|The turnkeys left the prisoners very much to themselves, never entering the wards after locking-up time, at dusk, till unlocking next morning,
+wavs/LJ018-0211.wav|pitch/LJ018-0211.pt|The false coin was bought by an agent from an agent, and dealings were carried on secretly at the "Clock House" in Seven Dials.
+wavs/LJ008-0054.wav|pitch/LJ008-0054.pt|This contrivance appears to have been copied with improvements from that which had been used in Dublin at a still earlier date,
+wavs/LJ040-0052.wav|pitch/LJ040-0052.pt|that his commitment to Marxism was an important factor influencing his conduct during his adult years.
+wavs/LJ028-0023.wav|pitch/LJ028-0023.pt|Two weeks pass, and at last you stand on the eastern edge of the plateau
+wavs/LJ009-0184.wav|pitch/LJ009-0184.pt|Lord Ferrers' body was brought to Surgeons' Hall after execution in his own carriage and six;
+wavs/LJ005-0252.wav|pitch/LJ005-0252.pt|A committee was appointed, under the presidency of the Duke of Richmond
+wavs/LJ015-0266.wav|pitch/LJ015-0266.pt|has probably no parallel in the annals of crime. Saward himself is a striking and in some respects an unique figure in criminal history.
+wavs/LJ017-0059.wav|pitch/LJ017-0059.pt|even after sentence, and until within a few hours of execution, he was buoyed up with the hope of reprieve.
+wavs/LJ024-0034.wav|pitch/LJ024-0034.pt|What do they mean by the words "packing the Court"?
+wavs/LJ016-0089.wav|pitch/LJ016-0089.pt|He was engaged in whitewashing and cleaning; the officer who had him in charge left him on the stairs leading to the gallery.
+wavs/LJ039-0227.wav|pitch/LJ039-0227.pt|with two hits, within four point eight and five point six seconds.
+wavs/LJ001-0096.wav|pitch/LJ001-0096.pt|have now come into general use and are obviously a great improvement on the ordinary "modern style" in use in England, which is in fact the Bodoni type
+wavs/LJ018-0129.wav|pitch/LJ018-0129.pt|who threatened to betray the theft. But Brewer, either before or after this, succumbed to temptation,
+wavs/LJ010-0157.wav|pitch/LJ010-0157.pt|and that, as he was starving, he had resolved on this desperate deed,
+wavs/LJ038-0264.wav|pitch/LJ038-0264.pt|He concluded that, quote, the general rifling characteristics of the rifle are of the same type as those found on the bullet
+wavs/LJ031-0165.wav|pitch/LJ031-0165.pt|When security arrangements at the airport were complete, the Secret Service made the necessary arrangements for the Vice President to leave the hospital.
+wavs/LJ018-0244.wav|pitch/LJ018-0244.pt|The effect of establishing the forgeries would be to restore to the Roupell family lands for which a price had already been paid
+wavs/LJ007-0071.wav|pitch/LJ007-0071.pt|in the face of impediments confessedly discouraging
+wavs/LJ028-0340.wav|pitch/LJ028-0340.pt|Such of the Babylonians as witnessed the treachery took refuge in the temple of Jupiter Belus;
+wavs/LJ017-0164.wav|pitch/LJ017-0164.pt|with the idea of subjecting her to the irritant poison slowly but surely until the desired effect, death, was achieved.
+wavs/LJ048-0197.wav|pitch/LJ048-0197.pt|I then told the officers that their primary duty was traffic and crowd control and that they should be alert for any persons who might attempt to throw anything
+wavs/LJ013-0098.wav|pitch/LJ013-0098.pt|Mr. Oxenford having denied that he had made any transfer of stock, the matter was at once put into the hands of the police.
+wavs/LJ012-0049.wav|pitch/LJ012-0049.pt|led him to think seriously of trying his fortunes in another land.
+wavs/LJ030-0014.wav|pitch/LJ030-0014.pt|quote, that the crowd was about the same as the one which came to see him before but there were one hundred thousand extra people on hand who came to see Mrs. Kennedy.
+wavs/LJ014-0186.wav|pitch/LJ014-0186.pt|A milliner's porter,
+wavs/LJ015-0027.wav|pitch/LJ015-0027.pt|Yet even so early as the death of the first Sir John Paul,
+wavs/LJ047-0049.wav|pitch/LJ047-0049.pt|Marina Oswald, however, recalled that her husband was upset by this interview.
+wavs/LJ012-0021.wav|pitch/LJ012-0021.pt|at fourteen he was a pickpocket and a "duffer," or a seller of sham goods.
+wavs/LJ003-0140.wav|pitch/LJ003-0140.pt|otherwise he would have been stripped of his clothes. End quote.
+wavs/LJ042-0130.wav|pitch/LJ042-0130.pt|Shortly thereafter, less than eighteen months after his defection, about six weeks before he met Marina Prusakova,
+wavs/LJ019-0180.wav|pitch/LJ019-0180.pt|His letter to the Corporation, under date fourth June,
+wavs/LJ017-0108.wav|pitch/LJ017-0108.pt|He was struck with the appearance of the corpse, which was not emaciated, as after a long disease ending in death;
+wavs/LJ006-0268.wav|pitch/LJ006-0268.pt|Women saw men if they merely pretended to be wives; even boys were visited by their sweethearts.
+wavs/LJ044-0125.wav|pitch/LJ044-0125.pt|of residence in the U.S.S.R. against any cause which I join, by association,
+wavs/LJ015-0231.wav|pitch/LJ015-0231.pt|It was Tester's business, who had access to the railway company's books, to watch for this.
+wavs/LJ002-0225.wav|pitch/LJ002-0225.pt|The rentals of rooms and fees went to the warden, whose income was two thousand three hundred seventy-two pounds.
+wavs/LJ034-0072.wav|pitch/LJ034-0072.pt|The employees raced the elevators to the first floor. Givens saw Oswald standing at the gate on the fifth floor as the elevator went by.
+wavs/LJ045-0033.wav|pitch/LJ045-0033.pt|He began to treat me better. He helped me more -- although he always did help. But he was more attentive, end quote.
+wavs/LJ031-0058.wav|pitch/LJ031-0058.pt|to infuse blood and fluids into the circulatory system.
+wavs/LJ029-0197.wav|pitch/LJ029-0197.pt|During November the Dallas papers reported frequently on the plans for protecting the President, stressing the thoroughness of the preparations.
+wavs/LJ043-0047.wav|pitch/LJ043-0047.pt|Oswald and his family lived for a brief period with his mother at her urging, but Oswald soon decided to move out.
+wavs/LJ021-0026.wav|pitch/LJ021-0026.pt|seems necessary to produce the same result of justice and right conduct
+wavs/LJ003-0230.wav|pitch/LJ003-0230.pt|The prison allowances were eked out by the broken victuals generously given by several eating-house keepers in the city,
+wavs/LJ037-0252.wav|pitch/LJ037-0252.pt|Ted Callaway, who saw the gunman moments after the shooting, testified that Commission Exhibit Number one sixty-two
+wavs/LJ031-0008.wav|pitch/LJ031-0008.pt|Meanwhile, Chief Curry ordered the police base station to notify Parkland Hospital that the wounded President was en route.
+wavs/LJ030-0021.wav|pitch/LJ030-0021.pt|all one had to do was get a high building someday with a telescopic rifle, and there was nothing anybody could do to defend against such an attempt.
+wavs/LJ046-0179.wav|pitch/LJ046-0179.pt|being reviewed regularly.
+wavs/LJ025-0118.wav|pitch/LJ025-0118.pt|and that, however diverse may be the fabrics or tissues of which their bodies are composed, all these varied structures result
+wavs/LJ028-0278.wav|pitch/LJ028-0278.pt|Zopyrus, when they told him, not thinking that it could be true, went and saw the colt with his own eyes;
+wavs/LJ007-0090.wav|pitch/LJ007-0090.pt|Not only did their presence tend greatly to interfere with the discipline of the prison, but their condition was deplorable in the extreme.
+wavs/LJ045-0045.wav|pitch/LJ045-0045.pt|that she would be able to leave the Soviet Union. Marina Oswald has denied this.
+wavs/LJ028-0289.wav|pitch/LJ028-0289.pt|For he cut off his own nose and ears, and then, clipping his hair close and flogging himself with a scourge,
+wavs/LJ009-0276.wav|pitch/LJ009-0276.pt|Calcraft, the moment he had adjusted the cap and rope, ran down the steps, drew the bolt, and disappeared.
+wavs/LJ031-0122.wav|pitch/LJ031-0122.pt|treated the gunshot wound in the left thigh.
+wavs/LJ016-0205.wav|pitch/LJ016-0205.pt|he received a retaining fee of five pounds, five shillings, with the usual guinea for each job;
+wavs/LJ019-0248.wav|pitch/LJ019-0248.pt|leading to an inequality, uncertainty, and inefficiency of punishment productive of the most prejudicial results.
+wavs/LJ033-0183.wav|pitch/LJ033-0183.pt|it was not surprising that the replica sack made on December one, nineteen sixty-three,
+wavs/LJ037-0001.wav|pitch/LJ037-0001.pt|Report of the President's Commission on the Assassination of President Kennedy. The Warren Commission Report. By The President's Commission on the Assassination of President Kennedy.
+wavs/LJ018-0218.wav|pitch/LJ018-0218.pt|In eighteen fifty-five
+wavs/LJ001-0102.wav|pitch/LJ001-0102.pt|Here and there a book is printed in France or Germany with some pretension to good taste,
+wavs/LJ007-0125.wav|pitch/LJ007-0125.pt|It was diverted from its proper uses, and, as the "place of the greatest comfort," was allotted to persons who should not have been sent to Newgate at all.
+wavs/LJ050-0022.wav|pitch/LJ050-0022.pt|A formal and thorough description of the responsibilities of the advance agent is now in preparation by the Service.
+wavs/LJ028-0212.wav|pitch/LJ028-0212.pt|On the night of the eleventh day Gobrias killed the son of the King.
+wavs/LJ028-0357.wav|pitch/LJ028-0357.pt|yet we may be sure that Babylon was taken by Darius only by use of stratagem. Its walls were impregnable.
+wavs/LJ014-0199.wav|pitch/LJ014-0199.pt|there was no case to make out; why waste money on lawyers for the defense? His demeanor was cool and collected throughout;
+wavs/LJ016-0077.wav|pitch/LJ016-0077.pt|A man named Lears, under sentence of transportation for an attempt at murder on board ship, got up part of the way,
+wavs/LJ009-0194.wav|pitch/LJ009-0194.pt|and that executors or persons having lawful possession of the bodies
+wavs/LJ014-0094.wav|pitch/LJ014-0094.pt|Discovery of the murder came in this wise. O'Connor, a punctual and well-conducted official, was at once missed at the London Docks.
+wavs/LJ001-0079.wav|pitch/LJ001-0079.pt|Caslon's type is clear and neat, and fairly well designed;
+wavs/LJ026-0052.wav|pitch/LJ026-0052.pt|In the nutrition of the animal the most essential and characteristic part of the food supply is derived from vegetable
+wavs/LJ013-0005.wav|pitch/LJ013-0005.pt|One of the earliest of the big operators in fraudulent finance was Edward Beaumont Smith,
+wavs/LJ033-0072.wav|pitch/LJ033-0072.pt|I then stepped off of it and the officer picked it up in the middle and it bent so.
+wavs/LJ036-0067.wav|pitch/LJ036-0067.pt|According to McWatters, the Beckley bus was behind the Marsalis bus, but he did not actually see it.
+wavs/LJ025-0098.wav|pitch/LJ025-0098.pt|and it is probable that amyloid substances are universally present in the animal organism, though not in the precise form of starch.
+wavs/LJ005-0257.wav|pitch/LJ005-0257.pt|during which time a host of witnesses were examined, and the committee presented three separate reports,
+wavs/LJ004-0024.wav|pitch/LJ004-0024.pt|Thus in eighteen thirteen the exaction of jail fees had been forbidden by law,
+wavs/LJ049-0154.wav|pitch/LJ049-0154.pt|In eighteen ninety-four,
+wavs/LJ039-0059.wav|pitch/LJ039-0059.pt|(three) his experience and practice after leaving the Marine Corps, and (four) the accuracy of the weapon and the quality of the ammunition.
+wavs/LJ007-0150.wav|pitch/LJ007-0150.pt|He is allowed intercourse with prostitutes who, in nine cases out of ten, have originally conduced to his ruin;
+wavs/LJ015-0001.wav|pitch/LJ015-0001.pt|Chronicles of Newgate, Volume two. By Arthur Griffiths. Section eighteen: Newgate notorieties continued, part three.
+wavs/LJ010-0158.wav|pitch/LJ010-0158.pt|feeling, as he said, that he might as well be shot or hanged as remain in such a state.
+wavs/LJ010-0281.wav|pitch/LJ010-0281.pt|who had borne the Queen's commission, first as cornet, and then lieutenant, in the tenth Hussars.
+wavs/LJ033-0055.wav|pitch/LJ033-0055.pt|and he could disassemble it more rapidly.
+wavs/LJ015-0218.wav|pitch/LJ015-0218.pt|A new accomplice was now needed within the company's establishment, and Pierce looked about long before he found the right person.
+wavs/LJ027-0006.wav|pitch/LJ027-0006.pt|In all these lines the facts are drawn together by a strong thread of unity.
+wavs/LJ016-0049.wav|pitch/LJ016-0049.pt|He had here completed his ascent.
+wavs/LJ006-0088.wav|pitch/LJ006-0088.pt|It was not likely that a system which left innocent men -- for the great bulk of new arrivals were still untried
+wavs/LJ042-0133.wav|pitch/LJ042-0133.pt|a great change must have occurred in Oswald's thinking to induce him to return to the United States.
+wavs/LJ045-0234.wav|pitch/LJ045-0234.pt|While he did become enraged at at least one point in his interrogation,
+wavs/LJ046-0033.wav|pitch/LJ046-0033.pt|The adequacy of existing procedures can fairly be assessed only after full consideration of the difficulty of the protective assignment,
+wavs/LJ037-0061.wav|pitch/LJ037-0061.pt|and having, quote, somewhat bushy, end quote, hair.
+wavs/LJ032-0025.wav|pitch/LJ032-0025.pt|the officers of Klein's discovered that a rifle bearing serial number C two seven six six had been shipped to one A. Hidell,
+wavs/LJ047-0197.wav|pitch/LJ047-0197.pt|in view of all the information concerning Oswald in its files, should have alerted the Secret Service to Oswald's presence in Dallas
+wavs/LJ018-0130.wav|pitch/LJ018-0130.pt|and stole paper on a much larger scale than Brown.
+wavs/LJ005-0265.wav|pitch/LJ005-0265.pt|It was recommended that the dietaries should be submitted and approved like the rules; that convicted prisoners should not receive any food but the jail allowance;
+wavs/LJ044-0105.wav|pitch/LJ044-0105.pt|He presented Arnold Johnson, Gus Hall,
+wavs/LJ015-0043.wav|pitch/LJ015-0043.pt|This went on for some time, and might never have been discovered had some good stroke of luck provided any of the partners
+wavs/LJ030-0125.wav|pitch/LJ030-0125.pt|On several occasions when the Vice President's car was slowed down by the throng, Special Agent Youngblood stepped out to hold the crowd back.
+wavs/LJ043-0140.wav|pitch/LJ043-0140.pt|He also studied Dallas bus schedules to prepare for his later use of buses to travel to and from General Walker's house.
+wavs/LJ002-0220.wav|pitch/LJ002-0220.pt|In consequence of these disclosures, both Bambridge and Huggin, his predecessor in the office, were committed to Newgate,
+wavs/LJ034-0117.wav|pitch/LJ034-0117.pt|At one:twenty-nine p.m. the police radio reported
+wavs/LJ018-0276.wav|pitch/LJ018-0276.pt|The first plot was against Mr. Harry Emmanuel, but he escaped, and the attempt was made upon Loudon and Ryder.
+wavs/LJ004-0077.wav|pitch/LJ004-0077.pt|nor has he a right to poison or starve his fellow-creatures."
+wavs/LJ042-0194.wav|pitch/LJ042-0194.pt|they should not be confused with slowness, indecision or fear. Only the intellectually fearless could even be remotely attracted to our doctrine,
+wavs/LJ029-0114.wav|pitch/LJ029-0114.pt|The route chosen from the airport to Main Street was the normal one, except where Harwood Street was selected as the means of access to Main Street
+wavs/LJ014-0194.wav|pitch/LJ014-0194.pt|The policemen were now in possession;
+wavs/LJ032-0027.wav|pitch/LJ032-0027.pt|According to its microfilm records, Klein's received an order for a rifle on March thirteen, nineteen sixty-three,
+wavs/LJ048-0289.wav|pitch/LJ048-0289.pt|However, there is no evidence that these men failed to take any action in Dallas within their power that would have averted the tragedy.
+wavs/LJ043-0188.wav|pitch/LJ043-0188.pt|that he was the leader of a fascist organization, and when I said that even though all of that might be true, just the same he had no right to take his life,
+wavs/LJ011-0118.wav|pitch/LJ011-0118.pt|In eighteen twenty-nine the gallows claimed two more victims for this offense.
+wavs/LJ040-0201.wav|pitch/LJ040-0201.pt|After her interview with Mrs. Oswald,
+wavs/LJ033-0056.wav|pitch/LJ033-0056.pt|While the rifle may have already been disassembled when Oswald arrived home on Thursday, he had ample time that evening to disassemble the rifle
+wavs/LJ047-0073.wav|pitch/LJ047-0073.pt|Hosty considered the information to be, quote, stale, unquote, by that time, and did not attempt to verify Oswald's reported statement.
+wavs/LJ001-0153.wav|pitch/LJ001-0153.pt|only nominally so, however, in many cases, since when he uses a headline he counts that in,
+wavs/LJ007-0158.wav|pitch/LJ007-0158.pt|or any kind of moral improvement was impossible; the prisoner's career was inevitably downward, till he struck the lowest depths.
+wavs/LJ028-0502.wav|pitch/LJ028-0502.pt|The Ishtar gateway leading to the palace was encased with beautiful blue glazed bricks,
+wavs/LJ028-0226.wav|pitch/LJ028-0226.pt|Though Herodotus wrote nearly a hundred years after Babylon fell, his story seems to bear the stamp of truth.
+wavs/LJ010-0038.wav|pitch/LJ010-0038.pt|as there had been before; as in the year eighteen forty-nine, a year memorable for the Rush murders at Norwich,
+wavs/LJ019-0241.wav|pitch/LJ019-0241.pt|But in the interval very comprehensive and, I think it must be admitted, salutary changes were successively introduced into the management of prisons.
+wavs/LJ001-0094.wav|pitch/LJ001-0094.pt|were induced to cut punches for a series of "old style" letters.
+wavs/LJ001-0015.wav|pitch/LJ001-0015.pt|the forms of printed letters should be beautiful, and that their arrangement on the page should be reasonable and a help to the shapeliness of the letters themselves.
+wavs/LJ047-0015.wav|pitch/LJ047-0015.pt|From defection to return to Fort Worth.
+wavs/LJ044-0139.wav|pitch/LJ044-0139.pt|since there was no background to the New Orleans FPCC, quote, organization, end quote, which consisted solely of Oswald.
+wavs/LJ050-0031.wav|pitch/LJ050-0031.pt|that the Secret Service consciously set about the task of inculcating and maintaining the highest standard of excellence and esprit, for all of its personnel.
+wavs/LJ050-0235.wav|pitch/LJ050-0235.pt|It has also used other Federal law enforcement agents during Presidential visits to cities in which such agents are stationed.
+wavs/LJ050-0137.wav|pitch/LJ050-0137.pt|FBI, and the Secret Service.
+wavs/LJ031-0109.wav|pitch/LJ031-0109.pt|At one:thirty-five p.m., after Governor Connally had been moved to the operating room, Dr. Shaw started the first operation
+wavs/LJ031-0041.wav|pitch/LJ031-0041.pt|He noted that the President was blue-white or ashen in color; had slow, spasmodic, agonal respiration without any coordination;
+wavs/LJ021-0139.wav|pitch/LJ021-0139.pt|There should be at least a full and fair trial given to these means of ending industrial warfare;
+wavs/LJ029-0004.wav|pitch/LJ029-0004.pt|The narrative of these events is based largely on the recollections of the participants,
+wavs/LJ023-0122.wav|pitch/LJ023-0122.pt|It was said in last year's Democratic platform,
+wavs/LJ005-0264.wav|pitch/LJ005-0264.pt|inspectors of prisons should be appointed, who should visit all the prisons from time to time and report to the Secretary of State.
+wavs/LJ002-0105.wav|pitch/LJ002-0105.pt|and beyond it was a room called the "wine room," because formerly used for the sale of wine, but
+wavs/LJ017-0035.wav|pitch/LJ017-0035.pt|in the interests and for the due protection of the public, that the fullest and fairest inquiry should be made,
+wavs/LJ048-0252.wav|pitch/LJ048-0252.pt|Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car.
+wavs/LJ013-0109.wav|pitch/LJ013-0109.pt|The proceeds of the robbery were lodged in a Boston bank,
+wavs/LJ039-0139.wav|pitch/LJ039-0139.pt|Oswald obtained a hunting license, joined a hunting club and went hunting about six times, as discussed more fully in chapter six.
+wavs/LJ044-0047.wav|pitch/LJ044-0047.pt|that anyone ever attacked any street demonstration in which Oswald was involved, except for the Bringuier incident mentioned above,
+wavs/LJ016-0417.wav|pitch/LJ016-0417.pt|Catherine Wilson, the poisoner, was reserved and reticent to the last, expressing no contrition, but also no fear --
+wavs/LJ045-0178.wav|pitch/LJ045-0178.pt|he left his wedding ring in a cup on the dresser in his room. He also left one hundred seventy dollars in a wallet in one of the dresser drawers.
+wavs/LJ009-0172.wav|pitch/LJ009-0172.pt|While in London, for instance, in eighteen twenty-nine, twenty-four persons had been executed for crimes other than murder,
+wavs/LJ049-0202.wav|pitch/LJ049-0202.pt|incident to its responsibilities.
+wavs/LJ032-0103.wav|pitch/LJ032-0103.pt|The name "Hidell" was stamped on some of the "Chapter's" printed literature and on the membership application blanks.
+wavs/LJ013-0091.wav|pitch/LJ013-0091.pt|and Elder had to be assisted by two bank porters, who carried it for him to a carriage waiting near the Mansion House.
+wavs/LJ037-0208.wav|pitch/LJ037-0208.pt|nineteen dollars, ninety-five cents, plus one dollar, twenty-seven cents shipping charge, had been collected from the consignee, Hidell.
+wavs/LJ014-0128.wav|pitch/LJ014-0128.pt|her hair was dressed in long crepe bands. She had lace ruffles at her wrist, and wore primrose-colored kid gloves.
+wavs/LJ015-0007.wav|pitch/LJ015-0007.pt|This affected Cole's credit, and ugly reports were in circulation charging him with the issue of simulated warrants.
+wavs/LJ036-0169.wav|pitch/LJ036-0169.pt|he would have reached his destination at approximately twelve:fifty-four p.m.
+wavs/LJ021-0040.wav|pitch/LJ021-0040.pt|The second step we have taken in the restoration of normal business enterprise
+wavs/LJ015-0036.wav|pitch/LJ015-0036.pt|The bank was already insolvent,
+wavs/LJ034-0041.wav|pitch/LJ034-0041.pt|Although Bureau experiments had shown that twenty-four hours was a likely maximum time, Latona stated
+wavs/LJ009-0192.wav|pitch/LJ009-0192.pt|The dissection of executed criminals was abolished soon after the discovery of the crime of burking,
+wavs/LJ037-0248.wav|pitch/LJ037-0248.pt|The eyewitnesses vary in their identification of the jacket.
+wavs/LJ015-0289.wav|pitch/LJ015-0289.pt|As each transaction was carried out from a different address, and a different messenger always employed,
+wavs/LJ005-0072.wav|pitch/LJ005-0072.pt|After a few years of active exertion the Society was rewarded by fresh legislation.
+wavs/LJ023-0047.wav|pitch/LJ023-0047.pt|The three horses are, of course, the three branches of government -- the Congress, the Executive and the courts.
+wavs/LJ009-0126.wav|pitch/LJ009-0126.pt|Hardly any one.
+wavs/LJ034-0097.wav|pitch/LJ034-0097.pt|The window was approximately one hundred twenty feet away.
+wavs/LJ028-0462.wav|pitch/LJ028-0462.pt|They were laid in bitumen.
+wavs/LJ046-0055.wav|pitch/LJ046-0055.pt|It is now possible for Presidents to travel the length and breadth of a land far larger than the United States
+wavs/LJ019-0371.wav|pitch/LJ019-0371.pt|Yet the law was seldom if ever enforced.
+wavs/LJ039-0207.wav|pitch/LJ039-0207.pt|Although all of the shots were a few inches high and to the right of the target,
+wavs/LJ002-0174.wav|pitch/LJ002-0174.pt|Mr. Buxton's friends at once paid the forty shillings, and the boy was released.
+wavs/LJ016-0233.wav|pitch/LJ016-0233.pt|In his own profession
+wavs/LJ026-0108.wav|pitch/LJ026-0108.pt|It is clear that there are upward and downward currents of water containing food (comparable to blood of an animal),
+wavs/LJ038-0035.wav|pitch/LJ038-0035.pt|Oswald rose from his seat, bringing up both hands.
+wavs/LJ026-0148.wav|pitch/LJ026-0148.pt|water which is lost by evaporation, especially from the leaf surface through the stomata;
+wavs/LJ001-0186.wav|pitch/LJ001-0186.pt|the position of our Society that a work of utility might be also a work of art, if we cared to make it so.
+wavs/LJ016-0264.wav|pitch/LJ016-0264.pt|The upturned faces of the eager spectators resembled those of the 'gods' at Drury Lane on Boxing Night;
+wavs/LJ009-0041.wav|pitch/LJ009-0041.pt|The occupants of this terrible black pew were the last always to enter the chapel.
+wavs/LJ010-0297.wav|pitch/LJ010-0297.pt|But there were other notorious cases of forgery.
+wavs/LJ040-0018.wav|pitch/LJ040-0018.pt|the Commission is not able to reach any definite conclusions as to whether or not he was, quote, sane, unquote, under prevailing legal standards.
+wavs/LJ005-0253.wav|pitch/LJ005-0253.pt|"to inquire into and report upon the several jails and houses of correction in the counties, cities, and corporate towns within England and Wales
+wavs/LJ027-0176.wav|pitch/LJ027-0176.pt|Fishes first appeared in the Devonian and Upper Silurian in very reptilian or rather amphibian forms.
+wavs/LJ034-0035.wav|pitch/LJ034-0035.pt|The position of this palmprint on the carton was parallel with the long axis of the box, and at right angles with the short axis;
+wavs/LJ016-0054.wav|pitch/LJ016-0054.pt|But he did not like the risk of entering a room by the fireplace, and the chances of detection it offered.
+wavs/LJ018-0262.wav|pitch/LJ018-0262.pt|Roupell received the announcement with a cheerful countenance,
+wavs/LJ044-0237.wav|pitch/LJ044-0237.pt|with thirteen dollars, eighty-seven cents when considerably greater resources were available to him.
+wavs/LJ034-0166.wav|pitch/LJ034-0166.pt|Two other witnesses were able to offer partial descriptions of a man they saw in the southeast corner window
+wavs/LJ016-0238.wav|pitch/LJ016-0238.pt|"just to steady their legs a little;" in other words, to add his weight to that of the hanging bodies.
+wavs/LJ042-0198.wav|pitch/LJ042-0198.pt|The discussion above has already set forth examples of his expression of hatred for the United States.
+wavs/LJ031-0189.wav|pitch/LJ031-0189.pt|At two:thirty-eight p.m., Eastern Standard Time, Lyndon Baines Johnson took the oath of office as the thirty-sixth President of the United States.
+wavs/LJ050-0084.wav|pitch/LJ050-0084.pt|or, quote, other high government officials in the nature of a complaint coupled with an expressed or implied determination to use a means,
+wavs/LJ044-0158.wav|pitch/LJ044-0158.pt|As for my return entrance visa please consider it separately. End quote.
+wavs/LJ045-0082.wav|pitch/LJ045-0082.pt|it appears that Marina Oswald also complained that her husband was not able to provide more material things for her.
+wavs/LJ045-0190.wav|pitch/LJ045-0190.pt|appeared in The Dallas Times Herald on November fifteen, nineteen sixty-three.
+wavs/LJ035-0155.wav|pitch/LJ035-0155.pt|The only exit from the office in the direction Oswald was moving was through the door to the front stairway.
+wavs/LJ044-0004.wav|pitch/LJ044-0004.pt|Political Activities
+wavs/LJ046-0016.wav|pitch/LJ046-0016.pt|The Commission has not undertaken a comprehensive examination of all facets of this subject;
+wavs/LJ019-0368.wav|pitch/LJ019-0368.pt|The latter too was to be laid before the House of Commons.
+wavs/LJ010-0062.wav|pitch/LJ010-0062.pt|But they proceeded in all seriousness, and would have shrunk from no outrage or atrocity in furtherance of their foolhardy enterprise.
+wavs/LJ033-0159.wav|pitch/LJ033-0159.pt|It was from Oswald's right hand, in which he carried the long package as he walked from Frazier's car to the building.
+wavs/LJ002-0171.wav|pitch/LJ002-0171.pt|The boy declared he saw no one, and accordingly passed through without paying the toll of a penny.
+wavs/LJ002-0298.wav|pitch/LJ002-0298.pt|in his evidence in eighteen fourteen, said it was more,
+wavs/LJ012-0219.wav|pitch/LJ012-0219.pt|and in one corner, at some depth, a bundle of clothes were unearthed, which, with a hairy cap,
+wavs/LJ017-0190.wav|pitch/LJ017-0190.pt|After this came the charge of administering oil of vitriol, which failed, as has been described.
+wavs/LJ019-0179.wav|pitch/LJ019-0179.pt|This, with a scheme for limiting the jail to untried prisoners, had been urgently recommended by Lord John Russell in eighteen thirty.
+wavs/LJ050-0188.wav|pitch/LJ050-0188.pt|each patrolman might be given a prepared booklet of instructions explaining what is expected of him. The Secret Service has expressed concern
+wavs/LJ006-0043.wav|pitch/LJ006-0043.pt|The disgraceful overcrowding had been partially ended, but the same evils of indiscriminate association were still present; there was the old neglect of decency,
+wavs/LJ029-0060.wav|pitch/LJ029-0060.pt|A number of people who resembled some of those in the photographs were placed under surveillance at the Trade Mart.
+wavs/LJ019-0052.wav|pitch/LJ019-0052.pt|Both systems came to us from the United States. The difference was really more in degree than in principle,
+wavs/LJ037-0081.wav|pitch/LJ037-0081.pt|Later in the day each woman found an empty shell on the ground near the house. These two shells were delivered to the police.
+wavs/LJ048-0200.wav|pitch/LJ048-0200.pt|paying particular attention to the crowd for any unusual activity.
+wavs/LJ016-0426.wav|pitch/LJ016-0426.pt|come along, gallows.
+wavs/LJ008-0182.wav|pitch/LJ008-0182.pt|A tremendous crowd assembled when Bellingham was executed in eighteen twelve for the murder of Spencer Percival, at that time prime minister;
+wavs/LJ043-0107.wav|pitch/LJ043-0107.pt|Upon moving to New Orleans on April twenty-four, nineteen sixty-three,
+wavs/LJ006-0084.wav|pitch/LJ006-0084.pt|and so numerous were his opportunities of showing favoritism, that all the prisoners may be said to be in his power.
+wavs/LJ025-0081.wav|pitch/LJ025-0081.pt|has no permanent digestive cavity or mouth, but takes in its food anywhere and digests, so to speak, all over its body.
+wavs/LJ019-0042.wav|pitch/LJ019-0042.pt|These were either satisfied with a makeshift, and modified existing buildings, without close regard to their suitability, or for a long time did nothing at all.
+wavs/LJ047-0240.wav|pitch/LJ047-0240.pt|They agree that Hosty told Revill
+wavs/LJ032-0012.wav|pitch/LJ032-0012.pt|the resistance to arrest and the attempted shooting of another police officer by the man (Lee Harvey Oswald) subsequently accused of assassinating President Kennedy
+wavs/LJ050-0209.wav|pitch/LJ050-0209.pt|The assistant to the Director of the FBI testified that
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_pitch_text_train_v3.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_pitch_text_train_v3.txt
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_pitch_text_val.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_pitch_text_val.txt
@ -0,0 +1,100 @@
+wavs/LJ016-0288.wav|pitch/LJ016-0288.pt|"Müller, Müller, He's the man," till a diversion was created by the appearance of the gallows, which was received with continuous yells.
+wavs/LJ028-0275.wav|pitch/LJ028-0275.pt|At last, in the twentieth month,
+wavs/LJ019-0273.wav|pitch/LJ019-0273.pt|which Sir Joshua Jebb told the committee he considered the proper elements of penal discipline.
+wavs/LJ021-0145.wav|pitch/LJ021-0145.pt|From those willing to join in establishing this hoped-for period of peace,
+wavs/LJ009-0076.wav|pitch/LJ009-0076.pt|We come to the sermon.
+wavs/LJ048-0194.wav|pitch/LJ048-0194.pt|during the morning of November twenty-two prior to the motorcade.
+wavs/LJ049-0050.wav|pitch/LJ049-0050.pt|Hill had both feet on the car and was climbing aboard to assist President and Mrs. Kennedy.
+wavs/LJ022-0023.wav|pitch/LJ022-0023.pt|The overwhelming majority of people in this country know how to sift the wheat from the chaff in what they hear and what they read.
+wavs/LJ034-0053.wav|pitch/LJ034-0053.pt|reached the same conclusion as Latona that the prints found on the cartons were those of Lee Harvey Oswald.
+wavs/LJ035-0129.wav|pitch/LJ035-0129.pt|and she must have run down the stairs ahead of Oswald and would probably have seen or heard him.
+wavs/LJ039-0075.wav|pitch/LJ039-0075.pt|once you know that you must put the crosshairs on the target and that is all that is necessary.
+wavs/LJ046-0184.wav|pitch/LJ046-0184.pt|but there is a system for the immediate notification of the Secret Service by the confining institution when a subject is released or escapes.
+wavs/LJ003-0111.wav|pitch/LJ003-0111.pt|He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity.
+wavs/LJ037-0234.wav|pitch/LJ037-0234.pt|Mrs. Mary Brock, the wife of a mechanic who worked at the station, was there at the time and she saw a white male,
+wavs/LJ047-0044.wav|pitch/LJ047-0044.pt|Oswald was, however, willing to discuss his contacts with Soviet authorities. He denied having any involvement with Soviet intelligence agencies
+wavs/LJ028-0081.wav|pitch/LJ028-0081.pt|Years later, when the archaeologists could readily distinguish the false from the true,
+wavs/LJ012-0161.wav|pitch/LJ012-0161.pt|he was reported to have fallen away to a shadow.
+wavs/LJ009-0114.wav|pitch/LJ009-0114.pt|Mr. Wakefield winds up his graphic but somewhat sensational account by describing another religious service, which may appropriately be inserted here.
+wavs/LJ028-0335.wav|pitch/LJ028-0335.pt|accordingly they committed to him the command of their whole army, and put the keys of their city into his hands.
+wavs/LJ005-0014.wav|pitch/LJ005-0014.pt|Speaking on a debate on prison matters, he declared that
+wavs/LJ008-0294.wav|pitch/LJ008-0294.pt|nearly indefinitely deferred.
+wavs/LJ028-0307.wav|pitch/LJ028-0307.pt|then let twenty days pass, and at the end of that time station near the Chaldasan gates a body of four thousand.
+wavs/LJ046-0058.wav|pitch/LJ046-0058.pt|During his Presidency, Franklin D. Roosevelt made almost four hundred journeys and traveled more than three hundred fifty thousand miles.
+wavs/LJ046-0146.wav|pitch/LJ046-0146.pt|The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files
+wavs/LJ017-0131.wav|pitch/LJ017-0131.pt|even when the high sheriff had told him there was no possibility of a reprieve, and within a few hours of execution.
+wavs/LJ002-0018.wav|pitch/LJ002-0018.pt|The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London,
+wavs/LJ019-0257.wav|pitch/LJ019-0257.pt|Here the tread-wheel was in use, there cellular cranks, or hard-labor machines.
+wavs/LJ034-0042.wav|pitch/LJ034-0042.pt|that he could only testify with certainty that the print was less than three days old.
+wavs/LJ031-0070.wav|pitch/LJ031-0070.pt|Dr. Clark, who most closely observed the head wound,
+wavs/LJ012-0035.wav|pitch/LJ012-0035.pt|the number and names on watches, were carefully removed or obliterated after the goods passed out of his hands.
+wavs/LJ050-0168.wav|pitch/LJ050-0168.pt|with the particular purposes of the agency involved. The Commission recognizes that this is a controversial area
+wavs/LJ036-0103.wav|pitch/LJ036-0103.pt|The police asked him whether he could pick out his passenger from the lineup.
+wavs/LJ016-0318.wav|pitch/LJ016-0318.pt|Other officials, great lawyers, governors of prisons, and chaplains supported this view.
+wavs/LJ034-0198.wav|pitch/LJ034-0198.pt|Euins, who was on the southwest corner of Elm and Houston Streets testified that he could not describe the man he saw in the window.
+wavs/LJ049-0026.wav|pitch/LJ049-0026.pt|On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President.
+wavs/LJ011-0096.wav|pitch/LJ011-0096.pt|He married a lady also belonging to the Society of Friends, who brought him a large fortune, which, and his own money, he put into a city firm,
+wavs/LJ040-0002.wav|pitch/LJ040-0002.pt|Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.
+wavs/LJ014-0030.wav|pitch/LJ014-0030.pt|These were damnatory facts which well supported the prosecution.
+wavs/LJ043-0002.wav|pitch/LJ043-0002.pt|The Warren Commission Report. By The President's Commission on the Assassination of President Kennedy. Chapter seven. Lee Harvey Oswald:
+wavs/LJ029-0022.wav|pitch/LJ029-0022.pt|The original plan called for the President to spend only one day in the State, making whirlwind visits to Dallas, Fort Worth, San Antonio, and Houston.
+wavs/LJ014-0020.wav|pitch/LJ014-0020.pt|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood;
+wavs/LJ040-0027.wav|pitch/LJ040-0027.pt|He was never satisfied with anything.
+wavs/LJ028-0093.wav|pitch/LJ028-0093.pt|but his scribe wrote it in the manner customary for the scribes of those days to write of their royal masters.
+wavs/LJ004-0152.wav|pitch/LJ004-0152.pt|although at Mr. Buxton's visit a new jail was in process of erection, the first step towards reform since Howard's visitation in seventeen seventy-four.
+wavs/LJ008-0111.wav|pitch/LJ008-0111.pt|They entered a "stone cold room," and were presently joined by the prisoner.
+wavs/LJ017-0044.wav|pitch/LJ017-0044.pt|and the deepest anxiety was felt that the crime, if crime there had been, should be brought home to its perpetrator.
+wavs/LJ033-0047.wav|pitch/LJ033-0047.pt|I noticed when I went out that the light was on, end quote,
+wavs/LJ028-0008.wav|pitch/LJ028-0008.pt|you tap gently with your heel upon the shoulder of the dromedary to urge her on.
+wavs/LJ016-0179.wav|pitch/LJ016-0179.pt|contracted with sheriffs and conveners to work by the job.
+wavs/LJ005-0201.wav|pitch/LJ005-0201.pt|as is shown by the report of the Commissioners to inquire into the state of the municipal corporations in eighteen thirty-five.
+wavs/LJ035-0019.wav|pitch/LJ035-0019.pt|drove to the northwest corner of Elm and Houston, and parked approximately ten feet from the traffic signal.
+wavs/LJ031-0038.wav|pitch/LJ031-0038.pt|The first physician to see the President at Parkland Hospital was Dr. Charles J. Carrico, a resident in general surgery.
+wavs/LJ017-0070.wav|pitch/LJ017-0070.pt|but his sporting operations did not prosper, and he became a needy man, always driven to desperate straits for cash.
+wavs/LJ007-0154.wav|pitch/LJ007-0154.pt|These pungent and well-grounded strictures applied with still greater force to the unconvicted prisoner, the man who came to the prison innocent, and still uncontaminated,
+wavs/LJ002-0043.wav|pitch/LJ002-0043.pt|long narrow rooms -- one thirty-six feet, six twenty-three feet, and the eighth eighteen,
+wavs/LJ004-0096.wav|pitch/LJ004-0096.pt|the fatal consequences whereof might be prevented if the justices of the peace were duly authorized
+wavs/LJ018-0081.wav|pitch/LJ018-0081.pt|his defense being that he had intended to commit suicide, but that, on the appearance of this officer who had wronged him,
+wavs/LJ042-0129.wav|pitch/LJ042-0129.pt|No night clubs or bowling alleys, no places of recreation except the trade union dances. I have had enough.
+wavs/LJ008-0278.wav|pitch/LJ008-0278.pt|or theirs might be one of many, and it might be considered necessary to "make an example."
+wavs/LJ015-0203.wav|pitch/LJ015-0203.pt|but were the precautions too minute, the vigilance too close to be eluded or overcome?
+wavs/LJ018-0239.wav|pitch/LJ018-0239.pt|His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to
+wavs/LJ021-0066.wav|pitch/LJ021-0066.pt|together with a great increase in the payrolls, there has come a substantial rise in the total of industrial profits
+wavs/LJ024-0083.wav|pitch/LJ024-0083.pt|This plan of mine is no attack on the Court;
+wavs/LJ008-0258.wav|pitch/LJ008-0258.pt|Let me retrace my steps, and speak more in detail of the treatment of the condemned in those bloodthirsty and brutally indifferent days,
+wavs/LJ038-0199.wav|pitch/LJ038-0199.pt|eleven. If I am alive and taken prisoner,
+wavs/LJ045-0230.wav|pitch/LJ045-0230.pt|when he was finally apprehended in the Texas Theatre. Although it is not fully corroborated by others who were present,
+wavs/LJ027-0141.wav|pitch/LJ027-0141.pt|is closely reproduced in the life-history of existing deer. Or, in other words,
+wavs/LJ016-0020.wav|pitch/LJ016-0020.pt|He never reached the cistern, but fell back into the yard, injuring his legs severely.
+wavs/LJ012-0250.wav|pitch/LJ012-0250.pt|On the seventh July, eighteen thirty-seven,
+wavs/LJ001-0110.wav|pitch/LJ001-0110.pt|Even the Caslon type when enlarged shows great shortcomings in this respect:
+wavs/LJ047-0148.wav|pitch/LJ047-0148.pt|On October twenty-five,
+wavs/LJ031-0134.wav|pitch/LJ031-0134.pt|On one occasion Mrs. Johnson, accompanied by two Secret Service agents, left the room to see Mrs. Kennedy and Mrs. Connally.
+wavs/LJ036-0174.wav|pitch/LJ036-0174.pt|This is the approximate time he entered the roominghouse, according to Earlene Roberts, the housekeeper there.
+wavs/LJ026-0068.wav|pitch/LJ026-0068.pt|Energy enters the plant, to a small extent,
+wavs/LJ034-0160.wav|pitch/LJ034-0160.pt|on Brennan's subsequent certain identification of Lee Harvey Oswald as the man he saw fire the rifle.
+wavs/LJ013-0164.wav|pitch/LJ013-0164.pt|who came from his room ready dressed, a suspicious circumstance, as he was always late in the morning.
+wavs/LJ014-0263.wav|pitch/LJ014-0263.pt|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art.
+wavs/LJ005-0079.wav|pitch/LJ005-0079.pt|and improve the morals of the prisoners, and shall insure the proper measure of punishment to convicted offenders.
+wavs/LJ048-0228.wav|pitch/LJ048-0228.pt|and others who were present say that no agent was inebriated or acted improperly.
+wavs/LJ027-0052.wav|pitch/LJ027-0052.pt|These principles of homology are essential to a correct interpretation of the facts of morphology.
+wavs/LJ004-0045.wav|pitch/LJ004-0045.pt|Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William Wilberforce.
+wavs/LJ012-0042.wav|pitch/LJ012-0042.pt|which he kept concealed in a hiding-place with a trap-door just under his bed.
+wavs/LJ014-0110.wav|pitch/LJ014-0110.pt|At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.
+wavs/LJ028-0506.wav|pitch/LJ028-0506.pt|A modern artist would have difficulty in doing such accurate work.
+wavs/LJ014-0010.wav|pitch/LJ014-0010.pt|yet he could not overcome the strange fascination it had for him, and remained by the side of the corpse till the stretcher came.
+wavs/LJ042-0096.wav|pitch/LJ042-0096.pt|(old exchange rate) in addition to his factory salary of approximately equal amount
+wavs/LJ031-0202.wav|pitch/LJ031-0202.pt|Mrs. Kennedy chose the hospital in Bethesda for the autopsy because the President had served in the Navy.
+wavs/LJ012-0235.wav|pitch/LJ012-0235.pt|While they were in a state of insensibility the murder was committed.
+wavs/LJ019-0186.wav|pitch/LJ019-0186.pt|seeing that since the establishment of the Central Criminal Court, Newgate received prisoners for trial from several counties,
+wavs/LJ018-0098.wav|pitch/LJ018-0098.pt|and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others.
+wavs/LJ036-0077.wav|pitch/LJ036-0077.pt|Roger D. Craig, a deputy sheriff of Dallas County,
+wavs/LJ045-0140.wav|pitch/LJ045-0140.pt|The arguments he used to justify his use of the alias suggest that Oswald may have come to think that the whole world was becoming involved
+wavs/LJ029-0032.wav|pitch/LJ029-0032.pt|According to O'Donnell, quote, we had a motorcade wherever we went, end quote.
+wavs/LJ003-0345.wav|pitch/LJ003-0345.pt|All the committee could do in this respect was to throw the responsibility on others.
+wavs/LJ008-0307.wav|pitch/LJ008-0307.pt|afterwards express a wish to murder the Recorder for having kept them so long in suspense.
+wavs/LJ043-0030.wav|pitch/LJ043-0030.pt|If somebody did that to me, a lousy trick like that, to take my wife away, and all the furniture, I would be mad as hell, too.
+wavs/LJ009-0238.wav|pitch/LJ009-0238.pt|After this the sheriffs sent for another rope, but the spectators interfered, and the man was carried back to jail.
+wavs/LJ039-0223.wav|pitch/LJ039-0223.pt|Oswald's Marine training in marksmanship, his other rifle experience and his established familiarity with this particular weapon
+wavs/LJ014-0076.wav|pitch/LJ014-0076.pt|He was seen afterwards smoking and talking with his hosts in their back parlor, and never seen again alive.
+wavs/LJ016-0138.wav|pitch/LJ016-0138.pt|at a distance from the prison.
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text.txt
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text_test_filelist.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text_test_filelist.txt
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text_train_filelist.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text_train_filelist.txt
@ -1369,7 +1369,6 @@ wavs/LJ045-0097.wav|whom she compared to, quote, a puppy dog that everybody kick
 wavs/LJ038-0217.wav|indicated that the note was written when they were living in a rented apartment; therefore it could not have been written while Marina Oswald was living with the Paines.
 wavs/LJ044-0102.wav|samples of his photographic work, offering to contribute that sort of service without charge.
 wavs/LJ004-0114.wav|King's evidences were also to be lodged apart.
-wavs/LJ016-0257.wav|and the raison d'être of the penalty, which in principle so many opposed, would be gone.
 wavs/LJ028-0130.wav|And after he had walled the city, and adorned its gates, he built another palace before his father's palace; but so that they joined to it:
 wavs/LJ019-0017.wav|This building was a costly affair. The site was uneven, and had to be leveled;
 wavs/LJ043-0021.wav|While the exact sequence of events is not clear because of conflicting testimony,
@ -2494,7 +2493,6 @@ wavs/LJ012-0268.wav|Suspicion grew almost to certainty as the evidence was unfol
 wavs/LJ049-0058.wav|Secondly, agents are instructed to remove the President as quickly as possible from known or impending danger.
 wavs/LJ010-0313.wav|the sum total amounting to some one hundred seventy thousand pounds, with a declaration in his own handwriting to the following effect.
 wavs/LJ007-0202.wav|But the one great and most crying evil remained unremedied.
-wavs/LJ011-0058.wav|He was summoned to the Mansion House, where he repeated his request, crying, "Accordez moi cette grâce," with much urgency.
 wavs/LJ010-0250.wav|of the criminal intent to kill.
 wavs/LJ047-0051.wav|particularly since his employment did not involve any sensitive information.
 wavs/LJ021-0106.wav|the representatives of trade and industry were permitted to write their ideas into the codes.
@ -2525,7 +2523,7 @@ wavs/LJ017-0171.wav|Smethurst was therefore given a free pardon for the offense
 wavs/LJ038-0038.wav|McDonald struck back with his right hand and grabbed the gun with his left hand. They both fell into the seats.
 wavs/LJ037-0228.wav|that the man was a, quote, white male, about twenty-five, about five feet eight, brown hair, medium, end quote, and wearing a, quote,
 wavs/LJ032-0184.wav|For ten days prior to the eve of the assassination Oswald had not been present at Ruth Paine's house in Irving, Texas, where the rifle was kept.
-wavs/LJ047-0160.wav|found it to be four one one Elm  Street. End quote.
+wavs/LJ047-0160.wav|found it to be four one one Elm Street. End quote.
 wavs/LJ032-0153.wav|A palmprint could not be placed on this portion of the rifle, when assembled, because the wooden foregrip covers the barrel at this point.
 wavs/LJ045-0143.wav|and use it against him as had been done in New Orleans.
 wavs/LJ049-0064.wav|because the variations possible preclude effective planning.
@ -5683,7 +5681,7 @@ wavs/LJ043-0056.wav|Even though Oswald cut off relations with his mother,
 wavs/LJ034-0055.wav|was probably made within a day or a day and a half of the examination on November twenty-two.
 wavs/LJ018-0252.wav|to deeds involving on the whole some three hundred fifty thousand pounds.
 wavs/LJ018-0380.wav|They paid close attention to the counsels of the archimandrite, and died quite penitent. A story is told of one of them, "Big Harry,"
-wavs/LJ031-0175.wav|O'Donnell tried to persuade Mrs. Kennedy to leave the area, but she refused.
+wavs/LJ031-0175.wav|O'Donnell tried to persuade Mrs. Kennedy to leave the area, but she refused. She said that she intended to stay with her husband.
 wavs/LJ031-0056.wav|Observing that an effective airway had to be established if treatment was to be effective, Dr. Perry performed a tracheotomy, which required three to five minutes.
 wavs/LJ045-0224.wav|and such a favorable opportunity to strike at a figure as great as the President would probably never have come to him again.
 wavs/LJ016-0160.wav|The noose was one of his hammock straps, which he buckled round his throat.
@ -6438,7 +6436,7 @@ wavs/LJ019-0015.wav|The first stone of Pentonville prison was laid on the tenth
 wavs/LJ048-0198.wav|and although it was not a violation of the law to carry a placard, that they were not to tolerate any actions such as the Stevenson incident
 wavs/LJ028-0279.wav|after which he commanded his servants to tell no one what had come to pass, while he himself pondered the matter.
 wavs/LJ008-0144.wav|in eighteen oh seven, an event long remembered from the fatal and disastrous consequences which followed it.
-wavs/LJ014-0165.wav|and presented by no heathen land under the sun.
+wavs/LJ014-0165.wav|and presented by no heathen land under the sun. The horrors of the gibbet and of the crime which brought the wretched murderers to it
 wavs/LJ008-0142.wav|the attendance at the execution was certain to be tumultuous, and the conduct of the mob disorderly.
 wavs/LJ048-0265.wav|after which time a very moderate use of liquor will not be considered a violation. However, all members of the White House Detail
 wavs/LJ003-0065.wav|All the misdemeanants, whatever their offense, were lodged in this chapel ward.
@ -6787,7 +6785,7 @@ wavs/LJ033-0030.wav|Mrs. A. C. Johnson, his landlady, testified that Oswald's ro
 wavs/LJ045-0117.wav|it would appear to be unlikely that his landlady in Dallas
 wavs/LJ047-0096.wav|The police called the local FBI office and an agent, John L. Quigley, was sent to the police station.
 wavs/LJ033-0212.wav|(two) took paper and tape from the wrapping bench of the Depository and fashioned a bag large enough to carry the disassembled rifle;
-wavs/LJ034-0138.wav|they saw and heard Brennan describing what he had seen.
+wavs/LJ034-0138.wav|they saw and heard Brennan describing what he had seen. Norman stated, quote
 wavs/LJ017-0250.wav|from Peru bound to Bordeaux, which had foundered at sea;
 wavs/LJ027-0049.wav|the structural elements remain, but are profoundly modified to perform totally different functions.
 wavs/LJ014-0185.wav|and bystanders peeped in through the shutters, but no one entered or sought to interfere in what seemed only a domestic quarrel.
@ -8732,7 +8730,6 @@ wavs/LJ028-0178.wav|the walls of Babylon were so long and wide and high that all
 wavs/LJ032-0089.wav|It certified that Lee Harvey Oswald had been vaccinated for smallpox on June eight, nineteen sixty-three.
 wavs/LJ049-0209.wav|The Commission was not asked to apply itself as did the Hoover Commission in nineteen forty-nine,
 wavs/LJ002-0069.wav|to the various wards their friends occupied.
-wavs/LJ018-0396.wav|The greatest causes célèbre, however, of recent times were the turf frauds by which the Comtesse de Goncourt was swindled
 wavs/LJ022-0158.wav|By far the greater part of the general decline in utility securities had occurred before I was inaugurated.
 wavs/LJ027-0065.wav|enclosed and protected by the skeleton, viz., the neural cavity above, and the visceral or body cavity below, the vertebral column.
 wavs/LJ032-0238.wav|Marina Oswald testified that the photographs were taken on a Sunday about two weeks before the attempted shooting of Maj. Gen. Edwin A. Walker
@ -11555,7 +11552,6 @@ wavs/LJ033-0145.wav|May have been used to carry gun. Lt. J. C. Day, end quote.
 wavs/LJ048-0255.wav|with full possession of his mental and physical capabilities and entirely ready for the performance of his assigned duties. Chief Rowley testified that,
 wavs/LJ006-0260.wav|From the same source came the two or three strong files which the inspectors found in one ward,
 wavs/LJ045-0040.wav|They had been married after a courtship of only about six weeks, a part of which Oswald spent in the hospital.
-wavs/LJ012-0205.wav|an avowed "snatcher" and habitué of the Fortune of War, a public-house in Smithfield frequented openly by men of this awful profession.
 wavs/LJ018-0337.wav|of Good, and Greenacre, and Catherine Hayes.
 wavs/LJ037-0033.wav|When questioned by police officers on the evening of November twenty-two, Benavides told them that he did not think that he could identify the man who fired the shots.
 wavs/LJ029-0153.wav|Under standard procedures, the responsibility for watching the windows of buildings was shared by local police stationed along the route
--- a/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text_val_filelist.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/filelists/ljs_audio_text_val_filelist.txt
@ -1,100 +1,100 @@
-wavs/LJ022-0023.wav|The overwhelming majority of people in this country know how to sift the wheat from the chaff in what they hear and what they read.
-wavs/LJ043-0030.wav|If somebody did that to me, a lousy trick like that, to take my wife away, and all the furniture, I would be mad as hell, too.
-wavs/LJ005-0201.wav|as is shown by the report of the Commissioners to inquire into the state of the municipal corporations in eighteen thirty-five.
-wavs/LJ001-0110.wav|Even the Caslon type when enlarged shows great shortcomings in this respect:
-wavs/LJ003-0345.wav|All the committee could do in this respect was to throw the responsibility on others.
-wavs/LJ007-0154.wav|These pungent and well-grounded strictures applied with still greater force to the unconvicted prisoner, the man who came to the prison innocent, and still uncontaminated,
-wavs/LJ018-0098.wav|and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others.
-wavs/LJ047-0044.wav|Oswald was, however, willing to discuss his contacts with Soviet authorities. He denied having any involvement with Soviet intelligence agencies
-wavs/LJ031-0038.wav|The first physician to see the President at Parkland Hospital was Dr. Charles J. Carrico, a resident in general surgery.
-wavs/LJ048-0194.wav|during the morning of November twenty-two prior to the motorcade.
-wavs/LJ049-0026.wav|On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President.
-wavs/LJ004-0152.wav|although at Mr. Buxton's visit a new jail was in process of erection, the first step towards reform since Howard's visitation in seventeen seventy-four.
-wavs/LJ008-0278.wav|or theirs might be one of many, and it might be considered necessary to "make an example."
-wavs/LJ043-0002.wav|The Warren Commission Report. By The President's Commission on the Assassination of President Kennedy. Chapter seven. Lee Harvey Oswald:
-wavs/LJ009-0114.wav|Mr. Wakefield winds up his graphic but somewhat sensational account by describing another religious service, which may appropriately be inserted here.
-wavs/LJ028-0506.wav|A modern artist would have difficulty in doing such accurate work.
-wavs/LJ050-0168.wav|with the particular purposes of the agency involved. The Commission recognizes that this is a controversial area
-wavs/LJ039-0223.wav|Oswald's Marine training in marksmanship, his other rifle experience and his established familiarity with this particular weapon
-wavs/LJ029-0032.wav|According to O'Donnell, quote, we had a motorcade wherever we went, end quote.
-wavs/LJ031-0070.wav|Dr. Clark, who most closely observed the head wound,
-wavs/LJ034-0198.wav|Euins, who was on the southwest corner of Elm and Houston Streets testified that he could not describe the man he saw in the window.
-wavs/LJ026-0068.wav|Energy enters the plant, to a small extent,
-wavs/LJ039-0075.wav|once you know that you must put the crosshairs on the target and that is all that is necessary.
-wavs/LJ004-0096.wav|the fatal consequences whereof might be prevented if the justices of the peace were duly authorized
-wavs/LJ005-0014.wav|Speaking on a debate on prison matters, he declared that
-wavs/LJ012-0161.wav|he was reported to have fallen away to a shadow.
-wavs/LJ018-0239.wav|His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to
-wavs/LJ019-0257.wav|Here the tread-wheel was in use, there cellular cranks, or hard-labor machines.
-wavs/LJ028-0008.wav|you tap gently with your heel upon the shoulder of the dromedary to urge her on.
-wavs/LJ024-0083.wav|This plan of mine is no attack on the Court;
-wavs/LJ042-0129.wav|No night clubs or bowling alleys, no places of recreation except the trade union dances. I have had enough.
-wavs/LJ036-0103.wav|The police asked him whether he could pick out his passenger from the lineup.
-wavs/LJ046-0058.wav|During his Presidency, Franklin D. Roosevelt made almost four hundred journeys and traveled more than three hundred fifty thousand miles.
-wavs/LJ014-0076.wav|He was seen afterwards smoking and talking with his hosts in their back parlor, and never seen again alive.
-wavs/LJ002-0043.wav|long narrow rooms -- one thirty-six feet, six twenty-three feet, and the eighth eighteen,
-wavs/LJ009-0076.wav|We come to the sermon.
-wavs/LJ017-0131.wav|even when the high sheriff had told him there was no possibility of a reprieve, and within a few hours of execution.
-wavs/LJ046-0184.wav|but there is a system for the immediate notification of the Secret Service by the confining institution when a subject is released or escapes.
-wavs/LJ014-0263.wav|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art.
-wavs/LJ042-0096.wav|(old exchange rate) in addition to his factory salary of approximately equal amount
-wavs/LJ049-0050.wav|Hill had both feet on the car and was climbing aboard to assist President and Mrs. Kennedy.
-wavs/LJ019-0186.wav|seeing that since the establishment of the Central Criminal Court, Newgate received prisoners for trial from several counties,
-wavs/LJ028-0307.wav|then let twenty days pass, and at the end of that time station near the Chaldasan gates a body of four thousand.
-wavs/LJ012-0235.wav|While they were in a state of insensibility the murder was committed.
-wavs/LJ034-0053.wav|reached the same conclusion as Latona that the prints found on the cartons were those of Lee Harvey Oswald.
-wavs/LJ014-0030.wav|These were damnatory facts which well supported the prosecution.
-wavs/LJ015-0203.wav|but were the precautions too minute, the vigilance too close to be eluded or overcome?
-wavs/LJ028-0093.wav|but his scribe wrote it in the manner customary for the scribes of those days to write of their royal masters.
-wavs/LJ002-0018.wav|The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London,
-wavs/LJ028-0275.wav|At last, in the twentieth month,
-wavs/LJ012-0042.wav|which he kept concealed in a hiding-place with a trap-door just under his bed.
-wavs/LJ011-0096.wav|He married a lady also belonging to the Society of Friends, who brought him a large fortune, which, and his own money, he put into a city firm,
-wavs/LJ036-0077.wav|Roger D. Craig, a deputy sheriff of Dallas County,
-wavs/LJ016-0318.wav|Other officials, great lawyers, governors of prisons, and chaplains supported this view.
-wavs/LJ013-0164.wav|who came from his room ready dressed, a suspicious circumstance, as he was always late in the morning.
-wavs/LJ027-0141.wav|is closely reproduced in the life-history of existing deer. Or, in other words,
-wavs/LJ028-0335.wav|accordingly they committed to him the command of their whole army, and put the keys of their city into his hands.
-wavs/LJ031-0202.wav|Mrs. Kennedy chose the hospital in Bethesda for the autopsy because the President had served in the Navy.
-wavs/LJ021-0145.wav|From those willing to join in establishing this hoped-for period of peace,
 wavs/LJ016-0288.wav|"Müller, Müller, He's the man," till a diversion was created by the appearance of the gallows, which was received with continuous yells.
-wavs/LJ028-0081.wav|Years later, when the archaeologists could readily distinguish the false from the true,
-wavs/LJ018-0081.wav|his defense being that he had intended to commit suicide, but that, on the appearance of this officer who had wronged him,
-wavs/LJ021-0066.wav|together with a great increase in the payrolls, there has come a substantial rise in the total of industrial profits
-wavs/LJ009-0238.wav|After this the sheriffs sent for another rope, but the spectators interfered, and the man was carried back to jail.
-wavs/LJ005-0079.wav|and improve the morals of the prisoners, and shall insure the proper measure of punishment to convicted offenders.
-wavs/LJ035-0019.wav|drove to the northwest corner of Elm and Houston, and parked approximately ten feet from the traffic signal.
-wavs/LJ036-0174.wav|This is the approximate time he entered the roominghouse, according to Earlene Roberts, the housekeeper there.
-wavs/LJ046-0146.wav|The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files
-wavs/LJ017-0044.wav|and the deepest anxiety was felt that the crime, if crime there had been, should be brought home to its perpetrator.
-wavs/LJ017-0070.wav|but his sporting operations did not prosper, and he became a needy man, always driven to desperate straits for cash.
-wavs/LJ014-0020.wav|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood;
-wavs/LJ016-0020.wav|He never reached the cistern, but fell back into the yard, injuring his legs severely.
-wavs/LJ045-0230.wav|when he was finally apprehended in the Texas Theatre. Although it is not fully corroborated by others who were present,
-wavs/LJ035-0129.wav|and she must have run down the stairs ahead of Oswald and would probably have seen or heard him.
-wavs/LJ008-0307.wav|afterwards express a wish to murder the Recorder for having kept them so long in suspense.
-wavs/LJ008-0294.wav|nearly indefinitely deferred.
-wavs/LJ047-0148.wav|On October twenty-five,
-wavs/LJ008-0111.wav|They entered a "stone cold room," and were presently joined by the prisoner.
-wavs/LJ034-0042.wav|that he could only testify with certainty that the print was less than three days old.
-wavs/LJ037-0234.wav|Mrs. Mary Brock, the wife of a mechanic who worked at the station, was there at the time and she saw a white male,
-wavs/LJ040-0002.wav|Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.
-wavs/LJ045-0140.wav|The arguments he used to justify his use of the alias suggest that Oswald may have come to think that the whole world was becoming involved
-wavs/LJ012-0035.wav|the number and names on watches, were carefully removed or obliterated after the goods passed out of his hands.
-wavs/LJ012-0250.wav|On the seventh July, eighteen thirty-seven,
-wavs/LJ016-0179.wav|contracted with sheriffs and conveners to work by the job.
-wavs/LJ016-0138.wav|at a distance from the prison.
-wavs/LJ027-0052.wav|These principles of homology are essential to a correct interpretation of the facts of morphology.
-wavs/LJ031-0134.wav|On one occasion Mrs. Johnson, accompanied by two Secret Service agents, left the room to see Mrs. Kennedy and Mrs. Connally.
+wavs/LJ028-0275.wav|At last, in the twentieth month,
 wavs/LJ019-0273.wav|which Sir Joshua Jebb told the committee he considered the proper elements of penal discipline.
-wavs/LJ014-0110.wav|At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.
-wavs/LJ034-0160.wav|on Brennan's subsequent certain identification of Lee Harvey Oswald as the man he saw fire the rifle.
-wavs/LJ038-0199.wav|eleven. If I am alive and taken prisoner,
-wavs/LJ014-0010.wav|yet he could not overcome the strange fascination it had for him, and remained by the side of the corpse till the stretcher came.
-wavs/LJ033-0047.wav|I noticed when I went out that the light was on, end quote,
-wavs/LJ040-0027.wav|He was never satisfied with anything.
-wavs/LJ048-0228.wav|and others who were present say that no agent was inebriated or acted improperly.
+wavs/LJ021-0145.wav|From those willing to join in establishing this hoped-for period of peace,
+wavs/LJ009-0076.wav|We come to the sermon.
+wavs/LJ048-0194.wav|during the morning of November twenty-two prior to the motorcade.
+wavs/LJ049-0050.wav|Hill had both feet on the car and was climbing aboard to assist President and Mrs. Kennedy.
+wavs/LJ022-0023.wav|The overwhelming majority of people in this country know how to sift the wheat from the chaff in what they hear and what they read.
+wavs/LJ034-0053.wav|reached the same conclusion as Latona that the prints found on the cartons were those of Lee Harvey Oswald.
+wavs/LJ035-0129.wav|and she must have run down the stairs ahead of Oswald and would probably have seen or heard him.
+wavs/LJ039-0075.wav|once you know that you must put the crosshairs on the target and that is all that is necessary.
+wavs/LJ046-0184.wav|but there is a system for the immediate notification of the Secret Service by the confining institution when a subject is released or escapes.
 wavs/LJ003-0111.wav|He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity.
-wavs/LJ008-0258.wav|Let me retrace my steps, and speak more in detail of the treatment of the condemned in those bloodthirsty and brutally indifferent days,
+wavs/LJ037-0234.wav|Mrs. Mary Brock, the wife of a mechanic who worked at the station, was there at the time and she saw a white male,
+wavs/LJ047-0044.wav|Oswald was, however, willing to discuss his contacts with Soviet authorities. He denied having any involvement with Soviet intelligence agencies
+wavs/LJ028-0081.wav|Years later, when the archaeologists could readily distinguish the false from the true,
+wavs/LJ012-0161.wav|he was reported to have fallen away to a shadow.
+wavs/LJ009-0114.wav|Mr. Wakefield winds up his graphic but somewhat sensational account by describing another religious service, which may appropriately be inserted here.
+wavs/LJ028-0335.wav|accordingly they committed to him the command of their whole army, and put the keys of their city into his hands.
+wavs/LJ005-0014.wav|Speaking on a debate on prison matters, he declared that
+wavs/LJ008-0294.wav|nearly indefinitely deferred.
+wavs/LJ028-0307.wav|then let twenty days pass, and at the end of that time station near the Chaldasan gates a body of four thousand.
+wavs/LJ046-0058.wav|During his Presidency, Franklin D. Roosevelt made almost four hundred journeys and traveled more than three hundred fifty thousand miles.
+wavs/LJ046-0146.wav|The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files
+wavs/LJ017-0131.wav|even when the high sheriff had told him there was no possibility of a reprieve, and within a few hours of execution.
+wavs/LJ002-0018.wav|The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London,
+wavs/LJ019-0257.wav|Here the tread-wheel was in use, there cellular cranks, or hard-labor machines.
+wavs/LJ034-0042.wav|that he could only testify with certainty that the print was less than three days old.
+wavs/LJ031-0070.wav|Dr. Clark, who most closely observed the head wound,
+wavs/LJ012-0035.wav|the number and names on watches, were carefully removed or obliterated after the goods passed out of his hands.
+wavs/LJ050-0168.wav|with the particular purposes of the agency involved. The Commission recognizes that this is a controversial area
+wavs/LJ036-0103.wav|The police asked him whether he could pick out his passenger from the lineup.
+wavs/LJ016-0318.wav|Other officials, great lawyers, governors of prisons, and chaplains supported this view.
+wavs/LJ034-0198.wav|Euins, who was on the southwest corner of Elm and Houston Streets testified that he could not describe the man he saw in the window.
+wavs/LJ049-0026.wav|On occasion the Secret Service has been permitted to have an agent riding in the passenger compartment with the President.
+wavs/LJ011-0096.wav|He married a lady also belonging to the Society of Friends, who brought him a large fortune, which, and his own money, he put into a city firm,
+wavs/LJ040-0002.wav|Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.
+wavs/LJ014-0030.wav|These were damnatory facts which well supported the prosecution.
+wavs/LJ043-0002.wav|The Warren Commission Report. By The President's Commission on the Assassination of President Kennedy. Chapter seven. Lee Harvey Oswald:
 wavs/LJ029-0022.wav|The original plan called for the President to spend only one day in the State, making whirlwind visits to Dallas, Fort Worth, San Antonio, and Houston.
+wavs/LJ014-0020.wav|He was soon afterwards arrested on suspicion, and a search of his lodgings brought to light several garments saturated with blood;
+wavs/LJ040-0027.wav|He was never satisfied with anything.
+wavs/LJ028-0093.wav|but his scribe wrote it in the manner customary for the scribes of those days to write of their royal masters.
+wavs/LJ004-0152.wav|although at Mr. Buxton's visit a new jail was in process of erection, the first step towards reform since Howard's visitation in seventeen seventy-four.
+wavs/LJ008-0111.wav|They entered a "stone cold room," and were presently joined by the prisoner.
+wavs/LJ017-0044.wav|and the deepest anxiety was felt that the crime, if crime there had been, should be brought home to its perpetrator.
+wavs/LJ033-0047.wav|I noticed when I went out that the light was on, end quote,
+wavs/LJ028-0008.wav|you tap gently with your heel upon the shoulder of the dromedary to urge her on.
+wavs/LJ016-0179.wav|contracted with sheriffs and conveners to work by the job.
+wavs/LJ005-0201.wav|as is shown by the report of the Commissioners to inquire into the state of the municipal corporations in eighteen thirty-five.
+wavs/LJ035-0019.wav|drove to the northwest corner of Elm and Houston, and parked approximately ten feet from the traffic signal.
+wavs/LJ031-0038.wav|The first physician to see the President at Parkland Hospital was Dr. Charles J. Carrico, a resident in general surgery.
+wavs/LJ017-0070.wav|but his sporting operations did not prosper, and he became a needy man, always driven to desperate straits for cash.
+wavs/LJ007-0154.wav|These pungent and well-grounded strictures applied with still greater force to the unconvicted prisoner, the man who came to the prison innocent, and still uncontaminated,
+wavs/LJ002-0043.wav|long narrow rooms -- one thirty-six feet, six twenty-three feet, and the eighth eighteen,
+wavs/LJ004-0096.wav|the fatal consequences whereof might be prevented if the justices of the peace were duly authorized
+wavs/LJ018-0081.wav|his defense being that he had intended to commit suicide, but that, on the appearance of this officer who had wronged him,
+wavs/LJ042-0129.wav|No night clubs or bowling alleys, no places of recreation except the trade union dances. I have had enough.
+wavs/LJ008-0278.wav|or theirs might be one of many, and it might be considered necessary to "make an example."
+wavs/LJ015-0203.wav|but were the precautions too minute, the vigilance too close to be eluded or overcome?
+wavs/LJ018-0239.wav|His disappearance gave color and substance to evil reports already in circulation that the will and conveyance above referred to
+wavs/LJ021-0066.wav|together with a great increase in the payrolls, there has come a substantial rise in the total of industrial profits
+wavs/LJ024-0083.wav|This plan of mine is no attack on the Court;
+wavs/LJ008-0258.wav|Let me retrace my steps, and speak more in detail of the treatment of the condemned in those bloodthirsty and brutally indifferent days,
+wavs/LJ038-0199.wav|eleven. If I am alive and taken prisoner,
+wavs/LJ045-0230.wav|when he was finally apprehended in the Texas Theatre. Although it is not fully corroborated by others who were present,
+wavs/LJ027-0141.wav|is closely reproduced in the life-history of existing deer. Or, in other words,
+wavs/LJ016-0020.wav|He never reached the cistern, but fell back into the yard, injuring his legs severely.
+wavs/LJ012-0250.wav|On the seventh July, eighteen thirty-seven,
+wavs/LJ001-0110.wav|Even the Caslon type when enlarged shows great shortcomings in this respect:
+wavs/LJ047-0148.wav|On October twenty-five,
+wavs/LJ031-0134.wav|On one occasion Mrs. Johnson, accompanied by two Secret Service agents, left the room to see Mrs. Kennedy and Mrs. Connally.
+wavs/LJ036-0174.wav|This is the approximate time he entered the roominghouse, according to Earlene Roberts, the housekeeper there.
+wavs/LJ026-0068.wav|Energy enters the plant, to a small extent,
+wavs/LJ034-0160.wav|on Brennan's subsequent certain identification of Lee Harvey Oswald as the man he saw fire the rifle.
+wavs/LJ013-0164.wav|who came from his room ready dressed, a suspicious circumstance, as he was always late in the morning.
+wavs/LJ014-0263.wav|When other pleasures palled he took a theatre, and posed as a munificent patron of the dramatic art.
+wavs/LJ005-0079.wav|and improve the morals of the prisoners, and shall insure the proper measure of punishment to convicted offenders.
+wavs/LJ048-0228.wav|and others who were present say that no agent was inebriated or acted improperly.
+wavs/LJ027-0052.wav|These principles of homology are essential to a correct interpretation of the facts of morphology.
 wavs/LJ004-0045.wav|Mr. Sturges Bourne, Sir James Mackintosh, Sir James Scarlett, and William Wilberforce.
+wavs/LJ012-0042.wav|which he kept concealed in a hiding-place with a trap-door just under his bed.
+wavs/LJ014-0110.wav|At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.
+wavs/LJ028-0506.wav|A modern artist would have difficulty in doing such accurate work.
+wavs/LJ014-0010.wav|yet he could not overcome the strange fascination it had for him, and remained by the side of the corpse till the stretcher came.
+wavs/LJ042-0096.wav|(old exchange rate) in addition to his factory salary of approximately equal amount
+wavs/LJ031-0202.wav|Mrs. Kennedy chose the hospital in Bethesda for the autopsy because the President had served in the Navy.
+wavs/LJ012-0235.wav|While they were in a state of insensibility the murder was committed.
+wavs/LJ019-0186.wav|seeing that since the establishment of the Central Criminal Court, Newgate received prisoners for trial from several counties,
+wavs/LJ018-0098.wav|and recognized as one of the frequenters of the bogus law-stationers. His arrest led to that of others.
+wavs/LJ036-0077.wav|Roger D. Craig, a deputy sheriff of Dallas County,
+wavs/LJ045-0140.wav|The arguments he used to justify his use of the alias suggest that Oswald may have come to think that the whole world was becoming involved
+wavs/LJ029-0032.wav|According to O'Donnell, quote, we had a motorcade wherever we went, end quote.
+wavs/LJ003-0345.wav|All the committee could do in this respect was to throw the responsibility on others.
+wavs/LJ008-0307.wav|afterwards express a wish to murder the Recorder for having kept them so long in suspense.
+wavs/LJ043-0030.wav|If somebody did that to me, a lousy trick like that, to take my wife away, and all the furniture, I would be mad as hell, too.
+wavs/LJ009-0238.wav|After this the sheriffs sent for another rope, but the spectators interfered, and the man was carried back to jail.
+wavs/LJ039-0223.wav|Oswald's Marine training in marksmanship, his other rifle experience and his established familiarity with this particular weapon
+wavs/LJ014-0076.wav|He was seen afterwards smoking and talking with his hosts in their back parlor, and never seen again alive.
+wavs/LJ016-0138.wav|at a distance from the prison.
--- a/PyTorch/SpeechSynthesis/FastPitch/img/loss.png
+++ b/PyTorch/SpeechSynthesis/FastPitch/img/loss.png
--- a/PyTorch/SpeechSynthesis/FastPitch/inference.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/inference.py
@ -28,10 +28,10 @@
 import argparse
 import models
 import time
-import tqdm
 import sys
 import warnings
 from pathlib import Path
+from tqdm import tqdm

 import torch
 import numpy as np
@ -45,6 +45,7 @@ from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
 from common import utils
 from common.tb_dllogger import (init_inference_metadata, stdout_metric_format,
                                unique_log_fpath)
+from common.text import cmudict
 from common.text.text_processing import TextProcessing
 from pitch_transform import pitch_transform_custom
 from waveglow import model as glow
@ -63,6 +64,7 @@ def parse_args(parser):
                        help='Output folder to save audio (file per phrase)')
    parser.add_argument('--log-file', type=str, default=None,
                        help='Path to a DLLogger log file')
+    parser.add_argument('--save-mels', action='store_true', help='')
    parser.add_argument('--cuda', action='store_true',
                        help='Run inference on a GPU using CUDA')
    parser.add_argument('--cudnn-benchmark', action='store_true',
@ -82,8 +84,8 @@ def parse_args(parser):
    parser.add_argument('--amp', action='store_true',
                        help='Inference with AMP')
    parser.add_argument('-bs', '--batch-size', type=int, default=64)
-    parser.add_argument('--include-warmup', action='store_true',
-                        help='Include warmup')
+    parser.add_argument('--warmup-steps', type=int, default=0,
+                        help='Warmup iterations before measuring performance')
    parser.add_argument('--repeats', type=int, default=1,
                        help='Repeat inference for benchmarking')
    parser.add_argument('--torchscript', action='store_true',
@ -95,6 +97,11 @@ def parse_args(parser):
    parser.add_argument('--speaker', type=int, default=0,
                        help='Speaker ID for a multi-speaker model')

+    parser.add_argument('--p-arpabet', type=float, default=0.0, help='')
+    parser.add_argument('--heteronyms-path', type=str, default='cmudict/heteronyms',
+                        help='')
+    parser.add_argument('--cmudict-path', type=str, default='cmudict/cmudict-0.7b',
+                        help='')
    transform = parser.add_argument_group('transform')
    transform.add_argument('--fade-out', type=int, default=10,
                           help='Number of fadeout frames at the end')
@ -151,6 +158,7 @@ def load_model_from_ckpt(checkpoint_path, ema, model):
 def load_and_setup_model(model_name, parser, checkpoint, amp, device,
                         unk_args=[], forward_is_infer=False, ema=True,
                         jitable=False):
+
    model_parser = models.parse_model_args(model_name, parser, add_help=False)
    model_args, model_unk_args = model_parser.parse_known_args()
    unk_args[:] = list(set(unk_args) & set(model_unk_args))
@ -165,7 +173,11 @@ def load_and_setup_model(model_name, parser, checkpoint, amp, device,
        model = load_model_from_ckpt(checkpoint, ema, model)

    if model_name == "WaveGlow":
+        for k, m in model.named_modules():
+            m._non_persistent_buffers_set = set()  # pytorch 1.6.0 compatability
+
        model = model.remove_weightnorm(model)
+
    if amp:
        model.half()
    model.eval()
@ -185,8 +197,8 @@ def load_fields(fpath):

 def prepare_input_sequence(fields, device, symbol_set, text_cleaners,
                           batch_size=128, dataset=None, load_mels=False,
-                           load_pitch=False):
-    tp = TextProcessing(symbol_set, text_cleaners)
+                           load_pitch=False, p_arpabet=0.0):
+    tp = TextProcessing(symbol_set, text_cleaners, p_arpabet=p_arpabet)

    fields['text'] = [torch.LongTensor(tp.encode_text(text))
                      for text in fields['text']]
@ -195,6 +207,9 @@ def prepare_input_sequence(fields, device, symbol_set, text_cleaners,
    fields['text'] = [fields['text'][i] for i in order]
    fields['text_lens'] = torch.LongTensor([t.size(0) for t in fields['text']])

+    for t in fields['text']:
+        print(tp.sequence_to_text(t.numpy()))
+
    if load_mels:
        assert 'mel' in fields
        fields['mel'] = [
@ -251,17 +266,23 @@ def build_pitch_transformation(args):


 class MeasureTime(list):
+    def __init__(self, *args, cuda=True, **kwargs):
+        super(MeasureTime, self).__init__(*args, **kwargs)
+        self.cuda = cuda
+
    def __enter__(self):
-        torch.cuda.synchronize()
+        if self.cuda:
+            torch.cuda.synchronize()
        self.t0 = time.perf_counter()

    def __exit__(self, exc_type, exc_value, exc_traceback):
-        torch.cuda.synchronize()
+        if self.cuda:
+            torch.cuda.synchronize()
        self.append(time.perf_counter() - self.t0)

    def __add__(self, other):
        assert len(self) == len(other)
-        return MeasureTime(sum(ab) for ab in zip(self, other))
+        return MeasureTime((sum(ab) for ab in zip(self, other)), cuda=cuda)


 def main():
@ -274,6 +295,9 @@ def main():
    parser = parse_args(parser)
    args, unk_args = parser.parse_known_args()

+    if args.p_arpabet > 0.0:
+        cmudict.initialize(args.cmudict_path, keep_ambiguous=True)
+
    torch.backends.cudnn.benchmark = args.cudnn_benchmark

    if args.output is not None:
@ -317,21 +341,20 @@ def main():
    fields = load_fields(args.input)
    batches = prepare_input_sequence(
        fields, device, args.symbol_set, args.text_cleaners, args.batch_size,
-        args.dataset_path, load_mels=(generator is None))
+        args.dataset_path, load_mels=(generator is None), p_arpabet=args.p_arpabet)

-    if args.include_warmup:
-        # Use real data rather than synthetic - FastPitch predicts len
-        for i in range(3):
-            with torch.no_grad():
-                if generator is not None:
-                    b = batches[0]
-                    mel, *_ = generator(b['text'])
-                if waveglow is not None:
-                    audios = waveglow(mel, sigma=args.sigma_infer).float()
-                    _ = denoiser(audios, strength=args.denoising_strength)
+    # Use real data rather than synthetic - FastPitch predicts len
+    for _ in tqdm(range(args.warmup_steps), 'Warmup'):
+        with torch.no_grad():
+            if generator is not None:
+                b = batches[0]
+                mel, *_ = generator(b['text'])
+            if waveglow is not None:
+                audios = waveglow(mel, sigma=args.sigma_infer).float()
+                _ = denoiser(audios, strength=args.denoising_strength)

-    gen_measures = MeasureTime()
-    waveglow_measures = MeasureTime()
+    gen_measures = MeasureTime(cuda=args.cuda)
+    waveglow_measures = MeasureTime(cuda=args.cuda)

    gen_kw = {'pace': args.pace,
              'speaker': args.speaker,
@ -351,15 +374,14 @@ def main():
    log_enabled = reps == 1
    log = lambda s, d: DLLogger.log(step=s, data=d) if log_enabled else None

-    for rep in (tqdm.tqdm(range(reps)) if reps > 1 else range(reps)):
+    for rep in (tqdm(range(reps), 'Inference') if reps > 1 else range(reps)):
        for b in batches:
            if generator is None:
                log(rep, {'Synthesizing from ground truth mels'})
-                mel = b['mel']
+                mel, mel_lens = b['mel'], b['mel_lens']
            else:
                with torch.no_grad(), gen_measures:
-                    mel, mel_lens, dur_pred, pitch_pred = generator(
-                        b['text'], b['text_lens'], **gen_kw)
+                    mel, mel_lens, *_ = generator(b['text'], **gen_kw)

                gen_infer_perf = mel.size(0) * mel.size(2) / gen_measures[-1]
                all_letters += b['text_lens'].sum().item()
@ -367,6 +389,13 @@ def main():
                log(rep, {"fastpitch_frames/s": gen_infer_perf})
                log(rep, {"fastpitch_latency": gen_measures[-1]})

+                if args.save_mels:
+                    for i, mel_ in enumerate(mel):
+                        m = mel_[:, :mel_lens[i].item()].permute(1, 0)
+                        fname = b['output'][i] if 'output' in b else f'mel_{i}.npy'
+                        mel_path = Path(args.output, Path(fname).stem + '.npy')
+                        np.save(mel_path, m.cpu().numpy())
+
            if waveglow is not None:
                with torch.no_grad(), waveglow_measures:
                    audios = waveglow(mel, sigma=args.sigma_infer)
--- a/PyTorch/SpeechSynthesis/FastPitch/loss_functions.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/loss_functions.py
@ -1,46 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import torch
-import torch.nn as nn
-
-from fastpitch.loss_function import FastPitchLoss
-from tacotron2.loss_function import Tacotron2Loss
-from waveglow.loss_function import WaveGlowLoss
-
-
-def get_loss_function(loss_function, **kw):
-    if loss_function == 'Tacotron2':
-        loss = Tacotron2Loss()
-    elif loss_function == 'WaveGlow':
-        loss = WaveGlowLoss(**kw)
-    elif loss_function == 'FastPitch':
-        loss = FastPitchLoss(**kw)
-    else:
-        raise NotImplementedError(
-            "unknown loss function requested: {}".format(loss_function))
-    return loss.cuda()
--- a/PyTorch/SpeechSynthesis/FastPitch/models.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/models.py
@ -25,25 +25,15 @@
 #
 # *****************************************************************************

-import sys
-from typing import Optional
-from os.path import abspath, dirname
-
 import torch

-# enabling modules discovery from global entrypoint
-sys.path.append(abspath(dirname(__file__)+'/'))
-from fastpitch.model import FastPitch as _FastPitch
-from fastpitch.model_jit import FastPitch as _FastPitchJIT
-from tacotron2.model import Tacotron2
-from waveglow.model import WaveGlow
 from common.text.symbols import get_symbols, get_pad_idx
+from fastpitch.model import FastPitch
+from fastpitch.model_jit import FastPitchJIT
+from waveglow.model import WaveGlow


 def parse_model_args(model_name, parser, add_help=False):
-    if model_name == 'Tacotron2':
-        from tacotron2.arg_parser import parse_tacotron2_args
-        return parse_tacotron2_args(parser, add_help)
    if model_name == 'WaveGlow':
        from waveglow.arg_parser import parse_waveglow_args
        return parse_waveglow_args(parser, add_help)
@ -54,15 +44,6 @@ def parse_model_args(model_name, parser, add_help=False):
        raise NotImplementedError(model_name)


-def batchnorm_to_float(module):
-    """Converts batch norm to FP32"""
-    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
-        module.float()
-    for child in module.children():
-        batchnorm_to_float(child)
-    return module
-
-
 def init_bn(module):
    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
        if module.affine:
@ -74,56 +55,22 @@ def init_bn(module):
 def get_model(model_name, model_config, device,
              uniform_initialize_bn_weight=False, forward_is_infer=False,
              jitable=False):
-    """ Code chooses a model based on name"""
-    model = None
-    if model_name == 'Tacotron2':
-        if forward_is_infer:
-            class Tacotron2__forward_is_infer(Tacotron2):
-                def forward(self, inputs, input_lengths):
-                    return self.infer(inputs, input_lengths)
-            model = Tacotron2__forward_is_infer(**model_config)
-        else:
-            model = Tacotron2(**model_config)

-    elif model_name == 'WaveGlow':
-        if forward_is_infer:
-            class WaveGlow__forward_is_infer(WaveGlow):
-                def forward(self, spect, sigma=1.0):
-                    return self.infer(spect, sigma)
-            model = WaveGlow__forward_is_infer(**model_config)
-        else:
-            model = WaveGlow(**model_config)
+    if model_name == 'WaveGlow':
+        model = WaveGlow(**model_config)

    elif model_name == 'FastPitch':
-        if forward_is_infer:
-            if jitable:
-                class FastPitch__forward_is_infer(_FastPitchJIT):
-                    def forward(self, inputs, input_lengths, pace: float = 1.0,
-                                dur_tgt: Optional[torch.Tensor] = None,
-                                pitch_tgt: Optional[torch.Tensor] = None,
-                                speaker: int = 0):
-                        return self.infer(inputs, input_lengths, pace=pace,
-                                          dur_tgt=dur_tgt, pitch_tgt=pitch_tgt,
-                                          speaker=speaker)
-            else:
-                class FastPitch__forward_is_infer(_FastPitch):
-                    def forward(self, inputs, input_lengths, pace: float = 1.0,
-                                dur_tgt: Optional[torch.Tensor] = None,
-                                pitch_tgt: Optional[torch.Tensor] = None,
-                                pitch_transform=None,
-                                speaker: Optional[int] = None):
-                        return self.infer(inputs, input_lengths, pace=pace,
-                                          dur_tgt=dur_tgt, pitch_tgt=pitch_tgt,
-                                          pitch_transform=pitch_transform,
-                                          speaker=speaker)
-
-            model = FastPitch__forward_is_infer(**model_config)
+        if jitable:
+            model = FastPitchJIT(**model_config)
        else:
-            model = _FastPitch(**model_config)
+            model = FastPitch(**model_config)

    else:
        raise NotImplementedError(model_name)

+    if forward_is_infer:
+        model.forward = model.infer
+
    if uniform_initialize_bn_weight:
        init_bn(model)

@ -132,41 +79,7 @@ def get_model(model_name, model_config, device,

 def get_model_config(model_name, args):
    """ Code chooses a model based on name"""
-    if model_name == 'Tacotron2':
-        model_config = dict(
-            # optimization
-            mask_padding=args.mask_padding,
-            # audio
-            n_mel_channels=args.n_mel_channels,
-            # symbols
-            n_symbols=len(get_symbols(args.symbol_set)),
-            symbols_embedding_dim=args.symbols_embedding_dim,
-            # encoder
-            encoder_kernel_size=args.encoder_kernel_size,
-            encoder_n_convolutions=args.encoder_n_convolutions,
-            encoder_embedding_dim=args.encoder_embedding_dim,
-            # attention
-            attention_rnn_dim=args.attention_rnn_dim,
-            attention_dim=args.attention_dim,
-            # attention location
-            attention_location_n_filters=args.attention_location_n_filters,
-            attention_location_kernel_size=args.attention_location_kernel_size,
-            # decoder
-            n_frames_per_step=args.n_frames_per_step,
-            decoder_rnn_dim=args.decoder_rnn_dim,
-            prenet_dim=args.prenet_dim,
-            max_decoder_steps=args.max_decoder_steps,
-            gate_threshold=args.gate_threshold,
-            p_attention_dropout=args.p_attention_dropout,
-            p_decoder_dropout=args.p_decoder_dropout,
-            # postnet
-            postnet_embedding_dim=args.postnet_embedding_dim,
-            postnet_kernel_size=args.postnet_kernel_size,
-            postnet_n_convolutions=args.postnet_n_convolutions,
-            decoder_no_early_stopping=args.decoder_no_early_stopping,
-        )
-        return model_config
-    elif model_name == 'WaveGlow':
+    if model_name == 'WaveGlow':
        model_config = dict(
            n_mel_channels=args.n_mel_channels,
            n_flows=args.flows,
@ -184,7 +97,6 @@ def get_model_config(model_name, args):
        model_config = dict(
            # io
            n_mel_channels=args.n_mel_channels,
-            max_seq_len=args.max_seq_len,
            # symbols
            n_symbols=len(get_symbols(args.symbol_set)),
            padding_idx=get_pad_idx(args.symbol_set),
@ -223,7 +135,15 @@ def get_model_config(model_name, args):
            pitch_embedding_kernel_size=args.pitch_embedding_kernel_size,
            # speakers parameters
            n_speakers=args.n_speakers,
-            speaker_emb_weight=args.speaker_emb_weight
+            speaker_emb_weight=args.speaker_emb_weight,
+            # energy predictor
+            energy_predictor_kernel_size=args.energy_predictor_kernel_size,
+            energy_predictor_filter_size=args.energy_predictor_filter_size,
+            p_energy_predictor_dropout=args.p_energy_predictor_dropout,
+            energy_predictor_n_layers=args.energy_predictor_n_layers,
+            # energy conditioning
+            energy_conditioning=args.energy_conditioning,
+            energy_embedding_kernel_size=args.energy_embedding_kernel_size,
        )
        return model_config

--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_1GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_1GPU.sh
@ -1,23 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python train.py \
-    --amp \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 64 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 4
+set -a
+
+: ${NUM_GPUS:=1}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=16}
+: ${AMP:=true}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_4GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_4GPU.sh
@ -1,23 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 4 train.py \
-    --amp \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 64 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1
+set -a
+
+: ${NUM_GPUS:=4}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=4}
+: ${AMP:=true}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_8GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_8GPU.sh
@ -1,23 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 8 train.py \
-    --amp \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1
+set -a
+
+: ${NUM_GPUS:=8}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=2}
+: ${AMP:=true}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_1GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_1GPU.sh
@ -1,22 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python train.py \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 8
+set -a
+
+: ${NUM_GPUS:=1}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=16}
+: ${AMP:=false}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_4GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_4GPU.sh
@ -1,22 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 4 train.py \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 2
+set -a
+
+: ${NUM_GPUS:=4}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=4}
+: ${AMP:=false}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_8GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_8GPU.sh
@ -1,22 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 8 train.py \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1
+set -a
+
+: ${NUM_GPUS:=8}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=2}
+: ${AMP:=false}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_1GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_1GPU.sh
@ -1,23 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python train.py \
-    --amp \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 128 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 2
+set -a
+
+: ${NUM_GPUS:=1}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION:=8}
+: ${AMP:=true}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_4GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_4GPU.sh
@ -1,23 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 4 train.py \
-    --amp \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 64 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1
+set -a
+
+: ${NUM_GPUS:=4}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION:=2}
+: ${AMP:=true}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_8GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_8GPU.sh
@ -1,23 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 8 train.py \
-    --amp \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1
+set -a
+
+: ${NUM_GPUS:=8}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION:=1}
+: ${AMP:=true}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_1GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_1GPU.sh
@ -1,22 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python train.py \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 8
+set -a
+
+: ${NUM_GPUS:=1}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION:=8}
+: ${AMP:=false}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_4GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_4GPU.sh
@ -1,22 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 4 train.py \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 2
+set -a
+
+: ${NUM_GPUS:=4}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION:=2}
+: ${AMP:=false}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_8GPU.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_8GPU.sh
@ -1,22 +1,10 @@
 #!/bin/bash

-mkdir -p output
-python -m torch.distributed.launch --nproc_per_node 8 train.py \
-    --cuda \
-    -o ./output/ \
-    --log-file output/nvlog.json \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs 32 \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1
+set -a
+
+: ${NUM_GPUS:=8}
+: ${BATCH_SIZE:=32}
+: ${GRAD_ACCUMULATION:=1}
+: ${AMP:=false}
+
+bash scripts/train.sh "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/prepare_dataset.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/prepare_dataset.py
@ -0,0 +1,174 @@
+# *****************************************************************************
+#  Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+
+import argparse
+import time
+from pathlib import Path
+
+import torch
+import tqdm
+import dllogger as DLLogger
+from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
+from torch.utils.data import DataLoader
+
+from fastpitch.data_function import TTSCollate, TTSDataset
+
+
+def parse_args(parser):
+    """
+    Parse commandline arguments.
+    """
+    parser.add_argument('-d', '--dataset-path', type=str,
+                        default='./', help='Path to dataset')
+    parser.add_argument('--wav-text-filelists', required=True, nargs='+',
+                        type=str, help='Files with audio paths and text')
+    parser.add_argument('--extract-mels', action='store_true',
+                        help='Calculate spectrograms from .wav files')
+    parser.add_argument('--extract-pitch', action='store_true',
+                        help='Extract pitch')
+    parser.add_argument('--save-alignment-priors', action='store_true',
+                        help='Pre-calculate diagonal matrices of alignment of text to audio')
+    parser.add_argument('--log-file', type=str, default='preproc_log.json',
+                         help='Filename for logging')
+    parser.add_argument('--n-speakers', type=int, default=1)
+    # Mel extraction
+    parser.add_argument('--max-wav-value', default=32768.0, type=float,
+                        help='Maximum audiowave value')
+    parser.add_argument('--sampling-rate', default=22050, type=int,
+                        help='Sampling rate')
+    parser.add_argument('--filter-length', default=1024, type=int,
+                        help='Filter length')
+    parser.add_argument('--hop-length', default=256, type=int,
+                        help='Hop (stride) length')
+    parser.add_argument('--win-length', default=1024, type=int,
+                        help='Window length')
+    parser.add_argument('--mel-fmin', default=0.0, type=float,
+                        help='Minimum mel frequency')
+    parser.add_argument('--mel-fmax', default=8000.0, type=float,
+                        help='Maximum mel frequency')
+    parser.add_argument('--n-mel-channels', type=int, default=80)
+    # Pitch extraction
+    parser.add_argument('--f0-method', default='pyin', type=str,
+                        choices=('pyin', 'praat'), help='F0 estimation method')
+    # Performance
+    parser.add_argument('-b', '--batch-size', default=1, type=int)
+    parser.add_argument('--n-workers', type=int, default=16)
+    return parser
+
+
+def main():
+    parser = argparse.ArgumentParser(description='FastPitch Data Pre-processing')
+    parser = parse_args(parser)
+    args, unk_args = parser.parse_known_args()
+    if len(unk_args) > 0:
+        raise ValueError(f'Invalid options {unk_args}')
+
+    DLLogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, Path(args.dataset_path, args.log_file)),
+                            StdOutBackend(Verbosity.VERBOSE)])
+    for k, v in vars(args).items():
+        DLLogger.log(step="PARAMETER", data={k: v})
+    DLLogger.flush()
+
+    if args.extract_mels:
+        Path(args.dataset_path, 'mels').mkdir(parents=False, exist_ok=True)
+
+    if args.extract_pitch:
+        Path(args.dataset_path, 'pitch').mkdir(parents=False, exist_ok=True)
+
+    if args.save_alignment_priors:
+        Path(args.dataset_path, 'alignment_priors').mkdir(parents=False, exist_ok=True)
+
+    for filelist in args.wav_text_filelists:
+
+        print(f'Processing {filelist}...')
+
+        dataset = TTSDataset(
+            args.dataset_path,
+            filelist,
+            text_cleaners=['english_cleaners_v2'],
+            n_mel_channels=args.n_mel_channels,
+            p_arpabet=0.0,
+            n_speakers=args.n_speakers,
+            load_mel_from_disk=False,
+            load_pitch_from_disk=False,
+            pitch_mean=None,
+            pitch_std=None,
+            max_wav_value=args.max_wav_value,
+            sampling_rate=args.sampling_rate,
+            filter_length=args.filter_length,
+            hop_length=args.hop_length,
+            win_length=args.win_length,
+            mel_fmin=args.mel_fmin,
+            mel_fmax=args.mel_fmax,
+            betabinomial_online_dir=None,
+            pitch_online_dir=None,
+            pitch_online_method=args.f0_method)
+
+        data_loader = DataLoader(
+            dataset,
+            batch_size=args.batch_size,
+            shuffle=False,
+            sampler=None,
+            num_workers=args.n_workers,
+            collate_fn=TTSCollate(),
+            pin_memory=False,
+            drop_last=False)
+
+        all_filenames = set()
+        for i, batch in enumerate(tqdm.tqdm(data_loader)):
+            tik = time.time()
+
+            _, input_lens, mels, mel_lens, _, pitch, _, _, attn_prior, fpaths = batch
+
+            # Ensure filenames are unique
+            for p in fpaths:
+                fname = Path(p).name
+                if fname in all_filenames:
+                    raise ValueError(f'Filename is not unique: {fname}')
+                all_filenames.add(fname)
+
+            if args.extract_mels:
+                for j, mel in enumerate(mels):
+                    fname = Path(fpaths[j]).with_suffix('.pt').name
+                    fpath = Path(args.dataset_path, 'mels', fname)
+                    torch.save(mel[:, :mel_lens[j]], fpath)
+
+            if args.extract_pitch:
+                for j, p in enumerate(pitch):
+                    fname = Path(fpaths[j]).with_suffix('.pt').name
+                    fpath = Path(args.dataset_path, 'pitch', fname)
+                    torch.save(p[:mel_lens[j]], fpath)
+
+            if args.save_alignment_priors:
+                for j, prior in enumerate(attn_prior):
+                    fname = Path(fpaths[j]).with_suffix('.pt').name
+                    fpath = Path(args.dataset_path, 'alignment_priors', fname)
+                    torch.save(prior[:mel_lens[j], :input_lens[j]], fpath)
+
+
+if __name__ == '__main__':
+    main()
--- a/PyTorch/SpeechSynthesis/FastPitch/requirements.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/requirements.txt
@ -1,7 +1,7 @@
 matplotlib
 numpy
 inflect
-librosa
+librosa==0.8.0
 scipy
 Unidecode
 praat-parselmouth==0.3.3
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/download_dataset.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/download_dataset.sh
@ -2,6 +2,9 @@

 set -e

+echo "Downloading cmudict-0.7b ..."
+wget https://github.com/Alexir/CMUdict/raw/master/cmudict-0.7b -qO cmudict/cmudict-0.7b
+
 DATA_DIR="LJSpeech-1.1"
 LJS_ARCH="LJSpeech-1.1.tar.bz2"
 LJS_URL="http://data.keithito.com/data/speech/${LJS_ARCH}"
@ -13,4 +16,3 @@ if [ ! -d ${DATA_DIR} ]; then
  tar jxvf ${LJS_ARCH}
  rm -f ${LJS_ARCH}
 fi
-
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/download_tacotron2.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/download_tacotron2.sh
@ -1,19 +0,0 @@
-#!/usr/bin/env bash
-
-set -e
-
-: ${MODEL_DIR:="pretrained_models/tacotron2"}
-MODEL="nvidia_tacotron2pyt_fp16.pt"
-MODEL_URL="https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_amp/versions/19.12.0/files/nvidia_tacotron2pyt_fp16.pt"
-
-mkdir -p "$MODEL_DIR"
-
-if [ ! -f "${MODEL_DIR}/${MODEL}" ]; then
-  echo "Downloading ${MODEL} ..."
-  wget --content-disposition -qO ${MODEL_DIR}/${MODEL} ${MODEL_URL} \
-       || { echo "ERROR: Failed to download ${MODEL} from NGC"; exit 1; }
-  echo "OK"
-
-else
-  echo "${MODEL} already downloaded."
-fi
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/inference_benchmark.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/inference_benchmark.sh
@ -1,30 +1,15 @@
-#!/bin/bash
+#!/usr/bin/env bash
+
+set -a

-: ${WAVEGLOW:="pretrained_models/waveglow/nvidia_waveglow256pyt_fp16.pt"}
-: ${FASTPITCH:="output/FastPitch_checkpoint_1500.pt"}
-: ${REPEATS:=1000}
-: ${BS_SEQUENCE:="1 4 8"}
 : ${PHRASES:="phrases/benchmark_8_128.tsv"}
 : ${OUTPUT_DIR:="./output/audio_$(basename ${PHRASES} .tsv)"}
-: ${AMP:=false}
+: ${TORCHSCRIPT:=true}
+: ${REPEATS:=100}
+: ${BS_SEQUENCE:="1 4 8"}
+: ${WARMUP:=100}

-[ "$AMP" = true ] && AMP_FLAG="--amp"
-
-mkdir -p "$OUTPUT_DIR"
-
-for BS in $BS_SEQUENCE ; do
-
-  echo -e "\nAMP: ${AMP}, batch size: ${BS}\n"
-
-  python inference.py --cuda --cudnn-benchmark \
-                      -i ${PHRASES} \
-                      -o ${OUTPUT_DIR} \
-                      --fastpitch ${FASTPITCH} \
-                      --waveglow ${WAVEGLOW} \
-                      --wn-channels 256 \
-                      --include-warmup \
-                      --batch-size ${BS} \
-                      --repeats ${REPEATS} \
-                      --torchscript \
-                      ${AMP_FLAG}
+for BATCH_SIZE in $BS_SEQUENCE ; do
+    LOG_FILE="$OUTPUT_DIR"/perf-infer_amp-${AMP}_bs${BATCH_SIZE}.json
+    bash scripts/inference_example.sh "$@"
 done
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/inference_example.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/inference_example.sh
@ -1,21 +1,45 @@
 #!/usr/bin/env bash

 : ${WAVEGLOW:="pretrained_models/waveglow/nvidia_waveglow256pyt_fp16.pt"}
-: ${FASTPITCH:="output/FastPitch_checkpoint_1500.pt"}
-: ${BS:=32}
+: ${FASTPITCH:="output/FastPitch_checkpoint_1000.pt"}
+: ${BATCH_SIZE:=32}
 : ${PHRASES:="phrases/devset10.tsv"}
 : ${OUTPUT_DIR:="./output/audio_$(basename ${PHRASES} .tsv)"}
+: ${LOG_FILE:="$OUTPUT_DIR/nvlog_infer.json"}
 : ${AMP:=false}
+: ${TORCHSCRIPT:=false}
+: ${PHONE:=true}
+: ${ENERGY:=true}
+: ${DENOISING:=0.01}
+: ${WARMUP:=0}
+: ${REPEATS:=1}

-[ "$AMP" = true ] && AMP_FLAG="--amp"
+: ${SPEAKER:=0}
+: ${NUM_SPEAKERS:=1}
+
+echo -e "\nAMP=$AMP, batch_size=$BATCH_SIZE\n"
+
+ARGS=""
+ARGS+=" --cuda"
+ARGS+=" --cudnn-benchmark"
+ARGS+=" -i $PHRASES"
+ARGS+=" -o $OUTPUT_DIR"
+ARGS+=" --log-file $LOG_FILE"
+ARGS+=" --fastpitch $FASTPITCH"
+ARGS+=" --waveglow $WAVEGLOW"
+ARGS+=" --wn-channels 256"
+ARGS+=" --batch-size $BATCH_SIZE"
+ARGS+=" --text-cleaners english_cleaners_v2"
+ARGS+=" --denoising-strength $DENOISING"
+ARGS+=" --repeats $REPEATS"
+ARGS+=" --warmup-steps $WARMUP"
+ARGS+=" --speaker $SPEAKER"
+ARGS+=" --n-speakers $NUM_SPEAKERS"
+[ "$AMP" = true ]           && ARGS+=" --amp"
+[ "$PHONE" = "true" ]       && ARGS+=" --p-arpabet 1.0"
+[ "$ENERGY" = "true" ]      && ARGS+=" --energy-conditioning"
+[ "$TORCHSCRIPT" = "true" ] && ARGS+=" --torchscript"

 mkdir -p "$OUTPUT_DIR"

-python inference.py --cuda \
-                    -i ${PHRASES} \
-                    -o ${OUTPUT_DIR} \
-                    --fastpitch ${FASTPITCH} \
-                    --waveglow ${WAVEGLOW} \
-		    --wn-channels 256 \
-                    --batch-size ${BS} \
-                    ${AMP_FLAG}
+python inference.py $ARGS "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/prepare_dataset.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/prepare_dataset.sh
@ -2,19 +2,15 @@

 set -e

-DATA_DIR="LJSpeech-1.1"
-TACO_CH=${TACO_CH:-"pretrained_models/tacotron2/nvidia_tacotron2pyt_fp16.pt"}
-for FILELIST in ljs_audio_text_train_filelist.txt \
-                ljs_audio_text_val_filelist.txt \
-                ljs_audio_text_test_filelist.txt \
-; do
-    python extract_mels.py \
-        --cuda \
-        --dataset-path ${DATA_DIR} \
-        --wav-text-filelist filelists/${FILELIST} \
-        --batch-size 256 \
-        --extract-mels \
-        --extract-durations \
-        --extract-pitch-char \
-        --tacotron2-checkpoint ${TACO_CH}
-done
+: ${DATA_DIR:=LJSpeech-1.1}
+: ${F0_METHOD:="pyin"}
+: ${ARGS="--extract-mels"}
+
+python prepare_dataset.py \
+    --wav-text-filelists filelists/ljs_audio_text.txt \
+    --n-workers 16 \
+    --batch-size 1 \
+    --dataset-path $DATA_DIR \
+    --extract-pitch \
+    --f0-method $F0_METHOD \
+    $ARGS
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/train.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/train.sh
@ -1,43 +1,98 @@
-#!/bin/bash
+#!/usr/bin/env bash

 export OMP_NUM_THREADS=1

 : ${NUM_GPUS:=8}
-: ${BS:=32}
-: ${GRAD_ACCUMULATION:=1}
+: ${BATCH_SIZE:=16}
+: ${GRAD_ACCUMULATION:=2}
 : ${OUTPUT_DIR:="./output"}
+: ${LOG_FILE:=$OUTPUT_DIR/nvlog.json}
+: ${DATASET_PATH:=LJSpeech-1.1}
+: ${TRAIN_FILELIST:=filelists/ljs_audio_pitch_text_train_v3.txt}
+: ${VAL_FILELIST:=filelists/ljs_audio_pitch_text_val.txt}
 : ${AMP:=false}
-: ${EPOCHS:=1500}
+: ${SEED:=""}

-[ "$AMP" == "true" ] && AMP_FLAG="--amp"
+: ${LEARNING_RATE:=0.1}

-# Adjust env variables to maintain the global batch size
-#
-#    NGPU x BS x GRAD_ACC = 256.
-#
-GBS=$(($NUM_GPUS * $BS * $GRAD_ACCUMULATION))
-[ $GBS -ne 256 ] && echo -e "\nWARNING: Global batch size changed from 256 to ${GBS}.\n"
+# Adjust these when the amount of data changes
+: ${EPOCHS:=1000}
+: ${EPOCHS_PER_CHECKPOINT:=100}
+: ${WARMUP_STEPS:=1000}
+: ${KL_LOSS_WARMUP:=100}

-echo -e "\nSetup: ${NUM_GPUS}x${BS}x${GRAD_ACCUMULATION} - global batch size ${GBS}\n"
+# Train a mixed phoneme/grapheme model
+: ${PHONE:=true}
+# Enable energy conditioning
+: ${ENERGY:=true}
+: ${TEXT_CLEANERS:=english_cleaners_v2}
+# Add dummy space prefix/suffix is audio is not precisely trimmed
+: ${APPEND_SPACES:=false}
+
+: ${LOAD_PITCH_FROM_DISK:=true}
+: ${LOAD_MEL_FROM_DISK:=false}
+
+# For multispeaker models, add speaker ID = {0, 1, ...} as the last filelist column
+: ${NSPEAKERS:=1}
+: ${SAMPLING_RATE:=22050}
+
+# Adjust env variables to maintain the global batch size: NUM_GPUS x BATCH_SIZE x GRAD_ACCUMULATION = 256.
+GBS=$(($NUM_GPUS * $BATCH_SIZE * $GRAD_ACCUMULATION))
+[ $GBS -ne 256 ] && echo -e "\nWARNING: Global batch size changed from 256 to ${GBS}."
+echo -e "\nAMP=$AMP, ${NUM_GPUS}x${BATCH_SIZE}x${GRAD_ACCUMULATION}" \
+        "(global batch size ${GBS})\n"
+
+ARGS=""
+ARGS+=" --cuda"
+ARGS+=" -o $OUTPUT_DIR"
+ARGS+=" --log-file $LOG_FILE"
+ARGS+=" --dataset-path $DATASET_PATH"
+ARGS+=" --training-files $TRAIN_FILELIST"
+ARGS+=" --validation-files $VAL_FILELIST"
+ARGS+=" -bs $BATCH_SIZE"
+ARGS+=" --grad-accumulation $GRAD_ACCUMULATION"
+ARGS+=" --optimizer lamb"
+ARGS+=" --epochs $EPOCHS"
+ARGS+=" --epochs-per-checkpoint $EPOCHS_PER_CHECKPOINT"
+ARGS+=" --resume"
+ARGS+=" --warmup-steps $WARMUP_STEPS"
+ARGS+=" -lr $LEARNING_RATE"
+ARGS+=" --weight-decay 1e-6"
+ARGS+=" --grad-clip-thresh 1000.0"
+ARGS+=" --dur-predictor-loss-scale 0.1"
+ARGS+=" --pitch-predictor-loss-scale 0.1"
+
+# Autoalign & new features
+ARGS+=" --kl-loss-start-epoch 0"
+ARGS+=" --kl-loss-warmup-epochs $KL_LOSS_WARMUP"
+ARGS+=" --text-cleaners $TEXT_CLEANERS"
+ARGS+=" --n-speakers $NSPEAKERS"
+
+[ "$AMP" = "true" ]                && ARGS+=" --amp"
+[ "$PHONE" = "true" ]              && ARGS+=" --p-arpabet 1.0"
+[ "$ENERGY" = "true" ]             && ARGS+=" --energy-conditioning"
+[ "$SEED" != "" ]                  && ARGS+=" --seed $SEED"
+[ "$LOAD_MEL_FROM_DISK" = true ]   && ARGS+=" --load-mel-from-disk"
+[ "$LOAD_PITCH_FROM_DISK" = true ] && ARGS+=" --load-pitch-from-disk"
+[ "$PITCH_ONLINE_DIR" != "" ]      && ARGS+=" --pitch-online-dir $PITCH_ONLINE_DIR"  # e.g., /dev/shm/pitch
+[ "$PITCH_ONLINE_METHOD" != "" ]   && ARGS+=" --pitch-online-method $PITCH_ONLINE_METHOD"
+[ "$APPEND_SPACES" = true ]        && ARGS+=" --prepend-space-to-text"
+[ "$APPEND_SPACES" = true ]        && ARGS+=" --append-space-to-text"
+
+if [ "$SAMPLING_RATE" == "44100" ]; then
+  ARGS+=" --sampling-rate 44100"
+  ARGS+=" --filter-length 2048"
+  ARGS+=" --hop-length 512"
+  ARGS+=" --win-length 2048"
+  ARGS+=" --mel-fmin 0.0"
+  ARGS+=" --mel-fmax 22050.0"
+
+elif [ "$SAMPLING_RATE" != "22050" ]; then
+  echo "Unknown sampling rate $SAMPLING_RATE"
+  exit 1
+fi

 mkdir -p "$OUTPUT_DIR"
-python -m torch.distributed.launch --nproc_per_node ${NUM_GPUS} train.py \
-    --cuda \
-    -o "$OUTPUT_DIR/" \
-    --log-file "$OUTPUT_DIR/nvlog.json" \
-    --dataset-path LJSpeech-1.1 \
-    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
-    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
-    --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs ${EPOCHS} \
-    --epochs-per-checkpoint 100 \
-    --warmup-steps 1000 \
-    -lr 0.1 \
-    -bs ${BS} \
-    --optimizer lamb \
-    --grad-clip-thresh 1000.0 \
-    --dur-predictor-loss-scale 0.1 \
-    --pitch-predictor-loss-scale 0.1 \
-    --weight-decay 1e-6 \
-    --gradient-accumulation-steps ${GRAD_ACCUMULATION} \
-    ${AMP_FLAG}
+
+: ${DISTRIBUTED:="-m torch.distributed.launch --nproc_per_node $NUM_GPUS"}
+python $DISTRIBUTED train.py $ARGS "$@"
--- a/PyTorch/SpeechSynthesis/FastPitch/scripts/train_benchmark.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/scripts/train_benchmark.sh
@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+
+set -a
+
+: ${AMP:=false}
+: ${NUM_GPUS_SEQUENCE:="1 4 8"}
+: ${EPOCHS:=30}
+: ${OUTPUT_DIR:="./output"}
+: ${F0_METHOD:=praat}
+: ${BATCH_SIZE:=16}
+
+for NUM_GPUS in $NUM_GPUS_SEQUENCE ; do
+    GRAD_ACCUMULATION=$((256 / $BATCH_SIZE / $NUM_GPUS ))
+    LOG_FILE=$OUTPUT_DIR/perf-train_amp-${AMP}_${NUM_GPUS}x${BATCH_SIZE}x${GRAD_ACCUMULATION}.json
+    bash scripts/train.sh "$@"
+    rm -f $OUTPUT_DIR/FastPitch*.pt
+done
--- a/PyTorch/SpeechSynthesis/FastPitch/tacotron2/arg_parser.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/tacotron2/arg_parser.py
@ -1,101 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import argparse
-
-def parse_tacotron2_args(parent, add_help=False):
-    """
-    Parse commandline arguments.
-    """
-    parser = argparse.ArgumentParser(parents=[parent], add_help=add_help)
-
-    # misc parameters
-    parser.add_argument('--mask-padding', default=False, type=bool,
-                        help='Use mask padding')
-    parser.add_argument('--n-mel-channels', default=80, type=int,
-                        help='Number of bins in mel-spectrograms')
-
-    # symbols parameters
-    symbols = parser.add_argument_group('symbols parameters')
-    symbols.add_argument('--symbols-embedding-dim', default=512, type=int,
-                         help='Input embedding dimension')
-
-    # encoder parameters
-    encoder = parser.add_argument_group('encoder parameters')
-    encoder.add_argument('--encoder-kernel-size', default=5, type=int,
-                         help='Encoder kernel size')
-    encoder.add_argument('--encoder-n-convolutions', default=3, type=int,
-                         help='Number of encoder convolutions')
-    encoder.add_argument('--encoder-embedding-dim', default=512, type=int,
-                         help='Encoder embedding dimension')
-
-    # decoder parameters
-    decoder = parser.add_argument_group('decoder parameters')
-    decoder.add_argument('--n-frames-per-step', default=1,
-                         type=int,
-                         help='Number of frames processed per step') # currently only 1 is supported
-    decoder.add_argument('--decoder-rnn-dim', default=1024, type=int,
-                         help='Number of units in decoder LSTM')
-    decoder.add_argument('--prenet-dim', default=256, type=int,
-                         help='Number of ReLU units in prenet layers')
-    decoder.add_argument('--max-decoder-steps', default=2000, type=int,
-                         help='Maximum number of output mel spectrograms')
-    decoder.add_argument('--gate-threshold', default=0.5, type=float,
-                         help='Probability threshold for stop token')
-    decoder.add_argument('--p-attention-dropout', default=0.1, type=float,
-                         help='Dropout probability for attention LSTM')
-    decoder.add_argument('--p-decoder-dropout', default=0.1, type=float,
-                         help='Dropout probability for decoder LSTM')
-    decoder.add_argument('--decoder-no-early-stopping', action='store_true',
-                         help='Stop decoding once all samples are finished')
-
-    # attention parameters
-    attention = parser.add_argument_group('attention parameters')
-    attention.add_argument('--attention-rnn-dim', default=1024, type=int,
-                           help='Number of units in attention LSTM')
-    attention.add_argument('--attention-dim', default=128, type=int,
-                           help='Dimension of attention hidden representation')
-
-    # location layer parameters
-    location = parser.add_argument_group('location parameters')
-    location.add_argument(
-        '--attention-location-n-filters', default=32, type=int,
-        help='Number of filters for location-sensitive attention')
-    location.add_argument(
-        '--attention-location-kernel-size', default=31, type=int,
-        help='Kernel size for location-sensitive attention')
-
-    # Mel-post processing network parameters
-    postnet = parser.add_argument_group('postnet parameters')
-    postnet.add_argument('--postnet-embedding-dim', default=512, type=int,
-                         help='Postnet embedding dimension')
-    postnet.add_argument('--postnet-kernel-size', default=5, type=int,
-                         help='Postnet kernel size')
-    postnet.add_argument('--postnet-n-convolutions', default=5, type=int,
-                         help='Number of postnet convolutions')
-
-    return parser
--- a/PyTorch/SpeechSynthesis/FastPitch/tacotron2/data_function.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/tacotron2/data_function.py
@ -1,173 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import random
-import numpy as np
-
-import torch
-import torch.utils.data
-
-import common.layers as layers
-from common.utils import load_wav_to_torch, load_filepaths_and_text, to_gpu
-
-class TextMelLoader(torch.utils.data.Dataset):
-    """
-        1) loads audio,text pairs
-        2) normalizes text and converts them to sequences of one-hot vectors
-        3) computes mel-spectrograms from audio files.
-    """
-    def __init__(self,
-                 dataset_path,
-                 audiopaths_and_text,
-                 text_cleaners,
-                 n_mel_channels,
-                 symbol_set='english_basic',
-                 n_speakers=1,
-                 load_mel_from_disk=True,
-                 max_wav_value=None,
-                 sampling_rate=None,
-                 filter_length=None,
-                 hop_length=None,
-                 win_length=None,
-                 mel_fmin=None,
-                 mel_fmax=None,
-                 **ignored):
-        self.audiopaths_and_text = load_filepaths_and_text(
-            dataset_path, audiopaths_and_text,
-            has_speakers=(n_speakers > 1))
-        self.load_mel_from_disk = load_mel_from_disk
-        if not load_mel_from_disk:
-            self.max_wav_value = max_wav_value
-            self.sampling_rate = sampling_rate
-            self.stft = layers.TacotronSTFT(
-                filter_length, hop_length, win_length,
-                n_mel_channels, sampling_rate, mel_fmin, mel_fmax)
-
-    def get_mel(self, filename):
-        if not self.load_mel_from_disk:
-            audio, sampling_rate = load_wav_to_torch(filename)
-            if sampling_rate != self.stft.sampling_rate:
-                raise ValueError("{} {} SR doesn't match target {} SR".format(
-                    sampling_rate, self.stft.sampling_rate))
-            audio_norm = audio / self.max_wav_value
-            audio_norm = audio_norm.unsqueeze(0)
-            audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
-            melspec = self.stft.mel_spectrogram(audio_norm)
-            melspec = torch.squeeze(melspec, 0)
-        else:
-            melspec = torch.load(filename)
-            # assert melspec.size(0) == self.stft.n_mel_channels, (
-            #     'Mel dimension mismatch: given {}, expected {}'.format(
-            #         melspec.size(0), self.stft.n_mel_channels))
-
-        return melspec
-
-    def get_text(self, text):
-        text_encoded = torch.IntTensor(self.tp.encode_text(text))
-        return text_encoded
-
-    def __getitem__(self, index):
-        # separate filename and text
-        audiopath, text = self.audiopaths_and_text[index]
-        text = self.get_text(text)
-        len_text = len(text)
-        mel = self.get_mel(audiopath)
-        return (text, mel, len_text)
-
-    def __len__(self):
-        return len(self.audiopaths_and_text)
-
-
-class TextMelCollate():
-    """ Zero-pads model inputs and targets based on number of frames per step
-    """
-    def __init__(self, n_frames_per_step):
-        self.n_frames_per_step = n_frames_per_step
-
-    def __call__(self, batch):
-        """Collate's training batch from normalized text and mel-spectrogram
-        PARAMS
-        ------
-        batch: [text_normalized, mel_normalized]
-        """
-        # Right zero-pad all one-hot text sequences to max input length
-        input_lengths, ids_sorted_decreasing = torch.sort(
-            torch.LongTensor([len(x[0]) for x in batch]),
-            dim=0, descending=True)
-        max_input_len = input_lengths[0]
-
-        text_padded = torch.LongTensor(len(batch), max_input_len)
-        text_padded.zero_()
-        for i in range(len(ids_sorted_decreasing)):
-            text = batch[ids_sorted_decreasing[i]][0]
-            text_padded[i, :text.size(0)] = text
-
-        # Right zero-pad mel-spec
-        num_mels = batch[0][1].size(0)
-        max_target_len = max([x[1].size(1) for x in batch])
-        if max_target_len % self.n_frames_per_step != 0:
-            max_target_len += self.n_frames_per_step - max_target_len % self.n_frames_per_step
-            assert max_target_len % self.n_frames_per_step == 0
-
-        # include mel padded and gate padded
-        mel_padded = torch.FloatTensor(len(batch), num_mels, max_target_len)
-        mel_padded.zero_()
-        gate_padded = torch.FloatTensor(len(batch), max_target_len)
-        gate_padded.zero_()
-        output_lengths = torch.LongTensor(len(batch))
-        for i in range(len(ids_sorted_decreasing)):
-            mel = batch[ids_sorted_decreasing[i]][1]
-            mel_padded[i, :, :mel.size(1)] = mel
-            gate_padded[i, mel.size(1)-1:] = 1
-            output_lengths[i] = mel.size(1)
-
-        # count number of items - characters in text
-        len_x = [x[2] for x in batch]
-        len_x = torch.Tensor(len_x)
-
-        # Return any extra fields as sorted lists
-        num_fields = len(batch[0])
-        extra_fields = tuple([batch[i][f] for i in ids_sorted_decreasing]
-                             for f in range(3, num_fields))
-
-        return (text_padded, input_lengths, mel_padded, gate_padded, \
-            output_lengths, len_x) + extra_fields
-
-
-def batch_to_gpu(batch):
-    text_padded, input_lengths, mel_padded, gate_padded, \
-        output_lengths, len_x = batch
-    text_padded = to_gpu(text_padded).long()
-    input_lengths = to_gpu(input_lengths).long()
-    max_len = torch.max(input_lengths.data).item()
-    mel_padded = to_gpu(mel_padded).float()
-    gate_padded = to_gpu(gate_padded).float()
-    output_lengths = to_gpu(output_lengths).long()
-    x = (text_padded, input_lengths, mel_padded, max_len, output_lengths)
-    y = (mel_padded, gate_padded)
-    len_x = torch.sum(output_lengths)
-    return (x, y, len_x)
--- a/PyTorch/SpeechSynthesis/FastPitch/tacotron2/loss_function.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/tacotron2/loss_function.py
@ -1,47 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-from torch import nn
-
-
-class Tacotron2Loss(nn.Module):
-    def __init__(self):
-        super(Tacotron2Loss, self).__init__()
-
-    def forward(self, model_output, targets):
-        mel_target, gate_target = targets[0], targets[1]
-        mel_target.requires_grad = False
-        gate_target.requires_grad = False
-        gate_target = gate_target.view(-1, 1)
-
-        mel_out, mel_out_postnet, gate_out, _ = model_output
-        gate_out = gate_out.view(-1, 1)
-        mel_loss = nn.MSELoss()(mel_out, mel_target) + \
-            nn.MSELoss()(mel_out_postnet, mel_target)
-        gate_loss = nn.BCEWithLogitsLoss()(gate_out, gate_target)
-        meta = {}
-        return mel_loss + gate_loss, meta
--- a/PyTorch/SpeechSynthesis/FastPitch/tacotron2/model.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/tacotron2/model.py
@ -1,695 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import sys
-from math import sqrt
-from os.path import abspath, dirname
-
-import torch
-from torch import nn
-from torch.nn import functional as F
-
-# enabling modules discovery from global entrypoint
-sys.path.append(abspath(dirname(__file__)+'/../'))
-from common.layers import ConvNorm, LinearNorm
-from common.utils import mask_from_lens, to_gpu
-
-
-class LocationLayer(nn.Module):
-    def __init__(self, attention_n_filters, attention_kernel_size,
-                 attention_dim):
-        super(LocationLayer, self).__init__()
-        padding = int((attention_kernel_size - 1) / 2)
-        self.location_conv = ConvNorm(2, attention_n_filters,
-                                      kernel_size=attention_kernel_size,
-                                      padding=padding, bias=False, stride=1,
-                                      dilation=1)
-        self.location_dense = LinearNorm(attention_n_filters, attention_dim,
-                                         bias=False, w_init_gain='tanh')
-
-    def forward(self, attention_weights_cat):
-        processed_attention = self.location_conv(attention_weights_cat)
-        processed_attention = processed_attention.transpose(1, 2)
-        processed_attention = self.location_dense(processed_attention)
-        return processed_attention
-
-
-class Attention(nn.Module):
-    def __init__(self, attention_rnn_dim, embedding_dim,
-                 attention_dim, attention_location_n_filters,
-                 attention_location_kernel_size):
-        super(Attention, self).__init__()
-        self.query_layer = LinearNorm(attention_rnn_dim, attention_dim,
-                                      bias=False, w_init_gain='tanh')
-        self.memory_layer = LinearNorm(embedding_dim, attention_dim, bias=False,
-                                       w_init_gain='tanh')
-        self.v = LinearNorm(attention_dim, 1, bias=False)
-        self.location_layer = LocationLayer(attention_location_n_filters,
-                                            attention_location_kernel_size,
-                                            attention_dim)
-        self.score_mask_value = -float("inf")
-
-    def get_alignment_energies(self, query, processed_memory,
-                               attention_weights_cat):
-        """
-        PARAMS
-        ------
-        query: decoder output (batch, n_mel_channels * n_frames_per_step)
-        processed_memory: processed encoder outputs (B, T_in, attention_dim)
-        attention_weights_cat: cumulative and prev. att weights (B, 2, max_time)
-
-        RETURNS
-        -------
-        alignment (batch, max_time)
-        """
-
-        processed_query = self.query_layer(query.unsqueeze(1))
-        processed_attention_weights = self.location_layer(attention_weights_cat)
-        energies = self.v(torch.tanh(
-            processed_query + processed_attention_weights + processed_memory))
-
-        energies = energies.squeeze(-1)
-        return energies
-
-    def forward(self, attention_hidden_state, memory, processed_memory,
-                attention_weights_cat, mask):
-        """
-        PARAMS
-        ------
-        attention_hidden_state: attention rnn last output
-        memory: encoder outputs
-        processed_memory: processed encoder outputs
-        attention_weights_cat: previous and cummulative attention weights
-        mask: binary mask for padded data
-        """
-        alignment = self.get_alignment_energies(
-            attention_hidden_state, processed_memory, attention_weights_cat)
-
-        alignment.masked_fill_(mask, self.score_mask_value)
-
-        attention_weights = F.softmax(alignment, dim=1)
-        attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
-        attention_context = attention_context.squeeze(1)
-
-        return attention_context, attention_weights
-
-
-class Prenet(nn.Module):
-    def __init__(self, in_dim, sizes):
-        super(Prenet, self).__init__()
-        in_sizes = [in_dim] + sizes[:-1]
-        self.layers = nn.ModuleList(
-            [LinearNorm(in_size, out_size, bias=False)
-             for (in_size, out_size) in zip(in_sizes, sizes)])
-
-    def forward(self, x):
-        for linear in self.layers:
-            x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
-        return x
-
-
-class Postnet(nn.Module):
-    """Postnet
-        - Five 1-d convolution with 512 channels and kernel size 5
-    """
-
-    def __init__(self, n_mel_channels, postnet_embedding_dim,
-                 postnet_kernel_size, postnet_n_convolutions):
-        super(Postnet, self).__init__()
-        self.convolutions = nn.ModuleList()
-
-        self.convolutions.append(
-            nn.Sequential(
-                ConvNorm(n_mel_channels, postnet_embedding_dim,
-                         kernel_size=postnet_kernel_size, stride=1,
-                         padding=int((postnet_kernel_size - 1) / 2),
-                         dilation=1, w_init_gain='tanh'),
-                nn.BatchNorm1d(postnet_embedding_dim))
-        )
-
-        for i in range(1, postnet_n_convolutions - 1):
-            self.convolutions.append(
-                nn.Sequential(
-                    ConvNorm(postnet_embedding_dim,
-                             postnet_embedding_dim,
-                             kernel_size=postnet_kernel_size, stride=1,
-                             padding=int((postnet_kernel_size - 1) / 2),
-                             dilation=1, w_init_gain='tanh'),
-                    nn.BatchNorm1d(postnet_embedding_dim))
-            )
-
-        self.convolutions.append(
-            nn.Sequential(
-                ConvNorm(postnet_embedding_dim, n_mel_channels,
-                         kernel_size=postnet_kernel_size, stride=1,
-                         padding=int((postnet_kernel_size - 1) / 2),
-                         dilation=1, w_init_gain='linear'),
-                nn.BatchNorm1d(n_mel_channels))
-        )
-        self.n_convs = len(self.convolutions)
-
-    def forward(self, x):
-        i = 0
-        for conv in self.convolutions:
-            if i < self.n_convs - 1:
-                x = F.dropout(torch.tanh(conv(x)), 0.5, training=self.training)
-            else:
-                x = F.dropout(conv(x), 0.5, training=self.training)
-            i += 1
-
-        return x
-
-
-class Encoder(nn.Module):
-    """Encoder module:
-        - Three 1-d convolution banks
-        - Bidirectional LSTM
-    """
-    def __init__(self, encoder_n_convolutions,
-                 encoder_embedding_dim, encoder_kernel_size):
-        super(Encoder, self).__init__()
-
-        convolutions = []
-        for _ in range(encoder_n_convolutions):
-            conv_layer = nn.Sequential(
-                ConvNorm(encoder_embedding_dim,
-                         encoder_embedding_dim,
-                         kernel_size=encoder_kernel_size, stride=1,
-                         padding=int((encoder_kernel_size - 1) / 2),
-                         dilation=1, w_init_gain='relu'),
-                nn.BatchNorm1d(encoder_embedding_dim))
-            convolutions.append(conv_layer)
-        self.convolutions = nn.ModuleList(convolutions)
-
-        self.lstm = nn.LSTM(encoder_embedding_dim,
-                            int(encoder_embedding_dim / 2), 1,
-                            batch_first=True, bidirectional=True)
-
-    @torch.jit.ignore
-    def forward(self, x, input_lengths):
-        for conv in self.convolutions:
-            x = F.dropout(F.relu(conv(x)), 0.5, self.training)
-
-        x = x.transpose(1, 2)
-
-        # pytorch tensor are not reversible, hence the conversion
-        input_lengths = input_lengths.cpu().numpy()
-        x = nn.utils.rnn.pack_padded_sequence(
-            x, input_lengths, batch_first=True)
-
-        self.lstm.flatten_parameters()
-        outputs, _ = self.lstm(x)
-
-        outputs, _ = nn.utils.rnn.pad_packed_sequence(
-            outputs, batch_first=True)
-
-        return outputs
-
-    @torch.jit.export
-    def infer(self, x, input_lengths):
-        device = x.device
-        for conv in self.convolutions:
-            x = F.dropout(F.relu(conv(x.to(device))), 0.5, self.training)
-
-        x = x.transpose(1, 2)
-
-        input_lengths = input_lengths.cpu()
-        x = nn.utils.rnn.pack_padded_sequence(
-            x, input_lengths, batch_first=True)
-
-        outputs, _ = self.lstm(x)
-
-        outputs, _ = nn.utils.rnn.pad_packed_sequence(
-            outputs, batch_first=True)
-
-        return outputs
-
-
-class Decoder(nn.Module):
-    def __init__(self, n_mel_channels, n_frames_per_step,
-                 encoder_embedding_dim, attention_dim,
-                 attention_location_n_filters,
-                 attention_location_kernel_size,
-                 attention_rnn_dim, decoder_rnn_dim,
-                 prenet_dim, max_decoder_steps, gate_threshold,
-                 p_attention_dropout, p_decoder_dropout,
-                 early_stopping):
-        super(Decoder, self).__init__()
-        self.n_mel_channels = n_mel_channels
-        self.n_frames_per_step = n_frames_per_step
-        self.encoder_embedding_dim = encoder_embedding_dim
-        self.attention_rnn_dim = attention_rnn_dim
-        self.decoder_rnn_dim = decoder_rnn_dim
-        self.prenet_dim = prenet_dim
-        self.max_decoder_steps = max_decoder_steps
-        self.gate_threshold = gate_threshold
-        self.p_attention_dropout = p_attention_dropout
-        self.p_decoder_dropout = p_decoder_dropout
-        self.early_stopping = early_stopping
-
-        self.prenet = Prenet(
-            n_mel_channels * n_frames_per_step,
-            [prenet_dim, prenet_dim])
-
-        self.attention_rnn = nn.LSTMCell(
-            prenet_dim + encoder_embedding_dim,
-            attention_rnn_dim)
-
-        self.attention_layer = Attention(
-            attention_rnn_dim, encoder_embedding_dim,
-            attention_dim, attention_location_n_filters,
-            attention_location_kernel_size)
-
-        self.decoder_rnn = nn.LSTMCell(
-            attention_rnn_dim + encoder_embedding_dim,
-            decoder_rnn_dim, 1)
-
-        self.linear_projection = LinearNorm(
-            decoder_rnn_dim + encoder_embedding_dim,
-            n_mel_channels * n_frames_per_step)
-
-        self.gate_layer = LinearNorm(
-            decoder_rnn_dim + encoder_embedding_dim, 1,
-            bias=True, w_init_gain='sigmoid')
-
-    def get_go_frame(self, memory):
-        """ Gets all zeros frames to use as first decoder input
-        PARAMS
-        ------
-        memory: decoder outputs
-
-        RETURNS
-        -------
-        decoder_input: all zeros frames
-        """
-        B = memory.size(0)
-        dtype = memory.dtype
-        device = memory.device
-        decoder_input = torch.zeros(
-            B, self.n_mel_channels*self.n_frames_per_step,
-            dtype=dtype, device=device)
-        return decoder_input
-
-    def initialize_decoder_states(self, memory):
-        """ Initializes attention rnn states, decoder rnn states, attention
-        weights, attention cumulative weights, attention context, stores memory
-        and stores processed memory
-        PARAMS
-        ------
-        memory: Encoder outputs
-        mask: Mask for padded data if training, expects None for inference
-        """
-        B = memory.size(0)
-        MAX_TIME = memory.size(1)
-        dtype = memory.dtype
-        device = memory.device
-
-        attention_hidden = torch.zeros(
-            B, self.attention_rnn_dim, dtype=dtype, device=device)
-        attention_cell = torch.zeros(
-            B, self.attention_rnn_dim, dtype=dtype, device=device)
-
-        decoder_hidden = torch.zeros(
-            B, self.decoder_rnn_dim, dtype=dtype, device=device)
-        decoder_cell = torch.zeros(
-            B, self.decoder_rnn_dim, dtype=dtype, device=device)
-
-        attention_weights = torch.zeros(
-            B, MAX_TIME, dtype=dtype, device=device)
-        attention_weights_cum = torch.zeros(
-            B, MAX_TIME, dtype=dtype, device=device)
-        attention_context = torch.zeros(
-            B, self.encoder_embedding_dim, dtype=dtype, device=device)
-
-        processed_memory = self.attention_layer.memory_layer(memory)
-
-        return (attention_hidden, attention_cell, decoder_hidden,
-                decoder_cell, attention_weights, attention_weights_cum,
-                attention_context, processed_memory)
-
-    def parse_decoder_inputs(self, decoder_inputs):
-        """ Prepares decoder inputs, i.e. mel outputs
-        PARAMS
-        ------
-        decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs
-
-        RETURNS
-        -------
-        inputs: processed decoder inputs
-
-        """
-        # (B, n_mel_channels, T_out) -> (B, T_out, n_mel_channels)
-        decoder_inputs = decoder_inputs.transpose(1, 2)
-        decoder_inputs = decoder_inputs.view(
-            decoder_inputs.size(0),
-            int(decoder_inputs.size(1)/self.n_frames_per_step), -1)
-        # (B, T_out, n_mel_channels) -> (T_out, B, n_mel_channels)
-        decoder_inputs = decoder_inputs.transpose(0, 1)
-        return decoder_inputs
-
-    def parse_decoder_outputs(self, mel_outputs, gate_outputs, alignments):
-        """ Prepares decoder outputs for output
-        PARAMS
-        ------
-        mel_outputs:
-        gate_outputs: gate output energies
-        alignments:
-
-        RETURNS
-        -------
-        mel_outputs:
-        gate_outpust: gate output energies
-        alignments:
-        """
-        # (T_out, B) -> (B, T_out)
-        alignments = alignments.transpose(0, 1).contiguous()
-        # (T_out, B) -> (B, T_out)
-        gate_outputs = gate_outputs.transpose(0, 1).contiguous()
-        # (T_out, B, n_mel_channels) -> (B, T_out, n_mel_channels)
-        mel_outputs = mel_outputs.transpose(0, 1).contiguous()
-        # decouple frames per step
-        shape = (mel_outputs.shape[0], -1, self.n_mel_channels)
-        mel_outputs = mel_outputs.view(*shape)
-        # (B, T_out, n_mel_channels) -> (B, n_mel_channels, T_out)
-        mel_outputs = mel_outputs.transpose(1, 2)
-
-        return mel_outputs, gate_outputs, alignments
-
-    def decode(self, decoder_input, attention_hidden, attention_cell,
-               decoder_hidden, decoder_cell, attention_weights,
-               attention_weights_cum, attention_context, memory,
-               processed_memory, mask):
-        """ Decoder step using stored states, attention and memory
-        PARAMS
-        ------
-        decoder_input: previous mel output
-
-        RETURNS
-        -------
-        mel_output:
-        gate_output: gate output energies
-        attention_weights:
-        """
-        cell_input = torch.cat((decoder_input, attention_context), -1)
-
-        attention_hidden, attention_cell = self.attention_rnn(
-            cell_input, (attention_hidden, attention_cell))
-        attention_hidden = F.dropout(
-            attention_hidden, self.p_attention_dropout, self.training)
-
-        attention_weights_cat = torch.cat(
-            (attention_weights.unsqueeze(1),
-             attention_weights_cum.unsqueeze(1)), dim=1)
-        attention_context, attention_weights = self.attention_layer(
-            attention_hidden, memory, processed_memory,
-            attention_weights_cat, mask)
-
-        attention_weights_cum += attention_weights
-        decoder_input = torch.cat(
-            (attention_hidden, attention_context), -1)
-
-        decoder_hidden, decoder_cell = self.decoder_rnn(
-            decoder_input, (decoder_hidden, decoder_cell))
-        decoder_hidden = F.dropout(
-            decoder_hidden, self.p_decoder_dropout, self.training)
-
-        decoder_hidden_attention_context = torch.cat(
-            (decoder_hidden, attention_context), dim=1)
-        decoder_output = self.linear_projection(
-            decoder_hidden_attention_context)
-
-        gate_prediction = self.gate_layer(decoder_hidden_attention_context)
-
-        return (decoder_output, gate_prediction, attention_hidden,
-                attention_cell, decoder_hidden, decoder_cell, attention_weights,
-                attention_weights_cum, attention_context)
-
-    @torch.jit.ignore
-    def forward(self, memory, decoder_inputs, memory_lengths):
-        """ Decoder forward pass for training
-        PARAMS
-        ------
-        memory: Encoder outputs
-        decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs
-        memory_lengths: Encoder output lengths for attention masking.
-
-        RETURNS
-        -------
-        mel_outputs: mel outputs from the decoder
-        gate_outputs: gate outputs from the decoder
-        alignments: sequence of attention weights from the decoder
-        """
-
-        decoder_input = self.get_go_frame(memory).unsqueeze(0)
-        decoder_inputs = self.parse_decoder_inputs(decoder_inputs)
-        decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0)
-        decoder_inputs = self.prenet(decoder_inputs)
-
-        mask = ~mask_from_lens(memory_lengths)
-        (attention_hidden,
-         attention_cell,
-         decoder_hidden,
-         decoder_cell,
-         attention_weights,
-         attention_weights_cum,
-         attention_context,
-         processed_memory) = self.initialize_decoder_states(memory)
-
-        mel_outputs, gate_outputs, alignments = [], [], []
-        while len(mel_outputs) < decoder_inputs.size(0) - 1:
-            decoder_input = decoder_inputs[len(mel_outputs)]
-            (mel_output,
-             gate_output,
-             attention_hidden,
-             attention_cell,
-             decoder_hidden,
-             decoder_cell,
-             attention_weights,
-             attention_weights_cum,
-             attention_context) = self.decode(decoder_input,
-                                              attention_hidden,
-                                              attention_cell,
-                                              decoder_hidden,
-                                              decoder_cell,
-                                              attention_weights,
-                                              attention_weights_cum,
-                                              attention_context,
-                                              memory,
-                                              processed_memory,
-                                              mask)
-
-            mel_outputs += [mel_output.squeeze(1)]
-            gate_outputs += [gate_output.squeeze()]
-            alignments += [attention_weights]
-
-        mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
-            torch.stack(mel_outputs),
-            torch.stack(gate_outputs),
-            torch.stack(alignments))
-
-        return mel_outputs, gate_outputs, alignments
-
-    @torch.jit.export
-    def infer(self, memory, memory_lengths):
-        """ Decoder inference
-        PARAMS
-        ------
-        memory: Encoder outputs
-
-        RETURNS
-        -------
-        mel_outputs: mel outputs from the decoder
-        gate_outputs: gate outputs from the decoder
-        alignments: sequence of attention weights from the decoder
-        """
-        decoder_input = self.get_go_frame(memory)
-
-        mask = ~mask_from_lens(memory_lengths)
-        (attention_hidden,
-         attention_cell,
-         decoder_hidden,
-         decoder_cell,
-         attention_weights,
-         attention_weights_cum,
-         attention_context,
-         processed_memory) = self.initialize_decoder_states(memory)
-
-        mel_lengths = torch.zeros([memory.size(0)], dtype=torch.int32).cuda()
-        not_finished = torch.ones([memory.size(0)], dtype=torch.int32).cuda()
-
-        mel_outputs, gate_outputs, alignments = (
-            torch.zeros(1), torch.zeros(1), torch.zeros(1))
-        first_iter = True
-        while True:
-            decoder_input = self.prenet(decoder_input)
-            (mel_output,
-             gate_output,
-             attention_hidden,
-             attention_cell,
-             decoder_hidden,
-             decoder_cell,
-             attention_weights,
-             attention_weights_cum,
-             attention_context) = self.decode(decoder_input,
-                                              attention_hidden,
-                                              attention_cell,
-                                              decoder_hidden,
-                                              decoder_cell,
-                                              attention_weights,
-                                              attention_weights_cum,
-                                              attention_context,
-                                              memory,
-                                              processed_memory,
-                                              mask)
-
-            if first_iter:
-                mel_outputs = mel_output.unsqueeze(0)
-                gate_outputs = gate_output
-                alignments = attention_weights
-                first_iter = False
-            else:
-                mel_outputs = torch.cat(
-                    (mel_outputs, mel_output.unsqueeze(0)), dim=0)
-                gate_outputs = torch.cat((gate_outputs, gate_output), dim=0)
-                alignments = torch.cat((alignments, attention_weights), dim=0)
-
-            dec = torch.le(torch.sigmoid(gate_output),
-                           self.gate_threshold).to(torch.int32).squeeze(1)
-
-            not_finished = not_finished*dec
-            mel_lengths += not_finished
-
-            if self.early_stopping and torch.sum(not_finished) == 0:
-                break
-            if len(mel_outputs) == self.max_decoder_steps:
-                print("Warning! Reached max decoder steps")
-                break
-
-            decoder_input = mel_output
-
-        # NOTE(Adrian): This makes it consitent with training-time dims
-        # (ML x B) x L --> ML x B x L
-        mel_len, bsz, _ = mel_outputs.size()
-        alignments = alignments.view(mel_len, bsz, -1)
-
-        mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
-            mel_outputs, gate_outputs, alignments)
-
-        return mel_outputs, gate_outputs, alignments, mel_lengths
-
-
-class Tacotron2(nn.Module):
-    def __init__(self, mask_padding, n_mel_channels,
-                 n_symbols, symbols_embedding_dim, encoder_kernel_size,
-                 encoder_n_convolutions, encoder_embedding_dim,
-                 attention_rnn_dim, attention_dim, attention_location_n_filters,
-                 attention_location_kernel_size, n_frames_per_step,
-                 decoder_rnn_dim, prenet_dim, max_decoder_steps, gate_threshold,
-                 p_attention_dropout, p_decoder_dropout,
-                 postnet_embedding_dim, postnet_kernel_size,
-                 postnet_n_convolutions, decoder_no_early_stopping):
-        super(Tacotron2, self).__init__()
-        self.mask_padding = mask_padding
-        self.n_mel_channels = n_mel_channels
-        self.n_frames_per_step = n_frames_per_step
-        self.embedding = nn.Embedding(n_symbols, symbols_embedding_dim)
-        std = sqrt(2.0 / (n_symbols + symbols_embedding_dim))
-        val = sqrt(3.0) * std  # uniform bounds for std
-        self.embedding.weight.data.uniform_(-val, val)
-        self.encoder = Encoder(encoder_n_convolutions,
-                               encoder_embedding_dim,
-                               encoder_kernel_size)
-        self.decoder = Decoder(n_mel_channels, n_frames_per_step,
-                               encoder_embedding_dim, attention_dim,
-                               attention_location_n_filters,
-                               attention_location_kernel_size,
-                               attention_rnn_dim, decoder_rnn_dim,
-                               prenet_dim, max_decoder_steps,
-                               gate_threshold, p_attention_dropout,
-                               p_decoder_dropout,
-                               not decoder_no_early_stopping)
-        self.postnet = Postnet(n_mel_channels, postnet_embedding_dim,
-                               postnet_kernel_size,
-                               postnet_n_convolutions)
-
-    def parse_batch(self, batch):
-        text_padded, input_lengths, mel_padded, gate_padded, \
-            output_lengths = batch
-        text_padded = to_gpu(text_padded).long()
-        input_lengths = to_gpu(input_lengths).long()
-        max_len = torch.max(input_lengths.data).item()
-        mel_padded = to_gpu(mel_padded).float()
-        gate_padded = to_gpu(gate_padded).float()
-        output_lengths = to_gpu(output_lengths).long()
-
-        return (
-            (text_padded, input_lengths, mel_padded, max_len, output_lengths),
-            (mel_padded, gate_padded))
-
-    def parse_output(self, outputs, output_lengths):
-        # type: (List[Tensor], Tensor) -> List[Tensor]
-        if self.mask_padding and output_lengths is not None:
-            mask = ~mask_from_lens(output_lengths)
-            mask = mask.expand(self.n_mel_channels, mask.size(0), mask.size(1))
-            mask = mask.permute(1, 0, 2)
-
-            outputs[0].masked_fill_(mask, 0.0)
-            outputs[1].masked_fill_(mask, 0.0)
-            outputs[2].masked_fill_(mask[:, 0, :], 1e3)  # gate energies
-
-        return outputs
-
-    def forward(self, inputs):
-        inputs, input_lengths, targets, max_len, output_lengths = inputs
-        input_lengths, output_lengths = input_lengths.data, output_lengths.data
-
-        embedded_inputs = self.embedding(inputs).transpose(1, 2)
-
-        encoder_outputs = self.encoder(embedded_inputs, input_lengths)
-
-        mel_outputs, gate_outputs, alignments = self.decoder(
-            encoder_outputs, targets, memory_lengths=input_lengths)
-
-        mel_outputs_postnet = self.postnet(mel_outputs)
-        mel_outputs_postnet = mel_outputs + mel_outputs_postnet
-
-        return self.parse_output(
-            [mel_outputs, mel_outputs_postnet, gate_outputs, alignments],
-            output_lengths)
-
-
-    def infer(self, inputs, input_lengths):
-
-        embedded_inputs = self.embedding(inputs).transpose(1, 2)
-        encoder_outputs = self.encoder.infer(embedded_inputs, input_lengths)
-        mel_outputs, gate_outputs, alignments, mel_lengths = self.decoder.infer(
-            encoder_outputs, input_lengths)
-
-        mel_outputs_postnet = self.postnet(mel_outputs)
-        mel_outputs_postnet = mel_outputs + mel_outputs_postnet
-
-        return mel_outputs_postnet, mel_lengths  # XXX , alignments
--- a/PyTorch/SpeechSynthesis/FastPitch/train.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/train.py
@ -27,112 +27,159 @@

 import argparse
 import copy
-import json
 import glob
 import os
 import re
 import time
+import warnings
 from collections import defaultdict, OrderedDict
-from contextlib import contextmanager
+
+try:
+    import nvidia_dlprof_pytorch_nvtx as pyprof
+except ModuleNotFoundError:
+    try:
+        import pyprof
+    except ModuleNotFoundError:
+        warnings.warn('PyProf is unavailable')

 import numpy as np
-import nvidia_dlprof_pytorch_nvtx as pyprof
 import torch
 import torch.cuda.profiler as profiler
 import torch.distributed as dist
-from scipy.io.wavfile import write as write_wav
-from torch.autograd import Variable
+import amp_C
+from apex.optimizers import FusedAdam, FusedLAMB
 from torch.nn.parallel import DistributedDataParallel
-from torch.nn.parameter import Parameter
 from torch.utils.data import DataLoader
 from torch.utils.data.distributed import DistributedSampler

 import common.tb_dllogger as logger
-from apex import amp
-from apex.optimizers import FusedAdam, FusedLAMB
-from apex.multi_tensor_apply import multi_tensor_applier
-import amp_C
-
-import common
-import data_functions
-import loss_functions
 import models
+from common.text import cmudict
+from common.utils import prepare_tmp
+from fastpitch.attn_loss_function import AttentionBinarizationLoss
+from fastpitch.data_function import batch_to_gpu, TTSCollate, TTSDataset
+from fastpitch.loss_function import FastPitchLoss


 def parse_args(parser):
-    """
-    Parse commandline arguments.
-    """
    parser.add_argument('-o', '--output', type=str, required=True,
                        help='Directory to save checkpoints')
    parser.add_argument('-d', '--dataset-path', type=str, default='./',
                        help='Path to dataset')
    parser.add_argument('--log-file', type=str, default=None,
                        help='Path to a DLLogger log file')
-    parser.add_argument('--pyprof', action='store_true', help='Enable pyprof profiling')
+    parser.add_argument('--pyprof', action='store_true',
+                        help='Enable pyprof profiling')

-    training = parser.add_argument_group('training setup')
-    training.add_argument('--epochs', type=int, required=True,
-                          help='Number of total epochs to run')
-    training.add_argument('--epochs-per-checkpoint', type=int, default=50,
-                          help='Number of epochs per checkpoint')
-    training.add_argument('--checkpoint-path', type=str, default=None,
-                          help='Checkpoint path to resume training')
-    training.add_argument('--resume', action='store_true',
-                          help='Resume training from the last available checkpoint')
-    training.add_argument('--seed', type=int, default=1234,
-                          help='Seed for PyTorch random number generators')
-    training.add_argument('--amp', action='store_true',
-                          help='Enable AMP')
-    training.add_argument('--cuda', action='store_true',
-                          help='Run on GPU using CUDA')
-    training.add_argument('--cudnn-benchmark', action='store_true',
-                          help='Enable cudnn benchmark mode')
-    training.add_argument('--ema-decay', type=float, default=0,
-                          help='Discounting factor for training weights EMA')
-    training.add_argument('--gradient-accumulation-steps', type=int, default=1,
-                          help='Training steps to accumulate gradients for')
+    train = parser.add_argument_group('training setup')
+    train.add_argument('--epochs', type=int, required=True,
+                       help='Number of total epochs to run')
+    train.add_argument('--epochs-per-checkpoint', type=int, default=50,
+                       help='Number of epochs per checkpoint')
+    train.add_argument('--checkpoint-path', type=str, default=None,
+                       help='Checkpoint path to resume training')
+    train.add_argument('--resume', action='store_true',
+                       help='Resume training from the last checkpoint')
+    train.add_argument('--seed', type=int, default=1234,
+                       help='Seed for PyTorch random number generators')
+    train.add_argument('--amp', action='store_true',
+                       help='Enable AMP')
+    train.add_argument('--cuda', action='store_true',
+                       help='Run on GPU using CUDA')
+    train.add_argument('--cudnn-benchmark', action='store_true',
+                       help='Enable cudnn benchmark mode')
+    train.add_argument('--ema-decay', type=float, default=0,
+                       help='Discounting factor for training weights EMA')
+    train.add_argument('--grad-accumulation', type=int, default=1,
+                       help='Training steps to accumulate gradients for')
+    train.add_argument('--kl-loss-start-epoch', type=int, default=250,
+                       help='Start adding the hard attention loss term')
+    train.add_argument('--kl-loss-warmup-epochs', type=int, default=100,
+                       help='Gradually increase the hard attention loss term')
+    train.add_argument('--kl-loss-weight', type=float, default=1.0,
+                       help='Gradually increase the hard attention loss term')

-    optimization = parser.add_argument_group('optimization setup')
-    optimization.add_argument('--optimizer', type=str, default='lamb',
-                              help='Optimization algorithm')
-    optimization.add_argument('-lr', '--learning-rate', type=float, required=True,
-                              help='Learing rate')
-    optimization.add_argument('--weight-decay', default=1e-6, type=float,
-                              help='Weight decay')
-    optimization.add_argument('--grad-clip-thresh', default=1000.0, type=float,
-                              help='Clip threshold for gradients')
-    optimization.add_argument('-bs', '--batch-size', type=int, required=True,
-                              help='Batch size per GPU')
-    optimization.add_argument('--warmup-steps', type=int, default=1000,
-                              help='Number of steps for lr warmup')
-    optimization.add_argument('--dur-predictor-loss-scale', type=float,
-                              default=1.0, help='Rescale duration predictor loss')
-    optimization.add_argument('--pitch-predictor-loss-scale', type=float,
-                              default=1.0, help='Rescale pitch predictor loss')
+    opt = parser.add_argument_group('optimization setup')
+    opt.add_argument('--optimizer', type=str, default='lamb',
+                     help='Optimization algorithm')
+    opt.add_argument('-lr', '--learning-rate', type=float, required=True,
+                     help='Learing rate')
+    opt.add_argument('--weight-decay', default=1e-6, type=float,
+                     help='Weight decay')
+    opt.add_argument('--grad-clip-thresh', default=1000.0, type=float,
+                     help='Clip threshold for gradients')
+    opt.add_argument('-bs', '--batch-size', type=int, required=True,
+                     help='Batch size per GPU')
+    opt.add_argument('--warmup-steps', type=int, default=1000,
+                     help='Number of steps for lr warmup')
+    opt.add_argument('--dur-predictor-loss-scale', type=float,
+                     default=1.0, help='Rescale duration predictor loss')
+    opt.add_argument('--pitch-predictor-loss-scale', type=float,
+                     default=1.0, help='Rescale pitch predictor loss')
+    opt.add_argument('--attn-loss-scale', type=float,
+                     default=1.0, help='Rescale alignment loss')

-    dataset = parser.add_argument_group('dataset parameters')
-    dataset.add_argument('--training-files', type=str, required=True,
-                         help='Path to training filelist. Separate multiple paths with commas.')
-    dataset.add_argument('--validation-files', type=str, required=True,
-                         help='Path to validation filelist. Separate multiple paths with commas.')
-    dataset.add_argument('--pitch-mean-std-file', type=str, default=None,
-                         help='Path to pitch stats to be stored in the model')
-    dataset.add_argument('--text-cleaners', nargs='*',
-                         default=['english_cleaners'], type=str,
-                         help='Type of text cleaners for input text')
-    dataset.add_argument('--symbol-set', type=str, default='english_basic',
-                         help='Define symbol set for input text')
+    data = parser.add_argument_group('dataset parameters')
+    data.add_argument('--training-files', type=str, nargs='*', required=True,
+                      help='Paths to training filelists.')
+    data.add_argument('--validation-files', type=str, nargs='*',
+                      required=True, help='Paths to validation filelists')
+    data.add_argument('--text-cleaners', nargs='*',
+                      default=['english_cleaners'], type=str,
+                      help='Type of text cleaners for input text')
+    data.add_argument('--symbol-set', type=str, default='english_basic',
+                      help='Define symbol set for input text')
+    data.add_argument('--p-arpabet', type=float, default=0.0,
+                      help='Probability of using arpabets instead of graphemes '
+                           'for each word; set 0 for pure grapheme training')
+    data.add_argument('--heteronyms-path', type=str, default='cmudict/heteronyms',
+                      help='Path to the list of heteronyms')
+    data.add_argument('--cmudict-path', type=str, default='cmudict/cmudict-0.7b',
+                      help='Path to the pronouncing dictionary')
+    data.add_argument('--prepend-space-to-text', action='store_true',
+                      help='Capture leading silence with a space token')
+    data.add_argument('--append-space-to-text', action='store_true',
+                      help='Capture trailing silence with a space token')

-    cond = parser.add_argument_group('conditioning on additional attributes')
+    cond = parser.add_argument_group('data for conditioning')
    cond.add_argument('--n-speakers', type=int, default=1,
-                      help='Condition on speaker, value > 1 enables trainable speaker embeddings.')
+                      help='Number of speakers in the dataset. '
+                           'n_speakers > 1 enables speaker embeddings')
+    cond.add_argument('--load-pitch-from-disk', action='store_true',
+                      help='Use pitch cached on disk with prepare_dataset.py')
+    cond.add_argument('--pitch-online-method', default='praat',
+                      choices=['praat', 'pyin'],
+                      help='Calculate pitch on the fly during trainig')
+    cond.add_argument('--pitch-online-dir', type=str, default=None,
+                      help='A directory for storing pitch calculated on-line')
+    cond.add_argument('--pitch-mean', type=float, default=214.72203,
+                      help='Normalization value for pitch')
+    cond.add_argument('--pitch-std', type=float, default=65.72038,
+                      help='Normalization value for pitch')
+    cond.add_argument('--load-mel-from-disk', action='store_true',
+                      help='Use mel-spectrograms cache on the disk')  # XXX

-    distributed = parser.add_argument_group('distributed setup')
-    distributed.add_argument('--local_rank', type=int, default=os.getenv('LOCAL_RANK', 0),
-                             help='Rank of the process for multiproc. Do not set manually.')
-    distributed.add_argument('--world_size', type=int, default=os.getenv('WORLD_SIZE', 1),
-                             help='Number of processes for multiproc. Do not set manually.')
+    audio = parser.add_argument_group('audio parameters')
+    audio.add_argument('--max-wav-value', default=32768.0, type=float,
+                       help='Maximum audiowave value')
+    audio.add_argument('--sampling-rate', default=22050, type=int,
+                       help='Sampling rate')
+    audio.add_argument('--filter-length', default=1024, type=int,
+                       help='Filter length')
+    audio.add_argument('--hop-length', default=256, type=int,
+                       help='Hop (stride) length')
+    audio.add_argument('--win-length', default=1024, type=int,
+                       help='Window length')
+    audio.add_argument('--mel-fmin', default=0.0, type=float,
+                       help='Minimum mel frequency')
+    audio.add_argument('--mel-fmax', default=8000.0, type=float,
+                       help='Maximum mel frequency')
+
+    dist = parser.add_argument_group('distributed setup')
+    dist.add_argument('--local_rank', type=int, default=os.getenv('LOCAL_RANK', 0),
+                      help='Rank of the process for multiproc; do not set manually')
+    dist.add_argument('--world_size', type=int, default=os.getenv('WORLD_SIZE', 1),
+                      help='Number of processes for multiproc; do not set manually')
    return parser


@ -162,7 +209,7 @@ def last_checkpoint(output):
            torch.load(fpath, map_location='cpu')
            return False
        except:
-            print(f'WARNING: Cannot load {fpath}')
+            warnings.warn(f'Cannot load {fpath}')
            return True

    saved = sorted(
@ -177,12 +224,19 @@ def last_checkpoint(output):
        return None


-def save_checkpoint(local_rank, model, ema_model, optimizer, epoch, total_iter,
-                    config, amp_run, filepath):
-    if local_rank != 0:
+def maybe_save_checkpoint(args, model, ema_model, optimizer, scaler, epoch,
+                          total_iter, config, final_checkpoint=False):
+    if args.local_rank != 0:
        return

-    print(f"Saving model and optimizer state at epoch {epoch} to {filepath}")
+    intermediate = (args.epochs_per_checkpoint > 0
+                    and epoch % args.epochs_per_checkpoint == 0)
+
+    if not intermediate and epoch < args.epochs:
+        return
+
+    fpath = os.path.join(args.output, f"FastPitch_checkpoint_{epoch}.pt")
+    print(f"Saving model and optimizer state at epoch {epoch} to {fpath}")
    ema_dict = None if ema_model is None else ema_model.state_dict()
    checkpoint = {'epoch': epoch,
                  'iteration': total_iter,
@ -190,35 +244,33 @@ def save_checkpoint(local_rank, model, ema_model, optimizer, epoch, total_iter,
                  'state_dict': model.state_dict(),
                  'ema_state_dict': ema_dict,
                  'optimizer': optimizer.state_dict()}
-    if amp_run:
-        checkpoint['amp'] = amp.state_dict()
-    torch.save(checkpoint, filepath)
+    if args.amp:
+        checkpoint['scaler'] = scaler.state_dict()
+    torch.save(checkpoint, fpath)


-def load_checkpoint(local_rank, model, ema_model, optimizer, epoch, total_iter,
-                    config, amp_run, filepath, world_size):
-    if local_rank == 0:
+def load_checkpoint(args, model, ema_model, optimizer, scaler, epoch,
+                    total_iter, config, filepath):
+    if args.local_rank == 0:
        print(f'Loading model and optimizer state from {filepath}')
    checkpoint = torch.load(filepath, map_location='cpu')
    epoch[0] = checkpoint['epoch'] + 1
    total_iter[0] = checkpoint['iteration']
-    config = checkpoint['config']

    sd = {k.replace('module.', ''): v
          for k, v in checkpoint['state_dict'].items()}
    getattr(model, 'module', model).load_state_dict(sd)
    optimizer.load_state_dict(checkpoint['optimizer'])

-    if amp_run:
-        amp.load_state_dict(checkpoint['amp'])
+    if args.amp:
+        scaler.load_state_dict(checkpoint['scaler'])

    if ema_model is not None:
        ema_model.load_state_dict(checkpoint['ema_state_dict'])


 def validate(model, epoch, total_iter, criterion, valset, batch_size,
-             collate_fn, distributed_run, batch_to_gpu, use_gt_durations=False,
-             ema=False):
+             collate_fn, distributed_run, batch_to_gpu, ema=False):
    """Handles all the validation scoring and printing"""
    was_training = model.training
    model.eval()
@ -226,7 +278,7 @@ def validate(model, epoch, total_iter, criterion, valset, batch_size,
    tik = time.perf_counter()
    with torch.no_grad():
        val_sampler = DistributedSampler(valset) if distributed_run else None
-        val_loader = DataLoader(valset, num_workers=8, shuffle=False,
+        val_loader = DataLoader(valset, num_workers=4, shuffle=False,
                                sampler=val_sampler,
                                batch_size=batch_size, pin_memory=False,
                                collate_fn=collate_fn)
@ -234,19 +286,19 @@ def validate(model, epoch, total_iter, criterion, valset, batch_size,
        val_num_frames = 0
        for i, batch in enumerate(val_loader):
            x, y, num_frames = batch_to_gpu(batch)
-            y_pred = model(x, use_gt_durations=use_gt_durations)
+            y_pred = model(x)
            loss, meta = criterion(y_pred, y, is_training=False, meta_agg='sum')

            if distributed_run:
-                for k,v in meta.items():
+                for k, v in meta.items():
                    val_meta[k] += reduce_tensor(v, 1)
                val_num_frames += reduce_tensor(num_frames.data, 1).item()
            else:
-                for k,v in meta.items():
+                for k, v in meta.items():
                    val_meta[k] += v
                val_num_frames = num_frames.item()

-        val_meta = {k: v / len(valset) for k,v in val_meta.items()}
+        val_meta = {k: v / len(valset) for k, v in val_meta.items()}

    val_meta['took'] = time.perf_counter() - tik

@ -258,7 +310,7 @@ def validate(model, epoch, total_iter, criterion, valset, batch_size,
                   ('mel_loss', val_meta['mel_loss'].item()),
                   ('frames/s', num_frames.item() / val_meta['took']),
                   ('took', val_meta['took'])]),
-    )
+               )

    if was_training:
        model.train()
@ -282,16 +334,23 @@ def apply_ema_decay(model, ema_model, decay):
        return
    st = model.state_dict()
    add_module = hasattr(model, 'module') and not hasattr(ema_model, 'module')
-    for k,v in ema_model.state_dict().items():
+    for k, v in ema_model.state_dict().items():
        if add_module and not k.startswith('module.'):
            k = 'module.' + k
        v.copy_(decay * v + (1 - decay) * st[k])


-def apply_multi_tensor_ema(model_weight_list, ema_model_weight_list, decay, overflow_buf):
-    if not decay:
-        return
-    amp_C.multi_tensor_axpby(65536, overflow_buf, [ema_model_weight_list, model_weight_list, ema_model_weight_list], decay, 1-decay, -1)
+def init_multi_tensor_ema(model, ema_model):
+    model_weights = list(model.state_dict().values())
+    ema_model_weights = list(ema_model.state_dict().values())
+    ema_overflow_buf = torch.cuda.IntTensor([0])
+    return model_weights, ema_model_weights, ema_overflow_buf
+
+
+def apply_multi_tensor_ema(decay, model_weights, ema_weights, overflow_buf):
+    amp_C.multi_tensor_axpby(
+        65536, overflow_buf, [ema_weights, model_weights, ema_weights],
+        decay, 1-decay, -1)


 def main():
@ -300,6 +359,9 @@ def main():
    parser = parse_args(parser)
    args, _ = parser.parse_known_args()

+    if args.p_arpabet > 0.0:
+        cmudict.initialize(args.cmudict_path, keep_ambiguous=True)
+
    distributed_run = args.world_size > 1

    torch.manual_seed(args.seed + args.local_rank)
@ -332,11 +394,11 @@ def main():
    model_config = models.get_model_config('FastPitch', args)
    model = models.get_model('FastPitch', model_config, device)

+    attention_kl_loss = AttentionBinarizationLoss()
+
    # Store pitch mean/std as params to translate from Hz during inference
-    with open(args.pitch_mean_std_file, 'r') as f:
-        stats = json.load(f)
-    model.pitch_mean[0] = stats['mean']
-    model.pitch_std[0] = stats['std']
+    model.pitch_mean[0] = args.pitch_mean
+    model.pitch_std[0] = args.pitch_std

    kw = dict(lr=args.learning_rate, betas=(0.9, 0.98), eps=1e-9,
              weight_decay=args.weight_decay)
@ -347,8 +409,7 @@ def main():
    else:
        raise ValueError

-    if args.amp:
-        model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
+    scaler = torch.cuda.amp.GradScaler(enabled=args.amp)

    if args.ema_decay > 0:
        ema_model = copy.deepcopy(model)
@ -376,40 +437,38 @@ def main():
        ch_fpath = None

    if ch_fpath is not None:
-        load_checkpoint(args.local_rank, model, ema_model, optimizer, start_epoch,
-                        start_iter, model_config, args.amp, ch_fpath,
-                        args.world_size)
+        load_checkpoint(args, model, ema_model, optimizer, scaler,
+                        start_epoch, start_iter, model_config, ch_fpath)

    start_epoch = start_epoch[0]
    total_iter = start_iter[0]

-    criterion = loss_functions.get_loss_function('FastPitch',
+    criterion = FastPitchLoss(
        dur_predictor_loss_scale=args.dur_predictor_loss_scale,
-        pitch_predictor_loss_scale=args.pitch_predictor_loss_scale)
+        pitch_predictor_loss_scale=args.pitch_predictor_loss_scale,
+        attn_loss_scale=args.attn_loss_scale)
+
+    collate_fn = TTSCollate()
+
+    if args.local_rank == 0:
+        prepare_tmp(args.pitch_online_dir)
+
+    trainset = TTSDataset(audiopaths_and_text=args.training_files, **vars(args))
+    valset = TTSDataset(audiopaths_and_text=args.validation_files, **vars(args))

-    collate_fn = data_functions.get_collate_function('FastPitch')
-    trainset = data_functions.get_data_loader('FastPitch',
-                                              audiopaths_and_text=args.training_files,
-                                              **vars(args))
-    valset = data_functions.get_data_loader('FastPitch',
-                                            audiopaths_and_text=args.validation_files,
-                                            **vars(args))
    if distributed_run:
        train_sampler, shuffle = DistributedSampler(trainset), False
    else:
        train_sampler, shuffle = None, True

-    train_loader = DataLoader(trainset, num_workers=16, shuffle=shuffle,
+    # 4 workers are optimal on DGX-1 (from epoch 2 onwards)
+    train_loader = DataLoader(trainset, num_workers=4, shuffle=shuffle,
                              sampler=train_sampler, batch_size=args.batch_size,
-                              pin_memory=False, drop_last=True,
-                              collate_fn=collate_fn)
-
-    batch_to_gpu = data_functions.get_batch_to_gpu('FastPitch')
+                              pin_memory=True, persistent_workers=True,
+                              drop_last=True, collate_fn=collate_fn)

    if args.ema_decay:
-        ema_model_weight_list, model_weight_list, overflow_buf_for_ema = init_multi_tensor_ema(model, ema_model)
-    else:
-        ema_model_weight_list, model_weight_list, overflow_buf_for_ema = None, None, None
+        mt_ema_params = init_multi_tensor_ema(model, ema_model)

    model.train()

@ -417,14 +476,20 @@ def main():
        torch.autograd.profiler.emit_nvtx().__enter__()
        profiler.start()

+    epoch_loss = []
+    epoch_mel_loss = []
+    epoch_num_frames = []
+    epoch_frames_per_sec = []
+    epoch_time = []
+
    torch.cuda.synchronize()
    for epoch in range(start_epoch, args.epochs + 1):
        epoch_start_time = time.perf_counter()

-        epoch_loss = 0.0
-        epoch_mel_loss = 0.0
-        epoch_num_frames = 0
-        epoch_frames_per_sec = 0.0
+        epoch_loss += [0.0]
+        epoch_mel_loss += [0.0]
+        epoch_num_frames += [0]
+        epoch_frames_per_sec += [0.0]

        if distributed_run:
            train_loader.sampler.set_epoch(epoch)
@ -433,9 +498,10 @@ def main():
        iter_loss = 0
        iter_num_frames = 0
        iter_meta = {}
+        iter_start_time = None

        epoch_iter = 0
-        num_iters = len(train_loader) // args.gradient_accumulation_steps
+        num_iters = len(train_loader) // args.grad_accumulation
        for batch in train_loader:

            if accumulated_steps == 0:
@ -443,31 +509,51 @@ def main():
                    break
                total_iter += 1
                epoch_iter += 1
-                iter_start_time = time.perf_counter()
+                if iter_start_time is None:
+                    iter_start_time = time.perf_counter()

                adjust_learning_rate(total_iter, optimizer, args.learning_rate,
                                     args.warmup_steps)

-                model.zero_grad()
+                model.zero_grad(set_to_none=True)

            x, y, num_frames = batch_to_gpu(batch)
-            y_pred = model(x, use_gt_durations=True)
-            loss, meta = criterion(y_pred, y)

-            loss /= args.gradient_accumulation_steps
-            meta = {k: v / args.gradient_accumulation_steps
+            with torch.cuda.amp.autocast(enabled=args.amp):
+                y_pred = model(x)
+                loss, meta = criterion(y_pred, y)
+
+                if (args.kl_loss_start_epoch is not None
+                        and epoch >= args.kl_loss_start_epoch):
+
+                    if args.kl_loss_start_epoch == epoch and epoch_iter == 1:
+                        print('Begin hard_attn loss')
+
+                    _, _, _, _, _, _, _, _, attn_soft, attn_hard, _, _ = y_pred
+                    binarization_loss = attention_kl_loss(attn_hard, attn_soft)
+                    kl_weight = min((epoch - args.kl_loss_start_epoch) / args.kl_loss_warmup_epochs, 1.0) * args.kl_loss_weight
+                    meta['kl_loss'] = binarization_loss.clone().detach() * kl_weight
+                    loss += kl_weight * binarization_loss
+
+                else:
+                    meta['kl_loss'] = torch.zeros_like(loss)
+                    kl_weight = 0
+                    binarization_loss = 0
+
+                loss /= args.grad_accumulation
+
+            meta = {k: v / args.grad_accumulation
                    for k, v in meta.items()}

            if args.amp:
-                with amp.scale_loss(loss, optimizer) as scaled_loss:
-                    scaled_loss.backward()
+                scaler.scale(loss).backward()
            else:
                loss.backward()

            if distributed_run:
                reduced_loss = reduce_tensor(loss.data, args.world_size).item()
                reduced_num_frames = reduce_tensor(num_frames.data, 1).item()
-                meta = {k: reduce_tensor(v, args.world_size) for k,v in meta.items()}
+                meta = {k: reduce_tensor(v, args.world_size) for k, v in meta.items()}
            else:
                reduced_loss = loss.item()
                reduced_num_frames = num_frames.item()
@ -479,25 +565,30 @@ def main():
            iter_num_frames += reduced_num_frames
            iter_meta = {k: iter_meta.get(k, 0) + meta.get(k, 0) for k in meta}

-            if accumulated_steps % args.gradient_accumulation_steps == 0:
+            if accumulated_steps % args.grad_accumulation == 0:

                logger.log_grads_tb(total_iter, model)
                if args.amp:
+                    scaler.unscale_(optimizer)
                    torch.nn.utils.clip_grad_norm_(
-                        amp.master_params(optimizer), args.grad_clip_thresh)
+                        model.parameters(), args.grad_clip_thresh)
+                    scaler.step(optimizer)
+                    scaler.update()
                else:
                    torch.nn.utils.clip_grad_norm_(
                        model.parameters(), args.grad_clip_thresh)
+                    optimizer.step()

-                optimizer.step()
-                apply_multi_tensor_ema(model_weight_list, ema_model_weight_list, args.ema_decay, overflow_buf_for_ema)
+                if args.ema_decay > 0.0:
+                    apply_multi_tensor_ema(args.ema_decay, *mt_ema_params)

                iter_time = time.perf_counter() - iter_start_time
                iter_mel_loss = iter_meta['mel_loss'].item()
-                epoch_frames_per_sec += iter_num_frames / iter_time
-                epoch_loss += iter_loss
-                epoch_num_frames += iter_num_frames
-                epoch_mel_loss += iter_mel_loss
+                iter_kl_loss = iter_meta['kl_loss'].item()
+                epoch_frames_per_sec[-1] += iter_num_frames / iter_time
+                epoch_loss[-1] += iter_loss
+                epoch_num_frames[-1] += iter_num_frames
+                epoch_mel_loss[-1] += iter_mel_loss

                logger.log((epoch, epoch_iter, num_iters),
                           tb_total_steps=total_iter,
@ -505,45 +596,45 @@ def main():
                           data=OrderedDict([
                               ('loss', iter_loss),
                               ('mel_loss', iter_mel_loss),
+                               ('kl_loss', iter_kl_loss),
+                               ('kl_weight', kl_weight),
                               ('frames/s', iter_num_frames / iter_time),
                               ('took', iter_time),
                               ('lrate', optimizer.param_groups[0]['lr'])]),
-                )
+                           )

                accumulated_steps = 0
                iter_loss = 0
                iter_num_frames = 0
                iter_meta = {}
+                iter_start_time = time.perf_counter()

        # Finished epoch
-        epoch_time = time.perf_counter() - epoch_start_time
+        epoch_loss[-1] /= epoch_iter
+        epoch_mel_loss[-1] /= epoch_iter
+        epoch_time += [time.perf_counter() - epoch_start_time]
+        iter_start_time = None

        logger.log((epoch,),
                   tb_total_steps=None,
                   subset='train_avg',
                   data=OrderedDict([
-                       ('loss', epoch_loss / epoch_iter),
-                       ('mel_loss', epoch_mel_loss / epoch_iter),
-                       ('frames/s', epoch_num_frames / epoch_time),
-                       ('took', epoch_time)]),
-        )
+                       ('loss', epoch_loss[-1]),
+                       ('mel_loss', epoch_mel_loss[-1]),
+                       ('frames/s', epoch_num_frames[-1] / epoch_time[-1]),
+                       ('took', epoch_time[-1])]),
+                   )

        validate(model, epoch, total_iter, criterion, valset, args.batch_size,
-                 collate_fn, distributed_run, batch_to_gpu,
-                 use_gt_durations=True)
+                 collate_fn, distributed_run, batch_to_gpu)

        if args.ema_decay > 0:
            validate(ema_model, epoch, total_iter, criterion, valset,
                     args.batch_size, collate_fn, distributed_run, batch_to_gpu,
-                     use_gt_durations=True, ema=True)
+                     ema=True)

-        if (epoch > 0 and args.epochs_per_checkpoint > 0 and
-            (epoch % args.epochs_per_checkpoint == 0) and args.local_rank == 0):
-
-            checkpoint_path = os.path.join(
-                args.output, f"FastPitch_checkpoint_{epoch}.pt")
-            save_checkpoint(args.local_rank, model, ema_model, optimizer, epoch,
-                            total_iter, model_config, args.amp, checkpoint_path)
+        maybe_save_checkpoint(args, model, ema_model, optimizer, scaler, epoch,
+                              total_iter, model_config)
        logger.flush()

    # Finished training
@ -551,24 +642,25 @@ def main():
        profiler.stop()
        torch.autograd.profiler.emit_nvtx().__exit__(None, None, None)

-    logger.log((),
-               tb_total_steps=None,
-               subset='train_avg',
-               data=OrderedDict([
-                   ('loss', epoch_loss / epoch_iter),
-                   ('mel_loss', epoch_mel_loss / epoch_iter),
-                   ('frames/s', epoch_num_frames / epoch_time),
-                   ('took', epoch_time)]),
-    )
-    validate(model, None, total_iter, criterion, valset, args.batch_size,
-             collate_fn, distributed_run, batch_to_gpu, use_gt_durations=True)
+    if len(epoch_loss) > 0:
+        # Was trained - average the last 20 measurements
+        last_ = lambda l: np.asarray(l[-20:])
+        epoch_loss = last_(epoch_loss)
+        epoch_mel_loss = last_(epoch_mel_loss)
+        epoch_num_frames = last_(epoch_num_frames)
+        epoch_time = last_(epoch_time)
+        logger.log((),
+                   tb_total_steps=None,
+                   subset='train_avg',
+                   data=OrderedDict([
+                       ('loss', epoch_loss.mean()),
+                       ('mel_loss', epoch_mel_loss.mean()),
+                       ('frames/s', epoch_num_frames.sum() / epoch_time.sum()),
+                       ('took', epoch_time.mean())]),
+                   )

-    if (epoch > 0 and args.epochs_per_checkpoint > 0 and
-        (epoch % args.epochs_per_checkpoint != 0) and args.local_rank == 0):
-        checkpoint_path = os.path.join(
-            args.output, f"FastPitch_checkpoint_{epoch}.pt")
-        save_checkpoint(args.local_rank, model, ema_model, optimizer, epoch,
-                        total_iter, model_config, args.amp, checkpoint_path)
+    validate(model, None, total_iter, criterion, valset, args.batch_size,
+             collate_fn, distributed_run, batch_to_gpu)


 if __name__ == '__main__':
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/Dockerfile-model
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/Dockerfile-model
@ -31,6 +31,8 @@ ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
 ENV PYTHONPATH /workspace/fastpitch
 WORKDIR /workspace/fastpitch

+# RUN conda install -y tqdm
+
 ADD requirements.txt .
 ADD triton/requirements.txt triton/requirements.txt
 RUN pip install -r requirements.txt
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/README.md
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/README.md
@ -106,12 +106,12 @@ Ensure you have the following components:

 ## Quick Start Guide

-Running the following scripts will build and launch the container with all required dependencies for native PyTorch as well as Triton Inference Server. This is necessary for running inference and can also be used for data download, processing, and training of the model. 
- 
+Running the following scripts will build and launch the container with all required dependencies for native PyTorch as well as Triton Inference Server. This is necessary for running inference and can also be used for data download, processing, and training of the model.
+
 1. Clone the repository.

   IMPORTANT: This step is executed on the host computer.
- 
+
   ```
    git clone https://github.com/NVIDIA/DeepLearningExamples.git
    cd DeepLearningExamples/PyTorch/SpeechSynthesis/FastPitch
@ -120,28 +120,28 @@ Running the following scripts will build and launch the container with all requi

   ```
    source triton/scripts/setup_environment.sh
-    bash triton/scripts/docker/triton_inference_server.sh 
+    bash triton/scripts/docker/triton_inference_server.sh
   ```

 1. Build and run a container that extends the NGC PyTorch container with the Triton Inference Server client libraries and dependencies.
- 
+
   ```
    bash triton/scripts/docker/build.sh
    bash triton/scripts/docker/interactive.sh
   ```
- 
- 
+
+
 1. Prepare the deployment configuration and create folders in Docker.
- 
+
   IMPORTANT: These and the following commands must be executed in the PyTorch NGC container.
- 
+
   ```
    source triton/scripts/setup_environment.sh
   ```

 1. Download and pre-process the dataset.

- 
+
   ```
    bash triton/scripts/download_data.sh
    bash triton/scripts/process_dataset.sh
@ -154,8 +154,8 @@ Running the following scripts will build and launch the container with all requi
   ```

 1. Convert the model from training to inference format (e.g. TensorRT).
- 
- 
+
+
   ```
    python3 triton/convert_model.py \
        --input-path ./triton/model.py \
@ -174,31 +174,34 @@ Running the following scripts will build and launch the container with all requi
        --precision ${PRECISION} \
        --ignore-unknown-parameters
   ```
- 
- 
+
+
 1. Configure the model on Triton Inference Server.
- 
+
   Generate the configuration from your model repository.
- 
+
   ```
-    python3 triton/config_model_on_triton.py \
-        --model-repository ${MODEL_REPOSITORY_PATH} \
-        --model-path ${SHARED_DIR}/model \
-        --model-format ${FORMAT} \
-        --model-name ${MODEL_NAME} \
-        --model-version 1 \
-        --max-batch-size ${MAX_BATCH_SIZE} \
-        --precision ${PRECISION} \
-        --number-of-model-instances ${NUMBER_OF_MODEL_INSTANCES} \
-        --max-queue-delay-us ${TRITON_MAX_QUEUE_DELAY} \
-        --capture-cuda-graph 0 \
-        --backend-accelerator ${BACKEND_ACCELERATOR} \
-        --load-model ${TRITON_LOAD_MODEL_METHOD} \
-        --verbose true
+   model-navigator triton-config-model \
+	   --model-repository ${MODEL_REPOSITORY_PATH} \
+	   --model-name ${MODEL_NAME} \
+	   --model-version 1 \
+	   --model-path ${SHARED_DIR}/model \
+	   --model-format ${CONFIG_FORMAT} \
+	   --model-control-mode ${TRITON_LOAD_MODEL_METHOD} \
+	   --load-model \
+	   --load-model-timeout-s 100 \
+	   --verbose \
+	   \
+	   --backend-accelerator ${BACKEND_ACCELERATOR} \
+	   --tensorrt-precision ${PRECISION} \
+	   --max-batch-size ${MAX_BATCH_SIZE} \
+	   --preferred-batch-sizes ${TRITON_PREFERRED_BATCH_SIZES} \
+	   --max-queue-delay-us ${TRITON_MAX_QUEUE_DELAY} \
+	   --engine-count-per-device gpu=${NUMBER_OF_MODEL_INSTANCES}
   ```
- 
+
 1. Run the Triton Inference Server accuracy tests.
- 
+
   ```
    python3 triton/run_inference_on_triton.py \
        --server-url localhost:8001 \
@ -218,11 +221,11 @@ Running the following scripts will build and launch the container with all requi
        --output-used-for-metrics OUTPUT__0

    cat ${SHARED_DIR}/accuracy_metrics.csv
- 
+
   ```
- 
+
 1. Prepare performance input.
- 
+
   ```
    mkdir -p ${SHARED_DIR}/input_data

@ -236,7 +239,7 @@ Running the following scripts will build and launch the container with all requi


 1. Run the Triton Inference Server performance online tests.
- 
+
   We want to maximize throughput within latency budget constraints.
   Dynamic batching is a feature of Triton Inference Server that allows
   inference requests to be combined by the server, so that a batch is
@ -247,7 +250,7 @@ Running the following scripts will build and launch the container with all requi
   in the Triton Inference Server model configuration. The measurements
   presented below set the maximum latency to zero to achieve the best latency
   possible with good performance.
- 
+

   ```
    python triton/run_online_performance_test_on_triton.py \
@ -262,7 +265,7 @@ Running the following scripts will build and launch the container with all requi


 1. Run the Triton Inference Server performance offline tests.
- 
+
   We want to maximize throughput. It assumes you have your data available
   for inference or that your data saturate to maximum batch size quickly.
   Triton Inference Server supports offline scenarios with static batching.
@ -342,8 +345,6 @@ we can consider that all clients are local.

 ## Performance

-The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
-


 ### Offline scenario
@ -626,7 +627,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_1.svg)
- 
+
 <details>

 <summary>
@ -666,7 +667,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_2.svg)
- 
+
 <details>

 <summary>
@ -707,7 +708,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_3.svg)
- 
+
 <details>

 <summary>
@ -747,7 +748,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_4.svg)
- 
+
 <details>

 <summary>
@ -787,7 +788,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_5.svg)
- 
+
 <details>

 <summary>
@ -828,7 +829,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_6.svg)
- 
+
 <details>

 <summary>
@ -869,7 +870,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_7.svg)
- 
+
 <details>

 <summary>
@ -910,7 +911,7 @@ Our results were obtained using the following configuration:


 ![](plots/graph_performance_online_8.svg)
- 
+
 <details>

 <summary>
@ -957,5 +958,3 @@ April 2021
 ### Known issues

 There are no known issues with this model.
-
-
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/config_model_on_triton.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/config_model_on_triton.py
@ -71,8 +71,9 @@ import argparse
 import logging
 import time

-from model_navigator import Accelerator, Format, Precision
-from model_navigator.args import str2bool
+from model_navigator.triton.config import BackendAccelerator as Accelerator
+from model_navigator.triton.config import TensorRTOptPrecision as Precision
+from model_navigator.model import Format
 from model_navigator.log import set_logger, log_dict
 from model_navigator.triton import ModelConfig, TritonClient, TritonModelStore

@ -160,7 +161,7 @@ def main():
        help="Use cuda capture graph (used only by TensorRT platform)",
    )

-    parser.add_argument("-v", "--verbose", help="Provide verbose logs", type=str2bool, default=False)
+    parser.add_argument("-v", "--verbose", help="Provide verbose logs", action='store_true')
    args = parser.parse_args()

    set_logger(verbose=args.verbose)
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/convert_model.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/convert_model.py
@ -145,6 +145,11 @@ def main():
    if Converter:  # if conversion is needed
        # dataloader must much source model precision - so not recovering it yet
        if args.dataloader is not None:
+
+            if args.p_arpabet > 0.0:
+                from common.text import cmudict
+                cmudict.initialize(args.cmudict_path, keep_ambiguous=True)
+
            get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
            dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)

--- a/PyTorch/SpeechSynthesis/FastPitch/triton/dataloader.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/dataloader.py
@ -16,29 +16,59 @@ import sys
 from os.path import abspath, dirname
 sys.path.append(abspath(dirname(__file__)+'/../'))

-from fastpitch.data_function import TextMelAliCollate, TextMelAliLoader
+from fastpitch.data_function import TTSCollate, TTSDataset
 from torch.utils.data import DataLoader
 import numpy as np
 import inspect
 import torch
 from typing import List
+from common.text import cmudict

-def get_dataloader_fn(text_cleaners: List = ['english_cleaners'],
-                      n_mel_channels: int = 80,
-                      n_speakers: int = 1,
-                      symbol_set: str ='english_basic',
+def get_dataloader_fn(batch_size: int = 8,
+                      precision: str = "fp16",
+                      heteronyms_path: str = 'cmudict/heteronyms',
+                      cmudict_path: str = 'cmudict/cmudict-0.7b',
                      dataset_path: str = './LJSpeech_1.1',
-                      filelist: str ="filelists/ljs_mel_dur_pitch_text_test_filelist.txt",
-                      batch_size: int = 8,
-                      precision: str = "fp16"):
+                      filelist: str ="filelists/ljs_audio_pitch_text_test.txt",
+                      text_cleaners: List = ['english_cleaners_v2'],
+                      n_mel_channels: int = 80,
+                      symbol_set: str ='english_basic',
+                      p_arpabet: float = 1.0,
+                      n_speakers: int = 1,
+                      load_mel_from_disk: bool = False,
+                      load_pitch_from_disk: bool = True,
+                      pitch_mean: float = 214.72203,  # LJSpeech defaults
+                      pitch_std: float = 65.72038,
+                      max_wav_value: float = 32768.0,
+                      sampling_rate: int = 22050,
+                      filter_length: int = 1024,
+                      hop_length: int = 256,
+                      win_length: int = 1024,
+                      mel_fmin: float = 0.0,
+                      mel_fmax: float = 8000.0):

-    dataset = TextMelAliLoader(dataset_path=dataset_path,
-                               audiopaths_and_text=filelist,
-                               text_cleaners=text_cleaners,
-                               n_mel_channels=n_mel_channels,
-                               symbol_set=symbol_set,
-                               n_speakers=n_speakers)
-    collate_fn = TextMelAliCollate()
+    if p_arpabet > 0.0:
+        cmudict.initialize(cmudict_path, keep_ambiguous=True)
+
+    dataset = TTSDataset(dataset_path=dataset_path,
+                         audiopaths_and_text=filelist,
+                         text_cleaners=text_cleaners,
+                         n_mel_channels=n_mel_channels,
+                         symbol_set=symbol_set,
+                         p_arpabet=p_arpabet,
+                         n_speakers=n_speakers,
+                         load_mel_from_disk=load_mel_from_disk,
+                         load_pitch_from_disk=load_pitch_from_disk,
+                         pitch_mean=pitch_mean,
+                         pitch_std=pitch_std,
+                         max_wav_value=max_wav_value,
+                         sampling_rate=sampling_rate,
+                         filter_length=filter_length,
+                         hop_length=hop_length,
+                         win_length=win_length,
+                         mel_fmin=mel_fmin,
+                         mel_fmax=mel_fmax)
+    collate_fn = TTSCollate()
    dataloader = DataLoader(dataset, num_workers=8, shuffle=False,
                            sampler=None,
                            batch_size=batch_size, pin_memory=False,
@ -46,24 +76,27 @@ def get_dataloader_fn(text_cleaners: List = ['english_cleaners'],

    def _get_dataloader():
        for idx, batch in enumerate(dataloader):
-            text_padded, input_lengths, mel_padded, output_lengths, \
-                len_x, dur_padded, dur_lens, pitch_padded, speaker = batch
-            input_lengths = input_lengths.unsqueeze(-1)
+
+            text_padded, _, mel_padded, output_lengths, _, \
+            pitch_padded, energy_padded, *_ = batch
+
            pitch_padded = pitch_padded.float()
-            dur_padded = dur_padded.float()
+            energy_padded = energy_padded.float()
+            dur_padded = torch.zeros_like(pitch_padded)

            if precision == "fp16":
                pitch_padded = pitch_padded.half()
                dur_padded = dur_padded.half()
                mel_padded = mel_padded.half()
+                energy_padded = energy_padded.half()

            ids = np.arange(idx*batch_size, idx*batch_size + batch_size)
-            x = {"INPUT__0": text_padded.cpu().numpy(),
-                 "INPUT__1": input_lengths.cpu().numpy()}
+            x = {"INPUT__0": text_padded.cpu().numpy()}
            y_real = {"OUTPUT__0": mel_padded.cpu().numpy(),
                      "OUTPUT__1": output_lengths.cpu().numpy(),
                      "OUTPUT__2": dur_padded.cpu().numpy(),
-                      "OUTPUT__3": pitch_padded.cpu().numpy()}
+                      "OUTPUT__3": pitch_padded.cpu().numpy(),
+                      "OUTPUT__4": energy_padded.cpu().numpy()}

            yield (ids, x, y_real)

--- a/PyTorch/SpeechSynthesis/FastPitch/triton/model.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/model.py
@ -20,7 +20,6 @@ sys.path.append(abspath(dirname(__file__)+'/../'))
 from common.text import symbols
 from inference import load_model_from_ckpt
 import models
-import data_functions
 from torch.utils.data import DataLoader
 import torch
 import numpy as np
@ -31,8 +30,6 @@ def update_argparser(parser):
    io = parser.add_argument_group('io parameters')
    io.add_argument('--n-mel-channels', default=80, type=int,
                    help='Number of bins in mel-spectrograms')
-    io.add_argument('--max-seq-len', default=2048, type=int,
-                    help='')

    symbols = parser.add_argument_group('symbols parameters')
    symbols.add_argument('--n-symbols', default=148, type=int,
@ -106,6 +103,17 @@ def update_argparser(parser):
    pitch_pred.add_argument('--pitch-predictor-n-layers', default=2, type=int,
                            help='Number of conv-1D layers')

+    energy_pred = parser.add_argument_group('energy predictor parameters')
+    energy_pred.add_argument('--energy-conditioning', type=bool, default=True)
+    energy_pred.add_argument('--energy-predictor-kernel-size', default=3, type=int,
+                            help='Pitch predictor conv-1D kernel size')
+    energy_pred.add_argument('--energy-predictor-filter-size', default=256, type=int,
+                            help='Pitch predictor conv-1D filter size')
+    energy_pred.add_argument('--p-energy-predictor-dropout', default=0.1, type=float,
+                            help='Pitch probability for energy predictor')
+    energy_pred.add_argument('--energy-predictor-n-layers', default=2, type=int,
+                            help='Number of conv-1D layers')
+
    ###~copy-paste from ./fastpitch/arg_parser.py

    parser.add_argument('--checkpoint', type=str,
@ -119,10 +127,14 @@ def update_argparser(parser):
    cond = parser.add_argument_group('conditioning parameters')
    cond.add_argument('--pitch-embedding-kernel-size', default=3, type=int,
                      help='Pitch embedding conv-1D kernel size')
+    cond.add_argument('--energy-embedding-kernel-size', default=3, type=int,
+                      help='Pitch embedding conv-1D kernel size')
    cond.add_argument('--speaker-emb-weight', type=float, default=1.0,
                      help='Scale speaker embedding')
    cond.add_argument('--n-speakers', type=int, default=1,
                      help='Number of speakers in the model.')
+    cond.add_argument('--pitch-conditioning-formants', default=1, type=int,
+                      help='Number of speech formants to condition on.')
    parser.add_argument("--precision", type=str, default="fp32",
                        choices=["fp32", "fp16"],
                        help="PyTorch model precision")
@ -139,7 +151,7 @@ def get_model(**model_args):
                                           args=args)

    jittable = True if 'ts-' in args.output_format else False
-    
+
    model = models.get_model(model_name="FastPitch",
                             model_config=model_config,
                             device='cuda',
@ -149,8 +161,8 @@ def get_model(**model_args):
    if args.precision == "fp16":
        model = model.half()
    model.eval()
-    tensor_names = {"inputs": ["INPUT__0", "INPUT__1"],
+    tensor_names = {"inputs": ["INPUT__0"],
                    "outputs" : ["OUTPUT__0", "OUTPUT__1",
-                                 "OUTPUT__2", "OUTPUT__3"]}
+                                 "OUTPUT__2", "OUTPUT__3", "OUTPUT__4"]}

    return model, tensor_names
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/requirements.txt
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/requirements.txt
@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 networkx==2.5
-numpy<1.20.0,>=1.19.1  # # numpy 1.20+ requires py37
+numpy
 onnx==1.8.0
 onnxruntime==1.5.2
 pycuda>=2019.1.2
@ -21,4 +21,4 @@ tqdm>=4.44.1
 tabulate>=0.8.7
 natsort>=7.0.0
 # use tags instead of branch names - because there might be docker cache hit causing not fetching most recent changes on branch
-model_navigator @ git+https://github.com/triton-inference-server/model_navigator.git@v0.1.0#egg=model_navigator
+model_navigator @ git+https://github.com/triton-inference-server/model_navigator.git@v0.2.1#egg=model_navigator
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/run_inference_on_triton.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/run_inference_on_triton.py
@ -222,9 +222,9 @@ def _parse_args():
    parser.add_argument("--dump-inputs", help="Dump inputs to output dir", action="store_true", default=False)
    parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
    parser.add_argument("--output-dir", required=True, help="Path to directory where outputs will be saved")
-    parser.add_argument("--response-wait-time", required=False, help="Maximal time to wait for response", default=120)
+    parser.add_argument("--response-wait-time", required=False, help="Maximal time to wait for response", type=int, default=120)
    parser.add_argument(
-        "--max-unresponded-requests", required=False, help="Maximal number of unresponded requests", default=128
+        "--max-unresponded-requests", required=False, help="Maximal number of unresponded requests", type=int, default=128
    )

    args, *_ = parser.parse_known_args()
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/docker/build.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/docker/build.sh
@ -13,4 +13,4 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-docker build -t fastpitch . -f Dockerfile
+docker build -t fastpitch . -f triton/Dockerfile-model
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/download_data.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/download_data.sh
@ -58,3 +58,6 @@ else
  rm LJSpeech-1.1.tar.bz2
  echo "ok"
 fi
+
+echo "Downloading cmudict-0.7b ..."
+wget https://github.com/Alexir/CMUdict/raw/master/cmudict-0.7b -qO cmudict/cmudict-0.7b
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/process_dataset.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/process_dataset.sh
@ -1,5 +1,4 @@
 #!/usr/bin/env bash
-#!/usr/bin/env bash
 # Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
@ -17,45 +16,20 @@
 set -e

 DATASET_DIR="${DATASETS_DIR}/LJSpeech-1.1/LJSpeech-1.1_fastpitch"
+: ${F0_METHOD:="pyin"}
+: ${ARGS="--extract-mels"}

-TACO_DIR="${DATASETS_DIR}/tacotron2"
-TACO_MODEL="nvidia_tacotron2pyt_fp16.pt"
-TACO_MODEL_URL="https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_amp/versions/19.12.0/files/nvidia_tacotron2pyt_fp16.pt"
-
-if [ -f "${TACO_DIR}/${TACO_MODEL}" ]; then
-  echo "${TACO_MODEL} already downloaded."
-elif [ -f "${WORKDIR}/pretrained_models/tacotron2/${TACO_MODEL}" ]; then
-  echo "Linking existing model from ${WORKDIR}/pretrained_models/tacotron2/${TACO_MODEL}"
-  mkdir -p ${TACO_DIR}
-  ln -s "${WORKDIR}/pretrained_models/tacotron2/${TACO_MODEL}" "${TACO_DIR}/${TACO_MODEL}"
-elif [ -f "${PWD}/pretrained_models/tacotron2/${TACO_MODEL}" ]; then
-  echo "Linking existing model from ${PWD}/pretrained_models/tacotron2/${TACO_MODEL}"
-  mkdir -p ${TACO_DIR}
-  ln -s "${PWD}/pretrained_models/tacotron2/${TACO_MODEL}" "${TACO_DIR}/${TACO_MODEL}"
-else
-  echo "Downloading ${TACO_MODEL} ..."
-  mkdir -p ${TACO_DIR}
-  wget --content-disposition -qO ${TACO_DIR}/${TACO_MODEL} ${TACO_MODEL_URL} ||
-    {
-      echo "ERROR: Failed to download ${TACO_MODEL} from NGC"
-      exit 1
-    }
-  echo "OK"
-fi

 if [ ! -d "${DATASET_DIR}/mels" ]; then

-  for FILELIST in ljs_audio_text_train_filelist.txt \
-    ljs_audio_text_val_filelist.txt \
-    ljs_audio_text_test_filelist.txt; do
-    python extract_mels.py \
-      --cuda \
-      --dataset-path ${DATASET_DIR} \
-      --wav-text-filelist filelists/${FILELIST} \
-      --batch-size 256 \
-      --extract-mels \
-      --extract-durations \
-      --extract-pitch-char \
-      --tacotron2-checkpoint ${TACO_DIR}/${TACO_MODEL}
-  done
+    python extract_pitch.py \
+        --wav-text-filelists filelists/ljs_audio_text_train_v3.txt \
+	filelists/ljs_audio_text_val.txt \
+        filelists/ljs_audio_text_test.txt \
+        --n-workers 16 \
+        --batch-size 1 \
+        --dataset-path $DATASET_DIR \
+        --extract-pitch \
+	--f0-method $F0_METHOD \
+	$ARGS
 fi
--- a/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/setup_parameters.sh
+++ b/PyTorch/SpeechSynthesis/FastPitch/triton/scripts/setup_parameters.sh
@ -15,9 +15,10 @@
 export PRECISION="fp16"
 export FORMAT="ts-trace"
 export BATCH_SIZE="1,2,4,8"
-export BACKEND_ACCELERATOR="cuda"
+export BACKEND_ACCELERATOR="none"
 export MAX_BATCH_SIZE="8"
 export NUMBER_OF_MODEL_INSTANCES="2"
 export TRITON_MAX_QUEUE_DELAY="1"
 export TRITON_PREFERRED_BATCH_SIZES="4 8"
 export SEQUENCE_LENGTH="128"
+export CONFIG_FORMAT="torchscript"
--- a/PyTorch/SpeechSynthesis/FastPitch/waveglow/denoiser.py
+++ b/PyTorch/SpeechSynthesis/FastPitch/waveglow/denoiser.py
@ -25,8 +25,6 @@
 #
 # *****************************************************************************

-import sys
-sys.path.append('tacotron2')
 import torch
 from common.layers import STFT