diff --git a/CUDA-Optimized/FastSpeech/.gitmodules b/CUDA-Optimized/FastSpeech/.gitmodules
deleted file mode 100644
index 404a7af7..00000000
--- a/CUDA-Optimized/FastSpeech/.gitmodules
+++ /dev/null
@@ -1,6 +0,0 @@
-[submodule "waveglow"]
-	path = waveglow
-	url = https://github.com/NVIDIA/waveglow.git
-[submodule "cub"]
-	path = cub
-	url = https://github.com/NVlabs/cub.git
diff --git a/CUDA-Optimized/FastSpeech/Dockerfile b/CUDA-Optimized/FastSpeech/Dockerfile
index f37fe00f..b62691d6 100644
--- a/CUDA-Optimized/FastSpeech/Dockerfile
+++ b/CUDA-Optimized/FastSpeech/Dockerfile
@@ -1,7 +1,14 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.10-py3
 FROM ${FROM_IMAGE_NAME}
 
+# ARG UNAME
+# ARG UID
+# ARG GID
+# RUN groupadd -g $GID -o $UNAME
+# RUN useradd -m -u $UID -g $GID -o -s /bin/bash $UNAME
+# USER $UNAME
+
 ADD . /workspace/fastspeech
 WORKDIR /workspace/fastspeech
 
-RUN sh ./scripts/install.sh
\ No newline at end of file
+RUN sh ./scripts/install.sh
diff --git a/CUDA-Optimized/FastSpeech/README.md b/CUDA-Optimized/FastSpeech/README.md
index a5d4e6ef..e79fb89f 100644
--- a/CUDA-Optimized/FastSpeech/README.md
+++ b/CUDA-Optimized/FastSpeech/README.md
@@ -95,9 +95,9 @@ and encapsulates some dependencies. Aside from these dependencies, ensure you
 have the following components:
 
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+* [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 or newer
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/), [Turing](https://www.nvidia.com/en-us/geforce/turing/)<!--, or [Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) based GPU-->
 
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -120,11 +120,6 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
 	git clone https://github.com/NVIDIA/DeepLearningExamples.git
 	cd DeepLearningExamples/CUDA-Optimized/FastSpeech
    ```
-   and pull submodules.
-   ```
-   git submodule init
-   git submodule update
-	```
 
 2.  Download and preprocess the dataset. Data is downloaded to the ./LJSpeech-1.1 directory (on the host). The ./LJSpeech-1.1 directory is mounted to the /workspace/fastspeech/LJSpeech-1.1 location in the NGC container.
     ```
@@ -148,7 +143,7 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
 
    The preprocessed mel-spectrograms are stored in the ./mels_ljspeech1.1 directory. 
 
-Next, calculate alignments on the LJSpeech dataset using a pre-trained [NVIDIA Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view). The output directory is specified with `--aligns_path`.
+   Next, preprocess the alignments on LJSpeech dataset with feed-forwards to the teacher model. Download the Nvidia [pretrained Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view) to get a pretrained teacher model. And set --tacotron2_path to the Tacotron2 checkpoint file path and the result alignments are stored in --aligns_path.
    ```
    python fastspeech/align_tacotron2.py --dataset_path="./LJSpeech-1.1" --tacotron2_path="tacotron2_statedict.pt" --aligns_path="aligns_ljspeech1.1"
    ```
@@ -169,23 +164,23 @@ Next, calculate alignments on the LJSpeech dataset using a pre-trained [NVIDIA T
    python fastspeech/train.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1" --aligns_path="./aligns_ljspeech1.1" --log_path="./logs" --checkpoint_path="./checkpoints" --use_amp
    ```
 
-6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
+6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ckpt_amp_256/files?version=19.10.0) from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.
 
    After you have trained the FastSpeech model, you can perform generation using the checkpoint stored in ./checkpoints. Then run:
    ```
-   python generate.py --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="./test_sentences.txt"
+   python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="./test_sentences.txt"
    ```
 
    The script loads automatically the latest checkpoint (if any exists), or you can pass a checkpoint file through --ckpt_file. And it loads input texts in ./test_sentences.txt and stores the result in ./results directory. You can also set the result directory path with --results_path.
 
    You can also run with a sample text:
    ```
-   python generate.py  --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
+   python generate.py  --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
    ```
 
-7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first extract a TensorRT engine file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt) and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
+7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first get an ONNX file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/tensorrt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt), convert it to an TensorRT engine using scripts/waveglow/convert_onnx2trt.py, and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
    ```
-   python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="./test_sentences.txt" --waveglow_engine_path="waveglow.fp16.trt"
+   python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="./test_sentences.txt" --waveglow_engine_path="waveglow.fp16.trt"
    ```
 
 ## Advanced
@@ -293,33 +288,29 @@ For more details, refer to [accelerating inference with TensorRT](fastspeech/trt
 
 #### Generation
 
-To generate waveforms with WaveGlow Vocoder, 1) Make sure to pull [Nvidia WaveGlow](https://github.com/NVIDIA/waveglow) through git submodule, 2) get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
-   ```
-   git submodule init
-   git submodule update
-   ```
+To generate waveforms with WaveGlow Vocoder, get [pretrained WaveGlow model](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ckpt_amp_256/files?version=19.10.0) from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.
 
 Run generate.py with:
   * --text - an input text or the text file path.
   * --results_path - result waveforms directory path. (default=./results).
   * --ckpt_file - checkpoint file path. (default checkpoint file is the latest file in --checkpoint_path)
    ```
-   python generate.py --waveglow_path="./waveglow_256channels.pt" --text="The more you buy, the more you save."
+   python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text="The more you buy, the more you save."
    ```
    or
    ```
-   python generate.py --waveglow_path="./waveglow_256channels.pt" --text=test_sentences.txt
+   python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text=test_sentences.txt
    ```
 
-Sample result waveforms are [here](https://gitlab-master.nvidia.com/dahn/fastspeech/tree/master/samples).
+Sample result waveforms are [here](samples).
 
-To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
+To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
 
 ```
-python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
+python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
 ```
 
-Sample result waveforms are [FP32](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples) and [FP16](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples_fp16).
+Sample result waveforms are [FP32](fastspeech/trt/samples) and [FP16](fastspeech/trt/samples_fp16).
 
 
 ## Performance
@@ -391,7 +382,17 @@ The following sections provide details on how we achieved our performance and ac
 
 #### Training performance results
 
-Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
+Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) on <!--NVIDIA DGX A100 with 8x A100 40G GPUs and -->NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
+
+<!-- ##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+
+| GPUs   | Batch size / GPU   | Throughput(mels/s) - FP32    | Throughput(mels/s) - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Multi-GPU Weak scaling - FP32    | Multi-GPU Weak scaling - mixed precision        
+|---|----|--------|--------|------|-----|------|
+| 1 | 32 |   |   |  |  |  1 |
+| 4 | 32 |  |  |  | |  |
+| 8 | 32 |  |  |  | |  | -->
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
 
 | GPUs   | Batch size / GPU   | Throughput(mels/s) - FP32    | Throughput(mels/s) - mixed precision    | Throughput speedup (FP32 - mixed precision)   | Multi-GPU Weak scaling - FP32    | Multi-GPU Weak scaling - mixed precision        
 |---|----|--------|--------|------|-----|------|
@@ -401,7 +402,7 @@ Our results were obtained by running the script in [training performance benchma
 
 #### Inference performance results
 
-Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
+Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark) on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
 
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
@@ -442,9 +443,12 @@ Our results were obtained by running the script in [inference performance benchm
 ## Release notes
 
 ### Changelog
+Oct 2020
+- PyTorch 1.7, TensorRT 7.2 support <!--and Nvidia Ampere architecture support-->
+
 July 2020
 - Initial release
 
 ### Known issues
 
-There are no known issues in this release.
+There are no known issues in this release.
\ No newline at end of file
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/dataset/ljspeech_dataset.py b/CUDA-Optimized/FastSpeech/fastspeech/dataset/ljspeech_dataset.py
index 97ef6b6e..ae51f193 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/dataset/ljspeech_dataset.py
+++ b/CUDA-Optimized/FastSpeech/fastspeech/dataset/ljspeech_dataset.py
@@ -24,11 +24,15 @@
 
 import csv
 
+import pprint
+
 import librosa
 from torch.utils.data import Dataset
 import pandas as pd
 from fastspeech.text_norm import text_to_sequence
 from fastspeech import audio
+from fastspeech.utils.logging import tprint
+
 import os
 import pathlib
 
@@ -38,6 +42,8 @@ from tqdm import tqdm
 
 from fastspeech import hparam as hp
 
+pp = pprint.PrettyPrinter(indent=4, width=1000)
+
 class LJSpeechDataset(Dataset):
 
     def __init__(self, root_path, meta_file="metadata.csv",
@@ -130,7 +136,7 @@ class LJSpeechDataset(Dataset):
         return data
 
 
-def preprocess_mel(hparam="base.yaml"):
+def preprocess_mel(hparam="base.yaml", **kwargs):
     """The script for preprocessing mel-spectrograms from the dataset.
 
     By default, this script assumes to load parameters in the default config file, fastspeech/hparams/base.yaml.
@@ -147,8 +153,9 @@ def preprocess_mel(hparam="base.yaml"):
         hparam (str, optional): Path to default config file. Defaults to "base.yaml".
     """
 
-    hp.set_hparam(hparam)
-
+    hp.set_hparam(hparam, kwargs)
+    tprint("Hparams:\n{}".format(pp.pformat(hp)))
+    
     pathlib.Path(hp.mels_path).mkdir(parents=True, exist_ok=True)
 
     dataset = LJSpeechDataset(hp.dataset_path, mels_path=None)
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/hparams/base.yaml b/CUDA-Optimized/FastSpeech/fastspeech/hparams/base.yaml
index 4a0ab8bf..f3a1fe12 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/hparams/base.yaml
+++ b/CUDA-Optimized/FastSpeech/fastspeech/hparams/base.yaml
@@ -1,7 +1,7 @@
 # Path
 dataset_path: "/workspace/fastspeech/LJSpeech-1.1"
 tacotron2_path: "/workspace/fastspeech/tacotron2_statedict.pt"
-waveglow_path: "/workspace/fastspeech/waveglow_256channels.pt"
+waveglow_path: "/workspace/fastspeech/nvidia_waveglow256pyt_fp16"
 mels_path: "/workspace/fastspeech/mels_ljspeech1.1"
 aligns_path: "/workspace/fastspeech/aligns_ljspeech1.1"
 log_path: "/workspace/fastspeech/logs"
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/hparams/trt.yaml b/CUDA-Optimized/FastSpeech/fastspeech/hparams/trt.yaml
index e2c919c2..eb26b59f 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/hparams/trt.yaml
+++ b/CUDA-Optimized/FastSpeech/fastspeech/hparams/trt.yaml
@@ -3,7 +3,7 @@ parent_yaml: 'infer.yaml'
 # Inference
 batch_size: 1                 # Batch size.
 use_trt: True                 # Usage of TensorRT. Must be True to enable TensorRT.
-use_fp16: True               # Usage of FP16. Set to True to enable half precision for the engine.
+use_fp16: True                # Usage of FP16. Set to True to enable half precision for the engine.
 
 # TRT
 trt_file_path: "/workspace/fastspeech/fastspeech.fp16.b1.trt"  # Built TensorRT engine file path.
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/inferencer/waveglow_inferencer.py b/CUDA-Optimized/FastSpeech/fastspeech/inferencer/waveglow_inferencer.py
index 4b85ea86..2f5a1c33 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/inferencer/waveglow_inferencer.py
+++ b/CUDA-Optimized/FastSpeech/fastspeech/inferencer/waveglow_inferencer.py
@@ -30,6 +30,21 @@ from fastspeech.utils.logging import tprint
 from fastspeech.utils.pytorch import to_cpu_numpy, to_device_async
 from fastspeech.inferencer.denoiser import Denoiser
 
+from waveglow.model import WaveGlow
+import argparse
+
+def unwrap_distributed(state_dict):
+    """
+    Unwraps model from DistributedDataParallel.
+    DDP wraps model in additional "module.", it needs to be removed for single
+    GPU inference.
+    :param state_dict: model's state dict
+    """
+    new_state_dict = {}
+    for key, value in state_dict.items():
+        new_key = key.replace('module.', '')
+        new_state_dict[new_key] = value
+    return new_state_dict
 
 class WaveGlowInferencer(object):
 
@@ -40,11 +55,36 @@ class WaveGlowInferencer(object):
         self.use_denoiser = use_denoiser
 
         # model
-        sys.path.append('waveglow')
-        self.model = torch.load(self.ckpt_file, map_location=self.device)['model']
-        self.model = self.model.remove_weightnorm(self.model)
-        self.model.eval()
+        # sys.path.append('waveglow')
+
+        from waveglow.arg_parser import parse_waveglow_args
+        parser = parser = argparse.ArgumentParser()
+        model_parser= parse_waveglow_args(parser)
+        args, _ = model_parser.parse_known_args()
+        model_config = dict(
+            n_mel_channels=args.n_mel_channels,
+            n_flows=args.flows,
+            n_group=args.groups,
+            n_early_every=args.early_every,
+            n_early_size=args.early_size,
+            WN_config=dict(
+                n_layers=args.wn_layers,
+                kernel_size=args.wn_kernel_size,
+                n_channels=args.wn_channels
+            )
+        )        
+        self.model = WaveGlow(**model_config)
+
+        state_dict = torch.load(self.ckpt_file, map_location=self.device)['state_dict']
+        state_dict = unwrap_distributed(state_dict)
+        self.model.load_state_dict(state_dict)
+
         self.model = to_device_async(self.model, self.device)
+
+        self.model = self.model.remove_weightnorm(self.model)
+
+        self.model.eval()
+
         if self.use_fp16:
             self.model = self.model.half()
         self.model = self.model
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/perf_infer_ljspeech.py b/CUDA-Optimized/FastSpeech/fastspeech/perf_infer_ljspeech.py
new file mode 100644
index 00000000..70766902
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/fastspeech/perf_infer_ljspeech.py
@@ -0,0 +1,202 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of the NVIDIA CORPORATION nor the
+#       names of its contributors may be used to endorse or promote products
+#       derived from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import pprint
+import sys
+import time
+
+import fire
+import torch
+from tqdm import tqdm
+
+from fastspeech import DEFAULT_DEVICE
+from fastspeech import hparam as hp
+from fastspeech.data_load import PadDataLoader
+from fastspeech.dataset.ljspeech_dataset import LJSpeechDataset
+from fastspeech.model.fastspeech import Fastspeech
+from fastspeech.utils.logging import tprint
+from fastspeech.utils.pytorch import to_cpu_numpy, to_device_async
+from fastspeech.infer import get_inferencer
+from fastspeech.inferencer.waveglow_inferencer import WaveGlowInferencer
+from contextlib import ExitStack
+import numpy as np
+
+try:
+    from apex import amp
+except ImportError:
+    ImportError('Required to install apex.')
+
+pp = pprint.PrettyPrinter(indent=4, width=1000)
+
+WARMUP_ITERS = 3
+
+
+def perf_inference(hparam="infer.yaml",
+                   with_vocoder=False,
+                   n_iters=None,
+                   device=DEFAULT_DEVICE,
+                   **kwargs):
+    """The script for estimating inference performance.
+
+    By default, this script assumes to load parameters in the default config file, fastspeech/hparams/infer.yaml.
+
+    Besides the flags, you can also set parameters in the config file via the command-line. For examples,
+    --dataset_path=DATASET_PATH
+        Path to dataset directory.
+    --checkpoint_path=CHECKPOINT_PATH
+        Path to checkpoint directory. The latest checkpoint will be loaded.
+    --batch_size=BATCH_SIZE
+        Batch size to use. Defaults to 1.
+
+    Refer to fastspeech/hparams/infer.yaml to see more parameters.
+
+    Args:
+        hparam (str, optional): Path to default config file. Defaults to "infer.yaml".
+        with_vocoder (bool, optional): Whether or not to estimate with a vocoder. Defaults to False.
+        n_iters (int, optional): Number of batches to estimate. Defaults to None (an epoch).
+        device (str, optional): Device to use. Defaults to "cuda" if avaiable, or "cpu".
+
+    """
+
+    hp.set_hparam(hparam, kwargs)
+    tprint("Hparams:\n{}".format(pp.pformat(hp)))
+    tprint("Device count: {}".format(torch.cuda.device_count()))
+
+    model = Fastspeech(
+        max_seq_len=hp.max_seq_len,
+        d_model=hp.d_model,
+        phoneme_side_n_layer=hp.phoneme_side_n_layer,
+        phoneme_side_head=hp.phoneme_side_head,
+        phoneme_side_conv1d_filter_size=hp.phoneme_side_conv1d_filter_size,
+        phoneme_side_output_size=hp.phoneme_side_output_size,
+        mel_side_n_layer=hp.mel_side_n_layer,
+        mel_side_head=hp.mel_side_head,
+        mel_side_conv1d_filter_size=hp.mel_side_conv1d_filter_size,
+        mel_side_output_size=hp.mel_side_output_size,
+        duration_predictor_filter_size=hp.duration_predictor_filter_size,
+        duration_predictor_kernel_size=hp.duration_predictor_kernel_size,
+        fft_conv1d_kernel=hp.fft_conv1d_kernel,
+        fft_conv1d_padding=hp.fft_conv1d_padding,
+        dropout=hp.dropout,
+        n_mels=hp.num_mels,
+        fused_layernorm=hp.fused_layernorm
+    )
+
+    dataset = LJSpeechDataset(root_path=hp.dataset_path,
+                            sr=hp.sr,
+                            n_fft=hp.n_fft,
+                            win_len=hp.win_len,
+                            hop_len=hp.hop_len,
+                            n_mels=hp.num_mels,
+                            mel_fmin=hp.mel_fmin,
+                            mel_fmax=hp.mel_fmax,
+                            exclude_mels=True,
+                            sort_by_length=True if hp.use_trt and hp.trt_multi_engine else False
+                            )
+    tprint("Dataset size: {}".format(len(dataset)))
+
+    data_loader = PadDataLoader(dataset,
+                                batch_size=hp.batch_size,
+                                num_workers=hp.n_workers,
+                                shuffle=False if hp.use_trt and hp.trt_multi_engine else True,
+                                drop_last=True,
+                                )
+
+    fs_inferencer = get_inferencer(model, data_loader, device)
+
+    if with_vocoder:
+        if hp.use_trt:
+            from fastspeech.trt.waveglow_trt_inferencer import WaveGlowTRTInferencer
+            wb_inferencer = WaveGlowTRTInferencer(ckpt_file=hp.waveglow_path, engine_file=hp.waveglow_engine_path, use_fp16=hp.use_fp16)
+        else:
+            wb_inferencer = WaveGlowInferencer(ckpt_file=hp.waveglow_path, device=device, use_fp16=hp.use_fp16)
+
+    with fs_inferencer, wb_inferencer if with_vocoder else ExitStack():
+
+        tprint("Perf started. Batch size={}.".format(hp.batch_size))
+
+        latencies = []
+        throughputs = []
+
+        n_iters = min(n_iters, len(data_loader)) if n_iters else len(data_loader)
+        assert(n_iters > WARMUP_ITERS)
+        for i in tqdm(range(n_iters)):
+            start = time.time()
+
+            outputs = fs_inferencer.infer()
+
+            mels = outputs['mel']
+            mel_masks = outputs['mel_mask']
+            assert(mels.is_cuda)
+
+            if with_vocoder:
+                # remove padding
+                max_len = mel_masks.sum(axis=1).max()
+                mels = mels[..., :max_len]
+                mel_masks = mel_masks[..., :max_len]
+
+                with torch.no_grad():
+                    wavs = wb_inferencer.infer(mels)
+                wavs = to_cpu_numpy(wavs)
+            else:
+                # include time for DtoH copy
+                to_cpu_numpy(mels)
+                to_cpu_numpy(mel_masks)
+
+            end = time.time()
+
+            if i > WARMUP_ITERS-1:
+                time_elapsed = end - start
+                generated_samples = len(mel_masks.nonzero()) * hp.hop_len
+                throughput = generated_samples / time_elapsed
+
+                latencies.append(time_elapsed)
+                throughputs.append(throughput)
+
+        latencies.sort()
+
+        avg_latency = np.mean(latencies)
+        std_latency = np.std(latencies)
+        latency_90 = max(latencies[:int(len(latencies)*0.90)]) if n_iters > 1 else 0
+        latency_95 = max(latencies[:int(len(latencies)*0.95)]) if n_iters > 1 else 0
+        latency_99 = max(latencies[:int(len(latencies)*0.99)]) if n_iters > 1 else 0
+
+        throughput = np.mean(throughputs)
+        rtf = throughput / (hp.sr * hp.batch_size)
+
+        tprint("Batch size\tPrecision\tAvg Latency(s)\tStd Latency(s)\tLatency 90%(s)\tLatency 95%(s)\tLatency 99%(s)\tThroughput(samples/s)\tAvg RTF\n\
+        {}\t{}\t{:.4f}\t{:.4f}\t{:.4f}\t{:.4f}\t{:.4f}\t{}\t{:.2f}".format(
+            hp.batch_size,
+            "FP16" if hp.use_fp16 else "FP32",
+            avg_latency,
+            std_latency,
+            latency_90,
+            latency_95,
+            latency_99,
+            int(throughput),
+            rtf))
+
+
+if __name__ == '__main__':
+    fire.Fire(perf_inference)
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trainer/trainer.py b/CUDA-Optimized/FastSpeech/fastspeech/trainer/trainer.py
index c48bc8c4..e9d12ca9 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trainer/trainer.py
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trainer/trainer.py
@@ -41,6 +41,7 @@ from fastspeech.utils.fp16 import cast_model_to_half
 
 import torch.cuda.profiler as profiler
 from fastspeech.utils.logging import tprint
+from fastspeech.utils.time import TimeElapsed
 
 plt.switch_backend('Agg')
 
@@ -136,12 +137,15 @@ class Trainer(object):
 
                     if self.nvprof_iter_start and i == self.nvprof_iter_start:
                         profiler.start()
+                        timer = TimeElapsed(name="Training time during profiling", format=":.6f")
+                        timer.start()
 
                     with Nvtx("step #{}".format(self.step)):
                         loss, meta = self.do_step()
 
                     if self.nvprof_iter_end and i == self.nvprof_iter_end:
                         profiler.stop()
+                        timer.end()
         
                     if self.lr_scheduler:
                         for param_group in self.optimizer.param_groups:
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/fastspeech_trt_inferencer.py b/CUDA-Optimized/FastSpeech/fastspeech/trt/fastspeech_trt_inferencer.py
index 17f7e953..0f4db331 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trt/fastspeech_trt_inferencer.py
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trt/fastspeech_trt_inferencer.py
@@ -61,27 +61,30 @@ class FastSpeechTRTInferencer(TRTInferencer):
         super(FastSpeechTRTInferencer, self).__init__(model_name, model, data_loader, ckpt_path, ckpt_file, trt_max_ws_size, trt_file_path, trt_force_build, use_fp16)
 
     def build_engine(self):
+        engine = None
         if self.trt_file_path and os.path.isfile(self.trt_file_path) and not self.trt_force_build:
             with open(self.trt_file_path, 'rb') as f:
                 engine_str = f.read()
             with trt.Runtime(TRT_LOGGER) as runtime:
-                self.engine = runtime.deserialize_cuda_engine(engine_str)
+                engine = runtime.deserialize_cuda_engine(engine_str)
 
-        if self.engine:
+        if engine:
             tprint('TRT Engine Loaded from {} successfully.'.format(self.trt_file_path))
-            return
+            return engine
         else:
             tprint('Loading TRT Engine from {} failed.'.format(self.trt_file_path))
 
         tprint('Building a TRT Engine..')
 
-        self.engine = self.do_build_engine()
+        engine = self.do_build_engine()
         tprint('TRT Engine Built.')
         if self.trt_file_path:
             with open(self.trt_file_path, 'wb') as f:
-                f.write(self.engine.serialize())
+                f.write(engine.serialize())
             tprint('TRT Engine Saved in {}.'.format(self.trt_file_path))
 
+        return engine
+
     def create_plugins(self):
         # create "adding positional encoding" plugin
         self.plugins['AddPosEncPlugin'] = self.get_plugin_creator(
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.d b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.d
index a8e15b44..47d95e84 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.d
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.d
@@ -143,6 +143,8 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/sm_61_intrinsics.hpp \
     /usr/local/cuda/include/crt/sm_70_rt.h \
     /usr/local/cuda/include/crt/sm_70_rt.hpp \
+    /usr/local/cuda/include/crt/sm_80_rt.h \
+    /usr/local/cuda/include/crt/sm_80_rt.hpp \
     /usr/local/cuda/include/surface_functions.h \
     /usr/local/cuda/include/texture_fetch_functions.h \
     /usr/local/cuda/include/texture_indirect_functions.h \
@@ -280,6 +282,7 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/detail/config/compiler.h \
     /usr/local/cuda/include/thrust/detail/config/cpp_dialect.h \
     /usr/local/cuda/include/thrust/detail/config/cpp_compatibility.h \
+    /usr/local/cuda/include/thrust/detail/config/deprecated.h \
     /usr/local/cuda/include/thrust/detail/config/host_system.h \
     /usr/local/cuda/include/thrust/detail/config/device_system.h \
     /usr/local/cuda/include/thrust/detail/config/host_device.h \
@@ -301,6 +304,8 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cuda/detail/execution_policy.h \
     /usr/local/cuda/include/thrust/iterator/detail/any_system_tag.h \
     /usr/local/cuda/include/thrust/system/cuda/config.h \
+    /usr/local/cuda/include/cub/util_namespace.cuh \
+    /usr/local/cuda/include/cub/version.cuh \
     /usr/local/cuda/include/thrust/detail/allocator_aware_execution_policy.h \
     /usr/local/cuda/include/thrust/detail/execute_with_allocator_fwd.h \
     /usr/local/cuda/include/thrust/detail/type_traits.h \
@@ -379,8 +384,10 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cpp/detail/for_each.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/for_each.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/util.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/util_arch.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/util_namespace.cuh \
+    /usr/local/cuda/include/cub/util_arch.cuh \
+    /usr/local/cuda/include/cub/util_cpp_dialect.cuh \
+    /usr/local/cuda/include/cub/util_compiler.cuh \
+    /usr/local/cuda/include/cub/util_macro.cuh \
     /usr/local/cuda/include/thrust/system_error.h \
     /usr/local/cuda/include/thrust/system/error_code.h \
     /usr/local/cuda/include/thrust/system/detail/errno.h \
@@ -404,35 +411,37 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cuda/detail/core/util.h \
     /usr/local/cuda/include/cuda_occupancy.h \
     /usr/local/cuda/include/thrust/type_traits/is_contiguous_iterator.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/block_load.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/block_exchange.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../util_ptx.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../util_type.cuh \
+    /usr/local/cuda/include/cub/block/block_load.cuh \
+    /usr/local/cuda/include/cub/block/block_exchange.cuh \
+    /usr/local/cuda/include/cub/block/../config.cuh \
+    /usr/local/cuda/include/cub/block/../util_deprecated.cuh \
+    /usr/local/cuda/include/cub/block/../util_ptx.cuh \
+    /usr/local/cuda/include/cub/block/../util_type.cuh \
     /usr/include/c++/7/cfloat \
     /usr/lib/gcc/x86_64-linux-gnu/7/include/float.h \
     /usr/local/cuda/include/cuda_fp16.h \
     /usr/local/cuda/include/cuda_fp16.hpp \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../util_macro.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../util_debug.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../iterator/cache_modified_input_iterator.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../iterator/../thread/thread_load.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../iterator/../thread/thread_store.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/../iterator/../util_device.cuh \
+    /usr/local/cuda/include/cub/block/../util_debug.cuh \
+    /usr/local/cuda/include/cub/block/../iterator/cache_modified_input_iterator.cuh \
+    /usr/local/cuda/include/cub/block/../iterator/../thread/thread_load.cuh \
+    /usr/local/cuda/include/cub/block/../iterator/../thread/thread_store.cuh \
+    /usr/local/cuda/include/cub/block/../iterator/../util_device.cuh \
+    /usr/include/c++/7/atomic \
     /usr/local/cuda/include/thrust/iterator/iterator_facade.h \
     /usr/local/cuda/include/thrust/iterator/detail/iterator_facade_category.h \
     /usr/local/cuda/include/thrust/iterator/detail/is_iterator_category.h \
     /usr/local/cuda/include/thrust/iterator/detail/distance_from_result.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/block_store.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/block_scan.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_raking.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../block/block_raking_layout.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../thread/thread_reduce.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../thread/../thread/thread_operators.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../thread/thread_scan.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../warp/warp_scan.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../warp/specializations/warp_scan_shfl.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/../../warp/specializations/warp_scan_smem.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/specializations/block_scan_warp_scans.cuh \
+    /usr/local/cuda/include/cub/block/block_store.cuh \
+    /usr/local/cuda/include/cub/block/block_scan.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_scan_raking.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/block_raking_layout.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../thread/thread_reduce.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../thread/../thread/thread_operators.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../thread/thread_scan.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../warp/warp_scan.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../warp/specializations/warp_scan_shfl.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../warp/specializations/warp_scan_smem.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_scan_warp_scans.cuh \
     /usr/local/cuda/include/thrust/distance.h \
     /usr/local/cuda/include/thrust/detail/distance.inl \
     /usr/local/cuda/include/thrust/advance.h \
@@ -446,6 +455,7 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/iterator/detail/minimum_category.h \
     /usr/local/cuda/include/thrust/iterator/detail/zip_iterator.inl \
     /usr/local/cuda/include/thrust/detail/internal_functional.h \
+    /usr/local/cuda/include/thrust/detail/memory_wrapper.h \
     /usr/local/cuda/include/thrust/system/detail/adl/transform.h \
     /usr/local/cuda/include/thrust/system/detail/sequential/transform.h \
     /usr/local/cuda/include/thrust/system/cpp/detail/transform.h \
@@ -568,14 +578,15 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cpp/detail/scan.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/scan.h \
     /usr/local/cuda/include/thrust/detail/cstdint.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/device_scan.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_scan.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/agent_scan.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/single_pass_scan_operators.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../warp/warp_reduce.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../warp/specializations/warp_reduce_shfl.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../warp/specializations/warp_reduce_smem.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../grid/grid_queue.cuh \
+    /usr/local/cuda/include/cub/device/device_scan.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_scan.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_scan.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/single_pass_scan_operators.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../warp/warp_reduce.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../warp/specializations/warp_reduce_shfl.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../warp/specializations/warp_reduce_smem.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../grid/grid_queue.cuh \
+    /usr/local/cuda/include/thrust/system/cuda/detail/dispatch.h \
     /usr/local/cuda/include/thrust/detail/mpl/math.h \
     /usr/local/cuda/include/thrust/detail/minmax.h \
     /usr/local/cuda/include/thrust/system/detail/adl/scan_by_key.h \
@@ -584,11 +595,11 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cuda/detail/scan_by_key.h \
     /usr/local/cuda/include/thrust/system/cuda/execution_policy.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/adjacent_difference.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/device_select.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_select_if.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/agent_select_if.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../block/block_discontinuity.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/block/block_adjacent_difference.cuh \
+    /usr/local/cuda/include/cub/device/device_select.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_select_if.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_select_if.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../block/block_discontinuity.cuh \
+    /usr/local/cuda/include/cub/block/block_adjacent_difference.cuh \
     /usr/local/cuda/include/thrust/adjacent_difference.h \
     /usr/local/cuda/include/thrust/detail/adjacent_difference.inl \
     /usr/local/cuda/include/thrust/system/detail/generic/adjacent_difference.h \
@@ -616,19 +627,20 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cpp/detail/copy_if.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/count.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/reduce.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/device_reduce.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/../iterator/arg_index_input_iterator.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/agent_reduce.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../block/block_reduce.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../block/specializations/block_reduce_raking.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../block/specializations/block_reduce_raking_commutative_only.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../block/specializations/block_reduce_warp_reductions.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../grid/grid_mapping.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../grid/grid_even_share.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_reduce_by_key.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/agent_reduce_by_key.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../iterator/constant_input_iterator.cuh \
+    /usr/local/cuda/include/cub/device/device_reduce.cuh \
+    /usr/local/cuda/include/cub/device/../iterator/arg_index_input_iterator.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_reduce.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_reduce.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../block/block_reduce.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../block/specializations/block_reduce_raking.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../block/specializations/block_reduce_raking_commutative_only.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../block/specializations/block_reduce_warp_reductions.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../grid/grid_mapping.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../grid/grid_even_share.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_reduce_by_key.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_reduce_by_key.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../iterator/constant_input_iterator.cuh \
+    /usr/local/cuda/include/thrust/system/cuda/detail/make_unsigned_special.h \
     /usr/local/cuda/include/thrust/reduce.h \
     /usr/local/cuda/include/thrust/detail/reduce.inl \
     /usr/local/cuda/include/thrust/system/detail/generic/reduce.h \
@@ -717,7 +729,7 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/cuda/detail/gather.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/inner_product.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/partition.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/device_partition.cuh \
+    /usr/local/cuda/include/cub/device/device_partition.cuh \
     /usr/local/cuda/include/thrust/partition.h \
     /usr/local/cuda/include/thrust/detail/partition.inl \
     /usr/local/cuda/include/thrust/system/detail/generic/partition.h \
@@ -745,12 +757,12 @@ AddPosEncPlugin.o : AddPosEncPlugin.cu \
     /usr/local/cuda/include/thrust/system/detail/adl/find.h \
     /usr/local/cuda/include/thrust/system/detail/adl/sort.h \
     /usr/local/cuda/include/thrust/system/cuda/detail/sort.h \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/device_radix_sort.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/dispatch_radix_sort.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/agent_radix_sort_upsweep.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/agent_radix_sort_downsweep.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../agent/../block/block_radix_rank.cuh \
-    /usr/local/cuda/include/thrust/system/cuda/detail/cub/device/dispatch/../../block/block_radix_sort.cuh \
+    /usr/local/cuda/include/cub/device/device_radix_sort.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_radix_sort.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_radix_sort_upsweep.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_radix_sort_downsweep.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../block/block_radix_rank.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../block/block_radix_sort.cuh \
     /usr/local/cuda/include/thrust/detail/trivial_sequence.h \
     /usr/local/cuda/include/thrust/sequence.h \
     /usr/local/cuda/include/thrust/detail/sequence.inl \
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.o b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.o
index c9eebe33..c9d26c70 100644
Binary files a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.o and b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.o differ
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.so b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.so
index 4e62a986..c0c1a303 100755
Binary files a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.so and b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/AddPosEncPlugin.so differ
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/Makefile b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/Makefile
index 04708ff4..75a83e26 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/Makefile
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/add_pos_enc/Makefile
@@ -22,6 +22,10 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+ARCH = "sm_70"  # Volta
+# ARCH = "sm_75"  # Turing
+# ARCH = "sm_80"  # Ampere
+
 CUDA_PATH  =/usr/local/cuda
 
 CUDA_INC_PATH = $(CUDA_PATH)/include
@@ -30,7 +34,7 @@ CUDA_COM_PATH = $(CUDA_PATH)/samples/common/inc
 GCC = g++
 NVCC = $(CUDA_PATH)/bin/nvcc
 CCFLAGS = -w -std=c++11 -DNDEBUG
-INCLUDES := -I. -I$(CUDA_COM_PATH) -I$(CUDA_INC_PATH) -I/usr/include
+INCLUDES := -I$(CUDA_COM_PATH) -I$(CUDA_INC_PATH) -I/usr/include
 
 CUDA_LIB_PATH  = $(CUDA_PATH)/lib64
 LDFLAGS := -L$(CUDA_LIB_PATH) 
@@ -56,10 +60,10 @@ clean:
 
 %.o: %.cu
 	$(NVCC) $(CCFLAGS) -M -MT $@ $(INCLUDES) -o $(@:.o=.d) $<
-	$(NVCC) $(CCFLAGS) $(INCLUDES) -Xcompiler -fPIC -arch=sm_70 -o $@ -c $<
+	$(NVCC) $(CCFLAGS) $(INCLUDES) -Xcompiler -fPIC -arch=$(ARCH) -o $@ -c $<
 
 $(SO):
 	$(GCC) $(CCFLAGS) -shared -o $@ $+ $(LDFLAGS)
 
 test: all
-	python3 test_add_pos_enc_plugin.py
\ No newline at end of file
+	python3 test_add_pos_enc_plugin.py
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/Makefile b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/Makefile
index 4c0078de..9a4bc6cb 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/Makefile
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/Makefile
@@ -22,6 +22,10 @@
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 # SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+ARCH = "sm_70"  # Volta
+# ARCH = "sm_75"  # Turing
+# ARCH = "sm_80"  # Ampere
+
 CUDA_PATH  =/usr/local/cuda
 
 CUDA_INC_PATH = $(CUDA_PATH)/include
@@ -32,7 +36,8 @@ NVCC = $(CUDA_PATH)/bin/nvcc
 # CCFLAGS = -g -std=c++11 -DNDEBUG
 CCFLAGS = -w -std=c++11 -DNDEBUG
 # CCFLAGS+= -DDEBUG_ME
-INCLUDES := -I../../../../cub -I$(CUDA_COM_PATH) -I$(CUDA_INC_PATH) -I/usr/include
+INCLUDES := -I$(CUDA_COM_PATH) -I$(CUDA_INC_PATH) -I/usr/include  # cuda 11
+# INCLUDES := -I../../../../cub -I$(CUDA_COM_PATH) -I$(CUDA_INC_PATH) -I/usr/include  # cuda 10
 
 CUDA_LIB_PATH  = $(CUDA_PATH)/lib64
 LDFLAGS := -L$(CUDA_LIB_PATH) 
@@ -58,10 +63,10 @@ clean:
 
 %.o: %.cu
 	$(NVCC) $(CCFLAGS) -M -MT $@ $(INCLUDES) -o $(@:.o=.d) $<
-	$(NVCC) $(CCFLAGS) $(INCLUDES) -Xcompiler -fPIC -arch=sm_70 -o $@ -c $<
+	$(NVCC) $(CCFLAGS) $(INCLUDES) -Xcompiler -fPIC -arch=$(ARCH) -o $@ -c $<
 
 $(SO):
 	$(GCC) $(CCFLAGS) -shared -o $@ $+ $(LDFLAGS)
 
 test: all
-	python3 test_repeat_plugin.py
\ No newline at end of file
+	python3 test_repeat_plugin.py
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.d b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.d
index bdf48c9b..84842a5f 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.d
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.d
@@ -143,6 +143,8 @@ RepeatPlugin.o : RepeatPlugin.cu \
     /usr/local/cuda/include/sm_61_intrinsics.hpp \
     /usr/local/cuda/include/crt/sm_70_rt.h \
     /usr/local/cuda/include/crt/sm_70_rt.hpp \
+    /usr/local/cuda/include/crt/sm_80_rt.h \
+    /usr/local/cuda/include/crt/sm_80_rt.hpp \
     /usr/local/cuda/include/surface_functions.h \
     /usr/local/cuda/include/texture_fetch_functions.h \
     /usr/local/cuda/include/texture_indirect_functions.h \
@@ -274,75 +276,68 @@ RepeatPlugin.o : RepeatPlugin.cu \
     /usr/include/c++/7/bits/atomic_base.h \
     /usr/include/c++/7/bits/atomic_lockfree_defines.h \
     /usr/include/c++/7/backward/auto_ptr.h \
-    ../../../../cub/cub/cub.cuh \
-    ../../../../cub/cub/block/block_histogram.cuh \
-    ../../../../cub/cub/block/specializations/block_histogram_sort.cuh \
-    ../../../../cub/cub/block/specializations/../../block/block_radix_sort.cuh \
-    ../../../../cub/cub/block/specializations/../../block/block_exchange.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../util_ptx.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../util_type.cuh \
+    /usr/local/cuda/include/cub/cub.cuh \
+    /usr/local/cuda/include/cub/config.cuh \
+    /usr/local/cuda/include/cub/util_arch.cuh \
+    /usr/local/cuda/include/cub/util_cpp_dialect.cuh \
+    /usr/local/cuda/include/cub/util_compiler.cuh \
+    /usr/local/cuda/include/cub/util_namespace.cuh \
+    /usr/local/cuda/include/cub/version.cuh \
+    /usr/local/cuda/include/cub/util_macro.cuh \
+    /usr/local/cuda/include/cub/util_deprecated.cuh \
+    /usr/local/cuda/include/cub/block/block_histogram.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_histogram_sort.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/block_radix_sort.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/block_exchange.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../util_ptx.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../util_type.cuh \
     /usr/include/c++/7/cfloat \
     /usr/lib/gcc/x86_64-linux-gnu/7/include/float.h \
-    ../../../../cub/cub/block/specializations/../../block/../util_macro.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../util_namespace.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../util_arch.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../util_debug.cuh \
-    ../../../../cub/cub/block/specializations/../../block/block_radix_rank.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../thread/thread_reduce.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../thread/../thread/thread_operators.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../thread/thread_scan.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/block_scan.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/block_scan_raking.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/../../block/block_raking_layout.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/../../warp/warp_scan.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/warp_scan_shfl.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/warp_scan_smem.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/../../thread/thread_load.cuh \
-    /usr/local/cuda/include/cuda.h \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../util_debug.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/block_radix_rank.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../thread/thread_reduce.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../thread/../thread/thread_operators.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../thread/thread_scan.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/block_scan.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/block_scan_raking.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/../../block/block_raking_layout.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/../../warp/warp_scan.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/warp_scan_shfl.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/warp_scan_smem.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/../../thread/thread_load.cuh \
     /usr/include/c++/7/iterator \
     /usr/include/c++/7/bits/stream_iterator.h \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/../../thread/thread_store.cuh \
-    ../../../../cub/cub/block/specializations/../../block/../block/specializations/block_scan_warp_scans.cuh \
-    ../../../../cub/cub/block/specializations/../../block/block_discontinuity.cuh \
-    ../../../../cub/cub/block/specializations/block_histogram_atomic.cuh \
-    ../../../../cub/cub/block/block_load.cuh \
-    ../../../../cub/cub/block/../iterator/cache_modified_input_iterator.cuh \
-    ../../../../cub/cub/block/../iterator/../util_device.cuh \
-    ../../../../cub/cub/block/block_reduce.cuh \
-    ../../../../cub/cub/block/specializations/block_reduce_raking.cuh \
-    ../../../../cub/cub/block/specializations/../../warp/warp_reduce.cuh \
-    ../../../../cub/cub/block/specializations/../../warp/specializations/warp_reduce_shfl.cuh \
-    ../../../../cub/cub/block/specializations/../../warp/specializations/warp_reduce_smem.cuh \
-    ../../../../cub/cub/block/specializations/block_reduce_raking_commutative_only.cuh \
-    ../../../../cub/cub/block/specializations/block_reduce_warp_reductions.cuh \
-    ../../../../cub/cub/block/block_store.cuh \
-    ../../../../cub/cub/device/device_histogram.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_histogram.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_histogram.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/../grid/grid_queue.cuh \
-    ../../../../cub/cub/device/dispatch/../../thread/thread_search.cuh \
-    ../../../../cub/cub/device/device_partition.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_select_if.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_scan.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_scan.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/single_pass_scan_operators.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_select_if.cuh \
-    ../../../../cub/cub/device/device_radix_sort.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_radix_sort.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_radix_sort_upsweep.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_radix_sort_downsweep.cuh \
-    ../../../../cub/cub/device/dispatch/../../grid/grid_even_share.cuh \
-    ../../../../cub/cub/device/dispatch/../../grid/grid_mapping.cuh \
-    ../../../../cub/cub/device/device_reduce.cuh \
-    ../../../../cub/cub/device/../iterator/arg_index_input_iterator.cuh \
-    /usr/local/cuda/include/thrust/version.h \
-    /usr/local/cuda/include/thrust/iterator/iterator_facade.h \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/../../warp/specializations/../../thread/thread_store.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/../block/specializations/block_scan_warp_scans.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../block/block_discontinuity.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_histogram_atomic.cuh \
+    /usr/local/cuda/include/cub/block/block_load.cuh \
+    /usr/local/cuda/include/cub/block/../iterator/cache_modified_input_iterator.cuh \
+    /usr/local/cuda/include/cub/block/../iterator/../util_device.cuh \
+    /usr/include/c++/7/atomic \
+    /usr/include/c++/7/cassert \
+    /usr/local/cuda/include/cub/block/block_reduce.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_reduce_raking.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../warp/warp_reduce.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../warp/specializations/warp_reduce_shfl.cuh \
+    /usr/local/cuda/include/cub/block/specializations/../../warp/specializations/warp_reduce_smem.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_reduce_raking_commutative_only.cuh \
+    /usr/local/cuda/include/cub/block/specializations/block_reduce_warp_reductions.cuh \
+    /usr/local/cuda/include/cub/block/block_store.cuh \
+    /usr/local/cuda/include/cub/device/device_histogram.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_histogram.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_histogram.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../grid/grid_queue.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../thread/thread_search.cuh \
+    /usr/local/cuda/include/thrust/system/cuda/detail/core/triple_chevron_launch.h \
     /usr/local/cuda/include/thrust/detail/config.h \
+    /usr/local/cuda/include/thrust/version.h \
     /usr/local/cuda/include/thrust/detail/config/config.h \
     /usr/local/cuda/include/thrust/detail/config/simple_defines.h \
     /usr/local/cuda/include/thrust/detail/config/compiler.h \
     /usr/local/cuda/include/thrust/detail/config/cpp_dialect.h \
     /usr/local/cuda/include/thrust/detail/config/cpp_compatibility.h \
+    /usr/local/cuda/include/thrust/detail/config/deprecated.h \
     /usr/local/cuda/include/thrust/detail/config/host_system.h \
     /usr/local/cuda/include/thrust/detail/config/device_system.h \
     /usr/local/cuda/include/thrust/detail/config/host_device.h \
@@ -350,9 +345,11 @@ RepeatPlugin.o : RepeatPlugin.cu \
     /usr/local/cuda/include/thrust/detail/config/forceinline.h \
     /usr/local/cuda/include/thrust/detail/config/exec_check_disable.h \
     /usr/local/cuda/include/thrust/detail/config/global_workarounds.h \
-    /usr/local/cuda/include/thrust/detail/type_traits.h \
-    /usr/local/cuda/include/thrust/detail/type_traits/has_trivial_assign.h \
-    /usr/local/cuda/include/thrust/iterator/detail/iterator_facade_category.h \
+    /usr/local/cuda/include/thrust/system/cuda/detail/core/alignment.h \
+    /usr/local/cuda/include/thrust/system/cuda/detail/util.h \
+    /usr/local/cuda/include/thrust/iterator/iterator_traits.h \
+    /usr/local/cuda/include/thrust/type_traits/void_t.h \
+    /usr/local/cuda/include/thrust/iterator/detail/iterator_traversal_tags.h \
     /usr/local/cuda/include/thrust/iterator/detail/host_system_tag.h \
     /usr/local/cuda/include/thrust/system/cpp/detail/execution_policy.h \
     /usr/local/cuda/include/thrust/system/detail/sequential/execution_policy.h \
@@ -363,6 +360,8 @@ RepeatPlugin.o : RepeatPlugin.cu \
     /usr/local/cuda/include/thrust/system/cuda/config.h \
     /usr/local/cuda/include/thrust/detail/allocator_aware_execution_policy.h \
     /usr/local/cuda/include/thrust/detail/execute_with_allocator_fwd.h \
+    /usr/local/cuda/include/thrust/detail/type_traits.h \
+    /usr/local/cuda/include/thrust/detail/type_traits/has_trivial_assign.h \
     /usr/local/cuda/include/thrust/detail/execute_with_dependencies.h \
     /usr/local/cuda/include/thrust/detail/cpp11_required.h \
     /usr/local/cuda/include/thrust/detail/type_deduction.h \
@@ -370,35 +369,98 @@ RepeatPlugin.o : RepeatPlugin.cu \
     /usr/local/cuda/include/thrust/type_traits/remove_cvref.h \
     /usr/local/cuda/include/thrust/detail/alignment.h \
     /usr/local/cuda/include/thrust/detail/dependencies_aware_execution_policy.h \
+    /usr/local/cuda/include/thrust/iterator/detail/iterator_traits.inl \
     /usr/local/cuda/include/thrust/iterator/iterator_categories.h \
     /usr/local/cuda/include/thrust/iterator/detail/iterator_category_with_system_and_traversal.h \
-    /usr/local/cuda/include/thrust/iterator/detail/iterator_traversal_tags.h \
     /usr/local/cuda/include/thrust/iterator/detail/universal_categories.h \
-    /usr/local/cuda/include/thrust/iterator/detail/is_iterator_category.h \
     /usr/local/cuda/include/thrust/iterator/detail/iterator_category_to_traversal.h \
     /usr/local/cuda/include/thrust/iterator/detail/iterator_category_to_system.h \
+    /usr/local/cuda/include/thrust/system_error.h \
+    /usr/local/cuda/include/thrust/system/error_code.h \
+    /usr/local/cuda/include/thrust/system/detail/errno.h \
+    /usr/local/cuda/include/thrust/system/detail/error_category.inl \
+    /usr/local/cuda/include/thrust/functional.h \
+    /usr/include/c++/7/functional \
+    /usr/include/c++/7/bits/std_function.h \
+    /usr/local/cuda/include/thrust/detail/functional/placeholder.h \
+    /usr/local/cuda/include/thrust/detail/functional/actor.h \
+    /usr/local/cuda/include/thrust/tuple.h \
+    /usr/local/cuda/include/thrust/detail/tuple.inl \
+    /usr/local/cuda/include/thrust/detail/swap.h \
+    /usr/local/cuda/include/thrust/pair.h \
+    /usr/local/cuda/include/thrust/detail/pair.inl \
+    /usr/local/cuda/include/thrust/detail/functional/value.h \
+    /usr/local/cuda/include/thrust/detail/functional/composite.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/assignment_operator.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/operator_adaptors.h \
+    /usr/local/cuda/include/thrust/detail/type_traits/result_of_adaptable_function.h \
+    /usr/local/cuda/include/thrust/detail/type_traits/function_traits.h \
+    /usr/local/cuda/include/thrust/detail/type_traits/has_nested_type.h \
+    /usr/local/cuda/include/thrust/detail/functional/actor.inl \
+    /usr/local/cuda/include/thrust/detail/functional/argument.h \
+    /usr/local/cuda/include/thrust/detail/functional.inl \
+    /usr/local/cuda/include/thrust/detail/functional/operators.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/arithmetic_operators.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/relational_operators.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/logical_operators.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/bitwise_operators.h \
+    /usr/local/cuda/include/thrust/detail/functional/operators/compound_assignment_operators.h \
+    /usr/local/cuda/include/thrust/system/detail/error_code.inl \
+    /usr/local/cuda/include/thrust/system/detail/error_condition.inl \
+    /usr/local/cuda/include/thrust/system/system_error.h \
+    /usr/local/cuda/include/thrust/system/detail/system_error.inl \
+    /usr/local/cuda/include/thrust/system/cuda/error.h \
+    /usr/local/cuda/include/thrust/system/cuda/detail/guarded_driver_types.h \
+    /usr/local/cuda/include/thrust/system/cuda/detail/error.inl \
+    /usr/local/cuda/include/thrust/system/cuda/detail/guarded_cuda_runtime_api.h \
+    /usr/local/cuda/include/cub/device/device_partition.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_select_if.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_scan.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_scan.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/single_pass_scan_operators.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_select_if.cuh \
+    /usr/local/cuda/include/cub/device/device_radix_sort.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_radix_sort.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_radix_sort_upsweep.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_radix_sort_downsweep.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../grid/grid_even_share.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../grid/grid_mapping.cuh \
+    /usr/local/cuda/include/cub/device/device_reduce.cuh \
+    /usr/local/cuda/include/cub/device/../iterator/arg_index_input_iterator.cuh \
+    /usr/local/cuda/include/thrust/iterator/iterator_facade.h \
+    /usr/local/cuda/include/thrust/iterator/detail/iterator_facade_category.h \
+    /usr/local/cuda/include/thrust/iterator/detail/is_iterator_category.h \
     /usr/local/cuda/include/thrust/iterator/detail/distance_from_result.h \
-    /usr/local/cuda/include/thrust/iterator/iterator_traits.h \
-    /usr/local/cuda/include/thrust/type_traits/void_t.h \
-    /usr/local/cuda/include/thrust/iterator/detail/iterator_traits.inl \
-    ../../../../cub/cub/device/dispatch/dispatch_reduce.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_reduce.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_reduce_by_key.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_reduce_by_key.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/../iterator/constant_input_iterator.cuh \
-    ../../../../cub/cub/device/device_run_length_encode.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_rle.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_rle.cuh \
-    ../../../../cub/cub/device/device_scan.cuh \
-    ../../../../cub/cub/device/device_segmented_radix_sort.cuh \
-    ../../../../cub/cub/device/device_segmented_reduce.cuh \
-    ../../../../cub/cub/device/device_select.cuh \
-    ../../../../cub/cub/device/device_spmv.cuh \
-    ../../../../cub/cub/device/dispatch/dispatch_spmv_orig.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_segment_fixup.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/agent_spmv_orig.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/../iterator/counting_input_iterator.cuh \
-    ../../../../cub/cub/device/dispatch/../../agent/../iterator/tex_ref_input_iterator.cuh \
-    ../../../../cub/cub/iterator/cache_modified_output_iterator.cuh \
-    ../../../../cub/cub/iterator/tex_obj_input_iterator.cuh \
-    ../../../../cub/cub/iterator/transform_input_iterator.cuh
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_reduce.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_reduce.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_reduce_by_key.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_reduce_by_key.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../iterator/constant_input_iterator.cuh \
+    /usr/local/cuda/include/cub/device/device_run_length_encode.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_rle.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_rle.cuh \
+    /usr/local/cuda/include/cub/device/device_scan.cuh \
+    /usr/local/cuda/include/cub/device/device_segmented_radix_sort.cuh \
+    /usr/local/cuda/include/cub/device/device_segmented_reduce.cuh \
+    /usr/local/cuda/include/cub/device/device_select.cuh \
+    /usr/local/cuda/include/cub/device/device_spmv.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/dispatch_spmv_orig.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_segment_fixup.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/agent_spmv_orig.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../iterator/counting_input_iterator.cuh \
+    /usr/local/cuda/include/cub/device/dispatch/../../agent/../iterator/tex_ref_input_iterator.cuh \
+    /usr/local/cuda/include/cub/iterator/cache_modified_output_iterator.cuh \
+    /usr/local/cuda/include/cub/iterator/discard_output_iterator.cuh \
+    /usr/local/cuda/include/cub/iterator/tex_obj_input_iterator.cuh \
+    /usr/local/cuda/include/cub/iterator/transform_input_iterator.cuh \
+    /usr/local/cuda/include/cub/util_allocator.cuh \
+    /usr/include/c++/7/set \
+    /usr/include/c++/7/bits/stl_tree.h \
+    /usr/include/c++/7/bits/stl_set.h \
+    /usr/include/c++/7/bits/stl_multiset.h \
+    /usr/include/c++/7/map \
+    /usr/include/c++/7/bits/stl_map.h \
+    /usr/include/c++/7/bits/stl_multimap.h \
+    /usr/local/cuda/include/cub/host/mutex.cuh \
+    /usr/include/c++/7/mutex \
+    /usr/include/c++/7/bits/std_mutex.h
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.o b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.o
index e4b7338a..72139471 100644
Binary files a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.o and b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.o differ
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.so b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.so
index c92fe235..4acd7176 100755
Binary files a/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.so and b/CUDA-Optimized/FastSpeech/fastspeech/trt/plugins/repeat/RepeatPlugin.so differ
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC nine seventy seven, and MIME content..wav b/CUDA-Optimized/FastSpeech/fastspeech/trt/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC.wav
similarity index 100%
rename from CUDA-Optimized/FastSpeech/fastspeech/trt/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC nine seventy seven, and MIME content..wav
rename to CUDA-Optimized/FastSpeech/fastspeech/trt/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC.wav
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information..wav b/CUDA-Optimized/FastSpeech/fastspeech/trt/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven f.wav
similarity index 100%
rename from CUDA-Optimized/FastSpeech/fastspeech/trt/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information..wav
rename to CUDA-Optimized/FastSpeech/fastspeech/trt/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven f.wav
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC nine seventy seven, and MIME content..wav b/CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC.wav
similarity index 100%
rename from CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC nine seventy seven, and MIME content..wav
rename to CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC.wav
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information..wav b/CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven f.wav
similarity index 100%
rename from CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information..wav
rename to CUDA-Optimized/FastSpeech/fastspeech/trt/samples_fp16/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven f.wav
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/trt/trt_inferencer.py b/CUDA-Optimized/FastSpeech/fastspeech/trt/trt_inferencer.py
index fef0a01e..9fd1006f 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/trt/trt_inferencer.py
+++ b/CUDA-Optimized/FastSpeech/fastspeech/trt/trt_inferencer.py
@@ -70,7 +70,7 @@ class TRTInferencer(object):
             # load checkpoint
             self.load(ckpt_file)
 
-        self.build_engine()
+        self.engine = self.build_engine()
 
     def __enter__(self):
         self.context = self.engine.create_execution_context()
diff --git a/CUDA-Optimized/FastSpeech/fastspeech/utils/time.py b/CUDA-Optimized/FastSpeech/fastspeech/utils/time.py
index 86b22cc4..90b012cd 100644
--- a/CUDA-Optimized/FastSpeech/fastspeech/utils/time.py
+++ b/CUDA-Optimized/FastSpeech/fastspeech/utils/time.py
@@ -35,13 +35,21 @@ class TimeElapsed(object):
         self.format = format
     
     def __enter__(self):
+        self.start()
+    
+    def __exit__(self, *exc_info):
+        self.end()
+
+    def start(self):
         if self.device == 'cuda' and self.cuda_sync:
             torch.cuda.synchronize()
         self.start_time = time.time()
-    
-    def __exit__(self, *exc_info):
+
+    def end(self):
+        if not hasattr(self, "start_time"):
+            return
         if self.device == 'cuda' and self.cuda_sync:
             torch.cuda.synchronize()
         self.end_time = time.time()
         self.time_elapsed = self.end_time - self.start_time
-        tprint(("[{}] Time elapsed: {" + self.format + "}").format(self.name, self.time_elapsed))
+        tprint(("[{}] Time elapsed: {" + self.format + "}").format(self.name, self.time_elapsed))
\ No newline at end of file
diff --git a/CUDA-Optimized/FastSpeech/generate.py b/CUDA-Optimized/FastSpeech/generate.py
index 7fa859f7..66114c6c 100644
--- a/CUDA-Optimized/FastSpeech/generate.py
+++ b/CUDA-Optimized/FastSpeech/generate.py
@@ -42,6 +42,7 @@ from fastspeech.utils.pytorch import to_device_async, to_cpu_numpy
 from fastspeech.infer import get_inferencer
 from fastspeech.inferencer.waveglow_inferencer import WaveGlowInferencer
 
+MAX_FILESIZE=128
 
 # TODO test with different speeds
 def generate(hparam='infer.yaml',
@@ -156,7 +157,7 @@ def generate(hparam='infer.yaml',
                     wav_len = mel_lens[i] * hp.hop_len
                     wav = wav[:wav_len]
 
-                    path = os.path.join(results_path, text + ".wav")
+                    path = os.path.join(results_path, text[:MAX_FILESIZE] + ".wav")
                     librosa.output.write_wav(path, wav, hp.sr)
 
         except StopIteration:
diff --git a/CUDA-Optimized/FastSpeech/requirements.txt b/CUDA-Optimized/FastSpeech/requirements.txt
index 2f6c62bb..1e9c2044 100644
--- a/CUDA-Optimized/FastSpeech/requirements.txt
+++ b/CUDA-Optimized/FastSpeech/requirements.txt
@@ -1,4 +1,3 @@
-# torch == 1.5
 librosa >= 0.7.0
 tensorboardX
 matplotlib
@@ -10,4 +9,5 @@ torchvision
 opencv-python
 pyyaml
 tqdm
-data
\ No newline at end of file
+data
+pycuda
\ No newline at end of file
diff --git a/CUDA-Optimized/FastSpeech/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC nine seventy seven, and MIME content..wav b/CUDA-Optimized/FastSpeech/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC.wav
similarity index 100%
rename from CUDA-Optimized/FastSpeech/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC nine seventy seven, and MIME content..wav
rename to CUDA-Optimized/FastSpeech/samples/To deliver interfaces that are significantly better suited to create and process RFC eight twenty one, RFC eight twenty two, RFC.wav
diff --git a/CUDA-Optimized/FastSpeech/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information..wav b/CUDA-Optimized/FastSpeech/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven f.wav
similarity index 100%
rename from CUDA-Optimized/FastSpeech/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information..wav
rename to CUDA-Optimized/FastSpeech/samples/You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven f.wav
diff --git a/CUDA-Optimized/FastSpeech/scripts/docker/build.sh b/CUDA-Optimized/FastSpeech/scripts/docker/build.sh
index c64addd6..f1e55eb3 100644
--- a/CUDA-Optimized/FastSpeech/scripts/docker/build.sh
+++ b/CUDA-Optimized/FastSpeech/scripts/docker/build.sh
@@ -1,3 +1,4 @@
 #!/bin/bash
 
-docker build . --rm -t fastspeech
\ No newline at end of file
+docker build . --rm -t fastspeech
+# docker build . --build-arg UID=$(id -u) --build-arg GID=$(id -g) --build-arg UNAME=$(id -un) --rm -t fastspeech
\ No newline at end of file
diff --git a/CUDA-Optimized/FastSpeech/scripts/docker/interactive.sh b/CUDA-Optimized/FastSpeech/scripts/docker/interactive.sh
index 57b093af..a1cd0096 100644
--- a/CUDA-Optimized/FastSpeech/scripts/docker/interactive.sh
+++ b/CUDA-Optimized/FastSpeech/scripts/docker/interactive.sh
@@ -1,4 +1,5 @@
 #!/bin/bash
 
 nvidia-docker run -it --rm --shm-size=16g -v $PWD:/workspace/fastspeech/ fastspeech bash
-# --ulimit memlock=-1 --ulimit stack=67108864 -it --rm --ipc=host
\ No newline at end of file
+# nvidia-docker run -it -u $(id -u):$(id -g) --rm --shm-size=16g -v $PWD:/workspace/fastspeech/ fastspeech bash
+# --ulimit memlock=-1 --ulimit stack=67108864 --ipc=host
diff --git a/CUDA-Optimized/FastSpeech/scripts/install.sh b/CUDA-Optimized/FastSpeech/scripts/install.sh
index e900eeee..53a16fee 100644
--- a/CUDA-Optimized/FastSpeech/scripts/install.sh
+++ b/CUDA-Optimized/FastSpeech/scripts/install.sh
@@ -1,8 +1,8 @@
 #!/bin/bash
 
 # install ffmpeg
-apt-get update
-apt-get install -y ffmpeg
+# apt-get update
+# apt-get install -y ffmpeg
 
 # install the project and its dependencies
-pip install --no-cache-dir .
\ No newline at end of file
+pip install --user --no-cache-dir .
\ No newline at end of file
diff --git a/CUDA-Optimized/FastSpeech/scripts/waveglow/export_onnx2trt.py b/CUDA-Optimized/FastSpeech/scripts/waveglow/export_onnx2trt.py
deleted file mode 100644
index 583b7574..00000000
--- a/CUDA-Optimized/FastSpeech/scripts/waveglow/export_onnx2trt.py
+++ /dev/null
@@ -1,107 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-# Edited from https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/Tacotron2/trt/export_onnx2trt.py
-
-import pycuda.driver as cuda
-import pycuda.autoinit
-import tensorrt as trt
-import onnx
-import argparse
-
-import sys
-sys.path.append('./')
-
-def parse_args(parser):
-    """
-    Parse commandline arguments.
-    """
-    parser.add_argument('-o', '--output', required=True,
-                        help='output folder to save audio (file per phrase)')
-    parser.add_argument('--waveglow', type=str, default="",
-                        help='full path to the WaveGlow ONNX')
-    parser.add_argument('--fp16', action='store_true',
-                        help='inference with FP16')
-    parser.add_argument('-b', '--batch_size', default=1, type=int,
-                        help='batch size for inference.')
-    parser.add_argument('-w', '--max_ws', default=1, type=int,
-                        help='max workspace size in GB.')
-
-    return parser
-
-
-def build_engine(model_file, shapes, max_ws=512*1024*1024, fp16=False):
-    TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
-    builder = trt.Builder(TRT_LOGGER)
-    builder.fp16_mode = fp16
-
-    config = builder.create_builder_config()
-    config.max_workspace_size = max_ws
-    if fp16:
-        config.flags |= 1 << int(trt.BuilderFlag.FP16)
-    profile = builder.create_optimization_profile()
-    for s in shapes:
-        profile.set_shape(s['name'], min=s['min'], opt=s['opt'], max=s['max'])
-    config.add_optimization_profile(profile)
-    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
-    network = builder.create_network(explicit_batch)
-
-    with trt.OnnxParser(network, TRT_LOGGER) as parser:
-        with open(model_file, 'rb') as model:
-            parsed = parser.parse(model.read())
-            for i in range(parser.num_errors):
-                print("TensorRT ONNX parser error:", parser.get_error(i))
-            engine = builder.build_engine(network, config=config)
-
-            return engine
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description='Export from ONNX to TensorRT for WaveGlow')
-    parser = parse_args(parser)
-    args = parser.parse_args()
-
-    engine_prec = ".fp16" if args.fp16 else ".fp32"
-
-    # WaveGlow
-    batch_size = args.batch_size
-    shapes=[{"name": "mel", "min": (batch_size,80,32,1),  "opt": (batch_size,80,768,1),  "max": (batch_size,80,1024,1)},
-            {"name": "z",   "min": (batch_size,8,1024,1), "opt": (batch_size,8,24576,1), "max": (batch_size,8,32768,1)}]
-    if args.waveglow != "":
-        print("Building WaveGlow ...")
-        waveglow_engine = build_engine(args.waveglow, shapes=shapes, fp16=args.fp16, max_ws=args.max_ws * 1<<30)
-        if waveglow_engine is not None:
-            with open(args.output+"/"+"waveglow"+engine_prec+".b"+str(batch_size), 'wb') as f:
-                f.write(waveglow_engine.serialize())
-        else:
-            print("Failed to build engine from", args.waveglow)
-            sys.exit()
-
-
-if __name__ == '__main__':
-    main()
diff --git a/CUDA-Optimized/FastSpeech/setup.py b/CUDA-Optimized/FastSpeech/setup.py
index 11178ee6..221ca8cd 100644
--- a/CUDA-Optimized/FastSpeech/setup.py
+++ b/CUDA-Optimized/FastSpeech/setup.py
@@ -37,7 +37,7 @@ def get_requirements(filename='requirements.txt'):
 
 setup(
     name='fastspeech',
-    version='0.2.1',
+    version='0.2.2',
     description='FastSpeech training and inference in PyTorch and TensorRT',
     author='Dabi Ahn',
     keywords='tts',
diff --git a/CUDA-Optimized/FastSpeech/waveglow/__init__.py b/CUDA-Optimized/FastSpeech/waveglow/__init__.py
new file mode 100644
index 00000000..0a2d298f
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/waveglow/__init__.py
@@ -0,0 +1,23 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
+
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in the
+#       documentation and/or other materials provided with the distribution.
+#     * Neither the name of the NVIDIA CORPORATION nor the
+#       names of its contributors may be used to endorse or promote products
+#       derived from this software without specific prior written permission.
+
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
\ No newline at end of file
diff --git a/CUDA-Optimized/FastSpeech/waveglow/arg_parser.py b/CUDA-Optimized/FastSpeech/waveglow/arg_parser.py
new file mode 100644
index 00000000..e4673b16
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/waveglow/arg_parser.py
@@ -0,0 +1,65 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+
+import argparse
+
+def parse_waveglow_args(parent, add_help=False):
+    """
+    Parse commandline arguments.
+    """
+    parser = argparse.ArgumentParser(parents=[parent], add_help=add_help)
+    
+    # misc parameters
+    parser.add_argument('--n-mel-channels', default=80, type=int,
+                        help='Number of bins in mel-spectrograms')
+
+    # glow parameters
+    parser.add_argument('--flows', default=12, type=int,
+                        help='Number of steps of flow')
+    parser.add_argument('--groups', default=8, type=int,
+                        help='Number of samples in a group processed by the steps of flow')
+    parser.add_argument('--early-every', default=4, type=int,
+                        help='Determines how often (i.e., after how many coupling layers) \
+                        a number of channels (defined by --early-size parameter) are output\
+                        to the loss function')
+    parser.add_argument('--early-size', default=2, type=int,
+                        help='Number of channels output to the loss function')
+    parser.add_argument('--sigma', default=1.0, type=float,
+                        help='Standard deviation used for sampling from Gaussian')
+    parser.add_argument('--segment-length', default=8000, type=int,
+                        help='Segment length (audio samples) processed per iteration')
+
+    # wavenet parameters
+    wavenet = parser.add_argument_group('WaveNet parameters')
+    wavenet.add_argument('--wn-kernel-size', default=3, type=int,
+                        help='Kernel size for dialted convolution in the affine coupling layer (WN)')
+    wavenet.add_argument('--wn-channels', default=256, type=int,
+                        help='Number of channels in WN')
+    wavenet.add_argument('--wn-layers', default=8, type=int,
+                        help='Number of layers in WN')
+
+    return parser
diff --git a/CUDA-Optimized/FastSpeech/waveglow/data_function.py b/CUDA-Optimized/FastSpeech/waveglow/data_function.py
new file mode 100644
index 00000000..1a425189
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/waveglow/data_function.py
@@ -0,0 +1,88 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************\
+
+import torch
+import random
+import common.layers as layers
+from common.utils import load_wav_to_torch, load_filepaths_and_text, to_gpu
+
+
+class MelAudioLoader(torch.utils.data.Dataset):
+    """
+        1) loads audio,text pairs
+        2) computes mel-spectrograms from audio files.
+    """
+
+    def __init__(self, dataset_path, audiopaths_and_text, args):
+        self.audiopaths_and_text = load_filepaths_and_text(dataset_path, audiopaths_and_text)
+        self.max_wav_value = args.max_wav_value
+        self.sampling_rate = args.sampling_rate
+        self.stft = layers.TacotronSTFT(
+            args.filter_length, args.hop_length, args.win_length,
+            args.n_mel_channels, args.sampling_rate, args.mel_fmin,
+            args.mel_fmax)
+        self.segment_length = args.segment_length
+        random.seed(1234)
+        random.shuffle(self.audiopaths_and_text)
+
+    def get_mel_audio_pair(self, filename):
+        audio, sampling_rate = load_wav_to_torch(filename)
+
+        if sampling_rate != self.stft.sampling_rate:
+            raise ValueError("{} {} SR doesn't match target {} SR".format(
+                sampling_rate, self.stft.sampling_rate))
+
+        # Take segment
+        if audio.size(0) >= self.segment_length:
+            max_audio_start = audio.size(0) - self.segment_length
+            audio_start = random.randint(0, max_audio_start)
+            audio = audio[audio_start:audio_start+self.segment_length]
+        else:
+            audio = torch.nn.functional.pad(
+                audio, (0, self.segment_length - audio.size(0)), 'constant').data
+
+        audio = audio / self.max_wav_value
+        audio_norm = audio.unsqueeze(0)
+        audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
+        melspec = self.stft.mel_spectrogram(audio_norm)
+        melspec = melspec.squeeze(0)
+
+        return (melspec, audio, len(audio))
+
+    def __getitem__(self, index):
+        return self.get_mel_audio_pair(self.audiopaths_and_text[index][0])
+
+    def __len__(self):
+        return len(self.audiopaths_and_text)
+
+
+def batch_to_gpu(batch):
+    x, y, len_y = batch
+    x = to_gpu(x).float()
+    y = to_gpu(y).float()
+    len_y = to_gpu(torch.sum(len_y))
+    return ((x, y), y, len_y)
diff --git a/CUDA-Optimized/FastSpeech/waveglow/denoiser.py b/CUDA-Optimized/FastSpeech/waveglow/denoiser.py
new file mode 100644
index 00000000..e7c17657
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/waveglow/denoiser.py
@@ -0,0 +1,72 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+
+import sys
+sys.path.append('tacotron2')
+import torch
+from common.layers import STFT
+
+
+class Denoiser(torch.nn.Module):
+    """ Removes model bias from audio produced with waveglow """
+
+    def __init__(self, waveglow, cpu_run=False, filter_length=1024, n_overlap=4,
+                 win_length=1024, mode='zeros'):
+        super(Denoiser, self).__init__()
+        if cpu_run:
+            self.stft = STFT(filter_length=filter_length,
+                         hop_length=int(filter_length/n_overlap),
+                         win_length=win_length)
+        else:
+            self.stft = STFT(filter_length=filter_length,
+                         hop_length=int(filter_length/n_overlap),
+                         win_length=win_length).cuda()
+        if mode == 'zeros':
+            mel_input = torch.zeros(
+                (1, 80, 88),
+                dtype=waveglow.upsample.weight.dtype,
+                device=waveglow.upsample.weight.device)
+        elif mode == 'normal':
+            mel_input = torch.randn(
+                (1, 80, 88),
+                dtype=waveglow.upsample.weight.dtype,
+                device=waveglow.upsample.weight.device)
+        else:
+            raise Exception("Mode {} if not supported".format(mode))
+
+        with torch.no_grad():
+            bias_audio = waveglow.infer(mel_input, sigma=0.0).float()
+            bias_spec, _ = self.stft.transform(bias_audio)
+
+        self.register_buffer('bias_spec', bias_spec[:, :, 0][:, :, None])
+
+    def forward(self, audio, strength=0.1):
+        audio_spec, audio_angles = self.stft.transform(audio.float())
+        audio_spec_denoised = audio_spec - self.bias_spec * strength
+        audio_spec_denoised = torch.clamp(audio_spec_denoised, 0.0)
+        audio_denoised = self.stft.inverse(audio_spec_denoised, audio_angles)
+        return audio_denoised
diff --git a/CUDA-Optimized/FastSpeech/waveglow/loss_function.py b/CUDA-Optimized/FastSpeech/waveglow/loss_function.py
new file mode 100644
index 00000000..efc00db7
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/waveglow/loss_function.py
@@ -0,0 +1,48 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+
+import torch
+
+class WaveGlowLoss(torch.nn.Module):
+    def __init__(self, sigma=1.0):
+        super(WaveGlowLoss, self).__init__()
+        self.sigma = sigma
+
+    def forward(self, model_output, clean_audio):
+        # clean_audio is unused;
+        z, log_s_list, log_det_W_list = model_output
+        for i, log_s in enumerate(log_s_list):
+            if i == 0:
+                log_s_total = torch.sum(log_s)
+                log_det_W_total = log_det_W_list[i]
+            else:
+                log_s_total = log_s_total + torch.sum(log_s)
+                log_det_W_total += log_det_W_list[i]
+
+        loss = torch.sum(
+            z * z) / (2 * self.sigma * self.sigma) - log_s_total - log_det_W_total  # noqa: E501
+        return loss / (z.size(0) * z.size(1) * z.size(2))
diff --git a/CUDA-Optimized/FastSpeech/waveglow/model.py b/CUDA-Optimized/FastSpeech/waveglow/model.py
new file mode 100644
index 00000000..c799709a
--- /dev/null
+++ b/CUDA-Optimized/FastSpeech/waveglow/model.py
@@ -0,0 +1,292 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+import torch
+from torch.autograd import Variable
+import torch.nn.functional as F
+
+
+@torch.jit.script
+def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
+    n_channels_int = n_channels[0]
+    in_act = input_a + input_b
+    t_act = torch.tanh(in_act[:, :n_channels_int, :])
+    s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
+    acts = t_act * s_act
+    return acts
+
+
+class Invertible1x1Conv(torch.nn.Module):
+    """
+    The layer outputs both the convolution, and the log determinant
+    of its weight matrix.  If reverse=True it does convolution with
+    inverse
+    """
+
+    def __init__(self, c):
+        super(Invertible1x1Conv, self).__init__()
+        self.conv = torch.nn.Conv1d(c, c, kernel_size=1, stride=1, padding=0,
+                                    bias=False)
+
+        # Sample a random orthonormal matrix to initialize weights
+        W = torch.qr(torch.FloatTensor(c, c).normal_())[0]
+
+        # Ensure determinant is 1.0 not -1.0
+        if torch.det(W) < 0:
+            W[:, 0] = -1 * W[:, 0]
+        W = W.view(c, c, 1)
+        self.conv.weight.data = W
+
+    def forward(self, z, reverse=False):
+        # shape
+        batch_size, group_size, n_of_groups = z.size()
+
+        W = self.conv.weight.squeeze()
+
+        if reverse:
+            if not hasattr(self, 'W_inverse'):
+                # Reverse computation
+                W_inverse = W.float().inverse()
+                W_inverse = Variable(W_inverse[..., None])
+                if z.type() == 'torch.cuda.HalfTensor' or z.type() == 'torch.HalfTensor':
+                    W_inverse = W_inverse.half()
+                self.W_inverse = W_inverse
+            z = F.conv1d(z, self.W_inverse, bias=None, stride=1, padding=0)
+            return z
+        else:
+            # Forward computation
+            log_det_W = batch_size * n_of_groups * torch.logdet(W.unsqueeze(0).float()).squeeze()
+            z = self.conv(z)
+            return z, log_det_W
+
+
+class WN(torch.nn.Module):
+    """
+    This is the WaveNet like layer for the affine coupling.  The primary
+    difference from WaveNet is the convolutions need not be causal.  There is
+    also no dilation size reset.  The dilation only doubles on each layer
+    """
+
+    def __init__(self, n_in_channels, n_mel_channels, n_layers, n_channels,
+                 kernel_size):
+        super(WN, self).__init__()
+        assert(kernel_size % 2 == 1)
+        assert(n_channels % 2 == 0)
+        self.n_layers = n_layers
+        self.n_channels = n_channels
+        self.in_layers = torch.nn.ModuleList()
+        self.res_skip_layers = torch.nn.ModuleList()
+        self.cond_layers = torch.nn.ModuleList()
+
+        start = torch.nn.Conv1d(n_in_channels, n_channels, 1)
+        start = torch.nn.utils.weight_norm(start, name='weight')
+        self.start = start
+
+        # Initializing last layer to 0 makes the affine coupling layers
+        # do nothing at first.  This helps with training stability
+        end = torch.nn.Conv1d(n_channels, 2 * n_in_channels, 1)
+        end.weight.data.zero_()
+        end.bias.data.zero_()
+        self.end = end
+
+        for i in range(n_layers):
+            dilation = 2 ** i
+            padding = int((kernel_size * dilation - dilation) / 2)
+            in_layer = torch.nn.Conv1d(n_channels, 2 * n_channels, kernel_size,
+                                       dilation=dilation, padding=padding)
+            in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
+            self.in_layers.append(in_layer)
+
+            cond_layer = torch.nn.Conv1d(n_mel_channels, 2 * n_channels, 1)
+            cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
+            self.cond_layers.append(cond_layer)
+
+            # last one is not necessary
+            if i < n_layers - 1:
+                res_skip_channels = 2 * n_channels
+            else:
+                res_skip_channels = n_channels
+            res_skip_layer = torch.nn.Conv1d(n_channels, res_skip_channels, 1)
+            res_skip_layer = torch.nn.utils.weight_norm(
+                res_skip_layer, name='weight')
+            self.res_skip_layers.append(res_skip_layer)
+
+    def forward(self, forward_input):
+        audio, spect = forward_input
+        audio = self.start(audio)
+
+        for i in range(self.n_layers):
+            acts = fused_add_tanh_sigmoid_multiply(
+                self.in_layers[i](audio),
+                self.cond_layers[i](spect),
+                torch.IntTensor([self.n_channels]))
+
+            res_skip_acts = self.res_skip_layers[i](acts)
+            if i < self.n_layers - 1:
+                audio = res_skip_acts[:, :self.n_channels, :] + audio
+                skip_acts = res_skip_acts[:, self.n_channels:, :]
+            else:
+                skip_acts = res_skip_acts
+
+            if i == 0:
+                output = skip_acts
+            else:
+                output = skip_acts + output
+        return self.end(output)
+
+
+class WaveGlow(torch.nn.Module):
+    def __init__(self, n_mel_channels, n_flows, n_group, n_early_every,
+                 n_early_size, WN_config):
+        super(WaveGlow, self).__init__()
+
+        self.upsample = torch.nn.ConvTranspose1d(n_mel_channels,
+                                                 n_mel_channels,
+                                                 1024, stride=256)
+        assert(n_group % 2 == 0)
+        self.n_flows = n_flows
+        self.n_group = n_group
+        self.n_early_every = n_early_every
+        self.n_early_size = n_early_size
+        self.WN = torch.nn.ModuleList()
+        self.convinv = torch.nn.ModuleList()
+
+        n_half = int(n_group / 2)
+
+        # Set up layers with the right sizes based on how many dimensions
+        # have been output already
+        n_remaining_channels = n_group
+        for k in range(n_flows):
+            if k % self.n_early_every == 0 and k > 0:
+                n_half = n_half - int(self.n_early_size / 2)
+                n_remaining_channels = n_remaining_channels - self.n_early_size
+            self.convinv.append(Invertible1x1Conv(n_remaining_channels))
+            self.WN.append(WN(n_half, n_mel_channels * n_group, **WN_config))
+        self.n_remaining_channels = n_remaining_channels
+
+    def forward(self, forward_input):
+        """
+        forward_input[0] = mel_spectrogram:  batch x n_mel_channels x frames
+        forward_input[1] = audio: batch x time
+        """
+        spect, audio = forward_input
+
+        #  Upsample spectrogram to size of audio
+        spect = self.upsample(spect)
+        assert(spect.size(2) >= audio.size(1))
+        if spect.size(2) > audio.size(1):
+            spect = spect[:, :, :audio.size(1)]
+
+        spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
+        spect = spect.contiguous().view(spect.size(0), spect.size(1), -1)
+        spect = spect.permute(0, 2, 1)
+
+        audio = audio.unfold(1, self.n_group, self.n_group).permute(0, 2, 1)
+        output_audio = []
+        log_s_list = []
+        log_det_W_list = []
+
+        for k in range(self.n_flows):
+            if k % self.n_early_every == 0 and k > 0:
+                output_audio.append(audio[:, :self.n_early_size, :])
+                audio = audio[:, self.n_early_size:, :]
+
+            audio, log_det_W = self.convinv[k](audio)
+            log_det_W_list.append(log_det_W)
+
+            n_half = int(audio.size(1) / 2)
+            audio_0 = audio[:, :n_half, :]
+            audio_1 = audio[:, n_half:, :]
+
+            output = self.WN[k]((audio_0, spect))
+            log_s = output[:, n_half:, :]
+            b = output[:, :n_half, :]
+            audio_1 = torch.exp(log_s) * audio_1 + b
+            log_s_list.append(log_s)
+
+            audio = torch.cat([audio_0, audio_1], 1)
+
+        output_audio.append(audio)
+        return torch.cat(output_audio, 1), log_s_list, log_det_W_list
+
+    def infer(self, spect, sigma=1.0):
+
+        spect = self.upsample(spect)
+        # trim conv artifacts. maybe pad spec to kernel multiple
+        time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
+        spect = spect[:, :, :-time_cutoff]
+
+        spect = spect.unfold(2, self.n_group, self.n_group).permute(0, 2, 1, 3)
+        spect = spect.contiguous().view(spect.size(0), spect.size(1), -1)
+        spect = spect.permute(0, 2, 1)
+
+        audio = torch.randn(spect.size(0),
+                            self.n_remaining_channels,
+                            spect.size(2), device=spect.device).to(spect.dtype)
+
+        audio = torch.autograd.Variable(sigma * audio)
+
+        for k in reversed(range(self.n_flows)):
+            n_half = int(audio.size(1) / 2)
+            audio_0 = audio[:, :n_half, :]
+            audio_1 = audio[:, n_half:, :]
+
+            output = self.WN[k]((audio_0, spect))
+            s = output[:, n_half:, :]
+            b = output[:, :n_half, :]
+            audio_1 = (audio_1 - b) / torch.exp(s)
+            audio = torch.cat([audio_0, audio_1], 1)
+
+            audio = self.convinv[k](audio, reverse=True)
+
+            if k % self.n_early_every == 0 and k > 0:
+                z = torch.randn(spect.size(0), self.n_early_size, spect.size(
+                    2), device=spect.device).to(spect.dtype)
+                audio = torch.cat((sigma * z, audio), 1)
+
+        audio = audio.permute(
+            0, 2, 1).contiguous().view(
+            audio.size(0), -1).data
+        return audio
+
+
+    @staticmethod
+    def remove_weightnorm(model):
+        waveglow = model
+        for WN in waveglow.WN:
+            WN.start = torch.nn.utils.remove_weight_norm(WN.start)
+            WN.in_layers = remove(WN.in_layers)
+            WN.cond_layers = remove(WN.cond_layers)
+            WN.res_skip_layers = remove(WN.res_skip_layers)
+        return waveglow
+
+
+def remove(conv_list):
+    new_conv_list = torch.nn.ModuleList()
+    for old_conv in conv_list:
+        old_conv = torch.nn.utils.remove_weight_norm(old_conv)
+        new_conv_list.append(old_conv)
+    return new_conv_list