Merge pull request #828 from NVIDIA/gh/release

[Jasper/PyT] Update DALI, perf, Triton, container, major refactor
2021-02-10 10:46:53 +01:00 · 2021-02-10 10:46:53 +01:00 · b901312732
parent dfed8d47e1 5b4165d5f0
commit b901312732
128 changed files with 7496 additions and 12071 deletions
--- a/PyTorch/SpeechRecognition/Jasper/.dockerignore
+++ b/PyTorch/SpeechRecognition/Jasper/.dockerignore
@ -1,3 +1,4 @@
+*.pt
 results/
 *__pycache__
 checkpoints/
@ -5,5 +6,3 @@ checkpoints/
 datasets/
 external/tensorrt-inference-server/
 checkpoints/
-triton/model_repo
-triton/deploy
--- a/PyTorch/SpeechRecognition/Jasper/.gitmodules
+++ b/PyTorch/SpeechRecognition/Jasper/.gitmodules
@ -1,4 +0,0 @@
-[submodule "external/triton-inference-server"]
-	path = external/triton-inference-server
-	url = https://github.com/NVIDIA/triton-inference-server
-	branch = r19.12
--- a/PyTorch/SpeechRecognition/Jasper/Dockerfile
+++ b/PyTorch/SpeechRecognition/Jasper/Dockerfile
@ -12,11 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.06-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.10-py3
 FROM ${FROM_IMAGE_NAME}

-
-RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
+RUN apt update && apt install -y libsndfile1 && apt install -y sox && rm -rf /var/lib/apt/lists/*

 WORKDIR /workspace/jasper

@ -24,5 +23,7 @@ WORKDIR /workspace/jasper
 COPY requirements.txt .
 RUN pip install --disable-pip-version-check -U -r requirements.txt

+RUN pip install --force-reinstall --extra-index-url https://developer.download.nvidia.com/compute/redist/nightly nvidia-dali-nightly-cuda110==0.28.0.dev20201026
+
 # Copy rest of files
 COPY . .
--- a/PyTorch/SpeechRecognition/Jasper/README.md
+++ b/PyTorch/SpeechRecognition/Jasper/README.md
@ -24,7 +24,6 @@ This repository provides scripts to train the Jasper model to achieve near state
   * [Training process](#training-process)
   * [Inference process](#inference-process)
   * [Evaluation process](#evaluation-process)
-   * [Deploying Jasper using TensorRT](#deploying-jasper-using-tensorrt)
   * [Deploying Jasper using Triton Inference Server](#deploying-jasper-using-triton-inference)
 - [Performance](#performance)
   * [Benchmarking](#benchmarking)
@ -32,16 +31,16 @@ This repository provides scripts to train the Jasper model to achieve near state
       * [Inference performance benchmark](#inference-performance-benchmark)
   * [Results](#results)
       * [Training accuracy results](#training-accuracy-results)
-           * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+           * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
           * [Training accuracy: NVIDIA DGX-1 (8x V100 32GB)](#training-accuracy-nvidia-dgx-1-8x-v100-32gb)
           * [Training stability test](#training-stability-test)
       * [Training performance results](#training-performance-results)
-         * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+         * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
         * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
         * [Training performance: NVIDIA DGX-1 (8x V100 32GB)](#training-performance-nvidia-dgx-1-8x-v100-32gb)
         * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
       * [Inference performance results](#inference-performance-results)
-           * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-40gb)
+           * [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-80gb)
           * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
           * [Inference performance: NVIDIA DGX-1 (1x V100 32GB)](#inference-performance-nvidia-dgx-1-1x-v100-32gb)
           * [Inference performance: NVIDIA DGX-2 (1x V100 32GB)](#inference-performance-nvidia-dgx-2-1x-v100-32gb)
@ -217,10 +216,10 @@ Uses both acoustic model and language model to output the transcript of an input
 The following section lists the requirements in order to start training and evaluating the Jasper model.

 ### Requirements
-This repository contains a `Dockerfile` which extends the PyTorch 20.06-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+This repository contains a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
 -   Supported GPUs:
    - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -260,10 +259,10 @@ bash scripts/docker/build.sh
 3. Start an interactive session in the NGC container to run data download/training/inference

 ```bash
-bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
+bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <OUTPUT_DIR>
 ```
 Within the container, the contents of this repository will be copied to the `/workspace/jasper` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
-and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
+and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<OUTPUT_DIR>` on the host.

 4. Download and preprocess the dataset.

@ -282,40 +281,49 @@ Inside the container, download and extract the datasets into the required format
 bash scripts/download_librispeech.sh
 ```
 Once the data download is complete, the following folders should exist:
-
-* `/datasets/LibriSpeech/`
-   * `train-clean-100/`
-   * `train-clean-360/`
-   * `train-other-500/`
-   * `dev-clean/`
-   * `dev-other/`
-   * `test-clean/`
-   * `test-other/`
+```bash
+datasets/LibriSpeech/
+├── dev-clean
+├── dev-other
+├── test-clean
+├── test-other
+├── train-clean-100
+├── train-clean-360
+└── train-other-500
+```

 Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3),  once the dataset is downloaded it will be accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.


-Next, convert the data into WAV files and add speed perturbation with 0.9 and 1.1 to the training files:
+Next, convert the data into WAV files:
 ```bash
 bash scripts/preprocess_librispeech.sh
 ```
 Once the data is converted, the following additional files and folders should exist:
-* `datasets/LibriSpeech/`
-   * `librispeech-train-clean-100-wav.json`
-   * `librispeech-train-clean-360-wav.json`
-   * `librispeech-train-other-500-wav.json`
-   * `librispeech-dev-clean-wav.json`
-   * `librispeech-dev-other-wav.json`
-   * `librispeech-test-clean-wav.json`
-   * `librispeech-test-other-wav.json`
-   * `train-clean-100-wav/` containsWAV files with original speed, 0.9 and 1.1
-   * `train-clean-360-wav/` contains WAV files with original speed, 0.9 and 1.1
-   * `train-other-500-wav/` contains WAV files with original speed, 0.9 and 1.1
-   * `dev-clean-wav/`
-   * `dev-other-wav/`
-   * `test-clean-wav/`
-   * `test-other-wav/`
+```bash
+datasets/LibriSpeech/
+├── dev-clean-wav
+├── dev-other-wav
+├── librispeech-train-clean-100-wav.json
+├── librispeech-train-clean-360-wav.json
+├── librispeech-train-other-500-wav.json
+├── librispeech-dev-clean-wav.json
+├── librispeech-dev-other-wav.json
+├── librispeech-test-clean-wav.json
+├── librispeech-test-other-wav.json
+├── test-clean-wav
+├── test-other-wav
+├── train-clean-100-wav
+├── train-clean-360-wav
+└── train-other-500-wav
+```

+The DALI data pre-processing pipeline, which is enabled by default, performs speed perturbation on-line during training.
+Without DALI, on-line speed perturbation might slow down the training.
+If you wish to disable DALI, speed perturbation can be computed off-line with:
+```bash
+SPEEDS="0.9 1.1" bash scripts/preprocess_librispeech.sh
+```

 5. Start training.

@ -323,22 +331,22 @@ Inside the container, use the following script to start training.
 Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.

 ```bash
-bash scripts/train.sh [OPTIONS]
+[OPTION1=value1 OPTION2=value2 ...] bash scripts/train.sh
 ```
 By default automatic precision is disabled, batch size is 64 over two gradient accumulation steps, and the recipe is run on a total of 8 GPUs. The hyperparameters are tuned for a GPU with at least 32GB of memory and will require adjustment for 16GB GPUs (e.g., by lowering batch size and using more gradient accumulation steps).

-More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Training process](#training-process).
+Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Training process](#training-process).

 6. Start validation/evaluation.

 Inside the container, use the following script to run evaluation.
 Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
 ```bash
-bash scripts/evaluation.sh [OPTIONS]
+[OPTION1=value1 OPTION2=value2 ...] bash scripts/evaluation.sh [OPTIONS]
 ```
 By default, this will use full precision, a batch size of 64 and run on a single GPU.

-More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).
+Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).


 7. Start inference/predictions.
@ -348,11 +356,11 @@ Inside the container, use the following script to run inference.
 A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).

 ```bash
-bash scripts/inference.sh [OPTIONS]
+[OPTION1=value1 OPTION2=value2 ...] bash scripts/inference.sh
 ```
 By default this will use single precision, a batch size of 64 and run on a single GPU.

-More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Inference process](#inference-process).
+Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Inference process](#inference-process).


 ## Advanced
@ -362,31 +370,27 @@ The following sections provide greater details of the dataset, running training

 ### Scripts and sample code
 In the `root` directory, the most important files are:
-* `train.py` - Serves as entry point for training
-* `inference.py` - Serves as entry point for inference and evaluation
-* `model.py` - Contains the model architecture
-* `dataset.py` - Contains the data loader and related functionality
-* `optimizer.py` - Contains the optimizer
-* `inference_benchmark.py` - Serves as inference benchmarking script that measures the latency of pre-processing and the acoustic model
-* `requirements.py` - Contains the required dependencies that are installed when building the Docker container
-* `Dockerfile` - Container with the basic set of dependencies to run Jasper
-
-The `scripts/` folder encapsulates all the one-click scripts required for running various supported functionalities, such as:
-* `train.sh` - Runs training using the `train.py` script
-* `inference.sh` - Runs inference using the `inference.py` script
-* `evaluation.sh` - Runs evaluation using the `inference.py` script
-* `download_librispeech.sh` - Downloads LibriSpeech dataset
-* `preprocess_librispeech.sh` - Preprocess LibriSpeech raw data files to be ready for training and inference
-* `inference_benchmark.sh` - Runs the inference benchmark using the `inference_benchmark.py` script
-* `train_benchmark.sh` - Runs the training performance benchmark using the `train.py` script
-* `docker/` - Contains the scripts for building and launching the container
-
-
-Other folders included in the `root` directory are:
-* `notebooks/` - Jupyter notebooks and example audio files
-* `configs/    - model configurations
-* `utils/`     - data downloading and common routines
-* `parts/`     - data pre-processing
+```
+jasper
+├── common        # data pre-processing, logging, etc.
+├── configs       # model configurations
+├── Dockerfile    # container with the basic set of dependencies to run Jasper
+├── inference.py  # entry point for inference
+├── jasper        # model-specific code
+├── notebooks     # jupyter notebooks and example audio files
+├── scripts       # one-click scripts required for running various supported functionalities
+│   ├── docker                     # contains the scripts for building and launching the container
+│   ├── download_librispeech.sh    # downloads LibriSpeech dataset
+│   ├── evaluation.sh              # runs evaluation using the `inference.py` script
+│   ├── inference_benchmark.sh     # runs the inference benchmark using the `inference_benchmark.py` script
+│   ├── inference.sh               # runs inference using the `inference.py` script
+│   ├── preprocess_librispeech.sh  # preprocess LibriSpeech raw data files for training and inference
+│   ├── train_benchmark.sh         # runs the training performance benchmark using the `train.py` script
+│   └── train.sh                   # runs training using the `train.py` script
+├── train.py      # entry point for training
+├── triton        # example of inference using Triton Inference Server
+└── utils         # data downloading and common routines
+```

 ### Parameters

@ -394,77 +398,94 @@ Parameters could be set as env variables, or passed as positional arguments.

 The complete list of available parameters for `scripts/train.sh` script contains:
 ```bash
- DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
- MODEL_CONFIG: relative path to model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
- RESULT_DIR: directory for results, logs, and created checkpoints. (default: '/results')
- CHECKPOINT: model checkpoint to continue training from. Model checkpoint is a dictionary object that contains apart from the model weights the optimizer state as well as the epoch number. If CHECKPOINT is set, training starts from scratch. (default: "")
- CREATE_LOGFILE: boolean that indicates whether to create a training log that will be stored in `$RESULT_DIR`. (default: true)
- CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
- NUM_GPUS: number of GPUs to use. (default: 8)
- AMP: if set to `true`, enables automatic mixed precision (default: false)
- EPOCHS: number of training epochs. (default: 400)
- SEED: seed for random number generator and used for ensuring reproducibility. (default: 6)
- BATCH_SIZE: data batch size. (default: 64)
- LEARNING_RATE: Initial learning rate. (default: 0.015)
- GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 2)
+DATA_DIR:                directory of dataset. (default: '/datasets/LibriSpeech')
+MODEL_CONFIG:            relative path to model configuration. (default: 'configs/jasper10x5dr_speedp-online_speca.yaml')
+OUTPUT_DIR:              directory for results, logs, and created checkpoints. (default: '/results')
+CHECKPOINT:              a specific model checkpoint to continue training from. To resume training from the last checkpoint, see the RESUME option.
+RESUME:                  resume training from the last checkpoint found in OUTPUT_DIR, or from scratch if there are no checkpoints (default: true)
+CUDNN_BENCHMARK:         boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
+NUM_GPUS:                number of GPUs to use. (default: 8)
+AMP:                     if set to `true`, enables automatic mixed precision (default: false)
+BATCH_SIZE:              effective data batch size. The real batch size per GPU might be lower, if gradient accumulation is enabled (default: 64)
+GRAD_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 2)
+LEARNING_RATE:           initial learning rate. (default: 0.01)
+MIN_LEARNING_RATE:       minimum learning rate, despite LR scheduling (default: 1e-5)
+LR_POLICY:               how to decay LR (default: exponential)
+LR_EXP_GAMMA:            decay factor for the exponential LR schedule (default: 0.981)
+EMA:                     decay factor for exponential averages of checkpoints (default: 0.999)
+SEED:                    seed for random number generator and used for ensuring reproducibility. (default: 0)
+EPOCHS:                  number of training epochs. (default: 440)
+WARMUP_EPOCHS:           number of initial epoch of linearly increasing LR. (default: 2)
+HOLD_EPOCHS:             number of epochs to hold maximum LR after warmup. (default: 140)
+SAVE_FREQUENCY:          number of epochs between saving the model to disk. (default: 10)
+EPOCHS_THIS_JOB:         run training for this number of epochs. Does not affect LR schedule like the EPOCHS parameter. (default: 0)
+DALI_DEVICE:             device to run the DALI pipeline on for calculation of filterbanks. Valid choices: cpu, gpu, none. (default: gpu)
+PAD_TO_MAX_DURATION:     pad all sequences with zeros to maximum length. (default: false)
+EVAL_FREQUENCY:          number of steps between evaluations on the validation set. (default: 544)
+PREDICTION_FREQUENCY:    the number of steps between writing a sample prediction to stdout. (default: 544)
+TRAIN_MANIFESTS:         lists of .json training set files
+VAL_MANIFESTS:           lists of .json validation set files
+
 ```

 The complete list of available parameters for `scripts/inference.sh` script contains:
 ```bash
-DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
-DATASET: name of dataset to use. (default: 'dev-clean')
-MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
-RESULT_DIR: directory for results and logs. (default: '/results')
-CHECKPOINT: model checkpoint path. (required)
-CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: true)
-CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: false)
-AMP: if set to `true`, enables FP16 inference with AMP (default: false)
-NUM_STEPS: number of inference steps. If -1 runs inference on entire dataset. (default: -1)
-SEED: seed for random number generator and useful for ensuring reproducibility. (default: 6)
-BATCH_SIZE: data batch size.(default: 64)
-LOGITS_FILE: destination path for serialized model output with binary protocol. If 'none' does not save model output. (default: 'none')
-PREDICTION_FILE: destination path for saving predictions. If 'none' does not save predictions. (default: '${RESULT_DIR}/${DATASET}.predictions)
+DATA_DIR:            directory of dataset. (default: '/datasets/LibriSpeech')
+MODEL_CONFIG:        model configuration. (default: 'configs/jasper10x5dr_speedp-online_speca.yaml')
+OUTPUT_DIR:          directory for results and logs. (default: '/results')
+CHECKPOINT:          model checkpoint path. (required)
+DATASET:             name of the LibriSpeech subset to use. (default: 'dev-clean')
+LOG_FILE:            path to the DLLogger .json logfile. (default: '')
+CUDNN_BENCHMARK:     enable cudnn benchmark mode for using more optimized kernels. (default: false)
+MAX_DURATION:        filter out recordings shorter then MAX_DURATION seconds. (default: "")
+PAD_TO_MAX_DURATION: pad all sequences with zeros to maximum length. (default: false)
+NUM_GPUS:            number of GPUs to use. Note that with > 1 GPUs WER results might be innaccurate due to the batching policy. (default: 1)
+NUM_STEPS:           number of batches to evaluate, loop the dataset if necessary. (default: 0)
+NUM_WARMUP_STEPS:    number of initial steps before measuring preformance. (default: 0)
+AMP:                 enable FP16 inference with AMP. (default: false)
+BATCH_SIZE:          data batch size. (default: 64)
+EMA:                 Attempt to load exponentially averaged weights from a checkpoint. (default: true)
+SEED:                seed for random number generator and used for ensuring reproducibility. (default: 0)
+DALI_DEVICE:         device to run the DALI pipeline on for calculation of filterbanks. Valid choices: cpu, gpu, none. (default: gpu)
+CPU:                 run inference on CPU. (default: false)
+LOGITS_FILE:         dump logit matrices to a file. (default: "")
+PREDICTION_FILE:     save predictions to a file. (default: "${OUTPUT_DIR}/${DATASET}.predictions")
 ```

-The complete list of available parameters for `scripts/evaluation.sh` script contains:
+The complete list of available parameters for `scripts/evaluation.sh` is the same as for `scripts/inference.sh` except for the few default changes.
 ```bash
-DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
-DATASET: name of dataset to use.(default: 'dev-clean')
-MODEL_CONFIG: model configuration.(default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
-RESULT_DIR: directory for results and logs. (default: '/results')
-CHECKPOINT: model checkpoint path. (required)
-CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: true)
-CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mde for using more optimized kernels. (default: false)
-NUM_GPUS: number of GPUs to run evaluation on (default: 1)
-AMP: if set to `true`, enables FP16 with AMP (default: false)
-NUM_STEPS: number of inference steps per GPU. If -1 runs inference on entire dataset (default: -1)
-SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
-BATCH_SIZE: data batch size.(default: 64)
+PREDICTION_FILE: (default: "")
+DATASET:         (default: "test-other")
 ```

-The `scripts/inference_benchmark.sh` script pads all input to the same length and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in millisecond per batch. The `scripts/inference_benchmark.sh`
-measures latency for a single GPU and extends  `scripts/inference.sh` by :
+The `scripts/inference_benchmark.sh` script pads all input to a fixed duration and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in milliseconds per batch. The `scripts/inference_benchmark.sh` measures latency for a single GPU and loops over a number of batch sizes and durations. It extends  `scripts/inference.sh`, and changes the defaults with:
 ```bash
- MAX_DURATION: filters out input audio data that exceeds a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 36)
+BATCH_SIZE_SEQ:      batch sizes to measre on. (defaul: "1 2 4 8 16")
+MAX_DURATION_SEQ:    input durations (in seconds) to measure on (default: "2 7 16.7")
+CUDNN_BENCHMARK:     (default: true)
+PAD_TO_MAX_DURATION: (default: true)
+NUM_WARMUP_STEPS:    (default: 10)
+NUM_STEPS:           (default: 500)
+DALI_DEVICE:         (default: cpu)
 ```

 The `scripts/train_benchmark.sh` script pads all input to the same length according to the input argument `MAX_DURATION` and measures average training latency and throughput performance. Latency is measured in seconds per batch, throughput in sequences per second.
-The complete list of available parameters for `scripts/train_benchmark.sh` script contains:
+Training performance is measured with on-line speed perturbation and cuDNN benchmark mode enabled.
+The script `scripts/train_benchmark.sh` loops over a number of batch sizes and GPU counts.
+It extends `scripts/train.sh`, and the complete list of available parameters for `scripts/train_benchmark.sh` script contains:
 ```bash
-DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
-MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
-RESULT_DIR: directory for results and logs. (default: '/results')
-CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: true)
-CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
-NUM_GPUS: number of GPUs to use. (default: 8)
-AMP: if set to `true`, enables automatic mixed precision with AMP (default: false)
-NUM_STEPS: number of training iterations. If -1 runs full training for  400 epochs. (default: -1)
-MAX_DURATION: filters out input audio data that exceed a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 16.7)
-SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
-BATCH_SIZE: data batch size.(default: 32)
-LEARNING_RATE: Initial learning rate. (default: 0.015)
-GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 1)
-PRINT_FREQUENCY: number of iterations after which training progress is printed. (default: 1)
+BATCH_SIZE_SEQ:          batch sizes to measre on. (defaul: "1 2 4 8 16")
+NUM_GPUS_SEQ:            number of GPUs to run the training on. (default: "1 4 8")
+MODEL_CONFIG:            (default: "configs/jasper10x5dr_speedp-online_train-benchmark.yaml")
+TRAIN_MANIFESTS:         (default: "$DATA_DIR/librispeech-train-clean-100-wav.json")
+RESUME:                  (default: false)
+EPOCHS_THIS_JOB:         (default: 2)
+EPOCHS:                  (default: 100000)
+SAVE_FREQUENCY:          (default: 100000)
+EVAL_FREQUENCY:          (default: 100000)
+GRAD_ACCUMULATION_STEPS: (default: 1)
+PAD_TO_MAX_DURATION:     (default: true)
+EMA:                     (default: 0)
 ```

 ### Command-line options
@ -478,11 +499,10 @@ python inference.py --help
 ### Getting the data
 The Jasper model was trained on LibriSpeech dataset. We use the concatenation of `train-clean-100`, `train-clean-360` and `train-other-500` for training and `dev-clean` for validation.

-This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 500GB free space is required for download and preprocessing, the final preprocessed dataset is 320GB.
-
+This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 250GB free space is required for download and preprocessing, the final preprocessed dataset is approximately 100GB. With offline speed perturbation, the dataset will be about 3x larger.

 #### Dataset guidelines
-The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are stripped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500` it also creates speed perturbed versions with rates of 0.9 and 1.1 for data augmentation.
+The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are stripped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500`. It can optionally create speed perturbed versions with rates of 0.9 and 1.1 for data augmentation. In the current version, those augmentations are applied on-line with the DALI pipeline without any impact on training time.

 After preprocessing, the script creates JSON files with output file paths, sample rate, target transcript and other metadata. These JSON files are used by the training script to identify training and validation datasets.

@ -500,21 +520,23 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
 * Trains on the concatenation of all 3 LibriSpeech training datasets and evaluates on the LibriSpeech dev-clean dataset
 * Maintains an exponential moving average of parameters for evaluation
 * Has cudnn benchmark enabled
-* Runs for 400 epochs
-* Uses an initial learning rate of 0.015 and polynomial (quadratic) learning rate decay
+* Runs for 440 epochs
+* Uses an initial learning rate of 0.01 and an exponential learning rate decay
 * Saves a checkpoint every 10 epochs
-* Runs evaluation on the development dataset every 100 iterations and at the end of training
-* Prints out training progress every 25 iterations
-* Creates a log file with training progress
-* Uses offline speed perturbed data
+* Automatically removes old checkpoints and preserves milestone checkpoints
+* Runs evaluation on the development dataset every 544 iterations and at the end of training
+* Maintains a separate checkpoint with the lowest WER on development set
+* Prints out training progress every iteration to stdout
+* Creates a DLLogger logfile and a Tensorboard log
+* Calculates speed perturbation on-line during training
 * Uses SpecAugment in data pre-processing
 * Filters out audio samples longer than 16.7 seconds
-* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
+* Pads each batch so its length would be divisible by 16
 * Uses masked convolutions and dense residuals as described in the paper
 * Uses weight decay of 0.001
 * Uses [Novograd](https://arxiv.org/pdf/1905.11286.pdf) as optimizer with betas=(0.95, 0)

-Enabling AMP permits batch size 64 with one gradient accumulation step. Such setup will match the greedy WER [Results](#results) of the Jasper paper on a DGX-1 with 32GB V100 GPUs.
+Enabling AMP permits batch size 64 with one gradient accumulation step. In the current setup it will improve upon the greedy WER [Results](#results) of the Jasper paper on a DGX-1 with 32GB V100 GPUs.

 ### Inference process
 Inference is performed using the `inference.py` script along with parameters defined in `scripts/inference.sh`.
@ -525,7 +547,7 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
 * Uses a batch size of 64
 * Runs for 1 epoch and prints out the final word error rate
 * Creates a log file with progress and results which will be stored in the results folder
-* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch
+* Pads each batch so its length would be divisible by 16
 * Does not use data augmentation
 * Does greedy decoding and saves the transcription in the results folder
 * Has the option to save the model output tensors for more complex decoding, for example, beam search
@ -533,24 +555,14 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect

 ### Evaluation process
 Evaluation is performed using the `inference.py` script along with parameters defined in `scripts/evaluation.sh`.
-The `scripts/evaluation.sh` script runs a job on a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
-Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the evaluation script:
+The setup is similar to `scripts/inference.sh`, with two differences:

-* Uses a batch size of 64
-* Evaluates the LibriSpeech dev-clean dataset
-* Runs for 1 epoch and prints out the final word error rate
-* Creates a log file with progress and results which is saved in the results folder
-* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
-* Does not use data augmentation
-* Has cudnn benchmark disabled
-
-### Deploying Jasper using TensorRT
-NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. Jasper’s architecture, which is of deep convolutional nature, is designed to facilitate fast GPU inference. After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch. 
-More information on how to perform inference using TensorRT and speed up comparison between TensorRT and native PyTorch can be found in the subfolder [./trt/README.md](trt/README.md)
+* Evaluates the LibriSpeech test-other dataset
+* Model outputs are not saved

 ### Deploying Jasper using Triton Inference Server
 The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
-More information on how to perform inference using TensorRT Inference Server with different model backends can be found in the subfolder [./trtis/README.md](trtis/README.md)
+More information on how to perform inference using Triton Inference Server with different model backends can be found in the subfolder [./triton/README.md](triton/README.md)


 ## Performance
@ -559,162 +571,121 @@ More information on how to perform inference using TensorRT Inference Server wit
 The following section shows how to run benchmarks measuring the model performance in training and inference modes.

 #### Training performance benchmark
-To benchmark the training performance on a specific batch size and audio length, for `NUM_STEPS` run:
+To benchmark the training performance in a specific setting on the `train-clean-100` subset of LibriSpeech, run:

 ```bash
-export NUM_STEPS=<NUM_STEPS>
-export MAX_DURATION=<DURATION>
-export BATCH_SIZE=<BATCH_SIZE>
-bash scripts/train_benchmark.sh
+BATCH_SIZE_SEQ=<BATCH_SIZES> NUM_GPUS_SEQ=<NUMS_OF_GPUS> bash scripts/train_benchmark.sh
 ```

-By default, this script runs 400 epochs on the configuration `configs/jasper10x5dr_sp_offline_specaugment.toml`
-using batch size 32 on a single node with 8x GPUs with at least 32GB of memory.
-By default, `NUM_STEPS=-1` means training is run for 400 EPOCHS. If `$NUM_STEPS > 0` is specified, training is only run for a user-defined number of iterations. Audio samples longer than `MAX_DURATION` are filtered out, the remaining ones are padded to this duration such that all batches have the same length. At the end of training the script saves the model checkpoint to the results folder, runs evaluation on LibriSpeech dev-clean dataset, and prints out information such as average training latency performance in seconds, average trng throughput in sequences per second, final training loss, final training WER, evaluation loss and evaluation WER.
-
+By default, this script runs 2 epochs on the configuration `configs/jasper10x5dr_speedp-online_train-benchmark.yaml`,
+which applies gentle speed perturbation that does not change the length of the output, enabling immediate stabilization of training step times in the cuDNN benchmark mode. The script runs benchmarks on batch sizes 32 on 1, 4, and 8 GPUs, and requires a 8x 32GB GPU machine.

 #### Inference performance benchmark
 To benchmark the inference performance on a specific batch size and audio length, run:

 ```bash
-bash scripts/inference_benchmark.sh
+BATCH_SIZE_SEQ=<BATCH_SIZES> MAX_DURATION_SEQ=<DURATIONS> bash scripts/inference_benchmark.sh
 ```
-By default, the script runs on a single GPU and evaluates on the entire dataset using the model configuration `configs/jasper10x5dr_sp_offline_specaugment.toml` and batch size 32.
-By default, `MAX_DURATION` is set to 36 seconds, which covers the maximum audio length. All audio samples are padded to this length. The script prints out `MAX_DURATION`, `BATCH_SIZE` and latency performance in milliseconds per batch.
+By default, the script runs on a single GPU and evaluates on the dataset limited to utterances shorter than MAX_DURATION. It uses the model configuration `configs/jasper10x5dr_speedp-online_speca.yaml`.

-Adjustments can be made with env variables, e.g.,
-```bash
-export SEED=42
-export BATCH_SIZE=1
-bash scripts/inference_benchmark.sh
-```

 ### Results
 The following sections provide details on how we achieved our performance and accuracy in training and inference.
 All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated
-on LibriSpeech dev-clean, dev-other, test-clean, test-other.
-The results for Jasper Large's word error rate from the original paper after greedy decoding are shown below:
-
-| **Number of GPUs**    |  **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**
-|---    |---    |---    |---    |---    |
-|8  |   3.64|   11.89| 3.86 | 11.95
-
+on LibriSpeech dev-clean, dev-other, test-clean, test-other. Checkpoints for evaluation are being chosen based on their
+word error rate on dev-clean.

 #### Training accuracy results

-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
-Our results were obtained by running the `scripts/train.sh` training script in the 20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX A100 with (8x A100 80GB) GPUs.
+The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.

-| **Number of GPUs** | **Batch size per GPU** | **Precision** | **dev-clean WER** | **dev-other WER** | **test-clean WER** | **test-other WER** | **Time to train** | **Time to train speedup (TF32 to mixed precision)** |
-|-----|-----|-------|-------|-------|------|-------|-------|-----|
-|   8 |  64 | mixed |  3.53 | 11.11 | 3.75 | 11.07 | 60 h  | 1.9 |
-|   8 |  64 |  TF32 |  3.55 | 11.30 | 3.81 | 11.17 | 115 h |  -  |
-
-For each precision, we show the best of 8 runs chosen based on dev-clean WER. For TF32, two gradient accumulation steps have been used.
+| Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | Time to train |
+|-----|-----|-------|-------|-------|------|-------|------|
+|   8 |  64 | mixed |  3.20 |  9.78 | 3.41 |  9.71 | 70 h |

 ##### Training accuracy: NVIDIA DGX-1 (8x V100 32GB)
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.06-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs.
-The following tables report the word error rate(WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs.
+The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.

-| **Number of GPUs** | **Batch size per GPU** | **Precision** | **dev-clean WER** | **dev-other WER** | **test-clean WER** | **test-other WER** | **Time to train** | **Time to train speedup (FP32 to mixed precision)** |
-|-----|-----|-------|-------|-------|------|-------|-------|-----|
-|   8 |  64 | mixed |  3.49 | 11.22 | 3.74 | 10.94 | 105 h | 3.1 |
-|   8 |  64 |  FP32 |  3.65 | 11.47 | 3.86 | 11.30 | 330 h |  -  |
+| Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | Time to train |
+|-----|-----|-------|-------|-------|------|-------|-------|
+|   8 |  64 | mixed |  3.26 | 10.00 | 3.54 |  9.80 | 130 h |

 We show the best of 5 runs (mixed precision) and 2 runs (FP32) chosen based on dev-clean WER. For FP32, two gradient accumulation steps have been used.

 ##### Training stability test
 The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.

-| **DGX A100, FP16, 8x GPU**   |   **Seed #1** |   **Seed #2** |   **Seed #3** |   **Seed #4** |   **Seed #5** |   **Seed #6** |   **Seed #7** |   **Seed #8** |   **Mean** |   **Std** |
-|-----------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|-----:|
-| dev-clean  |  3.69 |  3.71 |  3.64 |  3.53 |  3.71 |  3.66 |  3.77 |  3.70 |  3.68 | 0.07 |
-| dev-other  | 11.39 | 11.65 | 11.46 | 11.11 | 11.23 | 11.18 | 11.43 | 11.60 | 11.38 | 0.19 |
-| test-clean |  3.97 |  3.96 |  3.81 |  3.75 |  3.90 |  3.82 |  3.93 |  3.82 |  3.87 | 0.08 |
-| test-other | 11.27 | 11.34 | 11.40 | 11.07 | 11.24 | 11.29 | 11.58 | 11.58 | 11.35 | 0.17 |
+| DGX A100 80GB, FP16, 8x GPU |   Seed #1 |   Seed #2 |   Seed #3 |   Seed #4 |   Seed #5 |   Seed #6 |   Seed #7 |   Seed #8 |   Mean |   Std |
+|-----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-------:|------:|
+| dev-clean  |      3.46 |      3.55 |      3.45 |      3.44 |      3.25 |      3.34 |      3.20 |      3.40 |   3.39 |  0.11 |
+| dev-other  |     10.30 |     10.77 |     10.36 |     10.26 |      9.99 |     10.18 |      9.78 |     10.32 |  10.25 |  0.27 |
+| test-clean |      3.84 |      3.81 |      3.66 |      3.64 |      3.58 |      3.55 |      3.41 |      3.73 |   3.65 |  0.13 |
+| test-other |     10.61 |     10.52 |     10.49 |     10.47 |      9.89 |     10.09 |      9.71 |     10.26 |  10.26 |  0.31 |

-| **DGX A100, TF32, 8x GPU**   |   **Seed #1** |   **Seed #2** |   **Seed #3** |   **Seed #4** |   **Seed #5** |   **Seed #6** |   **Seed #7** |   **Seed #8** |   **Mean** |   **Std** |
-|-----------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|-----:|
-| dev-clean  |  3.56 |  3.60 |  3.60 |  3.55 |  3.65 |  3.57 |  3.89 |  3.67 |  3.64 | 0.11 |
-| dev-other  | 11.27 | 11.41 | 11.65 | 11.30 | 11.51 | 11.11 | 12.18 | 11.50 | 11.49 | 0.32 |
-| test-clean |  3.80 |  3.79 |  3.88 |  3.81 |  3.94 |  3.82 |  4.13 |  3.85 |  3.88 | 0.11 |
-| test-other | 11.40 | 11.26 | 11.47 | 11.17 | 11.36 | 11.16 | 12.15 | 11.46 | 11.43 | 0.32 |

-| **DGX-1 32GB, FP16, 8x GPU**   |   **Seed #1** |   **Seed #2** |   **Seed #3** |   **Seed #4** |   **Seed #5** |   **Mean** |   **Std** |
-|-----------:|------:|------:|------:|------:|------:|------:|-----:|
-| dev-clean  |  3.69 |  3.75 |  3.63 |  3.86 |  3.49 |  3.68 | 0.14 |
-| dev-other  | 11.35 | 11.63 | 11.60 | 11.68 | 11.22 | 11.50 | 0.20 |
-| test-clean |  3.90 |  3.84 |  3.94 |  3.96 |  3.74 |  3.88 | 0.09 |
-| test-other | 11.17 | 11.45 | 11.31 | 11.60 | 10.94 | 11.29 | 0.26 |
+| DGX-1 32GB, FP16, 8x GPU |   Seed #1 |   Seed #2 |   Seed #3 |   Seed #4 |   Seed #5 |   Seed #6 |   Seed #7 |   Seed #8 |   Mean |   Std |
+|-----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-------:|------:|
+| dev-clean  |      3.31 |      3.31 |      3.26 |      3.44 |      3.40 |      3.35 |      3.36 |      3.28 |   3.34 |  0.06 |
+| dev-other  |     10.02 |     10.01 |     10.00 |     10.06 |     10.05 |     10.03 |     10.10 |     10.04 |  10.04 |  0.03 |
+| test-clean |      3.49 |      3.50 |      3.54 |      3.61 |      3.57 |      3.58 |      3.48 |      3.51 |   3.54 |  0.04 |
+| test-other |     10.11 |     10.14 |      9.80 |     10.09 |     10.17 |      9.99 |      9.86 |     10.00 |  10.02 |  0.13 |

 #### Training performance results
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.06-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.

-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
-| **GPUs** | **Batch size / GPU** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **Weak scaling - TF32** | **Weak scaling - mixed precision** |
-|--:|---:|------:|-------:|-----:|-----:|-----:|
-| 1 | 32 |  36.09 |  69.33 | 1.92 | 1.00 | 1.00 |
-| 4 | 32 | 143.05 | 264.91 | 1.85 | 3.96 | 3.82 |
-| 8 | 32 | 285.25 | 524.33 | 1.84 | 7.90 | 7.56 |
-
-| **GPUs** | **Batch size / GPU** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **Weak scaling - TF32** | **Weak scaling - mixed precision** |
-|--:|---:|------:|-------:|-----:|-----:|-----:|
-| 1 | 64 |      - |  77.79 |    - |    - | 1.00 |
-| 4 | 64 |      - | 304.32 |    - |    - | 3.91 |
-| 8 | 64 |      - | 602.88 |    - |    - | 7.75 |
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+| Batch size / GPU | GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
+|----:|----:|-------:|-------:|-----:|-----:|-----:|
+|  32 |   1 |  42.18 |  64.32 | 1.52 | 1.00 | 1.00 |
+|  32 |   4 | 157.49 | 239.23 | 1.52 | 3.73 | 3.72 |
+|  32 |   8 | 310.10 | 470.09 | 1.52 | 7.35 | 7.31 |
+|  64 |   1 |  49.64 |  75.59 | 1.52 | 1.00 | 1.00 |
+|  64 |   4 | 192.66 | 289.16 | 1.50 | 3.88 | 3.83 |
+|  64 |   8 | 371.41 | 547.91 | 1.48 | 7.48 | 7.25 |

 Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
-|--:|---:|------:|-------:|-----:|-----:|-----:|
-| 1 | 16 | 11.12 |  28.87 | 2.60 | 1.00 | 1.00 |
-| 4 | 16 | 42.39 | 109.40 | 2.58 | 3.81 | 3.79 |
-| 8 | 16 | 84.45 | 194.30 | 2.30 | 7.59 | 6.73 |
-
-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
-|--:|---:|------:|-------:|-----:|-----:|-----:|
-| 1 | 32 |     - |  37.57 |    - |    - | 1.00 |
-| 4 | 32 |     - | 134.80 |    - |    - | 3.59 |
-| 8 | 32 |     - | 276.14 |    - |    - | 7.35 |
+| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
+|----:|----:|------:|-------:|-----:|-----:|-----:|
+|  16 |   1 | 10.71 |  27.87 | 2.60 | 1.00 | 1.00 |
+|  16 |   4 | 40.28 |  99.80 | 2.48 | 3.76 | 3.58 |
+|  16 |   8 | 78.23 | 193.89 | 2.48 | 7.30 | 6.96 |

 Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
-|--:|---:|------:|-------:|-----:|-----:|-----:|
-| 1 | 32 | 13.15 |  35.63 | 2.71 | 1.00 | 1.00 |
-| 4 | 32 | 51.21 | 134.01 | 2.62 | 3.90 | 3.76 |
-| 8 | 32 | 99.88 | 247.97 | 2.48 | 7.60 | 6.96 |
-
-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
-|--:|---:|------:|-------:|-----:|-----:|-----:|
-| 1 | 64 |     - |  41.74 |    - |    - | 1.00 |
-| 4 | 64 |     - | 158.44 |    - |    - | 3.80 |
-| 8 | 64 |     - | 312.22 |    - |    - | 7.48 |
+| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
+|----:|----:|------:|-------:|-----:|-----:|-----:|
+|  32 |   1 | 12.22 |  34.08 | 2.79 | 1.00 | 1.00 |
+|  32 |   4 | 46.97 | 128.39 | 2.73 | 3.84 | 3.77 |
+|  32 |   8 | 92.44 | 249.00 | 2.69 | 7.57 | 7.31 |
+|  64 |   1 |   N/A |  39.30 |  N/A |  N/A | 1.00 |
+|  64 |   4 |   N/A | 150.18 |  N/A |  N/A | 3.82 |
+|  64 |   8 |   N/A | 282.68 |  N/A |  N/A | 7.19 |

 Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
-|---:|---:|-------:|-------:|-----:|------:|------:|
-|  1 | 32 |  14.13 |  41.05 | 2.90 |  1.00 |  1.00 |
-|  4 | 32 |  54.32 | 156.47 | 2.88 |  3.84 |  3.81 |
-|  8 | 32 | 110.26 | 307.13 | 2.79 |  7.80 |  7.48 |
-| 16 | 32 | 218.14 | 561.85 | 2.58 | 15.44 | 13.69 |
-
-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
-|---:|---:|-------:|-------:|-----:|------:|------:|
-|  1 | 64 |      - |  46.41 |    - |     - |  1.00 |
-|  4 | 64 |      - | 147.90 |    - |     - |  3.19 |
-|  8 | 64 |      - | 359.15 |    - |     - |  7.74 |
-| 16 | 64 |      - | 703.13 |    - |     - | 15.15 |
+| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
+|----:|----:|-------:|-------:|-----:|------:|------:|
+|  32 |   1 |  13.46 |  38.94 | 2.89 |  1.00 |  1.00 |
+|  32 |   4 |  51.38 | 143.44 | 2.79 |  3.82 |  3.68 |
+|  32 |   8 | 100.54 | 280.48 | 2.79 |  7.47 |  7.20 |
+|  32 |  16 | 188.14 | 515.90 | 2.74 | 13.98 | 13.25 |
+|  64 |   1 |    N/A |  43.86 |  N/A |   N/A |  1.00 |
+|  64 |   4 |    N/A | 165.27 |  N/A |   N/A |  3.77 |
+|  64 |   8 |    N/A | 318.10 |  N/A |   N/A |  7.25 |
+|  64 |  16 |    N/A | 567.47 |  N/A |   N/A | 12.94 |

 Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.

@ -722,121 +693,130 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide


 #### Inference performance results
-Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 20.06-py3 NGC container on NVIDIA DGX A100, DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 1000 iterations.
+Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 20.10-py3 NGC container on NVIDIA DGX A100, DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 500 iterations.

-##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
-|    | |FP16 Latency (ms) Percentiles | | | | TF32 Latency (ms) Percentiles | | | | FP16/TF32 speed up |
-|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
-| BS | Duration (s) |   90% |   95% |   99% |   Avg |    90% |    95% |    99% |    Avg |  Avg |
-|  1 |            2 | 36.31 | 36.85 | 43.18 | 35.96 |  41.16 |  41.63 |  47.90 |  40.89 | 1.14 |
-|  2 |            2 | 37.56 | 43.32 | 45.23 | 37.11 |  42.53 |  47.79 |  49.62 |  42.07 | 1.13 |
-|  4 |            2 | 43.10 | 44.85 | 47.22 | 41.43 |  47.88 |  49.75 |  51.55 |  43.25 | 1.04 |
-|  8 |            2 | 44.02 | 44.30 | 45.21 | 39.51 |  50.14 |  50.47 |  51.50 |  45.63 | 1.16 |
-| 16 |            2 | 48.04 | 48.38 | 49.12 | 42.76 |  70.90 |  71.22 |  72.50 |  60.78 | 1.42 |
-|  1 |            7 | 37.74 | 37.88 | 38.92 | 37.02 |  41.53 |  42.17 |  44.75 |  40.79 | 1.10 |
-|  2 |            7 | 40.91 | 41.11 | 42.35 | 40.02 |  46.44 |  46.80 |  49.67 |  45.67 | 1.14 |
-|  4 |            7 | 43.94 | 44.32 | 46.71 | 43.00 |  54.39 |  54.80 |  56.63 |  53.53 | 1.24 |
-|  8 |            7 | 50.01 | 50.19 | 52.92 | 48.62 |  68.55 |  69.25 |  72.28 |  67.61 | 1.39 |
-| 16 |            7 | 60.38 | 60.76 | 62.44 | 57.92 |  93.17 |  94.15 |  98.84 |  92.21 | 1.59 |
-|  1 |         16.7 | 41.39 | 41.75 | 43.62 | 40.73 |  45.79 |  46.10 |  47.76 |  45.21 | 1.11 |
-|  2 |         16.7 | 46.43 | 46.76 | 47.72 | 45.81 |  52.53 |  53.13 |  55.60 |  51.71 | 1.13 |
-|  4 |         16.7 | 50.88 | 51.68 | 54.74 | 50.11 |  66.29 |  66.96 |  70.45 |  65.00 | 1.30 |
-|  8 |         16.7 | 62.09 | 62.76 | 65.08 | 61.40 |  94.16 |  94.67 |  97.46 |  93.00 | 1.51 |
-| 16 |         16.7 | 75.22 | 76.86 | 80.76 | 73.99 | 139.51 | 140.88 | 144.10 | 137.94 | 1.86 |
+##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
+|  |  | FP16 Latency (ms) Percentiles |  |  |  | TF32 Latency (ms) Percentiles |  |  |  | FP16/TF32 speed up |
+|-----:|---------------:|------:|------:|------:|------:|------:|------:|-------:|------:|------:|
+|   BS |   Duration (s) |   90% |   95% |   99% |   Avg |   90% |   95% |    99% |   Avg |   Avg |
+|    1 |            2.0 | 32.40 | 32.50 | 32.82 | 32.30 | 33.30 | 33.64 |  34.65 | 33.25 |  1.03 |
+|    2 |            2.0 | 32.90 | 33.51 | 34.35 | 32.69 | 34.48 | 34.65 |  35.66 | 34.27 |  1.05 |
+|    4 |            2.0 | 32.85 | 33.01 | 33.89 | 32.60 | 34.09 | 34.46 |  35.22 | 34.00 |  1.04 |
+|    8 |            2.0 | 35.51 | 35.89 | 37.10 | 35.33 | 34.86 | 35.36 |  36.08 | 34.45 |  0.98 |
+|   16 |            2.0 | 36.00 | 36.57 | 37.40 | 35.77 | 43.83 | 44.12 |  44.77 | 43.39 |  1.21 |
+|    1 |            7.0 | 33.50 | 33.99 | 34.91 | 33.03 | 33.83 | 34.25 |  34.95 | 33.70 |  1.02 |
+|    2 |            7.0 | 34.43 | 34.89 | 35.72 | 34.22 | 34.41 | 34.73 |  35.69 | 34.28 |  1.00 |
+|    4 |            7.0 | 34.30 | 34.59 | 35.43 | 34.07 | 37.95 | 38.18 |  38.87 | 37.55 |  1.10 |
+|    8 |            7.0 | 35.98 | 36.28 | 37.11 | 35.28 | 44.64 | 44.79 |  45.37 | 44.29 |  1.26 |
+|   16 |            7.0 | 39.86 | 40.08 | 41.16 | 39.33 | 55.17 | 55.46 |  57.24 | 54.56 |  1.39 |
+|    1 |           16.7 | 35.20 | 35.80 | 38.71 | 34.36 | 35.36 | 35.76 |  36.55 | 34.64 |  1.01 |
+|    2 |           16.7 | 35.40 | 35.81 | 36.50 | 34.76 | 36.34 | 36.53 |  37.40 | 35.87 |  1.03 |
+|    4 |           16.7 | 36.01 | 36.38 | 37.37 | 35.57 | 44.69 | 45.09 |  45.88 | 43.92 |  1.23 |
+|    8 |           16.7 | 41.48 | 41.78 | 44.22 | 40.69 | 58.57 | 58.74 |  59.62 | 58.11 |  1.43 |
+|   16 |           16.7 | 61.37 | 61.93 | 66.32 | 60.92 | 97.33 | 97.71 | 100.04 | 96.56 |  1.59 |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
-|    | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
-|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
-| BS | Duration (s) |   90% |   95% |   99% |   Avg |    90% |    95% |    99% |    Avg |  Avg |
-|  1 |    2 |  52.26 |  59.93 |  66.62 |  50.34 |  70.90 |  76.47 |  79.84 |  68.61 | 1.36 |
-|  2 |    2 |  62.04 |  67.68 |  70.91 |  58.65 |  75.72 |  80.15 |  83.50 |  71.33 | 1.22 |
-|  4 |    2 |  75.12 |  77.12 |  82.80 |  66.55 |  80.88 |  82.60 |  86.63 |  73.65 | 1.11 |
-|  8 |    2 |  71.62 |  72.99 |  81.10 |  66.39 |  99.57 | 101.43 | 107.16 |  92.34 | 1.39 |
-| 16 |    2 |  78.51 |  80.33 |  87.31 |  72.91 | 104.79 | 107.22 | 114.21 |  96.18 | 1.32 |
-|  1 |    7 |  52.67 |  54.40 |  64.27 |  50.47 |  73.86 |  75.61 |  84.93 |  72.08 | 1.43 |
-|  2 |    7 |  60.49 |  62.41 |  72.87 |  58.45 |  93.07 |  94.51 | 102.40 |  91.55 | 1.57 |
-|  4 |    7 |  70.55 |  72.95 |  82.59 |  68.43 | 131.48 | 137.60 | 149.06 | 129.23 | 1.89 |
-|  8 |    7 |  83.91 |  85.28 |  93.08 |  76.40 | 152.49 | 157.92 | 166.80 | 150.49 | 1.97 |
-| 16 |    7 | 100.21 | 103.12 | 109.00 |  96.31 | 178.45 | 181.46 | 187.20 | 174.33 | 1.81 |
-|  1 | 16.7 |  56.84 |  60.05 |  66.54 |  54.69 | 109.55 | 111.19 | 120.40 | 102.25 | 1.87 |
-|  2 | 16.7 |  69.39 |  70.97 |  75.34 |  67.39 | 149.93 | 150.79 | 154.06 | 147.45 | 2.19 |
-|  4 | 16.7 |  87.48 |  93.96 | 102.73 |  85.09 | 211.78 | 219.66 | 232.99 | 208.38 | 2.45 |
-|  8 | 16.7 | 106.91 | 111.92 | 116.55 | 104.13 | 246.92 | 250.94 | 268.44 | 243.34 | 2.34 |
-| 16 | 16.7 | 149.08 | 153.86 | 166.17 | 146.28 | 292.84 | 298.02 | 313.04 | 288.54 | 1.97 |
+|  |  | FP16 Latency (ms) Percentiles |  |  |  | FP32 Latency (ms) Percentiles |  |  |  | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+|   BS |   Duration (s) |    90% |    95% |    99% |    Avg |    90% |    95% |    99% |    Avg |   Avg |
+|    1 |            2.0 |  45.42 |  45.62 |  49.54 |  45.02 |  48.83 |  48.99 |  51.66 |  48.44 |  1.08 |
+|    2 |            2.0 |  50.31 |  50.53 |  53.66 |  49.10 |  49.87 |  50.04 |  52.99 |  49.41 |  1.01 |
+|    4 |            2.0 |  49.17 |  49.48 |  52.13 |  48.73 |  52.92 |  53.21 |  55.28 |  52.31 |  1.07 |
+|    8 |            2.0 |  51.20 |  51.40 |  52.32 |  49.01 |  73.02 |  73.30 |  75.00 |  71.99 |  1.47 |
+|   16 |            2.0 |  51.75 |  52.24 |  56.36 |  51.27 |  83.99 |  84.57 |  86.69 |  83.24 |  1.62 |
+|    1 |            7.0 |  48.13 |  48.53 |  50.95 |  46.78 |  48.52 |  48.75 |  50.89 |  48.01 |  1.03 |
+|    2 |            7.0 |  49.52 |  50.10 |  52.35 |  48.00 |  65.27 |  65.41 |  66.59 |  64.79 |  1.35 |
+|    4 |            7.0 |  51.75 |  52.01 |  54.39 |  50.38 |  93.75 |  94.77 |  97.04 |  92.27 |  1.83 |
+|    8 |            7.0 |  54.80 |  56.27 |  66.23 |  52.95 | 130.65 | 131.09 | 132.91 | 129.82 |  2.45 |
+|   16 |            7.0 |  73.02 |  73.42 |  75.83 |  71.96 | 157.53 | 158.20 | 160.73 | 155.51 |  2.16 |
+|    1 |           16.7 |  48.10 |  48.52 |  52.71 |  47.20 |  73.34 |  73.56 |  74.19 |  72.69 |  1.54 |
+|    2 |           16.7 |  64.21 |  64.52 |  65.56 |  56.06 | 129.48 | 129.97 | 131.78 | 126.36 |  2.25 |
+|    4 |           16.7 |  60.38 |  61.03 |  63.18 |  58.87 | 183.33 | 183.85 | 185.53 | 181.90 |  3.09 |
+|    8 |           16.7 |  85.88 |  86.34 |  87.70 |  84.46 | 227.42 | 228.21 | 229.63 | 225.71 |  2.67 |
+|   16 |           16.7 | 135.62 | 136.40 | 137.69 | 131.58 | 276.90 | 277.59 | 281.16 | 275.08 |  2.09 |

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Inference performance: NVIDIA DGX-1 (1x V100 32GB)
-|    | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
-|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
-| BS | Duration (s) |   90% |   95% |   99% |   Avg |    90% |    95% |    99% |    Avg |  Avg |
-|  1 |    2 |  64.60 |  67.34 |  79.87 |  60.73 |  84.69 |  86.78 |  96.02 |  79.32 | 1.31 |
-|  2 |    2 |  71.52 |  73.32 |  82.00 |  63.93 |  85.33 |  87.65 |  96.34 |  78.09 | 1.22 |
-|  4 |    2 |  80.38 |  84.62 |  93.09 |  74.95 |  90.29 |  97.59 | 100.61 |  84.44 | 1.13 |
-|  8 |    2 |  83.43 |  85.51 |  91.17 |  74.09 | 107.28 | 111.89 | 115.19 |  98.76 | 1.33 |
-| 16 |    2 |  90.01 |  90.81 |  96.48 |  79.85 | 115.39 | 116.95 | 123.71 | 103.26 | 1.29 |
-|  1 |    7 |  53.74 |  54.09 |  56.67 |  53.07 |  86.07 |  86.55 |  91.59 |  78.79 | 1.48 |
-|  2 |    7 |  63.34 |  63.67 |  66.08 |  62.62 |  96.25 |  96.82 |  99.72 |  95.44 | 1.52 |
-|  4 |    7 |  80.35 |  80.86 |  83.80 |  73.41 | 132.19 | 132.94 | 135.59 | 131.46 | 1.79 |
-|  8 |    7 |  77.68 |  78.11 |  86.71 |  75.72 | 156.30 | 157.72 | 165.55 | 154.87 | 2.05 |
-| 16 |    7 | 103.52 | 106.66 | 111.93 |  98.15 | 180.71 | 182.82 | 191.12 | 178.61 | 1.82 |
-|  1 | 16.7 |  57.58 |  57.79 |  59.75 |  56.58 | 104.51 | 104.87 | 108.01 | 104.04 | 1.84 |
-|  2 | 16.7 |  69.19 |  69.58 |  71.49 |  68.58 | 151.25 | 152.07 | 155.21 | 149.30 | 2.18 |
-|  4 | 16.7 |  87.17 |  88.53 |  97.41 |  86.56 | 211.28 | 212.41 | 214.97 | 208.54 | 2.41 |
-|  8 | 16.7 | 116.25 | 116.90 | 120.14 | 109.21 | 247.63 | 248.93 | 254.77 | 245.19 | 2.25 |
-| 16 | 16.7 | 151.99 | 154.79 | 163.36 | 149.80 | 293.99 | 296.05 | 303.04 | 291.00 | 1.94 |
+|  |  | FP16 Latency (ms) Percentiles |  |  |  | FP32 Latency (ms) Percentiles |  |  |  | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+|   BS |   Duration (s) |    90% |    95% |    99% |    Avg |    90% |    95% |    99% |    Avg |   Avg |
+|    1 |            2.0 |  52.74 |  53.01 |  54.40 |  51.47 |  55.97 |  56.22 |  57.93 |  54.93 |  1.07 |
+|    2 |            2.0 |  51.77 |  52.15 |  54.69 |  50.98 |  56.58 |  56.87 |  58.88 |  55.35 |  1.09 |
+|    4 |            2.0 |  51.41 |  51.76 |  53.47 |  50.55 |  61.56 |  61.87 |  63.81 |  60.74 |  1.20 |
+|    8 |            2.0 |  51.83 |  52.15 |  54.08 |  50.85 |  80.20 |  80.69 |  81.67 |  77.69 |  1.53 |
+|   16 |            2.0 |  70.48 |  70.96 |  72.11 |  62.98 |  93.00 |  93.44 |  94.17 |  89.05 |  1.41 |
+|    1 |            7.0 |  49.77 |  50.21 |  51.88 |  48.73 |  52.74 |  52.99 |  54.54 |  51.67 |  1.06 |
+|    2 |            7.0 |  51.12 |  51.47 |  52.84 |  49.98 |  65.33 |  65.63 |  67.07 |  64.64 |  1.29 |
+|    4 |            7.0 |  53.13 |  53.56 |  55.68 |  52.15 |  93.54 |  93.85 |  94.72 |  92.76 |  1.78 |
+|    8 |            7.0 |  57.67 |  58.07 |  59.89 |  56.41 | 133.93 | 134.18 | 134.88 | 133.15 |  2.36 |
+|   16 |            7.0 |  76.09 |  76.48 |  79.13 |  75.27 | 162.35 | 162.77 | 164.63 | 161.30 |  2.14 |
+|    1 |           16.7 |  54.78 |  55.29 |  56.83 |  52.51 |  75.37 |  76.27 |  78.05 |  74.32 |  1.42 |
+|    2 |           16.7 |  56.80 |  57.20 |  59.01 |  55.49 | 130.60 | 131.36 | 132.93 | 128.55 |  2.32 |
+|    4 |           16.7 |  64.19 |  64.84 |  66.47 |  62.87 | 188.09 | 188.76 | 190.07 | 185.76 |  2.95 |
+|    8 |           16.7 |  87.46 |  87.86 |  89.99 |  86.47 | 232.33 | 232.89 | 234.43 | 230.44 |  2.67 |
+|   16 |           16.7 | 136.02 | 136.52 | 139.44 | 134.78 | 283.87 | 284.59 | 286.70 | 282.01 |  2.09 |

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Inference performance: NVIDIA DGX-2 (1x V100 32GB)
-|    | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
-|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
-| BS | Duration (s) |   90% |   95% |   99% |   Avg |    90% |    95% |    99% |    Avg |  Avg |
-|  1 |    2 |  47.25 |  48.24 |  50.28 |  41.53 |  67.03 |  68.15 |  70.17 |  61.82 | 1.49 |
-|  2 |    2 |  54.11 |  55.20 |  60.44 |  48.82 |  69.11 |  70.38 |  75.93 |  64.45 | 1.32 |
-|  4 |    2 |  63.82 |  67.64 |  71.58 |  61.47 |  71.51 |  74.55 |  79.31 |  67.85 | 1.10 |
-|  8 |    2 |  64.78 |  65.86 |  67.68 |  59.07 |  90.84 |  91.99 |  94.10 |  84.28 | 1.43 |
-| 16 |    2 |  70.59 |  71.49 |  73.58 |  63.85 |  96.92 |  97.58 |  99.98 |  87.73 | 1.37 |
-|  1 |    7 |  42.35 |  42.55 |  43.50 |  41.08 |  63.87 |  64.02 |  64.73 |  62.54 | 1.52 |
-|  2 |    7 |  47.82 |  48.04 |  49.43 |  46.79 |  81.17 |  81.43 |  82.28 |  80.02 | 1.71 |
-|  4 |    7 |  58.27 |  58.54 |  59.69 |  56.96 | 116.00 | 116.46 | 118.79 | 114.82 | 2.02 |
-|  8 |    7 |  62.88 |  63.62 |  67.16 |  61.47 | 143.90 | 144.34 | 147.36 | 139.54 | 2.27 |
-| 16 |    7 |  88.04 |  88.57 |  90.96 |  82.84 | 163.04 | 164.04 | 167.30 | 161.36 | 1.95 |
-|  1 | 16.7 |  44.54 |  44.86 |  45.86 |  43.53 |  88.10 |  88.41 |  89.37 |  87.21 | 2.00 |
-|  2 | 16.7 |  55.21 |  55.55 |  56.92 |  54.33 | 134.99 | 135.69 | 137.87 | 132.97 | 2.45 |
-|  4 | 16.7 |  72.93 |  73.58 |  74.95 |  72.02 | 193.50 | 194.21 | 196.04 | 191.24 | 2.66 |
-|  8 | 16.7 |  96.94 |  97.66 |  99.58 |  92.73 | 227.70 | 228.74 | 231.59 | 225.35 | 2.43 |
-| 16 | 16.7 | 138.25 | 139.75 | 143.71 | 133.82 | 273.69 | 274.53 | 279.50 | 269.13 | 2.01 |
+|  |  | FP16 Latency (ms) Percentiles |  |  |  | FP32 Latency (ms) Percentiles |  |  |  | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+|   BS |   Duration (s) |    90% |    95% |    99% |    Avg |    90% |    95% |    99% |    Avg |   Avg |
+|    1 |            2.0 |  35.88 |  36.12 |  39.80 |  35.20 |  42.95 |  43.67 |  46.65 |  42.23 |  1.20 |
+|    2 |            2.0 |  36.36 |  36.57 |  40.97 |  35.60 |  41.83 |  42.21 |  45.60 |  40.97 |  1.15 |
+|    4 |            2.0 |  36.69 |  36.89 |  41.25 |  36.05 |  48.35 |  48.52 |  52.35 |  47.80 |  1.33 |
+|    8 |            2.0 |  37.49 |  37.70 |  41.37 |  36.88 |  65.41 |  65.64 |  66.50 |  64.96 |  1.76 |
+|   16 |            2.0 |  41.35 |  41.79 |  45.58 |  40.91 |  77.22 |  77.51 |  79.48 |  76.54 |  1.87 |
+|    1 |            7.0 |  36.07 |  36.55 |  40.31 |  35.62 |  39.52 |  39.84 |  43.07 |  38.93 |  1.09 |
+|    2 |            7.0 |  37.42 |  37.66 |  41.36 |  36.79 |  55.94 |  56.19 |  58.33 |  55.60 |  1.51 |
+|    4 |            7.0 |  38.51 |  38.95 |  42.55 |  37.98 |  86.62 |  87.08 |  87.50 |  86.20 |  2.27 |
+|    8 |            7.0 |  42.82 |  43.00 |  47.11 |  42.55 | 122.05 | 122.29 | 122.70 | 121.59 |  2.86 |
+|   16 |            7.0 |  67.74 |  67.92 |  69.05 |  65.69 | 149.92 | 150.16 | 151.03 | 149.49 |  2.28 |
+|    1 |           16.7 |  39.28 |  39.78 |  43.34 |  38.35 |  66.73 |  67.16 |  69.80 |  66.01 |  1.72 |
+|    2 |           16.7 |  43.05 |  43.42 |  47.18 |  42.43 | 120.04 | 121.12 | 123.32 | 118.14 |  2.78 |
+|    4 |           16.7 |  52.18 |  52.49 |  56.11 |  51.63 | 176.09 | 176.51 | 178.70 | 174.60 |  3.38 |
+|    8 |           16.7 |  78.55 |  78.79 |  81.66 |  78.04 | 216.19 | 216.68 | 217.63 | 214.48 |  2.75 |
+|   16 |           16.7 | 125.57 | 125.92 | 128.78 | 124.33 | 264.11 | 264.49 | 266.14 | 262.80 |  2.11 |

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ##### Inference performance: NVIDIA T4
-|    | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
-|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
-| BS | Duration (s) |   90% |   95% |   99% |   Avg |    90% |    95% |    99% |    Avg |  Avg |
-|  1 |    2 |  64.13 |  65.25 |  76.11 |  59.08 |  94.69 |  98.23 | 109.86 |  89.00 | 1.51 |
-|  2 |    2 |  67.59 |  70.77 |  84.06 |  57.47 | 103.88 | 105.37 | 114.59 |  93.30 | 1.62 |
-|  4 |    2 |  75.19 |  81.05 |  87.01 |  65.79 | 120.73 | 128.29 | 146.83 | 112.96 | 1.72 |
-|  8 |    2 |  74.15 |  77.69 |  84.96 |  62.77 | 161.97 | 163.46 | 170.25 | 153.07 | 2.44 |
-| 16 |    2 | 100.62 | 105.08 | 113.00 |  82.06 | 216.18 | 217.92 | 222.46 | 188.57 | 2.30 |
-|  1 |    7 |  77.88 |  79.61 |  81.90 |  70.22 | 110.37 | 113.93 | 121.39 | 107.17 | 1.53 |
-|  2 |    7 |  81.09 |  83.94 |  87.28 |  78.06 | 148.30 | 151.21 | 158.55 | 141.26 | 1.81 |
-|  4 |    7 |  99.85 | 100.83 | 104.24 |  96.81 | 229.94 | 232.34 | 238.11 | 225.43 | 2.33 |
-|  8 |    7 | 147.38 | 150.37 | 153.66 | 142.64 | 394.26 | 396.35 | 398.89 | 390.77 | 2.74 |
-| 16 |    7 | 280.32 | 281.37 | 282.74 | 278.01 | 484.20 | 485.74 | 499.89 | 482.67 | 1.74 |
-|  1 | 16.7 |  76.97 |  79.78 |  81.61 |  75.55 | 171.45 | 176.90 | 179.18 | 167.95 | 2.22 |
-|  2 | 16.7 |  96.48 |  99.42 | 101.21 |  92.74 | 276.12 | 278.67 | 282.06 | 270.05 | 2.91 |
-|  4 | 16.7 | 129.63 | 131.67 | 134.42 | 124.55 | 522.23 | 524.79 | 527.32 | 509.75 | 4.09 |
-|  8 | 16.7 | 209.64 | 211.36 | 214.66 | 204.83 | 706.84 | 709.21 | 715.57 | 697.97 | 3.41 |
-| 16 | 16.7 | 342.23 | 344.62 | 350.84 | 337.42 | 848.02 | 849.83 | 858.22 | 834.38 | 2.47 |
+|  |  | FP16 Latency (ms) Percentiles |  |  |  | FP32 Latency (ms) Percentiles |  |  |  | FP16/FP32 speed up |
+|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
+|   BS |   Duration (s) |    90% |    95% |    99% |    Avg |    90% |    95% |    99% |    Avg |   Avg |
+|    1 |            2.0 |  43.62 |  46.95 |  50.46 |  37.23 |  51.31 |  52.37 |  56.21 |  49.77 |  1.34 |
+|    2 |            2.0 |  49.09 |  50.46 |  53.11 |  40.61 |  81.85 |  82.22 |  83.94 |  80.81 |  1.99 |
+|    4 |            2.0 |  47.71 |  51.14 |  55.09 |  41.29 | 112.56 | 115.13 | 118.56 | 111.60 |  2.70 |
+|    8 |            2.0 |  51.37 |  53.11 |  55.48 |  45.94 | 198.95 | 199.48 | 200.28 | 197.22 |  4.29 |
+|   16 |            2.0 |  63.59 |  64.30 |  66.90 |  61.77 | 221.75 | 222.07 | 223.22 | 220.09 |  3.56 |
+|    1 |            7.0 |  47.49 |  48.66 |  53.36 |  40.76 |  73.63 |  74.41 |  77.65 |  72.41 |  1.78 |
+|    2 |            7.0 |  48.63 |  50.01 |  58.35 |  43.44 | 114.66 | 115.28 | 117.63 | 112.41 |  2.59 |
+|    4 |            7.0 |  52.19 |  52.85 |  54.22 |  49.94 | 200.38 | 201.29 | 202.97 | 197.21 |  3.95 |
+|    8 |            7.0 |  84.90 |  85.56 |  87.52 |  83.41 | 404.00 | 404.72 | 405.70 | 400.25 |  4.80 |
+|   16 |            7.0 | 157.12 | 157.58 | 159.19 | 155.01 | 490.93 | 492.09 | 493.44 | 486.45 |  3.14 |
+|    1 |           16.7 |  50.57 |  51.57 |  57.58 |  46.27 | 150.39 | 151.84 | 153.54 | 147.31 |  3.18 |
+|    2 |           16.7 |  63.64 |  64.55 |  66.31 |  61.98 | 256.54 | 258.16 | 262.71 | 250.34 |  4.04 |
+|    4 |           16.7 | 140.44 | 141.06 | 142.00 | 138.14 | 519.59 | 521.41 | 523.86 | 512.74 |  3.71 |
+|    8 |           16.7 | 267.03 | 268.06 | 270.01 | 263.15 | 727.33 | 728.61 | 731.36 | 722.62 |  2.75 |
+|   16 |           16.7 | 362.40 | 364.02 | 367.80 | 358.75 | 867.92 | 869.19 | 871.46 | 860.37 |  2.40 |

 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.

 ## Release notes

 ### Changelog
+February 2021
+* Added DALI data-processing pipeline for on-the-fly data processing and augmentation on CPU or GPU
+* Revised training recipe: ~10% relative improvement in Word Error Rate (WER)
+* Updated Triton scripts for compatibility with Triton V2 API, updated Triton inference results
+* Refactored codebase
+* Updated performance results for the PyTorch 20.10-py3 NGC container
+
 June 2020
- Updated performance tables to include A100 results
+* Updated performance tables to include A100 results

 December 2019
 * Inference support for TRT 6 with dynamic shapes
--- a/PyTorch/SpeechRecognition/Jasper/common/init.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/init.py
--- a/PyTorch/SpeechRecognition/Jasper/parts/segment.py
+++ b/PyTorch/SpeechRecognition/Jasper/parts/segment.py
@ -12,13 +12,28 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import numpy as np
-import librosa
+import random
 import soundfile as sf

+import librosa
+import torch
+import numpy as np
+
+import sox
+
+
+def audio_from_file(file_path, offset=0, duration=0, trim=False, target_sr=16000):
+    audio = AudioSegment(file_path, target_sr=target_sr, int_values=False,
+                         offset=offset, duration=duration, trim=trim)
+
+    samples = torch.tensor(audio.samples, dtype=torch.float).cuda()
+    num_samples = torch.tensor(samples.shape[0]).int().cuda()
+    return (samples.unsqueeze(0), num_samples.unsqueeze(0))
+

 class AudioSegment(object):
    """Monaural audio segment abstraction.
+
    :param samples: Audio samples [num_samples x num_channels].
    :type samples: ndarray.float32
    :param sample_rate: Audio sample rate.
@ -26,11 +41,30 @@ class AudioSegment(object):
    :raises TypeError: If the sample data type is not float or int.
    """

-    def __init__(self, samples, sample_rate, target_sr=None, trim=False,
-                             trim_db=60):
+    def __init__(self, filename, target_sr=None, int_values=False, offset=0,
+                 duration=0, trim=False, trim_db=60):
        """Create audio segment from samples.
+
        Samples are convert float32 internally, with int scaled to [-1, 1].
+        Load a file supported by librosa and return as an AudioSegment.
+        :param filename: path of file to load
+        :param target_sr: the desired sample rate
+        :param int_values: if true, load samples as 32-bit integers
+        :param offset: offset in seconds when loading audio
+        :param duration: duration in seconds when loading audio
+        :return: numpy array of samples
        """
+        with sf.SoundFile(filename, 'r') as f:
+            dtype = 'int32' if int_values else 'float32'
+            sample_rate = f.samplerate
+            if offset > 0:
+                f.seek(int(offset * sample_rate))
+            if duration > 0:
+                samples = f.read(int(duration * sample_rate), dtype=dtype)
+            else:
+                samples = f.read(dtype=dtype)
+        samples = samples.transpose()
+
        samples = self._convert_samples_to_float32(samples)
        if target_sr is not None and target_sr != sample_rate:
            samples = librosa.core.resample(samples, sample_rate, target_sr)
@ -67,6 +101,7 @@ class AudioSegment(object):
    @staticmethod
    def _convert_samples_to_float32(samples):
        """Convert sample type to float32.
+
        Audio sample type is usually integer or float-point.
        Integers will be scaled to [-1, 1] in float32.
        """
@ -80,30 +115,6 @@ class AudioSegment(object):
            raise TypeError("Unsupported sample type: %s." % samples.dtype)
        return float32_samples

-    @classmethod
-    def from_file(cls, filename, target_sr=None, int_values=False, offset=0,
-                                duration=0, trim=False):
-        """
-        Load a file supported by librosa and return as an AudioSegment.
-        :param filename: path of file to load
-        :param target_sr: the desired sample rate
-        :param int_values: if true, load samples as 32-bit integers
-        :param offset: offset in seconds when loading audio
-        :param duration: duration in seconds when loading audio
-        :return: numpy array of samples
-        """
-        with sf.SoundFile(filename, 'r') as f:
-            dtype = 'int32' if int_values else 'float32'
-            sample_rate = f.samplerate
-            if offset > 0:
-                f.seek(int(offset * sample_rate))
-            if duration > 0:
-                samples = f.read(int(duration * sample_rate), dtype=dtype)
-            else:
-                samples = f.read(dtype=dtype)
-        samples = samples.transpose()
-        return cls(samples, sample_rate, target_sr=target_sr, trim=trim)
-
    @property
    def samples(self):
        return self._samples.copy()
@ -129,9 +140,11 @@ class AudioSegment(object):
        self._samples *= 10. ** (gain / 20.)

    def pad(self, pad_size, symmetric=False):
-        """Add zero padding to the sample. The pad size is given in number of samples.
-        If symmetric=True, `pad_size` will be added to both sides. If false, `pad_size`
-        zeros will be added only to the end.
+        """Add zero padding to the sample.
+
+        The pad size is given in number of samples. If symmetric=True,
+        `pad_size` will be added to both sides. If false, `pad_size` zeros
+        will be added only to the end.
        """
        self._samples = np.pad(self._samples,
                               (pad_size if symmetric else 0, pad_size),
@ -139,6 +152,7 @@ class AudioSegment(object):

    def subsegment(self, start_time=None, end_time=None):
        """Cut the AudioSegment between given boundaries.
+
        Note that this is an in-place transformation.
        :param start_time: Beginning of subsegment in seconds.
        :type start_time: float
@ -168,3 +182,66 @@ class AudioSegment(object):
        start_sample = int(round(start_time * self._sample_rate))
        end_sample = int(round(end_time * self._sample_rate))
        self._samples = self._samples[start_sample:end_sample]
+
+
+class Perturbation:
+    def __init__(self, p=0.1, rng=None):
+        self.p = p
+        self._rng = random.Random() if rng is None else rng
+
+    def maybe_apply(self, segment, sample_rate=None):
+        if self._rng.random() < self.p:
+            self(segment, sample_rate)
+
+
+class SpeedPerturbation(Perturbation):
+    def __init__(self, min_rate=0.85, max_rate=1.15, discrete=False, p=0.1, rng=None):
+        super(SpeedPerturbation, self).__init__(p, rng)
+        assert 0 < min_rate < max_rate
+        self.min_rate = min_rate
+        self.max_rate = max_rate
+        self.discrete = discrete
+
+    def __call__(self, data, sample_rate):
+        if self.discrete:
+            rate = np.random.choice([self.min_rate, None, self.max_rate])
+        else:
+            rate = self._rng.uniform(self.min_rate, self.max_rate)
+
+        if rate is not None:
+            data._samples = sox.Transformer().speed(factor=rate).build_array(
+                input_array=data._samples, sample_rate_in=sample_rate)
+
+
+class GainPerturbation(Perturbation):
+    def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, p=0.1, rng=None):
+        super(GainPerturbation, self).__init__(p, rng)
+        self._rng = random.Random() if rng is None else rng
+        self._min_gain_dbfs = min_gain_dbfs
+        self._max_gain_dbfs = max_gain_dbfs
+
+    def __call__(self, data, sample_rate=None):
+        del sample_rate
+        gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
+        data._samples = data._samples * (10. ** (gain / 20.))
+
+
+class ShiftPerturbation(Perturbation):
+    def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, p=0.1, rng=None):
+        super(ShiftPerturbation, self).__init__(p, rng)
+        self._min_shift_ms = min_shift_ms
+        self._max_shift_ms = max_shift_ms
+
+    def __call__(self, data, sample_rate):
+        shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
+        if abs(shift_ms) / 1000 > data.duration:
+            # TODO: do something smarter than just ignore this condition
+            return
+        shift_samples = int(shift_ms * data.sample_rate // 1000)
+        # print("DEBUG: shift:", shift_samples)
+        if shift_samples < 0:
+            data._samples[-shift_samples:] = data._samples[:shift_samples]
+            data._samples[:-shift_samples] = 0
+        elif shift_samples > 0:
+            data._samples[:-shift_samples] = data._samples[shift_samples:]
+            data._samples[-shift_samples:] = 0
--- a/PyTorch/SpeechRecognition/Jasper/common/dali/init.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/dali/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/PyTorch/SpeechRecognition/Jasper/common/dali/data_loader.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/dali/data_loader.py
@ -0,0 +1,158 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import math
+import numpy as np
+import torch.distributed as dist
+from .iterator import DaliJasperIterator, SyntheticDataIterator
+from .pipeline import DaliPipeline
+from common.helpers import print_once
+
+
+def _parse_json(json_path: str, start_label=0, predicate=lambda json: True):
+    """
+    Parses json file to the format required by DALI
+    Args:
+        json_path: path to json file
+        start_label: the label, starting from which DALI will assign consecutive int numbers to every transcript
+        predicate: function, that accepts a sample descriptor (i.e. json dictionary) as an argument.
+                   If the predicate for a given sample returns True, it will be included in the dataset.
+
+    Returns:
+        output_files: dictionary, that maps file name to label assigned by DALI
+        transcripts: dictionary, that maps label assigned by DALI to the transcript
+    """
+    import json
+    global cnt
+    with open(json_path) as f:
+        librispeech_json = json.load(f)
+    output_files = {}
+    transcripts = {}
+    curr_label = start_label
+    for original_sample in librispeech_json:
+        if not predicate(original_sample):
+            continue
+        transcripts[curr_label] = original_sample['transcript']
+        output_files[original_sample['files'][-1]['fname']] = curr_label
+        curr_label += 1
+    return output_files, transcripts
+
+
+def _dict_to_file(dict: dict, filename: str):
+    with open(filename, "w") as f:
+        for key, value in dict.items():
+            f.write("{} {}\n".format(key, value))
+
+
+class DaliDataLoader:
+    """
+    DataLoader is the main entry point to the data preprocessing pipeline.
+    To use, create an object and then just iterate over `data_iterator`.
+    DataLoader will do the rest for you.
+    Example:
+        data_layer = DataLoader(DaliTrainPipeline, path, json, bs, ngpu)
+        data_it = data_layer.data_iterator
+        for data in data_it:
+            print(data)  # Here's your preprocessed data
+
+    Args:
+        device_type: Which device to use for preprocessing. Choose: "cpu", "gpu"
+        pipeline_type: Choose: "train", "val", "synth"
+    """
+
+    def __init__(self, gpu_id, dataset_path: str, config_data: dict, config_features: dict, json_names: list,
+                 symbols: list, batch_size: int, pipeline_type: str, grad_accumulation_steps: int = 1,
+                 synth_iters_per_epoch: int = 544, device_type: str = "gpu"):
+        import torch
+        self.batch_size = batch_size
+        self.grad_accumulation_steps = grad_accumulation_steps
+        self.drop_last = (pipeline_type == 'train')
+        self.device_type = device_type
+        pipeline_type = self._parse_pipeline_type(pipeline_type)
+        if pipeline_type == "synth":
+            self._dali_data_iterator = self._init_synth_iterator(self.batch_size, config_features['nfilt'],
+                                                                 iters_per_epoch=synth_iters_per_epoch,
+                                                                 ngpus=torch.distributed.get_world_size())
+        else:
+            self._dali_data_iterator = self._init_iterator(gpu_id=gpu_id, dataset_path=dataset_path,
+                                                           config_data=config_data,
+                                                           config_features=config_features,
+                                                           json_names=json_names, symbols=symbols,
+                                                           train_pipeline=pipeline_type == "train")
+
+    def _init_iterator(self, gpu_id, dataset_path, config_data, config_features, json_names: list, symbols: list,
+                       train_pipeline: bool):
+        """
+        Returns data iterator. Data underneath this operator is preprocessed within Dali
+        """
+
+        def hash_list_of_strings(li):
+            return str(abs(hash(''.join(li))))
+
+        output_files, transcripts = {}, {}
+        max_duration = config_data['max_duration']
+        for jname in json_names:
+            of, tr = _parse_json(jname if jname[0] == '/' else os.path.join(dataset_path, jname), len(output_files),
+                                 predicate=lambda json: json['original_duration'] <= max_duration)
+            output_files.update(of)
+            transcripts.update(tr)
+        file_list_path = os.path.join("/tmp", "jasper_dali.file_list." + hash_list_of_strings(json_names))
+        _dict_to_file(output_files, file_list_path)
+        self.dataset_size = len(output_files)
+        print_once(f"Dataset read by DALI. Number of samples: {self.dataset_size}")
+
+        pipeline = DaliPipeline.from_config(config_data=config_data, config_features=config_features, device_id=gpu_id,
+                                            file_root=dataset_path, file_list=file_list_path,
+                                            device_type=self.device_type, batch_size=self.batch_size,
+                                            train_pipeline=train_pipeline)
+
+        return DaliJasperIterator([pipeline], transcripts=transcripts, symbols=symbols, batch_size=self.batch_size,
+                                  shard_size=self._shard_size(), train_iterator=train_pipeline)
+
+    def _init_synth_iterator(self, batch_size, nfeatures, iters_per_epoch, ngpus):
+        self.dataset_size = ngpus * iters_per_epoch * batch_size
+        return SyntheticDataIterator(batch_size, nfeatures, regenerate=True)
+
+    @staticmethod
+    def _parse_pipeline_type(pipeline_type):
+        pipe = pipeline_type.lower()
+        assert pipe in ("train", "val", "synth"), 'Invalid pipeline type (choices: "train", "val", "synth").'
+        return pipe
+
+    def _shard_size(self):
+        """
+        Total number of samples handled by a single GPU in a single epoch.
+        """
+        world_size = dist.get_world_size() if dist.is_initialized() else 1
+        if self.drop_last:
+            divisor = world_size * self.batch_size * self.grad_accumulation_steps
+            return self.dataset_size // divisor * divisor // world_size
+        else:
+            return int(math.ceil(self.dataset_size / world_size))
+
+    def __len__(self):
+        """
+        Number of batches handled by each GPU.
+        """
+        if self.drop_last:
+            assert self._shard_size() % self.batch_size == 0, f'{self._shard_size()} {self.batch_size}'
+
+        return int(math.ceil(self._shard_size() / self.batch_size))
+
+    def data_iterator(self):
+        return self._dali_data_iterator
+
+    def __iter__(self):
+        return self._dali_data_iterator
--- a/PyTorch/SpeechRecognition/Jasper/common/dali/iterator.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/dali/iterator.py
@ -0,0 +1,162 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.distributed as dist
+import numpy as np
+from common.helpers import print_once
+from common.text import _clean_text, punctuation_map
+
+
+def normalize_string(s, symbols, punct_map):
+    """
+    Normalizes string.
+    Example:
+        'call me at 8:00 pm!' -> 'call me at eight zero pm'
+    """
+    labels = set(symbols)
+    try:
+        text = _clean_text(s, ["english_cleaners"], punct_map).strip()
+        return ''.join([tok for tok in text if all(t in labels for t in tok)])
+    except Exception as e:
+        print_once("WARNING: Normalizing failed: {s} {e}")
+
+
+class DaliJasperIterator(object):
+    """
+    Returns batches of data for Jasper training:
+    preprocessed_signal, preprocessed_signal_length, transcript, transcript_length
+
+    This iterator is not meant to be the entry point to Dali processing pipeline.
+    Use DataLoader instead.
+    """
+
+    def __init__(self, dali_pipelines, transcripts, symbols, batch_size, shard_size, train_iterator: bool):
+        self.transcripts = transcripts
+        self.symbols = symbols
+        self.batch_size = batch_size
+        from nvidia.dali.plugin.pytorch import DALIGenericIterator
+        from nvidia.dali.plugin.base_iterator import LastBatchPolicy
+
+        # in train pipeline shard_size is set to divisable by batch_size, so PARTIAL policy is safe
+        self.dali_it = DALIGenericIterator(
+            dali_pipelines, ["audio", "label", "audio_shape"], size=shard_size,
+            dynamic_shape=True, auto_reset=True, last_batch_padded=True,
+            last_batch_policy=LastBatchPolicy.PARTIAL)
+
+    @staticmethod
+    def _str2list(s: str):
+        """
+        Returns list of floats, that represents given string.
+        '0.' denotes separator
+        '1.' denotes 'a'
+        '27.' denotes "'"
+        Assumes, that the string is lower case.
+        """
+        list = []
+        for c in s:
+            if c == "'":
+                list.append(27.)
+            else:
+                list.append(max(0., ord(c) - 96.))
+        return list
+
+    @staticmethod
+    def _pad_lists(lists: list, pad_val=0):
+        """
+        Pads lists, so that all have the same size.
+        Returns list with actual sizes of corresponding input lists
+        """
+        max_length = 0
+        sizes = []
+        for li in lists:
+            sizes.append(len(li))
+            max_length = max_length if len(li) < max_length else len(li)
+        for li in lists:
+            li += [pad_val] * (max_length - len(li))
+        return sizes
+
+    def _gen_transcripts(self, labels, normalize_transcripts: bool = True):
+        """
+        Generate transcripts in format expected by NN
+        """
+        lists = [
+            self._str2list(normalize_string(self.transcripts[lab.item()], self.symbols, punctuation_map(self.symbols)))
+            for lab in labels
+        ] if normalize_transcripts else [self._str2list(self.transcripts[lab.item()]) for lab in labels]
+        sizes = self._pad_lists(lists)
+        return torch.tensor(lists).cuda(), torch.tensor(sizes, dtype=torch.int32).cuda()
+
+    def __next__(self):
+        data = self.dali_it.__next__()
+        transcripts, transcripts_lengths = self._gen_transcripts(data[0]["label"])
+        return data[0]["audio"], data[0]["audio_shape"][:, 1], transcripts, transcripts_lengths
+
+    def next(self):
+        return self.__next__()
+
+    def __iter__(self):
+        return self
+
+
+# TODO: refactor
+class SyntheticDataIterator(object):
+    def __init__(self, batch_size, nfeatures, feat_min=-5., feat_max=0., txt_min=0., txt_max=23., feat_lens_max=1760,
+                 txt_lens_max=231, regenerate=False):
+        """
+        Args:
+            batch_size
+            nfeatures: number of features for melfbanks
+            feat_min: minimum value in `feat` tensor, used for randomization
+            feat_max: maximum value in `feat` tensor, used for randomization
+            txt_min: minimum value in `txt` tensor, used for randomization
+            txt_max: maximum value in `txt` tensor, used for randomization
+            regenerate: If True, regenerate random tensors for every iterator step.
+                        If False, generate them only at start.
+        """
+        self.batch_size = batch_size
+        self.nfeatures = nfeatures
+        self.feat_min = feat_min
+        self.feat_max = feat_max
+        self.feat_lens_max = feat_lens_max
+        self.txt_min = txt_min
+        self.txt_max = txt_max
+        self.txt_lens_max = txt_lens_max
+        self.regenerate = regenerate
+
+        if not self.regenerate:
+            self.feat, self.feat_lens, self.txt, self.txt_lens = self._generate_sample()
+
+    def _generate_sample(self):
+        feat = (self.feat_max - self.feat_min) * np.random.random_sample(
+            (self.batch_size, self.nfeatures, self.feat_lens_max)) + self.feat_min
+        feat_lens = np.random.randint(0, int(self.feat_lens_max) - 1, size=self.batch_size)
+        txt = (self.txt_max - self.txt_min) * np.random.random_sample(
+            (self.batch_size, self.txt_lens_max)) + self.txt_min
+        txt_lens = np.random.randint(0, int(self.txt_lens_max) - 1, size=self.batch_size)
+        return torch.Tensor(feat).cuda(), \
+               torch.Tensor(feat_lens).cuda(), \
+               torch.Tensor(txt).cuda(), \
+               torch.Tensor(txt_lens).cuda()
+
+    def __next__(self):
+        if self.regenerate:
+            return self._generate_sample()
+        return self.feat, self.feat_lens, self.txt, self.txt_lens
+
+    def next(self):
+        return self.__next__()
+
+    def __iter__(self):
+        return self
--- a/PyTorch/SpeechRecognition/Jasper/common/dali/pipeline.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/dali/pipeline.py
@ -0,0 +1,397 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import nvidia.dali
+import nvidia.dali.ops as ops
+import nvidia.dali.types as types
+import multiprocessing
+import numpy as np
+import torch
+import math
+import itertools
+
+
+class DaliPipeline(nvidia.dali.pipeline.Pipeline):
+    def __init__(self, *,
+                 train_pipeline: bool,  # True if train pipeline, False if validation pipeline
+                 device_id,
+                 num_threads,
+                 batch_size,
+                 file_root: str,
+                 file_list: str,
+                 sample_rate,
+                 discrete_resample_range: bool,
+                 resample_range: list,
+                 window_size,
+                 window_stride,
+                 nfeatures,
+                 nfft,
+                 frame_splicing_factor,
+                 dither_coeff,
+                 silence_threshold,
+                 preemph_coeff,
+                 pad_align,
+                 max_duration,
+                 mask_time_num_regions,
+                 mask_time_min,
+                 mask_time_max,
+                 mask_freq_num_regions,
+                 mask_freq_min,
+                 mask_freq_max,
+                 mask_both_num_regions,
+                 mask_both_min_time,
+                 mask_both_max_time,
+                 mask_both_min_freq,
+                 mask_both_max_freq,
+                 preprocessing_device="gpu"):
+        super().__init__(batch_size, num_threads, device_id)
+
+        self._dali_init_log(locals())
+
+        if torch.distributed.is_initialized():
+            shard_id = torch.distributed.get_rank()
+            n_shards = torch.distributed.get_world_size()
+        else:
+            shard_id = 0
+            n_shards = 1
+
+        self.preprocessing_device = preprocessing_device.lower()
+        assert self.preprocessing_device == "cpu" or self.preprocessing_device == "gpu", \
+            "Incorrect preprocessing device. Please choose either 'cpu' or 'gpu'"
+        self.frame_splicing_factor = frame_splicing_factor
+        assert frame_splicing_factor == 1, "DALI doesn't support frame splicing operation"
+
+        self.resample_range = resample_range
+        self.discrete_resample_range = discrete_resample_range
+
+        self.train = train_pipeline
+        self.sample_rate = sample_rate
+        self.dither_coeff = dither_coeff
+        self.nfeatures = nfeatures
+        self.max_duration = max_duration
+        self.mask_params = {
+            'time_num_regions': mask_time_num_regions,
+            'time_min': mask_time_min,
+            'time_max': mask_time_max,
+            'freq_num_regions': mask_freq_num_regions,
+            'freq_min': mask_freq_min,
+            'freq_max': mask_freq_max,
+            'both_num_regions': mask_both_num_regions,
+            'both_min_time': mask_both_min_time,
+            'both_max_time': mask_both_max_time,
+            'both_min_freq': mask_both_min_freq,
+            'both_max_freq': mask_both_max_freq,
+        }
+        self.do_remove_silence = True if silence_threshold is not None else False
+
+        self.read = ops.FileReader(device="cpu", file_root=file_root, file_list=file_list, shard_id=shard_id,
+                                   num_shards=n_shards, shuffle_after_epoch=train_pipeline)
+
+        # TODO change ExternalSource to Uniform for new DALI release
+        if discrete_resample_range and resample_range is not None:
+            self.speed_perturbation_coeffs = ops.ExternalSource(device="cpu", cycle=True,
+                                                                source=self._discrete_resample_coeffs_generator)
+        elif resample_range is not None:
+            self.speed_perturbation_coeffs = ops.Uniform(device="cpu", range=resample_range)
+        else:
+            self.speed_perturbation_coeffs = None
+
+        self.decode = ops.AudioDecoder(device="cpu", sample_rate=self.sample_rate if resample_range is None else None,
+                                       dtype=types.FLOAT, downmix=True)
+
+        self.normal_distribution = ops.NormalDistribution(device=preprocessing_device)
+
+        self.preemph = ops.PreemphasisFilter(device=preprocessing_device, preemph_coeff=preemph_coeff)
+
+        self.spectrogram = ops.Spectrogram(device=preprocessing_device, nfft=nfft,
+                                           window_length=window_size * sample_rate,
+                                           window_step=window_stride * sample_rate)
+
+        self.mel_fbank = ops.MelFilterBank(device=preprocessing_device, sample_rate=sample_rate, nfilter=self.nfeatures,
+                                           normalize=True)
+
+        self.log_features = ops.ToDecibels(device=preprocessing_device, multiplier=np.log(10), reference=1.0,
+                                           cutoff_db=math.log(1e-20))
+
+        self.get_shape = ops.Shapes(device=preprocessing_device)
+
+        self.normalize = ops.Normalize(device=preprocessing_device, axes=[1])
+
+        self.pad = ops.Pad(device=preprocessing_device, axes=[1], fill_value=0, align=pad_align)
+
+        # Silence trimming
+        self.get_nonsilent_region = ops.NonsilentRegion(device="cpu", cutoff_db=silence_threshold)
+        self.trim_silence = ops.Slice(device="cpu", normalized_anchor=False, normalized_shape=False, axes=[0])
+        self.to_float = ops.Cast(device="cpu", dtype=types.FLOAT)
+
+        # Spectrogram masking
+        self.spectrogram_cutouts = ops.ExternalSource(source=self._cutouts_generator, num_outputs=2, cycle=True)
+        self.mask_spectrogram = ops.Erase(device=preprocessing_device, axes=[0, 1], fill_value=0,
+                                          normalized_anchor=True)
+
+    @classmethod
+    def from_config(cls, train_pipeline: bool, device_id, batch_size, file_root: str, file_list: str, config_data: dict,
+                    config_features: dict, device_type: str = "gpu", do_resampling: bool = True,
+                    num_cpu_threads=multiprocessing.cpu_count()):
+
+        max_duration = config_data['max_duration']
+        sample_rate = config_data['sample_rate']
+        silence_threshold = -60 if config_data['trim_silence'] else None
+
+        # TODO Take into account resampling probablity
+        # TODO     config_features['speed_perturbation']['p']
+
+        if do_resampling and config_data['speed_perturbation'] is not None:
+            resample_range = [config_data['speed_perturbation']['min_rate'],
+                              config_data['speed_perturbation']['max_rate']]
+            discrete_resample_range = config_data['speed_perturbation']['discrete']
+        else:
+            resample_range = None
+            discrete_resample_range = False
+
+        window_size = config_features['window_size']
+        window_stride = config_features['window_stride']
+        nfeatures = config_features['n_filt']
+        nfft = config_features['n_fft']
+        frame_splicing_factor = config_features['frame_splicing']
+        dither_coeff = config_features['dither']
+        pad_align = config_features['pad_align']
+        pad_to_max_duration = config_features['pad_to_max_duration']
+        assert not pad_to_max_duration, "Padding to max duration currently not supported in DALI"
+        preemph_coeff = .97
+
+        config_spec = config_features['spec_augment']
+        if config_spec is not None:
+            mask_time_num_regions = config_spec['time_masks']
+            mask_time_min = config_spec['min_time']
+            mask_time_max = config_spec['max_time']
+            mask_freq_num_regions = config_spec['freq_masks']
+            mask_freq_min = config_spec['min_freq']
+            mask_freq_max = config_spec['max_freq']
+        else:
+            mask_time_num_regions = 0
+            mask_time_min = 0
+            mask_time_max = 0
+            mask_freq_num_regions = 0
+            mask_freq_min = 0
+            mask_freq_max = 0
+
+        config_cutout = config_features['cutout_augment']
+        if config_cutout is not None:
+            mask_both_num_regions = config_cutout['masks']
+            mask_both_min_time = config_cutout['min_time']
+            mask_both_max_time = config_cutout['max_time']
+            mask_both_min_freq = config_cutout['min_freq']
+            mask_both_max_freq = config_cutout['max_freq']
+        else:
+            mask_both_num_regions = 0
+            mask_both_min_time = 0
+            mask_both_max_time = 0
+            mask_both_min_freq = 0
+            mask_both_max_freq = 0
+
+        return cls(train_pipeline=train_pipeline,
+                   device_id=device_id,
+                   preprocessing_device=device_type,
+                   num_threads=num_cpu_threads,
+                   batch_size=batch_size,
+                   file_root=file_root,
+                   file_list=file_list,
+                   sample_rate=sample_rate,
+                   discrete_resample_range=discrete_resample_range,
+                   resample_range=resample_range,
+                   window_size=window_size,
+                   window_stride=window_stride,
+                   nfeatures=nfeatures,
+                   nfft=nfft,
+                   frame_splicing_factor=frame_splicing_factor,
+                   dither_coeff=dither_coeff,
+                   silence_threshold=silence_threshold,
+                   preemph_coeff=preemph_coeff,
+                   pad_align=pad_align,
+                   max_duration=max_duration,
+                   mask_time_num_regions=mask_time_num_regions,
+                   mask_time_min=mask_time_min,
+                   mask_time_max=mask_time_max,
+                   mask_freq_num_regions=mask_freq_num_regions,
+                   mask_freq_min=mask_freq_min,
+                   mask_freq_max=mask_freq_max,
+                   mask_both_num_regions=mask_both_num_regions,
+                   mask_both_min_time=mask_both_min_time,
+                   mask_both_max_time=mask_both_max_time,
+                   mask_both_min_freq=mask_both_min_freq,
+                   mask_both_max_freq=mask_both_max_freq)
+
+    @staticmethod
+    def _dali_init_log(args: dict):
+        if (not torch.distributed.is_initialized() or (
+                torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):  # print once
+            max_len = max([len(ii) for ii in args.keys()])
+            fmt_string = '\t%' + str(max_len) + 's : %s'
+            print('Initializing DALI with parameters:')
+            for keyPair in sorted(args.items()):
+                print(fmt_string % keyPair)
+
+    @staticmethod
+    def _div_ceil(dividend, divisor):
+        return (dividend + (divisor - 1)) // divisor
+
+    def _get_audio_len(self, inp):
+        return self.get_shape(inp) if self.frame_splicing_factor == 1 else \
+            self._div_ceil(self.get_shape(inp), self.frame_splicing_factor)
+
+    def _remove_silence(self, inp):
+        begin, length = self.get_nonsilent_region(inp)
+        out = self.trim_silence(inp, self.to_float(begin), self.to_float(length))
+        return out
+
+    def _do_spectrogram_masking(self):
+        return self.mask_params['time_num_regions'] > 0 or self.mask_params['freq_num_regions'] > 0 or \
+               self.mask_params['both_num_regions'] > 0
+
+    @staticmethod
+    def _interleave_lists(*lists):
+        """
+        [*, **, ***], [1, 2, 3], [a, b, c] -> [*, 1, a, **, 2, b, ***, 3, c]
+        Returns:
+             iterator over interleaved list
+        """
+        assert all((len(lists[0]) == len(test_l) for test_l in lists)), "All lists have to have the same length"
+        return itertools.chain(*zip(*lists))
+
+    def _generate_cutouts(self):
+        """
+        Returns:
+            Generates anchors and shapes of the cutout regions.
+            Single call generates one batch of data.
+            The output shall be passed to DALI's Erase operator
+            anchors = [f0 t0 f1 t1 ...]
+            shapes = [f0w t0h f1w t1h ...]
+        """
+        MAX_TIME_DIMENSION = 20 * 16000
+        freq_anchors = np.random.random(self.mask_params['freq_num_regions'])
+        time_anchors = np.random.random(self.mask_params['time_num_regions'])
+        both_anchors_freq = np.random.random(self.mask_params['both_num_regions'])
+        both_anchors_time = np.random.random(self.mask_params['both_num_regions'])
+        anchors = []
+        for anch in freq_anchors:
+            anchors.extend([anch, 0])
+        for anch in time_anchors:
+            anchors.extend([0, anch])
+        for t, f in zip(both_anchors_time, both_anchors_freq):
+            anchors.extend([f, t])
+
+        shapes = []
+        shapes.extend(
+            self._interleave_lists(
+                np.random.randint(self.mask_params['freq_min'], self.mask_params['freq_max'] + 1,
+                                  self.mask_params['freq_num_regions']),
+                # XXX: Here, a time dimension of the spectrogram shall be passed.
+                #      However, in DALI ArgumentInput can't come from GPU.
+                #      So we leave the job for Erase (masking operator) to get it together.
+                [int(MAX_TIME_DIMENSION)] * self.mask_params['freq_num_regions']
+            )
+        )
+        shapes.extend(
+            self._interleave_lists(
+                [self.nfeatures] * self.mask_params['time_num_regions'],
+                np.random.randint(self.mask_params['time_min'], self.mask_params['time_max'] + 1,
+                                  self.mask_params['time_num_regions'])
+            )
+        )
+        shapes.extend(
+            self._interleave_lists(
+                np.random.randint(self.mask_params['both_min_freq'], self.mask_params['both_max_freq'] + 1,
+                                  self.mask_params['both_num_regions']),
+                np.random.randint(self.mask_params['both_min_time'], self.mask_params['both_max_time'] + 1,
+                                  self.mask_params['both_num_regions'])
+            )
+        )
+        return anchors, shapes
+
+    def _discrete_resample_coeffs_generator(self):
+        """
+        Generate resample coeffs from discrete set
+        """
+        yield np.random.choice([self.resample_range[0], 1.0, self.resample_range[1]],
+                               size=self.batch_size).astype('float32')
+
+    def _cutouts_generator(self):
+        """
+        Generator, that wraps cutouts creation in order to randomize inputs
+        and allow passing them to DALI's ExternalSource operator
+        """
+
+        def tuples2list(tuples: list):
+            """
+            [(a, b), (c, d)] -> [[a, c], [b, d]]
+            """
+            return map(list, zip(*tuples))
+
+        [anchors, shapes] = tuples2list([self._generate_cutouts() for _ in range(self.batch_size)])
+        yield np.array(anchors, dtype=np.float32), np.array(shapes, dtype=np.float32)
+
+    def define_graph(self):
+        audio, label = self.read()
+        if not self.train or self.speed_perturbation_coeffs is None:
+            audio, sr = self.decode(audio)
+        else:
+            resample_coeffs = self.speed_perturbation_coeffs() * self.sample_rate
+            audio, sr = self.decode(audio, sample_rate=resample_coeffs)
+
+        if self.do_remove_silence:
+            audio = self._remove_silence(audio)
+
+        # Max duration drop is performed at DataLayer stage
+
+        if self.preprocessing_device == "gpu":
+            audio = audio.gpu()
+
+        if self.dither_coeff != 0.:
+            audio = audio + self.normal_distribution(audio) * self.dither_coeff
+
+        audio = self.preemph(audio)
+
+        audio = self.spectrogram(audio)
+        audio = self.mel_fbank(audio)
+        audio = self.log_features(audio)
+
+        audio_len = self._get_audio_len(audio)
+
+        audio = self.normalize(audio)
+        audio = self.pad(audio)
+
+        if self.train and self._do_spectrogram_masking():
+            anchors, shapes = self.spectrogram_cutouts()
+            audio = self.mask_spectrogram(audio, anchor=anchors, shape=shapes)
+
+        # When modifying DALI pipeline returns, make sure you update `output_map` in DALIGenericIterator invocation
+        return audio.gpu(), label.gpu(), audio_len.gpu()
+
+
+class DaliTritonPipeline(DaliPipeline):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        assert not kwargs['train_pipeline'], "Pipeline for Triton shall be a validation pipeline"
+        if torch.distributed.is_initialized():
+            raise RuntimeError(
+                "You're creating Triton pipeline, using multi-process mode. Please use single-process mode.")
+        self.read = ops.ExternalSource(name="DALI_INPUT_0", no_copy=True, device="cpu")
+
+
+def serialize_dali_triton_pipeline(output_path: str, config_data: dict, config_features: dict):
+    pipe = DaliTritonPipeline.from_config(train_pipeline=False, device_id=-1, batch_size=-1, file_root=None,
+                                          file_list=None, config_data=config_data, config_features=config_features,
+                                          do_resampling=False, num_cpu_threads=-1)
+    pipe.serialize(filename=output_path)
--- a/PyTorch/SpeechRecognition/Jasper/common/dataset.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/dataset.py
@ -0,0 +1,234 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from pathlib import Path
+
+import numpy as np
+
+import torch
+from torch.utils.data import Dataset, DataLoader
+from torch.utils.data.distributed import DistributedSampler
+
+from .audio import (audio_from_file, AudioSegment, GainPerturbation,
+                    ShiftPerturbation, SpeedPerturbation)
+from .text import _clean_text, punctuation_map
+
+
+def normalize_string(s, labels, punct_map):
+    """Normalizes string.
+
+    Example:
+        'call me at 8:00 pm!' -> 'call me at eight zero pm'
+    """
+    labels = set(labels)
+    try:
+        text = _clean_text(s, ["english_cleaners"], punct_map).strip()
+        return ''.join([tok for tok in text if all(t in labels for t in tok)])
+    except:
+        print(f"WARNING: Normalizing failed: {s}")
+        return None
+
+
+class FilelistDataset(Dataset):
+    def __init__(self, filelist_fpath):
+        self.samples = [line.strip() for line in open(filelist_fpath, 'r')]
+
+    def __len__(self):
+        return len(self.samples)
+
+    def __getitem__(self, index):
+        audio, audio_len = audio_from_file(self.samples[index])
+        return (audio.squeeze(0), audio_len, torch.LongTensor([0]),
+                torch.LongTensor([0]))
+
+
+class SingleAudioDataset(FilelistDataset):
+    def __init__(self, audio_fpath):
+        self.samples = [audio_fpath]
+
+
+class AudioDataset(Dataset):
+    def __init__(self, data_dir, manifest_fpaths, labels,
+                 sample_rate=16000, min_duration=0.1, max_duration=float("inf"),
+                 pad_to_max_duration=False, max_utts=0, normalize_transcripts=True,
+                 sort_by_duration=False, trim_silence=False,
+                 speed_perturbation=None, gain_perturbation=None,
+                 shift_perturbation=None, ignore_offline_speed_perturbation=False):
+        """Loads audio, transcript and durations listed in a .json file.
+
+        Args:
+            data_dir: absolute path to dataset folder
+            manifest_filepath: relative path from dataset folder
+                to manifest json as described above. Can be coma-separated paths.
+            labels (str): all possible output symbols
+            min_duration (int): skip audio shorter than threshold
+            max_duration (int): skip audio longer than threshold
+            pad_to_max_duration (bool): pad all sequences to max_duration
+            max_utts (int): limit number of utterances
+            normalize_transcripts (bool): normalize transcript text
+            sort_by_duration (bool): sort sequences by increasing duration
+            trim_silence (bool): trim leading and trailing silence from audio
+            ignore_offline_speed_perturbation (bool): use precomputed speed perturbation
+
+        Returns:
+            tuple of Tensors
+        """
+        self.data_dir = data_dir
+        self.labels = labels
+        self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
+        self.punctuation_map = punctuation_map(labels)
+        self.blank_index = len(labels)
+
+        self.pad_to_max_duration = pad_to_max_duration
+
+        self.sort_by_duration = sort_by_duration
+        self.max_utts = max_utts
+        self.normalize_transcripts = normalize_transcripts
+        self.ignore_offline_speed_perturbation = ignore_offline_speed_perturbation
+
+        self.min_duration = min_duration
+        self.max_duration = max_duration
+        self.trim_silence = trim_silence
+        self.sample_rate = sample_rate
+
+        perturbations = []
+        if speed_perturbation is not None:
+            perturbations.append(SpeedPerturbation(**speed_perturbation))
+        if gain_perturbation is not None:
+            perturbations.append(GainPerturbation(**gain_perturbation))
+        if shift_perturbation is not None:
+            perturbations.append(ShiftPerturbation(**shift_perturbation))
+        self.perturbations = perturbations
+
+        self.max_duration = max_duration
+
+        self.samples = []
+        self.duration = 0.0
+        self.duration_filtered = 0.0
+
+        for fpath in manifest_fpaths:
+            self._load_json_manifest(fpath)
+
+        if sort_by_duration:
+            self.samples = sorted(self.samples, key=lambda s: s['duration'])
+
+    def __getitem__(self, index):
+        s = self.samples[index]
+        rn_indx = np.random.randint(len(s['audio_filepath']))
+        duration = s['audio_duration'][rn_indx] if 'audio_duration' in s else 0
+        offset = s.get('offset', 0)
+
+        segment = AudioSegment(
+            s['audio_filepath'][rn_indx], target_sr=self.sample_rate,
+            offset=offset, duration=duration, trim=self.trim_silence)
+
+        for p in self.perturbations:
+            p.maybe_apply(segment, self.sample_rate)
+
+        segment = torch.FloatTensor(segment.samples)
+
+        return (segment,
+                torch.tensor(segment.shape[0]).int(),
+                torch.tensor(s["transcript"]),
+                torch.tensor(len(s["transcript"])).int())
+
+    def __len__(self):
+        return len(self.samples)
+
+    def _load_json_manifest(self, fpath):
+        for s in json.load(open(fpath, "r", encoding="utf-8")):
+
+            if self.pad_to_max_duration and not self.ignore_offline_speed_perturbation:
+                # require all perturbed samples to be < self.max_duration
+                s_max_duration = max(f['duration'] for f in s['files'])
+            else:
+                # otherwise we allow perturbances to be > self.max_duration
+                s_max_duration = s['original_duration']
+
+            s['duration'] = s.pop('original_duration')
+            if not (self.min_duration <= s_max_duration <= self.max_duration):
+                self.duration_filtered += s['duration']
+                continue
+
+            # Prune and normalize according to transcript
+            tr = (s.get('transcript', None) or
+                  self.load_transcript(s['text_filepath']))
+
+            if not isinstance(tr, str):
+                print(f'WARNING: Skipped sample (transcript not a str): {tr}.')
+                self.duration_filtered += s['duration']
+                continue
+
+            if self.normalize_transcripts:
+                tr = normalize_string(tr, self.labels, self.punctuation_map)
+
+            s["transcript"] = self.to_vocab_inds(tr)
+
+            files = s.pop('files')
+            if self.ignore_offline_speed_perturbation:
+                files = [f for f in files if f['speed'] == 1.0]
+
+            s['audio_duration'] = [f['duration'] for f in files]
+            s['audio_filepath'] = [str(Path(self.data_dir, f['fname']))
+                                   for f in files]
+            self.samples.append(s)
+            self.duration += s['duration']
+
+            if self.max_utts > 0 and len(self.samples) >= self.max_utts:
+                print(f'Reached max_utts={self.max_utts}. Finished parsing {fpath}.')
+                break
+
+    def load_transcript(self, transcript_path):
+        with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
+            transcript = transcript_file.read().replace('\n', '')
+        return transcript
+
+    def to_vocab_inds(self, transcript):
+        chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
+        transcript = list(filter(lambda x: x != self.blank_index, chars))
+        return transcript
+
+
+def collate_fn(batch):
+    bs = len(batch)
+    max_len = lambda l, idx: max(el[idx].size(0) for el in l)
+    audio = torch.zeros(bs, max_len(batch, 0))
+    audio_lens = torch.zeros(bs, dtype=torch.int32)
+    transcript = torch.zeros(bs, max_len(batch, 2))
+    transcript_lens = torch.zeros(bs, dtype=torch.int32)
+
+    for i, sample in enumerate(batch):
+        audio[i].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
+        audio_lens[i] = sample[1]
+        transcript[i].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
+        transcript_lens[i] = sample[3]
+    return audio, audio_lens, transcript, transcript_lens
+
+
+def get_data_loader(dataset, batch_size, multi_gpu=True, shuffle=True,
+                    drop_last=True, num_workers=4):
+
+    kw = {'dataset': dataset, 'collate_fn': collate_fn,
+          'num_workers': num_workers, 'pin_memory': True}
+
+    if multi_gpu:
+        loader_shuffle = False
+        sampler = DistributedSampler(dataset, shuffle=shuffle)
+    else:
+        loader_shuffle = shuffle
+        sampler = None
+
+    return DataLoader(batch_size=batch_size, drop_last=drop_last,
+                      sampler=sampler, shuffle=loader_shuffle, **kw)
--- a/PyTorch/SpeechRecognition/Jasper/common/features.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/features.py
@ -0,0 +1,293 @@
+import math
+import random
+
+import librosa
+import torch
+import torch.nn as nn
+
+from apex import amp
+
+
+class BaseFeatures(nn.Module):
+    """Base class for GPU accelerated audio preprocessing."""
+    __constants__ = ["pad_align", "pad_to_max_duration", "max_len"]
+
+    def __init__(self, pad_align, pad_to_max_duration, max_duration,
+                 sample_rate, window_size, window_stride, spec_augment=None,
+                 cutout_augment=None):
+        super(BaseFeatures, self).__init__()
+
+        self.pad_align = pad_align
+        self.pad_to_max_duration = pad_to_max_duration
+        self.win_length = int(sample_rate * window_size) # frame size
+        self.hop_length = int(sample_rate * window_stride)
+
+        # Calculate maximum sequence length (# frames)
+        if pad_to_max_duration:
+            self.max_len = 1 + math.ceil(
+                (max_duration * sample_rate - self.win_length) / self.hop_length
+            )
+
+        if spec_augment is not None:
+            self.spec_augment = SpecAugment(**spec_augment)
+        else:
+            self.spec_augment = None
+
+        if cutout_augment is not None:
+            self.cutout_augment = CutoutAugment(**cutout_augment)
+        else:
+            self.cutout_augment = None
+
+    @torch.no_grad()
+    def calculate_features(self, audio, audio_lens):
+        return audio, audio_lens
+
+    def __call__(self, audio, audio_lens, optim_level=0):
+        dtype = audio.dtype
+        audio = audio.float()
+        if optim_level == 1:
+            with amp.disable_casts():
+                feat, feat_lens = self.calculate_features(audio, audio_lens)
+        else:
+            feat, feat_lens = self.calculate_features(audio, audio_lens)
+
+        feat = self.apply_padding(feat)
+
+        if self.cutout_augment is not None:
+            feat = self.cutout_augment(feat)
+
+        if self.spec_augment is not None:
+            feat = self.spec_augment(feat)
+
+        feat = feat.to(dtype)
+        return feat, feat_lens
+
+    def apply_padding(self, x):
+        if self.pad_to_max_duration:
+            x_size = max(x.size(-1), self.max_len)
+        else:
+            x_size = x.size(-1)
+
+        if self.pad_align > 0:
+            pad_amt = x_size % self.pad_align
+        else:
+            pad_amt = 0
+
+        padded_len = x_size + (self.pad_align - pad_amt if pad_amt > 0 else 0)
+        return nn.functional.pad(x, (0, padded_len - x.size(-1)))
+
+
+class SpecAugment(nn.Module):
+    """Spec augment. refer to https://arxiv.org/abs/1904.08779
+    """
+    def __init__(self, freq_masks=0, min_freq=0, max_freq=10, time_masks=0,
+                 min_time=0, max_time=10):
+        super(SpecAugment, self).__init__()
+        assert 0 <= min_freq <= max_freq
+        assert 0 <= min_time <= max_time
+
+        self.freq_masks = freq_masks
+        self.min_freq = min_freq
+        self.max_freq = max_freq
+
+        self.time_masks = time_masks
+        self.min_time = min_time
+        self.max_time = max_time
+
+    @torch.no_grad()
+    def forward(self, x):
+        sh = x.shape
+        mask = torch.zeros(x.shape, dtype=torch.bool, device=x.device)
+
+        for idx in range(sh[0]):
+            for _ in range(self.freq_masks):
+                w = torch.randint(self.min_freq, self.max_freq + 1, size=(1,)).item()
+                f0 = torch.randint(0, max(1, sh[1] - w), size=(1,))
+                mask[idx, f0:f0+w] = 1
+
+            for _ in range(self.time_masks):
+                w = torch.randint(self.min_time, self.max_time + 1, size=(1,)).item()
+                t0 = torch.randint(0, max(1, sh[2] - w), size=(1,))
+                mask[idx, :, t0:t0+w] = 1
+
+        return x.masked_fill(mask, 0)
+
+
+class CutoutAugment(nn.Module):
+    """Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
+    """
+    def __init__(self, masks=0, min_freq=20, max_freq=20, min_time=5, max_time=5):
+        super(CutoutAugment, self).__init__()
+        assert 0 <= min_freq <= max_freq
+        assert 0 <= min_time <= max_time
+
+        self.masks = masks
+        self.min_freq = min_freq
+        self.max_freq = max_freq
+        self.min_time = min_time
+        self.max_time = max_time
+
+    @torch.no_grad()
+    def forward(self, x):
+        sh = x.shape
+        mask = torch.zeros(x.shape, dtype=torch.bool, device=x.device)
+
+        for idx in range(sh[0]):
+            for i in range(self.masks):
+
+                w = torch.randint(self.min_freq, self.max_freq + 1, size=(1,)).item()
+                h = torch.randint(self.min_time, self.max_time + 1, size=(1,)).item()
+
+                f0 = int(random.uniform(0, sh[1] - w))
+                t0 = int(random.uniform(0, sh[2] - h))
+
+                mask[idx, f0:f0+w, t0:t0+h] = 1
+
+        return x.masked_fill(mask, 0)
+
+
+@torch.jit.script
+def normalize_batch(x, seq_len, normalize_type: str):
+#    print ("normalize_batch: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
+    if normalize_type == "per_feature":
+        x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
+                                                 device=x.device)
+        x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
+                                                device=x.device)
+        for i in range(x.shape[0]):
+            x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
+            x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
+        # make sure x_std is not zero
+        x_std += 1e-5
+        return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
+
+    elif normalize_type == "all_features":
+        x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
+        x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
+        for i in range(x.shape[0]):
+            x_mean[i] = x[i, :, :int(seq_len[i])].mean()
+            x_std[i] = x[i, :, :int(seq_len[i])].std()
+        # make sure x_std is not zero
+        x_std += 1e-5
+        return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
+    else:
+        return x
+
+
+@torch.jit.script
+def splice_frames(x, frame_splicing: int):
+    """ Stacks frames together across feature dim
+
+    input is batch_size, feature_dim, num_frames
+    output is batch_size, feature_dim*frame_splicing, num_frames
+
+    """
+    seq = [x]
+    # TORCHSCRIPT: JIT doesnt like range(start, stop)
+    for n in range(frame_splicing - 1):
+        seq.append(torch.cat([x[:, :, :n + 1], x[:, :, n + 1:]], dim=2))
+    return torch.cat(seq, dim=1)
+
+
+class FilterbankFeatures(BaseFeatures):
+    # For JIT, https://pytorch.org/docs/stable/jit.html#python-defined-constants
+    __constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length",
+                     "log", "frame_splicing", "normalize"]
+    # torchscript: "center" removed due to a bug
+
+    def __init__(self, spec_augment=None, cutout_augment=None,
+                 sample_rate=8000, window_size=0.02, window_stride=0.01,
+                 window="hamming", normalize="per_feature", n_fft=None,
+                 preemph=0.97, n_filt=64, lowfreq=0, highfreq=None, log=True,
+                 dither=1e-5, pad_align=8, pad_to_max_duration=False,
+                 max_duration=float('inf'), frame_splicing=1):
+        super(FilterbankFeatures, self).__init__(
+            pad_align=pad_align, pad_to_max_duration=pad_to_max_duration,
+            max_duration=max_duration, sample_rate=sample_rate,
+            window_size=window_size, window_stride=window_stride,
+            spec_augment=spec_augment, cutout_augment=cutout_augment)
+
+        torch_windows = {
+            'hann': torch.hann_window,
+            'hamming': torch.hamming_window,
+            'blackman': torch.blackman_window,
+            'bartlett': torch.bartlett_window,
+            'none': None,
+        }
+
+        self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
+
+        self.normalize = normalize
+        self.log = log
+        #TORCHSCRIPT: Check whether or not we need this
+        self.dither = dither
+        self.frame_splicing = frame_splicing
+        self.n_filt = n_filt
+        self.preemph = preemph
+        highfreq = highfreq or sample_rate / 2
+        window_fn = torch_windows.get(window, None)
+        window_tensor = window_fn(self.win_length,
+                                  periodic=False) if window_fn else None
+        filterbanks = torch.tensor(
+            librosa.filters.mel(sample_rate, self.n_fft, n_mels=n_filt,
+                                fmin=lowfreq, fmax=highfreq),
+            dtype=torch.float).unsqueeze(0)
+        # torchscript
+        self.register_buffer("fb", filterbanks)
+        self.register_buffer("window", window_tensor)
+
+    def get_seq_len(self, seq_len):
+        return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
+            dtype=torch.int)
+
+    # do stft
+    # TORCHSCRIPT: center removed due to bug
+    def stft(self, x):
+        return torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
+                          win_length=self.win_length,
+                          window=self.window.to(dtype=torch.float))
+
+    @torch.no_grad()
+    def calculate_features(self, x, seq_len):
+        dtype = x.dtype
+
+        seq_len = self.get_seq_len(seq_len)
+
+        # dither
+        if self.dither > 0:
+            x += self.dither * torch.randn_like(x)
+
+        # do preemphasis
+        if self.preemph is not None:
+            x = torch.cat(
+                (x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]), dim=1)
+        x  = self.stft(x)
+
+            # get power spectrum
+        x = x.pow(2).sum(-1)
+
+        # dot with filterbank energies
+        x = torch.matmul(self.fb.to(x.dtype), x)
+
+        # log features if required
+        if self.log:
+            x = torch.log(x + 1e-20)
+
+        # frame splicing if required
+        if self.frame_splicing > 1:
+            raise ValueError('Frame splicing not supported')
+
+        # normalize if required
+        x = normalize_batch(x, seq_len, normalize_type=self.normalize)
+
+        # mask to zero any values beyond seq_len in batch,
+        # pad to multiple of `pad_align` (for efficiency)
+        max_len = x.size(-1)
+        mask = torch.arange(max_len, dtype=seq_len.dtype, device=x.device)
+        mask = mask.expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
+        x = x.masked_fill(mask.unsqueeze(1), 0)
+
+        # TORCHSCRIPT: Is this del important? It breaks scripting
+        # del mask
+
+        return x.to(dtype), seq_len
--- a/PyTorch/SpeechRecognition/Jasper/common/helpers.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/helpers.py
@ -0,0 +1,300 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import glob
+import os
+import re
+from collections import OrderedDict
+
+from apex import amp
+
+import torch
+import torch.distributed as dist
+
+from .metrics import word_error_rate
+
+
+def print_once(msg):
+    if not dist.is_initialized() or dist.get_rank() == 0:
+        print(msg)
+
+
+def add_ctc_blank(symbols):
+    return symbols + ['<BLANK>']
+
+
+def ctc_decoder_predictions_tensor(tensor, labels):
+    """
+    Takes output of greedy ctc decoder and performs ctc decoding algorithm to
+    remove duplicates and special symbol. Returns prediction
+    Args:
+        tensor: model output tensor
+        label: A list of labels
+    Returns:
+        prediction
+    """
+    blank_id = len(labels) - 1
+    hypotheses = []
+    labels_map = {i: labels[i] for i in range(len(labels))}
+    prediction_cpu_tensor = tensor.long().cpu()
+    # iterate over batch
+    for ind in range(prediction_cpu_tensor.shape[0]):
+        prediction = prediction_cpu_tensor[ind].numpy().tolist()
+        # CTC decoding procedure
+        decoded_prediction = []
+        previous = len(labels) - 1 # id of a blank symbol
+        for p in prediction:
+            if (p != previous or previous == blank_id) and p != blank_id:
+                decoded_prediction.append(p)
+            previous = p
+        hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
+        hypotheses.append(hypothesis)
+    return hypotheses
+
+
+def greedy_wer(preds, tgt, tgt_lens, labels):
+    """
+    Takes output of greedy ctc decoder and performs ctc decoding algorithm to
+    remove duplicates and special symbol. Prints wer and prediction examples to screen
+    Args:
+        tensors: A list of 3 tensors (predictions, targets, target_lengths)
+        labels: A list of labels
+
+    Returns:
+        word error rate
+    """
+    with torch.no_grad():
+        references = gather_transcripts([tgt], [tgt_lens], labels)
+        hypotheses = ctc_decoder_predictions_tensor(preds, labels)
+
+    wer, _, _ = word_error_rate(hypotheses, references)
+    return wer, hypotheses[0], references[0]
+
+
+def gather_losses(losses_list):
+    return [torch.mean(torch.stack(losses_list))]
+
+
+def gather_predictions(predictions_list, labels):
+    results = []
+    for prediction in predictions_list:
+        results += ctc_decoder_predictions_tensor(prediction, labels=labels)
+    return results
+
+
+def gather_transcripts(transcript_list, transcript_len_list, labels):
+    results = []
+    labels_map = {i: labels[i] for i in range(len(labels))}
+    # iterate over workers
+    for txt, lens in zip(transcript_list, transcript_len_list):
+        for t, l in zip(txt.long().cpu(), lens.long().cpu()):
+            t = list(t.numpy())
+            results.append(''.join([labels_map[c] for c in t[:l]]))
+    return results
+
+
+def process_evaluation_batch(tensors, global_vars, labels):
+    """
+    Processes results of an iteration and saves it in global_vars
+    Args:
+        tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
+        global_vars: dictionary where processes results of iteration are saved
+        labels: A list of labels
+    """
+    for kv, v in tensors.items():
+        if kv.startswith('loss'):
+            global_vars['EvalLoss'] += gather_losses(v)
+        elif kv.startswith('predictions'):
+            global_vars['preds'] += gather_predictions(v, labels)
+        elif kv.startswith('transcript_length'):
+            transcript_len_list = v
+        elif kv.startswith('transcript'):
+            transcript_list = v
+        elif kv.startswith('output'):
+            global_vars['logits'] += v
+
+    global_vars['txts'] += gather_transcripts(
+        transcript_list, transcript_len_list, labels)
+
+
+def process_evaluation_epoch(aggregates, tag=None):
+    """
+    Processes results from each worker at the end of evaluation and combine to final result
+    Args:
+        aggregates: dictionary containing information of entire evaluation
+    Return:
+        wer: final word error rate
+        loss: final loss
+    """
+    if 'losses' in aggregates:
+        eloss = torch.mean(torch.stack(aggregates['losses'])).item()
+    else:
+        eloss = None
+    hypotheses = aggregates['preds']
+    references = aggregates['txts']
+
+    wer, scores, num_words = word_error_rate(hypotheses, references)
+    multi_gpu = dist.is_initialized()
+    if multi_gpu:
+        if eloss is not None:
+            eloss /= dist.get_world_size()
+            eloss_tensor = torch.tensor(eloss).cuda()
+            dist.all_reduce(eloss_tensor)
+            eloss = eloss_tensor.item()
+
+        scores_tensor = torch.tensor(scores).cuda()
+        dist.all_reduce(scores_tensor)
+        scores = scores_tensor.item()
+        num_words_tensor = torch.tensor(num_words).cuda()
+        dist.all_reduce(num_words_tensor)
+        num_words = num_words_tensor.item()
+        wer = scores * 1.0 / num_words
+    return wer, eloss
+
+
+def num_weights(module):
+    return sum(p.numel() for p in module.parameters() if p.requires_grad)
+
+
+def convert_v1_state_dict(state_dict):
+    rules = [
+        ('^jasper_encoder.encoder.', 'encoder.layers.'),
+        ('^jasper_decoder.decoder_layers.', 'decoder.layers.'),
+    ]
+    ret = {}
+    for k, v in state_dict.items():
+        if k.startswith('acoustic_model.'):
+            continue
+        if k.startswith('audio_preprocessor.'):
+            continue
+        for pattern, to in rules:
+            k = re.sub(pattern, to, k)
+        ret[k] = v
+
+    return ret
+
+
+class Checkpointer(object):
+
+    def __init__(self, save_dir, model_name, keep_milestones=[100,200,300],
+                 use_amp=False):
+        self.save_dir = save_dir
+        self.keep_milestones = keep_milestones
+        self.use_amp = use_amp
+        self.model_name = model_name
+
+        tracked = [
+            (int(re.search('epoch(\d+)_', f).group(1)), f)
+            for f in glob.glob(f'{save_dir}/{self.model_name}_epoch*_checkpoint.pt')]
+        tracked = sorted(tracked, key=lambda t: t[0])
+        self.tracked = OrderedDict(tracked)
+
+    def save(self, model, ema_model, optimizer, epoch, step, best_wer,
+             is_best=False):
+        """Saves model checkpoint for inference/resuming training.
+
+        Args:
+            model: the model, optionally wrapped by DistributedDataParallel
+            ema_model: model with averaged weights, can be None
+            optimizer: optimizer
+            epoch (int): epoch during which the model is saved
+            step (int): number of steps since beginning of training
+            best_wer (float): lowest recorded WER on the dev set
+            is_best (bool, optional): set name of checkpoint to 'best'
+                and overwrite the previous one
+        """
+        rank = 0
+        if dist.is_initialized():
+            dist.barrier()
+            rank = dist.get_rank()
+
+        if rank != 0:
+            return
+
+        # Checkpoint already saved
+        if not is_best and epoch in self.tracked:
+            return
+
+        unwrap_ddp = lambda model: getattr(model, 'module', model)
+        state = {
+            'epoch': epoch,
+            'step': step,
+            'best_wer': best_wer,
+            'state_dict': unwrap_ddp(model).state_dict(),
+            'ema_state_dict': unwrap_ddp(ema_model).state_dict() if ema_model is not None else None,
+            'optimizer': optimizer.state_dict(),
+            'amp': amp.state_dict() if self.use_amp else None,
+        }
+
+        if is_best:
+            fpath = os.path.join(
+                self.save_dir, f"{self.model_name}_best_checkpoint.pt")
+        else:
+            fpath = os.path.join(
+                self.save_dir, f"{self.model_name}_epoch{epoch}_checkpoint.pt")
+
+        print_once(f"Saving {fpath}...")
+        torch.save(state, fpath)
+
+        if not is_best:
+            # Remove old checkpoints; keep milestones and the last two
+            self.tracked[epoch] = fpath
+            for epoch in set(list(self.tracked)[:-2]) - set(self.keep_milestones):
+                try:
+                    os.remove(self.tracked[epoch])
+                except:
+                    pass
+                del self.tracked[epoch]
+
+    def last_checkpoint(self):
+        tracked = list(self.tracked.values())
+
+        if len(tracked) >= 1:
+            try:
+                torch.load(tracked[-1], map_location='cpu')
+                return tracked[-1]
+            except:
+                print_once(f'Last checkpoint {tracked[-1]} appears corrupted.')
+
+        elif len(tracked) >= 2:
+            return tracked[-2]
+        else:
+            return None
+
+    def load(self, fpath, model, ema_model, optimizer, meta):
+
+        print_once(f'Loading model from {fpath}')
+        checkpoint = torch.load(fpath, map_location="cpu")
+
+        unwrap_ddp = lambda model: getattr(model, 'module', model)
+        state_dict = convert_v1_state_dict(checkpoint['state_dict'])
+        unwrap_ddp(model).load_state_dict(state_dict, strict=True)
+
+        if ema_model is not None:
+            if checkpoint.get('ema_state_dict') is not None:
+                key = 'ema_state_dict'
+            else:
+                key = 'state_dict'
+                print_once('WARNING: EMA weights not found in the checkpoint.')
+                print_once('WARNING: Initializing EMA model with regular params.')
+            state_dict = convert_v1_state_dict(checkpoint[key])
+            unwrap_ddp(ema_model).load_state_dict(state_dict, strict=True)
+
+        optimizer.load_state_dict(checkpoint['optimizer'])
+
+        if self.use_amp:
+            amp.load_state_dict(checkpoint['amp'])
+
+        meta['start_epoch'] = checkpoint.get('epoch')
+        meta['best_wer'] = checkpoint.get('best_wer', meta['best_wer'])
--- a/PyTorch/SpeechRecognition/Jasper/common/metrics.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/metrics.py
@ -12,12 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from typing import List

+def __levenshtein(a, b):
+    """Calculates the Levenshtein distance between two sequences."""

-def __levenshtein(a: List, b: List) -> int:
-    """Calculates the Levenshtein distance between a and b.
-    """
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
@ -37,28 +35,18 @@ def __levenshtein(a: List, b: List) -> int:
    return current[n]


-def word_error_rate(hypotheses: List[str], references: List[str]) -> float:
-    """
-    Computes Average Word Error rate between two texts represented as
-    corresponding lists of string. Hypotheses and references must have same length.
+def word_error_rate(hypotheses, references):
+    """Computes average Word Error Rate (WER) between two text lists."""

-    Args:
-        hypotheses: list of hypotheses
-        references: list of references
-
-    Returns:
-        (float) average word error rate
-    """
    scores = 0
    words = 0
-    len_diff = len(references) - len(hypotheses) 
+    len_diff = len(references) - len(hypotheses)
    if len_diff > 0:
-        raise ValueError("In word error rate calculation, hypotheses and reference"
-                         " lists must have the same number of elements. But I got:"
-                         "{0} and {1} correspondingly".format(len(hypotheses), len(references)))
+        raise ValueError("Uneqal number of hypthoses and references: "
+                         "{0} and {1}".format(len(hypotheses), len(references)))
    elif len_diff < 0:
        hypotheses = hypotheses[:len_diff]
-    
+
    for h, r in zip(hypotheses, references):
        h_list = h.split()
        r_list = r.split()
--- a/PyTorch/SpeechRecognition/Jasper/common/optimizers.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/optimizers.py
@ -16,6 +16,51 @@ import torch
 from torch.optim import Optimizer
 import math

+
+def lr_policy(step, epoch, initial_lr, optimizer, steps_per_epoch, warmup_epochs,
+              hold_epochs, num_epochs=None, policy='linear', min_lr=1e-5,
+              exp_gamma=None):
+    """
+    learning rate decay
+    Args:
+        initial_lr: base learning rate
+        step: current iteration number
+        N: total number of iterations over which learning rate is decayed
+        lr_steps: list of steps to apply exp_gamma
+    """
+    warmup_steps = warmup_epochs * steps_per_epoch
+    hold_steps = hold_epochs * steps_per_epoch
+
+    if policy == 'legacy':
+        assert num_epochs is not None
+        tot_steps = num_epochs * steps_per_epoch
+
+        if step < warmup_steps:
+            a = (step + 1) / (warmup_steps + 1)
+        elif step < warmup_steps + hold_steps:
+            a = 1.0
+        else:
+            a = (((tot_steps - step)
+                 / (tot_steps - warmup_steps - hold_steps)) ** 2)
+
+    elif policy == 'exponential':
+        assert exp_gamma is not None
+
+        if step < warmup_steps:
+            a = (step + 1) / (warmup_steps + 1)
+        elif step < warmup_steps + hold_steps:
+            a = 1.0
+        else:
+            a = exp_gamma ** (epoch - warmup_epochs - hold_epochs)
+
+    else:
+        raise ValueError
+
+    new_lr = max(a * initial_lr, min_lr)
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = new_lr
+
+
 class AdamW(Optimizer):
    """Implements AdamW algorithm.
  
@ -114,6 +159,7 @@ class AdamW(Optimizer):
                p.data.add_(torch.mul(p.data, group['weight_decay']).addcdiv_(1, exp_avg, denom), alpha=-step_size)
  
        return loss
+
  
 class Novograd(Optimizer):
    """
--- a/PyTorch/SpeechRecognition/Jasper/common/tb_dllogger.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/tb_dllogger.py
@ -0,0 +1,159 @@
+import atexit
+import glob
+import os
+import re
+import numpy as np
+
+import torch
+from torch.utils.tensorboard import SummaryWriter
+
+import dllogger
+from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
+
+
+tb_loggers = {}
+
+
+class TBLogger:
+    """
+    xyz_dummies: stretch the screen with empty plots so the legend would
+                 always fit for other plots
+    """
+    def __init__(self, enabled, log_dir, name, interval=1, dummies=True):
+        self.enabled = enabled
+        self.interval = interval
+        self.cache = {}
+        if self.enabled:
+            self.summary_writer = SummaryWriter(
+                log_dir=os.path.join(log_dir, name),
+                flush_secs=120, max_queue=200)
+            atexit.register(self.summary_writer.close)
+            if dummies:
+                for key in ('aaa', 'zzz'):
+                    self.summary_writer.add_scalar(key, 0.0, 1)
+
+    def log(self, step, data):
+        for k, v in data.items():
+            self.log_value(step, k, v.item() if type(v) is torch.Tensor else v)
+
+    def log_value(self, step, key, val, stat='mean'):
+        if self.enabled:
+            if key not in self.cache:
+                self.cache[key] = []
+            self.cache[key].append(val)
+            if len(self.cache[key]) == self.interval:
+                agg_val = getattr(np, stat)(self.cache[key])
+                self.summary_writer.add_scalar(key, agg_val, step)
+                del self.cache[key]
+
+    def log_grads(self, step, model):
+        if self.enabled:
+            norms = [p.grad.norm().item() for p in model.parameters()
+                     if p.grad is not None]
+            for stat in ('max', 'min', 'mean'):
+                self.log_value(step, f'grad_{stat}', getattr(np, stat)(norms),
+                               stat=stat)
+
+
+def unique_log_fpath(log_fpath):
+
+    if not os.path.isfile(log_fpath):
+        return log_fpath
+
+    # Avoid overwriting old logs
+    saved = sorted([int(re.search('\.(\d+)', f).group(1))
+                    for f in glob.glob(f'{log_fpath}.*')])
+
+    log_num = (saved[-1] if saved else 0) + 1
+    return f'{log_fpath}.{log_num}'
+
+
+def stdout_step_format(step):
+    if isinstance(step, str):
+        return step
+    fields = []
+    if len(step) > 0:
+        fields.append("epoch {:>4}".format(step[0]))
+    if len(step) > 1:
+        fields.append("iter {:>4}".format(step[1]))
+    if len(step) > 2:
+        fields[-1] += "/{}".format(step[2])
+    return " | ".join(fields)
+
+
+def stdout_metric_format(metric, metadata, value):
+    name = metadata.get("name", metric + " : ")
+    unit = metadata.get("unit", None)
+    format = f'{{{metadata.get("format", "")}}}'
+    fields = [name, format.format(value) if value is not None else value, unit]
+    fields = [f for f in fields if f is not None]
+    return "| " + " ".join(fields)
+
+
+def init_log(args):
+    enabled = (args.local_rank == 0)
+    if enabled:
+        fpath = args.log_file or os.path.join(args.output_dir, 'nvlog.json')
+        backends = [JSONStreamBackend(Verbosity.DEFAULT,
+                                      unique_log_fpath(fpath)),
+                    StdOutBackend(Verbosity.VERBOSE,
+                                  step_format=stdout_step_format,
+                                  metric_format=stdout_metric_format)]
+    else:
+        backends = []
+
+    dllogger.init(backends=backends)
+    dllogger.metadata("train_lrate", {"name": "lrate", "format": ":>3.2e"})
+
+    for id_, pref in [('train', ''), ('train_avg', 'avg train '),
+                      ('dev', '  avg dev '), ('dev_ema', '  EMA dev ')]:
+
+        dllogger.metadata(f"{id_}_loss",
+                          {"name": f"{pref}loss", "format": ":>7.2f"})
+
+        dllogger.metadata(f"{id_}_wer",
+                          {"name": f"{pref}wer", "format": ":>6.2f"})
+
+        dllogger.metadata(f"{id_}_throughput",
+                          {"name": f"{pref}utts/s", "format": ":>5.0f"})
+
+        dllogger.metadata(f"{id_}_took",
+                          {"name": "took", "unit": "s", "format": ":>5.2f"})
+
+    tb_subsets = ['train', 'dev', 'dev_ema'] if args.ema else ['train', 'dev']
+    global tb_loggers
+    tb_loggers = {s: TBLogger(enabled, args.output_dir, name=s)
+                  for s in tb_subsets}
+
+    log_parameters(vars(args), tb_subset='train')
+
+
+def log(step, tb_total_steps=None, subset='train', data={}):
+
+    if tb_total_steps is not None:
+        tb_loggers[subset].log(tb_total_steps, data)
+
+    if subset != '':
+        data = {f'{subset}_{key}': v for key,v in data.items()}
+    dllogger.log(step, data=data)
+
+
+def log_grads_tb(tb_total_steps, grads, tb_subset='train'):
+    tb_loggers[tb_subset].log_grads(tb_total_steps, grads)
+
+
+def log_parameters(data, verbosity=0, tb_subset=None):
+    for k,v in data.items():
+        dllogger.log(step="PARAMETER", data={k:v}, verbosity=verbosity)
+
+    if tb_subset is not None and tb_loggers[tb_subset].enabled:
+        tb_data = {k:v for k,v in data.items()
+                   if type(v) in (str, bool, int, float)}
+        tb_loggers[tb_subset].summary_writer.add_hparams(tb_data, {})
+
+
+def flush_log():
+    dllogger.flush()
+    for tbl in tb_loggers.values():
+        if tbl.enabled:
+            tbl.summary_writer.flush()
--- a/PyTorch/SpeechRecognition/Jasper/common/text/LICENSE
+++ b/PyTorch/SpeechRecognition/Jasper/common/text/LICENSE
--- a/PyTorch/SpeechRecognition/Jasper/common/text/init.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/text/init.py
@ -0,0 +1,32 @@
+# Copyright (c) 2017 Keith Ito
+""" from https://github.com/keithito/tacotron """
+import re
+import string
+from . import cleaners
+
+def _clean_text(text, cleaner_names, *args):
+    for name in cleaner_names:
+        cleaner = getattr(cleaners, name)
+        if not cleaner:
+            raise Exception('Unknown cleaner: %s' % name)
+        text = cleaner(text, *args)
+    return text
+
+
+def punctuation_map(labels):
+    # Punctuation to remove
+    punctuation = string.punctuation
+    punctuation = punctuation.replace("+", "")
+    punctuation = punctuation.replace("&", "")
+    # TODO We might also want to consider:
+    # @ -> at
+    # # -> number, pound, hashtag
+    # ~ -> tilde
+    # _ -> underscore
+    # % -> percent
+    # If a punctuation symbol is inside our vocab, we do not remove from text
+    for l in labels:
+        punctuation = punctuation.replace(l, "")
+    # Turn all punctuation to whitespace
+    table = str.maketrans(punctuation, " " * len(punctuation))
+    return table
--- a/PyTorch/SpeechRecognition/Jasper/common/text/cleaners.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/text/cleaners.py
--- a/PyTorch/SpeechRecognition/Jasper/common/text/numbers.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/text/numbers.py
--- a/PyTorch/SpeechRecognition/Jasper/common/text/symbols.py
+++ b/PyTorch/SpeechRecognition/Jasper/common/text/symbols.py
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5_sp_offline.toml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5_sp_offline.toml
@ -1,194 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-
-model = "Jasper"
-
-[input]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-max_duration = 16.7
-speed_perturbation = true
-
-
-cutout_rect_regions = 0
-cutout_rect_time = 60
-cutout_rect_freq = 25
-
-
-cutout_x_regions = 0
-cutout_y_regions = 0
-cutout_x_width = 6
-cutout_y_width = 6
-
-[input_eval]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16 
-
-
-[encoder]
-activation = "relu"
-convmask = true
-
-[[jasper]]
-filters = 256
-repeat = 1
-kernel = [11]
-stride = [2]
-dilation = [1]
-dropout = 0.2
-residual = false
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-
-
-[[jasper]]
-filters = 896
-repeat = 1
-kernel = [29]
-stride = [1]
-dilation = [2]
-dropout = 0.4
-residual = false
-
-[[jasper]]
-filters = 1024
-repeat = 1
-kernel = [1]
-stride = [1]
-dilation = [1]
-dropout = 0.4
-residual = false
-
-[labels]
-labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr.toml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr.toml
@ -1,203 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-model = "Jasper"
-
-[input]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-max_duration = 16.7
-speed_perturbation = false
-
-
-cutout_rect_regions = 0
-cutout_rect_time = 60
-cutout_rect_freq = 25
-
-cutout_x_regions = 0
-cutout_y_regions = 0
-cutout_x_width = 6
-cutout_y_width = 6
-
-
-[input_eval]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-
-
-[encoder]
-activation = "relu"
-convmask = true
-
-[[jasper]]
-filters = 256
-repeat = 1
-kernel = [11]
-stride = [2]
-dilation = [1]
-dropout = 0.2
-residual = false
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 896
-repeat = 1
-kernel = [29]
-stride = [1]
-dilation = [2]
-dropout = 0.4
-residual = false
-
-[[jasper]]
-filters = 1024
-repeat = 1
-kernel = [1]
-stride = [1]
-dilation = [1]
-dropout = 0.4
-residual = false
-
-[labels]
-labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_nomask.toml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_nomask.toml
@ -1,203 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-model = "Jasper"
-
-[input]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-max_duration = 16.7
-speed_perturbation = false
-
-
-cutout_rect_regions = 0
-cutout_rect_time = 60
-cutout_rect_freq = 25
-
-cutout_x_regions = 0
-cutout_y_regions = 0
-cutout_x_width = 6
-cutout_y_width = 6
-
-
-[input_eval]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-
-
-[encoder]
-activation = "relu"
-convmask = false
-
-[[jasper]]
-filters = 256
-repeat = 1
-kernel = [11]
-stride = [2]
-dilation = [1]
-dropout = 0.2
-residual = false
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 896
-repeat = 1
-kernel = [29]
-stride = [1]
-dilation = [2]
-dropout = 0.4
-residual = false
-
-[[jasper]]
-filters = 1024
-repeat = 1
-kernel = [1]
-stride = [1]
-dilation = [1]
-dropout = 0.4
-residual = false
-
-[labels]
-labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline.toml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline.toml
@ -1,204 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-model = "Jasper"
-
-[input]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-max_duration = 16.7
-speed_perturbation = true
-
-
-cutout_rect_regions = 0
-cutout_rect_time = 60
-cutout_rect_freq = 25
-
-
-cutout_x_regions = 0
-cutout_y_regions = 0
-cutout_x_width = 6
-cutout_y_width = 6
-
-
-[input_eval]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-
-
-[encoder]
-activation = "relu"
-convmask = true
-
-[[jasper]]
-filters = 256
-repeat = 1
-kernel = [11]
-stride = [2]
-dilation = [1]
-dropout = 0.2
-residual = false
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 896
-repeat = 1
-kernel = [29]
-stride = [1]
-dilation = [2]
-dropout = 0.4
-residual = false
-
-[[jasper]]
-filters = 1024
-repeat = 1
-kernel = [1]
-stride = [1]
-dilation = [1]
-dropout = 0.4
-residual = false
-
-[labels]
-labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline_specaugment.toml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline_specaugment.toml
@ -1,204 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-model = "Jasper"
-
-[input]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-max_duration = 16.7
-speed_perturbation = true
-
-
-cutout_rect_regions = 0
-cutout_rect_time = 60
-cutout_rect_freq = 25
-
-
-cutout_x_regions = 2
-cutout_y_regions = 2
-cutout_x_width = 6
-cutout_y_width = 6
-
-
-[input_eval]
-normalize = "per_feature"
-sample_rate = 16000
-window_size = 0.02
-window_stride = 0.01
-window = "hann"
-features = 64
-n_fft = 512
-frame_splicing = 1
-dither = 0.00001
-feat_type = "logfbank"
-normalize_transcripts = true
-trim_silence = true
-pad_to = 16
-
-
-[encoder]
-activation = "relu"
-convmask = true
-
-[[jasper]]
-filters = 256
-repeat = 1
-kernel = [11]
-stride = [2]
-dilation = [1]
-dropout = 0.2
-residual = false
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 256
-repeat = 5
-kernel = [11]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 384
-repeat = 5
-kernel = [13]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 512
-repeat = 5
-kernel = [17]
-stride = [1]
-dilation = [1]
-dropout = 0.2
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 640
-repeat = 5
-kernel = [21]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 768
-repeat = 5
-kernel = [25]
-stride = [1]
-dilation = [1]
-dropout = 0.3
-residual = true
-residual_dense = true
-
-
-[[jasper]]
-filters = 896
-repeat = 1
-kernel = [29]
-stride = [1]
-dilation = [2]
-dropout = 0.4
-residual = false
-
-[[jasper]]
-filters = 1024
-repeat = 1
-kernel = [1]
-stride = [1]
-dilation = [1]
-dropout = 0.4
-residual = false
-
-[labels]
-labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speca.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speca.yaml
@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: true
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 2
+      max_freq: 20
+      time_masks: 2
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: true
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-offline.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-offline.yaml
@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: false
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 0
+      max_freq: 20
+      time_masks: 0
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: true
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-offline_speca.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-offline_speca.yaml
@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: false
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 2
+      max_freq: 20
+      time_masks: 2
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: true
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-offline_speca_nomask.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-offline_speca_nomask.yaml
@ -0,0 +1,139 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: false
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 2
+      max_freq: 20
+      time_masks: 2
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: false
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-online-discrete.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-online-discrete.yaml
@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: true
+
+    speed_perturbation:
+      discrete: true
+      min_rate: 0.9
+      max_rate: 1.1
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 0
+      max_freq: 20
+      time_masks: 0
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: true
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-online-discrete_speca.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-online-discrete_speca.yaml
@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: true
+
+    speed_perturbation:
+      discrete: true
+      min_rate: 0.9
+      max_rate: 1.1
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 2
+      max_freq: 20
+      time_masks: 2
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: true
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-online_speca.yaml
+++ b/PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_speedp-online_speca.yaml
@ -0,0 +1,144 @@
+# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: "Jasper"
+labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
+         "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
+
+input_val:
+  audio_dataset: &val_dataset
+    sample_rate: &sample_rate 16000
+    trim_silence: true
+    normalize_transcripts: true
+
+  filterbank_features: &val_features
+    normalize: per_feature
+    sample_rate: *sample_rate
+    window_size: 0.02
+    window_stride: 0.01
+    window: hann
+    n_filt: &n_filt 64
+    n_fft: 512
+    frame_splicing: &frame_splicing 1
+    dither: 0.00001
+    pad_align: 16
+
+# For training we keep samples < 16.7s and apply augmentation
+input_train:
+  audio_dataset:
+    <<: *val_dataset
+    max_duration: 16.7
+    ignore_offline_speed_perturbation: true
+
+    speed_perturbation:
+      discrete: false
+      min_rate: 0.85
+      max_rate: 1.15
+
+  filterbank_features:
+    <<: *val_features
+    max_duration: 16.7
+
+    spec_augment:
+      freq_masks: 2
+      max_freq: 20
+      time_masks: 2
+      max_time: 75
+
+jasper:
+  encoder:
+    init: xavier_uniform
+    in_feats: *n_filt
+    frame_splicing: *frame_splicing
+    activation: relu
+    use_conv_masks: true
+    blocks:
+    - &Conv1
+      filters: 256
+      repeat: 1
+      kernel_size: [11]
+      stride: [2]
+      dilation: [1]
+      dropout: 0.2
+      residual: false
+    - &B1
+      filters: 256
+      repeat: 5
+      kernel_size: [11]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B1
+    - &B2
+      filters: 384
+      repeat: 5
+      kernel_size: [13]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B2
+    - &B3
+      filters: 512
+      repeat: 5
+      kernel_size: [17]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.2
+      residual: true
+      residual_dense: true
+    - *B3
+    - &B4
+      filters: 640
+      repeat: 5
+      kernel_size: [21]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B4
+    - &B5
+      filters: 768
+      repeat: 5
+      kernel_size: [25]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.3
+      residual: true
+      residual_dense: true
+    - *B5
+    - &Conv2
+      filters: 896
+      repeat: 1
+      kernel_size: [29]
+      stride: [1]
+      dilation: [2]
+      dropout: 0.4
+      residual: false
+    - &Conv3
+      filters: &enc_feats 1024
+      repeat: 1
+      kernel_size: [1]
+      stride: [1]
+      dilation: [1]
+      dropout: 0.4
+      residual: false
+
+  decoder:
+    in_feats: *enc_feats
+    init: xavier_uniform
--- a/PyTorch/SpeechRecognition/Jasper/dataset.py
+++ b/PyTorch/SpeechRecognition/Jasper/dataset.py
@ -1,266 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""
-This file contains classes and functions related to data loading
-"""
-import torch
-import numpy as np
-import math
-from torch.utils.data import Dataset, Sampler
-import torch.distributed as dist
-from parts.manifest import Manifest
-from parts.features import WaveformFeaturizer
-
-class DistributedBucketBatchSampler(Sampler):
-    def __init__(self, dataset, batch_size, num_replicas=None, rank=None):
-        """Distributed sampler that buckets samples with similar length to minimize padding,
-          similar concept as pytorch BucketBatchSampler  https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.samplers.html#torchnlp.samplers.BucketBatchSampler
-
-        Args:
-            dataset: Dataset used for sampling.
-            batch_size: data batch size
-            num_replicas (optional): Number of processes participating in
-                distributed training.
-            rank (optional): Rank of the current process within num_replicas.
-        """
-        if num_replicas is None:
-            if not dist.is_available():
-                raise RuntimeError("Requires distributed package to be available")
-            num_replicas = dist.get_world_size()
-        if rank is None:
-            if not dist.is_available():
-                raise RuntimeError("Requires distributed package to be available")
-            rank = dist.get_rank()
-        self.dataset = dataset
-        self.dataset_size = len(dataset)
-        self.num_replicas = num_replicas
-        self.rank = rank
-        self.epoch = 0
-        self.batch_size = batch_size
-        self.tile_size = batch_size * self.num_replicas
-        self.num_buckets = 6
-        self.bucket_size = self.round_up_to(math.ceil(self.dataset_size / self.num_buckets), self.tile_size)
-        self.index_count = self.round_up_to(self.dataset_size, self.tile_size)
-        self.num_samples = self.index_count // self.num_replicas
-
-    def round_up_to(self, x, mod):
-        return (x + mod - 1) // mod * mod
-
-    def __iter__(self):
-        g = torch.Generator()
-        g.manual_seed(self.epoch)
-        indices = np.arange(self.index_count) % self.dataset_size
-        for bucket in range(self.num_buckets):
-            bucket_start = self.bucket_size * bucket
-            bucket_end = min(bucket_start + self.bucket_size, self.index_count)
-            indices[bucket_start:bucket_end] = indices[bucket_start:bucket_end][torch.randperm(bucket_end - bucket_start, generator=g)]
-
-        tile_indices = torch.randperm(self.index_count // self.tile_size, generator=g)
-        for tile_index in tile_indices:
-            start_index = self.tile_size * tile_index + self.batch_size * self.rank
-            end_index = start_index + self.batch_size
-            yield indices[start_index:end_index]
-
-    def __len__(self):
-        return self.num_samples
-
-    def set_epoch(self, epoch):
-        self.epoch = epoch
-
-class data_prefetcher():
-    def __init__(self, loader):
-        self.loader = iter(loader)
-        self.stream = torch.cuda.Stream()
-        self.preload()
-
-    def preload(self):
-        try:
-            self.next_input = next(self.loader)
-        except StopIteration:
-            self.next_input = None
-            return
-        with torch.cuda.stream(self.stream):
-            self.next_input = [ x.cuda(non_blocking=True) for x in self.next_input]
-
-    def __next__(self):
-        torch.cuda.current_stream().wait_stream(self.stream)
-        input = self.next_input
-        self.preload()
-        return input
-    def next(self):
-        return self.__next__()
-    def __iter__(self):
-        return self
-
-def seq_collate_fn(batch):
-    """batches samples and returns as tensors
-    Args:
-    batch : list of samples
-    Returns
-    batches of tensors
-    """
-    batch_size = len(batch)
-    def _find_max_len(lst, ind):
-        max_len = -1
-        for item in lst:
-            if item[ind].size(0) > max_len:
-                max_len = item[ind].size(0)
-        return max_len
-    max_audio_len = _find_max_len(batch, 0)
-    max_transcript_len = _find_max_len(batch, 2)
-
-    batched_audio_signal = torch.zeros(batch_size, max_audio_len)
-    batched_transcript = torch.zeros(batch_size, max_transcript_len)
-    audio_lengths = []
-    transcript_lengths = []
-    for ind, sample in enumerate(batch):
-        batched_audio_signal[ind].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
-        audio_lengths.append(sample[1])
-        batched_transcript[ind].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
-        transcript_lengths.append(sample[3])
-    return batched_audio_signal, torch.stack(audio_lengths), batched_transcript, \
-         torch.stack(transcript_lengths)
-
-class AudioToTextDataLayer:
-    """Data layer with data loader
-    """
-    def __init__(self, **kwargs):
-        self._device = torch.device("cuda")
-
-        featurizer_config = kwargs['featurizer_config']
-        pad_to_max = kwargs.get('pad_to_max', False)
-        perturb_config = kwargs.get('perturb_config', None)
-        manifest_filepath = kwargs['manifest_filepath']
-        dataset_dir = kwargs['dataset_dir']
-        labels = kwargs['labels']
-        batch_size = kwargs['batch_size']
-        drop_last = kwargs.get('drop_last', False)
-        shuffle = kwargs.get('shuffle', True)
-        min_duration = featurizer_config.get('min_duration', 0.1)
-        max_duration = featurizer_config.get('max_duration', None)
-        normalize_transcripts = kwargs.get('normalize_transcripts', True)
-        trim_silence = kwargs.get('trim_silence', False)
-        multi_gpu = kwargs.get('multi_gpu', False)
-        sampler_type = kwargs.get('sampler', 'default')
-        speed_perturbation = featurizer_config.get('speed_perturbation', False)
-        sort_by_duration=sampler_type == 'bucket'
-        self._featurizer = WaveformFeaturizer.from_config(featurizer_config, perturbation_configs=perturb_config)
-        self._dataset = AudioDataset(
-            dataset_dir=dataset_dir,
-            manifest_filepath=manifest_filepath,
-            labels=labels, blank_index=len(labels),
-            sort_by_duration=sort_by_duration,
-            pad_to_max=pad_to_max,
-            featurizer=self._featurizer, max_duration=max_duration,
-            min_duration=min_duration, normalize=normalize_transcripts,
-            trim=trim_silence, speed_perturbation=speed_perturbation)
-
-        print('sort_by_duration', sort_by_duration)
-
-        if not multi_gpu:
-            self.sampler = None
-            self._dataloader = torch.utils.data.DataLoader(
-                dataset=self._dataset,
-                batch_size=batch_size,
-                collate_fn=lambda b: seq_collate_fn(b),
-                drop_last=drop_last,
-                shuffle=shuffle if self.sampler is None else False,
-                num_workers=4,
-                pin_memory=True,
-                sampler=self.sampler
-            )
-        elif sampler_type == 'bucket':
-            self.sampler = DistributedBucketBatchSampler(self._dataset, batch_size=batch_size)
-            print("DDBucketSampler")
-            self._dataloader = torch.utils.data.DataLoader(
-                dataset=self._dataset,
-                collate_fn=lambda b: seq_collate_fn(b),
-                num_workers=4,
-                pin_memory=True,
-                batch_sampler=self.sampler
-            )
-        elif sampler_type == 'default':
-            self.sampler = torch.utils.data.distributed.DistributedSampler(self._dataset)
-            print("DDSampler")
-            self._dataloader = torch.utils.data.DataLoader(
-                dataset=self._dataset,
-                batch_size=batch_size,
-                collate_fn=lambda b: seq_collate_fn(b),
-                drop_last=drop_last,
-                shuffle=shuffle if self.sampler is None else False,
-                num_workers=4,
-                pin_memory=True,
-                sampler=self.sampler
-            )
-        else:
-            raise RuntimeError("Sampler {} not supported".format(sampler_type))
-
-    def __len__(self):
-        return len(self._dataset)
-
-    @property
-    def data_iterator(self):
-        return self._dataloader
-
-class AudioDataset(Dataset):
-    def __init__(self, dataset_dir, manifest_filepath, labels, featurizer, max_duration=None, pad_to_max=False,
-                 min_duration=None, blank_index=0, max_utts=0, normalize=True, sort_by_duration=False,
-                 trim=False, speed_perturbation=False):
-        """Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations
-        (in seconds). Each entry is a different audio sample.
-        Args:
-            dataset_dir: absolute path to dataset folder
-            manifest_filepath: relative path from dataset folder to manifest json as described above. Can be coma-separated paths.
-            labels: String containing all the possible characters to map to
-            featurizer: Initialized featurizer class that converts paths of audio to feature tensors
-            max_duration: If audio exceeds this length, do not include in dataset
-            min_duration: If audio is less than this length, do not include in dataset
-            pad_to_max: if specified input sequences into dnn model will be padded to max_duration
-            blank_index: blank index for ctc loss / decoder
-            max_utts: Limit number of utterances
-            normalize: whether to normalize transcript text
-            sort_by_duration: whether or not to sort sequences by increasing duration
-            trim: if specified trims leading and trailing silence from an audio signal.
-            speed_perturbation: specify if using data contains speed perburbation
-        """
-        m_paths = manifest_filepath.split(',')
-        self.manifest = Manifest(dataset_dir, m_paths, labels, blank_index, pad_to_max=pad_to_max,
-                             max_duration=max_duration,
-                             sort_by_duration=sort_by_duration,
-                             min_duration=min_duration, max_utts=max_utts,
-                             normalize=normalize, speed_perturbation=speed_perturbation)
-        self.featurizer = featurizer
-        self.blank_index = blank_index
-        self.trim = trim
-        print(
-            "Dataset loaded with {0:.2f} hours. Filtered {1:.2f} hours.".format(
-            self.manifest.duration / 3600,
-            self.manifest.filtered_duration / 3600))
-
-    def __getitem__(self, index):
-        sample = self.manifest[index]
-        rn_indx = np.random.randint(len(sample['audio_filepath']))
-        duration = sample['audio_duration'][rn_indx] if 'audio_duration' in sample else 0
-        offset = sample['offset'] if 'offset' in sample else 0
-        features = self.featurizer.process(sample['audio_filepath'][rn_indx],
-                                           offset=offset, duration=duration,
-                                           trim=self.trim)
-
-        return features, torch.tensor(features.shape[0]).int(), \
-               torch.tensor(sample["transcript"]), torch.tensor(
-               len(sample["transcript"])).int()
-
-    def __len__(self):
-        return len(self.manifest)
--- a/PyTorch/SpeechRecognition/Jasper/external/Dockerfile.client.patched
+++ b/PyTorch/SpeechRecognition/Jasper/external/Dockerfile.client.patched
@ -1,95 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-#  * Redistributions of source code must retain the above copyright
-#    notice, this list of conditions and the following disclaimer.
-#  * Redistributions in binary form must reproduce the above copyright
-#    notice, this list of conditions and the following disclaimer in the
-#    documentation and/or other materials provided with the distribution.
-#  * Neither the name of NVIDIA CORPORATION nor the names of its
-#    contributors may be used to endorse or promote products derived
-#    from this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
-# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
-# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-# Default setting is building on nvidia/cuda:10.1-devel-ubuntu18.04
-ARG BASE_IMAGE=nvidia/cuda:10.1-devel-ubuntu18.04
-
-FROM ${BASE_IMAGE}
-
-# Default to use Python3. Allowed values are "2" and "3".
-ARG PYVER=3
-
-# Ensure apt-get won't prompt for selecting options
-ENV DEBIAN_FRONTEND=noninteractive
-ENV PYVER=$PYVER
-
-RUN PYSFX=`[ "$PYVER" != "2" ] && echo "$PYVER" || echo ""` && \
-    apt-get update && \
-    apt-get install -y --no-install-recommends \
-            software-properties-common \
-            autoconf \
-            automake \
-            build-essential \
-            cmake \
-            curl \
-            git \
-            libopencv-dev \
-            libopencv-core-dev \
-            libssl-dev \
-            libtool \
-            pkg-config \
-            python${PYSFX} \
-            python${PYSFX}-pip \
-            python${PYSFX}-dev && \
-    pip${PYSFX} install --upgrade setuptools wheel
-
-RUN PYSFX=`[ "$PYVER" != "2" ] && echo "$PYVER" || echo ""` && \
-    pip${PYSFX} install --upgrade grpcio-tools
-
-# Build expects "python" executable (not python3).
-RUN rm -f /usr/bin/python && \
-    ln -s /usr/bin/python$PYVER /usr/bin/python
-
-# Build the client library and examples
-WORKDIR /workspace
-COPY VERSION .
-COPY build build
-COPY src/clients src/clients
-COPY src/core src/core
-
-RUN cd build && \
-    cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX:PATH=/workspace/install && \
-    make -j16 trtis-clients
-RUN cd install && \
-    export VERSION=`cat /workspace/VERSION` && \
-    tar zcf /workspace/v$VERSION.clients.tar.gz *
-
-# For CI testing need to install a test script.
-COPY qa/L0_client_tar/test.sh /tmp/test.sh
-
-# Install an image needed by the quickstart and other documentation.
-COPY qa/images/mug.jpg images/mug.jpg
-
-# Install the dependencies needed to run the client examples. These
-# are not needed for building but including them allows this image to
-# be used to run the client examples. The special upgrade and handling
-# of pip is needed to get numpy to install correctly with python2 on
-# ubuntu 16.04.
-RUN python -m pip install --user --upgrade pip && \
-    python -m pip install --upgrade install/python/tensorrtserver-*.whl numpy pillow
-
-ENV PATH //workspace/install/bin:${PATH}
-ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
--- a/PyTorch/SpeechRecognition/Jasper/external/triton-inference-server
+++ b/PyTorch/SpeechRecognition/Jasper/external/triton-inference-server
@ -1 +0,0 @@
-Subproject commit a1f3860ba65c0fd8f2be3adfcab2673efd039348
--- a/PyTorch/SpeechRecognition/Jasper/helpers.py
+++ b/PyTorch/SpeechRecognition/Jasper/helpers.py
@ -1,207 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import torch
-import torch.distributed as dist
-from apex.parallel import DistributedDataParallel as DDP
-from enum import Enum
-from metrics import word_error_rate
-
-
-def print_once(msg):
-    if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
-        print(msg)
-
-def add_ctc_labels(labels):
-    if not isinstance(labels, list):
-        raise ValueError("labels must be a list of symbols")
-    labels.append("<BLANK>")
-    return labels
-
-def __ctc_decoder_predictions_tensor(tensor, labels):
-    """
-    Takes output of greedy ctc decoder and performs ctc decoding algorithm to
-    remove duplicates and special symbol. Returns prediction
-    Args:
-        tensor: model output tensor
-        label: A list of labels
-    Returns:
-        prediction
-    """
-    blank_id = len(labels) - 1
-    hypotheses = []
-    labels_map = dict([(i, labels[i]) for i in range(len(labels))])
-    prediction_cpu_tensor = tensor.long().cpu()
-    # iterate over batch
-    for ind in range(prediction_cpu_tensor.shape[0]):
-        prediction = prediction_cpu_tensor[ind].numpy().tolist()
-        # CTC decoding procedure
-        decoded_prediction = []
-        previous = len(labels) - 1 # id of a blank symbol
-        for p in prediction:
-            if (p != previous or previous == blank_id) and p != blank_id:
-                decoded_prediction.append(p)
-            previous = p
-        hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
-        hypotheses.append(hypothesis)
-    return hypotheses
-
-
-def monitor_asr_train_progress(tensors: list, labels: list):
-    """
-    Takes output of greedy ctc decoder and performs ctc decoding algorithm to
-    remove duplicates and special symbol. Prints wer and prediction examples to screen
-    Args:
-        tensors: A list of 3 tensors (predictions, targets, target_lengths)
-        labels: A list of labels
-
-    Returns:
-        word error rate
-    """
-    references = []
-
-    labels_map = dict([(i, labels[i]) for i in range(len(labels))])
-    with torch.no_grad():
-        targets_cpu_tensor = tensors[1].long().cpu()
-        tgt_lenths_cpu_tensor = tensors[2].long().cpu()
-
-        # iterate over batch
-        for ind in range(targets_cpu_tensor.shape[0]):
-            tgt_len = tgt_lenths_cpu_tensor[ind].item()
-            target = targets_cpu_tensor[ind][:tgt_len].numpy().tolist()
-            reference = ''.join([labels_map[c] for c in target])
-            references.append(reference)
-        hypotheses = __ctc_decoder_predictions_tensor(tensors[0], labels=labels)
-    tag = "training_batch_WER"
-    wer, _, _ = word_error_rate(hypotheses, references)
-    print_once('{0}: {1}'.format(tag, wer))
-    print_once('Prediction: {0}'.format(hypotheses[0]))
-    print_once('Reference: {0}'.format(references[0]))
-    return wer
-
-
-def __gather_losses(losses_list: list) -> list:
-    return [torch.mean(torch.stack(losses_list))]
-
-
-def __gather_predictions(predictions_list: list, labels: list) -> list:
-    results = []
-    for prediction in predictions_list:
-        results += __ctc_decoder_predictions_tensor(prediction, labels=labels)
-    return results
-
-
-def __gather_transcripts(transcript_list: list, transcript_len_list: list,
-                                                 labels: list) -> list:
-    results = []
-    labels_map = dict([(i, labels[i]) for i in range(len(labels))])
-    # iterate over workers
-    for t, ln in zip(transcript_list, transcript_len_list):
-        # iterate over batch
-        t_lc = t.long().cpu()
-        ln_lc = ln.long().cpu()
-        for ind in range(t.shape[0]):
-            tgt_len = ln_lc[ind].item()
-            target = t_lc[ind][:tgt_len].numpy().tolist()
-            reference = ''.join([labels_map[c] for c in target])
-            results.append(reference)
-    return results
-
-
-def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list):
-    """
-    Processes results of an iteration and saves it in global_vars
-    Args:
-        tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
-        global_vars: dictionary where processes results of iteration are saved
-        labels: A list of labels
-    """
-    for kv, v in tensors.items():
-        if kv.startswith('loss'):
-            global_vars['EvalLoss'] += __gather_losses(v)
-        elif kv.startswith('predictions'):
-            global_vars['predictions'] += __gather_predictions(v, labels=labels)
-        elif kv.startswith('transcript_length'):
-            transcript_len_list = v
-        elif kv.startswith('transcript'):
-
-            transcript_list = v
-        elif kv.startswith('output'):
-            global_vars['logits'] += v
-
-    global_vars['transcripts'] += __gather_transcripts(transcript_list,
-                                                       transcript_len_list,
-                                                       labels=labels)
-
-
-def process_evaluation_epoch(global_vars: dict, tag=None):
-    """
-    Processes results from each worker at the end of evaluation and combine to final result
-    Args:
-        global_vars: dictionary containing information of entire evaluation
-    Return:
-        wer: final word error rate
-        loss: final loss
-    """
-    if 'EvalLoss' in global_vars:
-        eloss = torch.mean(torch.stack(global_vars['EvalLoss'])).item()
-    else:
-        eloss = None
-    hypotheses = global_vars['predictions']
-    references = global_vars['transcripts']
-
-    wer, scores, num_words = word_error_rate(hypotheses=hypotheses, references=references)
-    multi_gpu = torch.distributed.is_initialized()
-    if multi_gpu:
-        if eloss is not None:
-            eloss /= torch.distributed.get_world_size()
-            eloss_tensor = torch.tensor(eloss).cuda()
-            dist.all_reduce(eloss_tensor)
-            eloss = eloss_tensor.item()
-            del eloss_tensor
-
-        scores_tensor = torch.tensor(scores).cuda()
-        dist.all_reduce(scores_tensor)
-        scores = scores_tensor.item()
-        del scores_tensor
-        num_words_tensor = torch.tensor(num_words).cuda()
-        dist.all_reduce(num_words_tensor)
-        num_words = num_words_tensor.item()
-        del num_words_tensor
-        wer = scores *1.0/num_words
-    return wer, eloss
-
-
-
-def norm(x):
-    if not isinstance(x, list):
-        if not isinstance(x, tuple):
-            return x
-    return x[0]
-
-
-def print_dict(d):
-    maxLen = max([len(ii) for ii in d.keys()])
-    fmtString = '\t%' + str(maxLen) + 's : %s'
-    print('Arguments:')
-    for keyPair in sorted(d.items()):
-            print(fmtString % keyPair)
-
-
-
-def model_multi_gpu(model, multi_gpu=False):
-    if multi_gpu:
-        model = DDP(model)
-        print('DDP(model)')
-    return model
--- a/PyTorch/SpeechRecognition/Jasper/images/static_fp16_16.7s.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/static_fp16_16.7s.png
--- a/PyTorch/SpeechRecognition/Jasper/images/static_fp16_2s.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/static_fp16_2s.png
--- a/PyTorch/SpeechRecognition/Jasper/images/static_fp16_7s.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/static_fp16_7s.png
--- a/PyTorch/SpeechRecognition/Jasper/images/tensorrt_16.7s.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/tensorrt_16.7s.png
--- a/PyTorch/SpeechRecognition/Jasper/images/tensorrt_2s.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/tensorrt_2s.png
--- a/PyTorch/SpeechRecognition/Jasper/images/tensorrt_7s.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/tensorrt_7s.png
--- a/PyTorch/SpeechRecognition/Jasper/images/triton_dynamic_batching.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/triton_dynamic_batching.png
--- a/PyTorch/SpeechRecognition/Jasper/images/triton_static_batching_bs1.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/triton_static_batching_bs1.png
--- a/PyTorch/SpeechRecognition/Jasper/images/triton_static_batching_bs8.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/triton_static_batching_bs8.png
--- a/PyTorch/SpeechRecognition/Jasper/images/triton_throughput_latency_summary.png
+++ b/PyTorch/SpeechRecognition/Jasper/images/triton_throughput_latency_summary.png
--- a/PyTorch/SpeechRecognition/Jasper/inference.py
+++ b/PyTorch/SpeechRecognition/Jasper/inference.py
@ -13,328 +13,385 @@
 # limitations under the License.

 import argparse
-import itertools
-from typing import List
-from tqdm import tqdm
 import math
-import toml
-from dataset import AudioToTextDataLayer
-from helpers import process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, print_dict, model_multi_gpu, __ctc_decoder_predictions_tensor
-from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
-from parts.features import audio_from_file
-import torch
-import torch.nn as nn
-import apex
-from apex import amp
-import random
-import numpy as np
-import pickle
-import time
 import os
+import random
+import time
+from heapq import nlargest
+from itertools import chain, repeat
+from pathlib import Path
+from tqdm import tqdm

-def parse_args():
+import dllogger
+import torch
+import numpy as np
+import torch.distributed as distrib
+from apex import amp
+from apex.parallel import DistributedDataParallel
+from dllogger import JSONStreamBackend, StdOutBackend, Verbosity
+
+from jasper import config
+from common import helpers
+from common.dali.data_loader import DaliDataLoader
+from common.dataset import (AudioDataset, FilelistDataset, get_data_loader,
+                            SingleAudioDataset)
+from common.features import BaseFeatures, FilterbankFeatures
+from common.helpers import print_once, process_evaluation_epoch
+from jasper.model import GreedyCTCDecoder, Jasper
+from common.tb_dllogger import stdout_metric_format, unique_log_fpath
+
+
+def get_parser():
    parser = argparse.ArgumentParser(description='Jasper')
+    parser.add_argument('--batch_size', default=16, type=int,
+                        help='Data batch size')
+    parser.add_argument('--steps', default=0, type=int,
+                        help='Eval this many steps for every worker')
+    parser.add_argument('--warmup_steps', default=0, type=int,
+                        help='Burn-in period before measuring latencies')
+    parser.add_argument('--model_config', type=str,
+                        help='Relative model config path given dataset folder')
+    parser.add_argument('--dataset_dir', type=str,
+                        help='Absolute path to dataset folder')
+    parser.add_argument('--val_manifests', type=str, nargs='+',
+                        help='Relative path to evaluation dataset manifest files')
+    parser.add_argument('--ckpt', default=None, type=str,
+                        help='Path to model checkpoint')
+    parser.add_argument('--max_duration', default=None, type=float,
+                        help='Filter out longer inputs (in seconds)')
+    parser.add_argument('--pad_to_max_duration', action='store_true',
+                        help='Pads every batch to max_duration')
+    parser.add_argument('--amp', '--fp16', action='store_true',
+                        help='Use FP16 precision')
+    parser.add_argument('--cudnn_benchmark', action='store_true',
+                        help='Enable cudnn benchmark')
+    parser.add_argument('--cpu', action='store_true',
+                        help='Run inference on CPU')
+    parser.add_argument("--seed", default=None, type=int, help='Random seed')
+    parser.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
+                        type=int, help='GPU id used for distributed training')

-    parser.register("type", "bool", lambda x: x.lower() in ("yes", "true", "t", "1"))
-
-    parser.add_argument("--local_rank", default=None, type=int)
-    parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
-    parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
-    parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
-    parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
-    parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
-    parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
-    parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
-    parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
-    parser.add_argument("--amp", "--fp16", action='store_true', help='use half precision')
-    parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
-    parser.add_argument("--save_prediction", type=str, default=None, help="if specified saves predictions in text form at this location")
-    parser.add_argument("--logits_save_to", default=None, type=str, help="if specified will save logits to path")
-    parser.add_argument("--seed", default=42, type=int, help='seed')
-    parser.add_argument("--output_dir", default="results/", type=str, help="Output directory to store exported models. Only used if --export_model is used")
-    parser.add_argument("--export_model", action='store_true', help="Exports the audio_featurizer, encoder and decoder using torch.jit to the output_dir")
-    parser.add_argument("--wav", type=str, help='absolute path to .wav file (16KHz)')
-    parser.add_argument("--cpu", action="store_true", help="Run inference on CPU")
-    parser.add_argument("--ema", action="store_true", help="If available, load EMA model weights")
-
-    # FIXME Unused, but passed by Triton helper scripts
-    parser.add_argument("--pyt_fp16", action='store_true', help='use half precision')
-
-    return parser.parse_args()
-
-def calc_wer(data_layer, audio_processor,
-             encoderdecoder, greedy_decoder,
-             labels, args, device):
-
-    encoderdecoder = encoderdecoder.module if hasattr(encoderdecoder, 'module') else encoderdecoder
-    with torch.no_grad():
-        # reset global_var_dict - results of evaluation will be stored there
-        _global_var_dict = {
-            'predictions': [],
-            'transcripts': [],
-            'logits' : [],
-        }
-
-        # Evaluation mini-batch for loop
-        for it, data in enumerate(tqdm(data_layer.data_iterator)):
-
-            tensors = [t.to(device) for t in data]
-    
-            t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
-
-            t_processed_signal = audio_processor(t_audio_signal_e, t_a_sig_length_e)
-            t_log_probs_e, _ = encoderdecoder.infer(t_processed_signal)
-            t_predictions_e = greedy_decoder(t_log_probs_e)
-
-            values_dict = dict(
-                predictions=[t_predictions_e],
-                transcript=[t_transcript_e],
-                transcript_length=[t_transcript_len_e],
-                output=[t_log_probs_e]
-            )
-            # values_dict will contain results from all workers
-            process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
-
-            if args.steps is not None and it + 1 >= args.steps:
-                break
-
-        # final aggregation (over minibatches) and logging of results
-        wer, _ = process_evaluation_epoch(_global_var_dict)
-
-        return wer, _global_var_dict
+    io = parser.add_argument_group('feature and checkpointing setup')
+    io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
+                    default='gpu', help='Use DALI pipeline for fast data processing')
+    io.add_argument('--save_predictions', type=str, default=None,
+                    help='Save predictions in text form at this location')
+    io.add_argument('--save_logits', default=None, type=str,
+                    help='Save output logits under specified path')
+    io.add_argument('--transcribe_wav', type=str,
+                    help='Path to a single .wav file (16KHz)')
+    io.add_argument('--transcribe_filelist', type=str,
+                    help='Path to a filelist with one .wav path per line')
+    io.add_argument('-o', '--output_dir', default='results/',
+                    help='Output folder to save audio (file per phrase)')
+    io.add_argument('--log_file', type=str, default=None,
+                    help='Path to a DLLogger log file')
+    io.add_argument('--ema', action='store_true',
+                    help='Load averaged model weights')
+    io.add_argument('--torchscript', action='store_true',
+                    help='Evaluate with a TorchScripted model')
+    io.add_argument('--torchscript_export', action='store_true',
+                    help='Export the model with torch.jit to the output_dir')
+    return parser


-def jit_export(audio, audio_len, audio_processor, encoderdecoder, greedy_decoder, args):
+def durs_to_percentiles(durations, ratios):
+    durations = np.asarray(durations) * 1000  # in ms
+    latency = durations

-                print("##############")
+    latency = latency[5:]
+    mean_latency = np.mean(latency)

-                module_name = "{}_{}".format(os.path.basename(args.model_toml), "fp16" if args.amp else "fp32")
-
-                if args.use_conv_mask:
-                    module_name = module_name + "_noMaskConv"
-
-                # Export just the featurizer
-                print("exporting featurizer ...")
-                traced_module_feat = torch.jit.script(audio_processor)
-                traced_module_feat.save(os.path.join(args.output_dir, module_name + "_feat.pt"))
-
-                # Export just the acoustic model
-                print("exporting acoustic model ...")
-                inp_postFeat, _ = audio_processor(audio, audio_len)
-                traced_module_acoustic = torch.jit.trace(encoderdecoder, inp_postFeat)
-                traced_module_acoustic.save(os.path.join(args.output_dir, module_name + "_acoustic.pt"))
-
-                # Export just the decoder
-                print("exporting decoder ...")
-
-                inp_postAcoustic = encoderdecoder(inp_postFeat)
-                traced_module_decode = torch.jit.script(greedy_decoder, inp_postAcoustic)
-                traced_module_decode.save(os.path.join(args.output_dir, module_name + "_decoder.pt"))
-                print("JIT export complete")
-
-                return traced_module_feat, traced_module_acoustic, traced_module_decode
-
-def run_once(audio_processor, encoderdecoder, greedy_decoder, audio, audio_len, labels, device):
-            features, lens = audio_processor(audio, audio_len)
-            if not device.type == 'cpu':
-                torch.cuda.synchronize()
-            t0 = time.perf_counter()
-            # TorchScripted model does not support (features, lengths)
-            if isinstance(encoderdecoder, torch.jit.TracedModule):
-                t_log_probs_e = encoderdecoder(features)
-            else:
-                t_log_probs_e, _ = encoderdecoder.infer((features, lens))
-            if not device.type == 'cpu':
-                torch.cuda.synchronize()
-            t1 = time.perf_counter()
-            t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
-            hypotheses = __ctc_decoder_predictions_tensor(t_predictions_e, labels=labels)
-            print("INFERENCE TIME\t\t: {} ms".format((t1-t0)*1000.0))
-            print("TRANSCRIPT\t\t:", hypotheses[0])
+    latency_worst = nlargest(math.ceil((1 - min(ratios)) * len(latency)), latency)
+    latency_ranges = get_percentile(ratios, latency_worst, len(latency))
+    latency_ranges[0.5] = mean_latency
+    return latency_ranges


-def eval(
-         data_layer,
-         audio_processor,
-         encoderdecoder,
-         greedy_decoder,
-         labels,
-         multi_gpu,
-         device,
-         args):
-    """performs inference / evaluation
-    Args:
-        data_layer: data layer object that holds data loader
-        audio_processor: data processing module
-        encoderdecoder: acoustic model
-        greedy_decoder: greedy decoder
-        labels: list of labels as output vocabulary
-        multi_gpu: true if using multiple gpus
-        args: script input arguments
-    """
-    logits_save_to=args.logits_save_to
-
-    with torch.no_grad():
-        if args.wav:
-            audio, audio_len = audio_from_file(args.wav)
-            run_once(audio_processor, encoderdecoder, greedy_decoder, audio, audio_len, labels, device)
-            if args.export_model:
-                jit_audio_processor, jit_encoderdecoder, jit_greedy_decoder = jit_export(audio, audio_len, audio_processor, encoderdecoder,greedy_decoder,args)
-                run_once(jit_audio_processor, jit_encoderdecoder, jit_greedy_decoder, audio, audio_len, labels, device)
-            return
-        wer, _global_var_dict = calc_wer(data_layer, audio_processor, encoderdecoder, greedy_decoder, labels, args, device)
-        if (not multi_gpu or (multi_gpu and torch.distributed.get_rank() == 0)):
-            print("==========>>>>>>Evaluation WER: {0}\n".format(wer))
-
-            if args.save_prediction is not None:
-                with open(args.save_prediction, 'w') as fp:
-                    fp.write('\n'.join(_global_var_dict['predictions']))
-            if logits_save_to is not None:
-                logits = []
-                for batch in _global_var_dict["logits"]:
-                    for i in range(batch.shape[0]):
-                        logits.append(batch[i].cpu().numpy())
-                with open(logits_save_to, 'wb') as f:
-                    pickle.dump(logits, f, protocol=pickle.HIGHEST_PROTOCOL)
-
-            # if args.export_model:            
-            #     feat, acoustic, decoder = jit_export(inp, audio_processor, encoderdecoder, greedy_decoder,args)
-            #     wer_after = calc_wer(data_layer, feat, acoustic, decoder, labels, args)
-            #     print("===>>>Before WER: {0}".format(wer))
-            #     print("===>>>Traced WER: {0}".format(wer_after))
-            #     print("===>>>Diff      : {0} %".format((wer_after - wer_before) * 100.0 / wer_before))
-            #     print("")
+def get_percentile(ratios, arr, nsamples):
+    res = {}
+    for a in ratios:
+        idx = max(int(nsamples * (1 - a)), 0)
+        res[a] = arr[idx]
+    return res


-def main(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    torch.manual_seed(args.seed)
+def torchscript_export(data_loader, audio_processor, model, greedy_decoder,
+                       output_dir, use_amp, use_conv_masks, model_config, device,
+                       save):

-    multi_gpu = args.local_rank is not None
+    audio_processor.to(device)
+
+    for batch in data_loader:
+        batch = [t.to(device, non_blocking=True) for t in batch]
+        audio, audio_len, _, _ = batch
+        feats, feat_lens = audio_processor(audio, audio_len)
+        break
+
+    print("\nExporting featurizer...")
+    print("\nNOTE: Dithering causes warnings about non-determinism.\n")
+    ts_feat = torch.jit.trace(audio_processor, (audio, audio_len))
+
+    print("\nExporting acoustic model...")
+    model(feats, feat_lens)
+    ts_acoustic = torch.jit.trace(model, (feats, feat_lens))
+
+    print("\nExporting decoder...")
+    log_probs = model(feats, feat_lens)
+    ts_decoder = torch.jit.script(greedy_decoder, log_probs)
+    print("\nJIT export complete.")
+
+    if save:
+        precision = "fp16" if use_amp else "fp32"
+        module_name = f'{os.path.basename(model_config)}_{precision}'
+        ts_feat.save(os.path.join(output_dir, module_name + "_feat.pt"))
+        ts_acoustic.save(os.path.join(output_dir, module_name + "_acoustic.pt"))
+        ts_decoder.save(os.path.join(output_dir, module_name + "_decoder.pt"))
+
+    return ts_feat, ts_acoustic, ts_decoder
+
+
+def main():
+
+    parser = get_parser()
+    args = parser.parse_args()
+
+    log_fpath = args.log_file or str(Path(args.output_dir, 'nvlog_infer.json'))
+    log_fpath = unique_log_fpath(log_fpath)
+    dllogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, log_fpath),
+                            StdOutBackend(Verbosity.VERBOSE,
+                                          metric_format=stdout_metric_format)])
+
+    [dllogger.log("PARAMETER", {k: v}) for k, v in vars(args).items()]
+
+    for step in ['DNN', 'data+DNN', 'data']:
+        for c in [0.99, 0.95, 0.9, 0.5]:
+            cs = 'avg' if c == 0.5 else f'{int(100*c)}%'
+            dllogger.metadata(f'{step.lower()}_latency_{c}',
+                              {'name': f'{step} latency {cs}',
+                               'format': ':>7.2f', 'unit': 'ms'})
+    dllogger.metadata(
+        'eval_wer', {'name': 'WER', 'format': ':>3.2f', 'unit': '%'})

    if args.cpu:
-        assert(not multi_gpu)
        device = torch.device('cpu')
    else:
-        assert(torch.cuda.is_available())
+        assert torch.cuda.is_available()
        device = torch.device('cuda')
        torch.backends.cudnn.benchmark = args.cudnn_benchmark
-        print("CUDNN BENCHMARK ", args.cudnn_benchmark)

-        if multi_gpu:
-            print("DISTRIBUTED with ", torch.distributed.get_world_size())
-            torch.cuda.set_device(args.local_rank)
-            torch.distributed.init_process_group(backend='nccl', init_method='env://')
+    if args.seed is not None:
+        torch.manual_seed(args.seed + args.local_rank)
+        np.random.seed(args.seed + args.local_rank)
+        random.seed(args.seed + args.local_rank)

-    optim_level = 3 if args.amp else 0
+    # set up distributed training
+    multi_gpu = not args.cpu and int(os.environ.get('WORLD_SIZE', 1)) > 1
+    if multi_gpu:
+        torch.cuda.set_device(args.local_rank)
+        distrib.init_process_group(backend='nccl', init_method='env://')
+        print_once(f'Inference with {distrib.get_world_size()} GPUs')

-    jasper_model_definition = toml.load(args.model_toml)
-    dataset_vocab = jasper_model_definition['labels']['labels']
-    ctc_vocab = add_ctc_labels(dataset_vocab)
-
-    val_manifest = args.val_manifest
-    featurizer_config = jasper_model_definition['input_eval']
-    featurizer_config["optimization_level"] = optim_level
-    featurizer_config["fp16"] = args.amp
-
-    args.use_conv_mask = jasper_model_definition['encoder'].get('convmask', True)
-    if args.use_conv_mask and args.export_model:
-        print('WARNING: Masked convs currently not supported for TorchScript. Disabling.')
-        jasper_model_definition['encoder']['convmask'] = False
+    cfg = config.load(args.model_config)

    if args.max_duration is not None:
-        featurizer_config['max_duration'] = args.max_duration
-    if args.pad_to is not None:
-        featurizer_config['pad_to'] = args.pad_to
+        cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
+        cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration

-    if featurizer_config['pad_to'] == "max":
-        featurizer_config['pad_to'] = -1
+    if args.pad_to_max_duration:
+        assert cfg['input_val']['audio_dataset']['max_duration'] > 0
+        cfg['input_val']['audio_dataset']['pad_to_max_duration'] = True
+        cfg['input_val']['filterbank_features']['pad_to_max_duration'] = True

-    print('=== model_config ===')
-    print_dict(jasper_model_definition)
-    print()
-    print('=== feature_config ===')
-    print_dict(featurizer_config)
-    print()
-    data_layer = None
+    symbols = helpers.add_ctc_blank(cfg['labels'])

-    if args.wav is None:
-        data_layer = AudioToTextDataLayer(
-            dataset_dir=args.dataset_dir,
-            featurizer_config=featurizer_config,
-            manifest_filepath=val_manifest,
-            labels=dataset_vocab,
+    use_dali = args.dali_device in ('cpu', 'gpu')
+    dataset_kw, features_kw = config.input(cfg, 'val')
+
+    measure_perf = args.steps > 0
+
+    # dataset
+    if args.transcribe_wav or args.transcribe_filelist:
+        assert not use_dali, "DALI is not supported for a single audio"
+        assert not args.transcribe_filelist
+        assert not args.pad_to_max_duration
+        assert not (args.transcribe_wav and args.transcribe_filelist)
+
+        if args.transcribe_wav:
+            dataset = SingleAudioDataset(args.transcribe_wav)
+        else:
+            dataset = FilelistDataset(args.transcribe_filelist)
+
+        data_loader = get_data_loader(dataset,
+                                      batch_size=1,
+                                      multi_gpu=multi_gpu,
+                                      shuffle=False,
+                                      num_workers=0,
+                                      drop_last=(True if measure_perf else False))
+
+        _, features_kw = config.input(cfg, 'val')
+        feat_proc = FilterbankFeatures(**features_kw)
+
+    elif use_dali:
+        # pad_to_max_duration is not supported by DALI - have simple padders
+        if features_kw['pad_to_max_duration']:
+            feat_proc = BaseFeatures(
+                pad_align=features_kw['pad_align'],
+                pad_to_max_duration=True,
+                max_duration=features_kw['max_duration'],
+                sample_rate=features_kw['sample_rate'],
+                window_size=features_kw['window_size'],
+                window_stride=features_kw['window_stride'])
+            features_kw['pad_to_max_duration'] = False
+        else:
+            feat_proc = None
+
+        data_loader = DaliDataLoader(
+            gpu_id=args.local_rank or 0,
+            dataset_path=args.dataset_dir,
+            config_data=dataset_kw,
+            config_features=features_kw,
+            json_names=args.val_manifests,
            batch_size=args.batch_size,
-            pad_to_max=featurizer_config['pad_to'] == -1,
-            shuffle=False,
-            multi_gpu=multi_gpu)
-    audio_preprocessor = AudioPreprocessing(**featurizer_config)
-    encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
+            pipeline_type=("train" if measure_perf else "val"),  # no drop_last 
+            device_type=args.dali_device,
+            symbols=symbols)
+
+    else:
+        dataset = AudioDataset(args.dataset_dir,
+                               args.val_manifests,
+                               symbols,
+                               **dataset_kw)
+
+        data_loader = get_data_loader(dataset,
+                                      args.batch_size,
+                                      multi_gpu=multi_gpu,
+                                      shuffle=False,
+                                      num_workers=4,
+                                      drop_last=False)
+
+        feat_proc = FilterbankFeatures(**features_kw)
+
+    model = Jasper(encoder_kw=config.encoder(cfg),
+                   decoder_kw=config.decoder(cfg, n_classes=len(symbols)))

    if args.ckpt is not None:
-        print("loading model from ", args.ckpt)
+        print(f'Loading the model from {args.ckpt} ...')
+        checkpoint = torch.load(args.ckpt, map_location="cpu")
+        key = 'ema_state_dict' if args.ema else 'state_dict'
+        state_dict = helpers.convert_v1_state_dict(checkpoint[key])
+        model.load_state_dict(state_dict, strict=True)

-        if os.path.isdir(args.ckpt):
-            exit(0)
-        else:
-            checkpoint = torch.load(args.ckpt, map_location="cpu")
-            if args.ema and 'ema_state_dict' in checkpoint:
-                print('Loading EMA state dict')
-                sd = 'ema_state_dict'
-            else:
-                sd = 'state_dict'
+    model.to(device)
+    model.eval()

-            for k in audio_preprocessor.state_dict().keys():
-                checkpoint[sd][k] = checkpoint[sd].pop("audio_preprocessor." + k)
-            audio_preprocessor.load_state_dict(checkpoint[sd], strict=False)
-            encoderdecoder.load_state_dict(checkpoint[sd], strict=False)
-
-    greedy_decoder = GreedyCTCDecoder()
-
-    # print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
-    if args.wav is None:
-        N = len(data_layer)
-        step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
-
-        if args.steps is not None:
-            print('-----------------')
-            print('Have {0} examples to eval on.'.format(args.steps * args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
-            print('Have {0} steps / (gpu * epoch).'.format(args.steps))
-            print('-----------------')
-        else:
-            print('-----------------')
-            print('Have {0} examples to eval on.'.format(N))
-            print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
-            print('-----------------')
-
-    print ("audio_preprocessor.normalize: ", audio_preprocessor.featurizer.normalize)
-
-    audio_preprocessor.to(device)
-    encoderdecoder.to(device)
+    if feat_proc is not None:
+        feat_proc.to(device)
+        feat_proc.eval()

    if args.amp:
-        encoderdecoder = amp.initialize(models=encoderdecoder,
-                                        opt_level='O'+str(optim_level))
+        model = model.half()

-    encoderdecoder = model_multi_gpu(encoderdecoder, multi_gpu)
-    audio_preprocessor.eval()
-    encoderdecoder.eval()
-    greedy_decoder.eval()
+    if args.torchscript:
+        greedy_decoder = GreedyCTCDecoder()

-    eval(
-        data_layer=data_layer,
-        audio_processor=audio_preprocessor,
-        encoderdecoder=encoderdecoder,
-        greedy_decoder=greedy_decoder,
-        labels=ctc_vocab,
-        args=args,
-        device=device,
-        multi_gpu=multi_gpu)
+        feat_proc, model, greedy_decoder = torchscript_export(
+            data_loader, feat_proc, model, greedy_decoder, args.output_dir,
+            use_amp=args.amp, use_conv_masks=True, model_toml=args.model_toml,
+            device=device, save=args.torchscript_export)

-if __name__=="__main__":
-    args = parse_args()
+    if multi_gpu:
+        model = DistributedDataParallel(model)

-    print_dict(vars(args))
+    agg = {'txts': [], 'preds': [], 'logits': []}
+    dur = {'data': [], 'dnn': [], 'data+dnn': []}

-    main(args)
+    looped_loader = chain.from_iterable(repeat(data_loader))
+    greedy_decoder = GreedyCTCDecoder()
+
+    sync = lambda: torch.cuda.synchronize() if device.type == 'cuda' else None
+
+    steps = args.steps + args.warmup_steps or len(data_loader)
+    with torch.no_grad():
+
+        for it, batch in enumerate(tqdm(looped_loader, initial=1, total=steps)):
+
+            if use_dali:
+                feats, feat_lens, txt, txt_lens = batch
+                if feat_proc is not None:
+                    feats, feat_lens = feat_proc(feats, feat_lens)
+            else:
+                batch = [t.cuda(non_blocking=True) for t in batch]
+                audio, audio_lens, txt, txt_lens = batch
+                feats, feat_lens = feat_proc(audio, audio_lens)
+
+            sync()
+            t1 = time.perf_counter()
+
+            if args.amp:
+                feats = feats.half()
+
+            if model.encoder.use_conv_masks:
+                log_probs, log_prob_lens = model(feats, feat_lens)
+            else:
+                log_probs = model(feats, feat_lens)
+
+            preds = greedy_decoder(log_probs)
+
+            sync()
+            t2 = time.perf_counter()
+
+            # burn-in period; wait for a new loader due to num_workers
+            if it >= 1 and (args.steps == 0 or it >= args.warmup_steps):
+                dur['data'].append(t1 - t0)
+                dur['dnn'].append(t2 - t1)
+                dur['data+dnn'].append(t2 - t0)
+
+            if txt is not None:
+                agg['txts'] += helpers.gather_transcripts([txt], [txt_lens],
+                                                          symbols)
+            agg['preds'] += helpers.gather_predictions([preds], symbols)
+            agg['logits'].append(log_probs)
+
+            if it + 1 == steps:
+                break
+
+            sync()
+            t0 = time.perf_counter()
+
+        # communicate the results
+        if args.transcribe_wav:
+            for idx, p in enumerate(agg['preds']):
+                print_once(f'Prediction {idx+1: >3}: {p}')
+
+        elif args.transcribe_filelist:
+            pass
+
+        elif not multi_gpu or distrib.get_rank() == 0:
+            wer, _ = process_evaluation_epoch(agg)
+
+            dllogger.log(step=(), data={'eval_wer': 100 * wer})
+
+        if args.save_predictions:
+            with open(args.save_predictions, 'w') as f:
+                f.write('\n'.join(agg['preds']))
+
+        if args.save_logits:
+            logits = torch.cat(agg['logits'], dim=0).cpu()
+            torch.save(logits, args.save_logits)
+
+    # report timings
+    if len(dur['data']) >= 20:
+        ratios = [0.9, 0.95, 0.99]
+        for stage in dur:
+            lat = durs_to_percentiles(dur[stage], ratios)
+            for k in [0.99, 0.95, 0.9, 0.5]:
+                kk = str(k).replace('.', '_')
+                dllogger.log(step=(), data={f'{stage.lower()}_latency_{kk}': lat[k]})
+
+    else:
+        print_once('Not enough samples to measure latencies.')
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/SpeechRecognition/Jasper/inference_benchmark.py
+++ b/PyTorch/SpeechRecognition/Jasper/inference_benchmark.py
@ -1,301 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import itertools
-import os
-import sys
-import time
-import random
-import numpy as np
-from heapq import nlargest
-import math
-from tqdm import tqdm
-import toml
-import torch
-from apex import amp
-from dataset import AudioToTextDataLayer
-from helpers import process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, print_dict
-from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
-from parts.features import audio_from_file
-
-def parse_args():
-    parser = argparse.ArgumentParser(description='Jasper')
-    parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
-    parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
-    parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
-    parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
-    parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
-    parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
-    parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
-    parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
-    parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
-    parser.add_argument("--amp", "--fp16", action='store_true', help='use half precision')
-    parser.add_argument("--seed", default=42, type=int, help='seed')
-    parser.add_argument("--cpu", action='store_true', help='run inference on CPU')
-    parser.add_argument("--torch_script", action='store_true', help='export model')
-    parser.add_argument("--sample_audio", default="/datasets/LibriSpeech/dev-clean-wav/1272/128104/1272-128104-0000.wav", type=str, help='audio sample path for torchscript, points to one of the files in /datasets/LibriSpeech/dev-clean-wav/ if not defined')
-    return parser.parse_args()
-
-def jit_export(
-         audio,
-         audio_len,
-         audio_processor,
-         encoderdecoder,
-         greedy_decoder,
-         args):
-    """applies torchscript
-    Args:
-        audio:
-        audio_len: 
-        audio_processor: data processing module
-        encoderdecoder: acoustic model
-        greedy_decoder: greedy decoder
-        args: script input arguments
-    """
-    # Export just the featurizer
-    print("torchscripting featurizer ...")
-    traced_module_feat = torch.jit.script(audio_processor)
-
-    # Export just the acoustic model
-    print("torchscripting acoustic model ...")
-    inp_postFeat, _ = audio_processor(audio, audio_len)
-    traced_module_acoustic = torch.jit.trace(encoderdecoder, inp_postFeat)
-
-    # Export just the decoder
-    print("torchscripting decoder ...")
-    inp_postAcoustic = encoderdecoder(inp_postFeat)
-    traced_module_decode = torch.jit.script(greedy_decoder, inp_postAcoustic)
-    print("JIT process complete")
-
-    return traced_module_feat, traced_module_acoustic, traced_module_decode
-
-def eval(
-        data_layer,
-        audio_processor,
-        encoderdecoder,
-        greedy_decoder,
-        labels,
-        device,
-        args):
-    """performs evaluation and prints performance statistics
-    Args:
-        data_layer: data layer object that holds data loader
-        audio_processor: data processing module
-        encoderdecoder: acoustic model
-        greedy_decoder: greedy decoder
-        labels: list of labels as output vocabulary
-        args: script input arguments
-    """
-    batch_size=args.batch_size
-    steps=args.steps
-    audio_processor.eval()
-    encoderdecoder.eval()
-    greedy_decoder.eval()
-
-    if args.torch_script:
-        audio, audio_len = audio_from_file(args.sample_audio, device=device)
-        audio_processor, encoderdecoder, greedy_decoder = jit_export(audio, audio_len, audio_processor, encoderdecoder, greedy_decoder, args)
-
-    with torch.no_grad():
-        _global_var_dict = {
-            'predictions': [],
-            'transcripts': [],
-        }
-
-        it = 0
-        ep = 0
-
-        if steps is None:
-            steps = math.ceil(len(data_layer) / batch_size)
-        durations_dnn = []
-        durations_dnn_and_prep = []
-        seq_lens = []
-
-        sync = lambda: torch.cuda.synchronize() if device.type == 'cuda' else None
-
-        while True:
-            ep += 1
-            for data in tqdm(data_layer.data_iterator):
-                it += 1
-                if it > steps:
-                    break
-                tensors = [t.to(device) for t in data]
-     
-                t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
-
-                sync()
-                t0 = time.perf_counter()
-                features, lens = audio_processor(t_audio_signal_e, t_a_sig_length_e)
-
-                sync()
-                t1 = time.perf_counter()
-                if isinstance(encoderdecoder, torch.jit.TracedModule):
-                    t_log_probs_e = encoderdecoder(features)
-                else:
-                    t_log_probs_e, _ = encoderdecoder.infer((features, lens))
-
-                sync()
-                stop_time = time.perf_counter()
-                time_prep_and_dnn = stop_time - t0
-                time_dnn = stop_time - t1
-                t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
-
-                values_dict = dict(
-                    predictions=[t_predictions_e],
-                    transcript=[t_transcript_e],
-                    transcript_length=[t_transcript_len_e],
-                )
-                process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
-                durations_dnn.append(time_dnn)
-                durations_dnn_and_prep.append(time_prep_and_dnn)
-                seq_lens.append(features[0].shape[-1])
-
-            if it >= steps:
-
-                wer, _ = process_evaluation_epoch(_global_var_dict)
-                print("==========>>>>>>Evaluation of all iterations WER: {0}\n".format(wer))
-                break
-
-        ratios = [0.9,  0.95,0.99, 1.]
-        latencies_dnn = take_durations_and_output_percentile(durations_dnn, ratios)
-        latencies_dnn_and_prep = take_durations_and_output_percentile(durations_dnn_and_prep, ratios)
-        print("\n using batch size {} and {} frames ".format(batch_size, seq_lens[-1]))
-        print("\n".join(["dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn.items()]))
-        print("\n".join(["prep + dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn_and_prep.items()]))
-
-def take_durations_and_output_percentile(durations, ratios):
-    durations = np.asarray(durations) * 1000 # in ms
-    latency = durations 
-
-    latency = latency[5:]
-    mean_latency = np.mean(latency)
-
-    latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
-    latency_ranges=get_percentile(ratios, latency_worst, len(latency))
-    latency_ranges["0.5"] = mean_latency
-    return latency_ranges
-
-def get_percentile(ratios, arr, nsamples):
-    res = {}
-    for a in ratios:
-        idx = max(int(nsamples * (1 - a)), 0)
-        res[a] = arr[idx]
-    return res
-
-def main(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    torch.manual_seed(args.seed)
-    assert(args.steps is None or args.steps > 5)
-
-    if args.cpu:
-        device = torch.device('cpu')
-    else:
-        assert(torch.cuda.is_available())
-        device = torch.device('cuda')
-        torch.backends.cudnn.benchmark = args.cudnn_benchmark
-        print("CUDNN BENCHMARK ", args.cudnn_benchmark)
-
-    optim_level = 3 if args.amp else 0
-    batch_size = args.batch_size
-
-    jasper_model_definition = toml.load(args.model_toml)
-    dataset_vocab = jasper_model_definition['labels']['labels']
-    ctc_vocab = add_ctc_labels(dataset_vocab)
-
-    val_manifest = args.val_manifest
-    featurizer_config = jasper_model_definition['input_eval']
-    featurizer_config["optimization_level"] = optim_level
-
-    if args.max_duration is not None:
-        featurizer_config['max_duration'] = args.max_duration
-    
-    # TORCHSCRIPT: Cant use mixed types. Using -1 for "max"
-    if args.pad_to is not None:
-        featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else -1
-    
-    if featurizer_config['pad_to'] == "max":
-        featurizer_config['pad_to'] = -1
-
-    args.use_conv_mask = jasper_model_definition['encoder'].get('convmask', True)
-    if args.use_conv_mask and args.torch_script:
-        print('WARNING: Masked convs currently not supported for TorchScript. Disabling.')
-        jasper_model_definition['encoder']['convmask'] = False
-
-    print('model_config')
-    print_dict(jasper_model_definition)
-    print('feature_config')
-    print_dict(featurizer_config)
-
-    data_layer = AudioToTextDataLayer(
-                            dataset_dir=args.dataset_dir,
-                            featurizer_config=featurizer_config,
-                            manifest_filepath=val_manifest,
-                            labels=dataset_vocab,
-                            batch_size=batch_size,
-                            pad_to_max=featurizer_config['pad_to'] == -1,
-                            shuffle=False,
-                            multi_gpu=False)
-
-    audio_preprocessor = AudioPreprocessing(**featurizer_config)
-
-    encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
-
-    if args.ckpt is not None:
-        print("loading model from ", args.ckpt)
-        checkpoint = torch.load(args.ckpt, map_location="cpu")
-        for k in audio_preprocessor.state_dict().keys():
-            checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
-        audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
-        encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
-
-    greedy_decoder = GreedyCTCDecoder()
-
-    # print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
-
-    N = len(data_layer)
-    step_per_epoch = math.ceil(N / args.batch_size)
-
-    print('-----------------')
-    if args.steps is None:
-        print('Have {0} examples to eval on.'.format(N))
-        print('Have {0} steps / (epoch).'.format(step_per_epoch))
-    else:
-        print('Have {0} examples to eval on.'.format(args.steps * args.batch_size))
-        print('Have {0} steps / (epoch).'.format(args.steps))
-    print('-----------------')
-
-    audio_preprocessor.to(device)
-    encoderdecoder.to(device)
-
-    if args.amp:
-        encoderdecoder = amp.initialize(
-            models=encoderdecoder, opt_level='O'+str(optim_level))
-
-    eval(
-        data_layer=data_layer,
-        audio_processor=audio_preprocessor,
-        encoderdecoder=encoderdecoder,
-        greedy_decoder=greedy_decoder,
-        labels=ctc_vocab,
-        device=device,
-        args=args)
-
-if __name__=="__main__":
-    args = parse_args()
-
-    print_dict(vars(args))
-
-    main(args)
--- a/PyTorch/SpeechRecognition/Jasper/jasper/config.py
+++ b/PyTorch/SpeechRecognition/Jasper/jasper/config.py
@ -0,0 +1,110 @@
+import copy
+import inspect
+import yaml
+
+from .model import JasperDecoderForCTC, JasperBlock, JasperEncoder
+from common.audio import GainPerturbation, ShiftPerturbation, SpeedPerturbation
+from common.dataset import AudioDataset
+from common.features import CutoutAugment, FilterbankFeatures, SpecAugment
+from common.helpers import print_once
+
+
+def default_args(klass):
+    sig = inspect.signature(klass.__init__)
+    return {k: v.default for k,v in sig.parameters.items() if k != 'self'}
+
+
+def load(fpath):
+    if fpath.endswith('.toml'):
+        raise ValueError('.toml config format has been changed to .yaml')
+
+    cfg = yaml.safe_load(open(fpath, 'r'))
+
+    # Reload to deep copy shallow copies, which were made with yaml anchors
+    yaml.Dumper.ignore_aliases = lambda *args: True
+    cfg = yaml.dump(cfg)
+    cfg = yaml.safe_load(cfg)
+    return cfg
+
+
+def validate_and_fill(klass, user_conf, ignore_unk=[], optional=[]):
+    conf = default_args(klass)
+
+    for k,v in user_conf.items():
+        assert k in conf or k in ignore_unk, f'Unknown parameter {k} for {klass}'
+        conf[k] = v
+
+    # Keep only mandatory or optional-nonempty
+    conf = {k:v for k,v in conf.items()
+            if k not in optional or v is not inspect.Parameter.empty}
+
+    # Validate
+    for k,v in conf.items():
+        assert v is not inspect.Parameter.empty, \
+            f'Value for {k} not specified for {klass}'
+    return conf
+
+
+def input(conf_yaml, split='train'):
+    conf = copy.deepcopy(conf_yaml[f'input_{split}'])
+    conf_dataset = conf.pop('audio_dataset')
+    conf_features = conf.pop('filterbank_features')
+
+    # Validate known inner classes
+    inner_classes = [
+        (conf_dataset, 'speed_perturbation', SpeedPerturbation),
+        (conf_dataset, 'gain_perturbation', GainPerturbation),
+        (conf_dataset, 'shift_perturbation', ShiftPerturbation),
+        (conf_features, 'spec_augment', SpecAugment),
+        (conf_features, 'cutout_augment', CutoutAugment),
+    ]
+    for conf_tgt, key, klass in inner_classes:
+        if key in conf_tgt:
+            conf_tgt[key] = validate_and_fill(klass, conf_tgt[key])
+
+    for k in conf:
+        raise ValueError(f'Unknown key {k}')
+
+    # Validate outer classes
+    conf_dataset = validate_and_fill(
+        AudioDataset, conf_dataset,
+        optional=['data_dir', 'labels', 'manifest_fpaths'])
+
+    conf_features = validate_and_fill(
+        FilterbankFeatures, conf_features)
+
+    # Check params shared between classes
+    shared = ['sample_rate', 'max_duration', 'pad_to_max_duration']
+    for sh in shared:
+        assert conf_dataset[sh] == conf_features[sh], (
+            f'{sh} should match in Dataset and FeatureProcessor: '
+            f'{conf_dataset[sh]}, {conf_features[sh]}')
+
+    return conf_dataset, conf_features
+
+
+def encoder(conf):
+    """Validate config for JasperEncoder and subsequent JasperBlocks"""
+
+    # Validate, but don't overwrite with defaults
+    for blk in conf['jasper']['encoder']['blocks']:
+        validate_and_fill(JasperBlock, blk, optional=['infilters'],
+                          ignore_unk=['residual_dense'])
+
+    return validate_and_fill(JasperEncoder, conf['jasper']['encoder'])
+
+
+def decoder(conf, n_classes):
+    decoder_kw = {'n_classes': n_classes, **conf['jasper']['decoder']}
+    return validate_and_fill(JasperDecoderForCTC, decoder_kw)
+
+
+def apply_duration_flags(cfg, max_duration, pad_to_max_duration):
+    if max_duration is not None:
+        cfg['input_train']['audio_dataset']['max_duration'] = max_duration
+        cfg['input_train']['filterbank_features']['max_duration'] = max_duration
+
+    if pad_to_max_duration:
+        assert cfg['input_train']['audio_dataset']['max_duration'] > 0
+        cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
+        cfg['input_train']['filterbank_features']['pad_to_max_duration'] = True
--- a/PyTorch/SpeechRecognition/Jasper/jasper/model.py
+++ b/PyTorch/SpeechRecognition/Jasper/jasper/model.py
@ -0,0 +1,275 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+activations = {
+    "hardtanh": nn.Hardtanh,
+    "relu": nn.ReLU,
+    "selu": nn.SELU,
+}
+
+
+def init_weights(m, mode='xavier_uniform'):
+    if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
+        if mode == 'xavier_uniform':
+            nn.init.xavier_uniform_(m.weight, gain=1.0)
+        elif mode == 'xavier_normal':
+            nn.init.xavier_normal_(m.weight, gain=1.0)
+        elif mode == 'kaiming_uniform':
+            nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
+        elif mode == 'kaiming_normal':
+            nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
+        else:
+            raise ValueError("Unknown Initialization mode: {0}".format(mode))
+
+    elif type(m) == nn.BatchNorm1d:
+        if m.track_running_stats:
+            m.running_mean.zero_()
+            m.running_var.fill_(1)
+            m.num_batches_tracked.zero_()
+        if m.affine:
+            nn.init.ones_(m.weight)
+            nn.init.zeros_(m.bias)
+
+
+def get_same_padding(kernel_size, stride, dilation):
+    if stride > 1 and dilation > 1:
+        raise ValueError("Only stride OR dilation may be greater than 1")
+    return (kernel_size // 2) * dilation
+
+
+class MaskedConv1d(nn.Conv1d):
+    """1D convolution with sequence masking
+    """
+    __constants__ = ["masked"]
+    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
+                 padding=0, dilation=1, groups=1, bias=False, masked=True):
+        super(MaskedConv1d, self).__init__(
+            in_channels, out_channels, kernel_size, stride=stride,
+            padding=padding, dilation=dilation, groups=groups, bias=bias)
+
+        self.masked = masked
+
+    def get_seq_len(self, lens):
+        return ((lens + 2 * self.padding[0] - self.dilation[0]
+                 * (self.kernel_size[0] - 1) - 1) // self.stride[0] + 1)
+
+    def forward(self, x, x_lens=None):
+        if self.masked:
+            max_len = x.size(2)
+            idxs = torch.arange(max_len, dtype=x_lens.dtype, device=x_lens.device)
+            mask = idxs.expand(x_lens.size(0), max_len) >= x_lens.unsqueeze(1)
+            x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
+            x_lens = self.get_seq_len(x_lens)
+
+        return super(MaskedConv1d, self).forward(x), x_lens
+
+
+class JasperBlock(nn.Module):
+    __constants__ = ["use_conv_masks"]
+
+    """Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
+    """
+    def __init__(self, infilters, filters, repeat=3, kernel_size=11, stride=1,
+                 dilation=1, padding='same', dropout=0.2, activation=None,
+                 residual=True, residual_panes=[], use_conv_masks=False):
+        super(JasperBlock, self).__init__()
+
+        assert padding == "same", "Only 'same' padding is supported."
+
+        padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
+        self.use_conv_masks = use_conv_masks
+        self.conv = nn.ModuleList()
+        for i in range(repeat):
+            self.conv.extend(self._conv_bn(infilters if i == 0 else filters,
+                                           filters,
+                                           kernel_size=kernel_size,
+                                           stride=stride,
+                                           dilation=dilation,
+                                           padding=padding_val))
+            if i < repeat - 1:
+                self.conv.extend(self._act_dropout(dropout, activation))
+
+        self.res = nn.ModuleList() if residual else None
+        res_panes = residual_panes.copy()
+        self.dense_residual = residual
+
+        if residual:
+            if len(residual_panes) == 0:
+                res_panes = [infilters]
+                self.dense_residual = False
+
+            for ip in res_panes:
+                self.res.append(nn.ModuleList(
+                    self._conv_bn(ip, filters, kernel_size=1)))
+
+        self.out = nn.Sequential(*self._act_dropout(dropout, activation))
+
+    def _conv_bn(self, in_channels, out_channels, **kw):
+        return [MaskedConv1d(in_channels, out_channels,
+                             masked=self.use_conv_masks, **kw),
+                nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)]
+
+    def _act_dropout(self, dropout=0.2, activation=None):
+        return [activation or nn.Hardtanh(min_val=0.0, max_val=20.0),
+                nn.Dropout(p=dropout)]
+
+    def forward(self, xs, xs_lens=None):
+        if not self.use_conv_masks:
+            xs_lens = 0
+
+        # forward convolutions
+        out = xs[-1]
+        lens = xs_lens
+        for i, l in enumerate(self.conv):
+            if isinstance(l, MaskedConv1d):
+                out, lens = l(out, lens)
+            else:
+                out = l(out)
+
+        # residuals
+        if self.res is not None:
+            for i, layer in enumerate(self.res):
+                res_out = xs[i]
+                for j, res_layer in enumerate(layer):
+                    if j == 0:  #  and self.use_conv_mask:
+                        res_out, _ = res_layer(res_out, xs_lens)
+                    else:
+                        res_out = res_layer(res_out)
+                out += res_out
+
+        # output
+        out = self.out(out)
+        if self.res is not None and self.dense_residual:
+            out = xs + [out]
+        else:
+            out = [out]
+
+        if self.use_conv_masks:
+            return out, lens
+        else:
+            return out, None
+
+
+class JasperEncoder(nn.Module):
+    __constants__ = ["use_conv_masks"]
+
+    def __init__(self, in_feats, activation, frame_splicing=1,
+                 init='xavier_uniform', use_conv_masks=False, blocks=[]):
+        super(JasperEncoder, self).__init__()
+
+        self.use_conv_masks = use_conv_masks
+        self.layers = nn.ModuleList()
+
+        in_feats *= frame_splicing
+        all_residual_panes = []
+        for i,blk in enumerate(blocks):
+
+            blk['activation'] = activations[activation]()
+
+            has_residual_dense = blk.pop('residual_dense', False)
+            if has_residual_dense:
+                all_residual_panes += [in_feats]
+                blk['residual_panes'] = all_residual_panes
+            else:
+                blk['residual_panes'] = []
+
+            self.layers.append(
+                JasperBlock(in_feats, use_conv_masks=use_conv_masks, **blk))
+
+            in_feats = blk['filters']
+
+        self.apply(lambda x: init_weights(x, mode=init))
+
+    def forward(self, x, x_lens=None):
+        out, out_lens = [x], x_lens
+        for l in self.layers:
+            out, out_lens = l(out, out_lens)
+
+        return out, out_lens
+
+
+class JasperDecoderForCTC(nn.Module):
+    def __init__(self, in_feats, n_classes, init='xavier_uniform'):
+        super(JasperDecoderForCTC, self).__init__()
+
+        self.layers = nn.Sequential(
+            nn.Conv1d(in_feats, n_classes, kernel_size=1, bias=True),)
+        self.apply(lambda x: init_weights(x, mode=init))
+
+    def forward(self, enc_out):
+        out = self.layers(enc_out[-1]).transpose(1, 2)
+        return F.log_softmax(out, dim=2)
+
+
+class GreedyCTCDecoder(nn.Module):
+    @torch.no_grad()
+    def forward(self, log_probs, log_prob_lens=None):
+
+        if log_prob_lens is not None:
+            max_len = log_probs.size(1)
+            idxs = torch.arange(max_len, dtype=log_prob_lens.dtype,
+                                device=log_prob_lens.device)
+            mask = idxs.unsqueeze(0) >= log_prob_lens.unsqueeze(1)
+            log_probs[:,:,-1] = log_probs[:,:,-1].masked_fill(mask, float("Inf"))
+
+        return log_probs.argmax(dim=-1, keepdim=False).int()
+
+
+class Jasper(nn.Module):
+    def __init__(self, encoder_kw, decoder_kw, transpose_in=False):
+        super(Jasper, self).__init__()
+        self.transpose_in = transpose_in
+        self.encoder = JasperEncoder(**encoder_kw)
+        self.decoder = JasperDecoderForCTC(**decoder_kw)
+
+    def forward(self, x, x_lens=None):
+        if self.encoder.use_conv_masks:
+            assert x_lens is not None
+            enc, enc_lens = self.encoder(x, x_lens)
+            out = self.decoder(enc)
+            return out, enc_lens
+        else:
+            if self.transpose_in:
+                x = x.transpose(1, 2)
+            enc, _ = self.encoder(x)
+            out = self.decoder(enc)
+            return out  # torchscript refuses to output None
+
+    # TODO Explicitly add x_lens=None for inference (now x can be a Tensor or tuple)
+    def infer(self, x, x_lens=None):
+        if self.encoder.use_conv_masks:
+            return self.forward(x, x_lens)
+        else:
+            ret = self.forward(x)
+            return ret, len(ret)
+
+
+class CTCLossNM:
+    def __init__(self, n_classes):
+        self._criterion = nn.CTCLoss(blank=n_classes-1, reduction='none')
+
+    def __call__(self, log_probs, targets, input_length, target_length):
+        input_length = input_length.long()
+        target_length = target_length.long()
+        targets = targets.long()
+        loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
+                               target_length)
+        # note that this is different from reduction = 'mean'
+        # because we are not dividing by target lengths
+        return torch.mean(loss)
--- a/PyTorch/SpeechRecognition/Jasper/model.py
+++ b/PyTorch/SpeechRecognition/Jasper/model.py
@ -1,423 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from apex import amp
-import torch
-import torch.nn as nn
-from parts.features import FeatureFactory
-import random
-
-
-jasper_activations = {
-    "hardtanh": nn.Hardtanh,
-    "relu": nn.ReLU,
-    "selu": nn.SELU,
-}
-
-def init_weights(m, mode='xavier_uniform'):
-    if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
-        if mode == 'xavier_uniform':
-            nn.init.xavier_uniform_(m.weight, gain=1.0)
-        elif mode == 'xavier_normal':
-            nn.init.xavier_normal_(m.weight, gain=1.0)
-        elif mode == 'kaiming_uniform':
-            nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
-        elif mode == 'kaiming_normal':
-            nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
-        else:
-            raise ValueError("Unknown Initialization mode: {0}".format(mode))
-    elif type(m) == nn.BatchNorm1d:
-        if m.track_running_stats:
-            m.running_mean.zero_()
-            m.running_var.fill_(1)
-            m.num_batches_tracked.zero_()
-        if m.affine:
-            nn.init.ones_(m.weight)
-            nn.init.zeros_(m.bias)
-
-def get_same_padding(kernel_size, stride, dilation):
-    if stride > 1 and dilation > 1:
-        raise ValueError("Only stride OR dilation may be greater than 1")
-    return (kernel_size // 2) * dilation
-
-class AudioPreprocessing(nn.Module):
-    """GPU accelerated audio preprocessing
-    """
-    __constants__ = ["optim_level"]
-    def __init__(self, **kwargs):
-        nn.Module.__init__(self)    # For PyTorch API
-        self.optim_level = kwargs.get('optimization_level', 0)
-        self.featurizer = FeatureFactory.from_config(kwargs)
-        self.transpose_out = kwargs.get("transpose_out", False)
-
-    @torch.no_grad()
-    def forward(self, input_signal, length):
-        processed_signal = self.featurizer(input_signal, length)
-        processed_length = self.featurizer.get_seq_len(length)    
-        if self.transpose_out:
-            processed_signal.transpose_(2,1)
-            return processed_signal, processed_length
-        else:
-            return processed_signal, processed_length
-                
-class SpectrogramAugmentation(nn.Module):
-    """Spectrogram augmentation
-    """
-    def __init__(self, **kwargs):
-        nn.Module.__init__(self)
-        self.spec_cutout_regions = SpecCutoutRegions(kwargs)
-        self.spec_augment = SpecAugment(kwargs)
-
-    @torch.no_grad()
-    def forward(self, input_spec):
-        augmented_spec = self.spec_cutout_regions(input_spec)
-        augmented_spec = self.spec_augment(augmented_spec)
-        return augmented_spec
-
-class SpecAugment(nn.Module):
-    """Spec augment. refer to https://arxiv.org/abs/1904.08779
-    """
-    def __init__(self, cfg):
-        super(SpecAugment, self).__init__()
-        self.cutout_x_regions = cfg.get('cutout_x_regions', 0)
-        self.cutout_y_regions = cfg.get('cutout_y_regions', 0)
-
-        self.cutout_x_width = cfg.get('cutout_x_width', 10)
-        self.cutout_y_width = cfg.get('cutout_y_width', 10)
-
-    @torch.no_grad()
-    def forward(self, x):
-        sh = x.shape
-
-        mask = torch.zeros(x.shape, dtype=torch.bool)
-        for idx in range(sh[0]):
-            for _ in range(self.cutout_x_regions):
-                cutout_x_left = int(random.uniform(0, sh[1] - self.cutout_x_width))
-
-                mask[idx, cutout_x_left:cutout_x_left + self.cutout_x_width, :] = 1
-
-            for _ in range(self.cutout_y_regions):
-                cutout_y_left = int(random.uniform(0, sh[2] - self.cutout_y_width))
-
-                mask[idx, :, cutout_y_left:cutout_y_left + self.cutout_y_width] = 1
-
-        x = x.masked_fill(mask.to(device=x.device), 0)
-
-        return x
-
-class SpecCutoutRegions(nn.Module):
-    """Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
-    """
-    def __init__(self, cfg):
-        super(SpecCutoutRegions, self).__init__()
-
-        self.cutout_rect_regions = cfg.get('cutout_rect_regions', 0)
-        self.cutout_rect_time = cfg.get('cutout_rect_time', 5)
-        self.cutout_rect_freq = cfg.get('cutout_rect_freq', 20)
-
-    @torch.no_grad()
-    def forward(self, x):
-        sh = x.shape
-
-        mask = torch.zeros(x.shape, dtype=torch.bool)
-
-        for idx in range(sh[0]):
-            for i in range(self.cutout_rect_regions):
-                cutout_rect_x = int(random.uniform(
-                        0, sh[1] - self.cutout_rect_freq))
-                cutout_rect_y = int(random.uniform(
-                        0, sh[2] - self.cutout_rect_time))
-
-                mask[idx, cutout_rect_x:cutout_rect_x + self.cutout_rect_freq,
-                         cutout_rect_y:cutout_rect_y + self.cutout_rect_time] = 1
-
-        x = x.masked_fill(mask.to(device=x.device), 0)
-
-        return x
-
-class JasperEncoder(nn.Module):
-    __constants__ = ["use_conv_mask"]    
-    """Jasper encoder
-    """
-    def __init__(self, **kwargs):
-        cfg = {}
-        for key, value in kwargs.items():
-            cfg[key] = value
-
-        nn.Module.__init__(self)
-        self._cfg = cfg
-
-        activation = jasper_activations[cfg['encoder']['activation']]()
-        self.use_conv_mask = cfg['encoder'].get('convmask', False)
-        feat_in = cfg['input']['features'] * cfg['input'].get('frame_splicing', 1)
-        init_mode = cfg.get('init_mode', 'xavier_uniform')
-
-        residual_panes = []
-        encoder_layers = []
-        self.dense_residual = False
-        for lcfg in cfg['jasper']:
-            dense_res = []
-            if lcfg.get('residual_dense', False):
-                residual_panes.append(feat_in)
-                dense_res = residual_panes
-                self.dense_residual = True
-            encoder_layers.append(
-                JasperBlock(feat_in, lcfg['filters'], repeat=lcfg['repeat'],
-                                        kernel_size=lcfg['kernel'], stride=lcfg['stride'],
-                                        dilation=lcfg['dilation'], dropout=lcfg['dropout'],
-                                        residual=lcfg['residual'], activation=activation,
-                                        residual_panes=dense_res, use_conv_mask=self.use_conv_mask))
-            feat_in = lcfg['filters']
-
-        self.encoder = nn.Sequential(*encoder_layers)
-        self.apply(lambda x: init_weights(x, mode=init_mode))
-
-    def num_weights(self):
-        return sum(p.numel() for p in self.parameters() if p.requires_grad)
-
-    def forward(self, x):
-        if self.use_conv_mask:
-            audio_signal, length = x
-            return self.encoder(([audio_signal], length))
-        else:
-            return self.encoder([x])
-
-class JasperDecoderForCTC(nn.Module):
-    """Jasper decoder
-    """
-    def __init__(self, **kwargs):
-        nn.Module.__init__(self)
-        self._feat_in = kwargs.get("feat_in")
-        self._num_classes = kwargs.get("num_classes")
-        init_mode = kwargs.get('init_mode', 'xavier_uniform')
-
-        self.decoder_layers = nn.Sequential(
-            nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),)
-        self.apply(lambda x: init_weights(x, mode=init_mode))
-
-    def num_weights(self):
-        return sum(p.numel() for p in self.parameters() if p.requires_grad)
-
-    def forward(self, encoder_output):
-        out = self.decoder_layers(encoder_output[-1]).transpose(1, 2)
-        return nn.functional.log_softmax(out, dim=2)
-
-class JasperEncoderDecoder(nn.Module):
-    """Contains jasper encoder and decoder
-    """
-    def __init__(self, **kwargs):
-        nn.Module.__init__(self)
-        self.transpose_in=kwargs.get("transpose_in", False)
-        self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
-        self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
-                                                  num_classes=kwargs.get("num_classes"))
-        
-    def num_weights(self):
-        return sum(p.numel() for p in self.parameters() if p.requires_grad)
-
-    
-    def forward(self, x):
-        if self.jasper_encoder.use_conv_mask:
-            t_encoded_t, t_encoded_len_t = self.jasper_encoder(x)
-        else:
-            if self.transpose_in:
-                x = x.transpose(1, 2)   
-            t_encoded_t = self.jasper_encoder(x)
-            
-        out = self.jasper_decoder(t_encoded_t)
-        if self.jasper_encoder.use_conv_mask:
-            return out, t_encoded_len_t
-        else:
-            return out
-
-    def infer(self, x):
-        if self.jasper_encoder.use_conv_mask:
-            return self.forward(x)
-        else:
-            ret = self.forward(x[0])
-            return ret, len(ret)
-        
-    
-class Jasper(JasperEncoderDecoder):
-    """Contains data preprocessing, spectrogram augmentation, jasper encoder and decoder
-    """
-    def __init__(self, **kwargs):
-        JasperEncoderDecoder.__init__(self, **kwargs)
-        feature_config = kwargs.get("feature_config")
-        if self.transpose_in:
-            feature_config["transpose"] = True
-        self.audio_preprocessor = AudioPreprocessing(**feature_config)
-        self.data_spectr_augmentation = SpectrogramAugmentation(**feature_config)
-
-
-class MaskedConv1d(nn.Conv1d):
-    """1D convolution with sequence masking
-    """
-    __constants__ = ["use_conv_mask"]
-    def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                             padding=0, dilation=1, groups=1, bias=False, use_conv_mask=True):
-        super(MaskedConv1d, self).__init__(in_channels, out_channels, kernel_size,
-                                                                             stride=stride,
-                                                                             padding=padding, dilation=dilation,
-                                                                             groups=groups, bias=bias)
-        self.use_conv_mask = use_conv_mask
-
-    def get_seq_len(self, lens):
-        return ((lens + 2 * self.padding[0] - self.dilation[0] * (
-            self.kernel_size[0] - 1) - 1) // self.stride[0] + 1)
-
-    def forward(self, inp):
-        if self.use_conv_mask:
-            x, lens = inp
-            max_len = x.size(2)
-            idxs = torch.arange(max_len).to(lens.dtype).to(lens.device).expand(len(lens), max_len)
-            mask = idxs >= lens.unsqueeze(1)
-            x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
-            del mask
-            del idxs
-            lens = self.get_seq_len(lens)
-            return super(MaskedConv1d, self).forward(x), lens
-        else:
-            return super(MaskedConv1d, self).forward(inp)
-
-
-class JasperBlock(nn.Module):
-    __constants__ = ["use_conv_mask", "conv"]
-
-    """Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
-    """
-    def __init__(self, inplanes, planes, repeat=3, kernel_size=11, stride=1,
-                             dilation=1, padding='same', dropout=0.2, activation=None,
-                             residual=True, residual_panes=[], use_conv_mask=False):
-        super(JasperBlock, self).__init__()
-
-        if padding != "same":
-            raise ValueError("currently only 'same' padding is supported")
-
-
-        padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
-        self.use_conv_mask = use_conv_mask
-        self.conv = nn.ModuleList()
-        inplanes_loop = inplanes
-        for _ in range(repeat - 1):
-            self.conv.extend(
-                self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
-                                                                stride=stride, dilation=dilation,
-                                                                padding=padding_val))
-            self.conv.extend(
-                self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
-            inplanes_loop = planes
-        self.conv.extend(
-            self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
-                                                            stride=stride, dilation=dilation,
-                                                            padding=padding_val))
-
-        self.res = nn.ModuleList() if residual else None
-        res_panes = residual_panes.copy()
-        self.dense_residual = residual
-        if residual:
-            if len(residual_panes) == 0:
-                res_panes = [inplanes]
-                self.dense_residual = False
-            for ip in res_panes:
-                self.res.append(nn.ModuleList(
-                    modules=self._get_conv_bn_layer(ip, planes, kernel_size=1)))
-        self.out = nn.Sequential(
-            *self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
-
-    def _get_conv_bn_layer(self, in_channels, out_channels, kernel_size=11,
-                                                 stride=1, dilation=1, padding=0, bias=False):
-        layers = [
-            MaskedConv1d(in_channels, out_channels, kernel_size, stride=stride,
-                                     dilation=dilation, padding=padding, bias=bias,
-                                     use_conv_mask=self.use_conv_mask),
-            nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)
-        ]
-        return layers
-
-    def _get_act_dropout_layer(self, drop_prob=0.2, activation=None):
-        if activation is None:
-            activation = nn.Hardtanh(min_val=0.0, max_val=20.0)
-        layers = [
-            activation,
-            nn.Dropout(p=drop_prob)
-        ]
-        return layers
-
-    def num_weights(self):
-        return sum(p.numel() for p in self.parameters() if p.requires_grad)
-
-    def forward(self, input_):
-        if self.use_conv_mask:
-            xs, lens_orig = input_
-        else:
-            xs = input_
-            lens_orig = 0
-        # compute forward convolutions
-        out = xs[-1]
-        lens = lens_orig
-        for i, l in enumerate(self.conv):
-            if self.use_conv_mask and isinstance(l, MaskedConv1d):
-                out, lens = l((out, lens))
-            else:
-                out = l(out)
-        # compute the residuals
-        if self.res is not None:
-            for i, layer in enumerate(self.res):
-                res_out = xs[i]
-                for j, res_layer in enumerate(layer):
-                    if j == 0 and self.use_conv_mask:
-                        res_out, _ = res_layer((res_out, lens_orig))
-                    else:
-                        res_out = res_layer(res_out)
-                out += res_out
-
-        # compute the output
-        out = self.out(out)
-        if self.res is not None and self.dense_residual:
-            out = xs + [out]
-        else:
-            out = [out]
-
-        if self.use_conv_mask:
-            return out, lens
-        else:
-            return out
-
-class GreedyCTCDecoder(nn.Module):
-    """ Greedy CTC Decoder
-    """
-    def __init__(self, **kwargs):
-        nn.Module.__init__(self)    # For PyTorch API
-    @torch.no_grad()
-    def forward(self, log_probs):
-            argmx = log_probs.argmax(dim=-1, keepdim=False).int()
-            return argmx
-
-class CTCLossNM:
-    """ CTC loss
-    """
-    def __init__(self, **kwargs):
-        self._blank = kwargs['num_classes'] - 1
-        self._criterion = nn.CTCLoss(blank=self._blank, reduction='none')
-
-    def __call__(self, log_probs, targets, input_length, target_length):
-        input_length = input_length.long()
-        target_length = target_length.long()
-        targets = targets.long()
-        loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
-                                                     target_length)
-        # note that this is different from reduction = 'mean'
-        # because we are not dividing by target lengths
-        return torch.mean(loss)
--- a/PyTorch/SpeechRecognition/Jasper/notebooks/Colab_Jasper_TRT_inference_demo.ipynb
+++ b/PyTorch/SpeechRecognition/Jasper/notebooks/Colab_Jasper_TRT_inference_demo.ipynb
--- a/PyTorch/SpeechRecognition/Jasper/notebooks/JasperTRTIS.ipynb
+++ b/PyTorch/SpeechRecognition/Jasper/notebooks/JasperTRTIS.ipynb
@ -1,274 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "raw",
-   "metadata": {},
-   "source": [
-    "# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
-    "#\n",
-    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
-    "# you may not use this file except in compliance with the License.\n",
-    "# You may obtain a copy of the License at\n",
-    "#\n",
-    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
-    "#\n",
-    "# Unless required by applicable law or agreed to in writing, software\n",
-    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
-    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
-    "# See the License for the specific language governing permissions and\n",
-    "# limitations under the License.\n",
-    "# =============================================================================="
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<img src=http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png style=\"width: 90px; float: right;\">\n",
-    "\n",
-    "# Jasper inference using TensorRT Inference Server\n",
-    "This Jupyter notebook provides scripts to deploy high-performance inference in NVIDIA TensorRT Inference Server offering different options for the model backend, among others NVIDIA TensorRT. \n",
-    "Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. \n",
-    "NVIDIA TensorRT Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server\n",
-    "NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.\n",
-    "## 1. Overview\n",
-    "\n",
-    "The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.\n",
-    "\n",
-    "The original paper is Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf.\n",
-    "\n",
-    "### 1.1 Model architecture\n",
-    "By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.\n",
-    "Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout. \n",
-    "In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.\n",
-    "For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.\n",
-    "More information on the model architecture can be found in the [Jasper Pytorch README](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)\n",
-    "\n",
-    "### 1.2 TensorRT Inference Server Overview\n",
-    "\n",
-    "A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:\n",
-    "1. Client serializes the inference request into a message and sends it to the server (Client Send)\n",
-    "2. Message travels over the network from the client to the server (Network)\n",
-    "3. Message arrives at server, and is deserialized (Server Receive)\n",
-    "4. Request is placed on the queue (Server Queue)\n",
-    "5. Request is removed from the queue and computed (Server Compute)\n",
-    "6. Completed request is serialized in a message and sent back to the client (Server Send)\n",
-    "7. Completed message travels over network from the server to the client (Network)\n",
-    "8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)\n",
-    "\n",
-    "Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.\n",
-    "In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.\n",
-    "\n",
-    "Note: The following instructions are run from outside the container and call `docker run` commands as required.\n",
-    "\n",
-    "### 1.3 Inference Pipeline in TensorRT Inference Server\n",
-    "The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend: \n",
-    "\n",
-    "**Data preprocessor**\n",
-    "\n",
-    "The data processor transforms an input raw audio file into a spectrogram. By default the pipeline uses mel filter banks as spectrogram features. This part does not have any learnable weights.\n",
-    "\n",
-    "**Acoustic model**\n",
-    "\n",
-    "The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/README.md)). We do not use masking for inference . \n",
-    "\n",
-    "**Greedy decoder**\n",
-    "\n",
-    "The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability. \n",
-    "\n",
-    "To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline.\n",
-    "\n",
-    "|Backend\\Pipeline component|Data preprocessor|Acoustic Model|Decoder|\n",
-    "|---|---|---|---|\n",
-    "|PyTorch JIT|x|x|x|\n",
-    "|ONNX|-|x|-|\n",
-    "|TensorRT|-|x|-|\n",
-    "\n",
-    "In order to run inference with TensorRT outside of the inference server, refer to the [Jasper TensorRT README](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/trt/README.md)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 1.3 Learning objectives\n",
-    "\n",
-    "This notebook demonstrates:\n",
-    "- Speed up Jasper Inference with TensorRT in TensorRT Inference Server\n",
-    "- Use of Mixed Precision for Inference"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 2. Requirements\n",
-    "\n",
-    "Please refer to Jasper TensorRT Inference Server README.md"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 3. Jasper Inference\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 3.1  Prepare Working Directory"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "if not 'workbookDir' in globals():\n",
-    "    workbookDir = os.getcwd() + \"/../\"\n",
-    "print('workbookDir: ' + workbookDir)\n",
-    "os.chdir(workbookDir)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 3.2  Generate TRTIS Model Checkpoints\n",
-    "Use the PyTorch model checkpoint to generate all 3 model backends. You can find a pretrained checkpoint at https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16.\n",
-    "\n",
-    "Set the following parameters:\n",
-    "\n",
-    "* `ARCH`: hardware architecture. use 70 for Volta, 75 for Turing.\n",
-    "* `CHECKPOINT_DIR`: absolute path to model checkpoint directory.\n",
-    "* `CHECKPOINT`: model checkpoint name. (default: jasper10x5dr.pt)\n",
-    "* `PRECISION`: model precision. Default is using mixed precision.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "%env ARCH=70\n",
-    "# replace with absolute path to checkpoint directory, which should include CHECKPOINT file\n",
-    "%env CHECKPOINT_DIR=<CHECKPOINT_DIR> \n",
-    "# CHECKPOINT file name\n",
-    "%env CHECKPOINT=jasper_fp16.pt \n",
-    "%env PRECISION=fp16\n",
-    "!echo \"ARCH=${ARCH} CHECKPOINT_DIR=${CHECKPOINT_DIR} CHECKPOINT=${CHECKPOINT} PRECISION=${PRECISION} trtis/scripts/export_model.sh\"\n",
-    "!ARCH=${ARCH} CHECKPOINT_DIR=${CHECKPOINT_DIR} CHECKPOINT=${CHECKPOINT} PRECISION=${PRECISION} trtis/scripts/export_model.sh"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!bash trtis/scripts/prepare_model_repository.sh"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 3.3  Start the TensorRT Inference Server using Docker"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!bash trtis/scripts/run_server.sh"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 3.4. Start inference prediction in TRTIS\n",
-    "\n",
-    "Use the following script to run inference with TensorRT Inference Server.\n",
-    "You will need to set the parameters such as: \n",
-    "\n",
-    "\n",
-    "* `MODEL_TYPE`: Model pipeline type. Choose from [pyt, onnx, trt] for Pytorch JIT, ONNX, or TensorRT model pipeline.\n",
-    "* `DATA_DIR`: absolute path to directory with audio files\n",
-    "* `FILE`: relative path of audio file to `DATA_DIR`\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "MODEL_TYPE=\"trt\"\n",
-    "DATA_DIR=os.path.join(workbookDir, \"notebooks/\")\n",
-    "FILE=\"example1.wav\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!bash trtis/scripts/run_client.sh $MODEL_TYPE $DATA_DIR $FILE"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You can play with other examples from the 'notebooks' directory. You can also add your own audio files and generate the output text files in this way."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 3.5. Stop your container in the end"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!docker stop jasper-trtis"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
--- a/PyTorch/SpeechRecognition/Jasper/notebooks/README.md
+++ b/PyTorch/SpeechRecognition/Jasper/notebooks/README.md
@ -150,4 +150,54 @@ Use the token listed in the output from running the jupyter command to log in, f


 ## Jasper Jupyter Notebook for TensorRT Inference Server
-This notebook can be executed from Google [Colab](https://colab.research.google.com) by supplying the notebook Github [URL](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/notebooks/Colab_Jasper_TRT_inference_demo.ipynb) or by open this [link](https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/notebooks/Colab_Jasper_TRT_inference_demo.ipynb) directly.
+### Requirements
+
+`./trtis/` contains a Dockerfile which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU    
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [TensorRT Inference Server 19.09 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorrtserver)
+* [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
+
+### Quick Start Guide
+
+
+#### 1. Clone the repository.
+
+```
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+```
+
+#### 2. Build a container that extends NGC PyTorch 19.09, TensorRT, TensorRT Inference Server, and TensorRT Inference Client.
+
+```
+bash trtis/scripts/docker/build.sh
+```
+
+#### 3. Download the checkpoint
+Download the checkpoint file jasper_fp16.pt from NGC Model Repository:  
+- https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16
+
+to an user specified directory _CHECKPOINT_DIR_
+
+#### 4. Run the notebook
+
+For running the notebook on your local machine, run:
+
+```
+jupyter notebook -- notebooks/JasperTRTIS.ipynb
+```
+
+For running the notebook on another machine remotely, run:
+
+```
+jupyter notebook --ip=0.0.0.0 --allow-root
+```
+
+And navigate a web browser to the IP address or hostname of the host machine at port 8888: `http://[host machine]:8888`
+
+Use the token listed in the output from running the jupyter command to log in, for example: `http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b`
--- a/PyTorch/SpeechRecognition/Jasper/parts/features.py
+++ b/PyTorch/SpeechRecognition/Jasper/parts/features.py
@ -1,368 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import torch
-import torch.nn as nn
-import math
-import librosa
-from .perturb import AudioAugmentor
-from .segment import AudioSegment
-from apex import amp
-
-
-def audio_from_file(file_path, offset=0, duration=0, trim=False, target_sr=16000,
-                    device=torch.device('cuda')):
-    audio = AudioSegment.from_file(file_path,
-                                   target_sr=target_sr,
-                                   int_values=False,
-                                   offset=offset, duration=duration, trim=trim)
-    samples=torch.tensor(audio.samples, dtype=torch.float, device=device)
-    num_samples = torch.tensor(samples.shape[0], device=device).int()
-    return (samples.unsqueeze(0), num_samples.unsqueeze(0))
-
-class WaveformFeaturizer(object):
-    def __init__(self, input_cfg, augmentor=None):
-        self.augmentor = augmentor if augmentor is not None else AudioAugmentor()
-        self.cfg = input_cfg
-
-    def max_augmentation_length(self, length):
-        return self.augmentor.max_augmentation_length(length)
-
-    def process(self, file_path, offset=0, duration=0, trim=False):
-        audio = AudioSegment.from_file(file_path,
-                                       target_sr=self.cfg['sample_rate'],
-                                       int_values=self.cfg.get('int_values', False),
-                                       offset=offset, duration=duration, trim=trim)
-        return self.process_segment(audio)
-
-    def process_segment(self, audio_segment):
-        self.augmentor.perturb(audio_segment)
-        return torch.tensor(audio_segment.samples, dtype=torch.float)
-
-    @classmethod
-    def from_config(cls, input_config, perturbation_configs=None):
-        if perturbation_configs is not None:
-            aa = AudioAugmentor.from_config(perturbation_configs)
-        else:
-            aa = None
-
-        return cls(input_config, augmentor=aa)
-
-
-# @torch.jit.script
-# def normalize_batch_per_feature(x, seq_len):
-#     x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype, device=x.device)
-#     x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype, device=x.device)
-
-#     for i in range(x.shape[0]):
-#         x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
-#         x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
-#     # make sure x_std is not zero
-#     x_std += 1e-5
-#     return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
-
-# @torch.jit.script
-# def normalize_batch_all_features(x, seq_len):
-#     x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
-#     x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
-#     for i in range(x.shape[0]):
-#         x_mean[i] = x[i, :, :int(seq_len[i])].mean()
-#         x_std[i] = x[i, :, :int(seq_len[i])].std()
-#     # make sure x_std is not zero
-#     x_std += 1e-5
-#     return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
-
-@torch.jit.script
-def normalize_batch(x, seq_len, normalize_type: str):
-#    print ("normalize_batch: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
-    if normalize_type == "per_feature":
-        x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
-                                                 device=x.device)
-        x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
-                                                device=x.device)
-        for i in range(x.shape[0]):
-            x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
-            x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
-        # make sure x_std is not zero
-        x_std += 1e-5
-        return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
-    elif normalize_type == "all_features":
-        x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
-        x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
-        for i in range(x.shape[0]):
-            x_mean[i] = x[i, :, :int(seq_len[i])].mean()
-            x_std[i] = x[i, :, :int(seq_len[i])].std()
-        # make sure x_std is not zero
-        x_std += 1e-5
-        return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
-    else:
-        return x
-
-@torch.jit.script
-def splice_frames(x, frame_splicing: int):
-    """ Stacks frames together across feature dim
-
-    input is batch_size, feature_dim, num_frames
-    output is batch_size, feature_dim*frame_splicing, num_frames
-
-    """
-    seq = [x]
-    # TORCHSCRIPT: JIT doesnt like range(start, stop)
-    for n in range(frame_splicing - 1):
-        seq.append(torch.cat([x[:, :, :n + 1], x[:, :, n + 1:]], dim=2))
-    return torch.cat(seq, dim=1)
-
-class SpectrogramFeatures(nn.Module):
-    # For JIT. See https://pytorch.org/docs/stable/jit.html#python-defined-constants
-    __constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length", "log", "frame_splicing", "window", "normalize", "pad_to", "max_duration", "do_normalize"]
-
-    def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
-                       n_fft=None,
-                       window="hamming", normalize="per_feature", log=True, 
-                       dither=1e-5, pad_to=8, max_duration=16.7,
-                       frame_splicing=1):
-        super(SpectrogramFeatures, self).__init__()
-        torch_windows = {
-            'hann': torch.hann_window,
-            'hamming': torch.hamming_window,
-            'blackman': torch.blackman_window,
-            'bartlett': torch.bartlett_window,
-            'none': None,
-        }
-        self.win_length = int(sample_rate * window_size)
-        self.hop_length = int(sample_rate * window_stride)
-        self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
-
-        window_fn = torch_windows.get(window, None)
-        window_tensor = window_fn(self.win_length,
-                                  periodic=False) if window_fn else None
-        self.window = window_tensor
-
-        self.normalize = normalize
-        self.log = log
-        self.dither = dither
-        self.pad_to = pad_to
-        self.frame_splicing = frame_splicing
-
-        max_length = 1 + math.ceil(
-            (max_duration * sample_rate - self.win_length) / self.hop_length
-        )
-        max_pad = 16 - (max_length % 16)
-        self.max_length = max_length + max_pad
-
-    def get_seq_len(self, seq_len):
-        return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
-            dtype=torch.int)
-
-    @torch.no_grad()
-    def forward(self, x, seq_len):
-        dtype = x.dtype
-
-        seq_len = self.get_seq_len(seq_len)
-
-        # dither
-        if self.dither > 0:
-            x += self.dither * torch.randn_like(x)
-
-        # do preemphasis
-        if hasattr(self,'preemph') and self.preemph is not None:
-            x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
-                                        dim=1)
-
-        # get spectrogram
-        x = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
-                       win_length=self.win_length,
-                       window=self.window.to(torch.float))
-        x = torch.sqrt(x.pow(2).sum(-1))
-
-        # log features if required
-        if self.log:
-            x = torch.log(x + 1e-20)
-
-        # frame splicing if required
-        if self.frame_splicing > 1:
-            x = splice_frames(x, self.frame_splicing)
-
-        # normalize if required
-        x = normalize_batch(x, seq_len, normalize_type=self.normalize)
-
-        # mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
-        max_len = x.size(-1)
-        mask = torch.arange(max_len, dtype=seq_len.dtype).to(seq_len.device).expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
-        x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
-        
-        # TORCHSCRIPT: Is this del important? It breaks scripting
-        #del mask
-    
-        pad_to = self.pad_to
-    
-        # TORCHSCRIPT: Cant have mixed types. Using pad_to < 0 for "max"
-        if pad_to < 0:
-            x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
-        elif pad_to > 0:
-            pad_amt = x.size(-1) % pad_to
-            if pad_amt != 0:
-                x = nn.functional.pad(x, (0, pad_to - pad_amt))
-        return x.to(dtype)
-
-    @classmethod
-    def from_config(cls, cfg, log=False):
-        return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
-                   window_stride=cfg['window_stride'],
-                   n_fft=cfg['n_fft'], window=cfg['window'],
-                   normalize=cfg['normalize'],
-                   max_duration=cfg.get('max_duration', 16.7),
-                   dither=cfg.get('dither', 1e-5), pad_to=cfg.get("pad_to", 0),
-                   frame_splicing=cfg.get("frame_splicing", 1), log=log)
-constant=1e-5
-class FilterbankFeatures(nn.Module):
-    # For JIT. See https://pytorch.org/docs/stable/jit.html#python-defined-constants
-    __constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length", "center", "log", "frame_splicing", "window", "normalize", "pad_to", "max_duration", "max_length"]
-
-    def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
-                       window="hamming", normalize="per_feature", n_fft=None,
-                       preemph=0.97,
-                       nfilt=64, lowfreq=0, highfreq=None, log=True, dither=constant,
-                       pad_to=8,
-                       max_duration=16.7,
-                       frame_splicing=1):
-        super(FilterbankFeatures, self).__init__()
-
-        torch_windows = {
-            'hann': torch.hann_window,
-            'hamming': torch.hamming_window,
-            'blackman': torch.blackman_window,
-            'bartlett': torch.bartlett_window,
-            'none': None,
-        }
-
-        self.win_length = int(sample_rate * window_size) # frame size
-        self.hop_length = int(sample_rate * window_stride)
-        self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
-
-        self.normalize = normalize
-        self.log = log
-        #TORCHSCRIPT: Check whether or not we need this
-        self.dither = dither
-        self.frame_splicing = frame_splicing
-        self.nfilt = nfilt
-        self.preemph = preemph
-        self.pad_to = pad_to
-        highfreq = highfreq or sample_rate / 2
-        window_fn = torch_windows.get(window, None)
-        window_tensor = window_fn(self.win_length,
-                                  periodic=False) if window_fn else None
-        filterbanks = torch.tensor(
-            librosa.filters.mel(sample_rate, self.n_fft, n_mels=nfilt, fmin=lowfreq,
-                                                    fmax=highfreq), dtype=torch.float).unsqueeze(0)
-        # self.fb = filterbanks
-        # self.window = window_tensor
-        self.register_buffer("fb", filterbanks)
-        self.register_buffer("window", window_tensor)
-        # Calculate maximum sequence length (# frames)
-        max_length = 1 + math.ceil(
-            (max_duration * sample_rate - self.win_length) / self.hop_length
-        )
-        max_pad = 16 - (max_length % 16)
-        self.max_length = max_length + max_pad
-
-    def get_seq_len(self, seq_len):
-        return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
-            dtype=torch.int)
-
-    # do stft
-    # TORCHSCRIPT: center removed due to bug
-    def  stft(self, x):
-        return torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
-                          win_length=self.win_length,
-                          window=self.window.to(dtype=torch.float))
-    def forward(self, x, seq_len):
-        dtype = x.dtype
-
-        seq_len = self.get_seq_len(seq_len)
-        
-        # dither
-        if self.dither > 0:
-            x += self.dither * torch.randn_like(x)
-
-        # do preemphasis
-        if self.preemph is not None:
-            x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
-                                        dim=1)
-            
-        x  = self.stft(x)
-            
-            # get power spectrum
-        x = x.pow(2).sum(-1)
-
-        # dot with filterbank energies
-        x = torch.matmul(self.fb.to(x.dtype), x)
-
-        # log features if required
-        if self.log:
-            x = torch.log(x + 1e-20)
-
-        # frame splicing if required
-        if self.frame_splicing > 1:
-            x = splice_frames(x, self.frame_splicing)
-
-        # normalize if required
-        x = normalize_batch(x, seq_len, normalize_type=self.normalize)
-
-        # mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
-        max_len = x.size(-1)
-        mask = torch.arange(max_len, dtype=seq_len.dtype).to(x.device).expand(x.size(0),
-                                                                              max_len) >= seq_len.unsqueeze(1)
-
-        x = x.masked_fill(mask.unsqueeze(1), 0)
-        # TORCHSCRIPT: Is this del important? It breaks scripting
-        # del mask
-        # TORCHSCRIPT: Cant have mixed types. Using pad_to < 0 for "max"
-        if self.pad_to < 0:
-            x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
-        elif self.pad_to > 0:
-            pad_amt = x.size(-1) % self.pad_to
-            #            if pad_amt != 0:
-            x = nn.functional.pad(x, (0, self.pad_to - pad_amt))
-        
-        return x # .to(dtype)
-
-    @classmethod
-    def from_config(cls, cfg, log=False):
-        return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
-                   window_stride=cfg['window_stride'], n_fft=cfg['n_fft'],
-                   nfilt=cfg['features'], window=cfg['window'],
-                   normalize=cfg['normalize'],
-                   max_duration=cfg.get('max_duration', 16.7),
-                   dither=cfg['dither'], pad_to=cfg.get("pad_to", 0),
-                   frame_splicing=cfg.get("frame_splicing", 1), log=log)
-
-class FeatureFactory(object):
-    featurizers = {
-        "logfbank": FilterbankFeatures,
-        "fbank": FilterbankFeatures,
-        "stft": SpectrogramFeatures,
-        "logspect": SpectrogramFeatures,
-        "logstft": SpectrogramFeatures
-    }
-
-    def __init__(self):
-        pass
-
-    @classmethod
-    def from_config(cls, cfg):
-        feat_type = cfg.get('feat_type', "logspect")
-        featurizer = cls.featurizers[feat_type]
-        #return featurizer.from_config(cfg, log="log" in cfg['feat_type'])
-        return featurizer.from_config(cfg, log="log" in feat_type)
--- a/PyTorch/SpeechRecognition/Jasper/parts/manifest.py
+++ b/PyTorch/SpeechRecognition/Jasper/parts/manifest.py
@ -1,170 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import json
-import re
-import string
-import numpy as np
-import os
-
-from .text import _clean_text
-
-
-def normalize_string(s, labels, table, **unused_kwargs):
-    """
-    Normalizes string. For example:
-    'call me at 8:00 pm!' -> 'call me at eight zero pm'
-
-    Args:
-        s: string to normalize
-        labels: labels used during model training.
-
-    Returns:
-            Normalized string
-    """
-
-    def good_token(token, labels):
-        s = set(labels)
-        for t in token:
-            if not t in s:
-                return False
-        return True
-
-    try:
-        text = _clean_text(s, ["english_cleaners"], table).strip()
-        return ''.join([t for t in text if good_token(t, labels=labels)])
-    except:
-        print("WARNING: Normalizing {} failed".format(s))
-        return None
-
-class Manifest(object):
-    def __init__(self, data_dir, manifest_paths, labels, blank_index, max_duration=None, pad_to_max=False,
-                 min_duration=None, sort_by_duration=False, max_utts=0,
-                 normalize=True, speed_perturbation=False, filter_speed=1.0):
-        self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
-        self.blank_index = blank_index
-        self.max_duration= max_duration
-        ids = []
-        duration = 0.0
-        filtered_duration = 0.0
-
-        # If removing punctuation, make a list of punctuation to remove
-        table = None
-        if normalize:
-            # Punctuation to remove
-            punctuation = string.punctuation
-            punctuation = punctuation.replace("+", "")
-            punctuation = punctuation.replace("&", "")
-            ### We might also want to consider:
-            ### @ -> at
-            ### # -> number, pound, hashtag
-            ### ~ -> tilde
-            ### _ -> underscore
-            ### % -> percent
-            # If a punctuation symbol is inside our vocab, we do not remove from text
-            for l in labels:
-                punctuation = punctuation.replace(l, "")
-            # Turn all punctuation to whitespace
-            table = str.maketrans(punctuation, " " * len(punctuation))
-        for manifest_path in manifest_paths:
-            with open(manifest_path, "r", encoding="utf-8") as fh:
-                a=json.load(fh)
-                for data in a:
-                    files_and_speeds = data['files']
-
-                    if pad_to_max:
-                        if not speed_perturbation:
-                            min_speed = filter_speed
-                        else:
-                            min_speed = min(x['speed'] for x in files_and_speeds)
-                        max_duration = self.max_duration * min_speed
-
-                    data['duration'] = data['original_duration']
-                    if min_duration is not None and data['duration'] < min_duration:
-                        filtered_duration += data['duration']
-                        continue
-                    if max_duration is not None and data['duration'] > max_duration:
-                        filtered_duration += data['duration']
-                        continue
-
-                    # Prune and normalize according to transcript
-                    transcript_text = data[
-                        'transcript'] if "transcript" in data else self.load_transcript(
-                        data['text_filepath'])
-                    if normalize:
-                        transcript_text = normalize_string(transcript_text, labels=labels,
-                                                                                             table=table)
-                    if not isinstance(transcript_text, str):
-                        print(
-                            "WARNING: Got transcript: {}. It is not a string. Dropping data point".format(
-                                transcript_text))
-                        filtered_duration += data['duration']
-                        continue
-                    data["transcript"] = self.parse_transcript(transcript_text) # convert to vocab indices
-
-                    if speed_perturbation:
-                        audio_paths = [x['fname'] for x in files_and_speeds]
-                        data['audio_duration'] = [x['duration'] for x in files_and_speeds]
-                    else:
-                        audio_paths = [x['fname'] for x in files_and_speeds if x['speed'] == filter_speed]
-                        data['audio_duration'] = [x['duration'] for x in files_and_speeds if x['speed'] == filter_speed]
-                    data['audio_filepath'] = [os.path.join(data_dir, x) for x in audio_paths]
-                    data.pop('files')
-                    data.pop('original_duration')
-
-                    ids.append(data)
-                    duration += data['duration']
-
-                    if max_utts > 0 and len(ids) >= max_utts:
-                        print(
-                            'Stopping parsing %s as max_utts=%d' % (manifest_path, max_utts))
-                        break
-
-        if sort_by_duration:
-            ids = sorted(ids, key=lambda x: x['duration'])
-        self._data = ids
-        self._size = len(ids)
-        self._duration = duration
-        self._filtered_duration = filtered_duration
-
-    def load_transcript(self, transcript_path):
-        with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
-            transcript = transcript_file.read().replace('\n', '')
-        return transcript
-
-    def parse_transcript(self, transcript):
-        chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
-        transcript = list(filter(lambda x: x != self.blank_index, chars))
-        return transcript
-
-    def __getitem__(self, item):
-        return self._data[item]
-
-    def __len__(self):
-        return self._size
-
-    def __iter__(self):
-        return iter(self._data)
-
-    @property
-    def duration(self):
-        return self._duration
-
-    @property
-    def filtered_duration(self):
-        return self._filtered_duration
-
-    @property
-    def data(self):
-        return list(self._data)
--- a/PyTorch/SpeechRecognition/Jasper/parts/perturb.py
+++ b/PyTorch/SpeechRecognition/Jasper/parts/perturb.py
@ -1,111 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import random
-import librosa
-from .manifest import Manifest
-from .segment import AudioSegment
-
-
-class Perturbation(object):
-    def max_augmentation_length(self, length):
-        return length
-
-    def perturb(self, data):
-        raise NotImplementedError
-
-
-class SpeedPerturbation(Perturbation):
-    def __init__(self, min_speed_rate=0.85, max_speed_rate=1.15, rng=None):
-        self._min_rate = min_speed_rate
-        self._max_rate = max_speed_rate
-        self._rng = random.Random() if rng is None else rng
-
-    def max_augmentation_length(self, length):
-        return length * self._max_rate
-
-    def perturb(self, data):
-        speed_rate = self._rng.uniform(self._min_rate, self._max_rate)
-        if speed_rate <= 0:
-            raise ValueError("speed_rate should be greater than zero.")
-        data._samples = librosa.effects.time_stretch(data._samples, speed_rate)
-
-
-class GainPerturbation(Perturbation):
-    def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, rng=None):
-        self._min_gain_dbfs = min_gain_dbfs
-        self._max_gain_dbfs = max_gain_dbfs
-        self._rng = random.Random() if rng is None else rng
-
-    def perturb(self, data):
-        gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
-        data._samples = data._samples * (10. ** (gain / 20.))
-
-
-
-class ShiftPerturbation(Perturbation):
-    def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, rng=None):
-        self._min_shift_ms = min_shift_ms
-        self._max_shift_ms = max_shift_ms
-        self._rng = random.Random() if rng is None else rng
-
-    def perturb(self, data):
-        shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
-        if abs(shift_ms) / 1000 > data.duration:
-            # TODO: do something smarter than just ignore this condition
-            return
-        shift_samples = int(shift_ms * data.sample_rate // 1000)
-        # print("DEBUG: shift:", shift_samples)
-        if shift_samples < 0:
-            data._samples[-shift_samples:] = data._samples[:shift_samples]
-            data._samples[:-shift_samples] = 0
-        elif shift_samples > 0:
-            data._samples[:-shift_samples] = data._samples[shift_samples:]
-            data._samples[-shift_samples:] = 0
-
-
-perturbation_types = {
-    "speed": SpeedPerturbation,
-    "gain": GainPerturbation,
-    "shift": ShiftPerturbation,
-}
-
-
-class AudioAugmentor(object):
-    def __init__(self, perturbations=None, rng=None):
-        self._rng = random.Random() if rng is None else rng
-        self._pipeline = perturbations if perturbations is not None else []
-
-    def perturb(self, segment):
-        for (prob, p) in self._pipeline:
-            if self._rng.random() < prob:
-                p.perturb(segment)
-        return
-
-    def max_augmentation_length(self, length):
-        newlen = length
-        for (prob, p) in self._pipeline:
-            newlen = p.max_augmentation_length(newlen)
-        return newlen
-
-    @classmethod
-    def from_config(cls, config):
-        ptbs = []
-        for p in config:
-            if p['aug_type'] not in perturbation_types:
-                print(p['aug_type'], "perturbation not known. Skipping.")
-                continue
-            perturbation = perturbation_types[p['aug_type']]
-            ptbs.append((p['prob'], perturbation(**p['cfg'])))
-        return cls(perturbations=ptbs)
--- a/PyTorch/SpeechRecognition/Jasper/parts/text/init.py
+++ b/PyTorch/SpeechRecognition/Jasper/parts/text/init.py
@ -1,12 +0,0 @@
-# Copyright (c) 2017 Keith Ito
-""" from https://github.com/keithito/tacotron """
-import re
-from . import cleaners
-
-def _clean_text(text, cleaner_names, *args):
-    for name in cleaner_names:
-        cleaner = getattr(cleaners, name)
-        if not cleaner:
-            raise Exception('Unknown cleaner: %s' % name)
-        text = cleaner(text, *args)
-    return text
--- a/PyTorch/SpeechRecognition/Jasper/platform/DGX1-16GB_Jasper_AMP_8GPU.sh
+++ b/PyTorch/SpeechRecognition/Jasper/platform/DGX1-16GB_Jasper_AMP_8GPU.sh
@ -1,3 +1,3 @@
 #!/bin/bash

-NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=2 bash scripts/train.sh "$@"
+NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=4 bash scripts/train.sh "$@"
--- a/PyTorch/SpeechRecognition/Jasper/requirements.txt
+++ b/PyTorch/SpeechRecognition/Jasper/requirements.txt
@ -1,9 +1,10 @@
-pandas==0.24.2
-tqdm==4.31.1
 ascii-graph==1.5.1
-wrapt==1.10.11
-librosa
-toml
-soundfile
 ipdb
-sox
+librosa==0.8.0
+pandas==1.1.4
+pycuda==2020.1
+pyyaml
+soundfile
+sox==1.4.1
+tqdm==4.53.0
+wrapt==1.10.11
--- a/PyTorch/SpeechRecognition/Jasper/scripts/docker/launch.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/docker/launch.sh
@ -1,26 +1,30 @@
 #!/bin/bash
+
 SCRIPT_DIR=$(cd $(dirname $0); pwd)
-JASPER_REPO=${JASPER_REPO:-"${SCRIPT_DIR}/../.."}
+: ${JASPER_REPO:="$SCRIPT_DIR/../.."}

-# Launch TRT JASPER container.
+: ${DATA_DIR:=${1:-"$JASPER_REPO/datasets"}}
+: ${CHECKPOINT_DIR:=${2:-"$JASPER_REPO/checkpoints"}}
+: ${OUTPUT_DIR:=${3:-"$JASPER_REPO/results"}}
+: ${SCRIPT:=${4:-}}

-DATA_DIR=${1:-${DATA_DIR-"/datasets"}}
-CHECKPOINT_DIR=${2:-${CHECKPOINT_DIR:-"/checkpoints"}}
-RESULT_DIR=${3:-${RESULT_DIR:-"/results"}}
-PROGRAM_PATH=${PROGRAM_PATH}
+mkdir -p $DATA_DIR
+mkdir -p $CHECKPOINT_DIR
+mkdir -p $OUTPUT_DIR

 MOUNTS=""
 MOUNTS+=" -v $DATA_DIR:/datasets"
 MOUNTS+=" -v $CHECKPOINT_DIR:/checkpoints"
-MOUNTS+=" -v $RESULT_DIR:/results"
-MOUNTS+=" -v ${JASPER_REPO}:/jasper"
+MOUNTS+=" -v $OUTPUT_DIR:/results"
+MOUNTS+=" -v $JASPER_REPO:/workspace/jasper"

 echo $MOUNTS
-nvidia-docker run -it --rm \
-  --runtime=nvidia \
+docker run -it --rm --gpus all \
+  --env PYTHONDONTWRITEBYTECODE=1 \
  --shm-size=4g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
-  ${MOUNTS} \
-  ${EXTRA_JASPER_ENV} \
-  jasper:latest bash $PROGRAM_PATH
+  $MOUNTS \
+  $EXTRA_JASPER_ENV \
+  -w /workspace/jasper \
+  jasper:latest bash $SCRIPT
--- a/PyTorch/SpeechRecognition/Jasper/scripts/evaluation.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/evaluation.sh
@ -14,58 +14,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+set -a

-echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
+: ${PREDICTION_FILE:=}
+: ${DATASET:="test-other"}

-DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
-DATASET=${2:-${DATASET:-"dev-clean"}}
-MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
-RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
-CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
-CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
-CUDNN_BENCHMARK=${7:-${CUDNN_BENCHMARK:-"false"}}
-NUM_GPUS=${8:-${NUM_GPUS:-1}}
-AMP=${9:-${AMP:-"false"}}
-NUM_STEPS=${10:-${NUM_STEPS:-"-1"}}
-SEED=${11:-${SEED:-0}}
-BATCH_SIZE=${12:-${BATCH_SIZE:-64}}
-
-mkdir -p "$RESULT_DIR"
-
-CMD=" inference.py "
-CMD+=" --batch_size $BATCH_SIZE "
-CMD+=" --dataset_dir $DATA_DIR "
-CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
-CMD+=" --model_toml $MODEL_CONFIG  "
-CMD+=" --seed $SEED "
-CMD+=" --ckpt $CHECKPOINT "
-[ "$AMP" == "true" ] && \
-CMD+=" --amp"
-[ "$NUM_STEPS" -gt 0 ] && \
-CMD+=" --steps $NUM_STEPS"
-[ "$CUDNN_BENCHMARK" = "true" ] && \
-CMD+=" --cudnn"
-
-if [ "$CREATE_LOGFILE" = "true" ] ; then
-   export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
-   printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
-   DATESTAMP=`date +'%y%m%d%H%M%S'`
-   LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
-   printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-if [ "$NUM_GPUS" -gt 1  ] ; then
-   CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
-else
-   CMD="python3  $CMD"
-fi
-
-set -x
-if [ -z "$LOGFILE" ] ; then
-   $CMD
-else
-   (
-     $CMD
-   ) |& tee "$LOGFILE"
-fi
-set +x
+bash ./scripts/inference.sh "$@"
--- a/PyTorch/SpeechRecognition/Jasper/scripts/inference.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/inference.sh
@ -14,66 +14,48 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-"/checkpoints/jasper_fp16.pt"}}
+: ${DATASET:="test-other"}
+: ${LOG_FILE:=""}
+: ${CUDNN_BENCHMARK:=false}
+: ${MAX_DURATION:=""}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${NUM_GPUS:=1}
+: ${NUM_STEPS:=0}
+: ${NUM_WARMUP_STEPS:=0}
+: ${AMP:=false}
+: ${BATCH_SIZE:=64}
+: ${EMA:=true}
+: ${SEED:=0}
+: ${DALI_DEVICE:="gpu"}
+: ${CPU:=false}
+: ${LOGITS_FILE:=}
+: ${PREDICTION_FILE:="${OUTPUT_DIR}/${DATASET}.predictions"}

-echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
+mkdir -p "$OUTPUT_DIR"

-DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
-DATASET=${2:-${DATASET:-"dev-clean"}}
-MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
-RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
-CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
-CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
-CUDNN_BENCHMARK=${7:-${CUDNN_BENCHMARK:-"false"}}
-AMP=${8:-${AMP:-"false"}}
-NUM_STEPS=${9:-${NUM_STEPS:-"-1"}}
-SEED=${10:-${SEED:-0}}
-BATCH_SIZE=${11:-${BATCH_SIZE:-64}}
-LOGITS_FILE=${12:-${LOGITS_FILE:-""}}
-PREDICTION_FILE=${13:-${PREDICTION_FILE:-"${RESULT_DIR}/${DATASET}.predictions"}}
-CPU=${14:-${CPU:-"false"}}
-EMA=${14:-${EMA:-"false"}}
+ARGS="--dataset_dir=$DATA_DIR"
+ARGS+=" --val_manifest=$DATA_DIR/librispeech-${DATASET}-wav.json"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --batch_size=$BATCH_SIZE"
+ARGS+=" --seed=$SEED"
+ARGS+=" --dali_device=$DALI_DEVICE"
+ARGS+=" --steps $NUM_STEPS"
+ARGS+=" --warmup_steps $NUM_WARMUP_STEPS"

-mkdir -p "$RESULT_DIR"
+[ "$AMP" = true ] &&                 ARGS+=" --amp"
+[ "$EMA" = true ] &&                 ARGS+=" --ema"
+[ "$CUDNN_BENCHMARK" = true ] &&     ARGS+=" --cudnn_benchmark"
+[ -n "$CHECKPOINT" ] &&              ARGS+=" --ckpt=${CHECKPOINT}"
+[ -n "$LOG_FILE" ] &&                ARGS+=" --log_file $LOG_FILE"
+[ -n "$PREDICTION_FILE" ] &&         ARGS+=" --save_prediction $PREDICTION_FILE"
+[ -n "$LOGITS_FILE" ] &&             ARGS+=" --logits_save_to $LOGITS_FILE"
+[ "$CPU" == "true" ] &&              ARGS+=" --cpu"
+[ -n "$MAX_DURATION" ] &&            ARGS+=" --max_duration $MAX_DURATION"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --pad_to_max_duration"

-CMD="python inference.py "
-CMD+=" --batch_size $BATCH_SIZE "
-CMD+=" --dataset_dir $DATA_DIR "
-CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
-CMD+=" --model_toml $MODEL_CONFIG  "
-CMD+=" --seed $SEED "
-[ "$NUM_STEPS" -gt 0 ] && \
-CMD+=" --steps $NUM_STEPS"
-[ "$CUDNN_BENCHMARK" = "true" ] && \
-CMD+=" --cudnn"
-[ "$AMP" == "true" ] && \
-CMD+=" --amp"
-[ "$CPU" == "true" ] && \
-CMD+=" --cpu"
-[ "$EMA" == "true" ] && \
-CMD+=" --ema"
-[ -n "$CHECKPOINT" ] && \
-CMD+=" --ckpt=${CHECKPOINT}"
-[ -n "$PREDICTION_FILE" ] && \
-CMD+=" --save_prediction $PREDICTION_FILE"
-[ -n "$LOGITS_FILE" ] && \
-CMD+=" --logits_save_to $LOGITS_FILE"
-
-if [ "$CREATE_LOGFILE" = "true" ] ; then
-   export GBS=$(expr $BATCH_SIZE)
-   printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
-   DATESTAMP=`date +'%y%m%d%H%M%S'`
-   LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
-   printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-set -x
-if [ -z "$LOGFILE" ] ; then
-   $CMD
-else
-   (
-     $CMD
-   ) |& tee "$LOGFILE"
-fi
-set +x
-[ -n "$PREDICTION_FILE" ] && echo "PREDICTION_FILE: ${PREDICTION_FILE}"
-[ -n "$LOGITS_FILE" ] && echo "LOGITS_FILE: ${LOGITS_FILE}"
+python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS inference.py $ARGS
--- a/PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark.sh
@ -14,55 +14,24 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+set -a

-echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CUDNN_BENCHMARK:=true}
+: ${PAD_TO_MAX_DURATION:=true}
+: ${NUM_WARMUP_STEPS:=10}
+: ${NUM_STEPS:=500}

-DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
-DATASET=${2:-${DATASET:-"dev-clean"}}
-MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
-RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
-CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
-CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
-CUDNN_BENCHMARK=${7:-${CUDNN_BENCHMARK:-"true"}}
-AMP=${8:-${AMP:-"false"}}
-NUM_STEPS=${9:-${NUM_STEPS:-"-1"}}
-MAX_DURATION=${10:-${MAX_DURATION:-"36"}}
-SEED=${11:-${SEED:-0}}
-BATCH_SIZE=${12:-${BATCH_SIZE:-64}}
+: ${AMP:=false}
+: ${DALI_DEVICE:="cpu"}
+: ${BATCH_SIZE_SEQ:="1 2 4 8 16"}
+: ${MAX_DURATION_SEQ:="2 7 16.7"}

-mkdir -p "$RESULT_DIR"
+for MAX_DURATION in $MAX_DURATION_SEQ; do
+  for BATCH_SIZE in $BATCH_SIZE_SEQ; do

-CMD=" python inference_benchmark.py"
-CMD+=" --batch_size=$BATCH_SIZE"
-CMD+=" --model_toml=$MODEL_CONFIG"
-CMD+=" --seed=$SEED"
-CMD+=" --dataset_dir=$DATA_DIR"
-CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
-CMD+=" --ckpt=$CHECKPOINT"
-CMD+=" --max_duration=$MAX_DURATION"
-CMD+=" --pad_to=-1"
-[ "$AMP" == "true" ] && \
-CMD+=" --amp"
-[ "$NUM_STEPS" -gt 0 ] && \
-CMD+=" --steps $NUM_STEPS"
-[ "$CUDNN_BENCHMARK" = "true" ] && \
-CMD+=" --cudnn"
+    LOG_FILE="$OUTPUT_DIR/perf-infer_dali-${DALI_DEVICE}_amp-${AMP}_dur${MAX_DURATION}_bs${BATCH_SIZE}.json"
+    bash ./scripts/inference.sh "$@"

-if [ "$CREATE_LOGFILE" = "true" ] ; then
-   export GBS=$(expr $BATCH_SIZE )
-   printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
-   DATESTAMP=`date +'%y%m%d%H%M%S'`
-   LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
-   printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-set -x
-if [ -z "$LOGFILE" ] ; then
-   $CMD
-else
-   (
-     $CMD
-   ) |& tee "$LOGFILE"
-   grep 'latency' "$LOGFILE"
-fi
-set +x
+  done
+done
--- a/PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark_cpu.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark_cpu.sh
@ -1,66 +0,0 @@
-#!/bin/bash
-
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
-
-CUDA_VISIBLE_DEVICES=""
-DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
-DATASET=${2:-${DATASET:-"dev-clean"}}
-MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
-RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
-CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
-CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
-NUM_STEPS=${7:-${NUM_STEPS:-"-1"}}
-MAX_DURATION=${8:-${MAX_DURATION:-"36"}}
-SEED=${9:-${SEED:-0}}
-BATCH_SIZE=${10:-${BATCH_SIZE:-32}}
-SAMPLE_AUDIO=${11:-${SAMPLE_AUDIO:-"/datasets/LibriSpeech/dev-clean-wav/1272/128104/1272-128104-0000.wav"}}
-
-mkdir -p "$RESULT_DIR"
-
-CMD=" python inference_benchmark.py"
-CMD+=" --cpu"
-CMD+=" --batch_size=$BATCH_SIZE"
-CMD+=" --model_toml=$MODEL_CONFIG"
-CMD+=" --seed=$SEED"
-CMD+=" --dataset_dir=$DATA_DIR"
-CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
-CMD+=" --ckpt=$CHECKPOINT"
-CMD+=" --max_duration=$MAX_DURATION"
-CMD+=" --pad_to=-1"
-CMD+=" --sample_audio=$SAMPLE_AUDIO"
-[ "$NUM_STEPS" -gt 0 ] && \
-CMD+=" --steps $NUM_STEPS"
-
-if [ "$CREATE_LOGFILE" = "true" ] ; then
-   export GBS=$(expr $BATCH_SIZE )
-   printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
-   DATESTAMP=`date +'%y%m%d%H%M%S'`
-   LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
-   printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-set -x
-if [ -z "$LOGFILE" ] ; then
-   $CMD
-else
-   (
-     $CMD
-   ) |& tee "$LOGFILE"
-   grep 'latency' "$LOGFILE"
-fi
-set +x
--- a/PyTorch/SpeechRecognition/Jasper/scripts/preprocess_librispeech.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/preprocess_librispeech.sh
@ -14,21 +14,24 @@

 #!/usr/bin/env bash

+SPEEDS=$1
+[ -n "$SPEEDS" ] && SPEED_FLAG="--speed $SPEEDS"
+
 python ./utils/convert_librispeech.py \
    --input_dir /datasets/LibriSpeech/train-clean-100 \
    --dest_dir /datasets/LibriSpeech/train-clean-100-wav \
    --output_json /datasets/LibriSpeech/librispeech-train-clean-100-wav.json \
-    --speed 0.9 1.1
+    $SPEED_FLAG
 python ./utils/convert_librispeech.py \
    --input_dir /datasets/LibriSpeech/train-clean-360 \
    --dest_dir /datasets/LibriSpeech/train-clean-360-wav \
    --output_json /datasets/LibriSpeech/librispeech-train-clean-360-wav.json \
-    --speed 0.9 1.1
+    $SPEED_FLAG
 python ./utils/convert_librispeech.py \
    --input_dir /datasets/LibriSpeech/train-other-500 \
    --dest_dir /datasets/LibriSpeech/train-other-500-wav \
    --output_json /datasets/LibriSpeech/librispeech-train-other-500-wav.json \
-    --speed 0.9 1.1
+    $SPEED_FLAG


 python ./utils/convert_librispeech.py \
--- a/PyTorch/SpeechRecognition/Jasper/scripts/train.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/train.sh
@ -14,70 +14,74 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+export OMP_NUM_THREADS=1

-echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
+: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
+: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${CHECKPOINT:=${4:-}}
+: ${RESUME:=true}
+: ${CUDNN_BENCHMARK:=true}
+: ${NUM_GPUS:=8}
+: ${AMP:=false}
+: ${BATCH_SIZE:=64}
+: ${GRAD_ACCUMULATION_STEPS:=2}
+: ${LEARNING_RATE:=0.01}
+: ${MIN_LEARNING_RATE:=0.00001}
+: ${LR_POLICY:=exponential}
+: ${LR_EXP_GAMMA:=0.981}
+: ${EMA:=0.999}
+: ${SEED:=0}
+: ${EPOCHS:=440}
+: ${WARMUP_EPOCHS:=2}
+: ${HOLD_EPOCHS:=140}
+: ${SAVE_FREQUENCY:=10}
+: ${EPOCHS_THIS_JOB:=0}
+: ${DALI_DEVICE:="gpu"}
+: ${PAD_TO_MAX_DURATION:=false}
+: ${EVAL_FREQUENCY:=544}
+: ${PREDICTION_FREQUENCY:=544}
+: ${TRAIN_MANIFESTS:="$DATA_DIR/librispeech-train-clean-100-wav.json \
+                      $DATA_DIR/librispeech-train-clean-360-wav.json \
+                      $DATA_DIR/librispeech-train-other-500-wav.json"}
+: ${VAL_MANIFESTS:="$DATA_DIR/librispeech-dev-clean-wav.json"}

-DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
-MODEL_CONFIG=${2:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
-RESULT_DIR=${3:-${RESULT_DIR:-"/results"}}
-CHECKPOINT=${4:-${CHECKPOINT:-""}}
-CREATE_LOGFILE=${5:-${CREATE_LOGFILE:-"true"}}
-CUDNN_BENCHMARK=${6:-${CUDNN_BENCHMARK:-"true"}}
-NUM_GPUS=${7:-${NUM_GPUS:-8}}
-AMP=${8:-${AMP:-"false"}}
-EPOCHS=${9:-${EPOCHS:-400}}
-SEED=${10:-${SEED:-6}}
-BATCH_SIZE=${11:-${BATCH_SIZE:-64}}
-LEARNING_RATE=${12:-${LEARNING_RATE:-"0.015"}}
-GRADIENT_ACCUMULATION_STEPS=${13:-${GRADIENT_ACCUMULATION_STEPS:-2}}
-EMA=${EMA:-0.999}
-SAVE_FREQUENCY=${SAVE_FREQUENCY:-10}
+mkdir -p "$OUTPUT_DIR"

-mkdir -p "$RESULT_DIR"
+ARGS="--dataset_dir=$DATA_DIR"
+ARGS+=" --val_manifests $VAL_MANIFESTS"
+ARGS+=" --train_manifests $TRAIN_MANIFESTS"
+ARGS+=" --model_config=$MODEL_CONFIG"
+ARGS+=" --output_dir=$OUTPUT_DIR"
+ARGS+=" --lr=$LEARNING_RATE"
+ARGS+=" --batch_size=$BATCH_SIZE"
+ARGS+=" --min_lr=$MIN_LEARNING_RATE"
+ARGS+=" --lr_policy=$LR_POLICY"
+ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
+ARGS+=" --epochs=$EPOCHS"
+ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
+ARGS+=" --hold_epochs=$HOLD_EPOCHS"
+ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
+ARGS+=" --ema=$EMA"
+ARGS+=" --seed=$SEED"
+ARGS+=" --optimizer=novograd"
+ARGS+=" --weight_decay=1e-3"
+ARGS+=" --save_frequency=$SAVE_FREQUENCY"
+ARGS+=" --keep_milestones 100 200 300 400"
+ARGS+=" --save_best_from=380"
+ARGS+=" --log_frequency=1"
+ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
+ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
+ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
+ARGS+=" --dali_device=$DALI_DEVICE"

-CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
-CMD+=" train.py"
-CMD+=" --batch_size=$BATCH_SIZE"
-CMD+=" --num_epochs=$EPOCHS"
-CMD+=" --output_dir=$RESULT_DIR"
-CMD+=" --model_toml=$MODEL_CONFIG"
-CMD+=" --lr=$LEARNING_RATE"
-CMD+=" --ema=$EMA"
-CMD+=" --seed=$SEED"
-CMD+=" --optimizer=novograd"
-CMD+=" --dataset_dir=$DATA_DIR"
-CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
-CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json"
-CMD+=",$DATA_DIR/librispeech-train-clean-360-wav.json"
-CMD+=",$DATA_DIR/librispeech-train-other-500-wav.json"
-CMD+=" --weight_decay=1e-3"
-CMD+=" --save_freq=$SAVE_FREQUENCY"
-CMD+=" --eval_freq=100"
-CMD+=" --train_freq=1"
-CMD+=" --lr_decay"
-CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS "
+[ "$AMP" = true ] &&                 ARGS+=" --amp"
+[ "$RESUME" = true ] &&              ARGS+=" --resume"
+[ "$CUDNN_BENCHMARK" = true ] &&     ARGS+=" --cudnn_benchmark"
+[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --pad_to_max_duration"
+[ -n "$CHECKPOINT" ] &&              ARGS+=" --ckpt=$CHECKPOINT"
+[ -n "$LOG_FILE" ] &&                ARGS+=" --log_file $LOG_FILE"
+[ -n "$PRE_ALLOCATE" ] &&            ARGS+=" --pre_allocate_range $PRE_ALLOCATE"

-[ "$AMP" == "true" ] && \
-CMD+=" --amp"
-[ "$CUDNN_BENCHMARK" = "true" ] && \
-CMD+=" --cudnn"
-[ -n "$CHECKPOINT" ] && \
-CMD+=" --ckpt=${CHECKPOINT}"
-
-if [ "$CREATE_LOGFILE" = "true" ] ; then
-   export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
-   printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
-   DATESTAMP=`date +'%y%m%d%H%M%S'`
-   LOGFILE=$RESULT_DIR/$TAG.$DATESTAMP.log
-   printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-set -x
-if [ -z "$LOGFILE" ] ; then
-   $CMD
-else
-   (
-     $CMD
-   ) |& tee $LOGFILE
-fi
-set +x
+DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
+python $DISTRIBUTED train.py $ARGS
--- a/PyTorch/SpeechRecognition/Jasper/scripts/train_benchmark.sh
+++ b/PyTorch/SpeechRecognition/Jasper/scripts/train_benchmark.sh
@ -14,100 +14,36 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+set -a

-echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
+# measure on speed perturbed data, but so slightly that fbank length remains the same
+# with pad_to_max_duration, this reduces cuDNN benchmak's burn-in period to a single step
+: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
+: ${OUTPUT_DIR:=${3:-"/results"}}
+: ${TRAIN_MANIFESTS:="$DATA_DIR/librispeech-train-clean-100-wav.json"}

-SCRIPT_DIR=$(cd $(dirname $0); pwd)
-PROJECT_DIR=${SCRIPT_DIR}/..
+# run for a number of epochs, but don't finalize the training
+: ${EPOCHS_THIS_JOB:=2}
+: ${EPOCHS:=100000}
+: ${RESUME:=false}
+: ${SAVE_FREQUENCY:=100000}
+: ${EVAL_FREQUENCY:=100000}
+: ${GRAD_ACCUMULATION_STEPS:=1}

-DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
-MODEL_CONFIG=${2:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
-RESULT_DIR=${3:-${RESULT_DIR:-"/results"}}
-CREATE_LOGFILE=${4:-${CREATE_LOGFILE:-"true"}}
-CUDNN_BENCHMARK=${5:-${CUDNN_BENCHMARK:-"true"}}
-NUM_GPUS=${6:-${NUM_GPUS:-8}}
-AMP=${7:-${AMP:-"false"}}
-NUM_STEPS=${8:-${NUM_STEPS:-"-1"}}
-MAX_DURATION=${9:-${MAX_DURATION:-16.7}}
-SEED=${10:-${SEED:-0}}
-BATCH_SIZE=${11:-${BATCH_SIZE:-32}}
-LEARNING_RATE=${12:-${LEARNING_RATE:-"0.015"}}
-GRADIENT_ACCUMULATION_STEPS=${13:-${GRADIENT_ACCUMULATION_STEPS:-1}}
-PRINT_FREQUENCY=${14:-${PRINT_FREQUENCY:-1}}
-USE_PROFILER=${USE_PROFILER:-"false"}
+: ${AMP:=false}
+: ${EMA:=0}
+: ${DALI_DEVICE:="gpu"}
+: ${NUM_GPUS_SEQ:="1 4 8"}
+: ${BATCH_SIZE_SEQ:="32"}
+# A probable range of batch lengths for LibriSpeech
+# with BS=64 and continuous speed perturbation (0.85, 1.15)
+: ${PRE_ALLOCATE:="1408 1920"}

-mkdir -p "$RESULT_DIR"
+for NUM_GPUS in $NUM_GPUS_SEQ; do
+  for BATCH_SIZE in $BATCH_SIZE_SEQ; do

-[ "${USE_PROFILER}" = "true" ] && PYTHON_ARGS="-m cProfile -s cumtime"
+    LOG_FILE="$OUTPUT_DIR/perf-train_dali-${DALI_DEVICE}_amp-${AMP}_ngpus${NUM_GPUS}_bs${BATCH_SIZE}.json"
+    bash ./scripts/train.sh "$@"

-CMD="${PYTHON_ARGS} ${PROJECT_DIR}/train.py"
-CMD+=" --batch_size=$BATCH_SIZE"
-CMD+=" --num_epochs=400"
-CMD+=" --output_dir=$RESULT_DIR"
-CMD+=" --model_toml=$MODEL_CONFIG"
-CMD+=" --lr=$LEARNING_RATE"
-CMD+=" --seed=$SEED"
-CMD+=" --optimizer=novograd"
-CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS"
-CMD+=" --dataset_dir=$DATA_DIR"
-CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
-CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json,"
-CMD+="$DATA_DIR/librispeech-train-clean-360-wav.json,"
-CMD+="$DATA_DIR/librispeech-train-other-500-wav.json"
-CMD+=" --weight_decay=1e-3"
-CMD+=" --save_freq=100000"
-CMD+=" --eval_freq=100000"
-CMD+=" --max_duration=$MAX_DURATION"
-CMD+=" --pad_to_max"
-CMD+=" --train_freq=$PRINT_FREQUENCY"
-CMD+=" --lr_decay "
-[ "$AMP" == "true" ] && \
-CMD+=" --amp"
-[ "$CUDNN_BENCHMARK" = "true" ] && \
-CMD+=" --cudnn"
-[ "$NUM_STEPS" -gt 1 ] && \
-CMD+=" --num_steps=$NUM_STEPS"
-
-if [ "$NUM_GPUS" -gt 1  ] ; then
-   CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
-else
-   CMD="python3 $CMD"
-fi
-
-if [ "$CREATE_LOGFILE" = "true" ] ; then
-   export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
-   printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
-   DATESTAMP=`date +'%y%m%d%H%M%S'`
-   LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
-   printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-if [ -z "$LOGFILE" ] ; then
-
-   set -x
-   $CMD
-   set +x
-else
-
-   set -x
-   (
-     $CMD
-   ) |& tee "$LOGFILE"
-
-   set +x
-
-   mean_latency=`cat "$LOGFILE" | grep 'Step time' | awk '{print $3}'  | tail -n +2 | egrep -o '[0-9.]+'| awk 'BEGIN {total=0} {total+=$1} END {printf("%.2f\n",total/NR)}'`
-   mean_throughput=`python -c "print($BATCH_SIZE*$NUM_GPUS/${mean_latency})"`
-   training_wer_per_pgu=`cat "$LOGFILE" | grep 'training_batch_WER'| awk '{print $2}'  | tail -n 1 | egrep -o '[0-9.]+'`
-   training_loss_per_pgu=`cat "$LOGFILE" | grep 'Loss@Step'| awk '{print $4}'  | tail -n 1 | egrep -o '[0-9.]+'`
-   final_eval_wer=`cat "$LOGFILE" | grep 'Evaluation WER'| tail -n 1 | egrep -o '[0-9.]+'`
-   final_eval_loss=`cat "$LOGFILE" | grep 'Evaluation Loss'| tail -n 1 | egrep -o '[0-9.]+'`
-
-   echo "max duration: $MAX_DURATION s" | tee -a "$LOGFILE"
-   echo "mean_latency: $mean_latency s" | tee -a "$LOGFILE"
-   echo "mean_throughput: $mean_throughput sequences/s" | tee -a "$LOGFILE"
-   echo "training_wer_per_pgu: $training_wer_per_pgu" | tee -a "$LOGFILE"
-   echo "training_loss_per_pgu: $training_loss_per_pgu" | tee -a "$LOGFILE"
-   echo "final_eval_loss: $final_eval_loss" | tee -a "$LOGFILE"
-   echo "final_eval_wer: $final_eval_wer" | tee -a "$LOGFILE"
-fi
+  done
+done
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/Dockerfile
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/Dockerfile
@ -1,14 +0,0 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.08-py3
-FROM ${FROM_IMAGE_NAME}
-
-# Here's a good place to install pip reqs from JoC repo.
-# At the same step, also install TRT pip reqs
-WORKDIR /tmp/pipReqs
-COPY requirements.txt /tmp/pipReqs/jocRequirements.txt
-COPY tensorrt/requirements.txt /tmp/pipReqs/trtRequirements.txt
-RUN pip install --disable-pip-version-check -U -r jocRequirements.txt -r trtRequirements.txt
-
-
-WORKDIR /workspace/jasper
-COPY . .
-
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/README.md
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/README.md
@ -1,300 +0,0 @@
-
-# Jasper Inference For TensorRT
-
-This is subfolder of the Jasper for PyTorch repository, tested and maintained by NVIDIA, and provides scripts to perform high-performance inference using NVIDIA TensorRT. Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. More information about Jasper and its training and be found in the [Jasper PyTorch README](../README.md). 
-NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
-After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch. 
-
-
-
-## Table Of Contents
-
- [Model overview](#model-overview)
-   * [Model architecture](#model-architecture)
-   * [TensorRT Inference pipeline](#tensorrt-inference-pipeline)
-   * [Version Info](#version-info)
- [Setup](#setup)
-   * [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
-   * [Scripts and sample code](#scripts-and-sample-code)
-   * [Parameters](#parameters)
-   * [TensorRT Inference Benchmark Process](#tensorrt-inference-benchmark-process)
-   * [TensorRT Inference Process](#tensorrt-inference-process)
- [Performance](#performance)
-   * [Results](#results)
-      * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
-
-
-## Model overview
-
-### Model architecture
-By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
-Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
-In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution. 
-For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
-
-
-### TensorRT Inference pipeline
-
-The Jasper inference pipeline consists of 3 components: data preprocessor, acoustic model and greedy decoder. The acoustic model is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and also what differentiates Jasper from the competition. So, we focus on the acoustic model for the most part.
-
-For the non-TensorRT Jasper inference pipeline, all 3 components are implemented and run with native PyTorch. For the TensorRT inference pipeline, we show the speedup of running the acoustic model with TensorRT, while preprocessing and decoding are reused from the native PyTorch pipeline.
-
-To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into an ONNX file. Finally, a TensorRT engine is constructed from the ONNX file, serialized to TensorRT engine file, and also launched to do inference.
-
-Note that TensorRT engine is being runtime optimized before serialization. TensorRT tries a vast set of options to find the strategy that performs best on user’s GPU - so it takes a few minutes. After the TensorRT engine file is created, it can be reused. 
-
-### Version Info
-
-The following software version configuration has been tested and known to work:
-
-|Software|Version|
-|--------|-------|
-|Python|3.6.10|
-|PyTorch|1.7.0a0+8deb4fe|
-|TensorRT|7.1.3.4|
-|CUDA|11.0.221|
-
-## Setup
-
-The following section lists the requirements in order to start inference on the Jasper model with TensorRT.
-
-### Requirements
-
-This repository contains a `Dockerfile` which extends the PyTorch 19.10-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
-
-* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.08-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
-* NVIDIA [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/), [Turing](https://www.nvidia.com/en-us/geforce/turing/), or [Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) based GPU
-* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
-
-Required Python packages are listed in `requirements.txt` and `tensorrt/requirements.txt`. These packages are automatically installed when the Docker container is built. To manually install them, run:
-
-
-```bash
-pip install -r requirements.txt
-pip install -r tensorrt/requirements.txt
-```
-
-
-## Quick Start Guide
-
-
-Running the following scripts will build and launch the container containing all required dependencies for both TensorRT as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
-
-1. Clone the repository.
-
-      ```bash
-      git clone https://github.com/NVIDIA/DeepLearningExamples
-      cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
-      ```
-2. Build the Jasper PyTorch with TensorRT container:
-
-      ```bash
-      bash tensorrt/scripts/docker/build.sh
-      ```
-3. Start an interactive session in the NGC docker container:
-
-      ```bash
-      bash tensorrt/scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
-      ```
-
-      Alternatively, to start a script in the docker container:
-
-      ```bash
-      bash tensorrt/scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR> <SCRIPT_PATH>
-      ```
-
-      The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host. **These three paths should be absolute and should already exist.** The contents of this repository will be mounted to the `/workspace/jasper` directory. Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md).
-
-      Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](#acquiring-dataset)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `<RESULT_DIR>` should be prepared to contain timing results, logs, serialized TensorRT engines, and ONNX files.
-
-      4.  Acquiring dataset
-
-      If LibriSpeech has already been downloaded and preprocessed as defined in the [Jasper PyTorch README](../README.md), no further steps in this subsection need to be taken.
-
-      If LibriSpeech has not been downloaded already, note that only a subset of LibriSpeech is typically used for inference (`dev-*` and `test-*`). To acquire the inference subset of LibriSpeech run the following commands inside the container (does not require GPU):
-
-      ```bash
-      bash tensorrt/scripts/download_inference_librispeech.sh
-      ```
-
-      Once the data download is complete, the following folders should exist:
-
-      * `/datasets/LibriSpeech/`
-         * `dev-clean/`
-         * `dev-other/`
-         * `test-clean/`
-         * `test-other/`
-
-      Next, preprocessing the data can be performed with the following command:
-
-      ```bash
-      bash tensorrt/scripts/preprocess_inference_librispeech.sh
-      ```
-
-      Once the data is preprocessed, the following additional files should now exist:
-      * `/datasets/LibriSpeech/`
-         * `librispeech-dev-clean-wav.json`
-         * `librispeech-dev-other-wav.json`
-         * `librispeech-test-clean-wav.json`
-         * `librispeech-test-other-wav.json`
-         * `dev-clean-wav/`
-         * `dev-other-wav/`
-         * `test-clean-wav/`
-         * `test-other-wav/`
-
-5. Start TensorRT inference prediction
-
-      Inside the container, use the following script to run inference with TensorRT. To learn more about the following env variables see `tensorrt/scripts/inference.sh`.
-      ```bash
-      export CHECKPOINT=<CHECKPOINT>
-      export TRT_PRECISION=<PRECISION>
-      export PYTORCH_PRECISION=<PRECISION>
-      export TRT_PREDICTION_PATH=<TRT_PREDICTION_PATH>
-      bash tensorrt/scripts/inference.sh
-      ```
-      A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
-      More details can be found in [Advanced](#advanced) under [Scripts and sample code](#scripts-and-sample-code), [Parameters](#parameters) and [TensorRT Inference process](#tensorrt-inference).
-
-6.  Start TensorRT inference benchmark
-
-      Inside the container, use the following script to run inference benchmark with TensorRT.
-      ```bash
-      export CHECKPOINT=<CHECKPOINT>
-      export NUM_STEPS=<NUM_STEPS>
-      export NUM_FRAMES=<NUM_FRAMES>
-      export BATCH_SIZE=<BATCH_SIZE>
-      export TRT_PRECISION=<PRECISION>
-      export PYTORCH_PRECISION=<PRECISION>
-      export CSV_PATH=<CSV_PATH>
-      bash tensorrt/scripts/inference_benchmark.sh
-      ```
-      A pretrained model checkpoint can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
-      More details can be found in [Advanced](#advanced) under [Scripts and sample code](#scripts-and-sample-code), [Parameters](#parameters) and [TensorRT Inference Benchmark process](#tensorrt-inference-benchmark).
-
-7. Start Jupyter notebook to run inference interactively
-
-      The Jupyter notebook  is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
-      The notebook which is located at `notebooks/JasperTRT.ipynb` offers an interactive method to run the Steps 2,3,4,5. In addition, the notebook shows examples how to use TensorRT to transcribe a single audio file into text. To launch the application please follow the instructions under [../notebooks/README.md](../notebooks/README.md). 
-      A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
-
-
-## Advanced
-The following sections provide greater details on inference benchmarking with TensorRT and show inference results
-
-### Scripts and sample code
-In the `tensorrt/` directory, the most important files are:
-* `Dockerfile`: Container to run Jasper inference with TensorRT.
-* `requirements.py`: Python package dependencies. Installed when building the Docker container.
-* `perf.py`: Entry point for inference pipeline using TensorRT.
-* `perfprocedures.py`: Contains functionality to run inference through both the PyTorch model and TensorRT Engine, taking runtime measurements of each component of the inference process for comparison.
-* `trtutils.py`: Helper functions for TensorRT components of Jasper inference.
-* `perfutils.py`: Helper functions for non-TensorRT components of Jasper inference.
-
-The `tensorrt/scripts/` directory has one-click scripts to run supported functionalities, such as:
-
-* `download_librispeech.sh`: Downloads LibriSpeech inference dataset.
-* `preprocess_librispeech.sh`: Preprocess LibriSpeech raw data files to be ready for inference.
-* `inference_benchmark.sh`: Benchmarks and compares TensorRT and PyTorch inference pipelines using the `perf.py` script.
-* `inference.sh`: Runs TensorRT and PyTorch inference using the `inference_benchmark.sh` script.
-* `walk_benchmark.sh`: Illustrates an example of using `tensorrt/scripts/inference_benchmark.sh`, which *walks* a variety of values for `BATCH_SIZE` and `NUM_FRAMES`.
-* `docker/`: Contains the scripts for building and launching the container.
-
-
-### Parameters
-
-The list of parameters available for `tensorrt/scripts/inference_benchmark.sh` is:
-
-```
-Required:
--------
-CHECKPOINT: Model checkpoint path
-
-Arguments with Defaults:
--------
-DATA_DIR: directory of the dataset (Default: `/datasets/Librispeech`)
-DATASET: name of dataset to use (default: `dev-clean`)
-RESULT_DIR: directory for results including TensorRT engines, ONNX files, logs, and CSVs (default: `/results`)
-CREATE_LOGFILE: boolean that indicates whether to create log of session to be stored in `$RESULT_DIR` (default: "true")
-CSV_PATH: file to store CSV results (default: `/results/res.csv`)
-TRT_PREDICTION_PATH: file to store inference prediction results generated with TensorRT (default: `none`)
-PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch (default: `none`)
-VERBOSE: boolean that indicates whether to verbosely describe TensorRT engine building/deserialization and TensorRT inference (default: "false")
-TRT_PRECISION: "fp32" or "fp16". Defines which precision kernels will be used for TensorRT engine (default: "fp32")
-PYTORCH_PRECISION: "fp32" or "fp16". Defines which precision will be used for inference in PyTorch (default: "fp32")
-NUM_STEPS: Number of inference steps. If -1 runs inference on entire dataset (default: 100)
-BATCH_SIZE: data batch size (default: 64)
-NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 512)
-FORCE_ENGINE_REBUILD: boolean that indicates whether an already-built TensorRT engine of equivalent precision, batch-size, and number of frames should not be used. Engines are specific to the GPU, library versions, TensorRT versions, and CUDA versions they were built in and cannot be used in a different environment. (default: "true")
-USE_DYNAMIC_SHAPE: if 'yes' uses dynamic shapes (default: ‘yes’). Dynamic shape is always preferred since it allows to reuse engines.
-```
-
-The complete list of parameters available for `tensorrt/scripts/inference.sh` is the same as `tensorrt/scripts/inference_benchmark.sh` only with different default input arguments. In the following, only the parameters with different default values are listed:
-
-```
-TRT_PREDICTION_PATH: file to store inference prediction results generated with TensorRT (default: `/results/trt_predictions.txt`)
-PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch (default: `/results/pyt_predictions.txtone`)
-NUM_STEPS: Number of inference steps. If -1 runs inference on entire dataset (default: -1)
-BATCH_SIZE: data batch size (default: 1)
-NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 3600)
-```
-
-### TensorRT Inference Benchmark process
-
-The inference benchmarking is performed on a single GPU by ‘tensorrt/scripts/inference_benchmark.sh’ which delegates to `tensorrt/perf.py`,  which takes the following steps:
-
-
-1. Construct Jasper acoustic model in PyTorch.
-
-2. Construct TensorRT Engine of Jasper acoustic model
-
-   1. Perform ONNX export on the PyTorch model, if its ONNX file does not already exist.
-
-	2. Construct TensorRT engine from ONNX export, if a saved engine file does not already exist or `FORCE_ENGINE_REBUILD` is `true`.
-
-3. For each batch in the dataset, run inference through both the PyTorch model and TensorRT Engine, taking runtime measurements of each component of the inference process.
-
-4. Compile performance and WER accuracy results in CSV format, written to `CSV_PATH` file.
-
-`tensorrt/perf.py` utilizes `tensorrt/trtutils.py` and `tensorrt/perfutils.py`, helper functions for TensorRT and non-TensorRT components of Jasper inference respectively.
-
-### TensorRT Inference process
-
-The inference is performed by `tensorrt/scripts/inference.sh` which delegates to `tensorrt/scripts/inference_benchmark.sh`. The script runs on a single GPU. To do inference prediction on the entire dataset `NUM_FRAMES` is set to 3600, which roughly corresponds to 36 seconds. This covers the longest sentences in both LibriSpeech dev and test dataset. By default, `BATCH_SET` is set to 1 to simulate the online inference scenario in deployment. Other batch sizes can be tried by setting a different value to this parameter. By default `TRT_PRECISION` is set to full precision and can be changed by setting `export TRT_PRECISION=fp16`. The prediction results are stored at `/results/trt_predictions.txt` and `/results/pyt_predictions.txt`.
-
-
-
-## Performance
-
-To benchmark the inference performance on a specific batch size and audio length refer to [Quick-Start-Guide](#quick-start-guide). To do a sweep over multiple batch sizes and audio durations run:
-```bash
-bash tensorrt/scripts/walk_benchmark.sh
-```
-The results are obtained by running inference on LibriSpeech dev-clean dataset on a single T4 GPU using half precision with AMP. We compare the throughput of the acoustic model between TensorRT and native PyTorch.   
-
-### Results
-
-
-
-#### Inference performance: NVIDIA T4
-
-| Sequence Length (in seconds) | Batch size | TensorRT FP16 Throughput (#sequences/second) Percentiles |     	|     	|     	| PyTorch FP16 Throughput (#sequences/second) Percentiles |     	|     	|     	| TRT/PyT Speedup |
-|---------------|------------|---------------------|---------|---------|---------|-----------------|---------|---------|---------|-----------------|
-|           	|        	| 90%             	| 95% 	| 99% 	| Avg 	| 90%         	| 95% 	| 99% 	| Avg 	|             	|
-|2|1|71.002|70.897|70.535|71.987|42.974|42.932|42.861|43.166|1.668|
-||2|136.369|135.915|135.232|139.266|81.398|77.826|57.408|81.254|1.714|
-||4|231.528|228.875|220.085|239.686|130.055|117.779|104.529|135.660|1.767|
-||8|310.224|308.870|289.132|316.536|215.401|202.902|148.240|228.805|1.383|
-||16|389.086|366.839|358.419|401.267|288.353|278.708|230.790|307.070|1.307|
-|7|1|61.792|61.273|59.842|63.537|34.098|33.963|33.785|34.639|1.834|
-||2|93.869|92.480|91.528|97.082|59.397|59.221|51.050|60.934|1.593|
-||4|113.108|112.950|112.531|114.507|66.947|66.479|59.926|67.704|1.691|
-||8|118.878|118.542|117.619|120.367|83.208|82.998|82.698|84.187|1.430|
-||16|122.909|122.718|121.547|124.190|102.212|102.000|101.187|103.049|1.205|
-|16.7|1|38.665|38.404|37.946|39.363|21.267|21.197|21.127|21.456|1.835|
-||2|44.960|44.867|44.382|45.583|30.218|30.156|29.970|30.679|1.486|
-||4|47.754|47.667|47.541|48.287|29.146|29.079|28.941|29.470|1.639|
-||8|51.051|50.969|50.620|51.489|37.565|37.497|37.373|37.834|1.361|
-||16|53.316|53.288|53.188|53.773|45.217|45.090|44.946|45.560|1.180|
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/perf.py
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/perf.py
@ -1,132 +0,0 @@
-#!/usr/bin/env python3
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-'''Constructs TensorRT engine for JASPER and evaluates inference latency'''
-import argparse
-import sys, os
-# Get local modules in parent directory and current directory (assuming this was called from root of repository)
-sys.path.append("./")
-sys.path.append("./tensorrt")
-import perfutils
-import trtutils
-import perfprocedures
-from model import GreedyCTCDecoder
-from helpers import __ctc_decoder_predictions_tensor
-import caffe2.python.onnx.backend as c2backend
-import onnxruntime as ort
-
-import torch
-from torch import nn
-from torch.nn import functional as F
-
-
-def main(args):        
-    print ("Getting component")
-    # Get shared utility across PyTorch and TRT
-    pyt_components, saved_onnx = perfutils.get_pytorch_components_and_onnx(args)
-
-    print ("Getting engine")
-    # Get a TRT engine. See function for argument parsing logic
-    engine = trtutils.get_engine(args)
-    print ("Got engine.")
-
-    if args.wav:
-        with torch.no_grad():
-            audio_processor = pyt_components['audio_preprocessor']
-            audio_processor.eval()
-            greedy_decoder = GreedyCTCDecoder()
-            input_wav, num_audio_samples = pyt_components['input_wav']
-            features = audio_processor(input_wav, num_audio_samples)
-            features = perfutils.adjust_shape(features, args)
-            if not args.engine_path:
-                outputs = engine.run(None, {'FEATURES': features[0].data.cpu().numpy()})
-                inference = 1.0
-                t_log_probs_e = outputs[0]
-                t_log_probs_e=perfutils.torchify_trt_out(t_log_probs_e, t_log_probs_e.shape)
-            else:
-                with engine.create_execution_context() as context:
-                    t_log_probs_e, copyto, inference, copyfrom= perfprocedures.do_inference(context, [features[0]])
-            t_predictions_e = greedy_decoder(t_log_probs_e)
-            hypotheses = __ctc_decoder_predictions_tensor(t_predictions_e, labels=perfutils.get_vocab())
-            print("INTERENCE TIME: {} ms".format(inference*1000.0))
-            print("TRANSCRIPT: ", hypotheses)
-            return
-
-    wer, preds, times = perfprocedures.compare_times_trt_pyt_exhaustive(engine,
-                                                                        pyt_components,
-                                                                        args)
-
-    string_header, string_data = perfutils.do_csv_export(wer, times, args.batch_size, args.seq_len)
-
-    if args.csv_path is not None:
-        print ("Exporting to " + args.csv_path)
-        with open(args.csv_path, 'a+') as f:
-            # See if header is there, if so, check that it matches
-            f.seek(0) # Read from start of file
-            existing_header = f.readline()
-            if existing_header == "":
-                f.write(string_header)
-                f.write("\n")
-            elif existing_header[:-1] != string_header:
-                raise Exception(f"Writing to existing CSV with incorrect format\nProduced:\n{string_header}\nFound:\n{existing_header}\nIf you intended to write to a new results csv, please change the csv_path argument")
-            f.seek(0,2) # Write to end of file
-            f.write(string_data)
-            f.write("\n")
-    else:
-        print(string_header)
-        print(string_data)
-
-    if args.trt_prediction_path is not None:
-        with open(args.trt_prediction_path, 'w') as fp:
-            fp.write('\n'.join(preds['trt']))
-     
-    if args.pyt_prediction_path is not None:
-        with open(args.pyt_prediction_path, 'w') as fp:
-            fp.write('\n'.join(preds['pyt']))   
-
-
-def parse_args():
-    parser = argparse.ArgumentParser(description="Performance test of TRT")
-    parser.add_argument("--engine_path", default=None, type=str, help="Path to serialized TRT engine")
-    parser.add_argument("--use_existing_engine", action="store_true", default=False, help="If set, will deserialize engine at --engine_path" )
-    parser.add_argument("--engine_batch_size", default=16, type=int, help="Maximum batch size for constructed engine; needed when building")
-    parser.add_argument("--batch_size", default=16, type=int, help="Batch size for data when running inference.")
-    parser.add_argument("--dataset_dir", type=str, help="Root directory of dataset")
-    parser.add_argument("--model_toml", type=str, required=True, help="Config toml to use. A selection can be found in configs/")
-    parser.add_argument("--val_manifest", type=str, help="JSON manifest of dataset.")
-    parser.add_argument("--onnx_path", default=None, type=str, help="Path to onnx model for engine creation")
-    parser.add_argument("--seq_len", default=None, type=int, help="Generate an ONNX export with this fixed sequence length, and save to --onnx_path. Requires also using --onnx_path and --ckpt_path.")
-    parser.add_argument("--max_seq_len", default=3600, type=int, help="Max sequence length for TRT engine build. Default works with TRTIS benchmark. Set it larger than seq_len")
-    parser.add_argument("--ckpt_path", default=None, type=str, help="If provided, will also construct pytorch acoustic model")
-    parser.add_argument("--max_duration", default=None, type=float, help="Maximum possible length of audio data in seconds")
-    parser.add_argument("--num_steps", default=-1, type=int, help="Number of inference steps to run")
-    parser.add_argument("--trt_fp16", action="store_true", default=False, help="If set, will allow TRT engine builder to select fp16 kernels as well as fp32")
-    parser.add_argument("--pyt_fp16", action="store_true", default=False, help="If set, will construct pytorch model with fp16 weights")
-    parser.add_argument("--make_onnx", action="store_true", default=False, help="If set, will create an ONNX model and store it at the path specified by --onnx_path")
-    parser.add_argument("--csv_path", type=str, default=None, help="File to append csv info about inference time")
-    parser.add_argument("--trt_prediction_path", type=str, default=None, help="File to write predictions inferred with tensorrt")
-    parser.add_argument("--pyt_prediction_path", type=str, default=None, help="File to write predictions inferred with pytorch")
-    parser.add_argument("--verbose", action="store_true", default=False, help="If set, will verbosely describe TRT engine building and deserialization as well as TRT inference")
-    parser.add_argument("--wav", type=str, help='absolute path to .wav file (16KHz)')
-    parser.add_argument("--max_workspace_size", default=0, type=int, help="Maximum GPU memory workspace size for constructed engine; needed when building")
-    parser.add_argument("--transpose", action="store_true", default=False, help="If set, will transpose input")
-    parser.add_argument("--static_shape", action="store_true", default=False, help="If set, use static shape otherwise dynamic shape. Dynamic shape is always preferred.")
-
-    return parser.parse_args()
-
-if __name__ == "__main__":
-    args = parse_args()
-
-    main(args)
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/perfprocedures.py
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/perfprocedures.py
@ -1,182 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-'''A collection of accuracy and latency evaluation procedures for JASPER on PyTorch and TRT.
-'''
-
-
-import pycuda.driver as cuda
-import pycuda.autoinit
-import perfutils
-import trtutils
-import time
-import torch
-from tqdm import tqdm
-
-def compare_times_trt_pyt_exhaustive(engine, pyt_components, args):
-    '''Compares execution times and WER between TRT and PyTorch'''
-
-    # The engine has a fixed-size sequence length, which needs to be known for slicing/padding input
-    preprocess_times = []
-    inputadjust_times = []
-    outputadjust_times = []
-    process_batch_times = []
-    trt_solo_times = []
-    trt_async_times = []
-    tohost_sync_times =[]
-    pyt_infer_times = []
-    step_counter = 0
-
-    with engine.create_execution_context() as context, torch.no_grad():
-        for data in tqdm(pyt_components['data_layer'].data_iterator):
-            if args.num_steps >= 1:
-                if step_counter > args.num_steps:
-                    break
-                step_counter +=1
-            tensors = []
-            for d in data:
-                tensors.append(d.cuda())
-            preprocess_start = time.perf_counter()
-            am_input = pyt_components['audio_preprocessor'](tensors[0], tensors[1])
-            
-            torch.cuda.synchronize()
-            preprocess_end = time.perf_counter()
-
-            # Pad or cut to the neccessary engine length
-            inputadjust_start = time.perf_counter()
-            am_input = perfutils.adjust_shape(am_input, args)
-            torch.cuda.synchronize()
-            inputadjust_end = time.perf_counter()
-
-            batch_size = am_input[0].shape[0]
-
-            inp = [am_input[0]]
-            
-            # Run TRT inference 1: Async copying and inference
-            # import ipdb; ipdb.set_trace()
-            trt_out, time_taken= do_inference_overlap(context, inp)
-            torch.cuda.synchronize()
-            outputadjust_start = time.perf_counter()
-            outputadjust_end = time.perf_counter()
-            process_batch_start = time.perf_counter()
-            perfutils.global_process_batch(log_probs=trt_out,
-                                           original_tensors=tensors,
-                                           batch_size=batch_size,
-                                           is_trt=True)
-            torch.cuda.synchronize()
-            process_batch_end = time.perf_counter()
-
-            # Create explicit stream so pytorch doesn't complete asynchronously
-            pyt_infer_start = time.perf_counter()
-            pyt_out = pyt_components['acoustic_model'](am_input[0])
-            torch.cuda.synchronize()
-            pyt_infer_end = time.perf_counter()
-            perfutils.global_process_batch(log_probs=pyt_out,
-                                           original_tensors=tensors,
-                                           batch_size=batch_size,
-                                           is_trt=False)
-            # Run TRT inference 2: Synchronous copying and inference
-            sync_out, time_to, time_infer, time_from = do_inference(context,inp)
-            del sync_out
-            preprocess_times.append(preprocess_end - preprocess_start)
-            inputadjust_times.append(inputadjust_end - inputadjust_start)
-            outputadjust_times.append(outputadjust_end - outputadjust_start)
-            process_batch_times.append(process_batch_end - process_batch_start)
-            trt_solo_times.append(time_infer)
-            trt_async_times.append(time_taken)
-            tohost_sync_times.append(time_from)
-            pyt_infer_times.append(pyt_infer_end - pyt_infer_start)
-
-    trt_wer = perfutils.global_process_epoch(is_trt=True)
-    pyt_wer = perfutils.global_process_epoch(is_trt=False)
-    trt_preds = perfutils._global_trt_dict['predictions']
-    pyt_preds = perfutils._global_pyt_dict['predictions']
-    times = {
-        'preprocess': preprocess_times, # Time to go through preprocessing
-        'pyt_infer': pyt_infer_times, # Time for batch completion through pytorch
-        'input_adjust': inputadjust_times, # Time to pad/cut for TRT engine size requirements
-        'output_adjust' : outputadjust_times, # Time to reshape output of TRT and copy from host to device
-        'post_process': process_batch_times, # Time to run greedy decoding and do CTC conversion
-        'trt_solo_infer': trt_solo_times, # Time to execute just TRT acoustic model
-        'to_host': tohost_sync_times, # Time to execute device to host copy synchronously
-        'trt_async_infer': trt_async_times, # Time to execute combined async TRT acoustic model + device to host copy
-
-    }
-    wer = {
-        'trt': trt_wer,
-        'pyt': pyt_wer
-    }
-    preds = {
-        'trt': trt_preds,
-        'pyt': pyt_preds
-    }
-    return wer, preds, times
-
-def do_inference(context, inp):
-    '''Do inference using a TRT engine and time it
-    Execution and device-to-host copy are completed synchronously
-    '''
-    # Typical Python-TRT used in samples would copy input data from host to device.
-    # Because the PyTorch Tensor is already on the device, such a copy is unneeded.
-    t0 = time.perf_counter()
-    stream = cuda.Stream()
-    # Create output buffers and stream
-    outputs, bindings, out_shape = trtutils.allocate_buffers_with_existing_inputs(context, inp)
-    t01 = time.perf_counter()
-    # simulate sync call here
-    context.execute_async_v2(
-        bindings=bindings,
-        stream_handle=stream.handle)
-    stream.synchronize()
-
-    t2 = time.perf_counter()
-    # for out in outputs:
-    #     cuda.memcpy_dtoh(out.host, out.device) 
-    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
-    stream.synchronize()
-   
-    t3 = time.perf_counter()
-    copyto = t01-t0
-    inference = t2-t01
-    copyfrom = t3-t2
-    out = outputs[0].host
-    outputs[0].device.free()
-    out = perfutils.torchify_trt_out(outputs[0].host, out_shape)
-    return out, copyto, inference, copyfrom
-
-def do_inference_overlap(context, inp):
-    '''Do inference using a TRT engine and time it
-    Execution and device-to-host copy are completed asynchronously
-    '''
-    # Typical Python-TRT used in samples would copy input data from host to device.
-    # Because the PyTorch Tensor is already on the device, such a copy is unneeded.
-    
-    t0 = time.perf_counter()
-    # Create output buffers and stream
-    stream = cuda.Stream()
-    outputs, bindings, out_shape = trtutils.allocate_buffers_with_existing_inputs(context, inp)
-    t01 = time.perf_counter()
-    t1 = time.perf_counter()
-    # Run inference and transfer outputs to host asynchronously
-    context.execute_async_v2(
-                             bindings=bindings,
-                             stream_handle=stream.handle)
-    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
-    stream.synchronize()
-    t2 = time.perf_counter()
-    copyto = t1-t0
-    inference = t2-t1
-    outputs[0].device.free()
-    out = perfutils.torchify_trt_out(outputs[0].host, out_shape)
-    return out, t2-t1
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/perfutils.py
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/perfutils.py
@ -1,317 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-'''Contains helper functions for non-TRT components of JASPER inference
-'''
-
-from model import GreedyCTCDecoder, AudioPreprocessing, JasperEncoderDecoder
-from dataset import AudioToTextDataLayer
-from helpers import process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, norm
-from apex import amp
-import torch
-import torch.nn as nn
-import toml
-from parts.features import audio_from_file
-import onnx
-import os
-
-_global_ctc_labels = None
-def get_vocab():
-    ''' Gets the CTC vocab
-
-    Requires calling get_pytorch_components_and_onnx() to setup global labels.
-    '''
-    if _global_ctc_labels is None:
-        raise Exception("Feature labels have not been found. Execute `get_pytorch_components_and_onnx()` first")
-
-    return _global_ctc_labels
-
-def get_results(log_probs, original_tensors, batch_size):
-    ''' Returns WER and predictions for the outputs of the acoustic model
-
-    Used for one-off batches. Epoch-wide evaluation should use
-    global_process_batch and global_process_epoch
-    '''
-    # Used to get WER and predictions for one-off batches
-    greedy_decoder = GreedyCTCDecoder()
-    predicts = norm(greedy_decoder(log_probs=log_probs))
-    values_dict = dict(
-        predictions=[predicts],
-        transcript=[original_tensors[2][0:batch_size,...]],
-        transcript_length=[original_tensors[3][0:batch_size,...]],
-    )
-    temp_dict = {
-        'predictions': [],
-        'transcripts': [],
-    }
-    process_evaluation_batch(values_dict, temp_dict, labels=get_vocab())
-    predictions = temp_dict['predictions']
-    wer, _ = process_evaluation_epoch(temp_dict)
-    return wer, predictions
-
-
-_global_trt_dict = {
-        'predictions': [],
-        'transcripts': [],
-}
-_global_pyt_dict = {
-        'predictions': [],
-        'transcripts': [],
-}
-
-def global_process_batch(log_probs, original_tensors, batch_size, is_trt=True):
-    '''Accumulates prediction evaluations for batches across an epoch
-
-    is_trt determines which global dictionary will be used.
-    To get WER at any point, use global_process_epoch.
-    For one-off WER evaluations, use get_results()
-    '''
-    # State-based approach for full WER comparison across a dataset.
-    greedy_decoder = GreedyCTCDecoder()
-    predicts = norm(greedy_decoder(log_probs=log_probs))
-    values_dict = dict(
-        predictions=[predicts],
-        transcript=[original_tensors[2][0:batch_size,...]],
-        transcript_length=[original_tensors[3][0:batch_size,...]],
-    )
-    dict_to_process = _global_trt_dict if is_trt else _global_pyt_dict
-    process_evaluation_batch(values_dict, dict_to_process, labels=get_vocab())
-
-
-def global_process_epoch(is_trt=True):
-    '''Returns WER in accumulated global dictionary
-    '''
-    dict_to_process = _global_trt_dict if is_trt else _global_pyt_dict
-    wer, _ = process_evaluation_epoch(dict_to_process)
-    return wer
-
-
-
-def get_onnx(path, acoustic_model,  args):
-    ''' Get an ONNX model with float weights
-
-    Requires an --onnx_save_path and --ckpt_path (so that an acoustic model could be constructed).
-    Fixed-length --seq_len must be provided as well.
-    '''
-    
-    dynamic_dim = 0
-    if not args.static_shape:
-        dynamic_dim = 1 if args.transpose else 2
-
-
-    if args.transpose:
-        signal_shape=(args.engine_batch_size, int(args.seq_len), 64)
-    else:
-        signal_shape=(args.engine_batch_size, 64, int(args.seq_len))
-        
-    with torch.no_grad():
-        phony_signal = torch.zeros(signal_shape, dtype=torch.float, device=torch.device("cuda"))
-        phony_len = torch.IntTensor(len(phony_signal))
-        phony_out = acoustic_model.infer((phony_signal, phony_len))
-        
-        input_names=["FEATURES"]
-        output_names=["LOGITS"]
-
-        if acoustic_model.jasper_encoder.use_conv_mask:
-            input_names.append("FETURES_LEN")
-            output_names.append("LOGITS_LEN")
-            phony_signal = [phony_signal, phony_len]
-        
-        if dynamic_dim > 0:
-            dynamic_axes={
-                "FEATURES" : {0 : "BATCHSIZE", dynamic_dim : "NUM_FEATURES"},
-                "LOGITS" : { 0: "BATCHSIZE", 1 : "NUM_LOGITS"}
-            }
-        else:
-            dynamic_axes = None
-
-        jitted_model = acoustic_model
-        
-        torch.onnx.export(jitted_model, phony_signal, path,
-                          input_names=input_names, output_names=output_names,
-                          opset_version=10,
-                          do_constant_folding=True,
-                          verbose=True,
-                          dynamic_axes=dynamic_axes,
-                          example_outputs = phony_out
-        )
-
-        fn=path+".readable"
-        with open(fn, 'w') as f:
-            #Write human-readable graph representation to file as well.
-            tempModel = onnx.load(path)
-            onnx.checker.check_model(tempModel)
-            pgraph = onnx.helper.printable_graph(tempModel.graph)
-            f.write(pgraph)
-
-    return path
-
-
-def get_pytorch_components_and_onnx(args):
-    '''Returns PyTorch components used for inference
-    '''
-    model_definition = toml.load(args.model_toml)
-    dataset_vocab = model_definition['labels']['labels']
-    # Set up global labels for future vocab calls
-    global _global_ctc_labels
-    _global_ctc_labels= add_ctc_labels(dataset_vocab)
-    featurizer_config = model_definition['input_eval']
-
-    optim_level = 3 if args.pyt_fp16 else 0
-
-    featurizer_config["optimization_level"] = optim_level
-
-    audio_preprocessor = None
-    onnx_path = None
-    data_layer = None
-    wav = None
-    seq_len = None
-    
-    if args.max_duration is not None:
-        featurizer_config['max_duration'] = args.max_duration
-    if args.dataset_dir is not None:    
-        data_layer =  AudioToTextDataLayer(dataset_dir=args.dataset_dir,
-                                           featurizer_config=featurizer_config,
-                                           manifest_filepath=args.val_manifest,
-                                           labels=dataset_vocab,
-                                           batch_size=args.batch_size,
-                                           shuffle=False)
-    if args.wav is not None:
-        args.batch_size=1
-        wav, seq_len = audio_from_file(args.wav)
-        if args.seq_len is None or args.seq_len == 0:
-            args.seq_len = seq_len/(featurizer_config['sample_rate']/100)
-        args.seq_len = int(args.seq_len)
-
-    if args.transpose:
-        featurizer_config["transpose_out"] = True
-        model_definition["transpose_in"] = True
-
-    model = JasperEncoderDecoder(jasper_model_definition=model_definition, feat_in=1024, num_classes=len(get_vocab()), transpose_in=args.transpose)
-    model = model.cuda()
-    model.eval()
-
-    audio_preprocessor = AudioPreprocessing(**featurizer_config)
-    audio_preprocessor = audio_preprocessor.cuda()
-    audio_preprocessor.eval()
-    
-    if args.ckpt_path is not None:
-        if os.path.isdir(args.ckpt_path):
-            d_checkpoint = torch.load(args.ckpt_path+"/decoder.pt", map_location="cpu")
-            e_checkpoint = torch.load(args.ckpt_path+"/encoder.pt", map_location="cpu")
-            model.jasper_encoder.load_state_dict(e_checkpoint, strict=False)            
-            model.jasper_decoder.load_state_dict(d_checkpoint, strict=False)            
-        else:
-            checkpoint = torch.load(args.ckpt_path, map_location="cpu")
-            model.load_state_dict(checkpoint['state_dict'], strict=False)
-            
-    # if we are to produce engine, not run/create ONNX, postpone AMP initialization
-    # (ONNX parser cannot handle mixed FP16 ONNX yet)
-    if args.pyt_fp16 and args.engine_path is None:
-        amp.initialize(models=model, opt_level='O'+str(optim_level))
-        
-    if args.make_onnx:
-        if args.onnx_path is None or args.ckpt_path is None:
-            raise Exception("--ckpt_path, --onnx_path must be provided when using --make_onnx")
-        onnx_path = get_onnx(args.onnx_path, model, args)
-
-    if args.pyt_fp16 and args.engine_path is not None:
-        amp.initialize(models=model, opt_level='O'+str(optim_level))
-    
-    return {'data_layer': data_layer,
-            'audio_preprocessor': audio_preprocessor,
-            'acoustic_model': model,
-            'input_wav' : (wav, seq_len) }, onnx_path
-
-def adjust_shape(am_input, args):
-    '''Pads or cuts acoustic model input tensor to some fixed_length
-
-    '''
-    input = am_input[0]    
-    baked_length = int(args.seq_len)
-    
-    if args.transpose:
-        in_seq_len = input.shape[1]
-    else:
-        in_seq_len = input.shape[2]
-
-    if  baked_length is None or in_seq_len == baked_length:
-        return (input, am_input[1])
-
-    if args.transpose:
-        return (input.resize_(input.shape[0], baked_length, 64), am_input[1])
-    
-    newSeq=input
-    if in_seq_len > baked_length:
-        # Cut extra bits off, no inference done
-        newSeq = input[...,0:baked_length].contiguous()
-    elif in_seq_len < baked_length:
-        # Zero-pad to satisfy length
-        pad_length = baked_length - in_seq_len
-        newSeq = nn.functional.pad(input, (0, pad_length), 'constant', 0)
-    return (newSeq, am_input[1])
-
-def torchify_trt_out(trt_out, desired_shape):
-    '''Reshapes flat data to format for greedy+CTC decoding
-    Used to convert numpy array on host to PyT Tensor
-    '''
-    # Predictions must be reshaped.
-    ret = torch.from_numpy(trt_out)
-    return ret.reshape((desired_shape[0], desired_shape[1], desired_shape[2]))
-
-def do_csv_export(wers, times, batch_size, num_frames):
-    '''Produces CSV header and data for input data
-
-    wers: dictionary of WER with keys={'trt', 'pyt'}
-    times: dictionary of execution times
-    '''
-    def take_durations_and_output_percentile(durations, ratios):
-        from heapq import nlargest, nsmallest
-        import numpy as np
-        import math
-        durations = np.asarray(durations) * 1000 # in ms
-        latency = durations
-        # The first few entries may not be representative due to warm-up effects
-        # The last entry might not be representative if dataset_size % batch_size != 0
-        latency = latency[5:-1]
-        mean_latency = np.mean(latency)
-        latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
-        latency_ranges=get_percentile(ratios, latency_worst, len(latency))
-        latency_ranges["0.5"] = mean_latency
-        return latency_ranges
-    def get_percentile(ratios, arr, nsamples):
-        res = {}
-        for a in ratios:
-            idx = max(int(nsamples * (1 - a)), 0)
-            res[a] = arr[idx]
-        return res
-
-    ratios = [0.9, 0.95, 0.99, 1.]
-    header=[]
-    data=[]
-    header.append("BatchSize")
-    header.append("NumFrames")
-    data.append(f"{batch_size}")
-    data.append(f"{num_frames}")
-    for title, wer in wers.items():
-        header.append(title)
-        data.append(f"{wer}")
-    for title, durations in times.items():
-        ratio_latencies_dict = take_durations_and_output_percentile(durations, ratios)
-        for ratio, latency in ratio_latencies_dict.items():
-            header.append(f"{title}_{ratio}")
-            data.append(f"{latency}")
-    string_header = ", ".join(header)
-    string_data = ", ".join(data)
-    return string_header, string_data
-
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/requirements.txt
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/requirements.txt
@ -1,4 +0,0 @@
-pycuda
-pillow
-onnx==1.6.0
-onnxruntime==1.4.0
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/docker/build.sh
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/docker/build.sh
@ -1,5 +0,0 @@
-#!/bin/bash
-
-# Constructs a docker image containing dependencies for execution of JASPER through TensorRT
-echo "docker build . -f ./tensorrt/Dockerfile -t jasper:tensorrt"
-docker build . -f ./tensorrt/Dockerfile -t jasper:tensorrt
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/docker/launch.sh
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/docker/launch.sh
@ -1,43 +0,0 @@
-#!/bin/bash
-SCRIPT_DIR=$(cd $(dirname $0); pwd)
-JASPER_REPO=${JASPER_REPO:-"${SCRIPT_DIR}/../../.."}
-
-# Launch TRT JASPER container.
-
-DATA_DIR=$1
-CHECKPOINT_DIR=$2
-RESULT_DIR=$3
-PROGRAM_PATH=${PROGRAM_PATH}
-    
-if [ $# -lt 3 ]; then
-    echo "Usage: ./launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR> (<SCRIPT_PATH>)"
-    echo "All directory paths must be absolute paths and exist"
-    exit 1
-fi
-
-for dir in $DATA_DIR $CHECKPOINT_DIR $RESULT_DIR; do
-    if [[ $dir != /* ]]; then
-        echo "All directory paths must be absolute paths!"
-        echo "${dir} is not an absolute path"
-        exit 1
-    fi
-
-    if [ ! -d $dir ]; then
-        echo "All directory paths must exist!"
-        echo "${dir} does not exist"
-        exit 1
-    fi
-done
-
-
-nvidia-docker run -it --rm \
-  --runtime=nvidia \
-  --shm-size=4g \
-  --ulimit memlock=-1 \
-  --ulimit stack=67108864 \
-  -v $DATA_DIR:/datasets \
-  -v $CHECKPOINT_DIR:/checkpoints/ \
-  -v $RESULT_DIR:/results/ \
-  -v ${JASPER_REPO}:/jasper \
-  ${EXTRA_JASPER_ENV} \
-  jasper:tensorrt bash $PROGRAM_PATH
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/inference.sh
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/inference.sh
@ -1,67 +0,0 @@
-#!/bin/bash
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Performs inference and measures latency and accuracy of TRT and PyTorch implementations of JASPER.
-
-echo "Container nvidia build = " $NVIDIA_BUILD_ID
-
-# Mandatory Arguments
-CHECKPOINT=$CHECKPOINT
-
-# Arguments with Defaults
-DATA_DIR=${DATA_DIR:-"/datasets/LibriSpeech"}
-DATASET=${DATASET:-"dev-clean"}
-RESULT_DIR=${RESULT_DIR:-"/results"}
-CREATE_LOGFILE=${CREATE_LOGFILE:-"true"}
-TRT_PRECISION=${TRT_PRECISION:-"fp32"}
-PYTORCH_PRECISION=${PYTORCH_PRECISION:-"fp32"}
-NUM_STEPS=${NUM_STEPS:-"-1"}
-BATCH_SIZE=${BATCH_SIZE:-1}
-NUM_FRAMES=${NUM_FRAMES:-3600}
-MAX_SEQUENCE_LENGTH_FOR_ENGINE=${MAX_SEQUENCE_LENGTH_FOR_ENGINE:-$NUM_FRAMES}
-FORCE_ENGINE_REBUILD=${FORCE_ENGINE_REBUILD:-"true"}
-CSV_PATH=${CSV_PATH:-"/results/res.csv"}
-TRT_PREDICTION_PATH=${TRT_PREDICTION_PATH:-"/results/trt_predictions.txt"}
-PYT_PREDICTION_PATH=${PYT_PREDICTION_PATH:-"/results/pyt_predictions.txt"}
-VERBOSE=${VERBOSE:-"false"}
-
-
-
-export CHECKPOINT="$CHECKPOINT"
-export DATA_DIR="$DATA_DIR"
-export DATASET="$DATASET"
-export RESULT_DIR="$RESULT_DIR"
-export CREATE_LOGFILE="$CREATE_LOGFILE"
-export TRT_PRECISION="$TRT_PRECISION"
-export PYTORCH_PRECISION="$PYTORCH_PRECISION"
-export NUM_STEPS="$NUM_STEPS"
-export BATCH_SIZE="$BATCH_SIZE"
-export NUM_FRAMES="$NUM_FRAMES"
-export MAX_SEQUENCE_LENGTH_FOR_ENGINE="$MAX_SEQUENCE_LENGTH_FOR_ENGINE"
-export FORCE_ENGINE_REBUILD="$FORCE_ENGINE_REBUILD"
-export CSV_PATH="$CSV_PATH"
-export TRT_PREDICTION_PATH="$TRT_PREDICTION_PATH"
-export PYT_PREDICTION_PATH="$PYT_PREDICTION_PATH"
-export VERBOSE="$VERBOSE"
-
-bash ./tensorrt/scripts/inference_benchmark.sh $1 $2 $3 $4 $5 $6 $7
-
-trt_word_error_rate=`cat "$CSV_PATH" | awk '{print $3}'`
-pyt_word_error_rate=`cat "$CSV_PATH" | awk '{print $4}'`
-
-echo "word error rate for native PyTorch inference: "
-echo "${pyt_word_error_rate}"
-echo "word error rate for native TRT inference: "
-echo "${trt_word_error_rate}"
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/inference_benchmark.sh
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/inference_benchmark.sh
@ -1,173 +0,0 @@
-#!/bin/bash
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Measures latency and accuracy of TRT and PyTorch implementations of JASPER.
-
-echo "Container nvidia build = " $NVIDIA_BUILD_ID
-
-trap "exit" INT
-
-
-# Mandatory Arguments
-CHECKPOINT=${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}
-
-# Arguments with Defaults
-DATA_DIR=${DATA_DIR:-"/datasets/LibriSpeech"}
-DATASET=${DATASET:-"dev-clean"}
-RESULT_DIR=${RESULT_DIR:-"/results"}
-LOG_DIR=${RESULT_DIR}/logs
-CREATE_LOGFILE=${CREATE_LOGFILE:-"true"}
-TRT_PRECISION=${TRT_PRECISION:-"fp16"}
-PYTORCH_PRECISION=${PYTORCH_PRECISION:-"fp16"}
-NUM_STEPS=${NUM_STEPS:-"100"}
-BATCH_SIZE=${BATCH_SIZE:-64}
-NUM_FRAMES=${NUM_FRAMES:-512}
-FORCE_ENGINE_REBUILD=${FORCE_ENGINE_REBUILD:-"false"}
-CSV_PATH=${CSV_PATH:-"/results/res.csv"}
-TRT_PREDICTION_PATH=${TRT_PREDICTION_PATH:-"none"}
-PYT_PREDICTION_PATH=${PYT_PREDICTION_PATH:-"none"}
-VERBOSE=${VERBOSE:-"false"}
-USE_DYNAMIC_SHAPE=${USE_DYNAMIC_SHAPE:-"yes"}
-
-
-# Set up flag-based arguments
-TRT_PREC=""
-if [ "$TRT_PRECISION" = "fp16" ] ; then
-    TRT_PREC="--trt_fp16"
-elif [ "$TRT_PRECISION" = "fp32" ] ; then
-    TRT_PREC=""
-else
-   echo "Unknown <trt_precision> argument"
-   exit -2
-fi
-
-PYTORCH_PREC=""
-if [ "$PYTORCH_PRECISION" = "fp16" ] ; then
-    PYTORCH_PREC="--pyt_fp16"
-elif [ "$PYTORCH_PRECISION" = "fp32" ] ; then
-    PYTORCH_PREC=""
-else
-   echo "Unknown <pytorch_precision> argument"
-   exit -2
-fi
-
-SHOULD_VERBOSE=""
-if [ "$VERBOSE" = "true" ] ; then
-    SHOULD_VERBOSE="--verbose"
-fi
-
-STEPS=""
-if [ "$NUM_STEPS" -gt 0 ] ; then
-   STEPS=" --num_steps $NUM_STEPS"
-fi
-
-# Making engine and onnx directories in RESULT_DIR if they don't already exist
-ONNX_DIR=$RESULT_DIR/onnxs
-ENGINE_DIR=$RESULT_DIR/engines
-mkdir -p $ONNX_DIR
-mkdir -p $ENGINE_DIR
-mkdir -p $LOG_DIR
-
-
-
-if [ "$USE_DYNAMIC_SHAPE" = "no" ] ; then
-    PREFIX=BS${BATCH_SIZE}_NF${NUM_FRAMES}
-    DYNAMIC_PREFIX=" --static_shape "
-else
-    PREFIX=DYNAMIC
-fi
-
-# Currently, TRT parser for ONNX can't parse mixed-precision weights, so ONNX
-# export will always be FP32. This is also enforced in perf.py
-ONNX_FILE=fp32_${PREFIX}.onnx
-ENGINE_FILE=${TRT_PRECISION}_${PREFIX}.engine
-
-
-# If an ONNX with the same precision and number of frames exists, don't recreate it because
-# TRT engine construction can be done on an onnx of any batch size
-# "%P" only prints filenames (rather than absolute/relative path names)
-EXISTING_ONNX=$(find $ONNX_DIR -name ${ONNX_FILE} -printf "%P\n" | head -n 1)
-SHOULD_MAKE_ONNX=""
-if [ -z "$EXISTING_ONNX" ] ; then
-    SHOULD_MAKE_ONNX="--make_onnx"
-else
-    ONNX_FILE=${EXISTING_ONNX}
-fi
-
-# Follow FORCE_ENGINE_REBUILD about reusing existing engines.
-# If false, the existing engine must match precision, batch size, and number of frames
-SHOULD_MAKE_ENGINE=""
-if [ "$FORCE_ENGINE_REBUILD" != "true" ] ; then
-    EXISTING_ENGINE=$(find $ENGINE_DIR -name "${ENGINE_FILE}")
-    if [ -n "$EXISTING_ENGINE" ] ; then
-        SHOULD_MAKE_ENGINE="--use_existing_engine"
-    fi
-fi
-
-
-
-if [ "$TRT_PREDICTION_PATH" = "none" ] ; then
-   TRT_PREDICTION_PATH=""
-else
-   TRT_PREDICTION_PATH=" --trt_prediction_path=${TRT_PREDICTION_PATH}"
-fi
-
-
-if [ "$PYT_PREDICTION_PATH" = "none" ] ; then
-   PYT_PREDICTION_PATH=""
-else
-   PYT_PREDICTION_PATH=" --pyt_prediction_path=${PYT_PREDICTION_PATH}"
-fi
-
-CMD="python tensorrt/perf.py"
-CMD+=" --batch_size $BATCH_SIZE"
-CMD+=" --engine_batch_size $BATCH_SIZE"
-CMD+=" --model_toml configs/jasper10x5dr_nomask.toml"
-CMD+=" --dataset_dir $DATA_DIR"
-CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
-CMD+=" --ckpt_path $CHECKPOINT"
-CMD+=" $SHOULD_VERBOSE"
-CMD+=" $TRT_PREC"
-CMD+=" $PYTORCH_PREC"
-CMD+=" $STEPS"
-CMD+=" --engine_path ${RESULT_DIR}/engines/${ENGINE_FILE}"
-CMD+=" --onnx_path ${RESULT_DIR}/onnxs/${ONNX_FILE}"
-CMD+=" --seq_len $NUM_FRAMES"
-CMD+=" $SHOULD_MAKE_ONNX"
-CMD+=" $SHOULD_MAKE_ENGINE"
-CMD+=" $DYNAMIC_PREFIX"
-CMD+=" --csv_path $CSV_PATH"
-CMD+=" $1 $2 $3 $4 $5 $6 $7 $8 $9"
-CMD+=" $TRT_PREDICTION_PATH"
-CMD+=" $PYT_PREDICTION_PATH"
-
-
-if [ "$CREATE_LOGFILE" == "true" ] ; then
-  export GBS=$(expr $BATCH_SIZE )
-  printf -v TAG "jasper_tensorrt_inference_benchmark_%s_gbs%d" "$PYTORCH_PRECISION" $GBS
-  DATESTAMP=`date +'%y%m%d%H%M%S'`
-  LOGFILE=$LOG_DIR/$TAG.$DATESTAMP.log
-  printf "Logs written to %s\n" "$LOGFILE"
-fi
-
-mkdir -p ${RESULT_DIR}/logs
-
-set -x
-if [ -z "$LOGFILE" ] ; then
-   $CMD
-else
-   $CMD |& tee $LOGFILE
-   grep 'latency' $LOGFILE
-fi
-set +x
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/walk_benchmark.sh
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/scripts/walk_benchmark.sh
@ -1,43 +0,0 @@
-#!/bin/bash
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# A usage example of inference_benchmark.sh.
-
-
-export NUM_STEPS=100
-export FORCE_ENGINE_REBUILD="false"
-export CHECKPOINT=${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}
-export CREATE_LOGFILE="true"
-prec=fp16
-export TRT_PRECISION=$prec
-export PYTORCH_PRECISION=$prec
-
-trap "exit" INT
-
-for use_dynamic in yes no;
-do
-    export USE_DYNAMIC_SHAPE=${use_dynamic}
-    export CSV_PATH="/results/${prec}.csv"
-    for nf in 208 304 512 704 1008 1680;
-    do
-        export NUM_FRAMES=$nf
-        for bs in 1 2 4 8 16 32 64;
-        do
-            export BATCH_SIZE=$bs
-
-            echo "Doing batch size ${bs}, sequence length ${nf}, precision ${prec}"
-            bash tensorrt/scripts/inference_benchmark.sh $1 $2 $3 $4 $5 $6
-        done
-    done
-done
--- a/PyTorch/SpeechRecognition/Jasper/tensorrt/trtutils.py
+++ b/PyTorch/SpeechRecognition/Jasper/tensorrt/trtutils.py
@ -1,155 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#           http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-'''Contains helper functions for TRT components of JASPER inference
-'''
-import pycuda.driver as cuda
-import tensorrt as trt
-import onnxruntime as ort
-import numpy as np
-
-class HostDeviceMem(object):
-    '''Type for managing host and device buffers
-
-    A simple class which is more explicit that dealing with a 2-tuple.
-    '''
-    def __init__(self, host_mem, device_mem):
-        self.host = host_mem
-        self.device = device_mem
-
-    def __str__(self):
-        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
-
-    def __repr__(self):
-        return self.__str__()
-
-def build_engine_from_parser(args):
-    '''Builds TRT engine from an ONNX file
-    Note that network output 1 is unmarked so that the engine will not use
-    vestigial length calculations associated with masked_fill
-    '''
-    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if args.verbose else trt.Logger(trt.Logger.WARNING)
-    builder = trt.Builder(TRT_LOGGER)
-    builder.max_batch_size = 64
-
-    if args.trt_fp16:
-        builder.fp16_mode = True
-        print("Optimizing for FP16")
-        config_flags = 1 << int(trt.BuilderFlag.FP16) # | 1 << int(trt.BuilderFlag.STRICT_TYPES)
-        max_size = 4*1024*1024*1024
-        max_len = args.max_seq_len
-    else:
-        config_flags = 0
-        max_size = 4*1024*1024*1024
-        max_len = args.max_seq_len
-    if args.max_workspace_size > 0:
-        builder.max_workspace_size = args.max_workspace_size
-    else:
-        builder.max_workspace_size = max_size
-        
-    config = builder.create_builder_config()
-    config.flags = config_flags
-    
-    if not args.static_shape:
-        profile = builder.create_optimization_profile()
-        if args.transpose:
-            profile.set_shape("FEATURES", min=(1,192,64), opt=(args.engine_batch_size,256,64), max=(builder.max_batch_size, max_len, 64))
-        else:
-            profile.set_shape("FEATURES", min=(1,64,192), opt=(args.engine_batch_size,64,256), max=(builder.max_batch_size, 64, max_len))        
-        config.add_optimization_profile(profile)    
-    explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
-    network = builder.create_network(explicit_batch)
-
-    with trt.OnnxParser(network, TRT_LOGGER) as parser:
-        with open(args.onnx_path, 'rb') as model:
-            parsed = parser.parse(model.read())
-            print ("Parsing returned ", parsed, "dynamic_shape= " , not args.static_shape, "\n")
-            return builder.build_engine(network, config=config)
-
-def deserialize_engine(engine_path, is_verbose):
-    '''Deserializes TRT engine at engine_path
-    '''
-    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if is_verbose else trt.Logger(trt.Logger.WARNING)
-    with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
-        engine = runtime.deserialize_cuda_engine(f.read())
-    return engine
-
-
-def allocate_buffers_with_existing_inputs(context, inp):
-    '''
-    allocate_buffers() (see TRT python samples) but uses an existing inputs on device
-
-    inp:  List of pointers to device memory. Pointers are in the same order as
-          would be produced by allocate_buffers(). That is, inputs are in the
-          order defined by iterating through `engine`
-    '''
-    # Add input to bindings
-    bindings = [0,0]
-    outputs = []
-    engine = context.engine
-    batch_size = inp[0].shape
-    inp_idx = engine.get_binding_index("FEATURES")    
-    inp_b = inp[0].data_ptr()
-    assert(inp[0].is_contiguous())
-    bindings[inp_idx] = inp_b
-    sh = inp[0].shape
-    batch_size = sh[0]
-    orig_shape = context.get_binding_shape(inp_idx)
-    if orig_shape[0]==-1:
-        context.set_binding_shape(inp_idx, trt.Dims([batch_size, sh[1], sh[2]]))
-
-    assert context.all_binding_shapes_specified
-
-    out_idx = engine.get_binding_index("LOGITS")
-    # Allocate output buffer by querying the size from the context. This may be different for different input shapes.
-    out_shape = context.get_binding_shape(out_idx)
-    #print ("Out_shape: ", out_shape)
-    h_output = cuda.pagelocked_empty(tuple(out_shape), dtype=np.float32())
-    # print ("Out bytes: " , h_output.nbytes)
-    d_output = cuda.mem_alloc(h_output.nbytes)
-    bindings[out_idx] = int(d_output)
-    hdm = HostDeviceMem(h_output, d_output)
-    outputs.append(hdm)
-    return outputs, bindings, out_shape
-
-def get_engine(args):
-    '''Get a TRT engine
-
-    If --should_serialize is present, always build from ONNX and store result in --engine_path.
-    Else If an engine is provided as an argument (--engine_path) use that one.
-    Otherwise, make one from onnx (--onnx_load_path), but don't serialize it.
-    '''
-    engine = None
-
-    if args.engine_path is not None and args.use_existing_engine:
-        engine = deserialize_engine(args.engine_path, args.verbose)
-    elif args.engine_path is not None and args.onnx_path is not None:
-        # Build a new engine and serialize it.
-        print("Building TRT engine ....") 
-        engine = build_engine_from_parser(args)
-        if engine is not None:
-            with open(args.engine_path, 'wb') as f:
-                f.write(engine.serialize())
-                print("TRT engine saved at " + args.engine_path + " ...") 
-    elif args.onnx_path is not None:
-        ort_session = ort.InferenceSession(args.onnx_path)
-        return ort_session
-    else:
-        raise Exception("One of the following sets of arguments must be provided:\n"+
-                        "<engine_path> + --use_existing_engine\n"+
-                        "<engine_path> + <onnx_path>\n"+
-                        "in order to construct a TRT engine")
-    if engine is None:
-        raise Exception("Failed to acquire TRT engine")
-
-    return engine
--- a/PyTorch/SpeechRecognition/Jasper/train.py
+++ b/PyTorch/SpeechRecognition/Jasper/train.py
@ -14,449 +14,488 @@

 import argparse
 import copy
-import itertools
-import math
 import os
 import random
 import time

-import toml
+import pyprof
 import torch
 import numpy as np
+import torch.cuda.profiler as profiler
+import torch.distributed as dist
 from apex import amp
+from apex.parallel import DistributedDataParallel

-from dataset import AudioToTextDataLayer
-from helpers import (add_ctc_labels, model_multi_gpu, monitor_asr_train_progress,
-                     print_dict, print_once, process_evaluation_batch,
-                     process_evaluation_epoch)
-from model import AudioPreprocessing, CTCLossNM, GreedyCTCDecoder, Jasper
-from optimizers import Novograd, AdamW
+from common import helpers
+from common.dali.data_loader import DaliDataLoader
+from common.dataset import AudioDataset, get_data_loader
+from common.features import BaseFeatures, FilterbankFeatures
+from common.helpers import (Checkpointer, greedy_wer, num_weights, print_once,
+                            process_evaluation_epoch)
+from common.optimizers import AdamW, lr_policy, Novograd
+from common.tb_dllogger import flush_log, init_log, log
+from jasper import config
+from jasper.model import CTCLossNM, GreedyCTCDecoder, Jasper


-def lr_policy(initial_lr, step, N):
-    """
-    learning rate decay
-    Args:
-        initial_lr: base learning rate
-        step: current iteration number
-        N: total number of iterations over which learning rate is decayed
-    """
-    min_lr = 0.00001
-    res = initial_lr * ((N - step) / N) ** 2
-    return max(res, min_lr)
+def parse_args():
+    parser = argparse.ArgumentParser(description='Jasper')
+
+    training = parser.add_argument_group('training setup')
+    training.add_argument('--epochs', default=400, type=int,
+                          help='Number of epochs for the entire training; influences the lr schedule')
+    training.add_argument("--warmup_epochs", default=0, type=int,
+                          help='Initial epochs of increasing learning rate')
+    training.add_argument("--hold_epochs", default=0, type=int,
+                          help='Constant max learning rate epochs after warmup')
+    training.add_argument('--epochs_this_job', default=0, type=int,
+                          help=('Run for a number of epochs with no effect on the lr schedule.'
+                                'Useful for re-starting the training.'))
+    training.add_argument('--cudnn_benchmark', action='store_true', default=True,
+                          help='Enable cudnn benchmark')
+    training.add_argument('--amp', '--fp16', action='store_true', default=False,
+                          help='Use mixed precision training')
+    training.add_argument('--seed', default=42, type=int, help='Random seed')
+    training.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
+                          type=int, help='GPU id used for distributed training')
+    training.add_argument('--pre_allocate_range', default=None, type=int, nargs=2,
+                          help='Warmup with batches of length [min, max] before training')
+    training.add_argument('--pyprof', action='store_true', help='Enable pyprof profiling')
+
+    optim = parser.add_argument_group('optimization setup')
+    optim.add_argument('--batch_size', default=32, type=int,
+                       help='Global batch size')
+    optim.add_argument('--lr', default=1e-3, type=float,
+                       help='Peak learning rate')
+    optim.add_argument("--min_lr", default=1e-5, type=float,
+                       help='minimum learning rate')
+    optim.add_argument("--lr_policy", default='exponential', type=str,
+                       choices=['exponential', 'legacy'], help='lr scheduler')
+    optim.add_argument("--lr_exp_gamma", default=0.99, type=float,
+                       help='gamma factor for exponential lr scheduler')
+    optim.add_argument('--weight_decay', default=1e-3, type=float,
+                       help='Weight decay for the optimizer')
+    optim.add_argument('--grad_accumulation_steps', default=1, type=int,
+                       help='Number of accumulation steps')
+    optim.add_argument('--optimizer', default='novograd', type=str,
+                       choices=['novograd', 'adamw'], help='Optimization algorithm')
+    optim.add_argument('--ema', type=float, default=0.0,
+                       help='Discount factor for exp averaging of model weights')
+
+    io = parser.add_argument_group('feature and checkpointing setup')
+    io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
+                    default='gpu', help='Use DALI pipeline for fast data processing')
+    io.add_argument('--resume', action='store_true',
+                    help='Try to resume from last saved checkpoint.')
+    io.add_argument('--ckpt', default=None, type=str,
+                    help='Path to a checkpoint for resuming training')
+    io.add_argument('--save_frequency', default=10, type=int,
+                    help='Checkpoint saving frequency in epochs')
+    io.add_argument('--keep_milestones', default=[100, 200, 300], type=int, nargs='+',
+                    help='Milestone checkpoints to keep from removing')
+    io.add_argument('--save_best_from', default=380, type=int,
+                    help='Epoch on which to begin tracking best checkpoint (dev WER)')
+    io.add_argument('--eval_frequency', default=200, type=int,
+                    help='Number of steps between evaluations on dev set')
+    io.add_argument('--log_frequency', default=25, type=int,
+                    help='Number of steps between printing training stats')
+    io.add_argument('--prediction_frequency', default=100, type=int,
+                    help='Number of steps between printing sample decodings')
+    io.add_argument('--model_config', type=str, required=True,
+                    help='Path of the model configuration file')
+    io.add_argument('--train_manifests', type=str, required=True, nargs='+',
+                    help='Paths of the training dataset manifest file')
+    io.add_argument('--val_manifests', type=str, required=True, nargs='+',
+                    help='Paths of the evaluation datasets manifest files')
+    io.add_argument('--max_duration', type=float,
+                    help='Discard samples longer than max_duration')
+    io.add_argument('--pad_to_max_duration', action='store_true', default=False,
+                    help='Pad training sequences to max_duration')
+    io.add_argument('--dataset_dir', required=True, type=str,
+                    help='Root dir of dataset')
+    io.add_argument('--output_dir', type=str, required=True,
+                    help='Directory for logs and checkpoints')
+    io.add_argument('--log_file', type=str, default=None,
+                    help='Path to save the training logfile.')
+    return parser.parse_args()


-def save(model, ema_model, optimizer, epoch, output_dir, optim_level):
-    """
-    Saves model checkpoint
-    Args:
-        model: model
-        ema_model: model with exponential averages of weights
-        optimizer: optimizer
-        epoch: epoch of model training
-        output_dir: path to save model checkpoint
-    """
-    out_fpath = os.path.join(output_dir, f"Jasper_epoch{epoch}_checkpoint.pt")
-    print_once(f"Saving {out_fpath}...")
-
-    if torch.distributed.is_initialized():
-        torch.distributed.barrier()
-        rank = torch.distributed.get_rank()
-    else:
-        rank = 0
-
-    if rank == 0:
-        checkpoint = {
-            'epoch': epoch,
-            'state_dict': getattr(model, 'module', model).state_dict(),
-            'optimizer': optimizer.state_dict(),
-            'amp': amp.state_dict() if optim_level > 0 else None,
-
-        }
-        if ema_model is not None:
-            checkpoint['ema_state_dict'] = getattr(ema_model, 'module', ema_model).state_dict()
-        torch.save(checkpoint, out_fpath)
-
-    print_once('Saved.')
+def reduce_tensor(tensor, num_gpus):
+    rt = tensor.clone()
+    dist.all_reduce(rt, op=dist.ReduceOp.SUM)
+    return rt.true_divide(num_gpus)


 def apply_ema(model, ema_model, decay):
    if not decay:
        return
-    st = model.state_dict()
-    add_module = hasattr(model, 'module') and not hasattr(ema_model, 'module')
-    for k,v in ema_model.state_dict().items():
-        if add_module and not k.startswith('module.'):
-            k = 'module.' + k
-        v.copy_(decay * v + (1 - decay) * st[k])
+
+    sd = getattr(model, 'module', model).state_dict()
+    for k, v in ema_model.state_dict().items():
+        v.copy_(decay * v + (1 - decay) * sd[k])


-def train(
-        data_layer,
-        data_layer_eval,
-        model,
-        ema_model,
-        ctc_loss,
-        greedy_decoder,
-        optimizer,
-        optim_level,
-        labels,
-        multi_gpu,
-        args,
-        fn_lr_policy=None):
-    """Trains model
-    Args:
-        data_layer: training data layer
-        data_layer_eval: evaluation data layer
-        model: model ( encapsulates data processing, encoder, decoder)
-        ctc_loss: loss function
-        greedy_decoder: greedy ctc decoder
-        optimizer: optimizer
-        optim_level: AMP optimization level
-        labels: list of output labels
-        multi_gpu: true if multi gpu training
-        args: script input argument list
-        fn_lr_policy: learning rate adjustment function
-    """
-    def eval(model, name=''):
-        """Evaluates model on evaluation dataset
-        """
-        with torch.no_grad():
-            _global_var_dict = {
-                'EvalLoss': [],
-                'predictions': [],
-                'transcripts': [],
-            }
-            eval_dataloader = data_layer_eval.data_iterator
-            for data in eval_dataloader:
-                tensors = []
-                for d in data:
-                    if isinstance(d, torch.Tensor):
-                        tensors.append(d.cuda())
-                    else:
-                        tensors.append(d)
-                t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
+@torch.no_grad()
+def evaluate(epoch, step, val_loader, val_feat_proc, labels, model,
+             ema_model, ctc_loss, greedy_decoder, use_amp, use_dali=False):

-                model.eval()
-                if optim_level == 1:
-                  with amp.disable_casts():
-                      t_processed_signal_e, t_processed_sig_length_e = audio_preprocessor(t_audio_signal_e, t_a_sig_length_e) 
-                else:
-                  t_processed_signal_e, t_processed_sig_length_e = audio_preprocessor(t_audio_signal_e, t_a_sig_length_e)
-                if jasper_encoder.use_conv_mask:
-                    t_log_probs_e, t_encoded_len_e = model.forward((t_processed_signal_e, t_processed_sig_length_e))
-                else:
-                    t_log_probs_e = model.forward(t_processed_signal_e)
-                t_loss_e = ctc_loss(log_probs=t_log_probs_e, targets=t_transcript_e, input_length=t_encoded_len_e, target_length=t_transcript_len_e)
-                t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
+    for model, subset in [(model, 'dev'), (ema_model, 'dev_ema')]:
+        if model is None:
+            continue

-                values_dict = dict(
-                    loss=[t_loss_e],
-                    predictions=[t_predictions_e],
-                    transcript=[t_transcript_e],
-                    transcript_length=[t_transcript_len_e]
-                )
-                process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
+        model.eval()
+        start_time = time.time()
+        agg = {'losses': [], 'preds': [], 'txts': []}

-            # final aggregation across all workers and minibatches) and logging of results
-            wer, eloss = process_evaluation_epoch(_global_var_dict)
-
-            if name != '':
-                name = '_' + name
-
-            print_once(f"==========>>>>>>Evaluation{name} Loss: {eloss}\n")
-            print_once(f"==========>>>>>>Evaluation{name} WER: {wer}\n")
-
-    print_once("Starting .....")
-    start_time = time.time()
-
-    train_dataloader = data_layer.data_iterator
-    epoch = args.start_epoch
-    step = epoch * args.step_per_epoch
-
-    audio_preprocessor = model.module.audio_preprocessor if hasattr(model, 'module') else model.audio_preprocessor
-    data_spectr_augmentation = model.module.data_spectr_augmentation if hasattr(model, 'module') else model.data_spectr_augmentation
-    jasper_encoder = model.module.jasper_encoder if hasattr(model, 'module') else model.jasper_encoder
-
-    while True:
-        if multi_gpu:
-            data_layer.sampler.set_epoch(epoch)
-        print_once("Starting epoch {0}, step {1}".format(epoch, step))
-        last_epoch_start = time.time()
-        batch_counter = 0
-        average_loss = 0
-        for data in train_dataloader:
-            tensors = []
-            for d in data:
-                if isinstance(d, torch.Tensor):
-                    tensors.append(d.cuda())
-                else:
-                    tensors.append(d)
-
-            if batch_counter == 0:
-
-                if fn_lr_policy is not None:
-                    adjusted_lr = fn_lr_policy(step)
-                    for param_group in optimizer.param_groups:
-                            param_group['lr'] = adjusted_lr
-                optimizer.zero_grad()
-                last_iter_start = time.time()
-
-            t_audio_signal_t, t_a_sig_length_t, t_transcript_t, t_transcript_len_t = tensors
-            model.train()
-            if optim_level == 1:
-              with amp.disable_casts():
-                  t_processed_signal_t, t_processed_sig_length_t = audio_preprocessor(t_audio_signal_t, t_a_sig_length_t)
+        for batch in val_loader:
+            if use_dali:
+                # with DALI, the data is already on GPU
+                feat, feat_lens, txt, txt_lens = batch
+                if val_feat_proc is not None:
+                    feat, feat_lens = val_feat_proc(feat, feat_lens, use_amp)
            else:
-              t_processed_signal_t, t_processed_sig_length_t = audio_preprocessor(t_audio_signal_t, t_a_sig_length_t)
-            t_processed_signal_t = data_spectr_augmentation(t_processed_signal_t)
-            if jasper_encoder.use_conv_mask:
-                t_log_probs_t, t_encoded_len_t = model.forward((t_processed_signal_t, t_processed_sig_length_t))
-            else:
-                t_log_probs_t = model.forward(t_processed_signal_t)
+                batch = [t.cuda(non_blocking=True) for t in batch]
+                audio, audio_lens, txt, txt_lens = batch
+                feat, feat_lens = val_feat_proc(audio, audio_lens, use_amp)

-            t_loss_t = ctc_loss(log_probs=t_log_probs_t, targets=t_transcript_t, input_length=t_encoded_len_t, target_length=t_transcript_len_t)
-            if args.gradient_accumulation_steps > 1:
-                    t_loss_t = t_loss_t / args.gradient_accumulation_steps
+            log_probs, enc_lens = model.forward(feat, feat_lens)
+            loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+            pred = greedy_decoder(log_probs)

-            if 0 < optim_level <= 3:
-                with amp.scale_loss(t_loss_t, optimizer) as scaled_loss:
-                    scaled_loss.backward()
-            else:
-                t_loss_t.backward()
-            batch_counter += 1
-            average_loss += t_loss_t.item()
+            agg['losses'] += helpers.gather_losses([loss])
+            agg['preds'] += helpers.gather_predictions([pred], labels)
+            agg['txts'] += helpers.gather_transcripts([txt], [txt_lens], labels)

-            if batch_counter % args.gradient_accumulation_steps == 0:
-                optimizer.step()
-
-                if step % args.train_frequency == 0:
-                    t_predictions_t = greedy_decoder(log_probs=t_log_probs_t)
-
-                    e_tensors = [t_predictions_t, t_transcript_t, t_transcript_len_t]
-                    train_wer = monitor_asr_train_progress(e_tensors, labels=labels)
-                    print_once("Loss@Step: {0}  ::::::: {1}".format(step, str(average_loss)))
-                    print_once("Step time: {0} seconds".format(time.time() - last_iter_start))
-                if step > 0 and step % args.eval_frequency == 0:
-                    print_once("Doing Evaluation ....................... ......  ... .. . .")
-                    eval(model)
-                    if args.ema > 0:
-                        eval(ema_model, 'EMA')
-
-                step += 1
-                batch_counter = 0
-                average_loss = 0
-                if args.num_steps is not None and step >= args.num_steps:
-                    break
-
-        if args.num_steps is not None and step >= args.num_steps:
-            break
-        print_once("Finished epoch {0} in {1}".format(epoch, time.time() - last_epoch_start))
-        epoch += 1
-        if epoch % args.save_frequency == 0 and epoch > 0:
-            save(model, ema_model, optimizer, epoch, args.output_dir, optim_level)
-        if args.num_steps is None and epoch >= args.num_epochs:
-            break
-    print_once("Done in {0}".format(time.time() - start_time))
-    print_once("Final Evaluation ....................... ......  ... .. . .")
-    eval(model)
-    if args.ema > 0:
-        eval(ema_model, 'EMA')
-    save(model, ema_model, optimizer, epoch, args.output_dir, optim_level)
+        wer, loss = process_evaluation_epoch(agg)
+        log((epoch,), step, subset, {'loss': loss, 'wer': 100.0 * wer,
+                                     'took': time.time() - start_time})
+        model.train()
+    return wer


-def main(args):
-    random.seed(args.seed)
-    np.random.seed(args.seed)
-    torch.manual_seed(args.seed)
+def main():
+    args = parse_args()
+
    assert(torch.cuda.is_available())
-    torch.backends.cudnn.benchmark = args.cudnn
+    assert args.prediction_frequency % args.log_frequency == 0
+
+    torch.backends.cudnn.benchmark = args.cudnn_benchmark

    # set up distributed training
-    if args.local_rank is not None:
-        torch.cuda.set_device(args.local_rank)
-        torch.distributed.init_process_group(backend='nccl', init_method='env://')
-
-
-    multi_gpu = torch.distributed.is_initialized()
+    multi_gpu = int(os.environ.get('WORLD_SIZE', 1)) > 1
    if multi_gpu:
-        print_once("DISTRIBUTED TRAINING with {} gpus".format(torch.distributed.get_world_size()))
+        torch.cuda.set_device(args.local_rank)
+        dist.init_process_group(backend='nccl', init_method='env://')
+        world_size = dist.get_world_size()
+        print_once(f'Distributed training with {world_size} GPUs\n')
+    else:
+        world_size = 1

-    # define amp optimiation level
-    optim_level = 1 if args.amp else 0
+    torch.manual_seed(args.seed + args.local_rank)
+    np.random.seed(args.seed + args.local_rank)
+    random.seed(args.seed + args.local_rank)

-    jasper_model_definition = toml.load(args.model_toml)
-    dataset_vocab = jasper_model_definition['labels']['labels']
-    ctc_vocab = add_ctc_labels(dataset_vocab)
+    init_log(args)

-    train_manifest = args.train_manifest
-    val_manifest = args.val_manifest
-    featurizer_config = jasper_model_definition['input']
-    featurizer_config_eval = jasper_model_definition['input_eval']
-    featurizer_config["optimization_level"] = optim_level
-    featurizer_config_eval["optimization_level"] = optim_level
+    cfg = config.load(args.model_config)
+    config.apply_duration_flags(cfg, args.max_duration, args.pad_to_max_duration)

-    sampler_type = featurizer_config.get("sampler", 'default')
-    perturb_config = jasper_model_definition.get('perturb', None)
-    if args.pad_to_max:
-        assert(args.max_duration > 0)
-        featurizer_config['max_duration'] = args.max_duration
-        featurizer_config_eval['max_duration'] = args.max_duration
-        featurizer_config['pad_to'] = -1
-        featurizer_config_eval['pad_to'] = -1
+    symbols = helpers.add_ctc_blank(cfg['labels'])

-    print_once('model_config')
-    print_dict(jasper_model_definition)
+    assert args.grad_accumulation_steps >= 1
+    assert args.batch_size % args.grad_accumulation_steps == 0
+    batch_size = args.batch_size // args.grad_accumulation_steps

-    if args.gradient_accumulation_steps < 1:
-        raise ValueError('Invalid gradient accumulation steps parameter {}'.format(args.gradient_accumulation_steps))
-    if args.batch_size % args.gradient_accumulation_steps != 0:
-        raise ValueError('gradient accumulation step {} is not divisible by batch size {}'.format(args.gradient_accumulation_steps, args.batch_size))
+    print_once('Setting up datasets...')
+    train_dataset_kw, train_features_kw = config.input(cfg, 'train')
+    val_dataset_kw, val_features_kw = config.input(cfg, 'val')

+    use_dali = args.dali_device in ('cpu', 'gpu')
+    if use_dali:
+        assert train_dataset_kw['ignore_offline_speed_perturbation'], \
+            "DALI doesn't support offline speed perturbation"

-    data_layer = AudioToTextDataLayer(
-                                    dataset_dir=args.dataset_dir,
-                                    featurizer_config=featurizer_config,
-                                    perturb_config=perturb_config,
-                                    manifest_filepath=train_manifest,
-                                    labels=dataset_vocab,
-                                    batch_size=args.batch_size // args.gradient_accumulation_steps,
-                                    multi_gpu=multi_gpu,
-                                    pad_to_max=args.pad_to_max,
-                                    sampler=sampler_type)
+        # pad_to_max_duration is not supported by DALI - have simple padders
+        if train_features_kw['pad_to_max_duration']:
+            train_feat_proc = BaseFeatures(
+                pad_align=train_features_kw['pad_align'],
+                pad_to_max_duration=True,
+                max_duration=train_features_kw['max_duration'],
+                sample_rate=train_features_kw['sample_rate'],
+                window_size=train_features_kw['window_size'],
+                window_stride=train_features_kw['window_stride'])
+            train_features_kw['pad_to_max_duration'] = False
+        else:
+            train_feat_proc = None

-    data_layer_eval = AudioToTextDataLayer(
-                                    dataset_dir=args.dataset_dir,
-                                    featurizer_config=featurizer_config_eval,
-                                    manifest_filepath=val_manifest,
-                                    labels=dataset_vocab,
-                                    batch_size=args.batch_size,
-                                    multi_gpu=multi_gpu,
-                                    pad_to_max=args.pad_to_max
-                                    )
+        if val_features_kw['pad_to_max_duration']:
+            val_feat_proc = BaseFeatures(
+                pad_align=val_features_kw['pad_align'],
+                pad_to_max_duration=True,
+                max_duration=val_features_kw['max_duration'],
+                sample_rate=val_features_kw['sample_rate'],
+                window_size=val_features_kw['window_size'],
+                window_stride=val_features_kw['window_stride'])
+            val_features_kw['pad_to_max_duration'] = False
+        else:
+            val_feat_proc = None

-    model = Jasper(feature_config=featurizer_config, jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
+        train_loader = DaliDataLoader(gpu_id=args.local_rank,
+                                      dataset_path=args.dataset_dir,
+                                      config_data=train_dataset_kw,
+                                      config_features=train_features_kw,
+                                      json_names=args.train_manifests,
+                                      batch_size=batch_size,
+                                      grad_accumulation_steps=args.grad_accumulation_steps,
+                                      pipeline_type="train",
+                                      device_type=args.dali_device,
+                                      symbols=symbols)

-    ctc_loss = CTCLossNM( num_classes=len(ctc_vocab))
+        val_loader = DaliDataLoader(gpu_id=args.local_rank,
+                                    dataset_path=args.dataset_dir,
+                                    config_data=val_dataset_kw,
+                                    config_features=val_features_kw,
+                                    json_names=args.val_manifests,
+                                    batch_size=batch_size,
+                                    pipeline_type="val",
+                                    device_type=args.dali_device,
+                                    symbols=symbols)
+    else:
+        train_dataset_kw, train_features_kw = config.input(cfg, 'train')
+        train_dataset = AudioDataset(args.dataset_dir,
+                                     args.train_manifests,
+                                     symbols,
+                                     **train_dataset_kw)
+        train_loader = get_data_loader(train_dataset,
+                                       batch_size,
+                                       multi_gpu=multi_gpu,
+                                       shuffle=True,
+                                       num_workers=4)
+        train_feat_proc = FilterbankFeatures(**train_features_kw)
+
+        val_dataset_kw, val_features_kw = config.input(cfg, 'val')
+        val_dataset = AudioDataset(args.dataset_dir,
+                                   args.val_manifests,
+                                   symbols,
+                                   **val_dataset_kw)
+        val_loader = get_data_loader(val_dataset,
+                                     batch_size,
+                                     multi_gpu=multi_gpu,
+                                     shuffle=False,
+                                     num_workers=4,
+                                     drop_last=False)
+        val_feat_proc = FilterbankFeatures(**val_features_kw)
+
+        dur = train_dataset.duration / 3600
+        dur_f = train_dataset.duration_filtered / 3600
+        nsampl = len(train_dataset)
+        print_once(f'Training samples: {nsampl} ({dur:.1f}h, '
+                   f'filtered {dur_f:.1f}h)')
+
+    if train_feat_proc is not None:
+        train_feat_proc.cuda()
+    if val_feat_proc is not None:
+        val_feat_proc.cuda()
+
+    steps_per_epoch = len(train_loader) // args.grad_accumulation_steps
+
+    # set up the model
+    model = Jasper(encoder_kw=config.encoder(cfg),
+                   decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
+    model.cuda()
+    ctc_loss = CTCLossNM(n_classes=len(symbols))
    greedy_decoder = GreedyCTCDecoder()

-    print_once("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
-    print_once("Number of parameters in decode: {0}".format(model.jasper_decoder.num_weights()))
+    print_once(f'Model size: {num_weights(model) / 10**6:.1f}M params\n')

-    N = len(data_layer)
-    if sampler_type == 'default':
-        args.step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
-    elif sampler_type == 'bucket':
-        args.step_per_epoch = int(len(data_layer.sampler) / args.batch_size )
-
-    print_once('-----------------')
-    print_once('Have {0} examples to train on.'.format(N))
-    print_once('Have {0} steps / (gpu * epoch).'.format(args.step_per_epoch))
-    print_once('-----------------')
-
-    fn_lr_policy = lambda s: lr_policy(args.lr, s, args.num_epochs * args.step_per_epoch)
-
-
-    model.cuda()
-
-    if args.optimizer_kind == "novograd":
-        optimizer = Novograd(model.parameters(),
-                        lr=args.lr,
-                        weight_decay=args.weight_decay)
-    elif args.optimizer_kind == "adam":
-        optimizer = AdamW(model.parameters(),
-                        lr=args.lr,
-                        weight_decay=args.weight_decay)
+    # optimization
+    kw = {'lr': args.lr, 'weight_decay': args.weight_decay}
+    if args.optimizer == "novograd":
+        optimizer = Novograd(model.parameters(), **kw)
+    elif args.optimizer == "adamw":
+        optimizer = AdamW(model.parameters(), **kw)
    else:
-        raise ValueError("invalid optimizer choice: {}".format(args.optimizer_kind))
+        raise ValueError(f'Invalid optimizer "{args.optimizer}"')

-    if 0 < optim_level <= 3:
+    adjust_lr = lambda step, epoch, optimizer: lr_policy(
+        step, epoch, args.lr, optimizer, steps_per_epoch=steps_per_epoch,
+        warmup_epochs=args.warmup_epochs, hold_epochs=args.hold_epochs,
+        num_epochs=args.epochs, policy=args.lr_policy, min_lr=args.min_lr,
+        exp_gamma=args.lr_exp_gamma)
+
+    if args.amp:
        model, optimizer = amp.initialize(
-            min_loss_scale=1.0,
-            models=model,
-            optimizers=optimizer,
-            opt_level='O' + str(optim_level))
+            min_loss_scale=1.0, models=model, optimizers=optimizer,
+            opt_level='O1', max_loss_scale=512.0)

    if args.ema > 0:
        ema_model = copy.deepcopy(model)
    else:
        ema_model = None

-    model = model_multi_gpu(model, multi_gpu)
+    if multi_gpu:
+        model = DistributedDataParallel(model)
+
+    if args.pyprof:
+        pyprof.init(enable_function_stack=True)
+
+    # load checkpoint
+    meta = {'best_wer': 10**6, 'start_epoch': 0}
+    checkpointer = Checkpointer(args.output_dir, 'Jasper',
+                                args.keep_milestones, args.amp)
+    if args.resume:
+        args.ckpt = checkpointer.last_checkpoint() or args.ckpt

    if args.ckpt is not None:
-        print_once("loading model from {}".format(args.ckpt))
-        checkpoint = torch.load(args.ckpt, map_location="cpu")
-        if hasattr(model, 'module'):
-            model.module.load_state_dict(checkpoint['state_dict'], strict=True)
-        else:
-            model.load_state_dict(checkpoint['state_dict'], strict=True)
+        checkpointer.load(args.ckpt, model, ema_model, optimizer, meta)

-        if args.ema > 0:
-            if 'ema_state_dict' in checkpoint:
-                if hasattr(ema_model, 'module'):
-                    ema_model.module.load_state_dict(checkpoint['ema_state_dict'], strict=True)
-                else:
-                    ema_model.load_state_dict(checkpoint['ema_state_dict'], strict=True)
+    start_epoch = meta['start_epoch']
+    best_wer = meta['best_wer']
+    epoch = 1
+    step = start_epoch * steps_per_epoch + 1
+
+    if args.pyprof:
+        torch.autograd.profiler.emit_nvtx().__enter__()
+        profiler.start()
+
+    # training loop
+    model.train()
+
+    # pre-allocate
+    if args.pre_allocate_range is not None:
+        n_feats = train_features_kw['n_filt']
+        pad_align = train_features_kw['pad_align']
+        a, b = args.pre_allocate_range
+        for n_frames in range(a, b + pad_align, pad_align):
+            print_once(f'Pre-allocation ({batch_size}x{n_feats}x{n_frames})...')
+
+            feat = torch.randn(batch_size, n_feats, n_frames, device='cuda')
+            feat_lens = torch.ones(batch_size, device='cuda').fill_(n_frames)
+            txt = torch.randint(high=len(symbols)-1, size=(batch_size, 100),
+                                device='cuda')
+            txt_lens = torch.ones(batch_size, device='cuda').fill_(100)
+            log_probs, enc_lens = model(feat, feat_lens)
+            del feat
+            loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+            loss.backward()
+            model.zero_grad()
+
+    for epoch in range(start_epoch + 1, args.epochs + 1):
+        if multi_gpu and not use_dali:
+            train_loader.sampler.set_epoch(epoch)
+
+        epoch_utts = 0
+        accumulated_batches = 0
+        epoch_start_time = time.time()
+
+        for batch in train_loader:
+
+            if accumulated_batches == 0:
+                adjust_lr(step, epoch, optimizer)
+                optimizer.zero_grad()
+                step_loss = 0
+                step_utts = 0
+                step_start_time = time.time()
+
+            if use_dali:
+                # with DALI, the data is already on GPU
+                feat, feat_lens, txt, txt_lens = batch
+                if train_feat_proc is not None:
+                    feat, feat_lens = train_feat_proc(feat, feat_lens, args.amp)
            else:
-                print_once('WARNING: ema_state_dict not found in the checkpoint')
-                print_once('WARNING: initializing EMA model with regular params')
-                if hasattr(ema_model, 'module'):
-                    ema_model.module.load_state_dict(checkpoint['state_dict'], strict=True)
+                batch = [t.cuda(non_blocking=True) for t in batch]
+                audio, audio_lens, txt, txt_lens = batch
+                feat, feat_lens = train_feat_proc(audio, audio_lens, args.amp)
+
+            log_probs, enc_lens = model(feat, feat_lens)
+
+            loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
+            loss /= args.grad_accumulation_steps
+
+            if torch.isnan(loss).any():
+                print_once(f'WARNING: loss is NaN; skipping update')
+            else:
+                if multi_gpu:
+                    step_loss += reduce_tensor(loss.data, world_size).item()
                else:
-                    ema_model.load_state_dict(checkpoint['state_dict'], strict=True)
+                    step_loss += loss.item()

-        optimizer.load_state_dict(checkpoint['optimizer'])
+                if args.amp:
+                    with amp.scale_loss(loss, optimizer) as scaled_loss:
+                        scaled_loss.backward()
+                else:
+                    loss.backward()
+                step_utts += batch[0].size(0) * world_size
+                epoch_utts += batch[0].size(0) * world_size
+                accumulated_batches += 1

-        if optim_level > 0:
-            amp.load_state_dict(checkpoint['amp'])
+            if accumulated_batches % args.grad_accumulation_steps == 0:
+                optimizer.step()
+                apply_ema(model, ema_model, args.ema)

-        args.start_epoch = checkpoint['epoch']
-    else:
-        args.start_epoch = 0
+                if step % args.log_frequency == 0:
+                    preds = greedy_decoder(log_probs)
+                    wer, pred_utt, ref = greedy_wer(preds, txt, txt_lens, symbols)

-    train(data_layer, data_layer_eval, model, ema_model,
-          ctc_loss=ctc_loss, \
-          greedy_decoder=greedy_decoder, \
-          optimizer=optimizer, \
-          labels=ctc_vocab, \
-          optim_level=optim_level, \
-          multi_gpu=multi_gpu, \
-          fn_lr_policy=fn_lr_policy if args.lr_decay else None, \
-          args=args)
+                    if step % args.prediction_frequency == 0:
+                        print_once(f'  Decoded:   {pred_utt[:90]}')
+                        print_once(f'  Reference: {ref[:90]}')
+
+                    step_time = time.time() - step_start_time
+                    log((epoch, step % steps_per_epoch or steps_per_epoch, steps_per_epoch),
+                        step, 'train',
+                        {'loss': step_loss,
+                         'wer': 100.0 * wer,
+                         'throughput': step_utts / step_time,
+                         'took': step_time,
+                         'lrate': optimizer.param_groups[0]['lr']})
+
+                step_start_time = time.time()
+
+                if step % args.eval_frequency == 0:
+                    wer = evaluate(epoch, step, val_loader, val_feat_proc,
+                                   symbols, model, ema_model, ctc_loss,
+                                   greedy_decoder, args.amp, use_dali)
+
+                    if wer < best_wer and epoch >= args.save_best_from:
+                        checkpointer.save(model, ema_model, optimizer, epoch,
+                                          step, best_wer, is_best=True)
+                        best_wer = wer
+
+                step += 1
+                accumulated_batches = 0
+                # end of step
+
+            # DALI iterator need to be exhausted;
+            # if not using DALI, simulate drop_last=True with grad accumulation
+            if not use_dali and step > steps_per_epoch * epoch:
+                break
+
+        epoch_time = time.time() - epoch_start_time
+        log((epoch,), None, 'train_avg', {'throughput': epoch_utts / epoch_time,
+                                          'took': epoch_time})
+
+        if epoch % args.save_frequency == 0 or epoch in args.keep_milestones:
+            checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
+
+        if 0 < args.epochs_this_job <= epoch - start_epoch:
+            print_once(f'Finished after {args.epochs_this_job} epochs.')
+            break
+        # end of epoch
+
+    if args.pyprof:
+        profiler.stop()
+        torch.autograd.profiler.emit_nvtx().__exit__(None, None, None)
+
+    log((), None, 'train_avg', {'throughput': epoch_utts / epoch_time})
+
+    if epoch == args.epochs:
+        evaluate(epoch, step, val_loader, val_feat_proc, symbols, model,
+                 ema_model, ctc_loss, greedy_decoder, args.amp, use_dali)
+
+        checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
+    flush_log()


-def parse_args():
-    parser = argparse.ArgumentParser(description='Jasper')
-    parser.add_argument("--local_rank", default=None, type=int)
-    parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
-    parser.add_argument("--num_epochs", default=10, type=int, help='number of training epochs. if number of steps if specified will overwrite this')
-    parser.add_argument("--num_steps", default=None, type=int, help='if specified overwrites num_epochs and will only train for this number of iterations')
-    parser.add_argument("--save_freq", dest="save_frequency", default=300, type=int, help='number of epochs until saving checkpoint. will save at the end of training too.')
-    parser.add_argument("--eval_freq", dest="eval_frequency", default=200, type=int, help='number of iterations until doing evaluation on full dataset')
-    parser.add_argument("--train_freq", dest="train_frequency", default=25, type=int, help='number of iterations until printing training statistics on the past iteration')
-    parser.add_argument("--lr", default=1e-3, type=float, help='learning rate')
-    parser.add_argument("--weight_decay", default=1e-3, type=float, help='weight decay rate')
-    parser.add_argument("--train_manifest", type=str, required=True, help='relative path given dataset folder of training manifest file')
-    parser.add_argument("--model_toml", type=str, required=True, help='relative path given dataset folder of model configuration file')
-    parser.add_argument("--val_manifest", type=str, required=True, help='relative path given dataset folder of evaluation manifest file')
-    parser.add_argument("--max_duration", type=float, help='maximum duration of audio samples for training and evaluation')
-    parser.add_argument("--pad_to_max", action="store_true", default=False, help="pad sequence to max_duration")
-    parser.add_argument("--gradient_accumulation_steps", default=1, type=int, help='number of accumulation steps')
-    parser.add_argument("--optimizer", dest="optimizer_kind", default="novograd", type=str, help='optimizer')
-    parser.add_argument("--dataset_dir", dest="dataset_dir", required=True, type=str, help='root dir of dataset')
-    parser.add_argument("--lr_decay", action="store_true", default=False, help='use learning rate decay')
-    parser.add_argument("--cudnn", action="store_true", default=False, help="enable cudnn benchmark")
-    parser.add_argument("--amp", "--fp16", action="store_true", default=False, help="use mixed precision training")
-    parser.add_argument("--output_dir", type=str, required=True, help='saves results in this directory')
-    parser.add_argument("--ckpt", default=None, type=str, help="if specified continues training from given checkpoint. Otherwise starts from beginning")
-    parser.add_argument("--seed", default=42, type=int, help='seed')
-    parser.add_argument("--ema", type=float, default=0.0, help='discount factor for exponential averaging of model weights during training')
-    args=parser.parse_args()
-    return args
-
-
-if __name__=="__main__":
-    args = parse_args()
-    print_dict(vars(args))
-    main(args)
+if __name__ == "__main__":
+    main()
--- a/PyTorch/SpeechRecognition/Jasper/triton/Dockerfile
+++ b/PyTorch/SpeechRecognition/Jasper/triton/Dockerfile
@ -1,38 +1,10 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.09-py3
-
-FROM tensorrtserver_client as trtis-client
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tritonserver:20.10-py3-clientsdk
 FROM ${FROM_IMAGE_NAME}
-RUN apt-get update && apt-get install -y python3
-ARG version=6.0.1-1+cuda10.1
-RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb \
-&& dpkg -i cuda-repo-*.deb \
-&& wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb \
-&& dpkg -i nvidia-machine-learning-repo-*.deb \
-&& apt-get update \
-&& apt-get install -y --no-install-recommends libnvinfer6=${version} libnvonnxparsers6=${version} libnvparsers6=${version} libnvinfer-plugin6=${version} libnvinfer-dev=${version} libnvonnxparsers-dev=${version} libnvparsers-dev=${version} libnvinfer-plugin-dev=${version} python-libnvinfer=${version} python3-libnvinfer=${version}
-RUN cp -r /usr/lib/python3.6/dist-packages/tensorrt /opt/conda/lib/python3.6/site-packages/tensorrt

+RUN apt update && apt install -y python3-pyaudio libsndfile1

-ENV PATH=$PATH:/usr/src/tensorrt/bin
-WORKDIR /tmp/onnx-trt
-COPY tensorrt/onnx-trt.patch .
-RUN git clone https://github.com/onnx/onnx-tensorrt.git && cd onnx-tensorrt && git checkout  b677b9cbf19af803fa6f76d05ce558e657e4d8b6  && git submodule update --init --recursive && \
-    patch -f < ../onnx-trt.patch && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DGPU_ARCHS="60 70 75" && make -j16 && make install && mv -f /usr/lib/libnvonnx* /usr/lib/x86_64-linux-gnu/ && ldconfig
-
-
-# Here's a good place to install pip reqs from JoC repo.
-# At the same step, also install TRT pip reqs
-WORKDIR /tmp/pipReqs
-COPY requirements.txt /tmp/pipReqs/pytRequirements.txt
-COPY tensorrt/requirements.txt /tmp/pipReqs/trtRequirements.txt
-COPY triton/requirements.txt /tmp/pipReqs/trtisRequirements.txt
-RUN apt-get update && apt-get install -y --no-install-recommends portaudio19-dev && pip install -r pytRequirements.txt && pip install -r trtRequirements.txt && pip install -r trtisRequirements.txt
-
-#Copy the perf_client over
-COPY --from=trtis-client /workspace/install/bin/perf_client /workspace/install/bin/perf_client
-#Copy the python wheel and install with pip
-COPY --from=trtis-client /workspace/install/python/tensorrtserver*.whl /tmp/
-RUN pip install /tmp/tensorrtserver*.whl && rm /tmp/tensorrtserver*.whl
+RUN pip3 install -U pip
+RUN pip3 install onnxruntime unidecode inflect soundfile

 WORKDIR /workspace/jasper
 COPY . .
--- a/PyTorch/SpeechRecognition/Jasper/triton/README.md
+++ b/PyTorch/SpeechRecognition/Jasper/triton/README.md
@ -1,55 +1,37 @@
-# Jasper Inference Using Triton Inference Server
+# Deploying the Jasper Inference model using Triton Inference Server

-This is a subfolder of the Jasper for PyTorch repository that provides scripts to deploy high-performance inference using NVIDIA Triton Inference Server (formerly NVIDIA TensorRT Inference Server). It offers different options for the inference model pipeline.
+This subfolder of the Jasper for PyTorch repository contains scripts for  deployment of high-performance inference on NVIDIA Triton Inference Server as well as detailed performance analysis. It offers different options for the inference model pipeline.


 ## Table Of Contents
-
- [Model overview](#model-overview)
-   * [Model architecture](#model-architecture)
-   * [Triton Inference Server Overview](#triton-inference-server-overview)
-   * [Inference Pipeline in Triton Inference Server](#inference-pipeline-in-triton-inference-server)
+- [Solution overview](#solution-overview)
+- [Inference Pipeline in Triton Inference Server](#inference-pipeline-in-triton-inference-server)
 - [Setup](#setup)
-  * [Supported Software](#supported-software)
-  * [Requirements](#requirements)
 - [Quick Start Guide](#quick-start-guide)
 - [Advanced](#advanced)
-  * [Scripts and sample code](#scripts-and-sample-code)
+     * [Scripts and sample code](#scripts-and-sample-code)
 - [Performance](#performance)
-  * [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server)
-  * [Results](#results)
-    * [Performance analysis for Triton Inference Server: NVIDIA T4](#performance-analysis-for-triton-inference-server-nvidia-t4)
-	* [Maximum Batch Size](#maximum-batch-size)
-	* [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic-batching)
-    	* [TensorRT/ONNX/PyTorch JIT comparisons](#tensorrt/onnx/pytorch-jit-comparisons)
-		    * [Throughput Comparison](#throughput-comparison)
-		    * [Latency Comparison](#latency-comparison)
-
-## Model overview
-
-### Model architecture
+     * [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server)
+     * [Results](#results)
+       * [Performance Analysis for Triton Inference Server: NVIDIA T4
+](#performance-analysis-for-triton-inference-server-nvidia-t4)
+       * [Maximum batch size](#maximum-batch-size)
+            * [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic)
+            * [TensorRT, ONNX, and PyTorch JIT comparisons](#tensorrt-onnx-and-pytorch-jit-comparisons)
+- [Release Notes](#release-notes)
+	* [Changelog](#change-log)
+	* [Known issues](#known-issues)


-Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. More information about Jasper and its training and be found in the [Jasper PyTorch README](../README.md). 
-By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
-Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
-
-In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution. 
-For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
-
-More information on the Jasper model architecture can be found in the [Jasper PyTorch README](../README.md). 
-
-
-
-
-### Triton Inference Server Overview
+## Solution Overview

 The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+
 This folder contains detailed performance analysis as well as scripts to run Jasper inference using Triton Inference Server.

 A typical Triton Inference Server pipeline can be broken down into the following steps:

-1. The client serializes the inference request into a message and sends it to the server (Client Send). 
+1. The client serializes the inference request into a message and sends it to the server (Client Send).
 2. The message travels over the network from the client to the server (Network).
 3. The message arrives at the server, and is deserialized (Server Receive).
 4. The request is placed on the queue (Server Queue).
@ -58,15 +40,16 @@ A typical Triton Inference Server pipeline can be broken down into the following
 7. The completed message then travels over the network from the server to the client (Network).
 8. The completed message is deserialized by the client and processed as a completed inference request (Client Receive).

-Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
+Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to step 5. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
+
 In this section, we will go over how to launch both the Triton Inference Server and the client and get the best performance solution that fits your specific application needs.

-Note: The following instructions are run from outside the container and call `docker run` commands as required.
+More information on how to perform inference using NVIDIA Triton Inference Server can be found in [triton/README.md](https://github.com/triton-inference-server/server/blob/master/README.md).


 ## Inference Pipeline in Triton Inference Server

-The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend: 
+The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend:

 **Data preprocessor**

@ -74,11 +57,11 @@ The data processor transforms an input raw audio file into a spectrogram. By def

 **Acoustic model**

-The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [../README.md]). We do not use masking for inference . 
+The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](../README.md)). We do not use masking for inference.

 **Greedy decoder**

-The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability. 
+The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability.

 To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline.

@ -93,37 +76,24 @@ In order to run inference with TensorRT outside of the inference server, refer t

 ## Setup

-### Supported Software
+The repository contains a folder `./triton` with a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:

-The following software version configuration is supported has been tested.
+- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+- [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+- [Triton Inference Server 20.10 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
+- Access to [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA CUDA repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
+- Supported GPUs:
+    - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+    - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+- [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp)

-|Software|Version|
-|--------|-------|
-|Python|3.6.9|
-|PyTorch|1.2.0|
-|TensorRT|6.0.1.5|
-|CUDA|10.1.243|
-
-
-The following section lists the requirements in order to start inference with Jasper in Triton Inference Server.
-
-### Requirements
-
-The repository contains a folder `./trtis/` with a `Dockerfile` which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
-
-* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
-* [Triton Inference Server 20.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
-* Access to [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
-* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
-
-Required Python packages are listed in `requirements.txt`, `trt/requirements.txt` and `trtis/requirements.txt`. These packages are automatically installed when the Docker container is built. 
+Required Python packages are listed in `requirements.txt`. These packages are automatically installed when the Docker container is built.


 ## Quick Start Guide

-Running the following scripts will build and launch the container containing all required dependencies for both TensorRT 6 as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
+Running the following scripts will build and launch the container containing all required dependencies for native PyTorch as well as Triton. This is necessary for using inference and can also be used for data download, processing, and training of the model. For more information on the scripts and arguments, refer to the [Advanced](#advanced) section.

 1. Clone the repository.

@ -132,114 +102,103 @@ Running the following scripts will build and launch the container containing all
    cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
    ```

-2. Build a container that extends NGC PyTorch 19.09, TensorRT, Triton Inference Server, and Triton Inference Client:
+2. Build the Jasper PyTorch container.
+
+    Running the following scripts will build the container which contains all the required dependencies for data download and processing as well as converting the model.

    ```bash
-    bash trtis/scripts/docker/build.sh
+    bash scripts/docker/build.sh
    ```

 3. Start an interactive session in the Docker container:

    ```bash
-    export DATA_DIR=<DATA_DIR>
-    export CHECKPOINT_DIR=<CHECKPOINT_DIR>
-    export RESULT_DIR=<RESULT_DIR>
-    bash trtis/scripts/docker/launch.sh
+    bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
    ```

-    Where <DATA_DIR>, <CHECKPOINT_DIR> and <RESULT_DIR> can be either empty or absolute directory paths to dataset, existing checkpoints or potential output files.
-      
-    Alternatively, to start a script `foo.sh` in the Docker container without an interactive session, run:
+    Where <DATA_DIR>, <CHECKPOINT_DIR> and <RESULT_DIR> can be either empty or absolute directory paths to dataset, existing checkpoints or potential output files. When left empty, they default to `datasets/`, `/checkpoints`, and `results/`, respectively. The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
+
+    Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` and `trt/scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md) and [Jasper TensorRT README](../tensorrt/README.md).
+
+    Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](../trt/README.md)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `<RESULT_DIR>` should be prepared to contain converted model and logs.
+
+4. Downloading the `test-clean` part of `LibriSpeech` is required for model conversion. But it is not required for inference on Triton Inference Server, which can use a single .wav audio file. To download and preprocess LibriSpeech, run the following inside the container:
+
+   ```bash
+   bash triton/scripts/download_triton_librispeech.sh
+   bash triton/scripts/preprocess_triton_librispeech.sh
+   ```
+
+5. (Option 1) Convert pretrained PyTorch model checkpoint into Triton Inference Server compatible model backends.
+
+   Inside the container, run:

    ```bash
-    export DATA_DIR=<DATA_DIR>
-    export CHECKPOINT_DIR=<CHECKPOINT_DIR>
-    export RESULT_DIR=<RESULT_DIR>
-    export PROGRAM_PATH=foo.sh
-    bash trtis/scripts/docker/trtis.sh
+    export CHECKPOINT_PATH=<CHECKPOINT_PATH>
+    export CONVERT_PRECISIONS=<CONVERT_PRECISIONS>
+    export CONVERTS=<CONVERTS>
+    bash triton/scripts/export_model.sh
    ```

-    The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host. Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` and `trt/scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md) and [Jasper TensorRT README](../tensorrt/README.md).
+   Where `<CHECKPOINT_PATH>` (`"/checkpoints/jasper_fp16.pt"`) is the absolute file path of the pretrained checkpoint, `<CONVERT_PRECISIONS>` (`"fp16" "fp32"`) is the list of precisions used for conversion, and `<CONVERTS>` (`"feature-extractor" "decoder" "ts-trace" "onnx" "tensorrt"`) is the list of conversions to be applied. The feature extractor converts only to TorchScript trace module (`feature-extractor`), the decoder only to TorchScript script module (`decoder`), and the Jasper model can convert to TorchScript trace module (`ts-trace`), ONNX (`onnx`), or TensorRT (`tensorrt`).

-    Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](../trt/README.md)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `<RESULT_DIR>` should be prepared to contain timing results and logs. Downloading `LibriSpeech` is not required for Inference in Triton Inference Server on a single .wav audio file. To do inference and evaluation on LibriSpeech, download the dataset following the instructions in the [Jasper TensorRT README](../tensorrt/README.md)
+   A pretrained PyTorch model checkpoint for model conversion can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp).

-4. Convert pretrained PyTorch model checkpoint into Triton Inference Server compatible model backends.
+   More details can be found in the [Advanced](#advanced) section under [Scripts and sample code](#scripts-and-sample-code).
+
+6. (Option 2) Download pre-exported inference checkpoints from NGC.
+
+   Alternatively, you can skip the manual model export and download already generated model backends for every version of the model pipeline.
+
+   * [Jasper_ONNX](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_onnx_fp16_amp/version),
+   * [Jasper_TorchScript](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_torchscript_fp16_amp/version),
+   * [Jasper_TensorRT_Turing](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_turing/version),
+   * [Jasper_TensorRT_Volta](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_volta/version).
+
+   If you wish to use TensorRT pipeline, make sure to download the correct version for your hardware. The extracted model folder should contain 3 subfolders `feature-extractor-ts-trace`, `decoder-ts-script` and `jasper-x` where `x` can be `ts-trace`, `onnx`, `tensorrt` depending on the model backend. Copy the 3 model folders to the directory `./triton/model_repo/fp16` in your Jasper project.
+
+7. Build a container that extends Triton Inference Client:

    From outside the container, run:

    ```bash
-    export ARCH=<ARCH>
-    export CHECKPOINT_DIR=<CHECKPOINT_DIR>
-    export CHECKPOINT=<CHECKPOINT>
-    export PRECISION=<PRECISION>
-    export MAX_SEQUENCE_LENGTH_FOR_ENGINE=<MAX_SEQUENCE_LENGTH_FOR_ENGINE>
-    bash trtis/scripts/export_model.sh
-    bash trtis/scripts/prepare_model_repository.sh
+    bash triton/scripts/docker/build_triton_client.sh
    ```

-    Where `<ARCH>` is either 70(Volta) or 75(Turing), `<CHECKPOINT_DIR>` is the absolute path that contains the pretrained checkpoint `<CHECKPOINT>`, and `<PRECISION>` is either `fp16` or `fp32`. `<MAX_SEQUENCE_LENGTH_FOR_ENGINE>` defines the maximum feasible audio length, where 100 corresponds to 1 second.
-    The exported models for deployment will be generated at `./trtis/deploy/`.
+Once the above steps are completed you can either run inference benchmarks or perform inference on real data.

-    A pretrained PyTorch model checkpoint for model conversion can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
-
-    More details can be found in the [Advanced](#advanced) section under [Scripts and sample code](#scripts-and-sample-code).
-
-5. Download Pre-exported Inference Checkpoints from NGC 
-
-    If you would like to skip the manual model export, you can find already generated model backends in [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_jit_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_jit_fp16), [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_onnx_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_onnx_fp16), [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_turing_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_turing_fp16), [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_volta_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_volta_fp16). for every version of the model pipeline. If you wish to use TensorRT pipeline, make sure to download the correct version for your hardware. The extracted model folder should contain 3 subfolders `jasper-feature-extractor`, `jasper-decoder` and `jasper-x` where x can be pyt, onnx, trt depending on the model backend. You will find folders with the same name in your local Jasper repository under `trtis/model_repo/’. Copy the content of each of the 3 model folders to the according directory in your Jasper project, replace files with the same name.
-
-    Then run:
-    ```bash
-    bash trtis/scripts/prepare_model_repository.sh
-    ```
-
-6. Launch Triton Inference Server.
-
-    Start the server:
-    ```bash
-    bash trtis/scripts/run_server.sh
-    ```
-
-7. Run all inference benchmarks. 
+8. (Option 1) Run all inference benchmarks.

    From outside the container, run:

    ```bash
-    export ARCH=<ARCH>
-    export CHECKPOINT_DIR=<CHECKPOINT_DIR>
    export RESULT_DIR=<RESULT_DIR>
-    export CHECKPOINT=<CHECKPOINT>
-    bash trtis/scripts/execute_all_perf_runs.sh
+    export PRECISION_TESTS=<PRECISION_TESTS>
+    export BATCH_SIZES=<BATCH_SIZES>
+    export SEQ_LENS=<SEQ_LENS>
+    bash triton/scripts/execute_all_perf_runs.sh
    ```

-    Where `<ARCH>` is either 70(Volta) or 75(Turing), `<CHECKPOINT_DIR>` is the absolute path that contains the pretrained checkpoint `<CHECKPOINT>`, and `<RESULT_DIR>` is the absolute path to potential output files.
+    Where `<RESULT_DIR>` is the absolute path to potential output files (`./results`), `<PRECISION_TESTS>` is a list of precisions to be tested (`"fp16" "fp32"`), `<BATCH_SIZES>` is a list of tested batch sizes (`"1" "2" "4" "8"`), and `<SEQ_LENS>` are tested sequnce lengths (`"32000" "112000" "267200"`).

    Note: This can take several hours to complete due to the extensiveness of the benchmark. More details about the benchmark are found in the [Advanced](#advanced) section under [Performance](#performance).

-8. Run inference using the Client and Triton Inference Server.
+9. (Option 2) Run inference on real data using the Client and Triton Inference Server.

    8.1 From outside the container, restart the server:
+
    ```bash
-    bash trtis/scripts/run_server.sh
-    ``` 
+    bash triton/scripts/run_server.sh <MODEL_TYPE> <PRECISION>
+    ```

    8.2 From outside the container, submit the client request using:
    ```bash
-    bash trtis/scripts/run_client.sh <MODEL_TYPE> <DATA_DIR> <FILE>
+    bash triton/scripts/run_client.sh <MODEL_TYPE> <DATA_DIR> <FILE>
    ```

-    Where `<MODEL_TYPE>` can be either “pyt” (default), “trt” or “onnx”. `<DATA_DIR>` is an absolute local path to the directory of files. <FILE> is the relative path to <DATA_DIR> to either an audio file in .wav format or a manifest file in .json format. 
+    Where `<MODEL_TYPE>` can be either "ts-trace", "tensorrt" or "onnx", `<PRECISION>` is either "fp32" or "fp16". `<DATA_DIR>` is an absolute local path to the directory of files. <FILE> is the relative path to <DATA_DIR> to either an audio file in .wav format or a manifest file in .json format.

-    Note: If <FILE> is *.json <DATA_DIR> should be the path to the LibriSpeech dataset. In this case this script will do both inference and evaluation on the accoring LibriSpeech dataset. 
-
-9. Start Jupyter Notebook to run inference interactively.
-
-    Run:
-    ```bash
-    jupyter notebook -- notebooks/JasperTRTIS.ipynb
-    ```
-
-    A pretrained model checkpoint necessary for using the jupyter notebook to be able to run inference can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
+    Note: If <FILE> is *.json <DATA_DIR> should be the path to the LibriSpeech dataset. In this case this script will do both inference and evaluation on the accoring LibriSpeech dataset.


 ## Advanced
@ -249,130 +208,177 @@ The following sections provide greater details about the Triton Inference Server

 ### Scripts and sample code

-The `trtis/` directory contains the following files:
+The `triton/` directory contains the following files:
 * `jasper-client.py`: Python client script that takes an audio file and a specific model pipeline type and submits a client request to the server to run inference with the model on the given audio file.
-* `speech-utils.py`: helper functions for `jasper-client.py`
+* `speech_utils.py`: helper functions for `jasper-client.py`.
+* `converter.py`: Python script for model conversion to different backends.
+* `jasper_module.py`: helper functions for `converter.py`.
+* `model_repo_configs/`: directory with Triton model config files for different backend and precision configurations.

-The `trtis/scripts/` directory has easy to use scripts to run supported functionalities, such as:
-* `./docker/build.sh`: builds container
-* `./docker/launch.sh`: launches container
-* `execute_all_perf_runs.sh`: runs all benchmarks using TRTIS perfclient calls `generate_perf_results.sh`
-* `export_model.sh`: from pretrained PyTorch checkpoint generates backends for every version of the model inference pipeline, calls `export_model_helper.sh`
-* `prepare_model_repository.sh`: copies model config files from `./model_repo/` to `./deploy/model_rep`o and creates links to generated model backends, setting up the model repository for Triton Inference Server
-* `generate_perf_results.sh`: runs benchmark with perf-client for specific configuration and calls `run_perf_client.sh`
+The `triton/scripts/` directory has easy to use scripts to run supported functionalities, such as:
+* `./docker/build_triton_client.sh`: builds container
+* `execute_all_perf_runs.sh`: runs all benchmarks using Triton Inference Server performance client; calls `generate_perf_results.sh`
+* `export_model.sh`: from pretrained PyTorch checkpoint generates backends for every version of the model inference pipeline.
+* `prepare_model_repository.sh`: copies model config files from `./model_repo_configs/` to `./deploy/model_repo` and creates links to generated model backends, setting up the model repository for Triton Inference Server
+* `generate_perf_results.sh`: runs benchmark with `perf-client` for specific configuration and calls `run_perf_client.sh`
 * `run_server.sh`: launches Triton Inference Server
 * `run_client.sh`: launches client by using `jasper-client.py` to submit inference requests to server


+### Running the Triton Inference Server
+
+Launch the Triton Inference Server in detached mode to run in the background by default:
+
+```bash
+bash triton/scripts/run_server.sh
+```
+
+To run in the foreground interactively, for debugging purposes, run:
+
+```bash
+DAEMON="--detach=false" bash trinton/scripts/run_server.sh
+```
+
+The script mounts and loads models at `$PWD/triton/deploy/model_repo` to the server with all visible GPUs. In order to selectively choose the devices, set `NVIDIA_VISIBLE_DEVICES`.
+
+
+### Running the Triton Inference Client
+
+*Real data*
+In order to run the client with real data, run:
+
+```bash
+bash triton/scripts/run_client.sh <backend> <data directory> <audio file>
+```
+
+The script calls `triton/jasper-client.py` which preprocesses data and sends/receives requests to/from the server.
+
+*Synthetic data*
+In order to run the client with synthetic data for performance measurements, run:
+
+```bash
+export MODEL_NAME=jasper-tensorrt-ensemble
+export MODEL_VERSION=1
+export BATCH_SIZE=1
+export MAX_LATENCY=500
+export MAX_CONCURRENCY=64
+export AUDIO_LENGTH=32000
+export SERVER_HOSTNAME=localhost
+export RESULT_DIR_H=${PWD}/results/perf_client/${MODEL_NAME}/batch_${BATCH_SIZE}_len_${AUDIO_LENGTH}
+bash triton/scripts/run_perf_client.sh
+```
+
+The export values above are default values. The script waits until the server is up and running, sends requests as per the constraints set and writes results to `/results/results_${TIMESTAMP}.csv` where `TIMESTAMP=$(date "+%y%m%d_%H%M")` and `/results/` is the results directory mounted in the docker .
+
+For more information about `perf_client`, refer to the [official documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/optimization.html#perf-client).


 ## Performance

 ### Inference Benchmarking in Triton Inference Server

-To benchmark the inference performance on either Volta or Turing GPU, run `bash trtis/scripts/execute_all_perf_runs.sh` according to [Quick-Start-Guide](#quick-start-guide) Step 7 and set `ARCH` according to the underlying hardware (`ARCH=70` for Volta and `ARCH=75` for Turing)
-
-By default, this script measures inference performance for all 3 model pipelines: PyTorch JIT  (‘pyt’) pipeline, ONNX (‘onnx’) pipeline, TensorRT(‘tensorrt’) pipeline, both with fp32 and fp16 precision. Each of these pipelines is measured for different audio input lengths (2sec, 7sec, 16.7sec) and a range of different server batch sizes (up to 64). This takes place in `trtis/scripts/generate_perf_results.sh`. For a specific audio length and batch size static and dynamic batching comparison is performed. For benchmarking we used `MAX_SEQUENCE_LENGTH_FOR_ENGINE=1792` for inference model generation.
+To benchmark the inference performance on Volta Turing or Ampere GPU, run `bash triton/scripts/execute_all_perf_runs.sh` according to [Quick-Start-Guide](#quick-start-guide) Step 7.

+By default, this script measures inference performance for all 3 model pipelines: PyTorch JIT  (`ts-trace`) pipeline, ONNX (`onnx`) pipeline, TensorRT(`tensorrt`) pipeline, both with FP32 and FP16 precision. Each of these pipelines is measured for different audio input lengths (2sec, 7sec, 16.7sec) and a range of different server batch sizes (up to 8). This takes place in `triton/scripts/generate_perf_results.sh`. For a specific audio length and batch size, static and dynamic batching comparison is performed.


 ### Results

+In the following section, we analyze the results using the example of the Triton pipeline.

 #### Performance Analysis for Triton Inference Server: NVIDIA T4

-
-
-### Results
-
-
-#### Performance Analysis for Triton Inference Server: NVIDIA T4
-
-Based on the figure below, we recommend using the Dynamic Batcher with `max_batch_size=8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect). The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of concurrent requests, while also keeping latency down for infrequent requests. 
-
 All results below are obtained using the following configurations:
 * Single T4 16GB GPU on a local server
-* Jasper Large
-* Audio length = 7 seconds
 * FP16 precision
+* Python 3.6.10
+* PyTorch 1.7.0a0+7036e91
+* TensorRT 7.2.1.4
+* CUDA 11.1.0.024
+* CUDNN 8.0.4.30

-Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1). A high number of concurrent requests can reduce the impact of network latency on overall throughput.
+##### Batching techniques: Static Batching

+Static batching is a feature of the inference server that allows inference requests to be served as they are received. The largest improvements to throughput come from increasing the batch size due to efficiency gains in the GPU with larger batches.

-<img src="../images/triton_throughput_latency_summary.png" width="100%" height="100%">
+![](../images/static_fp16_2s.png)
+Figure 1: Throughput vs. Latency for Jasper, Audio Length = 2sec using various model backends available in Triton Inference Server and static batching.

-Figure 1: Latency vs Throughput for Jasper Large, FP16, Audio Length = 7sec using various configurations and all 3 model backends  available in Triton Inference Server. TODO: TensorRT is denoted as TRT, PyTorch as PyT.
+![](../images/static_fp16_7s.png)
+Figure 2: Throughput vs. Latency for Jasper, Audio Length = 7sec using various model backends available in Triton Inference Server and static batching.

+![](../images/static_fp16_16.7s.png)
+Figure 3: Throughput vs. Latency for Jasper, Audio Length = 16.7sec using various model backends available in Triton Inference Server and static batching.

-##### Maximum Batch Size
-In general, increasing batch size leads to higher throughput at the cost of higher latency. In the following sections, we analyze the results using the example of the Triton pipeline.
+These charts can be used to establish the optimal batch size to use in dynamic batching, given a latency budget. For example, in Figure 2 (Audio length = 7s) given a budget of 50ms, the optimal batch size to use for the TensorRT backend is 4. This will result in a maximum throughput of 100 inf/s under the latency constraint. In all three charts, TensorRT shows the best throughput and latency performance for a given batch size

-As we can see in Figure 2, the throughput at Batch Size=1, Client Concurrent Requests = 8 is 45 and in Figure 3, the throughput at Batch Size=8, Client Concurrent Requests = 1 is 101, giving a speedup of ~2.24x. 
-Note: We compare Batch Size=1, Client Concurrent Requests = 8 to Batch Size=8, Client Concurrent Requests = 1 to keep the Total Number of Outstanding Requests equal between the two different modes. Where Total Number of Outstanding Requests = Batch Size * Client Concurrent Requests. 
-Increasing the batch size by 8-fold from 1 to 8 results in an increase in compute time by only 2.42x (45ms to 109ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
+##### Batching techniques: Dynamic Batching

-<img src="../images/triton_static_batching_bs1.png" width="80%" height="80%"> 
+Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the Dynamic Batcher parameter `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and `preferred_batch_size` to indicate your maximum server batch size in the Triton Inference Server model config.

-Figure 2: Triton pipeline - Latency & Throughput vs Concurrency using Static Batching at Batch size = 1
+Figures 4, 5, and 6 emphasizes the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to max_queue_delay_microseconds.

-<img src="../images/triton_static_batching_bs8.png" width="80%" height="80%"> 
+![](../images/tensorrt_2s.png)
+Figure 4: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size = 8, max_queue_delay_microseconds = 5000, input audio length = 2 seconds, TensorRT backend.

-Figure 3: Triton pipeline - Latency & Throughput vs Concurrency using Static Batching at Batch size = 8
+![](../images/tensorrt_7s.png)
+Figure 5: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size =  8, max_queue_delay_microseconds = 5000, input audio length = 7 seconds, TensorRT backend.

-##### Batching techniques: Static versus Dynamic Batching
-In the following section, we analyze the results using the example of the Triton pipeline.
-Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
-Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the Dynamic Batcher parameter `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and `preferred_batch_size` to indicate your maximum server batch size in the Triton Inference Server model config. 
-Figure 4 emphasizes the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to max_queue_delay_microseconds. The effect of preferred_batchsize for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, the throughput approaches a maximum limit as we saturate the GPU utilization.
+![](../images/tensorrt_16.7s.png)
+Figure 6: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size = 8, max_queue_delay_microseconds = 5000, input audio length = 16.7 seconds, TensorRT backend.

-<img src="../images/triton_dynamic_batching.png" width="80%" height="80%"> 
- 
-Figure 4: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at client Batch size = 1, maximum server batch size=4, max_queue_delay_microseconds = 5000
+##### TensorRT, ONNX, and PyTorch JIT comparisons

-
-##### TensorRT/ONNX/PyTorch JIT comparisons
 The following tables show inference and latency comparisons across all 3 backends for mixed precision and static batching. The main observations are:
-Increasing the batch size leads to higher inference throughput and latency up to a certain batch size, after which it slowly saturates.
-TensorRT is faster than both the PyTorch and ONNX pipeline, achieving a speedup of up to ~1.5x and ~2.4x respectively.
+Increasing the batch size leads to higher inference throughput and - latency up to a certain batch size, after which it slowly saturates.
 The longer the audio length, the lower the throughput and the higher the latency.

-
 ###### Throughput Comparison

-Following Table shows throughput benchmark results for all 3 model backends in Triton Inference Server using static batching under optimal concurrency
-
+The following table shows the throughput benchmark results for all 3 model backends in Triton Inference Server using static batching under optimal concurrency

 |Audio length in seconds|Batch Size|TensorRT (inf/s)|PyTorch (inf/s)|ONNX (inf/s)|TensorRT/PyTorch Speedup|TensorRT/Onnx Speedup|
 |---    |---    |---    |---    |---    |---    |---    |
-|2.00|1.00|46.67|40.67|41.00|1.15|1.14|
-|2.00|2.00|90.67|74.67|58.00|1.21|1.56|
-|2.00|4.00|168.00|128.00|112.00|1.31|1.50|
-|2.00|8.00|248.00|213.33|194.67|1.16|1.27|
-|7.00|1.00|44.33|31.67|37.00|1.40|1.20|
-|7.00|2.00|74.67|56.67|49.33|1.32|1.51|
-|7.00|4.00|100.00|62.67|50.67|1.60|1.97|
-|7.00|8.00|106.67|80.00|53.33|1.33|2.00|
-|16.70|1.00|31.00|20.00|25.33|1.55|1.22|
-|16.70|2.00|42.00|29.33|19.33|1.43|2.17|
-|16.70|4.00|46.67|29.33|22.67|1.59|2.06|
-|16.70|8.00|50.67|37.33|21.33|1.36|2.38|
+|  2.0| 1|  49.67|  55.67|  41.67| 0.89| 1.19|
+|  2.0| 2|  98.67|  96.00|  77.33| 1.03| 1.28|
+|  2.0| 4| 180.00| 141.33| 118.67| 1.27| 1.52|
+|  2.0| 8| 285.33| 202.67| 136.00| 1.41| 2.10|
+|  7.0| 1|  47.67|  37.00|  18.00| 1.29| 2.65|
+|  7.0| 2|  79.33|  47.33|  46.00| 1.68| 1.72|
+|  7.0| 4| 100.00|  73.33|  36.00| 1.36| 2.78|
+|  7.0| 8| 117.33|  82.67|  40.00| 1.42| 2.93|
+| 16.7| 1|  36.33|  21.67|  11.33| 1.68| 3.21|
+| 16.7| 2|  40.67|  25.33|  16.00| 1.61| 2.54|
+| 16.7| 4|  46.67|  37.33|  16.00| 1.25| 2.92|
+| 16.7| 8|  48.00|  40.00|  18.67| 1.20| 2.57|

 ###### Latency Comparison

-Following Table shows throughput benchmark results for all 3 model backends in Triton Inference Server using static batching and a single concurrent request. 
-
+The following table shows the throughput benchmark results for all 3 model backends in Triton Inference Server using static batching and a single concurrent request.

 |Audio length in seconds|Batch Size|TensorRT (ms)|PyTorch (ms)|ONNX (ms)|TensorRT/PyTorch Speedup|TensorRT/Onnx Speedup|
 |---    |---    |---    |---    |---    |---    |---    |
-|2.00|1.00|24.74|27.80|26.70|1.12|1.08|
-|2.00|2.00|23.75|29.76|38.54|1.25|1.62|
-|2.00|4.00|25.28|34.09|39.67|1.35|1.57|
-|2.00|8.00|36.18|41.18|45.84|1.14|1.27|
-|7.00|1.00|25.86|34.82|29.41|1.35|1.14|
-|7.00|2.00|29.83|38.04|43.37|1.28|1.45|
-|7.00|4.00|41.91|66.69|79.38|1.59|1.89|
-|7.00|8.00|80.72|106.86|151.61|1.32|1.88|
-|16.70|1.00|34.89|52.83|43.10|1.51|1.24|
-|16.70|2.00|51.91|73.52|105.58|1.42|2.03|
-|16.70|4.00|95.42|145.17|187.49|1.52|1.96|
-|16.70|8.00|167.67|229.67|413.74|1.37|2.47|
+|  2.0| 1|  23.61|  25.06| 31.84| 1.06| 1.35|
+|  2.0| 2|  24.56|  25.11| 37.54| 1.02| 1.53|
+|  2.0| 4|  25.90|  31.00| 37.20| 1.20| 1.44|
+|  2.0| 8|  31.57|  41.76| 37.13| 1.32| 1.18|
+|  7.0| 1|  24.79|  30.55| 32.16| 1.23| 1.30|
+|  7.0| 2|  28.48|  45.05| 37.47| 1.58| 1.32|
+|  7.0| 4|  41.71|  57.71| 37.92| 1.38| 0.91|
+|  7.0| 8|  72.19|  98.84| 38.13| 1.37| 0.53|
+| 16.7| 1|  30.66|  48.42| 32.74| 1.58| 1.07|
+| 16.7| 2|  52.79|  81.89| 37.82| 1.55| 0.72|
+| 16.7| 4|  92.86| 115.03| 37.91| 1.24| 0.41|
+| 16.7| 8| 170.34| 203.52| 37.84| 2.36| 0.22|
+
+## Release Notes
+
+### Changelog
+
+February 2021
+* Updated Triton scripts for compatibility with Triton Inference Server version 2
+* Updated Quick Start Guide
+* Updated performance results
+
+### Known issues
+There are no known issues in this deployment.
--- a/PyTorch/SpeechRecognition/Jasper/triton/converter.py
+++ b/PyTorch/SpeechRecognition/Jasper/triton/converter.py
@ -0,0 +1,254 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+
+
+import os
+import json
+import torch
+import argparse
+import importlib
+from pytorch.utils import extract_io_props, load_io_props
+
+import logging
+
+def get_parser():
+
+    parser = argparse.ArgumentParser()
+    # required args
+    parser.add_argument("--model-module", type=str, default="", required=True,
+                        help="Module with model initializer and data loader")
+    parser.add_argument('--convert', choices=['ts-script', 'ts-trace',
+                                              'onnx', 'tensorrt'],
+                        required=True, help='convert to '
+                        'ts-script: TorchScript using torch.jit.script, '
+                        'ts-trace: TorchScript using torch.jit.trace, '
+                        'onnx: ONNX using torch.onnx.export, '
+                        'tensorrt: TensorRT using OnnxParser, ')
+    parser.add_argument("--max_workspace_size", type=int,
+                        default=512*1024*1024,
+                        help="set the size of the workspace for TensorRT \
+                        conversion")
+    parser.add_argument("--precision", choices=['fp16', 'fp32'],
+                        default='fp32', help="convert TensorRT or \
+                        TorchScript model in a given precision")
+    parser.add_argument('--convert-filename', type=str, default='',
+                        help='Saved model name')
+    parser.add_argument('--save-dir', type=str, default='',
+                        help='Saved model directory')
+    parser.add_argument("--max-batch-size", type=int, default=1,
+                        help="Specifies the 'max_batch_size' in the Triton \
+                           model config and in TensorRT builder. See the \
+                           Triton and TensorRT documentation for more info.")
+    parser.add_argument('--device', type=str, default='cuda',
+                        help='Select device for conversion.')
+
+    parser.add_argument('model_arguments', nargs=argparse.REMAINDER,
+                        help='arguments that will be ignored by \
+                           converter lib and will be forwarded to your convert \
+                           script')
+
+    return parser
+
+
+class Converter:
+    def __init__(self, model, dataloader):
+        self.model = model
+        self.dataloader = dataloader
+
+        self.convert_props = {
+            'ts-script': {
+                'convert_func':  self.to_torchscript,
+                'convert_filename': 'model.pt'
+            },
+            'ts-trace': {
+                'convert_func'    :  self.to_torchscript,
+                'convert_filename': 'model.pt'
+            },
+            'onnx': {
+                'convert_func'    :  self.to_onnx,
+                'convert_filename': 'model.onnx'
+            },
+            'tensorrt': {
+                'convert_func'    :  self.to_tensorrt,
+                'convert_filename': 'model.plan'
+            }
+        }
+
+    def convert(self, convert_type, save_dir, convert_filename,
+                device, precision='fp32',
+                max_batch_size=1,
+                # args for TensorRT:
+                max_workspace_size=None):
+        ''' convert the model '''
+        self.convert_type = convert_type
+        self.max_workspace_size = max_workspace_size
+        self.max_batch_size = max_batch_size
+        self.precision = precision
+
+        # override default name if user provided name
+        if convert_filename != '':
+            self.convert_props[convert_type]['convert_filename'] = convert_filename
+
+        # setup device
+        torch_device = torch.device(device)
+
+        # prepare model
+        self.model.to(torch_device)
+        self.model.eval()
+        assert (not self.model.training), \
+            "[Converter error]: could not set the model to eval() mode!"
+
+        io_props = None
+        if self.dataloader is not None:
+            io_props = extract_io_props(self.model, self.dataloader,
+                                        torch_device, precision, max_batch_size)
+
+        assert self.convert_type == "ts-script" or io_props is not None, \
+            "Input and output properties are empty. For conversion types \
+other than \'ts-script\' input shapes are required to generate dummy input. \
+Make sure that dataloader works correctly or that IO props file is provided."
+
+        # prepare save path
+        model_name = self.convert_props[convert_type]['convert_filename']
+        convert_model_path = os.path.join(save_dir, model_name)
+
+        # get convert method depending on the convert type
+        convert_func = self.convert_props[convert_type]['convert_func']
+
+        # convert the model - will be saved to disk
+        if self.convert_type == "tensorrt":
+            io_filepath = "triton/tensorrt_io_props_" + str(precision) + ".json"
+            io_props = load_io_props(io_filepath)
+
+        convert_func(model, torch_device, io_props, convert_model_path)
+
+        assert (os.path.isfile(convert_model_path)), \
+            f"[Converter error]: saving model to {convert_model_path} failed!"
+
+
+    def generate_dummy_input(self, io_props, device):
+
+        from pytorch.utils import triton_type_to_torch_type
+
+        dummy_input = []
+        for s,t in zip(io_props['opt_shapes'], io_props['input_types']):
+            t = triton_type_to_torch_type[t]
+            tensor = torch.empty(size=s, dtype=t, device=device).random_()
+            dummy_input.append(tensor)
+        dummy_input = tuple(dummy_input)
+
+        return dummy_input
+
+    def to_onnx(self, model, device, io_props, convert_model_path):
+        ''' convert the model to onnx '''
+
+        dummy_input = self.generate_dummy_input(io_props, device)
+
+        opset_version = 11
+        # convert the model to onnx
+        with torch.no_grad():
+            torch.onnx.export(model, dummy_input,
+                              convert_model_path,
+                              do_constant_folding=True,
+                              input_names=io_props['input_names'],
+                              output_names=io_props['output_names'],
+                              dynamic_axes=io_props['dynamic_axes'],
+                              opset_version=opset_version,
+                              enable_onnx_checker=True)
+
+
+    def to_tensorrt(self, model, device, io_props, convert_model_path):
+        ''' convert the model to tensorrt '''
+
+        assert (self.max_workspace_size), "[Converter error]: for TensorRT conversion you must provide \'max_workspace_size\'."
+
+        import tensorrt as trt
+        from pytorch.utils import build_tensorrt_engine
+
+        # convert the model to onnx first
+        self.to_onnx(model, device, io_props, convert_model_path)
+        del model
+        torch.cuda.empty_cache()
+
+        zipped = zip(io_props['input_names'], io_props['min_shapes'],
+                     io_props['opt_shapes'], io_props['max_shapes'])
+        shapes = []
+        for name,min_shape,opt_shape,max_shape in zipped:
+            d = {"name":name, "min": min_shape,
+                 "opt": opt_shape, "max": max_shape}
+            shapes.append(d)
+
+        tensorrt_fp16 = True if self.precision == 'fp16' else False
+        # build tensorrt engine
+        engine = build_tensorrt_engine(convert_model_path, shapes,
+                                       self.max_workspace_size,
+                                       self.max_batch_size,
+                                       tensorrt_fp16)
+
+        assert engine is not None, "[Converter error]: TensorRT build failure"
+
+        # write tensorrt engine
+        with open(convert_model_path, 'wb') as f:
+            f.write(engine.serialize())
+
+
+    def to_torchscript(self, model, device, io_props, convert_model_path):
+        ''' convert the model to torchscript '''
+
+        if self.convert_type == 'ts-script':
+
+            model_ts = torch.jit.script(model)
+
+        else: # self.convert_type == 'ts-trace'
+
+            dummy_input = self.generate_dummy_input(io_props, device)
+            with torch.no_grad():
+                model_ts = torch.jit.trace(model, dummy_input)
+
+        # save the model
+        torch.jit.save(model_ts, convert_model_path)
+
+
+if __name__=='__main__':
+
+    parser = get_parser()
+    args = parser.parse_args()
+    model_args_list = args.model_arguments[1:]
+
+    logging.basicConfig(level=logging.INFO)
+
+    mm = importlib.import_module(args.model_module)
+
+    model = mm.init_model(model_args_list, args.precision, args.device)
+
+    dataloader = mm.get_dataloader(model_args_list)
+    converter = Converter(model, dataloader)
+
+    converter.convert(args.convert, args.save_dir, args.convert_filename,
+                      args.device, args.precision,
+                      args.max_batch_size,
+                      args.max_workspace_size)
--- a/PyTorch/SpeechRecognition/Jasper/triton/jasper-client.py
+++ b/PyTorch/SpeechRecognition/Jasper/triton/jasper-client.py
@ -18,7 +18,6 @@ import sys
 import argparse
 import numpy as np
 import os
-from tensorrtserver.api import *
 from speech_utils import AudioSegment, SpeechClient
 import soundfile
 import pyaudio as pa
@ -303,9 +302,9 @@ if __name__ == '__main__':

    FLAGS = parser.parse_args()

-    protocol = ProtocolType.from_str(FLAGS.protocol)
+    protocol = FLAGS.protocol.lower()

-    valid_model_platforms = {"pyt","onnx", "trt"}
+    valid_model_platforms = {"ts-trace","onnx", "tensorrt"}

    if FLAGS.model_platform not in valid_model_platforms:
        raise ValueError("Invalid model_platform {}. Valid choices are {"
@ -321,8 +320,6 @@ if __name__ == '__main__':
        verbose=FLAGS.verbose, mode="synchronous",
        from_features=False
    )
-
-
    
    filenames = []
    transcripts = []
@ -361,8 +358,7 @@ if __name__ == '__main__':
                    files_and_speeds = data['files']
                    audio_path = [x['fname'] for x in files_and_speeds if x['speed'] == filter_speed][0]
                    filenames.append(os.path.join(data_dir, audio_path))
-                    transcript_text = data[
-                                'transcript']
+                    transcript_text = data['transcript']
                    transcript_text = normalize_string(transcript_text, labels=labels, table=table)
                    transcripts.append(transcript_text) #parse_transcript(transcript_text, labels_map, blank_index)) # convert to vocab indices

--- a/PyTorch/SpeechRecognition/Jasper/triton/jasper_module.py
+++ b/PyTorch/SpeechRecognition/Jasper/triton/jasper_module.py
@ -0,0 +1,196 @@
+# *****************************************************************************
+#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without
+#  modification, are permitted provided that the following conditions are met:
+#      * Redistributions of source code must retain the above copyright
+#        notice, this list of conditions and the following disclaimer.
+#      * Redistributions in binary form must reproduce the above copyright
+#        notice, this list of conditions and the following disclaimer in the
+#        documentation and/or other materials provided with the distribution.
+#      * Neither the name of the NVIDIA CORPORATION nor the
+#        names of its contributors may be used to endorse or promote products
+#        derived from this software without specific prior written permission.
+#
+#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
+#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# *****************************************************************************
+
+import torch
+import sys
+sys.path.append("./")
+
+class FeatureCollate:
+
+    def __init__(self, feature_proc):
+        self.feature_proc = feature_proc
+
+    def __call__(self, batch):
+        bs = len(batch)
+        max_len = lambda l,idx: max(el[idx].size(0) for el in l)
+        audio = torch.zeros(bs, max_len(batch, 0))
+        audio_lens = torch.zeros(bs, dtype=torch.int32)
+
+        for i, sample in enumerate(batch):
+            audio[i].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
+            audio_lens[i] = sample[1]
+
+        ret = (audio, audio_lens)
+
+        if self.feature_proc is not None:
+            feats, feat_lens = self.feature_proc(audio, audio_lens)
+            ret = (feats,)
+
+        return ret
+
+
+def get_dataloader(model_args_list):
+    ''' return dataloader for inference '''
+
+    from inference import get_parser
+    from common.helpers import add_ctc_blank
+    from jasper import config
+    from common.dataset import (AudioDataset, FilelistDataset, get_data_loader,
+                                SingleAudioDataset)
+    from common.features import FilterbankFeatures
+
+    parser = get_parser()
+    parser.add_argument('--component', type=str, default="model",
+                        choices=["feature-extractor", "model", "decoder"],
+                        help='Component to convert')
+    args = parser.parse_args(model_args_list)
+
+
+    if args.component == "decoder":
+        return None
+
+    cfg = config.load(args.model_config)
+
+    if args.max_duration is not None:
+        cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
+        cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
+
+    if args.pad_to_max_duration:
+        assert cfg['input_train']['audio_dataset']['max_duration'] > 0
+        cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
+
+    symbols = add_ctc_blank(cfg['labels'])
+
+    dataset_kw, features_kw = config.input(cfg, 'val')
+
+    dataset = AudioDataset(args.dataset_dir, args.val_manifests,
+                           symbols, **dataset_kw)
+
+    data_loader = get_data_loader(dataset, args.batch_size, multi_gpu=False,
+                                  shuffle=False, num_workers=4, drop_last=False)
+    feature_proc = None
+
+    if args.component == "model":
+        feature_proc = FilterbankFeatures(**features_kw)
+
+    data_loader.collate_fn = FeatureCollate(feature_proc)
+
+    return data_loader
+
+
+def init_feature_extractor(args):
+
+    from jasper import config
+    from common.features import FilterbankFeatures
+
+    cfg = config.load(args.model_config)
+
+    if args.max_duration is not None:
+        cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
+        cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
+
+    if args.pad_to_max_duration:
+        assert cfg['input_train']['audio_dataset']['max_duration'] > 0
+        cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
+
+    _, features_kw = config.input(cfg, 'val')
+
+    feature_proc = FilterbankFeatures(**features_kw)
+
+    return feature_proc
+
+
+def init_acoustic_model(args):
+
+    from common.helpers import add_ctc_blank
+    from jasper.model import Jasper
+    from jasper import config
+
+    cfg = config.load(args.model_config)
+
+    if args.max_duration is not None:
+        cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
+        cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
+
+    if args.pad_to_max_duration:
+        assert cfg['input_train']['audio_dataset']['max_duration'] > 0
+        cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
+
+    if cfg['jasper']['encoder']['use_conv_masks'] == True:
+        print("[Jasper module]: Warning: setting 'use_conv_masks' \
+to False; masked convolutions are not supported.")
+        cfg['jasper']['encoder']['use_conv_masks'] = False
+
+    symbols = add_ctc_blank(cfg['labels'])
+    model = Jasper(encoder_kw=config.encoder(cfg),
+                   decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
+
+    if args.ckpt is not None:
+        checkpoint = torch.load(args.ckpt, map_location="cpu")
+        key = 'ema_state_dict' if args.ema else 'state_dict'
+        state_dict = checkpoint[key]
+        model.load_state_dict(state_dict, strict=True)
+
+    return model
+
+
+def init_decoder(args):
+
+    class GreedyCTCDecoderSimple(torch.nn.Module):
+        @torch.no_grad()
+        def forward(self, log_probs):
+            return log_probs.argmax(dim=-1, keepdim=False).int()
+    return GreedyCTCDecoderSimple()
+
+
+def init_model(model_args_list, precision, device):
+    ''' Return either of the components: feature-extractor, model, or decoder.
+The returned compoenent is ready to convert '''
+
+    from inference import get_parser
+    parser = get_parser()
+    parser.add_argument('--component', type=str, default="model",
+                        choices=["feature-extractor", "model", "decoder"],
+                        help='Component to convert')
+    args = parser.parse_args(model_args_list)
+
+    init_comp = {"feature-extractor": init_feature_extractor,
+                 "model": init_acoustic_model,
+                 "decoder": init_decoder}
+    comp = init_comp[args.component](args)
+
+    torch_device = torch.device(device)
+    print(f"[Jasper module]: using device {torch_device}")
+    comp.to(torch_device)
+    comp.eval()
+
+    if precision == "fp16":
+        print("[Jasper module]: using mixed precision")
+        comp.half()
+    else:
+        print("[Jasper module]: using fp32 precision")
+    return comp
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-feature-extractor/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-feature-extractor/config.pbtxt
@ -1,32 +0,0 @@
-name: "jasper-feature-extractor"
-platform: "pytorch_libtorch"
-default_model_filename: "jasper-feature-extractor.pt"
-max_batch_size: 64
-
-input [ {
-  name: "AUDIO_SIGNAL__0"
-  data_type: TYPE_FP32
-  dims: [ -1 ]
-},
-{
-  name: "NUM_SAMPLES__1"
-  data_type: TYPE_INT32
-  dims: [ 1 ]
-  reshape { shape: [] }
-}
-]
-
-output [
-{
-  name: "AUDIO_FEATURES__0"
-  data_type: TYPE_FP32
-  dims: [64, -1]
-}
-,
-  {	
-    name: "NUM_TIME_STEPS__1"
-    data_type: TYPE_INT32
-    dims: [ 1 ]
-    reshape: { shape: [] }
-  }
-]
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-onnx-cpu-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-onnx-cpu-ensemble/config.pbtxt
@ -1,60 +0,0 @@
-name: "jasper-onnx-cpu-ensemble"
-platform: "ensemble"
-max_batch_size: 64#MAX_BATCH
-input {
-  name: "AUDIO_SIGNAL"
-  data_type: TYPE_FP32
-  dims: -1#AUDIO_LENGTH
-}
-input {
-    name: "NUM_SAMPLES"
-    data_type: TYPE_INT32
-    dims: [ 1 ]
-}
-output {
-  name: "TRANSCRIPT"
-  data_type: TYPE_INT32
-  dims: [-1]
-}
-ensemble_scheduling {
-  step {
-    model_name: "jasper-feature-extractor"
-    model_version: -1
-    input_map {
-      key: "AUDIO_SIGNAL__0"
-      value: "AUDIO_SIGNAL"
-    }
-    input_map {
-      key: "NUM_SAMPLES__1"
-      value: "NUM_SAMPLES"
-    }
-    output_map {
-      key: "AUDIO_FEATURES__0"
-      value: "AUDIO_FEATURES"
-    }
-  }
-  step {
-    model_name: "jasper-onnx-cpu"
-    model_version: -1
-    input_map {
-      key: "FEATURES"
-      value: "AUDIO_FEATURES"
-    }
-    output_map {
-      key: "LOGITS"
-      value: "CHARACTER_PROBABILITIES"
-    }
-  }
-  step {
-    model_name: "jasper-decoder"
-    model_version: -1
-    input_map {
-      key: "CLASS_LOGITS__0"
-      value: "CHARACTER_PROBABILITIES"
-    }
-    output_map {
-      key: "CANDIDATE_TRANSCRIPT__0"
-      value: "TRANSCRIPT"
-    }
-  }
-}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-onnx-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-onnx-ensemble/config.pbtxt
@ -1,60 +0,0 @@
-name: "jasper-onnx-ensemble"
-platform: "ensemble"
-max_batch_size: 64#MAX_BATCH
-input {
-  name: "AUDIO_SIGNAL"
-  data_type: TYPE_FP32
-  dims: -1#AUDIO_LENGTH
-}
-input {
-    name: "NUM_SAMPLES"
-    data_type: TYPE_INT32
-    dims: [ 1 ]
-}
-output {
-  name: "TRANSCRIPT"
-  data_type: TYPE_INT32
-  dims: [-1]
-}
-ensemble_scheduling {
-  step {
-    model_name: "jasper-feature-extractor"
-    model_version: -1
-    input_map {
-      key: "AUDIO_SIGNAL__0"
-      value: "AUDIO_SIGNAL"
-    }
-    input_map {
-      key: "NUM_SAMPLES__1"
-      value: "NUM_SAMPLES"
-    }
-    output_map {
-      key: "AUDIO_FEATURES__0"
-      value: "AUDIO_FEATURES"
-    }
-  }
-  step {
-    model_name: "jasper-onnx"
-    model_version: -1
-    input_map {
-      key: "FEATURES"
-      value: "AUDIO_FEATURES"
-    }
-    output_map {
-      key: "LOGITS"
-      value: "CHARACTER_PROBABILITIES"
-    }
-  }
-  step {
-    model_name: "jasper-decoder"
-    model_version: -1
-    input_map {
-      key: "CLASS_LOGITS__0"
-      value: "CHARACTER_PROBABILITIES"
-    }
-    output_map {
-      key: "CANDIDATE_TRANSCRIPT__0"
-      value: "TRANSCRIPT"
-    }
-  }
-}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-pyt-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-pyt-ensemble/config.pbtxt
@ -1,61 +0,0 @@
-name: "jasper-pyt-ensemble"
-platform: "ensemble"
-max_batch_size: 64#MAX_BATCH
-input {
-  name: "AUDIO_SIGNAL"
-  data_type: TYPE_FP32
-  dims: -1#AUDIO_LENGTH
-}
-input {
-    name: "NUM_SAMPLES"
-    data_type: TYPE_INT32
-    dims: [ 1 ]
-}
-output {
-  name: "TRANSCRIPT"
-  data_type: TYPE_INT32
-  dims: [-1]
-}
-
-ensemble_scheduling {
-  step {
-    model_name: "jasper-feature-extractor"
-    model_version: -1
-    input_map {
-      key: "AUDIO_SIGNAL__0"
-      value: "AUDIO_SIGNAL"
-    }
-    input_map {
-      key: "NUM_SAMPLES__1"
-      value: "NUM_SAMPLES"
-    }
-    output_map {
-      key: "AUDIO_FEATURES__0"
-      value: "AUDIO_FEATURES"
-    }
-  }
-  step {
-    model_name: "jasper-pyt"
-    model_version: -1
-    input_map {
-      key: "AUDIO_FEATURES__0"
-      value: "AUDIO_FEATURES"
-    }
-    output_map {
-      key: "LOG_PROBS__0"
-      value: "CHARACTER_PROBABILITIES"
-    }
-  }
-  step {
-    model_name: "jasper-decoder"
-    model_version: -1
-    input_map {
-      key: "CLASS_LOGITS__0"
-      value: "CHARACTER_PROBABILITIES"
-    }
-    output_map {
-      key: "CANDIDATE_TRANSCRIPT__0"
-      value: "TRANSCRIPT"
-    }
-  }
-}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-trt-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo/jasper-trt-ensemble/config.pbtxt
@ -1,60 +0,0 @@
-name: "jasper-trt-ensemble"
-platform: "ensemble"
-max_batch_size: 64#MAX_BATCH
-input {
-  name: "AUDIO_SIGNAL"
-  data_type: TYPE_FP32
-  dims: -1#AUDIO_LENGTH
-}
-input {
-    name: "NUM_SAMPLES"
-    data_type: TYPE_INT32
-    dims: [ 1 ]
-}
-output {
-  name: "TRANSCRIPT"
-  data_type: TYPE_INT32
-  dims: [-1]
-}
-ensemble_scheduling {
- step {
-    model_name: "jasper-feature-extractor"
-    model_version: -1
-    input_map {
-      key: "AUDIO_SIGNAL__0"
-      value: "AUDIO_SIGNAL"
-    }
-    input_map {
-      key: "NUM_SAMPLES__1"
-      value: "NUM_SAMPLES"
-    }
-    output_map {
-      key: "AUDIO_FEATURES__0"
-      value: "AUDIO_FEATURES"
-    }    
-  }
-  step {
-    model_name: "jasper-trt"
-    model_version: -1
-    input_map {
-      key: "FEATURES"
-      value: "AUDIO_FEATURES"
-    }
-    output_map {
-      key: "LOGITS"
-      value: "CHARACTER_PROBABILITIES"
-    }
-  }
-  step {
-    model_name: "jasper-decoder"
-    model_version: -1
-    input_map {
-      key: "CLASS_LOGITS__0"
-      value: "CHARACTER_PROBABILITIES"
-    }
-    output_map {
-      key: "CANDIDATE_TRANSCRIPT__0"
-      value: "TRANSCRIPT"
-    }
-  }
-}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/decoder-ts-script/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/decoder-ts-script/config.pbtxt
@ -23,23 +23,24 @@
 # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-default_model_filename: "jasper-decoder.pt"
-name: "jasper-decoder"
+
+name: "decoder-ts-script"
 platform: "pytorch_libtorch"
-
+default_model_filename: "model.pt"
 max_batch_size: 64
+
 input [
-  {
-    name: "CLASS_LOGITS__0"
-    data_type: TYPE_FP32
-    dims: [ -1, 29 ]
-  }
-]
-output [
-  {
-    name: "CANDIDATE_TRANSCRIPT__0"
-    data_type: TYPE_INT32
-    dims: [ -1]
-  }
+    {
+	name: "input__0"
+	data_type: TYPE_FP16
+	dims: [ -1, 29 ]
+    }
 ]

+output [
+    {
+	name: "output__0"
+	data_type: TYPE_INT32
+	dims: [-1]
+    }
+]
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/feature-extractor-ts-trace/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/feature-extractor-ts-trace/config.pbtxt
@ -0,0 +1,32 @@
+name: "feature-extractor-ts-trace"
+platform: "pytorch_libtorch"
+default_model_filename: "model.pt"
+max_batch_size: 64
+
+input [
+    {
+	name: "input__0"
+	data_type: TYPE_FP16
+	dims: [ -1 ]
+    },
+    {
+	name: "input__1"
+	data_type: TYPE_INT32
+	dims: [ 1 ]
+	reshape { shape: [] }
+    }
+]
+
+output [
+    {
+	name: "output__0"
+	data_type: TYPE_FP16
+	dims: [64, -1]
+    },
+    {	
+	name: "output__1"
+	data_type: TYPE_INT32
+	dims: [ 1 ]
+	reshape: { shape: [] }
+    }
+]
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-onnx-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-onnx-ensemble/config.pbtxt
@ -0,0 +1,63 @@
+name: "jasper-onnx-ensemble"
+platform: "ensemble"
+max_batch_size: 8#MAX_BATCH
+input {
+    name: "AUDIO_SIGNAL"
+    data_type: TYPE_FP16
+    dims: -1#AUDIO_LENGTH
+}
+input {
+    name: "NUM_SAMPLES"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+}
+output {
+    name: "TRANSCRIPT"
+    data_type: TYPE_INT32
+    dims: [-1]
+}
+ensemble_scheduling {
+    step {
+	model_name: "feature-extractor-ts-trace"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "AUDIO_SIGNAL"
+	}
+	input_map {
+	    key: "input__1"
+	    value: "NUM_SAMPLES"
+	}
+	output_map {
+	    key: "output__0"
+	    value: "AUDIO_FEATURES"
+	}    
+    }
+    step {
+	model_name: "jasper-onnx"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "AUDIO_FEATURES"
+
+	}
+	output_map {
+	    key: "output__0"
+	    value: "CHARACTER_PROBABILITIES"
+
+	}
+
+    }
+    step {
+	model_name: "decoder-ts-script"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "CHARACTER_PROBABILITIES"
+	}
+	output_map {
+	    key: "output__0"
+	    value: "TRANSCRIPT"
+	}
+    }
+}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-onnx/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-onnx/config.pbtxt
@ -26,23 +26,25 @@

 name: "jasper-onnx"
 platform: "onnxruntime_onnx"
-default_model_filename: "jasper.onnx"
+default_model_filename: "model.onnx"

- max_batch_size : 64#MAX_BATCH
- input [
-   {
-     name: "FEATURES"
-     data_type: TYPE_FP32
-     dims: [ 64, -1 ]
-   }
- ]
- output [
-   {
-     name: "LOGITS"
-     data_type: TYPE_FP32
-     dims: [ -1, 29 ]
-   }
-  ]
+max_batch_size: 8#MAX_BATCH
+
+input [
+    {
+        name: "input__0"
+        data_type: TYPE_FP16
+        dims: [64, -1]
+    }
+]
+
+output [
+    {
+        name: "output__0"
+        data_type: TYPE_FP16
+        dims: [-1, 29 ]
+    }
+]

 instance_group {
  count: 1#NUM_ENGINES
@ -51,6 +53,6 @@ instance_group {
 }

 #db#dynamic_batching {
-#db#    preferred_batch_size: 64#MAX_BATCH
+#db#    preferred_batch_size: 8#MAX_BATCH
 #db#    max_queue_delay_microseconds: #MAX_QUEUE
 #db#}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-tensorrt-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-tensorrt-ensemble/config.pbtxt
@ -0,0 +1,63 @@
+name: "jasper-tensorrt-ensemble"
+platform: "ensemble"
+max_batch_size: 8#MAX_BATCH
+input {
+    name: "AUDIO_SIGNAL"
+    data_type: TYPE_FP16
+    dims: -1#AUDIO_LENGTH
+}
+input {
+    name: "NUM_SAMPLES"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+}
+output {
+    name: "TRANSCRIPT"
+    data_type: TYPE_INT32
+    dims: [-1]
+}
+ensemble_scheduling {
+    step {
+	model_name: "feature-extractor-ts-trace"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "AUDIO_SIGNAL"
+	}
+	input_map {
+	    key: "input__1"
+	    value: "NUM_SAMPLES"
+	}
+	output_map {
+	    key: "output__0"
+	    value: "AUDIO_FEATURES"
+	}    
+    }
+    step {
+	model_name: "jasper-tensorrt"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "AUDIO_FEATURES"
+
+	}
+	output_map {
+	    key: "output__0"
+	    value: "CHARACTER_PROBABILITIES"
+
+	}
+
+    }
+    step {
+	model_name: "decoder-ts-script"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "CHARACTER_PROBABILITIES"
+	}
+	output_map {
+	    key: "output__0"
+	    value: "TRANSCRIPT"
+	}
+    }
+}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-tensorrt/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-tensorrt/config.pbtxt
@ -0,0 +1,58 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+name: "jasper-tensorrt"
+platform: "tensorrt_plan"
+default_model_filename: "model.plan"
+
+max_batch_size: 8#MAX_BATCH
+
+input [
+    {
+        name: "input__0"
+        data_type: TYPE_FP16
+        dims: [64, -1]
+    }      
+]
+
+output [
+    {
+        name: "output__0"
+        data_type: TYPE_FP16
+        dims: [-1, 29 ]
+    }
+]
+
+instance_group {
+  count: 1#NUM_ENGINES
+  gpus: 0
+  kind: KIND_GPU
+}
+
+#db#dynamic_batching {
+#db#    preferred_batch_size: 8#MAX_BATCH
+#db#    max_queue_delay_microseconds: #MAX_QUEUE
+#db#}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-ts-trace-ensemble/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-ts-trace-ensemble/config.pbtxt
@ -0,0 +1,63 @@
+name: "jasper-ts-trace-ensemble"
+platform: "ensemble"
+max_batch_size: 8#MAX_BATCH
+input {
+    name: "AUDIO_SIGNAL"
+    data_type: TYPE_FP16
+    dims: -1#AUDIO_LENGTH
+}
+input {
+    name: "NUM_SAMPLES"
+    data_type: TYPE_INT32
+    dims: [ 1 ]
+}
+output {
+    name: "TRANSCRIPT"
+    data_type: TYPE_INT32
+    dims: [-1]
+}
+ensemble_scheduling {
+    step {
+	model_name: "feature-extractor-ts-trace"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "AUDIO_SIGNAL"
+	}
+	input_map {
+	    key: "input__1"
+	    value: "NUM_SAMPLES"
+	}
+	output_map {
+	    key: "output__0"
+	    value: "AUDIO_FEATURES"
+	}    
+    }
+    step {
+	model_name: "jasper-ts-trace"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "AUDIO_FEATURES"
+
+	}
+	output_map {
+	    key: "output__0"
+	    value: "CHARACTER_PROBABILITIES"
+
+	}
+
+    }
+    step {
+	model_name: "decoder-ts-script"
+	model_version: -1
+	input_map {
+	    key: "input__0"
+	    value: "CHARACTER_PROBABILITIES"
+	}
+	output_map {
+	    key: "output__0"
+	    value: "TRANSCRIPT"
+	}
+    }
+}
--- a/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-ts-trace/config.pbtxt
+++ b/PyTorch/SpeechRecognition/Jasper/triton/model_repo_configs/fp16/jasper-ts-trace/config.pbtxt
@ -0,0 +1,32 @@
+name: "jasper-ts-trace"
+platform: "pytorch_libtorch"
+default_model_filename: "model.pt"
+
+max_batch_size: 8#MAX_BATCH
+
+input [
+  {
+    name: "input__0"
+    data_type: TYPE_FP16
+    dims: [64, -1]
+  }
+]
+
+output [
+  {
+    name: "output__0"
+    data_type: TYPE_FP16
+    dims: [-1, 29]
+  }
+]
+
+instance_group {
+  count: 1#NUM_ENGINES
+  gpus: 0
+  kind: KIND_GPU
+}
+
+#db#dynamic_batching {
+#db#    preferred_batch_size: 8#MAX_BATCH
+#db#    max_queue_delay_microseconds: #MAX_QUEUE
+#db#}
--- a/Show more
+++ b/Show more
				`@ -1 +0,0 @@`
				`Subproject commit a1f3860ba65c0fd8f2be3adfcab2673efd039348`