Merge pull request #828 from NVIDIA/gh/release

[Jasper/PyT] Update DALI, perf, Triton, container, major refactor
This commit is contained in:
nv-kkudrynski 2021-02-10 10:46:53 +01:00 committed by GitHub
commit b901312732
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
128 changed files with 7496 additions and 12071 deletions

View file

@ -1,3 +1,4 @@
*.pt
results/
*__pycache__
checkpoints/
@ -5,5 +6,3 @@ checkpoints/
datasets/
external/tensorrt-inference-server/
checkpoints/
triton/model_repo
triton/deploy

View file

@ -1,4 +0,0 @@
[submodule "external/triton-inference-server"]
path = external/triton-inference-server
url = https://github.com/NVIDIA/triton-inference-server
branch = r19.12

View file

@ -12,11 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.06-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.10-py3
FROM ${FROM_IMAGE_NAME}
RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
RUN apt update && apt install -y libsndfile1 && apt install -y sox && rm -rf /var/lib/apt/lists/*
WORKDIR /workspace/jasper
@ -24,5 +23,7 @@ WORKDIR /workspace/jasper
COPY requirements.txt .
RUN pip install --disable-pip-version-check -U -r requirements.txt
RUN pip install --force-reinstall --extra-index-url https://developer.download.nvidia.com/compute/redist/nightly nvidia-dali-nightly-cuda110==0.28.0.dev20201026
# Copy rest of files
COPY . .

View file

@ -24,7 +24,6 @@ This repository provides scripts to train the Jasper model to achieve near state
* [Training process](#training-process)
* [Inference process](#inference-process)
* [Evaluation process](#evaluation-process)
* [Deploying Jasper using TensorRT](#deploying-jasper-using-tensorrt)
* [Deploying Jasper using Triton Inference Server](#deploying-jasper-using-triton-inference)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
@ -32,16 +31,16 @@ This repository provides scripts to train the Jasper model to achieve near state
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
* [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
* [Training accuracy: NVIDIA DGX-1 (8x V100 32GB)](#training-accuracy-nvidia-dgx-1-8x-v100-32gb)
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
* [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
* [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
* [Training performance: NVIDIA DGX-1 (8x V100 32GB)](#training-performance-nvidia-dgx-1-8x-v100-32gb)
* [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-40gb)
* [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-80gb)
* [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
* [Inference performance: NVIDIA DGX-1 (1x V100 32GB)](#inference-performance-nvidia-dgx-1-1x-v100-32gb)
* [Inference performance: NVIDIA DGX-2 (1x V100 32GB)](#inference-performance-nvidia-dgx-2-1x-v100-32gb)
@ -217,10 +216,10 @@ Uses both acoustic model and language model to output the transcript of an input
The following section lists the requirements in order to start training and evaluating the Jasper model.
### Requirements
This repository contains a `Dockerfile` which extends the PyTorch 20.06-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
This repository contains a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
* [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
- Supported GPUs:
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -260,10 +259,10 @@ bash scripts/docker/build.sh
3. Start an interactive session in the NGC container to run data download/training/inference
```bash
bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <OUTPUT_DIR>
```
Within the container, the contents of this repository will be copied to the `/workspace/jasper` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<OUTPUT_DIR>` on the host.
4. Download and preprocess the dataset.
@ -282,40 +281,49 @@ Inside the container, download and extract the datasets into the required format
bash scripts/download_librispeech.sh
```
Once the data download is complete, the following folders should exist:
* `/datasets/LibriSpeech/`
* `train-clean-100/`
* `train-clean-360/`
* `train-other-500/`
* `dev-clean/`
* `dev-other/`
* `test-clean/`
* `test-other/`
```bash
datasets/LibriSpeech/
├── dev-clean
├── dev-other
├── test-clean
├── test-other
├── train-clean-100
├── train-clean-360
└── train-other-500
```
Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3), once the dataset is downloaded it will be accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.
Next, convert the data into WAV files and add speed perturbation with 0.9 and 1.1 to the training files:
Next, convert the data into WAV files:
```bash
bash scripts/preprocess_librispeech.sh
```
Once the data is converted, the following additional files and folders should exist:
* `datasets/LibriSpeech/`
* `librispeech-train-clean-100-wav.json`
* `librispeech-train-clean-360-wav.json`
* `librispeech-train-other-500-wav.json`
* `librispeech-dev-clean-wav.json`
* `librispeech-dev-other-wav.json`
* `librispeech-test-clean-wav.json`
* `librispeech-test-other-wav.json`
* `train-clean-100-wav/` containsWAV files with original speed, 0.9 and 1.1
* `train-clean-360-wav/` contains WAV files with original speed, 0.9 and 1.1
* `train-other-500-wav/` contains WAV files with original speed, 0.9 and 1.1
* `dev-clean-wav/`
* `dev-other-wav/`
* `test-clean-wav/`
* `test-other-wav/`
```bash
datasets/LibriSpeech/
├── dev-clean-wav
├── dev-other-wav
├── librispeech-train-clean-100-wav.json
├── librispeech-train-clean-360-wav.json
├── librispeech-train-other-500-wav.json
├── librispeech-dev-clean-wav.json
├── librispeech-dev-other-wav.json
├── librispeech-test-clean-wav.json
├── librispeech-test-other-wav.json
├── test-clean-wav
├── test-other-wav
├── train-clean-100-wav
├── train-clean-360-wav
└── train-other-500-wav
```
The DALI data pre-processing pipeline, which is enabled by default, performs speed perturbation on-line during training.
Without DALI, on-line speed perturbation might slow down the training.
If you wish to disable DALI, speed perturbation can be computed off-line with:
```bash
SPEEDS="0.9 1.1" bash scripts/preprocess_librispeech.sh
```
5. Start training.
@ -323,22 +331,22 @@ Inside the container, use the following script to start training.
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
```bash
bash scripts/train.sh [OPTIONS]
[OPTION1=value1 OPTION2=value2 ...] bash scripts/train.sh
```
By default automatic precision is disabled, batch size is 64 over two gradient accumulation steps, and the recipe is run on a total of 8 GPUs. The hyperparameters are tuned for a GPU with at least 32GB of memory and will require adjustment for 16GB GPUs (e.g., by lowering batch size and using more gradient accumulation steps).
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Training process](#training-process).
Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Training process](#training-process).
6. Start validation/evaluation.
Inside the container, use the following script to run evaluation.
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
```bash
bash scripts/evaluation.sh [OPTIONS]
[OPTION1=value1 OPTION2=value2 ...] bash scripts/evaluation.sh [OPTIONS]
```
By default, this will use full precision, a batch size of 64 and run on a single GPU.
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).
Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).
7. Start inference/predictions.
@ -348,11 +356,11 @@ Inside the container, use the following script to run inference.
A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
```bash
bash scripts/inference.sh [OPTIONS]
[OPTION1=value1 OPTION2=value2 ...] bash scripts/inference.sh
```
By default this will use single precision, a batch size of 64 and run on a single GPU.
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Inference process](#inference-process).
Options are being passed as environment variables. More details on available options can be found in [Parameters](#parameters) and [Inference process](#inference-process).
## Advanced
@ -362,31 +370,27 @@ The following sections provide greater details of the dataset, running training
### Scripts and sample code
In the `root` directory, the most important files are:
* `train.py` - Serves as entry point for training
* `inference.py` - Serves as entry point for inference and evaluation
* `model.py` - Contains the model architecture
* `dataset.py` - Contains the data loader and related functionality
* `optimizer.py` - Contains the optimizer
* `inference_benchmark.py` - Serves as inference benchmarking script that measures the latency of pre-processing and the acoustic model
* `requirements.py` - Contains the required dependencies that are installed when building the Docker container
* `Dockerfile` - Container with the basic set of dependencies to run Jasper
The `scripts/` folder encapsulates all the one-click scripts required for running various supported functionalities, such as:
* `train.sh` - Runs training using the `train.py` script
* `inference.sh` - Runs inference using the `inference.py` script
* `evaluation.sh` - Runs evaluation using the `inference.py` script
* `download_librispeech.sh` - Downloads LibriSpeech dataset
* `preprocess_librispeech.sh` - Preprocess LibriSpeech raw data files to be ready for training and inference
* `inference_benchmark.sh` - Runs the inference benchmark using the `inference_benchmark.py` script
* `train_benchmark.sh` - Runs the training performance benchmark using the `train.py` script
* `docker/` - Contains the scripts for building and launching the container
Other folders included in the `root` directory are:
* `notebooks/` - Jupyter notebooks and example audio files
* `configs/ - model configurations
* `utils/` - data downloading and common routines
* `parts/` - data pre-processing
```
jasper
├── common # data pre-processing, logging, etc.
├── configs # model configurations
├── Dockerfile # container with the basic set of dependencies to run Jasper
├── inference.py # entry point for inference
├── jasper # model-specific code
├── notebooks # jupyter notebooks and example audio files
├── scripts # one-click scripts required for running various supported functionalities
│   ├── docker # contains the scripts for building and launching the container
│   ├── download_librispeech.sh # downloads LibriSpeech dataset
│   ├── evaluation.sh # runs evaluation using the `inference.py` script
│   ├── inference_benchmark.sh # runs the inference benchmark using the `inference_benchmark.py` script
│   ├── inference.sh # runs inference using the `inference.py` script
│   ├── preprocess_librispeech.sh # preprocess LibriSpeech raw data files for training and inference
│   ├── train_benchmark.sh # runs the training performance benchmark using the `train.py` script
│   └── train.sh # runs training using the `train.py` script
├── train.py # entry point for training
├── triton # example of inference using Triton Inference Server
└── utils # data downloading and common routines
```
### Parameters
@ -394,77 +398,94 @@ Parameters could be set as env variables, or passed as positional arguments.
The complete list of available parameters for `scripts/train.sh` script contains:
```bash
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
MODEL_CONFIG: relative path to model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results, logs, and created checkpoints. (default: '/results')
CHECKPOINT: model checkpoint to continue training from. Model checkpoint is a dictionary object that contains apart from the model weights the optimizer state as well as the epoch number. If CHECKPOINT is set, training starts from scratch. (default: "")
CREATE_LOGFILE: boolean that indicates whether to create a training log that will be stored in `$RESULT_DIR`. (default: true)
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
NUM_GPUS: number of GPUs to use. (default: 8)
AMP: if set to `true`, enables automatic mixed precision (default: false)
EPOCHS: number of training epochs. (default: 400)
SEED: seed for random number generator and used for ensuring reproducibility. (default: 6)
BATCH_SIZE: data batch size. (default: 64)
LEARNING_RATE: Initial learning rate. (default: 0.015)
GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 2)
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
MODEL_CONFIG: relative path to model configuration. (default: 'configs/jasper10x5dr_speedp-online_speca.yaml')
OUTPUT_DIR: directory for results, logs, and created checkpoints. (default: '/results')
CHECKPOINT: a specific model checkpoint to continue training from. To resume training from the last checkpoint, see the RESUME option.
RESUME: resume training from the last checkpoint found in OUTPUT_DIR, or from scratch if there are no checkpoints (default: true)
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
NUM_GPUS: number of GPUs to use. (default: 8)
AMP: if set to `true`, enables automatic mixed precision (default: false)
BATCH_SIZE: effective data batch size. The real batch size per GPU might be lower, if gradient accumulation is enabled (default: 64)
GRAD_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 2)
LEARNING_RATE: initial learning rate. (default: 0.01)
MIN_LEARNING_RATE: minimum learning rate, despite LR scheduling (default: 1e-5)
LR_POLICY: how to decay LR (default: exponential)
LR_EXP_GAMMA: decay factor for the exponential LR schedule (default: 0.981)
EMA: decay factor for exponential averages of checkpoints (default: 0.999)
SEED: seed for random number generator and used for ensuring reproducibility. (default: 0)
EPOCHS: number of training epochs. (default: 440)
WARMUP_EPOCHS: number of initial epoch of linearly increasing LR. (default: 2)
HOLD_EPOCHS: number of epochs to hold maximum LR after warmup. (default: 140)
SAVE_FREQUENCY: number of epochs between saving the model to disk. (default: 10)
EPOCHS_THIS_JOB: run training for this number of epochs. Does not affect LR schedule like the EPOCHS parameter. (default: 0)
DALI_DEVICE: device to run the DALI pipeline on for calculation of filterbanks. Valid choices: cpu, gpu, none. (default: gpu)
PAD_TO_MAX_DURATION: pad all sequences with zeros to maximum length. (default: false)
EVAL_FREQUENCY: number of steps between evaluations on the validation set. (default: 544)
PREDICTION_FREQUENCY: the number of steps between writing a sample prediction to stdout. (default: 544)
TRAIN_MANIFESTS: lists of .json training set files
VAL_MANIFESTS: lists of .json validation set files
```
The complete list of available parameters for `scripts/inference.sh` script contains:
```bash
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
DATASET: name of dataset to use. (default: 'dev-clean')
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results and logs. (default: '/results')
CHECKPOINT: model checkpoint path. (required)
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: true)
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: false)
AMP: if set to `true`, enables FP16 inference with AMP (default: false)
NUM_STEPS: number of inference steps. If -1 runs inference on entire dataset. (default: -1)
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 6)
BATCH_SIZE: data batch size.(default: 64)
LOGITS_FILE: destination path for serialized model output with binary protocol. If 'none' does not save model output. (default: 'none')
PREDICTION_FILE: destination path for saving predictions. If 'none' does not save predictions. (default: '${RESULT_DIR}/${DATASET}.predictions)
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_speedp-online_speca.yaml')
OUTPUT_DIR: directory for results and logs. (default: '/results')
CHECKPOINT: model checkpoint path. (required)
DATASET: name of the LibriSpeech subset to use. (default: 'dev-clean')
LOG_FILE: path to the DLLogger .json logfile. (default: '')
CUDNN_BENCHMARK: enable cudnn benchmark mode for using more optimized kernels. (default: false)
MAX_DURATION: filter out recordings shorter then MAX_DURATION seconds. (default: "")
PAD_TO_MAX_DURATION: pad all sequences with zeros to maximum length. (default: false)
NUM_GPUS: number of GPUs to use. Note that with > 1 GPUs WER results might be innaccurate due to the batching policy. (default: 1)
NUM_STEPS: number of batches to evaluate, loop the dataset if necessary. (default: 0)
NUM_WARMUP_STEPS: number of initial steps before measuring preformance. (default: 0)
AMP: enable FP16 inference with AMP. (default: false)
BATCH_SIZE: data batch size. (default: 64)
EMA: Attempt to load exponentially averaged weights from a checkpoint. (default: true)
SEED: seed for random number generator and used for ensuring reproducibility. (default: 0)
DALI_DEVICE: device to run the DALI pipeline on for calculation of filterbanks. Valid choices: cpu, gpu, none. (default: gpu)
CPU: run inference on CPU. (default: false)
LOGITS_FILE: dump logit matrices to a file. (default: "")
PREDICTION_FILE: save predictions to a file. (default: "${OUTPUT_DIR}/${DATASET}.predictions")
```
The complete list of available parameters for `scripts/evaluation.sh` script contains:
The complete list of available parameters for `scripts/evaluation.sh` is the same as for `scripts/inference.sh` except for the few default changes.
```bash
DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
DATASET: name of dataset to use.(default: 'dev-clean')
MODEL_CONFIG: model configuration.(default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results and logs. (default: '/results')
CHECKPOINT: model checkpoint path. (required)
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: true)
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mde for using more optimized kernels. (default: false)
NUM_GPUS: number of GPUs to run evaluation on (default: 1)
AMP: if set to `true`, enables FP16 with AMP (default: false)
NUM_STEPS: number of inference steps per GPU. If -1 runs inference on entire dataset (default: -1)
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
BATCH_SIZE: data batch size.(default: 64)
PREDICTION_FILE: (default: "")
DATASET: (default: "test-other")
```
The `scripts/inference_benchmark.sh` script pads all input to the same length and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in millisecond per batch. The `scripts/inference_benchmark.sh`
measures latency for a single GPU and extends `scripts/inference.sh` by :
The `scripts/inference_benchmark.sh` script pads all input to a fixed duration and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in milliseconds per batch. The `scripts/inference_benchmark.sh` measures latency for a single GPU and loops over a number of batch sizes and durations. It extends `scripts/inference.sh`, and changes the defaults with:
```bash
MAX_DURATION: filters out input audio data that exceeds a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 36)
BATCH_SIZE_SEQ: batch sizes to measre on. (defaul: "1 2 4 8 16")
MAX_DURATION_SEQ: input durations (in seconds) to measure on (default: "2 7 16.7")
CUDNN_BENCHMARK: (default: true)
PAD_TO_MAX_DURATION: (default: true)
NUM_WARMUP_STEPS: (default: 10)
NUM_STEPS: (default: 500)
DALI_DEVICE: (default: cpu)
```
The `scripts/train_benchmark.sh` script pads all input to the same length according to the input argument `MAX_DURATION` and measures average training latency and throughput performance. Latency is measured in seconds per batch, throughput in sequences per second.
The complete list of available parameters for `scripts/train_benchmark.sh` script contains:
Training performance is measured with on-line speed perturbation and cuDNN benchmark mode enabled.
The script `scripts/train_benchmark.sh` loops over a number of batch sizes and GPU counts.
It extends `scripts/train.sh`, and the complete list of available parameters for `scripts/train_benchmark.sh` script contains:
```bash
DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results and logs. (default: '/results')
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: true)
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: true)
NUM_GPUS: number of GPUs to use. (default: 8)
AMP: if set to `true`, enables automatic mixed precision with AMP (default: false)
NUM_STEPS: number of training iterations. If -1 runs full training for 400 epochs. (default: -1)
MAX_DURATION: filters out input audio data that exceed a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 16.7)
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
BATCH_SIZE: data batch size.(default: 32)
LEARNING_RATE: Initial learning rate. (default: 0.015)
GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 1)
PRINT_FREQUENCY: number of iterations after which training progress is printed. (default: 1)
BATCH_SIZE_SEQ: batch sizes to measre on. (defaul: "1 2 4 8 16")
NUM_GPUS_SEQ: number of GPUs to run the training on. (default: "1 4 8")
MODEL_CONFIG: (default: "configs/jasper10x5dr_speedp-online_train-benchmark.yaml")
TRAIN_MANIFESTS: (default: "$DATA_DIR/librispeech-train-clean-100-wav.json")
RESUME: (default: false)
EPOCHS_THIS_JOB: (default: 2)
EPOCHS: (default: 100000)
SAVE_FREQUENCY: (default: 100000)
EVAL_FREQUENCY: (default: 100000)
GRAD_ACCUMULATION_STEPS: (default: 1)
PAD_TO_MAX_DURATION: (default: true)
EMA: (default: 0)
```
### Command-line options
@ -478,11 +499,10 @@ python inference.py --help
### Getting the data
The Jasper model was trained on LibriSpeech dataset. We use the concatenation of `train-clean-100`, `train-clean-360` and `train-other-500` for training and `dev-clean` for validation.
This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 500GB free space is required for download and preprocessing, the final preprocessed dataset is 320GB.
This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 250GB free space is required for download and preprocessing, the final preprocessed dataset is approximately 100GB. With offline speed perturbation, the dataset will be about 3x larger.
#### Dataset guidelines
The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are stripped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500` it also creates speed perturbed versions with rates of 0.9 and 1.1 for data augmentation.
The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are stripped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500`. It can optionally create speed perturbed versions with rates of 0.9 and 1.1 for data augmentation. In the current version, those augmentations are applied on-line with the DALI pipeline without any impact on training time.
After preprocessing, the script creates JSON files with output file paths, sample rate, target transcript and other metadata. These JSON files are used by the training script to identify training and validation datasets.
@ -500,21 +520,23 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
* Trains on the concatenation of all 3 LibriSpeech training datasets and evaluates on the LibriSpeech dev-clean dataset
* Maintains an exponential moving average of parameters for evaluation
* Has cudnn benchmark enabled
* Runs for 400 epochs
* Uses an initial learning rate of 0.015 and polynomial (quadratic) learning rate decay
* Runs for 440 epochs
* Uses an initial learning rate of 0.01 and an exponential learning rate decay
* Saves a checkpoint every 10 epochs
* Runs evaluation on the development dataset every 100 iterations and at the end of training
* Prints out training progress every 25 iterations
* Creates a log file with training progress
* Uses offline speed perturbed data
* Automatically removes old checkpoints and preserves milestone checkpoints
* Runs evaluation on the development dataset every 544 iterations and at the end of training
* Maintains a separate checkpoint with the lowest WER on development set
* Prints out training progress every iteration to stdout
* Creates a DLLogger logfile and a Tensorboard log
* Calculates speed perturbation on-line during training
* Uses SpecAugment in data pre-processing
* Filters out audio samples longer than 16.7 seconds
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
* Pads each batch so its length would be divisible by 16
* Uses masked convolutions and dense residuals as described in the paper
* Uses weight decay of 0.001
* Uses [Novograd](https://arxiv.org/pdf/1905.11286.pdf) as optimizer with betas=(0.95, 0)
Enabling AMP permits batch size 64 with one gradient accumulation step. Such setup will match the greedy WER [Results](#results) of the Jasper paper on a DGX-1 with 32GB V100 GPUs.
Enabling AMP permits batch size 64 with one gradient accumulation step. In the current setup it will improve upon the greedy WER [Results](#results) of the Jasper paper on a DGX-1 with 32GB V100 GPUs.
### Inference process
Inference is performed using the `inference.py` script along with parameters defined in `scripts/inference.sh`.
@ -525,7 +547,7 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
* Uses a batch size of 64
* Runs for 1 epoch and prints out the final word error rate
* Creates a log file with progress and results which will be stored in the results folder
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch
* Pads each batch so its length would be divisible by 16
* Does not use data augmentation
* Does greedy decoding and saves the transcription in the results folder
* Has the option to save the model output tensors for more complex decoding, for example, beam search
@ -533,24 +555,14 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
### Evaluation process
Evaluation is performed using the `inference.py` script along with parameters defined in `scripts/evaluation.sh`.
The `scripts/evaluation.sh` script runs a job on a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the evaluation script:
The setup is similar to `scripts/inference.sh`, with two differences:
* Uses a batch size of 64
* Evaluates the LibriSpeech dev-clean dataset
* Runs for 1 epoch and prints out the final word error rate
* Creates a log file with progress and results which is saved in the results folder
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
* Does not use data augmentation
* Has cudnn benchmark disabled
### Deploying Jasper using TensorRT
NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. Jaspers architecture, which is of deep convolutional nature, is designed to facilitate fast GPU inference. After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch.
More information on how to perform inference using TensorRT and speed up comparison between TensorRT and native PyTorch can be found in the subfolder [./trt/README.md](trt/README.md)
* Evaluates the LibriSpeech test-other dataset
* Model outputs are not saved
### Deploying Jasper using Triton Inference Server
The NVIDIA Triton Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
More information on how to perform inference using TensorRT Inference Server with different model backends can be found in the subfolder [./trtis/README.md](trtis/README.md)
More information on how to perform inference using Triton Inference Server with different model backends can be found in the subfolder [./triton/README.md](triton/README.md)
## Performance
@ -559,162 +571,121 @@ More information on how to perform inference using TensorRT Inference Server wit
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
#### Training performance benchmark
To benchmark the training performance on a specific batch size and audio length, for `NUM_STEPS` run:
To benchmark the training performance in a specific setting on the `train-clean-100` subset of LibriSpeech, run:
```bash
export NUM_STEPS=<NUM_STEPS>
export MAX_DURATION=<DURATION>
export BATCH_SIZE=<BATCH_SIZE>
bash scripts/train_benchmark.sh
BATCH_SIZE_SEQ=<BATCH_SIZES> NUM_GPUS_SEQ=<NUMS_OF_GPUS> bash scripts/train_benchmark.sh
```
By default, this script runs 400 epochs on the configuration `configs/jasper10x5dr_sp_offline_specaugment.toml`
using batch size 32 on a single node with 8x GPUs with at least 32GB of memory.
By default, `NUM_STEPS=-1` means training is run for 400 EPOCHS. If `$NUM_STEPS > 0` is specified, training is only run for a user-defined number of iterations. Audio samples longer than `MAX_DURATION` are filtered out, the remaining ones are padded to this duration such that all batches have the same length. At the end of training the script saves the model checkpoint to the results folder, runs evaluation on LibriSpeech dev-clean dataset, and prints out information such as average training latency performance in seconds, average trng throughput in sequences per second, final training loss, final training WER, evaluation loss and evaluation WER.
By default, this script runs 2 epochs on the configuration `configs/jasper10x5dr_speedp-online_train-benchmark.yaml`,
which applies gentle speed perturbation that does not change the length of the output, enabling immediate stabilization of training step times in the cuDNN benchmark mode. The script runs benchmarks on batch sizes 32 on 1, 4, and 8 GPUs, and requires a 8x 32GB GPU machine.
#### Inference performance benchmark
To benchmark the inference performance on a specific batch size and audio length, run:
```bash
bash scripts/inference_benchmark.sh
BATCH_SIZE_SEQ=<BATCH_SIZES> MAX_DURATION_SEQ=<DURATIONS> bash scripts/inference_benchmark.sh
```
By default, the script runs on a single GPU and evaluates on the entire dataset using the model configuration `configs/jasper10x5dr_sp_offline_specaugment.toml` and batch size 32.
By default, `MAX_DURATION` is set to 36 seconds, which covers the maximum audio length. All audio samples are padded to this length. The script prints out `MAX_DURATION`, `BATCH_SIZE` and latency performance in milliseconds per batch.
By default, the script runs on a single GPU and evaluates on the dataset limited to utterances shorter than MAX_DURATION. It uses the model configuration `configs/jasper10x5dr_speedp-online_speca.yaml`.
Adjustments can be made with env variables, e.g.,
```bash
export SEED=42
export BATCH_SIZE=1
bash scripts/inference_benchmark.sh
```
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated
on LibriSpeech dev-clean, dev-other, test-clean, test-other.
The results for Jasper Large's word error rate from the original paper after greedy decoding are shown below:
| **Number of GPUs** | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**
|--- |--- |--- |--- |--- |
|8 | 3.64| 11.89| 3.86 | 11.95
on LibriSpeech dev-clean, dev-other, test-clean, test-other. Checkpoints for evaluation are being chosen based on their
word error rate on dev-clean.
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
Our results were obtained by running the `scripts/train.sh` training script in the 20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX A100 with (8x A100 80GB) GPUs.
The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
| **Number of GPUs** | **Batch size per GPU** | **Precision** | **dev-clean WER** | **dev-other WER** | **test-clean WER** | **test-other WER** | **Time to train** | **Time to train speedup (TF32 to mixed precision)** |
|-----|-----|-------|-------|-------|------|-------|-------|-----|
| 8 | 64 | mixed | 3.53 | 11.11 | 3.75 | 11.07 | 60 h | 1.9 |
| 8 | 64 | TF32 | 3.55 | 11.30 | 3.81 | 11.17 | 115 h | - |
For each precision, we show the best of 8 runs chosen based on dev-clean WER. For TF32, two gradient accumulation steps have been used.
| Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | Time to train |
|-----|-----|-------|-------|-------|------|-------|------|
| 8 | 64 | mixed | 3.20 | 9.78 | 3.41 | 9.71 | 70 h |
##### Training accuracy: NVIDIA DGX-1 (8x V100 32GB)
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.06-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs.
The following tables report the word error rate(WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container with NVIDIA DGX-1 with (8x V100 32GB) GPUs.
The following table reports the word error rate (WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
| **Number of GPUs** | **Batch size per GPU** | **Precision** | **dev-clean WER** | **dev-other WER** | **test-clean WER** | **test-other WER** | **Time to train** | **Time to train speedup (FP32 to mixed precision)** |
|-----|-----|-------|-------|-------|------|-------|-------|-----|
| 8 | 64 | mixed | 3.49 | 11.22 | 3.74 | 10.94 | 105 h | 3.1 |
| 8 | 64 | FP32 | 3.65 | 11.47 | 3.86 | 11.30 | 330 h | - |
| Number of GPUs | Batch size per GPU | Precision | dev-clean WER | dev-other WER | test-clean WER | test-other WER | Time to train |
|-----|-----|-------|-------|-------|------|-------|-------|
| 8 | 64 | mixed | 3.26 | 10.00 | 3.54 | 9.80 | 130 h |
We show the best of 5 runs (mixed precision) and 2 runs (FP32) chosen based on dev-clean WER. For FP32, two gradient accumulation steps have been used.
##### Training stability test
The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.
| **DGX A100, FP16, 8x GPU** | **Seed #1** | **Seed #2** | **Seed #3** | **Seed #4** | **Seed #5** | **Seed #6** | **Seed #7** | **Seed #8** | **Mean** | **Std** |
|-----------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|-----:|
| dev-clean | 3.69 | 3.71 | 3.64 | 3.53 | 3.71 | 3.66 | 3.77 | 3.70 | 3.68 | 0.07 |
| dev-other | 11.39 | 11.65 | 11.46 | 11.11 | 11.23 | 11.18 | 11.43 | 11.60 | 11.38 | 0.19 |
| test-clean | 3.97 | 3.96 | 3.81 | 3.75 | 3.90 | 3.82 | 3.93 | 3.82 | 3.87 | 0.08 |
| test-other | 11.27 | 11.34 | 11.40 | 11.07 | 11.24 | 11.29 | 11.58 | 11.58 | 11.35 | 0.17 |
| DGX A100 80GB, FP16, 8x GPU | Seed #1 | Seed #2 | Seed #3 | Seed #4 | Seed #5 | Seed #6 | Seed #7 | Seed #8 | Mean | Std |
|-----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-------:|------:|
| dev-clean | 3.46 | 3.55 | 3.45 | 3.44 | 3.25 | 3.34 | 3.20 | 3.40 | 3.39 | 0.11 |
| dev-other | 10.30 | 10.77 | 10.36 | 10.26 | 9.99 | 10.18 | 9.78 | 10.32 | 10.25 | 0.27 |
| test-clean | 3.84 | 3.81 | 3.66 | 3.64 | 3.58 | 3.55 | 3.41 | 3.73 | 3.65 | 0.13 |
| test-other | 10.61 | 10.52 | 10.49 | 10.47 | 9.89 | 10.09 | 9.71 | 10.26 | 10.26 | 0.31 |
| **DGX A100, TF32, 8x GPU** | **Seed #1** | **Seed #2** | **Seed #3** | **Seed #4** | **Seed #5** | **Seed #6** | **Seed #7** | **Seed #8** | **Mean** | **Std** |
|-----------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|-----:|
| dev-clean | 3.56 | 3.60 | 3.60 | 3.55 | 3.65 | 3.57 | 3.89 | 3.67 | 3.64 | 0.11 |
| dev-other | 11.27 | 11.41 | 11.65 | 11.30 | 11.51 | 11.11 | 12.18 | 11.50 | 11.49 | 0.32 |
| test-clean | 3.80 | 3.79 | 3.88 | 3.81 | 3.94 | 3.82 | 4.13 | 3.85 | 3.88 | 0.11 |
| test-other | 11.40 | 11.26 | 11.47 | 11.17 | 11.36 | 11.16 | 12.15 | 11.46 | 11.43 | 0.32 |
| **DGX-1 32GB, FP16, 8x GPU** | **Seed #1** | **Seed #2** | **Seed #3** | **Seed #4** | **Seed #5** | **Mean** | **Std** |
|-----------:|------:|------:|------:|------:|------:|------:|-----:|
| dev-clean | 3.69 | 3.75 | 3.63 | 3.86 | 3.49 | 3.68 | 0.14 |
| dev-other | 11.35 | 11.63 | 11.60 | 11.68 | 11.22 | 11.50 | 0.20 |
| test-clean | 3.90 | 3.84 | 3.94 | 3.96 | 3.74 | 3.88 | 0.09 |
| test-other | 11.17 | 11.45 | 11.31 | 11.60 | 10.94 | 11.29 | 0.26 |
| DGX-1 32GB, FP16, 8x GPU | Seed #1 | Seed #2 | Seed #3 | Seed #4 | Seed #5 | Seed #6 | Seed #7 | Seed #8 | Mean | Std |
|-----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|----------:|-------:|------:|
| dev-clean | 3.31 | 3.31 | 3.26 | 3.44 | 3.40 | 3.35 | 3.36 | 3.28 | 3.34 | 0.06 |
| dev-other | 10.02 | 10.01 | 10.00 | 10.06 | 10.05 | 10.03 | 10.10 | 10.04 | 10.04 | 0.03 |
| test-clean | 3.49 | 3.50 | 3.54 | 3.61 | 3.57 | 3.58 | 3.48 | 3.51 | 3.54 | 0.04 |
| test-other | 10.11 | 10.14 | 9.80 | 10.09 | 10.17 | 9.99 | 9.86 | 10.00 | 10.02 | 0.13 |
#### Training performance results
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.06-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 20.10-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
| **GPUs** | **Batch size / GPU** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **Weak scaling - TF32** | **Weak scaling - mixed precision** |
|--:|---:|------:|-------:|-----:|-----:|-----:|
| 1 | 32 | 36.09 | 69.33 | 1.92 | 1.00 | 1.00 |
| 4 | 32 | 143.05 | 264.91 | 1.85 | 3.96 | 3.82 |
| 8 | 32 | 285.25 | 524.33 | 1.84 | 7.90 | 7.56 |
| **GPUs** | **Batch size / GPU** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **Weak scaling - TF32** | **Weak scaling - mixed precision** |
|--:|---:|------:|-------:|-----:|-----:|-----:|
| 1 | 64 | - | 77.79 | - | - | 1.00 |
| 4 | 64 | - | 304.32 | - | - | 3.91 |
| 8 | 64 | - | 602.88 | - | - | 7.75 |
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| Batch size / GPU | GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|----:|----:|-------:|-------:|-----:|-----:|-----:|
| 32 | 1 | 42.18 | 64.32 | 1.52 | 1.00 | 1.00 |
| 32 | 4 | 157.49 | 239.23 | 1.52 | 3.73 | 3.72 |
| 32 | 8 | 310.10 | 470.09 | 1.52 | 7.35 | 7.31 |
| 64 | 1 | 49.64 | 75.59 | 1.52 | 1.00 | 1.00 |
| 64 | 4 | 192.66 | 289.16 | 1.50 | 3.88 | 3.83 |
| 64 | 8 | 371.41 | 547.91 | 1.48 | 7.48 | 7.25 |
Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|--:|---:|------:|-------:|-----:|-----:|-----:|
| 1 | 16 | 11.12 | 28.87 | 2.60 | 1.00 | 1.00 |
| 4 | 16 | 42.39 | 109.40 | 2.58 | 3.81 | 3.79 |
| 8 | 16 | 84.45 | 194.30 | 2.30 | 7.59 | 6.73 |
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|--:|---:|------:|-------:|-----:|-----:|-----:|
| 1 | 32 | - | 37.57 | - | - | 1.00 |
| 4 | 32 | - | 134.80 | - | - | 3.59 |
| 8 | 32 | - | 276.14 | - | - | 7.35 |
| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|----:|----:|------:|-------:|-----:|-----:|-----:|
| 16 | 1 | 10.71 | 27.87 | 2.60 | 1.00 | 1.00 |
| 16 | 4 | 40.28 | 99.80 | 2.48 | 3.76 | 3.58 |
| 16 | 8 | 78.23 | 193.89 | 2.48 | 7.30 | 6.96 |
Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|--:|---:|------:|-------:|-----:|-----:|-----:|
| 1 | 32 | 13.15 | 35.63 | 2.71 | 1.00 | 1.00 |
| 4 | 32 | 51.21 | 134.01 | 2.62 | 3.90 | 3.76 |
| 8 | 32 | 99.88 | 247.97 | 2.48 | 7.60 | 6.96 |
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|--:|---:|------:|-------:|-----:|-----:|-----:|
| 1 | 64 | - | 41.74 | - | - | 1.00 |
| 4 | 64 | - | 158.44 | - | - | 3.80 |
| 8 | 64 | - | 312.22 | - | - | 7.48 |
| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|----:|----:|------:|-------:|-----:|-----:|-----:|
| 32 | 1 | 12.22 | 34.08 | 2.79 | 1.00 | 1.00 |
| 32 | 4 | 46.97 | 128.39 | 2.73 | 3.84 | 3.77 |
| 32 | 8 | 92.44 | 249.00 | 2.69 | 7.57 | 7.31 |
| 64 | 1 | N/A | 39.30 | N/A | N/A | 1.00 |
| 64 | 4 | N/A | 150.18 | N/A | N/A | 3.82 |
| 64 | 8 | N/A | 282.68 | N/A | N/A | 7.19 |
Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---:|---:|-------:|-------:|-----:|------:|------:|
| 1 | 32 | 14.13 | 41.05 | 2.90 | 1.00 | 1.00 |
| 4 | 32 | 54.32 | 156.47 | 2.88 | 3.84 | 3.81 |
| 8 | 32 | 110.26 | 307.13 | 2.79 | 7.80 | 7.48 |
| 16 | 32 | 218.14 | 561.85 | 2.58 | 15.44 | 13.69 |
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---:|---:|-------:|-------:|-----:|------:|------:|
| 1 | 64 | - | 46.41 | - | - | 1.00 |
| 4 | 64 | - | 147.90 | - | - | 3.19 |
| 8 | 64 | - | 359.15 | - | - | 7.74 |
| 16 | 64 | - | 703.13 | - | - | 15.15 |
| Batch size / GPU | GPUs | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
|----:|----:|-------:|-------:|-----:|------:|------:|
| 32 | 1 | 13.46 | 38.94 | 2.89 | 1.00 | 1.00 |
| 32 | 4 | 51.38 | 143.44 | 2.79 | 3.82 | 3.68 |
| 32 | 8 | 100.54 | 280.48 | 2.79 | 7.47 | 7.20 |
| 32 | 16 | 188.14 | 515.90 | 2.74 | 13.98 | 13.25 |
| 64 | 1 | N/A | 43.86 | N/A | N/A | 1.00 |
| 64 | 4 | N/A | 165.27 | N/A | N/A | 3.77 |
| 64 | 8 | N/A | 318.10 | N/A | N/A | 7.25 |
| 64 | 16 | N/A | 567.47 | N/A | N/A | 12.94 |
Note: Mixed precision permits higher batch sizes during training. We report the maximum batch sizes (as powers of 2), which are allowed without gradient accumulation.
@ -722,121 +693,130 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
#### Inference performance results
Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 20.06-py3 NGC container on NVIDIA DGX A100, DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 1000 iterations.
Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 20.10-py3 NGC container on NVIDIA DGX A100, DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 500 iterations.
##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
| | |FP16 Latency (ms) Percentiles | | | | TF32 Latency (ms) Percentiles | | | | FP16/TF32 speed up |
|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2 | 36.31 | 36.85 | 43.18 | 35.96 | 41.16 | 41.63 | 47.90 | 40.89 | 1.14 |
| 2 | 2 | 37.56 | 43.32 | 45.23 | 37.11 | 42.53 | 47.79 | 49.62 | 42.07 | 1.13 |
| 4 | 2 | 43.10 | 44.85 | 47.22 | 41.43 | 47.88 | 49.75 | 51.55 | 43.25 | 1.04 |
| 8 | 2 | 44.02 | 44.30 | 45.21 | 39.51 | 50.14 | 50.47 | 51.50 | 45.63 | 1.16 |
| 16 | 2 | 48.04 | 48.38 | 49.12 | 42.76 | 70.90 | 71.22 | 72.50 | 60.78 | 1.42 |
| 1 | 7 | 37.74 | 37.88 | 38.92 | 37.02 | 41.53 | 42.17 | 44.75 | 40.79 | 1.10 |
| 2 | 7 | 40.91 | 41.11 | 42.35 | 40.02 | 46.44 | 46.80 | 49.67 | 45.67 | 1.14 |
| 4 | 7 | 43.94 | 44.32 | 46.71 | 43.00 | 54.39 | 54.80 | 56.63 | 53.53 | 1.24 |
| 8 | 7 | 50.01 | 50.19 | 52.92 | 48.62 | 68.55 | 69.25 | 72.28 | 67.61 | 1.39 |
| 16 | 7 | 60.38 | 60.76 | 62.44 | 57.92 | 93.17 | 94.15 | 98.84 | 92.21 | 1.59 |
| 1 | 16.7 | 41.39 | 41.75 | 43.62 | 40.73 | 45.79 | 46.10 | 47.76 | 45.21 | 1.11 |
| 2 | 16.7 | 46.43 | 46.76 | 47.72 | 45.81 | 52.53 | 53.13 | 55.60 | 51.71 | 1.13 |
| 4 | 16.7 | 50.88 | 51.68 | 54.74 | 50.11 | 66.29 | 66.96 | 70.45 | 65.00 | 1.30 |
| 8 | 16.7 | 62.09 | 62.76 | 65.08 | 61.40 | 94.16 | 94.67 | 97.46 | 93.00 | 1.51 |
| 16 | 16.7 | 75.22 | 76.86 | 80.76 | 73.99 | 139.51 | 140.88 | 144.10 | 137.94 | 1.86 |
##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
| | | FP16 Latency (ms) Percentiles | | | | TF32 Latency (ms) Percentiles | | | | FP16/TF32 speed up |
|-----:|---------------:|------:|------:|------:|------:|------:|------:|-------:|------:|------:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2.0 | 32.40 | 32.50 | 32.82 | 32.30 | 33.30 | 33.64 | 34.65 | 33.25 | 1.03 |
| 2 | 2.0 | 32.90 | 33.51 | 34.35 | 32.69 | 34.48 | 34.65 | 35.66 | 34.27 | 1.05 |
| 4 | 2.0 | 32.85 | 33.01 | 33.89 | 32.60 | 34.09 | 34.46 | 35.22 | 34.00 | 1.04 |
| 8 | 2.0 | 35.51 | 35.89 | 37.10 | 35.33 | 34.86 | 35.36 | 36.08 | 34.45 | 0.98 |
| 16 | 2.0 | 36.00 | 36.57 | 37.40 | 35.77 | 43.83 | 44.12 | 44.77 | 43.39 | 1.21 |
| 1 | 7.0 | 33.50 | 33.99 | 34.91 | 33.03 | 33.83 | 34.25 | 34.95 | 33.70 | 1.02 |
| 2 | 7.0 | 34.43 | 34.89 | 35.72 | 34.22 | 34.41 | 34.73 | 35.69 | 34.28 | 1.00 |
| 4 | 7.0 | 34.30 | 34.59 | 35.43 | 34.07 | 37.95 | 38.18 | 38.87 | 37.55 | 1.10 |
| 8 | 7.0 | 35.98 | 36.28 | 37.11 | 35.28 | 44.64 | 44.79 | 45.37 | 44.29 | 1.26 |
| 16 | 7.0 | 39.86 | 40.08 | 41.16 | 39.33 | 55.17 | 55.46 | 57.24 | 54.56 | 1.39 |
| 1 | 16.7 | 35.20 | 35.80 | 38.71 | 34.36 | 35.36 | 35.76 | 36.55 | 34.64 | 1.01 |
| 2 | 16.7 | 35.40 | 35.81 | 36.50 | 34.76 | 36.34 | 36.53 | 37.40 | 35.87 | 1.03 |
| 4 | 16.7 | 36.01 | 36.38 | 37.37 | 35.57 | 44.69 | 45.09 | 45.88 | 43.92 | 1.23 |
| 8 | 16.7 | 41.48 | 41.78 | 44.22 | 40.69 | 58.57 | 58.74 | 59.62 | 58.11 | 1.43 |
| 16 | 16.7 | 61.37 | 61.93 | 66.32 | 60.92 | 97.33 | 97.71 | 100.04 | 96.56 | 1.59 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
| | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2 | 52.26 | 59.93 | 66.62 | 50.34 | 70.90 | 76.47 | 79.84 | 68.61 | 1.36 |
| 2 | 2 | 62.04 | 67.68 | 70.91 | 58.65 | 75.72 | 80.15 | 83.50 | 71.33 | 1.22 |
| 4 | 2 | 75.12 | 77.12 | 82.80 | 66.55 | 80.88 | 82.60 | 86.63 | 73.65 | 1.11 |
| 8 | 2 | 71.62 | 72.99 | 81.10 | 66.39 | 99.57 | 101.43 | 107.16 | 92.34 | 1.39 |
| 16 | 2 | 78.51 | 80.33 | 87.31 | 72.91 | 104.79 | 107.22 | 114.21 | 96.18 | 1.32 |
| 1 | 7 | 52.67 | 54.40 | 64.27 | 50.47 | 73.86 | 75.61 | 84.93 | 72.08 | 1.43 |
| 2 | 7 | 60.49 | 62.41 | 72.87 | 58.45 | 93.07 | 94.51 | 102.40 | 91.55 | 1.57 |
| 4 | 7 | 70.55 | 72.95 | 82.59 | 68.43 | 131.48 | 137.60 | 149.06 | 129.23 | 1.89 |
| 8 | 7 | 83.91 | 85.28 | 93.08 | 76.40 | 152.49 | 157.92 | 166.80 | 150.49 | 1.97 |
| 16 | 7 | 100.21 | 103.12 | 109.00 | 96.31 | 178.45 | 181.46 | 187.20 | 174.33 | 1.81 |
| 1 | 16.7 | 56.84 | 60.05 | 66.54 | 54.69 | 109.55 | 111.19 | 120.40 | 102.25 | 1.87 |
| 2 | 16.7 | 69.39 | 70.97 | 75.34 | 67.39 | 149.93 | 150.79 | 154.06 | 147.45 | 2.19 |
| 4 | 16.7 | 87.48 | 93.96 | 102.73 | 85.09 | 211.78 | 219.66 | 232.99 | 208.38 | 2.45 |
| 8 | 16.7 | 106.91 | 111.92 | 116.55 | 104.13 | 246.92 | 250.94 | 268.44 | 243.34 | 2.34 |
| 16 | 16.7 | 149.08 | 153.86 | 166.17 | 146.28 | 292.84 | 298.02 | 313.04 | 288.54 | 1.97 |
| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2.0 | 45.42 | 45.62 | 49.54 | 45.02 | 48.83 | 48.99 | 51.66 | 48.44 | 1.08 |
| 2 | 2.0 | 50.31 | 50.53 | 53.66 | 49.10 | 49.87 | 50.04 | 52.99 | 49.41 | 1.01 |
| 4 | 2.0 | 49.17 | 49.48 | 52.13 | 48.73 | 52.92 | 53.21 | 55.28 | 52.31 | 1.07 |
| 8 | 2.0 | 51.20 | 51.40 | 52.32 | 49.01 | 73.02 | 73.30 | 75.00 | 71.99 | 1.47 |
| 16 | 2.0 | 51.75 | 52.24 | 56.36 | 51.27 | 83.99 | 84.57 | 86.69 | 83.24 | 1.62 |
| 1 | 7.0 | 48.13 | 48.53 | 50.95 | 46.78 | 48.52 | 48.75 | 50.89 | 48.01 | 1.03 |
| 2 | 7.0 | 49.52 | 50.10 | 52.35 | 48.00 | 65.27 | 65.41 | 66.59 | 64.79 | 1.35 |
| 4 | 7.0 | 51.75 | 52.01 | 54.39 | 50.38 | 93.75 | 94.77 | 97.04 | 92.27 | 1.83 |
| 8 | 7.0 | 54.80 | 56.27 | 66.23 | 52.95 | 130.65 | 131.09 | 132.91 | 129.82 | 2.45 |
| 16 | 7.0 | 73.02 | 73.42 | 75.83 | 71.96 | 157.53 | 158.20 | 160.73 | 155.51 | 2.16 |
| 1 | 16.7 | 48.10 | 48.52 | 52.71 | 47.20 | 73.34 | 73.56 | 74.19 | 72.69 | 1.54 |
| 2 | 16.7 | 64.21 | 64.52 | 65.56 | 56.06 | 129.48 | 129.97 | 131.78 | 126.36 | 2.25 |
| 4 | 16.7 | 60.38 | 61.03 | 63.18 | 58.87 | 183.33 | 183.85 | 185.53 | 181.90 | 3.09 |
| 8 | 16.7 | 85.88 | 86.34 | 87.70 | 84.46 | 227.42 | 228.21 | 229.63 | 225.71 | 2.67 |
| 16 | 16.7 | 135.62 | 136.40 | 137.69 | 131.58 | 276.90 | 277.59 | 281.16 | 275.08 | 2.09 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA DGX-1 (1x V100 32GB)
| | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2 | 64.60 | 67.34 | 79.87 | 60.73 | 84.69 | 86.78 | 96.02 | 79.32 | 1.31 |
| 2 | 2 | 71.52 | 73.32 | 82.00 | 63.93 | 85.33 | 87.65 | 96.34 | 78.09 | 1.22 |
| 4 | 2 | 80.38 | 84.62 | 93.09 | 74.95 | 90.29 | 97.59 | 100.61 | 84.44 | 1.13 |
| 8 | 2 | 83.43 | 85.51 | 91.17 | 74.09 | 107.28 | 111.89 | 115.19 | 98.76 | 1.33 |
| 16 | 2 | 90.01 | 90.81 | 96.48 | 79.85 | 115.39 | 116.95 | 123.71 | 103.26 | 1.29 |
| 1 | 7 | 53.74 | 54.09 | 56.67 | 53.07 | 86.07 | 86.55 | 91.59 | 78.79 | 1.48 |
| 2 | 7 | 63.34 | 63.67 | 66.08 | 62.62 | 96.25 | 96.82 | 99.72 | 95.44 | 1.52 |
| 4 | 7 | 80.35 | 80.86 | 83.80 | 73.41 | 132.19 | 132.94 | 135.59 | 131.46 | 1.79 |
| 8 | 7 | 77.68 | 78.11 | 86.71 | 75.72 | 156.30 | 157.72 | 165.55 | 154.87 | 2.05 |
| 16 | 7 | 103.52 | 106.66 | 111.93 | 98.15 | 180.71 | 182.82 | 191.12 | 178.61 | 1.82 |
| 1 | 16.7 | 57.58 | 57.79 | 59.75 | 56.58 | 104.51 | 104.87 | 108.01 | 104.04 | 1.84 |
| 2 | 16.7 | 69.19 | 69.58 | 71.49 | 68.58 | 151.25 | 152.07 | 155.21 | 149.30 | 2.18 |
| 4 | 16.7 | 87.17 | 88.53 | 97.41 | 86.56 | 211.28 | 212.41 | 214.97 | 208.54 | 2.41 |
| 8 | 16.7 | 116.25 | 116.90 | 120.14 | 109.21 | 247.63 | 248.93 | 254.77 | 245.19 | 2.25 |
| 16 | 16.7 | 151.99 | 154.79 | 163.36 | 149.80 | 293.99 | 296.05 | 303.04 | 291.00 | 1.94 |
| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2.0 | 52.74 | 53.01 | 54.40 | 51.47 | 55.97 | 56.22 | 57.93 | 54.93 | 1.07 |
| 2 | 2.0 | 51.77 | 52.15 | 54.69 | 50.98 | 56.58 | 56.87 | 58.88 | 55.35 | 1.09 |
| 4 | 2.0 | 51.41 | 51.76 | 53.47 | 50.55 | 61.56 | 61.87 | 63.81 | 60.74 | 1.20 |
| 8 | 2.0 | 51.83 | 52.15 | 54.08 | 50.85 | 80.20 | 80.69 | 81.67 | 77.69 | 1.53 |
| 16 | 2.0 | 70.48 | 70.96 | 72.11 | 62.98 | 93.00 | 93.44 | 94.17 | 89.05 | 1.41 |
| 1 | 7.0 | 49.77 | 50.21 | 51.88 | 48.73 | 52.74 | 52.99 | 54.54 | 51.67 | 1.06 |
| 2 | 7.0 | 51.12 | 51.47 | 52.84 | 49.98 | 65.33 | 65.63 | 67.07 | 64.64 | 1.29 |
| 4 | 7.0 | 53.13 | 53.56 | 55.68 | 52.15 | 93.54 | 93.85 | 94.72 | 92.76 | 1.78 |
| 8 | 7.0 | 57.67 | 58.07 | 59.89 | 56.41 | 133.93 | 134.18 | 134.88 | 133.15 | 2.36 |
| 16 | 7.0 | 76.09 | 76.48 | 79.13 | 75.27 | 162.35 | 162.77 | 164.63 | 161.30 | 2.14 |
| 1 | 16.7 | 54.78 | 55.29 | 56.83 | 52.51 | 75.37 | 76.27 | 78.05 | 74.32 | 1.42 |
| 2 | 16.7 | 56.80 | 57.20 | 59.01 | 55.49 | 130.60 | 131.36 | 132.93 | 128.55 | 2.32 |
| 4 | 16.7 | 64.19 | 64.84 | 66.47 | 62.87 | 188.09 | 188.76 | 190.07 | 185.76 | 2.95 |
| 8 | 16.7 | 87.46 | 87.86 | 89.99 | 86.47 | 232.33 | 232.89 | 234.43 | 230.44 | 2.67 |
| 16 | 16.7 | 136.02 | 136.52 | 139.44 | 134.78 | 283.87 | 284.59 | 286.70 | 282.01 | 2.09 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA DGX-2 (1x V100 32GB)
| | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2 | 47.25 | 48.24 | 50.28 | 41.53 | 67.03 | 68.15 | 70.17 | 61.82 | 1.49 |
| 2 | 2 | 54.11 | 55.20 | 60.44 | 48.82 | 69.11 | 70.38 | 75.93 | 64.45 | 1.32 |
| 4 | 2 | 63.82 | 67.64 | 71.58 | 61.47 | 71.51 | 74.55 | 79.31 | 67.85 | 1.10 |
| 8 | 2 | 64.78 | 65.86 | 67.68 | 59.07 | 90.84 | 91.99 | 94.10 | 84.28 | 1.43 |
| 16 | 2 | 70.59 | 71.49 | 73.58 | 63.85 | 96.92 | 97.58 | 99.98 | 87.73 | 1.37 |
| 1 | 7 | 42.35 | 42.55 | 43.50 | 41.08 | 63.87 | 64.02 | 64.73 | 62.54 | 1.52 |
| 2 | 7 | 47.82 | 48.04 | 49.43 | 46.79 | 81.17 | 81.43 | 82.28 | 80.02 | 1.71 |
| 4 | 7 | 58.27 | 58.54 | 59.69 | 56.96 | 116.00 | 116.46 | 118.79 | 114.82 | 2.02 |
| 8 | 7 | 62.88 | 63.62 | 67.16 | 61.47 | 143.90 | 144.34 | 147.36 | 139.54 | 2.27 |
| 16 | 7 | 88.04 | 88.57 | 90.96 | 82.84 | 163.04 | 164.04 | 167.30 | 161.36 | 1.95 |
| 1 | 16.7 | 44.54 | 44.86 | 45.86 | 43.53 | 88.10 | 88.41 | 89.37 | 87.21 | 2.00 |
| 2 | 16.7 | 55.21 | 55.55 | 56.92 | 54.33 | 134.99 | 135.69 | 137.87 | 132.97 | 2.45 |
| 4 | 16.7 | 72.93 | 73.58 | 74.95 | 72.02 | 193.50 | 194.21 | 196.04 | 191.24 | 2.66 |
| 8 | 16.7 | 96.94 | 97.66 | 99.58 | 92.73 | 227.70 | 228.74 | 231.59 | 225.35 | 2.43 |
| 16 | 16.7 | 138.25 | 139.75 | 143.71 | 133.82 | 273.69 | 274.53 | 279.50 | 269.13 | 2.01 |
| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2.0 | 35.88 | 36.12 | 39.80 | 35.20 | 42.95 | 43.67 | 46.65 | 42.23 | 1.20 |
| 2 | 2.0 | 36.36 | 36.57 | 40.97 | 35.60 | 41.83 | 42.21 | 45.60 | 40.97 | 1.15 |
| 4 | 2.0 | 36.69 | 36.89 | 41.25 | 36.05 | 48.35 | 48.52 | 52.35 | 47.80 | 1.33 |
| 8 | 2.0 | 37.49 | 37.70 | 41.37 | 36.88 | 65.41 | 65.64 | 66.50 | 64.96 | 1.76 |
| 16 | 2.0 | 41.35 | 41.79 | 45.58 | 40.91 | 77.22 | 77.51 | 79.48 | 76.54 | 1.87 |
| 1 | 7.0 | 36.07 | 36.55 | 40.31 | 35.62 | 39.52 | 39.84 | 43.07 | 38.93 | 1.09 |
| 2 | 7.0 | 37.42 | 37.66 | 41.36 | 36.79 | 55.94 | 56.19 | 58.33 | 55.60 | 1.51 |
| 4 | 7.0 | 38.51 | 38.95 | 42.55 | 37.98 | 86.62 | 87.08 | 87.50 | 86.20 | 2.27 |
| 8 | 7.0 | 42.82 | 43.00 | 47.11 | 42.55 | 122.05 | 122.29 | 122.70 | 121.59 | 2.86 |
| 16 | 7.0 | 67.74 | 67.92 | 69.05 | 65.69 | 149.92 | 150.16 | 151.03 | 149.49 | 2.28 |
| 1 | 16.7 | 39.28 | 39.78 | 43.34 | 38.35 | 66.73 | 67.16 | 69.80 | 66.01 | 1.72 |
| 2 | 16.7 | 43.05 | 43.42 | 47.18 | 42.43 | 120.04 | 121.12 | 123.32 | 118.14 | 2.78 |
| 4 | 16.7 | 52.18 | 52.49 | 56.11 | 51.63 | 176.09 | 176.51 | 178.70 | 174.60 | 3.38 |
| 8 | 16.7 | 78.55 | 78.79 | 81.66 | 78.04 | 216.19 | 216.68 | 217.63 | 214.48 | 2.75 |
| 16 | 16.7 | 125.57 | 125.92 | 128.78 | 124.33 | 264.11 | 264.49 | 266.14 | 262.80 | 2.11 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA T4
| | |FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|---:|-------------:|------:|------:|------:|------:|-------:|-------:|-------:|-------:|-----:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2 | 64.13 | 65.25 | 76.11 | 59.08 | 94.69 | 98.23 | 109.86 | 89.00 | 1.51 |
| 2 | 2 | 67.59 | 70.77 | 84.06 | 57.47 | 103.88 | 105.37 | 114.59 | 93.30 | 1.62 |
| 4 | 2 | 75.19 | 81.05 | 87.01 | 65.79 | 120.73 | 128.29 | 146.83 | 112.96 | 1.72 |
| 8 | 2 | 74.15 | 77.69 | 84.96 | 62.77 | 161.97 | 163.46 | 170.25 | 153.07 | 2.44 |
| 16 | 2 | 100.62 | 105.08 | 113.00 | 82.06 | 216.18 | 217.92 | 222.46 | 188.57 | 2.30 |
| 1 | 7 | 77.88 | 79.61 | 81.90 | 70.22 | 110.37 | 113.93 | 121.39 | 107.17 | 1.53 |
| 2 | 7 | 81.09 | 83.94 | 87.28 | 78.06 | 148.30 | 151.21 | 158.55 | 141.26 | 1.81 |
| 4 | 7 | 99.85 | 100.83 | 104.24 | 96.81 | 229.94 | 232.34 | 238.11 | 225.43 | 2.33 |
| 8 | 7 | 147.38 | 150.37 | 153.66 | 142.64 | 394.26 | 396.35 | 398.89 | 390.77 | 2.74 |
| 16 | 7 | 280.32 | 281.37 | 282.74 | 278.01 | 484.20 | 485.74 | 499.89 | 482.67 | 1.74 |
| 1 | 16.7 | 76.97 | 79.78 | 81.61 | 75.55 | 171.45 | 176.90 | 179.18 | 167.95 | 2.22 |
| 2 | 16.7 | 96.48 | 99.42 | 101.21 | 92.74 | 276.12 | 278.67 | 282.06 | 270.05 | 2.91 |
| 4 | 16.7 | 129.63 | 131.67 | 134.42 | 124.55 | 522.23 | 524.79 | 527.32 | 509.75 | 4.09 |
| 8 | 16.7 | 209.64 | 211.36 | 214.66 | 204.83 | 706.84 | 709.21 | 715.57 | 697.97 | 3.41 |
| 16 | 16.7 | 342.23 | 344.62 | 350.84 | 337.42 | 848.02 | 849.83 | 858.22 | 834.38 | 2.47 |
| | | FP16 Latency (ms) Percentiles | | | | FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up |
|-----:|---------------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|-------:|------:|
| BS | Duration (s) | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | Avg |
| 1 | 2.0 | 43.62 | 46.95 | 50.46 | 37.23 | 51.31 | 52.37 | 56.21 | 49.77 | 1.34 |
| 2 | 2.0 | 49.09 | 50.46 | 53.11 | 40.61 | 81.85 | 82.22 | 83.94 | 80.81 | 1.99 |
| 4 | 2.0 | 47.71 | 51.14 | 55.09 | 41.29 | 112.56 | 115.13 | 118.56 | 111.60 | 2.70 |
| 8 | 2.0 | 51.37 | 53.11 | 55.48 | 45.94 | 198.95 | 199.48 | 200.28 | 197.22 | 4.29 |
| 16 | 2.0 | 63.59 | 64.30 | 66.90 | 61.77 | 221.75 | 222.07 | 223.22 | 220.09 | 3.56 |
| 1 | 7.0 | 47.49 | 48.66 | 53.36 | 40.76 | 73.63 | 74.41 | 77.65 | 72.41 | 1.78 |
| 2 | 7.0 | 48.63 | 50.01 | 58.35 | 43.44 | 114.66 | 115.28 | 117.63 | 112.41 | 2.59 |
| 4 | 7.0 | 52.19 | 52.85 | 54.22 | 49.94 | 200.38 | 201.29 | 202.97 | 197.21 | 3.95 |
| 8 | 7.0 | 84.90 | 85.56 | 87.52 | 83.41 | 404.00 | 404.72 | 405.70 | 400.25 | 4.80 |
| 16 | 7.0 | 157.12 | 157.58 | 159.19 | 155.01 | 490.93 | 492.09 | 493.44 | 486.45 | 3.14 |
| 1 | 16.7 | 50.57 | 51.57 | 57.58 | 46.27 | 150.39 | 151.84 | 153.54 | 147.31 | 3.18 |
| 2 | 16.7 | 63.64 | 64.55 | 66.31 | 61.98 | 256.54 | 258.16 | 262.71 | 250.34 | 4.04 |
| 4 | 16.7 | 140.44 | 141.06 | 142.00 | 138.14 | 519.59 | 521.41 | 523.86 | 512.74 | 3.71 |
| 8 | 16.7 | 267.03 | 268.06 | 270.01 | 263.15 | 727.33 | 728.61 | 731.36 | 722.62 | 2.75 |
| 16 | 16.7 | 362.40 | 364.02 | 367.80 | 358.75 | 867.92 | 869.19 | 871.46 | 860.37 | 2.40 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
## Release notes
### Changelog
February 2021
* Added DALI data-processing pipeline for on-the-fly data processing and augmentation on CPU or GPU
* Revised training recipe: ~10% relative improvement in Word Error Rate (WER)
* Updated Triton scripts for compatibility with Triton V2 API, updated Triton inference results
* Refactored codebase
* Updated performance results for the PyTorch 20.10-py3 NGC container
June 2020
- Updated performance tables to include A100 results
* Updated performance tables to include A100 results
December 2019
* Inference support for TRT 6 with dynamic shapes

View file

@ -12,13 +12,28 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import librosa
import random
import soundfile as sf
import librosa
import torch
import numpy as np
import sox
def audio_from_file(file_path, offset=0, duration=0, trim=False, target_sr=16000):
audio = AudioSegment(file_path, target_sr=target_sr, int_values=False,
offset=offset, duration=duration, trim=trim)
samples = torch.tensor(audio.samples, dtype=torch.float).cuda()
num_samples = torch.tensor(samples.shape[0]).int().cuda()
return (samples.unsqueeze(0), num_samples.unsqueeze(0))
class AudioSegment(object):
"""Monaural audio segment abstraction.
:param samples: Audio samples [num_samples x num_channels].
:type samples: ndarray.float32
:param sample_rate: Audio sample rate.
@ -26,11 +41,30 @@ class AudioSegment(object):
:raises TypeError: If the sample data type is not float or int.
"""
def __init__(self, samples, sample_rate, target_sr=None, trim=False,
trim_db=60):
def __init__(self, filename, target_sr=None, int_values=False, offset=0,
duration=0, trim=False, trim_db=60):
"""Create audio segment from samples.
Samples are convert float32 internally, with int scaled to [-1, 1].
Load a file supported by librosa and return as an AudioSegment.
:param filename: path of file to load
:param target_sr: the desired sample rate
:param int_values: if true, load samples as 32-bit integers
:param offset: offset in seconds when loading audio
:param duration: duration in seconds when loading audio
:return: numpy array of samples
"""
with sf.SoundFile(filename, 'r') as f:
dtype = 'int32' if int_values else 'float32'
sample_rate = f.samplerate
if offset > 0:
f.seek(int(offset * sample_rate))
if duration > 0:
samples = f.read(int(duration * sample_rate), dtype=dtype)
else:
samples = f.read(dtype=dtype)
samples = samples.transpose()
samples = self._convert_samples_to_float32(samples)
if target_sr is not None and target_sr != sample_rate:
samples = librosa.core.resample(samples, sample_rate, target_sr)
@ -67,6 +101,7 @@ class AudioSegment(object):
@staticmethod
def _convert_samples_to_float32(samples):
"""Convert sample type to float32.
Audio sample type is usually integer or float-point.
Integers will be scaled to [-1, 1] in float32.
"""
@ -80,30 +115,6 @@ class AudioSegment(object):
raise TypeError("Unsupported sample type: %s." % samples.dtype)
return float32_samples
@classmethod
def from_file(cls, filename, target_sr=None, int_values=False, offset=0,
duration=0, trim=False):
"""
Load a file supported by librosa and return as an AudioSegment.
:param filename: path of file to load
:param target_sr: the desired sample rate
:param int_values: if true, load samples as 32-bit integers
:param offset: offset in seconds when loading audio
:param duration: duration in seconds when loading audio
:return: numpy array of samples
"""
with sf.SoundFile(filename, 'r') as f:
dtype = 'int32' if int_values else 'float32'
sample_rate = f.samplerate
if offset > 0:
f.seek(int(offset * sample_rate))
if duration > 0:
samples = f.read(int(duration * sample_rate), dtype=dtype)
else:
samples = f.read(dtype=dtype)
samples = samples.transpose()
return cls(samples, sample_rate, target_sr=target_sr, trim=trim)
@property
def samples(self):
return self._samples.copy()
@ -129,9 +140,11 @@ class AudioSegment(object):
self._samples *= 10. ** (gain / 20.)
def pad(self, pad_size, symmetric=False):
"""Add zero padding to the sample. The pad size is given in number of samples.
If symmetric=True, `pad_size` will be added to both sides. If false, `pad_size`
zeros will be added only to the end.
"""Add zero padding to the sample.
The pad size is given in number of samples. If symmetric=True,
`pad_size` will be added to both sides. If false, `pad_size` zeros
will be added only to the end.
"""
self._samples = np.pad(self._samples,
(pad_size if symmetric else 0, pad_size),
@ -139,6 +152,7 @@ class AudioSegment(object):
def subsegment(self, start_time=None, end_time=None):
"""Cut the AudioSegment between given boundaries.
Note that this is an in-place transformation.
:param start_time: Beginning of subsegment in seconds.
:type start_time: float
@ -168,3 +182,66 @@ class AudioSegment(object):
start_sample = int(round(start_time * self._sample_rate))
end_sample = int(round(end_time * self._sample_rate))
self._samples = self._samples[start_sample:end_sample]
class Perturbation:
def __init__(self, p=0.1, rng=None):
self.p = p
self._rng = random.Random() if rng is None else rng
def maybe_apply(self, segment, sample_rate=None):
if self._rng.random() < self.p:
self(segment, sample_rate)
class SpeedPerturbation(Perturbation):
def __init__(self, min_rate=0.85, max_rate=1.15, discrete=False, p=0.1, rng=None):
super(SpeedPerturbation, self).__init__(p, rng)
assert 0 < min_rate < max_rate
self.min_rate = min_rate
self.max_rate = max_rate
self.discrete = discrete
def __call__(self, data, sample_rate):
if self.discrete:
rate = np.random.choice([self.min_rate, None, self.max_rate])
else:
rate = self._rng.uniform(self.min_rate, self.max_rate)
if rate is not None:
data._samples = sox.Transformer().speed(factor=rate).build_array(
input_array=data._samples, sample_rate_in=sample_rate)
class GainPerturbation(Perturbation):
def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, p=0.1, rng=None):
super(GainPerturbation, self).__init__(p, rng)
self._rng = random.Random() if rng is None else rng
self._min_gain_dbfs = min_gain_dbfs
self._max_gain_dbfs = max_gain_dbfs
def __call__(self, data, sample_rate=None):
del sample_rate
gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
data._samples = data._samples * (10. ** (gain / 20.))
class ShiftPerturbation(Perturbation):
def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, p=0.1, rng=None):
super(ShiftPerturbation, self).__init__(p, rng)
self._min_shift_ms = min_shift_ms
self._max_shift_ms = max_shift_ms
def __call__(self, data, sample_rate):
shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
if abs(shift_ms) / 1000 > data.duration:
# TODO: do something smarter than just ignore this condition
return
shift_samples = int(shift_ms * data.sample_rate // 1000)
# print("DEBUG: shift:", shift_samples)
if shift_samples < 0:
data._samples[-shift_samples:] = data._samples[:shift_samples]
data._samples[:-shift_samples] = 0
elif shift_samples > 0:
data._samples[:-shift_samples] = data._samples[shift_samples:]
data._samples[-shift_samples:] = 0

View file

@ -0,0 +1,13 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View file

@ -0,0 +1,158 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import math
import numpy as np
import torch.distributed as dist
from .iterator import DaliJasperIterator, SyntheticDataIterator
from .pipeline import DaliPipeline
from common.helpers import print_once
def _parse_json(json_path: str, start_label=0, predicate=lambda json: True):
"""
Parses json file to the format required by DALI
Args:
json_path: path to json file
start_label: the label, starting from which DALI will assign consecutive int numbers to every transcript
predicate: function, that accepts a sample descriptor (i.e. json dictionary) as an argument.
If the predicate for a given sample returns True, it will be included in the dataset.
Returns:
output_files: dictionary, that maps file name to label assigned by DALI
transcripts: dictionary, that maps label assigned by DALI to the transcript
"""
import json
global cnt
with open(json_path) as f:
librispeech_json = json.load(f)
output_files = {}
transcripts = {}
curr_label = start_label
for original_sample in librispeech_json:
if not predicate(original_sample):
continue
transcripts[curr_label] = original_sample['transcript']
output_files[original_sample['files'][-1]['fname']] = curr_label
curr_label += 1
return output_files, transcripts
def _dict_to_file(dict: dict, filename: str):
with open(filename, "w") as f:
for key, value in dict.items():
f.write("{} {}\n".format(key, value))
class DaliDataLoader:
"""
DataLoader is the main entry point to the data preprocessing pipeline.
To use, create an object and then just iterate over `data_iterator`.
DataLoader will do the rest for you.
Example:
data_layer = DataLoader(DaliTrainPipeline, path, json, bs, ngpu)
data_it = data_layer.data_iterator
for data in data_it:
print(data) # Here's your preprocessed data
Args:
device_type: Which device to use for preprocessing. Choose: "cpu", "gpu"
pipeline_type: Choose: "train", "val", "synth"
"""
def __init__(self, gpu_id, dataset_path: str, config_data: dict, config_features: dict, json_names: list,
symbols: list, batch_size: int, pipeline_type: str, grad_accumulation_steps: int = 1,
synth_iters_per_epoch: int = 544, device_type: str = "gpu"):
import torch
self.batch_size = batch_size
self.grad_accumulation_steps = grad_accumulation_steps
self.drop_last = (pipeline_type == 'train')
self.device_type = device_type
pipeline_type = self._parse_pipeline_type(pipeline_type)
if pipeline_type == "synth":
self._dali_data_iterator = self._init_synth_iterator(self.batch_size, config_features['nfilt'],
iters_per_epoch=synth_iters_per_epoch,
ngpus=torch.distributed.get_world_size())
else:
self._dali_data_iterator = self._init_iterator(gpu_id=gpu_id, dataset_path=dataset_path,
config_data=config_data,
config_features=config_features,
json_names=json_names, symbols=symbols,
train_pipeline=pipeline_type == "train")
def _init_iterator(self, gpu_id, dataset_path, config_data, config_features, json_names: list, symbols: list,
train_pipeline: bool):
"""
Returns data iterator. Data underneath this operator is preprocessed within Dali
"""
def hash_list_of_strings(li):
return str(abs(hash(''.join(li))))
output_files, transcripts = {}, {}
max_duration = config_data['max_duration']
for jname in json_names:
of, tr = _parse_json(jname if jname[0] == '/' else os.path.join(dataset_path, jname), len(output_files),
predicate=lambda json: json['original_duration'] <= max_duration)
output_files.update(of)
transcripts.update(tr)
file_list_path = os.path.join("/tmp", "jasper_dali.file_list." + hash_list_of_strings(json_names))
_dict_to_file(output_files, file_list_path)
self.dataset_size = len(output_files)
print_once(f"Dataset read by DALI. Number of samples: {self.dataset_size}")
pipeline = DaliPipeline.from_config(config_data=config_data, config_features=config_features, device_id=gpu_id,
file_root=dataset_path, file_list=file_list_path,
device_type=self.device_type, batch_size=self.batch_size,
train_pipeline=train_pipeline)
return DaliJasperIterator([pipeline], transcripts=transcripts, symbols=symbols, batch_size=self.batch_size,
shard_size=self._shard_size(), train_iterator=train_pipeline)
def _init_synth_iterator(self, batch_size, nfeatures, iters_per_epoch, ngpus):
self.dataset_size = ngpus * iters_per_epoch * batch_size
return SyntheticDataIterator(batch_size, nfeatures, regenerate=True)
@staticmethod
def _parse_pipeline_type(pipeline_type):
pipe = pipeline_type.lower()
assert pipe in ("train", "val", "synth"), 'Invalid pipeline type (choices: "train", "val", "synth").'
return pipe
def _shard_size(self):
"""
Total number of samples handled by a single GPU in a single epoch.
"""
world_size = dist.get_world_size() if dist.is_initialized() else 1
if self.drop_last:
divisor = world_size * self.batch_size * self.grad_accumulation_steps
return self.dataset_size // divisor * divisor // world_size
else:
return int(math.ceil(self.dataset_size / world_size))
def __len__(self):
"""
Number of batches handled by each GPU.
"""
if self.drop_last:
assert self._shard_size() % self.batch_size == 0, f'{self._shard_size()} {self.batch_size}'
return int(math.ceil(self._shard_size() / self.batch_size))
def data_iterator(self):
return self._dali_data_iterator
def __iter__(self):
return self._dali_data_iterator

View file

@ -0,0 +1,162 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.distributed as dist
import numpy as np
from common.helpers import print_once
from common.text import _clean_text, punctuation_map
def normalize_string(s, symbols, punct_map):
"""
Normalizes string.
Example:
'call me at 8:00 pm!' -> 'call me at eight zero pm'
"""
labels = set(symbols)
try:
text = _clean_text(s, ["english_cleaners"], punct_map).strip()
return ''.join([tok for tok in text if all(t in labels for t in tok)])
except Exception as e:
print_once("WARNING: Normalizing failed: {s} {e}")
class DaliJasperIterator(object):
"""
Returns batches of data for Jasper training:
preprocessed_signal, preprocessed_signal_length, transcript, transcript_length
This iterator is not meant to be the entry point to Dali processing pipeline.
Use DataLoader instead.
"""
def __init__(self, dali_pipelines, transcripts, symbols, batch_size, shard_size, train_iterator: bool):
self.transcripts = transcripts
self.symbols = symbols
self.batch_size = batch_size
from nvidia.dali.plugin.pytorch import DALIGenericIterator
from nvidia.dali.plugin.base_iterator import LastBatchPolicy
# in train pipeline shard_size is set to divisable by batch_size, so PARTIAL policy is safe
self.dali_it = DALIGenericIterator(
dali_pipelines, ["audio", "label", "audio_shape"], size=shard_size,
dynamic_shape=True, auto_reset=True, last_batch_padded=True,
last_batch_policy=LastBatchPolicy.PARTIAL)
@staticmethod
def _str2list(s: str):
"""
Returns list of floats, that represents given string.
'0.' denotes separator
'1.' denotes 'a'
'27.' denotes "'"
Assumes, that the string is lower case.
"""
list = []
for c in s:
if c == "'":
list.append(27.)
else:
list.append(max(0., ord(c) - 96.))
return list
@staticmethod
def _pad_lists(lists: list, pad_val=0):
"""
Pads lists, so that all have the same size.
Returns list with actual sizes of corresponding input lists
"""
max_length = 0
sizes = []
for li in lists:
sizes.append(len(li))
max_length = max_length if len(li) < max_length else len(li)
for li in lists:
li += [pad_val] * (max_length - len(li))
return sizes
def _gen_transcripts(self, labels, normalize_transcripts: bool = True):
"""
Generate transcripts in format expected by NN
"""
lists = [
self._str2list(normalize_string(self.transcripts[lab.item()], self.symbols, punctuation_map(self.symbols)))
for lab in labels
] if normalize_transcripts else [self._str2list(self.transcripts[lab.item()]) for lab in labels]
sizes = self._pad_lists(lists)
return torch.tensor(lists).cuda(), torch.tensor(sizes, dtype=torch.int32).cuda()
def __next__(self):
data = self.dali_it.__next__()
transcripts, transcripts_lengths = self._gen_transcripts(data[0]["label"])
return data[0]["audio"], data[0]["audio_shape"][:, 1], transcripts, transcripts_lengths
def next(self):
return self.__next__()
def __iter__(self):
return self
# TODO: refactor
class SyntheticDataIterator(object):
def __init__(self, batch_size, nfeatures, feat_min=-5., feat_max=0., txt_min=0., txt_max=23., feat_lens_max=1760,
txt_lens_max=231, regenerate=False):
"""
Args:
batch_size
nfeatures: number of features for melfbanks
feat_min: minimum value in `feat` tensor, used for randomization
feat_max: maximum value in `feat` tensor, used for randomization
txt_min: minimum value in `txt` tensor, used for randomization
txt_max: maximum value in `txt` tensor, used for randomization
regenerate: If True, regenerate random tensors for every iterator step.
If False, generate them only at start.
"""
self.batch_size = batch_size
self.nfeatures = nfeatures
self.feat_min = feat_min
self.feat_max = feat_max
self.feat_lens_max = feat_lens_max
self.txt_min = txt_min
self.txt_max = txt_max
self.txt_lens_max = txt_lens_max
self.regenerate = regenerate
if not self.regenerate:
self.feat, self.feat_lens, self.txt, self.txt_lens = self._generate_sample()
def _generate_sample(self):
feat = (self.feat_max - self.feat_min) * np.random.random_sample(
(self.batch_size, self.nfeatures, self.feat_lens_max)) + self.feat_min
feat_lens = np.random.randint(0, int(self.feat_lens_max) - 1, size=self.batch_size)
txt = (self.txt_max - self.txt_min) * np.random.random_sample(
(self.batch_size, self.txt_lens_max)) + self.txt_min
txt_lens = np.random.randint(0, int(self.txt_lens_max) - 1, size=self.batch_size)
return torch.Tensor(feat).cuda(), \
torch.Tensor(feat_lens).cuda(), \
torch.Tensor(txt).cuda(), \
torch.Tensor(txt_lens).cuda()
def __next__(self):
if self.regenerate:
return self._generate_sample()
return self.feat, self.feat_lens, self.txt, self.txt_lens
def next(self):
return self.__next__()
def __iter__(self):
return self

View file

@ -0,0 +1,397 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import nvidia.dali
import nvidia.dali.ops as ops
import nvidia.dali.types as types
import multiprocessing
import numpy as np
import torch
import math
import itertools
class DaliPipeline(nvidia.dali.pipeline.Pipeline):
def __init__(self, *,
train_pipeline: bool, # True if train pipeline, False if validation pipeline
device_id,
num_threads,
batch_size,
file_root: str,
file_list: str,
sample_rate,
discrete_resample_range: bool,
resample_range: list,
window_size,
window_stride,
nfeatures,
nfft,
frame_splicing_factor,
dither_coeff,
silence_threshold,
preemph_coeff,
pad_align,
max_duration,
mask_time_num_regions,
mask_time_min,
mask_time_max,
mask_freq_num_regions,
mask_freq_min,
mask_freq_max,
mask_both_num_regions,
mask_both_min_time,
mask_both_max_time,
mask_both_min_freq,
mask_both_max_freq,
preprocessing_device="gpu"):
super().__init__(batch_size, num_threads, device_id)
self._dali_init_log(locals())
if torch.distributed.is_initialized():
shard_id = torch.distributed.get_rank()
n_shards = torch.distributed.get_world_size()
else:
shard_id = 0
n_shards = 1
self.preprocessing_device = preprocessing_device.lower()
assert self.preprocessing_device == "cpu" or self.preprocessing_device == "gpu", \
"Incorrect preprocessing device. Please choose either 'cpu' or 'gpu'"
self.frame_splicing_factor = frame_splicing_factor
assert frame_splicing_factor == 1, "DALI doesn't support frame splicing operation"
self.resample_range = resample_range
self.discrete_resample_range = discrete_resample_range
self.train = train_pipeline
self.sample_rate = sample_rate
self.dither_coeff = dither_coeff
self.nfeatures = nfeatures
self.max_duration = max_duration
self.mask_params = {
'time_num_regions': mask_time_num_regions,
'time_min': mask_time_min,
'time_max': mask_time_max,
'freq_num_regions': mask_freq_num_regions,
'freq_min': mask_freq_min,
'freq_max': mask_freq_max,
'both_num_regions': mask_both_num_regions,
'both_min_time': mask_both_min_time,
'both_max_time': mask_both_max_time,
'both_min_freq': mask_both_min_freq,
'both_max_freq': mask_both_max_freq,
}
self.do_remove_silence = True if silence_threshold is not None else False
self.read = ops.FileReader(device="cpu", file_root=file_root, file_list=file_list, shard_id=shard_id,
num_shards=n_shards, shuffle_after_epoch=train_pipeline)
# TODO change ExternalSource to Uniform for new DALI release
if discrete_resample_range and resample_range is not None:
self.speed_perturbation_coeffs = ops.ExternalSource(device="cpu", cycle=True,
source=self._discrete_resample_coeffs_generator)
elif resample_range is not None:
self.speed_perturbation_coeffs = ops.Uniform(device="cpu", range=resample_range)
else:
self.speed_perturbation_coeffs = None
self.decode = ops.AudioDecoder(device="cpu", sample_rate=self.sample_rate if resample_range is None else None,
dtype=types.FLOAT, downmix=True)
self.normal_distribution = ops.NormalDistribution(device=preprocessing_device)
self.preemph = ops.PreemphasisFilter(device=preprocessing_device, preemph_coeff=preemph_coeff)
self.spectrogram = ops.Spectrogram(device=preprocessing_device, nfft=nfft,
window_length=window_size * sample_rate,
window_step=window_stride * sample_rate)
self.mel_fbank = ops.MelFilterBank(device=preprocessing_device, sample_rate=sample_rate, nfilter=self.nfeatures,
normalize=True)
self.log_features = ops.ToDecibels(device=preprocessing_device, multiplier=np.log(10), reference=1.0,
cutoff_db=math.log(1e-20))
self.get_shape = ops.Shapes(device=preprocessing_device)
self.normalize = ops.Normalize(device=preprocessing_device, axes=[1])
self.pad = ops.Pad(device=preprocessing_device, axes=[1], fill_value=0, align=pad_align)
# Silence trimming
self.get_nonsilent_region = ops.NonsilentRegion(device="cpu", cutoff_db=silence_threshold)
self.trim_silence = ops.Slice(device="cpu", normalized_anchor=False, normalized_shape=False, axes=[0])
self.to_float = ops.Cast(device="cpu", dtype=types.FLOAT)
# Spectrogram masking
self.spectrogram_cutouts = ops.ExternalSource(source=self._cutouts_generator, num_outputs=2, cycle=True)
self.mask_spectrogram = ops.Erase(device=preprocessing_device, axes=[0, 1], fill_value=0,
normalized_anchor=True)
@classmethod
def from_config(cls, train_pipeline: bool, device_id, batch_size, file_root: str, file_list: str, config_data: dict,
config_features: dict, device_type: str = "gpu", do_resampling: bool = True,
num_cpu_threads=multiprocessing.cpu_count()):
max_duration = config_data['max_duration']
sample_rate = config_data['sample_rate']
silence_threshold = -60 if config_data['trim_silence'] else None
# TODO Take into account resampling probablity
# TODO config_features['speed_perturbation']['p']
if do_resampling and config_data['speed_perturbation'] is not None:
resample_range = [config_data['speed_perturbation']['min_rate'],
config_data['speed_perturbation']['max_rate']]
discrete_resample_range = config_data['speed_perturbation']['discrete']
else:
resample_range = None
discrete_resample_range = False
window_size = config_features['window_size']
window_stride = config_features['window_stride']
nfeatures = config_features['n_filt']
nfft = config_features['n_fft']
frame_splicing_factor = config_features['frame_splicing']
dither_coeff = config_features['dither']
pad_align = config_features['pad_align']
pad_to_max_duration = config_features['pad_to_max_duration']
assert not pad_to_max_duration, "Padding to max duration currently not supported in DALI"
preemph_coeff = .97
config_spec = config_features['spec_augment']
if config_spec is not None:
mask_time_num_regions = config_spec['time_masks']
mask_time_min = config_spec['min_time']
mask_time_max = config_spec['max_time']
mask_freq_num_regions = config_spec['freq_masks']
mask_freq_min = config_spec['min_freq']
mask_freq_max = config_spec['max_freq']
else:
mask_time_num_regions = 0
mask_time_min = 0
mask_time_max = 0
mask_freq_num_regions = 0
mask_freq_min = 0
mask_freq_max = 0
config_cutout = config_features['cutout_augment']
if config_cutout is not None:
mask_both_num_regions = config_cutout['masks']
mask_both_min_time = config_cutout['min_time']
mask_both_max_time = config_cutout['max_time']
mask_both_min_freq = config_cutout['min_freq']
mask_both_max_freq = config_cutout['max_freq']
else:
mask_both_num_regions = 0
mask_both_min_time = 0
mask_both_max_time = 0
mask_both_min_freq = 0
mask_both_max_freq = 0
return cls(train_pipeline=train_pipeline,
device_id=device_id,
preprocessing_device=device_type,
num_threads=num_cpu_threads,
batch_size=batch_size,
file_root=file_root,
file_list=file_list,
sample_rate=sample_rate,
discrete_resample_range=discrete_resample_range,
resample_range=resample_range,
window_size=window_size,
window_stride=window_stride,
nfeatures=nfeatures,
nfft=nfft,
frame_splicing_factor=frame_splicing_factor,
dither_coeff=dither_coeff,
silence_threshold=silence_threshold,
preemph_coeff=preemph_coeff,
pad_align=pad_align,
max_duration=max_duration,
mask_time_num_regions=mask_time_num_regions,
mask_time_min=mask_time_min,
mask_time_max=mask_time_max,
mask_freq_num_regions=mask_freq_num_regions,
mask_freq_min=mask_freq_min,
mask_freq_max=mask_freq_max,
mask_both_num_regions=mask_both_num_regions,
mask_both_min_time=mask_both_min_time,
mask_both_max_time=mask_both_max_time,
mask_both_min_freq=mask_both_min_freq,
mask_both_max_freq=mask_both_max_freq)
@staticmethod
def _dali_init_log(args: dict):
if (not torch.distributed.is_initialized() or (
torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)): # print once
max_len = max([len(ii) for ii in args.keys()])
fmt_string = '\t%' + str(max_len) + 's : %s'
print('Initializing DALI with parameters:')
for keyPair in sorted(args.items()):
print(fmt_string % keyPair)
@staticmethod
def _div_ceil(dividend, divisor):
return (dividend + (divisor - 1)) // divisor
def _get_audio_len(self, inp):
return self.get_shape(inp) if self.frame_splicing_factor == 1 else \
self._div_ceil(self.get_shape(inp), self.frame_splicing_factor)
def _remove_silence(self, inp):
begin, length = self.get_nonsilent_region(inp)
out = self.trim_silence(inp, self.to_float(begin), self.to_float(length))
return out
def _do_spectrogram_masking(self):
return self.mask_params['time_num_regions'] > 0 or self.mask_params['freq_num_regions'] > 0 or \
self.mask_params['both_num_regions'] > 0
@staticmethod
def _interleave_lists(*lists):
"""
[*, **, ***], [1, 2, 3], [a, b, c] -> [*, 1, a, **, 2, b, ***, 3, c]
Returns:
iterator over interleaved list
"""
assert all((len(lists[0]) == len(test_l) for test_l in lists)), "All lists have to have the same length"
return itertools.chain(*zip(*lists))
def _generate_cutouts(self):
"""
Returns:
Generates anchors and shapes of the cutout regions.
Single call generates one batch of data.
The output shall be passed to DALI's Erase operator
anchors = [f0 t0 f1 t1 ...]
shapes = [f0w t0h f1w t1h ...]
"""
MAX_TIME_DIMENSION = 20 * 16000
freq_anchors = np.random.random(self.mask_params['freq_num_regions'])
time_anchors = np.random.random(self.mask_params['time_num_regions'])
both_anchors_freq = np.random.random(self.mask_params['both_num_regions'])
both_anchors_time = np.random.random(self.mask_params['both_num_regions'])
anchors = []
for anch in freq_anchors:
anchors.extend([anch, 0])
for anch in time_anchors:
anchors.extend([0, anch])
for t, f in zip(both_anchors_time, both_anchors_freq):
anchors.extend([f, t])
shapes = []
shapes.extend(
self._interleave_lists(
np.random.randint(self.mask_params['freq_min'], self.mask_params['freq_max'] + 1,
self.mask_params['freq_num_regions']),
# XXX: Here, a time dimension of the spectrogram shall be passed.
# However, in DALI ArgumentInput can't come from GPU.
# So we leave the job for Erase (masking operator) to get it together.
[int(MAX_TIME_DIMENSION)] * self.mask_params['freq_num_regions']
)
)
shapes.extend(
self._interleave_lists(
[self.nfeatures] * self.mask_params['time_num_regions'],
np.random.randint(self.mask_params['time_min'], self.mask_params['time_max'] + 1,
self.mask_params['time_num_regions'])
)
)
shapes.extend(
self._interleave_lists(
np.random.randint(self.mask_params['both_min_freq'], self.mask_params['both_max_freq'] + 1,
self.mask_params['both_num_regions']),
np.random.randint(self.mask_params['both_min_time'], self.mask_params['both_max_time'] + 1,
self.mask_params['both_num_regions'])
)
)
return anchors, shapes
def _discrete_resample_coeffs_generator(self):
"""
Generate resample coeffs from discrete set
"""
yield np.random.choice([self.resample_range[0], 1.0, self.resample_range[1]],
size=self.batch_size).astype('float32')
def _cutouts_generator(self):
"""
Generator, that wraps cutouts creation in order to randomize inputs
and allow passing them to DALI's ExternalSource operator
"""
def tuples2list(tuples: list):
"""
[(a, b), (c, d)] -> [[a, c], [b, d]]
"""
return map(list, zip(*tuples))
[anchors, shapes] = tuples2list([self._generate_cutouts() for _ in range(self.batch_size)])
yield np.array(anchors, dtype=np.float32), np.array(shapes, dtype=np.float32)
def define_graph(self):
audio, label = self.read()
if not self.train or self.speed_perturbation_coeffs is None:
audio, sr = self.decode(audio)
else:
resample_coeffs = self.speed_perturbation_coeffs() * self.sample_rate
audio, sr = self.decode(audio, sample_rate=resample_coeffs)
if self.do_remove_silence:
audio = self._remove_silence(audio)
# Max duration drop is performed at DataLayer stage
if self.preprocessing_device == "gpu":
audio = audio.gpu()
if self.dither_coeff != 0.:
audio = audio + self.normal_distribution(audio) * self.dither_coeff
audio = self.preemph(audio)
audio = self.spectrogram(audio)
audio = self.mel_fbank(audio)
audio = self.log_features(audio)
audio_len = self._get_audio_len(audio)
audio = self.normalize(audio)
audio = self.pad(audio)
if self.train and self._do_spectrogram_masking():
anchors, shapes = self.spectrogram_cutouts()
audio = self.mask_spectrogram(audio, anchor=anchors, shape=shapes)
# When modifying DALI pipeline returns, make sure you update `output_map` in DALIGenericIterator invocation
return audio.gpu(), label.gpu(), audio_len.gpu()
class DaliTritonPipeline(DaliPipeline):
def __init__(self, **kwargs):
super().__init__(**kwargs)
assert not kwargs['train_pipeline'], "Pipeline for Triton shall be a validation pipeline"
if torch.distributed.is_initialized():
raise RuntimeError(
"You're creating Triton pipeline, using multi-process mode. Please use single-process mode.")
self.read = ops.ExternalSource(name="DALI_INPUT_0", no_copy=True, device="cpu")
def serialize_dali_triton_pipeline(output_path: str, config_data: dict, config_features: dict):
pipe = DaliTritonPipeline.from_config(train_pipeline=False, device_id=-1, batch_size=-1, file_root=None,
file_list=None, config_data=config_data, config_features=config_features,
do_resampling=False, num_cpu_threads=-1)
pipe.serialize(filename=output_path)

View file

@ -0,0 +1,234 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
from pathlib import Path
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
from .audio import (audio_from_file, AudioSegment, GainPerturbation,
ShiftPerturbation, SpeedPerturbation)
from .text import _clean_text, punctuation_map
def normalize_string(s, labels, punct_map):
"""Normalizes string.
Example:
'call me at 8:00 pm!' -> 'call me at eight zero pm'
"""
labels = set(labels)
try:
text = _clean_text(s, ["english_cleaners"], punct_map).strip()
return ''.join([tok for tok in text if all(t in labels for t in tok)])
except:
print(f"WARNING: Normalizing failed: {s}")
return None
class FilelistDataset(Dataset):
def __init__(self, filelist_fpath):
self.samples = [line.strip() for line in open(filelist_fpath, 'r')]
def __len__(self):
return len(self.samples)
def __getitem__(self, index):
audio, audio_len = audio_from_file(self.samples[index])
return (audio.squeeze(0), audio_len, torch.LongTensor([0]),
torch.LongTensor([0]))
class SingleAudioDataset(FilelistDataset):
def __init__(self, audio_fpath):
self.samples = [audio_fpath]
class AudioDataset(Dataset):
def __init__(self, data_dir, manifest_fpaths, labels,
sample_rate=16000, min_duration=0.1, max_duration=float("inf"),
pad_to_max_duration=False, max_utts=0, normalize_transcripts=True,
sort_by_duration=False, trim_silence=False,
speed_perturbation=None, gain_perturbation=None,
shift_perturbation=None, ignore_offline_speed_perturbation=False):
"""Loads audio, transcript and durations listed in a .json file.
Args:
data_dir: absolute path to dataset folder
manifest_filepath: relative path from dataset folder
to manifest json as described above. Can be coma-separated paths.
labels (str): all possible output symbols
min_duration (int): skip audio shorter than threshold
max_duration (int): skip audio longer than threshold
pad_to_max_duration (bool): pad all sequences to max_duration
max_utts (int): limit number of utterances
normalize_transcripts (bool): normalize transcript text
sort_by_duration (bool): sort sequences by increasing duration
trim_silence (bool): trim leading and trailing silence from audio
ignore_offline_speed_perturbation (bool): use precomputed speed perturbation
Returns:
tuple of Tensors
"""
self.data_dir = data_dir
self.labels = labels
self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
self.punctuation_map = punctuation_map(labels)
self.blank_index = len(labels)
self.pad_to_max_duration = pad_to_max_duration
self.sort_by_duration = sort_by_duration
self.max_utts = max_utts
self.normalize_transcripts = normalize_transcripts
self.ignore_offline_speed_perturbation = ignore_offline_speed_perturbation
self.min_duration = min_duration
self.max_duration = max_duration
self.trim_silence = trim_silence
self.sample_rate = sample_rate
perturbations = []
if speed_perturbation is not None:
perturbations.append(SpeedPerturbation(**speed_perturbation))
if gain_perturbation is not None:
perturbations.append(GainPerturbation(**gain_perturbation))
if shift_perturbation is not None:
perturbations.append(ShiftPerturbation(**shift_perturbation))
self.perturbations = perturbations
self.max_duration = max_duration
self.samples = []
self.duration = 0.0
self.duration_filtered = 0.0
for fpath in manifest_fpaths:
self._load_json_manifest(fpath)
if sort_by_duration:
self.samples = sorted(self.samples, key=lambda s: s['duration'])
def __getitem__(self, index):
s = self.samples[index]
rn_indx = np.random.randint(len(s['audio_filepath']))
duration = s['audio_duration'][rn_indx] if 'audio_duration' in s else 0
offset = s.get('offset', 0)
segment = AudioSegment(
s['audio_filepath'][rn_indx], target_sr=self.sample_rate,
offset=offset, duration=duration, trim=self.trim_silence)
for p in self.perturbations:
p.maybe_apply(segment, self.sample_rate)
segment = torch.FloatTensor(segment.samples)
return (segment,
torch.tensor(segment.shape[0]).int(),
torch.tensor(s["transcript"]),
torch.tensor(len(s["transcript"])).int())
def __len__(self):
return len(self.samples)
def _load_json_manifest(self, fpath):
for s in json.load(open(fpath, "r", encoding="utf-8")):
if self.pad_to_max_duration and not self.ignore_offline_speed_perturbation:
# require all perturbed samples to be < self.max_duration
s_max_duration = max(f['duration'] for f in s['files'])
else:
# otherwise we allow perturbances to be > self.max_duration
s_max_duration = s['original_duration']
s['duration'] = s.pop('original_duration')
if not (self.min_duration <= s_max_duration <= self.max_duration):
self.duration_filtered += s['duration']
continue
# Prune and normalize according to transcript
tr = (s.get('transcript', None) or
self.load_transcript(s['text_filepath']))
if not isinstance(tr, str):
print(f'WARNING: Skipped sample (transcript not a str): {tr}.')
self.duration_filtered += s['duration']
continue
if self.normalize_transcripts:
tr = normalize_string(tr, self.labels, self.punctuation_map)
s["transcript"] = self.to_vocab_inds(tr)
files = s.pop('files')
if self.ignore_offline_speed_perturbation:
files = [f for f in files if f['speed'] == 1.0]
s['audio_duration'] = [f['duration'] for f in files]
s['audio_filepath'] = [str(Path(self.data_dir, f['fname']))
for f in files]
self.samples.append(s)
self.duration += s['duration']
if self.max_utts > 0 and len(self.samples) >= self.max_utts:
print(f'Reached max_utts={self.max_utts}. Finished parsing {fpath}.')
break
def load_transcript(self, transcript_path):
with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
transcript = transcript_file.read().replace('\n', '')
return transcript
def to_vocab_inds(self, transcript):
chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
transcript = list(filter(lambda x: x != self.blank_index, chars))
return transcript
def collate_fn(batch):
bs = len(batch)
max_len = lambda l, idx: max(el[idx].size(0) for el in l)
audio = torch.zeros(bs, max_len(batch, 0))
audio_lens = torch.zeros(bs, dtype=torch.int32)
transcript = torch.zeros(bs, max_len(batch, 2))
transcript_lens = torch.zeros(bs, dtype=torch.int32)
for i, sample in enumerate(batch):
audio[i].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
audio_lens[i] = sample[1]
transcript[i].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
transcript_lens[i] = sample[3]
return audio, audio_lens, transcript, transcript_lens
def get_data_loader(dataset, batch_size, multi_gpu=True, shuffle=True,
drop_last=True, num_workers=4):
kw = {'dataset': dataset, 'collate_fn': collate_fn,
'num_workers': num_workers, 'pin_memory': True}
if multi_gpu:
loader_shuffle = False
sampler = DistributedSampler(dataset, shuffle=shuffle)
else:
loader_shuffle = shuffle
sampler = None
return DataLoader(batch_size=batch_size, drop_last=drop_last,
sampler=sampler, shuffle=loader_shuffle, **kw)

View file

@ -0,0 +1,293 @@
import math
import random
import librosa
import torch
import torch.nn as nn
from apex import amp
class BaseFeatures(nn.Module):
"""Base class for GPU accelerated audio preprocessing."""
__constants__ = ["pad_align", "pad_to_max_duration", "max_len"]
def __init__(self, pad_align, pad_to_max_duration, max_duration,
sample_rate, window_size, window_stride, spec_augment=None,
cutout_augment=None):
super(BaseFeatures, self).__init__()
self.pad_align = pad_align
self.pad_to_max_duration = pad_to_max_duration
self.win_length = int(sample_rate * window_size) # frame size
self.hop_length = int(sample_rate * window_stride)
# Calculate maximum sequence length (# frames)
if pad_to_max_duration:
self.max_len = 1 + math.ceil(
(max_duration * sample_rate - self.win_length) / self.hop_length
)
if spec_augment is not None:
self.spec_augment = SpecAugment(**spec_augment)
else:
self.spec_augment = None
if cutout_augment is not None:
self.cutout_augment = CutoutAugment(**cutout_augment)
else:
self.cutout_augment = None
@torch.no_grad()
def calculate_features(self, audio, audio_lens):
return audio, audio_lens
def __call__(self, audio, audio_lens, optim_level=0):
dtype = audio.dtype
audio = audio.float()
if optim_level == 1:
with amp.disable_casts():
feat, feat_lens = self.calculate_features(audio, audio_lens)
else:
feat, feat_lens = self.calculate_features(audio, audio_lens)
feat = self.apply_padding(feat)
if self.cutout_augment is not None:
feat = self.cutout_augment(feat)
if self.spec_augment is not None:
feat = self.spec_augment(feat)
feat = feat.to(dtype)
return feat, feat_lens
def apply_padding(self, x):
if self.pad_to_max_duration:
x_size = max(x.size(-1), self.max_len)
else:
x_size = x.size(-1)
if self.pad_align > 0:
pad_amt = x_size % self.pad_align
else:
pad_amt = 0
padded_len = x_size + (self.pad_align - pad_amt if pad_amt > 0 else 0)
return nn.functional.pad(x, (0, padded_len - x.size(-1)))
class SpecAugment(nn.Module):
"""Spec augment. refer to https://arxiv.org/abs/1904.08779
"""
def __init__(self, freq_masks=0, min_freq=0, max_freq=10, time_masks=0,
min_time=0, max_time=10):
super(SpecAugment, self).__init__()
assert 0 <= min_freq <= max_freq
assert 0 <= min_time <= max_time
self.freq_masks = freq_masks
self.min_freq = min_freq
self.max_freq = max_freq
self.time_masks = time_masks
self.min_time = min_time
self.max_time = max_time
@torch.no_grad()
def forward(self, x):
sh = x.shape
mask = torch.zeros(x.shape, dtype=torch.bool, device=x.device)
for idx in range(sh[0]):
for _ in range(self.freq_masks):
w = torch.randint(self.min_freq, self.max_freq + 1, size=(1,)).item()
f0 = torch.randint(0, max(1, sh[1] - w), size=(1,))
mask[idx, f0:f0+w] = 1
for _ in range(self.time_masks):
w = torch.randint(self.min_time, self.max_time + 1, size=(1,)).item()
t0 = torch.randint(0, max(1, sh[2] - w), size=(1,))
mask[idx, :, t0:t0+w] = 1
return x.masked_fill(mask, 0)
class CutoutAugment(nn.Module):
"""Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
"""
def __init__(self, masks=0, min_freq=20, max_freq=20, min_time=5, max_time=5):
super(CutoutAugment, self).__init__()
assert 0 <= min_freq <= max_freq
assert 0 <= min_time <= max_time
self.masks = masks
self.min_freq = min_freq
self.max_freq = max_freq
self.min_time = min_time
self.max_time = max_time
@torch.no_grad()
def forward(self, x):
sh = x.shape
mask = torch.zeros(x.shape, dtype=torch.bool, device=x.device)
for idx in range(sh[0]):
for i in range(self.masks):
w = torch.randint(self.min_freq, self.max_freq + 1, size=(1,)).item()
h = torch.randint(self.min_time, self.max_time + 1, size=(1,)).item()
f0 = int(random.uniform(0, sh[1] - w))
t0 = int(random.uniform(0, sh[2] - h))
mask[idx, f0:f0+w, t0:t0+h] = 1
return x.masked_fill(mask, 0)
@torch.jit.script
def normalize_batch(x, seq_len, normalize_type: str):
# print ("normalize_batch: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
if normalize_type == "per_feature":
x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
device=x.device)
x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
device=x.device)
for i in range(x.shape[0]):
x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
# make sure x_std is not zero
x_std += 1e-5
return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
elif normalize_type == "all_features":
x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
for i in range(x.shape[0]):
x_mean[i] = x[i, :, :int(seq_len[i])].mean()
x_std[i] = x[i, :, :int(seq_len[i])].std()
# make sure x_std is not zero
x_std += 1e-5
return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
else:
return x
@torch.jit.script
def splice_frames(x, frame_splicing: int):
""" Stacks frames together across feature dim
input is batch_size, feature_dim, num_frames
output is batch_size, feature_dim*frame_splicing, num_frames
"""
seq = [x]
# TORCHSCRIPT: JIT doesnt like range(start, stop)
for n in range(frame_splicing - 1):
seq.append(torch.cat([x[:, :, :n + 1], x[:, :, n + 1:]], dim=2))
return torch.cat(seq, dim=1)
class FilterbankFeatures(BaseFeatures):
# For JIT, https://pytorch.org/docs/stable/jit.html#python-defined-constants
__constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length",
"log", "frame_splicing", "normalize"]
# torchscript: "center" removed due to a bug
def __init__(self, spec_augment=None, cutout_augment=None,
sample_rate=8000, window_size=0.02, window_stride=0.01,
window="hamming", normalize="per_feature", n_fft=None,
preemph=0.97, n_filt=64, lowfreq=0, highfreq=None, log=True,
dither=1e-5, pad_align=8, pad_to_max_duration=False,
max_duration=float('inf'), frame_splicing=1):
super(FilterbankFeatures, self).__init__(
pad_align=pad_align, pad_to_max_duration=pad_to_max_duration,
max_duration=max_duration, sample_rate=sample_rate,
window_size=window_size, window_stride=window_stride,
spec_augment=spec_augment, cutout_augment=cutout_augment)
torch_windows = {
'hann': torch.hann_window,
'hamming': torch.hamming_window,
'blackman': torch.blackman_window,
'bartlett': torch.bartlett_window,
'none': None,
}
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
self.normalize = normalize
self.log = log
#TORCHSCRIPT: Check whether or not we need this
self.dither = dither
self.frame_splicing = frame_splicing
self.n_filt = n_filt
self.preemph = preemph
highfreq = highfreq or sample_rate / 2
window_fn = torch_windows.get(window, None)
window_tensor = window_fn(self.win_length,
periodic=False) if window_fn else None
filterbanks = torch.tensor(
librosa.filters.mel(sample_rate, self.n_fft, n_mels=n_filt,
fmin=lowfreq, fmax=highfreq),
dtype=torch.float).unsqueeze(0)
# torchscript
self.register_buffer("fb", filterbanks)
self.register_buffer("window", window_tensor)
def get_seq_len(self, seq_len):
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
dtype=torch.int)
# do stft
# TORCHSCRIPT: center removed due to bug
def stft(self, x):
return torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
win_length=self.win_length,
window=self.window.to(dtype=torch.float))
@torch.no_grad()
def calculate_features(self, x, seq_len):
dtype = x.dtype
seq_len = self.get_seq_len(seq_len)
# dither
if self.dither > 0:
x += self.dither * torch.randn_like(x)
# do preemphasis
if self.preemph is not None:
x = torch.cat(
(x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]), dim=1)
x = self.stft(x)
# get power spectrum
x = x.pow(2).sum(-1)
# dot with filterbank energies
x = torch.matmul(self.fb.to(x.dtype), x)
# log features if required
if self.log:
x = torch.log(x + 1e-20)
# frame splicing if required
if self.frame_splicing > 1:
raise ValueError('Frame splicing not supported')
# normalize if required
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
# mask to zero any values beyond seq_len in batch,
# pad to multiple of `pad_align` (for efficiency)
max_len = x.size(-1)
mask = torch.arange(max_len, dtype=seq_len.dtype, device=x.device)
mask = mask.expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1), 0)
# TORCHSCRIPT: Is this del important? It breaks scripting
# del mask
return x.to(dtype), seq_len

View file

@ -0,0 +1,300 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import glob
import os
import re
from collections import OrderedDict
from apex import amp
import torch
import torch.distributed as dist
from .metrics import word_error_rate
def print_once(msg):
if not dist.is_initialized() or dist.get_rank() == 0:
print(msg)
def add_ctc_blank(symbols):
return symbols + ['<BLANK>']
def ctc_decoder_predictions_tensor(tensor, labels):
"""
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
remove duplicates and special symbol. Returns prediction
Args:
tensor: model output tensor
label: A list of labels
Returns:
prediction
"""
blank_id = len(labels) - 1
hypotheses = []
labels_map = {i: labels[i] for i in range(len(labels))}
prediction_cpu_tensor = tensor.long().cpu()
# iterate over batch
for ind in range(prediction_cpu_tensor.shape[0]):
prediction = prediction_cpu_tensor[ind].numpy().tolist()
# CTC decoding procedure
decoded_prediction = []
previous = len(labels) - 1 # id of a blank symbol
for p in prediction:
if (p != previous or previous == blank_id) and p != blank_id:
decoded_prediction.append(p)
previous = p
hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
hypotheses.append(hypothesis)
return hypotheses
def greedy_wer(preds, tgt, tgt_lens, labels):
"""
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
remove duplicates and special symbol. Prints wer and prediction examples to screen
Args:
tensors: A list of 3 tensors (predictions, targets, target_lengths)
labels: A list of labels
Returns:
word error rate
"""
with torch.no_grad():
references = gather_transcripts([tgt], [tgt_lens], labels)
hypotheses = ctc_decoder_predictions_tensor(preds, labels)
wer, _, _ = word_error_rate(hypotheses, references)
return wer, hypotheses[0], references[0]
def gather_losses(losses_list):
return [torch.mean(torch.stack(losses_list))]
def gather_predictions(predictions_list, labels):
results = []
for prediction in predictions_list:
results += ctc_decoder_predictions_tensor(prediction, labels=labels)
return results
def gather_transcripts(transcript_list, transcript_len_list, labels):
results = []
labels_map = {i: labels[i] for i in range(len(labels))}
# iterate over workers
for txt, lens in zip(transcript_list, transcript_len_list):
for t, l in zip(txt.long().cpu(), lens.long().cpu()):
t = list(t.numpy())
results.append(''.join([labels_map[c] for c in t[:l]]))
return results
def process_evaluation_batch(tensors, global_vars, labels):
"""
Processes results of an iteration and saves it in global_vars
Args:
tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
global_vars: dictionary where processes results of iteration are saved
labels: A list of labels
"""
for kv, v in tensors.items():
if kv.startswith('loss'):
global_vars['EvalLoss'] += gather_losses(v)
elif kv.startswith('predictions'):
global_vars['preds'] += gather_predictions(v, labels)
elif kv.startswith('transcript_length'):
transcript_len_list = v
elif kv.startswith('transcript'):
transcript_list = v
elif kv.startswith('output'):
global_vars['logits'] += v
global_vars['txts'] += gather_transcripts(
transcript_list, transcript_len_list, labels)
def process_evaluation_epoch(aggregates, tag=None):
"""
Processes results from each worker at the end of evaluation and combine to final result
Args:
aggregates: dictionary containing information of entire evaluation
Return:
wer: final word error rate
loss: final loss
"""
if 'losses' in aggregates:
eloss = torch.mean(torch.stack(aggregates['losses'])).item()
else:
eloss = None
hypotheses = aggregates['preds']
references = aggregates['txts']
wer, scores, num_words = word_error_rate(hypotheses, references)
multi_gpu = dist.is_initialized()
if multi_gpu:
if eloss is not None:
eloss /= dist.get_world_size()
eloss_tensor = torch.tensor(eloss).cuda()
dist.all_reduce(eloss_tensor)
eloss = eloss_tensor.item()
scores_tensor = torch.tensor(scores).cuda()
dist.all_reduce(scores_tensor)
scores = scores_tensor.item()
num_words_tensor = torch.tensor(num_words).cuda()
dist.all_reduce(num_words_tensor)
num_words = num_words_tensor.item()
wer = scores * 1.0 / num_words
return wer, eloss
def num_weights(module):
return sum(p.numel() for p in module.parameters() if p.requires_grad)
def convert_v1_state_dict(state_dict):
rules = [
('^jasper_encoder.encoder.', 'encoder.layers.'),
('^jasper_decoder.decoder_layers.', 'decoder.layers.'),
]
ret = {}
for k, v in state_dict.items():
if k.startswith('acoustic_model.'):
continue
if k.startswith('audio_preprocessor.'):
continue
for pattern, to in rules:
k = re.sub(pattern, to, k)
ret[k] = v
return ret
class Checkpointer(object):
def __init__(self, save_dir, model_name, keep_milestones=[100,200,300],
use_amp=False):
self.save_dir = save_dir
self.keep_milestones = keep_milestones
self.use_amp = use_amp
self.model_name = model_name
tracked = [
(int(re.search('epoch(\d+)_', f).group(1)), f)
for f in glob.glob(f'{save_dir}/{self.model_name}_epoch*_checkpoint.pt')]
tracked = sorted(tracked, key=lambda t: t[0])
self.tracked = OrderedDict(tracked)
def save(self, model, ema_model, optimizer, epoch, step, best_wer,
is_best=False):
"""Saves model checkpoint for inference/resuming training.
Args:
model: the model, optionally wrapped by DistributedDataParallel
ema_model: model with averaged weights, can be None
optimizer: optimizer
epoch (int): epoch during which the model is saved
step (int): number of steps since beginning of training
best_wer (float): lowest recorded WER on the dev set
is_best (bool, optional): set name of checkpoint to 'best'
and overwrite the previous one
"""
rank = 0
if dist.is_initialized():
dist.barrier()
rank = dist.get_rank()
if rank != 0:
return
# Checkpoint already saved
if not is_best and epoch in self.tracked:
return
unwrap_ddp = lambda model: getattr(model, 'module', model)
state = {
'epoch': epoch,
'step': step,
'best_wer': best_wer,
'state_dict': unwrap_ddp(model).state_dict(),
'ema_state_dict': unwrap_ddp(ema_model).state_dict() if ema_model is not None else None,
'optimizer': optimizer.state_dict(),
'amp': amp.state_dict() if self.use_amp else None,
}
if is_best:
fpath = os.path.join(
self.save_dir, f"{self.model_name}_best_checkpoint.pt")
else:
fpath = os.path.join(
self.save_dir, f"{self.model_name}_epoch{epoch}_checkpoint.pt")
print_once(f"Saving {fpath}...")
torch.save(state, fpath)
if not is_best:
# Remove old checkpoints; keep milestones and the last two
self.tracked[epoch] = fpath
for epoch in set(list(self.tracked)[:-2]) - set(self.keep_milestones):
try:
os.remove(self.tracked[epoch])
except:
pass
del self.tracked[epoch]
def last_checkpoint(self):
tracked = list(self.tracked.values())
if len(tracked) >= 1:
try:
torch.load(tracked[-1], map_location='cpu')
return tracked[-1]
except:
print_once(f'Last checkpoint {tracked[-1]} appears corrupted.')
elif len(tracked) >= 2:
return tracked[-2]
else:
return None
def load(self, fpath, model, ema_model, optimizer, meta):
print_once(f'Loading model from {fpath}')
checkpoint = torch.load(fpath, map_location="cpu")
unwrap_ddp = lambda model: getattr(model, 'module', model)
state_dict = convert_v1_state_dict(checkpoint['state_dict'])
unwrap_ddp(model).load_state_dict(state_dict, strict=True)
if ema_model is not None:
if checkpoint.get('ema_state_dict') is not None:
key = 'ema_state_dict'
else:
key = 'state_dict'
print_once('WARNING: EMA weights not found in the checkpoint.')
print_once('WARNING: Initializing EMA model with regular params.')
state_dict = convert_v1_state_dict(checkpoint[key])
unwrap_ddp(ema_model).load_state_dict(state_dict, strict=True)
optimizer.load_state_dict(checkpoint['optimizer'])
if self.use_amp:
amp.load_state_dict(checkpoint['amp'])
meta['start_epoch'] = checkpoint.get('epoch')
meta['best_wer'] = checkpoint.get('best_wer', meta['best_wer'])

View file

@ -12,12 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List
def __levenshtein(a, b):
"""Calculates the Levenshtein distance between two sequences."""
def __levenshtein(a: List, b: List) -> int:
"""Calculates the Levenshtein distance between a and b.
"""
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
@ -37,28 +35,18 @@ def __levenshtein(a: List, b: List) -> int:
return current[n]
def word_error_rate(hypotheses: List[str], references: List[str]) -> float:
"""
Computes Average Word Error rate between two texts represented as
corresponding lists of string. Hypotheses and references must have same length.
def word_error_rate(hypotheses, references):
"""Computes average Word Error Rate (WER) between two text lists."""
Args:
hypotheses: list of hypotheses
references: list of references
Returns:
(float) average word error rate
"""
scores = 0
words = 0
len_diff = len(references) - len(hypotheses)
len_diff = len(references) - len(hypotheses)
if len_diff > 0:
raise ValueError("In word error rate calculation, hypotheses and reference"
" lists must have the same number of elements. But I got:"
"{0} and {1} correspondingly".format(len(hypotheses), len(references)))
raise ValueError("Uneqal number of hypthoses and references: "
"{0} and {1}".format(len(hypotheses), len(references)))
elif len_diff < 0:
hypotheses = hypotheses[:len_diff]
for h, r in zip(hypotheses, references):
h_list = h.split()
r_list = r.split()

View file

@ -16,6 +16,51 @@ import torch
from torch.optim import Optimizer
import math
def lr_policy(step, epoch, initial_lr, optimizer, steps_per_epoch, warmup_epochs,
hold_epochs, num_epochs=None, policy='linear', min_lr=1e-5,
exp_gamma=None):
"""
learning rate decay
Args:
initial_lr: base learning rate
step: current iteration number
N: total number of iterations over which learning rate is decayed
lr_steps: list of steps to apply exp_gamma
"""
warmup_steps = warmup_epochs * steps_per_epoch
hold_steps = hold_epochs * steps_per_epoch
if policy == 'legacy':
assert num_epochs is not None
tot_steps = num_epochs * steps_per_epoch
if step < warmup_steps:
a = (step + 1) / (warmup_steps + 1)
elif step < warmup_steps + hold_steps:
a = 1.0
else:
a = (((tot_steps - step)
/ (tot_steps - warmup_steps - hold_steps)) ** 2)
elif policy == 'exponential':
assert exp_gamma is not None
if step < warmup_steps:
a = (step + 1) / (warmup_steps + 1)
elif step < warmup_steps + hold_steps:
a = 1.0
else:
a = exp_gamma ** (epoch - warmup_epochs - hold_epochs)
else:
raise ValueError
new_lr = max(a * initial_lr, min_lr)
for param_group in optimizer.param_groups:
param_group['lr'] = new_lr
class AdamW(Optimizer):
"""Implements AdamW algorithm.
@ -114,6 +159,7 @@ class AdamW(Optimizer):
p.data.add_(torch.mul(p.data, group['weight_decay']).addcdiv_(1, exp_avg, denom), alpha=-step_size)
return loss
class Novograd(Optimizer):
"""

View file

@ -0,0 +1,159 @@
import atexit
import glob
import os
import re
import numpy as np
import torch
from torch.utils.tensorboard import SummaryWriter
import dllogger
from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
tb_loggers = {}
class TBLogger:
"""
xyz_dummies: stretch the screen with empty plots so the legend would
always fit for other plots
"""
def __init__(self, enabled, log_dir, name, interval=1, dummies=True):
self.enabled = enabled
self.interval = interval
self.cache = {}
if self.enabled:
self.summary_writer = SummaryWriter(
log_dir=os.path.join(log_dir, name),
flush_secs=120, max_queue=200)
atexit.register(self.summary_writer.close)
if dummies:
for key in ('aaa', 'zzz'):
self.summary_writer.add_scalar(key, 0.0, 1)
def log(self, step, data):
for k, v in data.items():
self.log_value(step, k, v.item() if type(v) is torch.Tensor else v)
def log_value(self, step, key, val, stat='mean'):
if self.enabled:
if key not in self.cache:
self.cache[key] = []
self.cache[key].append(val)
if len(self.cache[key]) == self.interval:
agg_val = getattr(np, stat)(self.cache[key])
self.summary_writer.add_scalar(key, agg_val, step)
del self.cache[key]
def log_grads(self, step, model):
if self.enabled:
norms = [p.grad.norm().item() for p in model.parameters()
if p.grad is not None]
for stat in ('max', 'min', 'mean'):
self.log_value(step, f'grad_{stat}', getattr(np, stat)(norms),
stat=stat)
def unique_log_fpath(log_fpath):
if not os.path.isfile(log_fpath):
return log_fpath
# Avoid overwriting old logs
saved = sorted([int(re.search('\.(\d+)', f).group(1))
for f in glob.glob(f'{log_fpath}.*')])
log_num = (saved[-1] if saved else 0) + 1
return f'{log_fpath}.{log_num}'
def stdout_step_format(step):
if isinstance(step, str):
return step
fields = []
if len(step) > 0:
fields.append("epoch {:>4}".format(step[0]))
if len(step) > 1:
fields.append("iter {:>4}".format(step[1]))
if len(step) > 2:
fields[-1] += "/{}".format(step[2])
return " | ".join(fields)
def stdout_metric_format(metric, metadata, value):
name = metadata.get("name", metric + " : ")
unit = metadata.get("unit", None)
format = f'{{{metadata.get("format", "")}}}'
fields = [name, format.format(value) if value is not None else value, unit]
fields = [f for f in fields if f is not None]
return "| " + " ".join(fields)
def init_log(args):
enabled = (args.local_rank == 0)
if enabled:
fpath = args.log_file or os.path.join(args.output_dir, 'nvlog.json')
backends = [JSONStreamBackend(Verbosity.DEFAULT,
unique_log_fpath(fpath)),
StdOutBackend(Verbosity.VERBOSE,
step_format=stdout_step_format,
metric_format=stdout_metric_format)]
else:
backends = []
dllogger.init(backends=backends)
dllogger.metadata("train_lrate", {"name": "lrate", "format": ":>3.2e"})
for id_, pref in [('train', ''), ('train_avg', 'avg train '),
('dev', ' avg dev '), ('dev_ema', ' EMA dev ')]:
dllogger.metadata(f"{id_}_loss",
{"name": f"{pref}loss", "format": ":>7.2f"})
dllogger.metadata(f"{id_}_wer",
{"name": f"{pref}wer", "format": ":>6.2f"})
dllogger.metadata(f"{id_}_throughput",
{"name": f"{pref}utts/s", "format": ":>5.0f"})
dllogger.metadata(f"{id_}_took",
{"name": "took", "unit": "s", "format": ":>5.2f"})
tb_subsets = ['train', 'dev', 'dev_ema'] if args.ema else ['train', 'dev']
global tb_loggers
tb_loggers = {s: TBLogger(enabled, args.output_dir, name=s)
for s in tb_subsets}
log_parameters(vars(args), tb_subset='train')
def log(step, tb_total_steps=None, subset='train', data={}):
if tb_total_steps is not None:
tb_loggers[subset].log(tb_total_steps, data)
if subset != '':
data = {f'{subset}_{key}': v for key,v in data.items()}
dllogger.log(step, data=data)
def log_grads_tb(tb_total_steps, grads, tb_subset='train'):
tb_loggers[tb_subset].log_grads(tb_total_steps, grads)
def log_parameters(data, verbosity=0, tb_subset=None):
for k,v in data.items():
dllogger.log(step="PARAMETER", data={k:v}, verbosity=verbosity)
if tb_subset is not None and tb_loggers[tb_subset].enabled:
tb_data = {k:v for k,v in data.items()
if type(v) in (str, bool, int, float)}
tb_loggers[tb_subset].summary_writer.add_hparams(tb_data, {})
def flush_log():
dllogger.flush()
for tbl in tb_loggers.values():
if tbl.enabled:
tbl.summary_writer.flush()

View file

@ -0,0 +1,32 @@
# Copyright (c) 2017 Keith Ito
""" from https://github.com/keithito/tacotron """
import re
import string
from . import cleaners
def _clean_text(text, cleaner_names, *args):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception('Unknown cleaner: %s' % name)
text = cleaner(text, *args)
return text
def punctuation_map(labels):
# Punctuation to remove
punctuation = string.punctuation
punctuation = punctuation.replace("+", "")
punctuation = punctuation.replace("&", "")
# TODO We might also want to consider:
# @ -> at
# # -> number, pound, hashtag
# ~ -> tilde
# _ -> underscore
# % -> percent
# If a punctuation symbol is inside our vocab, we do not remove from text
for l in labels:
punctuation = punctuation.replace(l, "")
# Turn all punctuation to whitespace
table = str.maketrans(punctuation, " " * len(punctuation))
return table

View file

@ -1,194 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = true
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -1,203 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = false
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -1,203 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = false
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = false
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -1,204 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = true
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -1,204 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = true
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 2
cutout_y_regions = 2
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -0,0 +1,139 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: true
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 2
max_freq: 20
time_masks: 2
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: true
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -0,0 +1,139 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: false
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 0
max_freq: 20
time_masks: 0
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: true
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -0,0 +1,139 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: false
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 2
max_freq: 20
time_masks: 2
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: true
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -0,0 +1,139 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: false
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 2
max_freq: 20
time_masks: 2
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: false
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -0,0 +1,144 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: true
speed_perturbation:
discrete: true
min_rate: 0.9
max_rate: 1.1
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 0
max_freq: 20
time_masks: 0
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: true
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -0,0 +1,144 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: true
speed_perturbation:
discrete: true
min_rate: 0.9
max_rate: 1.1
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 2
max_freq: 20
time_masks: 2
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: true
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -0,0 +1,144 @@
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: "Jasper"
labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
"n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
input_val:
audio_dataset: &val_dataset
sample_rate: &sample_rate 16000
trim_silence: true
normalize_transcripts: true
filterbank_features: &val_features
normalize: per_feature
sample_rate: *sample_rate
window_size: 0.02
window_stride: 0.01
window: hann
n_filt: &n_filt 64
n_fft: 512
frame_splicing: &frame_splicing 1
dither: 0.00001
pad_align: 16
# For training we keep samples < 16.7s and apply augmentation
input_train:
audio_dataset:
<<: *val_dataset
max_duration: 16.7
ignore_offline_speed_perturbation: true
speed_perturbation:
discrete: false
min_rate: 0.85
max_rate: 1.15
filterbank_features:
<<: *val_features
max_duration: 16.7
spec_augment:
freq_masks: 2
max_freq: 20
time_masks: 2
max_time: 75
jasper:
encoder:
init: xavier_uniform
in_feats: *n_filt
frame_splicing: *frame_splicing
activation: relu
use_conv_masks: true
blocks:
- &Conv1
filters: 256
repeat: 1
kernel_size: [11]
stride: [2]
dilation: [1]
dropout: 0.2
residual: false
- &B1
filters: 256
repeat: 5
kernel_size: [11]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B1
- &B2
filters: 384
repeat: 5
kernel_size: [13]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B2
- &B3
filters: 512
repeat: 5
kernel_size: [17]
stride: [1]
dilation: [1]
dropout: 0.2
residual: true
residual_dense: true
- *B3
- &B4
filters: 640
repeat: 5
kernel_size: [21]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B4
- &B5
filters: 768
repeat: 5
kernel_size: [25]
stride: [1]
dilation: [1]
dropout: 0.3
residual: true
residual_dense: true
- *B5
- &Conv2
filters: 896
repeat: 1
kernel_size: [29]
stride: [1]
dilation: [2]
dropout: 0.4
residual: false
- &Conv3
filters: &enc_feats 1024
repeat: 1
kernel_size: [1]
stride: [1]
dilation: [1]
dropout: 0.4
residual: false
decoder:
in_feats: *enc_feats
init: xavier_uniform

View file

@ -1,266 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file contains classes and functions related to data loading
"""
import torch
import numpy as np
import math
from torch.utils.data import Dataset, Sampler
import torch.distributed as dist
from parts.manifest import Manifest
from parts.features import WaveformFeaturizer
class DistributedBucketBatchSampler(Sampler):
def __init__(self, dataset, batch_size, num_replicas=None, rank=None):
"""Distributed sampler that buckets samples with similar length to minimize padding,
similar concept as pytorch BucketBatchSampler https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.samplers.html#torchnlp.samplers.BucketBatchSampler
Args:
dataset: Dataset used for sampling.
batch_size: data batch size
num_replicas (optional): Number of processes participating in
distributed training.
rank (optional): Rank of the current process within num_replicas.
"""
if num_replicas is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available")
num_replicas = dist.get_world_size()
if rank is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available")
rank = dist.get_rank()
self.dataset = dataset
self.dataset_size = len(dataset)
self.num_replicas = num_replicas
self.rank = rank
self.epoch = 0
self.batch_size = batch_size
self.tile_size = batch_size * self.num_replicas
self.num_buckets = 6
self.bucket_size = self.round_up_to(math.ceil(self.dataset_size / self.num_buckets), self.tile_size)
self.index_count = self.round_up_to(self.dataset_size, self.tile_size)
self.num_samples = self.index_count // self.num_replicas
def round_up_to(self, x, mod):
return (x + mod - 1) // mod * mod
def __iter__(self):
g = torch.Generator()
g.manual_seed(self.epoch)
indices = np.arange(self.index_count) % self.dataset_size
for bucket in range(self.num_buckets):
bucket_start = self.bucket_size * bucket
bucket_end = min(bucket_start + self.bucket_size, self.index_count)
indices[bucket_start:bucket_end] = indices[bucket_start:bucket_end][torch.randperm(bucket_end - bucket_start, generator=g)]
tile_indices = torch.randperm(self.index_count // self.tile_size, generator=g)
for tile_index in tile_indices:
start_index = self.tile_size * tile_index + self.batch_size * self.rank
end_index = start_index + self.batch_size
yield indices[start_index:end_index]
def __len__(self):
return self.num_samples
def set_epoch(self, epoch):
self.epoch = epoch
class data_prefetcher():
def __init__(self, loader):
self.loader = iter(loader)
self.stream = torch.cuda.Stream()
self.preload()
def preload(self):
try:
self.next_input = next(self.loader)
except StopIteration:
self.next_input = None
return
with torch.cuda.stream(self.stream):
self.next_input = [ x.cuda(non_blocking=True) for x in self.next_input]
def __next__(self):
torch.cuda.current_stream().wait_stream(self.stream)
input = self.next_input
self.preload()
return input
def next(self):
return self.__next__()
def __iter__(self):
return self
def seq_collate_fn(batch):
"""batches samples and returns as tensors
Args:
batch : list of samples
Returns
batches of tensors
"""
batch_size = len(batch)
def _find_max_len(lst, ind):
max_len = -1
for item in lst:
if item[ind].size(0) > max_len:
max_len = item[ind].size(0)
return max_len
max_audio_len = _find_max_len(batch, 0)
max_transcript_len = _find_max_len(batch, 2)
batched_audio_signal = torch.zeros(batch_size, max_audio_len)
batched_transcript = torch.zeros(batch_size, max_transcript_len)
audio_lengths = []
transcript_lengths = []
for ind, sample in enumerate(batch):
batched_audio_signal[ind].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
audio_lengths.append(sample[1])
batched_transcript[ind].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
transcript_lengths.append(sample[3])
return batched_audio_signal, torch.stack(audio_lengths), batched_transcript, \
torch.stack(transcript_lengths)
class AudioToTextDataLayer:
"""Data layer with data loader
"""
def __init__(self, **kwargs):
self._device = torch.device("cuda")
featurizer_config = kwargs['featurizer_config']
pad_to_max = kwargs.get('pad_to_max', False)
perturb_config = kwargs.get('perturb_config', None)
manifest_filepath = kwargs['manifest_filepath']
dataset_dir = kwargs['dataset_dir']
labels = kwargs['labels']
batch_size = kwargs['batch_size']
drop_last = kwargs.get('drop_last', False)
shuffle = kwargs.get('shuffle', True)
min_duration = featurizer_config.get('min_duration', 0.1)
max_duration = featurizer_config.get('max_duration', None)
normalize_transcripts = kwargs.get('normalize_transcripts', True)
trim_silence = kwargs.get('trim_silence', False)
multi_gpu = kwargs.get('multi_gpu', False)
sampler_type = kwargs.get('sampler', 'default')
speed_perturbation = featurizer_config.get('speed_perturbation', False)
sort_by_duration=sampler_type == 'bucket'
self._featurizer = WaveformFeaturizer.from_config(featurizer_config, perturbation_configs=perturb_config)
self._dataset = AudioDataset(
dataset_dir=dataset_dir,
manifest_filepath=manifest_filepath,
labels=labels, blank_index=len(labels),
sort_by_duration=sort_by_duration,
pad_to_max=pad_to_max,
featurizer=self._featurizer, max_duration=max_duration,
min_duration=min_duration, normalize=normalize_transcripts,
trim=trim_silence, speed_perturbation=speed_perturbation)
print('sort_by_duration', sort_by_duration)
if not multi_gpu:
self.sampler = None
self._dataloader = torch.utils.data.DataLoader(
dataset=self._dataset,
batch_size=batch_size,
collate_fn=lambda b: seq_collate_fn(b),
drop_last=drop_last,
shuffle=shuffle if self.sampler is None else False,
num_workers=4,
pin_memory=True,
sampler=self.sampler
)
elif sampler_type == 'bucket':
self.sampler = DistributedBucketBatchSampler(self._dataset, batch_size=batch_size)
print("DDBucketSampler")
self._dataloader = torch.utils.data.DataLoader(
dataset=self._dataset,
collate_fn=lambda b: seq_collate_fn(b),
num_workers=4,
pin_memory=True,
batch_sampler=self.sampler
)
elif sampler_type == 'default':
self.sampler = torch.utils.data.distributed.DistributedSampler(self._dataset)
print("DDSampler")
self._dataloader = torch.utils.data.DataLoader(
dataset=self._dataset,
batch_size=batch_size,
collate_fn=lambda b: seq_collate_fn(b),
drop_last=drop_last,
shuffle=shuffle if self.sampler is None else False,
num_workers=4,
pin_memory=True,
sampler=self.sampler
)
else:
raise RuntimeError("Sampler {} not supported".format(sampler_type))
def __len__(self):
return len(self._dataset)
@property
def data_iterator(self):
return self._dataloader
class AudioDataset(Dataset):
def __init__(self, dataset_dir, manifest_filepath, labels, featurizer, max_duration=None, pad_to_max=False,
min_duration=None, blank_index=0, max_utts=0, normalize=True, sort_by_duration=False,
trim=False, speed_perturbation=False):
"""Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations
(in seconds). Each entry is a different audio sample.
Args:
dataset_dir: absolute path to dataset folder
manifest_filepath: relative path from dataset folder to manifest json as described above. Can be coma-separated paths.
labels: String containing all the possible characters to map to
featurizer: Initialized featurizer class that converts paths of audio to feature tensors
max_duration: If audio exceeds this length, do not include in dataset
min_duration: If audio is less than this length, do not include in dataset
pad_to_max: if specified input sequences into dnn model will be padded to max_duration
blank_index: blank index for ctc loss / decoder
max_utts: Limit number of utterances
normalize: whether to normalize transcript text
sort_by_duration: whether or not to sort sequences by increasing duration
trim: if specified trims leading and trailing silence from an audio signal.
speed_perturbation: specify if using data contains speed perburbation
"""
m_paths = manifest_filepath.split(',')
self.manifest = Manifest(dataset_dir, m_paths, labels, blank_index, pad_to_max=pad_to_max,
max_duration=max_duration,
sort_by_duration=sort_by_duration,
min_duration=min_duration, max_utts=max_utts,
normalize=normalize, speed_perturbation=speed_perturbation)
self.featurizer = featurizer
self.blank_index = blank_index
self.trim = trim
print(
"Dataset loaded with {0:.2f} hours. Filtered {1:.2f} hours.".format(
self.manifest.duration / 3600,
self.manifest.filtered_duration / 3600))
def __getitem__(self, index):
sample = self.manifest[index]
rn_indx = np.random.randint(len(sample['audio_filepath']))
duration = sample['audio_duration'][rn_indx] if 'audio_duration' in sample else 0
offset = sample['offset'] if 'offset' in sample else 0
features = self.featurizer.process(sample['audio_filepath'][rn_indx],
offset=offset, duration=duration,
trim=self.trim)
return features, torch.tensor(features.shape[0]).int(), \
torch.tensor(sample["transcript"]), torch.tensor(
len(sample["transcript"])).int()
def __len__(self):
return len(self.manifest)

View file

@ -1,95 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# Default setting is building on nvidia/cuda:10.1-devel-ubuntu18.04
ARG BASE_IMAGE=nvidia/cuda:10.1-devel-ubuntu18.04
FROM ${BASE_IMAGE}
# Default to use Python3. Allowed values are "2" and "3".
ARG PYVER=3
# Ensure apt-get won't prompt for selecting options
ENV DEBIAN_FRONTEND=noninteractive
ENV PYVER=$PYVER
RUN PYSFX=`[ "$PYVER" != "2" ] && echo "$PYVER" || echo ""` && \
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common \
autoconf \
automake \
build-essential \
cmake \
curl \
git \
libopencv-dev \
libopencv-core-dev \
libssl-dev \
libtool \
pkg-config \
python${PYSFX} \
python${PYSFX}-pip \
python${PYSFX}-dev && \
pip${PYSFX} install --upgrade setuptools wheel
RUN PYSFX=`[ "$PYVER" != "2" ] && echo "$PYVER" || echo ""` && \
pip${PYSFX} install --upgrade grpcio-tools
# Build expects "python" executable (not python3).
RUN rm -f /usr/bin/python && \
ln -s /usr/bin/python$PYVER /usr/bin/python
# Build the client library and examples
WORKDIR /workspace
COPY VERSION .
COPY build build
COPY src/clients src/clients
COPY src/core src/core
RUN cd build && \
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX:PATH=/workspace/install && \
make -j16 trtis-clients
RUN cd install && \
export VERSION=`cat /workspace/VERSION` && \
tar zcf /workspace/v$VERSION.clients.tar.gz *
# For CI testing need to install a test script.
COPY qa/L0_client_tar/test.sh /tmp/test.sh
# Install an image needed by the quickstart and other documentation.
COPY qa/images/mug.jpg images/mug.jpg
# Install the dependencies needed to run the client examples. These
# are not needed for building but including them allows this image to
# be used to run the client examples. The special upgrade and handling
# of pip is needed to get numpy to install correctly with python2 on
# ubuntu 16.04.
RUN python -m pip install --user --upgrade pip && \
python -m pip install --upgrade install/python/tensorrtserver-*.whl numpy pillow
ENV PATH //workspace/install/bin:${PATH}
ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}

@ -1 +0,0 @@
Subproject commit a1f3860ba65c0fd8f2be3adfcab2673efd039348

View file

@ -1,207 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.distributed as dist
from apex.parallel import DistributedDataParallel as DDP
from enum import Enum
from metrics import word_error_rate
def print_once(msg):
if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
print(msg)
def add_ctc_labels(labels):
if not isinstance(labels, list):
raise ValueError("labels must be a list of symbols")
labels.append("<BLANK>")
return labels
def __ctc_decoder_predictions_tensor(tensor, labels):
"""
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
remove duplicates and special symbol. Returns prediction
Args:
tensor: model output tensor
label: A list of labels
Returns:
prediction
"""
blank_id = len(labels) - 1
hypotheses = []
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
prediction_cpu_tensor = tensor.long().cpu()
# iterate over batch
for ind in range(prediction_cpu_tensor.shape[0]):
prediction = prediction_cpu_tensor[ind].numpy().tolist()
# CTC decoding procedure
decoded_prediction = []
previous = len(labels) - 1 # id of a blank symbol
for p in prediction:
if (p != previous or previous == blank_id) and p != blank_id:
decoded_prediction.append(p)
previous = p
hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
hypotheses.append(hypothesis)
return hypotheses
def monitor_asr_train_progress(tensors: list, labels: list):
"""
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
remove duplicates and special symbol. Prints wer and prediction examples to screen
Args:
tensors: A list of 3 tensors (predictions, targets, target_lengths)
labels: A list of labels
Returns:
word error rate
"""
references = []
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
with torch.no_grad():
targets_cpu_tensor = tensors[1].long().cpu()
tgt_lenths_cpu_tensor = tensors[2].long().cpu()
# iterate over batch
for ind in range(targets_cpu_tensor.shape[0]):
tgt_len = tgt_lenths_cpu_tensor[ind].item()
target = targets_cpu_tensor[ind][:tgt_len].numpy().tolist()
reference = ''.join([labels_map[c] for c in target])
references.append(reference)
hypotheses = __ctc_decoder_predictions_tensor(tensors[0], labels=labels)
tag = "training_batch_WER"
wer, _, _ = word_error_rate(hypotheses, references)
print_once('{0}: {1}'.format(tag, wer))
print_once('Prediction: {0}'.format(hypotheses[0]))
print_once('Reference: {0}'.format(references[0]))
return wer
def __gather_losses(losses_list: list) -> list:
return [torch.mean(torch.stack(losses_list))]
def __gather_predictions(predictions_list: list, labels: list) -> list:
results = []
for prediction in predictions_list:
results += __ctc_decoder_predictions_tensor(prediction, labels=labels)
return results
def __gather_transcripts(transcript_list: list, transcript_len_list: list,
labels: list) -> list:
results = []
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
# iterate over workers
for t, ln in zip(transcript_list, transcript_len_list):
# iterate over batch
t_lc = t.long().cpu()
ln_lc = ln.long().cpu()
for ind in range(t.shape[0]):
tgt_len = ln_lc[ind].item()
target = t_lc[ind][:tgt_len].numpy().tolist()
reference = ''.join([labels_map[c] for c in target])
results.append(reference)
return results
def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list):
"""
Processes results of an iteration and saves it in global_vars
Args:
tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
global_vars: dictionary where processes results of iteration are saved
labels: A list of labels
"""
for kv, v in tensors.items():
if kv.startswith('loss'):
global_vars['EvalLoss'] += __gather_losses(v)
elif kv.startswith('predictions'):
global_vars['predictions'] += __gather_predictions(v, labels=labels)
elif kv.startswith('transcript_length'):
transcript_len_list = v
elif kv.startswith('transcript'):
transcript_list = v
elif kv.startswith('output'):
global_vars['logits'] += v
global_vars['transcripts'] += __gather_transcripts(transcript_list,
transcript_len_list,
labels=labels)
def process_evaluation_epoch(global_vars: dict, tag=None):
"""
Processes results from each worker at the end of evaluation and combine to final result
Args:
global_vars: dictionary containing information of entire evaluation
Return:
wer: final word error rate
loss: final loss
"""
if 'EvalLoss' in global_vars:
eloss = torch.mean(torch.stack(global_vars['EvalLoss'])).item()
else:
eloss = None
hypotheses = global_vars['predictions']
references = global_vars['transcripts']
wer, scores, num_words = word_error_rate(hypotheses=hypotheses, references=references)
multi_gpu = torch.distributed.is_initialized()
if multi_gpu:
if eloss is not None:
eloss /= torch.distributed.get_world_size()
eloss_tensor = torch.tensor(eloss).cuda()
dist.all_reduce(eloss_tensor)
eloss = eloss_tensor.item()
del eloss_tensor
scores_tensor = torch.tensor(scores).cuda()
dist.all_reduce(scores_tensor)
scores = scores_tensor.item()
del scores_tensor
num_words_tensor = torch.tensor(num_words).cuda()
dist.all_reduce(num_words_tensor)
num_words = num_words_tensor.item()
del num_words_tensor
wer = scores *1.0/num_words
return wer, eloss
def norm(x):
if not isinstance(x, list):
if not isinstance(x, tuple):
return x
return x[0]
def print_dict(d):
maxLen = max([len(ii) for ii in d.keys()])
fmtString = '\t%' + str(maxLen) + 's : %s'
print('Arguments:')
for keyPair in sorted(d.items()):
print(fmtString % keyPair)
def model_multi_gpu(model, multi_gpu=False):
if multi_gpu:
model = DDP(model)
print('DDP(model)')
return model

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 149 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 112 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

View file

@ -13,328 +13,385 @@
# limitations under the License.
import argparse
import itertools
from typing import List
from tqdm import tqdm
import math
import toml
from dataset import AudioToTextDataLayer
from helpers import process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, print_dict, model_multi_gpu, __ctc_decoder_predictions_tensor
from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
from parts.features import audio_from_file
import torch
import torch.nn as nn
import apex
from apex import amp
import random
import numpy as np
import pickle
import time
import os
import random
import time
from heapq import nlargest
from itertools import chain, repeat
from pathlib import Path
from tqdm import tqdm
def parse_args():
import dllogger
import torch
import numpy as np
import torch.distributed as distrib
from apex import amp
from apex.parallel import DistributedDataParallel
from dllogger import JSONStreamBackend, StdOutBackend, Verbosity
from jasper import config
from common import helpers
from common.dali.data_loader import DaliDataLoader
from common.dataset import (AudioDataset, FilelistDataset, get_data_loader,
SingleAudioDataset)
from common.features import BaseFeatures, FilterbankFeatures
from common.helpers import print_once, process_evaluation_epoch
from jasper.model import GreedyCTCDecoder, Jasper
from common.tb_dllogger import stdout_metric_format, unique_log_fpath
def get_parser():
parser = argparse.ArgumentParser(description='Jasper')
parser.add_argument('--batch_size', default=16, type=int,
help='Data batch size')
parser.add_argument('--steps', default=0, type=int,
help='Eval this many steps for every worker')
parser.add_argument('--warmup_steps', default=0, type=int,
help='Burn-in period before measuring latencies')
parser.add_argument('--model_config', type=str,
help='Relative model config path given dataset folder')
parser.add_argument('--dataset_dir', type=str,
help='Absolute path to dataset folder')
parser.add_argument('--val_manifests', type=str, nargs='+',
help='Relative path to evaluation dataset manifest files')
parser.add_argument('--ckpt', default=None, type=str,
help='Path to model checkpoint')
parser.add_argument('--max_duration', default=None, type=float,
help='Filter out longer inputs (in seconds)')
parser.add_argument('--pad_to_max_duration', action='store_true',
help='Pads every batch to max_duration')
parser.add_argument('--amp', '--fp16', action='store_true',
help='Use FP16 precision')
parser.add_argument('--cudnn_benchmark', action='store_true',
help='Enable cudnn benchmark')
parser.add_argument('--cpu', action='store_true',
help='Run inference on CPU')
parser.add_argument("--seed", default=None, type=int, help='Random seed')
parser.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
type=int, help='GPU id used for distributed training')
parser.register("type", "bool", lambda x: x.lower() in ("yes", "true", "t", "1"))
parser.add_argument("--local_rank", default=None, type=int)
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
parser.add_argument("--amp", "--fp16", action='store_true', help='use half precision')
parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
parser.add_argument("--save_prediction", type=str, default=None, help="if specified saves predictions in text form at this location")
parser.add_argument("--logits_save_to", default=None, type=str, help="if specified will save logits to path")
parser.add_argument("--seed", default=42, type=int, help='seed')
parser.add_argument("--output_dir", default="results/", type=str, help="Output directory to store exported models. Only used if --export_model is used")
parser.add_argument("--export_model", action='store_true', help="Exports the audio_featurizer, encoder and decoder using torch.jit to the output_dir")
parser.add_argument("--wav", type=str, help='absolute path to .wav file (16KHz)')
parser.add_argument("--cpu", action="store_true", help="Run inference on CPU")
parser.add_argument("--ema", action="store_true", help="If available, load EMA model weights")
# FIXME Unused, but passed by Triton helper scripts
parser.add_argument("--pyt_fp16", action='store_true', help='use half precision')
return parser.parse_args()
def calc_wer(data_layer, audio_processor,
encoderdecoder, greedy_decoder,
labels, args, device):
encoderdecoder = encoderdecoder.module if hasattr(encoderdecoder, 'module') else encoderdecoder
with torch.no_grad():
# reset global_var_dict - results of evaluation will be stored there
_global_var_dict = {
'predictions': [],
'transcripts': [],
'logits' : [],
}
# Evaluation mini-batch for loop
for it, data in enumerate(tqdm(data_layer.data_iterator)):
tensors = [t.to(device) for t in data]
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
t_processed_signal = audio_processor(t_audio_signal_e, t_a_sig_length_e)
t_log_probs_e, _ = encoderdecoder.infer(t_processed_signal)
t_predictions_e = greedy_decoder(t_log_probs_e)
values_dict = dict(
predictions=[t_predictions_e],
transcript=[t_transcript_e],
transcript_length=[t_transcript_len_e],
output=[t_log_probs_e]
)
# values_dict will contain results from all workers
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
if args.steps is not None and it + 1 >= args.steps:
break
# final aggregation (over minibatches) and logging of results
wer, _ = process_evaluation_epoch(_global_var_dict)
return wer, _global_var_dict
io = parser.add_argument_group('feature and checkpointing setup')
io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
default='gpu', help='Use DALI pipeline for fast data processing')
io.add_argument('--save_predictions', type=str, default=None,
help='Save predictions in text form at this location')
io.add_argument('--save_logits', default=None, type=str,
help='Save output logits under specified path')
io.add_argument('--transcribe_wav', type=str,
help='Path to a single .wav file (16KHz)')
io.add_argument('--transcribe_filelist', type=str,
help='Path to a filelist with one .wav path per line')
io.add_argument('-o', '--output_dir', default='results/',
help='Output folder to save audio (file per phrase)')
io.add_argument('--log_file', type=str, default=None,
help='Path to a DLLogger log file')
io.add_argument('--ema', action='store_true',
help='Load averaged model weights')
io.add_argument('--torchscript', action='store_true',
help='Evaluate with a TorchScripted model')
io.add_argument('--torchscript_export', action='store_true',
help='Export the model with torch.jit to the output_dir')
return parser
def jit_export(audio, audio_len, audio_processor, encoderdecoder, greedy_decoder, args):
def durs_to_percentiles(durations, ratios):
durations = np.asarray(durations) * 1000 # in ms
latency = durations
print("##############")
latency = latency[5:]
mean_latency = np.mean(latency)
module_name = "{}_{}".format(os.path.basename(args.model_toml), "fp16" if args.amp else "fp32")
if args.use_conv_mask:
module_name = module_name + "_noMaskConv"
# Export just the featurizer
print("exporting featurizer ...")
traced_module_feat = torch.jit.script(audio_processor)
traced_module_feat.save(os.path.join(args.output_dir, module_name + "_feat.pt"))
# Export just the acoustic model
print("exporting acoustic model ...")
inp_postFeat, _ = audio_processor(audio, audio_len)
traced_module_acoustic = torch.jit.trace(encoderdecoder, inp_postFeat)
traced_module_acoustic.save(os.path.join(args.output_dir, module_name + "_acoustic.pt"))
# Export just the decoder
print("exporting decoder ...")
inp_postAcoustic = encoderdecoder(inp_postFeat)
traced_module_decode = torch.jit.script(greedy_decoder, inp_postAcoustic)
traced_module_decode.save(os.path.join(args.output_dir, module_name + "_decoder.pt"))
print("JIT export complete")
return traced_module_feat, traced_module_acoustic, traced_module_decode
def run_once(audio_processor, encoderdecoder, greedy_decoder, audio, audio_len, labels, device):
features, lens = audio_processor(audio, audio_len)
if not device.type == 'cpu':
torch.cuda.synchronize()
t0 = time.perf_counter()
# TorchScripted model does not support (features, lengths)
if isinstance(encoderdecoder, torch.jit.TracedModule):
t_log_probs_e = encoderdecoder(features)
else:
t_log_probs_e, _ = encoderdecoder.infer((features, lens))
if not device.type == 'cpu':
torch.cuda.synchronize()
t1 = time.perf_counter()
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
hypotheses = __ctc_decoder_predictions_tensor(t_predictions_e, labels=labels)
print("INFERENCE TIME\t\t: {} ms".format((t1-t0)*1000.0))
print("TRANSCRIPT\t\t:", hypotheses[0])
latency_worst = nlargest(math.ceil((1 - min(ratios)) * len(latency)), latency)
latency_ranges = get_percentile(ratios, latency_worst, len(latency))
latency_ranges[0.5] = mean_latency
return latency_ranges
def eval(
data_layer,
audio_processor,
encoderdecoder,
greedy_decoder,
labels,
multi_gpu,
device,
args):
"""performs inference / evaluation
Args:
data_layer: data layer object that holds data loader
audio_processor: data processing module
encoderdecoder: acoustic model
greedy_decoder: greedy decoder
labels: list of labels as output vocabulary
multi_gpu: true if using multiple gpus
args: script input arguments
"""
logits_save_to=args.logits_save_to
with torch.no_grad():
if args.wav:
audio, audio_len = audio_from_file(args.wav)
run_once(audio_processor, encoderdecoder, greedy_decoder, audio, audio_len, labels, device)
if args.export_model:
jit_audio_processor, jit_encoderdecoder, jit_greedy_decoder = jit_export(audio, audio_len, audio_processor, encoderdecoder,greedy_decoder,args)
run_once(jit_audio_processor, jit_encoderdecoder, jit_greedy_decoder, audio, audio_len, labels, device)
return
wer, _global_var_dict = calc_wer(data_layer, audio_processor, encoderdecoder, greedy_decoder, labels, args, device)
if (not multi_gpu or (multi_gpu and torch.distributed.get_rank() == 0)):
print("==========>>>>>>Evaluation WER: {0}\n".format(wer))
if args.save_prediction is not None:
with open(args.save_prediction, 'w') as fp:
fp.write('\n'.join(_global_var_dict['predictions']))
if logits_save_to is not None:
logits = []
for batch in _global_var_dict["logits"]:
for i in range(batch.shape[0]):
logits.append(batch[i].cpu().numpy())
with open(logits_save_to, 'wb') as f:
pickle.dump(logits, f, protocol=pickle.HIGHEST_PROTOCOL)
# if args.export_model:
# feat, acoustic, decoder = jit_export(inp, audio_processor, encoderdecoder, greedy_decoder,args)
# wer_after = calc_wer(data_layer, feat, acoustic, decoder, labels, args)
# print("===>>>Before WER: {0}".format(wer))
# print("===>>>Traced WER: {0}".format(wer_after))
# print("===>>>Diff : {0} %".format((wer_after - wer_before) * 100.0 / wer_before))
# print("")
def get_percentile(ratios, arr, nsamples):
res = {}
for a in ratios:
idx = max(int(nsamples * (1 - a)), 0)
res[a] = arr[idx]
return res
def main(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
def torchscript_export(data_loader, audio_processor, model, greedy_decoder,
output_dir, use_amp, use_conv_masks, model_config, device,
save):
multi_gpu = args.local_rank is not None
audio_processor.to(device)
for batch in data_loader:
batch = [t.to(device, non_blocking=True) for t in batch]
audio, audio_len, _, _ = batch
feats, feat_lens = audio_processor(audio, audio_len)
break
print("\nExporting featurizer...")
print("\nNOTE: Dithering causes warnings about non-determinism.\n")
ts_feat = torch.jit.trace(audio_processor, (audio, audio_len))
print("\nExporting acoustic model...")
model(feats, feat_lens)
ts_acoustic = torch.jit.trace(model, (feats, feat_lens))
print("\nExporting decoder...")
log_probs = model(feats, feat_lens)
ts_decoder = torch.jit.script(greedy_decoder, log_probs)
print("\nJIT export complete.")
if save:
precision = "fp16" if use_amp else "fp32"
module_name = f'{os.path.basename(model_config)}_{precision}'
ts_feat.save(os.path.join(output_dir, module_name + "_feat.pt"))
ts_acoustic.save(os.path.join(output_dir, module_name + "_acoustic.pt"))
ts_decoder.save(os.path.join(output_dir, module_name + "_decoder.pt"))
return ts_feat, ts_acoustic, ts_decoder
def main():
parser = get_parser()
args = parser.parse_args()
log_fpath = args.log_file or str(Path(args.output_dir, 'nvlog_infer.json'))
log_fpath = unique_log_fpath(log_fpath)
dllogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, log_fpath),
StdOutBackend(Verbosity.VERBOSE,
metric_format=stdout_metric_format)])
[dllogger.log("PARAMETER", {k: v}) for k, v in vars(args).items()]
for step in ['DNN', 'data+DNN', 'data']:
for c in [0.99, 0.95, 0.9, 0.5]:
cs = 'avg' if c == 0.5 else f'{int(100*c)}%'
dllogger.metadata(f'{step.lower()}_latency_{c}',
{'name': f'{step} latency {cs}',
'format': ':>7.2f', 'unit': 'ms'})
dllogger.metadata(
'eval_wer', {'name': 'WER', 'format': ':>3.2f', 'unit': '%'})
if args.cpu:
assert(not multi_gpu)
device = torch.device('cpu')
else:
assert(torch.cuda.is_available())
assert torch.cuda.is_available()
device = torch.device('cuda')
torch.backends.cudnn.benchmark = args.cudnn_benchmark
print("CUDNN BENCHMARK ", args.cudnn_benchmark)
if multi_gpu:
print("DISTRIBUTED with ", torch.distributed.get_world_size())
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
if args.seed is not None:
torch.manual_seed(args.seed + args.local_rank)
np.random.seed(args.seed + args.local_rank)
random.seed(args.seed + args.local_rank)
optim_level = 3 if args.amp else 0
# set up distributed training
multi_gpu = not args.cpu and int(os.environ.get('WORLD_SIZE', 1)) > 1
if multi_gpu:
torch.cuda.set_device(args.local_rank)
distrib.init_process_group(backend='nccl', init_method='env://')
print_once(f'Inference with {distrib.get_world_size()} GPUs')
jasper_model_definition = toml.load(args.model_toml)
dataset_vocab = jasper_model_definition['labels']['labels']
ctc_vocab = add_ctc_labels(dataset_vocab)
val_manifest = args.val_manifest
featurizer_config = jasper_model_definition['input_eval']
featurizer_config["optimization_level"] = optim_level
featurizer_config["fp16"] = args.amp
args.use_conv_mask = jasper_model_definition['encoder'].get('convmask', True)
if args.use_conv_mask and args.export_model:
print('WARNING: Masked convs currently not supported for TorchScript. Disabling.')
jasper_model_definition['encoder']['convmask'] = False
cfg = config.load(args.model_config)
if args.max_duration is not None:
featurizer_config['max_duration'] = args.max_duration
if args.pad_to is not None:
featurizer_config['pad_to'] = args.pad_to
cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
if featurizer_config['pad_to'] == "max":
featurizer_config['pad_to'] = -1
if args.pad_to_max_duration:
assert cfg['input_val']['audio_dataset']['max_duration'] > 0
cfg['input_val']['audio_dataset']['pad_to_max_duration'] = True
cfg['input_val']['filterbank_features']['pad_to_max_duration'] = True
print('=== model_config ===')
print_dict(jasper_model_definition)
print()
print('=== feature_config ===')
print_dict(featurizer_config)
print()
data_layer = None
symbols = helpers.add_ctc_blank(cfg['labels'])
if args.wav is None:
data_layer = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
manifest_filepath=val_manifest,
labels=dataset_vocab,
use_dali = args.dali_device in ('cpu', 'gpu')
dataset_kw, features_kw = config.input(cfg, 'val')
measure_perf = args.steps > 0
# dataset
if args.transcribe_wav or args.transcribe_filelist:
assert not use_dali, "DALI is not supported for a single audio"
assert not args.transcribe_filelist
assert not args.pad_to_max_duration
assert not (args.transcribe_wav and args.transcribe_filelist)
if args.transcribe_wav:
dataset = SingleAudioDataset(args.transcribe_wav)
else:
dataset = FilelistDataset(args.transcribe_filelist)
data_loader = get_data_loader(dataset,
batch_size=1,
multi_gpu=multi_gpu,
shuffle=False,
num_workers=0,
drop_last=(True if measure_perf else False))
_, features_kw = config.input(cfg, 'val')
feat_proc = FilterbankFeatures(**features_kw)
elif use_dali:
# pad_to_max_duration is not supported by DALI - have simple padders
if features_kw['pad_to_max_duration']:
feat_proc = BaseFeatures(
pad_align=features_kw['pad_align'],
pad_to_max_duration=True,
max_duration=features_kw['max_duration'],
sample_rate=features_kw['sample_rate'],
window_size=features_kw['window_size'],
window_stride=features_kw['window_stride'])
features_kw['pad_to_max_duration'] = False
else:
feat_proc = None
data_loader = DaliDataLoader(
gpu_id=args.local_rank or 0,
dataset_path=args.dataset_dir,
config_data=dataset_kw,
config_features=features_kw,
json_names=args.val_manifests,
batch_size=args.batch_size,
pad_to_max=featurizer_config['pad_to'] == -1,
shuffle=False,
multi_gpu=multi_gpu)
audio_preprocessor = AudioPreprocessing(**featurizer_config)
encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
pipeline_type=("train" if measure_perf else "val"), # no drop_last
device_type=args.dali_device,
symbols=symbols)
else:
dataset = AudioDataset(args.dataset_dir,
args.val_manifests,
symbols,
**dataset_kw)
data_loader = get_data_loader(dataset,
args.batch_size,
multi_gpu=multi_gpu,
shuffle=False,
num_workers=4,
drop_last=False)
feat_proc = FilterbankFeatures(**features_kw)
model = Jasper(encoder_kw=config.encoder(cfg),
decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
if args.ckpt is not None:
print("loading model from ", args.ckpt)
print(f'Loading the model from {args.ckpt} ...')
checkpoint = torch.load(args.ckpt, map_location="cpu")
key = 'ema_state_dict' if args.ema else 'state_dict'
state_dict = helpers.convert_v1_state_dict(checkpoint[key])
model.load_state_dict(state_dict, strict=True)
if os.path.isdir(args.ckpt):
exit(0)
else:
checkpoint = torch.load(args.ckpt, map_location="cpu")
if args.ema and 'ema_state_dict' in checkpoint:
print('Loading EMA state dict')
sd = 'ema_state_dict'
else:
sd = 'state_dict'
model.to(device)
model.eval()
for k in audio_preprocessor.state_dict().keys():
checkpoint[sd][k] = checkpoint[sd].pop("audio_preprocessor." + k)
audio_preprocessor.load_state_dict(checkpoint[sd], strict=False)
encoderdecoder.load_state_dict(checkpoint[sd], strict=False)
greedy_decoder = GreedyCTCDecoder()
# print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
if args.wav is None:
N = len(data_layer)
step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
if args.steps is not None:
print('-----------------')
print('Have {0} examples to eval on.'.format(args.steps * args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
print('Have {0} steps / (gpu * epoch).'.format(args.steps))
print('-----------------')
else:
print('-----------------')
print('Have {0} examples to eval on.'.format(N))
print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
print('-----------------')
print ("audio_preprocessor.normalize: ", audio_preprocessor.featurizer.normalize)
audio_preprocessor.to(device)
encoderdecoder.to(device)
if feat_proc is not None:
feat_proc.to(device)
feat_proc.eval()
if args.amp:
encoderdecoder = amp.initialize(models=encoderdecoder,
opt_level='O'+str(optim_level))
model = model.half()
encoderdecoder = model_multi_gpu(encoderdecoder, multi_gpu)
audio_preprocessor.eval()
encoderdecoder.eval()
greedy_decoder.eval()
if args.torchscript:
greedy_decoder = GreedyCTCDecoder()
eval(
data_layer=data_layer,
audio_processor=audio_preprocessor,
encoderdecoder=encoderdecoder,
greedy_decoder=greedy_decoder,
labels=ctc_vocab,
args=args,
device=device,
multi_gpu=multi_gpu)
feat_proc, model, greedy_decoder = torchscript_export(
data_loader, feat_proc, model, greedy_decoder, args.output_dir,
use_amp=args.amp, use_conv_masks=True, model_toml=args.model_toml,
device=device, save=args.torchscript_export)
if __name__=="__main__":
args = parse_args()
if multi_gpu:
model = DistributedDataParallel(model)
print_dict(vars(args))
agg = {'txts': [], 'preds': [], 'logits': []}
dur = {'data': [], 'dnn': [], 'data+dnn': []}
main(args)
looped_loader = chain.from_iterable(repeat(data_loader))
greedy_decoder = GreedyCTCDecoder()
sync = lambda: torch.cuda.synchronize() if device.type == 'cuda' else None
steps = args.steps + args.warmup_steps or len(data_loader)
with torch.no_grad():
for it, batch in enumerate(tqdm(looped_loader, initial=1, total=steps)):
if use_dali:
feats, feat_lens, txt, txt_lens = batch
if feat_proc is not None:
feats, feat_lens = feat_proc(feats, feat_lens)
else:
batch = [t.cuda(non_blocking=True) for t in batch]
audio, audio_lens, txt, txt_lens = batch
feats, feat_lens = feat_proc(audio, audio_lens)
sync()
t1 = time.perf_counter()
if args.amp:
feats = feats.half()
if model.encoder.use_conv_masks:
log_probs, log_prob_lens = model(feats, feat_lens)
else:
log_probs = model(feats, feat_lens)
preds = greedy_decoder(log_probs)
sync()
t2 = time.perf_counter()
# burn-in period; wait for a new loader due to num_workers
if it >= 1 and (args.steps == 0 or it >= args.warmup_steps):
dur['data'].append(t1 - t0)
dur['dnn'].append(t2 - t1)
dur['data+dnn'].append(t2 - t0)
if txt is not None:
agg['txts'] += helpers.gather_transcripts([txt], [txt_lens],
symbols)
agg['preds'] += helpers.gather_predictions([preds], symbols)
agg['logits'].append(log_probs)
if it + 1 == steps:
break
sync()
t0 = time.perf_counter()
# communicate the results
if args.transcribe_wav:
for idx, p in enumerate(agg['preds']):
print_once(f'Prediction {idx+1: >3}: {p}')
elif args.transcribe_filelist:
pass
elif not multi_gpu or distrib.get_rank() == 0:
wer, _ = process_evaluation_epoch(agg)
dllogger.log(step=(), data={'eval_wer': 100 * wer})
if args.save_predictions:
with open(args.save_predictions, 'w') as f:
f.write('\n'.join(agg['preds']))
if args.save_logits:
logits = torch.cat(agg['logits'], dim=0).cpu()
torch.save(logits, args.save_logits)
# report timings
if len(dur['data']) >= 20:
ratios = [0.9, 0.95, 0.99]
for stage in dur:
lat = durs_to_percentiles(dur[stage], ratios)
for k in [0.99, 0.95, 0.9, 0.5]:
kk = str(k).replace('.', '_')
dllogger.log(step=(), data={f'{stage.lower()}_latency_{kk}': lat[k]})
else:
print_once('Not enough samples to measure latencies.')
if __name__ == "__main__":
main()

View file

@ -1,301 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import itertools
import os
import sys
import time
import random
import numpy as np
from heapq import nlargest
import math
from tqdm import tqdm
import toml
import torch
from apex import amp
from dataset import AudioToTextDataLayer
from helpers import process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, print_dict
from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
from parts.features import audio_from_file
def parse_args():
parser = argparse.ArgumentParser(description='Jasper')
parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
parser.add_argument("--amp", "--fp16", action='store_true', help='use half precision')
parser.add_argument("--seed", default=42, type=int, help='seed')
parser.add_argument("--cpu", action='store_true', help='run inference on CPU')
parser.add_argument("--torch_script", action='store_true', help='export model')
parser.add_argument("--sample_audio", default="/datasets/LibriSpeech/dev-clean-wav/1272/128104/1272-128104-0000.wav", type=str, help='audio sample path for torchscript, points to one of the files in /datasets/LibriSpeech/dev-clean-wav/ if not defined')
return parser.parse_args()
def jit_export(
audio,
audio_len,
audio_processor,
encoderdecoder,
greedy_decoder,
args):
"""applies torchscript
Args:
audio:
audio_len:
audio_processor: data processing module
encoderdecoder: acoustic model
greedy_decoder: greedy decoder
args: script input arguments
"""
# Export just the featurizer
print("torchscripting featurizer ...")
traced_module_feat = torch.jit.script(audio_processor)
# Export just the acoustic model
print("torchscripting acoustic model ...")
inp_postFeat, _ = audio_processor(audio, audio_len)
traced_module_acoustic = torch.jit.trace(encoderdecoder, inp_postFeat)
# Export just the decoder
print("torchscripting decoder ...")
inp_postAcoustic = encoderdecoder(inp_postFeat)
traced_module_decode = torch.jit.script(greedy_decoder, inp_postAcoustic)
print("JIT process complete")
return traced_module_feat, traced_module_acoustic, traced_module_decode
def eval(
data_layer,
audio_processor,
encoderdecoder,
greedy_decoder,
labels,
device,
args):
"""performs evaluation and prints performance statistics
Args:
data_layer: data layer object that holds data loader
audio_processor: data processing module
encoderdecoder: acoustic model
greedy_decoder: greedy decoder
labels: list of labels as output vocabulary
args: script input arguments
"""
batch_size=args.batch_size
steps=args.steps
audio_processor.eval()
encoderdecoder.eval()
greedy_decoder.eval()
if args.torch_script:
audio, audio_len = audio_from_file(args.sample_audio, device=device)
audio_processor, encoderdecoder, greedy_decoder = jit_export(audio, audio_len, audio_processor, encoderdecoder, greedy_decoder, args)
with torch.no_grad():
_global_var_dict = {
'predictions': [],
'transcripts': [],
}
it = 0
ep = 0
if steps is None:
steps = math.ceil(len(data_layer) / batch_size)
durations_dnn = []
durations_dnn_and_prep = []
seq_lens = []
sync = lambda: torch.cuda.synchronize() if device.type == 'cuda' else None
while True:
ep += 1
for data in tqdm(data_layer.data_iterator):
it += 1
if it > steps:
break
tensors = [t.to(device) for t in data]
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
sync()
t0 = time.perf_counter()
features, lens = audio_processor(t_audio_signal_e, t_a_sig_length_e)
sync()
t1 = time.perf_counter()
if isinstance(encoderdecoder, torch.jit.TracedModule):
t_log_probs_e = encoderdecoder(features)
else:
t_log_probs_e, _ = encoderdecoder.infer((features, lens))
sync()
stop_time = time.perf_counter()
time_prep_and_dnn = stop_time - t0
time_dnn = stop_time - t1
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
values_dict = dict(
predictions=[t_predictions_e],
transcript=[t_transcript_e],
transcript_length=[t_transcript_len_e],
)
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
durations_dnn.append(time_dnn)
durations_dnn_and_prep.append(time_prep_and_dnn)
seq_lens.append(features[0].shape[-1])
if it >= steps:
wer, _ = process_evaluation_epoch(_global_var_dict)
print("==========>>>>>>Evaluation of all iterations WER: {0}\n".format(wer))
break
ratios = [0.9, 0.95,0.99, 1.]
latencies_dnn = take_durations_and_output_percentile(durations_dnn, ratios)
latencies_dnn_and_prep = take_durations_and_output_percentile(durations_dnn_and_prep, ratios)
print("\n using batch size {} and {} frames ".format(batch_size, seq_lens[-1]))
print("\n".join(["dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn.items()]))
print("\n".join(["prep + dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn_and_prep.items()]))
def take_durations_and_output_percentile(durations, ratios):
durations = np.asarray(durations) * 1000 # in ms
latency = durations
latency = latency[5:]
mean_latency = np.mean(latency)
latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
latency_ranges=get_percentile(ratios, latency_worst, len(latency))
latency_ranges["0.5"] = mean_latency
return latency_ranges
def get_percentile(ratios, arr, nsamples):
res = {}
for a in ratios:
idx = max(int(nsamples * (1 - a)), 0)
res[a] = arr[idx]
return res
def main(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
assert(args.steps is None or args.steps > 5)
if args.cpu:
device = torch.device('cpu')
else:
assert(torch.cuda.is_available())
device = torch.device('cuda')
torch.backends.cudnn.benchmark = args.cudnn_benchmark
print("CUDNN BENCHMARK ", args.cudnn_benchmark)
optim_level = 3 if args.amp else 0
batch_size = args.batch_size
jasper_model_definition = toml.load(args.model_toml)
dataset_vocab = jasper_model_definition['labels']['labels']
ctc_vocab = add_ctc_labels(dataset_vocab)
val_manifest = args.val_manifest
featurizer_config = jasper_model_definition['input_eval']
featurizer_config["optimization_level"] = optim_level
if args.max_duration is not None:
featurizer_config['max_duration'] = args.max_duration
# TORCHSCRIPT: Cant use mixed types. Using -1 for "max"
if args.pad_to is not None:
featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else -1
if featurizer_config['pad_to'] == "max":
featurizer_config['pad_to'] = -1
args.use_conv_mask = jasper_model_definition['encoder'].get('convmask', True)
if args.use_conv_mask and args.torch_script:
print('WARNING: Masked convs currently not supported for TorchScript. Disabling.')
jasper_model_definition['encoder']['convmask'] = False
print('model_config')
print_dict(jasper_model_definition)
print('feature_config')
print_dict(featurizer_config)
data_layer = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
manifest_filepath=val_manifest,
labels=dataset_vocab,
batch_size=batch_size,
pad_to_max=featurizer_config['pad_to'] == -1,
shuffle=False,
multi_gpu=False)
audio_preprocessor = AudioPreprocessing(**featurizer_config)
encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
if args.ckpt is not None:
print("loading model from ", args.ckpt)
checkpoint = torch.load(args.ckpt, map_location="cpu")
for k in audio_preprocessor.state_dict().keys():
checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
greedy_decoder = GreedyCTCDecoder()
# print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
N = len(data_layer)
step_per_epoch = math.ceil(N / args.batch_size)
print('-----------------')
if args.steps is None:
print('Have {0} examples to eval on.'.format(N))
print('Have {0} steps / (epoch).'.format(step_per_epoch))
else:
print('Have {0} examples to eval on.'.format(args.steps * args.batch_size))
print('Have {0} steps / (epoch).'.format(args.steps))
print('-----------------')
audio_preprocessor.to(device)
encoderdecoder.to(device)
if args.amp:
encoderdecoder = amp.initialize(
models=encoderdecoder, opt_level='O'+str(optim_level))
eval(
data_layer=data_layer,
audio_processor=audio_preprocessor,
encoderdecoder=encoderdecoder,
greedy_decoder=greedy_decoder,
labels=ctc_vocab,
device=device,
args=args)
if __name__=="__main__":
args = parse_args()
print_dict(vars(args))
main(args)

View file

@ -0,0 +1,110 @@
import copy
import inspect
import yaml
from .model import JasperDecoderForCTC, JasperBlock, JasperEncoder
from common.audio import GainPerturbation, ShiftPerturbation, SpeedPerturbation
from common.dataset import AudioDataset
from common.features import CutoutAugment, FilterbankFeatures, SpecAugment
from common.helpers import print_once
def default_args(klass):
sig = inspect.signature(klass.__init__)
return {k: v.default for k,v in sig.parameters.items() if k != 'self'}
def load(fpath):
if fpath.endswith('.toml'):
raise ValueError('.toml config format has been changed to .yaml')
cfg = yaml.safe_load(open(fpath, 'r'))
# Reload to deep copy shallow copies, which were made with yaml anchors
yaml.Dumper.ignore_aliases = lambda *args: True
cfg = yaml.dump(cfg)
cfg = yaml.safe_load(cfg)
return cfg
def validate_and_fill(klass, user_conf, ignore_unk=[], optional=[]):
conf = default_args(klass)
for k,v in user_conf.items():
assert k in conf or k in ignore_unk, f'Unknown parameter {k} for {klass}'
conf[k] = v
# Keep only mandatory or optional-nonempty
conf = {k:v for k,v in conf.items()
if k not in optional or v is not inspect.Parameter.empty}
# Validate
for k,v in conf.items():
assert v is not inspect.Parameter.empty, \
f'Value for {k} not specified for {klass}'
return conf
def input(conf_yaml, split='train'):
conf = copy.deepcopy(conf_yaml[f'input_{split}'])
conf_dataset = conf.pop('audio_dataset')
conf_features = conf.pop('filterbank_features')
# Validate known inner classes
inner_classes = [
(conf_dataset, 'speed_perturbation', SpeedPerturbation),
(conf_dataset, 'gain_perturbation', GainPerturbation),
(conf_dataset, 'shift_perturbation', ShiftPerturbation),
(conf_features, 'spec_augment', SpecAugment),
(conf_features, 'cutout_augment', CutoutAugment),
]
for conf_tgt, key, klass in inner_classes:
if key in conf_tgt:
conf_tgt[key] = validate_and_fill(klass, conf_tgt[key])
for k in conf:
raise ValueError(f'Unknown key {k}')
# Validate outer classes
conf_dataset = validate_and_fill(
AudioDataset, conf_dataset,
optional=['data_dir', 'labels', 'manifest_fpaths'])
conf_features = validate_and_fill(
FilterbankFeatures, conf_features)
# Check params shared between classes
shared = ['sample_rate', 'max_duration', 'pad_to_max_duration']
for sh in shared:
assert conf_dataset[sh] == conf_features[sh], (
f'{sh} should match in Dataset and FeatureProcessor: '
f'{conf_dataset[sh]}, {conf_features[sh]}')
return conf_dataset, conf_features
def encoder(conf):
"""Validate config for JasperEncoder and subsequent JasperBlocks"""
# Validate, but don't overwrite with defaults
for blk in conf['jasper']['encoder']['blocks']:
validate_and_fill(JasperBlock, blk, optional=['infilters'],
ignore_unk=['residual_dense'])
return validate_and_fill(JasperEncoder, conf['jasper']['encoder'])
def decoder(conf, n_classes):
decoder_kw = {'n_classes': n_classes, **conf['jasper']['decoder']}
return validate_and_fill(JasperDecoderForCTC, decoder_kw)
def apply_duration_flags(cfg, max_duration, pad_to_max_duration):
if max_duration is not None:
cfg['input_train']['audio_dataset']['max_duration'] = max_duration
cfg['input_train']['filterbank_features']['max_duration'] = max_duration
if pad_to_max_duration:
assert cfg['input_train']['audio_dataset']['max_duration'] > 0
cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
cfg['input_train']['filterbank_features']['pad_to_max_duration'] = True

View file

@ -0,0 +1,275 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
import torch.nn.functional as F
activations = {
"hardtanh": nn.Hardtanh,
"relu": nn.ReLU,
"selu": nn.SELU,
}
def init_weights(m, mode='xavier_uniform'):
if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
if mode == 'xavier_uniform':
nn.init.xavier_uniform_(m.weight, gain=1.0)
elif mode == 'xavier_normal':
nn.init.xavier_normal_(m.weight, gain=1.0)
elif mode == 'kaiming_uniform':
nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
elif mode == 'kaiming_normal':
nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
else:
raise ValueError("Unknown Initialization mode: {0}".format(mode))
elif type(m) == nn.BatchNorm1d:
if m.track_running_stats:
m.running_mean.zero_()
m.running_var.fill_(1)
m.num_batches_tracked.zero_()
if m.affine:
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
def get_same_padding(kernel_size, stride, dilation):
if stride > 1 and dilation > 1:
raise ValueError("Only stride OR dilation may be greater than 1")
return (kernel_size // 2) * dilation
class MaskedConv1d(nn.Conv1d):
"""1D convolution with sequence masking
"""
__constants__ = ["masked"]
def __init__(self, in_channels, out_channels, kernel_size, stride=1,
padding=0, dilation=1, groups=1, bias=False, masked=True):
super(MaskedConv1d, self).__init__(
in_channels, out_channels, kernel_size, stride=stride,
padding=padding, dilation=dilation, groups=groups, bias=bias)
self.masked = masked
def get_seq_len(self, lens):
return ((lens + 2 * self.padding[0] - self.dilation[0]
* (self.kernel_size[0] - 1) - 1) // self.stride[0] + 1)
def forward(self, x, x_lens=None):
if self.masked:
max_len = x.size(2)
idxs = torch.arange(max_len, dtype=x_lens.dtype, device=x_lens.device)
mask = idxs.expand(x_lens.size(0), max_len) >= x_lens.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
x_lens = self.get_seq_len(x_lens)
return super(MaskedConv1d, self).forward(x), x_lens
class JasperBlock(nn.Module):
__constants__ = ["use_conv_masks"]
"""Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
"""
def __init__(self, infilters, filters, repeat=3, kernel_size=11, stride=1,
dilation=1, padding='same', dropout=0.2, activation=None,
residual=True, residual_panes=[], use_conv_masks=False):
super(JasperBlock, self).__init__()
assert padding == "same", "Only 'same' padding is supported."
padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
self.use_conv_masks = use_conv_masks
self.conv = nn.ModuleList()
for i in range(repeat):
self.conv.extend(self._conv_bn(infilters if i == 0 else filters,
filters,
kernel_size=kernel_size,
stride=stride,
dilation=dilation,
padding=padding_val))
if i < repeat - 1:
self.conv.extend(self._act_dropout(dropout, activation))
self.res = nn.ModuleList() if residual else None
res_panes = residual_panes.copy()
self.dense_residual = residual
if residual:
if len(residual_panes) == 0:
res_panes = [infilters]
self.dense_residual = False
for ip in res_panes:
self.res.append(nn.ModuleList(
self._conv_bn(ip, filters, kernel_size=1)))
self.out = nn.Sequential(*self._act_dropout(dropout, activation))
def _conv_bn(self, in_channels, out_channels, **kw):
return [MaskedConv1d(in_channels, out_channels,
masked=self.use_conv_masks, **kw),
nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)]
def _act_dropout(self, dropout=0.2, activation=None):
return [activation or nn.Hardtanh(min_val=0.0, max_val=20.0),
nn.Dropout(p=dropout)]
def forward(self, xs, xs_lens=None):
if not self.use_conv_masks:
xs_lens = 0
# forward convolutions
out = xs[-1]
lens = xs_lens
for i, l in enumerate(self.conv):
if isinstance(l, MaskedConv1d):
out, lens = l(out, lens)
else:
out = l(out)
# residuals
if self.res is not None:
for i, layer in enumerate(self.res):
res_out = xs[i]
for j, res_layer in enumerate(layer):
if j == 0: # and self.use_conv_mask:
res_out, _ = res_layer(res_out, xs_lens)
else:
res_out = res_layer(res_out)
out += res_out
# output
out = self.out(out)
if self.res is not None and self.dense_residual:
out = xs + [out]
else:
out = [out]
if self.use_conv_masks:
return out, lens
else:
return out, None
class JasperEncoder(nn.Module):
__constants__ = ["use_conv_masks"]
def __init__(self, in_feats, activation, frame_splicing=1,
init='xavier_uniform', use_conv_masks=False, blocks=[]):
super(JasperEncoder, self).__init__()
self.use_conv_masks = use_conv_masks
self.layers = nn.ModuleList()
in_feats *= frame_splicing
all_residual_panes = []
for i,blk in enumerate(blocks):
blk['activation'] = activations[activation]()
has_residual_dense = blk.pop('residual_dense', False)
if has_residual_dense:
all_residual_panes += [in_feats]
blk['residual_panes'] = all_residual_panes
else:
blk['residual_panes'] = []
self.layers.append(
JasperBlock(in_feats, use_conv_masks=use_conv_masks, **blk))
in_feats = blk['filters']
self.apply(lambda x: init_weights(x, mode=init))
def forward(self, x, x_lens=None):
out, out_lens = [x], x_lens
for l in self.layers:
out, out_lens = l(out, out_lens)
return out, out_lens
class JasperDecoderForCTC(nn.Module):
def __init__(self, in_feats, n_classes, init='xavier_uniform'):
super(JasperDecoderForCTC, self).__init__()
self.layers = nn.Sequential(
nn.Conv1d(in_feats, n_classes, kernel_size=1, bias=True),)
self.apply(lambda x: init_weights(x, mode=init))
def forward(self, enc_out):
out = self.layers(enc_out[-1]).transpose(1, 2)
return F.log_softmax(out, dim=2)
class GreedyCTCDecoder(nn.Module):
@torch.no_grad()
def forward(self, log_probs, log_prob_lens=None):
if log_prob_lens is not None:
max_len = log_probs.size(1)
idxs = torch.arange(max_len, dtype=log_prob_lens.dtype,
device=log_prob_lens.device)
mask = idxs.unsqueeze(0) >= log_prob_lens.unsqueeze(1)
log_probs[:,:,-1] = log_probs[:,:,-1].masked_fill(mask, float("Inf"))
return log_probs.argmax(dim=-1, keepdim=False).int()
class Jasper(nn.Module):
def __init__(self, encoder_kw, decoder_kw, transpose_in=False):
super(Jasper, self).__init__()
self.transpose_in = transpose_in
self.encoder = JasperEncoder(**encoder_kw)
self.decoder = JasperDecoderForCTC(**decoder_kw)
def forward(self, x, x_lens=None):
if self.encoder.use_conv_masks:
assert x_lens is not None
enc, enc_lens = self.encoder(x, x_lens)
out = self.decoder(enc)
return out, enc_lens
else:
if self.transpose_in:
x = x.transpose(1, 2)
enc, _ = self.encoder(x)
out = self.decoder(enc)
return out # torchscript refuses to output None
# TODO Explicitly add x_lens=None for inference (now x can be a Tensor or tuple)
def infer(self, x, x_lens=None):
if self.encoder.use_conv_masks:
return self.forward(x, x_lens)
else:
ret = self.forward(x)
return ret, len(ret)
class CTCLossNM:
def __init__(self, n_classes):
self._criterion = nn.CTCLoss(blank=n_classes-1, reduction='none')
def __call__(self, log_probs, targets, input_length, target_length):
input_length = input_length.long()
target_length = target_length.long()
targets = targets.long()
loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
target_length)
# note that this is different from reduction = 'mean'
# because we are not dividing by target lengths
return torch.mean(loss)

View file

@ -1,423 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from apex import amp
import torch
import torch.nn as nn
from parts.features import FeatureFactory
import random
jasper_activations = {
"hardtanh": nn.Hardtanh,
"relu": nn.ReLU,
"selu": nn.SELU,
}
def init_weights(m, mode='xavier_uniform'):
if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
if mode == 'xavier_uniform':
nn.init.xavier_uniform_(m.weight, gain=1.0)
elif mode == 'xavier_normal':
nn.init.xavier_normal_(m.weight, gain=1.0)
elif mode == 'kaiming_uniform':
nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
elif mode == 'kaiming_normal':
nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
else:
raise ValueError("Unknown Initialization mode: {0}".format(mode))
elif type(m) == nn.BatchNorm1d:
if m.track_running_stats:
m.running_mean.zero_()
m.running_var.fill_(1)
m.num_batches_tracked.zero_()
if m.affine:
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
def get_same_padding(kernel_size, stride, dilation):
if stride > 1 and dilation > 1:
raise ValueError("Only stride OR dilation may be greater than 1")
return (kernel_size // 2) * dilation
class AudioPreprocessing(nn.Module):
"""GPU accelerated audio preprocessing
"""
__constants__ = ["optim_level"]
def __init__(self, **kwargs):
nn.Module.__init__(self) # For PyTorch API
self.optim_level = kwargs.get('optimization_level', 0)
self.featurizer = FeatureFactory.from_config(kwargs)
self.transpose_out = kwargs.get("transpose_out", False)
@torch.no_grad()
def forward(self, input_signal, length):
processed_signal = self.featurizer(input_signal, length)
processed_length = self.featurizer.get_seq_len(length)
if self.transpose_out:
processed_signal.transpose_(2,1)
return processed_signal, processed_length
else:
return processed_signal, processed_length
class SpectrogramAugmentation(nn.Module):
"""Spectrogram augmentation
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self.spec_cutout_regions = SpecCutoutRegions(kwargs)
self.spec_augment = SpecAugment(kwargs)
@torch.no_grad()
def forward(self, input_spec):
augmented_spec = self.spec_cutout_regions(input_spec)
augmented_spec = self.spec_augment(augmented_spec)
return augmented_spec
class SpecAugment(nn.Module):
"""Spec augment. refer to https://arxiv.org/abs/1904.08779
"""
def __init__(self, cfg):
super(SpecAugment, self).__init__()
self.cutout_x_regions = cfg.get('cutout_x_regions', 0)
self.cutout_y_regions = cfg.get('cutout_y_regions', 0)
self.cutout_x_width = cfg.get('cutout_x_width', 10)
self.cutout_y_width = cfg.get('cutout_y_width', 10)
@torch.no_grad()
def forward(self, x):
sh = x.shape
mask = torch.zeros(x.shape, dtype=torch.bool)
for idx in range(sh[0]):
for _ in range(self.cutout_x_regions):
cutout_x_left = int(random.uniform(0, sh[1] - self.cutout_x_width))
mask[idx, cutout_x_left:cutout_x_left + self.cutout_x_width, :] = 1
for _ in range(self.cutout_y_regions):
cutout_y_left = int(random.uniform(0, sh[2] - self.cutout_y_width))
mask[idx, :, cutout_y_left:cutout_y_left + self.cutout_y_width] = 1
x = x.masked_fill(mask.to(device=x.device), 0)
return x
class SpecCutoutRegions(nn.Module):
"""Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
"""
def __init__(self, cfg):
super(SpecCutoutRegions, self).__init__()
self.cutout_rect_regions = cfg.get('cutout_rect_regions', 0)
self.cutout_rect_time = cfg.get('cutout_rect_time', 5)
self.cutout_rect_freq = cfg.get('cutout_rect_freq', 20)
@torch.no_grad()
def forward(self, x):
sh = x.shape
mask = torch.zeros(x.shape, dtype=torch.bool)
for idx in range(sh[0]):
for i in range(self.cutout_rect_regions):
cutout_rect_x = int(random.uniform(
0, sh[1] - self.cutout_rect_freq))
cutout_rect_y = int(random.uniform(
0, sh[2] - self.cutout_rect_time))
mask[idx, cutout_rect_x:cutout_rect_x + self.cutout_rect_freq,
cutout_rect_y:cutout_rect_y + self.cutout_rect_time] = 1
x = x.masked_fill(mask.to(device=x.device), 0)
return x
class JasperEncoder(nn.Module):
__constants__ = ["use_conv_mask"]
"""Jasper encoder
"""
def __init__(self, **kwargs):
cfg = {}
for key, value in kwargs.items():
cfg[key] = value
nn.Module.__init__(self)
self._cfg = cfg
activation = jasper_activations[cfg['encoder']['activation']]()
self.use_conv_mask = cfg['encoder'].get('convmask', False)
feat_in = cfg['input']['features'] * cfg['input'].get('frame_splicing', 1)
init_mode = cfg.get('init_mode', 'xavier_uniform')
residual_panes = []
encoder_layers = []
self.dense_residual = False
for lcfg in cfg['jasper']:
dense_res = []
if lcfg.get('residual_dense', False):
residual_panes.append(feat_in)
dense_res = residual_panes
self.dense_residual = True
encoder_layers.append(
JasperBlock(feat_in, lcfg['filters'], repeat=lcfg['repeat'],
kernel_size=lcfg['kernel'], stride=lcfg['stride'],
dilation=lcfg['dilation'], dropout=lcfg['dropout'],
residual=lcfg['residual'], activation=activation,
residual_panes=dense_res, use_conv_mask=self.use_conv_mask))
feat_in = lcfg['filters']
self.encoder = nn.Sequential(*encoder_layers)
self.apply(lambda x: init_weights(x, mode=init_mode))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, x):
if self.use_conv_mask:
audio_signal, length = x
return self.encoder(([audio_signal], length))
else:
return self.encoder([x])
class JasperDecoderForCTC(nn.Module):
"""Jasper decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self._feat_in = kwargs.get("feat_in")
self._num_classes = kwargs.get("num_classes")
init_mode = kwargs.get('init_mode', 'xavier_uniform')
self.decoder_layers = nn.Sequential(
nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),)
self.apply(lambda x: init_weights(x, mode=init_mode))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, encoder_output):
out = self.decoder_layers(encoder_output[-1]).transpose(1, 2)
return nn.functional.log_softmax(out, dim=2)
class JasperEncoderDecoder(nn.Module):
"""Contains jasper encoder and decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self.transpose_in=kwargs.get("transpose_in", False)
self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
num_classes=kwargs.get("num_classes"))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, x):
if self.jasper_encoder.use_conv_mask:
t_encoded_t, t_encoded_len_t = self.jasper_encoder(x)
else:
if self.transpose_in:
x = x.transpose(1, 2)
t_encoded_t = self.jasper_encoder(x)
out = self.jasper_decoder(t_encoded_t)
if self.jasper_encoder.use_conv_mask:
return out, t_encoded_len_t
else:
return out
def infer(self, x):
if self.jasper_encoder.use_conv_mask:
return self.forward(x)
else:
ret = self.forward(x[0])
return ret, len(ret)
class Jasper(JasperEncoderDecoder):
"""Contains data preprocessing, spectrogram augmentation, jasper encoder and decoder
"""
def __init__(self, **kwargs):
JasperEncoderDecoder.__init__(self, **kwargs)
feature_config = kwargs.get("feature_config")
if self.transpose_in:
feature_config["transpose"] = True
self.audio_preprocessor = AudioPreprocessing(**feature_config)
self.data_spectr_augmentation = SpectrogramAugmentation(**feature_config)
class MaskedConv1d(nn.Conv1d):
"""1D convolution with sequence masking
"""
__constants__ = ["use_conv_mask"]
def __init__(self, in_channels, out_channels, kernel_size, stride=1,
padding=0, dilation=1, groups=1, bias=False, use_conv_mask=True):
super(MaskedConv1d, self).__init__(in_channels, out_channels, kernel_size,
stride=stride,
padding=padding, dilation=dilation,
groups=groups, bias=bias)
self.use_conv_mask = use_conv_mask
def get_seq_len(self, lens):
return ((lens + 2 * self.padding[0] - self.dilation[0] * (
self.kernel_size[0] - 1) - 1) // self.stride[0] + 1)
def forward(self, inp):
if self.use_conv_mask:
x, lens = inp
max_len = x.size(2)
idxs = torch.arange(max_len).to(lens.dtype).to(lens.device).expand(len(lens), max_len)
mask = idxs >= lens.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
del mask
del idxs
lens = self.get_seq_len(lens)
return super(MaskedConv1d, self).forward(x), lens
else:
return super(MaskedConv1d, self).forward(inp)
class JasperBlock(nn.Module):
__constants__ = ["use_conv_mask", "conv"]
"""Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
"""
def __init__(self, inplanes, planes, repeat=3, kernel_size=11, stride=1,
dilation=1, padding='same', dropout=0.2, activation=None,
residual=True, residual_panes=[], use_conv_mask=False):
super(JasperBlock, self).__init__()
if padding != "same":
raise ValueError("currently only 'same' padding is supported")
padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
self.use_conv_mask = use_conv_mask
self.conv = nn.ModuleList()
inplanes_loop = inplanes
for _ in range(repeat - 1):
self.conv.extend(
self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
stride=stride, dilation=dilation,
padding=padding_val))
self.conv.extend(
self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
inplanes_loop = planes
self.conv.extend(
self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
stride=stride, dilation=dilation,
padding=padding_val))
self.res = nn.ModuleList() if residual else None
res_panes = residual_panes.copy()
self.dense_residual = residual
if residual:
if len(residual_panes) == 0:
res_panes = [inplanes]
self.dense_residual = False
for ip in res_panes:
self.res.append(nn.ModuleList(
modules=self._get_conv_bn_layer(ip, planes, kernel_size=1)))
self.out = nn.Sequential(
*self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
def _get_conv_bn_layer(self, in_channels, out_channels, kernel_size=11,
stride=1, dilation=1, padding=0, bias=False):
layers = [
MaskedConv1d(in_channels, out_channels, kernel_size, stride=stride,
dilation=dilation, padding=padding, bias=bias,
use_conv_mask=self.use_conv_mask),
nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)
]
return layers
def _get_act_dropout_layer(self, drop_prob=0.2, activation=None):
if activation is None:
activation = nn.Hardtanh(min_val=0.0, max_val=20.0)
layers = [
activation,
nn.Dropout(p=drop_prob)
]
return layers
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, input_):
if self.use_conv_mask:
xs, lens_orig = input_
else:
xs = input_
lens_orig = 0
# compute forward convolutions
out = xs[-1]
lens = lens_orig
for i, l in enumerate(self.conv):
if self.use_conv_mask and isinstance(l, MaskedConv1d):
out, lens = l((out, lens))
else:
out = l(out)
# compute the residuals
if self.res is not None:
for i, layer in enumerate(self.res):
res_out = xs[i]
for j, res_layer in enumerate(layer):
if j == 0 and self.use_conv_mask:
res_out, _ = res_layer((res_out, lens_orig))
else:
res_out = res_layer(res_out)
out += res_out
# compute the output
out = self.out(out)
if self.res is not None and self.dense_residual:
out = xs + [out]
else:
out = [out]
if self.use_conv_mask:
return out, lens
else:
return out
class GreedyCTCDecoder(nn.Module):
""" Greedy CTC Decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self) # For PyTorch API
@torch.no_grad()
def forward(self, log_probs):
argmx = log_probs.argmax(dim=-1, keepdim=False).int()
return argmx
class CTCLossNM:
""" CTC loss
"""
def __init__(self, **kwargs):
self._blank = kwargs['num_classes'] - 1
self._criterion = nn.CTCLoss(blank=self._blank, reduction='none')
def __call__(self, log_probs, targets, input_length, target_length):
input_length = input_length.long()
target_length = target_length.long()
targets = targets.long()
loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
target_length)
# note that this is different from reduction = 'mean'
# because we are not dividing by target lengths
return torch.mean(loss)

File diff suppressed because one or more lines are too long

View file

@ -1,274 +0,0 @@
{
"cells": [
{
"cell_type": "raw",
"metadata": {},
"source": [
"# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"# =============================================================================="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png style=\"width: 90px; float: right;\">\n",
"\n",
"# Jasper inference using TensorRT Inference Server\n",
"This Jupyter notebook provides scripts to deploy high-performance inference in NVIDIA TensorRT Inference Server offering different options for the model backend, among others NVIDIA TensorRT. \n",
"Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. \n",
"NVIDIA TensorRT Inference Server provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server\n",
"NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.\n",
"## 1. Overview\n",
"\n",
"The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.\n",
"\n",
"The original paper is Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf.\n",
"\n",
"### 1.1 Model architecture\n",
"By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.\n",
"Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout. \n",
"In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.\n",
"For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.\n",
"More information on the model architecture can be found in the [Jasper Pytorch README](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)\n",
"\n",
"### 1.2 TensorRT Inference Server Overview\n",
"\n",
"A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:\n",
"1. Client serializes the inference request into a message and sends it to the server (Client Send)\n",
"2. Message travels over the network from the client to the server (Network)\n",
"3. Message arrives at server, and is deserialized (Server Receive)\n",
"4. Request is placed on the queue (Server Queue)\n",
"5. Request is removed from the queue and computed (Server Compute)\n",
"6. Completed request is serialized in a message and sent back to the client (Server Send)\n",
"7. Completed message travels over network from the server to the client (Network)\n",
"8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)\n",
"\n",
"Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.\n",
"In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.\n",
"\n",
"Note: The following instructions are run from outside the container and call `docker run` commands as required.\n",
"\n",
"### 1.3 Inference Pipeline in TensorRT Inference Server\n",
"The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend: \n",
"\n",
"**Data preprocessor**\n",
"\n",
"The data processor transforms an input raw audio file into a spectrogram. By default the pipeline uses mel filter banks as spectrogram features. This part does not have any learnable weights.\n",
"\n",
"**Acoustic model**\n",
"\n",
"The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/README.md)). We do not use masking for inference . \n",
"\n",
"**Greedy decoder**\n",
"\n",
"The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability. \n",
"\n",
"To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline.\n",
"\n",
"|Backend\\Pipeline component|Data preprocessor|Acoustic Model|Decoder|\n",
"|---|---|---|---|\n",
"|PyTorch JIT|x|x|x|\n",
"|ONNX|-|x|-|\n",
"|TensorRT|-|x|-|\n",
"\n",
"In order to run inference with TensorRT outside of the inference server, refer to the [Jasper TensorRT README](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/trt/README.md)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.3 Learning objectives\n",
"\n",
"This notebook demonstrates:\n",
"- Speed up Jasper Inference with TensorRT in TensorRT Inference Server\n",
"- Use of Mixed Precision for Inference"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Requirements\n",
"\n",
"Please refer to Jasper TensorRT Inference Server README.md"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Jasper Inference\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.1 Prepare Working Directory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"if not 'workbookDir' in globals():\n",
" workbookDir = os.getcwd() + \"/../\"\n",
"print('workbookDir: ' + workbookDir)\n",
"os.chdir(workbookDir)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Generate TRTIS Model Checkpoints\n",
"Use the PyTorch model checkpoint to generate all 3 model backends. You can find a pretrained checkpoint at https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16.\n",
"\n",
"Set the following parameters:\n",
"\n",
"* `ARCH`: hardware architecture. use 70 for Volta, 75 for Turing.\n",
"* `CHECKPOINT_DIR`: absolute path to model checkpoint directory.\n",
"* `CHECKPOINT`: model checkpoint name. (default: jasper10x5dr.pt)\n",
"* `PRECISION`: model precision. Default is using mixed precision.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%env ARCH=70\n",
"# replace with absolute path to checkpoint directory, which should include CHECKPOINT file\n",
"%env CHECKPOINT_DIR=<CHECKPOINT_DIR> \n",
"# CHECKPOINT file name\n",
"%env CHECKPOINT=jasper_fp16.pt \n",
"%env PRECISION=fp16\n",
"!echo \"ARCH=${ARCH} CHECKPOINT_DIR=${CHECKPOINT_DIR} CHECKPOINT=${CHECKPOINT} PRECISION=${PRECISION} trtis/scripts/export_model.sh\"\n",
"!ARCH=${ARCH} CHECKPOINT_DIR=${CHECKPOINT_DIR} CHECKPOINT=${CHECKPOINT} PRECISION=${PRECISION} trtis/scripts/export_model.sh"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!bash trtis/scripts/prepare_model_repository.sh"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.3 Start the TensorRT Inference Server using Docker"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!bash trtis/scripts/run_server.sh"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.4. Start inference prediction in TRTIS\n",
"\n",
"Use the following script to run inference with TensorRT Inference Server.\n",
"You will need to set the parameters such as: \n",
"\n",
"\n",
"* `MODEL_TYPE`: Model pipeline type. Choose from [pyt, onnx, trt] for Pytorch JIT, ONNX, or TensorRT model pipeline.\n",
"* `DATA_DIR`: absolute path to directory with audio files\n",
"* `FILE`: relative path of audio file to `DATA_DIR`\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_TYPE=\"trt\"\n",
"DATA_DIR=os.path.join(workbookDir, \"notebooks/\")\n",
"FILE=\"example1.wav\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!bash trtis/scripts/run_client.sh $MODEL_TYPE $DATA_DIR $FILE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can play with other examples from the 'notebooks' directory. You can also add your own audio files and generate the output text files in this way."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.5. Stop your container in the end"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!docker stop jasper-trtis"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -150,4 +150,54 @@ Use the token listed in the output from running the jupyter command to log in, f
## Jasper Jupyter Notebook for TensorRT Inference Server
This notebook can be executed from Google [Colab](https://colab.research.google.com) by supplying the notebook Github [URL](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/notebooks/Colab_Jasper_TRT_inference_demo.ipynb) or by open this [link](https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/notebooks/Colab_Jasper_TRT_inference_demo.ipynb) directly.
### Requirements
`./trtis/` contains a Dockerfile which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
* [TensorRT Inference Server 19.09 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorrtserver)
* [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
### Quick Start Guide
#### 1. Clone the repository.
```
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
```
#### 2. Build a container that extends NGC PyTorch 19.09, TensorRT, TensorRT Inference Server, and TensorRT Inference Client.
```
bash trtis/scripts/docker/build.sh
```
#### 3. Download the checkpoint
Download the checkpoint file jasper_fp16.pt from NGC Model Repository:
- https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16
to an user specified directory _CHECKPOINT_DIR_
#### 4. Run the notebook
For running the notebook on your local machine, run:
```
jupyter notebook -- notebooks/JasperTRTIS.ipynb
```
For running the notebook on another machine remotely, run:
```
jupyter notebook --ip=0.0.0.0 --allow-root
```
And navigate a web browser to the IP address or hostname of the host machine at port 8888: `http://[host machine]:8888`
Use the token listed in the output from running the jupyter command to log in, for example: `http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b`

View file

@ -1,368 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
import math
import librosa
from .perturb import AudioAugmentor
from .segment import AudioSegment
from apex import amp
def audio_from_file(file_path, offset=0, duration=0, trim=False, target_sr=16000,
device=torch.device('cuda')):
audio = AudioSegment.from_file(file_path,
target_sr=target_sr,
int_values=False,
offset=offset, duration=duration, trim=trim)
samples=torch.tensor(audio.samples, dtype=torch.float, device=device)
num_samples = torch.tensor(samples.shape[0], device=device).int()
return (samples.unsqueeze(0), num_samples.unsqueeze(0))
class WaveformFeaturizer(object):
def __init__(self, input_cfg, augmentor=None):
self.augmentor = augmentor if augmentor is not None else AudioAugmentor()
self.cfg = input_cfg
def max_augmentation_length(self, length):
return self.augmentor.max_augmentation_length(length)
def process(self, file_path, offset=0, duration=0, trim=False):
audio = AudioSegment.from_file(file_path,
target_sr=self.cfg['sample_rate'],
int_values=self.cfg.get('int_values', False),
offset=offset, duration=duration, trim=trim)
return self.process_segment(audio)
def process_segment(self, audio_segment):
self.augmentor.perturb(audio_segment)
return torch.tensor(audio_segment.samples, dtype=torch.float)
@classmethod
def from_config(cls, input_config, perturbation_configs=None):
if perturbation_configs is not None:
aa = AudioAugmentor.from_config(perturbation_configs)
else:
aa = None
return cls(input_config, augmentor=aa)
# @torch.jit.script
# def normalize_batch_per_feature(x, seq_len):
# x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype, device=x.device)
# x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype, device=x.device)
# for i in range(x.shape[0]):
# x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
# x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
# # make sure x_std is not zero
# x_std += 1e-5
# return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
# @torch.jit.script
# def normalize_batch_all_features(x, seq_len):
# x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
# x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
# for i in range(x.shape[0]):
# x_mean[i] = x[i, :, :int(seq_len[i])].mean()
# x_std[i] = x[i, :, :int(seq_len[i])].std()
# # make sure x_std is not zero
# x_std += 1e-5
# return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
@torch.jit.script
def normalize_batch(x, seq_len, normalize_type: str):
# print ("normalize_batch: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
if normalize_type == "per_feature":
x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
device=x.device)
x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
device=x.device)
for i in range(x.shape[0]):
x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
# make sure x_std is not zero
x_std += 1e-5
return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
elif normalize_type == "all_features":
x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
for i in range(x.shape[0]):
x_mean[i] = x[i, :, :int(seq_len[i])].mean()
x_std[i] = x[i, :, :int(seq_len[i])].std()
# make sure x_std is not zero
x_std += 1e-5
return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
else:
return x
@torch.jit.script
def splice_frames(x, frame_splicing: int):
""" Stacks frames together across feature dim
input is batch_size, feature_dim, num_frames
output is batch_size, feature_dim*frame_splicing, num_frames
"""
seq = [x]
# TORCHSCRIPT: JIT doesnt like range(start, stop)
for n in range(frame_splicing - 1):
seq.append(torch.cat([x[:, :, :n + 1], x[:, :, n + 1:]], dim=2))
return torch.cat(seq, dim=1)
class SpectrogramFeatures(nn.Module):
# For JIT. See https://pytorch.org/docs/stable/jit.html#python-defined-constants
__constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length", "log", "frame_splicing", "window", "normalize", "pad_to", "max_duration", "do_normalize"]
def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
n_fft=None,
window="hamming", normalize="per_feature", log=True,
dither=1e-5, pad_to=8, max_duration=16.7,
frame_splicing=1):
super(SpectrogramFeatures, self).__init__()
torch_windows = {
'hann': torch.hann_window,
'hamming': torch.hamming_window,
'blackman': torch.blackman_window,
'bartlett': torch.bartlett_window,
'none': None,
}
self.win_length = int(sample_rate * window_size)
self.hop_length = int(sample_rate * window_stride)
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
window_fn = torch_windows.get(window, None)
window_tensor = window_fn(self.win_length,
periodic=False) if window_fn else None
self.window = window_tensor
self.normalize = normalize
self.log = log
self.dither = dither
self.pad_to = pad_to
self.frame_splicing = frame_splicing
max_length = 1 + math.ceil(
(max_duration * sample_rate - self.win_length) / self.hop_length
)
max_pad = 16 - (max_length % 16)
self.max_length = max_length + max_pad
def get_seq_len(self, seq_len):
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
dtype=torch.int)
@torch.no_grad()
def forward(self, x, seq_len):
dtype = x.dtype
seq_len = self.get_seq_len(seq_len)
# dither
if self.dither > 0:
x += self.dither * torch.randn_like(x)
# do preemphasis
if hasattr(self,'preemph') and self.preemph is not None:
x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
dim=1)
# get spectrogram
x = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
win_length=self.win_length,
window=self.window.to(torch.float))
x = torch.sqrt(x.pow(2).sum(-1))
# log features if required
if self.log:
x = torch.log(x + 1e-20)
# frame splicing if required
if self.frame_splicing > 1:
x = splice_frames(x, self.frame_splicing)
# normalize if required
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
# mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
max_len = x.size(-1)
mask = torch.arange(max_len, dtype=seq_len.dtype).to(seq_len.device).expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
# TORCHSCRIPT: Is this del important? It breaks scripting
#del mask
pad_to = self.pad_to
# TORCHSCRIPT: Cant have mixed types. Using pad_to < 0 for "max"
if pad_to < 0:
x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
elif pad_to > 0:
pad_amt = x.size(-1) % pad_to
if pad_amt != 0:
x = nn.functional.pad(x, (0, pad_to - pad_amt))
return x.to(dtype)
@classmethod
def from_config(cls, cfg, log=False):
return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
window_stride=cfg['window_stride'],
n_fft=cfg['n_fft'], window=cfg['window'],
normalize=cfg['normalize'],
max_duration=cfg.get('max_duration', 16.7),
dither=cfg.get('dither', 1e-5), pad_to=cfg.get("pad_to", 0),
frame_splicing=cfg.get("frame_splicing", 1), log=log)
constant=1e-5
class FilterbankFeatures(nn.Module):
# For JIT. See https://pytorch.org/docs/stable/jit.html#python-defined-constants
__constants__ = ["dither", "preemph", "n_fft", "hop_length", "win_length", "center", "log", "frame_splicing", "window", "normalize", "pad_to", "max_duration", "max_length"]
def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
window="hamming", normalize="per_feature", n_fft=None,
preemph=0.97,
nfilt=64, lowfreq=0, highfreq=None, log=True, dither=constant,
pad_to=8,
max_duration=16.7,
frame_splicing=1):
super(FilterbankFeatures, self).__init__()
torch_windows = {
'hann': torch.hann_window,
'hamming': torch.hamming_window,
'blackman': torch.blackman_window,
'bartlett': torch.bartlett_window,
'none': None,
}
self.win_length = int(sample_rate * window_size) # frame size
self.hop_length = int(sample_rate * window_stride)
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
self.normalize = normalize
self.log = log
#TORCHSCRIPT: Check whether or not we need this
self.dither = dither
self.frame_splicing = frame_splicing
self.nfilt = nfilt
self.preemph = preemph
self.pad_to = pad_to
highfreq = highfreq or sample_rate / 2
window_fn = torch_windows.get(window, None)
window_tensor = window_fn(self.win_length,
periodic=False) if window_fn else None
filterbanks = torch.tensor(
librosa.filters.mel(sample_rate, self.n_fft, n_mels=nfilt, fmin=lowfreq,
fmax=highfreq), dtype=torch.float).unsqueeze(0)
# self.fb = filterbanks
# self.window = window_tensor
self.register_buffer("fb", filterbanks)
self.register_buffer("window", window_tensor)
# Calculate maximum sequence length (# frames)
max_length = 1 + math.ceil(
(max_duration * sample_rate - self.win_length) / self.hop_length
)
max_pad = 16 - (max_length % 16)
self.max_length = max_length + max_pad
def get_seq_len(self, seq_len):
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
dtype=torch.int)
# do stft
# TORCHSCRIPT: center removed due to bug
def stft(self, x):
return torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
win_length=self.win_length,
window=self.window.to(dtype=torch.float))
def forward(self, x, seq_len):
dtype = x.dtype
seq_len = self.get_seq_len(seq_len)
# dither
if self.dither > 0:
x += self.dither * torch.randn_like(x)
# do preemphasis
if self.preemph is not None:
x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
dim=1)
x = self.stft(x)
# get power spectrum
x = x.pow(2).sum(-1)
# dot with filterbank energies
x = torch.matmul(self.fb.to(x.dtype), x)
# log features if required
if self.log:
x = torch.log(x + 1e-20)
# frame splicing if required
if self.frame_splicing > 1:
x = splice_frames(x, self.frame_splicing)
# normalize if required
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
# mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
max_len = x.size(-1)
mask = torch.arange(max_len, dtype=seq_len.dtype).to(x.device).expand(x.size(0),
max_len) >= seq_len.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1), 0)
# TORCHSCRIPT: Is this del important? It breaks scripting
# del mask
# TORCHSCRIPT: Cant have mixed types. Using pad_to < 0 for "max"
if self.pad_to < 0:
x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
elif self.pad_to > 0:
pad_amt = x.size(-1) % self.pad_to
# if pad_amt != 0:
x = nn.functional.pad(x, (0, self.pad_to - pad_amt))
return x # .to(dtype)
@classmethod
def from_config(cls, cfg, log=False):
return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
window_stride=cfg['window_stride'], n_fft=cfg['n_fft'],
nfilt=cfg['features'], window=cfg['window'],
normalize=cfg['normalize'],
max_duration=cfg.get('max_duration', 16.7),
dither=cfg['dither'], pad_to=cfg.get("pad_to", 0),
frame_splicing=cfg.get("frame_splicing", 1), log=log)
class FeatureFactory(object):
featurizers = {
"logfbank": FilterbankFeatures,
"fbank": FilterbankFeatures,
"stft": SpectrogramFeatures,
"logspect": SpectrogramFeatures,
"logstft": SpectrogramFeatures
}
def __init__(self):
pass
@classmethod
def from_config(cls, cfg):
feat_type = cfg.get('feat_type', "logspect")
featurizer = cls.featurizers[feat_type]
#return featurizer.from_config(cfg, log="log" in cfg['feat_type'])
return featurizer.from_config(cfg, log="log" in feat_type)

View file

@ -1,170 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import re
import string
import numpy as np
import os
from .text import _clean_text
def normalize_string(s, labels, table, **unused_kwargs):
"""
Normalizes string. For example:
'call me at 8:00 pm!' -> 'call me at eight zero pm'
Args:
s: string to normalize
labels: labels used during model training.
Returns:
Normalized string
"""
def good_token(token, labels):
s = set(labels)
for t in token:
if not t in s:
return False
return True
try:
text = _clean_text(s, ["english_cleaners"], table).strip()
return ''.join([t for t in text if good_token(t, labels=labels)])
except:
print("WARNING: Normalizing {} failed".format(s))
return None
class Manifest(object):
def __init__(self, data_dir, manifest_paths, labels, blank_index, max_duration=None, pad_to_max=False,
min_duration=None, sort_by_duration=False, max_utts=0,
normalize=True, speed_perturbation=False, filter_speed=1.0):
self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
self.blank_index = blank_index
self.max_duration= max_duration
ids = []
duration = 0.0
filtered_duration = 0.0
# If removing punctuation, make a list of punctuation to remove
table = None
if normalize:
# Punctuation to remove
punctuation = string.punctuation
punctuation = punctuation.replace("+", "")
punctuation = punctuation.replace("&", "")
### We might also want to consider:
### @ -> at
### # -> number, pound, hashtag
### ~ -> tilde
### _ -> underscore
### % -> percent
# If a punctuation symbol is inside our vocab, we do not remove from text
for l in labels:
punctuation = punctuation.replace(l, "")
# Turn all punctuation to whitespace
table = str.maketrans(punctuation, " " * len(punctuation))
for manifest_path in manifest_paths:
with open(manifest_path, "r", encoding="utf-8") as fh:
a=json.load(fh)
for data in a:
files_and_speeds = data['files']
if pad_to_max:
if not speed_perturbation:
min_speed = filter_speed
else:
min_speed = min(x['speed'] for x in files_and_speeds)
max_duration = self.max_duration * min_speed
data['duration'] = data['original_duration']
if min_duration is not None and data['duration'] < min_duration:
filtered_duration += data['duration']
continue
if max_duration is not None and data['duration'] > max_duration:
filtered_duration += data['duration']
continue
# Prune and normalize according to transcript
transcript_text = data[
'transcript'] if "transcript" in data else self.load_transcript(
data['text_filepath'])
if normalize:
transcript_text = normalize_string(transcript_text, labels=labels,
table=table)
if not isinstance(transcript_text, str):
print(
"WARNING: Got transcript: {}. It is not a string. Dropping data point".format(
transcript_text))
filtered_duration += data['duration']
continue
data["transcript"] = self.parse_transcript(transcript_text) # convert to vocab indices
if speed_perturbation:
audio_paths = [x['fname'] for x in files_and_speeds]
data['audio_duration'] = [x['duration'] for x in files_and_speeds]
else:
audio_paths = [x['fname'] for x in files_and_speeds if x['speed'] == filter_speed]
data['audio_duration'] = [x['duration'] for x in files_and_speeds if x['speed'] == filter_speed]
data['audio_filepath'] = [os.path.join(data_dir, x) for x in audio_paths]
data.pop('files')
data.pop('original_duration')
ids.append(data)
duration += data['duration']
if max_utts > 0 and len(ids) >= max_utts:
print(
'Stopping parsing %s as max_utts=%d' % (manifest_path, max_utts))
break
if sort_by_duration:
ids = sorted(ids, key=lambda x: x['duration'])
self._data = ids
self._size = len(ids)
self._duration = duration
self._filtered_duration = filtered_duration
def load_transcript(self, transcript_path):
with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
transcript = transcript_file.read().replace('\n', '')
return transcript
def parse_transcript(self, transcript):
chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
transcript = list(filter(lambda x: x != self.blank_index, chars))
return transcript
def __getitem__(self, item):
return self._data[item]
def __len__(self):
return self._size
def __iter__(self):
return iter(self._data)
@property
def duration(self):
return self._duration
@property
def filtered_duration(self):
return self._filtered_duration
@property
def data(self):
return list(self._data)

View file

@ -1,111 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import librosa
from .manifest import Manifest
from .segment import AudioSegment
class Perturbation(object):
def max_augmentation_length(self, length):
return length
def perturb(self, data):
raise NotImplementedError
class SpeedPerturbation(Perturbation):
def __init__(self, min_speed_rate=0.85, max_speed_rate=1.15, rng=None):
self._min_rate = min_speed_rate
self._max_rate = max_speed_rate
self._rng = random.Random() if rng is None else rng
def max_augmentation_length(self, length):
return length * self._max_rate
def perturb(self, data):
speed_rate = self._rng.uniform(self._min_rate, self._max_rate)
if speed_rate <= 0:
raise ValueError("speed_rate should be greater than zero.")
data._samples = librosa.effects.time_stretch(data._samples, speed_rate)
class GainPerturbation(Perturbation):
def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, rng=None):
self._min_gain_dbfs = min_gain_dbfs
self._max_gain_dbfs = max_gain_dbfs
self._rng = random.Random() if rng is None else rng
def perturb(self, data):
gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
data._samples = data._samples * (10. ** (gain / 20.))
class ShiftPerturbation(Perturbation):
def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, rng=None):
self._min_shift_ms = min_shift_ms
self._max_shift_ms = max_shift_ms
self._rng = random.Random() if rng is None else rng
def perturb(self, data):
shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
if abs(shift_ms) / 1000 > data.duration:
# TODO: do something smarter than just ignore this condition
return
shift_samples = int(shift_ms * data.sample_rate // 1000)
# print("DEBUG: shift:", shift_samples)
if shift_samples < 0:
data._samples[-shift_samples:] = data._samples[:shift_samples]
data._samples[:-shift_samples] = 0
elif shift_samples > 0:
data._samples[:-shift_samples] = data._samples[shift_samples:]
data._samples[-shift_samples:] = 0
perturbation_types = {
"speed": SpeedPerturbation,
"gain": GainPerturbation,
"shift": ShiftPerturbation,
}
class AudioAugmentor(object):
def __init__(self, perturbations=None, rng=None):
self._rng = random.Random() if rng is None else rng
self._pipeline = perturbations if perturbations is not None else []
def perturb(self, segment):
for (prob, p) in self._pipeline:
if self._rng.random() < prob:
p.perturb(segment)
return
def max_augmentation_length(self, length):
newlen = length
for (prob, p) in self._pipeline:
newlen = p.max_augmentation_length(newlen)
return newlen
@classmethod
def from_config(cls, config):
ptbs = []
for p in config:
if p['aug_type'] not in perturbation_types:
print(p['aug_type'], "perturbation not known. Skipping.")
continue
perturbation = perturbation_types[p['aug_type']]
ptbs.append((p['prob'], perturbation(**p['cfg'])))
return cls(perturbations=ptbs)

View file

@ -1,12 +0,0 @@
# Copyright (c) 2017 Keith Ito
""" from https://github.com/keithito/tacotron """
import re
from . import cleaners
def _clean_text(text, cleaner_names, *args):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception('Unknown cleaner: %s' % name)
text = cleaner(text, *args)
return text

View file

@ -1,3 +1,3 @@
#!/bin/bash
NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=2 bash scripts/train.sh "$@"
NUM_GPUS=8 AMP=true BATCH_SIZE=64 GRADIENT_ACCUMULATION_STEPS=4 bash scripts/train.sh "$@"

View file

@ -1,9 +1,10 @@
pandas==0.24.2
tqdm==4.31.1
ascii-graph==1.5.1
wrapt==1.10.11
librosa
toml
soundfile
ipdb
sox
librosa==0.8.0
pandas==1.1.4
pycuda==2020.1
pyyaml
soundfile
sox==1.4.1
tqdm==4.53.0
wrapt==1.10.11

View file

@ -1,26 +1,30 @@
#!/bin/bash
SCRIPT_DIR=$(cd $(dirname $0); pwd)
JASPER_REPO=${JASPER_REPO:-"${SCRIPT_DIR}/../.."}
: ${JASPER_REPO:="$SCRIPT_DIR/../.."}
# Launch TRT JASPER container.
: ${DATA_DIR:=${1:-"$JASPER_REPO/datasets"}}
: ${CHECKPOINT_DIR:=${2:-"$JASPER_REPO/checkpoints"}}
: ${OUTPUT_DIR:=${3:-"$JASPER_REPO/results"}}
: ${SCRIPT:=${4:-}}
DATA_DIR=${1:-${DATA_DIR-"/datasets"}}
CHECKPOINT_DIR=${2:-${CHECKPOINT_DIR:-"/checkpoints"}}
RESULT_DIR=${3:-${RESULT_DIR:-"/results"}}
PROGRAM_PATH=${PROGRAM_PATH}
mkdir -p $DATA_DIR
mkdir -p $CHECKPOINT_DIR
mkdir -p $OUTPUT_DIR
MOUNTS=""
MOUNTS+=" -v $DATA_DIR:/datasets"
MOUNTS+=" -v $CHECKPOINT_DIR:/checkpoints"
MOUNTS+=" -v $RESULT_DIR:/results"
MOUNTS+=" -v ${JASPER_REPO}:/jasper"
MOUNTS+=" -v $OUTPUT_DIR:/results"
MOUNTS+=" -v $JASPER_REPO:/workspace/jasper"
echo $MOUNTS
nvidia-docker run -it --rm \
--runtime=nvidia \
docker run -it --rm --gpus all \
--env PYTHONDONTWRITEBYTECODE=1 \
--shm-size=4g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
${MOUNTS} \
${EXTRA_JASPER_ENV} \
jasper:latest bash $PROGRAM_PATH
$MOUNTS \
$EXTRA_JASPER_ENV \
-w /workspace/jasper \
jasper:latest bash $SCRIPT

View file

@ -14,58 +14,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
set -a
echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
: ${PREDICTION_FILE:=}
: ${DATASET:="test-other"}
DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
DATASET=${2:-${DATASET:-"dev-clean"}}
MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
CUDNN_BENCHMARK=${7:-${CUDNN_BENCHMARK:-"false"}}
NUM_GPUS=${8:-${NUM_GPUS:-1}}
AMP=${9:-${AMP:-"false"}}
NUM_STEPS=${10:-${NUM_STEPS:-"-1"}}
SEED=${11:-${SEED:-0}}
BATCH_SIZE=${12:-${BATCH_SIZE:-64}}
mkdir -p "$RESULT_DIR"
CMD=" inference.py "
CMD+=" --batch_size $BATCH_SIZE "
CMD+=" --dataset_dir $DATA_DIR "
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --model_toml $MODEL_CONFIG "
CMD+=" --seed $SEED "
CMD+=" --ckpt $CHECKPOINT "
[ "$AMP" == "true" ] && \
CMD+=" --amp"
[ "$NUM_STEPS" -gt 0 ] && \
CMD+=" --steps $NUM_STEPS"
[ "$CUDNN_BENCHMARK" = "true" ] && \
CMD+=" --cudnn"
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
if [ "$NUM_GPUS" -gt 1 ] ; then
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
else
CMD="python3 $CMD"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
fi
set +x
bash ./scripts/inference.sh "$@"

View file

@ -14,66 +14,48 @@
# See the License for the specific language governing permissions and
# limitations under the License.
: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
: ${OUTPUT_DIR:=${3:-"/results"}}
: ${CHECKPOINT:=${4:-"/checkpoints/jasper_fp16.pt"}}
: ${DATASET:="test-other"}
: ${LOG_FILE:=""}
: ${CUDNN_BENCHMARK:=false}
: ${MAX_DURATION:=""}
: ${PAD_TO_MAX_DURATION:=false}
: ${NUM_GPUS:=1}
: ${NUM_STEPS:=0}
: ${NUM_WARMUP_STEPS:=0}
: ${AMP:=false}
: ${BATCH_SIZE:=64}
: ${EMA:=true}
: ${SEED:=0}
: ${DALI_DEVICE:="gpu"}
: ${CPU:=false}
: ${LOGITS_FILE:=}
: ${PREDICTION_FILE:="${OUTPUT_DIR}/${DATASET}.predictions"}
echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
mkdir -p "$OUTPUT_DIR"
DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
DATASET=${2:-${DATASET:-"dev-clean"}}
MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
CUDNN_BENCHMARK=${7:-${CUDNN_BENCHMARK:-"false"}}
AMP=${8:-${AMP:-"false"}}
NUM_STEPS=${9:-${NUM_STEPS:-"-1"}}
SEED=${10:-${SEED:-0}}
BATCH_SIZE=${11:-${BATCH_SIZE:-64}}
LOGITS_FILE=${12:-${LOGITS_FILE:-""}}
PREDICTION_FILE=${13:-${PREDICTION_FILE:-"${RESULT_DIR}/${DATASET}.predictions"}}
CPU=${14:-${CPU:-"false"}}
EMA=${14:-${EMA:-"false"}}
ARGS="--dataset_dir=$DATA_DIR"
ARGS+=" --val_manifest=$DATA_DIR/librispeech-${DATASET}-wav.json"
ARGS+=" --model_config=$MODEL_CONFIG"
ARGS+=" --output_dir=$OUTPUT_DIR"
ARGS+=" --batch_size=$BATCH_SIZE"
ARGS+=" --seed=$SEED"
ARGS+=" --dali_device=$DALI_DEVICE"
ARGS+=" --steps $NUM_STEPS"
ARGS+=" --warmup_steps $NUM_WARMUP_STEPS"
mkdir -p "$RESULT_DIR"
[ "$AMP" = true ] && ARGS+=" --amp"
[ "$EMA" = true ] && ARGS+=" --ema"
[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=${CHECKPOINT}"
[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
[ -n "$PREDICTION_FILE" ] && ARGS+=" --save_prediction $PREDICTION_FILE"
[ -n "$LOGITS_FILE" ] && ARGS+=" --logits_save_to $LOGITS_FILE"
[ "$CPU" == "true" ] && ARGS+=" --cpu"
[ -n "$MAX_DURATION" ] && ARGS+=" --max_duration $MAX_DURATION"
[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --pad_to_max_duration"
CMD="python inference.py "
CMD+=" --batch_size $BATCH_SIZE "
CMD+=" --dataset_dir $DATA_DIR "
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --model_toml $MODEL_CONFIG "
CMD+=" --seed $SEED "
[ "$NUM_STEPS" -gt 0 ] && \
CMD+=" --steps $NUM_STEPS"
[ "$CUDNN_BENCHMARK" = "true" ] && \
CMD+=" --cudnn"
[ "$AMP" == "true" ] && \
CMD+=" --amp"
[ "$CPU" == "true" ] && \
CMD+=" --cpu"
[ "$EMA" == "true" ] && \
CMD+=" --ema"
[ -n "$CHECKPOINT" ] && \
CMD+=" --ckpt=${CHECKPOINT}"
[ -n "$PREDICTION_FILE" ] && \
CMD+=" --save_prediction $PREDICTION_FILE"
[ -n "$LOGITS_FILE" ] && \
CMD+=" --logits_save_to $LOGITS_FILE"
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE)
printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
fi
set +x
[ -n "$PREDICTION_FILE" ] && echo "PREDICTION_FILE: ${PREDICTION_FILE}"
[ -n "$LOGITS_FILE" ] && echo "LOGITS_FILE: ${LOGITS_FILE}"
python -m torch.distributed.launch --nproc_per_node=$NUM_GPUS inference.py $ARGS

View file

@ -14,55 +14,24 @@
# See the License for the specific language governing permissions and
# limitations under the License.
set -a
echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
: ${OUTPUT_DIR:=${3:-"/results"}}
: ${CUDNN_BENCHMARK:=true}
: ${PAD_TO_MAX_DURATION:=true}
: ${NUM_WARMUP_STEPS:=10}
: ${NUM_STEPS:=500}
DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
DATASET=${2:-${DATASET:-"dev-clean"}}
MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
CUDNN_BENCHMARK=${7:-${CUDNN_BENCHMARK:-"true"}}
AMP=${8:-${AMP:-"false"}}
NUM_STEPS=${9:-${NUM_STEPS:-"-1"}}
MAX_DURATION=${10:-${MAX_DURATION:-"36"}}
SEED=${11:-${SEED:-0}}
BATCH_SIZE=${12:-${BATCH_SIZE:-64}}
: ${AMP:=false}
: ${DALI_DEVICE:="cpu"}
: ${BATCH_SIZE_SEQ:="1 2 4 8 16"}
: ${MAX_DURATION_SEQ:="2 7 16.7"}
mkdir -p "$RESULT_DIR"
for MAX_DURATION in $MAX_DURATION_SEQ; do
for BATCH_SIZE in $BATCH_SIZE_SEQ; do
CMD=" python inference_benchmark.py"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --seed=$SEED"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --ckpt=$CHECKPOINT"
CMD+=" --max_duration=$MAX_DURATION"
CMD+=" --pad_to=-1"
[ "$AMP" == "true" ] && \
CMD+=" --amp"
[ "$NUM_STEPS" -gt 0 ] && \
CMD+=" --steps $NUM_STEPS"
[ "$CUDNN_BENCHMARK" = "true" ] && \
CMD+=" --cudnn"
LOG_FILE="$OUTPUT_DIR/perf-infer_dali-${DALI_DEVICE}_amp-${AMP}_dur${MAX_DURATION}_bs${BATCH_SIZE}.json"
bash ./scripts/inference.sh "$@"
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE )
printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
grep 'latency' "$LOGFILE"
fi
set +x
done
done

View file

@ -1,66 +0,0 @@
#!/bin/bash
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
CUDA_VISIBLE_DEVICES=""
DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
DATASET=${2:-${DATASET:-"dev-clean"}}
MODEL_CONFIG=${3:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
RESULT_DIR=${4:-${RESULT_DIR:-"/results"}}
CHECKPOINT=${5:-${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}}
CREATE_LOGFILE=${6:-${CREATE_LOGFILE:-"true"}}
NUM_STEPS=${7:-${NUM_STEPS:-"-1"}}
MAX_DURATION=${8:-${MAX_DURATION:-"36"}}
SEED=${9:-${SEED:-0}}
BATCH_SIZE=${10:-${BATCH_SIZE:-32}}
SAMPLE_AUDIO=${11:-${SAMPLE_AUDIO:-"/datasets/LibriSpeech/dev-clean-wav/1272/128104/1272-128104-0000.wav"}}
mkdir -p "$RESULT_DIR"
CMD=" python inference_benchmark.py"
CMD+=" --cpu"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --seed=$SEED"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --ckpt=$CHECKPOINT"
CMD+=" --max_duration=$MAX_DURATION"
CMD+=" --pad_to=-1"
CMD+=" --sample_audio=$SAMPLE_AUDIO"
[ "$NUM_STEPS" -gt 0 ] && \
CMD+=" --steps $NUM_STEPS"
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE )
printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
grep 'latency' "$LOGFILE"
fi
set +x

View file

@ -14,21 +14,24 @@
#!/usr/bin/env bash
SPEEDS=$1
[ -n "$SPEEDS" ] && SPEED_FLAG="--speed $SPEEDS"
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/train-clean-100 \
--dest_dir /datasets/LibriSpeech/train-clean-100-wav \
--output_json /datasets/LibriSpeech/librispeech-train-clean-100-wav.json \
--speed 0.9 1.1
$SPEED_FLAG
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/train-clean-360 \
--dest_dir /datasets/LibriSpeech/train-clean-360-wav \
--output_json /datasets/LibriSpeech/librispeech-train-clean-360-wav.json \
--speed 0.9 1.1
$SPEED_FLAG
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/train-other-500 \
--dest_dir /datasets/LibriSpeech/train-other-500-wav \
--output_json /datasets/LibriSpeech/librispeech-train-other-500-wav.json \
--speed 0.9 1.1
$SPEED_FLAG
python ./utils/convert_librispeech.py \

View file

@ -14,70 +14,74 @@
# See the License for the specific language governing permissions and
# limitations under the License.
export OMP_NUM_THREADS=1
echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
: ${MODEL_CONFIG:=${2:-"configs/jasper10x5dr_speedp-online_speca.yaml"}}
: ${OUTPUT_DIR:=${3:-"/results"}}
: ${CHECKPOINT:=${4:-}}
: ${RESUME:=true}
: ${CUDNN_BENCHMARK:=true}
: ${NUM_GPUS:=8}
: ${AMP:=false}
: ${BATCH_SIZE:=64}
: ${GRAD_ACCUMULATION_STEPS:=2}
: ${LEARNING_RATE:=0.01}
: ${MIN_LEARNING_RATE:=0.00001}
: ${LR_POLICY:=exponential}
: ${LR_EXP_GAMMA:=0.981}
: ${EMA:=0.999}
: ${SEED:=0}
: ${EPOCHS:=440}
: ${WARMUP_EPOCHS:=2}
: ${HOLD_EPOCHS:=140}
: ${SAVE_FREQUENCY:=10}
: ${EPOCHS_THIS_JOB:=0}
: ${DALI_DEVICE:="gpu"}
: ${PAD_TO_MAX_DURATION:=false}
: ${EVAL_FREQUENCY:=544}
: ${PREDICTION_FREQUENCY:=544}
: ${TRAIN_MANIFESTS:="$DATA_DIR/librispeech-train-clean-100-wav.json \
$DATA_DIR/librispeech-train-clean-360-wav.json \
$DATA_DIR/librispeech-train-other-500-wav.json"}
: ${VAL_MANIFESTS:="$DATA_DIR/librispeech-dev-clean-wav.json"}
DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
MODEL_CONFIG=${2:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
RESULT_DIR=${3:-${RESULT_DIR:-"/results"}}
CHECKPOINT=${4:-${CHECKPOINT:-""}}
CREATE_LOGFILE=${5:-${CREATE_LOGFILE:-"true"}}
CUDNN_BENCHMARK=${6:-${CUDNN_BENCHMARK:-"true"}}
NUM_GPUS=${7:-${NUM_GPUS:-8}}
AMP=${8:-${AMP:-"false"}}
EPOCHS=${9:-${EPOCHS:-400}}
SEED=${10:-${SEED:-6}}
BATCH_SIZE=${11:-${BATCH_SIZE:-64}}
LEARNING_RATE=${12:-${LEARNING_RATE:-"0.015"}}
GRADIENT_ACCUMULATION_STEPS=${13:-${GRADIENT_ACCUMULATION_STEPS:-2}}
EMA=${EMA:-0.999}
SAVE_FREQUENCY=${SAVE_FREQUENCY:-10}
mkdir -p "$OUTPUT_DIR"
mkdir -p "$RESULT_DIR"
ARGS="--dataset_dir=$DATA_DIR"
ARGS+=" --val_manifests $VAL_MANIFESTS"
ARGS+=" --train_manifests $TRAIN_MANIFESTS"
ARGS+=" --model_config=$MODEL_CONFIG"
ARGS+=" --output_dir=$OUTPUT_DIR"
ARGS+=" --lr=$LEARNING_RATE"
ARGS+=" --batch_size=$BATCH_SIZE"
ARGS+=" --min_lr=$MIN_LEARNING_RATE"
ARGS+=" --lr_policy=$LR_POLICY"
ARGS+=" --lr_exp_gamma=$LR_EXP_GAMMA"
ARGS+=" --epochs=$EPOCHS"
ARGS+=" --warmup_epochs=$WARMUP_EPOCHS"
ARGS+=" --hold_epochs=$HOLD_EPOCHS"
ARGS+=" --epochs_this_job=$EPOCHS_THIS_JOB"
ARGS+=" --ema=$EMA"
ARGS+=" --seed=$SEED"
ARGS+=" --optimizer=novograd"
ARGS+=" --weight_decay=1e-3"
ARGS+=" --save_frequency=$SAVE_FREQUENCY"
ARGS+=" --keep_milestones 100 200 300 400"
ARGS+=" --save_best_from=380"
ARGS+=" --log_frequency=1"
ARGS+=" --eval_frequency=$EVAL_FREQUENCY"
ARGS+=" --prediction_frequency=$PREDICTION_FREQUENCY"
ARGS+=" --grad_accumulation_steps=$GRAD_ACCUMULATION_STEPS "
ARGS+=" --dali_device=$DALI_DEVICE"
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
CMD+=" train.py"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --num_epochs=$EPOCHS"
CMD+=" --output_dir=$RESULT_DIR"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --lr=$LEARNING_RATE"
CMD+=" --ema=$EMA"
CMD+=" --seed=$SEED"
CMD+=" --optimizer=novograd"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json"
CMD+=",$DATA_DIR/librispeech-train-clean-360-wav.json"
CMD+=",$DATA_DIR/librispeech-train-other-500-wav.json"
CMD+=" --weight_decay=1e-3"
CMD+=" --save_freq=$SAVE_FREQUENCY"
CMD+=" --eval_freq=100"
CMD+=" --train_freq=1"
CMD+=" --lr_decay"
CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS "
[ "$AMP" = true ] && ARGS+=" --amp"
[ "$RESUME" = true ] && ARGS+=" --resume"
[ "$CUDNN_BENCHMARK" = true ] && ARGS+=" --cudnn_benchmark"
[ "$PAD_TO_MAX_DURATION" = true ] && ARGS+=" --pad_to_max_duration"
[ -n "$CHECKPOINT" ] && ARGS+=" --ckpt=$CHECKPOINT"
[ -n "$LOG_FILE" ] && ARGS+=" --log_file $LOG_FILE"
[ -n "$PRE_ALLOCATE" ] && ARGS+=" --pre_allocate_range $PRE_ALLOCATE"
[ "$AMP" == "true" ] && \
CMD+=" --amp"
[ "$CUDNN_BENCHMARK" = "true" ] && \
CMD+=" --cudnn"
[ -n "$CHECKPOINT" ] && \
CMD+=" --ckpt=${CHECKPOINT}"
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULT_DIR/$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee $LOGFILE
fi
set +x
DISTRIBUTED="-m torch.distributed.launch --nproc_per_node=$NUM_GPUS"
python $DISTRIBUTED train.py $ARGS

View file

@ -14,100 +14,36 @@
# See the License for the specific language governing permissions and
# limitations under the License.
set -a
echo "NVIDIA container build: ${NVIDIA_BUILD_ID}"
# measure on speed perturbed data, but so slightly that fbank length remains the same
# with pad_to_max_duration, this reduces cuDNN benchmak's burn-in period to a single step
: ${DATA_DIR:=${1:-"/datasets/LibriSpeech"}}
: ${OUTPUT_DIR:=${3:-"/results"}}
: ${TRAIN_MANIFESTS:="$DATA_DIR/librispeech-train-clean-100-wav.json"}
SCRIPT_DIR=$(cd $(dirname $0); pwd)
PROJECT_DIR=${SCRIPT_DIR}/..
# run for a number of epochs, but don't finalize the training
: ${EPOCHS_THIS_JOB:=2}
: ${EPOCHS:=100000}
: ${RESUME:=false}
: ${SAVE_FREQUENCY:=100000}
: ${EVAL_FREQUENCY:=100000}
: ${GRAD_ACCUMULATION_STEPS:=1}
DATA_DIR=${1:-${DATA_DIR:-"/datasets/LibriSpeech"}}
MODEL_CONFIG=${2:-${MODEL_CONFIG:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}}
RESULT_DIR=${3:-${RESULT_DIR:-"/results"}}
CREATE_LOGFILE=${4:-${CREATE_LOGFILE:-"true"}}
CUDNN_BENCHMARK=${5:-${CUDNN_BENCHMARK:-"true"}}
NUM_GPUS=${6:-${NUM_GPUS:-8}}
AMP=${7:-${AMP:-"false"}}
NUM_STEPS=${8:-${NUM_STEPS:-"-1"}}
MAX_DURATION=${9:-${MAX_DURATION:-16.7}}
SEED=${10:-${SEED:-0}}
BATCH_SIZE=${11:-${BATCH_SIZE:-32}}
LEARNING_RATE=${12:-${LEARNING_RATE:-"0.015"}}
GRADIENT_ACCUMULATION_STEPS=${13:-${GRADIENT_ACCUMULATION_STEPS:-1}}
PRINT_FREQUENCY=${14:-${PRINT_FREQUENCY:-1}}
USE_PROFILER=${USE_PROFILER:-"false"}
: ${AMP:=false}
: ${EMA:=0}
: ${DALI_DEVICE:="gpu"}
: ${NUM_GPUS_SEQ:="1 4 8"}
: ${BATCH_SIZE_SEQ:="32"}
# A probable range of batch lengths for LibriSpeech
# with BS=64 and continuous speed perturbation (0.85, 1.15)
: ${PRE_ALLOCATE:="1408 1920"}
mkdir -p "$RESULT_DIR"
for NUM_GPUS in $NUM_GPUS_SEQ; do
for BATCH_SIZE in $BATCH_SIZE_SEQ; do
[ "${USE_PROFILER}" = "true" ] && PYTHON_ARGS="-m cProfile -s cumtime"
LOG_FILE="$OUTPUT_DIR/perf-train_dali-${DALI_DEVICE}_amp-${AMP}_ngpus${NUM_GPUS}_bs${BATCH_SIZE}.json"
bash ./scripts/train.sh "$@"
CMD="${PYTHON_ARGS} ${PROJECT_DIR}/train.py"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --num_epochs=400"
CMD+=" --output_dir=$RESULT_DIR"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --lr=$LEARNING_RATE"
CMD+=" --seed=$SEED"
CMD+=" --optimizer=novograd"
CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json,"
CMD+="$DATA_DIR/librispeech-train-clean-360-wav.json,"
CMD+="$DATA_DIR/librispeech-train-other-500-wav.json"
CMD+=" --weight_decay=1e-3"
CMD+=" --save_freq=100000"
CMD+=" --eval_freq=100000"
CMD+=" --max_duration=$MAX_DURATION"
CMD+=" --pad_to_max"
CMD+=" --train_freq=$PRINT_FREQUENCY"
CMD+=" --lr_decay "
[ "$AMP" == "true" ] && \
CMD+=" --amp"
[ "$CUDNN_BENCHMARK" = "true" ] && \
CMD+=" --cudnn"
[ "$NUM_STEPS" -gt 1 ] && \
CMD+=" --num_steps=$NUM_STEPS"
if [ "$NUM_GPUS" -gt 1 ] ; then
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
else
CMD="python3 $CMD"
fi
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
printf -v TAG "jasper_train_benchmark_amp-%s_gbs%d" "$AMP" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
if [ -z "$LOGFILE" ] ; then
set -x
$CMD
set +x
else
set -x
(
$CMD
) |& tee "$LOGFILE"
set +x
mean_latency=`cat "$LOGFILE" | grep 'Step time' | awk '{print $3}' | tail -n +2 | egrep -o '[0-9.]+'| awk 'BEGIN {total=0} {total+=$1} END {printf("%.2f\n",total/NR)}'`
mean_throughput=`python -c "print($BATCH_SIZE*$NUM_GPUS/${mean_latency})"`
training_wer_per_pgu=`cat "$LOGFILE" | grep 'training_batch_WER'| awk '{print $2}' | tail -n 1 | egrep -o '[0-9.]+'`
training_loss_per_pgu=`cat "$LOGFILE" | grep 'Loss@Step'| awk '{print $4}' | tail -n 1 | egrep -o '[0-9.]+'`
final_eval_wer=`cat "$LOGFILE" | grep 'Evaluation WER'| tail -n 1 | egrep -o '[0-9.]+'`
final_eval_loss=`cat "$LOGFILE" | grep 'Evaluation Loss'| tail -n 1 | egrep -o '[0-9.]+'`
echo "max duration: $MAX_DURATION s" | tee -a "$LOGFILE"
echo "mean_latency: $mean_latency s" | tee -a "$LOGFILE"
echo "mean_throughput: $mean_throughput sequences/s" | tee -a "$LOGFILE"
echo "training_wer_per_pgu: $training_wer_per_pgu" | tee -a "$LOGFILE"
echo "training_loss_per_pgu: $training_loss_per_pgu" | tee -a "$LOGFILE"
echo "final_eval_loss: $final_eval_loss" | tee -a "$LOGFILE"
echo "final_eval_wer: $final_eval_wer" | tee -a "$LOGFILE"
fi
done
done

View file

@ -1,14 +0,0 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.08-py3
FROM ${FROM_IMAGE_NAME}
# Here's a good place to install pip reqs from JoC repo.
# At the same step, also install TRT pip reqs
WORKDIR /tmp/pipReqs
COPY requirements.txt /tmp/pipReqs/jocRequirements.txt
COPY tensorrt/requirements.txt /tmp/pipReqs/trtRequirements.txt
RUN pip install --disable-pip-version-check -U -r jocRequirements.txt -r trtRequirements.txt
WORKDIR /workspace/jasper
COPY . .

View file

@ -1,300 +0,0 @@
# Jasper Inference For TensorRT
This is subfolder of the Jasper for PyTorch repository, tested and maintained by NVIDIA, and provides scripts to perform high-performance inference using NVIDIA TensorRT. Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. More information about Jasper and its training and be found in the [Jasper PyTorch README](../README.md).
NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch.
## Table Of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [TensorRT Inference pipeline](#tensorrt-inference-pipeline)
* [Version Info](#version-info)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [TensorRT Inference Benchmark Process](#tensorrt-inference-benchmark-process)
* [TensorRT Inference Process](#tensorrt-inference-process)
- [Performance](#performance)
* [Results](#results)
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
## Model overview
### Model architecture
By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.
For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
### TensorRT Inference pipeline
The Jasper inference pipeline consists of 3 components: data preprocessor, acoustic model and greedy decoder. The acoustic model is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and also what differentiates Jasper from the competition. So, we focus on the acoustic model for the most part.
For the non-TensorRT Jasper inference pipeline, all 3 components are implemented and run with native PyTorch. For the TensorRT inference pipeline, we show the speedup of running the acoustic model with TensorRT, while preprocessing and decoding are reused from the native PyTorch pipeline.
To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into an ONNX file. Finally, a TensorRT engine is constructed from the ONNX file, serialized to TensorRT engine file, and also launched to do inference.
Note that TensorRT engine is being runtime optimized before serialization. TensorRT tries a vast set of options to find the strategy that performs best on users GPU - so it takes a few minutes. After the TensorRT engine file is created, it can be reused.
### Version Info
The following software version configuration has been tested and known to work:
|Software|Version|
|--------|-------|
|Python|3.6.10|
|PyTorch|1.7.0a0+8deb4fe|
|TensorRT|7.1.3.4|
|CUDA|11.0.221|
## Setup
The following section lists the requirements in order to start inference on the Jasper model with TensorRT.
### Requirements
This repository contains a `Dockerfile` which extends the PyTorch 19.10-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.08-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
* NVIDIA [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/), [Turing](https://www.nvidia.com/en-us/geforce/turing/), or [Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) based GPU
* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
Required Python packages are listed in `requirements.txt` and `tensorrt/requirements.txt`. These packages are automatically installed when the Docker container is built. To manually install them, run:
```bash
pip install -r requirements.txt
pip install -r tensorrt/requirements.txt
```
## Quick Start Guide
Running the following scripts will build and launch the container containing all required dependencies for both TensorRT as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
1. Clone the repository.
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
```
2. Build the Jasper PyTorch with TensorRT container:
```bash
bash tensorrt/scripts/docker/build.sh
```
3. Start an interactive session in the NGC docker container:
```bash
bash tensorrt/scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
```
Alternatively, to start a script in the docker container:
```bash
bash tensorrt/scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR> <SCRIPT_PATH>
```
The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host. **These three paths should be absolute and should already exist.** The contents of this repository will be mounted to the `/workspace/jasper` directory. Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md).
Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](#acquiring-dataset)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `<RESULT_DIR>` should be prepared to contain timing results, logs, serialized TensorRT engines, and ONNX files.
4. Acquiring dataset
If LibriSpeech has already been downloaded and preprocessed as defined in the [Jasper PyTorch README](../README.md), no further steps in this subsection need to be taken.
If LibriSpeech has not been downloaded already, note that only a subset of LibriSpeech is typically used for inference (`dev-*` and `test-*`). To acquire the inference subset of LibriSpeech run the following commands inside the container (does not require GPU):
```bash
bash tensorrt/scripts/download_inference_librispeech.sh
```
Once the data download is complete, the following folders should exist:
* `/datasets/LibriSpeech/`
* `dev-clean/`
* `dev-other/`
* `test-clean/`
* `test-other/`
Next, preprocessing the data can be performed with the following command:
```bash
bash tensorrt/scripts/preprocess_inference_librispeech.sh
```
Once the data is preprocessed, the following additional files should now exist:
* `/datasets/LibriSpeech/`
* `librispeech-dev-clean-wav.json`
* `librispeech-dev-other-wav.json`
* `librispeech-test-clean-wav.json`
* `librispeech-test-other-wav.json`
* `dev-clean-wav/`
* `dev-other-wav/`
* `test-clean-wav/`
* `test-other-wav/`
5. Start TensorRT inference prediction
Inside the container, use the following script to run inference with TensorRT. To learn more about the following env variables see `tensorrt/scripts/inference.sh`.
```bash
export CHECKPOINT=<CHECKPOINT>
export TRT_PRECISION=<PRECISION>
export PYTORCH_PRECISION=<PRECISION>
export TRT_PREDICTION_PATH=<TRT_PREDICTION_PATH>
bash tensorrt/scripts/inference.sh
```
A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
More details can be found in [Advanced](#advanced) under [Scripts and sample code](#scripts-and-sample-code), [Parameters](#parameters) and [TensorRT Inference process](#tensorrt-inference).
6. Start TensorRT inference benchmark
Inside the container, use the following script to run inference benchmark with TensorRT.
```bash
export CHECKPOINT=<CHECKPOINT>
export NUM_STEPS=<NUM_STEPS>
export NUM_FRAMES=<NUM_FRAMES>
export BATCH_SIZE=<BATCH_SIZE>
export TRT_PRECISION=<PRECISION>
export PYTORCH_PRECISION=<PRECISION>
export CSV_PATH=<CSV_PATH>
bash tensorrt/scripts/inference_benchmark.sh
```
A pretrained model checkpoint can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
More details can be found in [Advanced](#advanced) under [Scripts and sample code](#scripts-and-sample-code), [Parameters](#parameters) and [TensorRT Inference Benchmark process](#tensorrt-inference-benchmark).
7. Start Jupyter notebook to run inference interactively
The Jupyter notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
The notebook which is located at `notebooks/JasperTRT.ipynb` offers an interactive method to run the Steps 2,3,4,5. In addition, the notebook shows examples how to use TensorRT to transcribe a single audio file into text. To launch the application please follow the instructions under [../notebooks/README.md](../notebooks/README.md).
A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
## Advanced
The following sections provide greater details on inference benchmarking with TensorRT and show inference results
### Scripts and sample code
In the `tensorrt/` directory, the most important files are:
* `Dockerfile`: Container to run Jasper inference with TensorRT.
* `requirements.py`: Python package dependencies. Installed when building the Docker container.
* `perf.py`: Entry point for inference pipeline using TensorRT.
* `perfprocedures.py`: Contains functionality to run inference through both the PyTorch model and TensorRT Engine, taking runtime measurements of each component of the inference process for comparison.
* `trtutils.py`: Helper functions for TensorRT components of Jasper inference.
* `perfutils.py`: Helper functions for non-TensorRT components of Jasper inference.
The `tensorrt/scripts/` directory has one-click scripts to run supported functionalities, such as:
* `download_librispeech.sh`: Downloads LibriSpeech inference dataset.
* `preprocess_librispeech.sh`: Preprocess LibriSpeech raw data files to be ready for inference.
* `inference_benchmark.sh`: Benchmarks and compares TensorRT and PyTorch inference pipelines using the `perf.py` script.
* `inference.sh`: Runs TensorRT and PyTorch inference using the `inference_benchmark.sh` script.
* `walk_benchmark.sh`: Illustrates an example of using `tensorrt/scripts/inference_benchmark.sh`, which *walks* a variety of values for `BATCH_SIZE` and `NUM_FRAMES`.
* `docker/`: Contains the scripts for building and launching the container.
### Parameters
The list of parameters available for `tensorrt/scripts/inference_benchmark.sh` is:
```
Required:
--------
CHECKPOINT: Model checkpoint path
Arguments with Defaults:
--------
DATA_DIR: directory of the dataset (Default: `/datasets/Librispeech`)
DATASET: name of dataset to use (default: `dev-clean`)
RESULT_DIR: directory for results including TensorRT engines, ONNX files, logs, and CSVs (default: `/results`)
CREATE_LOGFILE: boolean that indicates whether to create log of session to be stored in `$RESULT_DIR` (default: "true")
CSV_PATH: file to store CSV results (default: `/results/res.csv`)
TRT_PREDICTION_PATH: file to store inference prediction results generated with TensorRT (default: `none`)
PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch (default: `none`)
VERBOSE: boolean that indicates whether to verbosely describe TensorRT engine building/deserialization and TensorRT inference (default: "false")
TRT_PRECISION: "fp32" or "fp16". Defines which precision kernels will be used for TensorRT engine (default: "fp32")
PYTORCH_PRECISION: "fp32" or "fp16". Defines which precision will be used for inference in PyTorch (default: "fp32")
NUM_STEPS: Number of inference steps. If -1 runs inference on entire dataset (default: 100)
BATCH_SIZE: data batch size (default: 64)
NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 512)
FORCE_ENGINE_REBUILD: boolean that indicates whether an already-built TensorRT engine of equivalent precision, batch-size, and number of frames should not be used. Engines are specific to the GPU, library versions, TensorRT versions, and CUDA versions they were built in and cannot be used in a different environment. (default: "true")
USE_DYNAMIC_SHAPE: if 'yes' uses dynamic shapes (default: yes). Dynamic shape is always preferred since it allows to reuse engines.
```
The complete list of parameters available for `tensorrt/scripts/inference.sh` is the same as `tensorrt/scripts/inference_benchmark.sh` only with different default input arguments. In the following, only the parameters with different default values are listed:
```
TRT_PREDICTION_PATH: file to store inference prediction results generated with TensorRT (default: `/results/trt_predictions.txt`)
PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch (default: `/results/pyt_predictions.txtone`)
NUM_STEPS: Number of inference steps. If -1 runs inference on entire dataset (default: -1)
BATCH_SIZE: data batch size (default: 1)
NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 3600)
```
### TensorRT Inference Benchmark process
The inference benchmarking is performed on a single GPU by tensorrt/scripts/inference_benchmark.sh which delegates to `tensorrt/perf.py`, which takes the following steps:
1. Construct Jasper acoustic model in PyTorch.
2. Construct TensorRT Engine of Jasper acoustic model
1. Perform ONNX export on the PyTorch model, if its ONNX file does not already exist.
2. Construct TensorRT engine from ONNX export, if a saved engine file does not already exist or `FORCE_ENGINE_REBUILD` is `true`.
3. For each batch in the dataset, run inference through both the PyTorch model and TensorRT Engine, taking runtime measurements of each component of the inference process.
4. Compile performance and WER accuracy results in CSV format, written to `CSV_PATH` file.
`tensorrt/perf.py` utilizes `tensorrt/trtutils.py` and `tensorrt/perfutils.py`, helper functions for TensorRT and non-TensorRT components of Jasper inference respectively.
### TensorRT Inference process
The inference is performed by `tensorrt/scripts/inference.sh` which delegates to `tensorrt/scripts/inference_benchmark.sh`. The script runs on a single GPU. To do inference prediction on the entire dataset `NUM_FRAMES` is set to 3600, which roughly corresponds to 36 seconds. This covers the longest sentences in both LibriSpeech dev and test dataset. By default, `BATCH_SET` is set to 1 to simulate the online inference scenario in deployment. Other batch sizes can be tried by setting a different value to this parameter. By default `TRT_PRECISION` is set to full precision and can be changed by setting `export TRT_PRECISION=fp16`. The prediction results are stored at `/results/trt_predictions.txt` and `/results/pyt_predictions.txt`.
## Performance
To benchmark the inference performance on a specific batch size and audio length refer to [Quick-Start-Guide](#quick-start-guide). To do a sweep over multiple batch sizes and audio durations run:
```bash
bash tensorrt/scripts/walk_benchmark.sh
```
The results are obtained by running inference on LibriSpeech dev-clean dataset on a single T4 GPU using half precision with AMP. We compare the throughput of the acoustic model between TensorRT and native PyTorch.
### Results
#### Inference performance: NVIDIA T4
| Sequence Length (in seconds) | Batch size | TensorRT FP16 Throughput (#sequences/second) Percentiles | | | | PyTorch FP16 Throughput (#sequences/second) Percentiles | | | | TRT/PyT Speedup |
|---------------|------------|---------------------|---------|---------|---------|-----------------|---------|---------|---------|-----------------|
| | | 90% | 95% | 99% | Avg | 90% | 95% | 99% | Avg | |
|2|1|71.002|70.897|70.535|71.987|42.974|42.932|42.861|43.166|1.668|
||2|136.369|135.915|135.232|139.266|81.398|77.826|57.408|81.254|1.714|
||4|231.528|228.875|220.085|239.686|130.055|117.779|104.529|135.660|1.767|
||8|310.224|308.870|289.132|316.536|215.401|202.902|148.240|228.805|1.383|
||16|389.086|366.839|358.419|401.267|288.353|278.708|230.790|307.070|1.307|
|7|1|61.792|61.273|59.842|63.537|34.098|33.963|33.785|34.639|1.834|
||2|93.869|92.480|91.528|97.082|59.397|59.221|51.050|60.934|1.593|
||4|113.108|112.950|112.531|114.507|66.947|66.479|59.926|67.704|1.691|
||8|118.878|118.542|117.619|120.367|83.208|82.998|82.698|84.187|1.430|
||16|122.909|122.718|121.547|124.190|102.212|102.000|101.187|103.049|1.205|
|16.7|1|38.665|38.404|37.946|39.363|21.267|21.197|21.127|21.456|1.835|
||2|44.960|44.867|44.382|45.583|30.218|30.156|29.970|30.679|1.486|
||4|47.754|47.667|47.541|48.287|29.146|29.079|28.941|29.470|1.639|
||8|51.051|50.969|50.620|51.489|37.565|37.497|37.373|37.834|1.361|
||16|53.316|53.288|53.188|53.773|45.217|45.090|44.946|45.560|1.180|

View file

@ -1,132 +0,0 @@
#!/usr/bin/env python3
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''Constructs TensorRT engine for JASPER and evaluates inference latency'''
import argparse
import sys, os
# Get local modules in parent directory and current directory (assuming this was called from root of repository)
sys.path.append("./")
sys.path.append("./tensorrt")
import perfutils
import trtutils
import perfprocedures
from model import GreedyCTCDecoder
from helpers import __ctc_decoder_predictions_tensor
import caffe2.python.onnx.backend as c2backend
import onnxruntime as ort
import torch
from torch import nn
from torch.nn import functional as F
def main(args):
print ("Getting component")
# Get shared utility across PyTorch and TRT
pyt_components, saved_onnx = perfutils.get_pytorch_components_and_onnx(args)
print ("Getting engine")
# Get a TRT engine. See function for argument parsing logic
engine = trtutils.get_engine(args)
print ("Got engine.")
if args.wav:
with torch.no_grad():
audio_processor = pyt_components['audio_preprocessor']
audio_processor.eval()
greedy_decoder = GreedyCTCDecoder()
input_wav, num_audio_samples = pyt_components['input_wav']
features = audio_processor(input_wav, num_audio_samples)
features = perfutils.adjust_shape(features, args)
if not args.engine_path:
outputs = engine.run(None, {'FEATURES': features[0].data.cpu().numpy()})
inference = 1.0
t_log_probs_e = outputs[0]
t_log_probs_e=perfutils.torchify_trt_out(t_log_probs_e, t_log_probs_e.shape)
else:
with engine.create_execution_context() as context:
t_log_probs_e, copyto, inference, copyfrom= perfprocedures.do_inference(context, [features[0]])
t_predictions_e = greedy_decoder(t_log_probs_e)
hypotheses = __ctc_decoder_predictions_tensor(t_predictions_e, labels=perfutils.get_vocab())
print("INTERENCE TIME: {} ms".format(inference*1000.0))
print("TRANSCRIPT: ", hypotheses)
return
wer, preds, times = perfprocedures.compare_times_trt_pyt_exhaustive(engine,
pyt_components,
args)
string_header, string_data = perfutils.do_csv_export(wer, times, args.batch_size, args.seq_len)
if args.csv_path is not None:
print ("Exporting to " + args.csv_path)
with open(args.csv_path, 'a+') as f:
# See if header is there, if so, check that it matches
f.seek(0) # Read from start of file
existing_header = f.readline()
if existing_header == "":
f.write(string_header)
f.write("\n")
elif existing_header[:-1] != string_header:
raise Exception(f"Writing to existing CSV with incorrect format\nProduced:\n{string_header}\nFound:\n{existing_header}\nIf you intended to write to a new results csv, please change the csv_path argument")
f.seek(0,2) # Write to end of file
f.write(string_data)
f.write("\n")
else:
print(string_header)
print(string_data)
if args.trt_prediction_path is not None:
with open(args.trt_prediction_path, 'w') as fp:
fp.write('\n'.join(preds['trt']))
if args.pyt_prediction_path is not None:
with open(args.pyt_prediction_path, 'w') as fp:
fp.write('\n'.join(preds['pyt']))
def parse_args():
parser = argparse.ArgumentParser(description="Performance test of TRT")
parser.add_argument("--engine_path", default=None, type=str, help="Path to serialized TRT engine")
parser.add_argument("--use_existing_engine", action="store_true", default=False, help="If set, will deserialize engine at --engine_path" )
parser.add_argument("--engine_batch_size", default=16, type=int, help="Maximum batch size for constructed engine; needed when building")
parser.add_argument("--batch_size", default=16, type=int, help="Batch size for data when running inference.")
parser.add_argument("--dataset_dir", type=str, help="Root directory of dataset")
parser.add_argument("--model_toml", type=str, required=True, help="Config toml to use. A selection can be found in configs/")
parser.add_argument("--val_manifest", type=str, help="JSON manifest of dataset.")
parser.add_argument("--onnx_path", default=None, type=str, help="Path to onnx model for engine creation")
parser.add_argument("--seq_len", default=None, type=int, help="Generate an ONNX export with this fixed sequence length, and save to --onnx_path. Requires also using --onnx_path and --ckpt_path.")
parser.add_argument("--max_seq_len", default=3600, type=int, help="Max sequence length for TRT engine build. Default works with TRTIS benchmark. Set it larger than seq_len")
parser.add_argument("--ckpt_path", default=None, type=str, help="If provided, will also construct pytorch acoustic model")
parser.add_argument("--max_duration", default=None, type=float, help="Maximum possible length of audio data in seconds")
parser.add_argument("--num_steps", default=-1, type=int, help="Number of inference steps to run")
parser.add_argument("--trt_fp16", action="store_true", default=False, help="If set, will allow TRT engine builder to select fp16 kernels as well as fp32")
parser.add_argument("--pyt_fp16", action="store_true", default=False, help="If set, will construct pytorch model with fp16 weights")
parser.add_argument("--make_onnx", action="store_true", default=False, help="If set, will create an ONNX model and store it at the path specified by --onnx_path")
parser.add_argument("--csv_path", type=str, default=None, help="File to append csv info about inference time")
parser.add_argument("--trt_prediction_path", type=str, default=None, help="File to write predictions inferred with tensorrt")
parser.add_argument("--pyt_prediction_path", type=str, default=None, help="File to write predictions inferred with pytorch")
parser.add_argument("--verbose", action="store_true", default=False, help="If set, will verbosely describe TRT engine building and deserialization as well as TRT inference")
parser.add_argument("--wav", type=str, help='absolute path to .wav file (16KHz)')
parser.add_argument("--max_workspace_size", default=0, type=int, help="Maximum GPU memory workspace size for constructed engine; needed when building")
parser.add_argument("--transpose", action="store_true", default=False, help="If set, will transpose input")
parser.add_argument("--static_shape", action="store_true", default=False, help="If set, use static shape otherwise dynamic shape. Dynamic shape is always preferred.")
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
main(args)

View file

@ -1,182 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''A collection of accuracy and latency evaluation procedures for JASPER on PyTorch and TRT.
'''
import pycuda.driver as cuda
import pycuda.autoinit
import perfutils
import trtutils
import time
import torch
from tqdm import tqdm
def compare_times_trt_pyt_exhaustive(engine, pyt_components, args):
'''Compares execution times and WER between TRT and PyTorch'''
# The engine has a fixed-size sequence length, which needs to be known for slicing/padding input
preprocess_times = []
inputadjust_times = []
outputadjust_times = []
process_batch_times = []
trt_solo_times = []
trt_async_times = []
tohost_sync_times =[]
pyt_infer_times = []
step_counter = 0
with engine.create_execution_context() as context, torch.no_grad():
for data in tqdm(pyt_components['data_layer'].data_iterator):
if args.num_steps >= 1:
if step_counter > args.num_steps:
break
step_counter +=1
tensors = []
for d in data:
tensors.append(d.cuda())
preprocess_start = time.perf_counter()
am_input = pyt_components['audio_preprocessor'](tensors[0], tensors[1])
torch.cuda.synchronize()
preprocess_end = time.perf_counter()
# Pad or cut to the neccessary engine length
inputadjust_start = time.perf_counter()
am_input = perfutils.adjust_shape(am_input, args)
torch.cuda.synchronize()
inputadjust_end = time.perf_counter()
batch_size = am_input[0].shape[0]
inp = [am_input[0]]
# Run TRT inference 1: Async copying and inference
# import ipdb; ipdb.set_trace()
trt_out, time_taken= do_inference_overlap(context, inp)
torch.cuda.synchronize()
outputadjust_start = time.perf_counter()
outputadjust_end = time.perf_counter()
process_batch_start = time.perf_counter()
perfutils.global_process_batch(log_probs=trt_out,
original_tensors=tensors,
batch_size=batch_size,
is_trt=True)
torch.cuda.synchronize()
process_batch_end = time.perf_counter()
# Create explicit stream so pytorch doesn't complete asynchronously
pyt_infer_start = time.perf_counter()
pyt_out = pyt_components['acoustic_model'](am_input[0])
torch.cuda.synchronize()
pyt_infer_end = time.perf_counter()
perfutils.global_process_batch(log_probs=pyt_out,
original_tensors=tensors,
batch_size=batch_size,
is_trt=False)
# Run TRT inference 2: Synchronous copying and inference
sync_out, time_to, time_infer, time_from = do_inference(context,inp)
del sync_out
preprocess_times.append(preprocess_end - preprocess_start)
inputadjust_times.append(inputadjust_end - inputadjust_start)
outputadjust_times.append(outputadjust_end - outputadjust_start)
process_batch_times.append(process_batch_end - process_batch_start)
trt_solo_times.append(time_infer)
trt_async_times.append(time_taken)
tohost_sync_times.append(time_from)
pyt_infer_times.append(pyt_infer_end - pyt_infer_start)
trt_wer = perfutils.global_process_epoch(is_trt=True)
pyt_wer = perfutils.global_process_epoch(is_trt=False)
trt_preds = perfutils._global_trt_dict['predictions']
pyt_preds = perfutils._global_pyt_dict['predictions']
times = {
'preprocess': preprocess_times, # Time to go through preprocessing
'pyt_infer': pyt_infer_times, # Time for batch completion through pytorch
'input_adjust': inputadjust_times, # Time to pad/cut for TRT engine size requirements
'output_adjust' : outputadjust_times, # Time to reshape output of TRT and copy from host to device
'post_process': process_batch_times, # Time to run greedy decoding and do CTC conversion
'trt_solo_infer': trt_solo_times, # Time to execute just TRT acoustic model
'to_host': tohost_sync_times, # Time to execute device to host copy synchronously
'trt_async_infer': trt_async_times, # Time to execute combined async TRT acoustic model + device to host copy
}
wer = {
'trt': trt_wer,
'pyt': pyt_wer
}
preds = {
'trt': trt_preds,
'pyt': pyt_preds
}
return wer, preds, times
def do_inference(context, inp):
'''Do inference using a TRT engine and time it
Execution and device-to-host copy are completed synchronously
'''
# Typical Python-TRT used in samples would copy input data from host to device.
# Because the PyTorch Tensor is already on the device, such a copy is unneeded.
t0 = time.perf_counter()
stream = cuda.Stream()
# Create output buffers and stream
outputs, bindings, out_shape = trtutils.allocate_buffers_with_existing_inputs(context, inp)
t01 = time.perf_counter()
# simulate sync call here
context.execute_async_v2(
bindings=bindings,
stream_handle=stream.handle)
stream.synchronize()
t2 = time.perf_counter()
# for out in outputs:
# cuda.memcpy_dtoh(out.host, out.device)
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
stream.synchronize()
t3 = time.perf_counter()
copyto = t01-t0
inference = t2-t01
copyfrom = t3-t2
out = outputs[0].host
outputs[0].device.free()
out = perfutils.torchify_trt_out(outputs[0].host, out_shape)
return out, copyto, inference, copyfrom
def do_inference_overlap(context, inp):
'''Do inference using a TRT engine and time it
Execution and device-to-host copy are completed asynchronously
'''
# Typical Python-TRT used in samples would copy input data from host to device.
# Because the PyTorch Tensor is already on the device, such a copy is unneeded.
t0 = time.perf_counter()
# Create output buffers and stream
stream = cuda.Stream()
outputs, bindings, out_shape = trtutils.allocate_buffers_with_existing_inputs(context, inp)
t01 = time.perf_counter()
t1 = time.perf_counter()
# Run inference and transfer outputs to host asynchronously
context.execute_async_v2(
bindings=bindings,
stream_handle=stream.handle)
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
stream.synchronize()
t2 = time.perf_counter()
copyto = t1-t0
inference = t2-t1
outputs[0].device.free()
out = perfutils.torchify_trt_out(outputs[0].host, out_shape)
return out, t2-t1

View file

@ -1,317 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''Contains helper functions for non-TRT components of JASPER inference
'''
from model import GreedyCTCDecoder, AudioPreprocessing, JasperEncoderDecoder
from dataset import AudioToTextDataLayer
from helpers import process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, norm
from apex import amp
import torch
import torch.nn as nn
import toml
from parts.features import audio_from_file
import onnx
import os
_global_ctc_labels = None
def get_vocab():
''' Gets the CTC vocab
Requires calling get_pytorch_components_and_onnx() to setup global labels.
'''
if _global_ctc_labels is None:
raise Exception("Feature labels have not been found. Execute `get_pytorch_components_and_onnx()` first")
return _global_ctc_labels
def get_results(log_probs, original_tensors, batch_size):
''' Returns WER and predictions for the outputs of the acoustic model
Used for one-off batches. Epoch-wide evaluation should use
global_process_batch and global_process_epoch
'''
# Used to get WER and predictions for one-off batches
greedy_decoder = GreedyCTCDecoder()
predicts = norm(greedy_decoder(log_probs=log_probs))
values_dict = dict(
predictions=[predicts],
transcript=[original_tensors[2][0:batch_size,...]],
transcript_length=[original_tensors[3][0:batch_size,...]],
)
temp_dict = {
'predictions': [],
'transcripts': [],
}
process_evaluation_batch(values_dict, temp_dict, labels=get_vocab())
predictions = temp_dict['predictions']
wer, _ = process_evaluation_epoch(temp_dict)
return wer, predictions
_global_trt_dict = {
'predictions': [],
'transcripts': [],
}
_global_pyt_dict = {
'predictions': [],
'transcripts': [],
}
def global_process_batch(log_probs, original_tensors, batch_size, is_trt=True):
'''Accumulates prediction evaluations for batches across an epoch
is_trt determines which global dictionary will be used.
To get WER at any point, use global_process_epoch.
For one-off WER evaluations, use get_results()
'''
# State-based approach for full WER comparison across a dataset.
greedy_decoder = GreedyCTCDecoder()
predicts = norm(greedy_decoder(log_probs=log_probs))
values_dict = dict(
predictions=[predicts],
transcript=[original_tensors[2][0:batch_size,...]],
transcript_length=[original_tensors[3][0:batch_size,...]],
)
dict_to_process = _global_trt_dict if is_trt else _global_pyt_dict
process_evaluation_batch(values_dict, dict_to_process, labels=get_vocab())
def global_process_epoch(is_trt=True):
'''Returns WER in accumulated global dictionary
'''
dict_to_process = _global_trt_dict if is_trt else _global_pyt_dict
wer, _ = process_evaluation_epoch(dict_to_process)
return wer
def get_onnx(path, acoustic_model, args):
''' Get an ONNX model with float weights
Requires an --onnx_save_path and --ckpt_path (so that an acoustic model could be constructed).
Fixed-length --seq_len must be provided as well.
'''
dynamic_dim = 0
if not args.static_shape:
dynamic_dim = 1 if args.transpose else 2
if args.transpose:
signal_shape=(args.engine_batch_size, int(args.seq_len), 64)
else:
signal_shape=(args.engine_batch_size, 64, int(args.seq_len))
with torch.no_grad():
phony_signal = torch.zeros(signal_shape, dtype=torch.float, device=torch.device("cuda"))
phony_len = torch.IntTensor(len(phony_signal))
phony_out = acoustic_model.infer((phony_signal, phony_len))
input_names=["FEATURES"]
output_names=["LOGITS"]
if acoustic_model.jasper_encoder.use_conv_mask:
input_names.append("FETURES_LEN")
output_names.append("LOGITS_LEN")
phony_signal = [phony_signal, phony_len]
if dynamic_dim > 0:
dynamic_axes={
"FEATURES" : {0 : "BATCHSIZE", dynamic_dim : "NUM_FEATURES"},
"LOGITS" : { 0: "BATCHSIZE", 1 : "NUM_LOGITS"}
}
else:
dynamic_axes = None
jitted_model = acoustic_model
torch.onnx.export(jitted_model, phony_signal, path,
input_names=input_names, output_names=output_names,
opset_version=10,
do_constant_folding=True,
verbose=True,
dynamic_axes=dynamic_axes,
example_outputs = phony_out
)
fn=path+".readable"
with open(fn, 'w') as f:
#Write human-readable graph representation to file as well.
tempModel = onnx.load(path)
onnx.checker.check_model(tempModel)
pgraph = onnx.helper.printable_graph(tempModel.graph)
f.write(pgraph)
return path
def get_pytorch_components_and_onnx(args):
'''Returns PyTorch components used for inference
'''
model_definition = toml.load(args.model_toml)
dataset_vocab = model_definition['labels']['labels']
# Set up global labels for future vocab calls
global _global_ctc_labels
_global_ctc_labels= add_ctc_labels(dataset_vocab)
featurizer_config = model_definition['input_eval']
optim_level = 3 if args.pyt_fp16 else 0
featurizer_config["optimization_level"] = optim_level
audio_preprocessor = None
onnx_path = None
data_layer = None
wav = None
seq_len = None
if args.max_duration is not None:
featurizer_config['max_duration'] = args.max_duration
if args.dataset_dir is not None:
data_layer = AudioToTextDataLayer(dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
manifest_filepath=args.val_manifest,
labels=dataset_vocab,
batch_size=args.batch_size,
shuffle=False)
if args.wav is not None:
args.batch_size=1
wav, seq_len = audio_from_file(args.wav)
if args.seq_len is None or args.seq_len == 0:
args.seq_len = seq_len/(featurizer_config['sample_rate']/100)
args.seq_len = int(args.seq_len)
if args.transpose:
featurizer_config["transpose_out"] = True
model_definition["transpose_in"] = True
model = JasperEncoderDecoder(jasper_model_definition=model_definition, feat_in=1024, num_classes=len(get_vocab()), transpose_in=args.transpose)
model = model.cuda()
model.eval()
audio_preprocessor = AudioPreprocessing(**featurizer_config)
audio_preprocessor = audio_preprocessor.cuda()
audio_preprocessor.eval()
if args.ckpt_path is not None:
if os.path.isdir(args.ckpt_path):
d_checkpoint = torch.load(args.ckpt_path+"/decoder.pt", map_location="cpu")
e_checkpoint = torch.load(args.ckpt_path+"/encoder.pt", map_location="cpu")
model.jasper_encoder.load_state_dict(e_checkpoint, strict=False)
model.jasper_decoder.load_state_dict(d_checkpoint, strict=False)
else:
checkpoint = torch.load(args.ckpt_path, map_location="cpu")
model.load_state_dict(checkpoint['state_dict'], strict=False)
# if we are to produce engine, not run/create ONNX, postpone AMP initialization
# (ONNX parser cannot handle mixed FP16 ONNX yet)
if args.pyt_fp16 and args.engine_path is None:
amp.initialize(models=model, opt_level='O'+str(optim_level))
if args.make_onnx:
if args.onnx_path is None or args.ckpt_path is None:
raise Exception("--ckpt_path, --onnx_path must be provided when using --make_onnx")
onnx_path = get_onnx(args.onnx_path, model, args)
if args.pyt_fp16 and args.engine_path is not None:
amp.initialize(models=model, opt_level='O'+str(optim_level))
return {'data_layer': data_layer,
'audio_preprocessor': audio_preprocessor,
'acoustic_model': model,
'input_wav' : (wav, seq_len) }, onnx_path
def adjust_shape(am_input, args):
'''Pads or cuts acoustic model input tensor to some fixed_length
'''
input = am_input[0]
baked_length = int(args.seq_len)
if args.transpose:
in_seq_len = input.shape[1]
else:
in_seq_len = input.shape[2]
if baked_length is None or in_seq_len == baked_length:
return (input, am_input[1])
if args.transpose:
return (input.resize_(input.shape[0], baked_length, 64), am_input[1])
newSeq=input
if in_seq_len > baked_length:
# Cut extra bits off, no inference done
newSeq = input[...,0:baked_length].contiguous()
elif in_seq_len < baked_length:
# Zero-pad to satisfy length
pad_length = baked_length - in_seq_len
newSeq = nn.functional.pad(input, (0, pad_length), 'constant', 0)
return (newSeq, am_input[1])
def torchify_trt_out(trt_out, desired_shape):
'''Reshapes flat data to format for greedy+CTC decoding
Used to convert numpy array on host to PyT Tensor
'''
# Predictions must be reshaped.
ret = torch.from_numpy(trt_out)
return ret.reshape((desired_shape[0], desired_shape[1], desired_shape[2]))
def do_csv_export(wers, times, batch_size, num_frames):
'''Produces CSV header and data for input data
wers: dictionary of WER with keys={'trt', 'pyt'}
times: dictionary of execution times
'''
def take_durations_and_output_percentile(durations, ratios):
from heapq import nlargest, nsmallest
import numpy as np
import math
durations = np.asarray(durations) * 1000 # in ms
latency = durations
# The first few entries may not be representative due to warm-up effects
# The last entry might not be representative if dataset_size % batch_size != 0
latency = latency[5:-1]
mean_latency = np.mean(latency)
latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
latency_ranges=get_percentile(ratios, latency_worst, len(latency))
latency_ranges["0.5"] = mean_latency
return latency_ranges
def get_percentile(ratios, arr, nsamples):
res = {}
for a in ratios:
idx = max(int(nsamples * (1 - a)), 0)
res[a] = arr[idx]
return res
ratios = [0.9, 0.95, 0.99, 1.]
header=[]
data=[]
header.append("BatchSize")
header.append("NumFrames")
data.append(f"{batch_size}")
data.append(f"{num_frames}")
for title, wer in wers.items():
header.append(title)
data.append(f"{wer}")
for title, durations in times.items():
ratio_latencies_dict = take_durations_and_output_percentile(durations, ratios)
for ratio, latency in ratio_latencies_dict.items():
header.append(f"{title}_{ratio}")
data.append(f"{latency}")
string_header = ", ".join(header)
string_data = ", ".join(data)
return string_header, string_data

View file

@ -1,4 +0,0 @@
pycuda
pillow
onnx==1.6.0
onnxruntime==1.4.0

View file

@ -1,5 +0,0 @@
#!/bin/bash
# Constructs a docker image containing dependencies for execution of JASPER through TensorRT
echo "docker build . -f ./tensorrt/Dockerfile -t jasper:tensorrt"
docker build . -f ./tensorrt/Dockerfile -t jasper:tensorrt

View file

@ -1,43 +0,0 @@
#!/bin/bash
SCRIPT_DIR=$(cd $(dirname $0); pwd)
JASPER_REPO=${JASPER_REPO:-"${SCRIPT_DIR}/../../.."}
# Launch TRT JASPER container.
DATA_DIR=$1
CHECKPOINT_DIR=$2
RESULT_DIR=$3
PROGRAM_PATH=${PROGRAM_PATH}
if [ $# -lt 3 ]; then
echo "Usage: ./launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR> (<SCRIPT_PATH>)"
echo "All directory paths must be absolute paths and exist"
exit 1
fi
for dir in $DATA_DIR $CHECKPOINT_DIR $RESULT_DIR; do
if [[ $dir != /* ]]; then
echo "All directory paths must be absolute paths!"
echo "${dir} is not an absolute path"
exit 1
fi
if [ ! -d $dir ]; then
echo "All directory paths must exist!"
echo "${dir} does not exist"
exit 1
fi
done
nvidia-docker run -it --rm \
--runtime=nvidia \
--shm-size=4g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v $DATA_DIR:/datasets \
-v $CHECKPOINT_DIR:/checkpoints/ \
-v $RESULT_DIR:/results/ \
-v ${JASPER_REPO}:/jasper \
${EXTRA_JASPER_ENV} \
jasper:tensorrt bash $PROGRAM_PATH

View file

@ -1,67 +0,0 @@
#!/bin/bash
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Performs inference and measures latency and accuracy of TRT and PyTorch implementations of JASPER.
echo "Container nvidia build = " $NVIDIA_BUILD_ID
# Mandatory Arguments
CHECKPOINT=$CHECKPOINT
# Arguments with Defaults
DATA_DIR=${DATA_DIR:-"/datasets/LibriSpeech"}
DATASET=${DATASET:-"dev-clean"}
RESULT_DIR=${RESULT_DIR:-"/results"}
CREATE_LOGFILE=${CREATE_LOGFILE:-"true"}
TRT_PRECISION=${TRT_PRECISION:-"fp32"}
PYTORCH_PRECISION=${PYTORCH_PRECISION:-"fp32"}
NUM_STEPS=${NUM_STEPS:-"-1"}
BATCH_SIZE=${BATCH_SIZE:-1}
NUM_FRAMES=${NUM_FRAMES:-3600}
MAX_SEQUENCE_LENGTH_FOR_ENGINE=${MAX_SEQUENCE_LENGTH_FOR_ENGINE:-$NUM_FRAMES}
FORCE_ENGINE_REBUILD=${FORCE_ENGINE_REBUILD:-"true"}
CSV_PATH=${CSV_PATH:-"/results/res.csv"}
TRT_PREDICTION_PATH=${TRT_PREDICTION_PATH:-"/results/trt_predictions.txt"}
PYT_PREDICTION_PATH=${PYT_PREDICTION_PATH:-"/results/pyt_predictions.txt"}
VERBOSE=${VERBOSE:-"false"}
export CHECKPOINT="$CHECKPOINT"
export DATA_DIR="$DATA_DIR"
export DATASET="$DATASET"
export RESULT_DIR="$RESULT_DIR"
export CREATE_LOGFILE="$CREATE_LOGFILE"
export TRT_PRECISION="$TRT_PRECISION"
export PYTORCH_PRECISION="$PYTORCH_PRECISION"
export NUM_STEPS="$NUM_STEPS"
export BATCH_SIZE="$BATCH_SIZE"
export NUM_FRAMES="$NUM_FRAMES"
export MAX_SEQUENCE_LENGTH_FOR_ENGINE="$MAX_SEQUENCE_LENGTH_FOR_ENGINE"
export FORCE_ENGINE_REBUILD="$FORCE_ENGINE_REBUILD"
export CSV_PATH="$CSV_PATH"
export TRT_PREDICTION_PATH="$TRT_PREDICTION_PATH"
export PYT_PREDICTION_PATH="$PYT_PREDICTION_PATH"
export VERBOSE="$VERBOSE"
bash ./tensorrt/scripts/inference_benchmark.sh $1 $2 $3 $4 $5 $6 $7
trt_word_error_rate=`cat "$CSV_PATH" | awk '{print $3}'`
pyt_word_error_rate=`cat "$CSV_PATH" | awk '{print $4}'`
echo "word error rate for native PyTorch inference: "
echo "${pyt_word_error_rate}"
echo "word error rate for native TRT inference: "
echo "${trt_word_error_rate}"

View file

@ -1,173 +0,0 @@
#!/bin/bash
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Measures latency and accuracy of TRT and PyTorch implementations of JASPER.
echo "Container nvidia build = " $NVIDIA_BUILD_ID
trap "exit" INT
# Mandatory Arguments
CHECKPOINT=${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}
# Arguments with Defaults
DATA_DIR=${DATA_DIR:-"/datasets/LibriSpeech"}
DATASET=${DATASET:-"dev-clean"}
RESULT_DIR=${RESULT_DIR:-"/results"}
LOG_DIR=${RESULT_DIR}/logs
CREATE_LOGFILE=${CREATE_LOGFILE:-"true"}
TRT_PRECISION=${TRT_PRECISION:-"fp16"}
PYTORCH_PRECISION=${PYTORCH_PRECISION:-"fp16"}
NUM_STEPS=${NUM_STEPS:-"100"}
BATCH_SIZE=${BATCH_SIZE:-64}
NUM_FRAMES=${NUM_FRAMES:-512}
FORCE_ENGINE_REBUILD=${FORCE_ENGINE_REBUILD:-"false"}
CSV_PATH=${CSV_PATH:-"/results/res.csv"}
TRT_PREDICTION_PATH=${TRT_PREDICTION_PATH:-"none"}
PYT_PREDICTION_PATH=${PYT_PREDICTION_PATH:-"none"}
VERBOSE=${VERBOSE:-"false"}
USE_DYNAMIC_SHAPE=${USE_DYNAMIC_SHAPE:-"yes"}
# Set up flag-based arguments
TRT_PREC=""
if [ "$TRT_PRECISION" = "fp16" ] ; then
TRT_PREC="--trt_fp16"
elif [ "$TRT_PRECISION" = "fp32" ] ; then
TRT_PREC=""
else
echo "Unknown <trt_precision> argument"
exit -2
fi
PYTORCH_PREC=""
if [ "$PYTORCH_PRECISION" = "fp16" ] ; then
PYTORCH_PREC="--pyt_fp16"
elif [ "$PYTORCH_PRECISION" = "fp32" ] ; then
PYTORCH_PREC=""
else
echo "Unknown <pytorch_precision> argument"
exit -2
fi
SHOULD_VERBOSE=""
if [ "$VERBOSE" = "true" ] ; then
SHOULD_VERBOSE="--verbose"
fi
STEPS=""
if [ "$NUM_STEPS" -gt 0 ] ; then
STEPS=" --num_steps $NUM_STEPS"
fi
# Making engine and onnx directories in RESULT_DIR if they don't already exist
ONNX_DIR=$RESULT_DIR/onnxs
ENGINE_DIR=$RESULT_DIR/engines
mkdir -p $ONNX_DIR
mkdir -p $ENGINE_DIR
mkdir -p $LOG_DIR
if [ "$USE_DYNAMIC_SHAPE" = "no" ] ; then
PREFIX=BS${BATCH_SIZE}_NF${NUM_FRAMES}
DYNAMIC_PREFIX=" --static_shape "
else
PREFIX=DYNAMIC
fi
# Currently, TRT parser for ONNX can't parse mixed-precision weights, so ONNX
# export will always be FP32. This is also enforced in perf.py
ONNX_FILE=fp32_${PREFIX}.onnx
ENGINE_FILE=${TRT_PRECISION}_${PREFIX}.engine
# If an ONNX with the same precision and number of frames exists, don't recreate it because
# TRT engine construction can be done on an onnx of any batch size
# "%P" only prints filenames (rather than absolute/relative path names)
EXISTING_ONNX=$(find $ONNX_DIR -name ${ONNX_FILE} -printf "%P\n" | head -n 1)
SHOULD_MAKE_ONNX=""
if [ -z "$EXISTING_ONNX" ] ; then
SHOULD_MAKE_ONNX="--make_onnx"
else
ONNX_FILE=${EXISTING_ONNX}
fi
# Follow FORCE_ENGINE_REBUILD about reusing existing engines.
# If false, the existing engine must match precision, batch size, and number of frames
SHOULD_MAKE_ENGINE=""
if [ "$FORCE_ENGINE_REBUILD" != "true" ] ; then
EXISTING_ENGINE=$(find $ENGINE_DIR -name "${ENGINE_FILE}")
if [ -n "$EXISTING_ENGINE" ] ; then
SHOULD_MAKE_ENGINE="--use_existing_engine"
fi
fi
if [ "$TRT_PREDICTION_PATH" = "none" ] ; then
TRT_PREDICTION_PATH=""
else
TRT_PREDICTION_PATH=" --trt_prediction_path=${TRT_PREDICTION_PATH}"
fi
if [ "$PYT_PREDICTION_PATH" = "none" ] ; then
PYT_PREDICTION_PATH=""
else
PYT_PREDICTION_PATH=" --pyt_prediction_path=${PYT_PREDICTION_PATH}"
fi
CMD="python tensorrt/perf.py"
CMD+=" --batch_size $BATCH_SIZE"
CMD+=" --engine_batch_size $BATCH_SIZE"
CMD+=" --model_toml configs/jasper10x5dr_nomask.toml"
CMD+=" --dataset_dir $DATA_DIR"
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --ckpt_path $CHECKPOINT"
CMD+=" $SHOULD_VERBOSE"
CMD+=" $TRT_PREC"
CMD+=" $PYTORCH_PREC"
CMD+=" $STEPS"
CMD+=" --engine_path ${RESULT_DIR}/engines/${ENGINE_FILE}"
CMD+=" --onnx_path ${RESULT_DIR}/onnxs/${ONNX_FILE}"
CMD+=" --seq_len $NUM_FRAMES"
CMD+=" $SHOULD_MAKE_ONNX"
CMD+=" $SHOULD_MAKE_ENGINE"
CMD+=" $DYNAMIC_PREFIX"
CMD+=" --csv_path $CSV_PATH"
CMD+=" $1 $2 $3 $4 $5 $6 $7 $8 $9"
CMD+=" $TRT_PREDICTION_PATH"
CMD+=" $PYT_PREDICTION_PATH"
if [ "$CREATE_LOGFILE" == "true" ] ; then
export GBS=$(expr $BATCH_SIZE )
printf -v TAG "jasper_tensorrt_inference_benchmark_%s_gbs%d" "$PYTORCH_PRECISION" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$LOG_DIR/$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
fi
mkdir -p ${RESULT_DIR}/logs
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
$CMD |& tee $LOGFILE
grep 'latency' $LOGFILE
fi
set +x

View file

@ -1,43 +0,0 @@
#!/bin/bash
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# A usage example of inference_benchmark.sh.
export NUM_STEPS=100
export FORCE_ENGINE_REBUILD="false"
export CHECKPOINT=${CHECKPOINT:-"/checkpoints/jasper_fp16.pt"}
export CREATE_LOGFILE="true"
prec=fp16
export TRT_PRECISION=$prec
export PYTORCH_PRECISION=$prec
trap "exit" INT
for use_dynamic in yes no;
do
export USE_DYNAMIC_SHAPE=${use_dynamic}
export CSV_PATH="/results/${prec}.csv"
for nf in 208 304 512 704 1008 1680;
do
export NUM_FRAMES=$nf
for bs in 1 2 4 8 16 32 64;
do
export BATCH_SIZE=$bs
echo "Doing batch size ${bs}, sequence length ${nf}, precision ${prec}"
bash tensorrt/scripts/inference_benchmark.sh $1 $2 $3 $4 $5 $6
done
done
done

View file

@ -1,155 +0,0 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''Contains helper functions for TRT components of JASPER inference
'''
import pycuda.driver as cuda
import tensorrt as trt
import onnxruntime as ort
import numpy as np
class HostDeviceMem(object):
'''Type for managing host and device buffers
A simple class which is more explicit that dealing with a 2-tuple.
'''
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
def build_engine_from_parser(args):
'''Builds TRT engine from an ONNX file
Note that network output 1 is unmarked so that the engine will not use
vestigial length calculations associated with masked_fill
'''
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if args.verbose else trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
builder.max_batch_size = 64
if args.trt_fp16:
builder.fp16_mode = True
print("Optimizing for FP16")
config_flags = 1 << int(trt.BuilderFlag.FP16) # | 1 << int(trt.BuilderFlag.STRICT_TYPES)
max_size = 4*1024*1024*1024
max_len = args.max_seq_len
else:
config_flags = 0
max_size = 4*1024*1024*1024
max_len = args.max_seq_len
if args.max_workspace_size > 0:
builder.max_workspace_size = args.max_workspace_size
else:
builder.max_workspace_size = max_size
config = builder.create_builder_config()
config.flags = config_flags
if not args.static_shape:
profile = builder.create_optimization_profile()
if args.transpose:
profile.set_shape("FEATURES", min=(1,192,64), opt=(args.engine_batch_size,256,64), max=(builder.max_batch_size, max_len, 64))
else:
profile.set_shape("FEATURES", min=(1,64,192), opt=(args.engine_batch_size,64,256), max=(builder.max_batch_size, 64, max_len))
config.add_optimization_profile(profile)
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(explicit_batch)
with trt.OnnxParser(network, TRT_LOGGER) as parser:
with open(args.onnx_path, 'rb') as model:
parsed = parser.parse(model.read())
print ("Parsing returned ", parsed, "dynamic_shape= " , not args.static_shape, "\n")
return builder.build_engine(network, config=config)
def deserialize_engine(engine_path, is_verbose):
'''Deserializes TRT engine at engine_path
'''
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if is_verbose else trt.Logger(trt.Logger.WARNING)
with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(f.read())
return engine
def allocate_buffers_with_existing_inputs(context, inp):
'''
allocate_buffers() (see TRT python samples) but uses an existing inputs on device
inp: List of pointers to device memory. Pointers are in the same order as
would be produced by allocate_buffers(). That is, inputs are in the
order defined by iterating through `engine`
'''
# Add input to bindings
bindings = [0,0]
outputs = []
engine = context.engine
batch_size = inp[0].shape
inp_idx = engine.get_binding_index("FEATURES")
inp_b = inp[0].data_ptr()
assert(inp[0].is_contiguous())
bindings[inp_idx] = inp_b
sh = inp[0].shape
batch_size = sh[0]
orig_shape = context.get_binding_shape(inp_idx)
if orig_shape[0]==-1:
context.set_binding_shape(inp_idx, trt.Dims([batch_size, sh[1], sh[2]]))
assert context.all_binding_shapes_specified
out_idx = engine.get_binding_index("LOGITS")
# Allocate output buffer by querying the size from the context. This may be different for different input shapes.
out_shape = context.get_binding_shape(out_idx)
#print ("Out_shape: ", out_shape)
h_output = cuda.pagelocked_empty(tuple(out_shape), dtype=np.float32())
# print ("Out bytes: " , h_output.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
bindings[out_idx] = int(d_output)
hdm = HostDeviceMem(h_output, d_output)
outputs.append(hdm)
return outputs, bindings, out_shape
def get_engine(args):
'''Get a TRT engine
If --should_serialize is present, always build from ONNX and store result in --engine_path.
Else If an engine is provided as an argument (--engine_path) use that one.
Otherwise, make one from onnx (--onnx_load_path), but don't serialize it.
'''
engine = None
if args.engine_path is not None and args.use_existing_engine:
engine = deserialize_engine(args.engine_path, args.verbose)
elif args.engine_path is not None and args.onnx_path is not None:
# Build a new engine and serialize it.
print("Building TRT engine ....")
engine = build_engine_from_parser(args)
if engine is not None:
with open(args.engine_path, 'wb') as f:
f.write(engine.serialize())
print("TRT engine saved at " + args.engine_path + " ...")
elif args.onnx_path is not None:
ort_session = ort.InferenceSession(args.onnx_path)
return ort_session
else:
raise Exception("One of the following sets of arguments must be provided:\n"+
"<engine_path> + --use_existing_engine\n"+
"<engine_path> + <onnx_path>\n"+
"in order to construct a TRT engine")
if engine is None:
raise Exception("Failed to acquire TRT engine")
return engine

View file

@ -14,449 +14,488 @@
import argparse
import copy
import itertools
import math
import os
import random
import time
import toml
import pyprof
import torch
import numpy as np
import torch.cuda.profiler as profiler
import torch.distributed as dist
from apex import amp
from apex.parallel import DistributedDataParallel
from dataset import AudioToTextDataLayer
from helpers import (add_ctc_labels, model_multi_gpu, monitor_asr_train_progress,
print_dict, print_once, process_evaluation_batch,
process_evaluation_epoch)
from model import AudioPreprocessing, CTCLossNM, GreedyCTCDecoder, Jasper
from optimizers import Novograd, AdamW
from common import helpers
from common.dali.data_loader import DaliDataLoader
from common.dataset import AudioDataset, get_data_loader
from common.features import BaseFeatures, FilterbankFeatures
from common.helpers import (Checkpointer, greedy_wer, num_weights, print_once,
process_evaluation_epoch)
from common.optimizers import AdamW, lr_policy, Novograd
from common.tb_dllogger import flush_log, init_log, log
from jasper import config
from jasper.model import CTCLossNM, GreedyCTCDecoder, Jasper
def lr_policy(initial_lr, step, N):
"""
learning rate decay
Args:
initial_lr: base learning rate
step: current iteration number
N: total number of iterations over which learning rate is decayed
"""
min_lr = 0.00001
res = initial_lr * ((N - step) / N) ** 2
return max(res, min_lr)
def parse_args():
parser = argparse.ArgumentParser(description='Jasper')
training = parser.add_argument_group('training setup')
training.add_argument('--epochs', default=400, type=int,
help='Number of epochs for the entire training; influences the lr schedule')
training.add_argument("--warmup_epochs", default=0, type=int,
help='Initial epochs of increasing learning rate')
training.add_argument("--hold_epochs", default=0, type=int,
help='Constant max learning rate epochs after warmup')
training.add_argument('--epochs_this_job', default=0, type=int,
help=('Run for a number of epochs with no effect on the lr schedule.'
'Useful for re-starting the training.'))
training.add_argument('--cudnn_benchmark', action='store_true', default=True,
help='Enable cudnn benchmark')
training.add_argument('--amp', '--fp16', action='store_true', default=False,
help='Use mixed precision training')
training.add_argument('--seed', default=42, type=int, help='Random seed')
training.add_argument('--local_rank', default=os.getenv('LOCAL_RANK', 0),
type=int, help='GPU id used for distributed training')
training.add_argument('--pre_allocate_range', default=None, type=int, nargs=2,
help='Warmup with batches of length [min, max] before training')
training.add_argument('--pyprof', action='store_true', help='Enable pyprof profiling')
optim = parser.add_argument_group('optimization setup')
optim.add_argument('--batch_size', default=32, type=int,
help='Global batch size')
optim.add_argument('--lr', default=1e-3, type=float,
help='Peak learning rate')
optim.add_argument("--min_lr", default=1e-5, type=float,
help='minimum learning rate')
optim.add_argument("--lr_policy", default='exponential', type=str,
choices=['exponential', 'legacy'], help='lr scheduler')
optim.add_argument("--lr_exp_gamma", default=0.99, type=float,
help='gamma factor for exponential lr scheduler')
optim.add_argument('--weight_decay', default=1e-3, type=float,
help='Weight decay for the optimizer')
optim.add_argument('--grad_accumulation_steps', default=1, type=int,
help='Number of accumulation steps')
optim.add_argument('--optimizer', default='novograd', type=str,
choices=['novograd', 'adamw'], help='Optimization algorithm')
optim.add_argument('--ema', type=float, default=0.0,
help='Discount factor for exp averaging of model weights')
io = parser.add_argument_group('feature and checkpointing setup')
io.add_argument('--dali_device', type=str, choices=['none', 'cpu', 'gpu'],
default='gpu', help='Use DALI pipeline for fast data processing')
io.add_argument('--resume', action='store_true',
help='Try to resume from last saved checkpoint.')
io.add_argument('--ckpt', default=None, type=str,
help='Path to a checkpoint for resuming training')
io.add_argument('--save_frequency', default=10, type=int,
help='Checkpoint saving frequency in epochs')
io.add_argument('--keep_milestones', default=[100, 200, 300], type=int, nargs='+',
help='Milestone checkpoints to keep from removing')
io.add_argument('--save_best_from', default=380, type=int,
help='Epoch on which to begin tracking best checkpoint (dev WER)')
io.add_argument('--eval_frequency', default=200, type=int,
help='Number of steps between evaluations on dev set')
io.add_argument('--log_frequency', default=25, type=int,
help='Number of steps between printing training stats')
io.add_argument('--prediction_frequency', default=100, type=int,
help='Number of steps between printing sample decodings')
io.add_argument('--model_config', type=str, required=True,
help='Path of the model configuration file')
io.add_argument('--train_manifests', type=str, required=True, nargs='+',
help='Paths of the training dataset manifest file')
io.add_argument('--val_manifests', type=str, required=True, nargs='+',
help='Paths of the evaluation datasets manifest files')
io.add_argument('--max_duration', type=float,
help='Discard samples longer than max_duration')
io.add_argument('--pad_to_max_duration', action='store_true', default=False,
help='Pad training sequences to max_duration')
io.add_argument('--dataset_dir', required=True, type=str,
help='Root dir of dataset')
io.add_argument('--output_dir', type=str, required=True,
help='Directory for logs and checkpoints')
io.add_argument('--log_file', type=str, default=None,
help='Path to save the training logfile.')
return parser.parse_args()
def save(model, ema_model, optimizer, epoch, output_dir, optim_level):
"""
Saves model checkpoint
Args:
model: model
ema_model: model with exponential averages of weights
optimizer: optimizer
epoch: epoch of model training
output_dir: path to save model checkpoint
"""
out_fpath = os.path.join(output_dir, f"Jasper_epoch{epoch}_checkpoint.pt")
print_once(f"Saving {out_fpath}...")
if torch.distributed.is_initialized():
torch.distributed.barrier()
rank = torch.distributed.get_rank()
else:
rank = 0
if rank == 0:
checkpoint = {
'epoch': epoch,
'state_dict': getattr(model, 'module', model).state_dict(),
'optimizer': optimizer.state_dict(),
'amp': amp.state_dict() if optim_level > 0 else None,
}
if ema_model is not None:
checkpoint['ema_state_dict'] = getattr(ema_model, 'module', ema_model).state_dict()
torch.save(checkpoint, out_fpath)
print_once('Saved.')
def reduce_tensor(tensor, num_gpus):
rt = tensor.clone()
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
return rt.true_divide(num_gpus)
def apply_ema(model, ema_model, decay):
if not decay:
return
st = model.state_dict()
add_module = hasattr(model, 'module') and not hasattr(ema_model, 'module')
for k,v in ema_model.state_dict().items():
if add_module and not k.startswith('module.'):
k = 'module.' + k
v.copy_(decay * v + (1 - decay) * st[k])
sd = getattr(model, 'module', model).state_dict()
for k, v in ema_model.state_dict().items():
v.copy_(decay * v + (1 - decay) * sd[k])
def train(
data_layer,
data_layer_eval,
model,
ema_model,
ctc_loss,
greedy_decoder,
optimizer,
optim_level,
labels,
multi_gpu,
args,
fn_lr_policy=None):
"""Trains model
Args:
data_layer: training data layer
data_layer_eval: evaluation data layer
model: model ( encapsulates data processing, encoder, decoder)
ctc_loss: loss function
greedy_decoder: greedy ctc decoder
optimizer: optimizer
optim_level: AMP optimization level
labels: list of output labels
multi_gpu: true if multi gpu training
args: script input argument list
fn_lr_policy: learning rate adjustment function
"""
def eval(model, name=''):
"""Evaluates model on evaluation dataset
"""
with torch.no_grad():
_global_var_dict = {
'EvalLoss': [],
'predictions': [],
'transcripts': [],
}
eval_dataloader = data_layer_eval.data_iterator
for data in eval_dataloader:
tensors = []
for d in data:
if isinstance(d, torch.Tensor):
tensors.append(d.cuda())
else:
tensors.append(d)
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
@torch.no_grad()
def evaluate(epoch, step, val_loader, val_feat_proc, labels, model,
ema_model, ctc_loss, greedy_decoder, use_amp, use_dali=False):
model.eval()
if optim_level == 1:
with amp.disable_casts():
t_processed_signal_e, t_processed_sig_length_e = audio_preprocessor(t_audio_signal_e, t_a_sig_length_e)
else:
t_processed_signal_e, t_processed_sig_length_e = audio_preprocessor(t_audio_signal_e, t_a_sig_length_e)
if jasper_encoder.use_conv_mask:
t_log_probs_e, t_encoded_len_e = model.forward((t_processed_signal_e, t_processed_sig_length_e))
else:
t_log_probs_e = model.forward(t_processed_signal_e)
t_loss_e = ctc_loss(log_probs=t_log_probs_e, targets=t_transcript_e, input_length=t_encoded_len_e, target_length=t_transcript_len_e)
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
for model, subset in [(model, 'dev'), (ema_model, 'dev_ema')]:
if model is None:
continue
values_dict = dict(
loss=[t_loss_e],
predictions=[t_predictions_e],
transcript=[t_transcript_e],
transcript_length=[t_transcript_len_e]
)
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
model.eval()
start_time = time.time()
agg = {'losses': [], 'preds': [], 'txts': []}
# final aggregation across all workers and minibatches) and logging of results
wer, eloss = process_evaluation_epoch(_global_var_dict)
if name != '':
name = '_' + name
print_once(f"==========>>>>>>Evaluation{name} Loss: {eloss}\n")
print_once(f"==========>>>>>>Evaluation{name} WER: {wer}\n")
print_once("Starting .....")
start_time = time.time()
train_dataloader = data_layer.data_iterator
epoch = args.start_epoch
step = epoch * args.step_per_epoch
audio_preprocessor = model.module.audio_preprocessor if hasattr(model, 'module') else model.audio_preprocessor
data_spectr_augmentation = model.module.data_spectr_augmentation if hasattr(model, 'module') else model.data_spectr_augmentation
jasper_encoder = model.module.jasper_encoder if hasattr(model, 'module') else model.jasper_encoder
while True:
if multi_gpu:
data_layer.sampler.set_epoch(epoch)
print_once("Starting epoch {0}, step {1}".format(epoch, step))
last_epoch_start = time.time()
batch_counter = 0
average_loss = 0
for data in train_dataloader:
tensors = []
for d in data:
if isinstance(d, torch.Tensor):
tensors.append(d.cuda())
else:
tensors.append(d)
if batch_counter == 0:
if fn_lr_policy is not None:
adjusted_lr = fn_lr_policy(step)
for param_group in optimizer.param_groups:
param_group['lr'] = adjusted_lr
optimizer.zero_grad()
last_iter_start = time.time()
t_audio_signal_t, t_a_sig_length_t, t_transcript_t, t_transcript_len_t = tensors
model.train()
if optim_level == 1:
with amp.disable_casts():
t_processed_signal_t, t_processed_sig_length_t = audio_preprocessor(t_audio_signal_t, t_a_sig_length_t)
for batch in val_loader:
if use_dali:
# with DALI, the data is already on GPU
feat, feat_lens, txt, txt_lens = batch
if val_feat_proc is not None:
feat, feat_lens = val_feat_proc(feat, feat_lens, use_amp)
else:
t_processed_signal_t, t_processed_sig_length_t = audio_preprocessor(t_audio_signal_t, t_a_sig_length_t)
t_processed_signal_t = data_spectr_augmentation(t_processed_signal_t)
if jasper_encoder.use_conv_mask:
t_log_probs_t, t_encoded_len_t = model.forward((t_processed_signal_t, t_processed_sig_length_t))
else:
t_log_probs_t = model.forward(t_processed_signal_t)
batch = [t.cuda(non_blocking=True) for t in batch]
audio, audio_lens, txt, txt_lens = batch
feat, feat_lens = val_feat_proc(audio, audio_lens, use_amp)
t_loss_t = ctc_loss(log_probs=t_log_probs_t, targets=t_transcript_t, input_length=t_encoded_len_t, target_length=t_transcript_len_t)
if args.gradient_accumulation_steps > 1:
t_loss_t = t_loss_t / args.gradient_accumulation_steps
log_probs, enc_lens = model.forward(feat, feat_lens)
loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
pred = greedy_decoder(log_probs)
if 0 < optim_level <= 3:
with amp.scale_loss(t_loss_t, optimizer) as scaled_loss:
scaled_loss.backward()
else:
t_loss_t.backward()
batch_counter += 1
average_loss += t_loss_t.item()
agg['losses'] += helpers.gather_losses([loss])
agg['preds'] += helpers.gather_predictions([pred], labels)
agg['txts'] += helpers.gather_transcripts([txt], [txt_lens], labels)
if batch_counter % args.gradient_accumulation_steps == 0:
optimizer.step()
if step % args.train_frequency == 0:
t_predictions_t = greedy_decoder(log_probs=t_log_probs_t)
e_tensors = [t_predictions_t, t_transcript_t, t_transcript_len_t]
train_wer = monitor_asr_train_progress(e_tensors, labels=labels)
print_once("Loss@Step: {0} ::::::: {1}".format(step, str(average_loss)))
print_once("Step time: {0} seconds".format(time.time() - last_iter_start))
if step > 0 and step % args.eval_frequency == 0:
print_once("Doing Evaluation ....................... ...... ... .. . .")
eval(model)
if args.ema > 0:
eval(ema_model, 'EMA')
step += 1
batch_counter = 0
average_loss = 0
if args.num_steps is not None and step >= args.num_steps:
break
if args.num_steps is not None and step >= args.num_steps:
break
print_once("Finished epoch {0} in {1}".format(epoch, time.time() - last_epoch_start))
epoch += 1
if epoch % args.save_frequency == 0 and epoch > 0:
save(model, ema_model, optimizer, epoch, args.output_dir, optim_level)
if args.num_steps is None and epoch >= args.num_epochs:
break
print_once("Done in {0}".format(time.time() - start_time))
print_once("Final Evaluation ....................... ...... ... .. . .")
eval(model)
if args.ema > 0:
eval(ema_model, 'EMA')
save(model, ema_model, optimizer, epoch, args.output_dir, optim_level)
wer, loss = process_evaluation_epoch(agg)
log((epoch,), step, subset, {'loss': loss, 'wer': 100.0 * wer,
'took': time.time() - start_time})
model.train()
return wer
def main(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
def main():
args = parse_args()
assert(torch.cuda.is_available())
torch.backends.cudnn.benchmark = args.cudnn
assert args.prediction_frequency % args.log_frequency == 0
torch.backends.cudnn.benchmark = args.cudnn_benchmark
# set up distributed training
if args.local_rank is not None:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
multi_gpu = torch.distributed.is_initialized()
multi_gpu = int(os.environ.get('WORLD_SIZE', 1)) > 1
if multi_gpu:
print_once("DISTRIBUTED TRAINING with {} gpus".format(torch.distributed.get_world_size()))
torch.cuda.set_device(args.local_rank)
dist.init_process_group(backend='nccl', init_method='env://')
world_size = dist.get_world_size()
print_once(f'Distributed training with {world_size} GPUs\n')
else:
world_size = 1
# define amp optimiation level
optim_level = 1 if args.amp else 0
torch.manual_seed(args.seed + args.local_rank)
np.random.seed(args.seed + args.local_rank)
random.seed(args.seed + args.local_rank)
jasper_model_definition = toml.load(args.model_toml)
dataset_vocab = jasper_model_definition['labels']['labels']
ctc_vocab = add_ctc_labels(dataset_vocab)
init_log(args)
train_manifest = args.train_manifest
val_manifest = args.val_manifest
featurizer_config = jasper_model_definition['input']
featurizer_config_eval = jasper_model_definition['input_eval']
featurizer_config["optimization_level"] = optim_level
featurizer_config_eval["optimization_level"] = optim_level
cfg = config.load(args.model_config)
config.apply_duration_flags(cfg, args.max_duration, args.pad_to_max_duration)
sampler_type = featurizer_config.get("sampler", 'default')
perturb_config = jasper_model_definition.get('perturb', None)
if args.pad_to_max:
assert(args.max_duration > 0)
featurizer_config['max_duration'] = args.max_duration
featurizer_config_eval['max_duration'] = args.max_duration
featurizer_config['pad_to'] = -1
featurizer_config_eval['pad_to'] = -1
symbols = helpers.add_ctc_blank(cfg['labels'])
print_once('model_config')
print_dict(jasper_model_definition)
assert args.grad_accumulation_steps >= 1
assert args.batch_size % args.grad_accumulation_steps == 0
batch_size = args.batch_size // args.grad_accumulation_steps
if args.gradient_accumulation_steps < 1:
raise ValueError('Invalid gradient accumulation steps parameter {}'.format(args.gradient_accumulation_steps))
if args.batch_size % args.gradient_accumulation_steps != 0:
raise ValueError('gradient accumulation step {} is not divisible by batch size {}'.format(args.gradient_accumulation_steps, args.batch_size))
print_once('Setting up datasets...')
train_dataset_kw, train_features_kw = config.input(cfg, 'train')
val_dataset_kw, val_features_kw = config.input(cfg, 'val')
use_dali = args.dali_device in ('cpu', 'gpu')
if use_dali:
assert train_dataset_kw['ignore_offline_speed_perturbation'], \
"DALI doesn't support offline speed perturbation"
data_layer = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
perturb_config=perturb_config,
manifest_filepath=train_manifest,
labels=dataset_vocab,
batch_size=args.batch_size // args.gradient_accumulation_steps,
multi_gpu=multi_gpu,
pad_to_max=args.pad_to_max,
sampler=sampler_type)
# pad_to_max_duration is not supported by DALI - have simple padders
if train_features_kw['pad_to_max_duration']:
train_feat_proc = BaseFeatures(
pad_align=train_features_kw['pad_align'],
pad_to_max_duration=True,
max_duration=train_features_kw['max_duration'],
sample_rate=train_features_kw['sample_rate'],
window_size=train_features_kw['window_size'],
window_stride=train_features_kw['window_stride'])
train_features_kw['pad_to_max_duration'] = False
else:
train_feat_proc = None
data_layer_eval = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config_eval,
manifest_filepath=val_manifest,
labels=dataset_vocab,
batch_size=args.batch_size,
multi_gpu=multi_gpu,
pad_to_max=args.pad_to_max
)
if val_features_kw['pad_to_max_duration']:
val_feat_proc = BaseFeatures(
pad_align=val_features_kw['pad_align'],
pad_to_max_duration=True,
max_duration=val_features_kw['max_duration'],
sample_rate=val_features_kw['sample_rate'],
window_size=val_features_kw['window_size'],
window_stride=val_features_kw['window_stride'])
val_features_kw['pad_to_max_duration'] = False
else:
val_feat_proc = None
model = Jasper(feature_config=featurizer_config, jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
train_loader = DaliDataLoader(gpu_id=args.local_rank,
dataset_path=args.dataset_dir,
config_data=train_dataset_kw,
config_features=train_features_kw,
json_names=args.train_manifests,
batch_size=batch_size,
grad_accumulation_steps=args.grad_accumulation_steps,
pipeline_type="train",
device_type=args.dali_device,
symbols=symbols)
ctc_loss = CTCLossNM( num_classes=len(ctc_vocab))
val_loader = DaliDataLoader(gpu_id=args.local_rank,
dataset_path=args.dataset_dir,
config_data=val_dataset_kw,
config_features=val_features_kw,
json_names=args.val_manifests,
batch_size=batch_size,
pipeline_type="val",
device_type=args.dali_device,
symbols=symbols)
else:
train_dataset_kw, train_features_kw = config.input(cfg, 'train')
train_dataset = AudioDataset(args.dataset_dir,
args.train_manifests,
symbols,
**train_dataset_kw)
train_loader = get_data_loader(train_dataset,
batch_size,
multi_gpu=multi_gpu,
shuffle=True,
num_workers=4)
train_feat_proc = FilterbankFeatures(**train_features_kw)
val_dataset_kw, val_features_kw = config.input(cfg, 'val')
val_dataset = AudioDataset(args.dataset_dir,
args.val_manifests,
symbols,
**val_dataset_kw)
val_loader = get_data_loader(val_dataset,
batch_size,
multi_gpu=multi_gpu,
shuffle=False,
num_workers=4,
drop_last=False)
val_feat_proc = FilterbankFeatures(**val_features_kw)
dur = train_dataset.duration / 3600
dur_f = train_dataset.duration_filtered / 3600
nsampl = len(train_dataset)
print_once(f'Training samples: {nsampl} ({dur:.1f}h, '
f'filtered {dur_f:.1f}h)')
if train_feat_proc is not None:
train_feat_proc.cuda()
if val_feat_proc is not None:
val_feat_proc.cuda()
steps_per_epoch = len(train_loader) // args.grad_accumulation_steps
# set up the model
model = Jasper(encoder_kw=config.encoder(cfg),
decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
model.cuda()
ctc_loss = CTCLossNM(n_classes=len(symbols))
greedy_decoder = GreedyCTCDecoder()
print_once("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
print_once("Number of parameters in decode: {0}".format(model.jasper_decoder.num_weights()))
print_once(f'Model size: {num_weights(model) / 10**6:.1f}M params\n')
N = len(data_layer)
if sampler_type == 'default':
args.step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
elif sampler_type == 'bucket':
args.step_per_epoch = int(len(data_layer.sampler) / args.batch_size )
print_once('-----------------')
print_once('Have {0} examples to train on.'.format(N))
print_once('Have {0} steps / (gpu * epoch).'.format(args.step_per_epoch))
print_once('-----------------')
fn_lr_policy = lambda s: lr_policy(args.lr, s, args.num_epochs * args.step_per_epoch)
model.cuda()
if args.optimizer_kind == "novograd":
optimizer = Novograd(model.parameters(),
lr=args.lr,
weight_decay=args.weight_decay)
elif args.optimizer_kind == "adam":
optimizer = AdamW(model.parameters(),
lr=args.lr,
weight_decay=args.weight_decay)
# optimization
kw = {'lr': args.lr, 'weight_decay': args.weight_decay}
if args.optimizer == "novograd":
optimizer = Novograd(model.parameters(), **kw)
elif args.optimizer == "adamw":
optimizer = AdamW(model.parameters(), **kw)
else:
raise ValueError("invalid optimizer choice: {}".format(args.optimizer_kind))
raise ValueError(f'Invalid optimizer "{args.optimizer}"')
if 0 < optim_level <= 3:
adjust_lr = lambda step, epoch, optimizer: lr_policy(
step, epoch, args.lr, optimizer, steps_per_epoch=steps_per_epoch,
warmup_epochs=args.warmup_epochs, hold_epochs=args.hold_epochs,
num_epochs=args.epochs, policy=args.lr_policy, min_lr=args.min_lr,
exp_gamma=args.lr_exp_gamma)
if args.amp:
model, optimizer = amp.initialize(
min_loss_scale=1.0,
models=model,
optimizers=optimizer,
opt_level='O' + str(optim_level))
min_loss_scale=1.0, models=model, optimizers=optimizer,
opt_level='O1', max_loss_scale=512.0)
if args.ema > 0:
ema_model = copy.deepcopy(model)
else:
ema_model = None
model = model_multi_gpu(model, multi_gpu)
if multi_gpu:
model = DistributedDataParallel(model)
if args.pyprof:
pyprof.init(enable_function_stack=True)
# load checkpoint
meta = {'best_wer': 10**6, 'start_epoch': 0}
checkpointer = Checkpointer(args.output_dir, 'Jasper',
args.keep_milestones, args.amp)
if args.resume:
args.ckpt = checkpointer.last_checkpoint() or args.ckpt
if args.ckpt is not None:
print_once("loading model from {}".format(args.ckpt))
checkpoint = torch.load(args.ckpt, map_location="cpu")
if hasattr(model, 'module'):
model.module.load_state_dict(checkpoint['state_dict'], strict=True)
else:
model.load_state_dict(checkpoint['state_dict'], strict=True)
checkpointer.load(args.ckpt, model, ema_model, optimizer, meta)
if args.ema > 0:
if 'ema_state_dict' in checkpoint:
if hasattr(ema_model, 'module'):
ema_model.module.load_state_dict(checkpoint['ema_state_dict'], strict=True)
else:
ema_model.load_state_dict(checkpoint['ema_state_dict'], strict=True)
start_epoch = meta['start_epoch']
best_wer = meta['best_wer']
epoch = 1
step = start_epoch * steps_per_epoch + 1
if args.pyprof:
torch.autograd.profiler.emit_nvtx().__enter__()
profiler.start()
# training loop
model.train()
# pre-allocate
if args.pre_allocate_range is not None:
n_feats = train_features_kw['n_filt']
pad_align = train_features_kw['pad_align']
a, b = args.pre_allocate_range
for n_frames in range(a, b + pad_align, pad_align):
print_once(f'Pre-allocation ({batch_size}x{n_feats}x{n_frames})...')
feat = torch.randn(batch_size, n_feats, n_frames, device='cuda')
feat_lens = torch.ones(batch_size, device='cuda').fill_(n_frames)
txt = torch.randint(high=len(symbols)-1, size=(batch_size, 100),
device='cuda')
txt_lens = torch.ones(batch_size, device='cuda').fill_(100)
log_probs, enc_lens = model(feat, feat_lens)
del feat
loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
loss.backward()
model.zero_grad()
for epoch in range(start_epoch + 1, args.epochs + 1):
if multi_gpu and not use_dali:
train_loader.sampler.set_epoch(epoch)
epoch_utts = 0
accumulated_batches = 0
epoch_start_time = time.time()
for batch in train_loader:
if accumulated_batches == 0:
adjust_lr(step, epoch, optimizer)
optimizer.zero_grad()
step_loss = 0
step_utts = 0
step_start_time = time.time()
if use_dali:
# with DALI, the data is already on GPU
feat, feat_lens, txt, txt_lens = batch
if train_feat_proc is not None:
feat, feat_lens = train_feat_proc(feat, feat_lens, args.amp)
else:
print_once('WARNING: ema_state_dict not found in the checkpoint')
print_once('WARNING: initializing EMA model with regular params')
if hasattr(ema_model, 'module'):
ema_model.module.load_state_dict(checkpoint['state_dict'], strict=True)
batch = [t.cuda(non_blocking=True) for t in batch]
audio, audio_lens, txt, txt_lens = batch
feat, feat_lens = train_feat_proc(audio, audio_lens, args.amp)
log_probs, enc_lens = model(feat, feat_lens)
loss = ctc_loss(log_probs, txt, enc_lens, txt_lens)
loss /= args.grad_accumulation_steps
if torch.isnan(loss).any():
print_once(f'WARNING: loss is NaN; skipping update')
else:
if multi_gpu:
step_loss += reduce_tensor(loss.data, world_size).item()
else:
ema_model.load_state_dict(checkpoint['state_dict'], strict=True)
step_loss += loss.item()
optimizer.load_state_dict(checkpoint['optimizer'])
if args.amp:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
step_utts += batch[0].size(0) * world_size
epoch_utts += batch[0].size(0) * world_size
accumulated_batches += 1
if optim_level > 0:
amp.load_state_dict(checkpoint['amp'])
if accumulated_batches % args.grad_accumulation_steps == 0:
optimizer.step()
apply_ema(model, ema_model, args.ema)
args.start_epoch = checkpoint['epoch']
else:
args.start_epoch = 0
if step % args.log_frequency == 0:
preds = greedy_decoder(log_probs)
wer, pred_utt, ref = greedy_wer(preds, txt, txt_lens, symbols)
train(data_layer, data_layer_eval, model, ema_model,
ctc_loss=ctc_loss, \
greedy_decoder=greedy_decoder, \
optimizer=optimizer, \
labels=ctc_vocab, \
optim_level=optim_level, \
multi_gpu=multi_gpu, \
fn_lr_policy=fn_lr_policy if args.lr_decay else None, \
args=args)
if step % args.prediction_frequency == 0:
print_once(f' Decoded: {pred_utt[:90]}')
print_once(f' Reference: {ref[:90]}')
step_time = time.time() - step_start_time
log((epoch, step % steps_per_epoch or steps_per_epoch, steps_per_epoch),
step, 'train',
{'loss': step_loss,
'wer': 100.0 * wer,
'throughput': step_utts / step_time,
'took': step_time,
'lrate': optimizer.param_groups[0]['lr']})
step_start_time = time.time()
if step % args.eval_frequency == 0:
wer = evaluate(epoch, step, val_loader, val_feat_proc,
symbols, model, ema_model, ctc_loss,
greedy_decoder, args.amp, use_dali)
if wer < best_wer and epoch >= args.save_best_from:
checkpointer.save(model, ema_model, optimizer, epoch,
step, best_wer, is_best=True)
best_wer = wer
step += 1
accumulated_batches = 0
# end of step
# DALI iterator need to be exhausted;
# if not using DALI, simulate drop_last=True with grad accumulation
if not use_dali and step > steps_per_epoch * epoch:
break
epoch_time = time.time() - epoch_start_time
log((epoch,), None, 'train_avg', {'throughput': epoch_utts / epoch_time,
'took': epoch_time})
if epoch % args.save_frequency == 0 or epoch in args.keep_milestones:
checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
if 0 < args.epochs_this_job <= epoch - start_epoch:
print_once(f'Finished after {args.epochs_this_job} epochs.')
break
# end of epoch
if args.pyprof:
profiler.stop()
torch.autograd.profiler.emit_nvtx().__exit__(None, None, None)
log((), None, 'train_avg', {'throughput': epoch_utts / epoch_time})
if epoch == args.epochs:
evaluate(epoch, step, val_loader, val_feat_proc, symbols, model,
ema_model, ctc_loss, greedy_decoder, args.amp, use_dali)
checkpointer.save(model, ema_model, optimizer, epoch, step, best_wer)
flush_log()
def parse_args():
parser = argparse.ArgumentParser(description='Jasper')
parser.add_argument("--local_rank", default=None, type=int)
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
parser.add_argument("--num_epochs", default=10, type=int, help='number of training epochs. if number of steps if specified will overwrite this')
parser.add_argument("--num_steps", default=None, type=int, help='if specified overwrites num_epochs and will only train for this number of iterations')
parser.add_argument("--save_freq", dest="save_frequency", default=300, type=int, help='number of epochs until saving checkpoint. will save at the end of training too.')
parser.add_argument("--eval_freq", dest="eval_frequency", default=200, type=int, help='number of iterations until doing evaluation on full dataset')
parser.add_argument("--train_freq", dest="train_frequency", default=25, type=int, help='number of iterations until printing training statistics on the past iteration')
parser.add_argument("--lr", default=1e-3, type=float, help='learning rate')
parser.add_argument("--weight_decay", default=1e-3, type=float, help='weight decay rate')
parser.add_argument("--train_manifest", type=str, required=True, help='relative path given dataset folder of training manifest file')
parser.add_argument("--model_toml", type=str, required=True, help='relative path given dataset folder of model configuration file')
parser.add_argument("--val_manifest", type=str, required=True, help='relative path given dataset folder of evaluation manifest file')
parser.add_argument("--max_duration", type=float, help='maximum duration of audio samples for training and evaluation')
parser.add_argument("--pad_to_max", action="store_true", default=False, help="pad sequence to max_duration")
parser.add_argument("--gradient_accumulation_steps", default=1, type=int, help='number of accumulation steps')
parser.add_argument("--optimizer", dest="optimizer_kind", default="novograd", type=str, help='optimizer')
parser.add_argument("--dataset_dir", dest="dataset_dir", required=True, type=str, help='root dir of dataset')
parser.add_argument("--lr_decay", action="store_true", default=False, help='use learning rate decay')
parser.add_argument("--cudnn", action="store_true", default=False, help="enable cudnn benchmark")
parser.add_argument("--amp", "--fp16", action="store_true", default=False, help="use mixed precision training")
parser.add_argument("--output_dir", type=str, required=True, help='saves results in this directory')
parser.add_argument("--ckpt", default=None, type=str, help="if specified continues training from given checkpoint. Otherwise starts from beginning")
parser.add_argument("--seed", default=42, type=int, help='seed')
parser.add_argument("--ema", type=float, default=0.0, help='discount factor for exponential averaging of model weights during training')
args=parser.parse_args()
return args
if __name__=="__main__":
args = parse_args()
print_dict(vars(args))
main(args)
if __name__ == "__main__":
main()

View file

@ -1,38 +1,10 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.09-py3
FROM tensorrtserver_client as trtis-client
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tritonserver:20.10-py3-clientsdk
FROM ${FROM_IMAGE_NAME}
RUN apt-get update && apt-get install -y python3
ARG version=6.0.1-1+cuda10.1
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb \
&& dpkg -i cuda-repo-*.deb \
&& wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb \
&& dpkg -i nvidia-machine-learning-repo-*.deb \
&& apt-get update \
&& apt-get install -y --no-install-recommends libnvinfer6=${version} libnvonnxparsers6=${version} libnvparsers6=${version} libnvinfer-plugin6=${version} libnvinfer-dev=${version} libnvonnxparsers-dev=${version} libnvparsers-dev=${version} libnvinfer-plugin-dev=${version} python-libnvinfer=${version} python3-libnvinfer=${version}
RUN cp -r /usr/lib/python3.6/dist-packages/tensorrt /opt/conda/lib/python3.6/site-packages/tensorrt
RUN apt update && apt install -y python3-pyaudio libsndfile1
ENV PATH=$PATH:/usr/src/tensorrt/bin
WORKDIR /tmp/onnx-trt
COPY tensorrt/onnx-trt.patch .
RUN git clone https://github.com/onnx/onnx-tensorrt.git && cd onnx-tensorrt && git checkout b677b9cbf19af803fa6f76d05ce558e657e4d8b6 && git submodule update --init --recursive && \
patch -f < ../onnx-trt.patch && mkdir build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DGPU_ARCHS="60 70 75" && make -j16 && make install && mv -f /usr/lib/libnvonnx* /usr/lib/x86_64-linux-gnu/ && ldconfig
# Here's a good place to install pip reqs from JoC repo.
# At the same step, also install TRT pip reqs
WORKDIR /tmp/pipReqs
COPY requirements.txt /tmp/pipReqs/pytRequirements.txt
COPY tensorrt/requirements.txt /tmp/pipReqs/trtRequirements.txt
COPY triton/requirements.txt /tmp/pipReqs/trtisRequirements.txt
RUN apt-get update && apt-get install -y --no-install-recommends portaudio19-dev && pip install -r pytRequirements.txt && pip install -r trtRequirements.txt && pip install -r trtisRequirements.txt
#Copy the perf_client over
COPY --from=trtis-client /workspace/install/bin/perf_client /workspace/install/bin/perf_client
#Copy the python wheel and install with pip
COPY --from=trtis-client /workspace/install/python/tensorrtserver*.whl /tmp/
RUN pip install /tmp/tensorrtserver*.whl && rm /tmp/tensorrtserver*.whl
RUN pip3 install -U pip
RUN pip3 install onnxruntime unidecode inflect soundfile
WORKDIR /workspace/jasper
COPY . .

View file

@ -1,55 +1,37 @@
# Jasper Inference Using Triton Inference Server
# Deploying the Jasper Inference model using Triton Inference Server
This is a subfolder of the Jasper for PyTorch repository that provides scripts to deploy high-performance inference using NVIDIA Triton Inference Server (formerly NVIDIA TensorRT Inference Server). It offers different options for the inference model pipeline.
This subfolder of the Jasper for PyTorch repository contains scripts for deployment of high-performance inference on NVIDIA Triton Inference Server as well as detailed performance analysis. It offers different options for the inference model pipeline.
## Table Of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Triton Inference Server Overview](#triton-inference-server-overview)
* [Inference Pipeline in Triton Inference Server](#inference-pipeline-in-triton-inference-server)
- [Solution overview](#solution-overview)
- [Inference Pipeline in Triton Inference Server](#inference-pipeline-in-triton-inference-server)
- [Setup](#setup)
* [Supported Software](#supported-software)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Scripts and sample code](#scripts-and-sample-code)
- [Performance](#performance)
* [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server)
* [Results](#results)
* [Performance analysis for Triton Inference Server: NVIDIA T4](#performance-analysis-for-triton-inference-server-nvidia-t4)
* [Maximum Batch Size](#maximum-batch-size)
* [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic-batching)
* [TensorRT/ONNX/PyTorch JIT comparisons](#tensorrt/onnx/pytorch-jit-comparisons)
* [Throughput Comparison](#throughput-comparison)
* [Latency Comparison](#latency-comparison)
## Model overview
### Model architecture
* [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server)
* [Results](#results)
* [Performance Analysis for Triton Inference Server: NVIDIA T4
](#performance-analysis-for-triton-inference-server-nvidia-t4)
* [Maximum batch size](#maximum-batch-size)
* [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic)
* [TensorRT, ONNX, and PyTorch JIT comparisons](#tensorrt-onnx-and-pytorch-jit-comparisons)
- [Release Notes](#release-notes)
* [Changelog](#change-log)
* [Known issues](#known-issues)
Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. More information about Jasper and its training and be found in the [Jasper PyTorch README](../README.md).
By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.
For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
More information on the Jasper model architecture can be found in the [Jasper PyTorch README](../README.md).
### Triton Inference Server Overview
## Solution Overview
The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This folder contains detailed performance analysis as well as scripts to run Jasper inference using Triton Inference Server.
A typical Triton Inference Server pipeline can be broken down into the following steps:
1. The client serializes the inference request into a message and sends it to the server (Client Send).
1. The client serializes the inference request into a message and sends it to the server (Client Send).
2. The message travels over the network from the client to the server (Network).
3. The message arrives at the server, and is deserialized (Server Receive).
4. The request is placed on the queue (Server Queue).
@ -58,15 +40,16 @@ A typical Triton Inference Server pipeline can be broken down into the following
7. The completed message then travels over the network from the server to the client (Network).
8. The completed message is deserialized by the client and processed as a completed inference request (Client Receive).
Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to step 5. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local.
In this section, we will go over how to launch both the Triton Inference Server and the client and get the best performance solution that fits your specific application needs.
Note: The following instructions are run from outside the container and call `docker run` commands as required.
More information on how to perform inference using NVIDIA Triton Inference Server can be found in [triton/README.md](https://github.com/triton-inference-server/server/blob/master/README.md).
## Inference Pipeline in Triton Inference Server
The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend:
The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend:
**Data preprocessor**
@ -74,11 +57,11 @@ The data processor transforms an input raw audio file into a spectrogram. By def
**Acoustic model**
The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [../README.md]). We do not use masking for inference .
The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](../README.md)). We do not use masking for inference.
**Greedy decoder**
The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability.
The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability.
To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline.
@ -93,37 +76,24 @@ In order to run inference with TensorRT outside of the inference server, refer t
## Setup
### Supported Software
The repository contains a folder `./triton` with a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
The following software version configuration is supported has been tested.
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
- [Triton Inference Server 20.10 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
- Access to [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA CUDA repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
- Supported GPUs:
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
- [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp)
|Software|Version|
|--------|-------|
|Python|3.6.9|
|PyTorch|1.2.0|
|TensorRT|6.0.1.5|
|CUDA|10.1.243|
The following section lists the requirements in order to start inference with Jasper in Triton Inference Server.
### Requirements
The repository contains a folder `./trtis/` with a `Dockerfile` which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
* [Triton Inference Server 20.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
* Access to [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA cuda repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
Required Python packages are listed in `requirements.txt`, `trt/requirements.txt` and `trtis/requirements.txt`. These packages are automatically installed when the Docker container is built.
Required Python packages are listed in `requirements.txt`. These packages are automatically installed when the Docker container is built.
## Quick Start Guide
Running the following scripts will build and launch the container containing all required dependencies for both TensorRT 6 as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
Running the following scripts will build and launch the container containing all required dependencies for native PyTorch as well as Triton. This is necessary for using inference and can also be used for data download, processing, and training of the model. For more information on the scripts and arguments, refer to the [Advanced](#advanced) section.
1. Clone the repository.
@ -132,114 +102,103 @@ Running the following scripts will build and launch the container containing all
cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
```
2. Build a container that extends NGC PyTorch 19.09, TensorRT, Triton Inference Server, and Triton Inference Client:
2. Build the Jasper PyTorch container.
Running the following scripts will build the container which contains all the required dependencies for data download and processing as well as converting the model.
```bash
bash trtis/scripts/docker/build.sh
bash scripts/docker/build.sh
```
3. Start an interactive session in the Docker container:
```bash
export DATA_DIR=<DATA_DIR>
export CHECKPOINT_DIR=<CHECKPOINT_DIR>
export RESULT_DIR=<RESULT_DIR>
bash trtis/scripts/docker/launch.sh
bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
```
Where <DATA_DIR>, <CHECKPOINT_DIR> and <RESULT_DIR> can be either empty or absolute directory paths to dataset, existing checkpoints or potential output files.
Alternatively, to start a script `foo.sh` in the Docker container without an interactive session, run:
Where <DATA_DIR>, <CHECKPOINT_DIR> and <RESULT_DIR> can be either empty or absolute directory paths to dataset, existing checkpoints or potential output files. When left empty, they default to `datasets/`, `/checkpoints`, and `results/`, respectively. The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` and `trt/scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md) and [Jasper TensorRT README](../tensorrt/README.md).
Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](../trt/README.md)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `<RESULT_DIR>` should be prepared to contain converted model and logs.
4. Downloading the `test-clean` part of `LibriSpeech` is required for model conversion. But it is not required for inference on Triton Inference Server, which can use a single .wav audio file. To download and preprocess LibriSpeech, run the following inside the container:
```bash
bash triton/scripts/download_triton_librispeech.sh
bash triton/scripts/preprocess_triton_librispeech.sh
```
5. (Option 1) Convert pretrained PyTorch model checkpoint into Triton Inference Server compatible model backends.
Inside the container, run:
```bash
export DATA_DIR=<DATA_DIR>
export CHECKPOINT_DIR=<CHECKPOINT_DIR>
export RESULT_DIR=<RESULT_DIR>
export PROGRAM_PATH=foo.sh
bash trtis/scripts/docker/trtis.sh
export CHECKPOINT_PATH=<CHECKPOINT_PATH>
export CONVERT_PRECISIONS=<CONVERT_PRECISIONS>
export CONVERTS=<CONVERTS>
bash triton/scripts/export_model.sh
```
The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host. Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` and `trt/scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md) and [Jasper TensorRT README](../tensorrt/README.md).
Where `<CHECKPOINT_PATH>` (`"/checkpoints/jasper_fp16.pt"`) is the absolute file path of the pretrained checkpoint, `<CONVERT_PRECISIONS>` (`"fp16" "fp32"`) is the list of precisions used for conversion, and `<CONVERTS>` (`"feature-extractor" "decoder" "ts-trace" "onnx" "tensorrt"`) is the list of conversions to be applied. The feature extractor converts only to TorchScript trace module (`feature-extractor`), the decoder only to TorchScript script module (`decoder`), and the Jasper model can convert to TorchScript trace module (`ts-trace`), ONNX (`onnx`), or TensorRT (`tensorrt`).
Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](../trt/README.md)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `<RESULT_DIR>` should be prepared to contain timing results and logs. Downloading `LibriSpeech` is not required for Inference in Triton Inference Server on a single .wav audio file. To do inference and evaluation on LibriSpeech, download the dataset following the instructions in the [Jasper TensorRT README](../tensorrt/README.md)
A pretrained PyTorch model checkpoint for model conversion can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp).
4. Convert pretrained PyTorch model checkpoint into Triton Inference Server compatible model backends.
More details can be found in the [Advanced](#advanced) section under [Scripts and sample code](#scripts-and-sample-code).
6. (Option 2) Download pre-exported inference checkpoints from NGC.
Alternatively, you can skip the manual model export and download already generated model backends for every version of the model pipeline.
* [Jasper_ONNX](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_onnx_fp16_amp/version),
* [Jasper_TorchScript](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_torchscript_fp16_amp/version),
* [Jasper_TensorRT_Turing](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_turing/version),
* [Jasper_TensorRT_Volta](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_volta/version).
If you wish to use TensorRT pipeline, make sure to download the correct version for your hardware. The extracted model folder should contain 3 subfolders `feature-extractor-ts-trace`, `decoder-ts-script` and `jasper-x` where `x` can be `ts-trace`, `onnx`, `tensorrt` depending on the model backend. Copy the 3 model folders to the directory `./triton/model_repo/fp16` in your Jasper project.
7. Build a container that extends Triton Inference Client:
From outside the container, run:
```bash
export ARCH=<ARCH>
export CHECKPOINT_DIR=<CHECKPOINT_DIR>
export CHECKPOINT=<CHECKPOINT>
export PRECISION=<PRECISION>
export MAX_SEQUENCE_LENGTH_FOR_ENGINE=<MAX_SEQUENCE_LENGTH_FOR_ENGINE>
bash trtis/scripts/export_model.sh
bash trtis/scripts/prepare_model_repository.sh
bash triton/scripts/docker/build_triton_client.sh
```
Where `<ARCH>` is either 70(Volta) or 75(Turing), `<CHECKPOINT_DIR>` is the absolute path that contains the pretrained checkpoint `<CHECKPOINT>`, and `<PRECISION>` is either `fp16` or `fp32`. `<MAX_SEQUENCE_LENGTH_FOR_ENGINE>` defines the maximum feasible audio length, where 100 corresponds to 1 second.
The exported models for deployment will be generated at `./trtis/deploy/`.
Once the above steps are completed you can either run inference benchmarks or perform inference on real data.
A pretrained PyTorch model checkpoint for model conversion can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
More details can be found in the [Advanced](#advanced) section under [Scripts and sample code](#scripts-and-sample-code).
5. Download Pre-exported Inference Checkpoints from NGC
If you would like to skip the manual model export, you can find already generated model backends in [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_jit_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_jit_fp16), [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_onnx_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_onnx_fp16), [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_turing_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_turing_fp16), [https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_volta_fp16](https://ngc.nvidia.com/models/nvidian:swdl:jasperpyt_trt_volta_fp16). for every version of the model pipeline. If you wish to use TensorRT pipeline, make sure to download the correct version for your hardware. The extracted model folder should contain 3 subfolders `jasper-feature-extractor`, `jasper-decoder` and `jasper-x` where x can be pyt, onnx, trt depending on the model backend. You will find folders with the same name in your local Jasper repository under `trtis/model_repo/. Copy the content of each of the 3 model folders to the according directory in your Jasper project, replace files with the same name.
Then run:
```bash
bash trtis/scripts/prepare_model_repository.sh
```
6. Launch Triton Inference Server.
Start the server:
```bash
bash trtis/scripts/run_server.sh
```
7. Run all inference benchmarks.
8. (Option 1) Run all inference benchmarks.
From outside the container, run:
```bash
export ARCH=<ARCH>
export CHECKPOINT_DIR=<CHECKPOINT_DIR>
export RESULT_DIR=<RESULT_DIR>
export CHECKPOINT=<CHECKPOINT>
bash trtis/scripts/execute_all_perf_runs.sh
export PRECISION_TESTS=<PRECISION_TESTS>
export BATCH_SIZES=<BATCH_SIZES>
export SEQ_LENS=<SEQ_LENS>
bash triton/scripts/execute_all_perf_runs.sh
```
Where `<ARCH>` is either 70(Volta) or 75(Turing), `<CHECKPOINT_DIR>` is the absolute path that contains the pretrained checkpoint `<CHECKPOINT>`, and `<RESULT_DIR>` is the absolute path to potential output files.
Where `<RESULT_DIR>` is the absolute path to potential output files (`./results`), `<PRECISION_TESTS>` is a list of precisions to be tested (`"fp16" "fp32"`), `<BATCH_SIZES>` is a list of tested batch sizes (`"1" "2" "4" "8"`), and `<SEQ_LENS>` are tested sequnce lengths (`"32000" "112000" "267200"`).
Note: This can take several hours to complete due to the extensiveness of the benchmark. More details about the benchmark are found in the [Advanced](#advanced) section under [Performance](#performance).
8. Run inference using the Client and Triton Inference Server.
9. (Option 2) Run inference on real data using the Client and Triton Inference Server.
8.1 From outside the container, restart the server:
```bash
bash trtis/scripts/run_server.sh
```
bash triton/scripts/run_server.sh <MODEL_TYPE> <PRECISION>
```
8.2 From outside the container, submit the client request using:
```bash
bash trtis/scripts/run_client.sh <MODEL_TYPE> <DATA_DIR> <FILE>
bash triton/scripts/run_client.sh <MODEL_TYPE> <DATA_DIR> <FILE>
```
Where `<MODEL_TYPE>` can be either “pyt” (default), “trt” or “onnx”. `<DATA_DIR>` is an absolute local path to the directory of files. <FILE> is the relative path to <DATA_DIR> to either an audio file in .wav format or a manifest file in .json format.
Where `<MODEL_TYPE>` can be either "ts-trace", "tensorrt" or "onnx", `<PRECISION>` is either "fp32" or "fp16". `<DATA_DIR>` is an absolute local path to the directory of files. <FILE> is the relative path to <DATA_DIR> to either an audio file in .wav format or a manifest file in .json format.
Note: If <FILE> is *.json <DATA_DIR> should be the path to the LibriSpeech dataset. In this case this script will do both inference and evaluation on the accoring LibriSpeech dataset.
9. Start Jupyter Notebook to run inference interactively.
Run:
```bash
jupyter notebook -- notebooks/JasperTRTIS.ipynb
```
A pretrained model checkpoint necessary for using the jupyter notebook to be able to run inference can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16).
Note: If <FILE> is *.json <DATA_DIR> should be the path to the LibriSpeech dataset. In this case this script will do both inference and evaluation on the accoring LibriSpeech dataset.
## Advanced
@ -249,130 +208,177 @@ The following sections provide greater details about the Triton Inference Server
### Scripts and sample code
The `trtis/` directory contains the following files:
The `triton/` directory contains the following files:
* `jasper-client.py`: Python client script that takes an audio file and a specific model pipeline type and submits a client request to the server to run inference with the model on the given audio file.
* `speech-utils.py`: helper functions for `jasper-client.py`
* `speech_utils.py`: helper functions for `jasper-client.py`.
* `converter.py`: Python script for model conversion to different backends.
* `jasper_module.py`: helper functions for `converter.py`.
* `model_repo_configs/`: directory with Triton model config files for different backend and precision configurations.
The `trtis/scripts/` directory has easy to use scripts to run supported functionalities, such as:
* `./docker/build.sh`: builds container
* `./docker/launch.sh`: launches container
* `execute_all_perf_runs.sh`: runs all benchmarks using TRTIS perfclient calls `generate_perf_results.sh`
* `export_model.sh`: from pretrained PyTorch checkpoint generates backends for every version of the model inference pipeline, calls `export_model_helper.sh`
* `prepare_model_repository.sh`: copies model config files from `./model_repo/` to `./deploy/model_rep`o and creates links to generated model backends, setting up the model repository for Triton Inference Server
* `generate_perf_results.sh`: runs benchmark with perf-client for specific configuration and calls `run_perf_client.sh`
The `triton/scripts/` directory has easy to use scripts to run supported functionalities, such as:
* `./docker/build_triton_client.sh`: builds container
* `execute_all_perf_runs.sh`: runs all benchmarks using Triton Inference Server performance client; calls `generate_perf_results.sh`
* `export_model.sh`: from pretrained PyTorch checkpoint generates backends for every version of the model inference pipeline.
* `prepare_model_repository.sh`: copies model config files from `./model_repo_configs/` to `./deploy/model_repo` and creates links to generated model backends, setting up the model repository for Triton Inference Server
* `generate_perf_results.sh`: runs benchmark with `perf-client` for specific configuration and calls `run_perf_client.sh`
* `run_server.sh`: launches Triton Inference Server
* `run_client.sh`: launches client by using `jasper-client.py` to submit inference requests to server
### Running the Triton Inference Server
Launch the Triton Inference Server in detached mode to run in the background by default:
```bash
bash triton/scripts/run_server.sh
```
To run in the foreground interactively, for debugging purposes, run:
```bash
DAEMON="--detach=false" bash trinton/scripts/run_server.sh
```
The script mounts and loads models at `$PWD/triton/deploy/model_repo` to the server with all visible GPUs. In order to selectively choose the devices, set `NVIDIA_VISIBLE_DEVICES`.
### Running the Triton Inference Client
*Real data*
In order to run the client with real data, run:
```bash
bash triton/scripts/run_client.sh <backend> <data directory> <audio file>
```
The script calls `triton/jasper-client.py` which preprocesses data and sends/receives requests to/from the server.
*Synthetic data*
In order to run the client with synthetic data for performance measurements, run:
```bash
export MODEL_NAME=jasper-tensorrt-ensemble
export MODEL_VERSION=1
export BATCH_SIZE=1
export MAX_LATENCY=500
export MAX_CONCURRENCY=64
export AUDIO_LENGTH=32000
export SERVER_HOSTNAME=localhost
export RESULT_DIR_H=${PWD}/results/perf_client/${MODEL_NAME}/batch_${BATCH_SIZE}_len_${AUDIO_LENGTH}
bash triton/scripts/run_perf_client.sh
```
The export values above are default values. The script waits until the server is up and running, sends requests as per the constraints set and writes results to `/results/results_${TIMESTAMP}.csv` where `TIMESTAMP=$(date "+%y%m%d_%H%M")` and `/results/` is the results directory mounted in the docker .
For more information about `perf_client`, refer to the [official documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/optimization.html#perf-client).
## Performance
### Inference Benchmarking in Triton Inference Server
To benchmark the inference performance on either Volta or Turing GPU, run `bash trtis/scripts/execute_all_perf_runs.sh` according to [Quick-Start-Guide](#quick-start-guide) Step 7 and set `ARCH` according to the underlying hardware (`ARCH=70` for Volta and `ARCH=75` for Turing)
By default, this script measures inference performance for all 3 model pipelines: PyTorch JIT (pyt) pipeline, ONNX (onnx) pipeline, TensorRT(tensorrt) pipeline, both with fp32 and fp16 precision. Each of these pipelines is measured for different audio input lengths (2sec, 7sec, 16.7sec) and a range of different server batch sizes (up to 64). This takes place in `trtis/scripts/generate_perf_results.sh`. For a specific audio length and batch size static and dynamic batching comparison is performed. For benchmarking we used `MAX_SEQUENCE_LENGTH_FOR_ENGINE=1792` for inference model generation.
To benchmark the inference performance on Volta Turing or Ampere GPU, run `bash triton/scripts/execute_all_perf_runs.sh` according to [Quick-Start-Guide](#quick-start-guide) Step 7.
By default, this script measures inference performance for all 3 model pipelines: PyTorch JIT (`ts-trace`) pipeline, ONNX (`onnx`) pipeline, TensorRT(`tensorrt`) pipeline, both with FP32 and FP16 precision. Each of these pipelines is measured for different audio input lengths (2sec, 7sec, 16.7sec) and a range of different server batch sizes (up to 8). This takes place in `triton/scripts/generate_perf_results.sh`. For a specific audio length and batch size, static and dynamic batching comparison is performed.
### Results
In the following section, we analyze the results using the example of the Triton pipeline.
#### Performance Analysis for Triton Inference Server: NVIDIA T4
### Results
#### Performance Analysis for Triton Inference Server: NVIDIA T4
Based on the figure below, we recommend using the Dynamic Batcher with `max_batch_size=8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect). The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of concurrent requests, while also keeping latency down for infrequent requests.
All results below are obtained using the following configurations:
* Single T4 16GB GPU on a local server
* Jasper Large
* Audio length = 7 seconds
* FP16 precision
* Python 3.6.10
* PyTorch 1.7.0a0+7036e91
* TensorRT 7.2.1.4
* CUDA 11.1.0.024
* CUDNN 8.0.4.30
Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1). A high number of concurrent requests can reduce the impact of network latency on overall throughput.
##### Batching techniques: Static Batching
Static batching is a feature of the inference server that allows inference requests to be served as they are received. The largest improvements to throughput come from increasing the batch size due to efficiency gains in the GPU with larger batches.
<img src="../images/triton_throughput_latency_summary.png" width="100%" height="100%">
![](../images/static_fp16_2s.png)
Figure 1: Throughput vs. Latency for Jasper, Audio Length = 2sec using various model backends available in Triton Inference Server and static batching.
Figure 1: Latency vs Throughput for Jasper Large, FP16, Audio Length = 7sec using various configurations and all 3 model backends available in Triton Inference Server. TODO: TensorRT is denoted as TRT, PyTorch as PyT.
![](../images/static_fp16_7s.png)
Figure 2: Throughput vs. Latency for Jasper, Audio Length = 7sec using various model backends available in Triton Inference Server and static batching.
![](../images/static_fp16_16.7s.png)
Figure 3: Throughput vs. Latency for Jasper, Audio Length = 16.7sec using various model backends available in Triton Inference Server and static batching.
##### Maximum Batch Size
In general, increasing batch size leads to higher throughput at the cost of higher latency. In the following sections, we analyze the results using the example of the Triton pipeline.
These charts can be used to establish the optimal batch size to use in dynamic batching, given a latency budget. For example, in Figure 2 (Audio length = 7s) given a budget of 50ms, the optimal batch size to use for the TensorRT backend is 4. This will result in a maximum throughput of 100 inf/s under the latency constraint. In all three charts, TensorRT shows the best throughput and latency performance for a given batch size
As we can see in Figure 2, the throughput at Batch Size=1, Client Concurrent Requests = 8 is 45 and in Figure 3, the throughput at Batch Size=8, Client Concurrent Requests = 1 is 101, giving a speedup of ~2.24x.
Note: We compare Batch Size=1, Client Concurrent Requests = 8 to Batch Size=8, Client Concurrent Requests = 1 to keep the Total Number of Outstanding Requests equal between the two different modes. Where Total Number of Outstanding Requests = Batch Size * Client Concurrent Requests.
Increasing the batch size by 8-fold from 1 to 8 results in an increase in compute time by only 2.42x (45ms to 109ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
##### Batching techniques: Dynamic Batching
<img src="../images/triton_static_batching_bs1.png" width="80%" height="80%">
Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the Dynamic Batcher parameter `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and `preferred_batch_size` to indicate your maximum server batch size in the Triton Inference Server model config.
Figure 2: Triton pipeline - Latency & Throughput vs Concurrency using Static Batching at Batch size = 1
Figures 4, 5, and 6 emphasizes the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to max_queue_delay_microseconds.
<img src="../images/triton_static_batching_bs8.png" width="80%" height="80%">
![](../images/tensorrt_2s.png)
Figure 4: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size = 8, max_queue_delay_microseconds = 5000, input audio length = 2 seconds, TensorRT backend.
Figure 3: Triton pipeline - Latency & Throughput vs Concurrency using Static Batching at Batch size = 8
![](../images/tensorrt_7s.png)
Figure 5: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size = 8, max_queue_delay_microseconds = 5000, input audio length = 7 seconds, TensorRT backend.
##### Batching techniques: Static versus Dynamic Batching
In the following section, we analyze the results using the example of the Triton pipeline.
Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the Dynamic Batcher parameter `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and `preferred_batch_size` to indicate your maximum server batch size in the Triton Inference Server model config.
Figure 4 emphasizes the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to max_queue_delay_microseconds. The effect of preferred_batchsize for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, the throughput approaches a maximum limit as we saturate the GPU utilization.
![](../images/tensorrt_16.7s.png)
Figure 6: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size = 8, max_queue_delay_microseconds = 5000, input audio length = 16.7 seconds, TensorRT backend.
<img src="../images/triton_dynamic_batching.png" width="80%" height="80%">
Figure 4: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at client Batch size = 1, maximum server batch size=4, max_queue_delay_microseconds = 5000
##### TensorRT, ONNX, and PyTorch JIT comparisons
##### TensorRT/ONNX/PyTorch JIT comparisons
The following tables show inference and latency comparisons across all 3 backends for mixed precision and static batching. The main observations are:
Increasing the batch size leads to higher inference throughput and latency up to a certain batch size, after which it slowly saturates.
TensorRT is faster than both the PyTorch and ONNX pipeline, achieving a speedup of up to ~1.5x and ~2.4x respectively.
Increasing the batch size leads to higher inference throughput and - latency up to a certain batch size, after which it slowly saturates.
The longer the audio length, the lower the throughput and the higher the latency.
###### Throughput Comparison
Following Table shows throughput benchmark results for all 3 model backends in Triton Inference Server using static batching under optimal concurrency
The following table shows the throughput benchmark results for all 3 model backends in Triton Inference Server using static batching under optimal concurrency
|Audio length in seconds|Batch Size|TensorRT (inf/s)|PyTorch (inf/s)|ONNX (inf/s)|TensorRT/PyTorch Speedup|TensorRT/Onnx Speedup|
|--- |--- |--- |--- |--- |--- |--- |
|2.00|1.00|46.67|40.67|41.00|1.15|1.14|
|2.00|2.00|90.67|74.67|58.00|1.21|1.56|
|2.00|4.00|168.00|128.00|112.00|1.31|1.50|
|2.00|8.00|248.00|213.33|194.67|1.16|1.27|
|7.00|1.00|44.33|31.67|37.00|1.40|1.20|
|7.00|2.00|74.67|56.67|49.33|1.32|1.51|
|7.00|4.00|100.00|62.67|50.67|1.60|1.97|
|7.00|8.00|106.67|80.00|53.33|1.33|2.00|
|16.70|1.00|31.00|20.00|25.33|1.55|1.22|
|16.70|2.00|42.00|29.33|19.33|1.43|2.17|
|16.70|4.00|46.67|29.33|22.67|1.59|2.06|
|16.70|8.00|50.67|37.33|21.33|1.36|2.38|
| 2.0| 1| 49.67| 55.67| 41.67| 0.89| 1.19|
| 2.0| 2| 98.67| 96.00| 77.33| 1.03| 1.28|
| 2.0| 4| 180.00| 141.33| 118.67| 1.27| 1.52|
| 2.0| 8| 285.33| 202.67| 136.00| 1.41| 2.10|
| 7.0| 1| 47.67| 37.00| 18.00| 1.29| 2.65|
| 7.0| 2| 79.33| 47.33| 46.00| 1.68| 1.72|
| 7.0| 4| 100.00| 73.33| 36.00| 1.36| 2.78|
| 7.0| 8| 117.33| 82.67| 40.00| 1.42| 2.93|
| 16.7| 1| 36.33| 21.67| 11.33| 1.68| 3.21|
| 16.7| 2| 40.67| 25.33| 16.00| 1.61| 2.54|
| 16.7| 4| 46.67| 37.33| 16.00| 1.25| 2.92|
| 16.7| 8| 48.00| 40.00| 18.67| 1.20| 2.57|
###### Latency Comparison
Following Table shows throughput benchmark results for all 3 model backends in Triton Inference Server using static batching and a single concurrent request.
The following table shows the throughput benchmark results for all 3 model backends in Triton Inference Server using static batching and a single concurrent request.
|Audio length in seconds|Batch Size|TensorRT (ms)|PyTorch (ms)|ONNX (ms)|TensorRT/PyTorch Speedup|TensorRT/Onnx Speedup|
|--- |--- |--- |--- |--- |--- |--- |
|2.00|1.00|24.74|27.80|26.70|1.12|1.08|
|2.00|2.00|23.75|29.76|38.54|1.25|1.62|
|2.00|4.00|25.28|34.09|39.67|1.35|1.57|
|2.00|8.00|36.18|41.18|45.84|1.14|1.27|
|7.00|1.00|25.86|34.82|29.41|1.35|1.14|
|7.00|2.00|29.83|38.04|43.37|1.28|1.45|
|7.00|4.00|41.91|66.69|79.38|1.59|1.89|
|7.00|8.00|80.72|106.86|151.61|1.32|1.88|
|16.70|1.00|34.89|52.83|43.10|1.51|1.24|
|16.70|2.00|51.91|73.52|105.58|1.42|2.03|
|16.70|4.00|95.42|145.17|187.49|1.52|1.96|
|16.70|8.00|167.67|229.67|413.74|1.37|2.47|
| 2.0| 1| 23.61| 25.06| 31.84| 1.06| 1.35|
| 2.0| 2| 24.56| 25.11| 37.54| 1.02| 1.53|
| 2.0| 4| 25.90| 31.00| 37.20| 1.20| 1.44|
| 2.0| 8| 31.57| 41.76| 37.13| 1.32| 1.18|
| 7.0| 1| 24.79| 30.55| 32.16| 1.23| 1.30|
| 7.0| 2| 28.48| 45.05| 37.47| 1.58| 1.32|
| 7.0| 4| 41.71| 57.71| 37.92| 1.38| 0.91|
| 7.0| 8| 72.19| 98.84| 38.13| 1.37| 0.53|
| 16.7| 1| 30.66| 48.42| 32.74| 1.58| 1.07|
| 16.7| 2| 52.79| 81.89| 37.82| 1.55| 0.72|
| 16.7| 4| 92.86| 115.03| 37.91| 1.24| 0.41|
| 16.7| 8| 170.34| 203.52| 37.84| 2.36| 0.22|
## Release Notes
### Changelog
February 2021
* Updated Triton scripts for compatibility with Triton Inference Server version 2
* Updated Quick Start Guide
* Updated performance results
### Known issues
There are no known issues in this deployment.

View file

@ -0,0 +1,254 @@
# *****************************************************************************
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the NVIDIA CORPORATION nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# *****************************************************************************
import os
import json
import torch
import argparse
import importlib
from pytorch.utils import extract_io_props, load_io_props
import logging
def get_parser():
parser = argparse.ArgumentParser()
# required args
parser.add_argument("--model-module", type=str, default="", required=True,
help="Module with model initializer and data loader")
parser.add_argument('--convert', choices=['ts-script', 'ts-trace',
'onnx', 'tensorrt'],
required=True, help='convert to '
'ts-script: TorchScript using torch.jit.script, '
'ts-trace: TorchScript using torch.jit.trace, '
'onnx: ONNX using torch.onnx.export, '
'tensorrt: TensorRT using OnnxParser, ')
parser.add_argument("--max_workspace_size", type=int,
default=512*1024*1024,
help="set the size of the workspace for TensorRT \
conversion")
parser.add_argument("--precision", choices=['fp16', 'fp32'],
default='fp32', help="convert TensorRT or \
TorchScript model in a given precision")
parser.add_argument('--convert-filename', type=str, default='',
help='Saved model name')
parser.add_argument('--save-dir', type=str, default='',
help='Saved model directory')
parser.add_argument("--max-batch-size", type=int, default=1,
help="Specifies the 'max_batch_size' in the Triton \
model config and in TensorRT builder. See the \
Triton and TensorRT documentation for more info.")
parser.add_argument('--device', type=str, default='cuda',
help='Select device for conversion.')
parser.add_argument('model_arguments', nargs=argparse.REMAINDER,
help='arguments that will be ignored by \
converter lib and will be forwarded to your convert \
script')
return parser
class Converter:
def __init__(self, model, dataloader):
self.model = model
self.dataloader = dataloader
self.convert_props = {
'ts-script': {
'convert_func': self.to_torchscript,
'convert_filename': 'model.pt'
},
'ts-trace': {
'convert_func' : self.to_torchscript,
'convert_filename': 'model.pt'
},
'onnx': {
'convert_func' : self.to_onnx,
'convert_filename': 'model.onnx'
},
'tensorrt': {
'convert_func' : self.to_tensorrt,
'convert_filename': 'model.plan'
}
}
def convert(self, convert_type, save_dir, convert_filename,
device, precision='fp32',
max_batch_size=1,
# args for TensorRT:
max_workspace_size=None):
''' convert the model '''
self.convert_type = convert_type
self.max_workspace_size = max_workspace_size
self.max_batch_size = max_batch_size
self.precision = precision
# override default name if user provided name
if convert_filename != '':
self.convert_props[convert_type]['convert_filename'] = convert_filename
# setup device
torch_device = torch.device(device)
# prepare model
self.model.to(torch_device)
self.model.eval()
assert (not self.model.training), \
"[Converter error]: could not set the model to eval() mode!"
io_props = None
if self.dataloader is not None:
io_props = extract_io_props(self.model, self.dataloader,
torch_device, precision, max_batch_size)
assert self.convert_type == "ts-script" or io_props is not None, \
"Input and output properties are empty. For conversion types \
other than \'ts-script\' input shapes are required to generate dummy input. \
Make sure that dataloader works correctly or that IO props file is provided."
# prepare save path
model_name = self.convert_props[convert_type]['convert_filename']
convert_model_path = os.path.join(save_dir, model_name)
# get convert method depending on the convert type
convert_func = self.convert_props[convert_type]['convert_func']
# convert the model - will be saved to disk
if self.convert_type == "tensorrt":
io_filepath = "triton/tensorrt_io_props_" + str(precision) + ".json"
io_props = load_io_props(io_filepath)
convert_func(model, torch_device, io_props, convert_model_path)
assert (os.path.isfile(convert_model_path)), \
f"[Converter error]: saving model to {convert_model_path} failed!"
def generate_dummy_input(self, io_props, device):
from pytorch.utils import triton_type_to_torch_type
dummy_input = []
for s,t in zip(io_props['opt_shapes'], io_props['input_types']):
t = triton_type_to_torch_type[t]
tensor = torch.empty(size=s, dtype=t, device=device).random_()
dummy_input.append(tensor)
dummy_input = tuple(dummy_input)
return dummy_input
def to_onnx(self, model, device, io_props, convert_model_path):
''' convert the model to onnx '''
dummy_input = self.generate_dummy_input(io_props, device)
opset_version = 11
# convert the model to onnx
with torch.no_grad():
torch.onnx.export(model, dummy_input,
convert_model_path,
do_constant_folding=True,
input_names=io_props['input_names'],
output_names=io_props['output_names'],
dynamic_axes=io_props['dynamic_axes'],
opset_version=opset_version,
enable_onnx_checker=True)
def to_tensorrt(self, model, device, io_props, convert_model_path):
''' convert the model to tensorrt '''
assert (self.max_workspace_size), "[Converter error]: for TensorRT conversion you must provide \'max_workspace_size\'."
import tensorrt as trt
from pytorch.utils import build_tensorrt_engine
# convert the model to onnx first
self.to_onnx(model, device, io_props, convert_model_path)
del model
torch.cuda.empty_cache()
zipped = zip(io_props['input_names'], io_props['min_shapes'],
io_props['opt_shapes'], io_props['max_shapes'])
shapes = []
for name,min_shape,opt_shape,max_shape in zipped:
d = {"name":name, "min": min_shape,
"opt": opt_shape, "max": max_shape}
shapes.append(d)
tensorrt_fp16 = True if self.precision == 'fp16' else False
# build tensorrt engine
engine = build_tensorrt_engine(convert_model_path, shapes,
self.max_workspace_size,
self.max_batch_size,
tensorrt_fp16)
assert engine is not None, "[Converter error]: TensorRT build failure"
# write tensorrt engine
with open(convert_model_path, 'wb') as f:
f.write(engine.serialize())
def to_torchscript(self, model, device, io_props, convert_model_path):
''' convert the model to torchscript '''
if self.convert_type == 'ts-script':
model_ts = torch.jit.script(model)
else: # self.convert_type == 'ts-trace'
dummy_input = self.generate_dummy_input(io_props, device)
with torch.no_grad():
model_ts = torch.jit.trace(model, dummy_input)
# save the model
torch.jit.save(model_ts, convert_model_path)
if __name__=='__main__':
parser = get_parser()
args = parser.parse_args()
model_args_list = args.model_arguments[1:]
logging.basicConfig(level=logging.INFO)
mm = importlib.import_module(args.model_module)
model = mm.init_model(model_args_list, args.precision, args.device)
dataloader = mm.get_dataloader(model_args_list)
converter = Converter(model, dataloader)
converter.convert(args.convert, args.save_dir, args.convert_filename,
args.device, args.precision,
args.max_batch_size,
args.max_workspace_size)

View file

@ -18,7 +18,6 @@ import sys
import argparse
import numpy as np
import os
from tensorrtserver.api import *
from speech_utils import AudioSegment, SpeechClient
import soundfile
import pyaudio as pa
@ -303,9 +302,9 @@ if __name__ == '__main__':
FLAGS = parser.parse_args()
protocol = ProtocolType.from_str(FLAGS.protocol)
protocol = FLAGS.protocol.lower()
valid_model_platforms = {"pyt","onnx", "trt"}
valid_model_platforms = {"ts-trace","onnx", "tensorrt"}
if FLAGS.model_platform not in valid_model_platforms:
raise ValueError("Invalid model_platform {}. Valid choices are {"
@ -321,8 +320,6 @@ if __name__ == '__main__':
verbose=FLAGS.verbose, mode="synchronous",
from_features=False
)
filenames = []
transcripts = []
@ -361,8 +358,7 @@ if __name__ == '__main__':
files_and_speeds = data['files']
audio_path = [x['fname'] for x in files_and_speeds if x['speed'] == filter_speed][0]
filenames.append(os.path.join(data_dir, audio_path))
transcript_text = data[
'transcript']
transcript_text = data['transcript']
transcript_text = normalize_string(transcript_text, labels=labels, table=table)
transcripts.append(transcript_text) #parse_transcript(transcript_text, labels_map, blank_index)) # convert to vocab indices

View file

@ -0,0 +1,196 @@
# *****************************************************************************
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the NVIDIA CORPORATION nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# *****************************************************************************
import torch
import sys
sys.path.append("./")
class FeatureCollate:
def __init__(self, feature_proc):
self.feature_proc = feature_proc
def __call__(self, batch):
bs = len(batch)
max_len = lambda l,idx: max(el[idx].size(0) for el in l)
audio = torch.zeros(bs, max_len(batch, 0))
audio_lens = torch.zeros(bs, dtype=torch.int32)
for i, sample in enumerate(batch):
audio[i].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
audio_lens[i] = sample[1]
ret = (audio, audio_lens)
if self.feature_proc is not None:
feats, feat_lens = self.feature_proc(audio, audio_lens)
ret = (feats,)
return ret
def get_dataloader(model_args_list):
''' return dataloader for inference '''
from inference import get_parser
from common.helpers import add_ctc_blank
from jasper import config
from common.dataset import (AudioDataset, FilelistDataset, get_data_loader,
SingleAudioDataset)
from common.features import FilterbankFeatures
parser = get_parser()
parser.add_argument('--component', type=str, default="model",
choices=["feature-extractor", "model", "decoder"],
help='Component to convert')
args = parser.parse_args(model_args_list)
if args.component == "decoder":
return None
cfg = config.load(args.model_config)
if args.max_duration is not None:
cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
if args.pad_to_max_duration:
assert cfg['input_train']['audio_dataset']['max_duration'] > 0
cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
symbols = add_ctc_blank(cfg['labels'])
dataset_kw, features_kw = config.input(cfg, 'val')
dataset = AudioDataset(args.dataset_dir, args.val_manifests,
symbols, **dataset_kw)
data_loader = get_data_loader(dataset, args.batch_size, multi_gpu=False,
shuffle=False, num_workers=4, drop_last=False)
feature_proc = None
if args.component == "model":
feature_proc = FilterbankFeatures(**features_kw)
data_loader.collate_fn = FeatureCollate(feature_proc)
return data_loader
def init_feature_extractor(args):
from jasper import config
from common.features import FilterbankFeatures
cfg = config.load(args.model_config)
if args.max_duration is not None:
cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
if args.pad_to_max_duration:
assert cfg['input_train']['audio_dataset']['max_duration'] > 0
cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
_, features_kw = config.input(cfg, 'val')
feature_proc = FilterbankFeatures(**features_kw)
return feature_proc
def init_acoustic_model(args):
from common.helpers import add_ctc_blank
from jasper.model import Jasper
from jasper import config
cfg = config.load(args.model_config)
if args.max_duration is not None:
cfg['input_val']['audio_dataset']['max_duration'] = args.max_duration
cfg['input_val']['filterbank_features']['max_duration'] = args.max_duration
if args.pad_to_max_duration:
assert cfg['input_train']['audio_dataset']['max_duration'] > 0
cfg['input_train']['audio_dataset']['pad_to_max_duration'] = True
if cfg['jasper']['encoder']['use_conv_masks'] == True:
print("[Jasper module]: Warning: setting 'use_conv_masks' \
to False; masked convolutions are not supported.")
cfg['jasper']['encoder']['use_conv_masks'] = False
symbols = add_ctc_blank(cfg['labels'])
model = Jasper(encoder_kw=config.encoder(cfg),
decoder_kw=config.decoder(cfg, n_classes=len(symbols)))
if args.ckpt is not None:
checkpoint = torch.load(args.ckpt, map_location="cpu")
key = 'ema_state_dict' if args.ema else 'state_dict'
state_dict = checkpoint[key]
model.load_state_dict(state_dict, strict=True)
return model
def init_decoder(args):
class GreedyCTCDecoderSimple(torch.nn.Module):
@torch.no_grad()
def forward(self, log_probs):
return log_probs.argmax(dim=-1, keepdim=False).int()
return GreedyCTCDecoderSimple()
def init_model(model_args_list, precision, device):
''' Return either of the components: feature-extractor, model, or decoder.
The returned compoenent is ready to convert '''
from inference import get_parser
parser = get_parser()
parser.add_argument('--component', type=str, default="model",
choices=["feature-extractor", "model", "decoder"],
help='Component to convert')
args = parser.parse_args(model_args_list)
init_comp = {"feature-extractor": init_feature_extractor,
"model": init_acoustic_model,
"decoder": init_decoder}
comp = init_comp[args.component](args)
torch_device = torch.device(device)
print(f"[Jasper module]: using device {torch_device}")
comp.to(torch_device)
comp.eval()
if precision == "fp16":
print("[Jasper module]: using mixed precision")
comp.half()
else:
print("[Jasper module]: using fp32 precision")
return comp

View file

@ -1,32 +0,0 @@
name: "jasper-feature-extractor"
platform: "pytorch_libtorch"
default_model_filename: "jasper-feature-extractor.pt"
max_batch_size: 64
input [ {
name: "AUDIO_SIGNAL__0"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "NUM_SAMPLES__1"
data_type: TYPE_INT32
dims: [ 1 ]
reshape { shape: [] }
}
]
output [
{
name: "AUDIO_FEATURES__0"
data_type: TYPE_FP32
dims: [64, -1]
}
,
{
name: "NUM_TIME_STEPS__1"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [] }
}
]

View file

@ -1,60 +0,0 @@
name: "jasper-onnx-cpu-ensemble"
platform: "ensemble"
max_batch_size: 64#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP32
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "jasper-feature-extractor"
model_version: -1
input_map {
key: "AUDIO_SIGNAL__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "NUM_SAMPLES__1"
value: "NUM_SAMPLES"
}
output_map {
key: "AUDIO_FEATURES__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-onnx-cpu"
model_version: -1
input_map {
key: "FEATURES"
value: "AUDIO_FEATURES"
}
output_map {
key: "LOGITS"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "jasper-decoder"
model_version: -1
input_map {
key: "CLASS_LOGITS__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "CANDIDATE_TRANSCRIPT__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -1,60 +0,0 @@
name: "jasper-onnx-ensemble"
platform: "ensemble"
max_batch_size: 64#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP32
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "jasper-feature-extractor"
model_version: -1
input_map {
key: "AUDIO_SIGNAL__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "NUM_SAMPLES__1"
value: "NUM_SAMPLES"
}
output_map {
key: "AUDIO_FEATURES__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-onnx"
model_version: -1
input_map {
key: "FEATURES"
value: "AUDIO_FEATURES"
}
output_map {
key: "LOGITS"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "jasper-decoder"
model_version: -1
input_map {
key: "CLASS_LOGITS__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "CANDIDATE_TRANSCRIPT__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -1,61 +0,0 @@
name: "jasper-pyt-ensemble"
platform: "ensemble"
max_batch_size: 64#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP32
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "jasper-feature-extractor"
model_version: -1
input_map {
key: "AUDIO_SIGNAL__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "NUM_SAMPLES__1"
value: "NUM_SAMPLES"
}
output_map {
key: "AUDIO_FEATURES__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-pyt"
model_version: -1
input_map {
key: "AUDIO_FEATURES__0"
value: "AUDIO_FEATURES"
}
output_map {
key: "LOG_PROBS__0"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "jasper-decoder"
model_version: -1
input_map {
key: "CLASS_LOGITS__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "CANDIDATE_TRANSCRIPT__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -1,60 +0,0 @@
name: "jasper-trt-ensemble"
platform: "ensemble"
max_batch_size: 64#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP32
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "jasper-feature-extractor"
model_version: -1
input_map {
key: "AUDIO_SIGNAL__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "NUM_SAMPLES__1"
value: "NUM_SAMPLES"
}
output_map {
key: "AUDIO_FEATURES__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-trt"
model_version: -1
input_map {
key: "FEATURES"
value: "AUDIO_FEATURES"
}
output_map {
key: "LOGITS"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "jasper-decoder"
model_version: -1
input_map {
key: "CLASS_LOGITS__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "CANDIDATE_TRANSCRIPT__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -23,23 +23,24 @@
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
default_model_filename: "jasper-decoder.pt"
name: "jasper-decoder"
name: "decoder-ts-script"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"
max_batch_size: 64
input [
{
name: "CLASS_LOGITS__0"
data_type: TYPE_FP32
dims: [ -1, 29 ]
}
]
output [
{
name: "CANDIDATE_TRANSCRIPT__0"
data_type: TYPE_INT32
dims: [ -1]
}
{
name: "input__0"
data_type: TYPE_FP16
dims: [ -1, 29 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_INT32
dims: [-1]
}
]

View file

@ -0,0 +1,32 @@
name: "feature-extractor-ts-trace"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"
max_batch_size: 64
input [
{
name: "input__0"
data_type: TYPE_FP16
dims: [ -1 ]
},
{
name: "input__1"
data_type: TYPE_INT32
dims: [ 1 ]
reshape { shape: [] }
}
]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [64, -1]
},
{
name: "output__1"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [] }
}
]

View file

@ -0,0 +1,63 @@
name: "jasper-onnx-ensemble"
platform: "ensemble"
max_batch_size: 8#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP16
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "feature-extractor-ts-trace"
model_version: -1
input_map {
key: "input__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "input__1"
value: "NUM_SAMPLES"
}
output_map {
key: "output__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-onnx"
model_version: -1
input_map {
key: "input__0"
value: "AUDIO_FEATURES"
}
output_map {
key: "output__0"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "decoder-ts-script"
model_version: -1
input_map {
key: "input__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "output__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -26,23 +26,25 @@
name: "jasper-onnx"
platform: "onnxruntime_onnx"
default_model_filename: "jasper.onnx"
default_model_filename: "model.onnx"
max_batch_size : 64#MAX_BATCH
input [
{
name: "FEATURES"
data_type: TYPE_FP32
dims: [ 64, -1 ]
}
]
output [
{
name: "LOGITS"
data_type: TYPE_FP32
dims: [ -1, 29 ]
}
]
max_batch_size: 8#MAX_BATCH
input [
{
name: "input__0"
data_type: TYPE_FP16
dims: [64, -1]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [-1, 29 ]
}
]
instance_group {
count: 1#NUM_ENGINES
@ -51,6 +53,6 @@ instance_group {
}
#db#dynamic_batching {
#db# preferred_batch_size: 64#MAX_BATCH
#db# preferred_batch_size: 8#MAX_BATCH
#db# max_queue_delay_microseconds: #MAX_QUEUE
#db#}

View file

@ -0,0 +1,63 @@
name: "jasper-tensorrt-ensemble"
platform: "ensemble"
max_batch_size: 8#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP16
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "feature-extractor-ts-trace"
model_version: -1
input_map {
key: "input__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "input__1"
value: "NUM_SAMPLES"
}
output_map {
key: "output__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-tensorrt"
model_version: -1
input_map {
key: "input__0"
value: "AUDIO_FEATURES"
}
output_map {
key: "output__0"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "decoder-ts-script"
model_version: -1
input_map {
key: "input__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "output__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -0,0 +1,58 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "jasper-tensorrt"
platform: "tensorrt_plan"
default_model_filename: "model.plan"
max_batch_size: 8#MAX_BATCH
input [
{
name: "input__0"
data_type: TYPE_FP16
dims: [64, -1]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [-1, 29 ]
}
]
instance_group {
count: 1#NUM_ENGINES
gpus: 0
kind: KIND_GPU
}
#db#dynamic_batching {
#db# preferred_batch_size: 8#MAX_BATCH
#db# max_queue_delay_microseconds: #MAX_QUEUE
#db#}

View file

@ -0,0 +1,63 @@
name: "jasper-ts-trace-ensemble"
platform: "ensemble"
max_batch_size: 8#MAX_BATCH
input {
name: "AUDIO_SIGNAL"
data_type: TYPE_FP16
dims: -1#AUDIO_LENGTH
}
input {
name: "NUM_SAMPLES"
data_type: TYPE_INT32
dims: [ 1 ]
}
output {
name: "TRANSCRIPT"
data_type: TYPE_INT32
dims: [-1]
}
ensemble_scheduling {
step {
model_name: "feature-extractor-ts-trace"
model_version: -1
input_map {
key: "input__0"
value: "AUDIO_SIGNAL"
}
input_map {
key: "input__1"
value: "NUM_SAMPLES"
}
output_map {
key: "output__0"
value: "AUDIO_FEATURES"
}
}
step {
model_name: "jasper-ts-trace"
model_version: -1
input_map {
key: "input__0"
value: "AUDIO_FEATURES"
}
output_map {
key: "output__0"
value: "CHARACTER_PROBABILITIES"
}
}
step {
model_name: "decoder-ts-script"
model_version: -1
input_map {
key: "input__0"
value: "CHARACTER_PROBABILITIES"
}
output_map {
key: "output__0"
value: "TRANSCRIPT"
}
}
}

View file

@ -0,0 +1,32 @@
name: "jasper-ts-trace"
platform: "pytorch_libtorch"
default_model_filename: "model.pt"
max_batch_size: 8#MAX_BATCH
input [
{
name: "input__0"
data_type: TYPE_FP16
dims: [64, -1]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP16
dims: [-1, 29]
}
]
instance_group {
count: 1#NUM_ENGINES
gpus: 0
kind: KIND_GPU
}
#db#dynamic_batching {
#db# preferred_batch_size: 8#MAX_BATCH
#db# max_queue_delay_microseconds: #MAX_QUEUE
#db#}

Some files were not shown because too many files have changed in this diff Show more