Adding Jasper/PyT

This commit is contained in:
Przemek Strzelczyk 2019-07-26 20:08:16 +02:00
parent 71fdde7200
commit fa400a7367
49 changed files with 5894 additions and 0 deletions

View file

@ -0,0 +1,4 @@
results/
*__pycache__
checkpoints/
datasets/

5
PyTorch/SpeechRecognition/Jasper/.gitignore vendored Executable file
View file

@ -0,0 +1,5 @@
__pycache__
*.pt
results/
datasets/
checkpoints/

View file

@ -0,0 +1,34 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
FROM ${FROM_IMAGE_NAME}
WORKDIR /tmp/unique_for_apex
RUN pip uninstall -y apex || :
RUN pip uninstall -y apex || :
RUN SHA=ToUcHMe git clone https://github.com/NVIDIA/apex.git
WORKDIR /tmp/unique_for_apex/apex
RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
WORKDIR /workspace/jasper
COPY . .
RUN pip install --disable-pip-version-check -U -r requirements.txt

View file

@ -0,0 +1,203 @@
Except where otherwise noted, the following license applies to all files in this repo.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2019 NVIDIA Corporation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View file

@ -0,0 +1,5 @@
Jasper in PyTorch
This repository includes source code (in "parts/") from:
* https://github.com/keithito/tacotron and https://github.com/ryanleary/patter licensed under MIT license.

View file

@ -0,0 +1,816 @@
# Jasper For PyTorch
This repository provides a script and recipe to train the Jasper model to achieve state of the art the paper accuracy of the acoustic model, and is tested and maintained by NVIDIA.
## Table Of Contents
- [Model overview](#model-overview)
* [Model architecture](#model-architecture)
* [Default configuration](#default-configuration)
* [Feature support matrix](#feature-support-matrix)
* [Features](#features)
* [Mixed precision training](#mixed-precision-training)
* [Enabling mixed precision](#enabling-mixed-precision)
* [Glossary](#glossary)
- [Setup](#setup)
* [Requirements](#requirements)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Training process](#training-process)
* [Inference process](#inference-process)
* [Evaluation process](#evaluation-process)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy: NVIDIA DGX-1 (8x V100 32G)](#training-accuracy-nvidia-dgx-1-8x-v100-32G)
* [Training stability test](#training-stability-test)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16G)
* [Training performance: NVIDIA DGX-1 (8x V100 32G)](#training-performance-nvidia-dgx-1-8x-v100-32G)
* [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-16x-v100-32G)
* [Inference performance results](#inference-performance-results)
* [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16G)
* [Inference performance: NVIDIA DGX-1 (1x V100 32G)](#inference-performance-nvidia-dgx-1-1x-v100-32G)
* [Inference performance: NVIDIA DGX-2 (1x V100 32G)](#inference-performance-nvidia-dgx-2-1x-v100-32G)
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
- [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
This repository provides an implementation of the Jasper model in PyTorch from the paper `Jasper: An End-to-End Convolutional Neural Acoustic Model` [https://arxiv.org/pdf/1904.03288.pdf](https://arxiv.org/pdf/1904.03288.pdf).
The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.
The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences
corresponding to a given audio segment. This post-processing step is called decoding.
This repository is a PyTorch implementation of Jasper and provides scripts to train the Jasper 10x5 model with dense residuals from scratch on the [Librispeech](http://www.openslr.org/12) dataset to achieve the greedy decoding results of the original paper.
The original reference code provides Jasper as part of a research toolkit in TensorFlow [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq).
This repository provides a simple implementation of Jasper with scripts for training and replicating the Jasper paper results.
This includes data preparation scripts, training and inference scripts.
Both training and inference scripts offer the option to use Automatic Mixed Precision (AMP) to benefit from Tensor Cores for better performance.
In addition to providing the hyperparameters for training a model checkpoint, we publish a thorough inference analysis across different NVIDIA GPU platforms, for example, DGX-1, DGX-2 and T4.
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta GPUs and evaluated on Volta and Turing GPUs. Therefore, researchers can get results 3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
The original paper takes the output of the Jasper acoustic model and shows results for 3 different decoding variations: greedy decoding, beam search with a 6-gram language model and beam search with further rescoring of the best ranked hypotheses with Transformer XL, which is a neural language model. Beam search and the rescoring with the neural language model scores are run on CPU and result in better word error rates compared to greedy decoding.
This repository provides instructions to reproduce greedy decoding results. To run beam search or rescoring with TransformerXL, use the following scripts from the [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq) repository:
https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/decode.py
https://github.com/NVIDIA/OpenSeq2Seq/tree/master/external_lm_rescore
### Model architecture
Details on the model architecture can be found in the paper [Jasper: An End-to-End Convolutional Neural Acoustic Model](https://arxiv.org/pdf/1904.03288.pdf).
|<img src="images/jasper_model.png" width="100%" height="40%"> | <img src="images/jasper_dense_residual.png" width="100%" height="40%">|
|:---:|:---:|
|Figure 1: Jasper BxR model: B- number of blocks, R- number of sub-blocks | Figure 2: Jasper Dense Residual |
Jasper is an end-to-end neural acoustic model that is based on convolutions.
In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame.
The acoustic model has a modular block structure and can be parametrized accordingly:
a Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
Each block input is connected directly to the last subblock of all following blocks via a residual connection, which is referred to as `dense residual` in the paper.
Every block differs in kernel size and number of filters, which are increasing in size from the bottom to the top layers.
Irrespective of the exact block configuration parameters B and R, every Jasper model has four additional convolutional blocks:
one immediately succeeding the input layer (Prologue) and three at the end of the B blocks (Epilogue).
The Prologue is to decimate the audio signal
in time in order to process a shorter time sequence for efficiency. The Epilogue with dilation captures a bigger context around an audio time step, which decreases the model word error rate (WER).
The paper achieves best results with Jasper 10x5 with dense residual connections, which is also the focus of this repository and is in the following referred to as Jasper Large.
### Default configuration
The following features were implemented in this model:
* GPU-supported feature extraction with data augmentation options [SpecAugment](https://arxiv.org/abs/1904.08779) and [Cutout](https://arxiv.org/pdf/1708.04552.pdf)
* offline and online [Speed Perturbation](https://www.danielpovey.com/files/2015_interspeech_augmentation.pdf)
* data-parallel multi-GPU training and evaluation
* AMP with dynamic loss scaling for Tensor Core training
* FP16 inference with AMP
Competitive training results and analysis is provided for the following Jasper model configuration
| **Model** | **Number of Blocks** | **Number of Subblocks** | **Max sequence length** | **Number of Parameters** |
|--- |--- |--- |--- |--- |
| Jasper Large | 10 | 5 | 16.7s | 333M |
### Feature support matrix
The following features are supported by this model.
| **Feature** | **Jasper** |
|--- |--- |
|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) | Yes |
#### Features
[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor Core-accelerated training. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
[Apex
DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) -
a module wrapper that enables easy multiprocess distributed data parallel
training, similar to
[torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
`DistributedDataParallel` is optimized for use with
[NCCL](https://github.com/NVIDIA/nccl). It achieves high performance by
overlapping communication with computation during `backward()` and bucketing
smaller gradient transfers to reduce the total number of transfers required.
### Mixed precision training
*Mixed precision* is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
1. Porting the model to use the FP16 data type where appropriate.
2. Adding loss scaling to preserve small gradient values.
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
For information about:
* How to train using mixed precision, see the[Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
#### Enabling mixed precision
For training, mixed precision can be enabled by setting the flag: `train.py --fp16`. You can change this behavior and execute the training in
single precision by removing the `--fp16` flag for the `train.py` training
script. For example, in the bash scripts `scripts/train.sh`, `scripts/inference.sh`, etc. the precision can be specified with the variable `PRECISION` by setting it to either `PRECISION=fp16` or `PRECISION=fp32`.
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
(AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables
to half-precision upon retrieval, while storing variables in single-precision
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
a [loss
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
step must be included when applying gradients. In PyTorch, loss scaling can be
easily applied by using `scale_loss()` method provided by AMP. The scaling
value to be used can be
[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.
For an in-depth walk through on AMP, check out sample usage
[here](https://nvidia.github.io/apex/amp.html#). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
utility libraries, such as AMP, which require minimal network code changes to
leverage tensor cores performance.
The following steps were needed to enable mixed precision training in Jasper:
* Import AMP from APEX (file: `train.py`):
```
from apex import amp
```
* Initialize AMP and wrap the model and the optimizer
```
model, optimizer = amp.initialize(
min_loss_scale=1.0,
models=model,
optimizers=optimizer,
opt_level=O1)
```
* Apply `scale_loss` context manager
```
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
```
### Glossary
Acoustic model:
Assigns a probability distribution over a vocabulary of characters given an audio frame.
Language Model:
Assigns a probability distribution over a sequence of words. Given a sequence of words, it assigns a probability to the whole sequence.
Pre-training:
Training a model on vast amounts of data on the same (or different) task to build general understandings.
Automatic Speech Recognition (ASR):
Uses both acoustic model and language model to output the transcript of an input audio signal.
## Setup
The following section lists the requirements in order to start training and evaluating the Jasper model.
### Requirements
This repository contains a `Dockerfile` which extends the PyTorch 19.06-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 19.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
Further required python packages are listed in `requirements.txt`, which are automatically installed with the Docker container built. To manually install them, run
```bash
pip install -r requirements.txt
```
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html).
## Quick Start Guide
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Jasper model on the Librispeech dataset. For details concerning training and inference, see [Advanced](#Advanced).
1. Clone the repository.
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
```
2. Build the Jasper PyTorch container.
Running the following scripts will build and launch the container which contains all the required dependencies for data download and processing as well as training and inference of the model.
```bash
bash scripts/docker/build.sh
```
3. Start an interactive session in the NGC container to run data download/training/inference
```bash
bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
```
Within the container, the contents of this repository will be copied to the `/workspace/jasper` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
4. Download and preprocess the dataset.
No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following Steps 2 and 3.
Note: Downloading and preprocessing the dataset requires 500GB of free disk space and can take several hours to complete.
This repository provides scripts to download, and extract the following datasets:
* LibriSpeech [http://www.openslr.org/12](http://www.openslr.org/12)
LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see the [LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS](http://www.danielpovey.com/files/2015_icassp_librispeech.pdf) paper.
Inside the container, download and extract the datasets into the required format for later training and inference:
```bash
bash scripts/download_librispeech.sh
```
Once the data download is complete, the following folders should exist:
* `/datasets/LibriSpeech/`
* `train-clean-100/`
* `train-clean-360/`
* `train-other-500/`
* `dev-clean/`
* `dev-other/`
* `test-clean/`
* `test-other/`
Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3), once the dataset is downloaded it is accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.
Next, convert the data into WAV files and add speed perturbation with 0.9 and 1.1 to the training files:
```bash
bash scripts/preprocess_librispeech.sh
```
Once the data is converted, the following additional files and folders should exist:
* `datasets/LibriSpeech/`
* `librispeech-train-clean-100-wav.json`
* `librispeech-train-clean-360-wav.json`
* `librispeech-train-other-500-wav.json`
* `librispeech-dev-clean-wav.json`
* `librispeech-dev-other-wav.json`
* `librispeech-test-clean-wav.json`
* `librispeech-test-other-wav.json`
* `train-clean-100-wav/` containsWAV files with original speed, 0.9 and 1.1
* `train-clean-360-wav/` contains WAV files with original speed, 0.9 and 1.1
* `train-other-500-wav/` contains WAV files with original speed, 0.9 and 1.1
* `dev-clean-wav/`
* `dev-other-wav/`
* `test-clean-wav/`
* `test-other-wav/`
5. Start training.
Inside the container, use the following script to start training.
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
```bash
bash scripts/train.sh [OPTIONS]
```
By default, this will use automatic mixed precision, a batch size of 64 and run on a total of 8 GPUs. The hyperparameters are tuned for DGX-1 32GB 8x V100 GPUs and will require adjustment for 16GB GPUs (e.g. by using more gradient accumulation steps)
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Training process](#training-process).
6. Start validation/evaluation.
Inside the container, use the following script to run evaluation.
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
```bash
bash scripts/evaluation.sh [OPTIONS]
```
By default, this will use full precision, a batch size of 64 and run on a single GPU.
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).
7. Start inference/predictions.
Inside the container, use the following script to run inference.
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
```bash
bash scripts/inference.sh [OPTIONS]
```
By default this will use full precision, a batch size of 64 and run on a single GPU.
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Inference process](#inference-process).
## Advanced
The following sections provide greater details of the dataset, running training and inference, and getting training and inference results.
### Scripts and sample code
In the `root` directory, the most important files are:
* `train.py` - Serves as entry point for training
* `inference.py` - Serves as entry point for inference and evaluation
* `model.py` - Contains the model architecture
* `dataset.py` - Contains the data loader and related functionality
* `optimizer.py` - Contains the optimizer
* `inference_benchmark.py` - Serves as inference benchmarking script that measures the latency of pre-processing and the acoustic model
* `requirements.py` - Contains the required dependencies that are installed when building the Docker container
* `Dockerfile` - Container with the basic set of dependencies to run Jasper
The `scripts/` folder encapsulates all the one-click scripts required for running various supported functionalities, such as:
* `train.sh` - Runs training using the `train.py` script
* `inference.sh` - Runs inference using the `inference.py` script
* `evaluation.sh` - Runs evaluation using the `inference.py` script
* `download_librispeech.sh` - Downloads LibriSpeech dataset
* `preprocess_librispeech.sh` - Preprocess LibriSpeech raw data files to be ready for training and inference
* `inference_benchmark.sh` - Runs the inference benchmark using the `inference_benchmark.py` script
* `train_benchmark.sh` - Runs the training performance benchmark using the `train.py` script
* `docker/` - Contains the scripts for building and launching the container
Other folders included in the `root` directory are:
* `configs/` - Model configurations
* `utils/` - Contains the necessary files for data download and processing
* `parts/` - Contains the necessary files for data pre-processing
### Parameters
The complete list of available parameters for `scripts/train.sh` script contains:
```bash
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
MODEL_CONFIG: relative path to model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results, logs, and created checkpoints. (default: '/results')
CHECKPOINT: model checkpoint to continue training from. Model checkpoint is a dictionary object that contains apart from the model weights the optimizer state as well as the epoch number. If CHECKPOINT is "none" , training starts from scratch. (default: "none")
CREATE_LOGFILE: boolean that indicates whether to create a training log that will be stored in `$RESULT_DIR`. (default: "true")
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'true')
NUM_GPUS: number of GPUs to use. (default: 8)
PRECISION: options are fp32 and fp16 with AMP. (default: 'fp16')
EPOCHS: number of training epochs. (default: 400)
SEED: seed for random number generator and used for ensuring reproducibility. (default: 6)
BATCH_SIZE: data batch size. (default: 64)
LEARNING_RATE: Initial learning rate. (default: 0.015)
GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 1)
LAUNCH_OPT: additional launch options. (default: "none")
```
The complete list of available parameters for `scripts/inference.sh` script contains:
```bash
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
DATASET: name of dataset to use. (default: 'dev-clean')
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results and logs. (default: '/results')
CHECKPOINT: model checkpoint path. (required)
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: "true")
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'false')
PRECISION: options are fp32 and fp16 with AMP. (default: 'fp32')
NUM_STEPS: number of inference steps. If -1 runs inference on entire dataset. (default: -1)
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 6)
BATCH_SIZE: data batch size.(default: 64)
MODELOUTPUT_FILE: destination path for serialized model output with binary protocol. If 'none' does not save model output. (default: 'none')
PREDICTION_FILE: destination path for saving predictions. If 'none' does not save predictions. (default: '${RESULT_DIR}/${DATASET}.predictions)
```
The complete list of available parameters for `scripts/evaluation.sh` script contains:
```bash
DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
DATASET: name of dataset to use.(default: 'dev-clean')
MODEL_CONFIG: model configuration.(default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results and logs. (default: '/results')
CHECKPOINT: model checkpoint path. (required)
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: 'true')
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'false')
NUM_GPUS: number of GPUs to run evaluation on (default: 1)
PRECISION: options are fp32 and fp16 with AMP.(default: 'fp32')
NUM_STEPS: number of inference steps per GPU. If -1 runs inference on entire dataset (default: -1)
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
BATCH_SIZE: data batch size.(default: 64)
```
The `scripts/inference_benchmark.sh` script pads all input to the same length and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in millisecond per batch. The `scripts/inference_benchmark.sh`
measures latency for a single GPU and extends `scripts/inference.sh` by :
```bash
MAX_DURATION: filters out input audio data that exceeds a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 36)
```
The `scripts/train_benchmark.sh` script pads all input to the same length according to the input argument `MAX_DURATION` and measures average training latency and throughput performance. Latency is measured in seconds per batch, throughput in sequences per second.
The complete list of available parameters for `scripts/train_benchmark.sh` script contains:
```bash
DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
RESULT_DIR: directory for results and logs. (default: '/results')
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: 'true')
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'true')
NUM_GPUS: number of GPUs to use. (default: 8)
PRECISION: options are fp32 and fp16 with AMP. (default: 'fp16')
NUM_STEPS: number of training iterations. If -1 runs full training for 400 epochs. (default: -1)
MAX_DURATION: filters out input audio data that exceed a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 16.7)
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
BATCH_SIZE: data batch size.(default: 64)
LEARNING_RATE: Initial learning rate. (default: 0.015)
GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 1)
PRINT_FREQUENCY: number of iterations after which training progress is printed. (default: 1)
```
### Command-line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example:
```bash
python train.py --help
python inference.py --help
```
### Getting the data
The Jasper model was trained on LibriSpeech dataset. We use the concatenation of `train-clean-100`, `train-clean-360` and `train-other-500` for training and `dev-clean` for validation.
This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 500GB free space is required for download and preprocessing, the final preprocessed dataset is 320GB.
#### Dataset guidelines
The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are striped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500` it also creates speed perturbed versions with rates of 0.9 and 1.1 for data augmentation.
After preprocessing, the script creates JSON files with output file paths, sample rate, target transcript and other metadata. These JSON files are used by the training script to identify training and validation datasets.
The Jasper model was tuned on audio signals with a sample rate of 16kHz, if you wish to use a different sampling rate then some hyperparameters might need to be changed - specifically window size and step size.
### Training process
The training is performed using `train.py` script along with parameters defined in `scripts/train.sh`
The `scripts/train.sh` script runs a job on a single node that trains the Jasper model from scratch using LibriSpeech as training data. To make training more efficient, we discard audio samples longer than 16.7 seconds from the training dataset, the total number of these samples is less than 1%. Such filtering does not degrade accuracy, but it allows us to decrease the number of time steps in a batch, which requires less GPU memory and increases training speed.
Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the training script:
* Runs on 8 32GB V100 GPUs with training and evaluation batch size 64
* Uses FP16 precision with AMP optimization level O1 (default)
* Enables cudnn benchmark to make mixed precision training faster
* Trains on the concatenation of all 3 LibriSpeech training datasets and evaluates on the LibriSpeech dev-clean dataset
* Uses a seed of 6
* Runs for 400 epochs
* Uses an initial learning rate of 0.015 and polynomial (quadratic) learning rate decay
* Saves a checkpoint every 10 epochs
* Runs evaluation on the development dataset every 100 iterations and at the end of training
* Prints out training progress every 25 iterations
* Creates a log file with training progress
* Uses offline speed perturbed data
* Uses SpecAugment in data pre-processing
* Filters out audio samples longer than 16.7 seconds
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
* Uses masked convolutions and dense residuals as described in the paper
* Uses weight decay of 0.001
* Uses 1 gradient accumulation step
* Uses [Novograd](https://arxiv.org/pdf/1905.11286.pdf) as optimizer with betas=(0.95, 0.98)
These parameters will match the greedy WER [Results](#results) of the Jasper paper on a DGX1 with 32GB V100 GPUs.
### Inference process
Inference is performed using the `inference.py` script along with parameters defined in `scripts/inference.sh`.
The `scripts/inference.sh` script runs the job on a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
Apart from the default arguments as listed in the [Parameters](#parameters) section by default the inference script:
* Evaluates on the LibriSpeech dev-clean dataset
* Uses full precision
* Uses a batch size of 64
* Runs for 1 epoch and prints out the final word error rate
* Creates a log file with progress and results which will be stored in the results folder
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
* Does not use data augmentation
* Does greedy decoding and saves the transcription in the results folder
* Has the option to save the model output tensors for more complex decoding, for example, beam search
* Has cudnn benchmark disabled
### Evaluation process
Evaluation is performed using the `inference.py` script along with parameters defined in `scripts/evaluation.sh`.
The `scripts/evaluation.sh` script runs a job on a a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the evaluation script:
* Uses a batch size of 64
* Evaluates the LibriSpeech dev-clean dataset
* Uses full precision
* Runs for 1 epoch and prints out the final word error rate
* Creates a log file with progress and results which is saved in the results folder
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
* Does not use data augmentation
* Has cudnn benchmark disabled
## Performance
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
#### Training performance benchmark
To benchmark the training performance on a specific batch size and audio length, run:
```bash
bash scripts/train_benchmark.sh <DATA_DIR> <MODEL_CONFIG> <RESULT_DIR> <CREATE_LOGFILE> <CUDNN_BENCHMARK> <NUM_GPUS> <PRECISION> <NUM_STEPS> <MAX_DURATION> <SEED> <BATCH_SIZE>
<LEARNING_RATE> <GRADIENT_ACCUMULATION_STEPS> <PRINT_FREQUENCY>
```
By default, this script runs 400 epochs on the configuration `configs/jasper10x5dr_sp_offline_specaugment.toml` using full precision
and batch size 64 on a single node with 8x 32GB V100 GPUs cards.
By default, `NUM_STEPS=-1` means training is run for 400 EPOCHS. If `$NUM_STEPS > 0` is specified, training is only run for a user-defined number of iterations. Audio samples longer than `MAX_DURATION` are filtered out, the remaining ones are padded to this duration such that all batches have the same length. At the end of training the script saves the model checkpoint to the results folder, runs evaluation on LibriSpeech dev-clean dataset, and prints out information such as average training latency performance in seconds, average training throughput in sequences per second, final training loss, final training WER, evaluation loss and evaluation WER.
#### Inference performance benchmark
To benchmark the inference performance on a specific batch size and audio length, run:
```bash
bash scripts/inference_benchmark.sh <DATA_DIR> <DATASET> <MODEL_CONFIG> <RESULT_DIR> <CHECKPOINT> <CREATE_LOGFILE> <CUDNN_BENCHMARK> <PRECISION> <NUM_GPUS> <MAX_DURATION>
<SEED> <BATCH_SIZE>
```
By default, the script runs on a single GPU and evaluates on the entire dataset using the model configuration `configs/jasper10x5dr_sp_offline_specaugment.toml`, full precision, cudnn benchmark for faster fp16 inference and batch size 64.
By default, `MAX_DURATION` is set to 36 seconds, which covers the maximum audio length. All audio samples are padded to this length. The script prints out `MAX_DURATION`, `BATCH_SIZE` and latency performance in milliseconds per batch.
### Results
The following sections provide details on how we achieved our performance and accuracy in training and inference.
All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated
on LibriSpeech dev-clean, dev-other, test-clean, test-other.
The results for Jasper Large's word error rate from the original paper after greedy decoding are shown below:
| **Number of GPUs** | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**
|--- |--- |--- |--- |--- |
|8 | 3.64| 11.89| 3.86 | 11.95
#### Training accuracy results
##### Training accuracy: NVIDIA DGX-1 (8x V100 32G)
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.06-py3 NGC container with NVIDIA DGX-1 with (8x V100 32G) GPUs.
The following tables report the word error rate(WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
FP16 (seed #6)
| **Number of GPUs** | **Batch size per GPU** | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**| **Total time to train with FP16 (Hrs)** |
|--- |--- |--- |--- |--- |--- |--- |
|8 |64| 3.51|11.14|3.74|11.06|100
FP32 training matches the results of mixed precision training and takes approximately 330 hours.
##### Training stability test
The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.
| **FP16, 8x GPUs** | **seed #1** | **seed #2** | **seed #3** | **seed #4** | **seed #5** | **seed #6** | **seed #7** | **seed #8** | **mean** | **std** |
|:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
|dev-clean|3.74|3.75|3.77|3.68|3.75|3.51|3.71|3.58|3.69|0.09
|dev-other|11.56|11.62|11.5|11.36|11.62|11.14|11.8|11.3|11.49|0.21
|test-clean|3.9|3.95|3.88|3.79|3.95|3.74|4.03|3.85|3.89|0.09
|test-other|11.47|11.54|11.51|11.29|11.54|11.06|11.68|11.29|11.42|0.20
#### Training performance results
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.06-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 16 | 10| 29.63| 2.96| 1.00| 1.00|
| 4 | 16 | 38.79| 106.67| 2.75| 3.88| 3.60|
| 8 | 16 | 76.64| 209.84| 2.74| 7.66| 7.08|
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 32 | - | 35.16 | - | - | 1.00 |
| 4 | 32 | - | 134.74 | - | - | 3.83 |
| 8 | 32 | - | 263.92 | - | - | 7.51 |
Note: The respective values for FP32 runs that use a batch size of 32 are not available due to out of memory errors that arise. Batch size of 32 is only available when using FP16.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 32 | 12.26| 34.04| 2.78| 1.00| 1.00|
| 4 | 32 | 48.67| 131.96| 2.71| 3.97| 3.88|
| 8 | 32 | 95.88| 253.47| 2.64| 7.82| 7.45|
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 64 | - | 41.03 | - | - | 1.00 |
| 4 | 64 | - | 159.01 | - | - | 3.88 |
| 8 | 64 | - | 312.20 | - | - | 7.61 |
Note: The respective values for FP32 runs that use a batch size of 64 are not available due to out of memory errors that arise. Batch size of 64 is only available when using FP16.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 32 | 8.12| 24.24| 2.98| 1.00| 1.00|
| 4 | 32 | 32.16| 92.09| 2.86| 3.96| 3.80|
| 8 | 32 | 63.68| 181.56| 2.85| 7.84| 7.49|
|16 | 32 | 124.88| 275.67| 2.20| 15.38| 11.35|
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 64 | - | 29.22 | - | - | 1.00 |
| 4 | 64 | - | 114.29 | - | - | 3.91 |
| 8 | 64 | - | 222.61 | - | - | 7.62 |
|16 | 64 | - | 414.57 | - | - | 14.19 |
Note: The respective values for FP32 runs that use a batch size of 64 are not available due to out of memory errors that arise. Batch size of 64 is only available when using FP16.
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
#### Inference performance results
Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 19.06-py3 NGC container on NVIDIA DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 1000 iterations.
##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|1|2|62.16|64.71|67.29|61.31|69.37|69.75|75.38|68.95|1.12
|2|2|60.94|63.60|68.03|59.57|82.18|83.12|84.26|75.33|1.26
|4|2|68.38|69.55|75.85|64.82|85.74|86.85|93.78|82.55|1.27
|8|2|68.80|71.54|73.28|62.83|104.22|106.58|109.41|95.77|1.52
|16|2|72.33|72.85|74.55|64.69|127.11|129.34|131.46|109.80|1.70
|1|7|59.06|60.51|62.83|58.10|75.41|75.72|78.64|74.70|1.29
|2|7|61.68|67.73|68.58|59.53|97.85|98.59|98.99|91.60|1.54
|4|7|60.88|62.13|65.23|60.38|119.08|119.80|121.28|118.67|1.97
|8|7|70.71|71.82|74.23|70.16|181.48|185.00|186.20|177.98|2.54
|16|7|93.75|94.70|100.58|92.96|219.72|220.25|221.28|215.09|2.31
|1|16.7|68.87|69.48|71.75|63.63|101.03|101.66|104.00|100.32|1.58
|2|16.7|73.00|73.76|75.58|66.44|145.64|146.64|152.41|143.69|2.16
|4|16.7|77.71|78.75|79.90|77.34|224.62|225.43|226.43|223.96|2.90
|8|16.7|96.34|97.07|104.46|95.94|318.52|319.13|320.74|316.14|3.30
|16|16.7|154.63|156.81|159.25|151.05|375.67|377.00|381.79|371.83|2.46
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA DGX-1 (1x V100 32G)
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|1|2|61.60|62.81|69.62|60.71|82.32|83.03|85.72|77.48|1.28
|2|2|68.82|70.10|72.08|61.91|77.99|81.99|85.13|76.93|1.24
|4|2|70.06|70.69|72.58|74.76|88.36|89.67|95.61|94.50|1.26
|8|2|69.98|71.51|74.20|64.20|105.82|107.16|110.04|98.02|1.53
|16|2|72.05|74.16|75.51|65.46|130.49|130.97|132.83|112.74|1.72
|1|7|61.40|61.78|65.53|60.93|75.72|75.83|76.55|75.35|1.24
|2|7|60.50|60.63|61.77|60.15|91.05|91.16|92.39|90.75|1.51
|4|7|64.67|71.41|72.10|64.19|123.77|123.99|124.92|123.38|1.92
|8|7|67.96|68.04|69.38|67.60|176.43|176.65|177.25|175.39|2.59
|16|7|95.41|95.80|100.94|93.86|213.04|213.38|215.52|212.05|2.26
|1|16.7|61.28|61.67|62.52|60.63|104.37|104.56|105.22|103.83|1.71
|2|16.7|66.88|67.31|68.09|66.40|151.08|151.61|152.26|146.73|2.21
|4|16.7|80.51|80.79|81.95|80.12|226.75|227.07|228.76|225.82|2.82
|8|16.7|95.66|95.89|98.86|95.62|314.74|316.74|318.66|312.10|3.26
|16|16.7|156.60|157.07|160.15|151.13|366.70|367.41|370.98|364.05|2.41
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA DGX-2 (1x V100 32G)
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|1|2|56.11|56.76|62.18|51.77|67.75|68.91|73.80|64.96|1.25
|2|2|55.56|56.96|61.72|50.63|65.84|69.88|74.05|63.57|1.26
|4|2|54.84|57.69|61.16|60.74|74.00|76.58|81.62|81.01|1.33
|8|2|57.15|57.92|60.80|52.47|90.56|91.83|93.79|84.58|1.61
|16|2|58.27|58.54|60.24|53.26|113.25|113.55|115.41|98.56|1.85
|1|7|49.16|49.39|50.82|48.31|64.53|64.84|65.79|63.90|1.32
|2|7|53.54|54.07|55.28|49.11|78.64|79.46|81.25|78.17|1.59
|4|7|50.87|51.15|53.36|50.07|109.33|110.61|114.00|108.17|2.16
|8|7|63.57|64.18|65.55|60.64|163.95|164.19|165.75|163.49|2.70
|16|7|82.15|83.66|87.01|81.46|196.15|197.18|202.09|195.36|2.40
|1|16.7|49.68|50.00|51.39|48.76|89.10|89.42|90.41|88.57|1.82
|2|16.7|52.47|52.91|54.27|51.51|128.58|129.09|130.34|127.36|2.47
|4|16.7|66.60|67.52|68.88|65.88|220.50|221.50|223.14|219.42|3.33
|8|16.7|85.42|86.03|88.37|85.11|293.80|294.39|296.21|290.58|3.41
|16|16.7|140.76|141.74|147.25|137.31|345.26|346.29|351.15|342.64|2.50
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Inference performance: NVIDIA T4
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|1|2|57.30|57.50|74.62|56.74|73.71|73.98|88.79|72.95|1.29
|2|2|53.68|69.69|76.08|52.63|82.83|93.38|97.67|78.23|1.49
|4|2|72.26|76.49|83.92|57.60|116.06|121.25|125.98|104.17|1.81
|8|2|70.52|71.85|76.26|58.16|159.92|161.22|164.76|148.34|2.55
|16|2|78.29|79.04|82.86|66.97|251.96|252.67|253.64|206.41|3.08
|1|7|54.83|54.94|55.50|54.58|85.57|89.11|89.71|84.08|1.54
|2|7|55.17|55.38|67.09|54.87|134.28|135.76|138.23|131.01|2.39
|4|7|74.24|78.09|79.51|73.75|214.77|215.65|217.28|211.66|2.87
|8|7|99.99|100.34|104.26|98.84|379.67|380.96|382.70|375.12|3.80
|16|7|167.48|168.07|177.29|166.53|623.36|624.11|625.89|619.34|3.72
|1|16.7|72.23|72.65|80.13|67.77|155.76|157.11|160.05|151.85|2.24
|2|16.7|75.43|76.04|80.41|74.65|259.56|261.23|266.09|252.80|3.39
|4|16.7|131.71|132.45|134.92|129.63|481.40|484.17|486.88|469.05|3.62
|8|16.7|197.10|197.94|200.15|193.88|806.76|812.73|822.27|792.85|4.09
|16|16.7|364.22|365.22|372.17|358.62|1165.78|1167.11|1171.02|1150.44|3.21
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
## Release notes
### Changelog
July 2019
* Initial release
### Known issues
There are no known issues in this release.

View file

@ -0,0 +1,194 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = true
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -0,0 +1,203 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = false
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -0,0 +1,204 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = true
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 0
cutout_y_regions = 0
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -0,0 +1,204 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
model = "Jasper"
[input]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
max_duration = 16.7
speed_perturbation = true
cutout_rect_regions = 0
cutout_rect_time = 60
cutout_rect_freq = 25
cutout_x_regions = 2
cutout_y_regions = 2
cutout_x_width = 6
cutout_y_width = 6
[input_eval]
normalize = "per_feature"
sample_rate = 16000
window_size = 0.02
window_stride = 0.01
window = "hann"
features = 64
n_fft = 512
frame_splicing = 1
dither = 0.00001
feat_type = "logfbank"
normalize_transcripts = true
trim_silence = true
pad_to = 16
[encoder]
activation = "relu"
convmask = true
[[jasper]]
filters = 256
repeat = 1
kernel = [11]
stride = [2]
dilation = [1]
dropout = 0.2
residual = false
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 256
repeat = 5
kernel = [11]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 384
repeat = 5
kernel = [13]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 512
repeat = 5
kernel = [17]
stride = [1]
dilation = [1]
dropout = 0.2
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 640
repeat = 5
kernel = [21]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 768
repeat = 5
kernel = [25]
stride = [1]
dilation = [1]
dropout = 0.3
residual = true
residual_dense = true
[[jasper]]
filters = 896
repeat = 1
kernel = [29]
stride = [1]
dilation = [2]
dropout = 0.4
residual = false
[[jasper]]
filters = 1024
repeat = 1
kernel = [1]
stride = [1]
dilation = [1]
dropout = 0.4
residual = false
[labels]
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

View file

@ -0,0 +1,269 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file contains classes and functions related to data loading
"""
import torch
import numpy as np
import math
from torch.utils.data import Dataset, Sampler
import torch.distributed as dist
from parts.manifest import Manifest
from parts.features import WaveformFeaturizer
class DistributedBucketBatchSampler(Sampler):
def __init__(self, dataset, batch_size, num_replicas=None, rank=None):
"""Distributed sampler that buckets samples with similar length to minimize padding,
similar concept as pytorch BucketBatchSampler https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.samplers.html#torchnlp.samplers.BucketBatchSampler
Args:
dataset: Dataset used for sampling.
batch_size: data batch size
num_replicas (optional): Number of processes participating in
distributed training.
rank (optional): Rank of the current process within num_replicas.
"""
if num_replicas is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available")
num_replicas = dist.get_world_size()
if rank is None:
if not dist.is_available():
raise RuntimeError("Requires distributed package to be available")
rank = dist.get_rank()
self.dataset = dataset
self.dataset_size = len(dataset)
self.num_replicas = num_replicas
self.rank = rank
self.epoch = 0
self.batch_size = batch_size
self.tile_size = batch_size * self.num_replicas
self.num_buckets = 6
self.bucket_size = self.round_up_to(math.ceil(self.dataset_size / self.num_buckets), self.tile_size)
self.index_count = self.round_up_to(self.dataset_size, self.tile_size)
self.num_samples = self.index_count // self.num_replicas
def round_up_to(self, x, mod):
return (x + mod - 1) // mod * mod
def __iter__(self):
g = torch.Generator()
g.manual_seed(self.epoch)
indices = np.arange(self.index_count) % self.dataset_size
for bucket in range(self.num_buckets):
bucket_start = self.bucket_size * bucket
bucket_end = min(bucket_start + self.bucket_size, self.index_count)
indices[bucket_start:bucket_end] = indices[bucket_start:bucket_end][torch.randperm(bucket_end - bucket_start, generator=g)]
tile_indices = torch.randperm(self.index_count // self.tile_size, generator=g)
for tile_index in tile_indices:
start_index = self.tile_size * tile_index + self.batch_size * self.rank
end_index = start_index + self.batch_size
yield indices[start_index:end_index]
def __len__(self):
return self.num_samples
def set_epoch(self, epoch):
self.epoch = epoch
class data_prefetcher():
def __init__(self, loader):
self.loader = iter(loader)
self.stream = torch.cuda.Stream()
self.preload()
def preload(self):
try:
self.next_input = next(self.loader)
except StopIteration:
self.next_input = None
return
with torch.cuda.stream(self.stream):
self.next_input = [ x.cuda(non_blocking=True) for x in self.next_input]
def __next__(self):
torch.cuda.current_stream().wait_stream(self.stream)
input = self.next_input
self.preload()
return input
def next(self):
return self.__next__()
def __iter__(self):
return self
def seq_collate_fn(batch):
"""batches samples and returns as tensors
Args:
batch : list of samples
Returns
batches of tensors
"""
batch_size = len(batch)
def _find_max_len(lst, ind):
max_len = -1
for item in lst:
if item[ind].size(0) > max_len:
max_len = item[ind].size(0)
return max_len
max_audio_len = _find_max_len(batch, 0)
max_transcript_len = _find_max_len(batch, 2)
batched_audio_signal = torch.zeros(batch_size, max_audio_len)
batched_transcript = torch.zeros(batch_size, max_transcript_len)
audio_lengths = []
transcript_lengths = []
for ind, sample in enumerate(batch):
batched_audio_signal[ind].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
audio_lengths.append(sample[1])
batched_transcript[ind].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
transcript_lengths.append(sample[3])
return batched_audio_signal, torch.stack(audio_lengths), batched_transcript, \
torch.stack(transcript_lengths)
class AudioToTextDataLayer:
"""Data layer with data loader
"""
def __init__(self, **kwargs):
self._device = torch.device("cuda")
featurizer_config = kwargs['featurizer_config']
pad_to_max = kwargs.get('pad_to_max', False)
perturb_config = kwargs.get('perturb_config', None)
manifest_filepath = kwargs['manifest_filepath']
dataset_dir = kwargs['dataset_dir']
labels = kwargs['labels']
batch_size = kwargs['batch_size']
drop_last = kwargs.get('drop_last', False)
shuffle = kwargs.get('shuffle', True)
min_duration = featurizer_config.get('min_duration', 0.1)
max_duration = featurizer_config.get('max_duration', None)
normalize_transcripts = kwargs.get('normalize_transcripts', True)
trim_silence = kwargs.get('trim_silence', False)
multi_gpu = kwargs.get('multi_gpu', False)
sampler_type = kwargs.get('sampler', 'default')
speed_perturbation = featurizer_config.get('speed_perturbation', False)
sort_by_duration=sampler_type == 'bucket'
self._featurizer = WaveformFeaturizer.from_config(featurizer_config, perturbation_configs=perturb_config)
self._dataset = AudioDataset(
dataset_dir=dataset_dir,
manifest_filepath=manifest_filepath,
labels=labels, blank_index=len(labels),
sort_by_duration=sort_by_duration,
pad_to_max=pad_to_max,
featurizer=self._featurizer, max_duration=max_duration,
min_duration=min_duration, normalize=normalize_transcripts,
trim=trim_silence, speed_perturbation=speed_perturbation)
print('sort_by_duration', sort_by_duration)
if not multi_gpu:
self.sampler = None
self._dataloader = torch.utils.data.DataLoader(
dataset=self._dataset,
batch_size=batch_size,
collate_fn=lambda b: seq_collate_fn(b),
drop_last=drop_last,
shuffle=shuffle if self.sampler is None else False,
num_workers=4,
pin_memory=True,
sampler=self.sampler
)
elif sampler_type == 'bucket':
self.sampler = DistributedBucketBatchSampler(self._dataset, batch_size=batch_size)
print("DDBucketSampler")
self._dataloader = torch.utils.data.DataLoader(
dataset=self._dataset,
collate_fn=lambda b: seq_collate_fn(b),
num_workers=4,
pin_memory=True,
batch_sampler=self.sampler
)
elif sampler_type == 'default':
self.sampler = torch.utils.data.distributed.DistributedSampler(self._dataset)
print("DDSampler")
self._dataloader = torch.utils.data.DataLoader(
dataset=self._dataset,
batch_size=batch_size,
collate_fn=lambda b: seq_collate_fn(b),
drop_last=drop_last,
shuffle=shuffle if self.sampler is None else False,
num_workers=4,
pin_memory=True,
sampler=self.sampler
)
else:
raise RuntimeError("Sampler {} not supported".format(sampler_type))
def __len__(self):
return len(self._dataset)
@property
def data_iterator(self):
return self._dataloader
class AudioDataset(Dataset):
def __init__(self, dataset_dir, manifest_filepath, labels, featurizer, max_duration=None, pad_to_max=False,
min_duration=None, blank_index=0, max_utts=0, normalize=True, sort_by_duration=False,
trim=False, speed_perturbation=False):
"""Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations
(in seconds). Each entry is a different audio sample.
Args:
dataset_dir: absolute path to dataset folder
manifest_filepath: relative path from dataset folder to manifest json as described above. Can be coma-separated paths.
labels: String containing all the possible characters to map to
featurizer: Initialized featurizer class that converts paths of audio to feature tensors
max_duration: If audio exceeds this length, do not include in dataset
min_duration: If audio is less than this length, do not include in dataset
pad_to_max: if specified input sequences into dnn model will be padded to max_duration
blank_index: blank index for ctc loss / decoder
max_utts: Limit number of utterances
normalize: whether to normalize transcript text
sort_by_duration: whether or not to sort sequences by increasing duration
trim: if specified trims leading and trailing silence from an audio signal.
speed_perturbation: specify if using data contains speed perburbation
"""
m_paths = manifest_filepath.split(',')
self.manifest = Manifest(dataset_dir, m_paths, labels, blank_index, pad_to_max=pad_to_max,
max_duration=max_duration,
sort_by_duration=sort_by_duration,
min_duration=min_duration, max_utts=max_utts,
normalize=normalize, speed_perturbation=speed_perturbation)
self.featurizer = featurizer
self.blank_index = blank_index
self.trim = trim
print(
"Dataset loaded with {0:.2f} hours. Filtered {1:.2f} hours.".format(
self.manifest.duration / 3600,
self.manifest.filtered_duration / 3600))
def __getitem__(self, index):
sample = self.manifest[index]
rn_indx = np.random.randint(len(sample['audio_filepath']))
duration = sample['audio_duration'][rn_indx] if 'audio_duration' in sample else 0
offset = sample['offset'] if 'offset' in sample else 0
features = self.featurizer.process(sample['audio_filepath'][rn_indx],
offset=offset, duration=duration,
trim=self.trim)
return features, torch.tensor(features.shape[0]).int(), \
torch.tensor(sample["transcript"]), torch.tensor(
len(sample["transcript"])).int()
def __len__(self):
return len(self.manifest)

View file

@ -0,0 +1,223 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.distributed as dist
from apex.parallel import DistributedDataParallel as DDP
from enum import Enum
from metrics import word_error_rate
class Optimization(Enum):
"""Various levels of Optimization.
WARNING: This might have effect on model accuracy."""
nothing = 0
mxprO0 = 1
mxprO1 = 2
mxprO2 = 3
mxprO3 = 4
AmpOptimizations = {Optimization.mxprO0: "O0",
Optimization.mxprO1: "O1",
Optimization.mxprO2: "O2",
Optimization.mxprO3: "O3"}
def print_once(msg):
if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
print(msg)
def add_ctc_labels(labels):
if not isinstance(labels, list):
raise ValueError("labels must be a list of symbols")
labels.append("<BLANK>")
return labels
def __ctc_decoder_predictions_tensor(tensor, labels):
"""
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
remove duplicates and special symbol. Returns prediction
Args:
tensor: model output tensor
label: A list of labels
Returns:
prediction
"""
blank_id = len(labels) - 1
hypotheses = []
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
prediction_cpu_tensor = tensor.long().cpu()
# iterate over batch
for ind in range(prediction_cpu_tensor.shape[0]):
prediction = prediction_cpu_tensor[ind].numpy().tolist()
# CTC decoding procedure
decoded_prediction = []
previous = len(labels) - 1 # id of a blank symbol
for p in prediction:
if (p != previous or previous == blank_id) and p != blank_id:
decoded_prediction.append(p)
previous = p
hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
hypotheses.append(hypothesis)
return hypotheses
def monitor_asr_train_progress(tensors: list, labels: list):
"""
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
remove duplicates and special symbol. Prints wer and prediction examples to screen
Args:
tensors: A list of 3 tensors (predictions, targets, target_lengths)
labels: A list of labels
Returns:
word error rate
"""
references = []
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
with torch.no_grad():
targets_cpu_tensor = tensors[1].long().cpu()
tgt_lenths_cpu_tensor = tensors[2].long().cpu()
# iterate over batch
for ind in range(targets_cpu_tensor.shape[0]):
tgt_len = tgt_lenths_cpu_tensor[ind].item()
target = targets_cpu_tensor[ind][:tgt_len].numpy().tolist()
reference = ''.join([labels_map[c] for c in target])
references.append(reference)
hypotheses = __ctc_decoder_predictions_tensor(tensors[0], labels=labels)
tag = "training_batch_WER"
wer, _, _ = word_error_rate(hypotheses, references)
print_once('{0}: {1}'.format(tag, wer))
print_once('Prediction: {0}'.format(hypotheses[0]))
print_once('Reference: {0}'.format(references[0]))
return wer
def __gather_losses(losses_list: list) -> list:
return [torch.mean(torch.stack(losses_list))]
def __gather_predictions(predictions_list: list, labels: list) -> list:
results = []
for prediction in predictions_list:
results += __ctc_decoder_predictions_tensor(prediction, labels=labels)
return results
def __gather_transcripts(transcript_list: list, transcript_len_list: list,
labels: list) -> list:
results = []
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
# iterate over workers
for t, ln in zip(transcript_list, transcript_len_list):
# iterate over batch
t_lc = t.long().cpu()
ln_lc = ln.long().cpu()
for ind in range(t.shape[0]):
tgt_len = ln_lc[ind].item()
target = t_lc[ind][:tgt_len].numpy().tolist()
reference = ''.join([labels_map[c] for c in target])
results.append(reference)
return results
def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list):
"""
Processes results of an iteration and saves it in global_vars
Args:
tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
global_vars: dictionary where processes results of iteration are saved
labels: A list of labels
"""
for kv, v in tensors.items():
if kv.startswith('loss'):
global_vars['EvalLoss'] += __gather_losses(v)
elif kv.startswith('predictions'):
global_vars['predictions'] += __gather_predictions(v, labels=labels)
elif kv.startswith('transcript_length'):
transcript_len_list = v
elif kv.startswith('transcript'):
transcript_list = v
elif kv.startswith('output'):
global_vars['logits'] += v
global_vars['transcripts'] += __gather_transcripts(transcript_list,
transcript_len_list,
labels=labels)
def process_evaluation_epoch(global_vars: dict, tag=None):
"""
Processes results from each worker at the end of evaluation and combine to final result
Args:
global_vars: dictionary containing information of entire evaluation
Return:
wer: final word error rate
loss: final loss
"""
if 'EvalLoss' in global_vars:
eloss = torch.mean(torch.stack(global_vars['EvalLoss'])).item()
else:
eloss = None
hypotheses = global_vars['predictions']
references = global_vars['transcripts']
wer, scores, num_words = word_error_rate(hypotheses=hypotheses, references=references)
multi_gpu = torch.distributed.is_initialized()
if multi_gpu:
if eloss is not None:
eloss /= torch.distributed.get_world_size()
eloss_tensor = torch.tensor(eloss).cuda()
dist.all_reduce(eloss_tensor)
eloss = eloss_tensor.item()
del eloss_tensor
scores_tensor = torch.tensor(scores).cuda()
dist.all_reduce(scores_tensor)
scores = scores_tensor.item()
del scores_tensor
num_words_tensor = torch.tensor(num_words).cuda()
dist.all_reduce(num_words_tensor)
num_words = num_words_tensor.item()
del num_words_tensor
wer = scores *1.0/num_words
return wer, eloss
def norm(x):
if not isinstance(x, List):
if not isinstance(x, tuple):
return x
return x[0]
def print_dict(d):
maxLen = max([len(ii) for ii in d.keys()])
fmtString = '\t%' + str(maxLen) + 's : %s'
print('Arguments:')
for keyPair in sorted(d.items()):
print(fmtString % keyPair)
def model_multi_gpu(model, multi_gpu=False):
if multi_gpu:
model = DDP(model)
print('DDP(model)')
return model

Binary file not shown.

After

Width:  |  Height:  |  Size: 246 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 295 KiB

View file

@ -0,0 +1,214 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import itertools
from typing import List
from tqdm import tqdm
import math
import toml
from dataset import AudioToTextDataLayer
from helpers import process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, print_dict, model_multi_gpu
from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
import torch
import apex
from apex import amp
import random
import numpy as np
import pickle
def parse_args():
parser = argparse.ArgumentParser(description='Jasper')
parser.add_argument("--local_rank", default=None, type=int)
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
parser.add_argument("--fp16", action='store_true', help='use half precision')
parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
parser.add_argument("--save_prediction", type=str, default=None, help="if specified saves predictions in text form at this location")
parser.add_argument("--logits_save_to", default=None, type=str, help="if specified will save logits to path")
parser.add_argument("--seed", default=42, type=int, help='seed')
return parser.parse_args()
def eval(
data_layer,
audio_processor,
encoderdecoder,
greedy_decoder,
labels,
multi_gpu,
args):
"""performs inference / evaluation
Args:
data_layer: data layer object that holds data loader
audio_processor: data processing module
encoderdecoder: acoustic model
greedy_decoder: greedy decoder
labels: list of labels as output vocabulary
multi_gpu: true if using multiple gpus
args: script input arguments
"""
logits_save_to=args.logits_save_to
audio_processor.eval()
encoderdecoder.eval()
with torch.no_grad():
_global_var_dict = {
'predictions': [],
'transcripts': [],
'logits' : [],
}
for it, data in enumerate(tqdm(data_layer.data_iterator)):
tensors = []
for d in data:
tensors.append(d.cuda())
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
inp = (t_audio_signal_e, t_a_sig_length_e)
t_processed_signal, p_length_e = audio_processor(x=inp)
t_log_probs_e, _ = encoderdecoder((t_processed_signal, p_length_e))
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
values_dict = dict(
predictions=[t_predictions_e],
transcript=[t_transcript_e],
transcript_length=[t_transcript_len_e],
output=[t_log_probs_e]
)
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
if args.steps is not None and it + 1 >= args.steps:
break
wer, _ = process_evaluation_epoch(_global_var_dict)
if (not multi_gpu or (multi_gpu and torch.distributed.get_rank() == 0)):
print("==========>>>>>>Evaluation WER: {0}\n".format(wer))
if args.save_prediction is not None:
with open(args.save_prediction, 'w') as fp:
fp.write('\n'.join(_global_var_dict['predictions']))
if logits_save_to is not None:
logits = []
for batch in _global_var_dict["logits"]:
for i in range(batch.shape[0]):
logits.append(batch[i].cpu().numpy())
with open(logits_save_to, 'wb') as f:
pickle.dump(logits, f, protocol=pickle.HIGHEST_PROTOCOL)
def main(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.backends.cudnn.benchmark = args.cudnn_benchmark
print("CUDNN BENCHMARK ", args.cudnn_benchmark)
assert(torch.cuda.is_available())
if args.local_rank is not None:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
multi_gpu = args.local_rank is not None
if multi_gpu:
print("DISTRIBUTED with ", torch.distributed.get_world_size())
if args.fp16:
optim_level = Optimization.mxprO3
else:
optim_level = Optimization.mxprO0
jasper_model_definition = toml.load(args.model_toml)
dataset_vocab = jasper_model_definition['labels']['labels']
ctc_vocab = add_ctc_labels(dataset_vocab)
val_manifest = args.val_manifest
featurizer_config = jasper_model_definition['input_eval']
featurizer_config["optimization_level"] = optim_level
if args.max_duration is not None:
featurizer_config['max_duration'] = args.max_duration
if args.pad_to is not None:
featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else "max"
print('model_config')
print_dict(jasper_model_definition)
print('feature_config')
print_dict(featurizer_config)
data_layer = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
manifest_filepath=val_manifest,
labels=dataset_vocab,
batch_size=args.batch_size,
pad_to_max=featurizer_config['pad_to'] == "max",
shuffle=False,
multi_gpu=multi_gpu)
audio_preprocessor = AudioPreprocessing(**featurizer_config)
encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
if args.ckpt is not None:
print("loading model from ", args.ckpt)
checkpoint = torch.load(args.ckpt, map_location="cpu")
for k in audio_preprocessor.state_dict().keys():
checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
greedy_decoder = GreedyCTCDecoder()
# print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
N = len(data_layer)
step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
if args.steps is not None:
print('-----------------')
print('Have {0} examples to eval on.'.format(args.steps * args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
print('Have {0} steps / (gpu * epoch).'.format(args.steps))
print('-----------------')
else:
print('-----------------')
print('Have {0} examples to eval on.'.format(N))
print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
print('-----------------')
audio_preprocessor.cuda()
encoderdecoder.cuda()
if args.fp16:
encoderdecoder = amp.initialize(
models=encoderdecoder,
opt_level=AmpOptimizations[optim_level])
encoderdecoder = model_multi_gpu(encoderdecoder, multi_gpu)
eval(
data_layer=data_layer,
audio_processor=audio_preprocessor,
encoderdecoder=encoderdecoder,
greedy_decoder=greedy_decoder,
labels=ctc_vocab,
args=args,
multi_gpu=multi_gpu)
if __name__=="__main__":
args = parse_args()
print_dict(vars(args))
main(args)

View file

@ -0,0 +1,241 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import itertools
import os
import sys
import time
import random
import numpy as np
from heapq import nlargest
import math
from tqdm import tqdm
import toml
import torch
from apex import amp
from dataset import AudioToTextDataLayer
from helpers import process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, print_dict
from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
def parse_args():
parser = argparse.ArgumentParser(description='Jasper')
parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
parser.add_argument("--fp16", action='store_true', help='use half precision')
parser.add_argument("--seed", default=42, type=int, help='seed')
return parser.parse_args()
def eval(
data_layer,
audio_processor,
encoderdecoder,
greedy_decoder,
labels,
args):
"""performs evaluation and prints performance statistics
Args:
data_layer: data layer object that holds data loader
audio_processor: data processing module
encoderdecoder: acoustic model
greedy_decoder: greedy decoder
labels: list of labels as output vocabulary
args: script input arguments
"""
batch_size=args.batch_size
steps=args.steps
audio_processor.eval()
encoderdecoder.eval()
with torch.no_grad():
_global_var_dict = {
'predictions': [],
'transcripts': [],
}
it = 0
ep = 0
if steps is None:
steps = math.ceil(len(data_layer) / batch_size)
durations_dnn = []
durations_dnn_and_prep = []
seq_lens = []
while True:
ep += 1
for data in tqdm(data_layer.data_iterator):
it += 1
if it > steps:
break
tensors = []
dl_device = torch.device("cuda")
for d in data:
tensors.append(d.to(dl_device))
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
inp=(t_audio_signal_e, t_a_sig_length_e)
torch.cuda.synchronize()
t0 = time.perf_counter()
t_processed_signal, p_length_e = audio_processor(x=inp)
torch.cuda.synchronize()
t1 = time.perf_counter()
t_log_probs_e, _ = encoderdecoder((t_processed_signal, p_length_e))
torch.cuda.synchronize()
stop_time = time.perf_counter()
time_prep_and_dnn = stop_time - t0
time_dnn = stop_time - t1
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
values_dict = dict(
predictions=[t_predictions_e],
transcript=[t_transcript_e],
transcript_length=[t_transcript_len_e],
)
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
durations_dnn.append(time_dnn)
durations_dnn_and_prep.append(time_prep_and_dnn)
seq_lens.append(t_processed_signal.shape[-1])
if it >= steps:
wer, _ = process_evaluation_epoch(_global_var_dict)
print("==========>>>>>>Evaluation of all iterations WER: {0}\n".format(wer))
break
ratios = [0.9, 0.95,0.99, 1.]
latencies_dnn = take_durations_and_output_percentile(durations_dnn, ratios)
latencies_dnn_and_prep = take_durations_and_output_percentile(durations_dnn_and_prep, ratios)
print("\n using batch size {} and {} frames ".format(batch_size, seq_lens[-1]))
print("\n".join(["dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn.items()]))
print("\n".join(["prep + dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn_and_prep.items()]))
def take_durations_and_output_percentile(durations, ratios):
durations = np.asarray(durations) * 1000 # in ms
latency = durations
latency = latency[5:]
mean_latency = np.mean(latency)
latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
latency_ranges=get_percentile(ratios, latency_worst, len(latency))
latency_ranges["0.5"] = mean_latency
return latency_ranges
def get_percentile(ratios, arr, nsamples):
res = {}
for a in ratios:
idx = max(int(nsamples * (1 - a)), 0)
res[a] = arr[idx]
return res
def main(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.backends.cudnn.benchmark = args.cudnn_benchmark
assert(args.steps is None or args.steps > 5)
print("CUDNN BENCHMARK ", args.cudnn_benchmark)
assert(torch.cuda.is_available())
if args.fp16:
optim_level = Optimization.mxprO3
else:
optim_level = Optimization.mxprO0
batch_size = args.batch_size
jasper_model_definition = toml.load(args.model_toml)
dataset_vocab = jasper_model_definition['labels']['labels']
ctc_vocab = add_ctc_labels(dataset_vocab)
val_manifest = args.val_manifest
featurizer_config = jasper_model_definition['input_eval']
featurizer_config["optimization_level"] = optim_level
if args.max_duration is not None:
featurizer_config['max_duration'] = args.max_duration
if args.pad_to is not None:
featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else "max"
print('model_config')
print_dict(jasper_model_definition)
print('feature_config')
print_dict(featurizer_config)
data_layer = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
manifest_filepath=val_manifest,
labels=dataset_vocab,
batch_size=batch_size,
pad_to_max=featurizer_config['pad_to'] == "max",
shuffle=False,
multi_gpu=False)
audio_preprocessor = AudioPreprocessing(**featurizer_config)
encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
if args.ckpt is not None:
print("loading model from ", args.ckpt)
checkpoint = torch.load(args.ckpt, map_location="cpu")
for k in audio_preprocessor.state_dict().keys():
checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
greedy_decoder = GreedyCTCDecoder()
# print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
N = len(data_layer)
step_per_epoch = math.ceil(N / args.batch_size)
print('-----------------')
if args.steps is None:
print('Have {0} examples to eval on.'.format(N))
print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
else:
print('Have {0} examples to eval on.'.format(args.steps * args.batch_size))
print('Have {0} steps / (gpu * epoch).'.format(args.steps))
print('-----------------')
audio_preprocessor.cuda()
encoderdecoder.cuda()
if args.fp16:
encoderdecoder = amp.initialize(
models=encoderdecoder,
opt_level=AmpOptimizations[optim_level])
eval(
data_layer=data_layer,
audio_processor=audio_preprocessor,
encoderdecoder=encoderdecoder,
greedy_decoder=greedy_decoder,
labels=ctc_vocab,
args=args)
if __name__=="__main__":
args = parse_args()
print_dict(vars(args))
main(args)

View file

@ -0,0 +1,68 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List
def __levenshtein(a: List, b: List) -> int:
"""Calculates the Levenshtein distance between a and b.
"""
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a, b = b, a
n, m = m, n
current = list(range(n + 1))
for i in range(1, m + 1):
previous, current = current, [i] + [0] * n
for j in range(1, n + 1):
add, delete = previous[j] + 1, current[j - 1] + 1
change = previous[j - 1]
if a[j - 1] != b[i - 1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
def word_error_rate(hypotheses: List[str], references: List[str]) -> float:
"""
Computes Average Word Error rate between two texts represented as
corresponding lists of string. Hypotheses and references must have same length.
Args:
hypotheses: list of hypotheses
references: list of references
Returns:
(float) average word error rate
"""
scores = 0
words = 0
if len(hypotheses) != len(references):
raise ValueError("In word error rate calculation, hypotheses and reference"
" lists must have the same number of elements. But I got:"
"{0} and {1} correspondingly".format(len(hypotheses), len(references)))
for h, r in zip(hypotheses, references):
h_list = h.split()
r_list = r.split()
words += len(r_list)
scores += __levenshtein(h_list, r_list)
if words!=0:
wer = 1.0*scores/words
else:
wer = float('inf')
return wer, scores, words

View file

@ -0,0 +1,409 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from apex import amp
import torch
import torch.nn as nn
from parts.features import FeatureFactory
from helpers import Optimization
import random
jasper_activations = {
"hardtanh": nn.Hardtanh,
"relu": nn.ReLU,
"selu": nn.SELU,
}
def init_weights(m, mode='xavier_uniform'):
if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
if mode == 'xavier_uniform':
nn.init.xavier_uniform_(m.weight, gain=1.0)
elif mode == 'xavier_normal':
nn.init.xavier_normal_(m.weight, gain=1.0)
elif mode == 'kaiming_uniform':
nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
elif mode == 'kaiming_normal':
nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
else:
raise ValueError("Unknown Initialization mode: {0}".format(mode))
elif type(m) == nn.BatchNorm1d:
if m.track_running_stats:
m.running_mean.zero_()
m.running_var.fill_(1)
m.num_batches_tracked.zero_()
if m.affine:
nn.init.ones_(m.weight)
nn.init.zeros_(m.bias)
def get_same_padding(kernel_size, stride, dilation):
if stride > 1 and dilation > 1:
raise ValueError("Only stride OR dilation may be greater than 1")
return (kernel_size // 2) * dilation
class AudioPreprocessing(nn.Module):
"""GPU accelerated audio preprocessing
"""
def __init__(self, **kwargs):
nn.Module.__init__(self) # For PyTorch API
self.optim_level = kwargs.get('optimization_level', Optimization.nothing)
self.featurizer = FeatureFactory.from_config(kwargs)
def forward(self, x):
input_signal, length = x
length.requires_grad_(False)
if self.optim_level not in [Optimization.nothing, Optimization.mxprO0, Optimization.mxprO3]:
with amp.disable_casts():
processed_signal = self.featurizer(x)
processed_length = self.featurizer.get_seq_len(length)
else:
processed_signal = self.featurizer(x)
processed_length = self.featurizer.get_seq_len(length)
return processed_signal, processed_length
class SpectrogramAugmentation(nn.Module):
"""Spectrogram augmentation
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self.spec_cutout_regions = SpecCutoutRegions(kwargs)
self.spec_augment = SpecAugment(kwargs)
@torch.no_grad()
def forward(self, input_spec):
augmented_spec = self.spec_cutout_regions(input_spec)
augmented_spec = self.spec_augment(augmented_spec)
return augmented_spec
class SpecAugment(nn.Module):
"""Spec augment. refer to https://arxiv.org/abs/1904.08779
"""
def __init__(self, cfg, rng=None):
super(SpecAugment, self).__init__()
self._rng = random.Random() if rng is None else rng
self.cutout_x_regions = cfg.get('cutout_x_regions', 0)
self.cutout_y_regions = cfg.get('cutout_y_regions', 0)
self.cutout_x_width = cfg.get('cutout_x_width', 10)
self.cutout_y_width = cfg.get('cutout_y_width', 10)
@torch.no_grad()
def forward(self, x):
sh = x.shape
mask = torch.zeros(x.shape).byte()
for idx in range(sh[0]):
for _ in range(self.cutout_x_regions):
cutout_x_left = int(self._rng.uniform(0, sh[1] - self.cutout_x_width))
mask[idx, cutout_x_left:cutout_x_left + self.cutout_x_width, :] = 1
for _ in range(self.cutout_y_regions):
cutout_y_left = int(self._rng.uniform(0, sh[2] - self.cutout_y_width))
mask[idx, :, cutout_y_left:cutout_y_left + self.cutout_y_width] = 1
x = x.masked_fill(mask.to(device=x.device), 0)
return x
class SpecCutoutRegions(nn.Module):
"""Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
"""
def __init__(self, cfg, rng=None):
super(SpecCutoutRegions, self).__init__()
self._rng = random.Random() if rng is None else rng
self.cutout_rect_regions = cfg.get('cutout_rect_regions', 0)
self.cutout_rect_time = cfg.get('cutout_rect_time', 5)
self.cutout_rect_freq = cfg.get('cutout_rect_freq', 20)
@torch.no_grad()
def forward(self, x):
sh = x.shape
mask = torch.zeros(x.shape).byte()
for idx in range(sh[0]):
for i in range(self.cutout_rect_regions):
cutout_rect_x = int(self._rng.uniform(
0, sh[1] - self.cutout_rect_freq))
cutout_rect_y = int(self._rng.uniform(
0, sh[2] - self.cutout_rect_time))
mask[idx, cutout_rect_x:cutout_rect_x + self.cutout_rect_freq,
cutout_rect_y:cutout_rect_y + self.cutout_rect_time] = 1
x = x.masked_fill(mask.to(device=x.device), 0)
return x
class JasperEncoder(nn.Module):
"""Jasper encoder
"""
def __init__(self, **kwargs):
cfg = {}
for key, value in kwargs.items():
cfg[key] = value
nn.Module.__init__(self)
self._cfg = cfg
activation = jasper_activations[cfg['encoder']['activation']]()
use_conv_mask = cfg['encoder'].get('convmask', False)
feat_in = cfg['input']['features'] * cfg['input'].get('frame_splicing', 1)
init_mode = cfg.get('init_mode', 'xavier_uniform')
residual_panes = []
encoder_layers = []
self.dense_residual = False
for lcfg in cfg['jasper']:
dense_res = []
if lcfg.get('residual_dense', False):
residual_panes.append(feat_in)
dense_res = residual_panes
self.dense_residual = True
encoder_layers.append(
JasperBlock(feat_in, lcfg['filters'], repeat=lcfg['repeat'],
kernel_size=lcfg['kernel'], stride=lcfg['stride'],
dilation=lcfg['dilation'], dropout=lcfg['dropout'],
residual=lcfg['residual'], activation=activation,
residual_panes=dense_res, conv_mask=use_conv_mask))
feat_in = lcfg['filters']
self.encoder = nn.Sequential(*encoder_layers)
self.apply(lambda x: init_weights(x, mode=init_mode))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, x):
audio_signal, length = x
s_input, length = self.encoder(([audio_signal], length))
return s_input, length
class JasperDecoderForCTC(nn.Module):
"""Jasper decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self._feat_in = kwargs.get("feat_in")
self._num_classes = kwargs.get("num_classes")
init_mode = kwargs.get('init_mode', 'xavier_uniform')
self.decoder_layers = nn.Sequential(
nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),
nn.LogSoftmax(dim=1))
self.apply(lambda x: init_weights(x, mode=init_mode))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, encoder_output):
out = self.decoder_layers(encoder_output[-1])
return out.transpose(1, 2)
class Jasper(nn.Module):
"""Contains data preprocessing, spectrogram augmentation, jasper encoder and decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self.audio_preprocessor = AudioPreprocessing(**kwargs.get("feature_config"))
self.data_spectr_augmentation = SpectrogramAugmentation(**kwargs.get("feature_config"))
self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
num_classes=kwargs.get("num_classes"))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, x):
input_signal, length = x
t_processed_signal, p_length_t = self.audio_preprocessor(x)
if self.training:
t_processed_signal = self.data_spectr_augmentation(input_spec=t_processed_signal)
t_encoded_t, t_encoded_len_t = self.jasper_encoder((t_processed_signal, p_length_t))
return self.jasper_decoder(encoder_output=t_encoded_t), t_encoded_len_t
class JasperEncoderDecoder(nn.Module):
"""Contains jasper encoder and decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self)
self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
num_classes=kwargs.get("num_classes"))
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, x):
t_processed_signal, p_length_t = x
t_encoded_t, t_encoded_len_t = self.jasper_encoder((t_processed_signal, p_length_t))
return self.jasper_decoder(encoder_output=t_encoded_t), t_encoded_len_t
class MaskedConv1d(nn.Conv1d):
"""1D convolution with sequence masking
"""
def __init__(self, in_channels, out_channels, kernel_size, stride=1,
padding=0, dilation=1, groups=1, bias=False, use_mask=True):
super(MaskedConv1d, self).__init__(in_channels, out_channels, kernel_size,
stride=stride,
padding=padding, dilation=dilation,
groups=groups, bias=bias)
self.use_mask = use_mask
def get_seq_len(self, lens):
return ((lens + 2 * self.padding[0] - self.dilation[0] * (
self.kernel_size[0] - 1) - 1) / self.stride[0] + 1)
def forward(self, inp):
x, lens = inp
if self.use_mask:
max_len = x.size(2)
mask = torch.arange(max_len).to(lens.dtype).to(lens.device).expand(len(lens),
max_len) >= lens.unsqueeze(
1)
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
del mask
lens = self.get_seq_len(lens)
out = super(MaskedConv1d, self).forward(x)
return out, lens
class JasperBlock(nn.Module):
"""Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
"""
def __init__(self, inplanes, planes, repeat=3, kernel_size=11, stride=1,
dilation=1, padding='same', dropout=0.2, activation=None,
residual=True, residual_panes=[], conv_mask=False):
super(JasperBlock, self).__init__()
if padding != "same":
raise ValueError("currently only 'same' padding is supported")
padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
self.conv_mask = conv_mask
self.conv = nn.ModuleList()
inplanes_loop = inplanes
for _ in range(repeat - 1):
self.conv.extend(
self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
stride=stride, dilation=dilation,
padding=padding_val))
self.conv.extend(
self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
inplanes_loop = planes
self.conv.extend(
self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
stride=stride, dilation=dilation,
padding=padding_val))
self.res = nn.ModuleList() if residual else None
res_panes = residual_panes.copy()
self.dense_residual = residual
if residual:
if len(residual_panes) == 0:
res_panes = [inplanes]
self.dense_residual = False
for ip in res_panes:
self.res.append(nn.ModuleList(
modules=self._get_conv_bn_layer(ip, planes, kernel_size=1)))
self.out = nn.Sequential(
*self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
def _get_conv_bn_layer(self, in_channels, out_channels, kernel_size=11,
stride=1, dilation=1, padding=0, bias=False):
layers = [
MaskedConv1d(in_channels, out_channels, kernel_size, stride=stride,
dilation=dilation, padding=padding, bias=bias,
use_mask=self.conv_mask),
nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)
]
return layers
def _get_act_dropout_layer(self, drop_prob=0.2, activation=None):
if activation is None:
activation = nn.Hardtanh(min_val=0.0, max_val=20.0)
layers = [
activation,
nn.Dropout(p=drop_prob)
]
return layers
def num_weights(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def forward(self, input_):
xs, lens_orig = input_
# compute forward convolutions
out = xs[-1]
lens = lens_orig
for i, l in enumerate(self.conv):
if isinstance(l, MaskedConv1d):
out, lens = l((out, lens))
else:
out = l(out)
# compute the residuals
if self.res is not None:
for i, layer in enumerate(self.res):
res_out = xs[i]
for j, res_layer in enumerate(layer):
if j == 0:
res_out, _ = res_layer((res_out, lens_orig))
else:
res_out = res_layer(res_out)
out += res_out
# compute the output
out = self.out(out)
if self.res is not None and self.dense_residual:
return xs + [out], lens
return [out], lens
class GreedyCTCDecoder(nn.Module):
""" Greedy CTC Decoder
"""
def __init__(self, **kwargs):
nn.Module.__init__(self) # For PyTorch API
def forward(self, log_probs):
with torch.no_grad():
argmx = log_probs.argmax(dim=-1, keepdim=False).int()
return argmx
class CTCLossNM:
""" CTC loss
"""
def __init__(self, **kwargs):
self._blank = kwargs['num_classes'] - 1
self._criterion = nn.CTCLoss(blank=self._blank, reduction='none')
def __call__(self, log_probs, targets, input_length, target_length):
input_length = input_length.long()
target_length = target_length.long()
targets = targets.long()
loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
target_length)
# note that this is different from reduction = 'mean'
# because we are not dividing by target lengths
return torch.mean(loss)

View file

@ -0,0 +1,223 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
from torch.optim import Optimizer
import math
class AdamW(Optimizer):
"""Implements AdamW algorithm.
It has been proposed in `Adam: A Method for Stochastic Optimization`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper `On the Convergence of Adam and Beyond`_
Adam: A Method for Stochastic Optimization:
https://arxiv.org/abs/1412.6980
On the Convergence of Adam and Beyond:
https://openreview.net/forum?id=ryQu7f-RZ
"""
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
weight_decay=0, amsgrad=False):
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if not 0.0 <= betas[0] < 1.0:
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
if not 0.0 <= betas[1] < 1.0:
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
defaults = dict(lr=lr, betas=betas, eps=eps,
weight_decay=weight_decay, amsgrad=amsgrad)
super(AdamW, self).__init__(params, defaults)
def __setstate__(self, state):
super(AdamW, self).__setstate__(state)
for group in self.param_groups:
group.setdefault('amsgrad', False)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
amsgrad = group['amsgrad']
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p.data)
if amsgrad:
# Maintains max of all exp. moving avg. of sq. grad. values
state['max_exp_avg_sq'] = torch.zeros_like(p.data)
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
if amsgrad:
max_exp_avg_sq = state['max_exp_avg_sq']
beta1, beta2 = group['betas']
state['step'] += 1
# Decay the first and second moment running average coefficient
exp_avg.mul_(beta1).add_(1 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
# Use the max. for normalizing running avg. of gradient
denom = max_exp_avg_sq.sqrt().add_(group['eps'])
else:
denom = exp_avg_sq.sqrt().add_(group['eps'])
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
p.data.add_(-step_size, torch.mul(p.data, group['weight_decay']).addcdiv_(1, exp_avg, denom) )
return loss
class Novograd(Optimizer):
"""
Implements Novograd algorithm.
Args:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.95, 0.98))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
grad_averaging: gradient averaging
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper `On the Convergence of Adam and Beyond`_
(default: False)
"""
def __init__(self, params, lr=1e-3, betas=(0.95, 0.98), eps=1e-8,
weight_decay=0, grad_averaging=False, amsgrad=False):
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if not 0.0 <= betas[0] < 1.0:
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
if not 0.0 <= betas[1] < 1.0:
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
defaults = dict(lr=lr, betas=betas, eps=eps,
weight_decay=weight_decay,
grad_averaging=grad_averaging,
amsgrad=amsgrad)
super(Novograd, self).__init__(params, defaults)
def __setstate__(self, state):
super(Novograd, self).__setstate__(state)
for group in self.param_groups:
group.setdefault('amsgrad', False)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
if grad.is_sparse:
raise RuntimeError('Sparse gradients are not supported.')
amsgrad = group['amsgrad']
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros([]).to(state['exp_avg'].device)
if amsgrad:
# Maintains max of all exp. moving avg. of sq. grad. values
state['max_exp_avg_sq'] = torch.zeros([]).to(state['exp_avg'].device)
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
if amsgrad:
max_exp_avg_sq = state['max_exp_avg_sq']
beta1, beta2 = group['betas']
state['step'] += 1
norm = torch.sum(torch.pow(grad, 2))
if exp_avg_sq == 0:
exp_avg_sq = norm
else:
exp_avg_sq.mul_(beta2).add_(1 - beta2, norm)
if amsgrad:
# Maintains the maximum of all 2nd moment running avg. till now
torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
# Use the max. for normalizing running avg. of gradient
denom = max_exp_avg_sq.sqrt().add_(group['eps'])
else:
denom = exp_avg_sq.sqrt().add_(group['eps'])
grad.div_(denom)
if group['weight_decay'] != 0:
grad.add_(group['weight_decay'], p.data)
if group['grad_averaging']:
grad.mul_(1 - beta1)
exp_avg.mul_(beta1).add_(grad)
p.data.add_(-group['lr'], exp_avg)
return loss

View file

@ -0,0 +1,325 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
import math
import librosa
from .perturb import AudioAugmentor
from .segment import AudioSegment
from apex import amp
class WaveformFeaturizer(object):
def __init__(self, input_cfg, augmentor=None):
self.augmentor = augmentor if augmentor is not None else AudioAugmentor()
self.cfg = input_cfg
def max_augmentation_length(self, length):
return self.augmentor.max_augmentation_length(length)
def process(self, file_path, offset=0, duration=0, trim=False):
audio = AudioSegment.from_file(file_path,
target_sr=self.cfg['sample_rate'],
int_values=self.cfg.get('int_values', False),
offset=offset, duration=duration, trim=trim)
return self.process_segment(audio)
def process_segment(self, audio_segment):
self.augmentor.perturb(audio_segment)
return torch.tensor(audio_segment.samples, dtype=torch.float)
@classmethod
def from_config(cls, input_config, perturbation_configs=None):
if perturbation_configs is not None:
aa = AudioAugmentor.from_config(perturbation_configs)
else:
aa = None
return cls(input_config, augmentor=aa)
constant = 1e-5
def normalize_batch(x, seq_len, normalize_type):
if normalize_type == "per_feature":
x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
device=x.device)
x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
device=x.device)
for i in range(x.shape[0]):
x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
# make sure x_std is not zero
x_std += constant
return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
elif normalize_type == "all_features":
x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
for i in range(x.shape[0]):
x_mean[i] = x[i, :, :seq_len[i].item()].mean()
x_std[i] = x[i, :, :seq_len[i].item()].std()
# make sure x_std is not zero
x_std += constant
return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
else:
return x
def splice_frames(x, frame_splicing):
""" Stacks frames together across feature dim
input is batch_size, feature_dim, num_frames
output is batch_size, feature_dim*frame_splicing, num_frames
"""
seq = [x]
for n in range(1, frame_splicing):
seq.append(torch.cat([x[:, :, :n], x[:, :, n:]], dim=2))
return torch.cat(seq, dim=1)
class SpectrogramFeatures(nn.Module):
def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
n_fft=None,
window="hamming", normalize="per_feature", log=True, center=True,
dither=constant, pad_to=8, max_duration=16.7,
frame_splicing=1):
super(SpectrogramFeatures, self).__init__()
torch_windows = {
'hann': torch.hann_window,
'hamming': torch.hamming_window,
'blackman': torch.blackman_window,
'bartlett': torch.bartlett_window,
'none': None,
}
self.win_length = int(sample_rate * window_size)
self.hop_length = int(sample_rate * window_stride)
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
window_fn = torch_windows.get(window, None)
window_tensor = window_fn(self.win_length,
periodic=False) if window_fn else None
self.window = window_tensor
self.normalize = normalize
self.log = log
self.center = center
self.dither = dither
self.pad_to = pad_to
self.frame_splicing = frame_splicing
max_length = 1 + math.ceil(
(max_duration * sample_rate - self.win_length) / self.hop_length
)
max_pad = 16 - (max_length % 16)
self.max_length = max_length + max_pad
def get_seq_len(self, seq_len):
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
dtype=torch.int)
@torch.no_grad()
def forward(self, inp):
x, seq_len = inp
dtype = x.dtype
seq_len = self.get_seq_len(seq_len)
# dither
if self.dither > 0:
x += self.dither * torch.randn_like(x)
# do preemphasis
if hasattr(self,'preemph') and self.preemph is not None:
x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
dim=1)
# get spectrogram
x = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
win_length=self.win_length, center=self.center,
window=self.window.to(torch.float))
x = torch.sqrt(x.pow(2).sum(-1))
# log features if required
if self.log:
x = torch.log(x + 1e-20)
# frame splicing if required
if self.frame_splicing > 1:
x = splice_frames(x, self.frame_splicing)
# normalize if required
if self.normalize:
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
# mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
max_len = x.size(-1)
mask = torch.arange(max_len).to(seq_len.dtype).to(seq_len.device).expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
del mask
pad_to = self.pad_to
if pad_to == "max":
x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
elif pad_to > 0:
pad_amt = x.size(-1) % pad_to
if pad_amt != 0:
x = nn.functional.pad(x, (0, pad_to - pad_amt))
return x.to(dtype)
@classmethod
def from_config(cls, cfg, log=False):
return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
window_stride=cfg['window_stride'],
n_fft=cfg['n_fft'], window=cfg['window'],
normalize=cfg['normalize'],
max_duration=cfg.get('max_duration', 16.7),
dither=cfg.get('dither', 1e-5), pad_to=cfg.get("pad_to", 0),
frame_splicing=cfg.get("frame_splicing", 1), log=log)
class FilterbankFeatures(nn.Module):
def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
window="hamming", normalize="per_feature", n_fft=None,
preemph=0.97,
nfilt=64, lowfreq=0, highfreq=None, log=True, dither=constant,
pad_to=8,
max_duration=16.7,
frame_splicing=1):
super(FilterbankFeatures, self).__init__()
print("PADDING: {}".format(pad_to))
torch_windows = {
'hann': torch.hann_window,
'hamming': torch.hamming_window,
'blackman': torch.blackman_window,
'bartlett': torch.bartlett_window,
'none': None,
}
self.win_length = int(sample_rate * window_size) # frame size
self.hop_length = int(sample_rate * window_stride)
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
self.normalize = normalize
self.log = log
self.dither = dither
self.frame_splicing = frame_splicing
self.nfilt = nfilt
self.preemph = preemph
self.pad_to = pad_to
highfreq = highfreq or sample_rate / 2
window_fn = torch_windows.get(window, None)
window_tensor = window_fn(self.win_length,
periodic=False) if window_fn else None
filterbanks = torch.tensor(
librosa.filters.mel(sample_rate, self.n_fft, n_mels=nfilt, fmin=lowfreq,
fmax=highfreq), dtype=torch.float).unsqueeze(0)
# self.fb = filterbanks
# self.window = window_tensor
self.register_buffer("fb", filterbanks)
self.register_buffer("window", window_tensor)
# Calculate maximum sequence length (# frames)
max_length = 1 + math.ceil(
(max_duration * sample_rate - self.win_length) / self.hop_length
)
max_pad = 16 - (max_length % 16)
self.max_length = max_length + max_pad
def get_seq_len(self, seq_len):
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
dtype=torch.int)
# dtype=torch.long)
@torch.no_grad()
def forward(self, inp):
x, seq_len = inp
dtype = x.dtype
seq_len = self.get_seq_len(seq_len)
# dither
if self.dither > 0:
x += self.dither * torch.randn_like(x)
# do preemphasis
if self.preemph is not None:
x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
dim=1)
# do stft
x = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
win_length=self.win_length,
center=True, window=self.window.to(dtype=torch.float))
# get power spectrum
x = x.pow(2).sum(-1)
# dot with filterbank energies
x = torch.matmul(self.fb.to(x.dtype), x)
# log features if required
if self.log:
x = torch.log(x + 1e-20)
# frame splicing if required
if self.frame_splicing > 1:
x = splice_frames(x, self.frame_splicing)
# normalize if required
if self.normalize:
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
# mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
max_len = x.size(-1)
mask = torch.arange(max_len).to(seq_len.dtype).to(x.device).expand(x.size(0),
max_len) >= seq_len.unsqueeze(1)
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
del mask
pad_to = self.pad_to
if pad_to == "max":
x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
elif pad_to > 0:
pad_amt = x.size(-1) % pad_to
if pad_amt != 0:
x = nn.functional.pad(x, (0, pad_to - pad_amt))
return x.to(dtype)
@classmethod
def from_config(cls, cfg, log=False):
return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
window_stride=cfg['window_stride'], n_fft=cfg['n_fft'],
nfilt=cfg['features'], window=cfg['window'],
normalize=cfg['normalize'],
max_duration=cfg.get('max_duration', 16.7),
dither=cfg['dither'], pad_to=cfg.get("pad_to", 0),
frame_splicing=cfg.get("frame_splicing", 1), log=log)
class FeatureFactory(object):
featurizers = {
"logfbank": FilterbankFeatures,
"fbank": FilterbankFeatures,
"stft": SpectrogramFeatures,
"logspect": SpectrogramFeatures,
"logstft": SpectrogramFeatures
}
def __init__(self):
pass
@classmethod
def from_config(cls, cfg):
feat_type = cfg.get('feat_type', "logspect")
featurizer = cls.featurizers[feat_type]
#return featurizer.from_config(cfg, log="log" in cfg['feat_type'])
return featurizer.from_config(cfg, log="log" in feat_type)

View file

@ -0,0 +1,170 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
import re
import string
import numpy as np
import os
from .text import _clean_text
def normalize_string(s, labels, table, **unused_kwargs):
"""
Normalizes string. For example:
'call me at 8:00 pm!' -> 'call me at eight zero pm'
Args:
s: string to normalize
labels: labels used during model training.
Returns:
Normalized string
"""
def good_token(token, labels):
s = set(labels)
for t in token:
if not t in s:
return False
return True
try:
text = _clean_text(s, ["english_cleaners"], table).strip()
return ''.join([t for t in text if good_token(t, labels=labels)])
except:
print("WARNING: Normalizing {} failed".format(s))
return None
class Manifest(object):
def __init__(self, data_dir, manifest_paths, labels, blank_index, max_duration=None, pad_to_max=False,
min_duration=None, sort_by_duration=False, max_utts=0,
normalize=True, speed_perturbation=False, filter_speed=1.0):
self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
self.blank_index = blank_index
self.max_duration= max_duration
ids = []
duration = 0.0
filtered_duration = 0.0
# If removing punctuation, make a list of punctuation to remove
table = None
if normalize:
# Punctuation to remove
punctuation = string.punctuation
punctuation = punctuation.replace("+", "")
punctuation = punctuation.replace("&", "")
### We might also want to consider:
### @ -> at
### # -> number, pound, hashtag
### ~ -> tilde
### _ -> underscore
### % -> percent
# If a punctuation symbol is inside our vocab, we do not remove from text
for l in labels:
punctuation = punctuation.replace(l, "")
# Turn all punctuation to whitespace
table = str.maketrans(punctuation, " " * len(punctuation))
for manifest_path in manifest_paths:
with open(manifest_path, "r", encoding="utf-8") as fh:
a=json.load(fh)
for data in a:
files_and_speeds = data['files']
if pad_to_max:
if not speed_perturbation:
min_speed = filter_speed
else:
min_speed = min(x['speed'] for x in files_and_speeds)
max_duration = self.max_duration * min_speed
data['duration'] = data['original_duration']
if min_duration is not None and data['duration'] < min_duration:
filtered_duration += data['duration']
continue
if max_duration is not None and data['duration'] > max_duration:
filtered_duration += data['duration']
continue
# Prune and normalize according to transcript
transcript_text = data[
'transcript'] if "transcript" in data else self.load_transcript(
data['text_filepath'])
if normalize:
transcript_text = normalize_string(transcript_text, labels=labels,
table=table)
if not isinstance(transcript_text, str):
print(
"WARNING: Got transcript: {}. It is not a string. Dropping data point".format(
transcript_text))
filtered_duration += data['duration']
continue
data["transcript"] = self.parse_transcript(transcript_text) # convert to vocab indices
if speed_perturbation:
audio_paths = [x['fname'] for x in files_and_speeds]
data['audio_duration'] = [x['duration'] for x in files_and_speeds]
else:
audio_paths = [x['fname'] for x in files_and_speeds if x['speed'] == filter_speed]
data['audio_duration'] = [x['duration'] for x in files_and_speeds if x['speed'] == filter_speed]
data['audio_filepath'] = [os.path.join(data_dir, x) for x in audio_paths]
data.pop('files')
data.pop('original_duration')
ids.append(data)
duration += data['duration']
if max_utts > 0 and len(ids) >= max_utts:
print(
'Stopping parsing %s as max_utts=%d' % (manifest_path, max_utts))
break
if sort_by_duration:
ids = sorted(ids, key=lambda x: x['duration'])
self._data = ids
self._size = len(ids)
self._duration = duration
self._filtered_duration = filtered_duration
def load_transcript(self, transcript_path):
with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
transcript = transcript_file.read().replace('\n', '')
return transcript
def parse_transcript(self, transcript):
chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
transcript = list(filter(lambda x: x != self.blank_index, chars))
return transcript
def __getitem__(self, item):
return self._data[item]
def __len__(self):
return self._size
def __iter__(self):
return iter(self._data)
@property
def duration(self):
return self._duration
@property
def filtered_duration(self):
return self._filtered_duration
@property
def data(self):
return list(self._data)

View file

@ -0,0 +1,111 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import librosa
from .manifest import Manifest
from .segment import AudioSegment
class Perturbation(object):
def max_augmentation_length(self, length):
return length
def perturb(self, data):
raise NotImplementedError
class SpeedPerturbation(Perturbation):
def __init__(self, min_speed_rate=0.85, max_speed_rate=1.15, rng=None):
self._min_rate = min_speed_rate
self._max_rate = max_speed_rate
self._rng = random.Random() if rng is None else rng
def max_augmentation_length(self, length):
return length * self._max_rate
def perturb(self, data):
speed_rate = self._rng.uniform(self._min_rate, self._max_rate)
if speed_rate <= 0:
raise ValueError("speed_rate should be greater than zero.")
data._samples = librosa.effects.time_stretch(data._samples, speed_rate)
class GainPerturbation(Perturbation):
def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, rng=None):
self._min_gain_dbfs = min_gain_dbfs
self._max_gain_dbfs = max_gain_dbfs
self._rng = random.Random() if rng is None else rng
def perturb(self, data):
gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
data._samples = data._samples * (10. ** (gain / 20.))
class ShiftPerturbation(Perturbation):
def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, rng=None):
self._min_shift_ms = min_shift_ms
self._max_shift_ms = max_shift_ms
self._rng = random.Random() if rng is None else rng
def perturb(self, data):
shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
if abs(shift_ms) / 1000 > data.duration:
# TODO: do something smarter than just ignore this condition
return
shift_samples = int(shift_ms * data.sample_rate // 1000)
# print("DEBUG: shift:", shift_samples)
if shift_samples < 0:
data._samples[-shift_samples:] = data._samples[:shift_samples]
data._samples[:-shift_samples] = 0
elif shift_samples > 0:
data._samples[:-shift_samples] = data._samples[shift_samples:]
data._samples[-shift_samples:] = 0
perturbation_types = {
"speed": SpeedPerturbation,
"gain": GainPerturbation,
"shift": ShiftPerturbation,
}
class AudioAugmentor(object):
def __init__(self, perturbations=None, rng=None):
self._rng = random.Random() if rng is None else rng
self._pipeline = perturbations if perturbations is not None else []
def perturb(self, segment):
for (prob, p) in self._pipeline:
if self._rng.random() < prob:
p.perturb(segment)
return
def max_augmentation_length(self, length):
newlen = length
for (prob, p) in self._pipeline:
newlen = p.max_augmentation_length(newlen)
return newlen
@classmethod
def from_config(cls, config):
ptbs = []
for p in config:
if p['aug_type'] not in perturbation_types:
print(p['aug_type'], "perturbation not known. Skipping.")
continue
perturbation = perturbation_types[p['aug_type']]
ptbs.append((p['prob'], perturbation(**p['cfg'])))
return cls(perturbations=ptbs)

View file

@ -0,0 +1,170 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import librosa
import soundfile as sf
class AudioSegment(object):
"""Monaural audio segment abstraction.
:param samples: Audio samples [num_samples x num_channels].
:type samples: ndarray.float32
:param sample_rate: Audio sample rate.
:type sample_rate: int
:raises TypeError: If the sample data type is not float or int.
"""
def __init__(self, samples, sample_rate, target_sr=None, trim=False,
trim_db=60):
"""Create audio segment from samples.
Samples are convert float32 internally, with int scaled to [-1, 1].
"""
samples = self._convert_samples_to_float32(samples)
if target_sr is not None and target_sr != sample_rate:
samples = librosa.core.resample(samples, sample_rate, target_sr)
sample_rate = target_sr
if trim:
samples, _ = librosa.effects.trim(samples, trim_db)
self._samples = samples
self._sample_rate = sample_rate
if self._samples.ndim >= 2:
self._samples = np.mean(self._samples, 1)
def __eq__(self, other):
"""Return whether two objects are equal."""
if type(other) is not type(self):
return False
if self._sample_rate != other._sample_rate:
return False
if self._samples.shape != other._samples.shape:
return False
if np.any(self.samples != other._samples):
return False
return True
def __ne__(self, other):
"""Return whether two objects are unequal."""
return not self.__eq__(other)
def __str__(self):
"""Return human-readable representation of segment."""
return ("%s: num_samples=%d, sample_rate=%d, duration=%.2fsec, "
"rms=%.2fdB" % (type(self), self.num_samples, self.sample_rate,
self.duration, self.rms_db))
@staticmethod
def _convert_samples_to_float32(samples):
"""Convert sample type to float32.
Audio sample type is usually integer or float-point.
Integers will be scaled to [-1, 1] in float32.
"""
float32_samples = samples.astype('float32')
if samples.dtype in np.sctypes['int']:
bits = np.iinfo(samples.dtype).bits
float32_samples *= (1. / 2 ** (bits - 1))
elif samples.dtype in np.sctypes['float']:
pass
else:
raise TypeError("Unsupported sample type: %s." % samples.dtype)
return float32_samples
@classmethod
def from_file(cls, filename, target_sr=None, int_values=False, offset=0,
duration=0, trim=False):
"""
Load a file supported by librosa and return as an AudioSegment.
:param filename: path of file to load
:param target_sr: the desired sample rate
:param int_values: if true, load samples as 32-bit integers
:param offset: offset in seconds when loading audio
:param duration: duration in seconds when loading audio
:return: numpy array of samples
"""
with sf.SoundFile(filename, 'r') as f:
dtype = 'int32' if int_values else 'float32'
sample_rate = f.samplerate
if offset > 0:
f.seek(int(offset * sample_rate))
if duration > 0:
samples = f.read(int(duration * sample_rate), dtype=dtype)
else:
samples = f.read(dtype=dtype)
samples = samples.transpose()
return cls(samples, sample_rate, target_sr=target_sr, trim=trim)
@property
def samples(self):
return self._samples.copy()
@property
def sample_rate(self):
return self._sample_rate
@property
def num_samples(self):
return self._samples.shape[0]
@property
def duration(self):
return self._samples.shape[0] / float(self._sample_rate)
@property
def rms_db(self):
mean_square = np.mean(self._samples ** 2)
return 10 * np.log10(mean_square)
def gain_db(self, gain):
self._samples *= 10. ** (gain / 20.)
def pad(self, pad_size, symmetric=False):
"""Add zero padding to the sample. The pad size is given in number of samples.
If symmetric=True, `pad_size` will be added to both sides. If false, `pad_size`
zeros will be added only to the end.
"""
self._samples = np.pad(self._samples,
(pad_size if symmetric else 0, pad_size),
mode='constant')
def subsegment(self, start_time=None, end_time=None):
"""Cut the AudioSegment between given boundaries.
Note that this is an in-place transformation.
:param start_time: Beginning of subsegment in seconds.
:type start_time: float
:param end_time: End of subsegment in seconds.
:type end_time: float
:raise ValueError: If start_time or end_time is incorrectly set, e.g. out
of bounds in time.
"""
start_time = 0.0 if start_time is None else start_time
end_time = self.duration if end_time is None else end_time
if start_time < 0.0:
start_time = self.duration + start_time
if end_time < 0.0:
end_time = self.duration + end_time
if start_time < 0.0:
raise ValueError("The slice start position (%f s) is out of "
"bounds." % start_time)
if end_time < 0.0:
raise ValueError("The slice end position (%f s) is out of bounds." %
end_time)
if start_time > end_time:
raise ValueError("The slice start position (%f s) is later than "
"the end position (%f s)." % (start_time, end_time))
if end_time > self.duration:
raise ValueError("The slice end position (%f s) is out of bounds "
"(> %f s)" % (end_time, self.duration))
start_sample = int(round(start_time * self._sample_rate))
end_sample = int(round(end_time * self._sample_rate))
self._samples = self._samples[start_sample:end_sample]

View file

@ -0,0 +1,19 @@
Copyright (c) 2017 Keith Ito
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

View file

@ -0,0 +1,12 @@
# Copyright (c) 2017 Keith Ito
""" from https://github.com/keithito/tacotron """
import re
from . import cleaners
def _clean_text(text, cleaner_names, *args):
for name in cleaner_names:
cleaner = getattr(cleaners, name)
if not cleaner:
raise Exception('Unknown cleaner: %s' % name)
text = cleaner(text, *args)
return text

View file

@ -0,0 +1,107 @@
# Copyright (c) 2017 Keith Ito
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" from https://github.com/keithito/tacotron
Modified to add puncturation removal
"""
'''
Cleaners are transformations that run over the input text at both training and eval time.
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
1. "english_cleaners" for English text
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
the symbols in symbols.py to match your data).
'''
import re
from unidecode import unidecode
from .numbers import normalize_numbers
# Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+')
# List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
('mrs', 'misess'),
('mr', 'mister'),
('dr', 'doctor'),
('st', 'saint'),
('co', 'company'),
('jr', 'junior'),
('maj', 'major'),
('gen', 'general'),
('drs', 'doctors'),
('rev', 'reverend'),
('lt', 'lieutenant'),
('hon', 'honorable'),
('sgt', 'sergeant'),
('capt', 'captain'),
('esq', 'esquire'),
('ltd', 'limited'),
('col', 'colonel'),
('ft', 'fort'),
]]
def expand_abbreviations(text):
for regex, replacement in _abbreviations:
text = re.sub(regex, replacement, text)
return text
def expand_numbers(text):
return normalize_numbers(text)
def lowercase(text):
return text.lower()
def collapse_whitespace(text):
return re.sub(_whitespace_re, ' ', text)
def convert_to_ascii(text):
return unidecode(text)
def remove_punctuation(text, table):
text = text.translate(table)
text = re.sub(r'&', " and ", text)
text = re.sub(r'\+', " plus ", text)
return text
def basic_cleaners(text):
'''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
text = lowercase(text)
text = collapse_whitespace(text)
return text
def transliteration_cleaners(text):
'''Pipeline for non-English text that transliterates to ASCII.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = collapse_whitespace(text)
return text
def english_cleaners(text, table=None):
'''Pipeline for English text, including number and abbreviation expansion.'''
text = convert_to_ascii(text)
text = lowercase(text)
text = expand_numbers(text)
text = expand_abbreviations(text)
if table is not None:
text = remove_punctuation(text, table)
text = collapse_whitespace(text)
return text

View file

@ -0,0 +1,99 @@
# Copyright (c) 2017 Keith Ito
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" from https://github.com/keithito/tacotron
Modifed to add support for time and slight tweaks to _expand_number
"""
import inflect
import re
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
_number_re = re.compile(r'[0-9]+')
_time_re = re.compile(r'([0-9]{1,2}):([0-9]{2})')
def _remove_commas(m):
return m.group(1).replace(',', '')
def _expand_decimal_point(m):
return m.group(1).replace('.', ' point ')
def _expand_dollars(m):
match = m.group(1)
parts = match.split('.')
if len(parts) > 2:
return match + ' dollars' # Unexpected format
dollars = int(parts[0]) if parts[0] else 0
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
if dollars and cents:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
elif dollars:
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
return '%s %s' % (dollars, dollar_unit)
elif cents:
cent_unit = 'cent' if cents == 1 else 'cents'
return '%s %s' % (cents, cent_unit)
else:
return 'zero dollars'
def _expand_ordinal(m):
return _inflect.number_to_words(m.group(0))
def _expand_number(m):
if int(m.group(0)[0]) == 0:
return _inflect.number_to_words(m.group(0), andword='', group=1)
num = int(m.group(0))
if num > 1000 and num < 3000:
if num == 2000:
return 'two thousand'
elif num > 2000 and num < 2010:
return 'two thousand ' + _inflect.number_to_words(num % 100)
elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
# Add check for number phones and other large numbers
elif num > 1000000000 and num % 10000 != 0:
return _inflect.number_to_words(num, andword='', group=1)
else:
return _inflect.number_to_words(num, andword='')
def _expand_time(m):
mins = int(m.group(2))
if mins == 0:
return _inflect.number_to_words(m.group(1))
return " ".join([_inflect.number_to_words(m.group(1)), _inflect.number_to_words(m.group(2))])
def normalize_numbers(text):
text = re.sub(_comma_number_re, _remove_commas, text)
text = re.sub(_pounds_re, r'\1 pounds', text)
text = re.sub(_dollars_re, _expand_dollars, text)
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
text = re.sub(_ordinal_re, _expand_ordinal, text)
text = re.sub(_number_re, _expand_number, text)
text = re.sub(_time_re, _expand_time, text)
return text

View file

@ -0,0 +1,19 @@
# Copyright (c) 2017 Keith Ito
""" from https://github.com/keithito/tacotron """
'''
Defines the set of symbols used in text input to the model.
The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. '''
from . import cmudict
_pad = '_'
_punctuation = '!\'(),.:;? '
_special = '-'
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
_arpabet = ['@' + s for s in cmudict.valid_symbols]
# Export all symbols:
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet

View file

@ -0,0 +1,9 @@
pandas==0.24.2
tqdm==4.31.1
ascii-graph==1.5.1
wrapt==1.10.11
librosa
toml
soundfile
ipdb
sox

View file

@ -0,0 +1,3 @@
#!/bin/bash
docker build . --rm -t jasper

View file

@ -0,0 +1,30 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
DATA_DIR=$1
CHECKPOINT_DIR=$2
RESULT_DIR=$3
docker run -it --rm \
--runtime=nvidia \
--shm-size=4g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v "$DATA_DIR":/datasets \
-v "$CHECKPOINT_DIR":/checkpoints/ \
-v "$RESULT_DIR":/results/ \
jasper bash

View file

@ -0,0 +1,28 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env bash
DATA_SET="LibriSpeech"
DATA_ROOT_DIR="/datasets"
DATA_DIR="${DATA_ROOT_DIR}/${DATA_SET}"
if [ ! -d "$DATA_DIR" ]
then
mkdir $DATA_DIR
chmod go+rx $DATA_DIR
python utils/download_librispeech.py utils/librispeech.csv $DATA_DIR -e ${DATA_ROOT_DIR}/
else
echo "Directory $DATA_DIR already exists."
fi

View file

@ -0,0 +1,92 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
DATA_DIR=${1:-"/datasets/LibriSpeech"}
DATASET=${2:-"dev-clean"}
MODEL_CONFIG=${3:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
RESULT_DIR=${4:-"/results"}
CHECKPOINT=$5
CREATE_LOGFILE=${6:-"true"}
CUDNN_BENCHMARK=${7:-"false"}
NUM_GPUS=${8:-1}
PRECISION=${9:-"fp32"}
NUM_STEPS=${10:-"-1"}
SEED=${11:-0}
BATCH_SIZE=${12:-64}
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
printf -v TAG "jasper_evaluation_${DATASET}_%s_gbs%d" "$PRECISION" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
PREC=""
if [ "$PRECISION" = "fp16" ] ; then
PREC="--fp16"
elif [ "$PRECISION" = "fp32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
STEPS=""
if [ "$NUM_STEPS" -gt 0 ] ; then
STEPS=" --steps $NUM_STEPS"
fi
if [ "$CUDNN_BENCHMARK" = "true" ] ; then
CUDNN_BENCHMARK=" --cudnn_benchmark"
else
CUDNN_BENCHMARK=""
fi
CMD=" inference.py "
CMD+=" --batch_size $BATCH_SIZE "
CMD+=" --dataset_dir $DATA_DIR "
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --model_toml $MODEL_CONFIG "
CMD+=" --seed $SEED "
CMD+=" --ckpt $CHECKPOINT "
CMD+=" $CUDNN_BENCHMARK"
CMD+=" $PREC "
CMD+=" $STEPS "
if [ "$NUM_GPUS" -gt 1 ] ; then
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
else
CMD="python3 $CMD"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
fi
set +x

View file

@ -0,0 +1,104 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
DATA_DIR=${1-"/datasets/LibriSpeech"}
DATASET=${2:-"dev-clean"}
MODEL_CONFIG=${3:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
RESULT_DIR=${4:-"/results"}
CHECKPOINT=$5
CREATE_LOGFILE=${6:-"true"}
CUDNN_BENCHMARK=${7:-"false"}
PRECISION=${8:-"fp32"}
NUM_STEPS=${9:-"-1"}
SEED=${10:-0}
BATCH_SIZE=${11:-64}
MODELOUTPUT_FILE=${12:-"none"}
PREDICTION_FILE=${13:-"$RESULT_DIR/${DATASET}.predictions"}
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE)
printf -v TAG "jasper_inference_${DATASET}_%s_gbs%d" "$PRECISION" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
PREC=""
if [ "$PRECISION" = "fp16" ] ; then
PREC="--fp16"
elif [ "$PRECISION" = "fp32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
PRED=""
if [ "$PREDICTION_FILE" = "none" ] ; then
PRED=""
else
PRED=" --save_prediction $PREDICTION_FILE"
fi
OUTPUT=""
if [ "$MODELOUTPUT_FILE" = "none" ] ; then
OUTPUT=" "
else
OUTPUT=" --logits_save_to $MODELOUTPUT_FILE"
fi
if [ "$CUDNN_BENCHMARK" = "true" ]; then
CUDNN_BENCHMARK=" --cudnn_benchmark"
else
CUDNN_BENCHMARK=""
fi
STEPS=""
if [ "$NUM_STEPS" -gt 0 ] ; then
STEPS=" --steps $NUM_STEPS"
fi
CMD=" python inference.py "
CMD+=" --batch_size $BATCH_SIZE "
CMD+=" --dataset_dir $DATA_DIR "
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --model_toml $MODEL_CONFIG "
CMD+=" --seed $SEED "
CMD+=" --ckpt $CHECKPOINT "
CMD+=" $CUDNN_BENCHMARK"
CMD+=" $PRED "
CMD+=" $OUTPUT "
CMD+=" $PREC "
CMD+=" $STEPS "
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
fi
set +x
echo "MODELOUTPUT_FILE: ${MODELOUTPUT_FILE}"
echo "PREDICTION_FILE: ${PREDICTION_FILE}"

View file

@ -0,0 +1,89 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
DATA_DIR=${1:-"/datasets/LibriSpeech"}
DATASET=${2:-"dev-clean"}
MODEL_CONFIG=${3:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
RESULT_DIR=${4:-"/results"}
CHECKPOINT=$5
CREATE_LOGFILE=${6:-"true"}
CUDNN_BENCHMARK=${7:-"true"}
PRECISION=${8:-"fp32"}
NUM_STEPS=${9:-"-1"}
MAX_DURATION=${10:-"36"}
SEED=${11:-0}
BATCH_SIZE=${12:-64}
PREC=""
if [ "$PRECISION" = "fp16" ] ; then
PREC="--fp16"
elif [ "$PRECISION" = "fp32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
STEPS=""
if [ "$NUM_STEPS" -gt 0 ] ; then
STEPS=" --steps $NUM_STEPS"
fi
if [ "$CUDNN_BENCHMARK" = "true" ] ; then
CUDNN_BENCHMARK=" --cudnn_benchmark"
else
CUDNN_BENCHMARK=""
fi
CMD=" python inference_benchmark.py"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --seed=$SEED"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
CMD+=" --ckpt=$CHECKPOINT"
CMD+=" --max_duration=$MAX_DURATION"
CMD+=" --pad_to=-1"
CMD+=" $CUDNN_BENCHMARK"
CMD+=" $PREC"
CMD+=" $STEPS"
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE )
printf -v TAG "jasper_inference_benchmark_%s_gbs%d" "$PRECISION" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee "$LOGFILE"
grep 'latency' "$LOGFILE"
fi
set +x

View file

@ -0,0 +1,51 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env bash
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/train-clean-100 \
--dest_dir /datasets/LibriSpeech/train-clean-100-wav \
--output_json /datasets/LibriSpeech/librispeech-train-clean-100-wav.json \
--speed 0.9 1.1
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/train-clean-360 \
--dest_dir /datasets/LibriSpeech/train-clean-360-wav \
--output_json /datasets/LibriSpeech/librispeech-train-clean-360-wav.json \
--speed 0.9 1.1
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/train-other-500 \
--dest_dir /datasets/LibriSpeech/train-other-500-wav \
--output_json /datasets/LibriSpeech/librispeech-train-other-500-wav.json \
--speed 0.9 1.1
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/dev-clean \
--dest_dir /datasets/LibriSpeech/dev-clean-wav \
--output_json /datasets/LibriSpeech/librispeech-dev-clean-wav.json
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/dev-other \
--dest_dir /datasets/LibriSpeech/dev-other-wav \
--output_json /datasets/LibriSpeech/librispeech-dev-other-wav.json
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/test-clean \
--dest_dir /datasets/LibriSpeech/test-clean-wav \
--output_json /datasets/LibriSpeech/librispeech-test-clean-wav.json
python ./utils/convert_librispeech.py \
--input_dir /datasets/LibriSpeech/test-other \
--dest_dir /datasets/LibriSpeech/test-other-wav \
--output_json /datasets/LibriSpeech/librispeech-test-other-wav.json

View file

@ -0,0 +1,111 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
DATA_DIR=${1:-"/datasets/LibriSpeech"}
MODEL_CONFIG=${2:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
RESULT_DIR=${3:-"/results"}
CHECKPOINT=${4:-"none"}
CREATE_LOGFILE=${5:-"true"}
CUDNN_BENCHMARK=${6:-"true"}
NUM_GPUS=${7:-8}
PRECISION=${8:-"fp16"}
EPOCHS=${9:-400}
SEED=${10:-6}
BATCH_SIZE=${11:-64}
LEARNING_RATE=${12:-"0.015"}
GRADIENT_ACCUMULATION_STEPS=${13:-1}
LAUNCH_OPT=${LAUNCH_OPT:-"none"}
PREC=""
if [ "$PRECISION" = "fp16" ] ; then
PREC="--fp16"
elif [ "$PRECISION" = "fp32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
CUDNN=""
if [ "$CUDNN_BENCHMARK" = "true" ] && [ "$PRECISION" = "fp16" ]; then
CUDNN=" --cudnn"
else
CUDNN=""
fi
if [ "$CHECKPOINT" = "none" ] ; then
CHECKPOINT=""
else
CHECKPOINT=" --ckpt=${CHECKPOINT}"
fi
CMD=" train.py"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --num_epochs=$EPOCHS"
CMD+=" --output_dir=$RESULT_DIR"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --lr=$LEARNING_RATE"
CMD+=" --seed=$SEED"
CMD+=" --optimizer=novograd"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json,$DATA_DIR/librispeech-train-clean-360-wav.json,$DATA_DIR/librispeech-train-other-500-wav.json"
CMD+=" --weight_decay=1e-3"
CMD+=" --save_freq=10"
CMD+=" --eval_freq=100"
CMD+=" --train_freq=25"
CMD+=" --lr_decay"
CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS "
CMD+=" $CHECKPOINT"
CMD+=" $PREC"
CMD+=" $CUDNN"
if [ "${LAUNCH_OPT}" != "none" ]; then
CMD="python -m $LAUNCH_OPT $CMD"
elif [ "$NUM_GPUS" -gt 1 ] ; then
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
else
CMD="python3 $CMD"
fi
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
printf -v TAG "jasper_train_%s_gbs%d" "$PRECISION" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULT_DIR/$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD
else
(
$CMD
) |& tee $LOGFILE
fi
set +x

View file

@ -0,0 +1,131 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
DATA_DIR=${1:-"/datasets/LibriSpeech"}
MODEL_CONFIG=${2:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
RESULT_DIR=${3:-"/results"}
CREATE_LOGFILE=${4:-"true"}
CUDNN_BENCHMARK=${5:-"true"}
NUM_GPUS=${6:-8}
PRECISION=${7:-"fp16"}
NUM_STEPS=${8:-"-1"}
MAX_DURATION=${9:-16.7}
SEED=${10:-0}
BATCH_SIZE=${11:-64}
LEARNING_RATE=${12:-"0.015"}
GRADIENT_ACCUMULATION_STEPS=${13:-1}
PRINT_FREQUENCY=${14:-1}
PREC=""
if [ "$PRECISION" = "fp16" ] ; then
PREC=" --fp16"
elif [ "$PRECISION" = "fp32" ] ; then
PREC=""
else
echo "Unknown <precision> argument"
exit -2
fi
STEPS=""
if [ "$NUM_STEPS" -ne "-1" ] ; then
STEPS=" --num_steps=$NUM_STEPS"
elif [ "$NUM_STEPS" = "-1" ] ; then
STEPS=""
else
echo "Unknown <precision> argument"
exit -2
fi
CUDNN=""
if [ "$CUDNN_BENCHMARK" = "true" ] ; then
CUDNN=" --cudnn"
else
CUDNN=""
fi
CMD=" train.py"
CMD+=" --batch_size=$BATCH_SIZE"
CMD+=" --num_epochs=400"
CMD+=" --output_dir=$RESULT_DIR"
CMD+=" --model_toml=$MODEL_CONFIG"
CMD+=" --lr=$LEARNING_RATE"
CMD+=" --seed=$SEED"
CMD+=" --optimizer=novograd"
CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS"
CMD+=" --dataset_dir=$DATA_DIR"
CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json,$DATA_DIR/librispeech-train-clean-360-wav.json,$DATA_DIR/librispeech-train-other-500-wav.json"
CMD+=" --weight_decay=1e-3"
CMD+=" --save_freq=100000"
CMD+=" --eval_freq=100000"
CMD+=" --max_duration=$MAX_DURATION"
CMD+=" --pad_to_max"
CMD+=" --train_freq=$PRINT_FREQUENCY"
CMD+=" --lr_decay"
CMD+=" $CUDNN"
CMD+=" $PREC"
CMD+=" $STEPS"
if [ "$NUM_GPUS" -gt 1 ] ; then
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
else
CMD="python3 $CMD"
fi
if [ "$CREATE_LOGFILE" = "true" ] ; then
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
printf -v TAG "jasper_train_benchmark_%s_gbs%d" "$PRECISION" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
printf "Logs written to %s\n" "$LOGFILE"
fi
if [ -z "$LOGFILE" ] ; then
set -x
$CMD
set +x
else
set -x
(
$CMD
) |& tee "$LOGFILE"
set +x
mean_latency=`cat "$LOGFILE" | grep 'Step time' | awk '{print $3}' | tail -n +2 | egrep -o '[0-9.]+'| awk 'BEGIN {total=0} {total+=$1} END {printf("%.2f\n",total/NR)}'`
mean_throughput=`python -c "print($BATCH_SIZE*$NUM_GPUS/${mean_latency})"`
training_wer_per_pgu=`cat "$LOGFILE" | grep 'training_batch_WER'| awk '{print $2}' | tail -n 1 | egrep -o '[0-9.]+'`
training_loss_per_pgu=`cat "$LOGFILE" | grep 'Loss@Step'| awk '{print $4}' | tail -n 1 | egrep -o '[0-9.]+'`
final_eval_wer=`cat "$LOGFILE" | grep 'Evaluation WER'| tail -n 1 | egrep -o '[0-9.]+'`
final_eval_loss=`cat "$LOGFILE" | grep 'Evaluation Loss'| tail -n 1 | egrep -o '[0-9.]+'`
echo "max duration: $MAX_DURATION s" | tee -a "$LOGFILE"
echo "mean_latency: $mean_latency s" | tee -a "$LOGFILE"
echo "mean_throughput: $mean_throughput sequences/s" | tee -a "$LOGFILE"
echo "training_wer_per_pgu: $training_wer_per_pgu" | tee -a "$LOGFILE"
echo "training_loss_per_pgu: $training_loss_per_pgu" | tee -a "$LOGFILE"
echo "final_eval_loss: $final_eval_loss" | tee -a "$LOGFILE"
echo "final_eval_wer: $final_eval_wer" | tee -a "$LOGFILE"
fi

View file

@ -0,0 +1,387 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import itertools
import os
import time
import toml
import torch
import apex
from apex import amp
import random
import numpy as np
import math
from dataset import AudioToTextDataLayer
from helpers import monitor_asr_train_progress, process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, model_multi_gpu, print_dict, print_once
from model import AudioPreprocessing, CTCLossNM, GreedyCTCDecoder, Jasper
from optimizers import Novograd, AdamW
def lr_policy(initial_lr, step, N):
"""
learning rate decay
Args:
initial_lr: base learning rate
step: current iteration number
N: total number of iterations over which learning rate is decayed
"""
min_lr = 0.00001
res = initial_lr * ((N - step) / N) ** 2
return max(res, min_lr)
def save(model, optimizer, epoch, output_dir):
"""
Saves model checkpoint
Args:
model: model
optimizer: optimizer
epoch: epoch of model training
output_dir: path to save model checkpoint
"""
class_name = model.__class__.__name__
unix_time = time.time()
file_name = "{0}_{1}-epoch-{2}.pt".format(class_name, unix_time, epoch)
print_once("Saving module {0} in {1}".format(class_name, os.path.join(output_dir, file_name)))
if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
save_checkpoint={
'epoch': epoch,
'state_dict': model_to_save.state_dict(),
'optimizer': optimizer.state_dict()
}
torch.save(save_checkpoint, os.path.join(output_dir, file_name))
print_once('Saved.')
def train(
data_layer,
data_layer_eval,
model,
ctc_loss,
greedy_decoder,
optimizer,
optim_level,
labels,
multi_gpu,
args,
fn_lr_policy=None):
"""Trains model
Args:
data_layer: training data layer
data_layer_eval: evaluation data layer
model: model ( encapsulates data processing, encoder, decoder)
ctc_loss: loss function
greedy_decoder: greedy ctc decoder
optimizer: optimizer
optim_level: AMP optimization level
labels: list of output labels
multi_gpu: true if multi gpu training
args: script input argument list
fn_lr_policy: learning rate adjustment function
"""
def eval():
"""Evaluates model on evaluation dataset
"""
with torch.no_grad():
_global_var_dict = {
'EvalLoss': [],
'predictions': [],
'transcripts': [],
}
eval_dataloader = data_layer_eval.data_iterator
for data in eval_dataloader:
tensors = []
for d in data:
if isinstance(d, torch.Tensor):
tensors.append(d.cuda())
else:
tensors.append(d)
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
model.eval()
t_log_probs_e, t_encoded_len_e = model(x=(t_audio_signal_e, t_a_sig_length_e))
t_loss_e = ctc_loss(log_probs=t_log_probs_e, targets=t_transcript_e, input_length=t_encoded_len_e, target_length=t_transcript_len_e)
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
values_dict = dict(
loss=[t_loss_e],
predictions=[t_predictions_e],
transcript=[t_transcript_e],
transcript_length=[t_transcript_len_e]
)
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
# final aggregation across all workers and minibatches) and logging of results
wer, eloss = process_evaluation_epoch(_global_var_dict)
print_once("==========>>>>>>Evaluation Loss: {0}\n".format(eloss))
print_once("==========>>>>>>Evaluation WER: {0}\n".format(wer))
print_once("Starting .....")
start_time = time.time()
train_dataloader = data_layer.data_iterator
epoch = args.start_epoch
step = epoch * args.step_per_epoch
while True:
if multi_gpu:
data_layer.sampler.set_epoch(epoch)
print_once("Starting epoch {0}, step {1}".format(epoch, step))
last_epoch_start = time.time()
batch_counter = 0
average_loss = 0
for data in train_dataloader:
tensors = []
for d in data:
if isinstance(d, torch.Tensor):
tensors.append(d.cuda())
else:
tensors.append(d)
if batch_counter == 0:
if fn_lr_policy is not None:
adjusted_lr = fn_lr_policy(step)
for param_group in optimizer.param_groups:
param_group['lr'] = adjusted_lr
optimizer.zero_grad()
last_iter_start = time.time()
t_audio_signal_t, t_a_sig_length_t, t_transcript_t, t_transcript_len_t = tensors
model.train()
t_log_probs_t, t_encoded_len_t = model(x=(t_audio_signal_t, t_a_sig_length_t))
t_loss_t = ctc_loss(log_probs=t_log_probs_t, targets=t_transcript_t, input_length=t_encoded_len_t, target_length=t_transcript_len_t)
if args.gradient_accumulation_steps > 1:
t_loss_t = t_loss_t / args.gradient_accumulation_steps
if optim_level in AmpOptimizations:
with amp.scale_loss(t_loss_t, optimizer) as scaled_loss:
scaled_loss.backward()
else:
t_loss_t.backward()
batch_counter += 1
average_loss += t_loss_t.item()
if batch_counter % args.gradient_accumulation_steps == 0:
optimizer.step()
if step % args.train_frequency == 0:
t_predictions_t = greedy_decoder(log_probs=t_log_probs_t)
e_tensors = [t_predictions_t, t_transcript_t, t_transcript_len_t]
train_wer = monitor_asr_train_progress(e_tensors, labels=labels)
print_once("Loss@Step: {0} ::::::: {1}".format(step, str(average_loss)))
print_once("Step time: {0} seconds".format(time.time() - last_iter_start))
if step > 0 and step % args.eval_frequency == 0:
print_once("Doing Evaluation ....................... ...... ... .. . .")
eval()
step += 1
batch_counter = 0
average_loss = 0
if args.num_steps is not None and step >= args.num_steps:
break
if args.num_steps is not None and step >= args.num_steps:
break
print_once("Finished epoch {0} in {1}".format(epoch, time.time() - last_epoch_start))
epoch += 1
if epoch % args.save_frequency == 0 and epoch > 0:
save(model, optimizer, epoch, output_dir=args.output_dir)
if args.num_steps is None and epoch >= args.num_epochs:
break
print_once("Done in {0}".format(time.time() - start_time))
print_once("Final Evaluation ....................... ...... ... .. . .")
eval()
save(model, optimizer, epoch, output_dir=args.output_dir)
def main(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
assert(torch.cuda.is_available())
torch.backends.cudnn.benchmark = args.cudnn
# set up distributed training
if args.local_rank is not None:
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')
multi_gpu = torch.distributed.is_initialized()
if multi_gpu:
print_once("DISTRIBUTED TRAINING with {} gpus".format(torch.distributed.get_world_size()))
# define amp optimiation level
if args.fp16:
optim_level = Optimization.mxprO1
else:
optim_level = Optimization.mxprO0
jasper_model_definition = toml.load(args.model_toml)
dataset_vocab = jasper_model_definition['labels']['labels']
ctc_vocab = add_ctc_labels(dataset_vocab)
train_manifest = args.train_manifest
val_manifest = args.val_manifest
featurizer_config = jasper_model_definition['input']
featurizer_config_eval = jasper_model_definition['input_eval']
featurizer_config["optimization_level"] = optim_level
featurizer_config_eval["optimization_level"] = optim_level
sampler_type = featurizer_config.get("sampler", 'default')
perturb_config = jasper_model_definition.get('perturb', None)
if args.pad_to_max:
assert(args.max_duration > 0)
featurizer_config['max_duration'] = args.max_duration
featurizer_config_eval['max_duration'] = args.max_duration
featurizer_config['pad_to'] = "max"
featurizer_config_eval['pad_to'] = "max"
print_once('model_config')
print_dict(jasper_model_definition)
if args.gradient_accumulation_steps < 1:
raise ValueError('Invalid gradient accumulation steps parameter {}'.format(args.gradient_accumulation_steps))
if args.batch_size % args.gradient_accumulation_steps != 0:
raise ValueError('gradient accumulation step {} is not divisible by batch size {}'.format(args.gradient_accumulation_steps, args.batch_size))
data_layer = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config,
perturb_config=perturb_config,
manifest_filepath=train_manifest,
labels=dataset_vocab,
batch_size=args.batch_size // args.gradient_accumulation_steps,
multi_gpu=multi_gpu,
pad_to_max=args.pad_to_max,
sampler=sampler_type)
data_layer_eval = AudioToTextDataLayer(
dataset_dir=args.dataset_dir,
featurizer_config=featurizer_config_eval,
manifest_filepath=val_manifest,
labels=dataset_vocab,
batch_size=args.batch_size,
multi_gpu=multi_gpu,
pad_to_max=args.pad_to_max
)
model = Jasper(feature_config=featurizer_config, jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
if args.ckpt is not None:
print_once("loading model from {}".format(args.ckpt))
checkpoint = torch.load(args.ckpt, map_location="cpu")
model.load_state_dict(checkpoint['state_dict'], strict=True)
args.start_epoch = checkpoint['epoch']
else:
args.start_epoch = 0
ctc_loss = CTCLossNM( num_classes=len(ctc_vocab))
greedy_decoder = GreedyCTCDecoder()
print_once("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
print_once("Number of parameters in decode: {0}".format(model.jasper_decoder.num_weights()))
N = len(data_layer)
if sampler_type == 'default':
args.step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
elif sampler_type == 'bucket':
args.step_per_epoch = int(len(data_layer.sampler) / args.batch_size )
print_once('-----------------')
print_once('Have {0} examples to train on.'.format(N))
print_once('Have {0} steps / (gpu * epoch).'.format(args.step_per_epoch))
print_once('-----------------')
fn_lr_policy = lambda s: lr_policy(args.lr, s, args.num_epochs * args.step_per_epoch)
model.cuda()
if args.optimizer_kind == "novograd":
optimizer = Novograd(model.parameters(),
lr=args.lr,
weight_decay=args.weight_decay)
elif args.optimizer_kind == "adam":
optimizer = AdamW(model.parameters(),
lr=args.lr,
weight_decay=args.weight_decay)
else:
raise ValueError("invalid optimizer choice: {}".format(args.optimizer_kind))
if optim_level in AmpOptimizations:
model, optimizer = amp.initialize(
min_loss_scale=1.0,
models=model,
optimizers=optimizer,
opt_level=AmpOptimizations[optim_level])
if args.ckpt is not None:
optimizer.load_state_dict(checkpoint['optimizer'])
model = model_multi_gpu(model, multi_gpu)
train(
data_layer=data_layer,
data_layer_eval=data_layer_eval,
model=model,
ctc_loss=ctc_loss,
greedy_decoder=greedy_decoder,
optimizer=optimizer,
labels=ctc_vocab,
optim_level=optim_level,
multi_gpu=multi_gpu,
fn_lr_policy=fn_lr_policy if args.lr_decay else None,
args=args)
def parse_args():
parser = argparse.ArgumentParser(description='Jasper')
parser.add_argument("--local_rank", default=None, type=int)
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
parser.add_argument("--num_epochs", default=10, type=int, help='number of training epochs. if number of steps if specified will overwrite this')
parser.add_argument("--num_steps", default=None, type=int, help='if specified overwrites num_epochs and will only train for this number of iterations')
parser.add_argument("--save_freq", dest="save_frequency", default=300, type=int, help='number of epochs until saving checkpoint. will save at the end of training too.')
parser.add_argument("--eval_freq", dest="eval_frequency", default=200, type=int, help='number of iterations until doing evaluation on full dataset')
parser.add_argument("--train_freq", dest="train_frequency", default=25, type=int, help='number of iterations until printing training statistics on the past iteration')
parser.add_argument("--lr", default=1e-3, type=float, help='learning rate')
parser.add_argument("--weight_decay", default=1e-3, type=float, help='weight decay rate')
parser.add_argument("--train_manifest", type=str, required=True, help='relative path given dataset folder of training manifest file')
parser.add_argument("--model_toml", type=str, required=True, help='relative path given dataset folder of model configuration file')
parser.add_argument("--val_manifest", type=str, required=True, help='relative path given dataset folder of evaluation manifest file')
parser.add_argument("--max_duration", type=float, help='maximum duration of audio samples for training and evaluation')
parser.add_argument("--pad_to_max", action="store_true", default=False, help="pad sequence to max_duration")
parser.add_argument("--gradient_accumulation_steps", default=1, type=int, help='number of accumulation steps')
parser.add_argument("--optimizer", dest="optimizer_kind", default="novograd", type=str, help='optimizer')
parser.add_argument("--dataset_dir", dest="dataset_dir", required=True, type=str, help='root dir of dataset')
parser.add_argument("--lr_decay", action="store_true", default=False, help='use learning rate decay')
parser.add_argument("--cudnn", action="store_true", default=False, help="enable cudnn benchmark")
parser.add_argument("--fp16", action="store_true", default=False, help="use mixed precision training")
parser.add_argument("--output_dir", type=str, required=True, help='saves results in this directory')
parser.add_argument("--ckpt", default=None, type=str, help="if specified continues training from given checkpoint. Otherwise starts from beginning")
parser.add_argument("--seed", default=42, type=int, help='seed')
args=parser.parse_args()
return args
if __name__=="__main__":
args = parse_args()
print_dict(vars(args))
main(args)

View file

@ -0,0 +1,81 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env python
import argparse
import os
import glob
import multiprocessing
import json
import pandas as pd
from preprocessing_utils import parallel_preprocess
parser = argparse.ArgumentParser(description='Preprocess LibriSpeech.')
parser.add_argument('--input_dir', type=str, required=True,
help='LibriSpeech collection input dir')
parser.add_argument('--dest_dir', type=str, required=True,
help='Output dir')
parser.add_argument('--output_json', type=str, default='./',
help='name of the output json file.')
parser.add_argument('-s','--speed', type=float, nargs='*',
help='Speed perturbation ratio')
parser.add_argument('--target_sr', type=int, default=None,
help='Target sample rate. '
'defaults to the input sample rate')
parser.add_argument('--overwrite', action='store_true',
help='Overwrite file if exists')
parser.add_argument('--parallel', type=int, default=multiprocessing.cpu_count(),
help='Number of threads to use when processing audio files')
args = parser.parse_args()
args.input_dir = args.input_dir.rstrip('/')
args.dest_dir = args.dest_dir.rstrip('/')
def build_input_arr(input_dir):
txt_files = glob.glob(os.path.join(input_dir, '**', '*.trans.txt'),
recursive=True)
input_data = []
for txt_file in txt_files:
rel_path = os.path.relpath(txt_file, input_dir)
with open(txt_file) as fp:
for line in fp:
fname, _, transcript = line.partition(' ')
input_data.append(dict(input_relpath=os.path.dirname(rel_path),
input_fname=fname+'.flac',
transcript=transcript))
return input_data
print("[%s] Scaning input dir..." % args.output_json)
dataset = build_input_arr(input_dir=args.input_dir)
print("[%s] Converting audio files..." % args.output_json)
dataset = parallel_preprocess(dataset=dataset,
input_dir=args.input_dir,
dest_dir=args.dest_dir,
target_sr=args.target_sr,
speed=args.speed,
overwrite=args.overwrite,
parallel=args.parallel)
print("[%s] Generating json..." % args.output_json)
df = pd.DataFrame(dataset, dtype=object)
# Save json with python. df.to_json() produces back slashed in file paths
dataset = df.to_dict(orient='records')
with open(args.output_json, 'w') as fp:
json.dump(dataset, fp, indent=2)

View file

@ -0,0 +1,72 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env python
import os
import argparse
import pandas as pd
from download_utils import download_file, md5_checksum, extract
parser = argparse.ArgumentParser(description='Download, verify and extract dataset files')
parser.add_argument('csv', type=str,
help='CSV file with urls and checksums to download.')
parser.add_argument('dest', type=str,
help='Download destnation folder.')
parser.add_argument('-e', type=str, default=None,
help='Extraction destnation folder. Defaults to download folder if not provided')
parser.add_argument('--skip_download', action='store_true',
help='Skip downloading the files')
parser.add_argument('--skip_checksum', action='store_true',
help='Skip checksum')
parser.add_argument('--skip_extract', action='store_true',
help='Skip extracting files')
args = parser.parse_args()
args.e = args.e or args.dest
df = pd.read_csv(args.csv, delimiter=',')
if not args.skip_download:
for url in df.url:
fname = url.split('/')[-1]
print("Downloading %s:" % fname)
download_file(url=url, dest_folder=args.dest, fname=fname)
else:
print("Skipping file download")
if not args.skip_checksum:
for index, row in df.iterrows():
url = row['url']
md5 = row['md5']
fname = url.split('/')[-1]
fpath = os.path.join(args.dest, fname)
print("Verifing %s: " % fname, end='')
ret = md5_checksum(fpath=fpath, target_hash=md5)
print("Passed" if ret else "Failed")
else:
print("Skipping checksum")
if not args.skip_extract:
for url in df.url:
fname = url.split('/')[-1]
fpath = os.path.join(args.dest, fname)
print("Decompressing %s:" % fpath)
extract(fpath=fpath, dest_folder=args.e)
else:
print("Skipping file extraction")

View file

@ -0,0 +1,68 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env python
import hashlib
import requests
import os
import tarfile
import tqdm
def download_file(url, dest_folder, fname, overwrite=False):
fpath = os.path.join(dest_folder, fname)
if os.path.isfile(fpath):
if overwrite:
print("Overwriting existing file")
else:
print("File exists, skipping download.")
return
tmp_fpath = fpath + '.tmp'
r = requests.get(url, stream=True)
file_size = int(r.headers['Content-Length'])
chunk_size = 1024 * 1024 # 1MB
total_chunks = int(file_size / chunk_size)
with open(tmp_fpath, 'wb') as fp:
content_iterator = r.iter_content(chunk_size=chunk_size)
chunks = tqdm.tqdm(content_iterator, total=total_chunks,
unit='MB', desc=fpath, leave=True)
for chunk in chunks:
fp.write(chunk)
os.rename(tmp_fpath, fpath)
def md5_checksum(fpath, target_hash):
file_hash = hashlib.md5()
with open(fpath, "rb") as fp:
for chunk in iter(lambda: fp.read(1024*1024), b""):
file_hash.update(chunk)
return file_hash.hexdigest() == target_hash
def extract(fpath, dest_folder):
if fpath.endswith('.tar.gz'):
mode = 'r:gz'
elif fpath.endswith('.tar'):
mode = 'r:'
else:
raise IOError('fpath has unknown extention: %s' % fpath)
with tarfile.open(fpath, mode) as tar:
members = tar.getmembers()
for member in tqdm.tqdm(iterable=members, total=len(members), leave=True):
tar.extract(path=dest_folder, member=member)

View file

@ -0,0 +1,8 @@
url,md5
http://www.openslr.org/resources/12/dev-clean.tar.gz,42e2234ba48799c1f50f24a7926300a1
http://www.openslr.org/resources/12/dev-other.tar.gz,c8d0bcc9cca99d4f8b62fcc847357931
http://www.openslr.org/resources/12/test-clean.tar.gz,32fa31d27d2e1cad72775fee3f4849a9
http://www.openslr.org/resources/12/test-other.tar.gz,fb5a50374b501bb3bac4815ee91d3135
http://www.openslr.org/resources/12/train-clean-100.tar.gz,2a93770f6d5c6c964bc36631d331a522
http://www.openslr.org/resources/12/train-clean-360.tar.gz,c0e676e450a7ff2f54aeade5171606fa
http://www.openslr.org/resources/12/train-other-500.tar.gz,d1a0fd59409feb2c614ce4d30c387708
1 url md5
2 http://www.openslr.org/resources/12/dev-clean.tar.gz 42e2234ba48799c1f50f24a7926300a1
3 http://www.openslr.org/resources/12/dev-other.tar.gz c8d0bcc9cca99d4f8b62fcc847357931
4 http://www.openslr.org/resources/12/test-clean.tar.gz 32fa31d27d2e1cad72775fee3f4849a9
5 http://www.openslr.org/resources/12/test-other.tar.gz fb5a50374b501bb3bac4815ee91d3135
6 http://www.openslr.org/resources/12/train-clean-100.tar.gz 2a93770f6d5c6c964bc36631d331a522
7 http://www.openslr.org/resources/12/train-clean-360.tar.gz c0e676e450a7ff2f54aeade5171606fa
8 http://www.openslr.org/resources/12/train-other-500.tar.gz d1a0fd59409feb2c614ce4d30c387708

View file

@ -0,0 +1,76 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env python
import os
import multiprocessing
import librosa
import functools
import sox
from tqdm import tqdm
def preprocess(data, input_dir, dest_dir, target_sr=None, speed=None,
overwrite=True):
speed = speed or []
speed.append(1)
speed = list(set(speed)) # Make uniqe
input_fname = os.path.join(input_dir,
data['input_relpath'],
data['input_fname'])
input_sr = sox.file_info.sample_rate(input_fname)
target_sr = target_sr or input_sr
os.makedirs(os.path.join(dest_dir, data['input_relpath']), exist_ok=True)
output_dict = {}
output_dict['transcript'] = data['transcript'].lower().strip()
output_dict['files'] = []
fname = os.path.splitext(data['input_fname'])[0]
for s in speed:
output_fname = fname + '{}.wav'.format('' if s==1 else '-{}'.format(s))
output_fpath = os.path.join(dest_dir,
data['input_relpath'],
output_fname)
if not os.path.exists(output_fpath) or overwrite:
cbn = sox.Transformer().speed(factor=s).convert(target_sr)
cbn.build(input_fname, output_fpath)
file_info = sox.file_info.info(output_fpath)
file_info['fname'] = os.path.join(os.path.basename(dest_dir),
data['input_relpath'],
output_fname)
file_info['speed'] = s
output_dict['files'].append(file_info)
if s == 1:
file_info = sox.file_info.info(output_fpath)
output_dict['original_duration'] = file_info['duration']
output_dict['original_num_samples'] = file_info['num_samples']
return output_dict
def parallel_preprocess(dataset, input_dir, dest_dir, target_sr, speed, overwrite, parallel):
with multiprocessing.Pool(parallel) as p:
func = functools.partial(preprocess,
input_dir=input_dir, dest_dir=dest_dir,
target_sr=target_sr, speed=speed, overwrite=overwrite)
dataset = list(tqdm(p.imap(func, dataset), total=len(dataset)))
return dataset

View file

@ -33,6 +33,9 @@ The examples are organized first by framework, such as TensorFlow, PyTorch, etc.
### Text to Speech
- __Tacotron & WaveGlow__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)]
### Speech Recognition
- __Jasper__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)]
## NVIDIA support
In each of the network READMEs, we indicate the level of support that will be provided. The range is from ongoing updates and improvements to a point-in-time release for thought leadership.

View file

View file

View file