Adding Jasper/PyT
This commit is contained in:
parent
71fdde7200
commit
fa400a7367
4
PyTorch/SpeechRecognition/Jasper/.dockerignore
Executable file
4
PyTorch/SpeechRecognition/Jasper/.dockerignore
Executable file
|
@ -0,0 +1,4 @@
|
|||
results/
|
||||
*__pycache__
|
||||
checkpoints/
|
||||
datasets/
|
5
PyTorch/SpeechRecognition/Jasper/.gitignore
vendored
Executable file
5
PyTorch/SpeechRecognition/Jasper/.gitignore
vendored
Executable file
|
@ -0,0 +1,5 @@
|
|||
__pycache__
|
||||
*.pt
|
||||
results/
|
||||
datasets/
|
||||
checkpoints/
|
34
PyTorch/SpeechRecognition/Jasper/Dockerfile
Executable file
34
PyTorch/SpeechRecognition/Jasper/Dockerfile
Executable file
|
@ -0,0 +1,34 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
|
||||
FROM ${FROM_IMAGE_NAME}
|
||||
|
||||
|
||||
WORKDIR /tmp/unique_for_apex
|
||||
RUN pip uninstall -y apex || :
|
||||
RUN pip uninstall -y apex || :
|
||||
|
||||
RUN SHA=ToUcHMe git clone https://github.com/NVIDIA/apex.git
|
||||
WORKDIR /tmp/unique_for_apex/apex
|
||||
RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
|
||||
|
||||
|
||||
RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /workspace/jasper
|
||||
|
||||
COPY . .
|
||||
RUN pip install --disable-pip-version-check -U -r requirements.txt
|
||||
|
203
PyTorch/SpeechRecognition/Jasper/LICENSE
Normal file
203
PyTorch/SpeechRecognition/Jasper/LICENSE
Normal file
|
@ -0,0 +1,203 @@
|
|||
Except where otherwise noted, the following license applies to all files in this repo.
|
||||
|
||||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright 2019 NVIDIA Corporation
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
5
PyTorch/SpeechRecognition/Jasper/NOTICE
Normal file
5
PyTorch/SpeechRecognition/Jasper/NOTICE
Normal file
|
@ -0,0 +1,5 @@
|
|||
Jasper in PyTorch
|
||||
|
||||
This repository includes source code (in "parts/") from:
|
||||
* https://github.com/keithito/tacotron and https://github.com/ryanleary/patter licensed under MIT license.
|
||||
|
816
PyTorch/SpeechRecognition/Jasper/README.md
Normal file
816
PyTorch/SpeechRecognition/Jasper/README.md
Normal file
|
@ -0,0 +1,816 @@
|
|||
# Jasper For PyTorch
|
||||
|
||||
This repository provides a script and recipe to train the Jasper model to achieve state of the art the paper accuracy of the acoustic model, and is tested and maintained by NVIDIA.
|
||||
|
||||
## Table Of Contents
|
||||
- [Model overview](#model-overview)
|
||||
* [Model architecture](#model-architecture)
|
||||
* [Default configuration](#default-configuration)
|
||||
* [Feature support matrix](#feature-support-matrix)
|
||||
* [Features](#features)
|
||||
* [Mixed precision training](#mixed-precision-training)
|
||||
* [Enabling mixed precision](#enabling-mixed-precision)
|
||||
* [Glossary](#glossary)
|
||||
- [Setup](#setup)
|
||||
* [Requirements](#requirements)
|
||||
- [Quick Start Guide](#quick-start-guide)
|
||||
- [Advanced](#advanced)
|
||||
* [Scripts and sample code](#scripts-and-sample-code)
|
||||
* [Parameters](#parameters)
|
||||
* [Command-line options](#command-line-options)
|
||||
* [Getting the data](#getting-the-data)
|
||||
* [Dataset guidelines](#dataset-guidelines)
|
||||
* [Training process](#training-process)
|
||||
* [Inference process](#inference-process)
|
||||
* [Evaluation process](#evaluation-process)
|
||||
- [Performance](#performance)
|
||||
* [Benchmarking](#benchmarking)
|
||||
* [Training performance benchmark](#training-performance-benchmark)
|
||||
* [Inference performance benchmark](#inference-performance-benchmark)
|
||||
* [Results](#results)
|
||||
* [Training accuracy results](#training-accuracy-results)
|
||||
* [Training accuracy: NVIDIA DGX-1 (8x V100 32G)](#training-accuracy-nvidia-dgx-1-8x-v100-32G)
|
||||
* [Training stability test](#training-stability-test)
|
||||
* [Training performance results](#training-performance-results)
|
||||
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16G)
|
||||
* [Training performance: NVIDIA DGX-1 (8x V100 32G)](#training-performance-nvidia-dgx-1-8x-v100-32G)
|
||||
* [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-16x-v100-32G)
|
||||
* [Inference performance results](#inference-performance-results)
|
||||
* [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16G)
|
||||
* [Inference performance: NVIDIA DGX-1 (1x V100 32G)](#inference-performance-nvidia-dgx-1-1x-v100-32G)
|
||||
* [Inference performance: NVIDIA DGX-2 (1x V100 32G)](#inference-performance-nvidia-dgx-2-1x-v100-32G)
|
||||
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
|
||||
- [Release notes](#release-notes)
|
||||
* [Changelog](#changelog)
|
||||
* [Known issues](#known-issues)
|
||||
|
||||
## Model overview
|
||||
|
||||
This repository provides an implementation of the Jasper model in PyTorch from the paper `Jasper: An End-to-End Convolutional Neural Acoustic Model` [https://arxiv.org/pdf/1904.03288.pdf](https://arxiv.org/pdf/1904.03288.pdf).
|
||||
The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.
|
||||
|
||||
The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences
|
||||
corresponding to a given audio segment. This post-processing step is called decoding.
|
||||
|
||||
This repository is a PyTorch implementation of Jasper and provides scripts to train the Jasper 10x5 model with dense residuals from scratch on the [Librispeech](http://www.openslr.org/12) dataset to achieve the greedy decoding results of the original paper.
|
||||
The original reference code provides Jasper as part of a research toolkit in TensorFlow [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq).
|
||||
This repository provides a simple implementation of Jasper with scripts for training and replicating the Jasper paper results.
|
||||
This includes data preparation scripts, training and inference scripts.
|
||||
Both training and inference scripts offer the option to use Automatic Mixed Precision (AMP) to benefit from Tensor Cores for better performance.
|
||||
|
||||
In addition to providing the hyperparameters for training a model checkpoint, we publish a thorough inference analysis across different NVIDIA GPU platforms, for example, DGX-1, DGX-2 and T4.
|
||||
|
||||
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta GPUs and evaluated on Volta and Turing GPUs. Therefore, researchers can get results 3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
|
||||
|
||||
The original paper takes the output of the Jasper acoustic model and shows results for 3 different decoding variations: greedy decoding, beam search with a 6-gram language model and beam search with further rescoring of the best ranked hypotheses with Transformer XL, which is a neural language model. Beam search and the rescoring with the neural language model scores are run on CPU and result in better word error rates compared to greedy decoding.
|
||||
This repository provides instructions to reproduce greedy decoding results. To run beam search or rescoring with TransformerXL, use the following scripts from the [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq) repository:
|
||||
https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/decode.py
|
||||
https://github.com/NVIDIA/OpenSeq2Seq/tree/master/external_lm_rescore
|
||||
|
||||
### Model architecture
|
||||
|
||||
Details on the model architecture can be found in the paper [Jasper: An End-to-End Convolutional Neural Acoustic Model](https://arxiv.org/pdf/1904.03288.pdf).
|
||||
|
||||
|<img src="images/jasper_model.png" width="100%" height="40%"> | <img src="images/jasper_dense_residual.png" width="100%" height="40%">|
|
||||
|:---:|:---:|
|
||||
|Figure 1: Jasper BxR model: B- number of blocks, R- number of sub-blocks | Figure 2: Jasper Dense Residual |
|
||||
|
||||
Jasper is an end-to-end neural acoustic model that is based on convolutions.
|
||||
In the audio processing stage, each frame is transformed into mel-scale spectrogram features, which the acoustic model takes as input and outputs a probability distribution over the vocabulary for each frame.
|
||||
The acoustic model has a modular block structure and can be parametrized accordingly:
|
||||
a Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
|
||||
|
||||
Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
|
||||
|
||||
Each block input is connected directly to the last subblock of all following blocks via a residual connection, which is referred to as `dense residual` in the paper.
|
||||
Every block differs in kernel size and number of filters, which are increasing in size from the bottom to the top layers.
|
||||
Irrespective of the exact block configuration parameters B and R, every Jasper model has four additional convolutional blocks:
|
||||
one immediately succeeding the input layer (Prologue) and three at the end of the B blocks (Epilogue).
|
||||
|
||||
The Prologue is to decimate the audio signal
|
||||
in time in order to process a shorter time sequence for efficiency. The Epilogue with dilation captures a bigger context around an audio time step, which decreases the model word error rate (WER).
|
||||
The paper achieves best results with Jasper 10x5 with dense residual connections, which is also the focus of this repository and is in the following referred to as Jasper Large.
|
||||
|
||||
### Default configuration
|
||||
|
||||
The following features were implemented in this model:
|
||||
|
||||
* GPU-supported feature extraction with data augmentation options [SpecAugment](https://arxiv.org/abs/1904.08779) and [Cutout](https://arxiv.org/pdf/1708.04552.pdf)
|
||||
* offline and online [Speed Perturbation](https://www.danielpovey.com/files/2015_interspeech_augmentation.pdf)
|
||||
* data-parallel multi-GPU training and evaluation
|
||||
* AMP with dynamic loss scaling for Tensor Core training
|
||||
* FP16 inference with AMP
|
||||
|
||||
|
||||
Competitive training results and analysis is provided for the following Jasper model configuration
|
||||
|
||||
| **Model** | **Number of Blocks** | **Number of Subblocks** | **Max sequence length** | **Number of Parameters** |
|
||||
|--- |--- |--- |--- |--- |
|
||||
| Jasper Large | 10 | 5 | 16.7s | 333M |
|
||||
|
||||
|
||||
### Feature support matrix
|
||||
|
||||
The following features are supported by this model.
|
||||
|
||||
| **Feature** | **Jasper** |
|
||||
|--- |--- |
|
||||
|[Apex AMP](https://nvidia.github.io/apex/amp.html) | Yes |
|
||||
|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) | Yes |
|
||||
|
||||
#### Features
|
||||
|
||||
[Apex AMP](https://nvidia.github.io/apex/amp.html) - a tool that enables Tensor Core-accelerated training. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
|
||||
|
||||
[Apex
|
||||
DistributedDataParallel](https://nvidia.github.io/apex/parallel.html#apex.parallel.DistributedDataParallel) -
|
||||
a module wrapper that enables easy multiprocess distributed data parallel
|
||||
training, similar to
|
||||
[torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel).
|
||||
`DistributedDataParallel` is optimized for use with
|
||||
[NCCL](https://github.com/NVIDIA/nccl). It achieves high performance by
|
||||
overlapping communication with computation during `backward()` and bucketing
|
||||
smaller gradient transfers to reduce the total number of transfers required.
|
||||
|
||||
|
||||
### Mixed precision training
|
||||
|
||||
*Mixed precision* is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
|
||||
|
||||
1. Porting the model to use the FP16 data type where appropriate.
|
||||
2. Adding loss scaling to preserve small gradient values.
|
||||
|
||||
The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
|
||||
|
||||
|
||||
For information about:
|
||||
* How to train using mixed precision, see the[Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
|
||||
* Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
|
||||
* APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
|
||||
|
||||
|
||||
#### Enabling mixed precision
|
||||
|
||||
For training, mixed precision can be enabled by setting the flag: `train.py --fp16`. You can change this behavior and execute the training in
|
||||
single precision by removing the `--fp16` flag for the `train.py` training
|
||||
script. For example, in the bash scripts `scripts/train.sh`, `scripts/inference.sh`, etc. the precision can be specified with the variable `PRECISION` by setting it to either `PRECISION=’fp16’` or `PRECISION=’fp32’`.
|
||||
|
||||
Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
|
||||
(AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables
|
||||
to half-precision upon retrieval, while storing variables in single-precision
|
||||
format. Furthermore, to preserve small gradient magnitudes in backpropagation,
|
||||
a [loss
|
||||
scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
|
||||
step must be included when applying gradients. In PyTorch, loss scaling can be
|
||||
easily applied by using `scale_loss()` method provided by AMP. The scaling
|
||||
value to be used can be
|
||||
[dynamic](https://nvidia.github.io/apex/amp.html#apex.amp.initialize) or fixed.
|
||||
|
||||
For an in-depth walk through on AMP, check out sample usage
|
||||
[here](https://nvidia.github.io/apex/amp.html#). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
|
||||
utility libraries, such as AMP, which require minimal network code changes to
|
||||
leverage tensor cores performance.
|
||||
|
||||
The following steps were needed to enable mixed precision training in Jasper:
|
||||
|
||||
* Import AMP from APEX (file: `train.py`):
|
||||
```
|
||||
from apex import amp
|
||||
```
|
||||
|
||||
* Initialize AMP and wrap the model and the optimizer
|
||||
```
|
||||
model, optimizer = amp.initialize(
|
||||
min_loss_scale=1.0,
|
||||
models=model,
|
||||
optimizers=optimizer,
|
||||
opt_level=’O1’)
|
||||
|
||||
```
|
||||
|
||||
* Apply `scale_loss` context manager
|
||||
```
|
||||
with amp.scale_loss(loss, optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
```
|
||||
|
||||
|
||||
### Glossary
|
||||
|
||||
Acoustic model:
|
||||
Assigns a probability distribution over a vocabulary of characters given an audio frame.
|
||||
|
||||
Language Model:
|
||||
Assigns a probability distribution over a sequence of words. Given a sequence of words, it assigns a probability to the whole sequence.
|
||||
|
||||
Pre-training:
|
||||
Training a model on vast amounts of data on the same (or different) task to build general understandings.
|
||||
|
||||
Automatic Speech Recognition (ASR):
|
||||
Uses both acoustic model and language model to output the transcript of an input audio signal.
|
||||
|
||||
|
||||
## Setup
|
||||
|
||||
The following section lists the requirements in order to start training and evaluating the Jasper model.
|
||||
|
||||
### Requirements
|
||||
|
||||
This repository contains a `Dockerfile` which extends the PyTorch 19.06-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
|
||||
|
||||
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
||||
* [PyTorch 19.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
|
||||
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
|
||||
|
||||
Further required python packages are listed in `requirements.txt`, which are automatically installed with the Docker container built. To manually install them, run
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
|
||||
|
||||
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
|
||||
|
||||
* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
|
||||
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
|
||||
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
|
||||
|
||||
For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html).
|
||||
|
||||
|
||||
## Quick Start Guide
|
||||
|
||||
To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Jasper model on the Librispeech dataset. For details concerning training and inference, see [Advanced](#Advanced).
|
||||
|
||||
1. Clone the repository.
|
||||
```bash
|
||||
git clone https://github.com/NVIDIA/DeepLearningExamples
|
||||
cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
|
||||
```
|
||||
2. Build the Jasper PyTorch container.
|
||||
|
||||
Running the following scripts will build and launch the container which contains all the required dependencies for data download and processing as well as training and inference of the model.
|
||||
|
||||
```bash
|
||||
bash scripts/docker/build.sh
|
||||
```
|
||||
|
||||
3. Start an interactive session in the NGC container to run data download/training/inference
|
||||
|
||||
```bash
|
||||
bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
|
||||
```
|
||||
Within the container, the contents of this repository will be copied to the `/workspace/jasper` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
|
||||
and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
|
||||
|
||||
4. Download and preprocess the dataset.
|
||||
|
||||
No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following Steps 2 and 3.
|
||||
|
||||
Note: Downloading and preprocessing the dataset requires 500GB of free disk space and can take several hours to complete.
|
||||
|
||||
This repository provides scripts to download, and extract the following datasets:
|
||||
|
||||
* LibriSpeech [http://www.openslr.org/12](http://www.openslr.org/12)
|
||||
|
||||
LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see the [LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS](http://www.danielpovey.com/files/2015_icassp_librispeech.pdf) paper.
|
||||
|
||||
Inside the container, download and extract the datasets into the required format for later training and inference:
|
||||
```bash
|
||||
bash scripts/download_librispeech.sh
|
||||
```
|
||||
Once the data download is complete, the following folders should exist:
|
||||
|
||||
* `/datasets/LibriSpeech/`
|
||||
* `train-clean-100/`
|
||||
* `train-clean-360/`
|
||||
* `train-other-500/`
|
||||
* `dev-clean/`
|
||||
* `dev-other/`
|
||||
* `test-clean/`
|
||||
* `test-other/`
|
||||
|
||||
Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3), once the dataset is downloaded it is accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.
|
||||
|
||||
|
||||
Next, convert the data into WAV files and add speed perturbation with 0.9 and 1.1 to the training files:
|
||||
```bash
|
||||
bash scripts/preprocess_librispeech.sh
|
||||
```
|
||||
Once the data is converted, the following additional files and folders should exist:
|
||||
* `datasets/LibriSpeech/`
|
||||
* `librispeech-train-clean-100-wav.json`
|
||||
* `librispeech-train-clean-360-wav.json`
|
||||
* `librispeech-train-other-500-wav.json`
|
||||
* `librispeech-dev-clean-wav.json`
|
||||
* `librispeech-dev-other-wav.json`
|
||||
* `librispeech-test-clean-wav.json`
|
||||
* `librispeech-test-other-wav.json`
|
||||
* `train-clean-100-wav/` containsWAV files with original speed, 0.9 and 1.1
|
||||
* `train-clean-360-wav/` contains WAV files with original speed, 0.9 and 1.1
|
||||
* `train-other-500-wav/` contains WAV files with original speed, 0.9 and 1.1
|
||||
* `dev-clean-wav/`
|
||||
* `dev-other-wav/`
|
||||
* `test-clean-wav/`
|
||||
* `test-other-wav/`
|
||||
|
||||
|
||||
5. Start training.
|
||||
|
||||
Inside the container, use the following script to start training.
|
||||
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
|
||||
|
||||
```bash
|
||||
bash scripts/train.sh [OPTIONS]
|
||||
```
|
||||
By default, this will use automatic mixed precision, a batch size of 64 and run on a total of 8 GPUs. The hyperparameters are tuned for DGX-1 32GB 8x V100 GPUs and will require adjustment for 16GB GPUs (e.g. by using more gradient accumulation steps)
|
||||
|
||||
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Training process](#training-process).
|
||||
|
||||
6. Start validation/evaluation.
|
||||
|
||||
Inside the container, use the following script to run evaluation.
|
||||
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
|
||||
```bash
|
||||
bash scripts/evaluation.sh [OPTIONS]
|
||||
```
|
||||
By default, this will use full precision, a batch size of 64 and run on a single GPU.
|
||||
|
||||
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Evaluation process](#evaluation-process).
|
||||
|
||||
|
||||
7. Start inference/predictions.
|
||||
|
||||
Inside the container, use the following script to run inference.
|
||||
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
|
||||
```bash
|
||||
bash scripts/inference.sh [OPTIONS]
|
||||
```
|
||||
By default this will use full precision, a batch size of 64 and run on a single GPU.
|
||||
|
||||
More details on available [OPTIONS] can be found in [Parameters](#parameters) and [Inference process](#inference-process).
|
||||
|
||||
|
||||
## Advanced
|
||||
|
||||
The following sections provide greater details of the dataset, running training and inference, and getting training and inference results.
|
||||
|
||||
|
||||
|
||||
### Scripts and sample code
|
||||
In the `root` directory, the most important files are:
|
||||
* `train.py` - Serves as entry point for training
|
||||
* `inference.py` - Serves as entry point for inference and evaluation
|
||||
* `model.py` - Contains the model architecture
|
||||
* `dataset.py` - Contains the data loader and related functionality
|
||||
* `optimizer.py` - Contains the optimizer
|
||||
* `inference_benchmark.py` - Serves as inference benchmarking script that measures the latency of pre-processing and the acoustic model
|
||||
* `requirements.py` - Contains the required dependencies that are installed when building the Docker container
|
||||
* `Dockerfile` - Container with the basic set of dependencies to run Jasper
|
||||
|
||||
The `scripts/` folder encapsulates all the one-click scripts required for running various supported functionalities, such as:
|
||||
* `train.sh` - Runs training using the `train.py` script
|
||||
* `inference.sh` - Runs inference using the `inference.py` script
|
||||
* `evaluation.sh` - Runs evaluation using the `inference.py` script
|
||||
* `download_librispeech.sh` - Downloads LibriSpeech dataset
|
||||
* `preprocess_librispeech.sh` - Preprocess LibriSpeech raw data files to be ready for training and inference
|
||||
* `inference_benchmark.sh` - Runs the inference benchmark using the `inference_benchmark.py` script
|
||||
* `train_benchmark.sh` - Runs the training performance benchmark using the `train.py` script
|
||||
* `docker/` - Contains the scripts for building and launching the container
|
||||
|
||||
|
||||
Other folders included in the `root` directory are:
|
||||
* `configs/` - Model configurations
|
||||
* `utils/` - Contains the necessary files for data download and processing
|
||||
|
||||
* `parts/` - Contains the necessary files for data pre-processing
|
||||
|
||||
### Parameters
|
||||
|
||||
The complete list of available parameters for `scripts/train.sh` script contains:
|
||||
```bash
|
||||
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
|
||||
MODEL_CONFIG: relative path to model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
|
||||
RESULT_DIR: directory for results, logs, and created checkpoints. (default: '/results')
|
||||
CHECKPOINT: model checkpoint to continue training from. Model checkpoint is a dictionary object that contains apart from the model weights the optimizer state as well as the epoch number. If CHECKPOINT is "none" , training starts from scratch. (default: "none")
|
||||
CREATE_LOGFILE: boolean that indicates whether to create a training log that will be stored in `$RESULT_DIR`. (default: "true")
|
||||
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'true')
|
||||
NUM_GPUS: number of GPUs to use. (default: 8)
|
||||
PRECISION: options are fp32 and fp16 with AMP. (default: 'fp16')
|
||||
EPOCHS: number of training epochs. (default: 400)
|
||||
SEED: seed for random number generator and used for ensuring reproducibility. (default: 6)
|
||||
BATCH_SIZE: data batch size. (default: 64)
|
||||
LEARNING_RATE: Initial learning rate. (default: 0.015)
|
||||
GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 1)
|
||||
LAUNCH_OPT: additional launch options. (default: "none")
|
||||
```
|
||||
|
||||
The complete list of available parameters for `scripts/inference.sh` script contains:
|
||||
```bash
|
||||
DATA_DIR: directory of dataset. (default: '/datasets/LibriSpeech')
|
||||
DATASET: name of dataset to use. (default: 'dev-clean')
|
||||
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
|
||||
RESULT_DIR: directory for results and logs. (default: '/results')
|
||||
CHECKPOINT: model checkpoint path. (required)
|
||||
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: "true")
|
||||
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'false')
|
||||
PRECISION: options are fp32 and fp16 with AMP. (default: 'fp32')
|
||||
NUM_STEPS: number of inference steps. If -1 runs inference on entire dataset. (default: -1)
|
||||
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 6)
|
||||
BATCH_SIZE: data batch size.(default: 64)
|
||||
MODELOUTPUT_FILE: destination path for serialized model output with binary protocol. If 'none' does not save model output. (default: 'none')
|
||||
PREDICTION_FILE: destination path for saving predictions. If 'none' does not save predictions. (default: '${RESULT_DIR}/${DATASET}.predictions)
|
||||
```
|
||||
|
||||
The complete list of available parameters for `scripts/evaluation.sh` script contains:
|
||||
```bash
|
||||
DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
|
||||
DATASET: name of dataset to use.(default: 'dev-clean')
|
||||
MODEL_CONFIG: model configuration.(default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
|
||||
RESULT_DIR: directory for results and logs. (default: '/results')
|
||||
CHECKPOINT: model checkpoint path. (required)
|
||||
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: 'true')
|
||||
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'false')
|
||||
NUM_GPUS: number of GPUs to run evaluation on (default: 1)
|
||||
PRECISION: options are fp32 and fp16 with AMP.(default: 'fp32')
|
||||
NUM_STEPS: number of inference steps per GPU. If -1 runs inference on entire dataset (default: -1)
|
||||
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
|
||||
BATCH_SIZE: data batch size.(default: 64)
|
||||
```
|
||||
|
||||
The `scripts/inference_benchmark.sh` script pads all input to the same length and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in millisecond per batch. The `scripts/inference_benchmark.sh`
|
||||
measures latency for a single GPU and extends `scripts/inference.sh` by :
|
||||
```bash
|
||||
MAX_DURATION: filters out input audio data that exceeds a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 36)
|
||||
```
|
||||
|
||||
The `scripts/train_benchmark.sh` script pads all input to the same length according to the input argument `MAX_DURATION` and measures average training latency and throughput performance. Latency is measured in seconds per batch, throughput in sequences per second.
|
||||
The complete list of available parameters for `scripts/train_benchmark.sh` script contains:
|
||||
```bash
|
||||
DATA_DIR: directory of dataset.(default: '/datasets/LibriSpeech')
|
||||
MODEL_CONFIG: model configuration. (default: 'configs/jasper10x5dr_sp_offline_specaugment.toml')
|
||||
RESULT_DIR: directory for results and logs. (default: '/results')
|
||||
CREATE_LOGFILE: boolean that indicates whether to create a log file that will be stored in `$RESULT_DIR`. (default: 'true')
|
||||
CUDNN_BENCHMARK: boolean that indicates whether to enable cudnn benchmark mode for using more optimized kernels. (default: 'true')
|
||||
NUM_GPUS: number of GPUs to use. (default: 8)
|
||||
PRECISION: options are fp32 and fp16 with AMP. (default: 'fp16')
|
||||
NUM_STEPS: number of training iterations. If -1 runs full training for 400 epochs. (default: -1)
|
||||
MAX_DURATION: filters out input audio data that exceed a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 16.7)
|
||||
SEED: seed for random number generator and useful for ensuring reproducibility. (default: 0)
|
||||
BATCH_SIZE: data batch size.(default: 64)
|
||||
LEARNING_RATE: Initial learning rate. (default: 0.015)
|
||||
GRADIENT_ACCUMULATION_STEPS: number of gradient accumulation steps until optimizer updates weights. (default: 1)
|
||||
PRINT_FREQUENCY: number of iterations after which training progress is printed. (default: 1)
|
||||
```
|
||||
|
||||
|
||||
### Command-line options
|
||||
|
||||
To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example:
|
||||
|
||||
```bash
|
||||
python train.py --help
|
||||
python inference.py --help
|
||||
```
|
||||
|
||||
### Getting the data
|
||||
|
||||
The Jasper model was trained on LibriSpeech dataset. We use the concatenation of `train-clean-100`, `train-clean-360` and `train-other-500` for training and `dev-clean` for validation.
|
||||
|
||||
This repository contains the `scripts/download_librispeech.sh` and `scripts/preprocess_librispeech.sh` scripts which will automatically download and preprocess the training, test and development datasets. By default, data will be downloaded to the `/datasets/LibriSpeech` directory, a minimum of 500GB free space is required for download and preprocessing, the final preprocessed dataset is 320GB.
|
||||
|
||||
|
||||
#### Dataset guidelines
|
||||
|
||||
The `scripts/preprocess_librispeech.sh` script converts the input audio files to WAV format with a sample rate of 16kHz, target transcripts are striped from whitespace characters, then lower-cased. For `train-clean-100`, `train-clean-360` and `train-other-500` it also creates speed perturbed versions with rates of 0.9 and 1.1 for data augmentation.
|
||||
|
||||
After preprocessing, the script creates JSON files with output file paths, sample rate, target transcript and other metadata. These JSON files are used by the training script to identify training and validation datasets.
|
||||
|
||||
The Jasper model was tuned on audio signals with a sample rate of 16kHz, if you wish to use a different sampling rate then some hyperparameters might need to be changed - specifically window size and step size.
|
||||
|
||||
|
||||
### Training process
|
||||
|
||||
The training is performed using `train.py` script along with parameters defined in `scripts/train.sh`
|
||||
The `scripts/train.sh` script runs a job on a single node that trains the Jasper model from scratch using LibriSpeech as training data. To make training more efficient, we discard audio samples longer than 16.7 seconds from the training dataset, the total number of these samples is less than 1%. Such filtering does not degrade accuracy, but it allows us to decrease the number of time steps in a batch, which requires less GPU memory and increases training speed.
|
||||
Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the training script:
|
||||
|
||||
* Runs on 8 32GB V100 GPUs with training and evaluation batch size 64
|
||||
* Uses FP16 precision with AMP optimization level O1 (default)
|
||||
* Enables cudnn benchmark to make mixed precision training faster
|
||||
* Trains on the concatenation of all 3 LibriSpeech training datasets and evaluates on the LibriSpeech dev-clean dataset
|
||||
* Uses a seed of 6
|
||||
* Runs for 400 epochs
|
||||
* Uses an initial learning rate of 0.015 and polynomial (quadratic) learning rate decay
|
||||
* Saves a checkpoint every 10 epochs
|
||||
* Runs evaluation on the development dataset every 100 iterations and at the end of training
|
||||
* Prints out training progress every 25 iterations
|
||||
* Creates a log file with training progress
|
||||
* Uses offline speed perturbed data
|
||||
* Uses SpecAugment in data pre-processing
|
||||
* Filters out audio samples longer than 16.7 seconds
|
||||
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
|
||||
* Uses masked convolutions and dense residuals as described in the paper
|
||||
* Uses weight decay of 0.001
|
||||
* Uses 1 gradient accumulation step
|
||||
* Uses [Novograd](https://arxiv.org/pdf/1905.11286.pdf) as optimizer with betas=(0.95, 0.98)
|
||||
|
||||
|
||||
These parameters will match the greedy WER [Results](#results) of the Jasper paper on a DGX1 with 32GB V100 GPUs.
|
||||
|
||||
### Inference process
|
||||
|
||||
Inference is performed using the `inference.py` script along with parameters defined in `scripts/inference.sh`.
|
||||
The `scripts/inference.sh` script runs the job on a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
|
||||
Apart from the default arguments as listed in the [Parameters](#parameters) section by default the inference script:
|
||||
|
||||
* Evaluates on the LibriSpeech dev-clean dataset
|
||||
* Uses full precision
|
||||
* Uses a batch size of 64
|
||||
* Runs for 1 epoch and prints out the final word error rate
|
||||
* Creates a log file with progress and results which will be stored in the results folder
|
||||
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
|
||||
* Does not use data augmentation
|
||||
* Does greedy decoding and saves the transcription in the results folder
|
||||
* Has the option to save the model output tensors for more complex decoding, for example, beam search
|
||||
* Has cudnn benchmark disabled
|
||||
|
||||
### Evaluation process
|
||||
|
||||
Evaluation is performed using the `inference.py` script along with parameters defined in `scripts/evaluation.sh`.
|
||||
The `scripts/evaluation.sh` script runs a job on a a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
|
||||
Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the evaluation script:
|
||||
|
||||
* Uses a batch size of 64
|
||||
* Evaluates the LibriSpeech dev-clean dataset
|
||||
* Uses full precision
|
||||
* Runs for 1 epoch and prints out the final word error rate
|
||||
* Creates a log file with progress and results which is saved in the results folder
|
||||
* Pads each sequence in a batch to the same length (smallest multiple of 16 that is at least the length of the longest sequence in the batch)
|
||||
* Does not use data augmentation
|
||||
* Has cudnn benchmark disabled
|
||||
|
||||
|
||||
|
||||
## Performance
|
||||
|
||||
### Benchmarking
|
||||
|
||||
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
|
||||
|
||||
#### Training performance benchmark
|
||||
|
||||
To benchmark the training performance on a specific batch size and audio length, run:
|
||||
|
||||
```bash
|
||||
bash scripts/train_benchmark.sh <DATA_DIR> <MODEL_CONFIG> <RESULT_DIR> <CREATE_LOGFILE> <CUDNN_BENCHMARK> <NUM_GPUS> <PRECISION> <NUM_STEPS> <MAX_DURATION> <SEED> <BATCH_SIZE>
|
||||
<LEARNING_RATE> <GRADIENT_ACCUMULATION_STEPS> <PRINT_FREQUENCY>
|
||||
```
|
||||
|
||||
By default, this script runs 400 epochs on the configuration `configs/jasper10x5dr_sp_offline_specaugment.toml` using full precision
|
||||
and batch size 64 on a single node with 8x 32GB V100 GPUs cards.
|
||||
By default, `NUM_STEPS=-1` means training is run for 400 EPOCHS. If `$NUM_STEPS > 0` is specified, training is only run for a user-defined number of iterations. Audio samples longer than `MAX_DURATION` are filtered out, the remaining ones are padded to this duration such that all batches have the same length. At the end of training the script saves the model checkpoint to the results folder, runs evaluation on LibriSpeech dev-clean dataset, and prints out information such as average training latency performance in seconds, average training throughput in sequences per second, final training loss, final training WER, evaluation loss and evaluation WER.
|
||||
|
||||
|
||||
|
||||
#### Inference performance benchmark
|
||||
|
||||
To benchmark the inference performance on a specific batch size and audio length, run:
|
||||
|
||||
```bash
|
||||
bash scripts/inference_benchmark.sh <DATA_DIR> <DATASET> <MODEL_CONFIG> <RESULT_DIR> <CHECKPOINT> <CREATE_LOGFILE> <CUDNN_BENCHMARK> <PRECISION> <NUM_GPUS> <MAX_DURATION>
|
||||
<SEED> <BATCH_SIZE>
|
||||
```
|
||||
By default, the script runs on a single GPU and evaluates on the entire dataset using the model configuration `configs/jasper10x5dr_sp_offline_specaugment.toml`, full precision, cudnn benchmark for faster fp16 inference and batch size 64.
|
||||
By default, `MAX_DURATION` is set to 36 seconds, which covers the maximum audio length. All audio samples are padded to this length. The script prints out `MAX_DURATION`, `BATCH_SIZE` and latency performance in milliseconds per batch.
|
||||
|
||||
|
||||
|
||||
### Results
|
||||
|
||||
The following sections provide details on how we achieved our performance and accuracy in training and inference.
|
||||
All results are trained on 960 hours of LibriSpeech with a maximum audio length of 16.7s. The training is evaluated
|
||||
on LibriSpeech dev-clean, dev-other, test-clean, test-other.
|
||||
The results for Jasper Large's word error rate from the original paper after greedy decoding are shown below:
|
||||
|
||||
|
||||
|
||||
| **Number of GPUs** | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**
|
||||
|--- |--- |--- |--- |--- |
|
||||
|8 | 3.64| 11.89| 3.86 | 11.95
|
||||
|
||||
|
||||
#### Training accuracy results
|
||||
|
||||
##### Training accuracy: NVIDIA DGX-1 (8x V100 32G)
|
||||
|
||||
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.06-py3 NGC container with NVIDIA DGX-1 with (8x V100 32G) GPUs.
|
||||
The following tables report the word error rate(WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
|
||||
|
||||
FP16 (seed #6)
|
||||
|
||||
| **Number of GPUs** | **Batch size per GPU** | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**| **Total time to train with FP16 (Hrs)** |
|
||||
|--- |--- |--- |--- |--- |--- |--- |
|
||||
|8 |64| 3.51|11.14|3.74|11.06|100
|
||||
|
||||
FP32 training matches the results of mixed precision training and takes approximately 330 hours.
|
||||
|
||||
|
||||
|
||||
##### Training stability test
|
||||
|
||||
The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.
|
||||
|
||||
| **FP16, 8x GPUs** | **seed #1** | **seed #2** | **seed #3** | **seed #4** | **seed #5** | **seed #6** | **seed #7** | **seed #8** | **mean** | **std** |
|
||||
|:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
|
||||
|dev-clean|3.74|3.75|3.77|3.68|3.75|3.51|3.71|3.58|3.69|0.09
|
||||
|dev-other|11.56|11.62|11.5|11.36|11.62|11.14|11.8|11.3|11.49|0.21
|
||||
|test-clean|3.9|3.95|3.88|3.79|3.95|3.74|4.03|3.85|3.89|0.09
|
||||
|test-other|11.47|11.54|11.51|11.29|11.54|11.06|11.68|11.29|11.42|0.20
|
||||
|
||||
|
||||
|
||||
#### Training performance results
|
||||
|
||||
Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.06-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
|
||||
|
||||
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
|
||||
|
||||
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|
||||
|---|---|-----|------|----|----|----|
|
||||
| 1 | 16 | 10| 29.63| 2.96| 1.00| 1.00|
|
||||
| 4 | 16 | 38.79| 106.67| 2.75| 3.88| 3.60|
|
||||
| 8 | 16 | 76.64| 209.84| 2.74| 7.66| 7.08|
|
||||
|
||||
|
||||
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|
||||
|---|---|-----|------|----|----|----|
|
||||
| 1 | 32 | - | 35.16 | - | - | 1.00 |
|
||||
| 4 | 32 | - | 134.74 | - | - | 3.83 |
|
||||
| 8 | 32 | - | 263.92 | - | - | 7.51 |
|
||||
|
||||
|
||||
|
||||
Note: The respective values for FP32 runs that use a batch size of 32 are not available due to out of memory errors that arise. Batch size of 32 is only available when using FP16.
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
|
||||
|
||||
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|
||||
|---|---|-----|------|----|----|----|
|
||||
| 1 | 32 | 12.26| 34.04| 2.78| 1.00| 1.00|
|
||||
| 4 | 32 | 48.67| 131.96| 2.71| 3.97| 3.88|
|
||||
| 8 | 32 | 95.88| 253.47| 2.64| 7.82| 7.45|
|
||||
|
||||
|
||||
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|
||||
|---|---|-----|------|----|----|----|
|
||||
| 1 | 64 | - | 41.03 | - | - | 1.00 |
|
||||
| 4 | 64 | - | 159.01 | - | - | 3.88 |
|
||||
| 8 | 64 | - | 312.20 | - | - | 7.61 |
|
||||
|
||||
|
||||
Note: The respective values for FP32 runs that use a batch size of 64 are not available due to out of memory errors that arise. Batch size of 64 is only available when using FP16.
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
|
||||
|
||||
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|
||||
|---|---|-----|------|----|----|----|
|
||||
| 1 | 32 | 8.12| 24.24| 2.98| 1.00| 1.00|
|
||||
| 4 | 32 | 32.16| 92.09| 2.86| 3.96| 3.80|
|
||||
| 8 | 32 | 63.68| 181.56| 2.85| 7.84| 7.49|
|
||||
|16 | 32 | 124.88| 275.67| 2.20| 15.38| 11.35|
|
||||
|
||||
|
||||
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|
||||
|---|---|-----|------|----|----|----|
|
||||
| 1 | 64 | - | 29.22 | - | - | 1.00 |
|
||||
| 4 | 64 | - | 114.29 | - | - | 3.91 |
|
||||
| 8 | 64 | - | 222.61 | - | - | 7.62 |
|
||||
|16 | 64 | - | 414.57 | - | - | 14.19 |
|
||||
|
||||
|
||||
Note: The respective values for FP32 runs that use a batch size of 64 are not available due to out of memory errors that arise. Batch size of 64 is only available when using FP16.
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
|
||||
#### Inference performance results
|
||||
|
||||
Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 19.06-py3 NGC container on NVIDIA DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 1000 iterations.
|
||||
|
||||
##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
|
||||
|
||||
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|
||||
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|
||||
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|
||||
|1|2|62.16|64.71|67.29|61.31|69.37|69.75|75.38|68.95|1.12
|
||||
|2|2|60.94|63.60|68.03|59.57|82.18|83.12|84.26|75.33|1.26
|
||||
|4|2|68.38|69.55|75.85|64.82|85.74|86.85|93.78|82.55|1.27
|
||||
|8|2|68.80|71.54|73.28|62.83|104.22|106.58|109.41|95.77|1.52
|
||||
|16|2|72.33|72.85|74.55|64.69|127.11|129.34|131.46|109.80|1.70
|
||||
|1|7|59.06|60.51|62.83|58.10|75.41|75.72|78.64|74.70|1.29
|
||||
|2|7|61.68|67.73|68.58|59.53|97.85|98.59|98.99|91.60|1.54
|
||||
|4|7|60.88|62.13|65.23|60.38|119.08|119.80|121.28|118.67|1.97
|
||||
|8|7|70.71|71.82|74.23|70.16|181.48|185.00|186.20|177.98|2.54
|
||||
|16|7|93.75|94.70|100.58|92.96|219.72|220.25|221.28|215.09|2.31
|
||||
|1|16.7|68.87|69.48|71.75|63.63|101.03|101.66|104.00|100.32|1.58
|
||||
|2|16.7|73.00|73.76|75.58|66.44|145.64|146.64|152.41|143.69|2.16
|
||||
|4|16.7|77.71|78.75|79.90|77.34|224.62|225.43|226.43|223.96|2.90
|
||||
|8|16.7|96.34|97.07|104.46|95.94|318.52|319.13|320.74|316.14|3.30
|
||||
|16|16.7|154.63|156.81|159.25|151.05|375.67|377.00|381.79|371.83|2.46
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
|
||||
##### Inference performance: NVIDIA DGX-1 (1x V100 32G)
|
||||
|
||||
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|
||||
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|
||||
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|
||||
|1|2|61.60|62.81|69.62|60.71|82.32|83.03|85.72|77.48|1.28
|
||||
|2|2|68.82|70.10|72.08|61.91|77.99|81.99|85.13|76.93|1.24
|
||||
|4|2|70.06|70.69|72.58|74.76|88.36|89.67|95.61|94.50|1.26
|
||||
|8|2|69.98|71.51|74.20|64.20|105.82|107.16|110.04|98.02|1.53
|
||||
|16|2|72.05|74.16|75.51|65.46|130.49|130.97|132.83|112.74|1.72
|
||||
|1|7|61.40|61.78|65.53|60.93|75.72|75.83|76.55|75.35|1.24
|
||||
|2|7|60.50|60.63|61.77|60.15|91.05|91.16|92.39|90.75|1.51
|
||||
|4|7|64.67|71.41|72.10|64.19|123.77|123.99|124.92|123.38|1.92
|
||||
|8|7|67.96|68.04|69.38|67.60|176.43|176.65|177.25|175.39|2.59
|
||||
|16|7|95.41|95.80|100.94|93.86|213.04|213.38|215.52|212.05|2.26
|
||||
|1|16.7|61.28|61.67|62.52|60.63|104.37|104.56|105.22|103.83|1.71
|
||||
|2|16.7|66.88|67.31|68.09|66.40|151.08|151.61|152.26|146.73|2.21
|
||||
|4|16.7|80.51|80.79|81.95|80.12|226.75|227.07|228.76|225.82|2.82
|
||||
|8|16.7|95.66|95.89|98.86|95.62|314.74|316.74|318.66|312.10|3.26
|
||||
|16|16.7|156.60|157.07|160.15|151.13|366.70|367.41|370.98|364.05|2.41
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
|
||||
|
||||
##### Inference performance: NVIDIA DGX-2 (1x V100 32G)
|
||||
|
||||
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|
||||
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|
||||
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|
||||
|1|2|56.11|56.76|62.18|51.77|67.75|68.91|73.80|64.96|1.25
|
||||
|2|2|55.56|56.96|61.72|50.63|65.84|69.88|74.05|63.57|1.26
|
||||
|4|2|54.84|57.69|61.16|60.74|74.00|76.58|81.62|81.01|1.33
|
||||
|8|2|57.15|57.92|60.80|52.47|90.56|91.83|93.79|84.58|1.61
|
||||
|16|2|58.27|58.54|60.24|53.26|113.25|113.55|115.41|98.56|1.85
|
||||
|1|7|49.16|49.39|50.82|48.31|64.53|64.84|65.79|63.90|1.32
|
||||
|2|7|53.54|54.07|55.28|49.11|78.64|79.46|81.25|78.17|1.59
|
||||
|4|7|50.87|51.15|53.36|50.07|109.33|110.61|114.00|108.17|2.16
|
||||
|8|7|63.57|64.18|65.55|60.64|163.95|164.19|165.75|163.49|2.70
|
||||
|16|7|82.15|83.66|87.01|81.46|196.15|197.18|202.09|195.36|2.40
|
||||
|1|16.7|49.68|50.00|51.39|48.76|89.10|89.42|90.41|88.57|1.82
|
||||
|2|16.7|52.47|52.91|54.27|51.51|128.58|129.09|130.34|127.36|2.47
|
||||
|4|16.7|66.60|67.52|68.88|65.88|220.50|221.50|223.14|219.42|3.33
|
||||
|8|16.7|85.42|86.03|88.37|85.11|293.80|294.39|296.21|290.58|3.41
|
||||
|16|16.7|140.76|141.74|147.25|137.31|345.26|346.29|351.15|342.64|2.50
|
||||
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
##### Inference performance: NVIDIA T4
|
||||
| | |FP16 Latency (ms) Percentiles | | | |FP32 Latency (ms) Percentiles | | | | FP16/FP32 speed up|
|
||||
|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|
||||
|BS |Sequence Length (in seconds) |90% |95% |99% |Avg |90% |95% |99% |Avg |Avg |
|
||||
|1|2|57.30|57.50|74.62|56.74|73.71|73.98|88.79|72.95|1.29
|
||||
|2|2|53.68|69.69|76.08|52.63|82.83|93.38|97.67|78.23|1.49
|
||||
|4|2|72.26|76.49|83.92|57.60|116.06|121.25|125.98|104.17|1.81
|
||||
|8|2|70.52|71.85|76.26|58.16|159.92|161.22|164.76|148.34|2.55
|
||||
|16|2|78.29|79.04|82.86|66.97|251.96|252.67|253.64|206.41|3.08
|
||||
|1|7|54.83|54.94|55.50|54.58|85.57|89.11|89.71|84.08|1.54
|
||||
|2|7|55.17|55.38|67.09|54.87|134.28|135.76|138.23|131.01|2.39
|
||||
|4|7|74.24|78.09|79.51|73.75|214.77|215.65|217.28|211.66|2.87
|
||||
|8|7|99.99|100.34|104.26|98.84|379.67|380.96|382.70|375.12|3.80
|
||||
|16|7|167.48|168.07|177.29|166.53|623.36|624.11|625.89|619.34|3.72
|
||||
|1|16.7|72.23|72.65|80.13|67.77|155.76|157.11|160.05|151.85|2.24
|
||||
|2|16.7|75.43|76.04|80.41|74.65|259.56|261.23|266.09|252.80|3.39
|
||||
|4|16.7|131.71|132.45|134.92|129.63|481.40|484.17|486.88|469.05|3.62
|
||||
|8|16.7|197.10|197.94|200.15|193.88|806.76|812.73|822.27|792.85|4.09
|
||||
|16|16.7|364.22|365.22|372.17|358.62|1165.78|1167.11|1171.02|1150.44|3.21
|
||||
|
||||
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
||||
|
||||
## Release notes
|
||||
|
||||
### Changelog
|
||||
|
||||
July 2019
|
||||
* Initial release
|
||||
|
||||
|
||||
### Known issues
|
||||
|
||||
There are no known issues in this release.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,194 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
|
||||
model = "Jasper"
|
||||
|
||||
[input]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
max_duration = 16.7
|
||||
speed_perturbation = true
|
||||
|
||||
|
||||
cutout_rect_regions = 0
|
||||
cutout_rect_time = 60
|
||||
cutout_rect_freq = 25
|
||||
|
||||
|
||||
cutout_x_regions = 0
|
||||
cutout_y_regions = 0
|
||||
cutout_x_width = 6
|
||||
cutout_y_width = 6
|
||||
|
||||
[input_eval]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
|
||||
|
||||
[encoder]
|
||||
activation = "relu"
|
||||
convmask = true
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 1
|
||||
kernel = [11]
|
||||
stride = [2]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 896
|
||||
repeat = 1
|
||||
kernel = [29]
|
||||
stride = [1]
|
||||
dilation = [2]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 1024
|
||||
repeat = 1
|
||||
kernel = [1]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[labels]
|
||||
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
|
203
PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr.toml
Normal file
203
PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr.toml
Normal file
|
@ -0,0 +1,203 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
model = "Jasper"
|
||||
|
||||
[input]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
max_duration = 16.7
|
||||
speed_perturbation = false
|
||||
|
||||
|
||||
cutout_rect_regions = 0
|
||||
cutout_rect_time = 60
|
||||
cutout_rect_freq = 25
|
||||
|
||||
cutout_x_regions = 0
|
||||
cutout_y_regions = 0
|
||||
cutout_x_width = 6
|
||||
cutout_y_width = 6
|
||||
|
||||
|
||||
[input_eval]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
|
||||
|
||||
[encoder]
|
||||
activation = "relu"
|
||||
convmask = true
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 1
|
||||
kernel = [11]
|
||||
stride = [2]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 896
|
||||
repeat = 1
|
||||
kernel = [29]
|
||||
stride = [1]
|
||||
dilation = [2]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 1024
|
||||
repeat = 1
|
||||
kernel = [1]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[labels]
|
||||
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
|
|
@ -0,0 +1,204 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
model = "Jasper"
|
||||
|
||||
[input]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
max_duration = 16.7
|
||||
speed_perturbation = true
|
||||
|
||||
|
||||
cutout_rect_regions = 0
|
||||
cutout_rect_time = 60
|
||||
cutout_rect_freq = 25
|
||||
|
||||
|
||||
cutout_x_regions = 0
|
||||
cutout_y_regions = 0
|
||||
cutout_x_width = 6
|
||||
cutout_y_width = 6
|
||||
|
||||
|
||||
[input_eval]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
|
||||
|
||||
[encoder]
|
||||
activation = "relu"
|
||||
convmask = true
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 1
|
||||
kernel = [11]
|
||||
stride = [2]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 896
|
||||
repeat = 1
|
||||
kernel = [29]
|
||||
stride = [1]
|
||||
dilation = [2]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 1024
|
||||
repeat = 1
|
||||
kernel = [1]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[labels]
|
||||
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
|
|
@ -0,0 +1,204 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
model = "Jasper"
|
||||
|
||||
[input]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
max_duration = 16.7
|
||||
speed_perturbation = true
|
||||
|
||||
|
||||
cutout_rect_regions = 0
|
||||
cutout_rect_time = 60
|
||||
cutout_rect_freq = 25
|
||||
|
||||
|
||||
cutout_x_regions = 2
|
||||
cutout_y_regions = 2
|
||||
cutout_x_width = 6
|
||||
cutout_y_width = 6
|
||||
|
||||
|
||||
[input_eval]
|
||||
normalize = "per_feature"
|
||||
sample_rate = 16000
|
||||
window_size = 0.02
|
||||
window_stride = 0.01
|
||||
window = "hann"
|
||||
features = 64
|
||||
n_fft = 512
|
||||
frame_splicing = 1
|
||||
dither = 0.00001
|
||||
feat_type = "logfbank"
|
||||
normalize_transcripts = true
|
||||
trim_silence = true
|
||||
pad_to = 16
|
||||
|
||||
|
||||
[encoder]
|
||||
activation = "relu"
|
||||
convmask = true
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 1
|
||||
kernel = [11]
|
||||
stride = [2]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 256
|
||||
repeat = 5
|
||||
kernel = [11]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 384
|
||||
repeat = 5
|
||||
kernel = [13]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 512
|
||||
repeat = 5
|
||||
kernel = [17]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.2
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 640
|
||||
repeat = 5
|
||||
kernel = [21]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 768
|
||||
repeat = 5
|
||||
kernel = [25]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.3
|
||||
residual = true
|
||||
residual_dense = true
|
||||
|
||||
|
||||
[[jasper]]
|
||||
filters = 896
|
||||
repeat = 1
|
||||
kernel = [29]
|
||||
stride = [1]
|
||||
dilation = [2]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[[jasper]]
|
||||
filters = 1024
|
||||
repeat = 1
|
||||
kernel = [1]
|
||||
stride = [1]
|
||||
dilation = [1]
|
||||
dropout = 0.4
|
||||
residual = false
|
||||
|
||||
[labels]
|
||||
labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
|
269
PyTorch/SpeechRecognition/Jasper/dataset.py
Normal file
269
PyTorch/SpeechRecognition/Jasper/dataset.py
Normal file
|
@ -0,0 +1,269 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""
|
||||
This file contains classes and functions related to data loading
|
||||
"""
|
||||
import torch
|
||||
import numpy as np
|
||||
import math
|
||||
from torch.utils.data import Dataset, Sampler
|
||||
import torch.distributed as dist
|
||||
from parts.manifest import Manifest
|
||||
from parts.features import WaveformFeaturizer
|
||||
|
||||
class DistributedBucketBatchSampler(Sampler):
|
||||
def __init__(self, dataset, batch_size, num_replicas=None, rank=None):
|
||||
"""Distributed sampler that buckets samples with similar length to minimize padding,
|
||||
similar concept as pytorch BucketBatchSampler https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.samplers.html#torchnlp.samplers.BucketBatchSampler
|
||||
|
||||
Args:
|
||||
dataset: Dataset used for sampling.
|
||||
batch_size: data batch size
|
||||
num_replicas (optional): Number of processes participating in
|
||||
distributed training.
|
||||
rank (optional): Rank of the current process within num_replicas.
|
||||
"""
|
||||
if num_replicas is None:
|
||||
if not dist.is_available():
|
||||
raise RuntimeError("Requires distributed package to be available")
|
||||
num_replicas = dist.get_world_size()
|
||||
if rank is None:
|
||||
if not dist.is_available():
|
||||
raise RuntimeError("Requires distributed package to be available")
|
||||
rank = dist.get_rank()
|
||||
self.dataset = dataset
|
||||
self.dataset_size = len(dataset)
|
||||
self.num_replicas = num_replicas
|
||||
self.rank = rank
|
||||
self.epoch = 0
|
||||
self.batch_size = batch_size
|
||||
self.tile_size = batch_size * self.num_replicas
|
||||
self.num_buckets = 6
|
||||
self.bucket_size = self.round_up_to(math.ceil(self.dataset_size / self.num_buckets), self.tile_size)
|
||||
self.index_count = self.round_up_to(self.dataset_size, self.tile_size)
|
||||
self.num_samples = self.index_count // self.num_replicas
|
||||
|
||||
def round_up_to(self, x, mod):
|
||||
return (x + mod - 1) // mod * mod
|
||||
|
||||
def __iter__(self):
|
||||
g = torch.Generator()
|
||||
g.manual_seed(self.epoch)
|
||||
indices = np.arange(self.index_count) % self.dataset_size
|
||||
for bucket in range(self.num_buckets):
|
||||
bucket_start = self.bucket_size * bucket
|
||||
bucket_end = min(bucket_start + self.bucket_size, self.index_count)
|
||||
indices[bucket_start:bucket_end] = indices[bucket_start:bucket_end][torch.randperm(bucket_end - bucket_start, generator=g)]
|
||||
|
||||
tile_indices = torch.randperm(self.index_count // self.tile_size, generator=g)
|
||||
for tile_index in tile_indices:
|
||||
start_index = self.tile_size * tile_index + self.batch_size * self.rank
|
||||
end_index = start_index + self.batch_size
|
||||
yield indices[start_index:end_index]
|
||||
|
||||
def __len__(self):
|
||||
return self.num_samples
|
||||
|
||||
def set_epoch(self, epoch):
|
||||
self.epoch = epoch
|
||||
|
||||
class data_prefetcher():
|
||||
def __init__(self, loader):
|
||||
self.loader = iter(loader)
|
||||
self.stream = torch.cuda.Stream()
|
||||
self.preload()
|
||||
|
||||
def preload(self):
|
||||
try:
|
||||
self.next_input = next(self.loader)
|
||||
except StopIteration:
|
||||
self.next_input = None
|
||||
return
|
||||
with torch.cuda.stream(self.stream):
|
||||
self.next_input = [ x.cuda(non_blocking=True) for x in self.next_input]
|
||||
|
||||
def __next__(self):
|
||||
torch.cuda.current_stream().wait_stream(self.stream)
|
||||
input = self.next_input
|
||||
self.preload()
|
||||
return input
|
||||
def next(self):
|
||||
return self.__next__()
|
||||
def __iter__(self):
|
||||
return self
|
||||
|
||||
def seq_collate_fn(batch):
|
||||
"""batches samples and returns as tensors
|
||||
Args:
|
||||
batch : list of samples
|
||||
Returns
|
||||
batches of tensors
|
||||
"""
|
||||
batch_size = len(batch)
|
||||
def _find_max_len(lst, ind):
|
||||
max_len = -1
|
||||
for item in lst:
|
||||
if item[ind].size(0) > max_len:
|
||||
max_len = item[ind].size(0)
|
||||
return max_len
|
||||
max_audio_len = _find_max_len(batch, 0)
|
||||
max_transcript_len = _find_max_len(batch, 2)
|
||||
|
||||
batched_audio_signal = torch.zeros(batch_size, max_audio_len)
|
||||
batched_transcript = torch.zeros(batch_size, max_transcript_len)
|
||||
audio_lengths = []
|
||||
transcript_lengths = []
|
||||
for ind, sample in enumerate(batch):
|
||||
batched_audio_signal[ind].narrow(0, 0, sample[0].size(0)).copy_(sample[0])
|
||||
audio_lengths.append(sample[1])
|
||||
batched_transcript[ind].narrow(0, 0, sample[2].size(0)).copy_(sample[2])
|
||||
transcript_lengths.append(sample[3])
|
||||
return batched_audio_signal, torch.stack(audio_lengths), batched_transcript, \
|
||||
torch.stack(transcript_lengths)
|
||||
|
||||
class AudioToTextDataLayer:
|
||||
"""Data layer with data loader
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
self._device = torch.device("cuda")
|
||||
|
||||
featurizer_config = kwargs['featurizer_config']
|
||||
pad_to_max = kwargs.get('pad_to_max', False)
|
||||
perturb_config = kwargs.get('perturb_config', None)
|
||||
manifest_filepath = kwargs['manifest_filepath']
|
||||
dataset_dir = kwargs['dataset_dir']
|
||||
labels = kwargs['labels']
|
||||
batch_size = kwargs['batch_size']
|
||||
drop_last = kwargs.get('drop_last', False)
|
||||
shuffle = kwargs.get('shuffle', True)
|
||||
min_duration = featurizer_config.get('min_duration', 0.1)
|
||||
max_duration = featurizer_config.get('max_duration', None)
|
||||
normalize_transcripts = kwargs.get('normalize_transcripts', True)
|
||||
trim_silence = kwargs.get('trim_silence', False)
|
||||
multi_gpu = kwargs.get('multi_gpu', False)
|
||||
sampler_type = kwargs.get('sampler', 'default')
|
||||
speed_perturbation = featurizer_config.get('speed_perturbation', False)
|
||||
sort_by_duration=sampler_type == 'bucket'
|
||||
self._featurizer = WaveformFeaturizer.from_config(featurizer_config, perturbation_configs=perturb_config)
|
||||
self._dataset = AudioDataset(
|
||||
dataset_dir=dataset_dir,
|
||||
manifest_filepath=manifest_filepath,
|
||||
labels=labels, blank_index=len(labels),
|
||||
sort_by_duration=sort_by_duration,
|
||||
pad_to_max=pad_to_max,
|
||||
featurizer=self._featurizer, max_duration=max_duration,
|
||||
min_duration=min_duration, normalize=normalize_transcripts,
|
||||
trim=trim_silence, speed_perturbation=speed_perturbation)
|
||||
|
||||
print('sort_by_duration', sort_by_duration)
|
||||
|
||||
if not multi_gpu:
|
||||
self.sampler = None
|
||||
self._dataloader = torch.utils.data.DataLoader(
|
||||
dataset=self._dataset,
|
||||
batch_size=batch_size,
|
||||
collate_fn=lambda b: seq_collate_fn(b),
|
||||
drop_last=drop_last,
|
||||
shuffle=shuffle if self.sampler is None else False,
|
||||
num_workers=4,
|
||||
pin_memory=True,
|
||||
sampler=self.sampler
|
||||
)
|
||||
elif sampler_type == 'bucket':
|
||||
self.sampler = DistributedBucketBatchSampler(self._dataset, batch_size=batch_size)
|
||||
print("DDBucketSampler")
|
||||
self._dataloader = torch.utils.data.DataLoader(
|
||||
dataset=self._dataset,
|
||||
collate_fn=lambda b: seq_collate_fn(b),
|
||||
num_workers=4,
|
||||
pin_memory=True,
|
||||
batch_sampler=self.sampler
|
||||
)
|
||||
elif sampler_type == 'default':
|
||||
self.sampler = torch.utils.data.distributed.DistributedSampler(self._dataset)
|
||||
print("DDSampler")
|
||||
self._dataloader = torch.utils.data.DataLoader(
|
||||
dataset=self._dataset,
|
||||
batch_size=batch_size,
|
||||
collate_fn=lambda b: seq_collate_fn(b),
|
||||
drop_last=drop_last,
|
||||
shuffle=shuffle if self.sampler is None else False,
|
||||
num_workers=4,
|
||||
pin_memory=True,
|
||||
sampler=self.sampler
|
||||
)
|
||||
else:
|
||||
raise RuntimeError("Sampler {} not supported".format(sampler_type))
|
||||
|
||||
def __len__(self):
|
||||
return len(self._dataset)
|
||||
|
||||
@property
|
||||
def data_iterator(self):
|
||||
return self._dataloader
|
||||
|
||||
class AudioDataset(Dataset):
|
||||
def __init__(self, dataset_dir, manifest_filepath, labels, featurizer, max_duration=None, pad_to_max=False,
|
||||
min_duration=None, blank_index=0, max_utts=0, normalize=True, sort_by_duration=False,
|
||||
trim=False, speed_perturbation=False):
|
||||
"""Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations
|
||||
(in seconds). Each entry is a different audio sample.
|
||||
Args:
|
||||
dataset_dir: absolute path to dataset folder
|
||||
manifest_filepath: relative path from dataset folder to manifest json as described above. Can be coma-separated paths.
|
||||
labels: String containing all the possible characters to map to
|
||||
featurizer: Initialized featurizer class that converts paths of audio to feature tensors
|
||||
max_duration: If audio exceeds this length, do not include in dataset
|
||||
min_duration: If audio is less than this length, do not include in dataset
|
||||
pad_to_max: if specified input sequences into dnn model will be padded to max_duration
|
||||
blank_index: blank index for ctc loss / decoder
|
||||
max_utts: Limit number of utterances
|
||||
normalize: whether to normalize transcript text
|
||||
sort_by_duration: whether or not to sort sequences by increasing duration
|
||||
trim: if specified trims leading and trailing silence from an audio signal.
|
||||
speed_perturbation: specify if using data contains speed perburbation
|
||||
"""
|
||||
m_paths = manifest_filepath.split(',')
|
||||
self.manifest = Manifest(dataset_dir, m_paths, labels, blank_index, pad_to_max=pad_to_max,
|
||||
max_duration=max_duration,
|
||||
sort_by_duration=sort_by_duration,
|
||||
min_duration=min_duration, max_utts=max_utts,
|
||||
normalize=normalize, speed_perturbation=speed_perturbation)
|
||||
self.featurizer = featurizer
|
||||
self.blank_index = blank_index
|
||||
self.trim = trim
|
||||
print(
|
||||
"Dataset loaded with {0:.2f} hours. Filtered {1:.2f} hours.".format(
|
||||
self.manifest.duration / 3600,
|
||||
self.manifest.filtered_duration / 3600))
|
||||
|
||||
def __getitem__(self, index):
|
||||
sample = self.manifest[index]
|
||||
rn_indx = np.random.randint(len(sample['audio_filepath']))
|
||||
duration = sample['audio_duration'][rn_indx] if 'audio_duration' in sample else 0
|
||||
offset = sample['offset'] if 'offset' in sample else 0
|
||||
features = self.featurizer.process(sample['audio_filepath'][rn_indx],
|
||||
offset=offset, duration=duration,
|
||||
trim=self.trim)
|
||||
|
||||
return features, torch.tensor(features.shape[0]).int(), \
|
||||
torch.tensor(sample["transcript"]), torch.tensor(
|
||||
len(sample["transcript"])).int()
|
||||
|
||||
def __len__(self):
|
||||
return len(self.manifest)
|
||||
|
||||
|
||||
|
223
PyTorch/SpeechRecognition/Jasper/helpers.py
Normal file
223
PyTorch/SpeechRecognition/Jasper/helpers.py
Normal file
|
@ -0,0 +1,223 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
from enum import Enum
|
||||
from metrics import word_error_rate
|
||||
|
||||
|
||||
class Optimization(Enum):
|
||||
"""Various levels of Optimization.
|
||||
WARNING: This might have effect on model accuracy."""
|
||||
nothing = 0
|
||||
mxprO0 = 1
|
||||
mxprO1 = 2
|
||||
mxprO2 = 3
|
||||
mxprO3 = 4
|
||||
|
||||
|
||||
AmpOptimizations = {Optimization.mxprO0: "O0",
|
||||
Optimization.mxprO1: "O1",
|
||||
Optimization.mxprO2: "O2",
|
||||
Optimization.mxprO3: "O3"}
|
||||
|
||||
def print_once(msg):
|
||||
if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
|
||||
print(msg)
|
||||
|
||||
def add_ctc_labels(labels):
|
||||
if not isinstance(labels, list):
|
||||
raise ValueError("labels must be a list of symbols")
|
||||
labels.append("<BLANK>")
|
||||
return labels
|
||||
|
||||
def __ctc_decoder_predictions_tensor(tensor, labels):
|
||||
"""
|
||||
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
|
||||
remove duplicates and special symbol. Returns prediction
|
||||
Args:
|
||||
tensor: model output tensor
|
||||
label: A list of labels
|
||||
Returns:
|
||||
prediction
|
||||
"""
|
||||
blank_id = len(labels) - 1
|
||||
hypotheses = []
|
||||
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
|
||||
prediction_cpu_tensor = tensor.long().cpu()
|
||||
# iterate over batch
|
||||
for ind in range(prediction_cpu_tensor.shape[0]):
|
||||
prediction = prediction_cpu_tensor[ind].numpy().tolist()
|
||||
# CTC decoding procedure
|
||||
decoded_prediction = []
|
||||
previous = len(labels) - 1 # id of a blank symbol
|
||||
for p in prediction:
|
||||
if (p != previous or previous == blank_id) and p != blank_id:
|
||||
decoded_prediction.append(p)
|
||||
previous = p
|
||||
hypothesis = ''.join([labels_map[c] for c in decoded_prediction])
|
||||
hypotheses.append(hypothesis)
|
||||
return hypotheses
|
||||
|
||||
|
||||
def monitor_asr_train_progress(tensors: list, labels: list):
|
||||
"""
|
||||
Takes output of greedy ctc decoder and performs ctc decoding algorithm to
|
||||
remove duplicates and special symbol. Prints wer and prediction examples to screen
|
||||
Args:
|
||||
tensors: A list of 3 tensors (predictions, targets, target_lengths)
|
||||
labels: A list of labels
|
||||
|
||||
Returns:
|
||||
word error rate
|
||||
"""
|
||||
references = []
|
||||
|
||||
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
|
||||
with torch.no_grad():
|
||||
targets_cpu_tensor = tensors[1].long().cpu()
|
||||
tgt_lenths_cpu_tensor = tensors[2].long().cpu()
|
||||
|
||||
# iterate over batch
|
||||
for ind in range(targets_cpu_tensor.shape[0]):
|
||||
tgt_len = tgt_lenths_cpu_tensor[ind].item()
|
||||
target = targets_cpu_tensor[ind][:tgt_len].numpy().tolist()
|
||||
reference = ''.join([labels_map[c] for c in target])
|
||||
references.append(reference)
|
||||
hypotheses = __ctc_decoder_predictions_tensor(tensors[0], labels=labels)
|
||||
tag = "training_batch_WER"
|
||||
wer, _, _ = word_error_rate(hypotheses, references)
|
||||
print_once('{0}: {1}'.format(tag, wer))
|
||||
print_once('Prediction: {0}'.format(hypotheses[0]))
|
||||
print_once('Reference: {0}'.format(references[0]))
|
||||
return wer
|
||||
|
||||
|
||||
def __gather_losses(losses_list: list) -> list:
|
||||
return [torch.mean(torch.stack(losses_list))]
|
||||
|
||||
|
||||
def __gather_predictions(predictions_list: list, labels: list) -> list:
|
||||
results = []
|
||||
for prediction in predictions_list:
|
||||
results += __ctc_decoder_predictions_tensor(prediction, labels=labels)
|
||||
return results
|
||||
|
||||
|
||||
def __gather_transcripts(transcript_list: list, transcript_len_list: list,
|
||||
labels: list) -> list:
|
||||
results = []
|
||||
labels_map = dict([(i, labels[i]) for i in range(len(labels))])
|
||||
# iterate over workers
|
||||
for t, ln in zip(transcript_list, transcript_len_list):
|
||||
# iterate over batch
|
||||
t_lc = t.long().cpu()
|
||||
ln_lc = ln.long().cpu()
|
||||
for ind in range(t.shape[0]):
|
||||
tgt_len = ln_lc[ind].item()
|
||||
target = t_lc[ind][:tgt_len].numpy().tolist()
|
||||
reference = ''.join([labels_map[c] for c in target])
|
||||
results.append(reference)
|
||||
return results
|
||||
|
||||
|
||||
def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list):
|
||||
"""
|
||||
Processes results of an iteration and saves it in global_vars
|
||||
Args:
|
||||
tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
|
||||
global_vars: dictionary where processes results of iteration are saved
|
||||
labels: A list of labels
|
||||
"""
|
||||
for kv, v in tensors.items():
|
||||
if kv.startswith('loss'):
|
||||
global_vars['EvalLoss'] += __gather_losses(v)
|
||||
elif kv.startswith('predictions'):
|
||||
global_vars['predictions'] += __gather_predictions(v, labels=labels)
|
||||
elif kv.startswith('transcript_length'):
|
||||
transcript_len_list = v
|
||||
elif kv.startswith('transcript'):
|
||||
|
||||
transcript_list = v
|
||||
elif kv.startswith('output'):
|
||||
global_vars['logits'] += v
|
||||
|
||||
global_vars['transcripts'] += __gather_transcripts(transcript_list,
|
||||
transcript_len_list,
|
||||
labels=labels)
|
||||
|
||||
|
||||
def process_evaluation_epoch(global_vars: dict, tag=None):
|
||||
"""
|
||||
Processes results from each worker at the end of evaluation and combine to final result
|
||||
Args:
|
||||
global_vars: dictionary containing information of entire evaluation
|
||||
Return:
|
||||
wer: final word error rate
|
||||
loss: final loss
|
||||
"""
|
||||
if 'EvalLoss' in global_vars:
|
||||
eloss = torch.mean(torch.stack(global_vars['EvalLoss'])).item()
|
||||
else:
|
||||
eloss = None
|
||||
hypotheses = global_vars['predictions']
|
||||
references = global_vars['transcripts']
|
||||
|
||||
wer, scores, num_words = word_error_rate(hypotheses=hypotheses, references=references)
|
||||
multi_gpu = torch.distributed.is_initialized()
|
||||
if multi_gpu:
|
||||
if eloss is not None:
|
||||
eloss /= torch.distributed.get_world_size()
|
||||
eloss_tensor = torch.tensor(eloss).cuda()
|
||||
dist.all_reduce(eloss_tensor)
|
||||
eloss = eloss_tensor.item()
|
||||
del eloss_tensor
|
||||
|
||||
scores_tensor = torch.tensor(scores).cuda()
|
||||
dist.all_reduce(scores_tensor)
|
||||
scores = scores_tensor.item()
|
||||
del scores_tensor
|
||||
num_words_tensor = torch.tensor(num_words).cuda()
|
||||
dist.all_reduce(num_words_tensor)
|
||||
num_words = num_words_tensor.item()
|
||||
del num_words_tensor
|
||||
wer = scores *1.0/num_words
|
||||
return wer, eloss
|
||||
|
||||
|
||||
|
||||
def norm(x):
|
||||
if not isinstance(x, List):
|
||||
if not isinstance(x, tuple):
|
||||
return x
|
||||
return x[0]
|
||||
|
||||
|
||||
def print_dict(d):
|
||||
maxLen = max([len(ii) for ii in d.keys()])
|
||||
fmtString = '\t%' + str(maxLen) + 's : %s'
|
||||
print('Arguments:')
|
||||
for keyPair in sorted(d.items()):
|
||||
print(fmtString % keyPair)
|
||||
|
||||
|
||||
|
||||
def model_multi_gpu(model, multi_gpu=False):
|
||||
if multi_gpu:
|
||||
model = DDP(model)
|
||||
print('DDP(model)')
|
||||
return model
|
||||
|
Binary file not shown.
After Width: | Height: | Size: 246 KiB |
BIN
PyTorch/SpeechRecognition/Jasper/images/jasper_model.png
Normal file
BIN
PyTorch/SpeechRecognition/Jasper/images/jasper_model.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 295 KiB |
214
PyTorch/SpeechRecognition/Jasper/inference.py
Normal file
214
PyTorch/SpeechRecognition/Jasper/inference.py
Normal file
|
@ -0,0 +1,214 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
import itertools
|
||||
from typing import List
|
||||
from tqdm import tqdm
|
||||
import math
|
||||
import toml
|
||||
from dataset import AudioToTextDataLayer
|
||||
from helpers import process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, print_dict, model_multi_gpu
|
||||
from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
|
||||
import torch
|
||||
import apex
|
||||
from apex import amp
|
||||
import random
|
||||
import numpy as np
|
||||
import pickle
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description='Jasper')
|
||||
parser.add_argument("--local_rank", default=None, type=int)
|
||||
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
|
||||
parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
|
||||
parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
|
||||
parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
|
||||
parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
|
||||
parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
|
||||
parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
|
||||
parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
|
||||
parser.add_argument("--fp16", action='store_true', help='use half precision')
|
||||
parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
|
||||
parser.add_argument("--save_prediction", type=str, default=None, help="if specified saves predictions in text form at this location")
|
||||
parser.add_argument("--logits_save_to", default=None, type=str, help="if specified will save logits to path")
|
||||
parser.add_argument("--seed", default=42, type=int, help='seed')
|
||||
return parser.parse_args()
|
||||
|
||||
def eval(
|
||||
data_layer,
|
||||
audio_processor,
|
||||
encoderdecoder,
|
||||
greedy_decoder,
|
||||
labels,
|
||||
multi_gpu,
|
||||
args):
|
||||
"""performs inference / evaluation
|
||||
Args:
|
||||
data_layer: data layer object that holds data loader
|
||||
audio_processor: data processing module
|
||||
encoderdecoder: acoustic model
|
||||
greedy_decoder: greedy decoder
|
||||
labels: list of labels as output vocabulary
|
||||
multi_gpu: true if using multiple gpus
|
||||
args: script input arguments
|
||||
"""
|
||||
logits_save_to=args.logits_save_to
|
||||
audio_processor.eval()
|
||||
encoderdecoder.eval()
|
||||
with torch.no_grad():
|
||||
_global_var_dict = {
|
||||
'predictions': [],
|
||||
'transcripts': [],
|
||||
'logits' : [],
|
||||
}
|
||||
|
||||
for it, data in enumerate(tqdm(data_layer.data_iterator)):
|
||||
tensors = []
|
||||
for d in data:
|
||||
tensors.append(d.cuda())
|
||||
|
||||
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
|
||||
|
||||
inp = (t_audio_signal_e, t_a_sig_length_e)
|
||||
|
||||
t_processed_signal, p_length_e = audio_processor(x=inp)
|
||||
t_log_probs_e, _ = encoderdecoder((t_processed_signal, p_length_e))
|
||||
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
|
||||
|
||||
values_dict = dict(
|
||||
predictions=[t_predictions_e],
|
||||
transcript=[t_transcript_e],
|
||||
transcript_length=[t_transcript_len_e],
|
||||
output=[t_log_probs_e]
|
||||
)
|
||||
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
|
||||
|
||||
if args.steps is not None and it + 1 >= args.steps:
|
||||
break
|
||||
wer, _ = process_evaluation_epoch(_global_var_dict)
|
||||
if (not multi_gpu or (multi_gpu and torch.distributed.get_rank() == 0)):
|
||||
print("==========>>>>>>Evaluation WER: {0}\n".format(wer))
|
||||
if args.save_prediction is not None:
|
||||
with open(args.save_prediction, 'w') as fp:
|
||||
fp.write('\n'.join(_global_var_dict['predictions']))
|
||||
if logits_save_to is not None:
|
||||
logits = []
|
||||
for batch in _global_var_dict["logits"]:
|
||||
for i in range(batch.shape[0]):
|
||||
logits.append(batch[i].cpu().numpy())
|
||||
with open(logits_save_to, 'wb') as f:
|
||||
pickle.dump(logits, f, protocol=pickle.HIGHEST_PROTOCOL)
|
||||
|
||||
def main(args):
|
||||
random.seed(args.seed)
|
||||
np.random.seed(args.seed)
|
||||
torch.manual_seed(args.seed)
|
||||
torch.backends.cudnn.benchmark = args.cudnn_benchmark
|
||||
print("CUDNN BENCHMARK ", args.cudnn_benchmark)
|
||||
assert(torch.cuda.is_available())
|
||||
|
||||
if args.local_rank is not None:
|
||||
torch.cuda.set_device(args.local_rank)
|
||||
torch.distributed.init_process_group(backend='nccl', init_method='env://')
|
||||
multi_gpu = args.local_rank is not None
|
||||
if multi_gpu:
|
||||
print("DISTRIBUTED with ", torch.distributed.get_world_size())
|
||||
|
||||
if args.fp16:
|
||||
optim_level = Optimization.mxprO3
|
||||
else:
|
||||
optim_level = Optimization.mxprO0
|
||||
|
||||
jasper_model_definition = toml.load(args.model_toml)
|
||||
dataset_vocab = jasper_model_definition['labels']['labels']
|
||||
ctc_vocab = add_ctc_labels(dataset_vocab)
|
||||
|
||||
val_manifest = args.val_manifest
|
||||
featurizer_config = jasper_model_definition['input_eval']
|
||||
featurizer_config["optimization_level"] = optim_level
|
||||
|
||||
if args.max_duration is not None:
|
||||
featurizer_config['max_duration'] = args.max_duration
|
||||
if args.pad_to is not None:
|
||||
featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else "max"
|
||||
|
||||
print('model_config')
|
||||
print_dict(jasper_model_definition)
|
||||
print('feature_config')
|
||||
print_dict(featurizer_config)
|
||||
|
||||
data_layer = AudioToTextDataLayer(
|
||||
dataset_dir=args.dataset_dir,
|
||||
featurizer_config=featurizer_config,
|
||||
manifest_filepath=val_manifest,
|
||||
labels=dataset_vocab,
|
||||
batch_size=args.batch_size,
|
||||
pad_to_max=featurizer_config['pad_to'] == "max",
|
||||
shuffle=False,
|
||||
multi_gpu=multi_gpu)
|
||||
audio_preprocessor = AudioPreprocessing(**featurizer_config)
|
||||
|
||||
encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
|
||||
|
||||
if args.ckpt is not None:
|
||||
print("loading model from ", args.ckpt)
|
||||
checkpoint = torch.load(args.ckpt, map_location="cpu")
|
||||
for k in audio_preprocessor.state_dict().keys():
|
||||
checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
|
||||
audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
|
||||
encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
|
||||
|
||||
greedy_decoder = GreedyCTCDecoder()
|
||||
|
||||
# print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
|
||||
|
||||
N = len(data_layer)
|
||||
step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
|
||||
|
||||
if args.steps is not None:
|
||||
print('-----------------')
|
||||
print('Have {0} examples to eval on.'.format(args.steps * args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
|
||||
print('Have {0} steps / (gpu * epoch).'.format(args.steps))
|
||||
print('-----------------')
|
||||
else:
|
||||
print('-----------------')
|
||||
print('Have {0} examples to eval on.'.format(N))
|
||||
print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
|
||||
print('-----------------')
|
||||
|
||||
audio_preprocessor.cuda()
|
||||
encoderdecoder.cuda()
|
||||
if args.fp16:
|
||||
encoderdecoder = amp.initialize(
|
||||
models=encoderdecoder,
|
||||
opt_level=AmpOptimizations[optim_level])
|
||||
|
||||
encoderdecoder = model_multi_gpu(encoderdecoder, multi_gpu)
|
||||
|
||||
eval(
|
||||
data_layer=data_layer,
|
||||
audio_processor=audio_preprocessor,
|
||||
encoderdecoder=encoderdecoder,
|
||||
greedy_decoder=greedy_decoder,
|
||||
labels=ctc_vocab,
|
||||
args=args,
|
||||
multi_gpu=multi_gpu)
|
||||
|
||||
if __name__=="__main__":
|
||||
args = parse_args()
|
||||
|
||||
print_dict(vars(args))
|
||||
|
||||
main(args)
|
241
PyTorch/SpeechRecognition/Jasper/inference_benchmark.py
Normal file
241
PyTorch/SpeechRecognition/Jasper/inference_benchmark.py
Normal file
|
@ -0,0 +1,241 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
import itertools
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
import random
|
||||
import numpy as np
|
||||
from heapq import nlargest
|
||||
import math
|
||||
from tqdm import tqdm
|
||||
import toml
|
||||
import torch
|
||||
from apex import amp
|
||||
from dataset import AudioToTextDataLayer
|
||||
from helpers import process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, print_dict
|
||||
from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description='Jasper')
|
||||
parser.add_argument("--steps", default=None, help='if not specified do evaluation on full dataset. otherwise only evaluates the specified number of iterations for each worker', type=int)
|
||||
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
|
||||
parser.add_argument("--max_duration", default=None, type=float, help='maximum duration of sequences. if None uses attribute from model configuration file')
|
||||
parser.add_argument("--pad_to", default=None, type=int, help="default is pad to value as specified in model configurations. if -1 pad to maximum duration. If > 0 pad batch to next multiple of value")
|
||||
parser.add_argument("--model_toml", type=str, help='relative model configuration path given dataset folder')
|
||||
parser.add_argument("--dataset_dir", type=str, help='absolute path to dataset folder')
|
||||
parser.add_argument("--val_manifest", type=str, help='relative path to evaluation dataset manifest file')
|
||||
parser.add_argument("--cudnn_benchmark", action='store_true', help="enable cudnn benchmark")
|
||||
parser.add_argument("--ckpt", default=None, type=str, required=True, help='path to model checkpoint')
|
||||
parser.add_argument("--fp16", action='store_true', help='use half precision')
|
||||
parser.add_argument("--seed", default=42, type=int, help='seed')
|
||||
return parser.parse_args()
|
||||
|
||||
def eval(
|
||||
data_layer,
|
||||
audio_processor,
|
||||
encoderdecoder,
|
||||
greedy_decoder,
|
||||
labels,
|
||||
args):
|
||||
"""performs evaluation and prints performance statistics
|
||||
Args:
|
||||
data_layer: data layer object that holds data loader
|
||||
audio_processor: data processing module
|
||||
encoderdecoder: acoustic model
|
||||
greedy_decoder: greedy decoder
|
||||
labels: list of labels as output vocabulary
|
||||
args: script input arguments
|
||||
"""
|
||||
batch_size=args.batch_size
|
||||
steps=args.steps
|
||||
audio_processor.eval()
|
||||
encoderdecoder.eval()
|
||||
with torch.no_grad():
|
||||
_global_var_dict = {
|
||||
'predictions': [],
|
||||
'transcripts': [],
|
||||
}
|
||||
|
||||
it = 0
|
||||
ep = 0
|
||||
|
||||
if steps is None:
|
||||
steps = math.ceil(len(data_layer) / batch_size)
|
||||
durations_dnn = []
|
||||
durations_dnn_and_prep = []
|
||||
seq_lens = []
|
||||
while True:
|
||||
ep += 1
|
||||
for data in tqdm(data_layer.data_iterator):
|
||||
it += 1
|
||||
if it > steps:
|
||||
break
|
||||
tensors = []
|
||||
dl_device = torch.device("cuda")
|
||||
for d in data:
|
||||
tensors.append(d.to(dl_device))
|
||||
|
||||
|
||||
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
|
||||
|
||||
inp=(t_audio_signal_e, t_a_sig_length_e)
|
||||
torch.cuda.synchronize()
|
||||
t0 = time.perf_counter()
|
||||
t_processed_signal, p_length_e = audio_processor(x=inp)
|
||||
torch.cuda.synchronize()
|
||||
t1 = time.perf_counter()
|
||||
t_log_probs_e, _ = encoderdecoder((t_processed_signal, p_length_e))
|
||||
torch.cuda.synchronize()
|
||||
stop_time = time.perf_counter()
|
||||
|
||||
time_prep_and_dnn = stop_time - t0
|
||||
time_dnn = stop_time - t1
|
||||
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
|
||||
|
||||
values_dict = dict(
|
||||
predictions=[t_predictions_e],
|
||||
transcript=[t_transcript_e],
|
||||
transcript_length=[t_transcript_len_e],
|
||||
)
|
||||
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
|
||||
durations_dnn.append(time_dnn)
|
||||
durations_dnn_and_prep.append(time_prep_and_dnn)
|
||||
seq_lens.append(t_processed_signal.shape[-1])
|
||||
|
||||
if it >= steps:
|
||||
|
||||
wer, _ = process_evaluation_epoch(_global_var_dict)
|
||||
print("==========>>>>>>Evaluation of all iterations WER: {0}\n".format(wer))
|
||||
break
|
||||
|
||||
ratios = [0.9, 0.95,0.99, 1.]
|
||||
latencies_dnn = take_durations_and_output_percentile(durations_dnn, ratios)
|
||||
latencies_dnn_and_prep = take_durations_and_output_percentile(durations_dnn_and_prep, ratios)
|
||||
print("\n using batch size {} and {} frames ".format(batch_size, seq_lens[-1]))
|
||||
print("\n".join(["dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn.items()]))
|
||||
print("\n".join(["prep + dnn latency {} : {} ".format(k, v) for k, v in latencies_dnn_and_prep.items()]))
|
||||
|
||||
def take_durations_and_output_percentile(durations, ratios):
|
||||
durations = np.asarray(durations) * 1000 # in ms
|
||||
latency = durations
|
||||
|
||||
latency = latency[5:]
|
||||
mean_latency = np.mean(latency)
|
||||
|
||||
latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
|
||||
latency_ranges=get_percentile(ratios, latency_worst, len(latency))
|
||||
latency_ranges["0.5"] = mean_latency
|
||||
return latency_ranges
|
||||
|
||||
def get_percentile(ratios, arr, nsamples):
|
||||
res = {}
|
||||
for a in ratios:
|
||||
idx = max(int(nsamples * (1 - a)), 0)
|
||||
res[a] = arr[idx]
|
||||
return res
|
||||
|
||||
def main(args):
|
||||
random.seed(args.seed)
|
||||
np.random.seed(args.seed)
|
||||
torch.manual_seed(args.seed)
|
||||
torch.backends.cudnn.benchmark = args.cudnn_benchmark
|
||||
assert(args.steps is None or args.steps > 5)
|
||||
print("CUDNN BENCHMARK ", args.cudnn_benchmark)
|
||||
assert(torch.cuda.is_available())
|
||||
|
||||
if args.fp16:
|
||||
optim_level = Optimization.mxprO3
|
||||
else:
|
||||
optim_level = Optimization.mxprO0
|
||||
batch_size = args.batch_size
|
||||
|
||||
jasper_model_definition = toml.load(args.model_toml)
|
||||
dataset_vocab = jasper_model_definition['labels']['labels']
|
||||
ctc_vocab = add_ctc_labels(dataset_vocab)
|
||||
|
||||
val_manifest = args.val_manifest
|
||||
featurizer_config = jasper_model_definition['input_eval']
|
||||
featurizer_config["optimization_level"] = optim_level
|
||||
if args.max_duration is not None:
|
||||
featurizer_config['max_duration'] = args.max_duration
|
||||
if args.pad_to is not None:
|
||||
featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else "max"
|
||||
|
||||
print('model_config')
|
||||
print_dict(jasper_model_definition)
|
||||
print('feature_config')
|
||||
print_dict(featurizer_config)
|
||||
|
||||
data_layer = AudioToTextDataLayer(
|
||||
dataset_dir=args.dataset_dir,
|
||||
featurizer_config=featurizer_config,
|
||||
manifest_filepath=val_manifest,
|
||||
labels=dataset_vocab,
|
||||
batch_size=batch_size,
|
||||
pad_to_max=featurizer_config['pad_to'] == "max",
|
||||
shuffle=False,
|
||||
multi_gpu=False)
|
||||
|
||||
audio_preprocessor = AudioPreprocessing(**featurizer_config)
|
||||
|
||||
encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
|
||||
|
||||
if args.ckpt is not None:
|
||||
print("loading model from ", args.ckpt)
|
||||
checkpoint = torch.load(args.ckpt, map_location="cpu")
|
||||
for k in audio_preprocessor.state_dict().keys():
|
||||
checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
|
||||
audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
|
||||
encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
|
||||
|
||||
greedy_decoder = GreedyCTCDecoder()
|
||||
|
||||
# print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
|
||||
|
||||
N = len(data_layer)
|
||||
step_per_epoch = math.ceil(N / args.batch_size)
|
||||
|
||||
print('-----------------')
|
||||
if args.steps is None:
|
||||
print('Have {0} examples to eval on.'.format(N))
|
||||
print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
|
||||
else:
|
||||
print('Have {0} examples to eval on.'.format(args.steps * args.batch_size))
|
||||
print('Have {0} steps / (gpu * epoch).'.format(args.steps))
|
||||
print('-----------------')
|
||||
|
||||
audio_preprocessor.cuda()
|
||||
encoderdecoder.cuda()
|
||||
if args.fp16:
|
||||
encoderdecoder = amp.initialize(
|
||||
models=encoderdecoder,
|
||||
opt_level=AmpOptimizations[optim_level])
|
||||
|
||||
eval(
|
||||
data_layer=data_layer,
|
||||
audio_processor=audio_preprocessor,
|
||||
encoderdecoder=encoderdecoder,
|
||||
greedy_decoder=greedy_decoder,
|
||||
labels=ctc_vocab,
|
||||
args=args)
|
||||
|
||||
if __name__=="__main__":
|
||||
args = parse_args()
|
||||
|
||||
print_dict(vars(args))
|
||||
|
||||
main(args)
|
68
PyTorch/SpeechRecognition/Jasper/metrics.py
Normal file
68
PyTorch/SpeechRecognition/Jasper/metrics.py
Normal file
|
@ -0,0 +1,68 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from typing import List
|
||||
|
||||
|
||||
def __levenshtein(a: List, b: List) -> int:
|
||||
"""Calculates the Levenshtein distance between a and b.
|
||||
"""
|
||||
n, m = len(a), len(b)
|
||||
if n > m:
|
||||
# Make sure n <= m, to use O(min(n,m)) space
|
||||
a, b = b, a
|
||||
n, m = m, n
|
||||
|
||||
current = list(range(n + 1))
|
||||
for i in range(1, m + 1):
|
||||
previous, current = current, [i] + [0] * n
|
||||
for j in range(1, n + 1):
|
||||
add, delete = previous[j] + 1, current[j - 1] + 1
|
||||
change = previous[j - 1]
|
||||
if a[j - 1] != b[i - 1]:
|
||||
change = change + 1
|
||||
current[j] = min(add, delete, change)
|
||||
|
||||
return current[n]
|
||||
|
||||
|
||||
def word_error_rate(hypotheses: List[str], references: List[str]) -> float:
|
||||
"""
|
||||
Computes Average Word Error rate between two texts represented as
|
||||
corresponding lists of string. Hypotheses and references must have same length.
|
||||
|
||||
Args:
|
||||
hypotheses: list of hypotheses
|
||||
references: list of references
|
||||
|
||||
Returns:
|
||||
(float) average word error rate
|
||||
"""
|
||||
scores = 0
|
||||
words = 0
|
||||
if len(hypotheses) != len(references):
|
||||
raise ValueError("In word error rate calculation, hypotheses and reference"
|
||||
" lists must have the same number of elements. But I got:"
|
||||
"{0} and {1} correspondingly".format(len(hypotheses), len(references)))
|
||||
for h, r in zip(hypotheses, references):
|
||||
h_list = h.split()
|
||||
r_list = r.split()
|
||||
words += len(r_list)
|
||||
scores += __levenshtein(h_list, r_list)
|
||||
if words!=0:
|
||||
wer = 1.0*scores/words
|
||||
else:
|
||||
wer = float('inf')
|
||||
return wer, scores, words
|
||||
|
409
PyTorch/SpeechRecognition/Jasper/model.py
Normal file
409
PyTorch/SpeechRecognition/Jasper/model.py
Normal file
|
@ -0,0 +1,409 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from apex import amp
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from parts.features import FeatureFactory
|
||||
from helpers import Optimization
|
||||
import random
|
||||
|
||||
|
||||
jasper_activations = {
|
||||
"hardtanh": nn.Hardtanh,
|
||||
"relu": nn.ReLU,
|
||||
"selu": nn.SELU,
|
||||
}
|
||||
|
||||
def init_weights(m, mode='xavier_uniform'):
|
||||
if type(m) == nn.Conv1d or type(m) == MaskedConv1d:
|
||||
if mode == 'xavier_uniform':
|
||||
nn.init.xavier_uniform_(m.weight, gain=1.0)
|
||||
elif mode == 'xavier_normal':
|
||||
nn.init.xavier_normal_(m.weight, gain=1.0)
|
||||
elif mode == 'kaiming_uniform':
|
||||
nn.init.kaiming_uniform_(m.weight, nonlinearity="relu")
|
||||
elif mode == 'kaiming_normal':
|
||||
nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
|
||||
else:
|
||||
raise ValueError("Unknown Initialization mode: {0}".format(mode))
|
||||
elif type(m) == nn.BatchNorm1d:
|
||||
if m.track_running_stats:
|
||||
m.running_mean.zero_()
|
||||
m.running_var.fill_(1)
|
||||
m.num_batches_tracked.zero_()
|
||||
if m.affine:
|
||||
nn.init.ones_(m.weight)
|
||||
nn.init.zeros_(m.bias)
|
||||
|
||||
def get_same_padding(kernel_size, stride, dilation):
|
||||
if stride > 1 and dilation > 1:
|
||||
raise ValueError("Only stride OR dilation may be greater than 1")
|
||||
|
||||
return (kernel_size // 2) * dilation
|
||||
|
||||
class AudioPreprocessing(nn.Module):
|
||||
"""GPU accelerated audio preprocessing
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
nn.Module.__init__(self) # For PyTorch API
|
||||
self.optim_level = kwargs.get('optimization_level', Optimization.nothing)
|
||||
self.featurizer = FeatureFactory.from_config(kwargs)
|
||||
|
||||
def forward(self, x):
|
||||
input_signal, length = x
|
||||
length.requires_grad_(False)
|
||||
if self.optim_level not in [Optimization.nothing, Optimization.mxprO0, Optimization.mxprO3]:
|
||||
with amp.disable_casts():
|
||||
processed_signal = self.featurizer(x)
|
||||
processed_length = self.featurizer.get_seq_len(length)
|
||||
else:
|
||||
processed_signal = self.featurizer(x)
|
||||
processed_length = self.featurizer.get_seq_len(length)
|
||||
return processed_signal, processed_length
|
||||
|
||||
class SpectrogramAugmentation(nn.Module):
|
||||
"""Spectrogram augmentation
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
nn.Module.__init__(self)
|
||||
self.spec_cutout_regions = SpecCutoutRegions(kwargs)
|
||||
self.spec_augment = SpecAugment(kwargs)
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, input_spec):
|
||||
augmented_spec = self.spec_cutout_regions(input_spec)
|
||||
augmented_spec = self.spec_augment(augmented_spec)
|
||||
return augmented_spec
|
||||
|
||||
class SpecAugment(nn.Module):
|
||||
"""Spec augment. refer to https://arxiv.org/abs/1904.08779
|
||||
"""
|
||||
def __init__(self, cfg, rng=None):
|
||||
super(SpecAugment, self).__init__()
|
||||
|
||||
self._rng = random.Random() if rng is None else rng
|
||||
|
||||
self.cutout_x_regions = cfg.get('cutout_x_regions', 0)
|
||||
self.cutout_y_regions = cfg.get('cutout_y_regions', 0)
|
||||
|
||||
self.cutout_x_width = cfg.get('cutout_x_width', 10)
|
||||
self.cutout_y_width = cfg.get('cutout_y_width', 10)
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, x):
|
||||
sh = x.shape
|
||||
|
||||
mask = torch.zeros(x.shape).byte()
|
||||
for idx in range(sh[0]):
|
||||
for _ in range(self.cutout_x_regions):
|
||||
cutout_x_left = int(self._rng.uniform(0, sh[1] - self.cutout_x_width))
|
||||
|
||||
mask[idx, cutout_x_left:cutout_x_left + self.cutout_x_width, :] = 1
|
||||
|
||||
for _ in range(self.cutout_y_regions):
|
||||
cutout_y_left = int(self._rng.uniform(0, sh[2] - self.cutout_y_width))
|
||||
|
||||
mask[idx, :, cutout_y_left:cutout_y_left + self.cutout_y_width] = 1
|
||||
|
||||
x = x.masked_fill(mask.to(device=x.device), 0)
|
||||
|
||||
return x
|
||||
|
||||
class SpecCutoutRegions(nn.Module):
|
||||
"""Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
|
||||
"""
|
||||
def __init__(self, cfg, rng=None):
|
||||
super(SpecCutoutRegions, self).__init__()
|
||||
|
||||
self._rng = random.Random() if rng is None else rng
|
||||
|
||||
self.cutout_rect_regions = cfg.get('cutout_rect_regions', 0)
|
||||
self.cutout_rect_time = cfg.get('cutout_rect_time', 5)
|
||||
self.cutout_rect_freq = cfg.get('cutout_rect_freq', 20)
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, x):
|
||||
sh = x.shape
|
||||
|
||||
mask = torch.zeros(x.shape).byte()
|
||||
|
||||
for idx in range(sh[0]):
|
||||
for i in range(self.cutout_rect_regions):
|
||||
cutout_rect_x = int(self._rng.uniform(
|
||||
0, sh[1] - self.cutout_rect_freq))
|
||||
cutout_rect_y = int(self._rng.uniform(
|
||||
0, sh[2] - self.cutout_rect_time))
|
||||
|
||||
mask[idx, cutout_rect_x:cutout_rect_x + self.cutout_rect_freq,
|
||||
cutout_rect_y:cutout_rect_y + self.cutout_rect_time] = 1
|
||||
|
||||
x = x.masked_fill(mask.to(device=x.device), 0)
|
||||
|
||||
return x
|
||||
|
||||
class JasperEncoder(nn.Module):
|
||||
"""Jasper encoder
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
cfg = {}
|
||||
for key, value in kwargs.items():
|
||||
cfg[key] = value
|
||||
|
||||
nn.Module.__init__(self)
|
||||
self._cfg = cfg
|
||||
|
||||
activation = jasper_activations[cfg['encoder']['activation']]()
|
||||
use_conv_mask = cfg['encoder'].get('convmask', False)
|
||||
feat_in = cfg['input']['features'] * cfg['input'].get('frame_splicing', 1)
|
||||
init_mode = cfg.get('init_mode', 'xavier_uniform')
|
||||
|
||||
residual_panes = []
|
||||
encoder_layers = []
|
||||
self.dense_residual = False
|
||||
for lcfg in cfg['jasper']:
|
||||
dense_res = []
|
||||
if lcfg.get('residual_dense', False):
|
||||
residual_panes.append(feat_in)
|
||||
dense_res = residual_panes
|
||||
self.dense_residual = True
|
||||
encoder_layers.append(
|
||||
JasperBlock(feat_in, lcfg['filters'], repeat=lcfg['repeat'],
|
||||
kernel_size=lcfg['kernel'], stride=lcfg['stride'],
|
||||
dilation=lcfg['dilation'], dropout=lcfg['dropout'],
|
||||
residual=lcfg['residual'], activation=activation,
|
||||
residual_panes=dense_res, conv_mask=use_conv_mask))
|
||||
feat_in = lcfg['filters']
|
||||
|
||||
self.encoder = nn.Sequential(*encoder_layers)
|
||||
self.apply(lambda x: init_weights(x, mode=init_mode))
|
||||
|
||||
def num_weights(self):
|
||||
return sum(p.numel() for p in self.parameters() if p.requires_grad)
|
||||
|
||||
def forward(self, x):
|
||||
audio_signal, length = x
|
||||
s_input, length = self.encoder(([audio_signal], length))
|
||||
return s_input, length
|
||||
|
||||
class JasperDecoderForCTC(nn.Module):
|
||||
"""Jasper decoder
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
nn.Module.__init__(self)
|
||||
self._feat_in = kwargs.get("feat_in")
|
||||
self._num_classes = kwargs.get("num_classes")
|
||||
init_mode = kwargs.get('init_mode', 'xavier_uniform')
|
||||
|
||||
self.decoder_layers = nn.Sequential(
|
||||
nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),
|
||||
nn.LogSoftmax(dim=1))
|
||||
self.apply(lambda x: init_weights(x, mode=init_mode))
|
||||
|
||||
|
||||
def num_weights(self):
|
||||
return sum(p.numel() for p in self.parameters() if p.requires_grad)
|
||||
|
||||
def forward(self, encoder_output):
|
||||
out = self.decoder_layers(encoder_output[-1])
|
||||
return out.transpose(1, 2)
|
||||
|
||||
class Jasper(nn.Module):
|
||||
"""Contains data preprocessing, spectrogram augmentation, jasper encoder and decoder
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
nn.Module.__init__(self)
|
||||
self.audio_preprocessor = AudioPreprocessing(**kwargs.get("feature_config"))
|
||||
self.data_spectr_augmentation = SpectrogramAugmentation(**kwargs.get("feature_config"))
|
||||
self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
|
||||
self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
|
||||
num_classes=kwargs.get("num_classes"))
|
||||
|
||||
def num_weights(self):
|
||||
return sum(p.numel() for p in self.parameters() if p.requires_grad)
|
||||
|
||||
def forward(self, x):
|
||||
input_signal, length = x
|
||||
t_processed_signal, p_length_t = self.audio_preprocessor(x)
|
||||
if self.training:
|
||||
t_processed_signal = self.data_spectr_augmentation(input_spec=t_processed_signal)
|
||||
t_encoded_t, t_encoded_len_t = self.jasper_encoder((t_processed_signal, p_length_t))
|
||||
return self.jasper_decoder(encoder_output=t_encoded_t), t_encoded_len_t
|
||||
|
||||
class JasperEncoderDecoder(nn.Module):
|
||||
"""Contains jasper encoder and decoder
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
nn.Module.__init__(self)
|
||||
self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
|
||||
self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
|
||||
num_classes=kwargs.get("num_classes"))
|
||||
def num_weights(self):
|
||||
return sum(p.numel() for p in self.parameters() if p.requires_grad)
|
||||
|
||||
def forward(self, x):
|
||||
t_processed_signal, p_length_t = x
|
||||
t_encoded_t, t_encoded_len_t = self.jasper_encoder((t_processed_signal, p_length_t))
|
||||
return self.jasper_decoder(encoder_output=t_encoded_t), t_encoded_len_t
|
||||
|
||||
class MaskedConv1d(nn.Conv1d):
|
||||
"""1D convolution with sequence masking
|
||||
"""
|
||||
def __init__(self, in_channels, out_channels, kernel_size, stride=1,
|
||||
padding=0, dilation=1, groups=1, bias=False, use_mask=True):
|
||||
super(MaskedConv1d, self).__init__(in_channels, out_channels, kernel_size,
|
||||
stride=stride,
|
||||
padding=padding, dilation=dilation,
|
||||
groups=groups, bias=bias)
|
||||
self.use_mask = use_mask
|
||||
|
||||
def get_seq_len(self, lens):
|
||||
return ((lens + 2 * self.padding[0] - self.dilation[0] * (
|
||||
self.kernel_size[0] - 1) - 1) / self.stride[0] + 1)
|
||||
|
||||
def forward(self, inp):
|
||||
x, lens = inp
|
||||
if self.use_mask:
|
||||
max_len = x.size(2)
|
||||
mask = torch.arange(max_len).to(lens.dtype).to(lens.device).expand(len(lens),
|
||||
max_len) >= lens.unsqueeze(
|
||||
1)
|
||||
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
|
||||
del mask
|
||||
|
||||
lens = self.get_seq_len(lens)
|
||||
|
||||
out = super(MaskedConv1d, self).forward(x)
|
||||
return out, lens
|
||||
|
||||
class JasperBlock(nn.Module):
|
||||
"""Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
|
||||
"""
|
||||
def __init__(self, inplanes, planes, repeat=3, kernel_size=11, stride=1,
|
||||
dilation=1, padding='same', dropout=0.2, activation=None,
|
||||
residual=True, residual_panes=[], conv_mask=False):
|
||||
super(JasperBlock, self).__init__()
|
||||
|
||||
if padding != "same":
|
||||
raise ValueError("currently only 'same' padding is supported")
|
||||
|
||||
|
||||
padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
|
||||
self.conv_mask = conv_mask
|
||||
self.conv = nn.ModuleList()
|
||||
inplanes_loop = inplanes
|
||||
for _ in range(repeat - 1):
|
||||
self.conv.extend(
|
||||
self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
|
||||
stride=stride, dilation=dilation,
|
||||
padding=padding_val))
|
||||
self.conv.extend(
|
||||
self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
|
||||
inplanes_loop = planes
|
||||
self.conv.extend(
|
||||
self._get_conv_bn_layer(inplanes_loop, planes, kernel_size=kernel_size,
|
||||
stride=stride, dilation=dilation,
|
||||
padding=padding_val))
|
||||
|
||||
self.res = nn.ModuleList() if residual else None
|
||||
res_panes = residual_panes.copy()
|
||||
self.dense_residual = residual
|
||||
if residual:
|
||||
if len(residual_panes) == 0:
|
||||
res_panes = [inplanes]
|
||||
self.dense_residual = False
|
||||
for ip in res_panes:
|
||||
self.res.append(nn.ModuleList(
|
||||
modules=self._get_conv_bn_layer(ip, planes, kernel_size=1)))
|
||||
self.out = nn.Sequential(
|
||||
*self._get_act_dropout_layer(drop_prob=dropout, activation=activation))
|
||||
|
||||
def _get_conv_bn_layer(self, in_channels, out_channels, kernel_size=11,
|
||||
stride=1, dilation=1, padding=0, bias=False):
|
||||
layers = [
|
||||
MaskedConv1d(in_channels, out_channels, kernel_size, stride=stride,
|
||||
dilation=dilation, padding=padding, bias=bias,
|
||||
use_mask=self.conv_mask),
|
||||
nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)
|
||||
]
|
||||
return layers
|
||||
|
||||
def _get_act_dropout_layer(self, drop_prob=0.2, activation=None):
|
||||
if activation is None:
|
||||
activation = nn.Hardtanh(min_val=0.0, max_val=20.0)
|
||||
layers = [
|
||||
activation,
|
||||
nn.Dropout(p=drop_prob)
|
||||
]
|
||||
return layers
|
||||
|
||||
def num_weights(self):
|
||||
return sum(p.numel() for p in self.parameters() if p.requires_grad)
|
||||
|
||||
def forward(self, input_):
|
||||
|
||||
xs, lens_orig = input_
|
||||
# compute forward convolutions
|
||||
out = xs[-1]
|
||||
lens = lens_orig
|
||||
for i, l in enumerate(self.conv):
|
||||
if isinstance(l, MaskedConv1d):
|
||||
out, lens = l((out, lens))
|
||||
else:
|
||||
out = l(out)
|
||||
# compute the residuals
|
||||
if self.res is not None:
|
||||
for i, layer in enumerate(self.res):
|
||||
res_out = xs[i]
|
||||
for j, res_layer in enumerate(layer):
|
||||
if j == 0:
|
||||
res_out, _ = res_layer((res_out, lens_orig))
|
||||
else:
|
||||
res_out = res_layer(res_out)
|
||||
out += res_out
|
||||
|
||||
# compute the output
|
||||
out = self.out(out)
|
||||
if self.res is not None and self.dense_residual:
|
||||
return xs + [out], lens
|
||||
|
||||
return [out], lens
|
||||
|
||||
class GreedyCTCDecoder(nn.Module):
|
||||
""" Greedy CTC Decoder
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
nn.Module.__init__(self) # For PyTorch API
|
||||
|
||||
def forward(self, log_probs):
|
||||
with torch.no_grad():
|
||||
argmx = log_probs.argmax(dim=-1, keepdim=False).int()
|
||||
return argmx
|
||||
|
||||
class CTCLossNM:
|
||||
""" CTC loss
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
self._blank = kwargs['num_classes'] - 1
|
||||
self._criterion = nn.CTCLoss(blank=self._blank, reduction='none')
|
||||
|
||||
def __call__(self, log_probs, targets, input_length, target_length):
|
||||
input_length = input_length.long()
|
||||
target_length = target_length.long()
|
||||
targets = targets.long()
|
||||
loss = self._criterion(log_probs.transpose(1, 0), targets, input_length,
|
||||
target_length)
|
||||
# note that this is different from reduction = 'mean'
|
||||
# because we are not dividing by target lengths
|
||||
return torch.mean(loss)
|
223
PyTorch/SpeechRecognition/Jasper/optimizers.py
Normal file
223
PyTorch/SpeechRecognition/Jasper/optimizers.py
Normal file
|
@ -0,0 +1,223 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import torch
|
||||
from torch.optim import Optimizer
|
||||
import math
|
||||
|
||||
class AdamW(Optimizer):
|
||||
"""Implements AdamW algorithm.
|
||||
|
||||
It has been proposed in `Adam: A Method for Stochastic Optimization`_.
|
||||
|
||||
Arguments:
|
||||
params (iterable): iterable of parameters to optimize or dicts defining
|
||||
parameter groups
|
||||
lr (float, optional): learning rate (default: 1e-3)
|
||||
betas (Tuple[float, float], optional): coefficients used for computing
|
||||
running averages of gradient and its square (default: (0.9, 0.999))
|
||||
eps (float, optional): term added to the denominator to improve
|
||||
numerical stability (default: 1e-8)
|
||||
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
|
||||
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
|
||||
algorithm from the paper `On the Convergence of Adam and Beyond`_
|
||||
|
||||
Adam: A Method for Stochastic Optimization:
|
||||
https://arxiv.org/abs/1412.6980
|
||||
On the Convergence of Adam and Beyond:
|
||||
https://openreview.net/forum?id=ryQu7f-RZ
|
||||
"""
|
||||
|
||||
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
|
||||
weight_decay=0, amsgrad=False):
|
||||
if not 0.0 <= lr:
|
||||
raise ValueError("Invalid learning rate: {}".format(lr))
|
||||
if not 0.0 <= eps:
|
||||
raise ValueError("Invalid epsilon value: {}".format(eps))
|
||||
if not 0.0 <= betas[0] < 1.0:
|
||||
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
|
||||
if not 0.0 <= betas[1] < 1.0:
|
||||
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
|
||||
defaults = dict(lr=lr, betas=betas, eps=eps,
|
||||
weight_decay=weight_decay, amsgrad=amsgrad)
|
||||
super(AdamW, self).__init__(params, defaults)
|
||||
|
||||
def __setstate__(self, state):
|
||||
super(AdamW, self).__setstate__(state)
|
||||
for group in self.param_groups:
|
||||
group.setdefault('amsgrad', False)
|
||||
|
||||
def step(self, closure=None):
|
||||
"""Performs a single optimization step.
|
||||
|
||||
Arguments:
|
||||
closure (callable, optional): A closure that reevaluates the model
|
||||
and returns the loss.
|
||||
"""
|
||||
loss = None
|
||||
if closure is not None:
|
||||
loss = closure()
|
||||
|
||||
for group in self.param_groups:
|
||||
for p in group['params']:
|
||||
if p.grad is None:
|
||||
continue
|
||||
grad = p.grad.data
|
||||
if grad.is_sparse:
|
||||
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
|
||||
amsgrad = group['amsgrad']
|
||||
|
||||
state = self.state[p]
|
||||
|
||||
# State initialization
|
||||
if len(state) == 0:
|
||||
state['step'] = 0
|
||||
# Exponential moving average of gradient values
|
||||
state['exp_avg'] = torch.zeros_like(p.data)
|
||||
# Exponential moving average of squared gradient values
|
||||
state['exp_avg_sq'] = torch.zeros_like(p.data)
|
||||
if amsgrad:
|
||||
# Maintains max of all exp. moving avg. of sq. grad. values
|
||||
state['max_exp_avg_sq'] = torch.zeros_like(p.data)
|
||||
|
||||
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
|
||||
if amsgrad:
|
||||
max_exp_avg_sq = state['max_exp_avg_sq']
|
||||
beta1, beta2 = group['betas']
|
||||
|
||||
state['step'] += 1
|
||||
# Decay the first and second moment running average coefficient
|
||||
exp_avg.mul_(beta1).add_(1 - beta1, grad)
|
||||
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
|
||||
if amsgrad:
|
||||
# Maintains the maximum of all 2nd moment running avg. till now
|
||||
torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
|
||||
# Use the max. for normalizing running avg. of gradient
|
||||
denom = max_exp_avg_sq.sqrt().add_(group['eps'])
|
||||
else:
|
||||
denom = exp_avg_sq.sqrt().add_(group['eps'])
|
||||
|
||||
bias_correction1 = 1 - beta1 ** state['step']
|
||||
bias_correction2 = 1 - beta2 ** state['step']
|
||||
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
|
||||
p.data.add_(-step_size, torch.mul(p.data, group['weight_decay']).addcdiv_(1, exp_avg, denom) )
|
||||
|
||||
return loss
|
||||
|
||||
class Novograd(Optimizer):
|
||||
"""
|
||||
Implements Novograd algorithm.
|
||||
|
||||
Args:
|
||||
params (iterable): iterable of parameters to optimize or dicts defining
|
||||
parameter groups
|
||||
lr (float, optional): learning rate (default: 1e-3)
|
||||
betas (Tuple[float, float], optional): coefficients used for computing
|
||||
running averages of gradient and its square (default: (0.95, 0.98))
|
||||
eps (float, optional): term added to the denominator to improve
|
||||
numerical stability (default: 1e-8)
|
||||
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
|
||||
grad_averaging: gradient averaging
|
||||
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
|
||||
algorithm from the paper `On the Convergence of Adam and Beyond`_
|
||||
(default: False)
|
||||
"""
|
||||
|
||||
def __init__(self, params, lr=1e-3, betas=(0.95, 0.98), eps=1e-8,
|
||||
weight_decay=0, grad_averaging=False, amsgrad=False):
|
||||
if not 0.0 <= lr:
|
||||
raise ValueError("Invalid learning rate: {}".format(lr))
|
||||
if not 0.0 <= eps:
|
||||
raise ValueError("Invalid epsilon value: {}".format(eps))
|
||||
if not 0.0 <= betas[0] < 1.0:
|
||||
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
|
||||
if not 0.0 <= betas[1] < 1.0:
|
||||
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
|
||||
defaults = dict(lr=lr, betas=betas, eps=eps,
|
||||
weight_decay=weight_decay,
|
||||
grad_averaging=grad_averaging,
|
||||
amsgrad=amsgrad)
|
||||
|
||||
super(Novograd, self).__init__(params, defaults)
|
||||
|
||||
def __setstate__(self, state):
|
||||
super(Novograd, self).__setstate__(state)
|
||||
for group in self.param_groups:
|
||||
group.setdefault('amsgrad', False)
|
||||
|
||||
def step(self, closure=None):
|
||||
"""Performs a single optimization step.
|
||||
|
||||
Arguments:
|
||||
closure (callable, optional): A closure that reevaluates the model
|
||||
and returns the loss.
|
||||
"""
|
||||
loss = None
|
||||
if closure is not None:
|
||||
loss = closure()
|
||||
|
||||
for group in self.param_groups:
|
||||
for p in group['params']:
|
||||
if p.grad is None:
|
||||
continue
|
||||
grad = p.grad.data
|
||||
if grad.is_sparse:
|
||||
raise RuntimeError('Sparse gradients are not supported.')
|
||||
amsgrad = group['amsgrad']
|
||||
|
||||
state = self.state[p]
|
||||
|
||||
# State initialization
|
||||
if len(state) == 0:
|
||||
state['step'] = 0
|
||||
# Exponential moving average of gradient values
|
||||
state['exp_avg'] = torch.zeros_like(p.data)
|
||||
# Exponential moving average of squared gradient values
|
||||
state['exp_avg_sq'] = torch.zeros([]).to(state['exp_avg'].device)
|
||||
if amsgrad:
|
||||
# Maintains max of all exp. moving avg. of sq. grad. values
|
||||
state['max_exp_avg_sq'] = torch.zeros([]).to(state['exp_avg'].device)
|
||||
|
||||
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
|
||||
if amsgrad:
|
||||
max_exp_avg_sq = state['max_exp_avg_sq']
|
||||
beta1, beta2 = group['betas']
|
||||
|
||||
state['step'] += 1
|
||||
|
||||
norm = torch.sum(torch.pow(grad, 2))
|
||||
|
||||
if exp_avg_sq == 0:
|
||||
exp_avg_sq = norm
|
||||
else:
|
||||
exp_avg_sq.mul_(beta2).add_(1 - beta2, norm)
|
||||
|
||||
if amsgrad:
|
||||
# Maintains the maximum of all 2nd moment running avg. till now
|
||||
torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
|
||||
# Use the max. for normalizing running avg. of gradient
|
||||
denom = max_exp_avg_sq.sqrt().add_(group['eps'])
|
||||
else:
|
||||
denom = exp_avg_sq.sqrt().add_(group['eps'])
|
||||
|
||||
grad.div_(denom)
|
||||
if group['weight_decay'] != 0:
|
||||
grad.add_(group['weight_decay'], p.data)
|
||||
if group['grad_averaging']:
|
||||
grad.mul_(1 - beta1)
|
||||
exp_avg.mul_(beta1).add_(grad)
|
||||
|
||||
p.data.add_(-group['lr'], exp_avg)
|
||||
|
||||
return loss
|
325
PyTorch/SpeechRecognition/Jasper/parts/features.py
Normal file
325
PyTorch/SpeechRecognition/Jasper/parts/features.py
Normal file
|
@ -0,0 +1,325 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import math
|
||||
import librosa
|
||||
from .perturb import AudioAugmentor
|
||||
from .segment import AudioSegment
|
||||
from apex import amp
|
||||
|
||||
|
||||
class WaveformFeaturizer(object):
|
||||
def __init__(self, input_cfg, augmentor=None):
|
||||
self.augmentor = augmentor if augmentor is not None else AudioAugmentor()
|
||||
self.cfg = input_cfg
|
||||
|
||||
def max_augmentation_length(self, length):
|
||||
return self.augmentor.max_augmentation_length(length)
|
||||
|
||||
def process(self, file_path, offset=0, duration=0, trim=False):
|
||||
audio = AudioSegment.from_file(file_path,
|
||||
target_sr=self.cfg['sample_rate'],
|
||||
int_values=self.cfg.get('int_values', False),
|
||||
offset=offset, duration=duration, trim=trim)
|
||||
return self.process_segment(audio)
|
||||
|
||||
def process_segment(self, audio_segment):
|
||||
self.augmentor.perturb(audio_segment)
|
||||
return torch.tensor(audio_segment.samples, dtype=torch.float)
|
||||
|
||||
@classmethod
|
||||
def from_config(cls, input_config, perturbation_configs=None):
|
||||
if perturbation_configs is not None:
|
||||
aa = AudioAugmentor.from_config(perturbation_configs)
|
||||
else:
|
||||
aa = None
|
||||
|
||||
return cls(input_config, augmentor=aa)
|
||||
|
||||
constant = 1e-5
|
||||
def normalize_batch(x, seq_len, normalize_type):
|
||||
if normalize_type == "per_feature":
|
||||
x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
|
||||
device=x.device)
|
||||
x_std = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
|
||||
device=x.device)
|
||||
for i in range(x.shape[0]):
|
||||
x_mean[i, :] = x[i, :, :seq_len[i]].mean(dim=1)
|
||||
x_std[i, :] = x[i, :, :seq_len[i]].std(dim=1)
|
||||
# make sure x_std is not zero
|
||||
x_std += constant
|
||||
return (x - x_mean.unsqueeze(2)) / x_std.unsqueeze(2)
|
||||
elif normalize_type == "all_features":
|
||||
x_mean = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
|
||||
x_std = torch.zeros(seq_len.shape, dtype=x.dtype, device=x.device)
|
||||
for i in range(x.shape[0]):
|
||||
x_mean[i] = x[i, :, :seq_len[i].item()].mean()
|
||||
x_std[i] = x[i, :, :seq_len[i].item()].std()
|
||||
# make sure x_std is not zero
|
||||
x_std += constant
|
||||
return (x - x_mean.view(-1, 1, 1)) / x_std.view(-1, 1, 1)
|
||||
else:
|
||||
return x
|
||||
|
||||
def splice_frames(x, frame_splicing):
|
||||
""" Stacks frames together across feature dim
|
||||
|
||||
input is batch_size, feature_dim, num_frames
|
||||
output is batch_size, feature_dim*frame_splicing, num_frames
|
||||
|
||||
"""
|
||||
seq = [x]
|
||||
for n in range(1, frame_splicing):
|
||||
seq.append(torch.cat([x[:, :, :n], x[:, :, n:]], dim=2))
|
||||
return torch.cat(seq, dim=1)
|
||||
|
||||
class SpectrogramFeatures(nn.Module):
|
||||
def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
|
||||
n_fft=None,
|
||||
window="hamming", normalize="per_feature", log=True, center=True,
|
||||
dither=constant, pad_to=8, max_duration=16.7,
|
||||
frame_splicing=1):
|
||||
super(SpectrogramFeatures, self).__init__()
|
||||
torch_windows = {
|
||||
'hann': torch.hann_window,
|
||||
'hamming': torch.hamming_window,
|
||||
'blackman': torch.blackman_window,
|
||||
'bartlett': torch.bartlett_window,
|
||||
'none': None,
|
||||
}
|
||||
self.win_length = int(sample_rate * window_size)
|
||||
self.hop_length = int(sample_rate * window_stride)
|
||||
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
|
||||
|
||||
window_fn = torch_windows.get(window, None)
|
||||
window_tensor = window_fn(self.win_length,
|
||||
periodic=False) if window_fn else None
|
||||
self.window = window_tensor
|
||||
|
||||
self.normalize = normalize
|
||||
self.log = log
|
||||
self.center = center
|
||||
self.dither = dither
|
||||
self.pad_to = pad_to
|
||||
self.frame_splicing = frame_splicing
|
||||
|
||||
max_length = 1 + math.ceil(
|
||||
(max_duration * sample_rate - self.win_length) / self.hop_length
|
||||
)
|
||||
max_pad = 16 - (max_length % 16)
|
||||
self.max_length = max_length + max_pad
|
||||
|
||||
def get_seq_len(self, seq_len):
|
||||
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
|
||||
dtype=torch.int)
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, inp):
|
||||
x, seq_len = inp
|
||||
dtype = x.dtype
|
||||
|
||||
seq_len = self.get_seq_len(seq_len)
|
||||
|
||||
# dither
|
||||
if self.dither > 0:
|
||||
x += self.dither * torch.randn_like(x)
|
||||
|
||||
# do preemphasis
|
||||
if hasattr(self,'preemph') and self.preemph is not None:
|
||||
x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
|
||||
dim=1)
|
||||
|
||||
# get spectrogram
|
||||
x = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
|
||||
win_length=self.win_length, center=self.center,
|
||||
window=self.window.to(torch.float))
|
||||
x = torch.sqrt(x.pow(2).sum(-1))
|
||||
|
||||
# log features if required
|
||||
if self.log:
|
||||
x = torch.log(x + 1e-20)
|
||||
|
||||
# frame splicing if required
|
||||
if self.frame_splicing > 1:
|
||||
x = splice_frames(x, self.frame_splicing)
|
||||
|
||||
# normalize if required
|
||||
if self.normalize:
|
||||
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
|
||||
|
||||
# mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
|
||||
max_len = x.size(-1)
|
||||
mask = torch.arange(max_len).to(seq_len.dtype).to(seq_len.device).expand(x.size(0), max_len) >= seq_len.unsqueeze(1)
|
||||
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
|
||||
del mask
|
||||
pad_to = self.pad_to
|
||||
if pad_to == "max":
|
||||
x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
|
||||
elif pad_to > 0:
|
||||
pad_amt = x.size(-1) % pad_to
|
||||
if pad_amt != 0:
|
||||
x = nn.functional.pad(x, (0, pad_to - pad_amt))
|
||||
|
||||
return x.to(dtype)
|
||||
|
||||
@classmethod
|
||||
def from_config(cls, cfg, log=False):
|
||||
return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
|
||||
window_stride=cfg['window_stride'],
|
||||
n_fft=cfg['n_fft'], window=cfg['window'],
|
||||
normalize=cfg['normalize'],
|
||||
max_duration=cfg.get('max_duration', 16.7),
|
||||
dither=cfg.get('dither', 1e-5), pad_to=cfg.get("pad_to", 0),
|
||||
frame_splicing=cfg.get("frame_splicing", 1), log=log)
|
||||
|
||||
class FilterbankFeatures(nn.Module):
|
||||
def __init__(self, sample_rate=8000, window_size=0.02, window_stride=0.01,
|
||||
window="hamming", normalize="per_feature", n_fft=None,
|
||||
preemph=0.97,
|
||||
nfilt=64, lowfreq=0, highfreq=None, log=True, dither=constant,
|
||||
pad_to=8,
|
||||
max_duration=16.7,
|
||||
frame_splicing=1):
|
||||
super(FilterbankFeatures, self).__init__()
|
||||
print("PADDING: {}".format(pad_to))
|
||||
|
||||
torch_windows = {
|
||||
'hann': torch.hann_window,
|
||||
'hamming': torch.hamming_window,
|
||||
'blackman': torch.blackman_window,
|
||||
'bartlett': torch.bartlett_window,
|
||||
'none': None,
|
||||
}
|
||||
|
||||
self.win_length = int(sample_rate * window_size) # frame size
|
||||
self.hop_length = int(sample_rate * window_stride)
|
||||
self.n_fft = n_fft or 2 ** math.ceil(math.log2(self.win_length))
|
||||
|
||||
self.normalize = normalize
|
||||
self.log = log
|
||||
self.dither = dither
|
||||
self.frame_splicing = frame_splicing
|
||||
self.nfilt = nfilt
|
||||
self.preemph = preemph
|
||||
self.pad_to = pad_to
|
||||
highfreq = highfreq or sample_rate / 2
|
||||
window_fn = torch_windows.get(window, None)
|
||||
window_tensor = window_fn(self.win_length,
|
||||
periodic=False) if window_fn else None
|
||||
filterbanks = torch.tensor(
|
||||
librosa.filters.mel(sample_rate, self.n_fft, n_mels=nfilt, fmin=lowfreq,
|
||||
fmax=highfreq), dtype=torch.float).unsqueeze(0)
|
||||
# self.fb = filterbanks
|
||||
# self.window = window_tensor
|
||||
self.register_buffer("fb", filterbanks)
|
||||
self.register_buffer("window", window_tensor)
|
||||
# Calculate maximum sequence length (# frames)
|
||||
max_length = 1 + math.ceil(
|
||||
(max_duration * sample_rate - self.win_length) / self.hop_length
|
||||
)
|
||||
max_pad = 16 - (max_length % 16)
|
||||
self.max_length = max_length + max_pad
|
||||
|
||||
|
||||
def get_seq_len(self, seq_len):
|
||||
return torch.ceil(seq_len.to(dtype=torch.float) / self.hop_length).to(
|
||||
dtype=torch.int)
|
||||
# dtype=torch.long)
|
||||
|
||||
@torch.no_grad()
|
||||
def forward(self, inp):
|
||||
x, seq_len = inp
|
||||
dtype = x.dtype
|
||||
|
||||
seq_len = self.get_seq_len(seq_len)
|
||||
|
||||
# dither
|
||||
if self.dither > 0:
|
||||
x += self.dither * torch.randn_like(x)
|
||||
|
||||
# do preemphasis
|
||||
if self.preemph is not None:
|
||||
x = torch.cat((x[:, 0].unsqueeze(1), x[:, 1:] - self.preemph * x[:, :-1]),
|
||||
dim=1)
|
||||
|
||||
# do stft
|
||||
x = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
|
||||
win_length=self.win_length,
|
||||
center=True, window=self.window.to(dtype=torch.float))
|
||||
|
||||
# get power spectrum
|
||||
x = x.pow(2).sum(-1)
|
||||
|
||||
# dot with filterbank energies
|
||||
x = torch.matmul(self.fb.to(x.dtype), x)
|
||||
|
||||
# log features if required
|
||||
if self.log:
|
||||
x = torch.log(x + 1e-20)
|
||||
|
||||
# frame splicing if required
|
||||
if self.frame_splicing > 1:
|
||||
x = splice_frames(x, self.frame_splicing)
|
||||
|
||||
# normalize if required
|
||||
if self.normalize:
|
||||
x = normalize_batch(x, seq_len, normalize_type=self.normalize)
|
||||
|
||||
# mask to zero any values beyond seq_len in batch, pad to multiple of `pad_to` (for efficiency)
|
||||
max_len = x.size(-1)
|
||||
mask = torch.arange(max_len).to(seq_len.dtype).to(x.device).expand(x.size(0),
|
||||
max_len) >= seq_len.unsqueeze(1)
|
||||
|
||||
x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
|
||||
del mask
|
||||
pad_to = self.pad_to
|
||||
if pad_to == "max":
|
||||
x = nn.functional.pad(x, (0, self.max_length - x.size(-1)))
|
||||
elif pad_to > 0:
|
||||
pad_amt = x.size(-1) % pad_to
|
||||
if pad_amt != 0:
|
||||
x = nn.functional.pad(x, (0, pad_to - pad_amt))
|
||||
|
||||
return x.to(dtype)
|
||||
|
||||
@classmethod
|
||||
def from_config(cls, cfg, log=False):
|
||||
return cls(sample_rate=cfg['sample_rate'], window_size=cfg['window_size'],
|
||||
window_stride=cfg['window_stride'], n_fft=cfg['n_fft'],
|
||||
nfilt=cfg['features'], window=cfg['window'],
|
||||
normalize=cfg['normalize'],
|
||||
max_duration=cfg.get('max_duration', 16.7),
|
||||
dither=cfg['dither'], pad_to=cfg.get("pad_to", 0),
|
||||
frame_splicing=cfg.get("frame_splicing", 1), log=log)
|
||||
|
||||
class FeatureFactory(object):
|
||||
featurizers = {
|
||||
"logfbank": FilterbankFeatures,
|
||||
"fbank": FilterbankFeatures,
|
||||
"stft": SpectrogramFeatures,
|
||||
"logspect": SpectrogramFeatures,
|
||||
"logstft": SpectrogramFeatures
|
||||
}
|
||||
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
@classmethod
|
||||
def from_config(cls, cfg):
|
||||
feat_type = cfg.get('feat_type', "logspect")
|
||||
featurizer = cls.featurizers[feat_type]
|
||||
#return featurizer.from_config(cfg, log="log" in cfg['feat_type'])
|
||||
return featurizer.from_config(cfg, log="log" in feat_type)
|
170
PyTorch/SpeechRecognition/Jasper/parts/manifest.py
Normal file
170
PyTorch/SpeechRecognition/Jasper/parts/manifest.py
Normal file
|
@ -0,0 +1,170 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import json
|
||||
import re
|
||||
import string
|
||||
import numpy as np
|
||||
import os
|
||||
|
||||
from .text import _clean_text
|
||||
|
||||
|
||||
def normalize_string(s, labels, table, **unused_kwargs):
|
||||
"""
|
||||
Normalizes string. For example:
|
||||
'call me at 8:00 pm!' -> 'call me at eight zero pm'
|
||||
|
||||
Args:
|
||||
s: string to normalize
|
||||
labels: labels used during model training.
|
||||
|
||||
Returns:
|
||||
Normalized string
|
||||
"""
|
||||
|
||||
def good_token(token, labels):
|
||||
s = set(labels)
|
||||
for t in token:
|
||||
if not t in s:
|
||||
return False
|
||||
return True
|
||||
|
||||
try:
|
||||
text = _clean_text(s, ["english_cleaners"], table).strip()
|
||||
return ''.join([t for t in text if good_token(t, labels=labels)])
|
||||
except:
|
||||
print("WARNING: Normalizing {} failed".format(s))
|
||||
return None
|
||||
|
||||
class Manifest(object):
|
||||
def __init__(self, data_dir, manifest_paths, labels, blank_index, max_duration=None, pad_to_max=False,
|
||||
min_duration=None, sort_by_duration=False, max_utts=0,
|
||||
normalize=True, speed_perturbation=False, filter_speed=1.0):
|
||||
self.labels_map = dict([(labels[i], i) for i in range(len(labels))])
|
||||
self.blank_index = blank_index
|
||||
self.max_duration= max_duration
|
||||
ids = []
|
||||
duration = 0.0
|
||||
filtered_duration = 0.0
|
||||
|
||||
# If removing punctuation, make a list of punctuation to remove
|
||||
table = None
|
||||
if normalize:
|
||||
# Punctuation to remove
|
||||
punctuation = string.punctuation
|
||||
punctuation = punctuation.replace("+", "")
|
||||
punctuation = punctuation.replace("&", "")
|
||||
### We might also want to consider:
|
||||
### @ -> at
|
||||
### # -> number, pound, hashtag
|
||||
### ~ -> tilde
|
||||
### _ -> underscore
|
||||
### % -> percent
|
||||
# If a punctuation symbol is inside our vocab, we do not remove from text
|
||||
for l in labels:
|
||||
punctuation = punctuation.replace(l, "")
|
||||
# Turn all punctuation to whitespace
|
||||
table = str.maketrans(punctuation, " " * len(punctuation))
|
||||
for manifest_path in manifest_paths:
|
||||
with open(manifest_path, "r", encoding="utf-8") as fh:
|
||||
a=json.load(fh)
|
||||
for data in a:
|
||||
files_and_speeds = data['files']
|
||||
|
||||
if pad_to_max:
|
||||
if not speed_perturbation:
|
||||
min_speed = filter_speed
|
||||
else:
|
||||
min_speed = min(x['speed'] for x in files_and_speeds)
|
||||
max_duration = self.max_duration * min_speed
|
||||
|
||||
data['duration'] = data['original_duration']
|
||||
if min_duration is not None and data['duration'] < min_duration:
|
||||
filtered_duration += data['duration']
|
||||
continue
|
||||
if max_duration is not None and data['duration'] > max_duration:
|
||||
filtered_duration += data['duration']
|
||||
continue
|
||||
|
||||
# Prune and normalize according to transcript
|
||||
transcript_text = data[
|
||||
'transcript'] if "transcript" in data else self.load_transcript(
|
||||
data['text_filepath'])
|
||||
if normalize:
|
||||
transcript_text = normalize_string(transcript_text, labels=labels,
|
||||
table=table)
|
||||
if not isinstance(transcript_text, str):
|
||||
print(
|
||||
"WARNING: Got transcript: {}. It is not a string. Dropping data point".format(
|
||||
transcript_text))
|
||||
filtered_duration += data['duration']
|
||||
continue
|
||||
data["transcript"] = self.parse_transcript(transcript_text) # convert to vocab indices
|
||||
|
||||
if speed_perturbation:
|
||||
audio_paths = [x['fname'] for x in files_and_speeds]
|
||||
data['audio_duration'] = [x['duration'] for x in files_and_speeds]
|
||||
else:
|
||||
audio_paths = [x['fname'] for x in files_and_speeds if x['speed'] == filter_speed]
|
||||
data['audio_duration'] = [x['duration'] for x in files_and_speeds if x['speed'] == filter_speed]
|
||||
data['audio_filepath'] = [os.path.join(data_dir, x) for x in audio_paths]
|
||||
data.pop('files')
|
||||
data.pop('original_duration')
|
||||
|
||||
ids.append(data)
|
||||
duration += data['duration']
|
||||
|
||||
if max_utts > 0 and len(ids) >= max_utts:
|
||||
print(
|
||||
'Stopping parsing %s as max_utts=%d' % (manifest_path, max_utts))
|
||||
break
|
||||
|
||||
if sort_by_duration:
|
||||
ids = sorted(ids, key=lambda x: x['duration'])
|
||||
self._data = ids
|
||||
self._size = len(ids)
|
||||
self._duration = duration
|
||||
self._filtered_duration = filtered_duration
|
||||
|
||||
def load_transcript(self, transcript_path):
|
||||
with open(transcript_path, 'r', encoding="utf-8") as transcript_file:
|
||||
transcript = transcript_file.read().replace('\n', '')
|
||||
return transcript
|
||||
|
||||
def parse_transcript(self, transcript):
|
||||
chars = [self.labels_map.get(x, self.blank_index) for x in list(transcript)]
|
||||
transcript = list(filter(lambda x: x != self.blank_index, chars))
|
||||
return transcript
|
||||
|
||||
def __getitem__(self, item):
|
||||
return self._data[item]
|
||||
|
||||
def __len__(self):
|
||||
return self._size
|
||||
|
||||
def __iter__(self):
|
||||
return iter(self._data)
|
||||
|
||||
@property
|
||||
def duration(self):
|
||||
return self._duration
|
||||
|
||||
@property
|
||||
def filtered_duration(self):
|
||||
return self._filtered_duration
|
||||
|
||||
@property
|
||||
def data(self):
|
||||
return list(self._data)
|
111
PyTorch/SpeechRecognition/Jasper/parts/perturb.py
Normal file
111
PyTorch/SpeechRecognition/Jasper/parts/perturb.py
Normal file
|
@ -0,0 +1,111 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import random
|
||||
import librosa
|
||||
from .manifest import Manifest
|
||||
from .segment import AudioSegment
|
||||
|
||||
|
||||
class Perturbation(object):
|
||||
def max_augmentation_length(self, length):
|
||||
return length
|
||||
|
||||
def perturb(self, data):
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
class SpeedPerturbation(Perturbation):
|
||||
def __init__(self, min_speed_rate=0.85, max_speed_rate=1.15, rng=None):
|
||||
self._min_rate = min_speed_rate
|
||||
self._max_rate = max_speed_rate
|
||||
self._rng = random.Random() if rng is None else rng
|
||||
|
||||
def max_augmentation_length(self, length):
|
||||
return length * self._max_rate
|
||||
|
||||
def perturb(self, data):
|
||||
speed_rate = self._rng.uniform(self._min_rate, self._max_rate)
|
||||
if speed_rate <= 0:
|
||||
raise ValueError("speed_rate should be greater than zero.")
|
||||
data._samples = librosa.effects.time_stretch(data._samples, speed_rate)
|
||||
|
||||
|
||||
class GainPerturbation(Perturbation):
|
||||
def __init__(self, min_gain_dbfs=-10, max_gain_dbfs=10, rng=None):
|
||||
self._min_gain_dbfs = min_gain_dbfs
|
||||
self._max_gain_dbfs = max_gain_dbfs
|
||||
self._rng = random.Random() if rng is None else rng
|
||||
|
||||
def perturb(self, data):
|
||||
gain = self._rng.uniform(self._min_gain_dbfs, self._max_gain_dbfs)
|
||||
data._samples = data._samples * (10. ** (gain / 20.))
|
||||
|
||||
|
||||
|
||||
class ShiftPerturbation(Perturbation):
|
||||
def __init__(self, min_shift_ms=-5.0, max_shift_ms=5.0, rng=None):
|
||||
self._min_shift_ms = min_shift_ms
|
||||
self._max_shift_ms = max_shift_ms
|
||||
self._rng = random.Random() if rng is None else rng
|
||||
|
||||
def perturb(self, data):
|
||||
shift_ms = self._rng.uniform(self._min_shift_ms, self._max_shift_ms)
|
||||
if abs(shift_ms) / 1000 > data.duration:
|
||||
# TODO: do something smarter than just ignore this condition
|
||||
return
|
||||
shift_samples = int(shift_ms * data.sample_rate // 1000)
|
||||
# print("DEBUG: shift:", shift_samples)
|
||||
if shift_samples < 0:
|
||||
data._samples[-shift_samples:] = data._samples[:shift_samples]
|
||||
data._samples[:-shift_samples] = 0
|
||||
elif shift_samples > 0:
|
||||
data._samples[:-shift_samples] = data._samples[shift_samples:]
|
||||
data._samples[-shift_samples:] = 0
|
||||
|
||||
|
||||
perturbation_types = {
|
||||
"speed": SpeedPerturbation,
|
||||
"gain": GainPerturbation,
|
||||
"shift": ShiftPerturbation,
|
||||
}
|
||||
|
||||
|
||||
class AudioAugmentor(object):
|
||||
def __init__(self, perturbations=None, rng=None):
|
||||
self._rng = random.Random() if rng is None else rng
|
||||
self._pipeline = perturbations if perturbations is not None else []
|
||||
|
||||
def perturb(self, segment):
|
||||
for (prob, p) in self._pipeline:
|
||||
if self._rng.random() < prob:
|
||||
p.perturb(segment)
|
||||
return
|
||||
|
||||
def max_augmentation_length(self, length):
|
||||
newlen = length
|
||||
for (prob, p) in self._pipeline:
|
||||
newlen = p.max_augmentation_length(newlen)
|
||||
return newlen
|
||||
|
||||
@classmethod
|
||||
def from_config(cls, config):
|
||||
ptbs = []
|
||||
for p in config:
|
||||
if p['aug_type'] not in perturbation_types:
|
||||
print(p['aug_type'], "perturbation not known. Skipping.")
|
||||
continue
|
||||
perturbation = perturbation_types[p['aug_type']]
|
||||
ptbs.append((p['prob'], perturbation(**p['cfg'])))
|
||||
return cls(perturbations=ptbs)
|
170
PyTorch/SpeechRecognition/Jasper/parts/segment.py
Normal file
170
PyTorch/SpeechRecognition/Jasper/parts/segment.py
Normal file
|
@ -0,0 +1,170 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
import librosa
|
||||
import soundfile as sf
|
||||
|
||||
|
||||
class AudioSegment(object):
|
||||
"""Monaural audio segment abstraction.
|
||||
:param samples: Audio samples [num_samples x num_channels].
|
||||
:type samples: ndarray.float32
|
||||
:param sample_rate: Audio sample rate.
|
||||
:type sample_rate: int
|
||||
:raises TypeError: If the sample data type is not float or int.
|
||||
"""
|
||||
|
||||
def __init__(self, samples, sample_rate, target_sr=None, trim=False,
|
||||
trim_db=60):
|
||||
"""Create audio segment from samples.
|
||||
Samples are convert float32 internally, with int scaled to [-1, 1].
|
||||
"""
|
||||
samples = self._convert_samples_to_float32(samples)
|
||||
if target_sr is not None and target_sr != sample_rate:
|
||||
samples = librosa.core.resample(samples, sample_rate, target_sr)
|
||||
sample_rate = target_sr
|
||||
if trim:
|
||||
samples, _ = librosa.effects.trim(samples, trim_db)
|
||||
self._samples = samples
|
||||
self._sample_rate = sample_rate
|
||||
if self._samples.ndim >= 2:
|
||||
self._samples = np.mean(self._samples, 1)
|
||||
|
||||
def __eq__(self, other):
|
||||
"""Return whether two objects are equal."""
|
||||
if type(other) is not type(self):
|
||||
return False
|
||||
if self._sample_rate != other._sample_rate:
|
||||
return False
|
||||
if self._samples.shape != other._samples.shape:
|
||||
return False
|
||||
if np.any(self.samples != other._samples):
|
||||
return False
|
||||
return True
|
||||
|
||||
def __ne__(self, other):
|
||||
"""Return whether two objects are unequal."""
|
||||
return not self.__eq__(other)
|
||||
|
||||
def __str__(self):
|
||||
"""Return human-readable representation of segment."""
|
||||
return ("%s: num_samples=%d, sample_rate=%d, duration=%.2fsec, "
|
||||
"rms=%.2fdB" % (type(self), self.num_samples, self.sample_rate,
|
||||
self.duration, self.rms_db))
|
||||
|
||||
@staticmethod
|
||||
def _convert_samples_to_float32(samples):
|
||||
"""Convert sample type to float32.
|
||||
Audio sample type is usually integer or float-point.
|
||||
Integers will be scaled to [-1, 1] in float32.
|
||||
"""
|
||||
float32_samples = samples.astype('float32')
|
||||
if samples.dtype in np.sctypes['int']:
|
||||
bits = np.iinfo(samples.dtype).bits
|
||||
float32_samples *= (1. / 2 ** (bits - 1))
|
||||
elif samples.dtype in np.sctypes['float']:
|
||||
pass
|
||||
else:
|
||||
raise TypeError("Unsupported sample type: %s." % samples.dtype)
|
||||
return float32_samples
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, filename, target_sr=None, int_values=False, offset=0,
|
||||
duration=0, trim=False):
|
||||
"""
|
||||
Load a file supported by librosa and return as an AudioSegment.
|
||||
:param filename: path of file to load
|
||||
:param target_sr: the desired sample rate
|
||||
:param int_values: if true, load samples as 32-bit integers
|
||||
:param offset: offset in seconds when loading audio
|
||||
:param duration: duration in seconds when loading audio
|
||||
:return: numpy array of samples
|
||||
"""
|
||||
with sf.SoundFile(filename, 'r') as f:
|
||||
dtype = 'int32' if int_values else 'float32'
|
||||
sample_rate = f.samplerate
|
||||
if offset > 0:
|
||||
f.seek(int(offset * sample_rate))
|
||||
if duration > 0:
|
||||
samples = f.read(int(duration * sample_rate), dtype=dtype)
|
||||
else:
|
||||
samples = f.read(dtype=dtype)
|
||||
samples = samples.transpose()
|
||||
return cls(samples, sample_rate, target_sr=target_sr, trim=trim)
|
||||
|
||||
@property
|
||||
def samples(self):
|
||||
return self._samples.copy()
|
||||
|
||||
@property
|
||||
def sample_rate(self):
|
||||
return self._sample_rate
|
||||
|
||||
@property
|
||||
def num_samples(self):
|
||||
return self._samples.shape[0]
|
||||
|
||||
@property
|
||||
def duration(self):
|
||||
return self._samples.shape[0] / float(self._sample_rate)
|
||||
|
||||
@property
|
||||
def rms_db(self):
|
||||
mean_square = np.mean(self._samples ** 2)
|
||||
return 10 * np.log10(mean_square)
|
||||
|
||||
def gain_db(self, gain):
|
||||
self._samples *= 10. ** (gain / 20.)
|
||||
|
||||
def pad(self, pad_size, symmetric=False):
|
||||
"""Add zero padding to the sample. The pad size is given in number of samples.
|
||||
If symmetric=True, `pad_size` will be added to both sides. If false, `pad_size`
|
||||
zeros will be added only to the end.
|
||||
"""
|
||||
self._samples = np.pad(self._samples,
|
||||
(pad_size if symmetric else 0, pad_size),
|
||||
mode='constant')
|
||||
|
||||
def subsegment(self, start_time=None, end_time=None):
|
||||
"""Cut the AudioSegment between given boundaries.
|
||||
Note that this is an in-place transformation.
|
||||
:param start_time: Beginning of subsegment in seconds.
|
||||
:type start_time: float
|
||||
:param end_time: End of subsegment in seconds.
|
||||
:type end_time: float
|
||||
:raise ValueError: If start_time or end_time is incorrectly set, e.g. out
|
||||
of bounds in time.
|
||||
"""
|
||||
start_time = 0.0 if start_time is None else start_time
|
||||
end_time = self.duration if end_time is None else end_time
|
||||
if start_time < 0.0:
|
||||
start_time = self.duration + start_time
|
||||
if end_time < 0.0:
|
||||
end_time = self.duration + end_time
|
||||
if start_time < 0.0:
|
||||
raise ValueError("The slice start position (%f s) is out of "
|
||||
"bounds." % start_time)
|
||||
if end_time < 0.0:
|
||||
raise ValueError("The slice end position (%f s) is out of bounds." %
|
||||
end_time)
|
||||
if start_time > end_time:
|
||||
raise ValueError("The slice start position (%f s) is later than "
|
||||
"the end position (%f s)." % (start_time, end_time))
|
||||
if end_time > self.duration:
|
||||
raise ValueError("The slice end position (%f s) is out of bounds "
|
||||
"(> %f s)" % (end_time, self.duration))
|
||||
start_sample = int(round(start_time * self._sample_rate))
|
||||
end_sample = int(round(end_time * self._sample_rate))
|
||||
self._samples = self._samples[start_sample:end_sample]
|
19
PyTorch/SpeechRecognition/Jasper/parts/text/LICENSE
Normal file
19
PyTorch/SpeechRecognition/Jasper/parts/text/LICENSE
Normal file
|
@ -0,0 +1,19 @@
|
|||
Copyright (c) 2017 Keith Ito
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in
|
||||
all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
THE SOFTWARE.
|
12
PyTorch/SpeechRecognition/Jasper/parts/text/__init__.py
Normal file
12
PyTorch/SpeechRecognition/Jasper/parts/text/__init__.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
# Copyright (c) 2017 Keith Ito
|
||||
""" from https://github.com/keithito/tacotron """
|
||||
import re
|
||||
from . import cleaners
|
||||
|
||||
def _clean_text(text, cleaner_names, *args):
|
||||
for name in cleaner_names:
|
||||
cleaner = getattr(cleaners, name)
|
||||
if not cleaner:
|
||||
raise Exception('Unknown cleaner: %s' % name)
|
||||
text = cleaner(text, *args)
|
||||
return text
|
107
PyTorch/SpeechRecognition/Jasper/parts/text/cleaners.py
Normal file
107
PyTorch/SpeechRecognition/Jasper/parts/text/cleaners.py
Normal file
|
@ -0,0 +1,107 @@
|
|||
# Copyright (c) 2017 Keith Ito
|
||||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
""" from https://github.com/keithito/tacotron
|
||||
Modified to add puncturation removal
|
||||
"""
|
||||
|
||||
'''
|
||||
Cleaners are transformations that run over the input text at both training and eval time.
|
||||
|
||||
Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
|
||||
hyperparameter. Some cleaners are English-specific. You'll typically want to use:
|
||||
1. "english_cleaners" for English text
|
||||
2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
|
||||
the Unidecode library (https://pypi.python.org/pypi/Unidecode)
|
||||
3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
|
||||
the symbols in symbols.py to match your data).
|
||||
|
||||
'''
|
||||
|
||||
import re
|
||||
from unidecode import unidecode
|
||||
from .numbers import normalize_numbers
|
||||
|
||||
# Regular expression matching whitespace:
|
||||
_whitespace_re = re.compile(r'\s+')
|
||||
|
||||
# List of (regular expression, replacement) pairs for abbreviations:
|
||||
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
|
||||
('mrs', 'misess'),
|
||||
('mr', 'mister'),
|
||||
('dr', 'doctor'),
|
||||
('st', 'saint'),
|
||||
('co', 'company'),
|
||||
('jr', 'junior'),
|
||||
('maj', 'major'),
|
||||
('gen', 'general'),
|
||||
('drs', 'doctors'),
|
||||
('rev', 'reverend'),
|
||||
('lt', 'lieutenant'),
|
||||
('hon', 'honorable'),
|
||||
('sgt', 'sergeant'),
|
||||
('capt', 'captain'),
|
||||
('esq', 'esquire'),
|
||||
('ltd', 'limited'),
|
||||
('col', 'colonel'),
|
||||
('ft', 'fort'),
|
||||
]]
|
||||
|
||||
def expand_abbreviations(text):
|
||||
for regex, replacement in _abbreviations:
|
||||
text = re.sub(regex, replacement, text)
|
||||
return text
|
||||
|
||||
def expand_numbers(text):
|
||||
return normalize_numbers(text)
|
||||
|
||||
def lowercase(text):
|
||||
return text.lower()
|
||||
|
||||
def collapse_whitespace(text):
|
||||
return re.sub(_whitespace_re, ' ', text)
|
||||
|
||||
def convert_to_ascii(text):
|
||||
return unidecode(text)
|
||||
|
||||
def remove_punctuation(text, table):
|
||||
text = text.translate(table)
|
||||
text = re.sub(r'&', " and ", text)
|
||||
text = re.sub(r'\+', " plus ", text)
|
||||
return text
|
||||
|
||||
def basic_cleaners(text):
|
||||
'''Basic pipeline that lowercases and collapses whitespace without transliteration.'''
|
||||
text = lowercase(text)
|
||||
text = collapse_whitespace(text)
|
||||
return text
|
||||
|
||||
def transliteration_cleaners(text):
|
||||
'''Pipeline for non-English text that transliterates to ASCII.'''
|
||||
text = convert_to_ascii(text)
|
||||
text = lowercase(text)
|
||||
text = collapse_whitespace(text)
|
||||
return text
|
||||
|
||||
def english_cleaners(text, table=None):
|
||||
'''Pipeline for English text, including number and abbreviation expansion.'''
|
||||
text = convert_to_ascii(text)
|
||||
text = lowercase(text)
|
||||
text = expand_numbers(text)
|
||||
text = expand_abbreviations(text)
|
||||
if table is not None:
|
||||
text = remove_punctuation(text, table)
|
||||
text = collapse_whitespace(text)
|
||||
return text
|
99
PyTorch/SpeechRecognition/Jasper/parts/text/numbers.py
Normal file
99
PyTorch/SpeechRecognition/Jasper/parts/text/numbers.py
Normal file
|
@ -0,0 +1,99 @@
|
|||
# Copyright (c) 2017 Keith Ito
|
||||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
""" from https://github.com/keithito/tacotron
|
||||
Modifed to add support for time and slight tweaks to _expand_number
|
||||
"""
|
||||
|
||||
import inflect
|
||||
import re
|
||||
|
||||
|
||||
_inflect = inflect.engine()
|
||||
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
|
||||
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
|
||||
_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
|
||||
_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
|
||||
_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
|
||||
_number_re = re.compile(r'[0-9]+')
|
||||
_time_re = re.compile(r'([0-9]{1,2}):([0-9]{2})')
|
||||
|
||||
|
||||
def _remove_commas(m):
|
||||
return m.group(1).replace(',', '')
|
||||
|
||||
|
||||
def _expand_decimal_point(m):
|
||||
return m.group(1).replace('.', ' point ')
|
||||
|
||||
|
||||
def _expand_dollars(m):
|
||||
match = m.group(1)
|
||||
parts = match.split('.')
|
||||
if len(parts) > 2:
|
||||
return match + ' dollars' # Unexpected format
|
||||
dollars = int(parts[0]) if parts[0] else 0
|
||||
cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
|
||||
if dollars and cents:
|
||||
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
|
||||
cent_unit = 'cent' if cents == 1 else 'cents'
|
||||
return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
|
||||
elif dollars:
|
||||
dollar_unit = 'dollar' if dollars == 1 else 'dollars'
|
||||
return '%s %s' % (dollars, dollar_unit)
|
||||
elif cents:
|
||||
cent_unit = 'cent' if cents == 1 else 'cents'
|
||||
return '%s %s' % (cents, cent_unit)
|
||||
else:
|
||||
return 'zero dollars'
|
||||
|
||||
|
||||
def _expand_ordinal(m):
|
||||
return _inflect.number_to_words(m.group(0))
|
||||
|
||||
|
||||
def _expand_number(m):
|
||||
if int(m.group(0)[0]) == 0:
|
||||
return _inflect.number_to_words(m.group(0), andword='', group=1)
|
||||
num = int(m.group(0))
|
||||
if num > 1000 and num < 3000:
|
||||
if num == 2000:
|
||||
return 'two thousand'
|
||||
elif num > 2000 and num < 2010:
|
||||
return 'two thousand ' + _inflect.number_to_words(num % 100)
|
||||
elif num % 100 == 0:
|
||||
return _inflect.number_to_words(num // 100) + ' hundred'
|
||||
else:
|
||||
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
|
||||
# Add check for number phones and other large numbers
|
||||
elif num > 1000000000 and num % 10000 != 0:
|
||||
return _inflect.number_to_words(num, andword='', group=1)
|
||||
else:
|
||||
return _inflect.number_to_words(num, andword='')
|
||||
|
||||
def _expand_time(m):
|
||||
mins = int(m.group(2))
|
||||
if mins == 0:
|
||||
return _inflect.number_to_words(m.group(1))
|
||||
return " ".join([_inflect.number_to_words(m.group(1)), _inflect.number_to_words(m.group(2))])
|
||||
|
||||
def normalize_numbers(text):
|
||||
text = re.sub(_comma_number_re, _remove_commas, text)
|
||||
text = re.sub(_pounds_re, r'\1 pounds', text)
|
||||
text = re.sub(_dollars_re, _expand_dollars, text)
|
||||
text = re.sub(_decimal_number_re, _expand_decimal_point, text)
|
||||
text = re.sub(_ordinal_re, _expand_ordinal, text)
|
||||
text = re.sub(_number_re, _expand_number, text)
|
||||
text = re.sub(_time_re, _expand_time, text)
|
||||
return text
|
19
PyTorch/SpeechRecognition/Jasper/parts/text/symbols.py
Normal file
19
PyTorch/SpeechRecognition/Jasper/parts/text/symbols.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
# Copyright (c) 2017 Keith Ito
|
||||
""" from https://github.com/keithito/tacotron """
|
||||
|
||||
'''
|
||||
Defines the set of symbols used in text input to the model.
|
||||
|
||||
The default is a set of ASCII characters that works well for English or text that has been run through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. '''
|
||||
from . import cmudict
|
||||
|
||||
_pad = '_'
|
||||
_punctuation = '!\'(),.:;? '
|
||||
_special = '-'
|
||||
_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
|
||||
|
||||
# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
|
||||
_arpabet = ['@' + s for s in cmudict.valid_symbols]
|
||||
|
||||
# Export all symbols:
|
||||
symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet
|
9
PyTorch/SpeechRecognition/Jasper/requirements.txt
Executable file
9
PyTorch/SpeechRecognition/Jasper/requirements.txt
Executable file
|
@ -0,0 +1,9 @@
|
|||
pandas==0.24.2
|
||||
tqdm==4.31.1
|
||||
ascii-graph==1.5.1
|
||||
wrapt==1.10.11
|
||||
librosa
|
||||
toml
|
||||
soundfile
|
||||
ipdb
|
||||
sox
|
3
PyTorch/SpeechRecognition/Jasper/scripts/docker/build.sh
Executable file
3
PyTorch/SpeechRecognition/Jasper/scripts/docker/build.sh
Executable file
|
@ -0,0 +1,3 @@
|
|||
#!/bin/bash
|
||||
|
||||
docker build . --rm -t jasper
|
30
PyTorch/SpeechRecognition/Jasper/scripts/docker/launch.sh
Executable file
30
PyTorch/SpeechRecognition/Jasper/scripts/docker/launch.sh
Executable file
|
@ -0,0 +1,30 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
DATA_DIR=$1
|
||||
CHECKPOINT_DIR=$2
|
||||
RESULT_DIR=$3
|
||||
|
||||
docker run -it --rm \
|
||||
--runtime=nvidia \
|
||||
--shm-size=4g \
|
||||
--ulimit memlock=-1 \
|
||||
--ulimit stack=67108864 \
|
||||
-v "$DATA_DIR":/datasets \
|
||||
-v "$CHECKPOINT_DIR":/checkpoints/ \
|
||||
-v "$RESULT_DIR":/results/ \
|
||||
jasper bash
|
28
PyTorch/SpeechRecognition/Jasper/scripts/download_librispeech.sh
Executable file
28
PyTorch/SpeechRecognition/Jasper/scripts/download_librispeech.sh
Executable file
|
@ -0,0 +1,28 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/usr/bin/env bash
|
||||
|
||||
DATA_SET="LibriSpeech"
|
||||
DATA_ROOT_DIR="/datasets"
|
||||
DATA_DIR="${DATA_ROOT_DIR}/${DATA_SET}"
|
||||
if [ ! -d "$DATA_DIR" ]
|
||||
then
|
||||
mkdir $DATA_DIR
|
||||
chmod go+rx $DATA_DIR
|
||||
python utils/download_librispeech.py utils/librispeech.csv $DATA_DIR -e ${DATA_ROOT_DIR}/
|
||||
else
|
||||
echo "Directory $DATA_DIR already exists."
|
||||
fi
|
92
PyTorch/SpeechRecognition/Jasper/scripts/evaluation.sh
Executable file
92
PyTorch/SpeechRecognition/Jasper/scripts/evaluation.sh
Executable file
|
@ -0,0 +1,92 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/bin/bash
|
||||
echo "Container nvidia build = " $NVIDIA_BUILD_ID
|
||||
|
||||
DATA_DIR=${1:-"/datasets/LibriSpeech"}
|
||||
DATASET=${2:-"dev-clean"}
|
||||
MODEL_CONFIG=${3:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
|
||||
RESULT_DIR=${4:-"/results"}
|
||||
CHECKPOINT=$5
|
||||
CREATE_LOGFILE=${6:-"true"}
|
||||
CUDNN_BENCHMARK=${7:-"false"}
|
||||
NUM_GPUS=${8:-1}
|
||||
PRECISION=${9:-"fp32"}
|
||||
NUM_STEPS=${10:-"-1"}
|
||||
SEED=${11:-0}
|
||||
BATCH_SIZE=${12:-64}
|
||||
|
||||
|
||||
if [ "$CREATE_LOGFILE" = "true" ] ; then
|
||||
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
|
||||
printf -v TAG "jasper_evaluation_${DATASET}_%s_gbs%d" "$PRECISION" $GBS
|
||||
DATESTAMP=`date +'%y%m%d%H%M%S'`
|
||||
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
|
||||
printf "Logs written to %s\n" "$LOGFILE"
|
||||
fi
|
||||
|
||||
|
||||
|
||||
PREC=""
|
||||
if [ "$PRECISION" = "fp16" ] ; then
|
||||
PREC="--fp16"
|
||||
elif [ "$PRECISION" = "fp32" ] ; then
|
||||
PREC=""
|
||||
else
|
||||
echo "Unknown <precision> argument"
|
||||
exit -2
|
||||
fi
|
||||
|
||||
STEPS=""
|
||||
if [ "$NUM_STEPS" -gt 0 ] ; then
|
||||
STEPS=" --steps $NUM_STEPS"
|
||||
fi
|
||||
|
||||
if [ "$CUDNN_BENCHMARK" = "true" ] ; then
|
||||
CUDNN_BENCHMARK=" --cudnn_benchmark"
|
||||
else
|
||||
CUDNN_BENCHMARK=""
|
||||
fi
|
||||
|
||||
|
||||
CMD=" inference.py "
|
||||
CMD+=" --batch_size $BATCH_SIZE "
|
||||
CMD+=" --dataset_dir $DATA_DIR "
|
||||
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
|
||||
CMD+=" --model_toml $MODEL_CONFIG "
|
||||
CMD+=" --seed $SEED "
|
||||
CMD+=" --ckpt $CHECKPOINT "
|
||||
CMD+=" $CUDNN_BENCHMARK"
|
||||
CMD+=" $PREC "
|
||||
CMD+=" $STEPS "
|
||||
|
||||
|
||||
if [ "$NUM_GPUS" -gt 1 ] ; then
|
||||
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
|
||||
else
|
||||
CMD="python3 $CMD"
|
||||
fi
|
||||
|
||||
|
||||
set -x
|
||||
if [ -z "$LOGFILE" ] ; then
|
||||
$CMD
|
||||
else
|
||||
(
|
||||
$CMD
|
||||
) |& tee "$LOGFILE"
|
||||
fi
|
||||
set +x
|
104
PyTorch/SpeechRecognition/Jasper/scripts/inference.sh
Executable file
104
PyTorch/SpeechRecognition/Jasper/scripts/inference.sh
Executable file
|
@ -0,0 +1,104 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/bin/bash
|
||||
echo "Container nvidia build = " $NVIDIA_BUILD_ID
|
||||
|
||||
|
||||
DATA_DIR=${1-"/datasets/LibriSpeech"}
|
||||
DATASET=${2:-"dev-clean"}
|
||||
MODEL_CONFIG=${3:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
|
||||
RESULT_DIR=${4:-"/results"}
|
||||
CHECKPOINT=$5
|
||||
CREATE_LOGFILE=${6:-"true"}
|
||||
CUDNN_BENCHMARK=${7:-"false"}
|
||||
PRECISION=${8:-"fp32"}
|
||||
NUM_STEPS=${9:-"-1"}
|
||||
SEED=${10:-0}
|
||||
BATCH_SIZE=${11:-64}
|
||||
MODELOUTPUT_FILE=${12:-"none"}
|
||||
PREDICTION_FILE=${13:-"$RESULT_DIR/${DATASET}.predictions"}
|
||||
|
||||
if [ "$CREATE_LOGFILE" = "true" ] ; then
|
||||
export GBS=$(expr $BATCH_SIZE)
|
||||
printf -v TAG "jasper_inference_${DATASET}_%s_gbs%d" "$PRECISION" $GBS
|
||||
DATESTAMP=`date +'%y%m%d%H%M%S'`
|
||||
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
|
||||
printf "Logs written to %s\n" "$LOGFILE"
|
||||
fi
|
||||
|
||||
|
||||
|
||||
PREC=""
|
||||
if [ "$PRECISION" = "fp16" ] ; then
|
||||
PREC="--fp16"
|
||||
elif [ "$PRECISION" = "fp32" ] ; then
|
||||
PREC=""
|
||||
else
|
||||
echo "Unknown <precision> argument"
|
||||
exit -2
|
||||
fi
|
||||
|
||||
PRED=""
|
||||
if [ "$PREDICTION_FILE" = "none" ] ; then
|
||||
PRED=""
|
||||
else
|
||||
PRED=" --save_prediction $PREDICTION_FILE"
|
||||
fi
|
||||
|
||||
OUTPUT=""
|
||||
if [ "$MODELOUTPUT_FILE" = "none" ] ; then
|
||||
OUTPUT=" "
|
||||
else
|
||||
OUTPUT=" --logits_save_to $MODELOUTPUT_FILE"
|
||||
fi
|
||||
|
||||
|
||||
if [ "$CUDNN_BENCHMARK" = "true" ]; then
|
||||
CUDNN_BENCHMARK=" --cudnn_benchmark"
|
||||
else
|
||||
CUDNN_BENCHMARK=""
|
||||
fi
|
||||
|
||||
STEPS=""
|
||||
if [ "$NUM_STEPS" -gt 0 ] ; then
|
||||
STEPS=" --steps $NUM_STEPS"
|
||||
fi
|
||||
|
||||
CMD=" python inference.py "
|
||||
CMD+=" --batch_size $BATCH_SIZE "
|
||||
CMD+=" --dataset_dir $DATA_DIR "
|
||||
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
|
||||
CMD+=" --model_toml $MODEL_CONFIG "
|
||||
CMD+=" --seed $SEED "
|
||||
CMD+=" --ckpt $CHECKPOINT "
|
||||
CMD+=" $CUDNN_BENCHMARK"
|
||||
CMD+=" $PRED "
|
||||
CMD+=" $OUTPUT "
|
||||
CMD+=" $PREC "
|
||||
CMD+=" $STEPS "
|
||||
|
||||
|
||||
set -x
|
||||
if [ -z "$LOGFILE" ] ; then
|
||||
$CMD
|
||||
else
|
||||
(
|
||||
$CMD
|
||||
) |& tee "$LOGFILE"
|
||||
fi
|
||||
set +x
|
||||
echo "MODELOUTPUT_FILE: ${MODELOUTPUT_FILE}"
|
||||
echo "PREDICTION_FILE: ${PREDICTION_FILE}"
|
89
PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark.sh
Executable file
89
PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark.sh
Executable file
|
@ -0,0 +1,89 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
echo "Container nvidia build = " $NVIDIA_BUILD_ID
|
||||
|
||||
|
||||
DATA_DIR=${1:-"/datasets/LibriSpeech"}
|
||||
DATASET=${2:-"dev-clean"}
|
||||
MODEL_CONFIG=${3:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
|
||||
RESULT_DIR=${4:-"/results"}
|
||||
CHECKPOINT=$5
|
||||
CREATE_LOGFILE=${6:-"true"}
|
||||
CUDNN_BENCHMARK=${7:-"true"}
|
||||
PRECISION=${8:-"fp32"}
|
||||
NUM_STEPS=${9:-"-1"}
|
||||
MAX_DURATION=${10:-"36"}
|
||||
SEED=${11:-0}
|
||||
BATCH_SIZE=${12:-64}
|
||||
|
||||
PREC=""
|
||||
if [ "$PRECISION" = "fp16" ] ; then
|
||||
PREC="--fp16"
|
||||
elif [ "$PRECISION" = "fp32" ] ; then
|
||||
PREC=""
|
||||
else
|
||||
echo "Unknown <precision> argument"
|
||||
exit -2
|
||||
fi
|
||||
STEPS=""
|
||||
if [ "$NUM_STEPS" -gt 0 ] ; then
|
||||
STEPS=" --steps $NUM_STEPS"
|
||||
fi
|
||||
if [ "$CUDNN_BENCHMARK" = "true" ] ; then
|
||||
CUDNN_BENCHMARK=" --cudnn_benchmark"
|
||||
else
|
||||
CUDNN_BENCHMARK=""
|
||||
fi
|
||||
|
||||
CMD=" python inference_benchmark.py"
|
||||
CMD+=" --batch_size=$BATCH_SIZE"
|
||||
CMD+=" --model_toml=$MODEL_CONFIG"
|
||||
CMD+=" --seed=$SEED"
|
||||
CMD+=" --dataset_dir=$DATA_DIR"
|
||||
CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
|
||||
CMD+=" --ckpt=$CHECKPOINT"
|
||||
CMD+=" --max_duration=$MAX_DURATION"
|
||||
CMD+=" --pad_to=-1"
|
||||
CMD+=" $CUDNN_BENCHMARK"
|
||||
CMD+=" $PREC"
|
||||
CMD+=" $STEPS"
|
||||
|
||||
|
||||
if [ "$CREATE_LOGFILE" = "true" ] ; then
|
||||
export GBS=$(expr $BATCH_SIZE )
|
||||
printf -v TAG "jasper_inference_benchmark_%s_gbs%d" "$PRECISION" $GBS
|
||||
DATESTAMP=`date +'%y%m%d%H%M%S'`
|
||||
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
|
||||
printf "Logs written to %s\n" "$LOGFILE"
|
||||
fi
|
||||
|
||||
set -x
|
||||
if [ -z "$LOGFILE" ] ; then
|
||||
$CMD
|
||||
else
|
||||
(
|
||||
$CMD
|
||||
) |& tee "$LOGFILE"
|
||||
grep 'latency' "$LOGFILE"
|
||||
fi
|
||||
set +x
|
||||
|
||||
|
||||
|
||||
|
||||
|
51
PyTorch/SpeechRecognition/Jasper/scripts/preprocess_librispeech.sh
Executable file
51
PyTorch/SpeechRecognition/Jasper/scripts/preprocess_librispeech.sh
Executable file
|
@ -0,0 +1,51 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#!/usr/bin/env bash
|
||||
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/train-clean-100 \
|
||||
--dest_dir /datasets/LibriSpeech/train-clean-100-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-train-clean-100-wav.json \
|
||||
--speed 0.9 1.1
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/train-clean-360 \
|
||||
--dest_dir /datasets/LibriSpeech/train-clean-360-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-train-clean-360-wav.json \
|
||||
--speed 0.9 1.1
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/train-other-500 \
|
||||
--dest_dir /datasets/LibriSpeech/train-other-500-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-train-other-500-wav.json \
|
||||
--speed 0.9 1.1
|
||||
|
||||
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/dev-clean \
|
||||
--dest_dir /datasets/LibriSpeech/dev-clean-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-dev-clean-wav.json
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/dev-other \
|
||||
--dest_dir /datasets/LibriSpeech/dev-other-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-dev-other-wav.json
|
||||
|
||||
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/test-clean \
|
||||
--dest_dir /datasets/LibriSpeech/test-clean-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-test-clean-wav.json
|
||||
python ./utils/convert_librispeech.py \
|
||||
--input_dir /datasets/LibriSpeech/test-other \
|
||||
--dest_dir /datasets/LibriSpeech/test-other-wav \
|
||||
--output_json /datasets/LibriSpeech/librispeech-test-other-wav.json
|
111
PyTorch/SpeechRecognition/Jasper/scripts/train.sh
Executable file
111
PyTorch/SpeechRecognition/Jasper/scripts/train.sh
Executable file
|
@ -0,0 +1,111 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/bin/bash
|
||||
echo "Container nvidia build = " $NVIDIA_BUILD_ID
|
||||
|
||||
|
||||
DATA_DIR=${1:-"/datasets/LibriSpeech"}
|
||||
MODEL_CONFIG=${2:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
|
||||
RESULT_DIR=${3:-"/results"}
|
||||
CHECKPOINT=${4:-"none"}
|
||||
CREATE_LOGFILE=${5:-"true"}
|
||||
CUDNN_BENCHMARK=${6:-"true"}
|
||||
NUM_GPUS=${7:-8}
|
||||
PRECISION=${8:-"fp16"}
|
||||
EPOCHS=${9:-400}
|
||||
SEED=${10:-6}
|
||||
BATCH_SIZE=${11:-64}
|
||||
LEARNING_RATE=${12:-"0.015"}
|
||||
GRADIENT_ACCUMULATION_STEPS=${13:-1}
|
||||
LAUNCH_OPT=${LAUNCH_OPT:-"none"}
|
||||
|
||||
|
||||
PREC=""
|
||||
if [ "$PRECISION" = "fp16" ] ; then
|
||||
PREC="--fp16"
|
||||
elif [ "$PRECISION" = "fp32" ] ; then
|
||||
PREC=""
|
||||
else
|
||||
echo "Unknown <precision> argument"
|
||||
exit -2
|
||||
fi
|
||||
|
||||
CUDNN=""
|
||||
if [ "$CUDNN_BENCHMARK" = "true" ] && [ "$PRECISION" = "fp16" ]; then
|
||||
CUDNN=" --cudnn"
|
||||
else
|
||||
CUDNN=""
|
||||
fi
|
||||
|
||||
|
||||
|
||||
if [ "$CHECKPOINT" = "none" ] ; then
|
||||
CHECKPOINT=""
|
||||
else
|
||||
CHECKPOINT=" --ckpt=${CHECKPOINT}"
|
||||
fi
|
||||
|
||||
|
||||
CMD=" train.py"
|
||||
CMD+=" --batch_size=$BATCH_SIZE"
|
||||
CMD+=" --num_epochs=$EPOCHS"
|
||||
CMD+=" --output_dir=$RESULT_DIR"
|
||||
CMD+=" --model_toml=$MODEL_CONFIG"
|
||||
CMD+=" --lr=$LEARNING_RATE"
|
||||
CMD+=" --seed=$SEED"
|
||||
CMD+=" --optimizer=novograd"
|
||||
CMD+=" --dataset_dir=$DATA_DIR"
|
||||
CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
|
||||
CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json,$DATA_DIR/librispeech-train-clean-360-wav.json,$DATA_DIR/librispeech-train-other-500-wav.json"
|
||||
CMD+=" --weight_decay=1e-3"
|
||||
CMD+=" --save_freq=10"
|
||||
CMD+=" --eval_freq=100"
|
||||
CMD+=" --train_freq=25"
|
||||
CMD+=" --lr_decay"
|
||||
CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS "
|
||||
CMD+=" $CHECKPOINT"
|
||||
CMD+=" $PREC"
|
||||
CMD+=" $CUDNN"
|
||||
|
||||
|
||||
if [ "${LAUNCH_OPT}" != "none" ]; then
|
||||
CMD="python -m $LAUNCH_OPT $CMD"
|
||||
elif [ "$NUM_GPUS" -gt 1 ] ; then
|
||||
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
|
||||
else
|
||||
CMD="python3 $CMD"
|
||||
fi
|
||||
|
||||
|
||||
|
||||
if [ "$CREATE_LOGFILE" = "true" ] ; then
|
||||
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
|
||||
printf -v TAG "jasper_train_%s_gbs%d" "$PRECISION" $GBS
|
||||
DATESTAMP=`date +'%y%m%d%H%M%S'`
|
||||
LOGFILE=$RESULT_DIR/$TAG.$DATESTAMP.log
|
||||
printf "Logs written to %s\n" "$LOGFILE"
|
||||
fi
|
||||
|
||||
set -x
|
||||
if [ -z "$LOGFILE" ] ; then
|
||||
$CMD
|
||||
else
|
||||
(
|
||||
$CMD
|
||||
) |& tee $LOGFILE
|
||||
fi
|
||||
set +x
|
||||
|
131
PyTorch/SpeechRecognition/Jasper/scripts/train_benchmark.sh
Executable file
131
PyTorch/SpeechRecognition/Jasper/scripts/train_benchmark.sh
Executable file
|
@ -0,0 +1,131 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
echo "Container nvidia build = " $NVIDIA_BUILD_ID
|
||||
|
||||
DATA_DIR=${1:-"/datasets/LibriSpeech"}
|
||||
MODEL_CONFIG=${2:-"configs/jasper10x5dr_sp_offline_specaugment.toml"}
|
||||
RESULT_DIR=${3:-"/results"}
|
||||
CREATE_LOGFILE=${4:-"true"}
|
||||
CUDNN_BENCHMARK=${5:-"true"}
|
||||
NUM_GPUS=${6:-8}
|
||||
PRECISION=${7:-"fp16"}
|
||||
NUM_STEPS=${8:-"-1"}
|
||||
MAX_DURATION=${9:-16.7}
|
||||
SEED=${10:-0}
|
||||
BATCH_SIZE=${11:-64}
|
||||
LEARNING_RATE=${12:-"0.015"}
|
||||
GRADIENT_ACCUMULATION_STEPS=${13:-1}
|
||||
PRINT_FREQUENCY=${14:-1}
|
||||
|
||||
|
||||
PREC=""
|
||||
if [ "$PRECISION" = "fp16" ] ; then
|
||||
PREC=" --fp16"
|
||||
elif [ "$PRECISION" = "fp32" ] ; then
|
||||
PREC=""
|
||||
else
|
||||
echo "Unknown <precision> argument"
|
||||
exit -2
|
||||
fi
|
||||
|
||||
STEPS=""
|
||||
if [ "$NUM_STEPS" -ne "-1" ] ; then
|
||||
STEPS=" --num_steps=$NUM_STEPS"
|
||||
elif [ "$NUM_STEPS" = "-1" ] ; then
|
||||
STEPS=""
|
||||
else
|
||||
echo "Unknown <precision> argument"
|
||||
exit -2
|
||||
fi
|
||||
|
||||
CUDNN=""
|
||||
if [ "$CUDNN_BENCHMARK" = "true" ] ; then
|
||||
CUDNN=" --cudnn"
|
||||
else
|
||||
CUDNN=""
|
||||
fi
|
||||
|
||||
|
||||
CMD=" train.py"
|
||||
CMD+=" --batch_size=$BATCH_SIZE"
|
||||
CMD+=" --num_epochs=400"
|
||||
CMD+=" --output_dir=$RESULT_DIR"
|
||||
CMD+=" --model_toml=$MODEL_CONFIG"
|
||||
CMD+=" --lr=$LEARNING_RATE"
|
||||
CMD+=" --seed=$SEED"
|
||||
CMD+=" --optimizer=novograd"
|
||||
CMD+=" --gradient_accumulation_steps=$GRADIENT_ACCUMULATION_STEPS"
|
||||
CMD+=" --dataset_dir=$DATA_DIR"
|
||||
CMD+=" --val_manifest=$DATA_DIR/librispeech-dev-clean-wav.json"
|
||||
CMD+=" --train_manifest=$DATA_DIR/librispeech-train-clean-100-wav.json,$DATA_DIR/librispeech-train-clean-360-wav.json,$DATA_DIR/librispeech-train-other-500-wav.json"
|
||||
CMD+=" --weight_decay=1e-3"
|
||||
CMD+=" --save_freq=100000"
|
||||
CMD+=" --eval_freq=100000"
|
||||
CMD+=" --max_duration=$MAX_DURATION"
|
||||
CMD+=" --pad_to_max"
|
||||
CMD+=" --train_freq=$PRINT_FREQUENCY"
|
||||
CMD+=" --lr_decay"
|
||||
CMD+=" $CUDNN"
|
||||
CMD+=" $PREC"
|
||||
CMD+=" $STEPS"
|
||||
|
||||
if [ "$NUM_GPUS" -gt 1 ] ; then
|
||||
CMD="python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS $CMD"
|
||||
else
|
||||
CMD="python3 $CMD"
|
||||
fi
|
||||
|
||||
|
||||
if [ "$CREATE_LOGFILE" = "true" ] ; then
|
||||
export GBS=$(expr $BATCH_SIZE \* $NUM_GPUS)
|
||||
printf -v TAG "jasper_train_benchmark_%s_gbs%d" "$PRECISION" $GBS
|
||||
DATESTAMP=`date +'%y%m%d%H%M%S'`
|
||||
LOGFILE="${RESULT_DIR}/${TAG}.${DATESTAMP}.log"
|
||||
printf "Logs written to %s\n" "$LOGFILE"
|
||||
|
||||
fi
|
||||
|
||||
if [ -z "$LOGFILE" ] ; then
|
||||
|
||||
set -x
|
||||
$CMD
|
||||
set +x
|
||||
else
|
||||
|
||||
set -x
|
||||
(
|
||||
$CMD
|
||||
) |& tee "$LOGFILE"
|
||||
|
||||
set +x
|
||||
|
||||
mean_latency=`cat "$LOGFILE" | grep 'Step time' | awk '{print $3}' | tail -n +2 | egrep -o '[0-9.]+'| awk 'BEGIN {total=0} {total+=$1} END {printf("%.2f\n",total/NR)}'`
|
||||
mean_throughput=`python -c "print($BATCH_SIZE*$NUM_GPUS/${mean_latency})"`
|
||||
training_wer_per_pgu=`cat "$LOGFILE" | grep 'training_batch_WER'| awk '{print $2}' | tail -n 1 | egrep -o '[0-9.]+'`
|
||||
training_loss_per_pgu=`cat "$LOGFILE" | grep 'Loss@Step'| awk '{print $4}' | tail -n 1 | egrep -o '[0-9.]+'`
|
||||
final_eval_wer=`cat "$LOGFILE" | grep 'Evaluation WER'| tail -n 1 | egrep -o '[0-9.]+'`
|
||||
final_eval_loss=`cat "$LOGFILE" | grep 'Evaluation Loss'| tail -n 1 | egrep -o '[0-9.]+'`
|
||||
|
||||
echo "max duration: $MAX_DURATION s" | tee -a "$LOGFILE"
|
||||
echo "mean_latency: $mean_latency s" | tee -a "$LOGFILE"
|
||||
echo "mean_throughput: $mean_throughput sequences/s" | tee -a "$LOGFILE"
|
||||
echo "training_wer_per_pgu: $training_wer_per_pgu" | tee -a "$LOGFILE"
|
||||
echo "training_loss_per_pgu: $training_loss_per_pgu" | tee -a "$LOGFILE"
|
||||
echo "final_eval_loss: $final_eval_loss" | tee -a "$LOGFILE"
|
||||
echo "final_eval_wer: $final_eval_wer" | tee -a "$LOGFILE"
|
||||
fi
|
||||
|
387
PyTorch/SpeechRecognition/Jasper/train.py
Normal file
387
PyTorch/SpeechRecognition/Jasper/train.py
Normal file
|
@ -0,0 +1,387 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
import itertools
|
||||
import os
|
||||
import time
|
||||
import toml
|
||||
import torch
|
||||
import apex
|
||||
from apex import amp
|
||||
import random
|
||||
import numpy as np
|
||||
import math
|
||||
from dataset import AudioToTextDataLayer
|
||||
from helpers import monitor_asr_train_progress, process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, model_multi_gpu, print_dict, print_once
|
||||
from model import AudioPreprocessing, CTCLossNM, GreedyCTCDecoder, Jasper
|
||||
from optimizers import Novograd, AdamW
|
||||
|
||||
|
||||
def lr_policy(initial_lr, step, N):
|
||||
"""
|
||||
learning rate decay
|
||||
Args:
|
||||
initial_lr: base learning rate
|
||||
step: current iteration number
|
||||
N: total number of iterations over which learning rate is decayed
|
||||
"""
|
||||
min_lr = 0.00001
|
||||
res = initial_lr * ((N - step) / N) ** 2
|
||||
return max(res, min_lr)
|
||||
|
||||
def save(model, optimizer, epoch, output_dir):
|
||||
"""
|
||||
Saves model checkpoint
|
||||
Args:
|
||||
model: model
|
||||
optimizer: optimizer
|
||||
epoch: epoch of model training
|
||||
output_dir: path to save model checkpoint
|
||||
"""
|
||||
class_name = model.__class__.__name__
|
||||
unix_time = time.time()
|
||||
file_name = "{0}_{1}-epoch-{2}.pt".format(class_name, unix_time, epoch)
|
||||
print_once("Saving module {0} in {1}".format(class_name, os.path.join(output_dir, file_name)))
|
||||
if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
|
||||
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
||||
save_checkpoint={
|
||||
'epoch': epoch,
|
||||
'state_dict': model_to_save.state_dict(),
|
||||
'optimizer': optimizer.state_dict()
|
||||
}
|
||||
|
||||
torch.save(save_checkpoint, os.path.join(output_dir, file_name))
|
||||
print_once('Saved.')
|
||||
|
||||
|
||||
|
||||
|
||||
def train(
|
||||
data_layer,
|
||||
data_layer_eval,
|
||||
model,
|
||||
ctc_loss,
|
||||
greedy_decoder,
|
||||
optimizer,
|
||||
optim_level,
|
||||
labels,
|
||||
multi_gpu,
|
||||
args,
|
||||
fn_lr_policy=None):
|
||||
"""Trains model
|
||||
Args:
|
||||
data_layer: training data layer
|
||||
data_layer_eval: evaluation data layer
|
||||
model: model ( encapsulates data processing, encoder, decoder)
|
||||
ctc_loss: loss function
|
||||
greedy_decoder: greedy ctc decoder
|
||||
optimizer: optimizer
|
||||
optim_level: AMP optimization level
|
||||
labels: list of output labels
|
||||
multi_gpu: true if multi gpu training
|
||||
args: script input argument list
|
||||
fn_lr_policy: learning rate adjustment function
|
||||
"""
|
||||
def eval():
|
||||
"""Evaluates model on evaluation dataset
|
||||
"""
|
||||
with torch.no_grad():
|
||||
_global_var_dict = {
|
||||
'EvalLoss': [],
|
||||
'predictions': [],
|
||||
'transcripts': [],
|
||||
}
|
||||
eval_dataloader = data_layer_eval.data_iterator
|
||||
for data in eval_dataloader:
|
||||
tensors = []
|
||||
for d in data:
|
||||
if isinstance(d, torch.Tensor):
|
||||
tensors.append(d.cuda())
|
||||
else:
|
||||
tensors.append(d)
|
||||
t_audio_signal_e, t_a_sig_length_e, t_transcript_e, t_transcript_len_e = tensors
|
||||
|
||||
model.eval()
|
||||
t_log_probs_e, t_encoded_len_e = model(x=(t_audio_signal_e, t_a_sig_length_e))
|
||||
t_loss_e = ctc_loss(log_probs=t_log_probs_e, targets=t_transcript_e, input_length=t_encoded_len_e, target_length=t_transcript_len_e)
|
||||
t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
|
||||
|
||||
values_dict = dict(
|
||||
loss=[t_loss_e],
|
||||
predictions=[t_predictions_e],
|
||||
transcript=[t_transcript_e],
|
||||
transcript_length=[t_transcript_len_e]
|
||||
)
|
||||
process_evaluation_batch(values_dict, _global_var_dict, labels=labels)
|
||||
|
||||
# final aggregation across all workers and minibatches) and logging of results
|
||||
wer, eloss = process_evaluation_epoch(_global_var_dict)
|
||||
|
||||
print_once("==========>>>>>>Evaluation Loss: {0}\n".format(eloss))
|
||||
print_once("==========>>>>>>Evaluation WER: {0}\n".format(wer))
|
||||
|
||||
print_once("Starting .....")
|
||||
start_time = time.time()
|
||||
|
||||
train_dataloader = data_layer.data_iterator
|
||||
epoch = args.start_epoch
|
||||
step = epoch * args.step_per_epoch
|
||||
|
||||
while True:
|
||||
if multi_gpu:
|
||||
data_layer.sampler.set_epoch(epoch)
|
||||
print_once("Starting epoch {0}, step {1}".format(epoch, step))
|
||||
last_epoch_start = time.time()
|
||||
batch_counter = 0
|
||||
average_loss = 0
|
||||
for data in train_dataloader:
|
||||
tensors = []
|
||||
for d in data:
|
||||
if isinstance(d, torch.Tensor):
|
||||
tensors.append(d.cuda())
|
||||
else:
|
||||
tensors.append(d)
|
||||
|
||||
if batch_counter == 0:
|
||||
|
||||
if fn_lr_policy is not None:
|
||||
adjusted_lr = fn_lr_policy(step)
|
||||
for param_group in optimizer.param_groups:
|
||||
param_group['lr'] = adjusted_lr
|
||||
optimizer.zero_grad()
|
||||
last_iter_start = time.time()
|
||||
|
||||
t_audio_signal_t, t_a_sig_length_t, t_transcript_t, t_transcript_len_t = tensors
|
||||
model.train()
|
||||
t_log_probs_t, t_encoded_len_t = model(x=(t_audio_signal_t, t_a_sig_length_t))
|
||||
|
||||
t_loss_t = ctc_loss(log_probs=t_log_probs_t, targets=t_transcript_t, input_length=t_encoded_len_t, target_length=t_transcript_len_t)
|
||||
if args.gradient_accumulation_steps > 1:
|
||||
t_loss_t = t_loss_t / args.gradient_accumulation_steps
|
||||
|
||||
if optim_level in AmpOptimizations:
|
||||
with amp.scale_loss(t_loss_t, optimizer) as scaled_loss:
|
||||
scaled_loss.backward()
|
||||
else:
|
||||
t_loss_t.backward()
|
||||
batch_counter += 1
|
||||
average_loss += t_loss_t.item()
|
||||
|
||||
if batch_counter % args.gradient_accumulation_steps == 0:
|
||||
optimizer.step()
|
||||
|
||||
if step % args.train_frequency == 0:
|
||||
t_predictions_t = greedy_decoder(log_probs=t_log_probs_t)
|
||||
|
||||
e_tensors = [t_predictions_t, t_transcript_t, t_transcript_len_t]
|
||||
train_wer = monitor_asr_train_progress(e_tensors, labels=labels)
|
||||
print_once("Loss@Step: {0} ::::::: {1}".format(step, str(average_loss)))
|
||||
print_once("Step time: {0} seconds".format(time.time() - last_iter_start))
|
||||
|
||||
if step > 0 and step % args.eval_frequency == 0:
|
||||
print_once("Doing Evaluation ....................... ...... ... .. . .")
|
||||
eval()
|
||||
step += 1
|
||||
batch_counter = 0
|
||||
average_loss = 0
|
||||
if args.num_steps is not None and step >= args.num_steps:
|
||||
break
|
||||
|
||||
if args.num_steps is not None and step >= args.num_steps:
|
||||
break
|
||||
print_once("Finished epoch {0} in {1}".format(epoch, time.time() - last_epoch_start))
|
||||
epoch += 1
|
||||
if epoch % args.save_frequency == 0 and epoch > 0:
|
||||
save(model, optimizer, epoch, output_dir=args.output_dir)
|
||||
if args.num_steps is None and epoch >= args.num_epochs:
|
||||
break
|
||||
print_once("Done in {0}".format(time.time() - start_time))
|
||||
print_once("Final Evaluation ....................... ...... ... .. . .")
|
||||
eval()
|
||||
save(model, optimizer, epoch, output_dir=args.output_dir)
|
||||
|
||||
def main(args):
|
||||
random.seed(args.seed)
|
||||
np.random.seed(args.seed)
|
||||
torch.manual_seed(args.seed)
|
||||
assert(torch.cuda.is_available())
|
||||
torch.backends.cudnn.benchmark = args.cudnn
|
||||
|
||||
# set up distributed training
|
||||
if args.local_rank is not None:
|
||||
torch.cuda.set_device(args.local_rank)
|
||||
torch.distributed.init_process_group(backend='nccl', init_method='env://')
|
||||
|
||||
multi_gpu = torch.distributed.is_initialized()
|
||||
if multi_gpu:
|
||||
print_once("DISTRIBUTED TRAINING with {} gpus".format(torch.distributed.get_world_size()))
|
||||
|
||||
# define amp optimiation level
|
||||
if args.fp16:
|
||||
optim_level = Optimization.mxprO1
|
||||
else:
|
||||
optim_level = Optimization.mxprO0
|
||||
|
||||
jasper_model_definition = toml.load(args.model_toml)
|
||||
dataset_vocab = jasper_model_definition['labels']['labels']
|
||||
ctc_vocab = add_ctc_labels(dataset_vocab)
|
||||
|
||||
train_manifest = args.train_manifest
|
||||
val_manifest = args.val_manifest
|
||||
featurizer_config = jasper_model_definition['input']
|
||||
featurizer_config_eval = jasper_model_definition['input_eval']
|
||||
featurizer_config["optimization_level"] = optim_level
|
||||
featurizer_config_eval["optimization_level"] = optim_level
|
||||
|
||||
sampler_type = featurizer_config.get("sampler", 'default')
|
||||
perturb_config = jasper_model_definition.get('perturb', None)
|
||||
if args.pad_to_max:
|
||||
assert(args.max_duration > 0)
|
||||
featurizer_config['max_duration'] = args.max_duration
|
||||
featurizer_config_eval['max_duration'] = args.max_duration
|
||||
featurizer_config['pad_to'] = "max"
|
||||
featurizer_config_eval['pad_to'] = "max"
|
||||
print_once('model_config')
|
||||
print_dict(jasper_model_definition)
|
||||
|
||||
if args.gradient_accumulation_steps < 1:
|
||||
raise ValueError('Invalid gradient accumulation steps parameter {}'.format(args.gradient_accumulation_steps))
|
||||
if args.batch_size % args.gradient_accumulation_steps != 0:
|
||||
raise ValueError('gradient accumulation step {} is not divisible by batch size {}'.format(args.gradient_accumulation_steps, args.batch_size))
|
||||
|
||||
|
||||
data_layer = AudioToTextDataLayer(
|
||||
dataset_dir=args.dataset_dir,
|
||||
featurizer_config=featurizer_config,
|
||||
perturb_config=perturb_config,
|
||||
manifest_filepath=train_manifest,
|
||||
labels=dataset_vocab,
|
||||
batch_size=args.batch_size // args.gradient_accumulation_steps,
|
||||
multi_gpu=multi_gpu,
|
||||
pad_to_max=args.pad_to_max,
|
||||
sampler=sampler_type)
|
||||
|
||||
data_layer_eval = AudioToTextDataLayer(
|
||||
dataset_dir=args.dataset_dir,
|
||||
featurizer_config=featurizer_config_eval,
|
||||
manifest_filepath=val_manifest,
|
||||
labels=dataset_vocab,
|
||||
batch_size=args.batch_size,
|
||||
multi_gpu=multi_gpu,
|
||||
pad_to_max=args.pad_to_max
|
||||
)
|
||||
|
||||
model = Jasper(feature_config=featurizer_config, jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
|
||||
|
||||
if args.ckpt is not None:
|
||||
print_once("loading model from {}".format(args.ckpt))
|
||||
checkpoint = torch.load(args.ckpt, map_location="cpu")
|
||||
model.load_state_dict(checkpoint['state_dict'], strict=True)
|
||||
args.start_epoch = checkpoint['epoch']
|
||||
else:
|
||||
args.start_epoch = 0
|
||||
|
||||
ctc_loss = CTCLossNM( num_classes=len(ctc_vocab))
|
||||
greedy_decoder = GreedyCTCDecoder()
|
||||
|
||||
print_once("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
|
||||
print_once("Number of parameters in decode: {0}".format(model.jasper_decoder.num_weights()))
|
||||
|
||||
N = len(data_layer)
|
||||
if sampler_type == 'default':
|
||||
args.step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
|
||||
elif sampler_type == 'bucket':
|
||||
args.step_per_epoch = int(len(data_layer.sampler) / args.batch_size )
|
||||
|
||||
print_once('-----------------')
|
||||
print_once('Have {0} examples to train on.'.format(N))
|
||||
print_once('Have {0} steps / (gpu * epoch).'.format(args.step_per_epoch))
|
||||
print_once('-----------------')
|
||||
|
||||
fn_lr_policy = lambda s: lr_policy(args.lr, s, args.num_epochs * args.step_per_epoch)
|
||||
|
||||
|
||||
model.cuda()
|
||||
|
||||
if args.optimizer_kind == "novograd":
|
||||
optimizer = Novograd(model.parameters(),
|
||||
lr=args.lr,
|
||||
weight_decay=args.weight_decay)
|
||||
elif args.optimizer_kind == "adam":
|
||||
optimizer = AdamW(model.parameters(),
|
||||
lr=args.lr,
|
||||
weight_decay=args.weight_decay)
|
||||
else:
|
||||
raise ValueError("invalid optimizer choice: {}".format(args.optimizer_kind))
|
||||
|
||||
|
||||
if optim_level in AmpOptimizations:
|
||||
model, optimizer = amp.initialize(
|
||||
min_loss_scale=1.0,
|
||||
models=model,
|
||||
optimizers=optimizer,
|
||||
opt_level=AmpOptimizations[optim_level])
|
||||
|
||||
if args.ckpt is not None:
|
||||
optimizer.load_state_dict(checkpoint['optimizer'])
|
||||
|
||||
model = model_multi_gpu(model, multi_gpu)
|
||||
|
||||
train(
|
||||
data_layer=data_layer,
|
||||
data_layer_eval=data_layer_eval,
|
||||
model=model,
|
||||
ctc_loss=ctc_loss,
|
||||
greedy_decoder=greedy_decoder,
|
||||
optimizer=optimizer,
|
||||
labels=ctc_vocab,
|
||||
optim_level=optim_level,
|
||||
multi_gpu=multi_gpu,
|
||||
fn_lr_policy=fn_lr_policy if args.lr_decay else None,
|
||||
args=args)
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description='Jasper')
|
||||
parser.add_argument("--local_rank", default=None, type=int)
|
||||
parser.add_argument("--batch_size", default=16, type=int, help='data batch size')
|
||||
parser.add_argument("--num_epochs", default=10, type=int, help='number of training epochs. if number of steps if specified will overwrite this')
|
||||
parser.add_argument("--num_steps", default=None, type=int, help='if specified overwrites num_epochs and will only train for this number of iterations')
|
||||
parser.add_argument("--save_freq", dest="save_frequency", default=300, type=int, help='number of epochs until saving checkpoint. will save at the end of training too.')
|
||||
parser.add_argument("--eval_freq", dest="eval_frequency", default=200, type=int, help='number of iterations until doing evaluation on full dataset')
|
||||
parser.add_argument("--train_freq", dest="train_frequency", default=25, type=int, help='number of iterations until printing training statistics on the past iteration')
|
||||
parser.add_argument("--lr", default=1e-3, type=float, help='learning rate')
|
||||
parser.add_argument("--weight_decay", default=1e-3, type=float, help='weight decay rate')
|
||||
parser.add_argument("--train_manifest", type=str, required=True, help='relative path given dataset folder of training manifest file')
|
||||
parser.add_argument("--model_toml", type=str, required=True, help='relative path given dataset folder of model configuration file')
|
||||
parser.add_argument("--val_manifest", type=str, required=True, help='relative path given dataset folder of evaluation manifest file')
|
||||
parser.add_argument("--max_duration", type=float, help='maximum duration of audio samples for training and evaluation')
|
||||
parser.add_argument("--pad_to_max", action="store_true", default=False, help="pad sequence to max_duration")
|
||||
parser.add_argument("--gradient_accumulation_steps", default=1, type=int, help='number of accumulation steps')
|
||||
parser.add_argument("--optimizer", dest="optimizer_kind", default="novograd", type=str, help='optimizer')
|
||||
parser.add_argument("--dataset_dir", dest="dataset_dir", required=True, type=str, help='root dir of dataset')
|
||||
parser.add_argument("--lr_decay", action="store_true", default=False, help='use learning rate decay')
|
||||
parser.add_argument("--cudnn", action="store_true", default=False, help="enable cudnn benchmark")
|
||||
parser.add_argument("--fp16", action="store_true", default=False, help="use mixed precision training")
|
||||
parser.add_argument("--output_dir", type=str, required=True, help='saves results in this directory')
|
||||
parser.add_argument("--ckpt", default=None, type=str, help="if specified continues training from given checkpoint. Otherwise starts from beginning")
|
||||
parser.add_argument("--seed", default=42, type=int, help='seed')
|
||||
args=parser.parse_args()
|
||||
return args
|
||||
|
||||
|
||||
if __name__=="__main__":
|
||||
args = parse_args()
|
||||
print_dict(vars(args))
|
||||
main(args)
|
0
PyTorch/SpeechRecognition/Jasper/utils/__init__.py
Normal file
0
PyTorch/SpeechRecognition/Jasper/utils/__init__.py
Normal file
|
@ -0,0 +1,81 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
#!/usr/bin/env python
|
||||
import argparse
|
||||
import os
|
||||
import glob
|
||||
import multiprocessing
|
||||
import json
|
||||
|
||||
import pandas as pd
|
||||
|
||||
from preprocessing_utils import parallel_preprocess
|
||||
|
||||
parser = argparse.ArgumentParser(description='Preprocess LibriSpeech.')
|
||||
parser.add_argument('--input_dir', type=str, required=True,
|
||||
help='LibriSpeech collection input dir')
|
||||
parser.add_argument('--dest_dir', type=str, required=True,
|
||||
help='Output dir')
|
||||
parser.add_argument('--output_json', type=str, default='./',
|
||||
help='name of the output json file.')
|
||||
parser.add_argument('-s','--speed', type=float, nargs='*',
|
||||
help='Speed perturbation ratio')
|
||||
parser.add_argument('--target_sr', type=int, default=None,
|
||||
help='Target sample rate. '
|
||||
'defaults to the input sample rate')
|
||||
parser.add_argument('--overwrite', action='store_true',
|
||||
help='Overwrite file if exists')
|
||||
parser.add_argument('--parallel', type=int, default=multiprocessing.cpu_count(),
|
||||
help='Number of threads to use when processing audio files')
|
||||
args = parser.parse_args()
|
||||
|
||||
args.input_dir = args.input_dir.rstrip('/')
|
||||
args.dest_dir = args.dest_dir.rstrip('/')
|
||||
|
||||
def build_input_arr(input_dir):
|
||||
txt_files = glob.glob(os.path.join(input_dir, '**', '*.trans.txt'),
|
||||
recursive=True)
|
||||
input_data = []
|
||||
for txt_file in txt_files:
|
||||
rel_path = os.path.relpath(txt_file, input_dir)
|
||||
with open(txt_file) as fp:
|
||||
for line in fp:
|
||||
fname, _, transcript = line.partition(' ')
|
||||
input_data.append(dict(input_relpath=os.path.dirname(rel_path),
|
||||
input_fname=fname+'.flac',
|
||||
transcript=transcript))
|
||||
return input_data
|
||||
|
||||
|
||||
print("[%s] Scaning input dir..." % args.output_json)
|
||||
dataset = build_input_arr(input_dir=args.input_dir)
|
||||
|
||||
print("[%s] Converting audio files..." % args.output_json)
|
||||
dataset = parallel_preprocess(dataset=dataset,
|
||||
input_dir=args.input_dir,
|
||||
dest_dir=args.dest_dir,
|
||||
target_sr=args.target_sr,
|
||||
speed=args.speed,
|
||||
overwrite=args.overwrite,
|
||||
parallel=args.parallel)
|
||||
|
||||
print("[%s] Generating json..." % args.output_json)
|
||||
df = pd.DataFrame(dataset, dtype=object)
|
||||
|
||||
# Save json with python. df.to_json() produces back slashed in file paths
|
||||
dataset = df.to_dict(orient='records')
|
||||
with open(args.output_json, 'w') as fp:
|
||||
json.dump(dataset, fp, indent=2)
|
|
@ -0,0 +1,72 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#!/usr/bin/env python
|
||||
|
||||
import os
|
||||
import argparse
|
||||
import pandas as pd
|
||||
|
||||
from download_utils import download_file, md5_checksum, extract
|
||||
|
||||
parser = argparse.ArgumentParser(description='Download, verify and extract dataset files')
|
||||
parser.add_argument('csv', type=str,
|
||||
help='CSV file with urls and checksums to download.')
|
||||
parser.add_argument('dest', type=str,
|
||||
help='Download destnation folder.')
|
||||
parser.add_argument('-e', type=str, default=None,
|
||||
help='Extraction destnation folder. Defaults to download folder if not provided')
|
||||
parser.add_argument('--skip_download', action='store_true',
|
||||
help='Skip downloading the files')
|
||||
parser.add_argument('--skip_checksum', action='store_true',
|
||||
help='Skip checksum')
|
||||
parser.add_argument('--skip_extract', action='store_true',
|
||||
help='Skip extracting files')
|
||||
args = parser.parse_args()
|
||||
args.e = args.e or args.dest
|
||||
|
||||
|
||||
df = pd.read_csv(args.csv, delimiter=',')
|
||||
|
||||
|
||||
if not args.skip_download:
|
||||
for url in df.url:
|
||||
fname = url.split('/')[-1]
|
||||
print("Downloading %s:" % fname)
|
||||
download_file(url=url, dest_folder=args.dest, fname=fname)
|
||||
else:
|
||||
print("Skipping file download")
|
||||
|
||||
|
||||
if not args.skip_checksum:
|
||||
for index, row in df.iterrows():
|
||||
url = row['url']
|
||||
md5 = row['md5']
|
||||
fname = url.split('/')[-1]
|
||||
fpath = os.path.join(args.dest, fname)
|
||||
print("Verifing %s: " % fname, end='')
|
||||
ret = md5_checksum(fpath=fpath, target_hash=md5)
|
||||
print("Passed" if ret else "Failed")
|
||||
else:
|
||||
print("Skipping checksum")
|
||||
|
||||
|
||||
if not args.skip_extract:
|
||||
for url in df.url:
|
||||
fname = url.split('/')[-1]
|
||||
fpath = os.path.join(args.dest, fname)
|
||||
print("Decompressing %s:" % fpath)
|
||||
extract(fpath=fpath, dest_folder=args.e)
|
||||
else:
|
||||
print("Skipping file extraction")
|
68
PyTorch/SpeechRecognition/Jasper/utils/download_utils.py
Normal file
68
PyTorch/SpeechRecognition/Jasper/utils/download_utils.py
Normal file
|
@ -0,0 +1,68 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#!/usr/bin/env python
|
||||
|
||||
import hashlib
|
||||
import requests
|
||||
import os
|
||||
import tarfile
|
||||
import tqdm
|
||||
|
||||
def download_file(url, dest_folder, fname, overwrite=False):
|
||||
fpath = os.path.join(dest_folder, fname)
|
||||
if os.path.isfile(fpath):
|
||||
if overwrite:
|
||||
print("Overwriting existing file")
|
||||
else:
|
||||
print("File exists, skipping download.")
|
||||
return
|
||||
|
||||
tmp_fpath = fpath + '.tmp'
|
||||
|
||||
r = requests.get(url, stream=True)
|
||||
file_size = int(r.headers['Content-Length'])
|
||||
chunk_size = 1024 * 1024 # 1MB
|
||||
total_chunks = int(file_size / chunk_size)
|
||||
|
||||
with open(tmp_fpath, 'wb') as fp:
|
||||
content_iterator = r.iter_content(chunk_size=chunk_size)
|
||||
chunks = tqdm.tqdm(content_iterator, total=total_chunks,
|
||||
unit='MB', desc=fpath, leave=True)
|
||||
for chunk in chunks:
|
||||
fp.write(chunk)
|
||||
|
||||
os.rename(tmp_fpath, fpath)
|
||||
|
||||
|
||||
def md5_checksum(fpath, target_hash):
|
||||
file_hash = hashlib.md5()
|
||||
with open(fpath, "rb") as fp:
|
||||
for chunk in iter(lambda: fp.read(1024*1024), b""):
|
||||
file_hash.update(chunk)
|
||||
return file_hash.hexdigest() == target_hash
|
||||
|
||||
|
||||
def extract(fpath, dest_folder):
|
||||
if fpath.endswith('.tar.gz'):
|
||||
mode = 'r:gz'
|
||||
elif fpath.endswith('.tar'):
|
||||
mode = 'r:'
|
||||
else:
|
||||
raise IOError('fpath has unknown extention: %s' % fpath)
|
||||
|
||||
with tarfile.open(fpath, mode) as tar:
|
||||
members = tar.getmembers()
|
||||
for member in tqdm.tqdm(iterable=members, total=len(members), leave=True):
|
||||
tar.extract(path=dest_folder, member=member)
|
8
PyTorch/SpeechRecognition/Jasper/utils/librispeech.csv
Normal file
8
PyTorch/SpeechRecognition/Jasper/utils/librispeech.csv
Normal file
|
@ -0,0 +1,8 @@
|
|||
url,md5
|
||||
http://www.openslr.org/resources/12/dev-clean.tar.gz,42e2234ba48799c1f50f24a7926300a1
|
||||
http://www.openslr.org/resources/12/dev-other.tar.gz,c8d0bcc9cca99d4f8b62fcc847357931
|
||||
http://www.openslr.org/resources/12/test-clean.tar.gz,32fa31d27d2e1cad72775fee3f4849a9
|
||||
http://www.openslr.org/resources/12/test-other.tar.gz,fb5a50374b501bb3bac4815ee91d3135
|
||||
http://www.openslr.org/resources/12/train-clean-100.tar.gz,2a93770f6d5c6c964bc36631d331a522
|
||||
http://www.openslr.org/resources/12/train-clean-360.tar.gz,c0e676e450a7ff2f54aeade5171606fa
|
||||
http://www.openslr.org/resources/12/train-other-500.tar.gz,d1a0fd59409feb2c614ce4d30c387708
|
|
|
@ -0,0 +1,76 @@
|
|||
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#!/usr/bin/env python
|
||||
import os
|
||||
import multiprocessing
|
||||
import librosa
|
||||
import functools
|
||||
|
||||
import sox
|
||||
|
||||
|
||||
from tqdm import tqdm
|
||||
|
||||
def preprocess(data, input_dir, dest_dir, target_sr=None, speed=None,
|
||||
overwrite=True):
|
||||
speed = speed or []
|
||||
speed.append(1)
|
||||
speed = list(set(speed)) # Make uniqe
|
||||
|
||||
input_fname = os.path.join(input_dir,
|
||||
data['input_relpath'],
|
||||
data['input_fname'])
|
||||
input_sr = sox.file_info.sample_rate(input_fname)
|
||||
target_sr = target_sr or input_sr
|
||||
|
||||
os.makedirs(os.path.join(dest_dir, data['input_relpath']), exist_ok=True)
|
||||
|
||||
output_dict = {}
|
||||
output_dict['transcript'] = data['transcript'].lower().strip()
|
||||
output_dict['files'] = []
|
||||
|
||||
fname = os.path.splitext(data['input_fname'])[0]
|
||||
for s in speed:
|
||||
output_fname = fname + '{}.wav'.format('' if s==1 else '-{}'.format(s))
|
||||
output_fpath = os.path.join(dest_dir,
|
||||
data['input_relpath'],
|
||||
output_fname)
|
||||
|
||||
if not os.path.exists(output_fpath) or overwrite:
|
||||
cbn = sox.Transformer().speed(factor=s).convert(target_sr)
|
||||
cbn.build(input_fname, output_fpath)
|
||||
|
||||
file_info = sox.file_info.info(output_fpath)
|
||||
file_info['fname'] = os.path.join(os.path.basename(dest_dir),
|
||||
data['input_relpath'],
|
||||
output_fname)
|
||||
file_info['speed'] = s
|
||||
output_dict['files'].append(file_info)
|
||||
|
||||
if s == 1:
|
||||
file_info = sox.file_info.info(output_fpath)
|
||||
output_dict['original_duration'] = file_info['duration']
|
||||
output_dict['original_num_samples'] = file_info['num_samples']
|
||||
|
||||
return output_dict
|
||||
|
||||
|
||||
def parallel_preprocess(dataset, input_dir, dest_dir, target_sr, speed, overwrite, parallel):
|
||||
with multiprocessing.Pool(parallel) as p:
|
||||
func = functools.partial(preprocess,
|
||||
input_dir=input_dir, dest_dir=dest_dir,
|
||||
target_sr=target_sr, speed=speed, overwrite=overwrite)
|
||||
dataset = list(tqdm(p.imap(func, dataset), total=len(dataset)))
|
||||
return dataset
|
|
@ -33,6 +33,9 @@ The examples are organized first by framework, such as TensorFlow, PyTorch, etc.
|
|||
### Text to Speech
|
||||
- __Tacotron & WaveGlow__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)]
|
||||
|
||||
### Speech Recognition
|
||||
- __Jasper__ [[PyTorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)]
|
||||
|
||||
## NVIDIA support
|
||||
In each of the network READMEs, we indicate the level of support that will be provided. The range is from ongoing updates and improvements to a point-in-time release for thought leadership.
|
||||
|
||||
|
|
0
TensorFlow/LanguageModeling/BERT/data/bookcorpus/config.sh
Executable file → Normal file
0
TensorFlow/LanguageModeling/BERT/data/bookcorpus/config.sh
Executable file → Normal file
0
TensorFlow/LanguageModeling/BERT/data/wikipedia_corpus/config.sh
Executable file → Normal file
0
TensorFlow/LanguageModeling/BERT/data/wikipedia_corpus/config.sh
Executable file → Normal file
0
TensorFlow/LanguageModeling/BERT/scripts/docker/build.sh
Normal file → Executable file
0
TensorFlow/LanguageModeling/BERT/scripts/docker/build.sh
Normal file → Executable file
Loading…
Reference in a new issue