Merge r1.4 bugfixes to main (#2918)

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* update branch for jenkinsfile and dockerfile

Signed-off-by: ericharper <complex451@gmail.com>

* Adding conformer-transducer models. (#2717)

* added the models.

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* added contextnet models.

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* added german and chinese models.

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* fix the abs_pos of conformer. (#2863)

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* update to match sde (#2867)

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* updated german ngc model (#2871)

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* Lower bound PTL to safe version (#2876)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update notebooks with onnxruntime (#2880)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Upperbound PTL (#2881)

Signed-off-by: smajumdar <titu1994@gmail.com>

* minor typo and broken link fixes (#2883)

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* Remove numbers from TTS tutorial names (#2882)

* Remove numbers from TTS tutorial names

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Update documentation links

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Typos (#2884)

* segmentation tutorial fix

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* data fixes

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* updated the messages in eval_beamsearch_ngram.py. (#2889)

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* style (#2890)

Signed-off-by: Jason <jasoli@nvidia.com>

* Fix broken link (#2891)

* fix broken link

Signed-off-by: fayejf <fayejf07@gmail.com>

* more

Signed-off-by: fayejf <fayejf07@gmail.com>

* Update sclite eval for new transcription method (#2893)

* Update sclite to use updated inference

Signed-off-by: smajumdar <titu1994@gmail.com>

* Remove WER

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update sclite script to use new inference methods

Signed-off-by: smajumdar <titu1994@gmail.com>

* Remove hub 5

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix TransformerDecoder export - r1.4 (#2875)

* export fix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* embedding pos

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* remove bool param

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* changes

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>

* Update Finetuning notebook (#2906)

* update notebook

Signed-off-by: Jason <jasoli@nvidia.com>

* rename

Signed-off-by: Jason <jasoli@nvidia.com>

* rename

Signed-off-by: Jason <jasoli@nvidia.com>

* revert branch to main

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
This commit is contained in:
Eric Harper 2021-09-28 20:13:55 -06:00 committed by GitHub
parent c88cfc42eb
commit 58bc1d2c6c
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
36 changed files with 1096 additions and 990 deletions

View file

@ -44,7 +44,7 @@ Key Features
* Speech processing
* `Automatic Speech recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_: Jasper, QuartzNet, CitriNet, Conformer
* `Speech Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (command recognition), MarbleNet (voice activity detection)
* `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: SpeakerNet, TDNN-Attention
* `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: SpeakerNet, ECAPA_TDNN
* `Speaker Diarization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html>`_: MarbleNet + SpeakerNet
* `NGC collection of pre-trained speech processing models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_asr>`_
* Natural Language Processing

View file

@ -1,3 +1,4 @@
Model,Model Base Class,Model Card
stt_de_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_quartznet15x5"
stt_de_citrinet_1024,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024"

1 Model Model Base Class Model Card
2 stt_de_quartznet15x5 EncDecCTCModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_quartznet15x5
3 stt_de_citrinet_1024 EncDecCTCModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024
4

View file

@ -7,9 +7,17 @@ stt_en_citrinet_1024,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nv
stt_en_citrinet_256_gamma_0_25,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_256_gamma_0_25"
stt_en_citrinet_512_gamma_0_25,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_512_gamma_0_25"
stt_en_citrinet_1024_gamma_0_25,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25"
stt_en_contextnet_256_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_256_mls"
stt_en_contextnet_512_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512_mls"
stt_en_contextnet_1024_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024_mls"
stt_en_contextnet_512,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512"
stt_en_contextnet_1024,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024"
stt_en_conformer_ctc_small,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small"
stt_en_conformer_ctc_medium,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium"
stt_en_conformer_ctc_large,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large"
stt_en_conformer_ctc_small_ls,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small_ls"
stt_en_conformer_ctc_medium_ls,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium_ls"
stt_en_conformer_ctc_large_ls,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large_ls"
stt_en_conformer_transducer_small,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_small"
stt_en_conformer_transducer_medium,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_medium"
stt_en_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large"
1 Model Name Model Base Class Model Card
7 stt_en_citrinet_256_gamma_0_25 EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_256_gamma_0_25
8 stt_en_citrinet_512_gamma_0_25 EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_512_gamma_0_25
9 stt_en_citrinet_1024_gamma_0_25 EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25
10 stt_en_contextnet_256_mls EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_256_mls
11 stt_en_contextnet_512_mls EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512_mls
12 stt_en_contextnet_1024_mls EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024_mls
13 stt_en_contextnet_512 EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512
14 stt_en_contextnet_1024 EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024
15 stt_en_conformer_ctc_small EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small
16 stt_en_conformer_ctc_medium EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium
17 stt_en_conformer_ctc_large EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large
18 stt_en_conformer_ctc_small_ls EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small_ls
19 stt_en_conformer_ctc_medium_ls EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium_ls
20 stt_en_conformer_ctc_large_ls EncDecCTCBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large_ls
21 stt_en_conformer_transducer_small EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_small
22 stt_en_conformer_transducer_medium EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_medium
23 stt_en_conformer_transducer_large EncDecRNNTBPEModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large

View file

@ -1,2 +1,3 @@
Model,Model Base Class,Model Card
stt_zh_citrinet_512,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512"
stt_zh_citrinet_512,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512"
stt_zh_citrinet_1024_gamma_0_25,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_1024_gamma_0_25"

1 Model Model Base Class Model Card
2 stt_zh_citrinet_512 EncDecCTCBPEModel EncDecCTCModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512
3 stt_zh_citrinet_1024_gamma_0_25 EncDecCTCModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_1024_gamma_0_25

View file

@ -117,13 +117,19 @@ To run a tutorial:
- `Relation Extraction - BioMegatron <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb>`_
* - TTS
- Speech Synthesis
- `TTS Inference <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/1_TTS_inference.ipynb>`_
- `TTS Inference <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb>`_
* - TTS
- Speech Synthesis
- `Tacotron2 Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/2_TTS_Tacotron2_Training.ipynb>`_
- `FastPitch Duration and Pitch Control <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_DurationPitchControl.ipynb>`_
* - TTS
- Speech Synthesis
- `TalkNet Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/3_TTS_TalkNet_Training.ipynb>`_
- `Tacotron2 Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Tacotron2_Training.ipynb>`_
* - TTS
- Speech Synthesis
- `TalkNet Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/TalkNet_Training.ipynb>`_
* - TTS
- Speech Synthesis
- `FastPitch Fine-Tuning <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb>`_
* - Tools
- CTC Segmentation
- `CTC Segmentation <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tools/CTC_Segmentation_Tutorial.ipynb>`_

View file

@ -17,6 +17,17 @@ This script is based on speech_to_text_infer.py and allows you to score the hypo
with sclite. A local installation from https://github.com/usnistgov/SCTK is required.
Hypotheses and references are first saved in trn format and are scored after applying a glm
file (if provided).
# Usage
python speech_to_text_sclite.py \
--asr_model="<Path to ASR Model>" \
--dataset="<Path to manifest file>" \
--out_dir="<Path to output dir, should be unique per model evaluated>" \
--sctk_dir="<Path to root directory where SCTK is installed>" \
--glm="<OPTIONAL: Path to glm file>" \
--batch_size=4
"""
import errno
@ -27,8 +38,7 @@ from argparse import ArgumentParser
import torch
from nemo.collections.asr.metrics.wer import WER
from nemo.collections.asr.models import EncDecCTCModel
from nemo.collections.asr.models import ASRModel
from nemo.utils import logging
try:
@ -64,6 +74,17 @@ def score_with_sctk(sctk_dir, ref_fname, hyp_fname, out_dir, glm=""):
_ = subprocess.check_output(f"{sclite_path} -h {hypglm} -r {refglm} -i wsj -o all", shell=True)
def read_manifest(manifest_path):
manifest_data = []
with open(manifest_path, 'r') as f:
for line in f:
data = json.loads(line)
manifest_data.append(data)
logging.info('Loaded manifest data')
return manifest_data
can_gpu = torch.cuda.is_available()
@ -84,12 +105,6 @@ def main():
)
parser.add_argument("--dataset", type=str, required=True, help="path to evaluation data")
parser.add_argument("--batch_size", type=int, default=4)
parser.add_argument(
"--dont_normalize_text",
default=False,
action='store_true',
help="Turn off trasnscript normalization. Recommended for non-English.",
)
parser.add_argument("--out_dir", type=str, required=True, help="Destination dir for output files")
parser.add_argument("--sctk_dir", type=str, required=False, default="", help="Path to sctk root dir")
parser.add_argument("--glm", type=str, required=False, default="", help="Path to glm file")
@ -97,48 +112,33 @@ def main():
torch.set_grad_enabled(False)
if not os.path.exists(args.out_dir):
os.makedirs(args.out_dir)
os.makedirs(args.out_dir, exist_ok=True)
use_sctk = os.path.exists(args.sctk_dir)
if args.asr_model.endswith('.nemo'):
logging.info(f"Using local ASR model from {args.asr_model}")
asr_model = EncDecCTCModel.restore_from(restore_path=args.asr_model)
asr_model = ASRModel.restore_from(restore_path=args.asr_model, map_location='cpu')
else:
logging.info(f"Using NGC cloud ASR model {args.asr_model}")
asr_model = EncDecCTCModel.from_pretrained(model_name=args.asr_model)
asr_model.setup_test_data(
test_data_config={
'sample_rate': 16000,
'manifest_filepath': args.dataset,
'labels': asr_model.decoder.vocabulary,
'batch_size': args.batch_size,
'normalize_transcripts': not args.dont_normalize_text,
}
)
asr_model = ASRModel.from_pretrained(model_name=args.asr_model, map_location='cpu')
if can_gpu:
asr_model = asr_model.cuda()
asr_model.eval()
labels_map = dict([(i, asr_model.decoder.vocabulary[i]) for i in range(len(asr_model.decoder.vocabulary))])
wer = WER(vocabulary=asr_model.decoder.vocabulary)
hypotheses = []
references = []
all_log_probs = []
for test_batch in asr_model.test_dataloader():
if can_gpu:
test_batch = [x.cuda() for x in test_batch]
with autocast():
log_probs, encoded_len, greedy_predictions = asr_model(
input_signal=test_batch[0], input_signal_length=test_batch[1]
)
for r in log_probs.cpu().numpy():
all_log_probs.append(r)
hypotheses += wer.ctc_decoder_predictions_tensor(greedy_predictions)
for batch_ind in range(greedy_predictions.shape[0]):
reference = ''.join([labels_map[c] for c in test_batch[2][batch_ind].cpu().detach().numpy()])
references.append(reference)
del test_batch
asr_model.eval()
manifest_data = read_manifest(args.dataset)
references = [data['text'] for data in manifest_data]
audio_filepaths = [data['audio_filepath'] for data in manifest_data]
with autocast():
hypotheses = asr_model.transcribe(audio_filepaths, batch_size=args.batch_size)
# if transcriptions form a tuple (from RNNT), extract just "best" hypothesis
if type(hypotheses) == tuple and len(hypotheses) == 2:
hypotheses = hypotheses[0]
info_list = get_utt_info(args.dataset)
hypfile = os.path.join(args.out_dir, "hyp.trn")

View file

@ -22,7 +22,7 @@ Diarization Error Rate (DER) table of `ecapa_tdnn.nemo` model on well known eval
* All models were tested using embedding extractor with window size 1.5s and shift length 0.75s
* The above result is based on the oracle Voice Activity Detection (VAD) result.
* This result is based on [ecapa_tdnn.nemo](https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn) model which will be soon uploaded on NGC.
* This result is based on [ecapa_tdnn.nemo](https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn) model.
<br/>

View file

@ -98,7 +98,7 @@ class EncDecCTCModelBPE(EncDecCTCModel, ASRBPEMixin):
model = PretrainedModelInfo(
pretrained_model_name="stt_de_citrinet_1024",
description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_de_citrinet_1024/versions/1.3.0/files/stt_de_citrinet_1024.nemo",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_de_citrinet_1024/versions/1.3.2/files/stt_de_citrinet_1024.nemo",
)
results.append(model)

View file

@ -86,9 +86,23 @@ class EncDecRNNTBPEModel(EncDecRNNTModel, ASRBPEMixin):
results.append(model)
model = PretrainedModelInfo(
pretrained_model_name="stt_en_conformer_transducer_large_mls",
description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large_mls",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_large_mls/versions/1.4.0/files/stt_en_conformer_transducer_large_mls.nemo",
pretrained_model_name="stt_en_conformer_transducer_small",
description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_small",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_small/versions/1.4.0/files/stt_en_conformer_transducer_small.nemo",
)
results.append(model)
model = PretrainedModelInfo(
pretrained_model_name="stt_en_conformer_transducer_medium",
description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_medium",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_medium/versions/1.4.0/files/stt_en_conformer_transducer_medium.nemo",
)
results.append(model)
model = PretrainedModelInfo(
pretrained_model_name="stt_en_conformer_transducer_large",
description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large",
location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_large/versions/1.4.0/files/stt_en_conformer_transducer_large.nemo",
)
results.append(model)

View file

@ -266,7 +266,7 @@ class PositionalEncoding(torch.nn.Module):
x+pos_emb (torch.Tensor): Its shape is (batch, time, feature_size)
pos_emb (torch.Tensor): Its shape is (1, time, feature_size)
"""
self.extend_pe(x)
self.extend_pe(length=x.size(1))
if self.pe.dtype != x.dtype or self.pe.device != x.device:
self.pe = self.pe.to(device=x.device, dtype=x.dtype)

View file

@ -16,7 +16,7 @@ from abc import ABC
from typing import Any, Dict, Optional
from nemo.core.classes import NeuralModule
from nemo.core.neural_types import ChannelType, MaskType, NeuralType
from nemo.core.neural_types import ChannelType, EncodedRepresentation, MaskType, NeuralType
__all__ = ['DecoderModule']
@ -31,6 +31,7 @@ class DecoderModule(NeuralModule, ABC):
"decoder_mask": NeuralType(('B', 'T'), MaskType(), optional=True),
"encoder_mask": NeuralType(('B', 'T'), MaskType(), optional=True),
"encoder_embeddings": NeuralType(('B', 'T', 'D'), ChannelType(), optional=True),
"decoder_mems": NeuralType(('B', 'D', 'T', 'D'), EncodedRepresentation(), optional=True),
}
@property

View file

@ -13,7 +13,7 @@
# limitations under the License.
from dataclasses import dataclass
from typing import Optional
from typing import Dict, Optional
import torch
from omegaconf.omegaconf import MISSING
@ -25,6 +25,7 @@ from nemo.collections.nlp.modules.common.transformer.transformer_encoders import
from nemo.collections.nlp.modules.common.transformer.transformer_modules import TransformerEmbedding
from nemo.core.classes.common import typecheck
from nemo.core.classes.exportable import Exportable
from nemo.core.neural_types import ChannelType, NeuralType
# @dataclass
# class TransformerConfig:
@ -183,6 +184,10 @@ class TransformerDecoderNM(DecoderModule, Exportable):
self._vocab_size = vocab_size
self._hidden_size = hidden_size
self._max_sequence_length = max_sequence_length
self.num_states = num_layers + 1
self.return_mems = False
if pre_ln_final_layer_norm:
self.num_states += 1
self._embedding = TransformerEmbedding(
vocab_size=self.vocab_size,
@ -207,14 +212,27 @@ class TransformerDecoderNM(DecoderModule, Exportable):
)
@typecheck()
def forward(self, input_ids, decoder_mask, encoder_embeddings, encoder_mask):
decoder_embeddings = self._embedding(input_ids=input_ids)
def forward(
self, input_ids, decoder_mask, encoder_embeddings, encoder_mask, decoder_mems=None,
):
start_pos = 0
if decoder_mems is not None:
start_pos = input_ids.shape[1] - 1
input_ids = input_ids[:, -1:]
decoder_mask = decoder_mask[:, -1:]
decoder_mems = torch.transpose(decoder_mems, 0, 1)
decoder_embeddings = self._embedding(input_ids=input_ids, start_pos=start_pos)
decoder_hidden_states = self._decoder(
decoder_states=decoder_embeddings,
decoder_mask=decoder_mask,
encoder_states=encoder_embeddings,
encoder_mask=encoder_mask,
decoder_mems_list=decoder_mems,
return_mems=self.return_mems,
return_mems_as_list=False,
)
if self.return_mems:
decoder_hidden_states = torch.transpose(decoder_hidden_states, 0, 1)
return decoder_hidden_states
@property
@ -246,8 +264,18 @@ class TransformerDecoderNM(DecoderModule, Exportable):
sample = next(self.parameters())
input_ids = torch.randint(low=0, high=2048, size=(2, 16), device=sample.device)
encoder_mask = torch.randint(low=0, high=1, size=(2, 16), device=sample.device)
return tuple([input_ids, encoder_mask, self._embedding(input_ids), encoder_mask])
mem_size = [2, self.num_states, 15, self._hidden_size]
decoder_mems = torch.rand(mem_size, device=sample.device)
return tuple([input_ids, encoder_mask, self._embedding(input_ids), encoder_mask, decoder_mems])
def _prepare_for_export(self, **kwargs):
self._decoder.diagonal = None
self.return_mems = True
super()._prepare_for_export(**kwargs)
@property
def output_types(self) -> Optional[Dict[str, NeuralType]]:
if self.return_mems:
return {"last_hidden_states": NeuralType(('B', 'D', 'T', 'D'), ChannelType())}
else:
return {"last_hidden_states": NeuralType(('B', 'T', 'D'), ChannelType())}

View file

@ -148,13 +148,23 @@ class TransformerDecoder(nn.Module):
def _get_memory_states(self, decoder_states, decoder_mems_list=None, i=0):
if decoder_mems_list is not None:
memory_states = torch.cat((decoder_mems_list[i], decoder_states), dim=1)
inp1 = torch.transpose(decoder_mems_list[i], 1, 2) # Putting seq_len to last dim to handle export cases
inp2 = torch.transpose(decoder_states, 1, 2)
memory_states = torch.cat((inp1, inp2), dim=2)
memory_states = torch.transpose(memory_states, 1, 2) # Transposing back
else:
memory_states = decoder_states
return memory_states
def forward(
self, decoder_states, decoder_mask, encoder_states, encoder_mask, decoder_mems_list=None, return_mems=False
self,
decoder_states,
decoder_mask,
encoder_states,
encoder_mask,
decoder_mems_list=None,
return_mems=False,
return_mems_as_list=True,
):
"""
Args:
@ -167,21 +177,31 @@ class TransformerDecoder(nn.Module):
of decoder_states as keys and values if not None
return_mems: bool, whether to return outputs of all decoder layers
or the last layer only
return_mems_as_list: bool, when True, mems returned are as a list; otherwise mems are Tensor
"""
decoder_attn_mask = form_attention_mask(decoder_mask, diagonal=self.diagonal)
encoder_attn_mask = form_attention_mask(encoder_mask)
memory_states = self._get_memory_states(decoder_states, decoder_mems_list, 0)
cached_mems_list = [memory_states]
if return_mems_as_list:
cached_mems_list = [memory_states]
else:
cached_mems_list = memory_states.unsqueeze(0)
for i, layer in enumerate(self.layers):
decoder_states = layer(decoder_states, decoder_attn_mask, memory_states, encoder_states, encoder_attn_mask)
memory_states = self._get_memory_states(decoder_states, decoder_mems_list, i + 1)
cached_mems_list.append(memory_states)
if return_mems_as_list:
cached_mems_list.append(memory_states)
else:
cached_mems_list = torch.cat((cached_mems_list, memory_states.unsqueeze(0)), dim=0)
if self.final_layer_norm is not None:
decoder_states = self.final_layer_norm(decoder_states)
memory_states = self._get_memory_states(decoder_states, decoder_mems_list, i + 1)
cached_mems_list.append(memory_states)
memory_states = self._get_memory_states(decoder_states, decoder_mems_list, i + 2)
if return_mems_as_list:
cached_mems_list.append(memory_states)
else:
cached_mems_list = torch.cat((cached_mems_list, memory_states.unsqueeze(0)), dim=0)
if return_mems:
return cached_mems_list

View file

@ -14,7 +14,7 @@
MAJOR = 1
MINOR = 3
MINOR = 4
PATCH = 0
PRE_RELEASE = ''

View file

@ -1,4 +1,4 @@
pytorch-lightning>=1.4.0
pytorch-lightning>=1.4.8,<1.5
torchmetrics>=0.4.1rc0
transformers>=4.0.1
webdataset>=0.1.48,<=0.1.62

View file

@ -126,7 +126,7 @@ def linear_search_wer(
param_name: the name of the parameter to be used in the figure
Output:
(best coefficient found, best WER achived)
(best coefficient found, best WER achieved)
"""
scale = scores1.mean().abs().item() / scores2.mean().abs().item()
left = coef_range[0] * scale

View file

@ -150,7 +150,9 @@ def beam_search_eval(
)
)
logging.info(
'Best WER/CER in candidates = {:.2%}/{:.2%}'.format(wer_dist_best / words_count, cer_dist_best / chars_count)
'Oracle WER/CER in candidates with perfect LM= {:.2%}/{:.2%}'.format(
wer_dist_best / words_count, cer_dist_best / chars_count
)
)
logging.info(f"=================================================================================")

View file

@ -240,7 +240,7 @@
"* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n",
"* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n",
"* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)\n",
"* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/1_TTS_inference.ipynb)\n",
"* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n",
"\n",
"\n",
"You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
@ -277,4 +277,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}

View file

@ -278,7 +278,7 @@
"* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n",
"* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n",
"* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)\n",
"* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/1_TTS_inference.ipynb)\n",
"* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n",
"\n",
"\n",
"You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
@ -327,4 +327,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}

View file

@ -385,7 +385,7 @@
"\n",
"Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n",
"\n",
"We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/v1.0.2/).\n",
"We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n",
"\n",
"NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training."
]
@ -997,7 +997,7 @@
"id": "I4WRcmakjQnj"
},
"source": [
"!pip install --upgrade onnxruntime onnxruntime-gpu\n",
"!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
"#!mkdir -p ort\n",
"#%cd ort\n",
"#!git clean -xfd\n",
@ -1173,4 +1173,4 @@
"outputs": []
}
]
}
}

View file

@ -612,7 +612,7 @@
"\r\n",
"We will use a Citrinet model to demonstrate the usage of subword tokenization models for training and inference. Citrinet is a [QuartzNet-like architecture](https://arxiv.org/abs/1910.10261), but it uses subword-tokenization along with 8x subsampling and [Squeeze-and-Excitation](https://arxiv.org/abs/1709.01507) to achieve strong accuracy in transcriptions while still using non-autoregressive decoding for efficient inference.\r\n",
"\r\n",
"We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/v1.0.2/).\r\n",
"We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\r\n",
"\r\n",
"NeMo let us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training."
]

View file

@ -765,12 +765,13 @@
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p ort\n",
"%cd ort\n",
"!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
"!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
"!pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
"%cd .."
"!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
"# !mkdir -p ort\n",
"# %cd ort\n",
"# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
"# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
"# !pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
"# %cd .."
]
},
{
@ -834,4 +835,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}

View file

@ -671,12 +671,13 @@
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p ort\n",
"%cd ort\n",
"!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
"!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
"!pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
"%cd .."
"!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
"# !mkdir -p ort\n",
"# %cd ort\n",
"# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
"# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
"# !pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
"# %cd .."
]
},
{
@ -740,4 +741,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}

View file

@ -644,7 +644,7 @@
"\n",
"For multi-GPU training, take a look at [the PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html)\n",
"\n",
"For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/amp.html)\n",
"For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html#mixed-precision-16-bit-training)\n",
"\n",
"```python\n",
"# Mixed precision:\n",

View file

@ -657,7 +657,7 @@
"\n",
"For multi-GPU training, take a look at [the PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html)\n",
"\n",
"For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/amp.html)\n",
"For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html#mixed-precision-16-bit-training)\n",
"\n",
"\n",
"```python\n",

File diff suppressed because it is too large Load diff

View file

@ -690,15 +690,16 @@
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p ort\n",
"%cd ort\n",
"!git clean -xfd\n",
"!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
"!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
"!pip uninstall -y onnxruntime\n",
"!pip uninstall -y onnxruntime-gpu\n",
"!pip install --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n",
"%cd .."
"!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
"# !mkdir -p ort\n",
"# %cd ort\n",
"# !git clean -xfd\n",
"# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
"# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
"# !pip uninstall -y onnxruntime\n",
"# !pip uninstall -y onnxruntime-gpu\n",
"# !pip install --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n",
"# %cd .."
]
},
{
@ -825,4 +826,4 @@
},
"nbformat": 4,
"nbformat_minor": 1
}
}

View file

@ -302,7 +302,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The following three varialbes are obtained from `get_speech_labels_list()` function.\n",
"The following three variables are obtained from `get_speech_labels_list()` function.\n",
"\n",
"- words List[str]: contains the sequence of words.\n",
"- spaces List[int]: contains frame level index of the end of the last word and the start time of the next word. \n",
@ -475,7 +475,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@ -489,7 +489,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.6"
"version": "3.8.10"
}
},
"nbformat": 4,

View file

@ -40,9 +40,9 @@
"\n",
"In NeMo we support both **oracle VAD** and **non-oracle VAD** diarization. \n",
"\n",
"In this tutorial, we shall first demonstrate how to perform diarization with a oracle VAD time stamps (we assume we already have speech time stamps) and pretrained speaker verification model which can be found in tutorial for [Speaker and Recognition and Verification in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Recognition_Verification.ipynb).\n",
"In this tutorial, we shall first demonstrate how to perform diarization with a oracle VAD time stamps (we assume we already have speech time stamps) and pretrained speaker verification model which can be found in tutorial for [Speaker Identification and Verification in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb).\n",
"\n",
"In [second part](#ORACLE-VAD-DIARIZATION) we show how to perform VAD and then diarization if ground truth timestamped speech were not available (non-oracle VAD). We also have tutorials for [VAD training in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Voice_Activity_Detection.ipynb) and [online offline microphone inference](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb), where you can custom your model and training/finetuning on your own data.\n",
"In ORACLE-VAD-DIARIZATION we show how to perform VAD and then diarization if ground truth timestamped speech were not available (non-oracle VAD). We also have tutorials for [VAD training in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Voice_Activity_Detection.ipynb) and [online offline microphone inference](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb), where you can custom your model and training/finetuning on your own data.\n",
"\n",
"For demonstration purposes we would be using simulated audio from [an4 dataset](http://www.speech.cs.cmu.edu/databases/an4/)"
]

View file

@ -257,7 +257,7 @@
"source": [
"As the goal of most speaker-related systems is to get good speaker level embeddings that could help distinguish from\n",
"other speakers, we shall first train these embeddings in end-to-end\n",
"manner optimizing the [QuatzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss.\n",
"manner optimizing the [QuartzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss.\n",
"We modify the decoder to get these fixed-size embeddings irrespective of the length of the input audio. We employ a mean and variance\n",
"based statistics pooling method to grab these embeddings."
]
@ -620,7 +620,7 @@
"\n",
"For multi-GPU training, take a look at the [PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html)\n",
"\n",
"For mixed-precision training, take a look at the [PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/stable/advanced/amp.html)\n",
"For mixed-precision training, take a look at the [PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html#mixed-precision-16-bit-training)\n",
"\n",
"### Mixed precision:\n",
"<pre><code>trainer = Trainer(amp_level='O1', precision=16)\n",

View file

@ -292,11 +292,11 @@
"\n",
"Text preprocessing:\n",
"* the text will be roughly split into sentences and stored under '$OUTPUT_DIR/processed/*.txt' where each sentence is going to start with a new line (we're going to find alignments for these sentences in the next steps)\n",
"* to change the lenghts of the final sentences/fragments, use `min_length` and `max_length` arguments, that specify min/max number of chars of the text segment for alignment.\n",
"* to change the lengths of the final sentences/fragments, use `min_length` and `max_length` arguments, that specify min/max number of chars of the text segment for alignment.\n",
"* to specify additional punctuation marks to split the text into fragments, use `--additional_split_symbols` argument. If segments produced after splitting the original text based on the end of sentence punctuation marks is longer than `--max_length`, `--additional_split_symbols` are going to be used to shorten the segments. Use `|` as a separator between symbols, for example: `--additional_split_symbols=;|:`\n",
"* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, (optionally) text will be changed to lowercase \n",
"* sentences for alignment with the original punctuation and capitalization will be stored under `$OUTPUT_DIR/processed/*_with_punct.txt`\n",
"* numbers will be converted from written to their spoken form with `num2words` package. To use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see [the text normalization tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tools/Text_Normalization_Tutorial.ipynb) for more details). Such normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. Note, the audio-based normalization of long audio samples is not supported due to many possible normalization options. See [https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
"* numbers will be converted from written to their spoken form with `num2words` package. To use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`tutorials/text_processing/Text_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Text_Normalization.ipynb) for more details). Such normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. Note, the audio-based normalization of long audio samples is not supported due to many possible normalization options. See [https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
"\n",
"Audio preprocessing:\n",
"* `.mp3` files will be converted to `.wav` files\n",
@ -540,7 +540,7 @@
" plot_signal(signal, sample_rate)\n",
" display(Audio(sample['audio_filepath']))\n",
" display('Reference text: ' + sample['text_no_preprocessing'])\n",
" display('ASR transcript: ' + sample['transcript'])\n",
" display('ASR transcript: ' + sample['pred_text'])\n",
" print('\\n' + '-' * 110)"
],
"execution_count": null,

View file

@ -36,6 +36,16 @@
"We will first create filelists of audio on which we wish to finetune the FastPitch model. We will create two kinds of filelists, one which contains only the audio files of the new speaker and one which contains the mixed audio files of the new speaker and the speaker used for training the pre-trained FastPitch Checkpoint."
]
},
{
"cell_type": "markdown",
"id": "53746a7b",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
" WARNING: This notebook requires downloading the HiFiTTS dataset which is 41GB. We plan on reducing the amount the download amount.\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
@ -191,7 +201,7 @@
"outputs": [],
"source": [
"make_sub_file_list(92, \"clean\", \"train\", None, 5)\n",
"mix_file_list(92, \"clean\", \"train\", None, 5, 8051, \"clean\", n_orig=5000)\n",
"mix_file_list(92, \"clean\", \"train\", None, 5, 8051, \"other\", n_orig=5000)\n",
"make_sub_file_list(92, \"clean\", \"dev\", None, None)"
]
},
@ -507,9 +517,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "nemoenv",
"display_name": "Python 3",
"language": "python",
"name": "nemoenv"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
@ -521,7 +531,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
"version": "3.8.10"
}
},
"nbformat": 4,