Merge r1.4 bugfixes to main (#2918)

* update package info Signed-off-by: ericharper <complex451@gmail.com> * update branch for jenkinsfile and dockerfile Signed-off-by: ericharper <complex451@gmail.com> * Adding conformer-transducer models. (#2717) * added the models. Signed-off-by: Vahid <vnoroozi@nvidia.com> * added contextnet models. Signed-off-by: Vahid <vnoroozi@nvidia.com> * added german and chinese models. Signed-off-by: Vahid <vnoroozi@nvidia.com> * fix the abs_pos of conformer. (#2863) Signed-off-by: Vahid <vnoroozi@nvidia.com> * update to match sde (#2867) Signed-off-by: ekmb <ebakhturina@nvidia.com> * updated german ngc model (#2871) Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * Lower bound PTL to safe version (#2876) Signed-off-by: smajumdar <titu1994@gmail.com> * Update notebooks with onnxruntime (#2880) Signed-off-by: smajumdar <titu1994@gmail.com> * Upperbound PTL (#2881) Signed-off-by: smajumdar <titu1994@gmail.com> * minor typo and broken link fixes (#2883) Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com> * Remove numbers from TTS tutorial names (#2882) * Remove numbers from TTS tutorial names Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Update documentation links Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> * Typos (#2884) * segmentation tutorial fix Signed-off-by: ekmb <ebakhturina@nvidia.com> * data fixes Signed-off-by: ekmb <ebakhturina@nvidia.com> * updated the messages in eval_beamsearch_ngram.py. (#2889) Signed-off-by: Vahid <vnoroozi@nvidia.com> * style (#2890) Signed-off-by: Jason <jasoli@nvidia.com> * Fix broken link (#2891) * fix broken link Signed-off-by: fayejf <fayejf07@gmail.com> * more Signed-off-by: fayejf <fayejf07@gmail.com> * Update sclite eval for new transcription method (#2893) * Update sclite to use updated inference Signed-off-by: smajumdar <titu1994@gmail.com> * Remove WER Signed-off-by: smajumdar <titu1994@gmail.com> * Update sclite script to use new inference methods Signed-off-by: smajumdar <titu1994@gmail.com> * Remove hub 5 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix TransformerDecoder export - r1.4 (#2875) * export fix Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * embedding pos Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * remove bool param Signed-off-by: Abhinav Khattar <aklife97@gmail.com> * changes Signed-off-by: Abhinav Khattar <aklife97@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca> * Update Finetuning notebook (#2906) * update notebook Signed-off-by: Jason <jasoli@nvidia.com> * rename Signed-off-by: Jason <jasoli@nvidia.com> * rename Signed-off-by: Jason <jasoli@nvidia.com> * revert branch to main Signed-off-by: ericharper <complex451@gmail.com> Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com> Co-authored-by: Somshubra Majumdar <titu1994@gmail.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Co-authored-by: Jocelyn <jocelynh@nvidia.com> Co-authored-by: Jason <jasoli@nvidia.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Abhinav Khattar <aklife97@gmail.com> Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
2021-09-28 20:13:55 -06:00 · 2021-09-28 20:13:55 -06:00 · 58bc1d2c6c
parent c88cfc42eb
commit 58bc1d2c6c
36 changed files with 1096 additions and 990 deletions
--- a/README.rst
+++ b/README.rst
@ -44,7 +44,7 @@ Key Features
 * Speech processing
    * `Automatic Speech recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_: Jasper, QuartzNet, CitriNet, Conformer
    * `Speech Classification <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (command recognition), MarbleNet (voice activity detection)
-    * `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: SpeakerNet, TDNN-Attention
+    * `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: SpeakerNet, ECAPA_TDNN
    * `Speaker Diarization <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html>`_: MarbleNet + SpeakerNet
    * `NGC collection of pre-trained speech processing models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_asr>`_
 * Natural Language Processing
--- a/docs/source/asr/data/benchmark_de.csv
+++ b/docs/source/asr/data/benchmark_de.csv
@ -1,3 +1,4 @@
 Model,Model Base Class,Model Card
 stt_de_quartznet15x5,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_quartznet15x5"
+stt_de_citrinet_1024,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024"

--- a/docs/source/asr/data/benchmark_en.csv
+++ b/docs/source/asr/data/benchmark_en.csv
@ -7,9 +7,17 @@ stt_en_citrinet_1024,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nv
 stt_en_citrinet_256_gamma_0_25,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_256_gamma_0_25"
 stt_en_citrinet_512_gamma_0_25,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_512_gamma_0_25"
 stt_en_citrinet_1024_gamma_0_25,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_citrinet_1024_gamma_0_25"
+stt_en_contextnet_256_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_256_mls"
+stt_en_contextnet_512_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512_mls"
+stt_en_contextnet_1024_mls,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024_mls"
+stt_en_contextnet_512,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_512"
+stt_en_contextnet_1024,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024"
 stt_en_conformer_ctc_small,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small"
 stt_en_conformer_ctc_medium,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium"
 stt_en_conformer_ctc_large,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large"
 stt_en_conformer_ctc_small_ls,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_small_ls"
 stt_en_conformer_ctc_medium_ls,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_medium_ls"
 stt_en_conformer_ctc_large_ls,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_ctc_large_ls"
+stt_en_conformer_transducer_small,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_small"
+stt_en_conformer_transducer_medium,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_medium"
+stt_en_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large"
--- a/docs/source/asr/data/benchmark_zh.csv
+++ b/docs/source/asr/data/benchmark_zh.csv
@ -1,2 +1,3 @@
 Model,Model Base Class,Model Card
-stt_zh_citrinet_512,EncDecCTCBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512"
+stt_zh_citrinet_512,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_512"
+stt_zh_citrinet_1024_gamma_0_25,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_zh_citrinet_1024_gamma_0_25"
--- a/docs/source/starthere/tutorials.rst
+++ b/docs/source/starthere/tutorials.rst
@ -117,13 +117,19 @@ To run a tutorial:
     - `Relation Extraction - BioMegatron <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Relation_Extraction-BioMegatron.ipynb>`_
   * - TTS
     - Speech Synthesis
-     - `TTS Inference <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/1_TTS_inference.ipynb>`_
+     - `TTS Inference <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb>`_
   * - TTS
     - Speech Synthesis
-     - `Tacotron2 Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/2_TTS_Tacotron2_Training.ipynb>`_
+     - `FastPitch Duration and Pitch Control <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_DurationPitchControl.ipynb>`_
   * - TTS
     - Speech Synthesis
-     - `TalkNet Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/3_TTS_TalkNet_Training.ipynb>`_
+     - `Tacotron2 Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Tacotron2_Training.ipynb>`_
+   * - TTS
+     - Speech Synthesis
+     - `TalkNet Training <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/TalkNet_Training.ipynb>`_
+   * - TTS
+     - Speech Synthesis
+     - `FastPitch Fine-Tuning <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_Finetuning.ipynb>`_
   * - Tools
     - CTC Segmentation
     - `CTC Segmentation <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tools/CTC_Segmentation_Tutorial.ipynb>`_
--- a/examples/asr/experimental/sclite/speech_to_text_sclite.py
+++ b/examples/asr/experimental/sclite/speech_to_text_sclite.py
@ -17,6 +17,17 @@ This script is based on speech_to_text_infer.py and allows you to score the hypo
 with sclite. A local installation from https://github.com/usnistgov/SCTK is required.
 Hypotheses and references are first saved in trn format and are scored after applying a glm
 file (if provided).
+
+# Usage
+
+python speech_to_text_sclite.py \
+    --asr_model="<Path to ASR Model>" \
+    --dataset="<Path to manifest file>" \
+    --out_dir="<Path to output dir, should be unique per model evaluated>" \
+    --sctk_dir="<Path to root directory where SCTK is installed>" \
+    --glm="<OPTIONAL: Path to glm file>" \
+    --batch_size=4
+
 """

 import errno
@ -27,8 +38,7 @@ from argparse import ArgumentParser

 import torch

-from nemo.collections.asr.metrics.wer import WER
-from nemo.collections.asr.models import EncDecCTCModel
+from nemo.collections.asr.models import ASRModel
 from nemo.utils import logging

 try:
@ -64,6 +74,17 @@ def score_with_sctk(sctk_dir, ref_fname, hyp_fname, out_dir, glm=""):
    _ = subprocess.check_output(f"{sclite_path} -h {hypglm}  -r {refglm} -i wsj -o all", shell=True)


+def read_manifest(manifest_path):
+    manifest_data = []
+    with open(manifest_path, 'r') as f:
+        for line in f:
+            data = json.loads(line)
+            manifest_data.append(data)
+
+    logging.info('Loaded manifest data')
+    return manifest_data
+
+
 can_gpu = torch.cuda.is_available()


@ -84,12 +105,6 @@ def main():
    )
    parser.add_argument("--dataset", type=str, required=True, help="path to evaluation data")
    parser.add_argument("--batch_size", type=int, default=4)
-    parser.add_argument(
-        "--dont_normalize_text",
-        default=False,
-        action='store_true',
-        help="Turn off trasnscript normalization. Recommended for non-English.",
-    )
    parser.add_argument("--out_dir", type=str, required=True, help="Destination dir for output files")
    parser.add_argument("--sctk_dir", type=str, required=False, default="", help="Path to sctk root dir")
    parser.add_argument("--glm", type=str, required=False, default="", help="Path to glm file")
@ -97,48 +112,33 @@ def main():
    torch.set_grad_enabled(False)

    if not os.path.exists(args.out_dir):
-        os.makedirs(args.out_dir)
+        os.makedirs(args.out_dir, exist_ok=True)

    use_sctk = os.path.exists(args.sctk_dir)

    if args.asr_model.endswith('.nemo'):
        logging.info(f"Using local ASR model from {args.asr_model}")
-        asr_model = EncDecCTCModel.restore_from(restore_path=args.asr_model)
+        asr_model = ASRModel.restore_from(restore_path=args.asr_model, map_location='cpu')
    else:
        logging.info(f"Using NGC cloud ASR model {args.asr_model}")
-        asr_model = EncDecCTCModel.from_pretrained(model_name=args.asr_model)
-    asr_model.setup_test_data(
-        test_data_config={
-            'sample_rate': 16000,
-            'manifest_filepath': args.dataset,
-            'labels': asr_model.decoder.vocabulary,
-            'batch_size': args.batch_size,
-            'normalize_transcripts': not args.dont_normalize_text,
-        }
-    )
+        asr_model = ASRModel.from_pretrained(model_name=args.asr_model, map_location='cpu')
+
    if can_gpu:
        asr_model = asr_model.cuda()
-    asr_model.eval()
-    labels_map = dict([(i, asr_model.decoder.vocabulary[i]) for i in range(len(asr_model.decoder.vocabulary))])

-    wer = WER(vocabulary=asr_model.decoder.vocabulary)
-    hypotheses = []
-    references = []
-    all_log_probs = []
-    for test_batch in asr_model.test_dataloader():
-        if can_gpu:
-            test_batch = [x.cuda() for x in test_batch]
-        with autocast():
-            log_probs, encoded_len, greedy_predictions = asr_model(
-                input_signal=test_batch[0], input_signal_length=test_batch[1]
-            )
-        for r in log_probs.cpu().numpy():
-            all_log_probs.append(r)
-        hypotheses += wer.ctc_decoder_predictions_tensor(greedy_predictions)
-        for batch_ind in range(greedy_predictions.shape[0]):
-            reference = ''.join([labels_map[c] for c in test_batch[2][batch_ind].cpu().detach().numpy()])
-            references.append(reference)
-        del test_batch
+    asr_model.eval()
+
+    manifest_data = read_manifest(args.dataset)
+
+    references = [data['text'] for data in manifest_data]
+    audio_filepaths = [data['audio_filepath'] for data in manifest_data]
+
+    with autocast():
+        hypotheses = asr_model.transcribe(audio_filepaths, batch_size=args.batch_size)
+
+        # if transcriptions form a tuple (from RNNT), extract just "best" hypothesis
+        if type(hypotheses) == tuple and len(hypotheses) == 2:
+            hypotheses = hypotheses[0]

    info_list = get_utt_info(args.dataset)
    hypfile = os.path.join(args.out_dir, "hyp.trn")
--- a/examples/speaker_tasks/diarization/README.md
+++ b/examples/speaker_tasks/diarization/README.md
@ -22,7 +22,7 @@ Diarization Error Rate (DER) table of `ecapa_tdnn.nemo` model on well known eval

 * All models were tested using embedding extractor with window size 1.5s and shift length 0.75s
 * The above result is based on the oracle Voice Activity Detection (VAD) result.
-* This result is based on [ecapa_tdnn.nemo](https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn) model which will be soon uploaded on NGC.
+* This result is based on [ecapa_tdnn.nemo](https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn) model.

 <br/>

--- a/nemo/collections/asr/models/ctc_bpe_models.py
+++ b/nemo/collections/asr/models/ctc_bpe_models.py
@ -98,7 +98,7 @@ class EncDecCTCModelBPE(EncDecCTCModel, ASRBPEMixin):
        model = PretrainedModelInfo(
            pretrained_model_name="stt_de_citrinet_1024",
            description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_de_citrinet_1024",
-            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_de_citrinet_1024/versions/1.3.0/files/stt_de_citrinet_1024.nemo",
+            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_de_citrinet_1024/versions/1.3.2/files/stt_de_citrinet_1024.nemo",
        )

        results.append(model)
--- a/nemo/collections/asr/models/rnnt_bpe_models.py
+++ b/nemo/collections/asr/models/rnnt_bpe_models.py
@ -86,9 +86,23 @@ class EncDecRNNTBPEModel(EncDecRNNTModel, ASRBPEMixin):
        results.append(model)

        model = PretrainedModelInfo(
-            pretrained_model_name="stt_en_conformer_transducer_large_mls",
-            description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large_mls",
-            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_large_mls/versions/1.4.0/files/stt_en_conformer_transducer_large_mls.nemo",
+            pretrained_model_name="stt_en_conformer_transducer_small",
+            description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_small",
+            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_small/versions/1.4.0/files/stt_en_conformer_transducer_small.nemo",
+        )
+        results.append(model)
+
+        model = PretrainedModelInfo(
+            pretrained_model_name="stt_en_conformer_transducer_medium",
+            description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_medium",
+            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_medium/versions/1.4.0/files/stt_en_conformer_transducer_medium.nemo",
+        )
+        results.append(model)
+
+        model = PretrainedModelInfo(
+            pretrained_model_name="stt_en_conformer_transducer_large",
+            description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_conformer_transducer_large",
+            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_large/versions/1.4.0/files/stt_en_conformer_transducer_large.nemo",
        )
        results.append(model)

--- a/nemo/collections/asr/parts/submodules/multi_head_attention.py
+++ b/nemo/collections/asr/parts/submodules/multi_head_attention.py
@ -266,7 +266,7 @@ class PositionalEncoding(torch.nn.Module):
            x+pos_emb (torch.Tensor): Its shape is (batch, time, feature_size)
            pos_emb (torch.Tensor): Its shape is (1, time, feature_size)
        """
-        self.extend_pe(x)
+        self.extend_pe(length=x.size(1))
        if self.pe.dtype != x.dtype or self.pe.device != x.device:
            self.pe = self.pe.to(device=x.device, dtype=x.dtype)

--- a/nemo/collections/nlp/modules/common/decoder_module.py
+++ b/nemo/collections/nlp/modules/common/decoder_module.py
@ -16,7 +16,7 @@ from abc import ABC
 from typing import Any, Dict, Optional

 from nemo.core.classes import NeuralModule
-from nemo.core.neural_types import ChannelType, MaskType, NeuralType
+from nemo.core.neural_types import ChannelType, EncodedRepresentation, MaskType, NeuralType

 __all__ = ['DecoderModule']

@ -31,6 +31,7 @@ class DecoderModule(NeuralModule, ABC):
            "decoder_mask": NeuralType(('B', 'T'), MaskType(), optional=True),
            "encoder_mask": NeuralType(('B', 'T'), MaskType(), optional=True),
            "encoder_embeddings": NeuralType(('B', 'T', 'D'), ChannelType(), optional=True),
+            "decoder_mems": NeuralType(('B', 'D', 'T', 'D'), EncodedRepresentation(), optional=True),
        }

    @property
--- a/nemo/collections/nlp/modules/common/transformer/transformer.py
+++ b/nemo/collections/nlp/modules/common/transformer/transformer.py
@ -13,7 +13,7 @@
 # limitations under the License.

 from dataclasses import dataclass
-from typing import Optional
+from typing import Dict, Optional

 import torch
 from omegaconf.omegaconf import MISSING
@ -25,6 +25,7 @@ from nemo.collections.nlp.modules.common.transformer.transformer_encoders import
 from nemo.collections.nlp.modules.common.transformer.transformer_modules import TransformerEmbedding
 from nemo.core.classes.common import typecheck
 from nemo.core.classes.exportable import Exportable
+from nemo.core.neural_types import ChannelType, NeuralType

 # @dataclass
 # class TransformerConfig:
@ -183,6 +184,10 @@ class TransformerDecoderNM(DecoderModule, Exportable):
        self._vocab_size = vocab_size
        self._hidden_size = hidden_size
        self._max_sequence_length = max_sequence_length
+        self.num_states = num_layers + 1
+        self.return_mems = False
+        if pre_ln_final_layer_norm:
+            self.num_states += 1

        self._embedding = TransformerEmbedding(
            vocab_size=self.vocab_size,
@ -207,14 +212,27 @@ class TransformerDecoderNM(DecoderModule, Exportable):
        )

    @typecheck()
-    def forward(self, input_ids, decoder_mask, encoder_embeddings, encoder_mask):
-        decoder_embeddings = self._embedding(input_ids=input_ids)
+    def forward(
+        self, input_ids, decoder_mask, encoder_embeddings, encoder_mask, decoder_mems=None,
+    ):
+        start_pos = 0
+        if decoder_mems is not None:
+            start_pos = input_ids.shape[1] - 1
+            input_ids = input_ids[:, -1:]
+            decoder_mask = decoder_mask[:, -1:]
+            decoder_mems = torch.transpose(decoder_mems, 0, 1)
+        decoder_embeddings = self._embedding(input_ids=input_ids, start_pos=start_pos)
        decoder_hidden_states = self._decoder(
            decoder_states=decoder_embeddings,
            decoder_mask=decoder_mask,
            encoder_states=encoder_embeddings,
            encoder_mask=encoder_mask,
+            decoder_mems_list=decoder_mems,
+            return_mems=self.return_mems,
+            return_mems_as_list=False,
        )
+        if self.return_mems:
+            decoder_hidden_states = torch.transpose(decoder_hidden_states, 0, 1)
        return decoder_hidden_states

    @property
@ -246,8 +264,18 @@ class TransformerDecoderNM(DecoderModule, Exportable):
        sample = next(self.parameters())
        input_ids = torch.randint(low=0, high=2048, size=(2, 16), device=sample.device)
        encoder_mask = torch.randint(low=0, high=1, size=(2, 16), device=sample.device)
-        return tuple([input_ids, encoder_mask, self._embedding(input_ids), encoder_mask])
+        mem_size = [2, self.num_states, 15, self._hidden_size]
+        decoder_mems = torch.rand(mem_size, device=sample.device)
+        return tuple([input_ids, encoder_mask, self._embedding(input_ids), encoder_mask, decoder_mems])

    def _prepare_for_export(self, **kwargs):
        self._decoder.diagonal = None
+        self.return_mems = True
        super()._prepare_for_export(**kwargs)
+
+    @property
+    def output_types(self) -> Optional[Dict[str, NeuralType]]:
+        if self.return_mems:
+            return {"last_hidden_states": NeuralType(('B', 'D', 'T', 'D'), ChannelType())}
+        else:
+            return {"last_hidden_states": NeuralType(('B', 'T', 'D'), ChannelType())}
--- a/nemo/collections/nlp/modules/common/transformer/transformer_decoders.py
+++ b/nemo/collections/nlp/modules/common/transformer/transformer_decoders.py
@ -148,13 +148,23 @@ class TransformerDecoder(nn.Module):

    def _get_memory_states(self, decoder_states, decoder_mems_list=None, i=0):
        if decoder_mems_list is not None:
-            memory_states = torch.cat((decoder_mems_list[i], decoder_states), dim=1)
+            inp1 = torch.transpose(decoder_mems_list[i], 1, 2)  # Putting seq_len to last dim to handle export cases
+            inp2 = torch.transpose(decoder_states, 1, 2)
+            memory_states = torch.cat((inp1, inp2), dim=2)
+            memory_states = torch.transpose(memory_states, 1, 2)  # Transposing back
        else:
            memory_states = decoder_states
        return memory_states

    def forward(
-        self, decoder_states, decoder_mask, encoder_states, encoder_mask, decoder_mems_list=None, return_mems=False
+        self,
+        decoder_states,
+        decoder_mask,
+        encoder_states,
+        encoder_mask,
+        decoder_mems_list=None,
+        return_mems=False,
+        return_mems_as_list=True,
    ):
        """
        Args:
@ -167,21 +177,31 @@ class TransformerDecoder(nn.Module):
                of decoder_states as keys and values if not None
            return_mems: bool, whether to return outputs of all decoder layers
                or the last layer only
+            return_mems_as_list: bool, when True, mems returned are as a list; otherwise mems are Tensor
        """
        decoder_attn_mask = form_attention_mask(decoder_mask, diagonal=self.diagonal)
        encoder_attn_mask = form_attention_mask(encoder_mask)
        memory_states = self._get_memory_states(decoder_states, decoder_mems_list, 0)
-        cached_mems_list = [memory_states]
+        if return_mems_as_list:
+            cached_mems_list = [memory_states]
+        else:
+            cached_mems_list = memory_states.unsqueeze(0)

        for i, layer in enumerate(self.layers):
            decoder_states = layer(decoder_states, decoder_attn_mask, memory_states, encoder_states, encoder_attn_mask)
            memory_states = self._get_memory_states(decoder_states, decoder_mems_list, i + 1)
-            cached_mems_list.append(memory_states)
+            if return_mems_as_list:
+                cached_mems_list.append(memory_states)
+            else:
+                cached_mems_list = torch.cat((cached_mems_list, memory_states.unsqueeze(0)), dim=0)

        if self.final_layer_norm is not None:
            decoder_states = self.final_layer_norm(decoder_states)
-            memory_states = self._get_memory_states(decoder_states, decoder_mems_list, i + 1)
-            cached_mems_list.append(memory_states)
+            memory_states = self._get_memory_states(decoder_states, decoder_mems_list, i + 2)
+            if return_mems_as_list:
+                cached_mems_list.append(memory_states)
+            else:
+                cached_mems_list = torch.cat((cached_mems_list, memory_states.unsqueeze(0)), dim=0)

        if return_mems:
            return cached_mems_list
--- a/nemo/package_info.py
+++ b/nemo/package_info.py
@ -14,7 +14,7 @@


 MAJOR = 1
-MINOR = 3
+MINOR = 4
 PATCH = 0
 PRE_RELEASE = ''

--- a/requirements/requirements_lightning.txt
+++ b/requirements/requirements_lightning.txt
@ -1,4 +1,4 @@
-pytorch-lightning>=1.4.0
+pytorch-lightning>=1.4.8,<1.5
 torchmetrics>=0.4.1rc0
 transformers>=4.0.1
 webdataset>=0.1.48,<=0.1.62
--- a/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py
+++ b/scripts/asr_language_modeling/neural_rescorer/eval_neural_rescorer.py
@ -126,7 +126,7 @@ def linear_search_wer(
        param_name: the name of the parameter to be used in the figure

    Output:
-        (best coefficient found, best WER achived)
+        (best coefficient found, best WER achieved)
    """
    scale = scores1.mean().abs().item() / scores2.mean().abs().item()
    left = coef_range[0] * scale
--- a/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py
+++ b/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram.py
@ -150,7 +150,9 @@ def beam_search_eval(
            )
        )
    logging.info(
-        'Best WER/CER in candidates = {:.2%}/{:.2%}'.format(wer_dist_best / words_count, cer_dist_best / chars_count)
+        'Oracle WER/CER in candidates with perfect LM= {:.2%}/{:.2%}'.format(
+            wer_dist_best / words_count, cer_dist_best / chars_count
+        )
    )
    logging.info(f"=================================================================================")

--- a/tutorials/AudioTranslationSample.ipynb
+++ b/tutorials/AudioTranslationSample.ipynb
@ -240,7 +240,7 @@
    "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n",
    "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n",
    "* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)\n",
-    "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/1_TTS_inference.ipynb)\n",
+    "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n",
    "\n",
    "\n",
    "You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
@ -277,4 +277,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 1
-}
+}
--- a/tutorials/VoiceSwapSample.ipynb
+++ b/tutorials/VoiceSwapSample.ipynb
@ -278,7 +278,7 @@
    "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n",
    "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n",
    "* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)\n",
-    "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/1_TTS_inference.ipynb)\n",
+    "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n",
    "\n",
    "\n",
    "You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
@ -327,4 +327,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 1
-}
+}
--- a/tutorials/asr/ASR_with_NeMo.ipynb
+++ b/tutorials/asr/ASR_with_NeMo.ipynb
@ -385,7 +385,7 @@
    "\n",
    "Now that we have an idea of what ASR is and how the audio data looks like, we can start using NeMo to do some ASR!\n",
    "\n",
-    "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/v1.0.2/).\n",
+    "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n",
    "\n",
    "NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training."
   ]
@ -997,7 +997,7 @@
    "id": "I4WRcmakjQnj"
   },
   "source": [
-    "!pip install --upgrade onnxruntime onnxruntime-gpu\n",
+    "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
    "#!mkdir -p ort\n",
    "#%cd ort\n",
    "#!git clean -xfd\n",
@ -1173,4 +1173,4 @@
   "outputs": []
  }
 ]
-}
+}
--- a/tutorials/asr/ASR_with_Subword_Tokenization.ipynb
+++ b/tutorials/asr/ASR_with_Subword_Tokenization.ipynb
@ -612,7 +612,7 @@
        "\r\n",
        "We will use a Citrinet model to demonstrate the usage of subword tokenization models for training and inference. Citrinet is a [QuartzNet-like architecture](https://arxiv.org/abs/1910.10261), but it uses subword-tokenization along with 8x subsampling and [Squeeze-and-Excitation](https://arxiv.org/abs/1709.01507) to achieve strong accuracy in transcriptions while still using non-autoregressive decoding for efficient inference.\r\n",
        "\r\n",
-        "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/v1.0.2/).\r\n",
+        "We'll be using the **Neural Modules (NeMo) toolkit** for this part, so if you haven't already, you should download and install NeMo and its dependencies. To do so, just follow the directions on the [GitHub page](https://github.com/NVIDIA/NeMo), or in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\r\n",
        "\r\n",
        "NeMo let us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules. NeMo also comes with complete models which only require your data and hyperparameters for training."
      ]
--- a/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb
+++ b/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb
@ -765,12 +765,13 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!mkdir -p ort\n",
-    "%cd ort\n",
-    "!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
-    "!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
-    "!pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
-    "%cd .."
+    "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
+    "# !mkdir -p ort\n",
+    "# %cd ort\n",
+    "# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
+    "# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
+    "# !pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
+    "# %cd .."
   ]
  },
  {
@ -834,4 +835,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 4
-}
+}
--- a/tutorials/asr/Online_Offline_Speech_Commands_Demo.ipynb
+++ b/tutorials/asr/Online_Offline_Speech_Commands_Demo.ipynb
@ -671,12 +671,13 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!mkdir -p ort\n",
-    "%cd ort\n",
-    "!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
-    "!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
-    "!pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
-    "%cd .."
+    "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
+    "# !mkdir -p ort\n",
+    "# %cd ort\n",
+    "# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
+    "# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
+    "# !pip install ./build/Linux/Release/dist/onnxruntime*.whl\n",
+    "# %cd .."
   ]
  },
  {
@ -740,4 +741,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 4
-}
+}
--- a/tutorials/asr/Speech_Commands.ipynb
+++ b/tutorials/asr/Speech_Commands.ipynb
@ -644,7 +644,7 @@
    "\n",
    "For multi-GPU training, take a look at [the PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html)\n",
    "\n",
-    "For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/amp.html)\n",
+    "For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html#mixed-precision-16-bit-training)\n",
    "\n",
    "```python\n",
    "# Mixed precision:\n",
--- a/tutorials/asr/Voice_Activity_Detection.ipynb
+++ b/tutorials/asr/Voice_Activity_Detection.ipynb
@ -657,7 +657,7 @@
    "\n",
    "For multi-GPU training, take a look at [the PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/multi_gpu.html)\n",
    "\n",
-    "For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/advanced/amp.html)\n",
+    "For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html#mixed-precision-16-bit-training)\n",
    "\n",
    "\n",
    "```python\n",
--- a/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb
+++ b/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb
--- a/tutorials/nlp/Text_Classification_Sentiment_Analysis.ipynb
+++ b/tutorials/nlp/Text_Classification_Sentiment_Analysis.ipynb
@ -690,15 +690,16 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!mkdir -p ort\n",
-    "%cd ort\n",
-    "!git clean -xfd\n",
-    "!git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
-    "!./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
-    "!pip uninstall -y onnxruntime\n",
-    "!pip uninstall -y onnxruntime-gpu\n",
-    "!pip install  --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n",
-    "%cd .."
+    "!pip install --upgrade onnxruntime # for gpu, use onnxruntime-gpu\n",
+    "# !mkdir -p ort\n",
+    "# %cd ort\n",
+    "# !git clean -xfd\n",
+    "# !git clone --depth 1 --branch v1.8.0 https://github.com/microsoft/onnxruntime.git .\n",
+    "# !./build.sh --skip_tests --config Release --build_shared_lib --parallel --use_cuda --cuda_home /usr/local/cuda --cudnn_home /usr/lib/x86_64-linux-gnu --build_wheel\n",
+    "# !pip uninstall -y onnxruntime\n",
+    "# !pip uninstall -y onnxruntime-gpu\n",
+    "# !pip install  --upgrade --force-reinstall ./build/Linux/Release/dist/onnxruntime*.whl\n",
+    "# %cd .."
   ]
  },
  {
@ -825,4 +826,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 1
-}
+}
--- a/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb
+++ b/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb
@ -302,7 +302,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The following three varialbes are obtained from `get_speech_labels_list()` function.\n",
+    "The following three variables are obtained from `get_speech_labels_list()` function.\n",
    "\n",
    "- words List[str]: contains the sequence of words.\n",
    "- spaces List[int]: contains frame level index of the end of the last word and the start time of the next word. \n",
@ -475,7 +475,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@ -489,7 +489,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.6"
+   "version": "3.8.10"
  }
 },
 "nbformat": 4,
--- a/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb
+++ b/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb
@ -40,9 +40,9 @@
    "\n",
    "In NeMo we support both **oracle VAD** and **non-oracle VAD** diarization. \n",
    "\n",
-    "In this tutorial, we shall first demonstrate how to perform diarization with a oracle VAD time stamps (we assume we already have speech time stamps) and pretrained speaker verification model which can be found in tutorial for [Speaker and Recognition and Verification in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Recognition_Verification.ipynb).\n",
+    "In this tutorial, we shall first demonstrate how to perform diarization with a oracle VAD time stamps (we assume we already have speech time stamps) and pretrained speaker verification model which can be found in tutorial for [Speaker Identification and Verification in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb).\n",
    "\n",
-    "In [second part](#ORACLE-VAD-DIARIZATION) we show how to perform VAD and then diarization if ground truth timestamped speech were not available (non-oracle VAD). We also have tutorials for [VAD training in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Voice_Activity_Detection.ipynb) and [online offline microphone inference](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb), where you can custom your model and training/finetuning on your own data.\n",
+    "In ORACLE-VAD-DIARIZATION we show how to perform VAD and then diarization if ground truth timestamped speech were not available (non-oracle VAD). We also have tutorials for [VAD training in NeMo](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Voice_Activity_Detection.ipynb) and [online offline microphone inference](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb), where you can custom your model and training/finetuning on your own data.\n",
    "\n",
    "For demonstration purposes we would be using simulated audio from [an4 dataset](http://www.speech.cs.cmu.edu/databases/an4/)"
   ]
--- a/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
+++ b/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb
@ -257,7 +257,7 @@
   "source": [
    "As the goal of most speaker-related systems is to get good speaker level embeddings that could help distinguish from\n",
    "other speakers, we shall first train these embeddings in end-to-end\n",
-    "manner optimizing the [QuatzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss.\n",
+    "manner optimizing the [QuartzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss.\n",
    "We modify the decoder to get these fixed-size embeddings irrespective of the length of the input audio. We employ a mean and variance\n",
    "based statistics pooling method to grab these embeddings."
   ]
@ -620,7 +620,7 @@
    "\n",
    "For multi-GPU training, take a look at the [PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html)\n",
    "\n",
-    "For mixed-precision training, take a look at the [PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/stable/advanced/amp.html)\n",
+    "For mixed-precision training, take a look at the [PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/guides/speed.html#mixed-precision-16-bit-training)\n",
    "\n",
    "### Mixed precision:\n",
    "<pre><code>trainer = Trainer(amp_level='O1', precision=16)\n",
--- a/tutorials/tools/CTC_Segmentation_Tutorial.ipynb
+++ b/tutorials/tools/CTC_Segmentation_Tutorial.ipynb
@ -292,11 +292,11 @@
    "\n",
    "Text preprocessing:\n",
    "* the text will be roughly split into sentences and stored under '$OUTPUT_DIR/processed/*.txt' where each sentence is going to start with a new line (we're going to find alignments for these sentences in the next steps)\n",
-    "* to change the lenghts of the final sentences/fragments, use `min_length` and `max_length` arguments, that specify min/max number of chars of the text segment for alignment.\n",
+    "* to change the lengths of the final sentences/fragments, use `min_length` and `max_length` arguments, that specify min/max number of chars of the text segment for alignment.\n",
    "* to specify additional punctuation marks to split the text into fragments, use `--additional_split_symbols` argument. If segments produced after splitting the original text based on the end of sentence punctuation marks is longer than `--max_length`, `--additional_split_symbols` are going to be used to shorten the segments. Use `|` as a separator between symbols, for example: `--additional_split_symbols=;|:`\n",
    "* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, (optionally) text will be changed to lowercase \n",
    "* sentences for alignment with the original punctuation and capitalization will be stored under  `$OUTPUT_DIR/processed/*_with_punct.txt`\n",
-    "* numbers will be converted from written to their spoken form with `num2words` package. To use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see [the text normalization tutorial](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/tools/Text_Normalization_Tutorial.ipynb) for more details). Such normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. Note, the audio-based normalization of long audio samples is not supported due to many possible normalization options. See [https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
+    "* numbers will be converted from written to their spoken form with `num2words` package. To use NeMo normalization tool use `--use_nemo_normalization` argument (not supported if running this segmentation tutorial in Colab, see the text normalization tutorial: [`tutorials/text_processing/Text_Normalization.ipynb`](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Text_Normalization.ipynb) for more details). Such normalization is usually enough for proper segmentation. However, it does not take audio into account. NeMo supports audio-based normalization for English and Russian languages that can be applied to the segmented data as a post-processing step. Audio-based normalization produces multiple normalization options. For example, `901` could be normalized as `nine zero one` or `nine hundred and one`. The audio-based normalization chooses the best match among the possible normalization options and the transcript based on the character error rate. Note, the audio-based normalization of long audio samples is not supported due to many possible normalization options. See [https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/normalize_with_audio.py) for more details.\n",
    "\n",
    "Audio preprocessing:\n",
    "* `.mp3` files will be converted to `.wav` files\n",
@ -540,7 +540,7 @@
    "            plot_signal(signal, sample_rate)\n",
    "            display(Audio(sample['audio_filepath']))\n",
    "            display('Reference text:       ' + sample['text_no_preprocessing'])\n",
-    "            display('ASR transcript: ' + sample['transcript'])\n",
+    "            display('ASR transcript: ' + sample['pred_text'])\n",
    "            print('\\n' + '-' * 110)"
   ],
   "execution_count": null,
--- a/tutorials/tts/4_TTS_FastPitch_Finetuning.ipynb
+++ b/tutorials/tts/4_TTS_FastPitch_Finetuning.ipynb
@ -36,6 +36,16 @@
    "We will first create filelists of audio on which we wish to finetune the FastPitch model. We will create two kinds of filelists, one which contains only the audio files of the new speaker and one which contains the mixed audio files of the new speaker and the speaker used for training the pre-trained FastPitch Checkpoint."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "53746a7b",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-block alert-warning\">\n",
+    "    WARNING: This notebook requires downloading the HiFiTTS dataset which is 41GB. We plan on reducing the amount the download amount.\n",
+    "</div>"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@ -191,7 +201,7 @@
   "outputs": [],
   "source": [
    "make_sub_file_list(92, \"clean\", \"train\", None, 5)\n",
-    "mix_file_list(92, \"clean\", \"train\", None, 5, 8051, \"clean\", n_orig=5000)\n",
+    "mix_file_list(92, \"clean\", \"train\", None, 5, 8051, \"other\", n_orig=5000)\n",
    "make_sub_file_list(92, \"clean\", \"dev\", None, None)"
   ]
  },
@ -507,9 +517,9 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "nemoenv",
+   "display_name": "Python 3",
   "language": "python",
-   "name": "nemoenv"
+   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
@ -521,7 +531,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.8.10"
  }
 },
 "nbformat": 4,
--- a/tutorials/tts/2_Inference_DurationPitchControl.ipynb
+++ b/tutorials/tts/2_Inference_DurationPitchControl.ipynb
--- a/tutorials/tts/1_Inference_ModelSelect.ipynb
+++ b/tutorials/tts/1_Inference_ModelSelect.ipynb
--- a/tutorials/tts/3_Tacotron2_Training.ipynb
+++ b/tutorials/tts/3_Tacotron2_Training.ipynb
--- a/tutorials/tts/4_TalkNet_Training.ipynb
+++ b/tutorials/tts/4_TalkNet_Training.ipynb