Update Finetuning Notebook (#3023)

* update note Signed-off-by: Jason <jasoli@nvidia.com> * update checkpoint Signed-off-by: Jason <jasoli@nvidia.com> * minor updates Signed-off-by: Jason <jasoli@nvidia.com> * update Signed-off-by: Jason <jasoli@nvidia.com> * update Signed-off-by: Jason <jasoli@nvidia.com> * meta Signed-off-by: Jason <jasoli@nvidia.com> * update Signed-off-by: Jason <jasoli@nvidia.com> * fix Signed-off-by: Jason <jasoli@nvidia.com>
2021-10-20 13:36:07 -04:00 · 2021-10-20 13:36:07 -04:00 · 689f8c59ac
parent d9b51f0553
commit 689f8c59ac
8 changed files with 393 additions and 429 deletions
--- a/nemo/collections/tts/models/fastpitch.py
+++ b/nemo/collections/tts/models/fastpitch.py
@ -378,7 +378,7 @@ class FastPitchModel(SpectrogramGenerator, Exportable):
        list_of_models = []
        model = PretrainedModelInfo(
            pretrained_model_name="tts_en_fastpitch",
-            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.0.0/files/tts_en_fastpitch.nemo",
+            location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.4.0/files/tts_en_fastpitch_align.nemo",
            description="This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent.",
            class_=cls,
        )
--- a/nemo/collections/tts/models/tacotron2.py
+++ b/nemo/collections/tts/models/tacotron2.py
@ -95,11 +95,11 @@ class Tacotron2Model(SpectrogramGenerator):
        if self._parser is not None:
            return self._parser
        if self._validation_dl is not None:
-            return self._validation_dl.dataset.parser
+            return self._validation_dl.dataset.manifest_processor.parser
        if self._test_dl is not None:
-            return self._test_dl.dataset.parser
+            return self._test_dl.dataset.manifest_processor.parser
        if self._train_dl is not None:
-            return self._train_dl.dataset.parser
+            return self._train_dl.dataset.manifest_processor.parser

        # Else construct a parser
        # Try to get params from validation, test, and then train
@ -122,7 +122,7 @@ class Tacotron2Model(SpectrogramGenerator):
        name = params.get('parser', None) or 'en'
        unk_id = params.get('unk_index', None) or -1
        blank_id = params.get('blank_index', None) or -1
-        do_normalize = params.get('normalize', None) or False
+        do_normalize = params.get('normalize', True)
        self._parser = parsers.make_parser(
            labels=self._cfg.labels, name=name, unk_id=unk_id, blank_id=blank_id, do_normalize=do_normalize,
        )
--- a/nemo/collections/tts/models/waveglow.py
+++ b/nemo/collections/tts/models/waveglow.py
@ -256,5 +256,6 @@ class WaveGlowModel(GlowVocoder, Exportable):
        # and can be computed from convinv.conv.weight
        # Ideally, we should remove this during saving instead of ignoring during loading
        for i in range(self._cfg.waveglow.n_flows):
-            del state_dict[f"waveglow.convinv.{i}.inv_conv.weight"]
+            if f"waveglow.convinv.{i}.inv_conv.weight" in state_dict:
+                del state_dict[f"waveglow.convinv.{i}.inv_conv.weight"]
        super().load_state_dict(state_dict, strict=strict)
--- a/tutorials/tts/FastPitch_Finetuning.ipynb
+++ b/tutorials/tts/FastPitch_Finetuning.ipynb
@ -2,393 +2,303 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "8d0bbac2",
-   "metadata": {},
+   "metadata": {
+    "id": "8d0bbac2"
+   },
   "source": [
-    "# Finetuning FastPitch for a new speaker"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "2d063299",
-   "metadata": {},
-   "source": [
-    "In this tutorial, we will finetune a single speaker FastPitch (with alignment) model on limited amount of new speaker's data. We cover two finetuning techniques: \n",
-    "1. We finetune the model parameters only on new speaker's text and speech pairs; \n",
-    "2. We add a learnable speaker embedding layer to the model, and finetune on a mixture of original speaker's and new speaker's data.\n",
+    "# Finetuning FastPitch for a new speaker\n",
    "\n",
-    "We will first prepare filelists containing the audiopaths and text of the samples on which we wish to finetune the model, then generate and run a training command to finetune Fastpitch on 5 mins of data, and finally synthesize the audio from the trained checkpoint."
+    "In this tutorial, we will finetune a single speaker FastPitch (with alignment) model on 5 mins of a new speaker's data. We will finetune the model parameters only on new speaker's text and speech pairs.\n",
+    "\n",
+    "We will download the training data, then generate and run a training command to finetune Fastpitch on 5 mins of data, and synthesize the audio from the trained checkpoint.\n",
+    "\n",
+    "A final section will describe approaches to improve audio quality past this notebook."
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "2502cf61",
-   "metadata": {},
+   "metadata": {
+    "id": "nGw0CBaAtmQ6"
+   },
   "source": [
-    "## Creating filelists for training"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "81fa2c02",
-   "metadata": {},
-   "source": [
-    "We will first create filelists of audio on which we wish to finetune the FastPitch model. We will create two kinds of filelists, one which contains only the audio files of the new speaker and one which contains the mixed audio files of the new speaker and the speaker used for training the pre-trained FastPitch Checkpoint."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "53746a7b",
-   "metadata": {},
-   "source": [
-    "<div class=\"alert alert-block alert-warning\">\n",
-    "    WARNING: This notebook requires downloading the HiFiTTS dataset which is 41GB. We plan on reducing the amount the download amount.\n",
-    "</div>"
+    "## License\n",
+    "\n",
+    "> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.\n",
+    ">\n",
+    "> Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "> you may not use this file except in compliance with the License.\n",
+    "> You may obtain a copy of the License at\n",
+    ">\n",
+    ">     http://www.apache.org/licenses/LICENSE-2.0\n",
+    ">\n",
+    "> Unless required by applicable law or agreed to in writing, software\n",
+    "> distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "> See the License for the specific language governing permissions and\n",
+    "> limitations under the License."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "b7b1563d",
-   "metadata": {},
+   "metadata": {
+    "id": "U7bOoIgLttRC"
+   },
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
+    "Instructions for setting up Colab are as follows:\n",
+    "1. Open a new Python 3 notebook.\n",
+    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
+    "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
+    "4. Run this cell to set up dependencies.\n",
+    "\"\"\"\n",
+    "# If you're using Google Colab and not running locally, uncomment and run this cell.\n",
+    "!apt-get install sox libsndfile1 ffmpeg\n",
+    "!pip install wget unidecode\n",
+    "BRANCH = 'stable'\n",
+    "!python -m pip install git+https://github.com/NeMo/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "2502cf61"
+   },
+   "source": [
+    "## Downloading Data\n",
+    "___"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "81fa2c02"
+   },
+   "source": [
+    "Download and untar the data.\n",
+    "\n",
+    "The data contains a 5 minute subset of audio from speaker 6097 from the HiFiTTS dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "VIFgqxLOpxha"
+   },
+   "outputs": [],
+   "source": [
+    "!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data\n",
+    "!tar -xzf 6097_5_mins.tar.gz"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "gSQqq0fBqy8K"
+   },
+   "source": [
+    "Looking at manifest.json, we see a standard NeMo json that contains the filepath, text, and duration. Please note that manifest.json only contains the relative path.\n",
+    "\n",
+    "```\n",
+    "{\"audio_filepath\": \"audio/presentpictureofnsw_02_mann_0532.wav\", \"text\": \"not to stop more than ten minutes by the way\", \"duration\": 2.6, \"text_no_preprocessing\": \"not to stop more than ten minutes by the way,\", \"text_normalized\": \"not to stop more than ten minutes by the way,\"}\n",
+    "```\n",
+    "\n",
+    "Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "B8gVfp5SsuDd"
+   },
+   "outputs": [],
+   "source": [
+    "!cat ./6097_5_mins/manifest.json | tail -n 2 > ./6097_manifest_dev_ns_all_local.json\n",
+    "!cat ./6097_5_mins/manifest.json | head -n -2 > ./6097_manifest_train_dur_5_mins_local.json\n",
+    "!ln -s ./6097_5_mins/audio audio"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ef75d1d5"
+   },
+   "source": [
+    "## Finetuning FastPitch\n",
+    "___\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "lhhg2wBNtW0r"
+   },
+   "source": [
+    "Let's first download the pretrained checkpoint that we want to finetune from. NeMo will save checkpoints to ~/.cache, so let's move that to our current directory"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LggELooctXCT"
+   },
   "outputs": [],
   "source": [
-    "import random\n",
    "import os\n",
-    "import json\n",
+    "\n",
    "import torch\n",
    "import IPython.display as ipd\n",
    "from matplotlib.pyplot import imshow\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
-    "data_dir = <ADD_PATH_TO_DIRECTORY_CONTAINING_HIFIGAN_DATASET> # Download dataset from https://www.openslr.org/109/. Specify path to Hi_Fi_TTS_v_0\n",
-    "filelist_dir = <ADD_PATH_TO_DIRECTORY_IN_WHICH_WE_WISH_TO_SAVE_FILELISTS> # will be created if it does not exist\n",
-    "exp_base_dir = <ADD_PATH_TO_BASE_EXPERIMENT_DIRECTORY_FOR_CHECKPOINTS_AND_LOGS> # will be created if it does not exist\n",
+    "from nemo.collections.tts.models import FastPitchModel\n",
+    "FastPitchModel.from_pretrained(\"tts_en_fastpitch\")\n",
    "\n",
+    "from pathlib import Path\n",
+    "nemo_files = [p for p in Path(\"/root/.cache/torch/NeMo/\").glob(\"**/tts_en_fastpitch_align.nemo\")]\n",
+    "print(f\"Copying {nemo_files[0]} to ./\")\n",
+    "Path(\"./tts_en_fastpitch_align.nemo\").write_bytes(nemo_files[0].read_bytes())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "6c8b13b8"
+   },
+   "source": [
+    "To finetune the FastPitch model on the above created filelists, we use `examples/tts/fastpitch2_finetune.py` script to train the models with the `fastpitch_align.yaml` configuration.\n",
    "\n",
-    "def make_sub_file_list(speaker_id, clean_other, split, num_samples, total_duration_mins, seed=42):\n",
-    "    \"\"\"\n",
-    "    Creates a subset of training data for a HiFiTTS speaker. Specify either the num_samples or total_duration_mins\n",
-    "    Saves the filelist in the filelist_dir. split is either \"train\" or \"dev\"\n",
-    "    \n",
-    "    Arguments:\n",
-    "    speaker_id -- speaker id of the new HiFiTTS speaker\n",
-    "    clean_other -- \"clean\" or \"other\" depending on type of data of new HiFiTTS speaker\n",
-    "    split -- \"train\" or \"dev\"\n",
-    "    num_samples -- Number samples of new speaker (set None if specifying total_duration_mins)\n",
-    "    total_duration_mins -- Total duration of new speaker's data (set None if specifying num_samples)\n",
-    "    \"\"\"\n",
-    "    file_list_name = \"{}_manifest_{}_{}.json\".format(speaker_id, clean_other, split)\n",
-    "    with open(os.path.join(data_dir, file_list_name), 'r') as f:\n",
-    "        all_records = [json.loads(l) for l in f.read().split(\"\\n\") if len(l) > 0]\n",
-    "    for r in all_records:\n",
-    "        r['audio_filepath'] = r['audio_filepath'][r['audio_filepath'].find(\"wav/\"):]\n",
-    "    random.seed(seed)\n",
-    "    random.shuffle(all_records)\n",
-    "    \n",
-    "    if num_samples is not None and total_duration_mins is None:\n",
-    "        sub_records = all_records[:num_samples]\n",
-    "        fname_extension = \"ns_{}\".format(num_samples)\n",
-    "    elif num_samples is None and total_duration_mins is not None:\n",
-    "        sub_record_duration = 0.0\n",
-    "        sub_records = []\n",
-    "        for r in all_records:\n",
-    "            sub_record_duration += r['duration']\n",
-    "            if sub_record_duration > total_duration_mins*60.0:\n",
-    "                print (\"Duration reached {} mins using {} records\".format(total_duration_mins, len(sub_records)))\n",
-    "                break\n",
-    "            sub_records.append(r)\n",
-    "        fname_extension = \"dur_{}_mins\".format( int(round(total_duration_mins )))\n",
-    "    elif num_samples is None and total_duration_mins is None:\n",
-    "        sub_records = all_records\n",
-    "        fname_extension = \"ns_all\"\n",
-    "    else:\n",
-    "        raise NotImplementedError()\n",
-    "    print (\"num sub records\", len(sub_records))\n",
-    "    \n",
-    "    if not os.path.exists(filelist_dir):\n",
-    "        os.makedirs(filelist_dir)\n",
-    "    \n",
-    "    target_fp = os.path.join(filelist_dir, \"{}_mainifest_{}_{}_local.json\".format(speaker_id, split,  fname_extension))\n",
-    "    with open(target_fp, 'w') as f:\n",
-    "        for record in json.loads(json.dumps(sub_records)):\n",
-    "            record['audio_filepath'] = record['audio_filepath'][record['audio_filepath'].find(\"wav/\"):]\n",
-    "            record['audio_filepath'] = os.path.join(data_dir, record['audio_filepath']) \n",
-    "            f.write(json.dumps(record) + \"\\n\")\n",
-    "\n",
-    "def mix_file_list(speaker_id, clean_other, split, num_samples, total_duration_mins, original_speaker_id, original_clean_other, n_orig=None, seed=42):\n",
-    "    \"\"\"\n",
-    "    Creates a mixed dataset of new and original speaker. num_samples or total_duration_mins specifies the amount \n",
-    "    of new speaker data to be used and n_orig specifies the number of original speaker samples. Creates a balanced \n",
-    "    dataset with alternating new and old speaker samples. Saves the filelist in the filelist_dir. \n",
-    "    \n",
-    "    Arguments:\n",
-    "    speaker_id -- speaker id of the new HiFiTTS speaker\n",
-    "    clean_other -- \"clean\" or \"other\" depending on type of data of new HiFiTTS speaker\n",
-    "    split -- \"train\" or \"dev\"\n",
-    "    num_samples -- Number samples of new speaker (set None if specifying total_duration_mins)\n",
-    "    total_duration_mins -- Total duration of new speaker's data (set None if specifying num_samples)\n",
-    "    original_speaker_id -- speaker id of the original HiFiTTS speaker (on which FastPitch was trained)\n",
-    "    original_clean_other -- \"clean\" or \"other\" depending on type of data of new HiFiTTS speaker\n",
-    "    n_orig -- Number of samples of old speaker to be mixed with new speaker\n",
-    "    \n",
-    "    \"\"\"\n",
-    "    file_list_name = \"{}_manifest_{}_{}.json\".format(speaker_id, clean_other, split)\n",
-    "    with open(os.path.join(data_dir, file_list_name), 'r') as f:\n",
-    "        all_records = [json.loads(l) for l in f.read().split(\"\\n\") if len(l) > 0]\n",
-    "    for r in all_records:\n",
-    "        r['audio_filepath'] = r['audio_filepath'][r['audio_filepath'].find(\"wav/\"):]\n",
-    "    \n",
-    "    original_file_list_name = \"{}_manifest_{}_{}.json\".format(original_speaker_id, original_clean_other, \"train\")\n",
-    "    with open(os.path.join(data_dir, original_file_list_name), 'r') as f:\n",
-    "        original_all_records = [json.loads(l) for l in f.read().split(\"\\n\") if len(l) > 0]\n",
-    "    for r in original_all_records:\n",
-    "        r['audio_filepath'] = r['audio_filepath'][r['audio_filepath'].find(\"wav/\"):]\n",
-    "    \n",
-    "    random.seed(seed)\n",
-    "    if n_orig is not None:\n",
-    "        random.shuffle(original_all_records)\n",
-    "        original_all_records = original_all_records[:n_orig]\n",
-    "        \n",
-    "    random.seed(seed)\n",
-    "    random.shuffle(all_records)\n",
-    "    \n",
-    "    if num_samples is not None and total_duration_mins is None:\n",
-    "        sub_records = all_records[:num_samples]\n",
-    "        fname_extension = \"ns_{}\".format(num_samples)\n",
-    "    elif num_samples is None and total_duration_mins is not None:\n",
-    "        sub_record_duration = 0.0\n",
-    "        sub_records = []\n",
-    "        for r in all_records:\n",
-    "            sub_record_duration += r['duration']\n",
-    "            if sub_record_duration > total_duration_mins * 60.0:\n",
-    "                print (\"Duration reached {} mins using {} records\".format(total_duration_mins, len(sub_records)))\n",
-    "                break\n",
-    "            sub_records.append(r)\n",
-    "        fname_extension = \"dur_{}_mins\".format( int(round(total_duration_mins)))\n",
-    "    elif num_samples is None and total_duration_mins is None:\n",
-    "        sub_records = all_records\n",
-    "        fname_extension = \"ns_all\"\n",
-    "    else:\n",
-    "        raise NotImplementedError()\n",
-    "        \n",
-    "    print(len(original_all_records))\n",
-    "    \n",
-    "    if not os.path.exists(filelist_dir):\n",
-    "        os.makedirs(filelist_dir)\n",
-    "        \n",
-    "    target_fp = os.path.join(filelist_dir, \"{}_mainifest_{}_{}_local_mix_{}.json\".format(speaker_id, split,  fname_extension, original_speaker_id))\n",
-    "    with open(target_fp, 'w') as f:\n",
-    "        for ridx, original_record in enumerate(original_all_records):\n",
-    "            original_record['audio_filepath'] = original_record['audio_filepath'][original_record['audio_filepath'].find(\"wav/\"):]\n",
-    "            original_record['audio_filepath'] = os.path.join(data_dir, original_record['audio_filepath']) \n",
-    "            \n",
-    "            new_speaker_record = sub_records[ridx % len(sub_records)]\n",
-    "            new_speaker_record['audio_filepath'] = new_speaker_record['audio_filepath'][new_speaker_record['audio_filepath'].find(\"wav/\"):]\n",
-    "            new_speaker_record['audio_filepath'] = os.path.join(data_dir, new_speaker_record['audio_filepath']) \n",
-    "            \n",
-    "            new_speaker_record['speaker'] = 1\n",
-    "            original_record['speaker'] = 0\n",
-    "            f.write(json.dumps(original_record) + \"\\n\")\n",
-    "            f.write(json.dumps(new_speaker_record) + \"\\n\")"
+    "Let's grab those files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "5c635928",
-   "metadata": {},
+   "metadata": {
+    "id": "3zg2H-32dNBU"
+   },
   "outputs": [],
   "source": [
-    "make_sub_file_list(92, \"clean\", \"train\", None, 5)\n",
-    "mix_file_list(92, \"clean\", \"train\", None, 5, 8051, \"other\", n_orig=5000)\n",
-    "make_sub_file_list(92, \"clean\", \"dev\", None, None)"
+    "!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch2_finetune.py\n",
+    "!mkdir conf && cd conf && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_align.yaml && cd .."
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "ef75d1d5",
-   "metadata": {},
+   "metadata": {
+    "id": "12b5511c"
+   },
   "source": [
-    "## Finetuning the model on filelists"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "6c8b13b8",
-   "metadata": {},
-   "source": [
-    "To finetune the FastPitch model on the above created filelists, we use `examples/tts/fastpitch2_finetune.py` script to train the models with the `fastpitch_align_44100.yaml` configuration. This configuration file has been defined for 44100Hz HiFiGan dataset audio. The function `generate_training_command` in this notebook can be used to generate a training command for a given speaker and finetuning technique."
+    "We can now train our model with the following command:\n",
+    "\n",
+    "NOTE: This will take about 50 minutes on colab's K80 GPUs.\n",
+    "\n",
+    "`python fastpitch2_finetune.py --config-name=fastpitch_align.yaml train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json +init_from_nemo_model=./tts_en_fastpitch_align.nemo +trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25 prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "a57fcfec",
-   "metadata": {},
+   "metadata": {
+    "id": "reY1LV4lwWoq"
+   },
   "outputs": [],
   "source": [
-    "# pitch statistics of the new speakers\n",
-    "# These can be computed from the pitch contours extracted using librosa yin\n",
-    "# Finetuning can still work without these, but we get better results using speaker specific pitch stats\n",
-    "pitch_stats = {\n",
-    "    92 : {\n",
-    "        'mean' : 214.5, # female speaker\n",
-    "        'std' : 30.9,\n",
-    "        'fmin' : 80,\n",
-    "        'fmax' : 512\n",
-    "    },\n",
-    "    6097 : {\n",
-    "        'mean' : 121.9, # male speaker\n",
-    "        'std' : 23.1,\n",
-    "        'fmin' : 30,\n",
-    "        'fmax' : 512\n",
-    "    }\n",
-    "}\n",
+    "!python fastpitch2_finetune.py --config-name=fastpitch_align.yaml train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json +init_from_nemo_model=./tts_en_fastpitch_align.nemo +trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25 prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "j2svKvd1eMhf"
+   },
+   "source": [
+    "Let's take a closer look at the training command:\n",
    "\n",
+    "* `python fastpitch2_finetune.py --config-name=fastpitch_align.yaml`\n",
+    "  * --config-name tells the script what config to use.\n",
    "\n",
-    "def generate_training_command(new_speaker_id, duration_mins, mixing_enabled, original_speaker_id, ckpt, use_new_pitch_stats=False):\n",
-    "    \"\"\"\n",
-    "    Generates the training command string to be run from the NeMo/ directory. Assumes we have created the finetuning filelists\n",
-    "    using the instructions given above.\n",
-    "    \n",
-    "    Arguments:\n",
-    "    new_speaker_id -- speaker id of the new HiFiTTS speaker\n",
-    "    duration_mins -- total minutes of the new speaker data (same as that used for creating the filelists)\n",
-    "    mixing_enabled -- True or False depending on whether we want to mix the original speaker data or not\n",
-    "    original_speaker_id -- speaker id of the original HiFiTTS speaker\n",
-    "    use_new_pitch_stats -- whether to use pitch_stats dictionary given above or not\n",
-    "    ckpt: Path to pretrained FastPitch checkpoint\n",
-    "    \n",
-    "    Returns:\n",
-    "    Training command string\n",
-    "    \"\"\"\n",
-    "    def _find_epochs(duration_mins, mixing_enabled, n_orig=None):\n",
-    "        # estimated num of epochs \n",
-    "        if duration_mins == 5:\n",
-    "            epochs = 1000\n",
-    "        elif duration_mins == 30:\n",
-    "            epochs = 300\n",
-    "        elif duration_mins == 60:\n",
-    "            epochs = 150\n",
-    "        \n",
-    "        if mixing_enabled:\n",
-    "            if duration_mins == 5:\n",
-    "                epochs = epochs/50 + 1\n",
-    "            elif duration_mins == 30:\n",
-    "                epochs = epochs/10 + 1\n",
-    "            elif duration_mins == 60:\n",
-    "                epochs = epochs/5 + 1\n",
-    "        \n",
-    "        return int(epochs)\n",
-    "            \n",
-    "            \n",
-    "    if ckpt.endswith(\".nemo\"):\n",
-    "        ckpt_arg_name = \"init_from_nemo_model\"\n",
-    "    else:\n",
-    "        ckpt_arg_name = \"init_from_ptl_ckpt\"\n",
-    "    if not mixing_enabled:\n",
-    "        train_dataset = \"{}_mainifest_train_dur_{}_mins_local.json\".format(new_speaker_id, duration_mins)\n",
-    "        val_dataset = \"{}_mainifest_dev_ns_all_local.json\".format(new_speaker_id)\n",
-    "        prior_folder = os.path.join(data_dir, \"Priors{}\".format(new_speaker_id))\n",
-    "        exp_dir = \"{}_to_{}_no_mixing_{}_mins\".format(original_speaker_id, new_speaker_id, duration_mins)\n",
-    "        n_speakers = 1\n",
-    "    else:\n",
-    "        train_dataset = \"{}_mainifest_train_dur_{}_mins_local_mix_{}.json\".format(new_speaker_id, duration_mins, original_speaker_id)\n",
-    "        val_dataset = \"{}_mainifest_dev_ns_all_local.json\".format(new_speaker_id)\n",
-    "        prior_folder = os.path.join(data_dir, \"Priors_{}_mix_{}\".format(new_speaker_id, original_speaker_id))\n",
-    "        exp_dir = \"{}_to_{}_mixing_{}_mins\".format(original_speaker_id, new_speaker_id, duration_mins)\n",
-    "        n_speakers = 2\n",
-    "    train_dataset = os.path.join(filelist_dir, train_dataset)\n",
-    "    val_dataset = os.path.join(filelist_dir, val_dataset)\n",
-    "    exp_dir = os.path.join(exp_base_dir, exp_dir)\n",
-    "                                    \n",
-    "    max_epochs = _find_epochs(duration_mins, mixing_enabled, n_orig=None)\n",
-    "    config_name = \"fastpitch_align_44100.yaml\"\n",
-    "    \n",
-    "    training_command = \"python examples/tts/fastpitch2_finetune.py --config-name={} train_dataset={} validation_datasets={} +{}={} trainer.max_epochs={} trainer.check_val_every_n_epoch=1 prior_folder={} model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir={} model.n_speakers={}\".format(\n",
-    "        config_name, train_dataset, val_dataset, ckpt_arg_name, ckpt, max_epochs, prior_folder, exp_dir, n_speakers)\n",
-    "    if use_new_pitch_stats:\n",
-    "        training_command += \" model.pitch_avg={} model.pitch_std={} model.pitch_fmin={} model.pitch_fmax={}\".format(\n",
-    "            pitch_stats[new_speaker_id]['mean'], \n",
-    "            pitch_stats[new_speaker_id]['std'],\n",
-    "            pitch_stats[new_speaker_id]['fmin'],\n",
-    "            pitch_stats[new_speaker_id]['fmax']\n",
-    "        )\n",
-    "    training_command += \" model.optim.lr=2e-4 ~model.optim.sched\"\n",
-    "    return training_command\n",
-    "    "
+    "* `train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json`\n",
+    "  * We tell the model what manifest files we can to train and eval on.\n",
+    "\n",
+    "* `+init_from_nemo_model=./tts_en_fastpitch_align.nemo`\n",
+    "  * We tell the script what checkpoint to finetune from.\n",
+    "\n",
+    "* `+trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25`\n",
+    "  * For this experiment, we need to tell the script to train for 1000 training steps/iterations. We need to remove max_epochs using `~trainer.max_epochs`.\n",
+    "\n",
+    "* `prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24`\n",
+    "  * Some dataset parameters. The dataset does some online processing and stores the processing steps to the `prior_folder`.\n",
+    "\n",
+    "* `exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins`\n",
+    "  * Where we want to save our log files, tensorboard file, checkpoints, and more\n",
+    "\n",
+    "* `model.n_speakers=1`\n",
+    "  * The number of speakers in the data. There is only 1 for now, but we will revisit this parameter later in the notebook\n",
+    "\n",
+    "* `model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512`\n",
+    "  * For the new speaker, we need to define new pitch hyperparameters for better audio quality.\n",
+    "  * These parameters work for speaker 6097 from the HiFiTTS dataset\n",
+    "  * For speaker 92, we suggest `model.pitch_avg=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512`\n",
+    "  * fmin and fmax are hyperparameters to librosa's pyin function. We recommend tweaking these per speaker.\n",
+    "  * After fmin and fmax are defined, pitch mean and std can be easily extracted\n",
+    "\n",
+    "* `model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`\n",
+    "  * For fine-tuning, we lower the learning rate\n",
+    "  * We use a fixed learning rate of 2e-4\n",
+    "  * We switch from the lamb optimizer to the adam optimizer"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "c3bdf1ed"
+   },
+   "source": [
+    "## Synthesize Samples from Finetuned Checkpoints\n",
+    "\n",
+    "---\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "f2b46325"
+   },
+   "source": [
+    "Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFiGAN vocoder trained on LJSpeech.\n",
+    "\n",
+    "We define some helper functions as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "f98c55af",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "new_speaker_id = 92\n",
-    "duration_mins = 5\n",
-    "mixing = False\n",
-    "original_speaker_id = 8051\n",
-    "ckpt_path = <PATH_TO_PRETRAINED_FASTPITCH_CHECKPOINT>\n",
-    "print(generate_training_command(new_speaker_id, duration_mins, mixing, original_speaker_id, ckpt_path, True))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "12b5511c",
-   "metadata": {},
-   "source": [
-    "The generated command should look something like this. We can ofcourse tweak things like epochs/learning rate if we like\n",
-    "\n",
-    "`python examples/tts/fastpitch2_finetune.py --config-name=fastpitch_align_44100 train_dataset=filelists/92_mainifest_train_dur_5_mins_local.json validation_datasets=filelists/92_mainifest_dev_ns_all_local.json +init_from_nemo_model=PreTrainedModels/FastPitch.nemo trainer.max_epochs=1000 trainer.check_val_every_n_epoch=1 prior_folder=Hi_Fi_TTS_v_0/Priors92 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=inetuningDemo/8051_to_92_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512  model.optim.lr=2e-4 ~model.optim.sched`"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "bc30d1e2",
-   "metadata": {},
-   "source": [
-    "^ Run the above command from the terminal from the `NeMo/` directory to start finetuning a model. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c3bdf1ed",
-   "metadata": {},
-   "source": [
-    "## Synthesize samples from finetuned checkpoints"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "f2b46325",
-   "metadata": {},
-   "source": [
-    "Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFiGAN vocoder trained on multiple speakers, get the trained checkpoint path for our trained model and synthesize audio for a given text as follows."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "886c91dc",
-   "metadata": {},
+   "metadata": {
+    "id": "886c91dc"
+   },
   "outputs": [],
   "source": [
    "from nemo.collections.tts.models import HifiGanModel\n",
    "from nemo.collections.tts.models import FastPitchModel\n",
    "\n",
-    "hifigan_ckpt_path =  <PATH_TO_PRETRAINED_HIFIGAN_CHECKPOINT>\n",
-    "vocoder = HifiGanModel.load_from_checkpoint(hifigan_ckpt_path)\n",
+    "vocoder = HifiGanModel.from_pretrained(\"tts_hifigan\")\n",
    "vocoder.eval().cuda()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "0a4c986f",
-   "metadata": {},
+   "metadata": {
+    "id": "0a4c986f"
+   },
   "outputs": [],
   "source": [
    "def infer(spec_gen_model, vocoder_model, str_input, speaker = None):\n",
@ -456,8 +366,9 @@
  },
  {
   "cell_type": "markdown",
-   "id": "0153bd5a",
-   "metadata": {},
+   "metadata": {
+    "id": "0153bd5a"
+   },
   "source": [
    "Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint from the exp_base_dir and compare the synthesized audio with validation samples of the new speaker."
   ]
@ -465,17 +376,17 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "8901f88b",
-   "metadata": {},
+   "metadata": {
+    "id": "8901f88b"
+   },
   "outputs": [],
   "source": [
-    "new_speaker_id = 92\n",
+    "new_speaker_id = 6097\n",
    "duration_mins = 5\n",
    "mixing = False\n",
-    "original_speaker_id = 8051\n",
+    "original_speaker_id = \"ljspeech\"\n",
    "\n",
-    "\n",
-    "_ ,last_ckpt = get_best_ckpt(exp_base_dir, new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
+    "_ ,last_ckpt = get_best_ckpt(\"./\", new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
    "print(last_ckpt)\n",
    "\n",
    "spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)\n",
@ -486,7 +397,7 @@
    "\n",
    "num_val = 2\n",
    "\n",
-    "manifest_path = os.path.join(filelist_dir, \"{}_mainifest_dev_ns_all_local.json\".format(new_speaker_id))\n",
+    "manifest_path = os.path.join(\"./\", \"{}_manifest_dev_ns_all_local.json\".format(new_speaker_id))\n",
    "val_records = []\n",
    "with open(manifest_path, \"r\") as f:\n",
    "    for i, line in enumerate(f):\n",
@ -496,10 +407,10 @@
    "            \n",
    "for val_record in val_records:\n",
    "    print (\"Real validation audio\")\n",
-    "    ipd.display(ipd.Audio(val_record['audio_filepath'], rate=44100))\n",
+    "    ipd.display(ipd.Audio(val_record['audio_filepath'], rate=22050))\n",
    "    print (\"SYNTHESIZED FOR -- Speaker: {} | Dataset size: {} mins | Mixing:{} | Text: {}\".format(new_speaker_id, duration_mins, mixing, val_record['text']))\n",
    "    spec, audio = infer(spec_model, vocoder, val_record['text'], speaker = _speaker)\n",
-    "    ipd.display(ipd.Audio(audio, rate=44100))\n",
+    "    ipd.display(ipd.Audio(audio, rate=22050))\n",
    "    %matplotlib inline\n",
    "    #if spec is not None:\n",
    "    imshow(spec, origin=\"lower\", aspect = \"auto\")\n",
@ -507,32 +418,84 @@
   ]
  },
  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "18ce524f",
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "ge2s7s9-w3py"
+   },
+   "source": [
+    "## Improving Speech Quality\n",
+    "___\n",
+    "\n",
+    "We see that from fine-tuning FastPitch, we were able to generate audio in a male voice but the audio quality is not as good as we expect. We recommend two steps to improve audio quality:\n",
+    "\n",
+    "* Finetuning HiFiGAN\n",
+    "* Adding more data\n",
+    "\n",
+    "Both of these steps are outside the scope of the notebook due to the limited compute available on colab.\n",
+    "\n",
+    "### Finetuning HiFiGAN\n",
+    "From the synthesized samples, there might be audible audio crackling. To fix this, we need to finetune HiFiGAN on the new speaker's data. HiFiGAN shows improvement using synthesized mel spectrograms, so the first step is to generate mel spectrograms with our finetuned FastPitch model.\n",
+    "\n",
+    "```python\n",
+    "# Get records from the training manifest\n",
+    "manifest_path = \"./6097_manifest_train_dur_5_mins_local.json\"\n",
+    "records = []\n",
+    "with open(manifest_path, \"r\") as f:\n",
+    "    for i, line in enumerate(f):\n",
+    "        records.append(json.loads(line))\n",
+    "\n",
+    "# Generate a spectrogram for each item\n",
+    "for i, r in enumerate(records):\n",
+    "  with torch.no_grad():\n",
+    "      parsed = parser_model.parse(r['text'])\n",
+    "      spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed)\n",
+    "      if isinstance(spectrogram, torch.Tensor):\n",
+    "          spectrogram = spectrogram.to('cpu').numpy()\n",
+    "      if len(spectrogram.shape) == 3:\n",
+    "          spectrogram = spectrogram[0]\n",
+    "      np.save(f\"mel_{i}\", spectrogram)\n",
+    "      r[\"mel_filepath\"] = f\"mel_{i}.npy\"\n",
+    "\n",
+    "# Save to a new json\n",
+    "with open(\"hifigan_train_ft.json\", \"w\") as f:\n",
+    "  for r in records:\n",
+    "    f.write(json.dumps(r) + '\\n')\n",
+    "\n",
+    "# Please do the same for the validation json. Code is omitted.\n",
+    "```\n",
+    "\n",
+    "We can then finetune hifigan similarly to fastpitch using NeMo's [hifigan_finetune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan_finetune.py) and [hifigan.yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/hifigan/hifigan.yaml):\n",
+    "\n",
+    "`python examples/tts/hifigan_finetune.py --config_name=hifigan.yaml model.train_ds.dataloader_params.batch_size=32 model.max_steps=1000 ~model.sched model.optim.lr=0.0001 train_dataset=./hifigan_train_ft.json validation_datasets=./hifigan_val_ft.json exp_manager.exp_dir=hifigan_ft +init_from_nemo_model=tts_hifigan.nemo trainer.check_val_every_n_epoch=10`\n",
+    "\n",
+    "### Improving TTS by Adding More Data\n",
+    "We can add more data in two ways. they can be combined for the best effect:\n",
+    "\n",
+    "* Add more training data from the new speaker\n",
+    "\n",
+    "The entire notebook can be repeated from the top after a new .json is defined for the additional data. Modify your finetuning commands to point to the new json. Be sure to increase the number of steps as more data is added to both the fastpitch and hifigan finetuning. We recommend 1000 steps per minute of audio for fastpitch and 500 steps per minute of audio for hifigan.\n",
+    "\n",
+    "* Mix new speaker data with old speaker data\n",
+    "\n",
+    "We recommend to train fastpitch using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the .json when finetuning fastpitch to include speaker information:\n",
+    "\n",
+    "`\n",
+    "{\"audio_filepath\": \"new_speaker.wav\", \"text\": \"sample\", \"duration\": 2.6, \"speaker\": 1}\n",
+    "{\"audio_filepath\": \"old_speaker.wav\", \"text\": \"LJSpeech sample\", \"duration\": 2.6, \"speaker\": 0}\n",
+    "`\n",
+    "5 hours of data from the old speaker should be sufficient. Since we should have less data from the new speaker, we need to ensure that the model sees a similar amount of new data and old data. For each sample from the old speaker, please add a sample from the new speaker in the .json. The samples from the new speaker will be repeated.\n",
+    "\n",
+    "Modify the fastpitch training command to point to the new training and validation .jsons, and update `model.n_speakers=1` to `model.n_speakers=2`. Ensure the pitch statistics correspond to the new speaker.\n",
+    "\n",
+    "For HiFiGAN finetuning, the training should be done on the new speaker data."
+   ]
  }
 ],
 "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.8.10"
-  }
+   "name": "python"
+  },
+  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 5
--- a/tutorials/tts/Inference_DurationPitchControl.ipynb
+++ b/tutorials/tts/Inference_DurationPitchControl.ipynb
@ -2,15 +2,16 @@
 "cells": [
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "# TTS Inference Prosody Control\n",
    "\n",
    "This notebook is intended to teach users how to control duration and pitch with the FastPitch model."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "# License\n",
    "\n",
@ -27,15 +28,18 @@
    "> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "> See the License for the specific language governing permissions and\n",
    "> limitations under the License."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
   "source": [
    "\"\"\"\n",
-    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
+    "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
@ -47,24 +51,22 @@
    "# !pip install wget unidecode\n",
    "# BRANCH = 'main'\n",
    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]"
-   ],
-   "outputs": [],
-   "metadata": {
-    "tags": []
-   }
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Please run the below cell to import libraries used in this notebook. This cell will load the fastpitch model and hifigan models used in the rest of the notebook. Lastly, two helper functions are defined. One is used for inference while the other is used to plot the inference results."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "# Import all libraries\n",
    "import IPython.display as ipd\n",
@ -113,12 +115,11 @@
    "    librosa.display.specshow(np.log(spec+1e-12), y_axis='log')\n",
    "    ipd.display(ipd.Audio(audio, rate=sr))\n",
    "    plt.show()"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Duration Control\n",
    "\n",
@ -145,12 +146,13 @@
    "```\n",
    "\n",
    "Let's try this out with FastPitch"
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"Hey, I am speaking at different paces!\"  # Feel free to change it and experiment\n",
@ -169,12 +171,11 @@
    "_, audio, *_ = str_to_audio(input_string, pace=0.75)\n",
    "print(f\"This is fastpitch speaking at the slower pace of 0.75. This example is {len(audio[0])/sr:.3f} seconds long.\")\n",
    "ipd.display(ipd.Audio(audio, rate=sr))"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Pitch Control\n",
    "\n",
@ -204,12 +205,13 @@
    "<img src=\"https://raw.githubusercontent.com/NVIDIA/NeMo/main/tutorials/tts/fastpitch-pitch.png\">\n",
    "\n",
    "Notice that the last word `pitch` has an increase in pitch to stress that it is a question."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "import librosa\n",
    "import librosa.display\n",
@ -237,12 +239,11 @@
    "ax.plot(pitch_pred.cpu().numpy()[0], color='cyan', linewidth=1)\n",
    "librosa.display.specshow(np.log(spec+1e-12), y_axis='log')\n",
    "ipd.display(ipd.Audio(audio, rate=sr))"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Plot Control\n",
    "\n",
@ -255,20 +256,21 @@
    "3) Pitch inversion\n",
    "\n",
    "4) Pitch amplification"
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "### Pitch Shift\n",
    "First, let's handle pitch shifting. To shift the pitch up or down by some Hz, we can just add or subtract as needed. Let's shift the pitch down by 50 Hz and compare it to the previous example."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
@ -295,21 +297,21 @@
    "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
    "print(\"The second shifted sample. This sample is much deeper than the previous.\")\n",
    "display_pitch(audio_shift, pitch_shift, durs=durs_shift_pred)"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "### Pitch Flattening\n",
    "Second, let's look at pitch flattening. To flatten the pitch, we just set it to 0. Let's run it and compare the results."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
@ -328,21 +330,21 @@
    "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
    "print(\"The second flattened sample. This sample is more monotone than the previous.\")\n",
    "display_pitch(audio_flat, pitch_flat, durs=durs_flat_pred)"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "### Pitch Inversion\n",
    "Third, let's look at pitch inversion. To invert the pitch, we just multiply it by -1. Let's run it and compare the results."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
@ -361,21 +363,21 @@
    "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
    "print(\"The second inverted sample. This sample sounds less like a question and more like a statement.\")\n",
    "display_pitch(audio_inv, pitch_inv, durs=durs_inv_pred)"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "### Pitch Amplify\n",
    "Lastly, let's look at pitch amplifying. To amplify the pitch, we just multiply it by a positive constant. Let's run it and compare the results."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"Hey, what is my pitch?\"  # Feel free to change it and experiment\n",
@ -394,22 +396,22 @@
    "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
    "print(\"The second amplified sample.\")\n",
    "display_pitch(audio_amp, pitch_amp, durs=durs_amp_pred)"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Putting it all together\n",
    "\n",
    "Now that we understand how to control the duration and pitch of TTS models, we can show how to adjust the voice to sound more solemn (slower speed + lower pitch), or more excited (higher speed + higher pitch)."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"I want to pass on my condolences for your loss.\"\n",
@ -431,13 +433,13 @@
    "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
    "print(\"The second solumn sample\")\n",
    "display_pitch(audio_sol, pitch_sol, durs=durs_sol_pred)"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "#Define what we want the model to say\n",
    "input_string = \"Congratulations on your promotion.\"\n",
@ -457,12 +459,11 @@
    "display_pitch(audio_norm, pitch_norm_pred, durs=durs_norm_pred)\n",
    "print(\"The second excited sample\")\n",
    "display_pitch(audio_excite, pitch_excite, durs=durs_excite_pred)"
-   ],
-   "outputs": [],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## Other Models\n",
    "\n",
@ -475,17 +476,16 @@
    "### Pitch Control\n",
    "\n",
    "Pitch control is more complicated. There are numerous design decisions that differ between models: 1) Whether to normalize the pitch, 2) Whether to predict pitch per spectrogram frame or per token, and more. While the basic transformations presented here (shift, flatten, invert, and amplify) can be done with all pitch predicting models, where to add this pitch transformation will differ depending on the model."
-   ],
-   "metadata": {}
+   ]
  },
  {
   "cell_type": "markdown",
+   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "[1] https://arxiv.org/abs/1905.09263"
-   ],
-   "metadata": {}
+   ]
  }
 ],
 "metadata": {
--- a/tutorials/tts/Inference_ModelSelect.ipynb
+++ b/tutorials/tts/Inference_ModelSelect.ipynb
@ -39,7 +39,7 @@
   "outputs": [],
   "source": [
    "\"\"\"\n",
-    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
+    "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
--- a/tutorials/tts/Tacotron2_Training.ipynb
+++ b/tutorials/tts/Tacotron2_Training.ipynb
@ -47,7 +47,7 @@
   "outputs": [],
   "source": [
    "\"\"\"\n",
-    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
+    "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
--- a/tutorials/tts/TalkNet_Training.ipynb
+++ b/tutorials/tts/TalkNet_Training.ipynb
@ -43,7 +43,7 @@
   "outputs": [],
   "source": [
    "\"\"\"\n",
-    "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
+    "You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
    "Instructions for setting up Colab are as follows:\n",
    "1. Open a new Python 3 notebook.\n",
    "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",