|
|
|
@ -2,393 +2,303 @@
|
|
|
|
|
"cells": [
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "8d0bbac2",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "8d0bbac2"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"# Finetuning FastPitch for a new speaker"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "2d063299",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"In this tutorial, we will finetune a single speaker FastPitch (with alignment) model on limited amount of new speaker's data. We cover two finetuning techniques: \n",
|
|
|
|
|
"1. We finetune the model parameters only on new speaker's text and speech pairs; \n",
|
|
|
|
|
"2. We add a learnable speaker embedding layer to the model, and finetune on a mixture of original speaker's and new speaker's data.\n",
|
|
|
|
|
"# Finetuning FastPitch for a new speaker\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"We will first prepare filelists containing the audiopaths and text of the samples on which we wish to finetune the model, then generate and run a training command to finetune Fastpitch on 5 mins of data, and finally synthesize the audio from the trained checkpoint."
|
|
|
|
|
"In this tutorial, we will finetune a single speaker FastPitch (with alignment) model on 5 mins of a new speaker's data. We will finetune the model parameters only on new speaker's text and speech pairs.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"We will download the training data, then generate and run a training command to finetune Fastpitch on 5 mins of data, and synthesize the audio from the trained checkpoint.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"A final section will describe approaches to improve audio quality past this notebook."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "2502cf61",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "nGw0CBaAtmQ6"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Creating filelists for training"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "81fa2c02",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"We will first create filelists of audio on which we wish to finetune the FastPitch model. We will create two kinds of filelists, one which contains only the audio files of the new speaker and one which contains the mixed audio files of the new speaker and the speaker used for training the pre-trained FastPitch Checkpoint."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "53746a7b",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"<div class=\"alert alert-block alert-warning\">\n",
|
|
|
|
|
" WARNING: This notebook requires downloading the HiFiTTS dataset which is 41GB. We plan on reducing the amount the download amount.\n",
|
|
|
|
|
"</div>"
|
|
|
|
|
"## License\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
|
|
|
|
|
">\n",
|
|
|
|
|
"> Licensed under the Apache License, Version 2.0 (the \"License\");\n",
|
|
|
|
|
"> you may not use this file except in compliance with the License.\n",
|
|
|
|
|
"> You may obtain a copy of the License at\n",
|
|
|
|
|
">\n",
|
|
|
|
|
"> http://www.apache.org/licenses/LICENSE-2.0\n",
|
|
|
|
|
">\n",
|
|
|
|
|
"> Unless required by applicable law or agreed to in writing, software\n",
|
|
|
|
|
"> distributed under the License is distributed on an \"AS IS\" BASIS,\n",
|
|
|
|
|
"> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
|
|
|
|
|
"> See the License for the specific language governing permissions and\n",
|
|
|
|
|
"> limitations under the License."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "b7b1563d",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "U7bOoIgLttRC"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"\"\"\"\n",
|
|
|
|
|
"You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
|
|
|
|
|
"Instructions for setting up Colab are as follows:\n",
|
|
|
|
|
"1. Open a new Python 3 notebook.\n",
|
|
|
|
|
"2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
|
|
|
|
|
"3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
|
|
|
|
|
"4. Run this cell to set up dependencies.\n",
|
|
|
|
|
"\"\"\"\n",
|
|
|
|
|
"# If you're using Google Colab and not running locally, uncomment and run this cell.\n",
|
|
|
|
|
"!apt-get install sox libsndfile1 ffmpeg\n",
|
|
|
|
|
"!pip install wget unidecode\n",
|
|
|
|
|
"BRANCH = 'stable'\n",
|
|
|
|
|
"!python -m pip install git+https://github.com/NeMo/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "2502cf61"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Downloading Data\n",
|
|
|
|
|
"___"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "81fa2c02"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"Download and untar the data.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"The data contains a 5 minute subset of audio from speaker 6097 from the HiFiTTS dataset."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "VIFgqxLOpxha"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz # Contains 10MB of data\n",
|
|
|
|
|
"!tar -xzf 6097_5_mins.tar.gz"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "gSQqq0fBqy8K"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"Looking at manifest.json, we see a standard NeMo json that contains the filepath, text, and duration. Please note that manifest.json only contains the relative path.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"```\n",
|
|
|
|
|
"{\"audio_filepath\": \"audio/presentpictureofnsw_02_mann_0532.wav\", \"text\": \"not to stop more than ten minutes by the way\", \"duration\": 2.6, \"text_no_preprocessing\": \"not to stop more than ten minutes by the way,\", \"text_normalized\": \"not to stop more than ten minutes by the way,\"}\n",
|
|
|
|
|
"```\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "B8gVfp5SsuDd"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"!cat ./6097_5_mins/manifest.json | tail -n 2 > ./6097_manifest_dev_ns_all_local.json\n",
|
|
|
|
|
"!cat ./6097_5_mins/manifest.json | head -n -2 > ./6097_manifest_train_dur_5_mins_local.json\n",
|
|
|
|
|
"!ln -s ./6097_5_mins/audio audio"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "ef75d1d5"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Finetuning FastPitch\n",
|
|
|
|
|
"___\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "lhhg2wBNtW0r"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"Let's first download the pretrained checkpoint that we want to finetune from. NeMo will save checkpoints to ~/.cache, so let's move that to our current directory"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "LggELooctXCT"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"import random\n",
|
|
|
|
|
"import os\n",
|
|
|
|
|
"import json\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"import torch\n",
|
|
|
|
|
"import IPython.display as ipd\n",
|
|
|
|
|
"from matplotlib.pyplot import imshow\n",
|
|
|
|
|
"from matplotlib import pyplot as plt\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"data_dir = <ADD_PATH_TO_DIRECTORY_CONTAINING_HIFIGAN_DATASET> # Download dataset from https://www.openslr.org/109/. Specify path to Hi_Fi_TTS_v_0\n",
|
|
|
|
|
"filelist_dir = <ADD_PATH_TO_DIRECTORY_IN_WHICH_WE_WISH_TO_SAVE_FILELISTS> # will be created if it does not exist\n",
|
|
|
|
|
"exp_base_dir = <ADD_PATH_TO_BASE_EXPERIMENT_DIRECTORY_FOR_CHECKPOINTS_AND_LOGS> # will be created if it does not exist\n",
|
|
|
|
|
"from nemo.collections.tts.models import FastPitchModel\n",
|
|
|
|
|
"FastPitchModel.from_pretrained(\"tts_en_fastpitch\")\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"from pathlib import Path\n",
|
|
|
|
|
"nemo_files = [p for p in Path(\"/root/.cache/torch/NeMo/\").glob(\"**/tts_en_fastpitch_align.nemo\")]\n",
|
|
|
|
|
"print(f\"Copying {nemo_files[0]} to ./\")\n",
|
|
|
|
|
"Path(\"./tts_en_fastpitch_align.nemo\").write_bytes(nemo_files[0].read_bytes())"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "6c8b13b8"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"To finetune the FastPitch model on the above created filelists, we use `examples/tts/fastpitch2_finetune.py` script to train the models with the `fastpitch_align.yaml` configuration.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"def make_sub_file_list(speaker_id, clean_other, split, num_samples, total_duration_mins, seed=42):\n",
|
|
|
|
|
" \"\"\"\n",
|
|
|
|
|
" Creates a subset of training data for a HiFiTTS speaker. Specify either the num_samples or total_duration_mins\n",
|
|
|
|
|
" Saves the filelist in the filelist_dir. split is either \"train\" or \"dev\"\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" Arguments:\n",
|
|
|
|
|
" speaker_id -- speaker id of the new HiFiTTS speaker\n",
|
|
|
|
|
" clean_other -- \"clean\" or \"other\" depending on type of data of new HiFiTTS speaker\n",
|
|
|
|
|
" split -- \"train\" or \"dev\"\n",
|
|
|
|
|
" num_samples -- Number samples of new speaker (set None if specifying total_duration_mins)\n",
|
|
|
|
|
" total_duration_mins -- Total duration of new speaker's data (set None if specifying num_samples)\n",
|
|
|
|
|
" \"\"\"\n",
|
|
|
|
|
" file_list_name = \"{}_manifest_{}_{}.json\".format(speaker_id, clean_other, split)\n",
|
|
|
|
|
" with open(os.path.join(data_dir, file_list_name), 'r') as f:\n",
|
|
|
|
|
" all_records = [json.loads(l) for l in f.read().split(\"\\n\") if len(l) > 0]\n",
|
|
|
|
|
" for r in all_records:\n",
|
|
|
|
|
" r['audio_filepath'] = r['audio_filepath'][r['audio_filepath'].find(\"wav/\"):]\n",
|
|
|
|
|
" random.seed(seed)\n",
|
|
|
|
|
" random.shuffle(all_records)\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" if num_samples is not None and total_duration_mins is None:\n",
|
|
|
|
|
" sub_records = all_records[:num_samples]\n",
|
|
|
|
|
" fname_extension = \"ns_{}\".format(num_samples)\n",
|
|
|
|
|
" elif num_samples is None and total_duration_mins is not None:\n",
|
|
|
|
|
" sub_record_duration = 0.0\n",
|
|
|
|
|
" sub_records = []\n",
|
|
|
|
|
" for r in all_records:\n",
|
|
|
|
|
" sub_record_duration += r['duration']\n",
|
|
|
|
|
" if sub_record_duration > total_duration_mins*60.0:\n",
|
|
|
|
|
" print (\"Duration reached {} mins using {} records\".format(total_duration_mins, len(sub_records)))\n",
|
|
|
|
|
" break\n",
|
|
|
|
|
" sub_records.append(r)\n",
|
|
|
|
|
" fname_extension = \"dur_{}_mins\".format( int(round(total_duration_mins )))\n",
|
|
|
|
|
" elif num_samples is None and total_duration_mins is None:\n",
|
|
|
|
|
" sub_records = all_records\n",
|
|
|
|
|
" fname_extension = \"ns_all\"\n",
|
|
|
|
|
" else:\n",
|
|
|
|
|
" raise NotImplementedError()\n",
|
|
|
|
|
" print (\"num sub records\", len(sub_records))\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" if not os.path.exists(filelist_dir):\n",
|
|
|
|
|
" os.makedirs(filelist_dir)\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" target_fp = os.path.join(filelist_dir, \"{}_mainifest_{}_{}_local.json\".format(speaker_id, split, fname_extension))\n",
|
|
|
|
|
" with open(target_fp, 'w') as f:\n",
|
|
|
|
|
" for record in json.loads(json.dumps(sub_records)):\n",
|
|
|
|
|
" record['audio_filepath'] = record['audio_filepath'][record['audio_filepath'].find(\"wav/\"):]\n",
|
|
|
|
|
" record['audio_filepath'] = os.path.join(data_dir, record['audio_filepath']) \n",
|
|
|
|
|
" f.write(json.dumps(record) + \"\\n\")\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"def mix_file_list(speaker_id, clean_other, split, num_samples, total_duration_mins, original_speaker_id, original_clean_other, n_orig=None, seed=42):\n",
|
|
|
|
|
" \"\"\"\n",
|
|
|
|
|
" Creates a mixed dataset of new and original speaker. num_samples or total_duration_mins specifies the amount \n",
|
|
|
|
|
" of new speaker data to be used and n_orig specifies the number of original speaker samples. Creates a balanced \n",
|
|
|
|
|
" dataset with alternating new and old speaker samples. Saves the filelist in the filelist_dir. \n",
|
|
|
|
|
" \n",
|
|
|
|
|
" Arguments:\n",
|
|
|
|
|
" speaker_id -- speaker id of the new HiFiTTS speaker\n",
|
|
|
|
|
" clean_other -- \"clean\" or \"other\" depending on type of data of new HiFiTTS speaker\n",
|
|
|
|
|
" split -- \"train\" or \"dev\"\n",
|
|
|
|
|
" num_samples -- Number samples of new speaker (set None if specifying total_duration_mins)\n",
|
|
|
|
|
" total_duration_mins -- Total duration of new speaker's data (set None if specifying num_samples)\n",
|
|
|
|
|
" original_speaker_id -- speaker id of the original HiFiTTS speaker (on which FastPitch was trained)\n",
|
|
|
|
|
" original_clean_other -- \"clean\" or \"other\" depending on type of data of new HiFiTTS speaker\n",
|
|
|
|
|
" n_orig -- Number of samples of old speaker to be mixed with new speaker\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" \"\"\"\n",
|
|
|
|
|
" file_list_name = \"{}_manifest_{}_{}.json\".format(speaker_id, clean_other, split)\n",
|
|
|
|
|
" with open(os.path.join(data_dir, file_list_name), 'r') as f:\n",
|
|
|
|
|
" all_records = [json.loads(l) for l in f.read().split(\"\\n\") if len(l) > 0]\n",
|
|
|
|
|
" for r in all_records:\n",
|
|
|
|
|
" r['audio_filepath'] = r['audio_filepath'][r['audio_filepath'].find(\"wav/\"):]\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" original_file_list_name = \"{}_manifest_{}_{}.json\".format(original_speaker_id, original_clean_other, \"train\")\n",
|
|
|
|
|
" with open(os.path.join(data_dir, original_file_list_name), 'r') as f:\n",
|
|
|
|
|
" original_all_records = [json.loads(l) for l in f.read().split(\"\\n\") if len(l) > 0]\n",
|
|
|
|
|
" for r in original_all_records:\n",
|
|
|
|
|
" r['audio_filepath'] = r['audio_filepath'][r['audio_filepath'].find(\"wav/\"):]\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" random.seed(seed)\n",
|
|
|
|
|
" if n_orig is not None:\n",
|
|
|
|
|
" random.shuffle(original_all_records)\n",
|
|
|
|
|
" original_all_records = original_all_records[:n_orig]\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" random.seed(seed)\n",
|
|
|
|
|
" random.shuffle(all_records)\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" if num_samples is not None and total_duration_mins is None:\n",
|
|
|
|
|
" sub_records = all_records[:num_samples]\n",
|
|
|
|
|
" fname_extension = \"ns_{}\".format(num_samples)\n",
|
|
|
|
|
" elif num_samples is None and total_duration_mins is not None:\n",
|
|
|
|
|
" sub_record_duration = 0.0\n",
|
|
|
|
|
" sub_records = []\n",
|
|
|
|
|
" for r in all_records:\n",
|
|
|
|
|
" sub_record_duration += r['duration']\n",
|
|
|
|
|
" if sub_record_duration > total_duration_mins * 60.0:\n",
|
|
|
|
|
" print (\"Duration reached {} mins using {} records\".format(total_duration_mins, len(sub_records)))\n",
|
|
|
|
|
" break\n",
|
|
|
|
|
" sub_records.append(r)\n",
|
|
|
|
|
" fname_extension = \"dur_{}_mins\".format( int(round(total_duration_mins)))\n",
|
|
|
|
|
" elif num_samples is None and total_duration_mins is None:\n",
|
|
|
|
|
" sub_records = all_records\n",
|
|
|
|
|
" fname_extension = \"ns_all\"\n",
|
|
|
|
|
" else:\n",
|
|
|
|
|
" raise NotImplementedError()\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" print(len(original_all_records))\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" if not os.path.exists(filelist_dir):\n",
|
|
|
|
|
" os.makedirs(filelist_dir)\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" target_fp = os.path.join(filelist_dir, \"{}_mainifest_{}_{}_local_mix_{}.json\".format(speaker_id, split, fname_extension, original_speaker_id))\n",
|
|
|
|
|
" with open(target_fp, 'w') as f:\n",
|
|
|
|
|
" for ridx, original_record in enumerate(original_all_records):\n",
|
|
|
|
|
" original_record['audio_filepath'] = original_record['audio_filepath'][original_record['audio_filepath'].find(\"wav/\"):]\n",
|
|
|
|
|
" original_record['audio_filepath'] = os.path.join(data_dir, original_record['audio_filepath']) \n",
|
|
|
|
|
" \n",
|
|
|
|
|
" new_speaker_record = sub_records[ridx % len(sub_records)]\n",
|
|
|
|
|
" new_speaker_record['audio_filepath'] = new_speaker_record['audio_filepath'][new_speaker_record['audio_filepath'].find(\"wav/\"):]\n",
|
|
|
|
|
" new_speaker_record['audio_filepath'] = os.path.join(data_dir, new_speaker_record['audio_filepath']) \n",
|
|
|
|
|
" \n",
|
|
|
|
|
" new_speaker_record['speaker'] = 1\n",
|
|
|
|
|
" original_record['speaker'] = 0\n",
|
|
|
|
|
" f.write(json.dumps(original_record) + \"\\n\")\n",
|
|
|
|
|
" f.write(json.dumps(new_speaker_record) + \"\\n\")"
|
|
|
|
|
"Let's grab those files."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "5c635928",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "3zg2H-32dNBU"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"make_sub_file_list(92, \"clean\", \"train\", None, 5)\n",
|
|
|
|
|
"mix_file_list(92, \"clean\", \"train\", None, 5, 8051, \"other\", n_orig=5000)\n",
|
|
|
|
|
"make_sub_file_list(92, \"clean\", \"dev\", None, None)"
|
|
|
|
|
"!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch2_finetune.py\n",
|
|
|
|
|
"!mkdir conf && cd conf && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_align.yaml && cd .."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "ef75d1d5",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "12b5511c"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Finetuning the model on filelists"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "6c8b13b8",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"To finetune the FastPitch model on the above created filelists, we use `examples/tts/fastpitch2_finetune.py` script to train the models with the `fastpitch_align_44100.yaml` configuration. This configuration file has been defined for 44100Hz HiFiGan dataset audio. The function `generate_training_command` in this notebook can be used to generate a training command for a given speaker and finetuning technique."
|
|
|
|
|
"We can now train our model with the following command:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"NOTE: This will take about 50 minutes on colab's K80 GPUs.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"`python fastpitch2_finetune.py --config-name=fastpitch_align.yaml train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json +init_from_nemo_model=./tts_en_fastpitch_align.nemo +trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25 prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "a57fcfec",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "reY1LV4lwWoq"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"# pitch statistics of the new speakers\n",
|
|
|
|
|
"# These can be computed from the pitch contours extracted using librosa yin\n",
|
|
|
|
|
"# Finetuning can still work without these, but we get better results using speaker specific pitch stats\n",
|
|
|
|
|
"pitch_stats = {\n",
|
|
|
|
|
" 92 : {\n",
|
|
|
|
|
" 'mean' : 214.5, # female speaker\n",
|
|
|
|
|
" 'std' : 30.9,\n",
|
|
|
|
|
" 'fmin' : 80,\n",
|
|
|
|
|
" 'fmax' : 512\n",
|
|
|
|
|
" },\n",
|
|
|
|
|
" 6097 : {\n",
|
|
|
|
|
" 'mean' : 121.9, # male speaker\n",
|
|
|
|
|
" 'std' : 23.1,\n",
|
|
|
|
|
" 'fmin' : 30,\n",
|
|
|
|
|
" 'fmax' : 512\n",
|
|
|
|
|
" }\n",
|
|
|
|
|
"}\n",
|
|
|
|
|
"!python fastpitch2_finetune.py --config-name=fastpitch_align.yaml train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json +init_from_nemo_model=./tts_en_fastpitch_align.nemo +trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25 prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "j2svKvd1eMhf"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"Let's take a closer look at the training command:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `python fastpitch2_finetune.py --config-name=fastpitch_align.yaml`\n",
|
|
|
|
|
" * --config-name tells the script what config to use.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"def generate_training_command(new_speaker_id, duration_mins, mixing_enabled, original_speaker_id, ckpt, use_new_pitch_stats=False):\n",
|
|
|
|
|
" \"\"\"\n",
|
|
|
|
|
" Generates the training command string to be run from the NeMo/ directory. Assumes we have created the finetuning filelists\n",
|
|
|
|
|
" using the instructions given above.\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" Arguments:\n",
|
|
|
|
|
" new_speaker_id -- speaker id of the new HiFiTTS speaker\n",
|
|
|
|
|
" duration_mins -- total minutes of the new speaker data (same as that used for creating the filelists)\n",
|
|
|
|
|
" mixing_enabled -- True or False depending on whether we want to mix the original speaker data or not\n",
|
|
|
|
|
" original_speaker_id -- speaker id of the original HiFiTTS speaker\n",
|
|
|
|
|
" use_new_pitch_stats -- whether to use pitch_stats dictionary given above or not\n",
|
|
|
|
|
" ckpt: Path to pretrained FastPitch checkpoint\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" Returns:\n",
|
|
|
|
|
" Training command string\n",
|
|
|
|
|
" \"\"\"\n",
|
|
|
|
|
" def _find_epochs(duration_mins, mixing_enabled, n_orig=None):\n",
|
|
|
|
|
" # estimated num of epochs \n",
|
|
|
|
|
" if duration_mins == 5:\n",
|
|
|
|
|
" epochs = 1000\n",
|
|
|
|
|
" elif duration_mins == 30:\n",
|
|
|
|
|
" epochs = 300\n",
|
|
|
|
|
" elif duration_mins == 60:\n",
|
|
|
|
|
" epochs = 150\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" if mixing_enabled:\n",
|
|
|
|
|
" if duration_mins == 5:\n",
|
|
|
|
|
" epochs = epochs/50 + 1\n",
|
|
|
|
|
" elif duration_mins == 30:\n",
|
|
|
|
|
" epochs = epochs/10 + 1\n",
|
|
|
|
|
" elif duration_mins == 60:\n",
|
|
|
|
|
" epochs = epochs/5 + 1\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" return int(epochs)\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" \n",
|
|
|
|
|
" if ckpt.endswith(\".nemo\"):\n",
|
|
|
|
|
" ckpt_arg_name = \"init_from_nemo_model\"\n",
|
|
|
|
|
" else:\n",
|
|
|
|
|
" ckpt_arg_name = \"init_from_ptl_ckpt\"\n",
|
|
|
|
|
" if not mixing_enabled:\n",
|
|
|
|
|
" train_dataset = \"{}_mainifest_train_dur_{}_mins_local.json\".format(new_speaker_id, duration_mins)\n",
|
|
|
|
|
" val_dataset = \"{}_mainifest_dev_ns_all_local.json\".format(new_speaker_id)\n",
|
|
|
|
|
" prior_folder = os.path.join(data_dir, \"Priors{}\".format(new_speaker_id))\n",
|
|
|
|
|
" exp_dir = \"{}_to_{}_no_mixing_{}_mins\".format(original_speaker_id, new_speaker_id, duration_mins)\n",
|
|
|
|
|
" n_speakers = 1\n",
|
|
|
|
|
" else:\n",
|
|
|
|
|
" train_dataset = \"{}_mainifest_train_dur_{}_mins_local_mix_{}.json\".format(new_speaker_id, duration_mins, original_speaker_id)\n",
|
|
|
|
|
" val_dataset = \"{}_mainifest_dev_ns_all_local.json\".format(new_speaker_id)\n",
|
|
|
|
|
" prior_folder = os.path.join(data_dir, \"Priors_{}_mix_{}\".format(new_speaker_id, original_speaker_id))\n",
|
|
|
|
|
" exp_dir = \"{}_to_{}_mixing_{}_mins\".format(original_speaker_id, new_speaker_id, duration_mins)\n",
|
|
|
|
|
" n_speakers = 2\n",
|
|
|
|
|
" train_dataset = os.path.join(filelist_dir, train_dataset)\n",
|
|
|
|
|
" val_dataset = os.path.join(filelist_dir, val_dataset)\n",
|
|
|
|
|
" exp_dir = os.path.join(exp_base_dir, exp_dir)\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" max_epochs = _find_epochs(duration_mins, mixing_enabled, n_orig=None)\n",
|
|
|
|
|
" config_name = \"fastpitch_align_44100.yaml\"\n",
|
|
|
|
|
" \n",
|
|
|
|
|
" training_command = \"python examples/tts/fastpitch2_finetune.py --config-name={} train_dataset={} validation_datasets={} +{}={} trainer.max_epochs={} trainer.check_val_every_n_epoch=1 prior_folder={} model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir={} model.n_speakers={}\".format(\n",
|
|
|
|
|
" config_name, train_dataset, val_dataset, ckpt_arg_name, ckpt, max_epochs, prior_folder, exp_dir, n_speakers)\n",
|
|
|
|
|
" if use_new_pitch_stats:\n",
|
|
|
|
|
" training_command += \" model.pitch_avg={} model.pitch_std={} model.pitch_fmin={} model.pitch_fmax={}\".format(\n",
|
|
|
|
|
" pitch_stats[new_speaker_id]['mean'], \n",
|
|
|
|
|
" pitch_stats[new_speaker_id]['std'],\n",
|
|
|
|
|
" pitch_stats[new_speaker_id]['fmin'],\n",
|
|
|
|
|
" pitch_stats[new_speaker_id]['fmax']\n",
|
|
|
|
|
" )\n",
|
|
|
|
|
" training_command += \" model.optim.lr=2e-4 ~model.optim.sched\"\n",
|
|
|
|
|
" return training_command\n",
|
|
|
|
|
" "
|
|
|
|
|
"* `train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json`\n",
|
|
|
|
|
" * We tell the model what manifest files we can to train and eval on.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `+init_from_nemo_model=./tts_en_fastpitch_align.nemo`\n",
|
|
|
|
|
" * We tell the script what checkpoint to finetune from.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `+trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25`\n",
|
|
|
|
|
" * For this experiment, we need to tell the script to train for 1000 training steps/iterations. We need to remove max_epochs using `~trainer.max_epochs`.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24`\n",
|
|
|
|
|
" * Some dataset parameters. The dataset does some online processing and stores the processing steps to the `prior_folder`.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins`\n",
|
|
|
|
|
" * Where we want to save our log files, tensorboard file, checkpoints, and more\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `model.n_speakers=1`\n",
|
|
|
|
|
" * The number of speakers in the data. There is only 1 for now, but we will revisit this parameter later in the notebook\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512`\n",
|
|
|
|
|
" * For the new speaker, we need to define new pitch hyperparameters for better audio quality.\n",
|
|
|
|
|
" * These parameters work for speaker 6097 from the HiFiTTS dataset\n",
|
|
|
|
|
" * For speaker 92, we suggest `model.pitch_avg=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512`\n",
|
|
|
|
|
" * fmin and fmax are hyperparameters to librosa's pyin function. We recommend tweaking these per speaker.\n",
|
|
|
|
|
" * After fmin and fmax are defined, pitch mean and std can be easily extracted\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* `model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`\n",
|
|
|
|
|
" * For fine-tuning, we lower the learning rate\n",
|
|
|
|
|
" * We use a fixed learning rate of 2e-4\n",
|
|
|
|
|
" * We switch from the lamb optimizer to the adam optimizer"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "c3bdf1ed"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Synthesize Samples from Finetuned Checkpoints\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"---\n",
|
|
|
|
|
"\n"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "f2b46325"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFiGAN vocoder trained on LJSpeech.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"We define some helper functions as well."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "f98c55af",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"new_speaker_id = 92\n",
|
|
|
|
|
"duration_mins = 5\n",
|
|
|
|
|
"mixing = False\n",
|
|
|
|
|
"original_speaker_id = 8051\n",
|
|
|
|
|
"ckpt_path = <PATH_TO_PRETRAINED_FASTPITCH_CHECKPOINT>\n",
|
|
|
|
|
"print(generate_training_command(new_speaker_id, duration_mins, mixing, original_speaker_id, ckpt_path, True))"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "12b5511c",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"The generated command should look something like this. We can ofcourse tweak things like epochs/learning rate if we like\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"`python examples/tts/fastpitch2_finetune.py --config-name=fastpitch_align_44100 train_dataset=filelists/92_mainifest_train_dur_5_mins_local.json validation_datasets=filelists/92_mainifest_dev_ns_all_local.json +init_from_nemo_model=PreTrainedModels/FastPitch.nemo trainer.max_epochs=1000 trainer.check_val_every_n_epoch=1 prior_folder=Hi_Fi_TTS_v_0/Priors92 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=inetuningDemo/8051_to_92_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched`"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "bc30d1e2",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"^ Run the above command from the terminal from the `NeMo/` directory to start finetuning a model. "
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "c3bdf1ed",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Synthesize samples from finetuned checkpoints"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "f2b46325",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"source": [
|
|
|
|
|
"Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFiGAN vocoder trained on multiple speakers, get the trained checkpoint path for our trained model and synthesize audio for a given text as follows."
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "886c91dc",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "886c91dc"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"from nemo.collections.tts.models import HifiGanModel\n",
|
|
|
|
|
"from nemo.collections.tts.models import FastPitchModel\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"hifigan_ckpt_path = <PATH_TO_PRETRAINED_HIFIGAN_CHECKPOINT>\n",
|
|
|
|
|
"vocoder = HifiGanModel.load_from_checkpoint(hifigan_ckpt_path)\n",
|
|
|
|
|
"vocoder = HifiGanModel.from_pretrained(\"tts_hifigan\")\n",
|
|
|
|
|
"vocoder.eval().cuda()"
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "0a4c986f",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "0a4c986f"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"def infer(spec_gen_model, vocoder_model, str_input, speaker = None):\n",
|
|
|
|
@ -456,8 +366,9 @@
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"id": "0153bd5a",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "0153bd5a"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint from the exp_base_dir and compare the synthesized audio with validation samples of the new speaker."
|
|
|
|
|
]
|
|
|
|
@ -465,17 +376,17 @@
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "8901f88b",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "8901f88b"
|
|
|
|
|
},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": [
|
|
|
|
|
"new_speaker_id = 92\n",
|
|
|
|
|
"new_speaker_id = 6097\n",
|
|
|
|
|
"duration_mins = 5\n",
|
|
|
|
|
"mixing = False\n",
|
|
|
|
|
"original_speaker_id = 8051\n",
|
|
|
|
|
"original_speaker_id = \"ljspeech\"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"_ ,last_ckpt = get_best_ckpt(exp_base_dir, new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
|
|
|
|
|
"_ ,last_ckpt = get_best_ckpt(\"./\", new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
|
|
|
|
|
"print(last_ckpt)\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)\n",
|
|
|
|
@ -486,7 +397,7 @@
|
|
|
|
|
"\n",
|
|
|
|
|
"num_val = 2\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"manifest_path = os.path.join(filelist_dir, \"{}_mainifest_dev_ns_all_local.json\".format(new_speaker_id))\n",
|
|
|
|
|
"manifest_path = os.path.join(\"./\", \"{}_manifest_dev_ns_all_local.json\".format(new_speaker_id))\n",
|
|
|
|
|
"val_records = []\n",
|
|
|
|
|
"with open(manifest_path, \"r\") as f:\n",
|
|
|
|
|
" for i, line in enumerate(f):\n",
|
|
|
|
@ -496,10 +407,10 @@
|
|
|
|
|
" \n",
|
|
|
|
|
"for val_record in val_records:\n",
|
|
|
|
|
" print (\"Real validation audio\")\n",
|
|
|
|
|
" ipd.display(ipd.Audio(val_record['audio_filepath'], rate=44100))\n",
|
|
|
|
|
" ipd.display(ipd.Audio(val_record['audio_filepath'], rate=22050))\n",
|
|
|
|
|
" print (\"SYNTHESIZED FOR -- Speaker: {} | Dataset size: {} mins | Mixing:{} | Text: {}\".format(new_speaker_id, duration_mins, mixing, val_record['text']))\n",
|
|
|
|
|
" spec, audio = infer(spec_model, vocoder, val_record['text'], speaker = _speaker)\n",
|
|
|
|
|
" ipd.display(ipd.Audio(audio, rate=44100))\n",
|
|
|
|
|
" ipd.display(ipd.Audio(audio, rate=22050))\n",
|
|
|
|
|
" %matplotlib inline\n",
|
|
|
|
|
" #if spec is not None:\n",
|
|
|
|
|
" imshow(spec, origin=\"lower\", aspect = \"auto\")\n",
|
|
|
|
@ -507,32 +418,84 @@
|
|
|
|
|
]
|
|
|
|
|
},
|
|
|
|
|
{
|
|
|
|
|
"cell_type": "code",
|
|
|
|
|
"execution_count": null,
|
|
|
|
|
"id": "18ce524f",
|
|
|
|
|
"metadata": {},
|
|
|
|
|
"outputs": [],
|
|
|
|
|
"source": []
|
|
|
|
|
"cell_type": "markdown",
|
|
|
|
|
"metadata": {
|
|
|
|
|
"id": "ge2s7s9-w3py"
|
|
|
|
|
},
|
|
|
|
|
"source": [
|
|
|
|
|
"## Improving Speech Quality\n",
|
|
|
|
|
"___\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"We see that from fine-tuning FastPitch, we were able to generate audio in a male voice but the audio quality is not as good as we expect. We recommend two steps to improve audio quality:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* Finetuning HiFiGAN\n",
|
|
|
|
|
"* Adding more data\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Both of these steps are outside the scope of the notebook due to the limited compute available on colab.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### Finetuning HiFiGAN\n",
|
|
|
|
|
"From the synthesized samples, there might be audible audio crackling. To fix this, we need to finetune HiFiGAN on the new speaker's data. HiFiGAN shows improvement using synthesized mel spectrograms, so the first step is to generate mel spectrograms with our finetuned FastPitch model.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"```python\n",
|
|
|
|
|
"# Get records from the training manifest\n",
|
|
|
|
|
"manifest_path = \"./6097_manifest_train_dur_5_mins_local.json\"\n",
|
|
|
|
|
"records = []\n",
|
|
|
|
|
"with open(manifest_path, \"r\") as f:\n",
|
|
|
|
|
" for i, line in enumerate(f):\n",
|
|
|
|
|
" records.append(json.loads(line))\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# Generate a spectrogram for each item\n",
|
|
|
|
|
"for i, r in enumerate(records):\n",
|
|
|
|
|
" with torch.no_grad():\n",
|
|
|
|
|
" parsed = parser_model.parse(r['text'])\n",
|
|
|
|
|
" spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed)\n",
|
|
|
|
|
" if isinstance(spectrogram, torch.Tensor):\n",
|
|
|
|
|
" spectrogram = spectrogram.to('cpu').numpy()\n",
|
|
|
|
|
" if len(spectrogram.shape) == 3:\n",
|
|
|
|
|
" spectrogram = spectrogram[0]\n",
|
|
|
|
|
" np.save(f\"mel_{i}\", spectrogram)\n",
|
|
|
|
|
" r[\"mel_filepath\"] = f\"mel_{i}.npy\"\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# Save to a new json\n",
|
|
|
|
|
"with open(\"hifigan_train_ft.json\", \"w\") as f:\n",
|
|
|
|
|
" for r in records:\n",
|
|
|
|
|
" f.write(json.dumps(r) + '\\n')\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"# Please do the same for the validation json. Code is omitted.\n",
|
|
|
|
|
"```\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"We can then finetune hifigan similarly to fastpitch using NeMo's [hifigan_finetune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan_finetune.py) and [hifigan.yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/hifigan/hifigan.yaml):\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"`python examples/tts/hifigan_finetune.py --config_name=hifigan.yaml model.train_ds.dataloader_params.batch_size=32 model.max_steps=1000 ~model.sched model.optim.lr=0.0001 train_dataset=./hifigan_train_ft.json validation_datasets=./hifigan_val_ft.json exp_manager.exp_dir=hifigan_ft +init_from_nemo_model=tts_hifigan.nemo trainer.check_val_every_n_epoch=10`\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"### Improving TTS by Adding More Data\n",
|
|
|
|
|
"We can add more data in two ways. they can be combined for the best effect:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* Add more training data from the new speaker\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"The entire notebook can be repeated from the top after a new .json is defined for the additional data. Modify your finetuning commands to point to the new json. Be sure to increase the number of steps as more data is added to both the fastpitch and hifigan finetuning. We recommend 1000 steps per minute of audio for fastpitch and 500 steps per minute of audio for hifigan.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"* Mix new speaker data with old speaker data\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"We recommend to train fastpitch using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the .json when finetuning fastpitch to include speaker information:\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"`\n",
|
|
|
|
|
"{\"audio_filepath\": \"new_speaker.wav\", \"text\": \"sample\", \"duration\": 2.6, \"speaker\": 1}\n",
|
|
|
|
|
"{\"audio_filepath\": \"old_speaker.wav\", \"text\": \"LJSpeech sample\", \"duration\": 2.6, \"speaker\": 0}\n",
|
|
|
|
|
"`\n",
|
|
|
|
|
"5 hours of data from the old speaker should be sufficient. Since we should have less data from the new speaker, we need to ensure that the model sees a similar amount of new data and old data. For each sample from the old speaker, please add a sample from the new speaker in the .json. The samples from the new speaker will be repeated.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"Modify the fastpitch training command to point to the new training and validation .jsons, and update `model.n_speakers=1` to `model.n_speakers=2`. Ensure the pitch statistics correspond to the new speaker.\n",
|
|
|
|
|
"\n",
|
|
|
|
|
"For HiFiGAN finetuning, the training should be done on the new speaker data."
|
|
|
|
|
]
|
|
|
|
|
}
|
|
|
|
|
],
|
|
|
|
|
"metadata": {
|
|
|
|
|
"kernelspec": {
|
|
|
|
|
"display_name": "Python 3",
|
|
|
|
|
"language": "python",
|
|
|
|
|
"name": "python3"
|
|
|
|
|
},
|
|
|
|
|
"language_info": {
|
|
|
|
|
"codemirror_mode": {
|
|
|
|
|
"name": "ipython",
|
|
|
|
|
"version": 3
|
|
|
|
|
},
|
|
|
|
|
"file_extension": ".py",
|
|
|
|
|
"mimetype": "text/x-python",
|
|
|
|
|
"name": "python",
|
|
|
|
|
"nbconvert_exporter": "python",
|
|
|
|
|
"pygments_lexer": "ipython3",
|
|
|
|
|
"version": "3.8.10"
|
|
|
|
|
}
|
|
|
|
|
"name": "python"
|
|
|
|
|
},
|
|
|
|
|
"orig_nbformat": 4
|
|
|
|
|
},
|
|
|
|
|
"nbformat": 4,
|
|
|
|
|
"nbformat_minor": 5
|
|
|
|
|