Merge branch 'r1.0.0rc1' into aug_tut_bug

This commit is contained in:
Jagadeesh Balam 2021-03-17 13:29:11 -07:00 committed by GitHub
commit 754ab33660
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
10 changed files with 360 additions and 6 deletions

View file

@ -950,3 +950,10 @@ url={https://openreview.net/forum?id=Bkg6RiCqY7},
booktitle={ICVPR},
year={2018}
}
@article{koluguri2020speakernet,
title={SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification},
author={Koluguri, Nithin Rao and Li, Jason and Lavrukhin, Vitaly and Ginsburg, Boris},
journal={arXiv preprint arXiv:2010.12653},
year={2020}
}

View file

@ -0,0 +1,114 @@
NeMo ASR Configuration Files
============================
This page covers NeMo configuration file setup that is specific to speaker recognition models.
For general information about how to set up and run experiments that is common to all NeMo models (e.g.
experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../introduction/core` page.
The model section of NeMo ASR configuration files will generally require information about the dataset(s) being
used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the
model architecture specification.
The sections on this page cover each of these in more detail.
Example configuration files for all of the Speaker related scripts can be found in the
config directory of the examples ``{NEMO_ROOT/examples/speaker_recognition/conf}``.
Dataset Configuration
---------------------
Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and
``test_ds`` sections of your configuration file, respectively.
Depending on the task, you may have arguments specifying the sample rate of your audio files, max time length to consider for each audio file , whether or not to shuffle the dataset, and so on.
You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line
at runtime.
Any initialization parameters that are accepted for the Dataset class used in your experiment
can be set in the config file.
An example SpeakerNet train and validation configuration could look like:
.. code-block:: yaml
model:
train_ds:
manifest_filepath: ???
sample_rate: 16000
labels: None # finds labels based on manifest file
batch_size: 32
trim_silence: False
time_length: 8
shuffle: True
validation_ds:
manifest_filepath: ???
sample_rate: 16000
labels: None # Keep None, to match with labels extracted during training
batch_size: 32
shuffle: False # No need to shuffle the validation data
Preprocessor Configuration
--------------------------
Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model.
For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__
Augmentation Configurations
---------------------------
For SpeakerNet training we use on-the fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section
The following example sets up musan augmentation with audio files taken from manifest path and
minimum and maximum SNR specified with min_snr and max_snr respectively. This section can be added to
``train_ds`` part in model
.. code-block:: yaml
model:
...
train_ds:
...
augmentor:
noise:
manifest_path: /path/to/musan/manifest_file
prob: 0.2 # probability to augment the incoming batch audio with augmentor data
min_snr_db: 5
max_snr_db: 15
See the :class:`nemo.collections.asr.parts.perturb.AudioAugmentor` API section for more details.
Model Architecture Configurations
---------------------------------
Each configuration file should describe the model architecture being used for the experiment.
Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field
specifying the module to use for each.
The following sections go into more detail about the specific configurations of each model architecture.
For more information about the SpeakerNet Encoder models, see the :doc:`Models <./models>` page and at `Jasper and QuartzNet <../configs.html#jasper-and-quartznet>`__
Decoder Configurations
------------------------
After features have been computed from speakernet encoder, we pass these features to thedecoder to compute embeddings and then to compute log probabilities
for training models.
.. code-block:: yaml
model:
...
decoder:
_target_: nemo.collections.asr.modules.SpeakerDecoder
feat_in: *enc_final
num_classes: 7205 # Total number of classes in training manifest file
pool_mode: xvector # xvector for variance and mean bases statistics pooling
emb_sizes: 256 # number of inermediate emb layers. can be comma separated for additional layers like 512,512
angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entrophy loss
loss:
scale: 30
margin 0.2

View file

@ -0,0 +1,4 @@
Model Name,Model Base Class,Model Card
speakerverification_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet"
speakerrecognition_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet"
speakerdiarization_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet"
1 Model Name Model Base Class Model Card
2 speakerverification_speakernet EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet
3 speakerrecognition_speakernet EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet
4 speakerdiarization_speakernet EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet

View file

@ -0,0 +1,49 @@
Datasets
========
.. _HI-MIA:
HI-MIA
--------
Run the script to download and process ``hi-mia`` dataset in order to generate files in the supported format of ``nemo_asr``. You should set the data folder of
hi-mia using ``--data_root``. These scripts are present in ``<nemo_root>/scripts``
.. code-block:: bash
python get_hi-mia_data.py --data_root=<data directory>
After download and conversion, your `data` folder should contain directories with following set of files as:
* `data/<set>/train.json`
* `data/<set>/dev.json`
* `data/<set>/{set}_all.json`
* `data/<set>/utt2spk`
All-other Datasets
------------------
These methods can be applied to any dataset to get similar training manifest files.
First we prepare scp file(s) containing absolute paths to all the wav files required for each of the train, dev, and test set. This can be easily prepared by using ``find`` bash command as follows:
.. code-block:: bash
!find {data_dir}/{train_dir} -iname "*.wav" > data/train_all.scp
!head -n 3 data/train_all.scp
Since we created the scp file for the train, we use `scp_to_manifest.py` to convert this scp file to a manifest file and then optionally split the files to train \& dev for evaluating the models while training by using the --split flag.
We wont be needing the --split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label.
After the download and conversion, your data folder should contain directories with manifest files as:
* `data/<path>/train.json`
* `data/<path>/dev.json`
* `data/<path>/train_all.json`
Each line in the manifest file describes a training sample - audio_filepath contains the path to the wav file, duration it's duration in seconds, and label is the speaker class label:
.. code-block:: json
{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": 3.9, "label": "speaker_id"}

Binary file not shown.

After

Width:  |  Height:  |  Size: 207 KiB

View file

@ -0,0 +1,63 @@
Speaker Recognition (SR)
========================
Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?).
We focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said.
Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings.
Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis.
As the goal of most speaker recognition systems is to get good speaker level embeddings that could help distinguish from other speakers, we shall first train these embeddings in end-to-end manner optimizing the QuatzNet based encoder model on cross-entropy loss.
We modify the decoder to get these fixed size embeddings irrespective of the length of the input audio and employ a mean and variance based statistics pooling method to grab these embeddings.
In Speaker Identification we typically train on a larger training set with cross-entrophy loss and finetune later on your preferred set of labels where one would want to classify only known set of speakers.
And in Speaker verification we train with Angular Softmax loss and compare embedings extracted from one audio file coming from a single speaker with
embeddings extracted from another file of same or another speaker by employing backend scoring techniques like cosine similarity.
--------------------------------
``Quick Start``
Write audio files to a ``manifest.json`` file with lines as in format:
.. code-block:: json
{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"}
This python call will download best pretrained model from NGC and writes embeddings pickle file to current working directory
.. code-block:: bash
python examples/speaker_recognition/extract_speaker_embeddings.py --manifest=manifest.json
--------------------------------
.. toctree::
:maxdepth: 8
models
configs
datasets
results
Resource and Documentation Guide
--------------------------------
Hands-on speaker recognition tutorial notebooks can be found under
`the speaker recognition tutorials folder <https://github.com/NVIDIA/NeMo/tree/r1.0.0rc1/tutorials/speaker_recognition/>`_. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab.
If you are looking for information about a particular SpeakerNet model, or would like to find out more about the model
architectures available in the ``nemo_asr`` collection, check out the :doc:`Models <./models>` page.
Documentation on dataset preprocessing can be found on the :doc:`Datasets <./datasets>` page.
NeMo includes preprocessing and other scripts for speaker_recognition in <nemo/scripts/speaker_recognition/> folder, and this page contains instructions on running
those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.
Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list
of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page.
Documentation for configuration files specific to the ``nemo_asr`` models can be found on the
:doc:`Configuration Files <./configs>` page.
For a clear step-by-step tutorial we advice you to refer tutorials found in `folder <https://github.com/NVIDIA/NeMo/tree/r1.0.0rc1/tutorials/speaker_recognition/>`_.

View file

@ -0,0 +1,51 @@
Models
======
Examples of config files for the following quartznet model can be found in the ``<NeMo_git_root>/examples/speaker_recogniton/conf`` directory.
For more information about the config files and how they should be structured, see the :doc:`./configs` page.
Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the :doc:`./results` page.
You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets.
The Checkpoints page also contains benchmark results for the available speaker recognition models.
SpeakerNet
-----------
The model is based on the QuartzNet ASR architecture :cite:`asr-models-koluguri2020speakernet`
comprising of an encoder and decoder structure. We use the encoder of the QuartzNet model as a top-level feature extractor, and feed the output to the statistics pooling layer, where
we compute the mean and variance across channel dimensions to capture the time-independent utterance-level speaker features.
The QuartzNet encoder used for speaker embeddings shown in figure below has the following structure: a QuartzNet BxR
model has B blocks, each with R sub-blocks. Each sub-block applies the following operations: a 1D convolution, batch norm, ReLU, and dropout. All sub-blocks in a block have the same number of output channels. These blocks are connected with residual connections. We use QuartzNet with 3 blocks, 2 sub-blocks, and 512 channels, as the Encoder for Speaker Embeddings. All conv layers have stride 1 and dilation 1.
.. image:: images/ICASPP_SpeakerNet.png
:align: center
:alt: speakernet model
:scale: 40%
Top level acoustic Features, obtained from the output of
encoder are used to compute intermediate features that are
then passed to the decoder for getting utterance level speaker
embeddings. The intermediate time-independent features are
computed using a statistics pooling layer, where we compute the mean and standard deviation of features across
time-channels, to get a time-independent feature representation S of size Batch_size × 3000.
The intermediate features, S are passed through the Decoder consisting of two layers each of output size 512 for a
linear transformation from S to the final number of classes
N for the larger (L) model, and a single linear layer of output size 256 to the final number of classes N for the medium
(M) model. We extract q-vectors after the final linear layer
of fixed size 512, 256 for SpeakerNet-L and SpeakerNet-M
models respectively.
SpeakerNet models can be instantiated using the :class:`EncDecSpeakerLabelModel<nemo.collections.asr.models.EncDecSpeakerLabelModel>` class.
References
-----------
.. bibliography:: ../asr_all.bib
:style: plain
:labelprefix: ASR-MODELS
:keyprefix: asr-models-

View file

@ -0,0 +1,65 @@
Checkpoints
===========
There are two main ways to load pretrained checkpoints in NeMo:
* Using the :code:`restore_from()` method to load a local checkpoint file (`.nemo`), or
* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC.
See the following sections for instructions and examples for each.
Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning.
For resuming an unfinished training experiment, please use the experiment manager to do so by setting the
``resume_if_exists`` flag to True.
Loading Local Checkpoints
-------------------------
NeMo will automatically save checkpoints of a model you are training in a `.nemo` format.
You can also manually save your models at any point using :code:`model.save_to(<checkpoint_path>.nemo)`.
If you have a local ``.nemo`` checkpoint that you'd like to load, simply use the :code:`restore_from()` method:
.. code-block:: python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
Where the model base class is the ASR model class of the original checkpoint, or the general `ASRModel` class.
NGC Pretrained Checkpoints
--------------------------
The SpeakerNet-ASR collection has checkpoints of several models trained on various datasets for a variety of tasks.
`Speaker_Recognition <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet>`_ and `Speaker_Verification <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet>`_ model cards on NGC contain more information about each of the checkpoints available.
The tables below list the SpeakerNet models available from NGC, and the models can be accessed via the
:code:`from_pretrained()` method inside the EncDecSpeakerLabelModel Model class.
.. note:: While loading, remember to use EncDecSpeakerLabelModel for recognition tasks and ExtractSpeakerEmbeddingsModel while extracting embeddings.
In general, you can load any of these models with code in the following format.
.. code-block:: python
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.<MODEL_CLASS_NAME>.from_pretrained(model_name="<MODEL_NAME>")
Where the model name is the value under "Model Name" entry in the tables below.
If you would like to programatically list the models available for a particular base class, you can use the
:code:`list_available_models()` method.
.. code-block:: python
nemo_asr.models.<MODEL_BASE_CLASS>.list_available_models()
Speaker Recognition Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. csv-table::
:file: data/speaker_results.csv
:align: left
:widths: 30, 30, 40
:header-rows: 1

View file

@ -17,6 +17,7 @@ NVIDIA NeMo Developer Guide
:name: Automatic Speech Recognition
asr/intro
asr/speaker_recognition/intro
.. toctree::
:maxdepth: 2
@ -52,4 +53,4 @@ NVIDIA NeMo Developer Guide
:name: API
api-docs/nemo

View file

@ -310,7 +310,7 @@
"metadata": {},
"outputs": [],
"source": [
"data_layer = AudioDataLayer(sample_rate=cfg.preprocessor.params.sample_rate)\n",
"data_layer = AudioDataLayer(sample_rate=cfg.preprocessor.sample_rate)\n",
"data_loader = DataLoader(data_layer, batch_size=1, collate_fn=data_layer.collate_fn)"
]
},
@ -361,8 +361,8 @@
" self.n_frame_len = int(frame_len * self.sr)\n",
" self.frame_overlap = frame_overlap\n",
" self.n_frame_overlap = int(frame_overlap * self.sr)\n",
" timestep_duration = model_definition['AudioToMelSpectrogramPreprocessor']['params']['window_stride']\n",
" for block in model_definition['JasperEncoder']['params']['jasper']:\n",
" timestep_duration = model_definition['AudioToMelSpectrogramPreprocessor']['window_stride']\n",
" for block in model_definition['JasperEncoder']['jasper']:\n",
" timestep_duration *= block['stride'][0] ** block['repeat']\n",
" self.n_timesteps_overlap = int(frame_overlap / timestep_duration) - 2\n",
" self.buffer = np.zeros(shape=2*self.n_frame_overlap + self.n_frame_len,\n",
@ -443,7 +443,7 @@
" 'sample_rate': SAMPLE_RATE,\n",
" 'AudioToMelSpectrogramPreprocessor': cfg.preprocessor,\n",
" 'JasperEncoder': cfg.encoder,\n",
" 'labels': cfg.decoder.params.vocabulary\n",
" 'labels': cfg.decoder.vocabulary\n",
" },\n",
" frame_len=FRAME_LEN, frame_overlap=2, \n",
" offset=4)"
@ -537,4 +537,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}