Merge branch 'r1.0.0rc1' into aug_tut_bug
This commit is contained in:
commit
754ab33660
|
@ -950,3 +950,10 @@ url={https://openreview.net/forum?id=Bkg6RiCqY7},
|
|||
booktitle={ICVPR},
|
||||
year={2018}
|
||||
}
|
||||
|
||||
@article{koluguri2020speakernet,
|
||||
title={SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification},
|
||||
author={Koluguri, Nithin Rao and Li, Jason and Lavrukhin, Vitaly and Ginsburg, Boris},
|
||||
journal={arXiv preprint arXiv:2010.12653},
|
||||
year={2020}
|
||||
}
|
114
docs/source/asr/speaker_recognition/configs.rst
Normal file
114
docs/source/asr/speaker_recognition/configs.rst
Normal file
|
@ -0,0 +1,114 @@
|
|||
NeMo ASR Configuration Files
|
||||
============================
|
||||
|
||||
This page covers NeMo configuration file setup that is specific to speaker recognition models.
|
||||
For general information about how to set up and run experiments that is common to all NeMo models (e.g.
|
||||
experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../introduction/core` page.
|
||||
|
||||
The model section of NeMo ASR configuration files will generally require information about the dataset(s) being
|
||||
used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the
|
||||
model architecture specification.
|
||||
The sections on this page cover each of these in more detail.
|
||||
|
||||
Example configuration files for all of the Speaker related scripts can be found in the
|
||||
config directory of the examples ``{NEMO_ROOT/examples/speaker_recognition/conf}``.
|
||||
|
||||
|
||||
Dataset Configuration
|
||||
---------------------
|
||||
|
||||
Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and
|
||||
``test_ds`` sections of your configuration file, respectively.
|
||||
Depending on the task, you may have arguments specifying the sample rate of your audio files, max time length to consider for each audio file , whether or not to shuffle the dataset, and so on.
|
||||
You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line
|
||||
at runtime.
|
||||
|
||||
Any initialization parameters that are accepted for the Dataset class used in your experiment
|
||||
can be set in the config file.
|
||||
|
||||
An example SpeakerNet train and validation configuration could look like:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
model:
|
||||
train_ds:
|
||||
manifest_filepath: ???
|
||||
sample_rate: 16000
|
||||
labels: None # finds labels based on manifest file
|
||||
batch_size: 32
|
||||
trim_silence: False
|
||||
time_length: 8
|
||||
shuffle: True
|
||||
|
||||
validation_ds:
|
||||
manifest_filepath: ???
|
||||
sample_rate: 16000
|
||||
labels: None # Keep None, to match with labels extracted during training
|
||||
batch_size: 32
|
||||
shuffle: False # No need to shuffle the validation data
|
||||
|
||||
|
||||
Preprocessor Configuration
|
||||
--------------------------
|
||||
Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model.
|
||||
For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__
|
||||
|
||||
|
||||
Augmentation Configurations
|
||||
---------------------------
|
||||
|
||||
For SpeakerNet training we use on-the fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section
|
||||
|
||||
The following example sets up musan augmentation with audio files taken from manifest path and
|
||||
minimum and maximum SNR specified with min_snr and max_snr respectively. This section can be added to
|
||||
``train_ds`` part in model
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
model:
|
||||
...
|
||||
train_ds:
|
||||
...
|
||||
augmentor:
|
||||
noise:
|
||||
manifest_path: /path/to/musan/manifest_file
|
||||
prob: 0.2 # probability to augment the incoming batch audio with augmentor data
|
||||
min_snr_db: 5
|
||||
max_snr_db: 15
|
||||
|
||||
|
||||
See the :class:`nemo.collections.asr.parts.perturb.AudioAugmentor` API section for more details.
|
||||
|
||||
|
||||
Model Architecture Configurations
|
||||
---------------------------------
|
||||
|
||||
Each configuration file should describe the model architecture being used for the experiment.
|
||||
Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field
|
||||
specifying the module to use for each.
|
||||
|
||||
The following sections go into more detail about the specific configurations of each model architecture.
|
||||
|
||||
For more information about the SpeakerNet Encoder models, see the :doc:`Models <./models>` page and at `Jasper and QuartzNet <../configs.html#jasper-and-quartznet>`__
|
||||
|
||||
Decoder Configurations
|
||||
------------------------
|
||||
|
||||
After features have been computed from speakernet encoder, we pass these features to thedecoder to compute embeddings and then to compute log probabilities
|
||||
for training models.
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
model:
|
||||
...
|
||||
decoder:
|
||||
_target_: nemo.collections.asr.modules.SpeakerDecoder
|
||||
feat_in: *enc_final
|
||||
num_classes: 7205 # Total number of classes in training manifest file
|
||||
pool_mode: xvector # xvector for variance and mean bases statistics pooling
|
||||
emb_sizes: 256 # number of inermediate emb layers. can be comma separated for additional layers like 512,512
|
||||
angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entrophy loss
|
||||
|
||||
loss:
|
||||
scale: 30
|
||||
margin 0.2
|
|
@ -0,0 +1,4 @@
|
|||
Model Name,Model Base Class,Model Card
|
||||
speakerverification_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet"
|
||||
speakerrecognition_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet"
|
||||
speakerdiarization_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet"
|
|
49
docs/source/asr/speaker_recognition/datasets.rst
Normal file
49
docs/source/asr/speaker_recognition/datasets.rst
Normal file
|
@ -0,0 +1,49 @@
|
|||
Datasets
|
||||
========
|
||||
|
||||
.. _HI-MIA:
|
||||
|
||||
HI-MIA
|
||||
--------
|
||||
|
||||
Run the script to download and process ``hi-mia`` dataset in order to generate files in the supported format of ``nemo_asr``. You should set the data folder of
|
||||
hi-mia using ``--data_root``. These scripts are present in ``<nemo_root>/scripts``
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python get_hi-mia_data.py --data_root=<data directory>
|
||||
|
||||
After download and conversion, your `data` folder should contain directories with following set of files as:
|
||||
|
||||
* `data/<set>/train.json`
|
||||
* `data/<set>/dev.json`
|
||||
* `data/<set>/{set}_all.json`
|
||||
* `data/<set>/utt2spk`
|
||||
|
||||
|
||||
All-other Datasets
|
||||
------------------
|
||||
|
||||
These methods can be applied to any dataset to get similar training manifest files.
|
||||
|
||||
First we prepare scp file(s) containing absolute paths to all the wav files required for each of the train, dev, and test set. This can be easily prepared by using ``find`` bash command as follows:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
!find {data_dir}/{train_dir} -iname "*.wav" > data/train_all.scp
|
||||
!head -n 3 data/train_all.scp
|
||||
|
||||
|
||||
Since we created the scp file for the train, we use `scp_to_manifest.py` to convert this scp file to a manifest file and then optionally split the files to train \& dev for evaluating the models while training by using the --split flag.
|
||||
We wont be needing the --split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label.
|
||||
After the download and conversion, your data folder should contain directories with manifest files as:
|
||||
|
||||
* `data/<path>/train.json`
|
||||
* `data/<path>/dev.json`
|
||||
* `data/<path>/train_all.json`
|
||||
|
||||
Each line in the manifest file describes a training sample - audio_filepath contains the path to the wav file, duration it's duration in seconds, and label is the speaker class label:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": 3.9, "label": "speaker_id"}
|
BIN
docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png
Normal file
BIN
docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 207 KiB |
63
docs/source/asr/speaker_recognition/intro.rst
Normal file
63
docs/source/asr/speaker_recognition/intro.rst
Normal file
|
@ -0,0 +1,63 @@
|
|||
Speaker Recognition (SR)
|
||||
========================
|
||||
|
||||
Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?).
|
||||
We focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said.
|
||||
Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings.
|
||||
Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis.
|
||||
|
||||
As the goal of most speaker recognition systems is to get good speaker level embeddings that could help distinguish from other speakers, we shall first train these embeddings in end-to-end manner optimizing the QuatzNet based encoder model on cross-entropy loss.
|
||||
We modify the decoder to get these fixed size embeddings irrespective of the length of the input audio and employ a mean and variance based statistics pooling method to grab these embeddings.
|
||||
|
||||
In Speaker Identification we typically train on a larger training set with cross-entrophy loss and finetune later on your preferred set of labels where one would want to classify only known set of speakers.
|
||||
And in Speaker verification we train with Angular Softmax loss and compare embedings extracted from one audio file coming from a single speaker with
|
||||
embeddings extracted from another file of same or another speaker by employing backend scoring techniques like cosine similarity.
|
||||
|
||||
--------------------------------
|
||||
|
||||
``Quick Start``
|
||||
|
||||
Write audio files to a ``manifest.json`` file with lines as in format:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"}
|
||||
|
||||
This python call will download best pretrained model from NGC and writes embeddings pickle file to current working directory
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
python examples/speaker_recognition/extract_speaker_embeddings.py --manifest=manifest.json
|
||||
|
||||
--------------------------------
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 8
|
||||
|
||||
models
|
||||
configs
|
||||
datasets
|
||||
results
|
||||
|
||||
Resource and Documentation Guide
|
||||
--------------------------------
|
||||
|
||||
Hands-on speaker recognition tutorial notebooks can be found under
|
||||
`the speaker recognition tutorials folder <https://github.com/NVIDIA/NeMo/tree/r1.0.0rc1/tutorials/speaker_recognition/>`_. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab.
|
||||
|
||||
If you are looking for information about a particular SpeakerNet model, or would like to find out more about the model
|
||||
architectures available in the ``nemo_asr`` collection, check out the :doc:`Models <./models>` page.
|
||||
|
||||
Documentation on dataset preprocessing can be found on the :doc:`Datasets <./datasets>` page.
|
||||
NeMo includes preprocessing and other scripts for speaker_recognition in <nemo/scripts/speaker_recognition/> folder, and this page contains instructions on running
|
||||
those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.
|
||||
|
||||
Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list
|
||||
of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page.
|
||||
|
||||
Documentation for configuration files specific to the ``nemo_asr`` models can be found on the
|
||||
:doc:`Configuration Files <./configs>` page.
|
||||
|
||||
|
||||
For a clear step-by-step tutorial we advice you to refer tutorials found in `folder <https://github.com/NVIDIA/NeMo/tree/r1.0.0rc1/tutorials/speaker_recognition/>`_.
|
||||
|
51
docs/source/asr/speaker_recognition/models.rst
Normal file
51
docs/source/asr/speaker_recognition/models.rst
Normal file
|
@ -0,0 +1,51 @@
|
|||
Models
|
||||
======
|
||||
|
||||
Examples of config files for the following quartznet model can be found in the ``<NeMo_git_root>/examples/speaker_recogniton/conf`` directory.
|
||||
|
||||
For more information about the config files and how they should be structured, see the :doc:`./configs` page.
|
||||
|
||||
Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the :doc:`./results` page.
|
||||
You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets.
|
||||
The Checkpoints page also contains benchmark results for the available speaker recognition models.
|
||||
|
||||
|
||||
SpeakerNet
|
||||
-----------
|
||||
|
||||
The model is based on the QuartzNet ASR architecture :cite:`asr-models-koluguri2020speakernet`
|
||||
comprising of an encoder and decoder structure. We use the encoder of the QuartzNet model as a top-level feature extractor, and feed the output to the statistics pooling layer, where
|
||||
we compute the mean and variance across channel dimensions to capture the time-independent utterance-level speaker features.
|
||||
|
||||
The QuartzNet encoder used for speaker embeddings shown in figure below has the following structure: a QuartzNet BxR
|
||||
model has B blocks, each with R sub-blocks. Each sub-block applies the following operations: a 1D convolution, batch norm, ReLU, and dropout. All sub-blocks in a block have the same number of output channels. These blocks are connected with residual connections. We use QuartzNet with 3 blocks, 2 sub-blocks, and 512 channels, as the Encoder for Speaker Embeddings. All conv layers have stride 1 and dilation 1.
|
||||
|
||||
|
||||
.. image:: images/ICASPP_SpeakerNet.png
|
||||
:align: center
|
||||
:alt: speakernet model
|
||||
:scale: 40%
|
||||
|
||||
Top level acoustic Features, obtained from the output of
|
||||
encoder are used to compute intermediate features that are
|
||||
then passed to the decoder for getting utterance level speaker
|
||||
embeddings. The intermediate time-independent features are
|
||||
computed using a statistics pooling layer, where we compute the mean and standard deviation of features across
|
||||
time-channels, to get a time-independent feature representation S of size Batch_size × 3000.
|
||||
The intermediate features, S are passed through the Decoder consisting of two layers each of output size 512 for a
|
||||
linear transformation from S to the final number of classes
|
||||
N for the larger (L) model, and a single linear layer of output size 256 to the final number of classes N for the medium
|
||||
(M) model. We extract q-vectors after the final linear layer
|
||||
of fixed size 512, 256 for SpeakerNet-L and SpeakerNet-M
|
||||
models respectively.
|
||||
|
||||
SpeakerNet models can be instantiated using the :class:`EncDecSpeakerLabelModel<nemo.collections.asr.models.EncDecSpeakerLabelModel>` class.
|
||||
|
||||
|
||||
References
|
||||
-----------
|
||||
|
||||
.. bibliography:: ../asr_all.bib
|
||||
:style: plain
|
||||
:labelprefix: ASR-MODELS
|
||||
:keyprefix: asr-models-
|
65
docs/source/asr/speaker_recognition/results.rst
Normal file
65
docs/source/asr/speaker_recognition/results.rst
Normal file
|
@ -0,0 +1,65 @@
|
|||
Checkpoints
|
||||
===========
|
||||
|
||||
There are two main ways to load pretrained checkpoints in NeMo:
|
||||
|
||||
* Using the :code:`restore_from()` method to load a local checkpoint file (`.nemo`), or
|
||||
* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC.
|
||||
|
||||
See the following sections for instructions and examples for each.
|
||||
|
||||
Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning.
|
||||
For resuming an unfinished training experiment, please use the experiment manager to do so by setting the
|
||||
``resume_if_exists`` flag to True.
|
||||
|
||||
Loading Local Checkpoints
|
||||
-------------------------
|
||||
|
||||
NeMo will automatically save checkpoints of a model you are training in a `.nemo` format.
|
||||
You can also manually save your models at any point using :code:`model.save_to(<checkpoint_path>.nemo)`.
|
||||
|
||||
If you have a local ``.nemo`` checkpoint that you'd like to load, simply use the :code:`restore_from()` method:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import nemo.collections.asr as nemo_asr
|
||||
model = nemo_asr.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
|
||||
|
||||
Where the model base class is the ASR model class of the original checkpoint, or the general `ASRModel` class.
|
||||
|
||||
NGC Pretrained Checkpoints
|
||||
--------------------------
|
||||
|
||||
The SpeakerNet-ASR collection has checkpoints of several models trained on various datasets for a variety of tasks.
|
||||
`Speaker_Recognition <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet>`_ and `Speaker_Verification <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet>`_ model cards on NGC contain more information about each of the checkpoints available.
|
||||
|
||||
The tables below list the SpeakerNet models available from NGC, and the models can be accessed via the
|
||||
:code:`from_pretrained()` method inside the EncDecSpeakerLabelModel Model class.
|
||||
|
||||
.. note:: While loading, remember to use EncDecSpeakerLabelModel for recognition tasks and ExtractSpeakerEmbeddingsModel while extracting embeddings.
|
||||
|
||||
In general, you can load any of these models with code in the following format.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import nemo.collections.asr as nemo_asr
|
||||
model = nemo_asr.models.<MODEL_CLASS_NAME>.from_pretrained(model_name="<MODEL_NAME>")
|
||||
|
||||
Where the model name is the value under "Model Name" entry in the tables below.
|
||||
|
||||
If you would like to programatically list the models available for a particular base class, you can use the
|
||||
:code:`list_available_models()` method.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
nemo_asr.models.<MODEL_BASE_CLASS>.list_available_models()
|
||||
|
||||
|
||||
Speaker Recognition Models
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. csv-table::
|
||||
:file: data/speaker_results.csv
|
||||
:align: left
|
||||
:widths: 30, 30, 40
|
||||
:header-rows: 1
|
|
@ -17,6 +17,7 @@ NVIDIA NeMo Developer Guide
|
|||
:name: Automatic Speech Recognition
|
||||
|
||||
asr/intro
|
||||
asr/speaker_recognition/intro
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
@ -52,4 +53,4 @@ NVIDIA NeMo Developer Guide
|
|||
:name: API
|
||||
|
||||
api-docs/nemo
|
||||
|
||||
|
||||
|
|
|
@ -310,7 +310,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"data_layer = AudioDataLayer(sample_rate=cfg.preprocessor.params.sample_rate)\n",
|
||||
"data_layer = AudioDataLayer(sample_rate=cfg.preprocessor.sample_rate)\n",
|
||||
"data_loader = DataLoader(data_layer, batch_size=1, collate_fn=data_layer.collate_fn)"
|
||||
]
|
||||
},
|
||||
|
@ -361,8 +361,8 @@
|
|||
" self.n_frame_len = int(frame_len * self.sr)\n",
|
||||
" self.frame_overlap = frame_overlap\n",
|
||||
" self.n_frame_overlap = int(frame_overlap * self.sr)\n",
|
||||
" timestep_duration = model_definition['AudioToMelSpectrogramPreprocessor']['params']['window_stride']\n",
|
||||
" for block in model_definition['JasperEncoder']['params']['jasper']:\n",
|
||||
" timestep_duration = model_definition['AudioToMelSpectrogramPreprocessor']['window_stride']\n",
|
||||
" for block in model_definition['JasperEncoder']['jasper']:\n",
|
||||
" timestep_duration *= block['stride'][0] ** block['repeat']\n",
|
||||
" self.n_timesteps_overlap = int(frame_overlap / timestep_duration) - 2\n",
|
||||
" self.buffer = np.zeros(shape=2*self.n_frame_overlap + self.n_frame_len,\n",
|
||||
|
@ -443,7 +443,7 @@
|
|||
" 'sample_rate': SAMPLE_RATE,\n",
|
||||
" 'AudioToMelSpectrogramPreprocessor': cfg.preprocessor,\n",
|
||||
" 'JasperEncoder': cfg.encoder,\n",
|
||||
" 'labels': cfg.decoder.params.vocabulary\n",
|
||||
" 'labels': cfg.decoder.vocabulary\n",
|
||||
" },\n",
|
||||
" frame_len=FRAME_LEN, frame_overlap=2, \n",
|
||||
" offset=4)"
|
||||
|
@ -537,4 +537,4 @@
|
|||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
}
|
Loading…
Reference in a new issue