Merge branch 'r1.0.0rc1' into aug_tut_bug

2021-03-17 13:29:11 -07:00 · 2021-03-17 13:29:11 -07:00 · 754ab33660
parent 7805a7d43d 81ab11dea9
commit 754ab33660
10 changed files with 360 additions and 6 deletions
--- a/docs/source/asr/asr_all.bib
+++ b/docs/source/asr/asr_all.bib
@ -950,3 +950,10 @@ url={https://openreview.net/forum?id=Bkg6RiCqY7},
  booktitle={ICVPR},
  year={2018}
 }
+
+@article{koluguri2020speakernet,
+  title={SpeakerNet: 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification},
+  author={Koluguri, Nithin Rao and Li, Jason and Lavrukhin, Vitaly and Ginsburg, Boris},
+  journal={arXiv preprint arXiv:2010.12653},
+  year={2020}
+}
--- a/docs/source/asr/speaker_recognition/configs.rst
+++ b/docs/source/asr/speaker_recognition/configs.rst
@ -0,0 +1,114 @@
+NeMo ASR Configuration Files
+============================
+
+This page covers NeMo configuration file setup that is specific to speaker recognition models.
+For general information about how to set up and run experiments that is common to all NeMo models (e.g.
+experiment manager and PyTorch Lightning trainer parameters), see the :doc:`../../introduction/core` page.
+
+The model section of NeMo ASR configuration files will generally require information about the dataset(s) being
+used, the preprocessor for audio files, parameters for any augmentation being performed, as well as the
+model architecture specification.
+The sections on this page cover each of these in more detail.
+
+Example configuration files for all of the Speaker related scripts can be found in the
+config directory of the examples ``{NEMO_ROOT/examples/speaker_recognition/conf}``.
+
+
+Dataset Configuration
+---------------------
+
+Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and
+``test_ds`` sections of your configuration file, respectively.
+Depending on the task, you may have arguments specifying the sample rate of your audio files, max time length to consider for each audio file , whether or not to shuffle the dataset, and so on.
+You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command line
+at runtime.
+
+Any initialization parameters that are accepted for the Dataset class used in your experiment
+can be set in the config file.
+
+An example SpeakerNet train and validation configuration could look like:
+
+.. code-block:: yaml
+
+  model:
+    train_ds:
+      manifest_filepath: ???
+      sample_rate: 16000
+      labels: None   # finds labels based on manifest file
+      batch_size: 32
+      trim_silence: False
+      time_length: 8
+      shuffle: True
+
+    validation_ds:
+      manifest_filepath: ???
+      sample_rate: 16000
+      labels: None   # Keep None, to match with labels extracted during training
+      batch_size: 32
+      shuffle: False    # No need to shuffle the validation data
+
+
+Preprocessor Configuration
+--------------------------
+Preprocessor helps to compute MFCC or mel spectrogram features that are given as inputs to model. 
+For details on how to write this section, refer to `Preprocessor Configuration <../configs.html#preprocessor-configuration>`__
+
+
+Augmentation Configurations
+---------------------------
+
+For SpeakerNet training we use on-the fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section
+
+The following example sets up musan augmentation with audio files taken from manifest path and 
+minimum and maximum SNR specified with min_snr and max_snr respectively. This section can be added to 
+``train_ds`` part in model
+
+.. code-block:: yaml
+
+  model:
+    ...
+    train_ds:
+      ...
+      augmentor:
+        noise:
+          manifest_path: /path/to/musan/manifest_file
+          prob: 0.2  # probability to augment the incoming batch audio with augmentor data
+          min_snr_db: 5 
+          max_snr_db: 15        
+
+
+See the :class:`nemo.collections.asr.parts.perturb.AudioAugmentor`  API section for more details.
+
+
+Model Architecture Configurations
+---------------------------------
+
+Each configuration file should describe the model architecture being used for the experiment.
+Models in the NeMo ASR collection need a ``encoder`` section and a ``decoder`` section, with the ``_target_`` field
+specifying the module to use for each.
+
+The following sections go into more detail about the specific configurations of each model architecture.
+
+For more information about the SpeakerNet Encoder models, see the :doc:`Models <./models>` page and at `Jasper and QuartzNet <../configs.html#jasper-and-quartznet>`__
+
+Decoder Configurations
+------------------------
+
+After features have been computed from speakernet encoder, we pass these features to thedecoder to compute embeddings and then to compute log probabilities
+for training models.
+
+.. code-block:: yaml
+
+  model:
+    ...
+    decoder:
+      _target_: nemo.collections.asr.modules.SpeakerDecoder
+      feat_in: *enc_final
+      num_classes: 7205  # Total number of classes in training manifest file
+      pool_mode: xvector # xvector for variance and mean bases statistics pooling 
+      emb_sizes: 256 # number of inermediate emb layers. can be comma separated for additional layers like 512,512
+      angular: true # if true then loss will be changed to angular softmax loss and consider scale and margin from loss section else train with cross-entrophy loss
+    
+    loss:
+      scale: 30
+      margin 0.2
--- a/docs/source/asr/speaker_recognition/data/speaker_results.csv
+++ b/docs/source/asr/speaker_recognition/data/speaker_results.csv
@ -0,0 +1,4 @@
+Model Name,Model Base Class,Model Card
+speakerverification_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet"
+speakerrecognition_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet"
+speakerdiarization_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet"
--- a/docs/source/asr/speaker_recognition/datasets.rst
+++ b/docs/source/asr/speaker_recognition/datasets.rst
@ -0,0 +1,49 @@
+Datasets
+========
+
+.. _HI-MIA:
+
+HI-MIA
+--------
+
+Run the script to download and process ``hi-mia`` dataset in order to generate files in the supported format of  ``nemo_asr``. You should set the data folder of 
+hi-mia using ``--data_root``. These scripts are present in ``<nemo_root>/scripts``
+
+.. code-block:: bash
+
+    python get_hi-mia_data.py --data_root=<data directory> 
+
+After download and conversion, your `data` folder should contain directories with following set of files as:
+
+* `data/<set>/train.json`
+* `data/<set>/dev.json` 
+* `data/<set>/{set}_all.json` 
+* `data/<set>/utt2spk`
+
+
+All-other Datasets
+------------------
+
+These methods can be applied to any dataset to get similar training manifest files.
+
+First we prepare scp file(s) containing absolute paths to all the wav files required for each of the train, dev, and test set. This can be easily prepared by using ``find`` bash command as follows:
+
+.. code-block:: bash 
+
+    !find {data_dir}/{train_dir}  -iname "*.wav" > data/train_all.scp
+    !head -n 3 data/train_all.scp
+
+
+Since we created the scp file for the train, we use `scp_to_manifest.py` to convert this scp file to a manifest file and then optionally split the files to train \& dev for evaluating the models while training by using the --split flag. 
+We wont be needing the --split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label.
+After the download and conversion, your data folder should contain directories with manifest files as:
+    
+* `data/<path>/train.json`
+* `data/<path>/dev.json`
+* `data/<path>/train_all.json`
+    
+Each line in the manifest file describes a training sample - audio_filepath contains the path to the wav file, duration it's duration in seconds, and label is the speaker class label:
+
+.. code-block:: json
+    
+    {"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": 3.9, "label": "speaker_id"}
--- a/docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png
+++ b/docs/source/asr/speaker_recognition/images/ICASPP_SpeakerNet.png
--- a/docs/source/asr/speaker_recognition/intro.rst
+++ b/docs/source/asr/speaker_recognition/intro.rst
@ -0,0 +1,63 @@
+Speaker Recognition (SR)
+========================
+
+Speaker Recognition (SR) is a broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who they claim to be?). 
+We focus on far-field, text-independent speaker recognition when the identity of the speaker is based on how the speech is spoken, not necessarily in what is being said. 
+Typically such SR systems operate on unconstrained speech utterances, which are converted into vectors of fixed length, called speaker embeddings. 
+Speaker embeddings can also be used in automatic speech recognition (ASR) and speech synthesis.
+
+As the goal of most speaker recognition systems is to get good speaker level embeddings that could help distinguish from other speakers, we shall first train these embeddings in end-to-end manner optimizing the QuatzNet based encoder model on cross-entropy loss. 
+We modify the decoder to get these fixed size embeddings irrespective of the length of the input audio and employ a mean and variance based statistics pooling method to grab these embeddings.
+   
+In Speaker Identification we typically train on a larger training set with cross-entrophy loss and finetune later on your preferred set of labels where one would want to classify only known set of speakers. 
+And in Speaker verification we train with Angular Softmax loss and compare embedings extracted from one audio file coming from a single speaker with 
+embeddings extracted from another file of same or another speaker by employing backend scoring techniques like cosine similarity. 
+
+--------------------------------
+
+``Quick Start``
+
+Write audio files to a ``manifest.json`` file with lines as in format:
+
+.. code-block:: json
+    
+    {"audio_filepath": "<absolute path to dataset>/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"}
+      
+This python call will download best pretrained model from NGC and writes embeddings pickle file to current working directory
+
+.. code-block:: bash
+   
+    python examples/speaker_recognition/extract_speaker_embeddings.py --manifest=manifest.json
+
+--------------------------------
+
+.. toctree::
+   :maxdepth: 8
+
+   models
+   configs
+   datasets
+   results
+
+Resource and Documentation Guide
+--------------------------------
+
+Hands-on speaker recognition tutorial notebooks can be found under
+`the speaker recognition tutorials folder <https://github.com/NVIDIA/NeMo/tree/r1.0.0rc1/tutorials/speaker_recognition/>`_. This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab.
+
+If you are looking for information about a particular SpeakerNet model, or would like to find out more about the model
+architectures available in the ``nemo_asr`` collection, check out the :doc:`Models <./models>` page.
+
+Documentation on dataset preprocessing can be found on the :doc:`Datasets <./datasets>` page.
+NeMo includes preprocessing and other scripts for speaker_recognition in <nemo/scripts/speaker_recognition/> folder, and this page contains instructions on running
+those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data.
+
+Information about how to load model checkpoints (either local files or pretrained ones from NGC), as well as a list
+of the checkpoints available on NGC are located on the :doc:`Checkpoints <./results>` page.
+
+Documentation for configuration files specific to the ``nemo_asr`` models can be found on the
+:doc:`Configuration Files <./configs>` page.
+
+
+For a clear step-by-step tutorial we advice you to refer tutorials found in `folder <https://github.com/NVIDIA/NeMo/tree/r1.0.0rc1/tutorials/speaker_recognition/>`_.
+
--- a/docs/source/asr/speaker_recognition/models.rst
+++ b/docs/source/asr/speaker_recognition/models.rst
@ -0,0 +1,51 @@
+Models
+======
+
+Examples of config files for the following quartznet model can be found in the ``<NeMo_git_root>/examples/speaker_recogniton/conf`` directory.
+
+For more information about the config files and how they should be structured, see the :doc:`./configs` page.
+
+Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found on the :doc:`./results` page.
+You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets.
+The Checkpoints page also contains benchmark results for the available speaker recognition models.
+
+
+SpeakerNet
+-----------
+
+The model is based on the QuartzNet ASR architecture :cite:`asr-models-koluguri2020speakernet`
+comprising of an encoder and decoder structure. We use the encoder of the QuartzNet model as a top-level feature extractor, and feed the output to the statistics pooling layer, where
+we compute the mean and variance across channel dimensions to capture the time-independent utterance-level speaker features.
+
+The QuartzNet encoder used for speaker embeddings shown in figure below has the following structure: a QuartzNet BxR
+model has B blocks, each with R sub-blocks. Each sub-block applies the following operations: a 1D convolution, batch norm, ReLU, and dropout. All sub-blocks in a block have the same number of output channels. These blocks are connected with residual connections. We use QuartzNet with 3 blocks, 2 sub-blocks, and 512 channels, as the Encoder for Speaker Embeddings. All conv layers have stride 1 and dilation 1.
+
+
+    .. image:: images/ICASPP_SpeakerNet.png
+        :align: center
+        :alt: speakernet model
+        :scale: 40%
+
+Top level acoustic Features, obtained from the output of
+encoder are used to compute intermediate features that are
+then passed to the decoder for getting utterance level speaker
+embeddings. The intermediate time-independent features are
+computed using a statistics pooling layer, where we compute the mean and standard deviation of features across
+time-channels, to get a time-independent feature representation S of size Batch_size × 3000.
+The intermediate features, S are passed through the Decoder consisting of two layers each of output size 512 for a
+linear transformation from S to the final number of classes
+N for the larger (L) model, and a single linear layer of output size 256 to the final number of classes N for the medium
+(M) model. We extract q-vectors after the final linear layer
+of fixed size 512, 256 for SpeakerNet-L and SpeakerNet-M
+models respectively.
+
+SpeakerNet models can be instantiated using the :class:`EncDecSpeakerLabelModel<nemo.collections.asr.models.EncDecSpeakerLabelModel>` class.
+
+
+References
+-----------
+
+.. bibliography:: ../asr_all.bib
+    :style: plain
+    :labelprefix: ASR-MODELS
+    :keyprefix: asr-models-
--- a/docs/source/asr/speaker_recognition/results.rst
+++ b/docs/source/asr/speaker_recognition/results.rst
@ -0,0 +1,65 @@
+Checkpoints
+===========
+
+There are two main ways to load pretrained checkpoints in NeMo:
+
+* Using the :code:`restore_from()` method to load a local checkpoint file (`.nemo`), or
+* Using the :code:`from_pretrained()` method to download and set up a checkpoint from NGC.
+
+See the following sections for instructions and examples for each.
+
+Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning.
+For resuming an unfinished training experiment, please use the experiment manager to do so by setting the
+``resume_if_exists`` flag to True.
+
+Loading Local Checkpoints
+-------------------------
+
+NeMo will automatically save checkpoints of a model you are training in a `.nemo` format.
+You can also manually save your models at any point using :code:`model.save_to(<checkpoint_path>.nemo)`.
+
+If you have a local ``.nemo`` checkpoint that you'd like to load, simply use the :code:`restore_from()` method:
+
+.. code-block:: python
+
+  import nemo.collections.asr as nemo_asr
+  model = nemo_asr.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
+
+Where the model base class is the ASR model class of the original checkpoint, or the general `ASRModel` class.
+
+NGC Pretrained Checkpoints
+--------------------------
+
+The SpeakerNet-ASR collection has checkpoints of several models trained on various datasets for a variety of tasks.
+`Speaker_Recognition <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet>`_ and `Speaker_Verification <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet>`_ model cards on NGC contain more information about each of the checkpoints available.
+
+The tables below list the SpeakerNet models available from NGC, and the models can be accessed via the
+:code:`from_pretrained()` method inside the EncDecSpeakerLabelModel Model class.
+
+.. note:: While loading, remember to use EncDecSpeakerLabelModel for recognition tasks and ExtractSpeakerEmbeddingsModel while extracting embeddings.
+
+In general, you can load any of these models with code in the following format.
+
+.. code-block:: python
+
+  import nemo.collections.asr as nemo_asr
+  model = nemo_asr.models.<MODEL_CLASS_NAME>.from_pretrained(model_name="<MODEL_NAME>")
+
+Where the model name is the value under "Model Name" entry in the tables below.
+
+If you would like to programatically list the models available for a particular base class, you can use the
+:code:`list_available_models()` method.
+
+.. code-block:: python
+
+  nemo_asr.models.<MODEL_BASE_CLASS>.list_available_models()
+
+
+Speaker Recognition Models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. csv-table::
+   :file: data/speaker_results.csv
+   :align: left
+   :widths: 30, 30, 40
+   :header-rows: 1
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -17,6 +17,7 @@ NVIDIA NeMo Developer Guide
   :name: Automatic Speech Recognition

   asr/intro
+   asr/speaker_recognition/intro

 .. toctree::
   :maxdepth: 2
@ -52,4 +53,4 @@ NVIDIA NeMo Developer Guide
   :name: API

   api-docs/nemo
-   
+   
--- a/tutorials/asr/02_Online_ASR_Microphone_Demo.ipynb
+++ b/tutorials/asr/02_Online_ASR_Microphone_Demo.ipynb
@ -310,7 +310,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "data_layer = AudioDataLayer(sample_rate=cfg.preprocessor.params.sample_rate)\n",
+    "data_layer = AudioDataLayer(sample_rate=cfg.preprocessor.sample_rate)\n",
    "data_loader = DataLoader(data_layer, batch_size=1, collate_fn=data_layer.collate_fn)"
   ]
  },
@ -361,8 +361,8 @@
    "        self.n_frame_len = int(frame_len * self.sr)\n",
    "        self.frame_overlap = frame_overlap\n",
    "        self.n_frame_overlap = int(frame_overlap * self.sr)\n",
-    "        timestep_duration = model_definition['AudioToMelSpectrogramPreprocessor']['params']['window_stride']\n",
-    "        for block in model_definition['JasperEncoder']['params']['jasper']:\n",
+    "        timestep_duration = model_definition['AudioToMelSpectrogramPreprocessor']['window_stride']\n",
+    "        for block in model_definition['JasperEncoder']['jasper']:\n",
    "            timestep_duration *= block['stride'][0] ** block['repeat']\n",
    "        self.n_timesteps_overlap = int(frame_overlap / timestep_duration) - 2\n",
    "        self.buffer = np.zeros(shape=2*self.n_frame_overlap + self.n_frame_len,\n",
@ -443,7 +443,7 @@
    "                   'sample_rate': SAMPLE_RATE,\n",
    "                   'AudioToMelSpectrogramPreprocessor': cfg.preprocessor,\n",
    "                   'JasperEncoder': cfg.encoder,\n",
-    "                   'labels': cfg.decoder.params.vocabulary\n",
+    "                   'labels': cfg.decoder.vocabulary\n",
    "               },\n",
    "               frame_len=FRAME_LEN, frame_overlap=2, \n",
    "               offset=4)"
@ -537,4 +537,4 @@
 },
 "nbformat": 4,
 "nbformat_minor": 4
-}
+}