Update model names (#2845)

* updated speaker model names

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* update tutorial model names

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>
This commit is contained in:
Nithin Rao 2021-09-19 18:58:04 -07:00 committed by GitHub
parent b1e1494688
commit 3f606194f2
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
17 changed files with 55 additions and 42 deletions

View file

@ -977,4 +977,15 @@ url={https://openreview.net/forum?id=Bkg6RiCqY7},
author={Gulati, Anmol and Qin, James and Chiu, Chung-Cheng and Parmar, Niki and Zhang, Yu and Yu, Jiahui and Han, Wei and Wang, Shibo and Zhang, Zhengdong and Wu, Yonghui and others},
journal={arXiv preprint arXiv:2005.08100},
year={2020}
}
}
@article{Dawalatabad_2021,
title={ECAPA-TDNN Embeddings for Speaker Diarization},
url={http://dx.doi.org/10.21437/Interspeech.2021-941},
DOI={10.21437/interspeech.2021-941},
journal={Interspeech 2021},
publisher={ISCA},
author={Dawalatabad, Nauman and Ravanelli, Mirco and Grondin, François and Thienpondt, Jenthe and Desplanques, Brecht and Na, Hwidong},
year={2021},
month={Aug}
}

View file

@ -1,4 +1,4 @@
Model Name,Model Base Class,Model Card
vad_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_marblenet"
vad_telephony_marblenet,EncDecClassificationModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_telephony_marblenet"
speakerdiarization_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet"
ecapa_tdnn,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn"

1 Model Name Model Base Class Model Card
2 vad_marblenet EncDecClassificationModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_marblenet
3 vad_telephony_marblenet EncDecClassificationModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:vad_telephony_marblenet
4 speakerdiarization_speakernet ecapa_tdnn EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn

View file

@ -52,20 +52,20 @@ Here ``paths2audio_files`` and ``path2groundtruth_rttm_files`` are files contain
AMI Meeting Corpus
------------------
The following are the suggested parameters for getting Speaker Error Rate (SER) of 4.1% on AMI Lapel test set corpus:
- diarizer.oracle_num_speakers = 4 (since there are exactly 4 speakers per each Lapel test set session)
- diarizer.speaker_embeddings.model_path = ``speakerverification_speakernet`` (This model is trained on voxceleb dataset. ``Use this model for simialr non-telephonic speech datasets``)
- diarizer.speaker_embeddings.window_length_in_sec = 3
- diarizer.speaker_embeddings.shift_length_in_sec = 1.5
The following are the suggested parameters for getting Speaker Error Rate (SER) of 2.13% on AMI Lapel test set corpus:
- diarizer.oracle_num_speakers = null (performing unknown speaker case)
- diarizer.speaker_embeddings.model_path = ``ecapa_tdnn`` (This model is trained on voxceleb dataset. ``Use this model for simialr non-telephonic speech datasets``)
- diarizer.speaker_embeddings.window_length_in_sec = 1.5
- diarizer.speaker_embeddings.shift_length_in_sec = 0.75
Input paths2audio_files, paths2rttm_files and oracle_vad_manifest by following steps as shown above
CallHome LDC97S42 (CH109)
-------------------------
The following are the suggested parameters for getting Speaker Error Rate (SER) of 5.4% on CH109 set:
The following are the suggested parameters for getting Speaker Error Rate (SER) of 1.19% on CH109 set:
- diarizer.oracle_num_speakers = 2 (since there are exactly 2 speakers per each ch109 session)
- diarizer.speaker_embeddings.model_path = ``speakerdiarization_speakernet`` (This model is trained on voxceleb and telephonic speech Fisher and SWBD. ``Use this model for similar telephonic speech datasets``)
- diarizer.speaker_embeddings.model_path = ``ecapa_tdnn`` (This model is trained on voxceleb and telephonic speech Fisher and SWBD. ``Use this model for similar telephonic speech datasets``)
- diarizer.speaker_embeddings.window_length_in_sec = 1.5
- diarizer.speaker_embeddings.shift_length_in_sec = 0.75

View file

@ -2,4 +2,4 @@ Models
======
Currenlty NeMo's Speaker Diarization pipeline uses `MarbleNet <../speech_classification/models.html#marblenet-vad>`__ model for Voice Activity Detection (VAD) and `SpeakerNet <../speaker_recognition/models.html#speakernet>`__ model for Speaker Embedding Extraction.
Currenlty NeMo's Speaker Diarization pipeline uses `MarbleNet <../speech_classification/models.html#marblenet-vad>`__ model for Voice Activity Detection (VAD) and `SpeakerNet <../speaker_recognition/models.html#speakernet>`__ & `ECAPA_TDNN` models for Speaker Embedding Extraction.

View file

@ -21,7 +21,7 @@ Load Speaker Embedding model
.. code-block:: bash
pretrained_speaker_model='/path/to/speakerdiarization_speakernet.nemo' # local .nemo or pretrained speakernet model name
pretrained_speaker_model='/path/to/ecapa_tdnn.nemo' # local .nemo or pretrained speakernet model name
...
# pass with hydra config
config.diarizer.speaker_embeddings.model_path=pretrained_speaker_model
@ -59,7 +59,7 @@ In general, you can load models with model name in the following format,
.. code-block:: python
pretrained_vad_model='vad_telephony_marblenet'
pretrained_speaker_model='speakerdiarization_speakernet'
pretrained_speaker_model='ecapa_tdnn'
...
config.diarizer.vad.model_path=retrained_vad_model \
config.diarizer.speaker_embeddings.model_path=pretrained_speaker_model

View file

@ -1,4 +1,3 @@
Model Name,Model Base Class,Model Card
speakerverification_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet"
speakerrecognition_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet"
speakerdiarization_speakernet,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet"
ecapa_tdnn,EncDecSpeakerLabelModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn"

1 Model Name Model Base Class Model Card
2 speakerverification_speakernet EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet
3 speakerrecognition_speakernet ecapa_tdnn EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn
speakerdiarization_speakernet EncDecSpeakerLabelModel https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerdiarization_speakernet

View file

@ -42,6 +42,12 @@ models respectively.
SpeakerNet models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecSpeakerLabelModel` class.
ECAPA_TDNN
----------
The model is based on the paper "ECAPA_TDNN Embeddings for Speaker Diarization" :cite:`Dawalatabad_2021` comprising an encoder of time dilation layers which are based on Emphasized Channel Attention, Propagation, and Aggregation. The ECAPA-TDNN model employs a channel- and contextdependent attention mechanism, Multilayer Feature Aggregation (MFA), as well as Squeeze-Excitation (SE) and residual blocks, due to faster training and inference we replacing residual blocks with group convolution blocks of single dilation. These models has shown good performance over various speaker tasks.
ecapa_tdnn models can be instantiated using the :class:`~nemo.collections.asr.models.EncDecSpeakerLabelModel` class.
References
-----------

View file

@ -82,7 +82,7 @@ NGC Pretrained Checkpoints
--------------------------
The SpeakerNet-ASR collection has checkpoints of several models trained on various datasets for a variety of tasks.
`Speaker_Recognition <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerrecognition_speakernet>`_ and `Speaker_Verification <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet>`_ model cards on NGC contain more information about each of the checkpoints available.
`ECAPA_TDNN <https://ngc.nvidia.com/catalog/models/nvidia:nemo:ecapa_tdnn>`_ and `Speaker_Verification <https://ngc.nvidia.com/catalog/models/nvidia:nemo:speakerverification_speakernet>`_ model cards on NGC contain more information about each of the checkpoints available.
The tables below list the SpeakerNet models available from NGC, and the models can be accessed via the
:code:`from_pretrained()` method inside the EncDecSpeakerLabelModel Model class.

View file

@ -32,7 +32,7 @@ Diarization Error Rate (DER) table of `ecapa_tdnn.nemo` model on well known eval
```bash
python speaker_diarize.py \
diarizer.paths2audio_files='my_wav.list' \
diarizer.speaker_embeddings.model_path='speakerdiarization_speakernet' \
diarizer.speaker_embeddings.model_path='ecapa_tdnn' \
diarizer.oracle_num_speakers=null \
diarizer.vad.model_path='vad_telephony_marblenet'
```
@ -42,7 +42,7 @@ If you have oracle VAD files and groundtruth RTTM files for evaluation:
```bash
python speaker_diarize.py \
diarizer.paths2audio_files='my_wav.list' \
diarizer.speaker_embeddings.model_path='path/to/speakerdiarization_speakernet.nemo' \
diarizer.speaker_embeddings.model_path='path/to/ecapa_tdnn.nemo' \
diarizer.oracle_num_speakers= null \
diarizer.speaker_embeddings.oracle_vad_manifest='oracle_vad.manifest' \
diarizer.path2groundtruth_rttm_files='my_wav_rttm.list'
@ -82,13 +82,13 @@ my_audio3 5
- **`diarizer.speaker_embeddings.model_path`: speaker embedding model name**
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Specify the name of speaker embedding model, then the script will download the model from NGC. Currently, we have 'speakerdiarization_speakernet' and 'speakerverification_speakernet'.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Specify the name of speaker embedding model, then the script will download the model from NGC. Currently, we have 'ecapa_tdnn' and 'speakerverification_speakernet'.
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `diarizer.speaker_embeddings.model_path='speakerdiarization_speakernet'`
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `diarizer.speaker_embeddings.model_path='ecapa_tdnn'`
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; You could also download *.nemo files from [this link](https://ngc.nvidia.com/catalog/models?orderBy=scoreDESC&pageNumber=0&query=SpeakerNet&quickFilter=&filters=) and specify the full path name to the speaker embedding model file (`*.nemo`).
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `diarizer.speaker_embeddings.model_path='path/to/speakerdiarization_speakernet.nemo'`
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; `diarizer.speaker_embeddings.model_path='path/to/ecapa_tdnn.nemo'`
- **`diarizer.vad.model_path`: voice activity detection modle name or path to the model**
@ -132,14 +132,14 @@ Currently, asr_with_diarization only supports QuartzNet English model ([`QuartzN
```bash
python asr_with_diarization.py \
--pretrained_speaker_model='speakerdiarization_speakernet' \
--pretrained_speaker_model='ecapa_tdnn' \
--audiofile_list_path='my_wav.list' \
```
If you have reference rttm files or oracle number of speaker information, you can provide those files as in the following example.
```bash
python asr_with_diarization.py \
--pretrained_speaker_model='speakerdiarization_speakernet' \
--pretrained_speaker_model='ecapa_tdnn' \
--audiofile_list_path='my_wav.list' \
--reference_rttmfile_list_path='my_wav_rttm.list'\
--oracle_num_speakers=number_of_speakers.list

View file

@ -31,6 +31,6 @@ diarizer:
speaker_embeddings:
oracle_vad_manifest: null
model_path: ??? #.nemo local model path or pretrained model name (speakerverification_speakernet or speakerdiarization_speakernet)
model_path: ??? #.nemo local model path or pretrained model name (ecapa_tdnn or speakerverification_speakernet)
window_length_in_sec: 1.5 # window length in sec for speaker embedding extraction
shift_length_in_sec: 0.75 # shift length in sec for speaker embedding extraction

View file

@ -21,7 +21,7 @@ python speaker_reco.py --config_path='conf' --config_name='SpeakerNet_verificati
For training ecapa_tdnn (channel-attention) model:
```bash
python speaker_reco.py --config_path='conf' --config_name='SpeakerNet_ECAPA.yaml'
python speaker_reco.py --config_path='conf' --config_name='ecapa_tdnn.yaml'
```
For step by step tutorial see [notebook](https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Recognition_Verification.ipynb).
@ -48,12 +48,12 @@ python <NeMo_root>/scripts/speaker_tasks/scp_to_manifest.py --scp voxceleb1_test
### Embedding Extraction
Now using the manifest file created, we can extract embeddings to `data` folder using:
```bash
python extract_speaker_embeddings.py --manifest=voxceleb1_test_manifest.json --model_path='speakerverification_speakernet' --embedding_dir='./'
python extract_speaker_embeddings.py --manifest=voxceleb1_test_manifest.json --model_path='ecapa_tdnn' --embedding_dir='./'
```
If you have a single file, you may also be using the following one liner to get embeddings for the audio file:
```python
speaker_model = EncDecSpeakerLabelModel.from_pretrained(model_name="speakerverification_speakernet")
speaker_model = EncDecSpeakerLabelModel.from_pretrained(model_name="ecapa_tdnn")
embs = speaker_model.get_embedding('audio_path')
```

View file

@ -110,7 +110,7 @@ def main():
parser.add_argument(
"--model_path",
type=str,
default='speakerverification_speakernet',
default='ecapa_tdnn',
required=False,
help="path to .nemo speaker verification model file to extract embeddings, if not passed SpeakerNet-M model would be downloaded from NGC and used to extract embeddings",
)
@ -130,7 +130,7 @@ def main():
elif args.model_path.endswith('.ckpt'):
speaker_model = EncDecSpeakerLabelModel.load_from_checkpoint(checkpoint_path=args.model_path)
else:
speaker_model = EncDecSpeakerLabelModel.from_pretrained(model_name="speakerverification_speakernet")
speaker_model = EncDecSpeakerLabelModel.from_pretrained(model_name="ecapa_tdnn")
logging.info(f"using pretrained speaker verification model from NGC")
device = 'cuda'

View file

@ -41,11 +41,7 @@ from nemo.utils.exp_manager import exp_manager
def main():
parser = ArgumentParser()
parser.add_argument(
"--pretrained_model",
type=str,
default="speakerverification_speakernet",
required=False,
help="Pass your trained .nemo model",
"--pretrained_model", type=str, default="ecapa_tdnn", required=False, help="Pass your trained .nemo model",
)
parser.add_argument(
"--finetune_config_file",

View file

@ -54,11 +54,7 @@ can_gpu = torch.cuda.is_available()
def main():
parser = ArgumentParser()
parser.add_argument(
"--spkr_model",
type=str,
default="speakerrecognition_speakernet",
required=True,
help="Pass your trained .nemo model",
"--spkr_model", type=str, default="ecapa_tdnn", required=True, help="Pass your trained .nemo model",
)
parser.add_argument(
"--train_manifest", type=str, required=True, help="path to train manifest file to match labels"

View file

@ -60,7 +60,12 @@ for refresh_cache in [True, False]:
testclass_downloads(
nemo_asr.models.EncDecSpeakerLabelModel,
refresh_cache,
['speakerrecognition_speakernet', 'speakerverification_speakernet', 'speakerdiarization_speakernet'],
[
'speakerrecognition_speakernet',
'speakerverification_speakernet',
'speakerdiarization_speakernet',
'ecapa_tdnn',
],
)
# Test NLP collection

View file

@ -364,7 +364,7 @@
"outputs": [],
"source": [
"num_speakers = None # If we know the number of speakers, we can assign \"2\".\n",
"pretrained_speaker_model = 'speakerdiarization_speakernet'\n",
"pretrained_speaker_model = 'ecapa_tdnn'\n",
"\n",
"asr_diar_offline.run_diarization(audio_file_list, vad_manifest_path, num_speakers, pretrained_speaker_model)\n",
"\n",

View file

@ -274,7 +274,7 @@
"metadata": {},
"outputs": [],
"source": [
"pretrained_speaker_model='speakerdiarization_speakernet'\n",
"pretrained_speaker_model='ecapa_tdnn'\n",
"config.diarizer.paths2audio_files = paths2audio_files\n",
"config.diarizer.path2groundtruth_rttm_files = path2groundtruth_rttm_files\n",
"config.diarizer.out_dir = output_dir #Directory to store intermediate files and prediction outputs\n",
@ -394,7 +394,7 @@
"outputs": [],
"source": [
"pretrained_vad = 'vad_marblenet'\n",
"pretrained_speaker_model = 'speakerdiarization_speakernet'"
"pretrained_speaker_model = 'ecapa_tdnn'"
]
},
{