In [1]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================



# Kaldi TRTIS Inference Offline Demo

## Overview


This repository provides a wrapper around the online GPU-accelerated ASR pipeline from the paper [GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition](https://arxiv.org/abs/1910.10032). That work includes a high-performance implementation of a GPU HMM Decoder, a low-latency Neural Net driver, fast Feature Extraction for preprocessing, and new ASR pipelines tailored for GPUs. These different modules have been integrated into the Kaldi ASR framework.

This repository contains a TensorRT Inference Server custom backend for the Kaldi ASR framework. This custom backend calls the high-performance online GPU pipeline from the Kaldi ASR framework. This TensorRT Inference Server integration provides ease-of-use to Kaldi ASR inference: gRPC streaming server, dynamic sequence batching, and multi-instances support. A client connects to the gRPC server, streams audio by sending chunks to the server, and gets back the inferred text as an answer. More information about the TensorRT Inference Server can be found [here](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/). 



### Learning objectives

This notebook demonstrates the steps for carrying out inferencing with the Kaldi TRTIS backend server using a Python gRPC client in an offline context, that is, we will stream pre-recorded .wav files to the inference server and receive the results back.

## Content
1. [Pre-requisite](#1)
1. [Setup](#2)
1. [Audio helper classes](#3)
1. [Inference](#4)



## 1. Pre-requisite

### 1.1 Docker containers
Follow the steps in [README](README.md) to build Kaldi server and client containers.

### 1.2 Hardware
This notebook can be executed on any CUDA-enabled NVIDIA GPU, although for efficient mixed precision inference, a [Tensor Core NVIDIA GPU](https://www.nvidia.com/en-us/data-center/tensorcore/) is desired (Volta, Turing or newer architectures). 

In [2]:
!nvidia-smi

Thu Mar 5 00:20:50 2020 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02 Driver Version: 440.48.02 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Quadro GV100 Off | 00000000:05:00.0 Off | Off |
| 33% 46C P2 37W / 250W | 17706MiB / 32506MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
+-----------------------------------------------------------------------------+


### 1.3 Data download and preprocessing

The script `scripts/docker/launch_download.sh` will download the LibriSpeech test dataset along with Kaldi ASR models.

In [3]:
!ls /Kaldi/data/data/LibriSpeech 

BOOKS.TXT LICENSE.TXT SPEAKERS.TXT test-other
CHAPTERS.TXT README.TXT test-clean


Within the docker container, the final data and model directory should look like:

```
/Kaldi/data
 data
 datasets 
 models
```


## 2 Setup 
### Import libraries and parameters

In [4]:
import argparse
import numpy as np
import os
from builtins import range
from functools import partial
import soundfile
import pyaudio as pa
import soundfile
import subprocess

import grpc
from tensorrtserver.api import api_pb2
from tensorrtserver.api import grpc_service_pb2
from tensorrtserver.api import grpc_service_pb2_grpc
import tensorrtserver.api.model_config_pb2 as model_config

In [5]:
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--file', help='Path for input file. First line should contain number of lines to search in')

parser.add_argument('-v', '--verbose', action="store_true", required=False, default=False,
 help='Enable verbose output')
parser.add_argument('-a', '--async', dest="async_set", action="store_true", required=False,
 default=False, help='Use asynchronous inference API')
parser.add_argument('--streaming', action="store_true", required=False, default=False,
 help='Use streaming inference API')
parser.add_argument('-m', '--model-name', type=str, required=False, default='kaldi_online' ,
 help='Name of model')
parser.add_argument('-x', '--model-version', type=int, required=False, default=1,
 help='Version of model. Default is to use latest version.')
parser.add_argument('-b', '--batch-size', type=int, required=False, default=1,
 help='Batch size. Default is 1.')
parser.add_argument('-u', '--url', type=str, required=False, default='localhost:8001',
 help='Inference server URL. Default is localhost:8001.')
FLAGS = parser.parse_args()

### Checking server status

We first query the status of the server. The target model is 'kaldi_online'. A successful deployment of the server should result in output similar to the below.

```
request_status {
 code: SUCCESS
 server_id: "inference:0"
 request_id: 17514
}
server_status {
 id: "inference:0"
 version: "1.9.0"
 uptime_ns: 14179155408971
 model_status {
 key: "kaldi_online"
...
```

In [6]:
# Create gRPC stub for communicating with the server
channel = grpc.insecure_channel(FLAGS.url)
grpc_stub = grpc_service_pb2_grpc.GRPCServiceStub(channel)

# Prepare request for Status gRPC
request = grpc_service_pb2.StatusRequest(model_name=FLAGS.model_name)
# Call and receive response from Status gRPC
response = grpc_stub.Status(request)

print(response)

request_status {
 code: SUCCESS
 server_id: "inference:0"
 request_id: 5864
}
server_status {
 id: "inference:0"
 version: "1.9.0"
 uptime_ns: 3596863561558
 model_status {
 key: "kaldi_online"
 value {
 config {
 name: "kaldi_online"
 platform: "custom"
 version_policy {
 latest {
 num_versions: 1
 }
 }
 max_batch_size: 2200
 input {
 name: "WAV_DATA"
 data_type: TYPE_FP32
 dims: 8160
 }
 input {
 name: "WAV_DATA_DIM"
 data_type: TYPE_INT32
 dims: 1
 }
 output {
 name: "TEXT"
 data_type: TYPE_STRING
 dims: 1
 }
 instance_group {
 name: "kaldi_online_0"
 count: 2
 gpus: 0
 kind: KIND_GPU
 }
 default_model_filename: "libkaldi-trtisbackend.so"
 sequence_batching {
 max_sequence_idle_microseconds: 5000000
 control_input {
 name: "START"
 control {
 int32_false_true: 0
 int32_false_true: 1
 }
 }
 control_input {
 name: "READY"
 control {
 kind: CONTROL_SEQUENCE_READY
 int32_false_true: 0
 int32_false_true: 1
 }
 }
 control_input {
 name: "END"
 control {
 kind: CONTROL_SEQUENCE_END
 int32_


## 3. Audio helper classes

Next, we define some helper classes for pre-processing audio from files. The below AudioSegment class reads audio data from .wav files and converts the sampling rate to that required by the Kaldi ASR model, which is 16000Hz by default.

Note: For historical reasons, Kaldi expects waveforms in the range (2^15-1)x[-1, 1], not the usual default DSP range [-1, 1]. Therefore, we scale the audio signal by a factor of (2^15-1).

In [7]:
WAV_SCALE_FACTOR = 2**15-1

class AudioSegment(object):
 """Monaural audio segment abstraction.
 :param samples: Audio samples [num_samples x num_channels].
 :type samples: ndarray.float32
 :param sample_rate: Audio sample rate.
 :type sample_rate: int
 :raises TypeError: If the sample data type is not float or int.
 """

 def __init__(self, samples, sample_rate, target_sr=16000, trim=False,
 trim_db=60):
 """Create audio segment from samples.
 Samples are convert float32 internally, with int scaled to [-1, 1].
 """
 samples = self._convert_samples_to_float32(samples)
 if target_sr is not None and target_sr != sample_rate:
 samples = librosa.core.resample(samples, sample_rate, target_sr)
 sample_rate = target_sr
 if trim:
 samples, _ = librosa.effects.trim(samples, trim_db)
 self._samples = samples
 self._sample_rate = sample_rate
 if self._samples.ndim >= 2:
 self._samples = np.mean(self._samples, 1)

 @staticmethod
 def _convert_samples_to_float32(samples):
 """Convert sample type to float32.
 Audio sample type is usually integer or float-point.
 Integers will be scaled to [-1, 1] in float32.
 """
 float32_samples = samples.astype('float32')
 if samples.dtype in np.sctypes['int']:
 bits = np.iinfo(samples.dtype).bits
 float32_samples *= (1. / ((2 ** (bits - 1)) - 1))
 elif samples.dtype in np.sctypes['float']:
 pass
 else:
 raise TypeError("Unsupported sample type: %s." % samples.dtype)
 return WAV_SCALE_FACTOR * float32_samples

 @classmethod
 def from_file(cls, filename, target_sr=16000, offset=0, duration=0,
 min_duration=0, trim=False):
 """
 Load a file supported by librosa and return as an AudioSegment.
 :param filename: path of file to load
 :param target_sr: the desired sample rate
 :param int_values: if true, load samples as 32-bit integers
 :param offset: offset in seconds when loading audio
 :param duration: duration in seconds when loading audio
 :return: numpy array of samples
 """
 with sf.SoundFile(filename, 'r') as f:
 dtype_options = {'PCM_16': 'int16', 'PCM_32': 'int32', 'FLOAT': 'float32'}
 dtype_file = f.subtype
 if dtype_file in dtype_options:
 dtype = dtype_options[dtype_file]
 else:
 dtype = 'float32'
 sample_rate = f.samplerate
 if offset > 0:
 f.seek(int(offset * sample_rate))
 if duration > 0:
 samples = f.read(int(duration * sample_rate), dtype=dtype)
 else:
 samples = f.read(dtype=dtype)

 num_zero_pad = int(target_sr * min_duration - samples.shape[0])
 if num_zero_pad > 0:
 samples = np.pad(samples, [0, num_zero_pad], mode='constant')

 samples = samples.transpose()
 return cls(samples, sample_rate, target_sr=target_sr, trim=trim)

 @property
 def samples(self):
 return self._samples.copy()

 @property
 def sample_rate(self):
 return self._sample_rate

In [8]:
# read audio chunk from a file
def get_audio_chunk_from_soundfile(sf, chunk_size):

 dtype_options = {'PCM_16': 'int16', 'PCM_32': 'int32', 'FLOAT': 'float32'}
 dtype_file = sf.subtype
 if dtype_file in dtype_options:
 dtype = dtype_options[dtype_file]
 else:
 dtype = 'float32'
 audio_signal = sf.read(chunk_size, dtype=dtype)
 end = False
 # pad to chunk size
 if len(audio_signal) < chunk_size:
 end = True
 audio_signal = np.pad(audio_signal, (0, chunk_size-len(
 audio_signal)), mode='constant')
 return audio_signal, end


# generator that returns chunks of audio data from file
def audio_generator_from_file(input_filename, target_sr, chunk_duration):

 sf = soundfile.SoundFile(input_filename, 'rb')
 chunk_size = int(chunk_duration*sf.samplerate)
 start = True
 end = False

 while not end:

 audio_signal, end = get_audio_chunk_from_soundfile(sf, chunk_size)

 audio_segment = AudioSegment(audio_signal, sf.samplerate, target_sr)

 yield audio_segment.samples, target_sr, start, end
 start = False

 sf.close()


### Loading data

We load and play a wave file from the LibriSpeech data set. The LibriSpeech data set is organized into directories and subdirectories containing speech segments and transcripts for different speakers. 

In [9]:
!ls /Kaldi/data/data/LibriSpeech/test-clean/1089/134686/

1089-134686-0000.flac 1089-134686-0013.flac 1089-134686-0026.flac
1089-134686-0000.wav 1089-134686-0013.wav 1089-134686-0026.wav
1089-134686-0001.flac 1089-134686-0014.flac 1089-134686-0027.flac
1089-134686-0001.wav 1089-134686-0014.wav 1089-134686-0027.wav
1089-134686-0002.flac 1089-134686-0015.flac 1089-134686-0028.flac
1089-134686-0002.wav 1089-134686-0015.wav 1089-134686-0028.wav
1089-134686-0003.flac 1089-134686-0016.flac 1089-134686-0029.flac
1089-134686-0003.wav 1089-134686-0016.wav 1089-134686-0029.wav
1089-134686-0004.flac 1089-134686-0017.flac 1089-134686-0030.flac
1089-134686-0004.wav 1089-134686-0017.wav 1089-134686-0030.wav
1089-134686-0005.flac 1089-134686-0018.flac 1089-134686-0031.flac
1089-134686-0005.wav 1089-134686-0018.wav 1089-134686-0031.wav
1089-134686-0006.flac 1089-134686-0019.flac 1089-134686-0032.flac
1089-134686-0006.wav 1089-134686-0019.wav 1089-134686-0032.wav
1089-134686-0007.flac 1089-134686-0020.flac 1089-134686-0033.flac
1089-134686-0007

In [10]:
DIR = "1089"
SUBDIR = "134686"
FILE_ID = "0000"
FILE_NAME = "/Kaldi/data/data/LibriSpeech/test-clean/%s/%s/%s-%s-%s.wav"%(DIR, SUBDIR, DIR, SUBDIR, FILE_ID)
TRANSRIPTION_FILE = "/Kaldi/data/data/LibriSpeech/test-clean/%s/%s/%s-%s.trans.txt"%(DIR, SUBDIR, DIR, SUBDIR)

batcmd = "cat %s|grep %s"%(TRANSRIPTION_FILE, FILE_ID)
res = subprocess.check_output(batcmd, shell=True)
transcript = " ".join(res.decode('utf-8').split(" ")[1:]).lower()

print(transcript)
import IPython.display as ipd
ipd.Audio(FILE_NAME)

he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fattened sauce



Next, we define a helper function which generate pairs of filepath and transcript from a LibriSpeech data directory.

In [11]:
def libri_generator(DATASET_ROOT):
 for subdir in os.listdir(DATASET_ROOT):
 SUBDIR = os.path.join(DATASET_ROOT, subdir)
 if os.path.isdir(os.path.join(DATASET_ROOT, subdir)):
 for subsubdir in os.listdir(SUBDIR):
 SUBSUBDIR = os.path.join(SUBDIR, subsubdir)
 #print(os.listdir(SUBSUBDIR))
 transcription_file = os.path.join(DATASET_ROOT, SUBDIR, SUBSUBDIR, "%s-%s.trans.txt"%(subdir, subsubdir))
 transcriptions = {}
 #pdb.set_trace()
 with open(transcription_file, "r") as f:
 for line in f:
 fields = line.split(" ")
 transcriptions[fields[0]] = " ".join(fields[1:])
 for file_key, transcript in transcriptions.items():
 file_path = os.path.join(DATASET_ROOT, SUBDIR, SUBSUBDIR, file_key+'.wav')
 yield file_path, transcript.strip().lower()

In [12]:
datagen = libri_generator("/Kaldi/data/data/LibriSpeech/test-clean/")
filepath, transcript = next(datagen)

In [13]:
print(transcript)
import IPython.display as ipd
ipd.Audio(filepath)

natty harmon tried the kitchen pump secretly several times during the evening for the water had to run up hill all the way from the well to the kitchen sink and he believed this to be a continual miracle that might give out at any moment



## Inference

We first create an inference context object that connects to the Kaldi TRTIS servier via a gPRC connection.

The server expects chunks of audio each containing up to input.WAV_DATA.dims samples (default: 8160). Per default, this corresponds to 510ms of audio per chunk (i.e. 16000Hz sampling rate). The last chunk can send a partial chunk smaller than this maximum value.

In [14]:
from tensorrtserver.api import *
protocol = ProtocolType.from_str("grpc")

CORRELATION_ID = 1101
ctx = InferContext(FLAGS.url, protocol, FLAGS.model_name, FLAGS.model_version,
 correlation_id=CORRELATION_ID, verbose=True,
 streaming=False)

Next, we take chunks from a selected audio file (each 510ms in duration, containing 8160 samples) and stream them sequentially to the Kaldi server. The server processes each chunk as soon as it is received. The transcription result is returned upon receiving the final chunk.

In the following, you can either specify a specific .wav file or take a file via the Kaldi dataset generator.

In [15]:
## Take a specific file
DIR = "1089"
SUBDIR = "134686"
FILE_ID = "0000"
FILE_NAME = "/Kaldi/data/data/LibriSpeech/test-clean/%s/%s/%s-%s-%s.wav"%(DIR, SUBDIR, DIR, SUBDIR, FILE_ID)
TRANSRIPTION_FILE = "/Kaldi/data/data/LibriSpeech/test-clean/%s/%s/%s-%s.trans.txt"%(DIR, SUBDIR, DIR, SUBDIR)
batcmd = "cat %s|grep %s"%(TRANSRIPTION_FILE, FILE_ID)
res = subprocess.check_output(batcmd, shell=True)
transcript = " ".join(res.decode('utf-8').split(" ")[1:]).lower()


## Alternatively, take a file from the data generator
#FILE_NAME, transcript = next(datagen)

cnt = 0
for audio_chunk in audio_generator_from_file(FILE_NAME, 16000, 0.51):
 print("Chunk ", cnt, audio_chunk[0].shape, audio_chunk[1], audio_chunk[2], audio_chunk[3])
 cnt += 1
 flags = InferRequestHeader.FLAG_NONE
 if audio_chunk[2]:
 flags = flags | InferRequestHeader.FLAG_SEQUENCE_START
 if audio_chunk[3]:
 flags = flags | InferRequestHeader.FLAG_SEQUENCE_END

 if not audio_chunk[3]: # if not end of sequence
 ctx.run({'WAV_DATA' : (audio_chunk[0],),
 'WAV_DATA_DIM' : (np.full(shape=1, fill_value=len(audio_chunk[0]), dtype=np.int32),)
 },
 {},
 batch_size=1, 
 flags=flags,
 corr_id=CORRELATION_ID)
 else:

 result = ctx.run({'WAV_DATA' : (audio_chunk[0],),
 'WAV_DATA_DIM' : (np.full(shape=1, fill_value=len(audio_chunk[0]), dtype=np.int32),)
 },
 { 'TEXT' : InferContext.ResultFormat.RAW},
 batch_size=1, 
 flags=flags,
 corr_id=CORRELATION_ID)
print("ASR output: %s" % "".join([c.decode('utf-8') for c in result['TEXT'][0]]).lower())
if transcript:
 print("Ground truth: %s"%transcript.lower())

Chunk 0 (8160,) 16000 True False
Chunk 1 (8160,) 16000 False False
Chunk 2 (8160,) 16000 False False
Chunk 3 (8160,) 16000 False False
Chunk 4 (8160,) 16000 False False
Chunk 5 (8160,) 16000 False False
Chunk 6 (8160,) 16000 False False
Chunk 7 (8160,) 16000 False False
Chunk 8 (8160,) 16000 False False
Chunk 9 (8160,) 16000 False False
Chunk 10 (8160,) 16000 False False
Chunk 11 (8160,) 16000 False False
Chunk 12 (8160,) 16000 False False
Chunk 13 (8160,) 16000 False False
Chunk 14 (8160,) 16000 False False
Chunk 15 (8160,) 16000 False False
Chunk 16 (8160,) 16000 False False
Chunk 17 (8160,) 16000 False False
Chunk 18 (8160,) 16000 False False
Chunk 19 (8160,) 16000 False False
Chunk 20 (8160,) 16000 False True
ASR output: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flower fatten sauce 
Ground truth: he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mut

# Conclusion

In this notebook, we have walked through the complete process of preparing the audio data and carry out inference with the Kaldi ASR model.

## What's next
Now it's time to try the Kaldi ASR model on your own data.
