# Deploying the Jasper Inference model using Triton Inference Server This subfolder of the Jasper for PyTorch repository contains scripts for deployment of high-performance inference on NVIDIA Triton Inference Server as well as detailed performance analysis. It offers different options for the inference model pipeline. ## Table Of Contents - [Solution overview](#solution-overview) - [Inference Pipeline in Triton Inference Server](#inference-pipeline-in-triton-inference-server) - [Setup](#setup) - [Quick Start Guide](#quick-start-guide) - [Advanced](#advanced) * [Scripts and sample code](#scripts-and-sample-code) - [Performance](#performance) * [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server) * [Results](#results) * [Performance Analysis for Triton Inference Server: NVIDIA T4](#performance-analysis-for-triton-inference-server-nvidia-t4) * [Maximum batch size](#maximum-batch-size) * [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic) * [TensorRT, ONNXRT-CUDA, and PyTorch JIT comparisons](#tensorrt-onnxrt-cuda-and-pytorch-jit-comparisons) - [Release Notes](#release-notes) * [Changelog](#change-log) * [Known issues](#known-issues) ## Solution Overview The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. This folder contains detailed performance analysis as well as scripts to run Jasper inference using Triton Inference Server. A typical Triton Inference Server pipeline can be broken down into the following steps: 1. The client serializes the inference request into a message and sends it to the server (Client Send). 2. The message travels over the network from the client to the server (Network). 3. The message arrives at the server, and is deserialized (Server Receive). 4. The request is placed on the queue (Server Queue). 5. The request is removed from the queue and computed (Server Compute). 6. The completed request is serialized in a message and sent back to the client (Server Send). 7. The completed message then travels over the network from the server to the client (Network). 8. The completed message is deserialized by the client and processed as a completed inference request (Client Receive). Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to step 5. As backend deep learning systems like Jasper are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of Jasper, we can consider that all clients are local. In this section, we will go over how to launch both the Triton Inference Server and the client and get the best performance solution that fits your specific application needs. More information on how to perform inference using NVIDIA Triton Inference Server can be found in [triton/README.md](https://github.com/triton-inference-server/server/blob/master/README.md). ## Inference Pipeline in Triton Inference Server The Jasper model pipeline consists of 3 components, where each part can be customized to be a different backend: **Data preprocessor** The data processor transforms an input raw audio file into a spectrogram. By default the pipeline uses mel filter banks as spectrogram features. This part does not have any learnable weights. **Acoustic model** The acoustic model takes in the spectrogram and outputs a probability over a list of characters. This part is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and what differentiates Jasper from other end-to-end neural speech recognition models. In the original paper, the acoustic model contains a masking operation for training (More details in [Jasper PyTorch README](../README.md)). We do not use masking for inference. **Greedy decoder** The decoder takes the probabilities over the list of characters and outputs the final transcription. Greedy decoding is a fast and simple way of doing this by always choosing the character with the maximum probability. To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference. The following table shows which backends are supported for each part along the model pipeline. |Backend\Pipeline component|Data preprocessor|Acoustic Model|Decoder| |---|---|---|---| |PyTorch JIT|x|x|x| |ONNX|-|x|-| |TensorRT|-|x|-| In order to run inference with TensorRT outside of the inference server, refer to the [Jasper TensorRT README](../tensort/README.md). ## Setup The repository contains a folder `./triton` with a `Dockerfile` which extends the PyTorch 20.10-py3 NGC container and encapsulates some dependencies. Ensure you have the following components: - [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) - [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) - [Triton Inference Server 20.10 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver) - Access to [NVIDIA machine learning repository](https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb) and [NVIDIA CUDA repository](https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb) for NVIDIA TensorRT 6 - Supported GPUs: - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/) - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) - [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp) Required Python packages are listed in `requirements.txt`. These packages are automatically installed when the Docker container is built. ## Quick Start Guide Running the following scripts will build and launch the container containing all required dependencies for native PyTorch as well as Triton. This is necessary for using inference and can also be used for data download, processing, and training of the model. For more information on the scripts and arguments, refer to the [Advanced](#advanced) section. 1. Clone the repository. ```bash git clone https://github.com/NVIDIA/DeepLearningExamples cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper ``` 2. Build the Jasper PyTorch container. Running the following scripts will build the container which contains all the required dependencies for data download and processing as well as converting the model. ```bash bash scripts/docker/build.sh ``` 3. Start an interactive session in the Docker container: ```bash bash scripts/docker/launch.sh ``` Where , and can be either empty or absolute directory paths to dataset, existing checkpoints or potential output files. When left empty, they default to `datasets/`, `/checkpoints`, and `results/`, respectively. The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories ``, ``, `` on the host. Note that ``, ``, and `` directly correspond to the same arguments in `scripts/docker/launch.sh` and `trt/scripts/docker/launch.sh` mentioned in the [Jasper PyTorch README](../README.md) and [Jasper TensorRT README](../tensorrt/README.md). Briefly, `` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](../trt/README.md)), `` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Jasper PyTorch README](../README.md), and `` should be prepared to contain converted model and logs. 4. Downloading the `test-clean` part of `LibriSpeech` is required for model conversion. But it is not required for inference on Triton Inference Server, which can use a single .wav audio file. To download and preprocess LibriSpeech, run the following inside the container: ```bash bash triton/scripts/download_triton_librispeech.sh bash triton/scripts/preprocess_triton_librispeech.sh ``` 5. (Option 1) Convert pretrained PyTorch model checkpoint into Triton Inference Server compatible model backends. Inside the container, run: ```bash export CHECKPOINT_PATH= export CONVERT_PRECISIONS= export CONVERTS= bash triton/scripts/export_model.sh ``` Where `` (`"/checkpoints/jasper_fp16.pt"`) is the absolute file path of the pretrained checkpoint, `` (`"fp16" "fp32"`) is the list of precisions used for conversion, and `` (`"feature-extractor" "decoder" "ts-trace" "onnx" "tensorrt"`) is the list of conversions to be applied. The feature extractor converts only to TorchScript trace module (`feature-extractor`), the decoder only to TorchScript script module (`decoder`), and the Jasper model can convert to TorchScript trace module (`ts-trace`), ONNX (`onnx`), or TensorRT (`tensorrt`). A pretrained PyTorch model checkpoint for model conversion can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_ckpt_amp). More details can be found in the [Advanced](#advanced) section under [Scripts and sample code](#scripts-and-sample-code). 6. (Option 2) Download pre-exported inference checkpoints from NGC. Alternatively, you can skip the manual model export and download already generated model backends for every version of the model pipeline. * [Jasper_ONNX](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_onnx_fp16_amp/version), * [Jasper_TorchScript](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_torchscript_fp16_amp/version), * [Jasper_TensorRT_Turing](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_turing/version), * [Jasper_TensorRT_Volta](https://ngc.nvidia.com/catalog/models/nvidia:jasper_pyt_trt_fp16_amp_volta/version). If you wish to use TensorRT pipeline, make sure to download the correct version for your hardware. The extracted model folder should contain 3 subfolders `feature-extractor-ts-trace`, `decoder-ts-script` and `jasper-x` where `x` can be `ts-trace`, `onnx`, `tensorrt` depending on the model backend. Copy the 3 model folders to the directory `./triton/model_repo/fp16` in your Jasper project. 7. Build a container that extends Triton Inference Client: From outside the container, run: ```bash bash triton/scripts/docker/build_triton_client.sh ``` Once the above steps are completed you can either run inference benchmarks or perform inference on real data. 8. (Option 1) Run all inference benchmarks. From outside the container, run: ```bash export RESULT_DIR= export PRECISION_TESTS= export BATCH_SIZES= export SEQ_LENS= bash triton/scripts/execute_all_perf_runs.sh ``` Where `` is the absolute path to potential output files (`./results`), `` is a list of precisions to be tested (`"fp16" "fp32"`), `` is a list of tested batch sizes (`"1" "2" "4" "8"`), and `` are tested sequnce lengths (`"32000" "112000" "267200"`). Note: This can take several hours to complete due to the extensiveness of the benchmark. More details about the benchmark are found in the [Advanced](#advanced) section under [Performance](#performance). 9. (Option 2) Run inference on real data using the Client and Triton Inference Server. 8.1 From outside the container, restart the server: ```bash bash triton/scripts/run_server.sh ``` 8.2 From outside the container, submit the client request using: ```bash bash triton/scripts/run_client.sh ``` Where `` can be either "ts-trace", "tensorrt" or "onnx", `` is either "fp32" or "fp16". `` is an absolute local path to the directory of files. is the relative path to to either an audio file in .wav format or a manifest file in .json format. Note: If is *.json should be the path to the LibriSpeech dataset. In this case this script will do both inference and evaluation on the accoring LibriSpeech dataset. ## Advanced The following sections provide greater details about the Triton Inference Server pipeline and inference analysis and benchmarking results. ### Scripts and sample code The `triton/` directory contains the following files: * `jasper-client.py`: Python client script that takes an audio file and a specific model pipeline type and submits a client request to the server to run inference with the model on the given audio file. * `speech_utils.py`: helper functions for `jasper-client.py`. * `converter.py`: Python script for model conversion to different backends. * `jasper_module.py`: helper functions for `converter.py`. * `model_repo_configs/`: directory with Triton model config files for different backend and precision configurations. The `triton/scripts/` directory has easy to use scripts to run supported functionalities, such as: * `./docker/build_triton_client.sh`: builds container * `execute_all_perf_runs.sh`: runs all benchmarks using Triton Inference Server performance client; calls `generate_perf_results.sh` * `export_model.sh`: from pretrained PyTorch checkpoint generates backends for every version of the model inference pipeline. * `prepare_model_repository.sh`: copies model config files from `./model_repo_configs/` to `./deploy/model_repo` and creates links to generated model backends, setting up the model repository for Triton Inference Server * `generate_perf_results.sh`: runs benchmark with `perf-client` for specific configuration and calls `run_perf_client.sh` * `run_server.sh`: launches Triton Inference Server * `run_client.sh`: launches client by using `jasper-client.py` to submit inference requests to server ### Running the Triton Inference Server Launch the Triton Inference Server in detached mode to run in the background by default: ```bash bash triton/scripts/run_server.sh ``` To run in the foreground interactively, for debugging purposes, run: ```bash DAEMON="--detach=false" bash trinton/scripts/run_server.sh ``` The script mounts and loads models at `$PWD/triton/deploy/model_repo` to the server with all visible GPUs. In order to selectively choose the devices, set `NVIDIA_VISIBLE_DEVICES`. ### Running the Triton Inference Client *Real data* In order to run the client with real data, run: ```bash bash triton/scripts/run_client.sh