updated onnx runtime info

This commit is contained in:
gkarch 2021-03-02 12:37:12 +01:00
parent 8d9f74b9bf
commit 1398d39508
5 changed files with 8 additions and 6 deletions

View file

@ -808,6 +808,9 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
## Release notes
### Changelog
March 2021
* Updated ONNX runtime information
February 2021
* Added DALI data-processing pipeline for on-the-fly data processing and augmentation on CPU or GPU
* Revised training recipe: ~10% relative improvement in Word Error Rate (WER)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 71 KiB

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 77 KiB

View file

@ -13,11 +13,10 @@ This subfolder of the Jasper for PyTorch repository contains scripts for deploy
- [Performance](#performance)
* [Inference Benchmarking in Triton Inference Server](#inference-benchmarking-in-triton-inference-server)
* [Results](#results)
* [Performance Analysis for Triton Inference Server: NVIDIA T4
](#performance-analysis-for-triton-inference-server-nvidia-t4)
* [Performance Analysis for Triton Inference Server: NVIDIA T4](#performance-analysis-for-triton-inference-server-nvidia-t4)
* [Maximum batch size](#maximum-batch-size)
* [Batching techniques: Static versus Dynamic Batching](#batching-techniques-static-versus-dynamic)
* [TensorRT, ONNX, and PyTorch JIT comparisons](#tensorrt-onnx-and-pytorch-jit-comparisons)
* [TensorRT, ONNXRT-CUDA, and PyTorch JIT comparisons](#tensorrt-onnxrt-cuda-and-pytorch-jit-comparisons)
- [Release Notes](#release-notes)
* [Changelog](#change-log)
* [Known issues](#known-issues)
@ -327,7 +326,7 @@ Figure 5: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Ba
![](../images/tensorrt_16.7s.png)
Figure 6: Triton pipeline - Latency & Throughput vs Concurrency using dynamic Batching at maximum server batch size = 8, max_queue_delay_microseconds = 5000, input audio length = 16.7 seconds, TensorRT backend.
##### TensorRT, ONNX, and PyTorch JIT comparisons
##### TensorRT, ONNXRT-CUDA, and PyTorch JIT comparisons
The following tables show inference and latency comparisons across all 3 backends for mixed precision and static batching. The main observations are:
Increasing the batch size leads to higher inference throughput and - latency up to a certain batch size, after which it slowly saturates.
@ -337,7 +336,7 @@ The longer the audio length, the lower the throughput and the higher the latency
The following table shows the throughput benchmark results for all 3 model backends in Triton Inference Server using static batching under optimal concurrency
|Audio length in seconds|Batch Size|TensorRT (inf/s)|PyTorch (inf/s)|ONNX (inf/s)|TensorRT/PyTorch Speedup|TensorRT/Onnx Speedup|
|Audio length in seconds|Batch Size|TensorRT (inf/s)|PyTorch (inf/s)|ONNXRT-CUDA (inf/s)|TensorRT/PyTorch Speedup|TensorRT/ONNXRT-CUDA Speedup|
|--- |--- |--- |--- |--- |--- |--- |
| 2.0| 1| 49.67| 55.67| 41.67| 0.89| 1.19|
| 2.0| 2| 98.67| 96.00| 77.33| 1.03| 1.28|
@ -356,7 +355,7 @@ The following table shows the throughput benchmark results for all 3 model backe
The following table shows the throughput benchmark results for all 3 model backends in Triton Inference Server using static batching and a single concurrent request.
|Audio length in seconds|Batch Size|TensorRT (ms)|PyTorch (ms)|ONNX (ms)|TensorRT/PyTorch Speedup|TensorRT/Onnx Speedup|
|Audio length in seconds|Batch Size|TensorRT (ms)|PyTorch (ms)|ONNXRT-CUDA (ms)|TensorRT/PyTorch Speedup|TensorRT/ONNXRT-CUDA Speedup|
|--- |--- |--- |--- |--- |--- |--- |
| 2.0| 1| 23.61| 25.06| 31.84| 1.06| 1.35|
| 2.0| 2| 24.56| 25.11| 37.54| 1.02| 1.53|