Merge pull request #920 from NVIDIA/gh/release

Gh/release
This commit is contained in:
nv-kkudrynski 2021-04-20 13:54:27 +02:00 committed by GitHub
commit bd257e1494
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
173 changed files with 124585 additions and 1565 deletions

View file

@ -30,7 +30,7 @@ The following table provides links to where you can find additional information
## Validation accuracy results
Our results were obtained by running the applicable
training scripts in the [framework-container-name] NGC container
training scripts in the 20.12 PyTorch NGC container
on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
The specific training script that was run is documented
in the corresponding model's README.
@ -56,49 +56,48 @@ three classification models side-by-side.
Our results were obtained by running the applicable
training scripts in the pytorch-20.12 NGC container
training scripts in the 21.03 PyTorch NGC container
on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
Performance numbers (in images per second)
were averaged over an entire training epoch.
The specific training script that was run is documented
in the corresponding model's README.
The following table shows the training accuracy results of the
three classification models side-by-side.
The following table shows the training accuracy results of
all the classification models side-by-side.
| **Model** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** |
|:----------------------:|:-------------------:|:----------:|:---------------------------:|
| efficientnet-b0 | 14391 img/s | 8225 img/s | 1.74 x |
| efficientnet-b4 | 2341 img/s | 1204 img/s | 1.94 x |
| efficientnet-widese-b0 | 15053 img/s | 8233 img/s | 1.82 x |
| efficientnet-widese-b4 | 2339 img/s | 1202 img/s | 1.94 x |
| resnet50 | 15977 img/s | 7365 img/s | 2.16 x |
| resnext101-32x4d | 7399 img/s | 3193 img/s | 2.31 x |
| se-resnext101-32x4d | 5248 img/s | 2665 img/s | 1.96 x |
| efficientnet-b0 | 16652 img/s | 8193 img/s | 2.03 x |
| efficientnet-b4 | 2570 img/s | 1223 img/s | 2.1 x |
| efficientnet-widese-b0 | 16368 img/s | 8244 img/s | 1.98 x |
| efficientnet-widese-b4 | 2585 img/s | 1223 img/s | 2.11 x |
| resnet50 | 16621 img/s | 7248 img/s | 2.29 x |
| resnext101-32x4d | 7925 img/s | 3471 img/s | 2.28 x |
| se-resnext101-32x4d | 5779 img/s | 2991 img/s | 1.93 x |
### Training performance: NVIDIA DGX-1 16G (8x V100 16GB)
Our results were obtained by running the applicable
training scripts in the pytorch-20.12 NGC container
training scripts in the 21.03 PyTorch NGC container
on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
Performance numbers (in images per second)
were averaged over an entire training epoch.
The specific training script that was run is documented
in the corresponding model's README.
The following table shows the training accuracy results of the
three classification models side-by-side.
The following table shows the training accuracy results of all the
classification models side-by-side.
| **Model** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** |
|:----------------------:|:-------------------:|:----------:|:---------------------------:|
| efficientnet-b0 | 7664 img/s | 4571 img/s | 1.67 x |
| efficientnet-b4 | 1330 img/s | 598 img/s | 2.22 x |
| efficientnet-widese-b0 | 7694 img/s | 4489 img/s | 1.71 x |
| efficientnet-widese-b4 | 1323 img/s | 590 img/s | 2.24 x |
| resnet50 | 7608 img/s | 2851 img/s | 2.66 x |
| resnext101-32x4d | 3742 img/s | 1117 img/s | 3.34 x |
| se-resnext101-32x4d | 2716 img/s | 994 img/s | 2.73 x |
| efficientnet-b0 | 7789 img/s | 4672 img/s | 1.66 x |
| efficientnet-b4 | 1366 img/s | 616 img/s | 2.21 x |
| efficientnet-widese-b0 | 7875 img/s | 4592 img/s | 1.71 x |
| efficientnet-widese-b4 | 1356 img/s | 612 img/s | 2.21 x |
| resnet50 | 8322 img/s | 2855 img/s | 2.91 x |
| resnext101-32x4d | 4065 img/s | 1133 img/s | 3.58 x |
| se-resnext101-32x4d | 2971 img/s | 1004 img/s | 2.95 x |
## Model Comparison

View file

@ -520,7 +520,7 @@ Each of these scripts will run 100 iterations and save results in the `benchmark
### Results
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@ -562,226 +562,234 @@ The following images show an A100 run.
##### Training performance: NVIDIA A100 (8x A100 80GB)
Our results were obtained by running the applicable `efficientnet/training/<AMP|TF32>/*.sh` training script in the PyTorch 20.12 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
Our results were obtained by running the applicable `efficientnet/training/<AMP|TF32>/*.sh` training script in the PyTorch 21.03 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
| **Model** | **GPUs** | **TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** |
|:----------------------:|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
| efficientnet-b0 | 1 | 1082 img/s | 2364 img/s | 2.18 x | 1.0 x | 1.0 x |
| efficientnet-b0 | 8 | 8225 img/s | 14391 img/s | 1.74 x | 7.59 x | 6.08 x |
| efficientnet-b4 | 1 | 154 img/s | 300 img/s | 1.94 x | 1.0 x | 1.0 x |
| efficientnet-b4 | 8 | 1204 img/s | 2341 img/s | 1.94 x | 7.8 x | 7.8 x |
| efficientnet-widese-b0 | 1 | 1081 img/s | 2368 img/s | 2.19 x | 1.0 x | 1.0 x |
| efficientnet-widese-b0 | 8 | 8233 img/s | 15053 img/s | 1.82 x | 7.61 x | 6.35 x |
| efficientnet-widese-b4 | 1 | 154 img/s | 299 img/s | 1.94 x | 1.0 x | 1.0 x |
| efficientnet-widese-b4 | 8 | 1202 img/s | 2339 img/s | 1.94 x | 7.8 x | 7.81 x |
| **Model** | **GPUs** | **TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** |
|:----------------------:|:--------:|:-----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
| efficientnet-b0 | 1 | 1078 img/s | 2489 img/s | 2.3 x | 1.0 x | 1.0 x |
| efficientnet-b0 | 8 | 8193 img/s | 16652 img/s | 2.03 x | 7.59 x | 6.68 x |
| efficientnet-b0 | 16 | 16137 img/s | 29332 img/s | 1.81 x | 14.96 x | 11.78 x |
| efficientnet-b4 | 1 | 157 img/s | 331 img/s | 2.1 x | 1.0 x | 1.0 x |
| efficientnet-b4 | 8 | 1223 img/s | 2570 img/s | 2.1 x | 7.76 x | 7.75 x |
| efficientnet-b4 | 16 | 2417 img/s | 4813 img/s | 1.99 x | 15.34 x | 14.51 x |
| efficientnet-b4 | 32 | 4813 img/s | 9425 img/s | 1.95 x | 30.55 x | 28.42 x |
| efficientnet-b4 | 64 | 9146 img/s | 18900 img/s | 2.06 x | 58.05 x | 57.0 x |
| efficientnet-widese-b0 | 1 | 1078 img/s | 2512 img/s | 2.32 x | 1.0 x | 1.0 x |
| efficientnet-widese-b0 | 8 | 8244 img/s | 16368 img/s | 1.98 x | 7.64 x | 6.51 x |
| efficientnet-widese-b0 | 16 | 16062 img/s | 29798 img/s | 1.85 x | 14.89 x | 11.86 x |
| efficientnet-widese-b4 | 1 | 157 img/s | 331 img/s | 2.1 x | 1.0 x | 1.0 x |
| efficientnet-widese-b4 | 8 | 1223 img/s | 2585 img/s | 2.11 x | 7.77 x | 7.8 x |
| efficientnet-widese-b4 | 16 | 2399 img/s | 5041 img/s | 2.1 x | 15.24 x | 15.21 x |
| efficientnet-widese-b4 | 32 | 4616 img/s | 9379 img/s | 2.03 x | 29.32 x | 28.3 x |
| efficientnet-widese-b4 | 64 | 9140 img/s | 18516 img/s | 2.02 x | 58.07 x | 55.88 x |
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
| **Model** | **GPUs** | **FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** |
|:----------------------:|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
| efficientnet-b0 | 1 | 652 img/s | 1254 img/s | 1.92 x | 1.0 x | 1.0 x |
| efficientnet-b0 | 8 | 4571 img/s | 7664 img/s | 1.67 x | 7.0 x | 6.1 x |
| efficientnet-b4 | 1 | 80 img/s | 199 img/s | 2.47 x | 1.0 x | 1.0 x |
| efficientnet-b4 | 8 | 598 img/s | 1330 img/s | 2.22 x | 7.42 x | 6.67 x |
| efficientnet-widese-b0 | 1 | 654 img/s | 1255 img/s | 1.91 x | 1.0 x | 1.0 x |
| efficientnet-widese-b0 | 8 | 4489 img/s | 7694 img/s | 1.71 x | 6.85 x | 6.12 x |
| efficientnet-widese-b4 | 1 | 79 img/s | 198 img/s | 2.51 x | 1.0 x | 1.0 x |
| efficientnet-widese-b4 | 8 | 590 img/s | 1323 img/s | 2.24 x | 7.46 x | 6.65 x |
| efficientnet-b0 | 1 | 655 img/s | 1301 img/s | 1.98 x | 1.0 x | 1.0 x |
| efficientnet-b0 | 8 | 4672 img/s | 7789 img/s | 1.66 x | 7.12 x | 5.98 x |
| efficientnet-b4 | 1 | 83 img/s | 204 img/s | 2.46 x | 1.0 x | 1.0 x |
| efficientnet-b4 | 8 | 616 img/s | 1366 img/s | 2.21 x | 7.41 x | 6.67 x |
| efficientnet-widese-b0 | 1 | 655 img/s | 1299 img/s | 1.98 x | 1.0 x | 1.0 x |
| efficientnet-widese-b0 | 8 | 4592 img/s | 7875 img/s | 1.71 x | 7.0 x | 6.05 x |
| efficientnet-widese-b4 | 1 | 83 img/s | 204 img/s | 2.45 x | 1.0 x | 1.0 x |
| efficientnet-widese-b4 | 8 | 612 img/s | 1356 img/s | 2.21 x | 7.34 x | 6.63 x |
##### Training performance: NVIDIA DGX-1 (8x V100 32GB)
Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
| **Model** | **GPUs** | **FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** |
|:----------------------:|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
| efficientnet-b0 | 1 | 637 img/s | 1352 img/s | 2.12 x | 1.0 x | 1.0 x |
| efficientnet-b0 | 8 | 4834 img/s | 8645 img/s | 1.78 x | 7.58 x | 6.39 x |
| efficientnet-b4 | 1 | 84 img/s | 200 img/s | 2.38 x | 1.0 x | 1.0 x |
| efficientnet-b4 | 8 | 632 img/s | 1519 img/s | 2.4 x | 7.53 x | 7.58 x |
| efficientnet-widese-b0 | 1 | 637 img/s | 1349 img/s | 2.11 x | 1.0 x | 1.0 x |
| efficientnet-widese-b0 | 8 | 4841 img/s | 8693 img/s | 1.79 x | 7.59 x | 6.43 x |
| efficientnet-widese-b4 | 1 | 83 img/s | 200 img/s | 2.38 x | 1.0 x | 1.0 x |
| efficientnet-widese-b4 | 8 | 627 img/s | 1508 img/s | 2.4 x | 7.47 x | 7.53 x |
| efficientnet-b0 | 1 | 646 img/s | 1401 img/s | 2.16 x | 1.0 x | 1.0 x |
| efficientnet-b0 | 8 | 4937 img/s | 8615 img/s | 1.74 x | 7.63 x | 6.14 x |
| efficientnet-b4 | 1 | 36 img/s | 89 img/s | 2.44 x | 1.0 x | 1.0 x |
| efficientnet-b4 | 8 | 641 img/s | 1565 img/s | 2.44 x | 17.6 x | 17.57 x |
| efficientnet-widese-b0 | 1 | 281 img/s | 603 img/s | 2.14 x | 1.0 x | 1.0 x |
| efficientnet-widese-b0 | 8 | 4924 img/s | 8870 img/s | 1.8 x | 17.49 x | 14.7 x |
| efficientnet-widese-b4 | 1 | 36 img/s | 89 img/s | 2.45 x | 1.0 x | 1.0 x |
| efficientnet-widese-b4 | 8 | 639 img/s | 1556 img/s | 2.43 x | 17.61 x | 17.44 x |
#### Inference performance results
##### Inference performance: NVIDIA A100 (1x A100 80GB)
Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
###### TF32 Inference Latency
| **Model** | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| efficientnet-b0 | 1 | 122 img/s | 10.04 ms | 8.59 ms | 10.2 ms |
| efficientnet-b0 | 2 | 249 img/s | 9.91 ms | 9.08 ms | 10.84 ms |
| efficientnet-b0 | 4 | 472 img/s | 10.31 ms | 9.67 ms | 11.25 ms |
| efficientnet-b0 | 8 | 922 img/s | 10.67 ms | 10.76 ms | 12.13 ms |
| efficientnet-b0 | 16 | 1796 img/s | 10.86 ms | 11.1 ms | 13.01 ms |
| efficientnet-b0 | 32 | 3235 img/s | 12.05 ms | 13.28 ms | 15.07 ms |
| efficientnet-b0 | 64 | 4658 img/s | 16.27 ms | 14.56 ms | 16.18 ms |
| efficientnet-b0 | 128 | 4911 img/s | 31.51 ms | 26.24 ms | 27.29 ms |
| efficientnet-b0 | 256 | 5015 img/s | 62.64 ms | 50.81 ms | 55.6 ms |
| efficientnet-b4 | 1 | 63 img/s | 17.64 ms | 16.29 ms | 17.92 ms |
| efficientnet-b4 | 2 | 122 img/s | 18.27 ms | 18.12 ms | 22.32 ms |
| efficientnet-b4 | 4 | 247 img/s | 18.25 ms | 17.79 ms | 21.02 ms |
| efficientnet-b4 | 8 | 469 img/s | 19.03 ms | 18.94 ms | 22.49 ms |
| efficientnet-b4 | 16 | 572 img/s | 29.95 ms | 28.14 ms | 28.99 ms |
| efficientnet-b4 | 32 | 638 img/s | 52.25 ms | 50.24 ms | 50.5 ms |
| efficientnet-b4 | 64 | 680 img/s | 96.93 ms | 94.1 ms | 94.3 ms |
| efficientnet-b4 | 128 | 672 img/s | 197.49 ms | 189.69 ms | 189.91 ms |
| efficientnet-b4 | 256 | 679 img/s | 392.15 ms | 374.18 ms | 386.85 ms |
| efficientnet-widese-b0 | 1 | 120 img/s | 10.21 ms | 8.61 ms | 11.37 ms |
| efficientnet-widese-b0 | 2 | 242 img/s | 10.16 ms | 9.98 ms | 11.36 ms |
| efficientnet-widese-b0 | 4 | 493 img/s | 9.97 ms | 8.92 ms | 10.23 ms |
| efficientnet-widese-b0 | 8 | 913 img/s | 10.77 ms | 10.58 ms | 12.11 ms |
| efficientnet-widese-b0 | 16 | 1864 img/s | 10.54 ms | 10.34 ms | 11.69 ms |
| efficientnet-widese-b0 | 32 | 3218 img/s | 12.06 ms | 13.17 ms | 15.69 ms |
| efficientnet-widese-b0 | 64 | 4625 img/s | 16.4 ms | 15.35 ms | 17.86 ms |
| efficientnet-widese-b0 | 128 | 4904 img/s | 31.84 ms | 26.22 ms | 28.69 ms |
| efficientnet-widese-b0 | 256 | 5013 img/s | 63.1 ms | 50.95 ms | 52.44 ms |
| efficientnet-widese-b4 | 1 | 64 img/s | 17.51 ms | 16.5 ms | 20.03 ms |
| efficientnet-widese-b4 | 2 | 125 img/s | 17.86 ms | 17.24 ms | 19.27 ms |
| efficientnet-widese-b4 | 4 | 248 img/s | 18.09 ms | 17.36 ms | 21.34 ms |
| efficientnet-widese-b4 | 8 | 472 img/s | 18.92 ms | 18.33 ms | 20.68 ms |
| efficientnet-widese-b4 | 16 | 569 img/s | 30.11 ms | 28.18 ms | 28.45 ms |
| efficientnet-widese-b4 | 32 | 628 img/s | 53.05 ms | 51.11 ms | 51.29 ms |
| efficientnet-widese-b4 | 64 | 679 img/s | 97.17 ms | 94.22 ms | 94.43 ms |
| efficientnet-widese-b4 | 128 | 672 img/s | 197.74 ms | 189.93 ms | 190.95 ms |
| efficientnet-widese-b4 | 256 | 679 img/s | 392.7 ms | 373.84 ms | 378.35 ms |
| efficientnet-b0 | 1 | 130 img/s | 9.33 ms | 7.95 ms | 9.0 ms |
| efficientnet-b0 | 2 | 262 img/s | 9.39 ms | 8.51 ms | 9.5 ms |
| efficientnet-b0 | 4 | 503 img/s | 9.68 ms | 9.53 ms | 10.78 ms |
| efficientnet-b0 | 8 | 1004 img/s | 9.85 ms | 9.89 ms | 11.49 ms |
| efficientnet-b0 | 16 | 1880 img/s | 10.27 ms | 10.34 ms | 11.19 ms |
| efficientnet-b0 | 32 | 3401 img/s | 11.46 ms | 12.51 ms | 14.39 ms |
| efficientnet-b0 | 64 | 4656 img/s | 19.58 ms | 14.52 ms | 16.63 ms |
| efficientnet-b0 | 128 | 5001 img/s | 31.03 ms | 25.72 ms | 28.34 ms |
| efficientnet-b0 | 256 | 5154 img/s | 60.71 ms | 49.44 ms | 54.99 ms |
| efficientnet-b4 | 1 | 69 img/s | 16.22 ms | 14.87 ms | 15.34 ms |
| efficientnet-b4 | 2 | 133 img/s | 16.84 ms | 16.49 ms | 17.72 ms |
| efficientnet-b4 | 4 | 259 img/s | 17.33 ms | 16.39 ms | 19.67 ms |
| efficientnet-b4 | 8 | 491 img/s | 18.22 ms | 18.09 ms | 19.51 ms |
| efficientnet-b4 | 16 | 606 img/s | 28.28 ms | 26.55 ms | 26.84 ms |
| efficientnet-b4 | 32 | 651 img/s | 51.08 ms | 49.39 ms | 49.61 ms |
| efficientnet-b4 | 64 | 684 img/s | 96.23 ms | 93.54 ms | 93.78 ms |
| efficientnet-b4 | 128 | 700 img/s | 195.22 ms | 182.17 ms | 182.42 ms |
| efficientnet-b4 | 256 | 702 img/s | 380.01 ms | 361.81 ms | 371.64 ms |
| efficientnet-widese-b0 | 1 | 130 img/s | 9.49 ms | 8.76 ms | 9.68 ms |
| efficientnet-widese-b0 | 2 | 265 img/s | 9.25 ms | 8.51 ms | 9.75 ms |
| efficientnet-widese-b0 | 4 | 520 img/s | 9.42 ms | 8.67 ms | 9.97 ms |
| efficientnet-widese-b0 | 8 | 996 img/s | 12.27 ms | 9.69 ms | 11.31 ms |
| efficientnet-widese-b0 | 16 | 1916 img/s | 10.2 ms | 10.29 ms | 11.3 ms |
| efficientnet-widese-b0 | 32 | 3293 img/s | 11.71 ms | 13.0 ms | 14.57 ms |
| efficientnet-widese-b0 | 64 | 4639 img/s | 16.21 ms | 14.61 ms | 16.29 ms |
| efficientnet-widese-b0 | 128 | 4997 img/s | 30.81 ms | 25.76 ms | 26.02 ms |
| efficientnet-widese-b0 | 256 | 5166 img/s | 73.68 ms | 49.39 ms | 55.74 ms |
| efficientnet-widese-b4 | 1 | 68 img/s | 16.41 ms | 15.14 ms | 16.59 ms |
| efficientnet-widese-b4 | 2 | 135 img/s | 16.65 ms | 15.52 ms | 17.93 ms |
| efficientnet-widese-b4 | 4 | 251 img/s | 17.74 ms | 17.29 ms | 20.47 ms |
| efficientnet-widese-b4 | 8 | 501 img/s | 17.75 ms | 17.12 ms | 18.01 ms |
| efficientnet-widese-b4 | 16 | 590 img/s | 28.94 ms | 27.29 ms | 27.81 ms |
| efficientnet-widese-b4 | 32 | 651 img/s | 50.96 ms | 49.34 ms | 49.55 ms |
| efficientnet-widese-b4 | 64 | 683 img/s | 99.28 ms | 93.65 ms | 93.88 ms |
| efficientnet-widese-b4 | 128 | 700 img/s | 189.81 ms | 182.3 ms | 182.58 ms |
| efficientnet-widese-b4 | 256 | 702 img/s | 379.36 ms | 361.84 ms | 366.05 ms |
###### Mixed Precision Inference Latency
| **Model** | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| efficientnet-b0 | 1 | 99 img/s | 11.89 ms | 10.83 ms | 13.04 ms |
| efficientnet-b0 | 2 | 208 img/s | 11.43 ms | 10.15 ms | 10.87 ms |
| efficientnet-b0 | 4 | 395 img/s | 12.0 ms | 11.01 ms | 12.8 ms |
| efficientnet-b0 | 8 | 763 img/s | 12.33 ms | 11.62 ms | 13.94 ms |
| efficientnet-b0 | 16 | 1499 img/s | 12.58 ms | 12.57 ms | 14.4 ms |
| efficientnet-b0 | 32 | 2875 img/s | 13.19 ms | 13.76 ms | 15.29 ms |
| efficientnet-b0 | 64 | 5841 img/s | 13.7 ms | 14.91 ms | 18.73 ms |
| efficientnet-b0 | 128 | 7850 img/s | 21.53 ms | 16.58 ms | 18.94 ms |
| efficientnet-b0 | 256 | 8285 img/s | 42.07 ms | 30.87 ms | 38.03 ms |
| efficientnet-b4 | 1 | 51 img/s | 21.2 ms | 19.73 ms | 21.47 ms |
| efficientnet-b4 | 2 | 103 img/s | 21.17 ms | 20.91 ms | 24.17 ms |
| efficientnet-b4 | 4 | 205 img/s | 21.34 ms | 20.32 ms | 23.46 ms |
| efficientnet-b4 | 8 | 376 img/s | 23.11 ms | 22.64 ms | 24.77 ms |
| efficientnet-b4 | 16 | 781 img/s | 22.42 ms | 23.03 ms | 25.37 ms |
| efficientnet-b4 | 32 | 1048 img/s | 32.52 ms | 30.76 ms | 31.65 ms |
| efficientnet-b4 | 64 | 1156 img/s | 58.31 ms | 55.45 ms | 56.89 ms |
| efficientnet-b4 | 128 | 1197 img/s | 112.92 ms | 106.69 ms | 107.84 ms |
| efficientnet-b4 | 256 | 1229 img/s | 220.5 ms | 206.68 ms | 223.16 ms |
| efficientnet-widese-b0 | 1 | 100 img/s | 11.75 ms | 10.62 ms | 13.67 ms |
| efficientnet-widese-b0 | 2 | 200 img/s | 11.86 ms | 11.38 ms | 14.32 ms |
| efficientnet-widese-b0 | 4 | 400 img/s | 11.81 ms | 10.8 ms | 13.8 ms |
| efficientnet-widese-b0 | 8 | 770 img/s | 12.17 ms | 11.2 ms | 12.38 ms |
| efficientnet-widese-b0 | 16 | 1501 img/s | 12.62 ms | 12.12 ms | 14.94 ms |
| efficientnet-widese-b0 | 32 | 2901 img/s | 13.06 ms | 13.28 ms | 15.23 ms |
| efficientnet-widese-b0 | 64 | 5853 img/s | 13.69 ms | 14.38 ms | 16.91 ms |
| efficientnet-widese-b0 | 128 | 7807 img/s | 21.43 ms | 16.63 ms | 21.8 ms |
| efficientnet-widese-b0 | 256 | 8270 img/s | 42.01 ms | 30.97 ms | 34.55 ms |
| efficientnet-widese-b4 | 1 | 52 img/s | 21.03 ms | 19.9 ms | 22.23 ms |
| efficientnet-widese-b4 | 2 | 102 img/s | 21.34 ms | 21.6 ms | 24.23 ms |
| efficientnet-widese-b4 | 4 | 200 img/s | 21.76 ms | 21.19 ms | 23.69 ms |
| efficientnet-widese-b4 | 8 | 373 img/s | 23.31 ms | 22.99 ms | 28.33 ms |
| efficientnet-widese-b4 | 16 | 763 img/s | 22.93 ms | 23.75 ms | 26.6 ms |
| efficientnet-widese-b4 | 32 | 1043 img/s | 32.7 ms | 31.03 ms | 33.52 ms |
| efficientnet-widese-b4 | 64 | 1152 img/s | 58.27 ms | 55.64 ms | 55.86 ms |
| efficientnet-widese-b4 | 128 | 1197 img/s | 112.86 ms | 106.72 ms | 108.65 ms |
| efficientnet-widese-b4 | 256 | 1229 img/s | 221.11 ms | 206.5 ms | 221.37 ms |
| efficientnet-b0 | 1 | 105 img/s | 11.21 ms | 9.9 ms | 12.55 ms |
| efficientnet-b0 | 2 | 214 img/s | 11.01 ms | 10.06 ms | 11.89 ms |
| efficientnet-b0 | 4 | 412 img/s | 11.45 ms | 11.73 ms | 13.0 ms |
| efficientnet-b0 | 8 | 803 img/s | 11.78 ms | 11.59 ms | 14.2 ms |
| efficientnet-b0 | 16 | 1584 img/s | 11.89 ms | 11.9 ms | 13.63 ms |
| efficientnet-b0 | 32 | 2915 img/s | 13.03 ms | 14.79 ms | 17.35 ms |
| efficientnet-b0 | 64 | 6315 img/s | 12.71 ms | 13.59 ms | 15.27 ms |
| efficientnet-b0 | 128 | 9311 img/s | 18.78 ms | 15.34 ms | 17.99 ms |
| efficientnet-b0 | 256 | 10239 img/s | 39.05 ms | 24.97 ms | 29.24 ms |
| efficientnet-b4 | 1 | 53 img/s | 20.45 ms | 19.06 ms | 20.36 ms |
| efficientnet-b4 | 2 | 109 img/s | 20.01 ms | 19.74 ms | 21.5 ms |
| efficientnet-b4 | 4 | 212 img/s | 20.6 ms | 19.88 ms | 22.37 ms |
| efficientnet-b4 | 8 | 416 img/s | 21.02 ms | 21.46 ms | 24.82 ms |
| efficientnet-b4 | 16 | 816 img/s | 21.53 ms | 22.91 ms | 26.06 ms |
| efficientnet-b4 | 32 | 1208 img/s | 28.4 ms | 26.77 ms | 28.3 ms |
| efficientnet-b4 | 64 | 1332 img/s | 50.55 ms | 48.23 ms | 48.49 ms |
| efficientnet-b4 | 128 | 1418 img/s | 95.84 ms | 90.12 ms | 95.76 ms |
| efficientnet-b4 | 256 | 1442 img/s | 191.48 ms | 176.19 ms | 189.04 ms |
| efficientnet-widese-b0 | 1 | 104 img/s | 11.28 ms | 10.0 ms | 12.72 ms |
| efficientnet-widese-b0 | 2 | 206 img/s | 11.41 ms | 10.65 ms | 12.72 ms |
| efficientnet-widese-b0 | 4 | 426 img/s | 11.15 ms | 10.23 ms | 11.03 ms |
| efficientnet-widese-b0 | 8 | 794 img/s | 11.9 ms | 12.68 ms | 14.17 ms |
| efficientnet-widese-b0 | 16 | 1536 img/s | 12.32 ms | 13.22 ms | 14.57 ms |
| efficientnet-widese-b0 | 32 | 2876 img/s | 14.12 ms | 14.45 ms | 16.23 ms |
| efficientnet-widese-b0 | 64 | 6183 img/s | 13.02 ms | 14.19 ms | 16.68 ms |
| efficientnet-widese-b0 | 128 | 9310 img/s | 20.06 ms | 15.24 ms | 17.84 ms |
| efficientnet-widese-b0 | 256 | 10193 img/s | 36.07 ms | 25.13 ms | 34.22 ms |
| efficientnet-widese-b4 | 1 | 53 img/s | 20.24 ms | 19.05 ms | 19.91 ms |
| efficientnet-widese-b4 | 2 | 109 img/s | 20.98 ms | 19.24 ms | 22.58 ms |
| efficientnet-widese-b4 | 4 | 213 img/s | 20.48 ms | 20.48 ms | 23.64 ms |
| efficientnet-widese-b4 | 8 | 425 img/s | 20.57 ms | 20.26 ms | 22.44 ms |
| efficientnet-widese-b4 | 16 | 800 img/s | 21.93 ms | 23.15 ms | 26.51 ms |
| efficientnet-widese-b4 | 32 | 1201 img/s | 28.51 ms | 26.89 ms | 28.13 ms |
| efficientnet-widese-b4 | 64 | 1322 img/s | 50.96 ms | 48.58 ms | 48.77 ms |
| efficientnet-widese-b4 | 128 | 1417 img/s | 96.45 ms | 90.17 ms | 90.43 ms |
| efficientnet-widese-b4 | 256 | 1439 img/s | 190.06 ms | 176.59 ms | 188.51 ms |
##### Inference performance: NVIDIA V100 (1x V100 16GB)
Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
###### FP32 Inference Latency
| **Model** | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| efficientnet-b0 | 1 | 77 img/s | 14.23 ms | 13.31 ms | 14.68 ms |
| efficientnet-b0 | 2 | 153 img/s | 14.46 ms | 13.67 ms | 14.69 ms |
| efficientnet-b0 | 4 | 317 img/s | 14.06 ms | 15.77 ms | 17.28 ms |
| efficientnet-b0 | 8 | 646 img/s | 13.88 ms | 14.32 ms | 15.05 ms |
| efficientnet-b0 | 16 | 1217 img/s | 14.74 ms | 15.89 ms | 18.03 ms |
| efficientnet-b0 | 32 | 2162 img/s | 16.51 ms | 17.9 ms | 20.06 ms |
| efficientnet-b0 | 64 | 2716 img/s | 25.74 ms | 23.64 ms | 24.08 ms |
| efficientnet-b0 | 128 | 2816 img/s | 50.21 ms | 45.43 ms | 46.3 ms |
| efficientnet-b0 | 256 | 2955 img/s | 96.46 ms | 85.96 ms | 92.74 ms |
| efficientnet-b4 | 1 | 38 img/s | 27.73 ms | 27.98 ms | 29.45 ms |
| efficientnet-b4 | 2 | 84 img/s | 25.1 ms | 24.6 ms | 26.29 ms |
| efficientnet-b4 | 4 | 170 img/s | 25.01 ms | 24.84 ms | 26.52 ms |
| efficientnet-b4 | 8 | 304 img/s | 27.75 ms | 26.28 ms | 27.71 ms |
| efficientnet-b4 | 16 | 334 img/s | 49.51 ms | 47.98 ms | 48.46 ms |
| efficientnet-b4 | 32 | 353 img/s | 92.42 ms | 90.81 ms | 91.0 ms |
| efficientnet-b4 | 64 | 380 img/s | 170.58 ms | 168.32 ms | 168.8 ms |
| efficientnet-b4 | 128 | 381 img/s | 343.03 ms | 334.58 ms | 334.94 ms |
| efficientnet-widese-b0 | 1 | 83 img/s | 13.38 ms | 13.14 ms | 13.58 ms |
| efficientnet-widese-b0 | 2 | 149 img/s | 14.82 ms | 15.09 ms | 16.03 ms |
| efficientnet-widese-b0 | 4 | 319 img/s | 13.91 ms | 13.06 ms | 13.96 ms |
| efficientnet-widese-b0 | 8 | 566 img/s | 15.62 ms | 16.3 ms | 17.5 ms |
| efficientnet-widese-b0 | 16 | 1211 img/s | 14.85 ms | 15.97 ms | 18.8 ms |
| efficientnet-widese-b0 | 32 | 2055 img/s | 17.33 ms | 19.54 ms | 21.59 ms |
| efficientnet-widese-b0 | 64 | 2707 img/s | 25.66 ms | 23.72 ms | 23.93 ms |
| efficientnet-widese-b0 | 128 | 2811 img/s | 49.93 ms | 45.46 ms | 45.51 ms |
| efficientnet-widese-b0 | 256 | 2953 img/s | 96.43 ms | 86.11 ms | 87.33 ms |
| efficientnet-widese-b4 | 1 | 44 img/s | 24.16 ms | 23.16 ms | 25.41 ms |
| efficientnet-widese-b4 | 2 | 89 img/s | 23.95 ms | 23.39 ms | 25.93 ms |
| efficientnet-widese-b4 | 4 | 169 img/s | 25.35 ms | 25.15 ms | 30.58 ms |
| efficientnet-widese-b4 | 8 | 279 img/s | 30.27 ms | 31.76 ms | 33.37 ms |
| efficientnet-widese-b4 | 16 | 331 img/s | 49.84 ms | 48.32 ms | 48.75 ms |
| efficientnet-widese-b4 | 32 | 353 img/s | 92.31 ms | 90.81 ms | 90.95 ms |
| efficientnet-widese-b4 | 64 | 375 img/s | 172.79 ms | 170.49 ms | 170.69 ms |
| efficientnet-widese-b4 | 128 | 381 img/s | 342.33 ms | 334.91 ms | 335.23 ms |
| efficientnet-b0 | 1 | 83 img/s | 13.15 ms | 13.23 ms | 14.11 ms |
| efficientnet-b0 | 2 | 167 img/s | 13.17 ms | 13.46 ms | 14.39 ms |
| efficientnet-b0 | 4 | 332 img/s | 13.25 ms | 13.29 ms | 14.85 ms |
| efficientnet-b0 | 8 | 657 img/s | 13.42 ms | 13.86 ms | 15.77 ms |
| efficientnet-b0 | 16 | 1289 img/s | 13.78 ms | 15.02 ms | 16.99 ms |
| efficientnet-b0 | 32 | 2140 img/s | 16.46 ms | 18.92 ms | 22.2 ms |
| efficientnet-b0 | 64 | 2743 img/s | 25.14 ms | 23.44 ms | 23.79 ms |
| efficientnet-b0 | 128 | 2908 img/s | 48.03 ms | 43.98 ms | 45.36 ms |
| efficientnet-b0 | 256 | 2968 img/s | 94.86 ms | 85.62 ms | 91.01 ms |
| efficientnet-b4 | 1 | 45 img/s | 23.31 ms | 23.3 ms | 24.9 ms |
| efficientnet-b4 | 2 | 87 img/s | 24.07 ms | 23.81 ms | 25.14 ms |
| efficientnet-b4 | 4 | 160 img/s | 26.29 ms | 26.78 ms | 30.85 ms |
| efficientnet-b4 | 8 | 316 img/s | 26.65 ms | 26.44 ms | 28.61 ms |
| efficientnet-b4 | 16 | 341 img/s | 48.18 ms | 46.9 ms | 47.13 ms |
| efficientnet-b4 | 32 | 365 img/s | 89.07 ms | 87.83 ms | 88.02 ms |
| efficientnet-b4 | 64 | 374 img/s | 173.2 ms | 171.61 ms | 172.27 ms |
| efficientnet-b4 | 128 | 376 img/s | 346.32 ms | 339.74 ms | 340.37 ms |
| efficientnet-widese-b0 | 1 | 82 img/s | 13.37 ms | 12.95 ms | 13.89 ms |
| efficientnet-widese-b0 | 2 | 168 img/s | 13.11 ms | 12.45 ms | 13.94 ms |
| efficientnet-widese-b0 | 4 | 346 img/s | 12.73 ms | 12.22 ms | 12.95 ms |
| efficientnet-widese-b0 | 8 | 674 img/s | 13.07 ms | 12.75 ms | 14.93 ms |
| efficientnet-widese-b0 | 16 | 1235 img/s | 14.3 ms | 15.05 ms | 16.53 ms |
| efficientnet-widese-b0 | 32 | 2194 img/s | 15.99 ms | 17.37 ms | 19.01 ms |
| efficientnet-widese-b0 | 64 | 2747 img/s | 25.05 ms | 23.38 ms | 23.71 ms |
| efficientnet-widese-b0 | 128 | 2906 img/s | 48.05 ms | 44.0 ms | 44.59 ms |
| efficientnet-widese-b0 | 256 | 2962 img/s | 95.14 ms | 85.86 ms | 86.25 ms |
| efficientnet-widese-b4 | 1 | 43 img/s | 24.28 ms | 25.24 ms | 27.36 ms |
| efficientnet-widese-b4 | 2 | 87 img/s | 24.04 ms | 24.38 ms | 26.01 ms |
| efficientnet-widese-b4 | 4 | 169 img/s | 24.96 ms | 25.8 ms | 27.14 ms |
| efficientnet-widese-b4 | 8 | 307 img/s | 27.39 ms | 28.4 ms | 30.7 ms |
| efficientnet-widese-b4 | 16 | 342 img/s | 48.05 ms | 46.74 ms | 46.9 ms |
| efficientnet-widese-b4 | 32 | 363 img/s | 89.44 ms | 88.23 ms | 88.39 ms |
| efficientnet-widese-b4 | 64 | 373 img/s | 173.47 ms | 172.01 ms | 172.36 ms |
| efficientnet-widese-b4 | 128 | 376 img/s | 347.18 ms | 340.09 ms | 340.45 ms |
###### Mixed Precision Inference Latency
| **Model** | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| efficientnet-b0 | 1 | 66 img/s | 16.38 ms | 15.63 ms | 17.01 ms |
| efficientnet-b0 | 2 | 120 img/s | 18.0 ms | 18.39 ms | 19.35 ms |
| efficientnet-b0 | 4 | 244 img/s | 17.77 ms | 18.98 ms | 21.4 ms |
| efficientnet-b0 | 8 | 506 img/s | 17.26 ms | 18.23 ms | 20.24 ms |
| efficientnet-b0 | 16 | 912 img/s | 19.07 ms | 20.33 ms | 22.59 ms |
| efficientnet-b0 | 32 | 1758 img/s | 20.3 ms | 22.2 ms | 24.7 ms |
| efficientnet-b0 | 64 | 3720 img/s | 19.18 ms | 20.09 ms | 21.48 ms |
| efficientnet-b0 | 128 | 4942 img/s | 30.53 ms | 26.0 ms | 27.54 ms |
| efficientnet-b0 | 256 | 5339 img/s | 57.82 ms | 47.63 ms | 51.61 ms |
| efficientnet-b4 | 1 | 32 img/s | 31.83 ms | 32.51 ms | 34.09 ms |
| efficientnet-b4 | 2 | 65 img/s | 31.82 ms | 34.53 ms | 36.95 ms |
| efficientnet-b4 | 4 | 127 img/s | 32.77 ms | 32.87 ms | 35.95 ms |
| efficientnet-b4 | 8 | 255 img/s | 32.9 ms | 34.56 ms | 37.01 ms |
| efficientnet-b4 | 16 | 486 img/s | 34.46 ms | 36.56 ms | 39.1 ms |
| efficientnet-b4 | 32 | 681 img/s | 48.48 ms | 46.98 ms | 48.55 ms |
| efficientnet-b4 | 64 | 738 img/s | 88.55 ms | 86.55 ms | 87.31 ms |
| efficientnet-b4 | 128 | 757 img/s | 174.13 ms | 168.73 ms | 168.92 ms |
| efficientnet-b4 | 256 | 770 img/s | 343.04 ms | 329.95 ms | 330.66 ms |
| efficientnet-widese-b0 | 1 | 63 img/s | 17.08 ms | 16.36 ms | 17.8 ms |
| efficientnet-widese-b0 | 2 | 123 img/s | 17.48 ms | 16.74 ms | 18.17 ms |
| efficientnet-widese-b0 | 4 | 241 img/s | 17.95 ms | 17.29 ms | 18.76 ms |
| efficientnet-widese-b0 | 8 | 486 img/s | 17.92 ms | 19.42 ms | 22.3 ms |
| efficientnet-widese-b0 | 16 | 898 img/s | 19.3 ms | 20.57 ms | 22.41 ms |
| efficientnet-widese-b0 | 32 | 1649 img/s | 21.06 ms | 23.14 ms | 24.83 ms |
| efficientnet-widese-b0 | 64 | 3360 img/s | 21.22 ms | 22.89 ms | 25.07 ms |
| efficientnet-widese-b0 | 128 | 4934 img/s | 30.35 ms | 26.48 ms | 30.3 ms |
| efficientnet-widese-b0 | 256 | 5340 img/s | 57.83 ms | 47.59 ms | 54.7 ms |
| efficientnet-widese-b4 | 1 | 31 img/s | 33.37 ms | 34.12 ms | 35.95 ms |
| efficientnet-widese-b4 | 2 | 63 img/s | 33.0 ms | 33.73 ms | 35.15 ms |
| efficientnet-widese-b4 | 4 | 133 img/s | 31.43 ms | 31.72 ms | 33.93 ms |
| efficientnet-widese-b4 | 8 | 244 img/s | 34.35 ms | 36.98 ms | 39.72 ms |
| efficientnet-widese-b4 | 16 | 454 img/s | 36.8 ms | 39.8 ms | 42.41 ms |
| efficientnet-widese-b4 | 32 | 680 img/s | 48.63 ms | 48.1 ms | 50.57 ms |
| efficientnet-widese-b4 | 64 | 738 img/s | 88.64 ms | 86.56 ms | 86.7 ms |
| efficientnet-widese-b4 | 128 | 756 img/s | 174.52 ms | 168.98 ms | 169.13 ms |
| efficientnet-widese-b4 | 256 | 771 img/s | 344.05 ms | 329.69 ms | 330.7 ms |
| efficientnet-b0 | 1 | 62 img/s | 17.19 ms | 18.01 ms | 18.63 ms |
| efficientnet-b0 | 2 | 119 img/s | 17.96 ms | 18.3 ms | 19.95 ms |
| efficientnet-b0 | 4 | 238 img/s | 17.9 ms | 17.8 ms | 19.13 ms |
| efficientnet-b0 | 8 | 495 img/s | 17.38 ms | 18.34 ms | 19.29 ms |
| efficientnet-b0 | 16 | 945 img/s | 18.23 ms | 19.42 ms | 21.58 ms |
| efficientnet-b0 | 32 | 1784 img/s | 19.29 ms | 20.71 ms | 22.51 ms |
| efficientnet-b0 | 64 | 3480 img/s | 20.34 ms | 22.22 ms | 24.62 ms |
| efficientnet-b0 | 128 | 5759 img/s | 26.11 ms | 22.61 ms | 24.06 ms |
| efficientnet-b0 | 256 | 6176 img/s | 49.36 ms | 41.18 ms | 43.5 ms |
| efficientnet-b4 | 1 | 34 img/s | 30.28 ms | 30.2 ms | 32.24 ms |
| efficientnet-b4 | 2 | 69 img/s | 30.12 ms | 30.02 ms | 31.92 ms |
| efficientnet-b4 | 4 | 129 img/s | 32.08 ms | 33.29 ms | 34.74 ms |
| efficientnet-b4 | 8 | 242 img/s | 34.43 ms | 37.34 ms | 41.08 ms |
| efficientnet-b4 | 16 | 488 img/s | 34.12 ms | 36.13 ms | 39.39 ms |
| efficientnet-b4 | 32 | 738 img/s | 44.67 ms | 44.85 ms | 47.86 ms |
| efficientnet-b4 | 64 | 809 img/s | 80.93 ms | 79.19 ms | 79.42 ms |
| efficientnet-b4 | 128 | 843 img/s | 156.42 ms | 152.17 ms | 152.76 ms |
| efficientnet-b4 | 256 | 847 img/s | 311.03 ms | 301.44 ms | 302.48 ms |
| efficientnet-widese-b0 | 1 | 64 img/s | 16.71 ms | 17.59 ms | 19.23 ms |
| efficientnet-widese-b0 | 2 | 129 img/s | 16.63 ms | 16.1 ms | 17.34 ms |
| efficientnet-widese-b0 | 4 | 238 img/s | 17.92 ms | 17.52 ms | 18.82 ms |
| efficientnet-widese-b0 | 8 | 445 img/s | 19.24 ms | 19.53 ms | 20.4 ms |
| efficientnet-widese-b0 | 16 | 936 img/s | 18.64 ms | 19.55 ms | 21.1 ms |
| efficientnet-widese-b0 | 32 | 1818 img/s | 18.97 ms | 20.62 ms | 23.06 ms |
| efficientnet-widese-b0 | 64 | 3572 img/s | 19.81 ms | 21.14 ms | 23.29 ms |
| efficientnet-widese-b0 | 128 | 5748 img/s | 26.18 ms | 23.72 ms | 26.1 ms |
| efficientnet-widese-b0 | 256 | 6187 img/s | 49.11 ms | 41.11 ms | 41.59 ms |
| efficientnet-widese-b4 | 1 | 32 img/s | 32.1 ms | 31.6 ms | 34.69 ms |
| efficientnet-widese-b4 | 2 | 68 img/s | 30.4 ms | 30.9 ms | 32.67 ms |
| efficientnet-widese-b4 | 4 | 123 img/s | 33.81 ms | 39.0 ms | 40.76 ms |
| efficientnet-widese-b4 | 8 | 257 img/s | 32.34 ms | 33.39 ms | 34.93 ms |
| efficientnet-widese-b4 | 16 | 497 img/s | 33.51 ms | 34.92 ms | 37.24 ms |
| efficientnet-widese-b4 | 32 | 739 img/s | 44.63 ms | 43.62 ms | 46.39 ms |
| efficientnet-widese-b4 | 64 | 808 img/s | 81.08 ms | 79.43 ms | 79.59 ms |
| efficientnet-widese-b4 | 128 | 840 img/s | 157.11 ms | 152.87 ms | 153.26 ms |
| efficientnet-widese-b4 | 256 | 846 img/s | 310.73 ms | 301.68 ms | 302.9 ms |

View file

@ -206,7 +206,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [PyTorch 21.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* Supported GPUs:
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -533,7 +533,7 @@ To benchmark inference, run:
* TF32 (A100 GPUs only)
`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
`python ./launch.py --model resnet50 --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
@ -543,11 +543,12 @@ Each of these scripts will run 100 iterations and save results in the `benchmark
### Results
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
#### Training accuracy results
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
@ -573,8 +574,6 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
| 90 | 77.10 +/- 0.06 | 77.23 +/- 0.04 |
| 250 | 78.59 +/- 0.13 | 78.46 +/- 0.03 |
##### Example plots
The following images show a 250 epochs configuration on a DGX-1V.
@ -587,64 +586,70 @@ The following images show a 250 epochs configuration on a DGX-1V.
#### Training performance results
Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 2461 img/s | 945 img/s | 2.6 x | 1.0 x | ~14 hours | 1.0 x | ~36 hours |
| 8 | 15977 img/s | 7365 img/s | 2.16 x | 6.49 x | ~3 hours | 7.78 x | ~5 hours |
| **GPUs** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 938 img/s | 2470 img/s | 2.63 x | 1.0 x | 1.0 x | ~14 hours | ~36 hours |
| 8 | 7248 img/s | 16621 img/s | 2.29 x | 7.72 x | 6.72 x | ~3 hours | ~5 hours |
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 1180 img/s | 371 img/s | 3.17 x | 1.0 x | ~29 hours | 1.0 x | ~91 hours |
| 8 | 7608 img/s | 2851 img/s | 2.66 x | 6.44 x | ~5 hours | 7.66 x | ~12 hours |
| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 367 img/s | 1200 img/s | 3.26 x | 1.0 x | 1.0 x | ~29 hours | ~92 hours |
| 8 | 2855 img/s | 8322 img/s | 2.91 x | 7.76 x | 6.93 x | ~5 hours | ~12 hours |
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 1115 img/s | 365 img/s | 3.04 x | 1.0 x | ~31 hours | 1.0 x | ~92 hours |
| 8 | 7375 img/s | 2811 img/s | 2.62 x | 6.61 x | ~5 hours | 7.68 x | ~12 hours |
| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 356 img/s | 1156 img/s | 3.24 x | 1.0 x | 1.0 x | ~30 hours | ~95 hours |
| 8 | 2766 img/s | 8056 img/s | 2.91 x | 7.75 x | 6.96 x | ~5 hours | ~13 hours |
#### Inference performance results
Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
###### FP32 Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 99 img/s | 10.38 ms | 11.24 ms | 12.32 ms |
| 2 | 190 img/s | 10.87 ms | 12.18 ms | 14.27 ms |
| 4 | 403 img/s | 10.26 ms | 11.02 ms | 13.28 ms |
| 8 | 754 img/s | 10.96 ms | 11.99 ms | 13.89 ms |
| 16 | 960 img/s | 17.16 ms | 16.74 ms | 18.18 ms |
| 32 | 1057 img/s | 31.39 ms | 30.4 ms | 30.55 ms |
| 64 | 1168 img/s | 57.1 ms | 55.01 ms | 56.19 ms |
| 112 | 1166 img/s | 100.78 ms | 95.98 ms | 97.43 ms |
| 128 | 1215 img/s | 111.11 ms | 105.52 ms | 106.38 ms |
| 256 | 1253 img/s | 217.03 ms | 203.78 ms | 208.68 ms |
| 1 | 96 img/s | 10.37 ms | 10.81 ms | 11.73 ms |
| 2 | 196 img/s | 10.24 ms | 11.18 ms | 12.89 ms |
| 4 | 386 img/s | 10.46 ms | 11.01 ms | 11.75 ms |
| 8 | 709 img/s | 11.5 ms | 12.36 ms | 13.12 ms |
| 16 | 1023 img/s | 16.07 ms | 15.69 ms | 15.97 ms |
| 32 | 1127 img/s | 29.37 ms | 28.53 ms | 28.67 ms |
| 64 | 1200 img/s | 55.4 ms | 53.5 ms | 53.71 ms |
| 128 | 1229 img/s | 109.26 ms | 104.04 ms | 104.34 ms |
| 256 | 1261 img/s | 214.48 ms | 202.51 ms | 202.88 ms |
###### Mixed Precision Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 82 img/s | 12.43 ms | 13.29 ms | 14.89 ms |
| 2 | 157 img/s | 13.04 ms | 13.84 ms | 16.79 ms |
| 4 | 310 img/s | 13.26 ms | 14.42 ms | 15.63 ms |
| 8 | 646 img/s | 12.69 ms | 13.65 ms | 15.48 ms |
| 16 | 1188 img/s | 14.01 ms | 15.56 ms | 18.34 ms |
| 32 | 2093 img/s | 16.41 ms | 18.25 ms | 19.9 ms |
| 64 | 2899 img/s | 24.12 ms | 22.14 ms | 22.55 ms |
| 128 | 3142 img/s | 45.28 ms | 40.77 ms | 42.89 ms |
| 256 | 3276 img/s | 88.44 ms | 77.8 ms | 79.01 ms |
| 256 | 3276 img/s | 88.6 ms | 77.74 ms | 79.11 ms |
| 1 | 78 img/s | 12.78 ms | 13.27 ms | 14.36 ms |
| 2 | 154 img/s | 13.01 ms | 13.74 ms | 15.19 ms |
| 4 | 300 img/s | 13.41 ms | 14.25 ms | 15.68 ms |
| 8 | 595 img/s | 13.65 ms | 14.51 ms | 15.6 ms |
| 16 | 1178 img/s | 14.0 ms | 15.07 ms | 16.26 ms |
| 32 | 2146 img/s | 15.84 ms | 17.25 ms | 18.53 ms |
| 64 | 2984 img/s | 23.18 ms | 21.51 ms | 21.93 ms |
| 128 | 3249 img/s | 43.55 ms | 39.36 ms | 40.1 ms |
| 256 | 3382 img/s | 84.14 ms | 75.3 ms | 80.08 ms |
##### Inference performance: NVIDIA T4
@ -653,30 +658,30 @@ The following images show a 250 epochs configuration on a DGX-1V.
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 147 img/s | 7.28 ms | 8.48 ms | 9.79 ms |
| 2 | 251 img/s | 8.48 ms | 10.23 ms | 14.01 ms |
| 4 | 303 img/s | 13.57 ms | 13.61 ms | 15.42 ms |
| 8 | 329 img/s | 24.7 ms | 24.74 ms | 25.0 ms |
| 16 | 371 img/s | 43.73 ms | 43.74 ms | 44.03 ms |
| 32 | 395 img/s | 82.36 ms | 82.13 ms | 82.58 ms |
| 64 | 421 img/s | 155.37 ms | 153.07 ms | 153.55 ms |
| 128 | 426 img/s | 309.06 ms | 303.0 ms | 307.42 ms |
| 256 | 419 img/s | 631.43 ms | 612.42 ms | 614.82 ms |
| 1 | 98 img/s | 10.7 ms | 12.82 ms | 16.71 ms |
| 2 | 186 img/s | 11.26 ms | 13.79 ms | 16.99 ms |
| 4 | 325 img/s | 12.73 ms | 13.89 ms | 18.03 ms |
| 8 | 363 img/s | 22.41 ms | 22.57 ms | 22.9 ms |
| 16 | 409 img/s | 39.77 ms | 39.8 ms | 40.23 ms |
| 32 | 420 img/s | 77.62 ms | 76.92 ms | 77.28 ms |
| 64 | 428 img/s | 152.73 ms | 152.03 ms | 153.02 ms |
| 128 | 426 img/s | 309.26 ms | 303.38 ms | 305.13 ms |
| 256 | 415 img/s | 635.98 ms | 620.16 ms | 625.21 ms |
###### Mixed Precision Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 112 img/s | 9.25 ms | 9.87 ms | 10.62 ms |
| 2 | 223 img/s | 9.4 ms | 10.62 ms | 13.9 ms |
| 4 | 468 img/s | 9.06 ms | 11.15 ms | 15.5 ms |
| 8 | 844 img/s | 10.05 ms | 12.67 ms | 17.86 ms |
| 16 | 1037 img/s | 16.01 ms | 15.66 ms | 15.86 ms |
| 32 | 1103 img/s | 30.27 ms | 29.45 ms | 29.74 ms |
| 64 | 1154 img/s | 57.96 ms | 56.33 ms | 56.96 ms |
| 128 | 1177 img/s | 114.95 ms | 110.4 ms | 111.1 ms |
| 256 | 1184 img/s | 229.61 ms | 217.84 ms | 224.75 ms |
| 1 | 79 img/s | 12.96 ms | 15.47 ms | 20.0 ms |
| 2 | 156 img/s | 13.18 ms | 14.9 ms | 18.73 ms |
| 4 | 317 img/s | 12.99 ms | 14.69 ms | 19.05 ms |
| 8 | 652 img/s | 12.82 ms | 16.04 ms | 19.43 ms |
| 16 | 1050 img/s | 15.8 ms | 16.57 ms | 20.62 ms |
| 32 | 1128 img/s | 29.54 ms | 28.79 ms | 28.97 ms |
| 64 | 1165 img/s | 57.41 ms | 55.67 ms | 56.11 ms |
| 128 | 1190 img/s | 114.24 ms | 109.17 ms | 110.41 ms |
| 256 | 1198 img/s | 225.95 ms | 215.28 ms | 222.94 ms |
## Release notes
@ -701,6 +706,7 @@ The following images show a 250 epochs configuration on a DGX-1V.
* Updated README
6. February 2021
* Moved from APEX AMP to Native AMP
### Known issues
There are no known issues with this model.

View file

@ -190,7 +190,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [PyTorch 21.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* Supported GPUs:
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -516,7 +516,7 @@ To benchmark inference, run:
* TF32 (A100 GPUs only)
`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
`python ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
@ -526,12 +526,12 @@ Each of these scripts will run 100 iterations and save results in the `benchmark
### Results
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
#### Training accuracy results
Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
| **Epochs** | **Mixed Precision Top1** | **TF32 Top1** |
@ -560,62 +560,70 @@ The following images show a 250 epochs configuration on a DGX-1V.
#### Training performance results
Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 1169 img/s | 420 img/s | 2.77 x | 1.0 x | ~29 hours | 1.0 x | ~80 hours |
| 8 | 7399 img/s | 3193 img/s | 2.31 x | 6.32 x | ~5 hours | 7.58 x | ~11 hours |
| **GPUs** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 456 img/s | 1211 img/s | 2.65 x | 1.0 x | 1.0 x | ~28 hours | ~74 hours |
| 8 | 3471 img/s | 7925 img/s | 2.28 x | 7.6 x | 6.54 x | ~5 hours | ~10 hours |
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 578 img/s | 149 img/s | 3.86 x | 1.0 x | ~59 hours | 1.0 x | ~225 hours |
| 8 | 3742 img/s | 1117 img/s | 3.34 x | 6.46 x | ~9 hours | 7.45 x | ~31 hours |
| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 147 img/s | 587 img/s | 3.97 x | 1.0 x | 1.0 x | ~58 hours | ~228 hours |
| 8 | 1133 img/s | 4065 img/s | 3.58 x | 7.65 x | 6.91 x | ~9 hours | ~30 hours |
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 556 img/s | 151 img/s | 3.68 x | 1.0 x | ~61 hours | 1.0 x | ~223 hours |
| 8 | 3595 img/s | 1102 img/s | 3.26 x | 6.45 x | ~10 hours | 7.28 x | ~31 hours |
| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 144 img/s | 565 img/s | 3.9 x | 1.0 x | 1.0 x | ~60 hours | ~233 hours |
| 8 | 1108 img/s | 3863 img/s | 3.48 x | 7.66 x | 6.83 x | ~9 hours | ~31 hours |
#### Inference performance results
Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
###### FP32 Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 55 img/s | 18.48 ms | 18.88 ms | 20.74 ms |
| 2 | 116 img/s | 17.54 ms | 18.15 ms | 21.32 ms |
| 4 | 214 img/s | 19.07 ms | 20.44 ms | 22.69 ms |
| 8 | 291 img/s | 27.8 ms | 27.99 ms | 28.47 ms |
| 16 | 354 img/s | 45.78 ms | 45.4 ms | 45.73 ms |
| 32 | 423 img/s | 77.13 ms | 75.96 ms | 76.21 ms |
| 64 | 486 img/s | 134.92 ms | 132.17 ms | 132.51 ms |
| 128 | 523 img/s | 252.11 ms | 244.5 ms | 244.99 ms |
| 256 | 530 img/s | 499.64 ms | 479.83 ms | 481.41 ms |
| 1 | 55 img/s | 17.95 ms | 20.61 ms | 22.0 ms |
| 2 | 105 img/s | 19.2 ms | 20.74 ms | 22.77 ms |
| 4 | 170 img/s | 23.65 ms | 24.66 ms | 28.0 ms |
| 8 | 336 img/s | 24.05 ms | 24.92 ms | 27.75 ms |
| 16 | 397 img/s | 40.77 ms | 40.44 ms | 40.65 ms |
| 32 | 452 img/s | 72.12 ms | 71.1 ms | 71.35 ms |
| 64 | 500 img/s | 130.9 ms | 128.19 ms | 128.64 ms |
| 128 | 527 img/s | 249.57 ms | 242.77 ms | 243.63 ms |
| 256 | 533 img/s | 496.76 ms | 478.04 ms | 480.42 ms |
###### Mixed Precision Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 40 img/s | 25.17 ms | 28.4 ms | 30.66 ms |
| 2 | 89 img/s | 22.64 ms | 24.29 ms | 25.99 ms |
| 4 | 165 img/s | 24.54 ms | 26.23 ms | 28.61 ms |
| 8 | 334 img/s | 24.31 ms | 28.46 ms | 29.91 ms |
| 16 | 632 img/s | 25.8 ms | 27.76 ms | 29.53 ms |
| 32 | 1219 img/s | 27.35 ms | 29.86 ms | 31.6 ms |
| 64 | 1525 img/s | 43.97 ms | 42.01 ms | 42.96 ms |
| 128 | 1647 img/s | 82.22 ms | 77.65 ms | 79.56 ms |
| 256 | 1689 img/s | 161.53 ms | 151.25 ms | 152.01 ms |
| 1 | 43 img/s | 23.08 ms | 24.18 ms | 27.82 ms |
| 2 | 84 img/s | 23.65 ms | 24.64 ms | 27.87 ms |
| 4 | 164 img/s | 24.38 ms | 27.33 ms | 27.95 ms |
| 8 | 333 img/s | 24.18 ms | 25.92 ms | 28.3 ms |
| 16 | 640 img/s | 25.4 ms | 26.53 ms | 29.47 ms |
| 32 | 1195 img/s | 27.72 ms | 29.9 ms | 32.19 ms |
| 64 | 1595 img/s | 41.89 ms | 40.15 ms | 41.08 ms |
| 128 | 1699 img/s | 79.45 ms | 75.65 ms | 76.08 ms |
| 256 | 1746 img/s | 154.68 ms | 145.76 ms | 146.52 ms |
##### Inference performance: NVIDIA T4
@ -624,30 +632,30 @@ The following images show a 250 epochs configuration on a DGX-1V.
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 79 img/s | 13.07 ms | 14.66 ms | 15.59 ms |
| 2 | 119 img/s | 17.21 ms | 18.07 ms | 19.78 ms |
| 4 | 141 img/s | 28.65 ms | 28.62 ms | 28.77 ms |
| 8 | 139 img/s | 57.84 ms | 58.29 ms | 58.62 ms |
| 16 | 153 img/s | 104.8 ms | 105.65 ms | 106.2 ms |
| 32 | 178 img/s | 181.24 ms | 180.96 ms | 181.57 ms |
| 64 | 179 img/s | 360.93 ms | 358.22 ms | 359.11 ms |
| 128 | 177 img/s | 735.99 ms | 726.15 ms | 727.81 ms |
| 256 | 167 img/s | 1561.91 ms | 1523.52 ms | 1525.96 ms |
| 1 | 56 img/s | 18.18 ms | 20.45 ms | 24.58 ms |
| 2 | 109 img/s | 18.77 ms | 21.53 ms | 26.21 ms |
| 4 | 151 img/s | 26.89 ms | 27.81 ms | 30.94 ms |
| 8 | 164 img/s | 48.99 ms | 49.44 ms | 49.91 ms |
| 16 | 172 img/s | 93.51 ms | 93.73 ms | 94.16 ms |
| 32 | 180 img/s | 178.83 ms | 178.41 ms | 179.07 ms |
| 64 | 178 img/s | 361.95 ms | 360.7 ms | 362.32 ms |
| 128 | 172 img/s | 756.93 ms | 750.21 ms | 752.45 ms |
| 256 | 161 img/s | 1615.79 ms | 1580.61 ms | 1583.43 ms |
###### Mixed Precision Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 65 img/s | 15.69 ms | 16.95 ms | 17.97 ms |
| 2 | 126 img/s | 16.2 ms | 16.78 ms | 18.6 ms |
| 4 | 245 img/s | 16.77 ms | 18.35 ms | 25.88 ms |
| 8 | 488 img/s | 16.82 ms | 17.86 ms | 25.45 ms |
| 16 | 541 img/s | 30.16 ms | 29.95 ms | 30.18 ms |
| 32 | 566 img/s | 57.79 ms | 57.11 ms | 57.29 ms |
| 64 | 580 img/s | 112.84 ms | 111.07 ms | 111.56 ms |
| 128 | 586 img/s | 224.75 ms | 219.12 ms | 219.64 ms |
| 256 | 589 img/s | 447.25 ms | 434.18 ms | 439.22 ms |
| 1 | 44 img/s | 23.0 ms | 25.77 ms | 29.41 ms |
| 2 | 87 img/s | 23.14 ms | 26.55 ms | 30.97 ms |
| 4 | 178 img/s | 22.8 ms | 24.2 ms | 29.38 ms |
| 8 | 371 img/s | 21.98 ms | 25.34 ms | 29.61 ms |
| 16 | 553 img/s | 29.47 ms | 29.52 ms | 31.14 ms |
| 32 | 578 img/s | 56.56 ms | 56.04 ms | 56.37 ms |
| 64 | 591 img/s | 110.82 ms | 109.37 ms | 109.83 ms |
| 128 | 597 img/s | 220.44 ms | 215.33 ms | 216.3 ms |
| 256 | 598 img/s | 439.3 ms | 428.2 ms | 431.46 ms |
## Release notes

View file

@ -191,7 +191,7 @@ The following section lists the requirements that you need to meet in order to s
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* [PyTorch 21.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
* Supported GPUs:
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -518,7 +518,7 @@ To benchmark inference, run:
* TF32 (A100 GPUs only)
`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
`python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
* AMP
@ -528,12 +528,12 @@ Each of these scripts will run 100 iterations and save results in the `benchmark
### Results
Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
#### Training accuracy results
Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
#### Training accuracy results
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
| **Epochs** | **Mixed Precision Top1** | **TF32 Top1** |
@ -562,63 +562,70 @@ The following images show a 250 epochs configuration on a DGX-1V.
#### Training performance results
Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 804 img/s | 360 img/s | 2.22 x | 1.0 x | ~42 hours | 1.0 x | ~94 hours |
| 8 | 5248 img/s | 2665 img/s | 1.96 x | 6.52 x | ~7 hours | 7.38 x | ~13 hours |
| **GPUs** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 395 img/s | 855 img/s | 2.16 x | 1.0 x | 1.0 x | ~40 hours | ~86 hours |
| 8 | 2991 img/s | 5779 img/s | 1.93 x | 7.56 x | 6.75 x | ~6 hours | ~12 hours |
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:---------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 430 img/s | 133 img/s | 3.21 x | 1.0 x | ~79 hours | 1.0 x | ~252 hours |
| 8 | 2716 img/s | 994 img/s | 2.73 x | 6.31 x | ~13 hours | 7.42 x | ~34 hours |
| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 132 img/s | 443 img/s | 3.34 x | 1.0 x | 1.0 x | ~76 hours | ~254 hours |
| 8 | 1004 img/s | 2971 img/s | 2.95 x | 7.57 x | 6.7 x | ~12 hours | ~34 hours |
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
| 1 | 413 img/s | 134 img/s | 3.08 x | 1.0 x | ~82 hours | 1.0 x | ~251 hours |
| 8 | 2572 img/s | 1011 img/s | 2.54 x | 6.22 x | ~14 hours | 7.54 x | ~34 hours |
| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
| 1 | 130 img/s | 427 img/s | 3.26 x | 1.0 x | 1.0 x | ~79 hours | ~257 hours |
| 8 | 992 img/s | 2925 img/s | 2.94 x | 7.58 x | 6.84 x | ~12 hours | ~34 hours |
#### Inference performance results
Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
###### FP32 Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 37 img/s | 26.81 ms | 27.89 ms | 31.44 ms |
| 2 | 75 img/s | 27.01 ms | 28.89 ms | 31.17 ms |
| 4 | 144 img/s | 28.09 ms | 30.14 ms | 32.47 ms |
| 8 | 259 img/s | 31.23 ms | 33.65 ms | 38.4 ms |
| 16 | 332 img/s | 48.7 ms | 48.35 ms | 48.8 ms |
| 32 | 394 img/s | 83.02 ms | 81.55 ms | 81.9 ms |
| 64 | 471 img/s | 138.88 ms | 136.24 ms | 136.54 ms |
| 128 | 505 img/s | 261.4 ms | 253.07 ms | 254.29 ms |
| 256 | 513 img/s | 516.66 ms | 496.06 ms | 497.05 ms |
| 1 | 40 img/s | 24.92 ms | 26.78 ms | 31.12 ms |
| 2 | 80 img/s | 24.89 ms | 27.63 ms | 30.81 ms |
| 4 | 127 img/s | 31.58 ms | 35.92 ms | 39.64 ms |
| 8 | 250 img/s | 32.29 ms | 34.5 ms | 38.14 ms |
| 16 | 363 img/s | 44.5 ms | 44.16 ms | 44.37 ms |
| 32 | 423 img/s | 76.86 ms | 75.89 ms | 76.17 ms |
| 64 | 472 img/s | 138.36 ms | 135.85 ms | 136.52 ms |
| 128 | 501 img/s | 262.64 ms | 255.48 ms | 256.02 ms |
| 256 | 508 img/s | 519.84 ms | 500.71 ms | 501.5 ms |
###### Mixed Precision Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 29 img/s | 34.24 ms | 36.67 ms | 39.4 ms |
| 2 | 53 img/s | 37.81 ms | 43.03 ms | 45.1 ms |
| 4 | 103 img/s | 39.1 ms | 43.05 ms | 46.16 ms |
| 8 | 226 img/s | 35.66 ms | 38.39 ms | 41.13 ms |
| 16 | 458 img/s | 35.4 ms | 37.38 ms | 39.97 ms |
| 32 | 882 img/s | 37.37 ms | 40.12 ms | 42.64 ms |
| 64 | 1356 img/s | 49.31 ms | 47.21 ms | 49.87 ms |
| 112 | 1448 img/s | 81.27 ms | 77.35 ms | 78.28 ms |
| 128 | 1486 img/s | 90.59 ms | 86.15 ms | 87.04 ms |
| 256 | 1534 img/s | 176.72 ms | 166.2 ms | 167.53 ms |
| 1 | 29 img/s | 33.83 ms | 39.1 ms | 41.57 ms |
| 2 | 58 img/s | 34.35 ms | 36.92 ms | 41.66 ms |
| 4 | 117 img/s | 34.33 ms | 38.67 ms | 41.05 ms |
| 8 | 232 img/s | 34.66 ms | 39.51 ms | 42.16 ms |
| 16 | 459 img/s | 35.23 ms | 36.77 ms | 38.11 ms |
| 32 | 871 img/s | 37.62 ms | 39.36 ms | 41.26 ms |
| 64 | 1416 img/s | 46.95 ms | 45.26 ms | 47.48 ms |
| 128 | 1533 img/s | 87.49 ms | 83.54 ms | 83.75 ms |
| 256 | 1576 img/s | 170.79 ms | 161.97 ms | 162.93 ms |
##### Inference performance: NVIDIA T4
@ -627,30 +634,30 @@ The following images show a 250 epochs configuration on a DGX-1V.
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 52 img/s | 19.39 ms | 20.39 ms | 21.18 ms |
| 2 | 102 img/s | 19.98 ms | 21.4 ms | 23.75 ms |
| 4 | 134 img/s | 30.12 ms | 30.14 ms | 30.54 ms |
| 8 | 136 img/s | 59.07 ms | 60.63 ms | 61.49 ms |
| 16 | 154 img/s | 104.38 ms | 105.21 ms | 105.81 ms |
| 32 | 169 img/s | 190.12 ms | 189.64 ms | 190.24 ms |
| 64 | 171 img/s | 376.19 ms | 374.16 ms | 375.6 ms |
| 128 | 168 img/s | 771.4 ms | 761.64 ms | 764.7 ms |
| 256 | 159 img/s | 1639.15 ms | 1603.45 ms | 1605.47 ms |
| 1 | 40 img/s | 25.12 ms | 28.83 ms | 31.59 ms |
| 2 | 75 img/s | 26.82 ms | 30.54 ms | 33.13 ms |
| 4 | 136 img/s | 29.79 ms | 33.33 ms | 37.65 ms |
| 8 | 155 img/s | 51.74 ms | 52.57 ms | 53.12 ms |
| 16 | 164 img/s | 97.99 ms | 98.76 ms | 99.21 ms |
| 32 | 173 img/s | 186.31 ms | 186.43 ms | 187.4 ms |
| 64 | 171 img/s | 378.1 ms | 377.19 ms | 378.82 ms |
| 128 | 165 img/s | 785.83 ms | 778.23 ms | 782.64 ms |
| 256 | 158 img/s | 1641.96 ms | 1601.74 ms | 1614.52 ms |
###### Mixed Precision Inference Latency
| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
| 1 | 42 img/s | 24.17 ms | 27.26 ms | 29.98 ms |
| 2 | 87 img/s | 23.24 ms | 24.66 ms | 26.77 ms |
| 4 | 170 img/s | 23.87 ms | 24.89 ms | 29.59 ms |
| 8 | 334 img/s | 24.49 ms | 27.92 ms | 35.66 ms |
| 16 | 472 img/s | 34.45 ms | 34.29 ms | 35.72 ms |
| 32 | 502 img/s | 64.93 ms | 64.47 ms | 65.16 ms |
| 64 | 517 img/s | 126.24 ms | 125.03 ms | 125.86 ms |
| 128 | 522 img/s | 250.99 ms | 245.87 ms | 247.1 ms |
| 256 | 523 img/s | 502.41 ms | 487.58 ms | 489.69 ms |
| 1 | 31 img/s | 32.51 ms | 37.26 ms | 39.53 ms |
| 2 | 61 img/s | 32.76 ms | 37.61 ms | 39.62 ms |
| 4 | 123 img/s | 32.98 ms | 38.97 ms | 42.66 ms |
| 8 | 262 img/s | 31.01 ms | 36.3 ms | 39.11 ms |
| 16 | 482 img/s | 33.76 ms | 34.54 ms | 38.5 ms |
| 32 | 512 img/s | 63.68 ms | 63.29 ms | 63.73 ms |
| 64 | 527 img/s | 123.57 ms | 122.69 ms | 123.56 ms |
| 128 | 525 img/s | 248.97 ms | 245.39 ms | 246.66 ms |
| 256 | 527 img/s | 496.23 ms | 485.68 ms | 488.3 ms |
## Release notes

View file

@ -0,0 +1,133 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
Using `calculate_metrics.py` script, you can obtain model accuracy/error metrics using defined `MetricsCalculator` class.
Data provided to `MetricsCalculator` are obtained from npz dump files
stored in directory pointed by `--dump-dir` argument.
Above files are prepared by `run_inference_on_fw.py` and `run_inference_on_triton.py` scripts.
Output data is stored in csv file pointed by `--csv` argument.
Example call:
```shell script
python ./triton/calculate_metrics.py \
--dump-dir /results/dump_triton \
--csv /results/accuracy_results.csv \
--metrics metrics.py \
--metric-class-param1 value
```
"""
import argparse
import csv
import logging
import string
from pathlib import Path
import numpy as np
# method from PEP-366 to support relative import in executed modules
if __package__ is None:
__package__ = Path(__file__).parent.name
from .deployment_toolkit.args import ArgParserGenerator
from .deployment_toolkit.core import BaseMetricsCalculator, load_from_file
from .deployment_toolkit.dump import pad_except_batch_axis
LOGGER = logging.getLogger("calculate_metrics")
TOTAL_COLUMN_NAME = "_total_"
def get_data(dump_dir, prefix):
"""Loads and concatenates dump files for given prefix (ex. inputs, outputs, labels, ids)"""
dump_dir = Path(dump_dir)
npz_files = sorted(dump_dir.glob(f"{prefix}*.npz"))
data = None
if npz_files:
# assume that all npz files with given prefix contain same set of names
names = list(np.load(npz_files[0].as_posix()).keys())
# calculate target shape
target_shape = {
name: tuple(np.max([np.load(npz_file.as_posix())[name].shape for npz_file in npz_files], axis=0))
for name in names
}
# pad and concatenate data
data = {
name: np.concatenate(
[pad_except_batch_axis(np.load(npz_file.as_posix())[name], target_shape[name]) for npz_file in npz_files]
)
for name in names
}
return data
def main():
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description="Run models with given dataloader", allow_abbrev=False)
parser.add_argument("--metrics", help=f"Path to python module containing metrics calculator", required=True)
parser.add_argument("--csv", help="Path to csv file", required=True)
parser.add_argument("--dump-dir", help="Path to directory with dumped outputs (and labels)", required=True)
args, *_ = parser.parse_known_args()
MetricsCalculator = load_from_file(args.metrics, "metrics", "MetricsCalculator")
ArgParserGenerator(MetricsCalculator).update_argparser(parser)
args = parser.parse_args()
LOGGER.info(f"args:")
for key, value in vars(args).items():
LOGGER.info(f" {key} = {value}")
MetricsCalculator = load_from_file(args.metrics, "metrics", "MetricsCalculator")
metrics_calculator: BaseMetricsCalculator = ArgParserGenerator(MetricsCalculator).from_args(args)
ids = get_data(args.dump_dir, "ids")["ids"]
x = get_data(args.dump_dir, "inputs")
y_true = get_data(args.dump_dir, "labels")
y_pred = get_data(args.dump_dir, "outputs")
common_keys = list({k for k in (y_true or [])} & {k for k in (y_pred or [])})
for key in common_keys:
if y_true[key].shape != y_pred[key].shape:
LOGGER.warning(
f"Model predictions and labels shall have equal shapes. "
f"y_pred[{key}].shape={y_pred[key].shape} != "
f"y_true[{key}].shape={y_true[key].shape}"
)
metrics = metrics_calculator.calc(ids=ids, x=x, y_pred=y_pred, y_real=y_true)
metrics = {TOTAL_COLUMN_NAME: len(ids), **metrics}
metric_names_with_space = [name for name in metrics if any([c in string.whitespace for c in name])]
if metric_names_with_space:
raise ValueError(f"Metric names shall have no spaces; Incorrect names: {', '.join(metric_names_with_space)}")
csv_path = Path(args.csv)
csv_path.parent.mkdir(parents=True, exist_ok=True)
with csv_path.open("w") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=list(metrics.keys()))
writer.writeheader()
writer.writerow(metrics)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,202 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
To configure model on Triton, you can use `config_model_on_triton.py` script.
This will prepare layout of Model Repository, including Model Configuration.
```shell script
python ./triton/config_model_on_triton.py \
--model-repository /model_repository \
--model-path /models/exported/model.onnx \
--model-format onnx \
--model-name ResNet50 \
--model-version 1 \
--max-batch-size 32 \
--precision fp16 \
--backend-accelerator trt \
--load-model explicit \
--timeout 120 \
--verbose
```
If Triton server to which we prepare model repository is running with **explicit model control mode**,
use `--load-model` argument to send request load_model request to Triton Inference Server.
If server is listening on non-default address or port use `--server-url` argument to point server control endpoint.
If it is required to use HTTP protocol to communicate with Triton server use `--http` argument.
To improve inference throughput you can use
[dynamic batching](https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md#dynamic-batcher)
for your model by providing `--preferred-batch-sizes` and `--max-queue-delay-us` parameters.
For models which doesn't support batching, set `--max-batch-sizes` to 0.
By default Triton will [automatically obtain inputs and outputs definitions](https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md#auto-generated-model-configuration).
but for TorchScript ang TF GraphDef models script uses file with I/O specs. This file is automatically generated
when the model is converted to ScriptModule (either traced or scripted).
If there is a need to pass different than default path to I/O spec file use `--io-spec` CLI argument.
I/O spec file is yaml file with below structure:
```yaml
- inputs:
- name: input
dtype: float32 # np.dtype name
shape: [None, 224, 224, 3]
- outputs:
- name: probabilities
dtype: float32
shape: [None, 1001]
- name: classes
dtype: int32
shape: [None, 1]
```
"""
import argparse
import logging
import time
from model_navigator import Accelerator, Format, Precision
from model_navigator.args import str2bool
from model_navigator.log import set_logger, log_dict
from model_navigator.triton import ModelConfig, TritonClient, TritonModelStore
LOGGER = logging.getLogger("config_model")
def _available_enum_values(my_enum):
return [item.value for item in my_enum]
def main():
parser = argparse.ArgumentParser(
description="Create Triton model repository and model configuration", allow_abbrev=False
)
parser.add_argument("--model-repository", required=True, help="Path to Triton model repository.")
parser.add_argument("--model-path", required=True, help="Path to model to configure")
# TODO: automation
parser.add_argument(
"--model-format",
required=True,
choices=_available_enum_values(Format),
help="Format of model to deploy",
)
parser.add_argument("--model-name", required=True, help="Model name")
parser.add_argument("--model-version", default="1", help="Version of model (default 1)")
parser.add_argument(
"--max-batch-size",
type=int,
default=32,
help="Maximum batch size allowed for inference. "
"A max_batch_size value of 0 indicates that batching is not allowed for the model",
)
# TODO: automation
parser.add_argument(
"--precision",
type=str,
default=Precision.FP16.value,
choices=_available_enum_values(Precision),
help="Model precision (parameter used only by Tensorflow backend with TensorRT optimization)",
)
# Triton Inference Server endpoint
parser.add_argument(
"--server-url",
type=str,
default="grpc://localhost:8001",
help="Inference server URL in format protocol://host[:port] (default grpc://localhost:8001)",
)
parser.add_argument(
"--load-model",
choices=["none", "poll", "explicit"],
help="Loading model while Triton Server is in given model control mode",
)
parser.add_argument(
"--timeout", default=120, help="Timeout in seconds to wait till model load (default=120)", type=int
)
# optimization related
parser.add_argument(
"--backend-accelerator",
type=str,
choices=_available_enum_values(Accelerator),
default=Accelerator.TRT.value,
help="Select Backend Accelerator used to serve model",
)
parser.add_argument("--number-of-model-instances", type=int, default=1, help="Number of model instances per GPU")
parser.add_argument(
"--preferred-batch-sizes",
type=int,
nargs="*",
help="Batch sizes that the dynamic batcher should attempt to create. "
"In case --max-queue-delay-us is set and this parameter is not, default value will be --max-batch-size",
)
parser.add_argument(
"--max-queue-delay-us",
type=int,
default=0,
help="Max delay time which dynamic batcher shall wait to form a batch (default 0)",
)
parser.add_argument(
"--capture-cuda-graph",
type=int,
default=0,
help="Use cuda capture graph (used only by TensorRT platform)",
)
parser.add_argument("-v", "--verbose", help="Provide verbose logs", type=str2bool, default=False)
args = parser.parse_args()
set_logger(verbose=args.verbose)
log_dict("args", vars(args))
config = ModelConfig.create(
model_path=args.model_path,
# model definition
model_name=args.model_name,
model_version=args.model_version,
model_format=args.model_format,
precision=args.precision,
max_batch_size=args.max_batch_size,
# optimization
accelerator=args.backend_accelerator,
gpu_engine_count=args.number_of_model_instances,
preferred_batch_sizes=args.preferred_batch_sizes or [],
max_queue_delay_us=args.max_queue_delay_us,
capture_cuda_graph=args.capture_cuda_graph,
)
model_store = TritonModelStore(args.model_repository)
model_store.deploy_model(model_config=config, model_path=args.model_path)
if args.load_model != "none":
client = TritonClient(server_url=args.server_url, verbose=args.verbose)
client.wait_for_server_ready(timeout=args.timeout)
if args.load_model == "explicit":
client.load_model(model_name=args.model_name)
if args.load_model == "poll":
time.sleep(15)
client.wait_for_model(model_name=args.model_name, model_version=args.model_version, timeout_s=args.timeout)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,166 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
`convert_model.py` script allows to convert between model formats with additional model optimizations
for faster inference.
It converts model from results of get_model function.
Currently supported input and output formats are:
- inputs
- `tf-estimator` - `get_model` function returning Tensorflow Estimator
- `tf-keras` - `get_model` function returning Tensorflow Keras Model
- `tf-savedmodel` - Tensorflow SavedModel binary
- `pyt` - `get_model` function returning PyTorch Module
- output
- `tf-savedmodel` - Tensorflow saved model
- `tf-trt` - TF-TRT saved model
- `ts-trace` - PyTorch traced ScriptModule
- `ts-script` - PyTorch scripted ScriptModule
- `onnx` - ONNX
- `trt` - TensorRT plan file
For tf-keras input you can use:
- --large-model flag - helps loading model which exceeds maximum protobuf size of 2GB
- --tf-allow-growth flag - control limiting GPU memory growth feature
(https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth). By default it is disabled.
"""
import argparse
import logging
import os
from pathlib import Path
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["TF_ENABLE_DEPRECATION_WARNINGS"] = "1"
# method from PEP-366 to support relative import in executed modules
if __name__ == "__main__" and __package__ is None:
__package__ = Path(__file__).parent.name
from .deployment_toolkit.args import ArgParserGenerator
from .deployment_toolkit.core import (
DATALOADER_FN_NAME,
BaseConverter,
BaseLoader,
BaseSaver,
Format,
Precision,
load_from_file,
)
from .deployment_toolkit.extensions import converters, loaders, savers
LOGGER = logging.getLogger("convert_model")
INPUT_MODEL_TYPES = [Format.TF_ESTIMATOR, Format.TF_KERAS, Format.TF_SAVEDMODEL, Format.PYT]
OUTPUT_MODEL_TYPES = [Format.TF_SAVEDMODEL, Format.TF_TRT, Format.ONNX, Format.TRT, Format.TS_TRACE, Format.TS_SCRIPT]
def _get_args():
parser = argparse.ArgumentParser(description="Script for conversion between model formats.", allow_abbrev=False)
parser.add_argument("--input-path", help="Path to input model file (python module or binary file)", required=True)
parser.add_argument(
"--input-type", help="Input model type", choices=[f.value for f in INPUT_MODEL_TYPES], required=True
)
parser.add_argument("--output-path", help="Path to output model file", required=True)
parser.add_argument(
"--output-type", help="Output model type", choices=[f.value for f in OUTPUT_MODEL_TYPES], required=True
)
parser.add_argument("--dataloader", help="Path to python module containing data loader")
parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
parser.add_argument(
"--ignore-unknown-parameters",
help="Ignore unknown parameters (argument often used in CI where set of arguments is constant)",
action="store_true",
default=False,
)
args, unparsed_args = parser.parse_known_args()
Loader: BaseLoader = loaders.get(args.input_type)
ArgParserGenerator(Loader, module_path=args.input_path).update_argparser(parser)
converter_name = f"{args.input_type}--{args.output_type}"
Converter: BaseConverter = converters.get(converter_name)
if Converter is not None:
ArgParserGenerator(Converter).update_argparser(parser)
Saver: BaseSaver = savers.get(args.output_type)
ArgParserGenerator(Saver).update_argparser(parser)
if args.dataloader is not None:
get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
ArgParserGenerator(get_dataloader_fn).update_argparser(parser)
if args.ignore_unknown_parameters:
args, unknown_args = parser.parse_known_args()
LOGGER.warning(f"Got additional args {unknown_args}")
else:
args = parser.parse_args()
return args
def main():
args = _get_args()
log_level = logging.INFO if not args.verbose else logging.DEBUG
log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
logging.basicConfig(level=log_level, format=log_format)
LOGGER.info(f"args:")
for key, value in vars(args).items():
LOGGER.info(f" {key} = {value}")
requested_model_precision = Precision(args.precision)
dataloader_fn = None
# if conversion is required, temporary change model load precision to that required by converter
# it is for TensorRT converters which require fp32 models for all requested precisions
converter_name = f"{args.input_type}--{args.output_type}"
Converter: BaseConverter = converters.get(converter_name)
if Converter:
args.precision = Converter.required_source_model_precision(requested_model_precision).value
Loader: BaseLoader = loaders.get(args.input_type)
loader = ArgParserGenerator(Loader, module_path=args.input_path).from_args(args)
model = loader.load(args.input_path)
LOGGER.info("inputs: %s", model.inputs)
LOGGER.info("outputs: %s", model.outputs)
if Converter: # if conversion is needed
# dataloader must much source model precision - so not recovering it yet
if args.dataloader is not None:
get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)
# recover precision to that requested by user
args.precision = requested_model_precision.value
if Converter:
converter = ArgParserGenerator(Converter).from_args(args)
model = converter.convert(model, dataloader_fn=dataloader_fn)
Saver: BaseSaver = savers.get(args.output_type)
saver = ArgParserGenerator(Saver).from_args(args)
saver.save(model, args.output_path)
return 0
if __name__ == "__main__":
main()

View file

@ -0,0 +1,49 @@
import logging
from pathlib import Path
import numpy as np
from PIL import Image
LOGGER = logging.getLogger(__name__)
def get_dataloader_fn(
*, data_dir: str, batch_size: int = 1, width: int = 224, height: int = 224, images_num: int = None,
precision: str = "fp32", classes: int = 1000
):
def _dataloader():
image_extensions = [".gif", ".png", ".jpeg", ".jpg"]
image_paths = sorted([p for p in Path(data_dir).rglob("*") if p.suffix.lower() in image_extensions])
if images_num is not None:
image_paths = image_paths[:images_num]
LOGGER.info(
f"Creating PIL dataloader on data_dir={data_dir} #images={len(image_paths)} "
f"image_size=({width}, {height}) batch_size={batch_size}"
)
onehot = np.eye(classes)
batch = []
for image_path in image_paths:
img = Image.open(image_path.as_posix()).convert("RGB")
img = img.resize((width, height))
img = (np.array(img).astype(np.float32) / 255) - np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 1, 3)
img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
true_class = np.array([int(image_path.parent.name)])
assert tuple(img.shape) == (height, width, 3)
img = img[np.newaxis, ...]
batch.append((img, image_path.as_posix(), true_class))
if len(batch) >= batch_size:
ids = [image_path for _, image_path, *_ in batch]
x = {"INPUT__0": np.ascontiguousarray(
np.transpose(np.concatenate([img for img, *_ in batch]),
(0, 3, 1, 2)).astype(np.float32 if precision == "fp32" else np.float16))}
y_real = {"OUTPUT__0": onehot[np.concatenate([class_ for *_, class_ in batch])].astype(
np.float32 if precision == "fp32" else np.float16
)}
batch = []
yield ids, x, y_real
return _dataloader

View file

@ -0,0 +1 @@
0.5.0-2-gd556907

View file

@ -0,0 +1,13 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View file

@ -0,0 +1,124 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import inspect
import logging
from typing import Any, Callable, Dict, Optional, Union
from .core import GET_ARGPARSER_FN_NAME, load_from_file
LOGGER = logging.getLogger(__name__)
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ("yes", "true", "t", "y", "1"):
return True
elif v.lower() in ("no", "false", "f", "n", "0"):
return False
else:
raise argparse.ArgumentTypeError("Boolean value expected.")
def filter_fn_args(args: Union[dict, argparse.Namespace], fn: Callable) -> dict:
signature = inspect.signature(fn)
parameters_names = list(signature.parameters)
if isinstance(args, argparse.Namespace):
args = vars(args)
args = {k: v for k, v in args.items() if k in parameters_names}
return args
def add_args_for_fn_signature(parser, fn) -> argparse.ArgumentParser:
parser.conflict_handler = "resolve"
signature = inspect.signature(fn)
for parameter in signature.parameters.values():
if parameter.name in ["self", "args", "kwargs"]:
continue
argument_kwargs = {}
if parameter.annotation != inspect.Parameter.empty:
if parameter.annotation == bool:
argument_kwargs["type"] = str2bool
argument_kwargs["choices"] = [0, 1]
elif isinstance(parameter.annotation, type(Optional[Any])):
types = [type_ for type_ in parameter.annotation.__args__ if not isinstance(None, type_)]
if len(types) != 1:
raise RuntimeError(
f"Could not prepare argument parser for {parameter.name}: {parameter.annotation} in {fn}"
)
argument_kwargs["type"] = types[0]
else:
argument_kwargs["type"] = parameter.annotation
if parameter.default != inspect.Parameter.empty:
if parameter.annotation == bool:
argument_kwargs["default"] = str2bool(parameter.default)
else:
argument_kwargs["default"] = parameter.default
else:
argument_kwargs["required"] = True
name = parameter.name.replace("_", "-")
LOGGER.debug(f"Adding argument {name} with {argument_kwargs}")
parser.add_argument(f"--{name}", **argument_kwargs)
return parser
class ArgParserGenerator:
def __init__(self, cls_or_fn, module_path: Optional[str] = None):
self._cls_or_fn = cls_or_fn
self._handle = cls_or_fn if inspect.isfunction(cls_or_fn) else getattr(cls_or_fn, "__init__")
input_is_python_file = module_path and module_path.endswith(".py")
self._input_path = module_path if input_is_python_file else None
self._required_fn_name_for_signature_parsing = getattr(
cls_or_fn, "required_fn_name_for_signature_parsing", None
)
def update_argparser(self, parser):
name = self._handle.__name__
group_parser = parser.add_argument_group(name)
add_args_for_fn_signature(group_parser, fn=self._handle)
self._update_argparser(group_parser)
def get_args(self, args: argparse.Namespace):
filtered_args = filter_fn_args(args, fn=self._handle)
tmp_parser = argparse.ArgumentParser(allow_abbrev=False)
self._update_argparser(tmp_parser)
custom_names = [
p.dest.replace("-", "_") for p in tmp_parser._actions if not isinstance(p, argparse._HelpAction)
]
custom_params = {n: getattr(args, n) for n in custom_names}
filtered_args = {**filtered_args, **custom_params}
return filtered_args
def from_args(self, args: Union[argparse.Namespace, Dict]):
args = self.get_args(args)
LOGGER.info(f"Initializing {self._cls_or_fn.__name__}({args})")
return self._cls_or_fn(**args)
def _update_argparser(self, parser):
label = "argparser_update"
if self._input_path:
update_argparser_handle = load_from_file(self._input_path, label=label, target=GET_ARGPARSER_FN_NAME)
if update_argparser_handle:
update_argparser_handle(parser)
elif self._required_fn_name_for_signature_parsing:
fn_handle = load_from_file(
self._input_path, label=label, target=self._required_fn_name_for_signature_parsing
)
if fn_handle:
add_args_for_fn_signature(parser, fn_handle)

View file

@ -0,0 +1,13 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View file

@ -0,0 +1,237 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from pathlib import Path
from typing import Dict, Optional, Union
import numpy as np
# pytype: disable=import-error
import onnx
import onnx.optimizer
import onnx.shape_inference
import onnxruntime
from google.protobuf import text_format
from onnx.mapping import TENSOR_TYPE_TO_NP_TYPE
# pytype: enable=import-error
from ..core import BaseLoader, BaseRunner, BaseRunnerSession, BaseSaver, Format, Model, Precision, TensorSpec
from ..extensions import loaders, runners, savers
from .utils import infer_precision
LOGGER = logging.getLogger(__name__)
def _value_info2tensor_spec(value_info: onnx.ValueInfoProto):
onnx_data_type_map = {"float": "float32", "double": "float64"}
elem_type_name = onnx.TensorProto.DataType.Name(value_info.type.tensor_type.elem_type).lower()
dtype = onnx_data_type_map.get(elem_type_name, elem_type_name)
def _get_dim(dim):
which = dim.WhichOneof("value")
if which is not None: # which is None when dim is None
dim = getattr(dim, which)
return None if isinstance(dim, (str, bytes)) else dim
shape = value_info.type.tensor_type.shape
shape = tuple([_get_dim(d) for d in shape.dim])
return TensorSpec(value_info.name, dtype=dtype, shape=shape)
def _infer_graph_precision(onnx_graph: onnx.GraphProto) -> Optional[Precision]:
import networkx as nx
# build directed graph
nx_graph = nx.DiGraph()
def _get_dtype(vi):
t = vi.type
if hasattr(t, "tensor_type"):
type_id = t.tensor_type.elem_type
else:
raise NotImplementedError("Not implemented yet")
return TENSOR_TYPE_TO_NP_TYPE[type_id]
node_output2type = {vi.name: _get_dtype(vi) for vi in onnx_graph.value_info}
node_outputs2node = {output_name: node for node in onnx_graph.node for output_name in node.output}
node_inputs2node = {input_name: node for node in onnx_graph.node for input_name in node.input}
for node in onnx_graph.node:
node_dtype = node_output2type.get("+".join(node.output), None)
nx_graph.add_node(
node.name,
op=node.op_type,
attr={a.name: a for a in node.attribute},
dtype=node_dtype,
)
for input_name in node.input:
prev_node = node_outputs2node.get(input_name, None)
if prev_node:
nx_graph.add_edge(prev_node.name, node.name)
for input_node in onnx_graph.input:
input_name = input_node.name
nx_graph.add_node(input_name, op="input", dtype=_get_dtype(input_node))
next_node = node_inputs2node.get(input_name, None)
if next_node:
nx_graph.add_edge(input_name, next_node.name)
for output in onnx_graph.output:
output_name = output.name
nx_graph.add_node(output_name, op="output", dtype=_get_dtype(output))
prev_node = node_outputs2node.get(output_name, None)
if prev_node:
nx_graph.add_edge(prev_node.name, output_name)
else:
LOGGER.warning(f"Could not find previous node for {output_name}")
input_names = [n.name for n in onnx_graph.input]
output_names = [n.name for n in onnx_graph.output]
most_common_dtype = infer_precision(nx_graph, input_names, output_names, lambda node: node.get("dtype", None))
if most_common_dtype is not None:
precision = {np.dtype("float32"): Precision.FP32, np.dtype("float16"): Precision.FP16}[most_common_dtype]
else:
precision = None
return precision
class OnnxLoader(BaseLoader):
def load(self, model_path: Union[str, Path], **_) -> Model:
if isinstance(model_path, Path):
model_path = model_path.as_posix()
model = onnx.load(model_path)
onnx.checker.check_model(model)
onnx.helper.strip_doc_string(model)
model = onnx.shape_inference.infer_shapes(model)
# TODO: probably modification of onnx model ios causes error on optimize
# from onnx.utils import polish_model
# model = polish_model(model) # run checker, docs strip, optimizer and shape inference
inputs = {vi.name: _value_info2tensor_spec(vi) for vi in model.graph.input}
outputs = {vi.name: _value_info2tensor_spec(vi) for vi in model.graph.output}
precision = _infer_graph_precision(model.graph)
return Model(model, precision, inputs, outputs)
class OnnxSaver(BaseSaver):
def __init__(self, as_text: bool = False):
self._as_text = as_text
def save(self, model: Model, model_path: Union[str, Path]) -> None:
model_path = Path(model_path)
LOGGER.debug(f"Saving ONNX model to {model_path.as_posix()}")
model_path.parent.mkdir(parents=True, exist_ok=True)
onnx_model: onnx.ModelProto = model.handle
if self._as_text:
with model_path.open("w") as f:
f.write(text_format.MessageToString(onnx_model))
else:
with model_path.open("wb") as f:
f.write(onnx_model.SerializeToString())
"""
ExecutionProviders on onnxruntime 1.4.0
['TensorrtExecutionProvider',
'CUDAExecutionProvider',
'MIGraphXExecutionProvider',
'NGRAPHExecutionProvider',
'OpenVINOExecutionProvider',
'DnnlExecutionProvider',
'NupharExecutionProvider',
'VitisAIExecutionProvider',
'ArmNNExecutionProvider',
'ACLExecutionProvider',
'CPUExecutionProvider']
"""
def _check_providers(providers):
providers = providers or []
if not isinstance(providers, (list, tuple)):
providers = [providers]
available_providers = onnxruntime.get_available_providers()
unavailable = set(providers) - set(available_providers)
if unavailable:
raise RuntimeError(f"Unavailable providers {unavailable}")
return providers
class OnnxRunner(BaseRunner):
def __init__(self, verbose_runtime_logs: bool = False):
self._providers = None
self._verbose_runtime_logs = verbose_runtime_logs
def init_inference(self, model: Model):
assert isinstance(model.handle, onnx.ModelProto)
return OnnxRunnerSession(
model=model, providers=self._providers, verbose_runtime_logs=self._verbose_runtime_logs
)
class OnnxRunnerSession(BaseRunnerSession):
def __init__(self, model: Model, providers, verbose_runtime_logs: bool = False):
super().__init__(model)
self._input_names = None
self._output_names = None
self._session = None
self._providers = providers
self._verbose_runtime_logs = verbose_runtime_logs
self._old_env_values = {}
def __enter__(self):
self._old_env_values = self._set_env_variables()
sess_options = onnxruntime.SessionOptions() # default session options
if self._verbose_runtime_logs:
sess_options.log_severity_level = 0
sess_options.log_verbosity_level = 1
LOGGER.info(
f"Starting inference session for onnx model providers={self._providers} sess_options={sess_options}"
)
self._input_names = list(self._model.inputs)
self._output_names = list(self._model.outputs)
model_payload = self._model.handle.SerializeToString()
self._session = onnxruntime.InferenceSession(
model_payload, providers=self._providers, sess_options=sess_options
)
return self
def __exit__(self, exc_type, exc_value, traceback):
self._input_names = None
self._output_names = None
self._session = None
self._recover_env_variables(self._old_env_values)
def __call__(self, x: Dict[str, object]):
feed_dict = {k: x[k] for k in self._input_names}
y_pred = self._session.run(self._output_names, feed_dict)
y_pred = dict(zip(self._output_names, y_pred))
return y_pred
loaders.register_extension(Format.ONNX.value, OnnxLoader)
runners.register_extension(Format.ONNX.value, OnnxRunner)
savers.register_extension(Format.ONNX.value, OnnxSaver)

View file

@ -0,0 +1,114 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
from typing import Dict, Iterable, Optional
# pytype: disable=import-error
import onnx
import tensorrt as trt
from ..core import BaseConverter, Format, Model, Precision, ShapeSpec
from ..extensions import converters
from .utils import get_input_shapes
# pytype: enable=import-error
LOGGER = logging.getLogger(__name__)
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
class Onnx2TRTConverter(BaseConverter):
def __init__(self, *, max_batch_size: int, max_workspace_size: int, precision: str):
self._max_batch_size = max_batch_size
self._max_workspace_size = max_workspace_size
self._precision = Precision(precision)
def convert(self, model: Model, dataloader_fn) -> Model:
input_shapes = get_input_shapes(dataloader_fn(), self._max_batch_size)
cuda_engine = onnx2trt(
model.handle,
shapes=input_shapes,
max_workspace_size=self._max_workspace_size,
max_batch_size=self._max_batch_size,
model_precision=self._precision.value,
)
return model._replace(handle=cuda_engine)
@staticmethod
def required_source_model_precision(requested_model_precision: Precision) -> Precision:
# TensorRT requires source models to be in FP32 precision
return Precision.FP32
def onnx2trt(
onnx_model: onnx.ModelProto,
*,
shapes: Dict[str, ShapeSpec],
max_workspace_size: int,
max_batch_size: int,
model_precision: str,
) -> "trt.ICudaEngine":
"""
Converts onnx model to TensorRT ICudaEngine
Args:
onnx_model: onnx.Model to convert
shapes: dictionary containing min shape, max shape, opt shape for each input name
max_workspace_size: The maximum GPU temporary memory which the CudaEngine can use at execution time.
max_batch_size: The maximum batch size which can be used at execution time,
and also the batch size for which the CudaEngine will be optimized.
model_precision: precision of kernels (possible values: fp16, fp32)
Returns: TensorRT ICudaEngine
"""
# Whether or not 16-bit kernels are permitted.
# During :class:`ICudaEngine` build fp16 kernels will also be tried when this mode is enabled.
fp16_mode = "16" in model_precision
builder = trt.Builder(TRT_LOGGER)
builder.fp16_mode = fp16_mode
builder.max_batch_size = max_batch_size
builder.max_workspace_size = max_workspace_size
# In TensorRT 7.0, the ONNX parser only supports full-dimensions mode,
# meaning that your network definition must be created with the explicitBatch flag set.
# For more information, see
# https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(flags)
with trt.OnnxParser(network, TRT_LOGGER) as parser:
# onnx model parsing
if not parser.parse(onnx_model.SerializeToString()):
for i in range(parser.num_errors):
LOGGER.error(f"OnnxParser error {i}/{parser.num_errors}: {parser.get_error(i)}")
raise RuntimeError("Error during parsing ONNX model (see logs for details)")
# optimization
config = builder.create_builder_config()
config.flags |= bool(fp16_mode) << int(trt.BuilderFlag.FP16)
config.max_workspace_size = max_workspace_size
profile = builder.create_optimization_profile()
for name, spec in shapes.items():
profile.set_shape(name, **spec._asdict())
config.add_optimization_profile(profile)
engine = builder.build_engine(network, config=config)
return engine
converters.register_extension(f"{Format.ONNX.value}--{Format.TRT.value}", Onnx2TRTConverter)

View file

@ -0,0 +1,358 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import os
from collections import Counter
from pathlib import Path
from typing import Dict, Iterable, NamedTuple, Optional, Union
import torch # pytype: disable=import-error
import yaml
from ..core import (
GET_MODEL_FN_NAME,
BaseConverter,
BaseLoader,
BaseRunner,
BaseRunnerSession,
BaseSaver,
Format,
Model,
Precision,
TensorSpec,
load_from_file,
)
from ..extensions import converters, loaders, runners, savers
from .utils import get_dynamic_axes, get_input_shapes, get_shapes_with_dynamic_axes
LOGGER = logging.getLogger(__name__)
class InputOutputSpec(NamedTuple):
inputs: Dict[str, TensorSpec]
outputs: Dict[str, TensorSpec]
def get_sample_input(dataloader, device):
for batch in dataloader:
_, x, _ = batch
break
if isinstance(x, dict):
sample_input = list(x.values())
elif isinstance(x, list):
sample_input = x
else:
raise TypeError("The first element (x) of batch returned by dataloader must be a list or a dict")
for idx, s in enumerate(sample_input):
sample_input[idx] = torch.from_numpy(s).to(device)
return tuple(sample_input)
def get_model_device(torch_model):
if next(torch_model.parameters()).is_cuda:
return "cuda"
else:
return "cpu"
def infer_model_precision(model):
counter = Counter()
for param in model.parameters():
counter[param.dtype] += 1
if counter[torch.float16] > 0:
return Precision.FP16
else:
return Precision.FP32
def _get_tensor_dtypes(dataloader, precision):
def _get_dtypes(t):
dtypes = {}
for k, v in t.items():
dtype = str(v.dtype)
if dtype == "float64":
dtype = "float32"
if precision == Precision.FP16 and dtype == "float32":
dtype = "float16"
dtypes[k] = dtype
return dtypes
input_dtypes = {}
output_dtypes = {}
for batch in dataloader:
_, x, y = batch
input_dtypes = _get_dtypes(x)
output_dtypes = _get_dtypes(y)
break
return input_dtypes, output_dtypes
### TODO assumption: floating point input
### type has same precision as the model
def _get_io_spec(model, dataloader_fn):
precision = model.precision
dataloader = dataloader_fn()
input_dtypes, output_dtypes = _get_tensor_dtypes(dataloader, precision)
input_shapes, output_shapes = get_shapes_with_dynamic_axes(dataloader)
inputs = {
name: TensorSpec(name=name, dtype=input_dtypes[name], shape=tuple(input_shapes[name])) for name in model.inputs
}
outputs = {
name: TensorSpec(name=name, dtype=output_dtypes[name], shape=tuple(output_shapes[name]))
for name in model.outputs
}
return InputOutputSpec(inputs, outputs)
class PyTorchModelLoader(BaseLoader):
required_fn_name_for_signature_parsing: Optional[str] = GET_MODEL_FN_NAME
def __init__(self, **kwargs):
self._model_args = kwargs
def load(self, model_path: Union[str, Path], **_) -> Model:
if isinstance(model_path, Path):
model_path = model_path.as_posix()
get_model = load_from_file(model_path, "model", GET_MODEL_FN_NAME)
model, tensor_infos = get_model(**self._model_args)
io_spec = InputOutputSpec(tensor_infos["inputs"], tensor_infos["outputs"])
precision = infer_model_precision(model)
return Model(handle=model, precision=precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
class TorchScriptLoader(BaseLoader):
def __init__(self, tensor_names_path: str = None, **kwargs):
self._model_args = kwargs
self._io_spec = None
if tensor_names_path is not None:
with Path(tensor_names_path).open("r") as fh:
tensor_infos = yaml.load(fh, Loader=yaml.SafeLoader)
self._io_spec = InputOutputSpec(tensor_infos["inputs"], tensor_infos["outputs"])
def load(self, model_path: Union[str, Path], **_) -> Model:
if not isinstance(model_path, Path):
model_path = Path(model_path)
model = torch.jit.load(model_path.as_posix())
precision = infer_model_precision(model)
io_spec = self._io_spec
if not io_spec:
yaml_path = model_path.parent / f"{model_path.stem}.yaml"
if not yaml_path.is_file():
raise ValueError(
f"If `--tensor-names-path is not provided, "
f"TorchScript model loader expects file {yaml_path} with tensor information."
)
with yaml_path.open("r") as fh:
tensor_info = yaml.load(fh, Loader=yaml.SafeLoader)
io_spec = InputOutputSpec(tensor_info["inputs"], tensor_info["outputs"])
return Model(handle=model, precision=precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
class TorchScriptTraceConverter(BaseConverter):
def __init__(self):
pass
def convert(self, model: Model, dataloader_fn) -> Model:
device = get_model_device(model.handle)
dummy_input = get_sample_input(dataloader_fn(), device)
converted_model = torch.jit.trace_module(model.handle, {"forward": dummy_input})
io_spec = _get_io_spec(model, dataloader_fn)
return Model(converted_model, precision=model.precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
class TorchScriptScriptConverter(BaseConverter):
def __init__(self):
pass
def convert(self, model: Model, dataloader_fn) -> Model:
converted_model = torch.jit.script(model.handle)
io_spec = _get_io_spec(model, dataloader_fn)
return Model(converted_model, precision=model.precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
class PYT2ONNXConverter(BaseConverter):
def __init__(self, onnx_opset: int = None):
self._onnx_opset = onnx_opset
def convert(self, model: Model, dataloader_fn) -> Model:
import tempfile
import onnx # pytype: disable=import-error
assert isinstance(model.handle, torch.jit.ScriptModule) or isinstance(
model.handle, torch.nn.Module
), "The model must be of type 'torch.jit.ScriptModule' or 'torch.nn.Module'. Converter aborted."
dynamic_axes = get_dynamic_axes(dataloader_fn())
device = get_model_device(model.handle)
dummy_input = get_sample_input(dataloader_fn(), device)
with tempfile.TemporaryDirectory() as tmpdirname:
export_path = os.path.join(tmpdirname, "model.onnx")
with torch.no_grad():
torch.onnx.export(
model.handle,
dummy_input,
export_path,
do_constant_folding=True,
input_names=list(model.inputs),
output_names=list(model.outputs),
dynamic_axes=dynamic_axes,
opset_version=self._onnx_opset,
enable_onnx_checker=True,
)
onnx_model = onnx.load(export_path)
onnx.checker.check_model(onnx_model)
onnx.helper.strip_doc_string(onnx_model)
onnx_model = onnx.shape_inference.infer_shapes(onnx_model)
return Model(
handle=onnx_model,
precision=model.precision,
inputs=model.inputs,
outputs=model.outputs,
)
class PYT2TensorRTConverter(BaseConverter):
def __init__(self, max_batch_size: int, max_workspace_size: int, onnx_opset: int, precision: str):
self._max_batch_size = max_batch_size
self._max_workspace_size = max_workspace_size
self._onnx_opset = onnx_opset
self._precision = Precision(precision)
def convert(self, model: Model, dataloader_fn) -> Model:
from .onnx import _infer_graph_precision
from .onnx2trt_conv import onnx2trt
pyt2onnx_converter = PYT2ONNXConverter(self._onnx_opset)
onnx_model = pyt2onnx_converter.convert(model, dataloader_fn).handle
precision = _infer_graph_precision(onnx_model.graph)
input_shapes = get_input_shapes(dataloader_fn(), self._max_batch_size)
cuda_engine = onnx2trt(
onnx_model,
shapes=input_shapes,
max_workspace_size=self._max_workspace_size,
max_batch_size=self._max_batch_size,
model_precision=self._precision.value,
)
return Model(
handle=cuda_engine,
precision=model.precision,
inputs=model.inputs,
outputs=model.outputs,
)
@staticmethod
def required_source_model_precision(requested_model_precision: Precision) -> Precision:
# TensorRT requires source models to be in FP32 precision
return Precision.FP32
class TorchScriptSaver(BaseSaver):
def save(self, model: Model, model_path: Union[str, Path]) -> None:
if not isinstance(model_path, Path):
model_path = Path(model_path)
if isinstance(model.handle, torch.jit.ScriptModule):
torch.jit.save(model.handle, model_path.as_posix())
else:
print("The model must be of type 'torch.jit.ScriptModule'. Saving aborted.")
assert False # temporary error handling
def _format_tensor_spec(tensor_spec):
# wrapping shape with list and whole tensor_spec with dict() is required for correct yaml dump
tensor_spec = tensor_spec._replace(shape=list(tensor_spec.shape))
tensor_spec = dict(tensor_spec._asdict())
return tensor_spec
# store TensorSpecs from inputs and outputs in a yaml file
tensor_specs = {
"inputs": {k: _format_tensor_spec(v) for k, v in model.inputs.items()},
"outputs": {k: _format_tensor_spec(v) for k, v in model.outputs.items()},
}
yaml_path = model_path.parent / f"{model_path.stem}.yaml"
with Path(yaml_path).open("w") as fh:
yaml.dump(tensor_specs, fh, indent=4)
class PyTorchRunner(BaseRunner):
def __init__(self):
pass
def init_inference(self, model: Model):
return PyTorchRunnerSession(model=model)
class PyTorchRunnerSession(BaseRunnerSession):
def __init__(self, model: Model):
super().__init__(model)
assert isinstance(model.handle, torch.jit.ScriptModule) or isinstance(
model.handle, torch.nn.Module
), "The model must be of type 'torch.jit.ScriptModule' or 'torch.nn.Module'. Runner aborted."
self._model = model
self._output_names = None
def __enter__(self):
self._output_names = list(self._model.outputs)
return self
def __exit__(self, exc_type, exc_value, traceback):
self._output_names = None
self._model = None
def __call__(self, x: Dict[str, object]):
with torch.no_grad():
feed_list = [torch.from_numpy(v).cuda() for k, v in x.items()]
y_pred = self._model.handle(*feed_list)
if isinstance(y_pred, torch.Tensor):
y_pred = (y_pred,)
y_pred = [t.cpu().numpy() for t in y_pred]
y_pred = dict(zip(self._output_names, y_pred))
return y_pred
loaders.register_extension(Format.PYT.value, PyTorchModelLoader)
loaders.register_extension(Format.TS_TRACE.value, TorchScriptLoader)
loaders.register_extension(Format.TS_SCRIPT.value, TorchScriptLoader)
converters.register_extension(f"{Format.PYT.value}--{Format.TS_SCRIPT.value}", TorchScriptScriptConverter)
converters.register_extension(f"{Format.PYT.value}--{Format.TS_TRACE.value}", TorchScriptTraceConverter)
converters.register_extension(f"{Format.PYT.value}--{Format.ONNX.value}", PYT2ONNXConverter)
converters.register_extension(f"{Format.PYT.value}--{Format.TRT.value}", PYT2TensorRTConverter)
savers.register_extension(Format.TS_SCRIPT.value, TorchScriptSaver)
savers.register_extension(Format.TS_TRACE.value, TorchScriptSaver)
runners.register_extension(Format.PYT.value, PyTorchRunner)
runners.register_extension(Format.TS_SCRIPT.value, PyTorchRunner)
runners.register_extension(Format.TS_TRACE.value, PyTorchRunner)

View file

@ -0,0 +1,216 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import logging
import sys
from pathlib import Path
from typing import Dict, NamedTuple, Optional, Union
import numpy as np
# pytype: disable=import-error
try:
import pycuda.autoinit
import pycuda.driver as cuda
except (ImportError, Exception) as e:
logging.getLogger(__name__).warning(f"Problems with importing pycuda package; {e}")
# pytype: enable=import-error
import tensorrt as trt # pytype: disable=import-error
from ..core import BaseLoader, BaseRunner, BaseRunnerSession, BaseSaver, Format, Model, Precision, TensorSpec
from ..extensions import loaders, runners, savers
LOGGER = logging.getLogger(__name__)
TRT_LOGGER = trt.Logger(trt.Logger.INFO)
"""
documentation:
https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#python_samples_section
"""
class TensorRTLoader(BaseLoader):
def load(self, model_path: Union[str, Path], **_) -> Model:
model_path = Path(model_path)
LOGGER.debug(f"Loading TensorRT engine from {model_path}")
with model_path.open("rb") as fh, trt.Runtime(TRT_LOGGER) as runtime:
engine = runtime.deserialize_cuda_engine(fh.read())
if engine is None:
raise RuntimeError(f"Could not load ICudaEngine from {model_path}")
inputs = {}
outputs = {}
for binding_idx in range(engine.num_bindings):
name = engine.get_binding_name(binding_idx)
is_input = engine.binding_is_input(binding_idx)
dtype = engine.get_binding_dtype(binding_idx)
shape = engine.get_binding_shape(binding_idx)
if is_input:
inputs[name] = TensorSpec(name, dtype, shape)
else:
outputs[name] = TensorSpec(name, dtype, shape)
return Model(engine, None, inputs, outputs)
class TensorRTSaver(BaseSaver):
def __init__(self):
pass
def save(self, model: Model, model_path: Union[str, Path]) -> None:
model_path = Path(model_path)
LOGGER.debug(f"Saving TensorRT engine to {model_path.as_posix()}")
model_path.parent.mkdir(parents=True, exist_ok=True)
engine: "trt.ICudaEngine" = model.handle
with model_path.open("wb") as fh:
fh.write(engine.serialize())
class TRTBuffers(NamedTuple):
x_host: Optional[Dict[str, object]]
x_dev: Dict[str, object]
y_pred_host: Dict[str, object]
y_pred_dev: Dict[str, object]
class TensorRTRunner(BaseRunner):
def __init__(self):
pass
def init_inference(self, model: Model):
return TensorRTRunnerSession(model=model)
class TensorRTRunnerSession(BaseRunnerSession):
def __init__(self, model: Model):
super().__init__(model)
assert isinstance(model.handle, trt.ICudaEngine)
self._model = model
self._has_dynamic_shapes = None
self._context = None
self._engine: trt.ICudaEngine = self._model.handle
self._cuda_context = pycuda.autoinit.context
self._input_names = None
self._output_names = None
self._buffers = None
def __enter__(self):
self._context = self._engine.create_execution_context()
self._context.__enter__()
self._input_names = [
self._engine[idx] for idx in range(self._engine.num_bindings) if self._engine.binding_is_input(idx)
]
self._output_names = [
self._engine[idx] for idx in range(self._engine.num_bindings) if not self._engine.binding_is_input(idx)
]
# all_binding_shapes_specified is True for models without dynamic shapes
# so initially this variable is False for models with dynamic shapes
self._has_dynamic_shapes = not self._context.all_binding_shapes_specified
return self
def __exit__(self, exc_type, exc_value, traceback):
self._context.__exit__(exc_type, exc_value, traceback)
self._input_names = None
self._output_names = None
# TODO: are cuda buffers dealloc automatically?
self._buffers = None
def __call__(self, x):
buffers = self._prepare_buffers_if_needed(x)
bindings = self._update_bindings(buffers)
for name in self._input_names:
cuda.memcpy_htod(buffers.x_dev[name], buffers.x_host[name])
self._cuda_context.push()
self._context.execute_v2(bindings=bindings)
self._cuda_context.pop()
for name in self._output_names:
cuda.memcpy_dtoh(buffers.y_pred_host[name], buffers.y_pred_dev[name])
return buffers.y_pred_host
def _update_bindings(self, buffers: TRTBuffers):
bindings = [None] * self._engine.num_bindings
for name in buffers.y_pred_dev:
binding_idx: int = self._engine[name]
bindings[binding_idx] = buffers.y_pred_dev[name]
for name in buffers.x_dev:
binding_idx: int = self._engine[name]
bindings[binding_idx] = buffers.x_dev[name]
return bindings
def _set_dynamic_input_shapes(self, x_host):
def _is_shape_dynamic(input_shape):
return any([dim is None or dim == -1 for dim in input_shape])
for name in self._input_names:
bindings_idx = self._engine[name]
data_shape = x_host[name].shape # pytype: disable=attribute-error
if self._engine.is_shape_binding(bindings_idx):
input_shape = self._context.get_shape(bindings_idx)
if _is_shape_dynamic(input_shape):
self._context.set_shape_input(bindings_idx, data_shape)
else:
input_shape = self._engine.get_binding_shape(bindings_idx)
if _is_shape_dynamic(input_shape):
self._context.set_binding_shape(bindings_idx, data_shape)
assert self._context.all_binding_shapes_specified and self._context.all_shape_inputs_specified
def _prepare_buffers_if_needed(self, x_host: Dict[str, object]):
# pytype: disable=attribute-error
new_batch_size = list(x_host.values())[0].shape[0]
current_batch_size = list(self._buffers.y_pred_host.values())[0].shape[0] if self._buffers else 0
# pytype: enable=attribute-error
if self._has_dynamic_shapes or new_batch_size != current_batch_size:
# TODO: are CUDA buffers dealloc automatically?
self._set_dynamic_input_shapes(x_host)
y_pred_host = {}
for name in self._output_names:
shape = self._context.get_binding_shape(self._engine[name])
y_pred_host[name] = np.zeros(shape, dtype=trt.nptype(self._model.outputs[name].dtype))
y_pred_dev = {name: cuda.mem_alloc(data.nbytes) for name, data in y_pred_host.items()}
x_dev = {
name: cuda.mem_alloc(host_input.nbytes)
for name, host_input in x_host.items()
if name in self._input_names # pytype: disable=attribute-error
}
self._buffers = TRTBuffers(None, x_dev, y_pred_host, y_pred_dev)
return self._buffers._replace(x_host=x_host)
if "pycuda.driver" in sys.modules:
loaders.register_extension(Format.TRT.value, TensorRTLoader)
runners.register_extension(Format.TRT.value, TensorRTRunner)
savers.register_extension(Format.TRT.value, TensorRTSaver)
else:
LOGGER.warning("Do not register TensorRT extension due problems with importing pycuda.driver package.")

View file

@ -0,0 +1,121 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections import Counter
from typing import Callable, Dict, List
import networkx as nx
from ..core import ShapeSpec
def infer_precision(
nx_graph: nx.Graph,
input_names: List[str],
output_names: List[str],
get_node_dtype_fn: Callable,
):
node_dtypes = [nx_graph.nodes[node_name].get("dtype", None) for node_name in nx_graph.nodes]
node_dtypes = [dt for dt in node_dtypes if dt is None or dt.kind not in ["i", "b"]]
dtypes_counter = Counter(node_dtypes)
return dtypes_counter.most_common()[0][0]
def get_shapes_with_dynamic_axes(dataloader, batch_size_dim=0):
def _set_dynamic_shapes(t, shapes):
for k, v in t.items():
shape = list(v.shape)
for dim, s in enumerate(shape):
if shapes[k][dim] != -1 and shapes[k][dim] != s:
shapes[k][dim] = -1
## get all shapes from input and output tensors
input_shapes = {}
output_shapes = {}
for batch in dataloader:
_, x, y = batch
for k, v in x.items():
input_shapes[k] = list(v.shape)
for k, v in y.items():
output_shapes[k] = list(v.shape)
break
# based on max <max_num_iters> iterations, check which
# dimensions differ to determine dynamic_axes
max_num_iters = 100
for idx, batch in enumerate(dataloader):
if idx >= max_num_iters:
break
_, x, y = batch
_set_dynamic_shapes(x, input_shapes)
_set_dynamic_shapes(y, output_shapes)
return input_shapes, output_shapes
def get_dynamic_axes(dataloader, batch_size_dim=0):
input_shapes, output_shapes = get_shapes_with_dynamic_axes(dataloader, batch_size_dim)
all_shapes = {**input_shapes, **output_shapes}
dynamic_axes = {}
for k, shape in all_shapes.items():
for idx, s in enumerate(shape):
if s == -1:
dynamic_axes[k] = {idx: k + "_" + str(idx)}
for k, v in all_shapes.items():
if k in dynamic_axes:
dynamic_axes[k].update({batch_size_dim: "batch_size_" + str(batch_size_dim)})
else:
dynamic_axes[k] = {batch_size_dim: "batch_size_" + str(batch_size_dim)}
return dynamic_axes
def get_input_shapes(dataloader, max_batch_size=1) -> Dict[str, ShapeSpec]:
def init_counters_and_shapes(x, counters, min_shapes, max_shapes):
for k, v in x.items():
counters[k] = Counter()
min_shapes[k] = [float("inf")] * v.ndim
max_shapes[k] = [float("-inf")] * v.ndim
counters = {}
min_shapes: Dict[str, tuple] = {}
max_shapes: Dict[str, tuple] = {}
for idx, batch in enumerate(dataloader):
ids, x, y = batch
if idx == 0:
init_counters_and_shapes(x, counters, min_shapes, max_shapes)
for k, v in x.items():
shape = v.shape
counters[k][shape] += 1
min_shapes[k] = tuple([min(a, b) for a, b in zip(min_shapes[k], shape)])
max_shapes[k] = tuple([max(a, b) for a, b in zip(max_shapes[k], shape)])
opt_shapes: Dict[str, tuple] = {}
for k, v in counters.items():
opt_shapes[k] = v.most_common(1)[0][0]
shapes = {}
for k in opt_shapes.keys(): # same keys in min_shapes and max_shapes
shapes[k] = ShapeSpec(
min=(1,) + min_shapes[k][1:],
max=(max_batch_size,) + max_shapes[k][1:],
opt=(max_batch_size,) + opt_shapes[k][1:],
)
return shapes

View file

@ -0,0 +1,183 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import abc
import importlib
import logging
import os
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, NamedTuple, Optional, Tuple, Union
import numpy as np
LOGGER = logging.getLogger(__name__)
DATALOADER_FN_NAME = "get_dataloader_fn"
GET_MODEL_FN_NAME = "get_model"
GET_SERVING_INPUT_RECEIVER_FN = "get_serving_input_receiver_fn"
GET_ARGPARSER_FN_NAME = "update_argparser"
class TensorSpec(NamedTuple):
name: str
dtype: str
shape: Tuple
class Parameter(Enum):
def __lt__(self, other: "Parameter") -> bool:
return self.value < other.value
class Accelerator(Parameter):
AMP = "amp"
CUDA = "cuda"
TRT = "trt"
class Precision(Parameter):
FP16 = "fp16"
FP32 = "fp32"
TF32 = "tf32" # Deprecated
class Format(Parameter):
TF_GRAPHDEF = "tf-graphdef"
TF_SAVEDMODEL = "tf-savedmodel"
TF_TRT = "tf-trt"
TF_ESTIMATOR = "tf-estimator"
TF_KERAS = "tf-keras"
ONNX = "onnx"
TRT = "trt"
TS_SCRIPT = "ts-script"
TS_TRACE = "ts-trace"
PYT = "pyt"
class Model(NamedTuple):
handle: object
precision: Optional[Precision]
inputs: Dict[str, TensorSpec]
outputs: Dict[str, TensorSpec]
def load_from_file(file_path, label, target):
spec = importlib.util.spec_from_file_location(name=label, location=file_path)
my_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(my_module) # pytype: disable=attribute-error
return getattr(my_module, target, None)
class BaseLoader(abc.ABC):
required_fn_name_for_signature_parsing: Optional[str] = None
@abc.abstractmethod
def load(self, model_path: Union[str, Path], **kwargs) -> Model:
"""
Loads and process model from file based on given set of args
"""
pass
class BaseSaver(abc.ABC):
required_fn_name_for_signature_parsing: Optional[str] = None
@abc.abstractmethod
def save(self, model: Model, model_path: Union[str, Path]) -> None:
"""
Save model to file
"""
pass
class BaseRunner(abc.ABC):
required_fn_name_for_signature_parsing: Optional[str] = None
@abc.abstractmethod
def init_inference(self, model: Model):
raise NotImplementedError
class BaseRunnerSession(abc.ABC):
def __init__(self, model: Model):
self._model = model
@abc.abstractmethod
def __enter__(self):
raise NotImplementedError()
@abc.abstractmethod
def __exit__(self, exc_type, exc_value, traceback):
raise NotImplementedError()
@abc.abstractmethod
def __call__(self, x: Dict[str, object]):
raise NotImplementedError()
def _set_env_variables(self) -> Dict[str, object]:
"""this method not remove values; fix it if needed"""
to_set = {}
old_values = {k: os.environ.pop(k, None) for k in to_set}
os.environ.update(to_set)
return old_values
def _recover_env_variables(self, old_envs: Dict[str, object]):
for name, value in old_envs.items():
if value is None:
del os.environ[name]
else:
os.environ[name] = str(value)
class BaseConverter(abc.ABC):
required_fn_name_for_signature_parsing: Optional[str] = None
@abc.abstractmethod
def convert(self, model: Model, dataloader_fn) -> Model:
raise NotImplementedError()
@staticmethod
def required_source_model_precision(requested_model_precision: Precision) -> Precision:
return requested_model_precision
class BaseMetricsCalculator(abc.ABC):
required_fn_name_for_signature_parsing: Optional[str] = None
@abc.abstractmethod
def calc(
self,
*,
ids: List[Any],
y_pred: Dict[str, np.ndarray],
x: Optional[Dict[str, np.ndarray]],
y_real: Optional[Dict[str, np.ndarray]],
) -> Dict[str, float]:
"""
Calculates error/accuracy metrics
Args:
ids: List of ids identifying each sample in the batch
y_pred: model output as dict where key is output name and value is output value
x: model input as dict where key is input name and value is input value
y_real: input ground truth as dict where key is output name and value is output value
Returns:
dictionary where key is metric name and value is its value
"""
pass
class ShapeSpec(NamedTuple):
min: Tuple
opt: Tuple
max: Tuple

View file

@ -0,0 +1,147 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
from typing import Dict, Iterable
import numpy as np
MB2B = 2 ** 20
B2MB = 1 / MB2B
FLUSH_THRESHOLD_B = 256 * MB2B
def pad_except_batch_axis(data: np.ndarray, target_shape_with_batch_axis: Iterable[int]):
assert all(
[current_size <= target_size for target_size, current_size in zip(target_shape_with_batch_axis, data.shape)]
), "target_shape should have equal or greater all dimensions comparing to data.shape"
padding = [(0, 0)] + [ # (0, 0) - do not pad on batch_axis (with index 0)
(0, target_size - current_size)
for target_size, current_size in zip(target_shape_with_batch_axis[1:], data.shape[1:])
]
return np.pad(data, padding, "constant", constant_values=np.nan)
class NpzWriter:
"""
Dumps dicts of numpy arrays into npz files
It can/shall be used as context manager:
```
with OutputWriter('mydir') as writer:
writer.write(outputs={'classes': np.zeros(8), 'probs': np.zeros((8, 4))},
labels={'classes': np.zeros(8)},
inputs={'input': np.zeros((8, 240, 240, 3)})
```
## Variable size data
Only dynamic of last axis is handled. Data is padded with np.nan value.
Also each generated file may have different size of dynamic axis.
"""
def __init__(self, output_dir, compress=False):
self._output_dir = Path(output_dir)
self._items_cache: Dict[str, Dict[str, np.ndarray]] = {}
self._items_counters: Dict[str, int] = {}
self._flush_threshold_b = FLUSH_THRESHOLD_B
self._compress = compress
@property
def cache_size(self):
return {name: sum([a.nbytes for a in data.values()]) for name, data in self._items_cache.items()}
def _append_to_cache(self, prefix, data):
if data is None:
return
if not isinstance(data, dict):
raise ValueError(f"{prefix} data to store shall be dict")
cached_data = self._items_cache.get(prefix, {})
for name, value in data.items():
assert isinstance(
value, (list, np.ndarray)
), f"Values shall be lists or np.ndarrays; current type {type(value)}"
if not isinstance(value, np.ndarray):
value = np.array(value)
assert value.dtype.kind in ["S", "U"] or not np.any(
np.isnan(value)
), f"Values with np.nan is not supported; {name}={value}"
cached_value = cached_data.get(name, None)
if cached_value is not None:
target_shape = np.max([cached_value.shape, value.shape], axis=0)
cached_value = pad_except_batch_axis(cached_value, target_shape)
value = pad_except_batch_axis(value, target_shape)
value = np.concatenate((cached_value, value))
cached_data[name] = value
self._items_cache[prefix] = cached_data
def write(self, **kwargs):
"""
Writes named list of dictionaries of np.ndarrays.
Finally keyword names will be later prefixes of npz files where those dictionaries will be stored.
ex. writer.write(inputs={'input': np.zeros((2, 10))},
outputs={'classes': np.zeros((2,)), 'probabilities': np.zeros((2, 32))},
labels={'classes': np.zeros((2,))})
Args:
**kwargs: named list of dictionaries of np.ndarrays to store
"""
for prefix, data in kwargs.items():
self._append_to_cache(prefix, data)
biggest_item_size = max(self.cache_size.values())
if biggest_item_size > self._flush_threshold_b:
self.flush()
def flush(self):
for prefix, data in self._items_cache.items():
self._dump(prefix, data)
self._items_cache = {}
def _dump(self, prefix, data):
idx = self._items_counters.setdefault(prefix, 0)
filename = f"{prefix}-{idx:012d}.npz"
output_path = self._output_dir / filename
if self._compress:
np.savez_compressed(output_path, **data)
else:
np.savez(output_path, **data)
nitems = len(list(data.values())[0])
msg_for_labels = (
"If these are correct shapes - consider moving loading of them into metrics.py."
if prefix == "labels"
else ""
)
shapes = {name: value.shape if isinstance(value, np.ndarray) else (len(value),) for name, value in data.items()}
assert all(len(v) == nitems for v in data.values()), (
f'All items in "{prefix}" shall have same size on 0 axis equal to batch size. {msg_for_labels}'
f'{", ".join(f"{name}: {shape}" for name, shape in shapes.items())}'
)
self._items_counters[prefix] += nitems
def __enter__(self):
if self._output_dir.exists() and len(list(self._output_dir.iterdir())):
raise ValueError(f"{self._output_dir.as_posix()} is not empty")
self._output_dir.mkdir(parents=True, exist_ok=True)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.flush()

View file

@ -0,0 +1,83 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib
import logging
import os
import re
from pathlib import Path
from typing import List
LOGGER = logging.getLogger(__name__)
class ExtensionManager:
def __init__(self, name: str):
self._name = name
self._registry = {}
def register_extension(self, extension: str, clazz):
already_registered_class = self._registry.get(extension, None)
if already_registered_class and already_registered_class.__module__ != clazz.__module__:
raise RuntimeError(
f"Conflicting extension {self._name}/{extension}; "
f"{already_registered_class.__module__}.{already_registered_class.__name} "
f"and "
f"{clazz.__module__}.{clazz.__name__}"
)
elif already_registered_class is None:
clazz_full_name = f"{clazz.__module__}.{clazz.__name__}" if clazz is not None else "None"
LOGGER.debug(f"Registering extension {self._name}/{extension}: {clazz_full_name}")
self._registry[extension] = clazz
def get(self, extension):
if extension not in self._registry:
raise RuntimeError(f"Missing extension {self._name}/{extension}")
return self._registry[extension]
@property
def supported_extensions(self):
return list(self._registry)
@staticmethod
def scan_for_extensions(extension_dirs: List[Path]):
register_pattern = r".*\.register_extension\(.*"
for extension_dir in extension_dirs:
for python_path in extension_dir.rglob("*.py"):
if not python_path.is_file():
continue
payload = python_path.read_text()
if re.findall(register_pattern, payload):
import_path = python_path.relative_to(toolkit_root_dir.parent)
package = import_path.parent.as_posix().replace(os.sep, ".")
package_with_module = f"{package}.{import_path.stem}"
spec = importlib.util.spec_from_file_location(name=package_with_module, location=python_path)
my_module = importlib.util.module_from_spec(spec)
my_module.__package__ = package
try:
spec.loader.exec_module(my_module) # pytype: disable=attribute-error
except ModuleNotFoundError as e:
LOGGER.error(
f"Could not load extensions from {import_path} due to missing python packages; {e}"
)
runners = ExtensionManager("runners")
loaders = ExtensionManager("loaders")
savers = ExtensionManager("savers")
converters = ExtensionManager("converters")
toolkit_root_dir = (Path(__file__).parent / "..").resolve()
ExtensionManager.scan_for_extensions([toolkit_root_dir])

View file

@ -0,0 +1,61 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import csv
import re
from typing import Dict, List
from natsort import natsorted
from tabulate import tabulate
def sort_results(results: List):
results = natsorted(results, key=lambda item: [item[key] for key in item.keys()])
return results
def save_results(filename: str, data: List, formatted: bool = False):
data = format_data(data=data) if formatted else data
with open(filename, "a") as csvfile:
fieldnames = data[0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow(row)
def format_data(data: List[Dict]) -> List[Dict]:
formatted_data = list()
for item in data:
formatted_item = format_keys(data=item)
formatted_data.append(formatted_item)
return formatted_data
def format_keys(data: Dict) -> Dict:
keys = {format_key(key=key): value for key, value in data.items()}
return keys
def format_key(key: str) -> str:
key = " ".join([k.capitalize() for k in re.split("_| ", key)])
return key
def show_results(results: List[Dict]):
headers = list(results[0].keys())
summary = map(lambda x: list(map(lambda item: item[1], x.items())), results)
print(tabulate(summary, headers=headers))

View file

@ -0,0 +1,67 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
from typing import List, Optional
def warmup(
model_name: str,
batch_sizes: List[int],
triton_gpu_engine_count: int = 1,
triton_instances: int = 1,
profiling_data: str = "random",
input_shapes: Optional[List[str]] = None,
server_url: str = "localhost",
measurement_window: int = 10000,
shared_memory: bool = False
):
print("\n")
print(f"==== Warmup start ====")
print("\n")
input_shapes = " ".join(map(lambda shape: f" --shape {shape}", input_shapes)) if input_shapes else ""
measurement_window = 6 * measurement_window
max_batch_size = max(batch_sizes)
max_total_requests = 2 * max_batch_size * triton_instances * triton_gpu_engine_count
max_concurrency = min(256, max_total_requests)
batch_size = max(1, max_total_requests // 256)
step = max(1, max_concurrency // 2)
min_concurrency = step
exec_args = f"""-m {model_name} \
-x 1 \
-p {measurement_window} \
-v \
-i http \
-u {server_url}:8000 \
-b {batch_size} \
--concurrency-range {min_concurrency}:{max_concurrency}:{step} \
--input-data {profiling_data} {input_shapes}"""
if shared_memory:
exec_args += " --shared-memory=cuda"
result = os.system(f"perf_client {exec_args}")
if result != 0:
print(f"Failed running performance tests. Perf client failed with exit code {result}")
sys.exit(1)
print("\n")
print(f"==== Warmup done ====")
print("\n")

View file

@ -0,0 +1,26 @@
from typing import Any, Dict, List, NamedTuple, Optional
import numpy as np
from deployment_toolkit.core import BaseMetricsCalculator
class MetricsCalculator(BaseMetricsCalculator):
def __init__(self):
pass
def calc(
self,
*,
ids: List[Any],
y_pred: Dict[str, np.ndarray],
x: Optional[Dict[str, np.ndarray]],
y_real: Optional[Dict[str, np.ndarray]],
) -> Dict[str, float]:
categories = np.argmax(y_pred["OUTPUT__0"], axis=-1)
print(categories.shape)
print(categories[:128], y_pred["OUTPUT__0"] )
print(y_real["OUTPUT__0"][:128])
return {
"accuracy": np.mean(np.argmax(y_pred["OUTPUT__0"], axis=-1) ==
np.argmax(y_real["OUTPUT__0"], axis=-1))
}

View file

@ -0,0 +1,32 @@
import torch
def update_argparser(parser):
parser.add_argument(
"--config", default="resnet50", type=str, required=True, help="Network to deploy")
parser.add_argument(
"--checkpoint", default=None, type=str, help="The checkpoint of the model. ")
parser.add_argument("--classes", type=int, default=1000, help="Number of classes")
parser.add_argument("--precision", type=str, default="fp32",
choices=["fp32", "fp16"], help="Inference precision")
def get_model(**model_args):
from image_classification import models
model = models.resnet50()
if "checkpoint" in model_args:
print(f"loading checkpoint {model_args['checkpoint']}")
state_dict = torch.load(model_args["checkpoint"], map_location="cpu")
model.load_state_dict({k.replace("module.", ""): v
for k, v in state_dict.items()})
if model_args["precision"] == "fp16":
model = model.half()
model = model.cuda()
model.eval()
tensor_names = {"inputs": ["INPUT__0"],
"outputs": ["OUTPUT__0"]}
return model, tensor_names

View file

@ -0,0 +1,127 @@
#!/usr/bin/env python3
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import tarfile
from pathlib import Path
from typing import Tuple, Dict, List
from PIL import Image
from tqdm import tqdm
DATASETS_DIR = os.environ.get("DATASETS_DIR", None)
IMAGENET_DIRNAME = "imagenet"
IMAGE_ARCHIVE_FILENAME = "ILSVRC2012_img_val.tar"
DEVKIT_ARCHIVE_FILENAME = "ILSVRC2012_devkit_t12.tar.gz"
LABELS_REL_PATH = "ILSVRC2012_devkit_t12/data/ILSVRC2012_validation_ground_truth.txt"
META_REL_PATH = "ILSVRC2012_devkit_t12/data/meta.mat"
TARGET_SIZE = (224, 224) # (width, height)
_RESIZE_MIN = 256 # resize preserving aspect ratio to where this is minimal size
def parse_meta_mat(metafile) -> Dict[int, str]:
import scipy.io
meta = scipy.io.loadmat(metafile, squeeze_me=True)["synsets"]
nums_children = list(zip(*meta))[4]
meta = [meta[idx] for idx, num_children in enumerate(nums_children) if num_children == 0]
idcs, wnids = list(zip(*meta))[:2]
idx_to_wnid = {idx: wnid for idx, wnid in zip(idcs, wnids)}
return idx_to_wnid
def _process_image(image_file, target_size):
image = Image.open(image_file)
original_size = image.size
# scale image to size where minimal size is _RESIZE_MIN
scale_factor = max(_RESIZE_MIN / original_size[0], _RESIZE_MIN / original_size[1])
resize_to = int(original_size[0] * scale_factor), int(original_size[1] * scale_factor)
resized_image = image.resize(resize_to)
# central crop of image to target_size
left, upper = (resize_to[0] - target_size[0]) // 2, (resize_to[1] - target_size[1]) // 2
cropped_image = resized_image.crop((left, upper, left + target_size[0], upper + target_size[1]))
return cropped_image
def main():
import argparse
parser = argparse.ArgumentParser(description="short_description")
parser.add_argument(
"--dataset-dir",
help="Path to dataset directory where imagenet archives are stored and processed files will be saved.",
required=False,
default=DATASETS_DIR,
)
parser.add_argument(
"--target-size",
help="Size of target image. Format it as <width>,<height>.",
required=False,
default=",".join(map(str, TARGET_SIZE)),
)
args = parser.parse_args()
if args.dataset_dir is None:
raise ValueError(
"Please set $DATASETS_DIR env variable to point dataset dir with original dataset archives "
"and where processed files should be stored. Alternatively provide --dataset-dir CLI argument"
)
datasets_dir = Path(args.dataset_dir)
target_size = tuple(map(int, args.target_size.split(",")))
image_archive_path = datasets_dir / IMAGE_ARCHIVE_FILENAME
if not image_archive_path.exists():
raise RuntimeError(
f"There should be {IMAGE_ARCHIVE_FILENAME} file in {datasets_dir}."
f"You need to download the dataset from http://www.image-net.org/download."
)
devkit_archive_path = datasets_dir / DEVKIT_ARCHIVE_FILENAME
if not devkit_archive_path.exists():
raise RuntimeError(
f"There should be {DEVKIT_ARCHIVE_FILENAME} file in {datasets_dir}."
f"You need to download the dataset from http://www.image-net.org/download."
)
with tarfile.open(devkit_archive_path, mode="r") as devkit_archive_file:
labels_file = devkit_archive_file.extractfile(LABELS_REL_PATH)
labels = list(map(int, labels_file.readlines()))
# map validation labels (idxes from LABELS_REL_PATH) into WNID compatible with training set
meta_file = devkit_archive_file.extractfile(META_REL_PATH)
idx_to_wnid = parse_meta_mat(meta_file)
labels_wnid = [idx_to_wnid[idx] for idx in labels]
# remap WNID into index in sorted list of all WNIDs - this is how network outputs class
available_wnids = sorted(set(labels_wnid))
wnid_to_newidx = {wnid: new_cls for new_cls, wnid in enumerate(available_wnids)}
labels = [wnid_to_newidx[wnid] for wnid in labels_wnid]
output_dir = datasets_dir / IMAGENET_DIRNAME
with tarfile.open(image_archive_path, mode="r") as image_archive_file:
image_rel_paths = sorted(image_archive_file.getnames())
for cls, image_rel_path in tqdm(zip(labels, image_rel_paths), total=len(image_rel_paths)):
output_path = output_dir / str(cls) / image_rel_path
original_image_file = image_archive_file.extractfile(image_rel_path)
processed_image = _process_image(original_image_file, target_size)
output_path.parent.mkdir(parents=True, exist_ok=True)
processed_image.save(output_path.as_posix())
if __name__ == "__main__":
main()

View file

@ -0,0 +1,24 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
networkx==2.5
numpy<1.20.0,>=1.19.1 # # numpy 1.20+ requires py37
onnx==1.8.0
onnxruntime==1.5.2
pycuda>=2019.1.2
PyYAML>=5.2
tqdm>=4.44.1
tabulate>=0.8.7
natsort>=7.0.0
# use tags instead of branch names - because there might be docker cache hit causing not fetching most recent changes on branch
model_navigator @ git+https://github.com/triton-inference-server/model_navigator.git@v0.1.0#egg=model_navigator

View file

@ -0,0 +1,28 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.02-py3
ARG TRITON_CLIENT_IMAGE_NAME=nvcr.io/nvidia/tritonserver:21.02-py3-sdk
FROM ${TRITON_CLIENT_IMAGE_NAME} as triton-client
FROM ${FROM_IMAGE_NAME}
# Install Perf Client required library
RUN apt-get update && apt-get install -y libb64-dev libb64-0d
# Install Triton Client PythonAPI and copy Perf Client
COPY --from=triton-client /workspace/install/ /workspace/install/
ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
RUN find /workspace/install/python/ -iname triton*manylinux*.whl -exec pip install {}[all] \;
# Setup environment variables to access Triton Client binaries and libs
ENV PATH /workspace/install/bin:${PATH}
ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
ENV PYTHONPATH /workspace
WORKDIR /workspace
RUN pip install nvidia-pyindex
ADD requirements.txt /workspace/requirements.txt
ADD triton/requirements.txt /workspace/triton/requirements.txt
RUN pip install -r /workspace/requirements.txt
RUN pip install -r /workspace/triton/requirements.txt
ADD . /workspace

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

View file

@ -1,248 +1,700 @@
# Deploying the ResNet-50 v1.5 model using Triton Inference Server
# Deploying the ResNet50 v1.5 model on Triton Inference Server
The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/trtis-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This folder contains instructions for deployment to run inference
on Triton Inference Server as well as a detailed performance analysis.
The purpose of this document is to help you with achieving
the best inference performance.
This folder contains instructions on how to deploy and run inference on
Triton Inference Server as well as gather detailed performance analysis.
## Table of contents
## Table Of Contents
- [Solution overview](#solution-overview)
- [Introduction](#introduction)
- [Deployment process](#deployment-process)
- [Setup](#setup)
- [Quick Start Guide](#quick-start-guide)
- [Advanced](#advanced)
- [Prepare configuration](#prepare-configuration)
- [Latency explanation](#latency-explanation)
- [Performance](#performance)
- [Offline scenario](#offline-scenario)
- [Offline: NVIDIA A40, ONNX Runtime TensorRT with FP16](#offline-nvidia-a40-onnx-runtime-tensorrt-with-fp16)
- [Offline: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16](#offline-nvidia-dgx-a100-1x-a100-80gb-onnx-runtime-tensorrt-with-fp16)
- [Offline: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16](#offline-nvidia-dgx-1-1x-v100-32gb-onnx-runtime-tensorrt-with-fp16)
- [Offline: NVIDIA T4, ONNX Runtime TensorRT with FP16](#offline-nvidia-t4-onnx-runtime-tensorrt-with-fp16)
- [Online scenario](#online-scenario)
- [Online: NVIDIA A40, ONNX Runtime TensorRT with FP16](#online-nvidia-a40-onnx-runtime-tensorrt-with-fp16)
- [Online: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16](#online-nvidia-dgx-a100-1x-a100-80gb-onnx-runtime-tensorrt-with-fp16)
- [Online: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16](#online-nvidia-dgx-1-1x-v100-32gb-onnx-runtime-tensorrt-with-fp16)
- [Online: NVIDIA T4, ONNX Runtime TensorRT with FP16](#online-nvidia-t4-onnx-runtime-tensorrt-with-fp16)
- [Release Notes](#release-notes)
- [Changelog](#changelog)
- [Known issues](#known-issues)
* [Model overview](#model-overview)
* [Setup](#setup)
* [Inference container](#inference-container)
* [Deploying the model](#deploying-the-model)
* [Running the Triton Inference Server](#running-the-triton-inference-server)
* [Quick Start Guide](#quick-start-guide)
* [Running the client](#running-the-client)
* [Gathering performance data](#gathering-performance-data)
* [Advanced](#advanced)
* [Automated benchmark script](#automated-benchmark-script)
* [Performance](#performance)
* [Dynamic batching performance](#dynamic-batching-performance)
* [TensorRT backend inference performance (1x V100 16GB)](#tensorrt-backend-inference-performance-1x-v100-16gb)
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).
The difference between v1 and v1.5 is that, in the bottleneck blocks which requires
downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.
This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a smallperformance drawback (~5% imgs/sec)
## Solution overview
### Introduction
The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server)
provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs.
The server provides an inference service via an HTTP or gRPC endpoint,
allowing remote clients to request inferencing for any number of GPU
or CPU models being managed by the server.
This README provides step-by-step deployment instructions for models generated
during training (as described in the [model README](../../resnet50v1.5/README.md)).
Additionally, this README provides the corresponding deployment scripts that
ensure optimal GPU utilization during inferencing on Triton Inference Server.
### Deployment process
The deployment process consists of two steps:
1. Conversion. The purpose of conversion is to find the best performing model
format supported by Triton Inference Server.
Triton Inference Server uses a number of runtime backends such as
[TensorRT](https://developer.nvidia.com/tensorrt),
[LibTorch](https://github.com/triton-inference-server/pytorch_backend) and
[ONNX Runtime](https://github.com/triton-inference-server/onnxruntime_backend)
to support various model types. Refer to the
[Triton documentation](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
for a list of available backends.
2. Configuration. Model configuration on Triton Inference Server, which generates
necessary [configuration files](https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md).
To run benchmarks measuring the model performance in inference,
perform the following steps:
1. Start the Triton Inference Server.
The Triton Inference Server container is started
in one (possibly remote) container and ports for gRPC or REST API are exposed.
2. Run accuracy tests.
Produce results which are tested against the given accuracy thresholds.
Refer to step 9 in the [Quick Start Guide](#quick-start-guide).
3. Run performance tests.
Produce latency and throughput results for offline (static batching)
and online (dynamic batching) scenarios.
Refer to step 11 in the [Quick Start Guide](#quick-start-guide).
The ResNet50 v1.5 model can be deployed for inference on the [NVIDIA Triton Inference Server](https://github.com/NVIDIA/trtis-inference-server) using
TorchScript, ONNX Runtime or TensorRT as an execution backend.
## Setup
This script requires trained ResNet50 v1.5 model checkpoint that can be used for deployment.
### Inference container
For easy-to-use deployment, a build script for special inference container was prepared. To build that container, go to the main repository folder and run:
Ensure you have the following components:
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
* [PyTorch NGC container 20.11](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
* [Triton Inference Server NGC container 20.11](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
* [NVIDIA CUDA repository](https://docs.nvidia.com/cuda/archive/11.1.1/index.html)
* [NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
`docker build -t rn50_inference . -f triton/Dockerfile`
This command will download the dependencies and build the inference containers. Then, run shell inside the container:
`docker run -it --rm --gpus device=0 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v <PATH_TO_MODEL_REPOSITORY>:/repository rn50_inference bash`
Here `device=0,1,2,3` selects the GPUs indexed by ordinals `0,1,2` and `3`, respectively. The server will see only these GPUs. If you write `device=all`, then the server will see all the available GPUs. `PATH_TO_MODEL_REPOSITORY` indicates location to where the
deployed models were stored.
### Deploying the model
To deploy the ResNet-50 v1.5 model into the Triton Inference Server, you must run the `deployer.py` script from inside the deployment Docker container to achieve a compatible format.
```
usage: deployer.py [-h] (--ts-script | --ts-trace | --onnx | --trt)
[--triton-no-cuda] [--triton-model-name TRITON_MODEL_NAME]
[--triton-model-version TRITON_MODEL_VERSION]
[--triton-server-url TRITON_SERVER_URL]
[--triton-max-batch-size TRITON_MAX_BATCH_SIZE]
[--triton-dyn-batching-delay TRITON_DYN_BATCHING_DELAY]
[--triton-engine-count TRITON_ENGINE_COUNT]
[--save-dir SAVE_DIR]
[--max_workspace_size MAX_WORKSPACE_SIZE] [--trt-fp16]
[--capture-cuda-graph CAPTURE_CUDA_GRAPH]
...
optional arguments:
-h, --help show this help message and exit
--ts-script convert to torchscript using torch.jit.script
--ts-trace convert to torchscript using torch.jit.trace
--onnx convert to onnx using torch.onnx.export
--trt convert to trt using tensorrt
triton related flags:
--triton-no-cuda Use the CPU for tracing.
--triton-model-name TRITON_MODEL_NAME
exports to appropriate directory structure for TRITON
--triton-model-version TRITON_MODEL_VERSION
exports to appropriate directory structure for TRITON
--triton-server-url TRITON_SERVER_URL
exports to appropriate directory structure for TRITON
--triton-max-batch-size TRITON_MAX_BATCH_SIZE
Specifies the 'max_batch_size' in the TRITON model
config. See the TRITON documentation for more info.
--triton-dyn-batching-delay TRITON_DYN_BATCHING_DELAY
Determines the dynamic_batching queue delay in
milliseconds(ms) for the TRITON model config. Use '0'
or '-1' to specify static batching. See the TRITON
documentation for more info.
--triton-engine-count TRITON_ENGINE_COUNT
Specifies the 'instance_group' count value in the
TRITON model config. See the TRITON documentation for
more info.
--save-dir SAVE_DIR Saved model directory
optimization flags:
--max_workspace_size MAX_WORKSPACE_SIZE
set the size of the workspace for trt export
--trt-fp16 trt flag ---- export model in mixed precision mode
--capture-cuda-graph CAPTURE_CUDA_GRAPH
capture cuda graph for obtaining speedup. possible
values: 0, 1. default: 1.
model_arguments arguments that will be ignored by deployer lib and
will be forwarded to your deployer script
```
Following model specific arguments have to be specified for model deployment:
```
--config CONFIG Network architecture to use for deployment (eg. resnet50,
resnext101-32x4d or se-resnext101-32x4d)
--checkpoint CHECKPOINT
Path to stored model weight. If not specified, model will be
randomly initialized
--batch_size BATCH_SIZE
Batch size used for dummy dataloader
--fp16 Use model with half-precision calculations
```
For example, to deploy model into TensorRT format, using half precision and max batch size 64 called
`rn-trt-16` execute:
`python -m triton.deployer --trt --trt-fp16 --triton-model-name rn-trt-16 --triton-max-batch-size 64 --save-dir /repository -- --config resnet50 --checkpoint model_checkpoint --batch_size 64 --fp16`
Where `model_checkpoint` is a checkpoint for a trained model with the same architecture (resnet50) as used during export.
### Running the Triton Inference Server
**NOTE: This step is executed outside the inference container.**
Pull the Triton Inference Server container from our repository:
`docker pull nvcr.io/nvidia/tritonserver:20.07-py3`
Run the command to start the Triton Inference Server:
`docker run -d --rm --gpus device=0 --ipc=host --network=host -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <PATH_TO_MODEL_REPOSITORY>:/models nvcr.io/nvidia/tritonserver:20.07-py3 trtserver --model-store=/models --log-verbose=1 --model-control-mode=poll --repository-poll-secs=5`
Here `device=0,1,2,3` selects GPUs indexed by ordinals `0,1,2` and `3`, respectively. The server will see only these GPUs. If you write `device=all`, then the server will see all the available GPUs. `PATH_TO_MODEL_REPOSITORY` indicates the location where the
deployed models were stored. An additional `--model-controle-mode` option allows to reload the model when it changes in the filesystem. It is a required option for benchmark scripts that works with multiple model versions on a single Triton Inference Server instance.
## Quick Start Guide
Running the following scripts will build and launch the container with all required dependencies for native PyTorch as well as Triton Inference Server. This is necessary for running inference and can also be used for data download, processing, and training of the model.
1. Clone the repository.
IMPORTANT: This step is executed on the host computer.
```
git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/PyTorch/Classification/ConvNets
```
2. Setup the environment in the host computer and start Triton Inference Server.
```
source triton/scripts/setup_environment.sh
bash triton/scripts/docker/triton_inference_server.sh
```
### Running the client
The client `client.py` checks the model accuracy against synthetic or real validation
data. The client connects to Triton Inference Server and performs inference.
```
usage: client.py [-h] --triton-server-url TRITON_SERVER_URL
--triton-model-name TRITON_MODEL_NAME [-v]
[--inference_data INFERENCE_DATA] [--batch_size BATCH_SIZE]
[--fp16]
optional arguments:
-h, --help show this help message and exit
--triton-server-url TRITON_SERVER_URL
URL adress of trtion server (with port)
--triton-model-name TRITON_MODEL_NAME
Triton deployed model name
-v, --verbose Verbose mode.
--inference_data INFERENCE_DATA
Path to file with inference data.
--batch_size BATCH_SIZE
Inference request batch size
--fp16 Use fp16 precision for input data
```
To run inference on the model exported in the previous steps, using the data located under
`/dataset`, run:
`python -m triton.client --triton-server-url localhost:8001 --triton-model-name rn-trt-16 --inference_data /data/test_data.bin --batch_size 16 --fp16`
3. Build and run a container that extends the NGC PyTorch container with the Triton Inference Server client libraries and dependencies.
```
bash triton/scripts/docker/build.sh
bash triton/scripts/docker/interactive.sh
```
### Gathering performance data
Performance data can be gathered using the `perf_client` tool. To use this tool to measure performance for batch_size=32, the following command can be used:
4. Prepare the deployment configuration and create folders in Docker.
IMPORTANT: These and the following commands must be executed in the PyTorch NGC container.
```
source triton/scripts/setup_environment.sh
```
`/workspace/bin/perf_client --max-threads 10 -m rn-trt-16 -x 1 -p 10000 -v -i gRPC -u localhost:8001 -b 32 -l 5000 --concurrency-range 1 -f result.csv`
5. Download and pre-process the dataset.
```
bash triton/scripts/download_data.sh
bash triton/scripts/process_dataset.sh
```
6. Setup the parameters for deployment.
```
source triton/scripts/setup_parameters.sh
```
7. Convert the model from training to inference format (e.g. TensorRT).
```
python3 triton/convert_model.py \
--input-path triton/model.py \
--input-type pyt \
--output-path ${SHARED_DIR}/model \
--output-type ${FORMAT} \
--onnx-opset 11 \
--onnx-optimized 1 \
--max-batch-size ${MAX_BATCH_SIZE} \
--max-workspace-size 1073741824 \
--ignore-unknown-parameters \
\
--checkpoint ${CHECKPOINT_DIR}/nvidia_resnet50_200821.pth.tar \
--precision ${PRECISION} \
--config resnet50 \
--classes 1000 \
\
--dataloader triton/dataloader.py \
--data-dir ${DATASETS_DIR}/imagenet \
--batch-size ${MAX_BATCH_SIZE}
```
8. Configure the model on Triton Inference Server.
Generate the configuration from your model repository.
```
python3 triton/config_model_on_triton.py \
--model-repository ${MODEL_REPOSITORY_PATH} \
--model-path ${SHARED_DIR}/model \
--model-format ${FORMAT} \
--model-name ${MODEL_NAME} \
--model-version 1 \
--max-batch-size ${MAX_BATCH_SIZE} \
--precision ${PRECISION} \
--number-of-model-instances ${NUMBER_OF_MODEL_INSTANCES} \
--max-queue-delay-us 0 \
--preferred-batch-sizes ${MAX_BATCH_SIZE} \
--capture-cuda-graph 0 \
--backend-accelerator ${BACKEND_ACCELERATOR} \
--load-model ${TRITON_LOAD_MODEL_METHOD}
```
9. Run the Triton Inference Server accuracy tests.
```
python3 triton/run_inference_on_triton.py \
--server-url localhost:8001 \
--model-name ${MODEL_NAME} \
--model-version 1 \
--output-dir ${SHARED_DIR}/accuracy_dump \
\
--precision ${PRECISION} \
--dataloader triton/dataloader.py \
--data-dir ${DATASETS_DIR}/imagenet \
--batch-size ${MAX_BATCH_SIZE} \
--dump-labels
python3 triton/calculate_metrics.py \
--metrics triton/metric.py \
--dump-dir ${SHARED_DIR}/accuracy_dump \
--csv ${SHARED_DIR}/accuracy_metrics.csv
cat ${SHARED_DIR}/accuracy_metrics.csv
```
10. Run the Triton Inference Server performance online tests.
We want to maximize throughput within latency budget constraints.
Dynamic batching is a feature of Triton Inference Server that allows
inference requests to be combined by the server, so that a batch is
created dynamically, resulting in a reduced average latency.
You can set the Dynamic Batcher parameter `max_queue_delay_microseconds` to
indicate the maximum amount of time you are willing to wait and
`preferred_batch_size` to indicate your maximum server batch size
in the Triton Inference Server model configuration. The measurements
presented below set the maximum latency to zero to achieve the best latency
possible with good performance.
```
python triton/run_online_performance_test_on_triton.py \
--model-name ${MODEL_NAME} \
--input-data random \
--batch-sizes ${BATCH_SIZE} \
--triton-instances ${TRITON_INSTANCES} \
--number-of-model-instances ${NUMBER_OF_MODEL_INSTANCES} \
--result-path ${SHARED_DIR}/triton_performance_online.csv
```
11. Run the Triton Inference Server performance offline tests.
We want to maximize throughput. It assumes you have your data available
for inference or that your data saturate to maximum batch size quickly.
Triton Inference Server supports offline scenarios with static batching.
Static batching allows inference requests to be served
as they are received. The largest improvements to throughput come
from increasing the batch size due to efficiency gains in the GPU with larger
batches.
```
python triton/run_offline_performance_test_on_triton.py \
--model-name ${MODEL_NAME} \
--input-data random \
--batch-sizes ${BATCH_SIZE} \
--triton-instances ${TRITON_INSTANCES} \
--result-path ${SHARED_DIR}/triton_performance_offline.csv
```
For more information about `perf_client`, refer to the [documentation](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/optimization.html#perf-client).
## Advanced
### Automated benchmark script
To automate benchmarks of different model configurations, a special benchmark script is located in `triton/scripts/benchmark.sh`. To use this script,
run Triton Inference Server and then execute the script as follows:
`bash triton/scripts/benchmark.sh <MODEL_REPOSITORY> <LOG_DIRECTORY> <ARCHITECTURE> (<CHECKPOINT_PATH>)`
### Prepare configuration
You can use the environment variables to set the parameters of your inference
configuration.
Triton deployment scripts support several inference runtimes listed in the table below:
| Inference runtime | Mnemonic used in scripts |
|-------------------|--------------------------|
| [TorchScript Tracing](https://pytorch.org/docs/stable/jit.html) | `ts-trace` |
| [TorchScript Tracing](https://pytorch.org/docs/stable/jit.html) | `ts-script` |
| [ONNX](https://onnx.ai) | `onnx` |
| [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) | `trt` |
The name of the inference runtime should be put into the `FORMAT` variable.
Example values of some key variables in one configuration:
```
PRECISION="fp16"
FORMAT="trt"
BATCH_SIZE="1, 2, 4, 8, 16, 32, 64, 128"
BACKEND_ACCELERATOR="cuda"
MAX_BATCH_SIZE="128"
NUMBER_OF_MODEL_INSTANCES="1"
TRITON_MAX_QUEUE_DELAY="1"
TRITON_PREFERRED_BATCH_SIZES="64 128"
```
### Latency explanation
A typical Triton Inference Server pipeline can be broken down into the following steps:
1. The client serializes the inference request into a message and sends it to
the server (Client Send).
2. The message travels over the network from the client to the server (Network).
3. The message arrives at the server and is deserialized (Server Receive).
4. The request is placed on the queue (Server Queue).
5. The request is removed from the queue and computed (Server Compute).
6. The completed request is serialized in a message and sent back to
the client (Server Send).
7. The completed message then travels over the network from the server
to the client (Network).
8. The completed message is deserialized by the client and processed as
a completed inference request (Client Receive).
Generally, for local clients, steps 1-4 and 6-8 will only occupy
a small fraction of time, compared to step 5. As backend deep learning
systems like Jasper are rarely exposed directly to end users, but instead
only interfacing with local front-end servers, for the sake of Jasper,
we can consider that all clients are local.
The benchmark script tests all supported backends with different batch sizes and server configuration. Logs from execution will be stored in `<LOG DIRECTORY>`.
To process static configuration logs, `triton/scripts/process_output.sh` script can be used.
## Performance
### Dynamic batching performance
The Triton Inference Server has a dynamic batching mechanism built-in that can be enabled. When it is enabled, the server creates inference batches from multiple received requests. This allows us to achieve better performance than doing inference on each single request. The single request is assumed to be a single image that needs to be inferenced. With dynamic batching enabled, the server will concatenate single image requests into an inference batch. The upper bound of the size of the inference batch is set to 64. All these parameters are configurable.
Our results were obtained by running automated benchmark script.
Throughput is measured in images/second, and latency in milliseconds.
### TensorRT backend inference performance (1x V100 16GB)
**FP32 Inference Performance**
|**Concurrent requests**|**Throughput (img/s)**|**Avg. Latency (ms)**|**90% Latency (ms)**|**95% Latency (ms)**|**99% Latency (ms)**|
|-----|--------|-------|--------|-------|-------|
| 1 | 133.6 | 7.48 | 7.56 | 7.59 | 7.68 |
| 2 | 156.6 | 12.77 | 12.84 | 12.86 | 12.93 |
| 4 | 193.3 | 20.70 | 20.82 | 20.85 | 20.92 |
| 8 | 357.4 | 22.38 | 22.53 | 22.57 | 22.67 |
| 16 | 627.3 | 25.49 | 25.64 | 25.69 | 25.80 |
| 32 | 1003 | 31.87 | 32.43 | 32.61 | 32.91 |
| 64 | 1394.7 | 45.85 | 46.13 | 46.22 | 46.86 |
| 128 | 1604.4 | 79.70 | 80.50 | 80.96 | 83.09 |
| 256 | 1670.7 | 152.21 | 186.78 | 188.36 | 190.52 |
**FP16 Inference Performance**
|**Concurrent requests**|**Throughput (img/s)**|**Avg. Latency (ms)**|**90% Latency (ms)**|**95% Latency (ms)**|**99% Latency (ms)**|
|-----|--------|-------|--------|-------|-------|
| 1 | 250.1 | 3.99 | 4.08 | 4.11 | 4.16 |
| 2 | 314.8 | 6.35 | 6.42 | 6.44 | 6.49 |
| 4 | 384.8 | 10.39 | 10.51 | 10.54 | 10.60 |
| 8 | 693.8 | 11.52 | 11.78 | 11.88 | 12.09 |
| 16 | 1132.9 | 14.13 | 14.31 | 14.41 | 14.65 |
| 32 | 1689.7 | 18.93 | 19.11 | 19.20 | 19.44 |
| 64 | 2226.3 | 28.74 | 29.53 | 29.74 | 31.09 |
| 128 | 2521.5 | 50.74 | 51.97 | 52.30 | 53.61 |
| 256 | 2738 | 93.76 | 97.14 | 115.19 | 117.21 |
### Offline scenario
This table lists the common variable parameters for all performance measurements:
| Parameter Name | Parameter Value |
|:-----------------------------|:------------------|
| Max Batch Size | 128.0 |
| Number of model instances | 1.0 |
| Triton Max Queue Delay | 1.0 |
| Triton Preferred Batch Sizes | 64 128 |
![Latency vs Througput](./Latency-vs-Throughput-TensorRT.png)
![Performance analysis - TensorRT FP32](./Performance-analysis-TensorRT-FP32.png)
#### Offline: NVIDIA A40, ONNX Runtime TensorRT with FP16
![Performance analysis - TensorRT FP16](./Performance-analysis-TensorRT-FP16.png)
Our results were obtained using the following configuration:
* **GPU:** NVIDIA A40
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
## Release notes
<table><tr><td>
![](plots/graph_performance_offline_1l.svg)
</td><td>
![](plots/graph_performance_offline_1r.svg)
</td></tr></table>
<details>
<summary>
Full tabular data
</summary>
| Precision | Backend Accelerator | Client Batch Size | Inferences/second | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
| FP16 | TensorRT | 1 | 491.5 | 2.046 | 2.111 | 2.126 | 2.031 |
| FP16 | TensorRT | 2 | 811.8 | 2.509 | 2.568 | 2.594 | 2.459 |
| FP16 | TensorRT | 4 | 1094 | 3.814 | 3.833 | 3.877 | 3.652 |
| FP16 | TensorRT | 8 | 1573.6 | 5.45 | 5.517 | 5.636 | 5.078 |
| FP16 | TensorRT | 16 | 1651.2 | 9.896 | 9.978 | 10.074 | 9.678 |
| FP16 | TensorRT | 32 | 2070.4 | 17.49 | 17.837 | 19.228 | 15.451 |
| FP16 | TensorRT | 64 | 1766.4 | 37.123 | 37.353 | 37.85 | 36.147 |
| FP16 | TensorRT | 128 | 1894.4 | 69.027 | 69.15 | 69.789 | 67.889 |
</details>
#### Offline: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA DGX A100 (1x A100 80GB)
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
<table><tr><td>
![](plots/graph_performance_offline_5l.svg)
</td><td>
![](plots/graph_performance_offline_5r.svg)
</td></tr></table>
<details>
<summary>
Full tabular data
</summary>
| Precision | Backend Accelerator | Client Batch Size | Inferences/second | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
| FP16 | TensorRT | 1 | 469.1 | 2.195 | 2.245 | 2.272 | 2.128 |
| FP16 | TensorRT | 2 | 910 | 2.222 | 2.229 | 2.357 | 2.194 |
| FP16 | TensorRT | 4 | 1447.6 | 3.055 | 3.093 | 3.354 | 2.759 |
| FP16 | TensorRT | 8 | 2051.2 | 4.035 | 4.195 | 4.287 | 3.895 |
| FP16 | TensorRT | 16 | 2760 | 6.033 | 6.121 | 6.348 | 5.793 |
| FP16 | TensorRT | 32 | 2857.6 | 11.47 | 11.573 | 11.962 | 11.193 |
| FP16 | TensorRT | 64 | 2534.4 | 26.345 | 26.899 | 29.744 | 25.244 |
| FP16 | TensorRT | 128 | 2662.4 | 49.612 | 51.713 | 53.666 | 48.086 |
</details>
#### Offline: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA DGX-1 (1x V100 32GB)
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
<table><tr><td>
![](plots/graph_performance_offline_9l.svg)
</td><td>
![](plots/graph_performance_offline_9r.svg)
</td></tr></table>
<details>
<summary>
Full tabular data
</summary>
| Precision | Backend Accelerator | Client Batch Size | Inferences/second | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
| FP16 | TensorRT | 1 | 351.8 | 2.996 | 3.051 | 3.143 | 2.838 |
| FP16 | TensorRT | 2 | 596.2 | 3.481 | 3.532 | 3.627 | 3.35 |
| FP16 | TensorRT | 4 | 953.6 | 4.314 | 4.351 | 4.45 | 4.191 |
| FP16 | TensorRT | 8 | 1337.6 | 6.185 | 6.347 | 6.581 | 5.979 |
| FP16 | TensorRT | 16 | 1726.4 | 9.736 | 9.87 | 10.904 | 9.266 |
| FP16 | TensorRT | 32 | 2044.8 | 15.833 | 15.977 | 16.438 | 15.664 |
| FP16 | TensorRT | 64 | 1670.4 | 38.667 | 38.842 | 40.773 | 38.412 |
| FP16 | TensorRT | 128 | 1548.8 | 84.454 | 85.308 | 88.363 | 82.159 |
</details>
#### Offline: NVIDIA T4, ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA T4
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
<table><tr><td>
![](plots/graph_performance_offline_13l.svg)
</td><td>
![](plots/graph_performance_offline_13r.svg)
</td></tr></table>
<details>
<summary>
Full tabular data
</summary>
| Precision | Backend Accelerator | Client Batch Size | Inferences/second | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
| FP16 | TensorRT | 1 | 332.4 | 3.065 | 3.093 | 3.189 | 3.003 |
| FP16 | TensorRT | 2 | 499.4 | 4.069 | 4.086 | 4.143 | 3.998 |
| FP16 | TensorRT | 4 | 695.2 | 5.779 | 5.786 | 5.802 | 5.747 |
| FP16 | TensorRT | 8 | 888 | 9.039 | 9.05 | 9.065 | 8.998 |
| FP16 | TensorRT | 16 | 1057.6 | 15.319 | 15.337 | 15.389 | 15.113 |
| FP16 | TensorRT | 32 | 1129.6 | 28.77 | 28.878 | 29.082 | 28.353 |
| FP16 | TensorRT | 64 | 1203.2 | 54.194 | 54.417 | 55.331 | 53.187 |
| FP16 | TensorRT | 128 | 1280 | 102.466 | 102.825 | 103.177 | 100.155 |
</details>
### Online scenario
This table lists the common variable parameters for all performance measurements:
| Parameter Name | Parameter Value |
|:-----------------------------|:------------------|
| Max Batch Size | 128.0 |
| Number of model instances | 1.0 |
| Triton Max Queue Delay | 1.0 |
| Triton Preferred Batch Sizes | 64 128 |
#### Online: NVIDIA A40, ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA A40
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
![](plots/graph_performance_online_2.svg)
<details>
<summary>
Full tabular data
</summary>
| Concurrent client requests | Inferences/second | Client Send | Network+server Send/recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | Client Recv | P50 Latency | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
| 16 | 2543.7 | 0.078 | 1.912 | 1.286 | 0.288 | 2.697 | 0.024 | 0 | 6.624 | 7.039 | 7.414 | 9.188 | 6.285 |
| 32 | 3166.7 | 0.085 | 3.478 | 1.81 | 0.582 | 4.098 | 0.047 | 0 | 9.924 | 11.001 | 12.217 | 14.717 | 10.1 |
| 48 | 3563.9 | 0.085 | 5.169 | 1.935 | 0.99 | 5.204 | 0.08 | 0 | 13.199 | 14.813 | 16.421 | 19.793 | 13.463 |
| 64 | 3514.9 | 0.091 | 5.729 | 3.847 | 1.553 | 6.842 | 0.138 | 0 | 17.986 | 18.85 | 19.916 | 25.825 | 18.2 |
| 80 | 3703.5 | 0.097 | 7.244 | 4.414 | 2 | 7.675 | 0.169 | 0 | 21.313 | 23.838 | 28.664 | 32.631 | 21.599 |
| 96 | 3636.9 | 0.101 | 8.459 | 5.679 | 3.157 | 8.771 | 0.215 | 0 | 26.131 | 27.751 | 31.269 | 38.695 | 26.382 |
| 112 | 3701.7 | 0.099 | 9.332 | 6.711 | 3.588 | 10.276 | 0.241 | 0 | 30.319 | 31.282 | 31.554 | 32.151 | 30.247 |
| 128 | 3795.8 | 0.106 | 10.581 | 7.309 | 4.067 | 11.386 | 0.268 | 0 | 33.893 | 34.793 | 35.448 | 43.182 | 33.717 |
| 144 | 3892.4 | 0.106 | 11.488 | 8.144 | 4.713 | 12.212 | 0.32 | 0 | 37.184 | 38.277 | 38.597 | 39.393 | 36.983 |
| 160 | 3950 | 0.106 | 13.5 | 7.999 | 5.083 | 13.481 | 0.343 | 0 | 40.656 | 42.157 | 44.756 | 53.426 | 40.512 |
| 176 | 3992.5 | 0.118 | 13.6 | 9.809 | 5.596 | 14.611 | 0.379 | 0 | 44.324 | 45.689 | 46.331 | 52.155 | 44.113 |
| 192 | 4058.3 | 0.116 | 14.902 | 10.223 | 6.054 | 15.564 | 0.416 | 0 | 47.537 | 48.91 | 49.752 | 55.973 | 47.275 |
| 208 | 4121.8 | 0.117 | 16.777 | 9.991 | 6.347 | 16.827 | 0.441 | 0 | 50.652 | 52.241 | 53.4 | 62.688 | 50.5 |
| 224 | 4116.1 | 0.124 | 17.048 | 11.743 | 7.065 | 17.91 | 0.504 | 0 | 54.571 | 56.204 | 56.877 | 62.169 | 54.394 |
| 240 | 4100 | 0.157 | 17.54 | 13.611 | 7.532 | 19.185 | 0.538 | 0 | 58.683 | 60.034 | 60.64 | 64.791 | 58.563 |
| 256 | 4310.5 | 0.277 | 18.282 | 13.5 | 7.714 | 19.136 | 0.539 | 0 | 59.244 | 60.686 | 61.349 | 66.84 | 59.448 |
</details>
#### Online: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA DGX A100 (1x A100 80GB)
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
![](plots/graph_performance_online_10.svg)
<details>
<summary>
Full tabular data
</summary>
| Concurrent client requests | Inferences/second | Client Send | Network+server Send/recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | Client Recv | P50 Latency | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
| 16 | 2571.2 | 0.067 | 1.201 | 1.894 | 0.351 | 2.678 | 0.027 | 0 | 6.205 | 6.279 | 6.31 | 6.418 | 6.218 |
| 32 | 3600.2 | 0.058 | 2.641 | 2.004 | 0.716 | 3.41 | 0.057 | 0 | 8.852 | 9.233 | 9.353 | 12.253 | 8.886 |
| 48 | 4274.2 | 0.062 | 3.102 | 2.738 | 1.121 | 4.113 | 0.089 | 0 | 11.03 | 11.989 | 12.1 | 15.115 | 11.225 |
| 64 | 4387.7 | 0.07 | 3.767 | 3.438 | 2.016 | 5.164 | 0.122 | 0 | 14.628 | 15.067 | 15.211 | 15.504 | 14.577 |
| 80 | 4630.1 | 0.064 | 4.23 | 5.049 | 2.316 | 5.463 | 0.151 | 0 | 17.205 | 17.726 | 17.9 | 18.31 | 17.273 |
| 96 | 4893.9 | 0.068 | 4.811 | 5.764 | 2.741 | 6.044 | 0.179 | 0 | 19.44 | 20.23 | 20.411 | 22.781 | 19.607 |
| 112 | 4887.6 | 0.069 | 6.232 | 5.202 | 3.597 | 7.586 | 0.236 | 0 | 23.099 | 23.665 | 23.902 | 24.192 | 22.922 |
| 128 | 5411.5 | 0.081 | 5.921 | 7 | 3.387 | 7.016 | 0.255 | 0 | 23.852 | 24.349 | 24.557 | 26.433 | 23.66 |
| 144 | 5322.9 | 0.08 | 7.066 | 7.55 | 3.996 | 8.059 | 0.299 | 0 | 27.024 | 28.487 | 29.725 | 33.7 | 27.05 |
| 160 | 5310.5 | 0.079 | 6.98 | 9.157 | 4.61 | 8.98 | 0.331 | 0 | 30.446 | 31.497 | 31.91 | 34.269 | 30.137 |
| 176 | 5458.7 | 0.081 | 7.857 | 9.272 | 5.047 | 9.634 | 0.345 | 0 | 32.588 | 33.271 | 33.478 | 35.47 | 32.236 |
| 192 | 5654.1 | 0.081 | 9.355 | 8.898 | 5.294 | 9.923 | 0.388 | 0 | 34.35 | 35.895 | 36.302 | 39.288 | 33.939 |
| 208 | 5643.7 | 0.093 | 9.407 | 10.488 | 5.953 | 10.54 | 0.383 | 0 | 36.994 | 38.14 | 38.766 | 41.616 | 36.864 |
| 224 | 5795.5 | 0.101 | 9.862 | 10.852 | 6.331 | 11.081 | 0.415 | 0 | 38.536 | 39.741 | 40.563 | 43.227 | 38.642 |
| 240 | 5855.8 | 0.131 | 9.994 | 12.391 | 6.589 | 11.419 | 0.447 | 0 | 40.721 | 43.344 | 44.449 | 46.902 | 40.971 |
| 256 | 6127.3 | 0.131 | 10.495 | 12.342 | 6.979 | 11.344 | 0.473 | 0 | 41.606 | 43.106 | 43.694 | 46.457 | 41.764 |
</details>
#### Online: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA DGX-1 (1x V100 32GB)
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
![](plots/graph_performance_online_18.svg)
<details>
<summary>
Full tabular data
</summary>
| Concurrent client requests | Inferences/second | Client Send | Network+server Send/recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | Client Recv | P50 Latency | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
| 16 | 1679.6 | 0.096 | 3.312 | 1.854 | 0.523 | 3.713 | 0.026 | 0 | 8.072 | 12.416 | 12.541 | 12.729 | 9.524 |
| 32 | 2760.1 | 0.095 | 3.933 | 1.978 | 0.949 | 4.597 | 0.035 | 0 | 11.569 | 11.728 | 11.785 | 12.39 | 11.587 |
| 48 | 3127.1 | 0.099 | 4.919 | 3.105 | 1.358 | 5.816 | 0.051 | 0 | 15.471 | 15.86 | 18.206 | 20.198 | 15.348 |
| 64 | 3287.4 | 0.101 | 5.874 | 4.346 | 1.789 | 7.293 | 0.069 | 0 | 19.44 | 19.727 | 19.838 | 20.584 | 19.472 |
| 80 | 3209 | 0.131 | 7.032 | 6.014 | 3.227 | 8.418 | 0.111 | 0 | 25.362 | 25.889 | 26.095 | 29.005 | 24.933 |
| 96 | 3273.6 | 0.14 | 8.539 | 6.74 | 4.371 | 9.369 | 0.153 | 0 | 29.217 | 29.641 | 29.895 | 31.002 | 29.312 |
| 112 | 3343.3 | 0.149 | 9.683 | 7.802 | 4.214 | 11.484 | 0.159 | 0 | 30.933 | 37.027 | 37.121 | 37.358 | 33.491 |
| 128 | 3335.1 | 0.152 | 9.865 | 10.127 | 5.519 | 12.534 | 0.195 | 0 | 38.762 | 40.022 | 40.336 | 42.943 | 38.392 |
| 144 | 3304.2 | 0.185 | 11.017 | 11.901 | 6.877 | 13.35 | 0.209 | 0 | 43.372 | 43.812 | 44.042 | 46.708 | 43.539 |
| 160 | 3319.9 | 0.206 | 12.701 | 12.625 | 7.49 | 14.907 | 0.238 | 0 | 48.31 | 49.135 | 49.343 | 50.441 | 48.167 |
| 176 | 3335 | 0.271 | 13.013 | 14.788 | 8.564 | 15.789 | 0.263 | 0 | 52.352 | 53.653 | 54.385 | 57.332 | 52.688 |
| 192 | 3380 | 0.243 | 13.894 | 15.719 | 9.865 | 16.841 | 0.283 | 0 | 56.872 | 58.64 | 58.944 | 62.097 | 56.845 |
| 208 | 3387.6 | 0.273 | 16.221 | 15.73 | 10.334 | 18.448 | 0.326 | 0 | 61.402 | 63.099 | 63.948 | 68.63 | 61.332 |
| 224 | 3477.2 | 0.613 | 14.167 | 18.902 | 10.896 | 19.605 | 0.34 | 0 | 64.495 | 65.69 | 66.101 | 67.522 | 64.523 |
| 240 | 3528 | 0.878 | 14.713 | 20.894 | 10.259 | 20.859 | 0.436 | 0 | 66.404 | 71.807 | 72.857 | 75.076 | 68.039 |
| 256 | 3558.4 | 1.035 | 15.534 | 22.837 | 11 | 21.062 | 0.435 | 0 | 71.657 | 77.271 | 78.269 | 80.804 | 71.903 |
</details>
#### Online: NVIDIA T4, ONNX Runtime TensorRT with FP16
Our results were obtained using the following configuration:
* **GPU:** NVIDIA T4
* **Backend:** ONNX Runtime
* **Backend accelerator:** TensorRT
* **Precision:** FP16
* **Model format:** ONNX
![](plots/graph_performance_online_26.svg)
<details>
<summary>
Full tabular data
</summary>
| Concurrent client requests | Inferences/second | Client Send | Network+server Send/recv | Server Queue | Server Compute Input | Server Compute Infer | Server Compute Output | Client Recv | P50 Latency | P90 Latency | P95 Latency | P99 Latency | Avg Latency |
|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
| 16 | 1078.4 | 0.169 | 6.163 | 2.009 | 0.495 | 5.963 | 0.022 | 0 | 15.75 | 16.219 | 16.376 | 16.597 | 14.821 |
| 32 | 2049.6 | 0.195 | 4.342 | 3.384 | 0.849 | 6.804 | 0.032 | 0 | 15.606 | 15.792 | 15.853 | 15.975 | 15.606 |
| 48 | 2133.1 | 0.189 | 6.365 | 4.926 | 1.379 | 9.573 | 0.063 | 0 | 22.304 | 23.432 | 23.73 | 27.241 | 22.495 |
| 64 | 2114.3 | 0.206 | 9.038 | 6.258 | 1.863 | 12.812 | 0.086 | 0 | 30.074 | 31.063 | 31.535 | 42.845 | 30.263 |
| 80 | 2089.3 | 0.204 | 11.943 | 7.841 | 2.676 | 15.556 | 0.108 | 0 | 38.289 | 40.895 | 52.977 | 58.393 | 38.328 |
| 96 | 2145.3 | 0.23 | 12.987 | 9.63 | 3.597 | 18.132 | 0.134 | 0 | 44.511 | 47.352 | 47.809 | 48.429 | 44.71 |
| 112 | 2062.3 | 0.28 | 13.253 | 14.112 | 5.088 | 21.398 | 0.154 | 0 | 54.289 | 55.441 | 55.69 | 56.205 | 54.285 |
| 128 | 2042.6 | 0.485 | 14.377 | 16.957 | 6.279 | 24.487 | 0.169 | 0 | 62.718 | 63.902 | 64.178 | 64.671 | 62.754 |
| 144 | 2066.6 | 0.726 | 16.363 | 18.601 | 7.085 | 26.801 | 0.193 | 0.001 | 69.67 | 71.418 | 71.765 | 73.255 | 69.77 |
| 160 | 2073.1 | 0.557 | 17.787 | 20.809 | 7.378 | 30.43 | 0.215 | 0 | 77.212 | 79.089 | 79.815 | 83.434 | 77.176 |
| 176 | 2076.8 | 1.209 | 18.446 | 23.075 | 8.689 | 32.894 | 0.253 | 0 | 84.13 | 86.732 | 87.404 | 95.286 | 84.566 |
| 192 | 2073.9 | 1.462 | 19.845 | 25.653 | 9.088 | 36.153 | 0.272 | 0 | 92.32 | 94.276 | 94.805 | 96.765 | 92.473 |
| 208 | 2053.2 | 1.071 | 22.995 | 26.411 | 10.123 | 40.415 | 0.322 | 0 | 101.178 | 103.725 | 105.498 | 110.695 | 101.337 |
| 224 | 1994.1 | 0.968 | 24.931 | 31.14 | 14.276 | 40.804 | 0.389 | 0 | 114.177 | 116.977 | 118.248 | 121.879 | 112.508 |
| 240 | 1952.6 | 1.028 | 27.957 | 34.546 | 16.535 | 42.685 | 0.38 | 0 | 122.846 | 126.022 | 128.074 | 136.541 | 123.131 |
| 256 | 2017.8 | 0.85 | 27.437 | 38.553 | 15.224 | 44.637 | 0.401 | 0 | 129.052 | 132.762 | 134.337 | 138.108 | 127.102 |
</details>
## Release Notes
Were constantly refining and improving our performance on AI
and HPC workloads even on the same hardware with frequent updates
to our software stack. For our latest performance data refer
to these pages for
[AI](https://developer.nvidia.com/deep-learning-performance-training-inference)
and [HPC](https://developer.nvidia.com/hpc-application-performance) benchmarks.
### Changelog
April 2021
- NVIDIA Ampere results added
September 2020
- Initial release
- Initial release
### Known issues
- There are no known issues with this model.

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 33 KiB

View file

@ -0,0 +1,995 @@
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Created with matplotlib (https://matplotlib.org/) -->
<svg height="331.389812pt" version="1.1" viewBox="0 0 417.63 331.389812" width="417.63pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<metadata>
<rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cc:Work>
<dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
<dc:date>2021-04-14T17:54:02.787041</dc:date>
<dc:format>image/svg+xml</dc:format>
<dc:creator>
<cc:Agent>
<dc:title>Matplotlib v3.3.4, https://matplotlib.org/</dc:title>
</cc:Agent>
</dc:creator>
</cc:Work>
</rdf:RDF>
</metadata>
<defs>
<style type="text/css">*{stroke-linecap:butt;stroke-linejoin:round;}</style>
</defs>
<g id="figure_1">
<g id="patch_1">
<path d="M 0 331.389812
L 417.63 331.389812
L 417.63 0
L 0 0
z
" style="fill:#ffffff;"/>
</g>
<g id="axes_1">
<g id="patch_2">
<path d="M 53.31 288.430125
L 410.43 288.430125
L 410.43 22.318125
L 53.31 22.318125
z
" style="fill:#ffffff;"/>
</g>
<g id="matplotlib.axis_1">
<g id="xtick_1">
<g id="text_1">
<!-- 1 -->
<g style="fill:#262626;" transform="translate(72.130625 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 12.40625 8.296875
L 28.515625 8.296875
L 28.515625 63.921875
L 10.984375 60.40625
L 10.984375 69.390625
L 28.421875 72.90625
L 38.28125 72.90625
L 38.28125 8.296875
L 54.390625 8.296875
L 54.390625 0
L 12.40625 0
z
" id="DejaVuSans-49"/>
</defs>
<use xlink:href="#DejaVuSans-49"/>
</g>
</g>
</g>
<g id="xtick_2">
<g id="text_2">
<!-- 2 -->
<g style="fill:#262626;" transform="translate(116.770625 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 19.1875 8.296875
L 53.609375 8.296875
L 53.609375 0
L 7.328125 0
L 7.328125 8.296875
Q 12.9375 14.109375 22.625 23.890625
Q 32.328125 33.6875 34.8125 36.53125
Q 39.546875 41.84375 41.421875 45.53125
Q 43.3125 49.21875 43.3125 52.78125
Q 43.3125 58.59375 39.234375 62.25
Q 35.15625 65.921875 28.609375 65.921875
Q 23.96875 65.921875 18.8125 64.3125
Q 13.671875 62.703125 7.8125 59.421875
L 7.8125 69.390625
Q 13.765625 71.78125 18.9375 73
Q 24.125 74.21875 28.421875 74.21875
Q 39.75 74.21875 46.484375 68.546875
Q 53.21875 62.890625 53.21875 53.421875
Q 53.21875 48.921875 51.53125 44.890625
Q 49.859375 40.875 45.40625 35.40625
Q 44.1875 33.984375 37.640625 27.21875
Q 31.109375 20.453125 19.1875 8.296875
z
" id="DejaVuSans-50"/>
</defs>
<use xlink:href="#DejaVuSans-50"/>
</g>
</g>
</g>
<g id="xtick_3">
<g id="text_3">
<!-- 4 -->
<g style="fill:#262626;" transform="translate(161.410625 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 37.796875 64.3125
L 12.890625 25.390625
L 37.796875 25.390625
z
M 35.203125 72.90625
L 47.609375 72.90625
L 47.609375 25.390625
L 58.015625 25.390625
L 58.015625 17.1875
L 47.609375 17.1875
L 47.609375 0
L 37.796875 0
L 37.796875 17.1875
L 4.890625 17.1875
L 4.890625 26.703125
z
" id="DejaVuSans-52"/>
</defs>
<use xlink:href="#DejaVuSans-52"/>
</g>
</g>
</g>
<g id="xtick_4">
<g id="text_4">
<!-- 8 -->
<g style="fill:#262626;" transform="translate(206.050625 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 31.78125 34.625
Q 24.75 34.625 20.71875 30.859375
Q 16.703125 27.09375 16.703125 20.515625
Q 16.703125 13.921875 20.71875 10.15625
Q 24.75 6.390625 31.78125 6.390625
Q 38.8125 6.390625 42.859375 10.171875
Q 46.921875 13.96875 46.921875 20.515625
Q 46.921875 27.09375 42.890625 30.859375
Q 38.875 34.625 31.78125 34.625
z
M 21.921875 38.8125
Q 15.578125 40.375 12.03125 44.71875
Q 8.5 49.078125 8.5 55.328125
Q 8.5 64.0625 14.71875 69.140625
Q 20.953125 74.21875 31.78125 74.21875
Q 42.671875 74.21875 48.875 69.140625
Q 55.078125 64.0625 55.078125 55.328125
Q 55.078125 49.078125 51.53125 44.71875
Q 48 40.375 41.703125 38.8125
Q 48.828125 37.15625 52.796875 32.3125
Q 56.78125 27.484375 56.78125 20.515625
Q 56.78125 9.90625 50.3125 4.234375
Q 43.84375 -1.421875 31.78125 -1.421875
Q 19.734375 -1.421875 13.25 4.234375
Q 6.78125 9.90625 6.78125 20.515625
Q 6.78125 27.484375 10.78125 32.3125
Q 14.796875 37.15625 21.921875 38.8125
z
M 18.3125 54.390625
Q 18.3125 48.734375 21.84375 45.5625
Q 25.390625 42.390625 31.78125 42.390625
Q 38.140625 42.390625 41.71875 45.5625
Q 45.3125 48.734375 45.3125 54.390625
Q 45.3125 60.0625 41.71875 63.234375
Q 38.140625 66.40625 31.78125 66.40625
Q 25.390625 66.40625 21.84375 63.234375
Q 18.3125 60.0625 18.3125 54.390625
z
" id="DejaVuSans-56"/>
</defs>
<use xlink:href="#DejaVuSans-56"/>
</g>
</g>
</g>
<g id="xtick_5">
<g id="text_5">
<!-- 16 -->
<g style="fill:#262626;" transform="translate(247.19125 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 33.015625 40.375
Q 26.375 40.375 22.484375 35.828125
Q 18.609375 31.296875 18.609375 23.390625
Q 18.609375 15.53125 22.484375 10.953125
Q 26.375 6.390625 33.015625 6.390625
Q 39.65625 6.390625 43.53125 10.953125
Q 47.40625 15.53125 47.40625 23.390625
Q 47.40625 31.296875 43.53125 35.828125
Q 39.65625 40.375 33.015625 40.375
z
M 52.59375 71.296875
L 52.59375 62.3125
Q 48.875 64.0625 45.09375 64.984375
Q 41.3125 65.921875 37.59375 65.921875
Q 27.828125 65.921875 22.671875 59.328125
Q 17.53125 52.734375 16.796875 39.40625
Q 19.671875 43.65625 24.015625 45.921875
Q 28.375 48.1875 33.59375 48.1875
Q 44.578125 48.1875 50.953125 41.515625
Q 57.328125 34.859375 57.328125 23.390625
Q 57.328125 12.15625 50.6875 5.359375
Q 44.046875 -1.421875 33.015625 -1.421875
Q 20.359375 -1.421875 13.671875 8.265625
Q 6.984375 17.96875 6.984375 36.375
Q 6.984375 53.65625 15.1875 63.9375
Q 23.390625 74.21875 37.203125 74.21875
Q 40.921875 74.21875 44.703125 73.484375
Q 48.484375 72.75 52.59375 71.296875
z
" id="DejaVuSans-54"/>
</defs>
<use xlink:href="#DejaVuSans-49"/>
<use x="63.623047" xlink:href="#DejaVuSans-54"/>
</g>
</g>
</g>
<g id="xtick_6">
<g id="text_6">
<!-- 32 -->
<g style="fill:#262626;" transform="translate(291.83125 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 40.578125 39.3125
Q 47.65625 37.796875 51.625 33
Q 55.609375 28.21875 55.609375 21.1875
Q 55.609375 10.40625 48.1875 4.484375
Q 40.765625 -1.421875 27.09375 -1.421875
Q 22.515625 -1.421875 17.65625 -0.515625
Q 12.796875 0.390625 7.625 2.203125
L 7.625 11.71875
Q 11.71875 9.328125 16.59375 8.109375
Q 21.484375 6.890625 26.8125 6.890625
Q 36.078125 6.890625 40.9375 10.546875
Q 45.796875 14.203125 45.796875 21.1875
Q 45.796875 27.640625 41.28125 31.265625
Q 36.765625 34.90625 28.71875 34.90625
L 20.21875 34.90625
L 20.21875 43.015625
L 29.109375 43.015625
Q 36.375 43.015625 40.234375 45.921875
Q 44.09375 48.828125 44.09375 54.296875
Q 44.09375 59.90625 40.109375 62.90625
Q 36.140625 65.921875 28.71875 65.921875
Q 24.65625 65.921875 20.015625 65.03125
Q 15.375 64.15625 9.8125 62.3125
L 9.8125 71.09375
Q 15.4375 72.65625 20.34375 73.4375
Q 25.25 74.21875 29.59375 74.21875
Q 40.828125 74.21875 47.359375 69.109375
Q 53.90625 64.015625 53.90625 55.328125
Q 53.90625 49.265625 50.4375 45.09375
Q 46.96875 40.921875 40.578125 39.3125
z
" id="DejaVuSans-51"/>
</defs>
<use xlink:href="#DejaVuSans-51"/>
<use x="63.623047" xlink:href="#DejaVuSans-50"/>
</g>
</g>
</g>
<g id="xtick_7">
<g id="text_7">
<!-- 64 -->
<g style="fill:#262626;" transform="translate(336.47125 306.288406)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-54"/>
<use x="63.623047" xlink:href="#DejaVuSans-52"/>
</g>
</g>
</g>
<g id="xtick_8">
<g id="text_8">
<!-- 128 -->
<g style="fill:#262626;" transform="translate(377.611875 306.288406)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-49"/>
<use x="63.623047" xlink:href="#DejaVuSans-50"/>
<use x="127.246094" xlink:href="#DejaVuSans-56"/>
</g>
</g>
</g>
<g id="text_9">
<!-- Client Batch Size -->
<g style="fill:#262626;" transform="translate(181.122187 321.694187)scale(0.12 -0.12)">
<defs>
<path d="M 64.40625 67.28125
L 64.40625 56.890625
Q 59.421875 61.53125 53.78125 63.8125
Q 48.140625 66.109375 41.796875 66.109375
Q 29.296875 66.109375 22.65625 58.46875
Q 16.015625 50.828125 16.015625 36.375
Q 16.015625 21.96875 22.65625 14.328125
Q 29.296875 6.6875 41.796875 6.6875
Q 48.140625 6.6875 53.78125 8.984375
Q 59.421875 11.28125 64.40625 15.921875
L 64.40625 5.609375
Q 59.234375 2.09375 53.4375 0.328125
Q 47.65625 -1.421875 41.21875 -1.421875
Q 24.65625 -1.421875 15.125 8.703125
Q 5.609375 18.84375 5.609375 36.375
Q 5.609375 53.953125 15.125 64.078125
Q 24.65625 74.21875 41.21875 74.21875
Q 47.75 74.21875 53.53125 72.484375
Q 59.328125 70.75 64.40625 67.28125
z
" id="DejaVuSans-67"/>
<path d="M 9.421875 75.984375
L 18.40625 75.984375
L 18.40625 0
L 9.421875 0
z
" id="DejaVuSans-108"/>
<path d="M 9.421875 54.6875
L 18.40625 54.6875
L 18.40625 0
L 9.421875 0
z
M 9.421875 75.984375
L 18.40625 75.984375
L 18.40625 64.59375
L 9.421875 64.59375
z
" id="DejaVuSans-105"/>
<path d="M 56.203125 29.59375
L 56.203125 25.203125
L 14.890625 25.203125
Q 15.484375 15.921875 20.484375 11.0625
Q 25.484375 6.203125 34.421875 6.203125
Q 39.59375 6.203125 44.453125 7.46875
Q 49.3125 8.734375 54.109375 11.28125
L 54.109375 2.78125
Q 49.265625 0.734375 44.1875 -0.34375
Q 39.109375 -1.421875 33.890625 -1.421875
Q 20.796875 -1.421875 13.15625 6.1875
Q 5.515625 13.8125 5.515625 26.8125
Q 5.515625 40.234375 12.765625 48.109375
Q 20.015625 56 32.328125 56
Q 43.359375 56 49.78125 48.890625
Q 56.203125 41.796875 56.203125 29.59375
z
M 47.21875 32.234375
Q 47.125 39.59375 43.09375 43.984375
Q 39.0625 48.390625 32.421875 48.390625
Q 24.90625 48.390625 20.390625 44.140625
Q 15.875 39.890625 15.1875 32.171875
z
" id="DejaVuSans-101"/>
<path d="M 54.890625 33.015625
L 54.890625 0
L 45.90625 0
L 45.90625 32.71875
Q 45.90625 40.484375 42.875 44.328125
Q 39.84375 48.1875 33.796875 48.1875
Q 26.515625 48.1875 22.3125 43.546875
Q 18.109375 38.921875 18.109375 30.90625
L 18.109375 0
L 9.078125 0
L 9.078125 54.6875
L 18.109375 54.6875
L 18.109375 46.1875
Q 21.34375 51.125 25.703125 53.5625
Q 30.078125 56 35.796875 56
Q 45.21875 56 50.046875 50.171875
Q 54.890625 44.34375 54.890625 33.015625
z
" id="DejaVuSans-110"/>
<path d="M 18.3125 70.21875
L 18.3125 54.6875
L 36.8125 54.6875
L 36.8125 47.703125
L 18.3125 47.703125
L 18.3125 18.015625
Q 18.3125 11.328125 20.140625 9.421875
Q 21.96875 7.515625 27.59375 7.515625
L 36.8125 7.515625
L 36.8125 0
L 27.59375 0
Q 17.1875 0 13.234375 3.875
Q 9.28125 7.765625 9.28125 18.015625
L 9.28125 47.703125
L 2.6875 47.703125
L 2.6875 54.6875
L 9.28125 54.6875
L 9.28125 70.21875
z
" id="DejaVuSans-116"/>
<path id="DejaVuSans-32"/>
<path d="M 19.671875 34.8125
L 19.671875 8.109375
L 35.5 8.109375
Q 43.453125 8.109375 47.28125 11.40625
Q 51.125 14.703125 51.125 21.484375
Q 51.125 28.328125 47.28125 31.5625
Q 43.453125 34.8125 35.5 34.8125
z
M 19.671875 64.796875
L 19.671875 42.828125
L 34.28125 42.828125
Q 41.5 42.828125 45.03125 45.53125
Q 48.578125 48.25 48.578125 53.8125
Q 48.578125 59.328125 45.03125 62.0625
Q 41.5 64.796875 34.28125 64.796875
z
M 9.8125 72.90625
L 35.015625 72.90625
Q 46.296875 72.90625 52.390625 68.21875
Q 58.5 63.53125 58.5 54.890625
Q 58.5 48.1875 55.375 44.234375
Q 52.25 40.28125 46.1875 39.3125
Q 53.46875 37.75 57.5 32.78125
Q 61.53125 27.828125 61.53125 20.40625
Q 61.53125 10.640625 54.890625 5.3125
Q 48.25 0 35.984375 0
L 9.8125 0
z
" id="DejaVuSans-66"/>
<path d="M 34.28125 27.484375
Q 23.390625 27.484375 19.1875 25
Q 14.984375 22.515625 14.984375 16.5
Q 14.984375 11.71875 18.140625 8.90625
Q 21.296875 6.109375 26.703125 6.109375
Q 34.1875 6.109375 38.703125 11.40625
Q 43.21875 16.703125 43.21875 25.484375
L 43.21875 27.484375
z
M 52.203125 31.203125
L 52.203125 0
L 43.21875 0
L 43.21875 8.296875
Q 40.140625 3.328125 35.546875 0.953125
Q 30.953125 -1.421875 24.3125 -1.421875
Q 15.921875 -1.421875 10.953125 3.296875
Q 6 8.015625 6 15.921875
Q 6 25.140625 12.171875 29.828125
Q 18.359375 34.515625 30.609375 34.515625
L 43.21875 34.515625
L 43.21875 35.40625
Q 43.21875 41.609375 39.140625 45
Q 35.0625 48.390625 27.6875 48.390625
Q 23 48.390625 18.546875 47.265625
Q 14.109375 46.140625 10.015625 43.890625
L 10.015625 52.203125
Q 14.9375 54.109375 19.578125 55.046875
Q 24.21875 56 28.609375 56
Q 40.484375 56 46.34375 49.84375
Q 52.203125 43.703125 52.203125 31.203125
z
" id="DejaVuSans-97"/>
<path d="M 48.78125 52.59375
L 48.78125 44.1875
Q 44.96875 46.296875 41.140625 47.34375
Q 37.3125 48.390625 33.40625 48.390625
Q 24.65625 48.390625 19.8125 42.84375
Q 14.984375 37.3125 14.984375 27.296875
Q 14.984375 17.28125 19.8125 11.734375
Q 24.65625 6.203125 33.40625 6.203125
Q 37.3125 6.203125 41.140625 7.25
Q 44.96875 8.296875 48.78125 10.40625
L 48.78125 2.09375
Q 45.015625 0.34375 40.984375 -0.53125
Q 36.96875 -1.421875 32.421875 -1.421875
Q 20.0625 -1.421875 12.78125 6.34375
Q 5.515625 14.109375 5.515625 27.296875
Q 5.515625 40.671875 12.859375 48.328125
Q 20.21875 56 33.015625 56
Q 37.15625 56 41.109375 55.140625
Q 45.0625 54.296875 48.78125 52.59375
z
" id="DejaVuSans-99"/>
<path d="M 54.890625 33.015625
L 54.890625 0
L 45.90625 0
L 45.90625 32.71875
Q 45.90625 40.484375 42.875 44.328125
Q 39.84375 48.1875 33.796875 48.1875
Q 26.515625 48.1875 22.3125 43.546875
Q 18.109375 38.921875 18.109375 30.90625
L 18.109375 0
L 9.078125 0
L 9.078125 75.984375
L 18.109375 75.984375
L 18.109375 46.1875
Q 21.34375 51.125 25.703125 53.5625
Q 30.078125 56 35.796875 56
Q 45.21875 56 50.046875 50.171875
Q 54.890625 44.34375 54.890625 33.015625
z
" id="DejaVuSans-104"/>
<path d="M 53.515625 70.515625
L 53.515625 60.890625
Q 47.90625 63.578125 42.921875 64.890625
Q 37.9375 66.21875 33.296875 66.21875
Q 25.25 66.21875 20.875 63.09375
Q 16.5 59.96875 16.5 54.203125
Q 16.5 49.359375 19.40625 46.890625
Q 22.3125 44.4375 30.421875 42.921875
L 36.375 41.703125
Q 47.40625 39.59375 52.65625 34.296875
Q 57.90625 29 57.90625 20.125
Q 57.90625 9.515625 50.796875 4.046875
Q 43.703125 -1.421875 29.984375 -1.421875
Q 24.8125 -1.421875 18.96875 -0.25
Q 13.140625 0.921875 6.890625 3.21875
L 6.890625 13.375
Q 12.890625 10.015625 18.65625 8.296875
Q 24.421875 6.59375 29.984375 6.59375
Q 38.421875 6.59375 43.015625 9.90625
Q 47.609375 13.234375 47.609375 19.390625
Q 47.609375 24.75 44.3125 27.78125
Q 41.015625 30.8125 33.5 32.328125
L 27.484375 33.5
Q 16.453125 35.6875 11.515625 40.375
Q 6.59375 45.0625 6.59375 53.421875
Q 6.59375 63.09375 13.40625 68.65625
Q 20.21875 74.21875 32.171875 74.21875
Q 37.3125 74.21875 42.625 73.28125
Q 47.953125 72.359375 53.515625 70.515625
z
" id="DejaVuSans-83"/>
<path d="M 5.515625 54.6875
L 48.1875 54.6875
L 48.1875 46.484375
L 14.40625 7.171875
L 48.1875 7.171875
L 48.1875 0
L 4.296875 0
L 4.296875 8.203125
L 38.09375 47.515625
L 5.515625 47.515625
z
" id="DejaVuSans-122"/>
</defs>
<use xlink:href="#DejaVuSans-67"/>
<use x="69.824219" xlink:href="#DejaVuSans-108"/>
<use x="97.607422" xlink:href="#DejaVuSans-105"/>
<use x="125.390625" xlink:href="#DejaVuSans-101"/>
<use x="186.914062" xlink:href="#DejaVuSans-110"/>
<use x="250.292969" xlink:href="#DejaVuSans-116"/>
<use x="289.501953" xlink:href="#DejaVuSans-32"/>
<use x="321.289062" xlink:href="#DejaVuSans-66"/>
<use x="389.892578" xlink:href="#DejaVuSans-97"/>
<use x="451.171875" xlink:href="#DejaVuSans-116"/>
<use x="490.380859" xlink:href="#DejaVuSans-99"/>
<use x="545.361328" xlink:href="#DejaVuSans-104"/>
<use x="608.740234" xlink:href="#DejaVuSans-32"/>
<use x="640.527344" xlink:href="#DejaVuSans-83"/>
<use x="704.003906" xlink:href="#DejaVuSans-105"/>
<use x="731.787109" xlink:href="#DejaVuSans-122"/>
<use x="784.277344" xlink:href="#DejaVuSans-101"/>
</g>
</g>
</g>
<g id="matplotlib.axis_2">
<g id="ytick_1">
<g id="line2d_1">
<path clip-path="url(#p4447e285f4)" d="M 53.31 288.430125
L 410.43 288.430125
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_10">
<!-- 0 -->
<g style="fill:#262626;" transform="translate(36.81125 292.609266)scale(0.11 -0.11)">
<defs>
<path d="M 31.78125 66.40625
Q 24.171875 66.40625 20.328125 58.90625
Q 16.5 51.421875 16.5 36.375
Q 16.5 21.390625 20.328125 13.890625
Q 24.171875 6.390625 31.78125 6.390625
Q 39.453125 6.390625 43.28125 13.890625
Q 47.125 21.390625 47.125 36.375
Q 47.125 51.421875 43.28125 58.90625
Q 39.453125 66.40625 31.78125 66.40625
z
M 31.78125 74.21875
Q 44.046875 74.21875 50.515625 64.515625
Q 56.984375 54.828125 56.984375 36.375
Q 56.984375 17.96875 50.515625 8.265625
Q 44.046875 -1.421875 31.78125 -1.421875
Q 19.53125 -1.421875 13.0625 8.265625
Q 6.59375 17.96875 6.59375 36.375
Q 6.59375 54.828125 13.0625 64.515625
Q 19.53125 74.21875 31.78125 74.21875
z
" id="DejaVuSans-48"/>
</defs>
<use xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_2">
<g id="line2d_2">
<path clip-path="url(#p4447e285f4)" d="M 53.31 244.146764
L 410.43 244.146764
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_11">
<!-- 20 -->
<g style="fill:#262626;" transform="translate(29.8125 248.325905)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-50"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_3">
<g id="line2d_3">
<path clip-path="url(#p4447e285f4)" d="M 53.31 199.863403
L 410.43 199.863403
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_12">
<!-- 40 -->
<g style="fill:#262626;" transform="translate(29.8125 204.042544)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-52"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_4">
<g id="line2d_4">
<path clip-path="url(#p4447e285f4)" d="M 53.31 155.580043
L 410.43 155.580043
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_13">
<!-- 60 -->
<g style="fill:#262626;" transform="translate(29.8125 159.759183)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-54"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_5">
<g id="line2d_5">
<path clip-path="url(#p4447e285f4)" d="M 53.31 111.296682
L 410.43 111.296682
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_14">
<!-- 80 -->
<g style="fill:#262626;" transform="translate(29.8125 115.475822)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-56"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_6">
<g id="line2d_6">
<path clip-path="url(#p4447e285f4)" d="M 53.31 67.013321
L 410.43 67.013321
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_15">
<!-- 100 -->
<g style="fill:#262626;" transform="translate(22.81375 71.192462)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-49"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
<use x="127.246094" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_7">
<g id="line2d_7">
<path clip-path="url(#p4447e285f4)" d="M 53.31 22.72996
L 410.43 22.72996
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_16">
<!-- 120 -->
<g style="fill:#262626;" transform="translate(22.81375 26.909101)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-49"/>
<use x="63.623047" xlink:href="#DejaVuSans-50"/>
<use x="127.246094" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="text_17">
<!-- Avg Latency -->
<g style="fill:#262626;" transform="translate(16.318125 192.110062)rotate(-90)scale(0.12 -0.12)">
<defs>
<path d="M 34.1875 63.1875
L 20.796875 26.90625
L 47.609375 26.90625
z
M 28.609375 72.90625
L 39.796875 72.90625
L 67.578125 0
L 57.328125 0
L 50.6875 18.703125
L 17.828125 18.703125
L 11.1875 0
L 0.78125 0
z
" id="DejaVuSans-65"/>
<path d="M 2.984375 54.6875
L 12.5 54.6875
L 29.59375 8.796875
L 46.6875 54.6875
L 56.203125 54.6875
L 35.6875 0
L 23.484375 0
z
" id="DejaVuSans-118"/>
<path d="M 45.40625 27.984375
Q 45.40625 37.75 41.375 43.109375
Q 37.359375 48.484375 30.078125 48.484375
Q 22.859375 48.484375 18.828125 43.109375
Q 14.796875 37.75 14.796875 27.984375
Q 14.796875 18.265625 18.828125 12.890625
Q 22.859375 7.515625 30.078125 7.515625
Q 37.359375 7.515625 41.375 12.890625
Q 45.40625 18.265625 45.40625 27.984375
z
M 54.390625 6.78125
Q 54.390625 -7.171875 48.1875 -13.984375
Q 42 -20.796875 29.203125 -20.796875
Q 24.46875 -20.796875 20.265625 -20.09375
Q 16.0625 -19.390625 12.109375 -17.921875
L 12.109375 -9.1875
Q 16.0625 -11.328125 19.921875 -12.34375
Q 23.78125 -13.375 27.78125 -13.375
Q 36.625 -13.375 41.015625 -8.765625
Q 45.40625 -4.15625 45.40625 5.171875
L 45.40625 9.625
Q 42.625 4.78125 38.28125 2.390625
Q 33.9375 0 27.875 0
Q 17.828125 0 11.671875 7.65625
Q 5.515625 15.328125 5.515625 27.984375
Q 5.515625 40.671875 11.671875 48.328125
Q 17.828125 56 27.875 56
Q 33.9375 56 38.28125 53.609375
Q 42.625 51.21875 45.40625 46.390625
L 45.40625 54.6875
L 54.390625 54.6875
z
" id="DejaVuSans-103"/>
<path d="M 9.8125 72.90625
L 19.671875 72.90625
L 19.671875 8.296875
L 55.171875 8.296875
L 55.171875 0
L 9.8125 0
z
" id="DejaVuSans-76"/>
<path d="M 32.171875 -5.078125
Q 28.375 -14.84375 24.75 -17.8125
Q 21.140625 -20.796875 15.09375 -20.796875
L 7.90625 -20.796875
L 7.90625 -13.28125
L 13.1875 -13.28125
Q 16.890625 -13.28125 18.9375 -11.515625
Q 21 -9.765625 23.484375 -3.21875
L 25.09375 0.875
L 2.984375 54.6875
L 12.5 54.6875
L 29.59375 11.921875
L 46.6875 54.6875
L 56.203125 54.6875
z
" id="DejaVuSans-121"/>
</defs>
<use xlink:href="#DejaVuSans-65"/>
<use x="62.533203" xlink:href="#DejaVuSans-118"/>
<use x="121.712891" xlink:href="#DejaVuSans-103"/>
<use x="185.189453" xlink:href="#DejaVuSans-32"/>
<use x="216.976562" xlink:href="#DejaVuSans-76"/>
<use x="272.689453" xlink:href="#DejaVuSans-97"/>
<use x="333.96875" xlink:href="#DejaVuSans-116"/>
<use x="373.177734" xlink:href="#DejaVuSans-101"/>
<use x="434.701172" xlink:href="#DejaVuSans-110"/>
<use x="498.080078" xlink:href="#DejaVuSans-99"/>
<use x="553.060547" xlink:href="#DejaVuSans-121"/>
</g>
</g>
</g>
<g id="patch_3">
<path clip-path="url(#p4447e285f4)" d="M 57.774 288.430125
L 93.486 288.430125
L 93.486 281.780978
L 57.774 281.780978
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_4">
<path clip-path="url(#p4447e285f4)" d="M 102.414 288.430125
L 138.126 288.430125
L 138.126 279.577881
L 102.414 279.577881
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_5">
<path clip-path="url(#p4447e285f4)" d="M 147.054 288.430125
L 182.766 288.430125
L 182.766 275.705301
L 147.054 275.705301
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_6">
<path clip-path="url(#p4447e285f4)" d="M 191.694 288.430125
L 227.406 288.430125
L 227.406 268.507041
L 191.694 268.507041
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_7">
<path clip-path="url(#p4447e285f4)" d="M 236.334 288.430125
L 272.046 288.430125
L 272.046 254.967403
L 236.334 254.967403
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_8">
<path clip-path="url(#p4447e285f4)" d="M 280.974 288.430125
L 316.686 288.430125
L 316.686 225.651819
L 280.974 225.651819
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_9">
<path clip-path="url(#p4447e285f4)" d="M 325.614 288.430125
L 361.326 288.430125
L 361.326 170.665169
L 325.614 170.665169
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_10">
<path clip-path="url(#p4447e285f4)" d="M 370.254 288.430125
L 405.966 288.430125
L 405.966 66.670125
L 370.254 66.670125
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_11">
<path d="M 53.31 288.430125
L 53.31 22.318125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="patch_12">
<path d="M 410.43 288.430125
L 410.43 22.318125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="patch_13">
<path d="M 53.31 288.430125
L 410.43 288.430125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="patch_14">
<path d="M 53.31 22.318125
L 410.43 22.318125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="text_18">
<!-- Performance offline -->
<g style="fill:#262626;" transform="translate(173.220937 16.318125)scale(0.12 -0.12)">
<defs>
<path d="M 19.671875 64.796875
L 19.671875 37.40625
L 32.078125 37.40625
Q 38.96875 37.40625 42.71875 40.96875
Q 46.484375 44.53125 46.484375 51.125
Q 46.484375 57.671875 42.71875 61.234375
Q 38.96875 64.796875 32.078125 64.796875
z
M 9.8125 72.90625
L 32.078125 72.90625
Q 44.34375 72.90625 50.609375 67.359375
Q 56.890625 61.8125 56.890625 51.125
Q 56.890625 40.328125 50.609375 34.8125
Q 44.34375 29.296875 32.078125 29.296875
L 19.671875 29.296875
L 19.671875 0
L 9.8125 0
z
" id="DejaVuSans-80"/>
<path d="M 41.109375 46.296875
Q 39.59375 47.171875 37.8125 47.578125
Q 36.03125 48 33.890625 48
Q 26.265625 48 22.1875 43.046875
Q 18.109375 38.09375 18.109375 28.8125
L 18.109375 0
L 9.078125 0
L 9.078125 54.6875
L 18.109375 54.6875
L 18.109375 46.1875
Q 20.953125 51.171875 25.484375 53.578125
Q 30.03125 56 36.53125 56
Q 37.453125 56 38.578125 55.875
Q 39.703125 55.765625 41.0625 55.515625
z
" id="DejaVuSans-114"/>
<path d="M 37.109375 75.984375
L 37.109375 68.5
L 28.515625 68.5
Q 23.6875 68.5 21.796875 66.546875
Q 19.921875 64.59375 19.921875 59.515625
L 19.921875 54.6875
L 34.71875 54.6875
L 34.71875 47.703125
L 19.921875 47.703125
L 19.921875 0
L 10.890625 0
L 10.890625 47.703125
L 2.296875 47.703125
L 2.296875 54.6875
L 10.890625 54.6875
L 10.890625 58.5
Q 10.890625 67.625 15.140625 71.796875
Q 19.390625 75.984375 28.609375 75.984375
z
" id="DejaVuSans-102"/>
<path d="M 30.609375 48.390625
Q 23.390625 48.390625 19.1875 42.75
Q 14.984375 37.109375 14.984375 27.296875
Q 14.984375 17.484375 19.15625 11.84375
Q 23.34375 6.203125 30.609375 6.203125
Q 37.796875 6.203125 41.984375 11.859375
Q 46.1875 17.53125 46.1875 27.296875
Q 46.1875 37.015625 41.984375 42.703125
Q 37.796875 48.390625 30.609375 48.390625
z
M 30.609375 56
Q 42.328125 56 49.015625 48.375
Q 55.71875 40.765625 55.71875 27.296875
Q 55.71875 13.875 49.015625 6.21875
Q 42.328125 -1.421875 30.609375 -1.421875
Q 18.84375 -1.421875 12.171875 6.21875
Q 5.515625 13.875 5.515625 27.296875
Q 5.515625 40.765625 12.171875 48.375
Q 18.84375 56 30.609375 56
z
" id="DejaVuSans-111"/>
<path d="M 52 44.1875
Q 55.375 50.25 60.0625 53.125
Q 64.75 56 71.09375 56
Q 79.640625 56 84.28125 50.015625
Q 88.921875 44.046875 88.921875 33.015625
L 88.921875 0
L 79.890625 0
L 79.890625 32.71875
Q 79.890625 40.578125 77.09375 44.375
Q 74.3125 48.1875 68.609375 48.1875
Q 61.625 48.1875 57.5625 43.546875
Q 53.515625 38.921875 53.515625 30.90625
L 53.515625 0
L 44.484375 0
L 44.484375 32.71875
Q 44.484375 40.625 41.703125 44.40625
Q 38.921875 48.1875 33.109375 48.1875
Q 26.21875 48.1875 22.15625 43.53125
Q 18.109375 38.875 18.109375 30.90625
L 18.109375 0
L 9.078125 0
L 9.078125 54.6875
L 18.109375 54.6875
L 18.109375 46.1875
Q 21.1875 51.21875 25.484375 53.609375
Q 29.78125 56 35.6875 56
Q 41.65625 56 45.828125 52.96875
Q 50 49.953125 52 44.1875
z
" id="DejaVuSans-109"/>
</defs>
<use xlink:href="#DejaVuSans-80"/>
<use x="56.677734" xlink:href="#DejaVuSans-101"/>
<use x="118.201172" xlink:href="#DejaVuSans-114"/>
<use x="159.314453" xlink:href="#DejaVuSans-102"/>
<use x="194.519531" xlink:href="#DejaVuSans-111"/>
<use x="255.701172" xlink:href="#DejaVuSans-114"/>
<use x="295.064453" xlink:href="#DejaVuSans-109"/>
<use x="392.476562" xlink:href="#DejaVuSans-97"/>
<use x="453.755859" xlink:href="#DejaVuSans-110"/>
<use x="517.134766" xlink:href="#DejaVuSans-99"/>
<use x="572.115234" xlink:href="#DejaVuSans-101"/>
<use x="633.638672" xlink:href="#DejaVuSans-32"/>
<use x="665.425781" xlink:href="#DejaVuSans-111"/>
<use x="726.607422" xlink:href="#DejaVuSans-102"/>
<use x="761.8125" xlink:href="#DejaVuSans-102"/>
<use x="797.017578" xlink:href="#DejaVuSans-108"/>
<use x="824.800781" xlink:href="#DejaVuSans-105"/>
<use x="852.583984" xlink:href="#DejaVuSans-110"/>
<use x="915.962891" xlink:href="#DejaVuSans-101"/>
</g>
</g>
<g id="legend_1"/>
</g>
</g>
<defs>
<clipPath id="p4447e285f4">
<rect height="266.112" width="357.12" x="53.31" y="22.318125"/>
</clipPath>
</defs>
</svg>

After

Width:  |  Height:  |  Size: 30 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 32 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 32 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 33 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 30 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 32 KiB

View file

@ -0,0 +1,965 @@
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Created with matplotlib (https://matplotlib.org/) -->
<svg height="331.389812pt" version="1.1" viewBox="0 0 410.63125 331.389812" width="410.63125pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<metadata>
<rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<cc:Work>
<dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
<dc:date>2021-04-14T17:54:01.333579</dc:date>
<dc:format>image/svg+xml</dc:format>
<dc:creator>
<cc:Agent>
<dc:title>Matplotlib v3.3.4, https://matplotlib.org/</dc:title>
</cc:Agent>
</dc:creator>
</cc:Work>
</rdf:RDF>
</metadata>
<defs>
<style type="text/css">*{stroke-linecap:butt;stroke-linejoin:round;}</style>
</defs>
<g id="figure_1">
<g id="patch_1">
<path d="M 0 331.389812
L 410.63125 331.389812
L 410.63125 0
L 0 0
z
" style="fill:#ffffff;"/>
</g>
<g id="axes_1">
<g id="patch_2">
<path d="M 46.31125 288.430125
L 403.43125 288.430125
L 403.43125 22.318125
L 46.31125 22.318125
z
" style="fill:#ffffff;"/>
</g>
<g id="matplotlib.axis_1">
<g id="xtick_1">
<g id="text_1">
<!-- 1 -->
<g style="fill:#262626;" transform="translate(65.131875 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 12.40625 8.296875
L 28.515625 8.296875
L 28.515625 63.921875
L 10.984375 60.40625
L 10.984375 69.390625
L 28.421875 72.90625
L 38.28125 72.90625
L 38.28125 8.296875
L 54.390625 8.296875
L 54.390625 0
L 12.40625 0
z
" id="DejaVuSans-49"/>
</defs>
<use xlink:href="#DejaVuSans-49"/>
</g>
</g>
</g>
<g id="xtick_2">
<g id="text_2">
<!-- 2 -->
<g style="fill:#262626;" transform="translate(109.771875 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 19.1875 8.296875
L 53.609375 8.296875
L 53.609375 0
L 7.328125 0
L 7.328125 8.296875
Q 12.9375 14.109375 22.625 23.890625
Q 32.328125 33.6875 34.8125 36.53125
Q 39.546875 41.84375 41.421875 45.53125
Q 43.3125 49.21875 43.3125 52.78125
Q 43.3125 58.59375 39.234375 62.25
Q 35.15625 65.921875 28.609375 65.921875
Q 23.96875 65.921875 18.8125 64.3125
Q 13.671875 62.703125 7.8125 59.421875
L 7.8125 69.390625
Q 13.765625 71.78125 18.9375 73
Q 24.125 74.21875 28.421875 74.21875
Q 39.75 74.21875 46.484375 68.546875
Q 53.21875 62.890625 53.21875 53.421875
Q 53.21875 48.921875 51.53125 44.890625
Q 49.859375 40.875 45.40625 35.40625
Q 44.1875 33.984375 37.640625 27.21875
Q 31.109375 20.453125 19.1875 8.296875
z
" id="DejaVuSans-50"/>
</defs>
<use xlink:href="#DejaVuSans-50"/>
</g>
</g>
</g>
<g id="xtick_3">
<g id="text_3">
<!-- 4 -->
<g style="fill:#262626;" transform="translate(154.411875 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 37.796875 64.3125
L 12.890625 25.390625
L 37.796875 25.390625
z
M 35.203125 72.90625
L 47.609375 72.90625
L 47.609375 25.390625
L 58.015625 25.390625
L 58.015625 17.1875
L 47.609375 17.1875
L 47.609375 0
L 37.796875 0
L 37.796875 17.1875
L 4.890625 17.1875
L 4.890625 26.703125
z
" id="DejaVuSans-52"/>
</defs>
<use xlink:href="#DejaVuSans-52"/>
</g>
</g>
</g>
<g id="xtick_4">
<g id="text_4">
<!-- 8 -->
<g style="fill:#262626;" transform="translate(199.051875 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 31.78125 34.625
Q 24.75 34.625 20.71875 30.859375
Q 16.703125 27.09375 16.703125 20.515625
Q 16.703125 13.921875 20.71875 10.15625
Q 24.75 6.390625 31.78125 6.390625
Q 38.8125 6.390625 42.859375 10.171875
Q 46.921875 13.96875 46.921875 20.515625
Q 46.921875 27.09375 42.890625 30.859375
Q 38.875 34.625 31.78125 34.625
z
M 21.921875 38.8125
Q 15.578125 40.375 12.03125 44.71875
Q 8.5 49.078125 8.5 55.328125
Q 8.5 64.0625 14.71875 69.140625
Q 20.953125 74.21875 31.78125 74.21875
Q 42.671875 74.21875 48.875 69.140625
Q 55.078125 64.0625 55.078125 55.328125
Q 55.078125 49.078125 51.53125 44.71875
Q 48 40.375 41.703125 38.8125
Q 48.828125 37.15625 52.796875 32.3125
Q 56.78125 27.484375 56.78125 20.515625
Q 56.78125 9.90625 50.3125 4.234375
Q 43.84375 -1.421875 31.78125 -1.421875
Q 19.734375 -1.421875 13.25 4.234375
Q 6.78125 9.90625 6.78125 20.515625
Q 6.78125 27.484375 10.78125 32.3125
Q 14.796875 37.15625 21.921875 38.8125
z
M 18.3125 54.390625
Q 18.3125 48.734375 21.84375 45.5625
Q 25.390625 42.390625 31.78125 42.390625
Q 38.140625 42.390625 41.71875 45.5625
Q 45.3125 48.734375 45.3125 54.390625
Q 45.3125 60.0625 41.71875 63.234375
Q 38.140625 66.40625 31.78125 66.40625
Q 25.390625 66.40625 21.84375 63.234375
Q 18.3125 60.0625 18.3125 54.390625
z
" id="DejaVuSans-56"/>
</defs>
<use xlink:href="#DejaVuSans-56"/>
</g>
</g>
</g>
<g id="xtick_5">
<g id="text_5">
<!-- 16 -->
<g style="fill:#262626;" transform="translate(240.1925 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 33.015625 40.375
Q 26.375 40.375 22.484375 35.828125
Q 18.609375 31.296875 18.609375 23.390625
Q 18.609375 15.53125 22.484375 10.953125
Q 26.375 6.390625 33.015625 6.390625
Q 39.65625 6.390625 43.53125 10.953125
Q 47.40625 15.53125 47.40625 23.390625
Q 47.40625 31.296875 43.53125 35.828125
Q 39.65625 40.375 33.015625 40.375
z
M 52.59375 71.296875
L 52.59375 62.3125
Q 48.875 64.0625 45.09375 64.984375
Q 41.3125 65.921875 37.59375 65.921875
Q 27.828125 65.921875 22.671875 59.328125
Q 17.53125 52.734375 16.796875 39.40625
Q 19.671875 43.65625 24.015625 45.921875
Q 28.375 48.1875 33.59375 48.1875
Q 44.578125 48.1875 50.953125 41.515625
Q 57.328125 34.859375 57.328125 23.390625
Q 57.328125 12.15625 50.6875 5.359375
Q 44.046875 -1.421875 33.015625 -1.421875
Q 20.359375 -1.421875 13.671875 8.265625
Q 6.984375 17.96875 6.984375 36.375
Q 6.984375 53.65625 15.1875 63.9375
Q 23.390625 74.21875 37.203125 74.21875
Q 40.921875 74.21875 44.703125 73.484375
Q 48.484375 72.75 52.59375 71.296875
z
" id="DejaVuSans-54"/>
</defs>
<use xlink:href="#DejaVuSans-49"/>
<use x="63.623047" xlink:href="#DejaVuSans-54"/>
</g>
</g>
</g>
<g id="xtick_6">
<g id="text_6">
<!-- 32 -->
<g style="fill:#262626;" transform="translate(284.8325 306.288406)scale(0.11 -0.11)">
<defs>
<path d="M 40.578125 39.3125
Q 47.65625 37.796875 51.625 33
Q 55.609375 28.21875 55.609375 21.1875
Q 55.609375 10.40625 48.1875 4.484375
Q 40.765625 -1.421875 27.09375 -1.421875
Q 22.515625 -1.421875 17.65625 -0.515625
Q 12.796875 0.390625 7.625 2.203125
L 7.625 11.71875
Q 11.71875 9.328125 16.59375 8.109375
Q 21.484375 6.890625 26.8125 6.890625
Q 36.078125 6.890625 40.9375 10.546875
Q 45.796875 14.203125 45.796875 21.1875
Q 45.796875 27.640625 41.28125 31.265625
Q 36.765625 34.90625 28.71875 34.90625
L 20.21875 34.90625
L 20.21875 43.015625
L 29.109375 43.015625
Q 36.375 43.015625 40.234375 45.921875
Q 44.09375 48.828125 44.09375 54.296875
Q 44.09375 59.90625 40.109375 62.90625
Q 36.140625 65.921875 28.71875 65.921875
Q 24.65625 65.921875 20.015625 65.03125
Q 15.375 64.15625 9.8125 62.3125
L 9.8125 71.09375
Q 15.4375 72.65625 20.34375 73.4375
Q 25.25 74.21875 29.59375 74.21875
Q 40.828125 74.21875 47.359375 69.109375
Q 53.90625 64.015625 53.90625 55.328125
Q 53.90625 49.265625 50.4375 45.09375
Q 46.96875 40.921875 40.578125 39.3125
z
" id="DejaVuSans-51"/>
</defs>
<use xlink:href="#DejaVuSans-51"/>
<use x="63.623047" xlink:href="#DejaVuSans-50"/>
</g>
</g>
</g>
<g id="xtick_7">
<g id="text_7">
<!-- 64 -->
<g style="fill:#262626;" transform="translate(329.4725 306.288406)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-54"/>
<use x="63.623047" xlink:href="#DejaVuSans-52"/>
</g>
</g>
</g>
<g id="xtick_8">
<g id="text_8">
<!-- 128 -->
<g style="fill:#262626;" transform="translate(370.613125 306.288406)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-49"/>
<use x="63.623047" xlink:href="#DejaVuSans-50"/>
<use x="127.246094" xlink:href="#DejaVuSans-56"/>
</g>
</g>
</g>
<g id="text_9">
<!-- Client Batch Size -->
<g style="fill:#262626;" transform="translate(174.123438 321.694187)scale(0.12 -0.12)">
<defs>
<path d="M 64.40625 67.28125
L 64.40625 56.890625
Q 59.421875 61.53125 53.78125 63.8125
Q 48.140625 66.109375 41.796875 66.109375
Q 29.296875 66.109375 22.65625 58.46875
Q 16.015625 50.828125 16.015625 36.375
Q 16.015625 21.96875 22.65625 14.328125
Q 29.296875 6.6875 41.796875 6.6875
Q 48.140625 6.6875 53.78125 8.984375
Q 59.421875 11.28125 64.40625 15.921875
L 64.40625 5.609375
Q 59.234375 2.09375 53.4375 0.328125
Q 47.65625 -1.421875 41.21875 -1.421875
Q 24.65625 -1.421875 15.125 8.703125
Q 5.609375 18.84375 5.609375 36.375
Q 5.609375 53.953125 15.125 64.078125
Q 24.65625 74.21875 41.21875 74.21875
Q 47.75 74.21875 53.53125 72.484375
Q 59.328125 70.75 64.40625 67.28125
z
" id="DejaVuSans-67"/>
<path d="M 9.421875 75.984375
L 18.40625 75.984375
L 18.40625 0
L 9.421875 0
z
" id="DejaVuSans-108"/>
<path d="M 9.421875 54.6875
L 18.40625 54.6875
L 18.40625 0
L 9.421875 0
z
M 9.421875 75.984375
L 18.40625 75.984375
L 18.40625 64.59375
L 9.421875 64.59375
z
" id="DejaVuSans-105"/>
<path d="M 56.203125 29.59375
L 56.203125 25.203125
L 14.890625 25.203125
Q 15.484375 15.921875 20.484375 11.0625
Q 25.484375 6.203125 34.421875 6.203125
Q 39.59375 6.203125 44.453125 7.46875
Q 49.3125 8.734375 54.109375 11.28125
L 54.109375 2.78125
Q 49.265625 0.734375 44.1875 -0.34375
Q 39.109375 -1.421875 33.890625 -1.421875
Q 20.796875 -1.421875 13.15625 6.1875
Q 5.515625 13.8125 5.515625 26.8125
Q 5.515625 40.234375 12.765625 48.109375
Q 20.015625 56 32.328125 56
Q 43.359375 56 49.78125 48.890625
Q 56.203125 41.796875 56.203125 29.59375
z
M 47.21875 32.234375
Q 47.125 39.59375 43.09375 43.984375
Q 39.0625 48.390625 32.421875 48.390625
Q 24.90625 48.390625 20.390625 44.140625
Q 15.875 39.890625 15.1875 32.171875
z
" id="DejaVuSans-101"/>
<path d="M 54.890625 33.015625
L 54.890625 0
L 45.90625 0
L 45.90625 32.71875
Q 45.90625 40.484375 42.875 44.328125
Q 39.84375 48.1875 33.796875 48.1875
Q 26.515625 48.1875 22.3125 43.546875
Q 18.109375 38.921875 18.109375 30.90625
L 18.109375 0
L 9.078125 0
L 9.078125 54.6875
L 18.109375 54.6875
L 18.109375 46.1875
Q 21.34375 51.125 25.703125 53.5625
Q 30.078125 56 35.796875 56
Q 45.21875 56 50.046875 50.171875
Q 54.890625 44.34375 54.890625 33.015625
z
" id="DejaVuSans-110"/>
<path d="M 18.3125 70.21875
L 18.3125 54.6875
L 36.8125 54.6875
L 36.8125 47.703125
L 18.3125 47.703125
L 18.3125 18.015625
Q 18.3125 11.328125 20.140625 9.421875
Q 21.96875 7.515625 27.59375 7.515625
L 36.8125 7.515625
L 36.8125 0
L 27.59375 0
Q 17.1875 0 13.234375 3.875
Q 9.28125 7.765625 9.28125 18.015625
L 9.28125 47.703125
L 2.6875 47.703125
L 2.6875 54.6875
L 9.28125 54.6875
L 9.28125 70.21875
z
" id="DejaVuSans-116"/>
<path id="DejaVuSans-32"/>
<path d="M 19.671875 34.8125
L 19.671875 8.109375
L 35.5 8.109375
Q 43.453125 8.109375 47.28125 11.40625
Q 51.125 14.703125 51.125 21.484375
Q 51.125 28.328125 47.28125 31.5625
Q 43.453125 34.8125 35.5 34.8125
z
M 19.671875 64.796875
L 19.671875 42.828125
L 34.28125 42.828125
Q 41.5 42.828125 45.03125 45.53125
Q 48.578125 48.25 48.578125 53.8125
Q 48.578125 59.328125 45.03125 62.0625
Q 41.5 64.796875 34.28125 64.796875
z
M 9.8125 72.90625
L 35.015625 72.90625
Q 46.296875 72.90625 52.390625 68.21875
Q 58.5 63.53125 58.5 54.890625
Q 58.5 48.1875 55.375 44.234375
Q 52.25 40.28125 46.1875 39.3125
Q 53.46875 37.75 57.5 32.78125
Q 61.53125 27.828125 61.53125 20.40625
Q 61.53125 10.640625 54.890625 5.3125
Q 48.25 0 35.984375 0
L 9.8125 0
z
" id="DejaVuSans-66"/>
<path d="M 34.28125 27.484375
Q 23.390625 27.484375 19.1875 25
Q 14.984375 22.515625 14.984375 16.5
Q 14.984375 11.71875 18.140625 8.90625
Q 21.296875 6.109375 26.703125 6.109375
Q 34.1875 6.109375 38.703125 11.40625
Q 43.21875 16.703125 43.21875 25.484375
L 43.21875 27.484375
z
M 52.203125 31.203125
L 52.203125 0
L 43.21875 0
L 43.21875 8.296875
Q 40.140625 3.328125 35.546875 0.953125
Q 30.953125 -1.421875 24.3125 -1.421875
Q 15.921875 -1.421875 10.953125 3.296875
Q 6 8.015625 6 15.921875
Q 6 25.140625 12.171875 29.828125
Q 18.359375 34.515625 30.609375 34.515625
L 43.21875 34.515625
L 43.21875 35.40625
Q 43.21875 41.609375 39.140625 45
Q 35.0625 48.390625 27.6875 48.390625
Q 23 48.390625 18.546875 47.265625
Q 14.109375 46.140625 10.015625 43.890625
L 10.015625 52.203125
Q 14.9375 54.109375 19.578125 55.046875
Q 24.21875 56 28.609375 56
Q 40.484375 56 46.34375 49.84375
Q 52.203125 43.703125 52.203125 31.203125
z
" id="DejaVuSans-97"/>
<path d="M 48.78125 52.59375
L 48.78125 44.1875
Q 44.96875 46.296875 41.140625 47.34375
Q 37.3125 48.390625 33.40625 48.390625
Q 24.65625 48.390625 19.8125 42.84375
Q 14.984375 37.3125 14.984375 27.296875
Q 14.984375 17.28125 19.8125 11.734375
Q 24.65625 6.203125 33.40625 6.203125
Q 37.3125 6.203125 41.140625 7.25
Q 44.96875 8.296875 48.78125 10.40625
L 48.78125 2.09375
Q 45.015625 0.34375 40.984375 -0.53125
Q 36.96875 -1.421875 32.421875 -1.421875
Q 20.0625 -1.421875 12.78125 6.34375
Q 5.515625 14.109375 5.515625 27.296875
Q 5.515625 40.671875 12.859375 48.328125
Q 20.21875 56 33.015625 56
Q 37.15625 56 41.109375 55.140625
Q 45.0625 54.296875 48.78125 52.59375
z
" id="DejaVuSans-99"/>
<path d="M 54.890625 33.015625
L 54.890625 0
L 45.90625 0
L 45.90625 32.71875
Q 45.90625 40.484375 42.875 44.328125
Q 39.84375 48.1875 33.796875 48.1875
Q 26.515625 48.1875 22.3125 43.546875
Q 18.109375 38.921875 18.109375 30.90625
L 18.109375 0
L 9.078125 0
L 9.078125 75.984375
L 18.109375 75.984375
L 18.109375 46.1875
Q 21.34375 51.125 25.703125 53.5625
Q 30.078125 56 35.796875 56
Q 45.21875 56 50.046875 50.171875
Q 54.890625 44.34375 54.890625 33.015625
z
" id="DejaVuSans-104"/>
<path d="M 53.515625 70.515625
L 53.515625 60.890625
Q 47.90625 63.578125 42.921875 64.890625
Q 37.9375 66.21875 33.296875 66.21875
Q 25.25 66.21875 20.875 63.09375
Q 16.5 59.96875 16.5 54.203125
Q 16.5 49.359375 19.40625 46.890625
Q 22.3125 44.4375 30.421875 42.921875
L 36.375 41.703125
Q 47.40625 39.59375 52.65625 34.296875
Q 57.90625 29 57.90625 20.125
Q 57.90625 9.515625 50.796875 4.046875
Q 43.703125 -1.421875 29.984375 -1.421875
Q 24.8125 -1.421875 18.96875 -0.25
Q 13.140625 0.921875 6.890625 3.21875
L 6.890625 13.375
Q 12.890625 10.015625 18.65625 8.296875
Q 24.421875 6.59375 29.984375 6.59375
Q 38.421875 6.59375 43.015625 9.90625
Q 47.609375 13.234375 47.609375 19.390625
Q 47.609375 24.75 44.3125 27.78125
Q 41.015625 30.8125 33.5 32.328125
L 27.484375 33.5
Q 16.453125 35.6875 11.515625 40.375
Q 6.59375 45.0625 6.59375 53.421875
Q 6.59375 63.09375 13.40625 68.65625
Q 20.21875 74.21875 32.171875 74.21875
Q 37.3125 74.21875 42.625 73.28125
Q 47.953125 72.359375 53.515625 70.515625
z
" id="DejaVuSans-83"/>
<path d="M 5.515625 54.6875
L 48.1875 54.6875
L 48.1875 46.484375
L 14.40625 7.171875
L 48.1875 7.171875
L 48.1875 0
L 4.296875 0
L 4.296875 8.203125
L 38.09375 47.515625
L 5.515625 47.515625
z
" id="DejaVuSans-122"/>
</defs>
<use xlink:href="#DejaVuSans-67"/>
<use x="69.824219" xlink:href="#DejaVuSans-108"/>
<use x="97.607422" xlink:href="#DejaVuSans-105"/>
<use x="125.390625" xlink:href="#DejaVuSans-101"/>
<use x="186.914062" xlink:href="#DejaVuSans-110"/>
<use x="250.292969" xlink:href="#DejaVuSans-116"/>
<use x="289.501953" xlink:href="#DejaVuSans-32"/>
<use x="321.289062" xlink:href="#DejaVuSans-66"/>
<use x="389.892578" xlink:href="#DejaVuSans-97"/>
<use x="451.171875" xlink:href="#DejaVuSans-116"/>
<use x="490.380859" xlink:href="#DejaVuSans-99"/>
<use x="545.361328" xlink:href="#DejaVuSans-104"/>
<use x="608.740234" xlink:href="#DejaVuSans-32"/>
<use x="640.527344" xlink:href="#DejaVuSans-83"/>
<use x="704.003906" xlink:href="#DejaVuSans-105"/>
<use x="731.787109" xlink:href="#DejaVuSans-122"/>
<use x="784.277344" xlink:href="#DejaVuSans-101"/>
</g>
</g>
</g>
<g id="matplotlib.axis_2">
<g id="ytick_1">
<g id="line2d_1">
<path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 288.430125
L 403.43125 288.430125
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_10">
<!-- 0 -->
<g style="fill:#262626;" transform="translate(29.8125 292.609266)scale(0.11 -0.11)">
<defs>
<path d="M 31.78125 66.40625
Q 24.171875 66.40625 20.328125 58.90625
Q 16.5 51.421875 16.5 36.375
Q 16.5 21.390625 20.328125 13.890625
Q 24.171875 6.390625 31.78125 6.390625
Q 39.453125 6.390625 43.28125 13.890625
Q 47.125 21.390625 47.125 36.375
Q 47.125 51.421875 43.28125 58.90625
Q 39.453125 66.40625 31.78125 66.40625
z
M 31.78125 74.21875
Q 44.046875 74.21875 50.515625 64.515625
Q 56.984375 54.828125 56.984375 36.375
Q 56.984375 17.96875 50.515625 8.265625
Q 44.046875 -1.421875 31.78125 -1.421875
Q 19.53125 -1.421875 13.0625 8.265625
Q 6.59375 17.96875 6.59375 36.375
Q 6.59375 54.828125 13.0625 64.515625
Q 19.53125 74.21875 31.78125 74.21875
z
" id="DejaVuSans-48"/>
</defs>
<use xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_2">
<g id="line2d_2">
<path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 234.446995
L 403.43125 234.446995
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_11">
<!-- 20 -->
<g style="fill:#262626;" transform="translate(22.81375 238.626135)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-50"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_3">
<g id="line2d_3">
<path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 180.463864
L 403.43125 180.463864
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_12">
<!-- 40 -->
<g style="fill:#262626;" transform="translate(22.81375 184.643005)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-52"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_4">
<g id="line2d_4">
<path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 126.480734
L 403.43125 126.480734
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_13">
<!-- 60 -->
<g style="fill:#262626;" transform="translate(22.81375 130.659875)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-54"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="ytick_5">
<g id="line2d_5">
<path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 72.497604
L 403.43125 72.497604
" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
</g>
<g id="text_14">
<!-- 80 -->
<g style="fill:#262626;" transform="translate(22.81375 76.676745)scale(0.11 -0.11)">
<use xlink:href="#DejaVuSans-56"/>
<use x="63.623047" xlink:href="#DejaVuSans-48"/>
</g>
</g>
</g>
<g id="text_15">
<!-- Avg Latency -->
<g style="fill:#262626;" transform="translate(16.318125 192.110062)rotate(-90)scale(0.12 -0.12)">
<defs>
<path d="M 34.1875 63.1875
L 20.796875 26.90625
L 47.609375 26.90625
z
M 28.609375 72.90625
L 39.796875 72.90625
L 67.578125 0
L 57.328125 0
L 50.6875 18.703125
L 17.828125 18.703125
L 11.1875 0
L 0.78125 0
z
" id="DejaVuSans-65"/>
<path d="M 2.984375 54.6875
L 12.5 54.6875
L 29.59375 8.796875
L 46.6875 54.6875
L 56.203125 54.6875
L 35.6875 0
L 23.484375 0
z
" id="DejaVuSans-118"/>
<path d="M 45.40625 27.984375
Q 45.40625 37.75 41.375 43.109375
Q 37.359375 48.484375 30.078125 48.484375
Q 22.859375 48.484375 18.828125 43.109375
Q 14.796875 37.75 14.796875 27.984375
Q 14.796875 18.265625 18.828125 12.890625
Q 22.859375 7.515625 30.078125 7.515625
Q 37.359375 7.515625 41.375 12.890625
Q 45.40625 18.265625 45.40625 27.984375
z
M 54.390625 6.78125
Q 54.390625 -7.171875 48.1875 -13.984375
Q 42 -20.796875 29.203125 -20.796875
Q 24.46875 -20.796875 20.265625 -20.09375
Q 16.0625 -19.390625 12.109375 -17.921875
L 12.109375 -9.1875
Q 16.0625 -11.328125 19.921875 -12.34375
Q 23.78125 -13.375 27.78125 -13.375
Q 36.625 -13.375 41.015625 -8.765625
Q 45.40625 -4.15625 45.40625 5.171875
L 45.40625 9.625
Q 42.625 4.78125 38.28125 2.390625
Q 33.9375 0 27.875 0
Q 17.828125 0 11.671875 7.65625
Q 5.515625 15.328125 5.515625 27.984375
Q 5.515625 40.671875 11.671875 48.328125
Q 17.828125 56 27.875 56
Q 33.9375 56 38.28125 53.609375
Q 42.625 51.21875 45.40625 46.390625
L 45.40625 54.6875
L 54.390625 54.6875
z
" id="DejaVuSans-103"/>
<path d="M 9.8125 72.90625
L 19.671875 72.90625
L 19.671875 8.296875
L 55.171875 8.296875
L 55.171875 0
L 9.8125 0
z
" id="DejaVuSans-76"/>
<path d="M 32.171875 -5.078125
Q 28.375 -14.84375 24.75 -17.8125
Q 21.140625 -20.796875 15.09375 -20.796875
L 7.90625 -20.796875
L 7.90625 -13.28125
L 13.1875 -13.28125
Q 16.890625 -13.28125 18.9375 -11.515625
Q 21 -9.765625 23.484375 -3.21875
L 25.09375 0.875
L 2.984375 54.6875
L 12.5 54.6875
L 29.59375 11.921875
L 46.6875 54.6875
L 56.203125 54.6875
z
" id="DejaVuSans-121"/>
</defs>
<use xlink:href="#DejaVuSans-65"/>
<use x="62.533203" xlink:href="#DejaVuSans-118"/>
<use x="121.712891" xlink:href="#DejaVuSans-103"/>
<use x="185.189453" xlink:href="#DejaVuSans-32"/>
<use x="216.976562" xlink:href="#DejaVuSans-76"/>
<use x="272.689453" xlink:href="#DejaVuSans-97"/>
<use x="333.96875" xlink:href="#DejaVuSans-116"/>
<use x="373.177734" xlink:href="#DejaVuSans-101"/>
<use x="434.701172" xlink:href="#DejaVuSans-110"/>
<use x="498.080078" xlink:href="#DejaVuSans-99"/>
<use x="553.060547" xlink:href="#DejaVuSans-121"/>
</g>
</g>
</g>
<g id="patch_3">
<path clip-path="url(#p8f4ea3f47d)" d="M 50.77525 288.430125
L 86.48725 288.430125
L 86.48725 280.769919
L 50.77525 280.769919
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_4">
<path clip-path="url(#p8f4ea3f47d)" d="M 95.41525 288.430125
L 131.12725 288.430125
L 131.12725 279.387951
L 95.41525 279.387951
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_5">
<path clip-path="url(#p8f4ea3f47d)" d="M 140.05525 288.430125
L 175.76725 288.430125
L 175.76725 277.11796
L 140.05525 277.11796
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_6">
<path clip-path="url(#p8f4ea3f47d)" d="M 184.69525 288.430125
L 220.40725 288.430125
L 220.40725 272.291868
L 184.69525 272.291868
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_7">
<path clip-path="url(#p8f4ea3f47d)" d="M 229.33525 288.430125
L 265.04725 288.430125
L 265.04725 263.419741
L 229.33525 263.419741
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_8">
<path clip-path="url(#p8f4ea3f47d)" d="M 273.97525 288.430125
L 309.68725 288.430125
L 309.68725 246.150537
L 273.97525 246.150537
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_9">
<path clip-path="url(#p8f4ea3f47d)" d="M 318.61525 288.430125
L 354.32725 288.430125
L 354.32725 184.750125
L 318.61525 184.750125
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_10">
<path clip-path="url(#p8f4ea3f47d)" d="M 363.25525 288.430125
L 398.96725 288.430125
L 398.96725 66.670125
L 363.25525 66.670125
z
" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
</g>
<g id="patch_11">
<path d="M 46.31125 288.430125
L 46.31125 22.318125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="patch_12">
<path d="M 403.43125 288.430125
L 403.43125 22.318125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="patch_13">
<path d="M 46.31125 288.430125
L 403.43125 288.430125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="patch_14">
<path d="M 46.31125 22.318125
L 403.43125 22.318125
" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
</g>
<g id="text_16">
<!-- Performance offline -->
<g style="fill:#262626;" transform="translate(166.222188 16.318125)scale(0.12 -0.12)">
<defs>
<path d="M 19.671875 64.796875
L 19.671875 37.40625
L 32.078125 37.40625
Q 38.96875 37.40625 42.71875 40.96875
Q 46.484375 44.53125 46.484375 51.125
Q 46.484375 57.671875 42.71875 61.234375
Q 38.96875 64.796875 32.078125 64.796875
z
M 9.8125 72.90625
L 32.078125 72.90625
Q 44.34375 72.90625 50.609375 67.359375
Q 56.890625 61.8125 56.890625 51.125
Q 56.890625 40.328125 50.609375 34.8125
Q 44.34375 29.296875 32.078125 29.296875
L 19.671875 29.296875
L 19.671875 0
L 9.8125 0
z
" id="DejaVuSans-80"/>
<path d="M 41.109375 46.296875
Q 39.59375 47.171875 37.8125 47.578125
Q 36.03125 48 33.890625 48
Q 26.265625 48 22.1875 43.046875
Q 18.109375 38.09375 18.109375 28.8125
L 18.109375 0
L 9.078125 0
L 9.078125 54.6875
L 18.109375 54.6875
L 18.109375 46.1875
Q 20.953125 51.171875 25.484375 53.578125
Q 30.03125 56 36.53125 56
Q 37.453125 56 38.578125 55.875
Q 39.703125 55.765625 41.0625 55.515625
z
" id="DejaVuSans-114"/>
<path d="M 37.109375 75.984375
L 37.109375 68.5
L 28.515625 68.5
Q 23.6875 68.5 21.796875 66.546875
Q 19.921875 64.59375 19.921875 59.515625
L 19.921875 54.6875
L 34.71875 54.6875
L 34.71875 47.703125
L 19.921875 47.703125
L 19.921875 0
L 10.890625 0
L 10.890625 47.703125
L 2.296875 47.703125
L 2.296875 54.6875
L 10.890625 54.6875
L 10.890625 58.5
Q 10.890625 67.625 15.140625 71.796875
Q 19.390625 75.984375 28.609375 75.984375
z
" id="DejaVuSans-102"/>
<path d="M 30.609375 48.390625
Q 23.390625 48.390625 19.1875 42.75
Q 14.984375 37.109375 14.984375 27.296875
Q 14.984375 17.484375 19.15625 11.84375
Q 23.34375 6.203125 30.609375 6.203125
Q 37.796875 6.203125 41.984375 11.859375
Q 46.1875 17.53125 46.1875 27.296875
Q 46.1875 37.015625 41.984375 42.703125
Q 37.796875 48.390625 30.609375 48.390625
z
M 30.609375 56
Q 42.328125 56 49.015625 48.375
Q 55.71875 40.765625 55.71875 27.296875
Q 55.71875 13.875 49.015625 6.21875
Q 42.328125 -1.421875 30.609375 -1.421875
Q 18.84375 -1.421875 12.171875 6.21875
Q 5.515625 13.875 5.515625 27.296875
Q 5.515625 40.765625 12.171875 48.375
Q 18.84375 56 30.609375 56
z
" id="DejaVuSans-111"/>
<path d="M 52 44.1875
Q 55.375 50.25 60.0625 53.125
Q 64.75 56 71.09375 56
Q 79.640625 56 84.28125 50.015625
Q 88.921875 44.046875 88.921875 33.015625
L 88.921875 0
L 79.890625 0
L 79.890625 32.71875
Q 79.890625 40.578125 77.09375 44.375
Q 74.3125 48.1875 68.609375 48.1875
Q 61.625 48.1875 57.5625 43.546875
Q 53.515625 38.921875 53.515625 30.90625
L 53.515625 0
L 44.484375 0
L 44.484375 32.71875
Q 44.484375 40.625 41.703125 44.40625
Q 38.921875 48.1875 33.109375 48.1875
Q 26.21875 48.1875 22.15625 43.53125
Q 18.109375 38.875 18.109375 30.90625
L 18.109375 0
L 9.078125 0
L 9.078125 54.6875
L 18.109375 54.6875
L 18.109375 46.1875
Q 21.1875 51.21875 25.484375 53.609375
Q 29.78125 56 35.6875 56
Q 41.65625 56 45.828125 52.96875
Q 50 49.953125 52 44.1875
z
" id="DejaVuSans-109"/>
</defs>
<use xlink:href="#DejaVuSans-80"/>
<use x="56.677734" xlink:href="#DejaVuSans-101"/>
<use x="118.201172" xlink:href="#DejaVuSans-114"/>
<use x="159.314453" xlink:href="#DejaVuSans-102"/>
<use x="194.519531" xlink:href="#DejaVuSans-111"/>
<use x="255.701172" xlink:href="#DejaVuSans-114"/>
<use x="295.064453" xlink:href="#DejaVuSans-109"/>
<use x="392.476562" xlink:href="#DejaVuSans-97"/>
<use x="453.755859" xlink:href="#DejaVuSans-110"/>
<use x="517.134766" xlink:href="#DejaVuSans-99"/>
<use x="572.115234" xlink:href="#DejaVuSans-101"/>
<use x="633.638672" xlink:href="#DejaVuSans-32"/>
<use x="665.425781" xlink:href="#DejaVuSans-111"/>
<use x="726.607422" xlink:href="#DejaVuSans-102"/>
<use x="761.8125" xlink:href="#DejaVuSans-102"/>
<use x="797.017578" xlink:href="#DejaVuSans-108"/>
<use x="824.800781" xlink:href="#DejaVuSans-105"/>
<use x="852.583984" xlink:href="#DejaVuSans-110"/>
<use x="915.962891" xlink:href="#DejaVuSans-101"/>
</g>
</g>
<g id="legend_1"/>
</g>
</g>
<defs>
<clipPath id="p8f4ea3f47d">
<rect height="266.112" width="357.12" x="46.31125" y="22.318125"/>
</clipPath>
</defs>
</svg>

After

Width:  |  Height:  |  Size: 29 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 96 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 95 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 94 KiB

File diff suppressed because it is too large Load diff

After

Width:  |  Height:  |  Size: 92 KiB

View file

@ -0,0 +1,134 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
To infer the model on framework runtime, you can use `run_inference_on_fw.py` script.
It infers data obtained from pointed data loader locally and saves received data into npz files.
Those files are stored in directory pointed by `--output-dir` argument.
Example call:
```shell script
python ./triton/run_inference_on_fw.py \
--input-path /models/exported/model.onnx \
--input-type onnx \
--dataloader triton/dataloader.py \
--data-dir /data/imagenet \
--batch-size 32 \
--output-dir /results/dump_local \
--dump-labels
```
"""
import argparse
import logging
import os
from pathlib import Path
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
os.environ["TF_ENABLE_DEPRECATION_WARNINGS"] = "0"
from tqdm import tqdm
# method from PEP-366 to support relative import in executed modules
if __package__ is None:
__package__ = Path(__file__).parent.name
from .deployment_toolkit.args import ArgParserGenerator
from .deployment_toolkit.core import DATALOADER_FN_NAME, BaseLoader, BaseRunner, Format, load_from_file
from .deployment_toolkit.dump import NpzWriter
from .deployment_toolkit.extensions import loaders, runners
LOGGER = logging.getLogger("run_inference_on_fw")
def _verify_and_format_dump(args, ids, x, y_pred, y_real):
data = {"outputs": y_pred, "ids": {"ids": ids}}
if args.dump_inputs:
data["inputs"] = x
if args.dump_labels:
if not y_real:
raise ValueError(
"Found empty label values. Please provide labels in dataloader_fn or do not use --dump-labels argument"
)
data["labels"] = y_real
return data
def _parse_and_validate_args():
supported_inputs = set(runners.supported_extensions) & set(loaders.supported_extensions)
parser = argparse.ArgumentParser(description="Dump local inference output of given model", allow_abbrev=False)
parser.add_argument("--input-path", help="Path to input model", required=True)
parser.add_argument("--input-type", help="Input model type", choices=supported_inputs, required=True)
parser.add_argument("--dataloader", help="Path to python file containing dataloader.", required=True)
parser.add_argument("--output-dir", help="Path to dir where output files will be stored", required=True)
parser.add_argument("--dump-labels", help="Dump labels to output dir", action="store_true", default=False)
parser.add_argument("--dump-inputs", help="Dump inputs to output dir", action="store_true", default=False)
parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
args, *_ = parser.parse_known_args()
get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
ArgParserGenerator(get_dataloader_fn).update_argparser(parser)
Loader: BaseLoader = loaders.get(args.input_type)
ArgParserGenerator(Loader, module_path=args.input_path).update_argparser(parser)
Runner: BaseRunner = runners.get(args.input_type)
ArgParserGenerator(Runner).update_argparser(parser)
args = parser.parse_args()
types_requiring_io_params = []
if args.input_type in types_requiring_io_params and not all(p for p in [args.inputs, args.outputs]):
parser.error(f"For {args.input_type} input provide --inputs and --outputs parameters")
return args
def main():
args = _parse_and_validate_args()
log_level = logging.INFO if not args.verbose else logging.DEBUG
log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
logging.basicConfig(level=log_level, format=log_format)
LOGGER.info(f"args:")
for key, value in vars(args).items():
LOGGER.info(f" {key} = {value}")
Loader: BaseLoader = loaders.get(args.input_type)
Runner: BaseRunner = runners.get(args.input_type)
loader = ArgParserGenerator(Loader, module_path=args.input_path).from_args(args)
runner = ArgParserGenerator(Runner).from_args(args)
LOGGER.info(f"Loading {args.input_path}")
model = loader.load(args.input_path)
with runner.init_inference(model=model) as runner_session, NpzWriter(args.output_dir) as writer:
get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)
LOGGER.info(f"Data loader initialized; Running inference")
for ids, x, y_real in tqdm(dataloader_fn(), unit="batch", mininterval=10):
y_pred = runner_session(x)
data = _verify_and_format_dump(args, ids=ids, x=x, y_pred=y_pred, y_real=y_real)
writer.write(**data)
LOGGER.info(f"Inference finished")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,287 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
To infer the model deployed on Triton, you can use `run_inference_on_triton.py` script.
It sends a request with data obtained from pointed data loader and dumps received data into npz files.
Those files are stored in directory pointed by `--output-dir` argument.
Currently, the client communicates with the Triton server asynchronously using GRPC protocol.
Example call:
```shell script
python ./triton/run_inference_on_triton.py \
--server-url localhost:8001 \
--model-name ResNet50 \
--model-version 1 \
--dump-labels \
--output-dir /results/dump_triton
```
"""
import argparse
import functools
import logging
import queue
import threading
import time
from pathlib import Path
from typing import Optional
from tqdm import tqdm
# pytype: disable=import-error
try:
from tritonclient import utils as client_utils # noqa: F401
from tritonclient.grpc import (
InferenceServerClient,
InferInput,
InferRequestedOutput,
)
except ImportError:
import tritongrpcclient as grpc_client
from tritongrpcclient import (
InferenceServerClient,
InferInput,
InferRequestedOutput,
)
# pytype: enable=import-error
# method from PEP-366 to support relative import in executed modules
if __package__ is None:
__package__ = Path(__file__).parent.name
from .deployment_toolkit.args import ArgParserGenerator
from .deployment_toolkit.core import DATALOADER_FN_NAME, load_from_file
from .deployment_toolkit.dump import NpzWriter
LOGGER = logging.getLogger("run_inference_on_triton")
class AsyncGRPCTritonRunner:
DEFAULT_MAX_RESP_WAIT_S = 120
DEFAULT_MAX_UNRESP_REQS = 128
DEFAULT_MAX_FINISH_WAIT_S = 900 # 15min
def __init__(
self,
server_url: str,
model_name: str,
model_version: str,
*,
dataloader,
verbose=False,
resp_wait_s: Optional[float] = None,
max_unresponded_reqs: Optional[int] = None,
):
self._server_url = server_url
self._model_name = model_name
self._model_version = model_version
self._dataloader = dataloader
self._verbose = verbose
self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
self._max_unresp_reqs = self.DEFAULT_MAX_UNRESP_REQS if max_unresponded_reqs is None else max_unresponded_reqs
self._results = queue.Queue()
self._processed_all = False
self._errors = []
self._num_waiting_for = 0
self._sync = threading.Condition()
self._req_thread = threading.Thread(target=self.req_loop, daemon=True)
def __iter__(self):
self._req_thread.start()
timeout_s = 0.050 # check flags processed_all and error flags every 50ms
while True:
try:
ids, x, y_pred, y_real = self._results.get(timeout=timeout_s)
yield ids, x, y_pred, y_real
except queue.Empty:
shall_stop = self._processed_all or self._errors
if shall_stop:
break
LOGGER.debug("Waiting for request thread to stop")
self._req_thread.join()
if self._errors:
error_msg = "\n".join(map(str, self._errors))
raise RuntimeError(error_msg)
def _on_result(self, ids, x, y_real, output_names, result, error):
with self._sync:
if error:
self._errors.append(error)
else:
y_pred = {name: result.as_numpy(name) for name in output_names}
self._results.put((ids, x, y_pred, y_real))
self._num_waiting_for -= 1
self._sync.notify_all()
def req_loop(self):
client = InferenceServerClient(self._server_url, verbose=self._verbose)
self._errors = self._verify_triton_state(client)
if self._errors:
return
LOGGER.debug(
f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " f"are up and ready!"
)
model_config = client.get_model_config(self._model_name, self._model_version)
model_metadata = client.get_model_metadata(self._model_name, self._model_version)
LOGGER.info(f"Model config {model_config}")
LOGGER.info(f"Model metadata {model_metadata}")
inputs = {tm.name: tm for tm in model_metadata.inputs}
outputs = {tm.name: tm for tm in model_metadata.outputs}
output_names = list(outputs)
outputs_req = [InferRequestedOutput(name) for name in outputs]
self._num_waiting_for = 0
for ids, x, y_real in self._dataloader:
infer_inputs = []
for name in inputs:
data = x[name]
infer_input = InferInput(name, data.shape, inputs[name].datatype)
target_np_dtype = client_utils.triton_to_np_dtype(inputs[name].datatype)
data = data.astype(target_np_dtype)
infer_input.set_data_from_numpy(data)
infer_inputs.append(infer_input)
with self._sync:
def _check_can_send():
return self._num_waiting_for < self._max_unresp_reqs
can_send = self._sync.wait_for(_check_can_send, timeout=self._response_wait_t)
if not can_send:
error_msg = f"Runner could not send new requests for {self._response_wait_t}s"
self._errors.append(error_msg)
break
callback = functools.partial(AsyncGRPCTritonRunner._on_result, self, ids, x, y_real, output_names)
client.async_infer(
model_name=self._model_name,
model_version=self._model_version,
inputs=infer_inputs,
outputs=outputs_req,
callback=callback,
)
self._num_waiting_for += 1
# wait till receive all requested data
with self._sync:
def _all_processed():
LOGGER.debug(f"wait for {self._num_waiting_for} unprocessed jobs")
return self._num_waiting_for == 0
self._processed_all = self._sync.wait_for(_all_processed, self.DEFAULT_MAX_FINISH_WAIT_S)
if not self._processed_all:
error_msg = f"Runner {self._response_wait_t}s timeout received while waiting for results from server"
self._errors.append(error_msg)
LOGGER.debug("Finished request thread")
def _verify_triton_state(self, triton_client):
errors = []
if not triton_client.is_server_live():
errors.append(f"Triton server {self._server_url} is not live")
elif not triton_client.is_server_ready():
errors.append(f"Triton server {self._server_url} is not ready")
elif not triton_client.is_model_ready(self._model_name, self._model_version):
errors.append(f"Model {self._model_name}:{self._model_version} is not ready")
return errors
def _parse_args():
parser = argparse.ArgumentParser(description="Infer model on Triton server", allow_abbrev=False)
parser.add_argument(
"--server-url", type=str, default="localhost:8001", help="Inference server URL (default localhost:8001)"
)
parser.add_argument("--model-name", help="The name of the model used for inference.", required=True)
parser.add_argument("--model-version", help="The version of the model used for inference.", required=True)
parser.add_argument("--dataloader", help="Path to python file containing dataloader.", required=True)
parser.add_argument("--dump-labels", help="Dump labels to output dir", action="store_true", default=False)
parser.add_argument("--dump-inputs", help="Dump inputs to output dir", action="store_true", default=False)
parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
parser.add_argument("--output-dir", required=True, help="Path to directory where outputs will be saved")
parser.add_argument("--response-wait-time", required=False, help="Maximal time to wait for response", default=120)
parser.add_argument(
"--max-unresponded-requests", required=False, help="Maximal number of unresponded requests", default=128
)
args, *_ = parser.parse_known_args()
get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
ArgParserGenerator(get_dataloader_fn).update_argparser(parser)
args = parser.parse_args()
return args
def main():
args = _parse_args()
log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
log_level = logging.INFO if not args.verbose else logging.DEBUG
logging.basicConfig(level=log_level, format=log_format)
LOGGER.info(f"args:")
for key, value in vars(args).items():
LOGGER.info(f" {key} = {value}")
get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)
runner = AsyncGRPCTritonRunner(
args.server_url,
args.model_name,
args.model_version,
dataloader=dataloader_fn(),
verbose=False,
resp_wait_s=args.response_wait_time,
max_unresponded_reqs=args.max_unresponded_requests,
)
with NpzWriter(output_dir=args.output_dir) as writer:
start = time.time()
for ids, x, y_pred, y_real in tqdm(runner, unit="batch", mininterval=10):
data = _verify_and_format_dump(args, ids, x, y_pred, y_real)
writer.write(**data)
stop = time.time()
LOGGER.info(f"\nThe inference took {stop - start:0.3f}s")
def _verify_and_format_dump(args, ids, x, y_pred, y_real):
data = {"outputs": y_pred, "ids": {"ids": ids}}
if args.dump_inputs:
data["inputs"] = x
if args.dump_labels:
if not y_real:
raise ValueError(
"Found empty label values. Please provide labels in dataloader_fn or do not use --dump-labels argument"
)
data["labels"] = y_real
return data
if __name__ == "__main__":
main()

View file

@ -0,0 +1,178 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
For models with variable-sized inputs you must provide the --input-shape argument so that perf_analyzer knows
what shape tensors to use. For example, for a model that has an input called IMAGE that has shape [ 3, N, M ],
where N and M are variable-size dimensions, to tell perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]
`--shape IMAGE:3,224,224`.
"""
import argparse
import csv
import os
import sys
from pathlib import Path
from typing import Dict, List, Optional
# method from PEP-366 to support relative import in executed modules
if __package__ is None:
__package__ = Path(__file__).parent.name
from .deployment_toolkit.report import save_results, show_results, sort_results
from .deployment_toolkit.warmup import warmup
def calculate_average_latency(r):
avg_sum_fields = [
"Client Send",
"Network+Server Send/Recv",
"Server Queue",
"Server Compute",
"Server Compute Input",
"Server Compute Infer",
"Server Compute Output",
"Client Recv",
]
avg_latency = sum([int(r.get(f, 0)) for f in avg_sum_fields])
return avg_latency
def update_performance_data(results: List, batch_size: int, performance_partial_file: str):
row: Dict = {"batch_size": batch_size}
with open(performance_partial_file, "r") as csvfile:
reader = csv.DictReader(csvfile)
for r in reader:
avg_latency = calculate_average_latency(r)
row = {**row, **r, "avg latency": avg_latency}
results.append(row)
def _parse_batch_sizes(batch_sizes: str):
batches = batch_sizes.split(sep=",")
return list(map(lambda x: int(x.strip()), batches))
def offline_performance(
model_name: str,
batch_sizes: List[int],
result_path: str,
input_shapes: Optional[List[str]] = None,
profiling_data: str = "random",
triton_instances: int = 1,
server_url: str = "localhost",
measurement_window: int = 10000,
shared_memory: bool = False
):
print("\n")
print(f"==== Static batching analysis start ====")
print("\n")
input_shapes = " ".join(map(lambda shape: f" --shape {shape}", input_shapes)) if input_shapes else ""
results: List[Dict] = list()
for batch_size in batch_sizes:
print(f"Running performance tests for batch size: {batch_size}")
performance_partial_file = f"triton_performance_partial_{batch_size}.csv"
exec_args = f"""-max-threads {triton_instances} \
-m {model_name} \
-x 1 \
-c {triton_instances} \
-t {triton_instances} \
-p {measurement_window} \
-v \
-i http \
-u {server_url}:8000 \
-b {batch_size} \
-f {performance_partial_file} \
--input-data {profiling_data} {input_shapes}"""
if shared_memory:
exec_args += " --shared-memory=cuda"
result = os.system(f"perf_client {exec_args}")
if result != 0:
print(f"Failed running performance tests. Perf client failed with exit code {result}")
sys.exit(1)
update_performance_data(results, batch_size, performance_partial_file)
os.remove(performance_partial_file)
results = sort_results(results=results)
save_results(filename=result_path, data=results)
show_results(results=results)
print("Performance results for static batching stored in: {0}".format(result_path))
print("\n")
print(f"==== Analysis done ====")
print("\n")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model-name", type=str, required=True, help="Name of the model to test")
parser.add_argument(
"--input-data", type=str, required=False, default="random", help="Input data to perform profiling."
)
parser.add_argument(
"--input-shape",
action="append",
required=False,
help="Input data shape in form INPUT_NAME:<full_shape_without_batch_axis>.",
)
parser.add_argument("--batch-sizes", type=str, required=True, help="List of batch sizes to tests. Comma separated.")
parser.add_argument("--result-path", type=str, required=True, help="Path where result file is going to be stored.")
parser.add_argument("--triton-instances", type=int, default=1, help="Number of Triton Server instances")
parser.add_argument("--server-url", type=str, required=False, default="localhost", help="Url to Triton server")
parser.add_argument(
"--measurement-window", required=False, help="Time which perf_analyzer will wait for results", default=10000
)
parser.add_argument("--shared-memory", help="Use shared memory for communication with Triton", action="store_true",
default=False)
args = parser.parse_args()
warmup(
server_url=args.server_url,
model_name=args.model_name,
batch_sizes=_parse_batch_sizes(args.batch_sizes),
triton_instances=args.triton_instances,
profiling_data=args.input_data,
input_shapes=args.input_shape,
measurement_window=args.measurement_window,
shared_memory=args.shared_memory
)
offline_performance(
server_url=args.server_url,
model_name=args.model_name,
batch_sizes=_parse_batch_sizes(args.batch_sizes),
triton_instances=args.triton_instances,
profiling_data=args.input_data,
input_shapes=args.input_shape,
result_path=args.result_path,
measurement_window=args.measurement_window,
shared_memory=args.shared_memory
)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,188 @@
#!/usr/bin/env python3
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""
For models with variable-sized inputs you must provide the --input-shape argument so that perf_analyzer knows
what shape tensors to use. For example, for a model that has an input called IMAGE that has shape [ 3, N, M ],
where N and M are variable-size dimensions, to tell perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]
`--shape IMAGE:3,224,224`.
"""
import argparse
import csv
import os
import sys
from pathlib import Path
from typing import List, Optional
# method from PEP-366 to support relative import in executed modules
if __package__ is None:
__package__ = Path(__file__).parent.name
from .deployment_toolkit.report import save_results, show_results, sort_results
from .deployment_toolkit.warmup import warmup
def calculate_average_latency(r):
avg_sum_fields = [
"Client Send",
"Network+Server Send/Recv",
"Server Queue",
"Server Compute",
"Server Compute Input",
"Server Compute Infer",
"Server Compute Output",
"Client Recv",
]
avg_latency = sum([int(r.get(f, 0)) for f in avg_sum_fields])
return avg_latency
def update_performance_data(results: List, performance_file: str):
with open(performance_file, "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
row["avg latency"] = calculate_average_latency(row)
results.append(row)
def _parse_batch_sizes(batch_sizes: str):
batches = batch_sizes.split(sep=",")
return list(map(lambda x: int(x.strip()), batches))
def online_performance(
model_name: str,
batch_sizes: List[int],
result_path: str,
input_shapes: Optional[List[str]] = None,
profiling_data: str = "random",
triton_instances: int = 1,
triton_gpu_engine_count: int = 1,
server_url: str = "localhost",
measurement_window: int = 10000,
shared_memory: bool = False
):
print("\n")
print(f"==== Dynamic batching analysis start ====")
print("\n")
input_shapes = " ".join(map(lambda shape: f" --shape {shape}", input_shapes)) if input_shapes else ""
print(f"Running performance tests for dynamic batching")
performance_file = f"triton_performance_dynamic_partial.csv"
max_batch_size = max(batch_sizes)
max_total_requests = 2 * max_batch_size * triton_instances * triton_gpu_engine_count
max_concurrency = min(256, max_total_requests)
batch_size = max(1, max_total_requests // 256)
step = max(1, max_concurrency // 32)
min_concurrency = step
exec_args = f"""-m {model_name} \
-x 1 \
-p {measurement_window} \
-v \
-i http \
-u {server_url}:8000 \
-b {batch_size} \
-f {performance_file} \
--concurrency-range {min_concurrency}:{max_concurrency}:{step} \
--input-data {profiling_data} {input_shapes}"""
if shared_memory:
exec_args += " --shared-memory=cuda"
result = os.system(f"perf_client {exec_args}")
if result != 0:
print(f"Failed running performance tests. Perf client failed with exit code {result}")
sys.exit(1)
results = list()
update_performance_data(results=results, performance_file=performance_file)
results = sort_results(results=results)
save_results(filename=result_path, data=results)
show_results(results=results)
os.remove(performance_file)
print("Performance results for dynamic batching stored in: {0}".format(result_path))
print("\n")
print(f"==== Analysis done ====")
print("\n")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model-name", type=str, required=True, help="Name of the model to test")
parser.add_argument(
"--input-data", type=str, required=False, default="random", help="Input data to perform profiling."
)
parser.add_argument(
"--input-shape",
action="append",
required=False,
help="Input data shape in form INPUT_NAME:<full_shape_without_batch_axis>.",
)
parser.add_argument("--batch-sizes", type=str, required=True, help="List of batch sizes to tests. Comma separated.")
parser.add_argument("--triton-instances", type=int, default=1, help="Number of Triton Server instances")
parser.add_argument(
"--number-of-model-instances", type=int, default=1, help="Number of models instances on Triton Server"
)
parser.add_argument("--result-path", type=str, required=True, help="Path where result file is going to be stored.")
parser.add_argument("--server-url", type=str, required=False, default="localhost", help="Url to Triton server")
parser.add_argument(
"--measurement-window", required=False, help="Time which perf_analyzer will wait for results", default=10000
)
parser.add_argument("--shared-memory", help="Use shared memory for communication with Triton", action="store_true",
default=False)
args = parser.parse_args()
warmup(
server_url=args.server_url,
model_name=args.model_name,
batch_sizes=_parse_batch_sizes(args.batch_sizes),
triton_instances=args.triton_instances,
triton_gpu_engine_count=args.number_of_model_instances,
profiling_data=args.input_data,
input_shapes=args.input_shape,
measurement_window=args.measurement_window,
shared_memory=args.shared_memory
)
online_performance(
server_url=args.server_url,
model_name=args.model_name,
batch_sizes=_parse_batch_sizes(args.batch_sizes),
triton_instances=args.triton_instances,
triton_gpu_engine_count=args.number_of_model_instances,
profiling_data=args.input_data,
input_shapes=args.input_shape,
result_path=args.result_path,
measurement_window=args.measurement_window,
shared_memory=args.shared_memory
)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,16 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
docker build -t resnet50 . -f triton/resnet50/Dockerfile

View file

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
docker run -it --rm \
--gpus "device=all" \
--net=host \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-e WORKDIR=$(pwd) \
-e PYTHONPATH=$(pwd) \
-v $(pwd):$(pwd) \
-w $(pwd) \
resnet50:latest bash

View file

@ -0,0 +1,32 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:=all}
docker run --rm -d \
-p 8000:8000 \
-p 8001:8001 \
-p 8002:8002 \
--runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES} \
-v ${MODEL_REPOSITORY_PATH}:${MODEL_REPOSITORY_PATH} \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
nvcr.io/nvidia/tritonserver:21.02-py3 tritonserver \
--model-store=${MODEL_REPOSITORY_PATH} \
--strict-model-config=false \
--exit-on-error=true \
--model-control-mode=explicit

View file

@ -0,0 +1,28 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Download checkpoint
if [ -f "${CHECKPOINT_DIR}/nvidia_resnet50_200821.pth.tar" ]; then
echo "Checkpoint already downloaded."
else
echo "Downloading checkpoint ..."
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnet50_pyt_amp/versions/20.06.0/zip -O \
resnet50_pyt_amp_20.06.0.zip || {
echo "ERROR: Failed to download checkpoint from NGC"
exit 1
}
unzip resnet50_pyt_amp_20.06.0.zip -d ${CHECKPOINT_DIR}
rm resnet50_pyt_amp_20.06.0.zip
echo "ok"
fi

View file

@ -0,0 +1,20 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [ -d "${DATASETS_DIR}/imagenet" ]; then
echo "Dataset already downloaded and processed."
else
python triton/process_dataset.py
fi

View file

@ -0,0 +1,32 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
WORKDIR="${WORKDIR:=$(pwd)}"
export WORKSPACE_DIR=${WORKDIR}/workspace
export DATASETS_DIR=${WORKSPACE_DIR}/datasets_dir
export CHECKPOINT_DIR=${WORKSPACE_DIR}/checkpoint_dir
export MODEL_REPOSITORY_PATH=${WORKSPACE_DIR}/model_store
export SHARED_DIR=${WORKSPACE_DIR}/shared_dir
echo "Preparing directories"
mkdir -p ${WORKSPACE_DIR}
mkdir -p ${DATASETS_DIR}
mkdir -p ${CHECKPOINT_DIR}
mkdir -p ${MODEL_REPOSITORY_PATH}
mkdir -p ${SHARED_DIR}
echo "Setting up environment"
export MODEL_NAME=resnet50
export TRITON_LOAD_MODEL_METHOD=explicit
export TRITON_INSTANCES=1

View file

@ -0,0 +1,23 @@
#!/usr/bin/env bash
# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export PRECISION="fp16"
export FORMAT="trt"
export BATCH_SIZE="1,2,4,8,16,32,64,128"
export BACKEND_ACCELERATOR="cuda"
export MAX_BATCH_SIZE="128"
export NUMBER_OF_MODEL_INSTANCES="1"
export TRITON_MAX_QUEUE_DELAY="1"
export TRITON_PREFERRED_BATCH_SIZES="64 128"

View file

@ -32,7 +32,7 @@ allow_multiline_lambdas = True
# # <------ this blank line
# def method():
# pass
blank_line_before_nested_class_or_def = True
blank_line_before_nested_class_or_def = False
# Insert a blank line before a module docstring.
blank_line_before_module_docstring = True
@ -83,7 +83,7 @@ continuation_indent_width = 4
# start_ts=now()-timedelta(days=3),
# end_ts=now(),
# ) # <--- this bracket is dedented and on a separate line
dedent_closing_brackets = True
dedent_closing_brackets = False
# Disable the heuristic which places each list element on a separate line if the list is comma-terminated.
disable_ending_comma_heuristic = false

View file

@ -1,8 +1,30 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3
ARG TRITON_CLIENT_IMAGE_NAME=nvcr.io/nvidia/tritonserver:20.12-py3-sdk
FROM ${TRITON_CLIENT_IMAGE_NAME} as triton-client
FROM ${FROM_IMAGE_NAME}
ADD requirements.txt .
RUN pip install -r requirements.txt
# Install perf_client required library
RUN apt-get update && \
apt-get install -y libb64-dev libb64-0d && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ADD . /workspace/rn50v15_tf
# Install Triton Client PythonAPI and copy Perf Client
COPY --from=triton-client /workspace/install/ /workspace/install/
ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
RUN find /workspace/install/python/ -iname triton*manylinux*.whl -exec pip install {}[all] \;
# Setup environmnent variables to access Triton Client lib and bin
ENV PATH /workspace/install/bin:${PATH}
ENV PYTHONPATH /workspace/rn50v15_tf
WORKDIR /workspace/rn50v15_tf
RUN pip uninstall -y typing
ADD requirements.txt .
ADD triton/requirements.txt triton/requirements.txt
RUN pip install -r requirements.txt
RUN pip install -r triton/requirements.txt
ADD . .

View file

@ -51,7 +51,7 @@ were averaged over an entire training epoch.
The specific training script that was run is documented
in the corresponding model's README.
The following table shows the training accuracy results of the
The following table shows the training performance results of the
three classification models side-by-side.
@ -71,7 +71,7 @@ were averaged over an entire training epoch.
The specific training script that was run is documented
in the corresponding model's README.
The following table shows the training accuracy results of the
The following table shows the training performance results of the
three classification models side-by-side.

View file

@ -0,0 +1,436 @@
#!/usr/bin/python
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Converts image data to TFRecords file format with Example protos.
The image data set is expected to reside in JPEG files located in the
following directory structure.
data_dir/label_0/image0.jpeg
data_dir/label_0/image1.jpg
...
data_dir/label_1/weird-image.jpeg
data_dir/label_1/my-image.jpeg
...
where the sub-directory is the unique label associated with these images.
This TensorFlow script converts the training and evaluation data into
a sharded data set consisting of TFRecord files
train_directory/train-00000-of-01024
train_directory/train-00001-of-01024
...
train_directory/train-01023-of-01024
and
validation_directory/validation-00000-of-00128
validation_directory/validation-00001-of-00128
...
validation_directory/validation-00127-of-00128
where we have selected 1024 and 128 shards for each data set. Each record
within the TFRecord file is a serialized Example proto. The Example proto
contains the following fields:
image/encoded: string containing JPEG encoded image in RGB colorspace
image/height: integer, image height in pixels
image/width: integer, image width in pixels
image/colorspace: string, specifying the colorspace, always 'RGB'
image/channels: integer, specifying the number of channels, always 3
image/format: string, specifying the format, always 'JPEG'
image/filename: string containing the basename of the image file
e.g. 'n01440764_10026.JPEG' or 'ILSVRC2012_val_00000293.JPEG'
image/class/label: integer specifying the index in a classification layer.
The label ranges from [0, num_labels] where 0 is unused and left as
the background class.
image/class/text: string specifying the human-readable version of the label
e.g. 'dog'
If your data set involves bounding boxes, please look at build_imagenet_data.py.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from datetime import datetime
import os
import random
import sys
import threading
import numpy as np
import tensorflow as tf
tf.app.flags.DEFINE_string('train_directory', '/tmp/',
'Training data directory')
tf.app.flags.DEFINE_string('validation_directory', '/tmp/',
'Validation data directory')
tf.app.flags.DEFINE_string('output_directory', '/tmp/',
'Output data directory')
tf.app.flags.DEFINE_integer('train_shards', 2,
'Number of shards in training TFRecord files.')
tf.app.flags.DEFINE_integer('validation_shards', 2,
'Number of shards in validation TFRecord files.')
tf.app.flags.DEFINE_integer('num_threads', 2,
'Number of threads to preprocess the images.')
# The labels file contains a list of valid labels are held in this file.
# Assumes that the file contains entries as such:
# dog
# cat
# flower
# where each line corresponds to a label. We map each label contained in
# the file to an integer corresponding to the line number starting from 0.
tf.app.flags.DEFINE_string('labels_file', '', 'Labels file')
FLAGS = tf.app.flags.FLAGS
def _int64_feature(value):
"""Wrapper for inserting int64 features into Example proto."""
if not isinstance(value, list):
value = [value]
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _bytes_feature(value):
"""Wrapper for inserting bytes features into Example proto."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _convert_to_example(filename, image_buffer, label, text, height, width):
"""Build an Example proto for an example.
Args:
filename: string, path to an image file, e.g., '/path/to/example.JPG'
image_buffer: string, JPEG encoding of RGB image
label: integer, identifier for the ground truth for the network
text: string, unique human-readable, e.g. 'dog'
height: integer, image height in pixels
width: integer, image width in pixels
Returns:
Example proto
"""
colorspace = 'RGB'
channels = 3
image_format = 'JPEG'
example = tf.train.Example(features=tf.train.Features(feature={
'image/height': _int64_feature(height),
'image/width': _int64_feature(width),
'image/colorspace': _bytes_feature(tf.compat.as_bytes(colorspace)),
'image/channels': _int64_feature(channels),
'image/class/label': _int64_feature(label),
'image/class/text': _bytes_feature(tf.compat.as_bytes(text)),
'image/format': _bytes_feature(tf.compat.as_bytes(image_format)),
'image/filename': _bytes_feature(tf.compat.as_bytes(os.path.basename(filename))),
'image/encoded': _bytes_feature(tf.compat.as_bytes(image_buffer))}))
return example
class ImageCoder(object):
"""Helper class that provides TensorFlow image coding utilities."""
def __init__(self):
# Create a single Session to run all image coding calls.
self._sess = tf.Session()
# Initializes function that converts PNG to JPEG data.
self._png_data = tf.placeholder(dtype=tf.string)
image = tf.image.decode_png(self._png_data, channels=3)
self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
# Initializes function that decodes RGB JPEG data.
self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
def png_to_jpeg(self, image_data):
return self._sess.run(self._png_to_jpeg,
feed_dict={self._png_data: image_data})
def decode_jpeg(self, image_data):
image = self._sess.run(self._decode_jpeg,
feed_dict={self._decode_jpeg_data: image_data})
assert len(image.shape) == 3
assert image.shape[2] == 3
return image
def _is_png(filename):
"""Determine if a file contains a PNG format image.
Args:
filename: string, path of the image file.
Returns:
boolean indicating if the image is a PNG.
"""
return filename.endswith('.png')
def _process_image(filename, coder):
"""Process a single image file.
Args:
filename: string, path to an image file e.g., '/path/to/example.JPG'.
coder: instance of ImageCoder to provide TensorFlow image coding utils.
Returns:
image_buffer: string, JPEG encoding of RGB image.
height: integer, image height in pixels.
width: integer, image width in pixels.
"""
# Read the image file.
with tf.gfile.FastGFile(filename, 'rb') as f:
image_data = f.read()
# Convert any PNG to JPEG's for consistency.
if _is_png(filename):
print('Converting PNG to JPEG for %s' % filename)
image_data = coder.png_to_jpeg(image_data)
# Decode the RGB JPEG.
image = coder.decode_jpeg(image_data)
# Check that image converted to RGB
assert len(image.shape) == 3
height = image.shape[0]
width = image.shape[1]
assert image.shape[2] == 3
return image_data, height, width
def _process_image_files_batch(coder, thread_index, ranges, name, filenames,
texts, labels, num_shards):
"""Processes and saves list of images as TFRecord in 1 thread.
Args:
coder: instance of ImageCoder to provide TensorFlow image coding utils.
thread_index: integer, unique batch to run index is within [0, len(ranges)).
ranges: list of pairs of integers specifying ranges of each batches to
analyze in parallel.
name: string, unique identifier specifying the data set
filenames: list of strings; each string is a path to an image file
texts: list of strings; each string is human readable, e.g. 'dog'
labels: list of integer; each integer identifies the ground truth
num_shards: integer number of shards for this data set.
"""
# Each thread produces N shards where N = int(num_shards / num_threads).
# For instance, if num_shards = 128, and the num_threads = 2, then the first
# thread would produce shards [0, 64).
num_threads = len(ranges)
assert not num_shards % num_threads
num_shards_per_batch = int(num_shards / num_threads)
shard_ranges = np.linspace(ranges[thread_index][0],
ranges[thread_index][1],
num_shards_per_batch + 1).astype(int)
num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
counter = 0
for s in range(num_shards_per_batch):
# Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
shard = thread_index * num_shards_per_batch + s
output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
output_file = os.path.join(FLAGS.output_directory, output_filename)
writer = tf.python_io.TFRecordWriter(output_file)
shard_counter = 0
files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
for i in files_in_shard:
filename = filenames[i]
label = labels[i]
text = texts[i]
try:
image_buffer, height, width = _process_image(filename, coder)
except Exception as e:
print(e)
print('SKIPPED: Unexpected error while decoding %s.' % filename)
continue
example = _convert_to_example(filename, image_buffer, label,
text, height, width)
writer.write(example.SerializeToString())
shard_counter += 1
counter += 1
if not counter % 1000:
print('%s [thread %d]: Processed %d of %d images in thread batch.' %
(datetime.now(), thread_index, counter, num_files_in_thread))
sys.stdout.flush()
writer.close()
print('%s [thread %d]: Wrote %d images to %s' %
(datetime.now(), thread_index, shard_counter, output_file))
sys.stdout.flush()
shard_counter = 0
print('%s [thread %d]: Wrote %d images to %d shards.' %
(datetime.now(), thread_index, counter, num_files_in_thread))
sys.stdout.flush()
def _process_image_files(name, filenames, texts, labels, num_shards):
"""Process and save list of images as TFRecord of Example protos.
Args:
name: string, unique identifier specifying the data set
filenames: list of strings; each string is a path to an image file
texts: list of strings; each string is human readable, e.g. 'dog'
labels: list of integer; each integer identifies the ground truth
num_shards: integer number of shards for this data set.
"""
assert len(filenames) == len(texts)
assert len(filenames) == len(labels)
# Break all images into batches with a [ranges[i][0], ranges[i][1]].
spacing = np.linspace(0, len(filenames), FLAGS.num_threads + 1).astype(np.int)
ranges = []
for i in range(len(spacing) - 1):
ranges.append([spacing[i], spacing[i + 1]])
# Launch a thread for each batch.
print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
sys.stdout.flush()
# Create a mechanism for monitoring when all threads are finished.
coord = tf.train.Coordinator()
# Create a generic TensorFlow-based utility for converting all image codings.
coder = ImageCoder()
threads = []
for thread_index in range(len(ranges)):
args = (coder, thread_index, ranges, name, filenames,
texts, labels, num_shards)
t = threading.Thread(target=_process_image_files_batch, args=args)
t.start()
threads.append(t)
# Wait for all the threads to terminate.
coord.join(threads)
print('%s: Finished writing all %d images in data set.' %
(datetime.now(), len(filenames)))
sys.stdout.flush()
def _find_image_files(data_dir, labels_file):
"""Build a list of all images files and labels in the data set.
Args:
data_dir: string, path to the root directory of images.
Assumes that the image data set resides in JPEG files located in
the following directory structure.
data_dir/dog/another-image.JPEG
data_dir/dog/my-image.jpg
where 'dog' is the label associated with these images.
labels_file: string, path to the labels file.
The list of valid labels are held in this file. Assumes that the file
contains entries as such:
dog
cat
flower
where each line corresponds to a label. We map each label contained in
the file to an integer starting with the integer 0 corresponding to the
label contained in the first line.
Returns:
filenames: list of strings; each string is a path to an image file.
texts: list of strings; each string is the class, e.g. 'dog'
labels: list of integer; each integer identifies the ground truth.
"""
print('Determining list of input files and labels from %s.' % data_dir)
unique_labels = [l.strip() for l in tf.gfile.FastGFile(
labels_file, 'r').readlines()]
labels = []
filenames = []
texts = []
# Leave label index 0 empty as a background class.
label_index = 1
# Construct the list of JPEG files and labels.
for text in unique_labels:
jpeg_file_path = '%s/%s/*' % (data_dir, text)
matching_files = tf.gfile.Glob(jpeg_file_path)
labels.extend([label_index] * len(matching_files))
texts.extend([text] * len(matching_files))
filenames.extend(matching_files)
if not label_index % 100:
print('Finished finding files in %d of %d classes.' % (
label_index, len(labels)))
label_index += 1
# Shuffle the ordering of all image files in order to guarantee
# random ordering of the images with respect to label in the
# saved TFRecord files. Make the randomization repeatable.
shuffled_index = list(range(len(filenames)))
random.seed(12345)
random.shuffle(shuffled_index)
filenames = [filenames[i] for i in shuffled_index]
texts = [texts[i] for i in shuffled_index]
labels = [labels[i] for i in shuffled_index]
print('Found %d JPEG files across %d labels inside %s.' %
(len(filenames), len(unique_labels), data_dir))
return filenames, texts, labels
def _process_dataset(name, directory, num_shards, labels_file):
"""Process a complete data set and save it as a TFRecord.
Args:
name: string, unique identifier specifying the data set.
directory: string, root path to the data set.
num_shards: integer number of shards for this data set.
labels_file: string, path to the labels file.
"""
filenames, texts, labels = _find_image_files(directory, labels_file)
_process_image_files(name, filenames, texts, labels, num_shards)
def main(unused_argv):
assert not FLAGS.train_shards % FLAGS.num_threads, (
'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
assert not FLAGS.validation_shards % FLAGS.num_threads, (
'Please make the FLAGS.num_threads commensurate with '
'FLAGS.validation_shards')
print('Saving results to %s' % FLAGS.output_directory)
# Run it!
_process_dataset('validation', FLAGS.validation_directory,
FLAGS.validation_shards, FLAGS.labels_file)
_process_dataset('train', FLAGS.train_directory,
FLAGS.train_shards, FLAGS.labels_file)
if __name__ == '__main__':
tf.app.run()

View file

@ -0,0 +1,707 @@
#!/usr/bin/python
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Converts ImageNet data to TFRecords file format with Example protos.
The raw ImageNet data set is expected to reside in JPEG files located in the
following directory structure.
data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
...
where 'n01440764' is the unique synset label associated with
these images.
The training data set consists of 1000 sub-directories (i.e. labels)
each containing 1200 JPEG images for a total of 1.2M JPEG images.
The evaluation data set consists of 1000 sub-directories (i.e. labels)
each containing 50 JPEG images for a total of 50K JPEG images.
This TensorFlow script converts the training and evaluation data into
a sharded data set consisting of 1024 and 128 TFRecord files, respectively.
train_directory/train-00000-of-01024
train_directory/train-00001-of-01024
...
train_directory/train-01023-of-01024
and
validation_directory/validation-00000-of-00128
validation_directory/validation-00001-of-00128
...
validation_directory/validation-00127-of-00128
Each validation TFRecord file contains ~390 records. Each training TFREcord
file contains ~1250 records. Each record within the TFRecord file is a
serialized Example proto. The Example proto contains the following fields:
image/encoded: string containing JPEG encoded image in RGB colorspace
image/height: integer, image height in pixels
image/width: integer, image width in pixels
image/colorspace: string, specifying the colorspace, always 'RGB'
image/channels: integer, specifying the number of channels, always 3
image/format: string, specifying the format, always 'JPEG'
image/filename: string containing the basename of the image file
e.g. 'n01440764_10026.JPEG' or 'ILSVRC2012_val_00000293.JPEG'
image/class/label: integer specifying the index in a classification layer.
The label ranges from [1, 1000] where 0 is not used.
image/class/synset: string specifying the unique ID of the label,
e.g. 'n01440764'
image/class/text: string specifying the human-readable version of the label
e.g. 'red fox, Vulpes vulpes'
image/object/bbox/xmin: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/xmax: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/ymin: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/ymax: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/label: integer specifying the index in a classification
layer. The label ranges from [1, 1000] where 0 is not used. Note this is
always identical to the image label.
Note that the length of xmin is identical to the length of xmax, ymin and ymax
for each example.
Running this script using 16 threads may take around ~2.5 hours on an HP Z420.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from datetime import datetime
import os
import random
import sys
import threading
import numpy as np
import six
import tensorflow as tf
tf.app.flags.DEFINE_string('train_directory', '/tmp/',
'Training data directory')
tf.app.flags.DEFINE_string('validation_directory', '/tmp/',
'Validation data directory')
tf.app.flags.DEFINE_string('output_directory', '/tmp/',
'Output data directory')
tf.app.flags.DEFINE_integer('train_shards', 1024,
'Number of shards in training TFRecord files.')
tf.app.flags.DEFINE_integer('validation_shards', 128,
'Number of shards in validation TFRecord files.')
tf.app.flags.DEFINE_integer('num_threads', 8,
'Number of threads to preprocess the images.')
# The labels file contains a list of valid labels are held in this file.
# Assumes that the file contains entries as such:
# n01440764
# n01443537
# n01484850
# where each line corresponds to a label expressed as a synset. We map
# each synset contained in the file to an integer (based on the alphabetical
# ordering). See below for details.
tf.app.flags.DEFINE_string('labels_file',
'imagenet_lsvrc_2015_synsets.txt',
'Labels file')
# This file containing mapping from synset to human-readable label.
# Assumes each line of the file looks like:
#
# n02119247 black fox
# n02119359 silver fox
# n02119477 red fox, Vulpes fulva
#
# where each line corresponds to a unique mapping. Note that each line is
# formatted as <synset>\t<human readable label>.
tf.app.flags.DEFINE_string('imagenet_metadata_file',
'imagenet_metadata.txt',
'ImageNet metadata file')
# This file is the output of process_bounding_box.py
# Assumes each line of the file looks like:
#
# n00007846_64193.JPEG,0.0060,0.2620,0.7545,0.9940
#
# where each line corresponds to one bounding box annotation associated
# with an image. Each line can be parsed as:
#
# <JPEG file name>, <xmin>, <ymin>, <xmax>, <ymax>
#
# Note that there might exist mulitple bounding box annotations associated
# with an image file.
tf.app.flags.DEFINE_string('bounding_box_file',
'./imagenet_2012_bounding_boxes.csv',
'Bounding box file')
FLAGS = tf.app.flags.FLAGS
def _int64_feature(value):
"""Wrapper for inserting int64 features into Example proto."""
if not isinstance(value, list):
value = [value]
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _float_feature(value):
"""Wrapper for inserting float features into Example proto."""
if not isinstance(value, list):
value = [value]
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def _bytes_feature(value):
"""Wrapper for inserting bytes features into Example proto."""
if six.PY3 and isinstance(value, six.text_type):
value = six.binary_type(value, encoding='utf-8')
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _convert_to_example(filename, image_buffer, label, synset, human, bbox,
height, width):
"""Build an Example proto for an example.
Args:
filename: string, path to an image file, e.g., '/path/to/example.JPG'
image_buffer: string, JPEG encoding of RGB image
label: integer, identifier for the ground truth for the network
synset: string, unique WordNet ID specifying the label, e.g., 'n02323233'
human: string, human-readable label, e.g., 'red fox, Vulpes vulpes'
bbox: list of bounding boxes; each box is a list of integers
specifying [xmin, ymin, xmax, ymax]. All boxes are assumed to belong to
the same label as the image label.
height: integer, image height in pixels
width: integer, image width in pixels
Returns:
Example proto
"""
xmin = []
ymin = []
xmax = []
ymax = []
for b in bbox:
assert len(b) == 4
# pylint: disable=expression-not-assigned
[l.append(point) for l, point in zip([xmin, ymin, xmax, ymax], b)]
# pylint: enable=expression-not-assigned
colorspace = 'RGB'
channels = 3
image_format = 'JPEG'
example = tf.train.Example(features=tf.train.Features(feature={
'image/height': _int64_feature(height),
'image/width': _int64_feature(width),
'image/colorspace': _bytes_feature(colorspace),
'image/channels': _int64_feature(channels),
'image/class/label': _int64_feature(label),
'image/class/synset': _bytes_feature(synset),
'image/class/text': _bytes_feature(human),
'image/object/bbox/xmin': _float_feature(xmin),
'image/object/bbox/xmax': _float_feature(xmax),
'image/object/bbox/ymin': _float_feature(ymin),
'image/object/bbox/ymax': _float_feature(ymax),
'image/object/bbox/label': _int64_feature([label] * len(xmin)),
'image/format': _bytes_feature(image_format),
'image/filename': _bytes_feature(os.path.basename(filename)),
'image/encoded': _bytes_feature(image_buffer)}))
return example
class ImageCoder(object):
"""Helper class that provides TensorFlow image coding utilities."""
def __init__(self):
# Create a single Session to run all image coding calls.
self._sess = tf.Session()
# Initializes function that converts PNG to JPEG data.
self._png_data = tf.placeholder(dtype=tf.string)
image = tf.image.decode_png(self._png_data, channels=3)
self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
# Initializes function that converts CMYK JPEG data to RGB JPEG data.
self._cmyk_data = tf.placeholder(dtype=tf.string)
image = tf.image.decode_jpeg(self._cmyk_data, channels=0)
self._cmyk_to_rgb = tf.image.encode_jpeg(image, format='rgb', quality=100)
# Initializes function that decodes RGB JPEG data.
self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
def png_to_jpeg(self, image_data):
return self._sess.run(self._png_to_jpeg,
feed_dict={self._png_data: image_data})
def cmyk_to_rgb(self, image_data):
return self._sess.run(self._cmyk_to_rgb,
feed_dict={self._cmyk_data: image_data})
def decode_jpeg(self, image_data):
image = self._sess.run(self._decode_jpeg,
feed_dict={self._decode_jpeg_data: image_data})
assert len(image.shape) == 3
assert image.shape[2] == 3
return image
def _is_png(filename):
"""Determine if a file contains a PNG format image.
Args:
filename: string, path of the image file.
Returns:
boolean indicating if the image is a PNG.
"""
# File list from:
# https://groups.google.com/forum/embed/?place=forum/torch7#!topic/torch7/fOSTXHIESSU
return 'n02105855_2933.JPEG' in filename
def _is_cmyk(filename):
"""Determine if file contains a CMYK JPEG format image.
Args:
filename: string, path of the image file.
Returns:
boolean indicating if the image is a JPEG encoded with CMYK color space.
"""
# File list from:
# https://github.com/cytsai/ilsvrc-cmyk-image-list
blacklist = ['n01739381_1309.JPEG', 'n02077923_14822.JPEG',
'n02447366_23489.JPEG', 'n02492035_15739.JPEG',
'n02747177_10752.JPEG', 'n03018349_4028.JPEG',
'n03062245_4620.JPEG', 'n03347037_9675.JPEG',
'n03467068_12171.JPEG', 'n03529860_11437.JPEG',
'n03544143_17228.JPEG', 'n03633091_5218.JPEG',
'n03710637_5125.JPEG', 'n03961711_5286.JPEG',
'n04033995_2932.JPEG', 'n04258138_17003.JPEG',
'n04264628_27969.JPEG', 'n04336792_7448.JPEG',
'n04371774_5854.JPEG', 'n04596742_4225.JPEG',
'n07583066_647.JPEG', 'n13037406_4650.JPEG']
return filename.split('/')[-1] in blacklist
def _process_image(filename, coder):
"""Process a single image file.
Args:
filename: string, path to an image file e.g., '/path/to/example.JPG'.
coder: instance of ImageCoder to provide TensorFlow image coding utils.
Returns:
image_buffer: string, JPEG encoding of RGB image.
height: integer, image height in pixels.
width: integer, image width in pixels.
"""
# Read the image file.
with tf.gfile.FastGFile(filename, 'rb') as f:
image_data = f.read()
# Clean the dirty data.
if _is_png(filename):
# 1 image is a PNG.
print('Converting PNG to JPEG for %s' % filename)
image_data = coder.png_to_jpeg(image_data)
elif _is_cmyk(filename):
# 22 JPEG images are in CMYK colorspace.
print('Converting CMYK to RGB for %s' % filename)
image_data = coder.cmyk_to_rgb(image_data)
# Decode the RGB JPEG.
image = coder.decode_jpeg(image_data)
# Check that image converted to RGB
assert len(image.shape) == 3
height = image.shape[0]
width = image.shape[1]
assert image.shape[2] == 3
return image_data, height, width
def _process_image_files_batch(coder, thread_index, ranges, name, filenames,
synsets, labels, humans, bboxes, num_shards):
"""Processes and saves list of images as TFRecord in 1 thread.
Args:
coder: instance of ImageCoder to provide TensorFlow image coding utils.
thread_index: integer, unique batch to run index is within [0, len(ranges)).
ranges: list of pairs of integers specifying ranges of each batches to
analyze in parallel.
name: string, unique identifier specifying the data set
filenames: list of strings; each string is a path to an image file
synsets: list of strings; each string is a unique WordNet ID
labels: list of integer; each integer identifies the ground truth
humans: list of strings; each string is a human-readable label
bboxes: list of bounding boxes for each image. Note that each entry in this
list might contain from 0+ entries corresponding to the number of bounding
box annotations for the image.
num_shards: integer number of shards for this data set.
"""
# Each thread produces N shards where N = int(num_shards / num_threads).
# For instance, if num_shards = 128, and the num_threads = 2, then the first
# thread would produce shards [0, 64).
num_threads = len(ranges)
assert not num_shards % num_threads
num_shards_per_batch = int(num_shards / num_threads)
shard_ranges = np.linspace(ranges[thread_index][0],
ranges[thread_index][1],
num_shards_per_batch + 1).astype(int)
num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
counter = 0
for s in range(num_shards_per_batch):
# Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
shard = thread_index * num_shards_per_batch + s
output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
output_file = os.path.join(FLAGS.output_directory, output_filename)
writer = tf.python_io.TFRecordWriter(output_file)
shard_counter = 0
files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
for i in files_in_shard:
filename = filenames[i]
label = labels[i]
synset = synsets[i]
human = humans[i]
bbox = bboxes[i]
image_buffer, height, width = _process_image(filename, coder)
example = _convert_to_example(filename, image_buffer, label,
synset, human, bbox,
height, width)
writer.write(example.SerializeToString())
shard_counter += 1
counter += 1
if not counter % 1000:
print('%s [thread %d]: Processed %d of %d images in thread batch.' %
(datetime.now(), thread_index, counter, num_files_in_thread))
sys.stdout.flush()
writer.close()
print('%s [thread %d]: Wrote %d images to %s' %
(datetime.now(), thread_index, shard_counter, output_file))
sys.stdout.flush()
shard_counter = 0
print('%s [thread %d]: Wrote %d images to %d shards.' %
(datetime.now(), thread_index, counter, num_files_in_thread))
sys.stdout.flush()
def _process_image_files(name, filenames, synsets, labels, humans,
bboxes, num_shards):
"""Process and save list of images as TFRecord of Example protos.
Args:
name: string, unique identifier specifying the data set
filenames: list of strings; each string is a path to an image file
synsets: list of strings; each string is a unique WordNet ID
labels: list of integer; each integer identifies the ground truth
humans: list of strings; each string is a human-readable label
bboxes: list of bounding boxes for each image. Note that each entry in this
list might contain from 0+ entries corresponding to the number of bounding
box annotations for the image.
num_shards: integer number of shards for this data set.
"""
assert len(filenames) == len(synsets)
assert len(filenames) == len(labels)
assert len(filenames) == len(humans)
assert len(filenames) == len(bboxes)
# Break all images into batches with a [ranges[i][0], ranges[i][1]].
spacing = np.linspace(0, len(filenames), FLAGS.num_threads + 1).astype(np.int)
ranges = []
threads = []
for i in range(len(spacing) - 1):
ranges.append([spacing[i], spacing[i + 1]])
# Launch a thread for each batch.
print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
sys.stdout.flush()
# Create a mechanism for monitoring when all threads are finished.
coord = tf.train.Coordinator()
# Create a generic TensorFlow-based utility for converting all image codings.
coder = ImageCoder()
threads = []
for thread_index in range(len(ranges)):
args = (coder, thread_index, ranges, name, filenames,
synsets, labels, humans, bboxes, num_shards)
t = threading.Thread(target=_process_image_files_batch, args=args)
t.start()
threads.append(t)
# Wait for all the threads to terminate.
coord.join(threads)
print('%s: Finished writing all %d images in data set.' %
(datetime.now(), len(filenames)))
sys.stdout.flush()
def _find_image_files(data_dir, labels_file):
"""Build a list of all images files and labels in the data set.
Args:
data_dir: string, path to the root directory of images.
Assumes that the ImageNet data set resides in JPEG files located in
the following directory structure.
data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
where 'n01440764' is the unique synset label associated with these images.
labels_file: string, path to the labels file.
The list of valid labels are held in this file. Assumes that the file
contains entries as such:
n01440764
n01443537
n01484850
where each line corresponds to a label expressed as a synset. We map
each synset contained in the file to an integer (based on the alphabetical
ordering) starting with the integer 1 corresponding to the synset
contained in the first line.
The reason we start the integer labels at 1 is to reserve label 0 as an
unused background class.
Returns:
filenames: list of strings; each string is a path to an image file.
synsets: list of strings; each string is a unique WordNet ID.
labels: list of integer; each integer identifies the ground truth.
"""
print('Determining list of input files and labels from %s.' % data_dir)
challenge_synsets = [l.strip() for l in
tf.gfile.FastGFile(labels_file, 'r').readlines()]
labels = []
filenames = []
synsets = []
# Leave label index 0 empty as a background class.
label_index = 1
# Construct the list of JPEG files and labels.
for synset in challenge_synsets:
jpeg_file_path = '%s/%s/*.JPEG' % (data_dir, synset)
matching_files = tf.gfile.Glob(jpeg_file_path)
labels.extend([label_index] * len(matching_files))
synsets.extend([synset] * len(matching_files))
filenames.extend(matching_files)
if not label_index % 100:
print('Finished finding files in %d of %d classes.' % (
label_index, len(challenge_synsets)))
label_index += 1
# Shuffle the ordering of all image files in order to guarantee
# random ordering of the images with respect to label in the
# saved TFRecord files. Make the randomization repeatable.
shuffled_index = list(range(len(filenames)))
random.seed(12345)
random.shuffle(shuffled_index)
filenames = [filenames[i] for i in shuffled_index]
synsets = [synsets[i] for i in shuffled_index]
labels = [labels[i] for i in shuffled_index]
print('Found %d JPEG files across %d labels inside %s.' %
(len(filenames), len(challenge_synsets), data_dir))
return filenames, synsets, labels
def _find_human_readable_labels(synsets, synset_to_human):
"""Build a list of human-readable labels.
Args:
synsets: list of strings; each string is a unique WordNet ID.
synset_to_human: dict of synset to human labels, e.g.,
'n02119022' --> 'red fox, Vulpes vulpes'
Returns:
List of human-readable strings corresponding to each synset.
"""
humans = []
for s in synsets:
assert s in synset_to_human, ('Failed to find: %s' % s)
humans.append(synset_to_human[s])
return humans
def _find_image_bounding_boxes(filenames, image_to_bboxes):
"""Find the bounding boxes for a given image file.
Args:
filenames: list of strings; each string is a path to an image file.
image_to_bboxes: dictionary mapping image file names to a list of
bounding boxes. This list contains 0+ bounding boxes.
Returns:
List of bounding boxes for each image. Note that each entry in this
list might contain from 0+ entries corresponding to the number of bounding
box annotations for the image.
"""
num_image_bbox = 0
bboxes = []
for f in filenames:
basename = os.path.basename(f)
if basename in image_to_bboxes:
bboxes.append(image_to_bboxes[basename])
num_image_bbox += 1
else:
bboxes.append([])
print('Found %d images with bboxes out of %d images' % (
num_image_bbox, len(filenames)))
return bboxes
def _process_dataset(name, directory, num_shards, synset_to_human,
image_to_bboxes):
"""Process a complete data set and save it as a TFRecord.
Args:
name: string, unique identifier specifying the data set.
directory: string, root path to the data set.
num_shards: integer number of shards for this data set.
synset_to_human: dict of synset to human labels, e.g.,
'n02119022' --> 'red fox, Vulpes vulpes'
image_to_bboxes: dictionary mapping image file names to a list of
bounding boxes. This list contains 0+ bounding boxes.
"""
filenames, synsets, labels = _find_image_files(directory, FLAGS.labels_file)
humans = _find_human_readable_labels(synsets, synset_to_human)
bboxes = _find_image_bounding_boxes(filenames, image_to_bboxes)
_process_image_files(name, filenames, synsets, labels,
humans, bboxes, num_shards)
def _build_synset_lookup(imagenet_metadata_file):
"""Build lookup for synset to human-readable label.
Args:
imagenet_metadata_file: string, path to file containing mapping from
synset to human-readable label.
Assumes each line of the file looks like:
n02119247 black fox
n02119359 silver fox
n02119477 red fox, Vulpes fulva
where each line corresponds to a unique mapping. Note that each line is
formatted as <synset>\t<human readable label>.
Returns:
Dictionary of synset to human labels, such as:
'n02119022' --> 'red fox, Vulpes vulpes'
"""
lines = tf.gfile.FastGFile(imagenet_metadata_file, 'r').readlines()
synset_to_human = {}
for l in lines:
if l:
parts = l.strip().split('\t')
assert len(parts) == 2
synset = parts[0]
human = parts[1]
synset_to_human[synset] = human
return synset_to_human
def _build_bounding_box_lookup(bounding_box_file):
"""Build a lookup from image file to bounding boxes.
Args:
bounding_box_file: string, path to file with bounding boxes annotations.
Assumes each line of the file looks like:
n00007846_64193.JPEG,0.0060,0.2620,0.7545,0.9940
where each line corresponds to one bounding box annotation associated
with an image. Each line can be parsed as:
<JPEG file name>, <xmin>, <ymin>, <xmax>, <ymax>
Note that there might exist mulitple bounding box annotations associated
with an image file. This file is the output of process_bounding_boxes.py.
Returns:
Dictionary mapping image file names to a list of bounding boxes. This list
contains 0+ bounding boxes.
"""
lines = tf.gfile.FastGFile(bounding_box_file, 'r').readlines()
images_to_bboxes = {}
num_bbox = 0
num_image = 0
for l in lines:
if l:
parts = l.split(',')
assert len(parts) == 5, ('Failed to parse: %s' % l)
filename = parts[0]
xmin = float(parts[1])
ymin = float(parts[2])
xmax = float(parts[3])
ymax = float(parts[4])
box = [xmin, ymin, xmax, ymax]
if filename not in images_to_bboxes:
images_to_bboxes[filename] = []
num_image += 1
images_to_bboxes[filename].append(box)
num_bbox += 1
print('Successfully read %d bounding boxes '
'across %d images.' % (num_bbox, num_image))
return images_to_bboxes
def main(unused_argv):
assert not FLAGS.train_shards % FLAGS.num_threads, (
'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
assert not FLAGS.validation_shards % FLAGS.num_threads, (
'Please make the FLAGS.num_threads commensurate with '
'FLAGS.validation_shards')
print('Saving results to %s' % FLAGS.output_directory)
# Build a map from synset to human-readable label.
synset_to_human = _build_synset_lookup(FLAGS.imagenet_metadata_file)
image_to_bboxes = _build_bounding_box_lookup(FLAGS.bounding_box_file)
# Run it!
_process_dataset('validation', FLAGS.validation_directory,
FLAGS.validation_shards, synset_to_human, image_to_bboxes)
_process_dataset('train', FLAGS.train_directory, FLAGS.train_shards,
synset_to_human, image_to_bboxes)
if __name__ == '__main__':
tf.app.run()

View file

@ -0,0 +1,618 @@
#!/usr/bin/python
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Converts ImageNet data to TFRecords file format with Example protos.
The raw ImageNet data set is expected to reside in JPEG files located in the
following directory structure.
data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
...
where 'n01440764' is the unique synset label associated with
these images.
The training data set consists of 1000 sub-directories (i.e. labels)
each containing 1200 JPEG images for a total of 1.2M JPEG images.
The evaluation data set consists of 1000 sub-directories (i.e. labels)
each containing 50 JPEG images for a total of 50K JPEG images.
This TensorFlow script converts the training and evaluation data into
a sharded data set consisting of 1024 and 128 TFRecord files, respectively.
train_directory/train-00000-of-01024
train_directory/train-00001-of-01024
...
train_directory/train-01023-of-01024
and
validation_directory/validation-00000-of-00128
validation_directory/validation-00001-of-00128
...
validation_directory/validation-00127-of-00128
Each validation TFRecord file contains ~390 records. Each training TFREcord
file contains ~1250 records. Each record within the TFRecord file is a
serialized Example proto. The Example proto contains the following fields:
image/encoded: string containing JPEG encoded image in RGB colorspace
image/height: integer, image height in pixels
image/width: integer, image width in pixels
image/colorspace: string, specifying the colorspace, always 'RGB'
image/channels: integer, specifying the number of channels, always 3
image/format: string, specifying the format, always 'JPEG'
image/filename: string containing the basename of the image file
e.g. 'n01440764_10026.JPEG' or 'ILSVRC2012_val_00000293.JPEG'
image/class/label: integer specifying the index in a classification layer.
The label ranges from [1, 1000] where 0 is not used.
image/class/synset: string specifying the unique ID of the label,
e.g. 'n01440764'
image/class/text: string specifying the human-readable version of the label
e.g. 'red fox, Vulpes vulpes'
image/object/bbox/xmin: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/xmax: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/ymin: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/ymax: list of integers specifying the 0+ human annotated
bounding boxes
image/object/bbox/label: integer specifying the index in a classification
layer. The label ranges from [1, 1000] where 0 is not used. Note this is
always identical to the image label.
Note that the length of xmin is identical to the length of xmax, ymin and ymax
for each example.
Running this script using 16 threads may take around ~2.5 hours on an HP Z420.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from datetime import datetime
import os
import random
import sys
import threading
import numpy as np
import six
import tensorflow as tf
tf.app.flags.DEFINE_string('train_directory', '/tmp/',
'Training data directory')
tf.app.flags.DEFINE_string('validation_directory', '/tmp/',
'Validation data directory')
tf.app.flags.DEFINE_string('output_directory', '/tmp/',
'Output data directory')
tf.app.flags.DEFINE_integer('train_shards', 1024,
'Number of shards in training TFRecord files.')
tf.app.flags.DEFINE_integer('validation_shards', 128,
'Number of shards in validation TFRecord files.')
tf.app.flags.DEFINE_integer('num_threads', 8,
'Number of threads to preprocess the images.')
# The labels file contains a list of valid labels are held in this file.
# Assumes that the file contains entries as such:
# n01440764
# n01443537
# n01484850
# where each line corresponds to a label expressed as a synset. We map
# each synset contained in the file to an integer (based on the alphabetical
# ordering). See below for details.
tf.app.flags.DEFINE_string('labels_file',
'imagenet_lsvrc_2015_synsets.txt',
'Labels file')
# This file containing mapping from synset to human-readable label.
# Assumes each line of the file looks like:
#
# n02119247 black fox
# n02119359 silver fox
# n02119477 red fox, Vulpes fulva
#
# where each line corresponds to a unique mapping. Note that each line is
# formatted as <synset>\t<human readable label>.
tf.app.flags.DEFINE_string('imagenet_metadata_file',
'imagenet_metadata.txt',
'ImageNet metadata file')
FLAGS = tf.app.flags.FLAGS
def _int64_feature(value):
"""Wrapper for inserting int64 features into Example proto."""
if not isinstance(value, list):
value = [value]
return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
def _float_feature(value):
"""Wrapper for inserting float features into Example proto."""
if not isinstance(value, list):
value = [value]
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
def _bytes_feature(value):
"""Wrapper for inserting bytes features into Example proto."""
if six.PY3 and isinstance(value, six.text_type):
value = six.binary_type(value, encoding='utf-8')
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _convert_to_example(filename, image_buffer, label, synset, human, bbox,
height, width):
"""Build an Example proto for an example.
Args:
filename: string, path to an image file, e.g., '/path/to/example.JPG'
image_buffer: string, JPEG encoding of RGB image
label: integer, identifier for the ground truth for the network
synset: string, unique WordNet ID specifying the label, e.g., 'n02323233'
human: string, human-readable label, e.g., 'red fox, Vulpes vulpes'
bbox: list of bounding boxes; each box is a list of integers
specifying [xmin, ymin, xmax, ymax]. All boxes are assumed to belong to
the same label as the image label.
height: integer, image height in pixels
width: integer, image width in pixels
Returns:
Example proto
"""
xmin = []
ymin = []
xmax = []
ymax = []
for b in bbox:
assert len(b) == 4
# pylint: disable=expression-not-assigned
[l.append(point) for l, point in zip([xmin, ymin, xmax, ymax], b)]
# pylint: enable=expression-not-assigned
colorspace = 'RGB'
channels = 3
image_format = 'JPEG'
example = tf.train.Example(features=tf.train.Features(feature={
'image/height': _int64_feature(height),
'image/width': _int64_feature(width),
'image/colorspace': _bytes_feature(colorspace),
'image/channels': _int64_feature(channels),
'image/class/label': _int64_feature(label),
'image/class/synset': _bytes_feature(synset),
'image/class/text': _bytes_feature(human),
'image/object/bbox/xmin': _float_feature(xmin),
'image/object/bbox/xmax': _float_feature(xmax),
'image/object/bbox/ymin': _float_feature(ymin),
'image/object/bbox/ymax': _float_feature(ymax),
'image/object/bbox/label': _int64_feature([label] * len(xmin)),
'image/format': _bytes_feature(image_format),
'image/filename': _bytes_feature(os.path.basename(filename)),
'image/encoded': _bytes_feature(image_buffer)}))
return example
class ImageCoder(object):
"""Helper class that provides TensorFlow image coding utilities."""
def __init__(self):
# Create a single Session to run all image coding calls.
self._sess = tf.Session()
# Initializes function that converts PNG to JPEG data.
self._png_data = tf.placeholder(dtype=tf.string)
image = tf.image.decode_png(self._png_data, channels=3)
self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
# Initializes function that converts CMYK JPEG data to RGB JPEG data.
self._cmyk_data = tf.placeholder(dtype=tf.string)
image = tf.image.decode_jpeg(self._cmyk_data, channels=0)
self._cmyk_to_rgb = tf.image.encode_jpeg(image, format='rgb', quality=100)
# Initializes function that decodes RGB JPEG data.
self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
def png_to_jpeg(self, image_data):
return self._sess.run(self._png_to_jpeg,
feed_dict={self._png_data: image_data})
def cmyk_to_rgb(self, image_data):
return self._sess.run(self._cmyk_to_rgb,
feed_dict={self._cmyk_data: image_data})
def decode_jpeg(self, image_data):
image = self._sess.run(self._decode_jpeg,
feed_dict={self._decode_jpeg_data: image_data})
assert len(image.shape) == 3
assert image.shape[2] == 3
return image
def _is_png(filename):
"""Determine if a file contains a PNG format image.
Args:
filename: string, path of the image file.
Returns:
boolean indicating if the image is a PNG.
"""
# File list from:
# https://groups.google.com/forum/embed/?place=forum/torch7#!topic/torch7/fOSTXHIESSU
return 'n02105855_2933.JPEG' in filename
def _is_cmyk(filename):
"""Determine if file contains a CMYK JPEG format image.
Args:
filename: string, path of the image file.
Returns:
boolean indicating if the image is a JPEG encoded with CMYK color space.
"""
# File list from:
# https://github.com/cytsai/ilsvrc-cmyk-image-list
blacklist = ['n01739381_1309.JPEG', 'n02077923_14822.JPEG',
'n02447366_23489.JPEG', 'n02492035_15739.JPEG',
'n02747177_10752.JPEG', 'n03018349_4028.JPEG',
'n03062245_4620.JPEG', 'n03347037_9675.JPEG',
'n03467068_12171.JPEG', 'n03529860_11437.JPEG',
'n03544143_17228.JPEG', 'n03633091_5218.JPEG',
'n03710637_5125.JPEG', 'n03961711_5286.JPEG',
'n04033995_2932.JPEG', 'n04258138_17003.JPEG',
'n04264628_27969.JPEG', 'n04336792_7448.JPEG',
'n04371774_5854.JPEG', 'n04596742_4225.JPEG',
'n07583066_647.JPEG', 'n13037406_4650.JPEG']
return filename.split('/')[-1] in blacklist
def _process_image(filename, coder):
"""Process a single image file.
Args:
filename: string, path to an image file e.g., '/path/to/example.JPG'.
coder: instance of ImageCoder to provide TensorFlow image coding utils.
Returns:
image_buffer: string, JPEG encoding of RGB image.
height: integer, image height in pixels.
width: integer, image width in pixels.
"""
# Read the image file.
with tf.gfile.FastGFile(filename, 'rb') as f:
image_data = f.read()
# Clean the dirty data.
if _is_png(filename):
# 1 image is a PNG.
print('Converting PNG to JPEG for %s' % filename)
image_data = coder.png_to_jpeg(image_data)
elif _is_cmyk(filename):
# 22 JPEG images are in CMYK colorspace.
print('Converting CMYK to RGB for %s' % filename)
image_data = coder.cmyk_to_rgb(image_data)
# Decode the RGB JPEG.
image = coder.decode_jpeg(image_data)
# Check that image converted to RGB
assert len(image.shape) == 3
height = image.shape[0]
width = image.shape[1]
assert image.shape[2] == 3
return image_data, height, width
def _process_image_files_batch(coder, thread_index, ranges, name, filenames,
synsets, labels, humans, bboxes, num_shards):
"""Processes and saves list of images as TFRecord in 1 thread.
Args:
coder: instance of ImageCoder to provide TensorFlow image coding utils.
thread_index: integer, unique batch to run index is within [0, len(ranges)).
ranges: list of pairs of integers specifying ranges of each batches to
analyze in parallel.
name: string, unique identifier specifying the data set
filenames: list of strings; each string is a path to an image file
synsets: list of strings; each string is a unique WordNet ID
labels: list of integer; each integer identifies the ground truth
humans: list of strings; each string is a human-readable label
bboxes: list of bounding boxes for each image. Note that each entry in this
list might contain from 0+ entries corresponding to the number of bounding
box annotations for the image.
num_shards: integer number of shards for this data set.
"""
# Each thread produces N shards where N = int(num_shards / num_threads).
# For instance, if num_shards = 128, and the num_threads = 2, then the first
# thread would produce shards [0, 64).
num_threads = len(ranges)
assert not num_shards % num_threads
num_shards_per_batch = int(num_shards / num_threads)
shard_ranges = np.linspace(ranges[thread_index][0],
ranges[thread_index][1],
num_shards_per_batch + 1).astype(int)
num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
counter = 0
for s in range(num_shards_per_batch):
# Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
shard = thread_index * num_shards_per_batch + s
output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
output_file = os.path.join(FLAGS.output_directory, output_filename)
writer = tf.python_io.TFRecordWriter(output_file)
shard_counter = 0
files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
for i in files_in_shard:
filename = filenames[i]
label = labels[i]
synset = synsets[i]
human = humans[i]
#bbox = bboxes[i]
image_buffer, height, width = _process_image(filename, coder)
example = _convert_to_example(filename, image_buffer, label,
synset, human, [[0, 0, 1, 1]],
height, width)
writer.write(example.SerializeToString())
shard_counter += 1
counter += 1
if not counter % 1000:
print('%s [thread %d]: Processed %d of %d images in thread batch.' %
(datetime.now(), thread_index, counter, num_files_in_thread))
sys.stdout.flush()
writer.close()
print('%s [thread %d]: Wrote %d images to %s' %
(datetime.now(), thread_index, shard_counter, output_file))
sys.stdout.flush()
shard_counter = 0
print('%s [thread %d]: Wrote %d images to %d shards.' %
(datetime.now(), thread_index, counter, num_files_in_thread))
sys.stdout.flush()
def _process_image_files(name, filenames, synsets, labels, humans,
bboxes, num_shards):
"""Process and save list of images as TFRecord of Example protos.
Args:
name: string, unique identifier specifying the data set
filenames: list of strings; each string is a path to an image file
synsets: list of strings; each string is a unique WordNet ID
labels: list of integer; each integer identifies the ground truth
humans: list of strings; each string is a human-readable label
bboxes: list of bounding boxes for each image. Note that each entry in this
list might contain from 0+ entries corresponding to the number of bounding
box annotations for the image.
num_shards: integer number of shards for this data set.
"""
assert len(filenames) == len(synsets)
assert len(filenames) == len(labels)
assert len(filenames) == len(humans)
#assert len(filenames) == len(bboxes)
# Break all images into batches with a [ranges[i][0], ranges[i][1]].
spacing = np.linspace(0, len(filenames), FLAGS.num_threads + 1).astype(np.int)
ranges = []
threads = []
for i in range(len(spacing) - 1):
ranges.append([spacing[i], spacing[i + 1]])
# Launch a thread for each batch.
print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
sys.stdout.flush()
# Create a mechanism for monitoring when all threads are finished.
coord = tf.train.Coordinator()
# Create a generic TensorFlow-based utility for converting all image codings.
coder = ImageCoder()
threads = []
for thread_index in range(len(ranges)):
args = (coder, thread_index, ranges, name, filenames,
synsets, labels, humans, bboxes, num_shards)
t = threading.Thread(target=_process_image_files_batch, args=args)
t.start()
threads.append(t)
# Wait for all the threads to terminate.
coord.join(threads)
print('%s: Finished writing all %d images in data set.' %
(datetime.now(), len(filenames)))
sys.stdout.flush()
def _find_image_files(data_dir, labels_file):
"""Build a list of all images files and labels in the data set.
Args:
data_dir: string, path to the root directory of images.
Assumes that the ImageNet data set resides in JPEG files located in
the following directory structure.
data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
where 'n01440764' is the unique synset label associated with these images.
labels_file: string, path to the labels file.
The list of valid labels are held in this file. Assumes that the file
contains entries as such:
n01440764
n01443537
n01484850
where each line corresponds to a label expressed as a synset. We map
each synset contained in the file to an integer (based on the alphabetical
ordering) starting with the integer 1 corresponding to the synset
contained in the first line.
The reason we start the integer labels at 1 is to reserve label 0 as an
unused background class.
Returns:
filenames: list of strings; each string is a path to an image file.
synsets: list of strings; each string is a unique WordNet ID.
labels: list of integer; each integer identifies the ground truth.
"""
print('Determining list of input files and labels from %s.' % data_dir)
challenge_synsets = [l.strip() for l in
tf.gfile.FastGFile(labels_file, 'r').readlines()]
labels = []
filenames = []
synsets = []
# Leave label index 0 empty as a background class.
label_index = 1
# Construct the list of JPEG files and labels.
for synset in challenge_synsets:
jpeg_file_path = '%s/%s/*.JPEG' % (data_dir, synset)
matching_files = tf.gfile.Glob(jpeg_file_path)
labels.extend([label_index] * len(matching_files))
synsets.extend([synset] * len(matching_files))
filenames.extend(matching_files)
if not label_index % 100:
print('Finished finding files in %d of %d classes.' % (
label_index, len(challenge_synsets)))
label_index += 1
# Shuffle the ordering of all image files in order to guarantee
# random ordering of the images with respect to label in the
# saved TFRecord files. Make the randomization repeatable.
shuffled_index = list(range(len(filenames)))
random.seed(12345)
random.shuffle(shuffled_index)
filenames = [filenames[i] for i in shuffled_index]
synsets = [synsets[i] for i in shuffled_index]
labels = [labels[i] for i in shuffled_index]
print('Found %d JPEG files across %d labels inside %s.' %
(len(filenames), len(challenge_synsets), data_dir))
return filenames, synsets, labels
def _find_human_readable_labels(synsets, synset_to_human):
"""Build a list of human-readable labels.
Args:
synsets: list of strings; each string is a unique WordNet ID.
synset_to_human: dict of synset to human labels, e.g.,
'n02119022' --> 'red fox, Vulpes vulpes'
Returns:
List of human-readable strings corresponding to each synset.
"""
humans = []
for s in synsets:
assert s in synset_to_human, ('Failed to find: %s' % s)
humans.append(synset_to_human[s])
return humans
def _process_dataset(name, directory, num_shards, synset_to_human,
image_to_bboxes):
"""Process a complete data set and save it as a TFRecord.
Args:
name: string, unique identifier specifying the data set.
directory: string, root path to the data set.
num_shards: integer number of shards for this data set.
synset_to_human: dict of synset to human labels, e.g.,
'n02119022' --> 'red fox, Vulpes vulpes'
image_to_bboxes: dictionary mapping image file names to a list of
bounding boxes. This list contains 0+ bounding boxes.
"""
filenames, synsets, labels = _find_image_files(directory, FLAGS.labels_file)
humans = _find_human_readable_labels(synsets, synset_to_human)
#bboxes = _find_image_bounding_boxes(filenames, image_to_bboxes)
bboxes = []
_process_image_files(name, filenames, synsets, labels,
humans, bboxes, num_shards)
def _build_synset_lookup(imagenet_metadata_file):
"""Build lookup for synset to human-readable label.
Args:
imagenet_metadata_file: string, path to file containing mapping from
synset to human-readable label.
Assumes each line of the file looks like:
n02119247 black fox
n02119359 silver fox
n02119477 red fox, Vulpes fulva
where each line corresponds to a unique mapping. Note that each line is
formatted as <synset>\t<human readable label>.
Returns:
Dictionary of synset to human labels, such as:
'n02119022' --> 'red fox, Vulpes vulpes'
"""
lines = tf.gfile.FastGFile(imagenet_metadata_file, 'r').readlines()
synset_to_human = {}
for l in lines:
if l:
parts = l.strip().split('\t')
assert len(parts) == 2
synset = parts[0]
human = parts[1]
synset_to_human[synset] = human
return synset_to_human
def main(unused_argv):
assert not FLAGS.train_shards % FLAGS.num_threads, (
'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
assert not FLAGS.validation_shards % FLAGS.num_threads, (
'Please make the FLAGS.num_threads commensurate with '
'FLAGS.validation_shards')
print('Saving results to %s' % FLAGS.output_directory)
# Build a map from synset to human-readable label.
synset_to_human = _build_synset_lookup(FLAGS.imagenet_metadata_file)
# Run it!
_process_dataset('validation', FLAGS.validation_directory,
FLAGS.validation_shards, synset_to_human, None)
_process_dataset('train', FLAGS.train_directory, FLAGS.train_shards,
synset_to_human, None)
if __name__ == '__main__':
tf.app.run()

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,10 @@
n02086240
n02087394
n02088364
n02089973
n02093754
n02096294
n02099601
n02105641
n02111889
n02115641

View file

@ -0,0 +1,82 @@
#!/bin/bash
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
# Script to download and preprocess ImageNet Challenge 2012
# training and validation data set.
#
# The final output of this script are sharded TFRecord files containing
# serialized Example protocol buffers. See build_imagenet_data.py for
# details of how the Example protocol buffers contain the ImageNet data.
#
# The final output of this script appears as such:
#
# data_dir/train-00000-of-01024
# data_dir/train-00001-of-01024
# ...
# data_dir/train-01023-of-01024
#
# and
#
# data_dir/validation-00000-of-00128
# data_dir/validation-00001-of-00128
# ...
# data_dir/validation-00127-of-00128
#
# Note that this script may take several hours to run to completion. The
# conversion of the ImageNet data to TFRecords alone takes 2-3 hours depending
# on the speed of your machine. Please be patient.
#
# **IMPORTANT**
# To download the raw images, the user must create an account with image-net.org
# and generate a username and access_key. The latter two are required for
# downloading the raw images.
#
# usage:
# ./preprocess_imagenet.sh [data-dir]
set -e
if [ -z "$1" ]; then
echo "Usage: preprocess_imagenet.sh [data dir]"
exit
fi
DATA_DIR="${1%/}"
SCRATCH_DIR="${DATA_DIR}/raw-data/"
mkdir -p ${SCRATCH_DIR}
# Convert the XML files for bounding box annotations into a single CSV.
echo "Extracting bounding box information from XML."
BOUNDING_BOX_SCRIPT="./dataprep/process_bounding_boxes.py"
BOUNDING_BOX_FILE="${DATA_DIR}/imagenet_2012_bounding_boxes.csv"
BOUNDING_BOX_DIR="${DATA_DIR}/bounding_boxes/"
LABELS_FILE="./dataprep/imagenet_lsvrc_2015_synsets.txt"
"${BOUNDING_BOX_SCRIPT}" "${BOUNDING_BOX_DIR}" "${LABELS_FILE}" \
| sort > "${BOUNDING_BOX_FILE}"
echo "preprocessing the ImageNet data."
# Build the TFRecords version of the ImageNet data.
OUTPUT_DIRECTORY="${DATA_DIR}"
IMAGENET_METADATA_FILE="./dataprep/imagenet_metadata.txt"
python ./dataprep/build_imagenet_data.py \
--train_directory="${DATA_DIR}/train" \
--validation_directory="${DATA_DIR}/val" \
--output_directory="${DATA_DIR}/result" \
--imagenet_metadata_file="${IMAGENET_METADATA_FILE}" \
--labels_file="${LABELS_FILE}" \
--bounding_box_file="${BOUNDING_BOX_FILE}"

View file

@ -0,0 +1,89 @@
#!/usr/bin/python
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Process the ImageNet Challenge bounding boxes for TensorFlow model training.
Associate the ImageNet 2012 Challenge validation data set with labels.
The raw ImageNet validation data set is expected to reside in JPEG files
located in the following directory structure.
data_dir/ILSVRC2012_val_00000001.JPEG
data_dir/ILSVRC2012_val_00000002.JPEG
...
data_dir/ILSVRC2012_val_00050000.JPEG
This script moves the files into a directory structure like such:
data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
...
where 'n01440764' is the unique synset label associated with
these images.
This directory reorganization requires a mapping from validation image
number (i.e. suffix of the original file) to the associated label. This
is provided in the ImageNet development kit via a Matlab file.
In order to make life easier and divorce ourselves from Matlab, we instead
supply a custom text file that provides this mapping for us.
Sample usage:
./preprocess_imagenet_validation_data.py ILSVRC2012_img_val \
imagenet_2012_validation_synset_labels.txt
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import errno
import os.path
import sys
if __name__ == '__main__':
if len(sys.argv) < 3:
print('Invalid usage\n'
'usage: preprocess_imagenet_validation_data.py '
'<validation data dir> <validation labels file>')
sys.exit(-1)
data_dir = sys.argv[1]
validation_labels_file = sys.argv[2]
# Read in the 50000 synsets associated with the validation data set.
labels = [l.strip() for l in open(validation_labels_file).readlines()]
unique_labels = set(labels)
# Make all sub-directories in the validation data dir.
for label in unique_labels:
labeled_data_dir = os.path.join(data_dir, label)
# Catch error if sub-directory exists
try:
os.makedirs(labeled_data_dir)
except OSError as e:
# Raise all errors but 'EEXIST'
if e.errno != errno.EEXIST:
raise
# Move all of the image to the appropriate sub-directory.
for i in range(len(labels)):
basename = 'ILSVRC2012_val_000%.5d.JPEG' % (i + 1)
original_filename = os.path.join(data_dir, basename)
if not os.path.exists(original_filename):
print('Failed to find: %s' % original_filename)
sys.exit(-1)
new_filename = os.path.join(data_dir, labels[i], basename)
os.rename(original_filename, new_filename)

View file

@ -0,0 +1,254 @@
#!/usr/bin/python
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Process the ImageNet Challenge bounding boxes for TensorFlow model training.
This script is called as
process_bounding_boxes.py <dir> [synsets-file]
Where <dir> is a directory containing the downloaded and unpacked bounding box
data. If [synsets-file] is supplied, then only the bounding boxes whose
synstes are contained within this file are returned. Note that the
[synsets-file] file contains synset ids, one per line.
The script dumps out a CSV text file in which each line contains an entry.
n00007846_64193.JPEG,0.0060,0.2620,0.7545,0.9940
The entry can be read as:
<JPEG file name>, <xmin>, <ymin>, <xmax>, <ymax>
The bounding box for <JPEG file name> contains two points (xmin, ymin) and
(xmax, ymax) specifying the lower-left corner and upper-right corner of a
bounding box in *relative* coordinates.
The user supplies a directory where the XML files reside. The directory
structure in the directory <dir> is assumed to look like this:
<dir>/nXXXXXXXX/nXXXXXXXX_YYYY.xml
Each XML file contains a bounding box annotation. The script:
(1) Parses the XML file and extracts the filename, label and bounding box info.
(2) The bounding box is specified in the XML files as integer (xmin, ymin) and
(xmax, ymax) *relative* to image size displayed to the human annotator. The
size of the image displayed to the human annotator is stored in the XML file
as integer (height, width).
Note that the displayed size will differ from the actual size of the image
downloaded from image-net.org. To make the bounding box annotation useable,
we convert bounding box to floating point numbers relative to displayed
height and width of the image.
Note that each XML file might contain N bounding box annotations.
Note that the points are all clamped at a range of [0.0, 1.0] because some
human annotations extend outside the range of the supplied image.
See details here: http://image-net.org/download-bboxes
(3) By default, the script outputs all valid bounding boxes. If a
[synsets-file] is supplied, only the subset of bounding boxes associated
with those synsets are outputted. Importantly, one can supply a list of
synsets in the ImageNet Challenge and output the list of bounding boxes
associated with the training images of the ILSVRC.
We use these bounding boxes to inform the random distortion of images
supplied to the network.
If you run this script successfully, you will see the following output
to stderr:
> Finished processing 544546 XML files.
> Skipped 0 XML files not in ImageNet Challenge.
> Skipped 0 bounding boxes not in ImageNet Challenge.
> Wrote 615299 bounding boxes from 544546 annotated images.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import glob
import os.path
import sys
import xml.etree.ElementTree as ET
class BoundingBox(object):
pass
def GetItem(name, root, index=0):
count = 0
for item in root.iter(name):
if count == index:
return item.text
count += 1
# Failed to find "index" occurrence of item.
return -1
def GetInt(name, root, index=0):
# In some XML annotation files, the point values are not integers, but floats.
# So we add a float function to avoid ValueError.
return int(float(GetItem(name, root, index)))
def FindNumberBoundingBoxes(root):
index = 0
while True:
if GetInt('xmin', root, index) == -1:
break
index += 1
return index
def ProcessXMLAnnotation(xml_file):
"""Process a single XML file containing a bounding box."""
# pylint: disable=broad-except
try:
tree = ET.parse(xml_file)
except Exception:
print('Failed to parse: ' + xml_file, file=sys.stderr)
return None
# pylint: enable=broad-except
root = tree.getroot()
num_boxes = FindNumberBoundingBoxes(root)
boxes = []
for index in range(num_boxes):
box = BoundingBox()
# Grab the 'index' annotation.
box.xmin = GetInt('xmin', root, index)
box.ymin = GetInt('ymin', root, index)
box.xmax = GetInt('xmax', root, index)
box.ymax = GetInt('ymax', root, index)
box.width = GetInt('width', root)
box.height = GetInt('height', root)
box.filename = GetItem('filename', root) + '.JPEG'
box.label = GetItem('name', root)
xmin = float(box.xmin) / float(box.width)
xmax = float(box.xmax) / float(box.width)
ymin = float(box.ymin) / float(box.height)
ymax = float(box.ymax) / float(box.height)
# Some images contain bounding box annotations that
# extend outside of the supplied image. See, e.g.
# n03127925/n03127925_147.xml
# Additionally, for some bounding boxes, the min > max
# or the box is entirely outside of the image.
min_x = min(xmin, xmax)
max_x = max(xmin, xmax)
box.xmin_scaled = min(max(min_x, 0.0), 1.0)
box.xmax_scaled = min(max(max_x, 0.0), 1.0)
min_y = min(ymin, ymax)
max_y = max(ymin, ymax)
box.ymin_scaled = min(max(min_y, 0.0), 1.0)
box.ymax_scaled = min(max(max_y, 0.0), 1.0)
boxes.append(box)
return boxes
if __name__ == '__main__':
if len(sys.argv) < 2 or len(sys.argv) > 3:
print('Invalid usage\n'
'usage: process_bounding_boxes.py <dir> [synsets-file]',
file=sys.stderr)
sys.exit(-1)
xml_files = glob.glob(sys.argv[1] + '/*/*.xml')
print('Identified %d XML files in %s' % (len(xml_files), sys.argv[1]),
file=sys.stderr)
if len(sys.argv) == 3:
labels = set([l.strip() for l in open(sys.argv[2]).readlines()])
print('Identified %d synset IDs in %s' % (len(labels), sys.argv[2]),
file=sys.stderr)
else:
labels = None
skipped_boxes = 0
skipped_files = 0
saved_boxes = 0
saved_files = 0
for file_index, one_file in enumerate(xml_files):
# Example: <...>/n06470073/n00141669_6790.xml
label = os.path.basename(os.path.dirname(one_file))
# Determine if the annotation is from an ImageNet Challenge label.
if labels is not None and label not in labels:
skipped_files += 1
continue
bboxes = ProcessXMLAnnotation(one_file)
assert bboxes is not None, 'No bounding boxes found in ' + one_file
found_box = False
for bbox in bboxes:
if labels is not None:
if bbox.label != label:
# Note: There is a slight bug in the bounding box annotation data.
# Many of the dog labels have the human label 'Scottish_deerhound'
# instead of the synset ID 'n02092002' in the bbox.label field. As a
# simple hack to overcome this issue, we only exclude bbox labels
# *which are synset ID's* that do not match original synset label for
# the XML file.
if bbox.label in labels:
skipped_boxes += 1
continue
# Guard against improperly specified boxes.
if (bbox.xmin_scaled >= bbox.xmax_scaled or
bbox.ymin_scaled >= bbox.ymax_scaled):
skipped_boxes += 1
continue
# Note bbox.filename occasionally contains '%s' in the name. This is
# data set noise that is fixed by just using the basename of the XML file.
image_filename = os.path.splitext(os.path.basename(one_file))[0]
print('%s.JPEG,%.4f,%.4f,%.4f,%.4f' %
(image_filename,
bbox.xmin_scaled, bbox.ymin_scaled,
bbox.xmax_scaled, bbox.ymax_scaled))
saved_boxes += 1
found_box = True
if found_box:
saved_files += 1
else:
skipped_files += 1
if not file_index % 5000:
print('--> processed %d of %d XML files.' %
(file_index + 1, len(xml_files)),
file=sys.stderr)
print('--> skipped %d boxes and %d XML files.' %
(skipped_boxes, skipped_files), file=sys.stderr)
print('Finished processing %d XML files.' % len(xml_files), file=sys.stderr)
print('Skipped %d XML files not in ImageNet Challenge.' % skipped_files,
file=sys.stderr)
print('Skipped %d bounding boxes not in ImageNet Challenge.' % skipped_boxes,
file=sys.stderr)
print('Wrote %d bounding boxes from %d annotated images.' %
(saved_boxes, saved_files),
file=sys.stderr)
print('Finished.', file=sys.stderr)

View file

@ -42,12 +42,10 @@ if __name__ == "__main__":
log_path = os.path.join(FLAGS.results_dir, FLAGS.log_filename)
os.makedirs(FLAGS.results_dir, exist_ok=True)
dllogger.init(
backends=[
dllogger.JSONStreamBackend(verbosity=dllogger.Verbosity.VERBOSE, filename=log_path),
dllogger.StdOutBackend(verbosity=dllogger.Verbosity.VERBOSE)
]
)
dllogger.init(backends=[
dllogger.JSONStreamBackend(verbosity=dllogger.Verbosity.VERBOSE, filename=log_path),
dllogger.StdOutBackend(verbosity=dllogger.Verbosity.VERBOSE)
])
else:
dllogger.init(backends=[])
dllogger.log(data=vars(FLAGS), step='PARAMETER')
@ -58,49 +56,46 @@ if __name__ == "__main__":
architecture=FLAGS.arch,
input_format='NHWC',
compute_format=FLAGS.data_format,
dtype=tf.float32 if FLAGS.precision == 'fp32' else tf.float16,
dtype=tf.float32,
n_channels=3,
height=224,
width=224,
height=224 if FLAGS.data_dir else FLAGS.synthetic_data_size,
width=224 if FLAGS.data_dir else FLAGS.synthetic_data_size,
distort_colors=False,
log_dir=FLAGS.results_dir,
model_dir=FLAGS.model_dir if FLAGS.model_dir is not None else FLAGS.results_dir,
data_dir=FLAGS.data_dir,
data_idx_dir=FLAGS.data_idx_dir,
weight_init=FLAGS.weight_init,
use_xla=FLAGS.use_xla,
use_tf_amp=FLAGS.use_tf_amp,
use_dali=FLAGS.use_dali,
use_xla=FLAGS.xla,
use_tf_amp=FLAGS.amp,
use_dali=FLAGS.dali,
gpu_memory_fraction=FLAGS.gpu_memory_fraction,
gpu_id=FLAGS.gpu_id,
seed=FLAGS.seed
)
seed=FLAGS.seed)
if FLAGS.mode in ["train", "train_and_evaluate", "training_benchmark"]:
runner.train(
iter_unit=FLAGS.iter_unit,
num_iter=FLAGS.num_iter,
run_iter=FLAGS.run_iter,
batch_size=FLAGS.batch_size,
warmup_steps=FLAGS.warmup_steps,
log_every_n_steps=FLAGS.display_every,
weight_decay=FLAGS.weight_decay,
lr_init=FLAGS.lr_init,
lr_warmup_epochs=FLAGS.lr_warmup_epochs,
momentum=FLAGS.momentum,
loss_scale=FLAGS.loss_scale,
label_smoothing=FLAGS.label_smoothing,
mixup=FLAGS.mixup,
use_static_loss_scaling=FLAGS.use_static_loss_scaling,
use_cosine_lr=FLAGS.use_cosine_lr,
is_benchmark=FLAGS.mode == 'training_benchmark',
use_final_conv=FLAGS.use_final_conv,
quantize=FLAGS.quantize,
symmetric=FLAGS.symmetric,
quant_delay = FLAGS.quant_delay,
use_qdq = FLAGS.use_qdq,
finetune_checkpoint=FLAGS.finetune_checkpoint,
)
runner.train(iter_unit=FLAGS.iter_unit,
num_iter=FLAGS.num_iter,
run_iter=FLAGS.run_iter,
batch_size=FLAGS.batch_size,
warmup_steps=FLAGS.warmup_steps,
log_every_n_steps=FLAGS.display_every,
weight_decay=FLAGS.weight_decay,
lr_init=FLAGS.lr_init,
lr_warmup_epochs=FLAGS.lr_warmup_epochs,
momentum=FLAGS.momentum,
loss_scale=FLAGS.static_loss_scale,
label_smoothing=FLAGS.label_smoothing,
mixup=FLAGS.mixup,
use_static_loss_scaling=(FLAGS.static_loss_scale != -1),
use_cosine_lr=FLAGS.cosine_lr,
is_benchmark=FLAGS.mode == 'training_benchmark',
use_final_conv=FLAGS.use_final_conv,
quantize=FLAGS.quantize,
symmetric=FLAGS.symmetric,
quant_delay=FLAGS.quant_delay,
use_qdq=FLAGS.use_qdq,
finetune_checkpoint=FLAGS.finetune_checkpoint)
if FLAGS.mode in ["train_and_evaluate", 'evaluate', 'inference_benchmark']:
@ -109,19 +104,17 @@ if __name__ == "__main__":
elif not hvd_utils.is_using_hvd() or hvd.rank() == 0:
runner.evaluate(
iter_unit=FLAGS.iter_unit if FLAGS.mode != "train_and_evaluate" else "epoch",
num_iter=FLAGS.num_iter if FLAGS.mode != "train_and_evaluate" else 1,
warmup_steps=FLAGS.warmup_steps,
batch_size=FLAGS.batch_size,
log_every_n_steps=FLAGS.display_every,
is_benchmark=FLAGS.mode == 'inference_benchmark',
export_dir=FLAGS.export_dir,
quantize=FLAGS.quantize,
symmetric=FLAGS.symmetric,
use_final_conv=FLAGS.use_final_conv,
use_qdq=FLAGS.use_qdq
)
runner.evaluate(iter_unit=FLAGS.iter_unit if FLAGS.mode != "train_and_evaluate" else "epoch",
num_iter=FLAGS.num_iter if FLAGS.mode != "train_and_evaluate" else 1,
warmup_steps=FLAGS.warmup_steps,
batch_size=FLAGS.batch_size,
log_every_n_steps=FLAGS.display_every,
is_benchmark=FLAGS.mode == 'inference_benchmark',
export_dir=FLAGS.export_dir,
quantize=FLAGS.quantize,
symmetric=FLAGS.symmetric,
use_final_conv=FLAGS.use_final_conv,
use_qdq=FLAGS.use_qdq)
if FLAGS.mode == 'predict':
if FLAGS.to_predict is None:
@ -134,4 +127,8 @@ if __name__ == "__main__":
raise NotImplementedError("Only single GPU inference is implemented.")
elif not hvd_utils.is_using_hvd() or hvd.rank() == 0:
runner.predict(FLAGS.to_predict, quantize=FLAGS.quantize, symmetric=FLAGS.symmetric, use_qdq=FLAGS.use_qdq, use_final_conv=FLAGS.use_final_conv)
runner.predict(FLAGS.to_predict,
quantize=FLAGS.quantize,
symmetric=FLAGS.symmetric,
use_qdq=FLAGS.use_qdq,
use_final_conv=FLAGS.use_final_conv)

View file

@ -29,7 +29,7 @@ def conv2d(
data_format='NHWC',
dilation_rate=(1, 1),
use_bias=True,
kernel_initializer=tf.variance_scaling_initializer(),
kernel_initializer=tf.compat.v1.variance_scaling_initializer(),
bias_initializer=tf.zeros_initializer(),
trainable=True,
name=None
@ -56,6 +56,5 @@ def conv2d(
activation=None,
name=name
)
return net
return net

View file

@ -22,7 +22,7 @@ def dense(
units,
use_bias=True,
trainable=True,
kernel_initializer=tf.variance_scaling_initializer(),
kernel_initializer=tf.compat.v1.variance_scaling_initializer(),
bias_initializer=tf.zeros_initializer()
):

View file

@ -29,7 +29,7 @@ def squeeze_excitation_layer(
ratio,
training=True,
data_format='NCHW',
kernel_initializer=tf.variance_scaling_initializer(),
kernel_initializer=tf.compat.v1.variance_scaling_initializer(),
bias_initializer=tf.zeros_initializer(),
name="squeeze_excitation_layer"
):

View file

@ -15,7 +15,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import print_function
import tensorflow as tf
@ -34,7 +33,6 @@ from utils.data_utils import normalized_inputs
from utils.learning_rate import learning_rate_scheduler
from utils.optimizers import FixedLossScalerOptimizer
__all__ = [
'ResnetModel',
]
@ -89,14 +87,14 @@ class ResnetModel(object):
)
self.conv2d_hparams = tf.contrib.training.HParams(
kernel_initializer=tf.variance_scaling_initializer(
kernel_initializer=tf.compat.v1.variance_scaling_initializer(
scale=2.0, distribution='truncated_normal', mode=weight_init
),
bias_initializer=tf.constant_initializer(0.0)
)
self.dense_hparams = tf.contrib.training.HParams(
kernel_initializer=tf.variance_scaling_initializer(
kernel_initializer=tf.compat.v1.variance_scaling_initializer(
scale=2.0, distribution='truncated_normal', mode=weight_init
),
bias_initializer=tf.constant_initializer(0.0)
@ -109,12 +107,13 @@ class ResnetModel(object):
print("Input_format", input_format)
print("dtype", str(dtype))
def __call__(self, features, labels, mode, params):
if mode == tf.estimator.ModeKeys.TRAIN:
mandatory_params = ["batch_size", "lr_init", "num_gpus", "steps_per_epoch",
"momentum", "weight_decay", "loss_scale", "label_smoothing"]
mandatory_params = [
"batch_size", "lr_init", "num_gpus", "steps_per_epoch", "momentum", "weight_decay", "loss_scale",
"label_smoothing"
]
for p in mandatory_params:
if p not in params:
raise RuntimeError("Parameter {} is missing.".format(p))
@ -141,43 +140,46 @@ class ResnetModel(object):
mixup = 0
eta = 0
if mode == tf.estimator.ModeKeys.TRAIN:
if mode == tf.estimator.ModeKeys.TRAIN:
eta = params['label_smoothing']
mixup = params['mixup']
if mode != tf.estimator.ModeKeys.PREDICT:
one_hot_smoothed_labels = tf.one_hot(labels, 1001,
on_value = 1 - eta + eta/1001,
off_value = eta/1001)
if mode != tf.estimator.ModeKeys.PREDICT:
n_cls = self.model_hparams.n_classes
one_hot_smoothed_labels = tf.one_hot(labels, n_cls,
on_value=1 - eta + eta / n_cls, off_value=eta / n_cls)
if mixup != 0:
print("Using mixup training with beta=", params['mixup'])
beta_distribution = tf.distributions.Beta(params['mixup'], params['mixup'])
feature_coefficients = beta_distribution.sample(sample_shape=[params['batch_size'], 1, 1, 1])
feature_coefficients = beta_distribution.sample(sample_shape=[params['batch_size'], 1, 1, 1])
reversed_feature_coefficients = tf.subtract(tf.ones(shape=feature_coefficients.shape), feature_coefficients)
reversed_feature_coefficients = tf.subtract(
tf.ones(shape=feature_coefficients.shape), feature_coefficients
)
rotated_features = tf.reverse(features, axis=[0])
rotated_features = tf.reverse(features, axis=[0])
features = feature_coefficients * features + reversed_feature_coefficients * rotated_features
label_coefficients = tf.squeeze(feature_coefficients, axis=[2, 3])
rotated_labels = tf.reverse(one_hot_smoothed_labels, axis=[0])
rotated_labels = tf.reverse(one_hot_smoothed_labels, axis=[0])
reversed_label_coefficients = tf.subtract(tf.ones(shape=label_coefficients.shape), label_coefficients)
reversed_label_coefficients = tf.subtract(
tf.ones(shape=label_coefficients.shape), label_coefficients
)
one_hot_smoothed_labels = label_coefficients * one_hot_smoothed_labels + reversed_label_coefficients * rotated_labels
# Update Global Step
global_step = tf.train.get_or_create_global_step()
tf.identity(global_step, name="global_step_ref")
tf.identity(features, name="features_ref")
if mode == tf.estimator.ModeKeys.TRAIN:
tf.identity(labels, name="labels_ref")
@ -202,16 +204,31 @@ class ResnetModel(object):
tf.identity(probs, name="probs_ref")
tf.identity(y_preds, name="y_preds_ref")
#if mode == tf.estimator.ModeKeys.TRAIN:
#
# assert (len(tf.trainable_variables()) == 161)
#
#else:
#
# assert (len(tf.trainable_variables()) == 0)
if mode == tf.estimator.ModeKeys.TRAIN and params['quantize']:
dllogger.log(data={"QUANTIZATION AWARE TRAINING ENABLED": True}, step=tuple())
if params['symmetric']:
dllogger.log(data={"MODE":"USING SYMMETRIC MODE"}, step=tuple())
tf.contrib.quantize.experimental_create_training_graph(tf.get_default_graph(), symmetric=True, use_qdq=params['use_qdq'] ,quant_delay=params['quant_delay'])
dllogger.log(data={"MODE": "USING SYMMETRIC MODE"}, step=tuple())
tf.contrib.quantize.experimental_create_training_graph(
tf.get_default_graph(),
symmetric=True,
use_qdq=params['use_qdq'],
quant_delay=params['quant_delay']
)
else:
dllogger.log(data={"MODE":"USING ASSYMETRIC MODE"}, step=tuple())
tf.contrib.quantize.create_training_graph(tf.get_default_graph(), quant_delay=params['quant_delay'], use_qdq=params['use_qdq'])
# Fix for restoring variables during fine-tuning of Resnet-50
dllogger.log(data={"MODE": "USING ASSYMETRIC MODE"}, step=tuple())
tf.contrib.quantize.create_training_graph(
tf.get_default_graph(), quant_delay=params['quant_delay'], use_qdq=params['use_qdq']
)
# Fix for restoring variables during fine-tuning of Resnet
if 'finetune_checkpoint' in params.keys():
train_vars = tf.trainable_variables()
train_var_dict = {}
@ -220,6 +237,13 @@ class ResnetModel(object):
dllogger.log(data={"Restoring variables from checkpoint": params['finetune_checkpoint']}, step=tuple())
tf.train.init_from_checkpoint(params['finetune_checkpoint'], train_var_dict)
with tf.device("/cpu:0"):
if hvd_utils.is_using_hvd():
sync_var = tf.Variable(initial_value=[0], dtype=tf.int32, name="signal_handler_var")
sync_var_assing = sync_var.assign([1], name="signal_handler_var_set")
sync_var_reset = sync_var.assign([0], name="signal_handler_var_reset")
sync_op = hvd.allreduce(sync_var, op=hvd.Sum, name="signal_handler_all_reduce")
if mode == tf.estimator.ModeKeys.PREDICT:
predictions = {'classes': y_preds, 'probabilities': probs}
@ -239,8 +263,12 @@ class ResnetModel(object):
acc_top5 = tf.nn.in_top_k(predictions=logits, targets=labels, k=5)
else:
acc_top1, acc_top1_update_op = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, targets=labels, k=1))
acc_top5, acc_top5_update_op = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, targets=labels, k=5))
acc_top1, acc_top1_update_op = tf.metrics.mean(
tf.nn.in_top_k(predictions=logits, targets=labels, k=1)
)
acc_top5, acc_top5_update_op = tf.metrics.mean(
tf.nn.in_top_k(predictions=logits, targets=labels, k=5)
)
tf.identity(acc_top1, name="acc_top1_ref")
tf.identity(acc_top5, name="acc_top5_ref")
@ -251,20 +279,21 @@ class ResnetModel(object):
'accuracy_top1': acc_top1,
'accuracy_top5': acc_top5
}
cross_entropy = tf.losses.softmax_cross_entropy(
logits=logits, onehot_labels=one_hot_smoothed_labels)
cross_entropy = tf.losses.softmax_cross_entropy(logits=logits, onehot_labels=one_hot_smoothed_labels)
assert (cross_entropy.dtype == tf.float32)
tf.identity(cross_entropy, name='cross_entropy_loss_ref')
def loss_filter_fn(name):
"""we don't need to compute L2 loss for BN and bias (eq. to add a cste)"""
return all([
tensor_name not in name.lower()
# for tensor_name in ["batchnorm", "batch_norm", "batch_normalization", "bias"]
for tensor_name in ["batchnorm", "batch_norm", "batch_normalization"]
])
return all(
[
tensor_name not in name.lower()
# for tensor_name in ["batchnorm", "batch_norm", "batch_normalization", "bias"]
for tensor_name in ["batchnorm", "batch_norm", "batch_normalization"]
]
)
filtered_params = [tf.cast(v, tf.float32) for v in tf.trainable_variables() if loss_filter_fn(v.name)]
@ -287,7 +316,7 @@ class ResnetModel(object):
tf.summary.scalar('cross_entropy', cross_entropy)
tf.summary.scalar('l2_loss', l2_loss)
tf.summary.scalar('total_loss', total_loss)
if mode == tf.estimator.ModeKeys.TRAIN:
with tf.device("/cpu:0"):
@ -317,17 +346,18 @@ class ResnetModel(object):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
if mode != tf.estimator.ModeKeys.TRAIN:
update_ops += [acc_top1_update_op, acc_top5_update_op]
deterministic = True
gate_gradients = (tf.train.Optimizer.GATE_OP if deterministic else tf.train.Optimizer.GATE_NONE)
gate_gradients = (tf.compat.v1.train.Optimizer.GATE_OP if deterministic else tf.compat.v1.train.Optimizer.GATE_NONE)
backprop_op = optimizer.minimize(total_loss, gate_gradients=gate_gradients, global_step=global_step)
if self.model_hparams.use_dali:
train_ops = tf.group(backprop_op, update_ops, name='train_ops')
else:
train_ops = tf.group(backprop_op, cpu_prefetch_op, gpu_prefetch_op, update_ops, name='train_ops')
train_ops = tf.group(
backprop_op, cpu_prefetch_op, gpu_prefetch_op, update_ops, name='train_ops'
)
return tf.estimator.EstimatorSpec(mode=mode, loss=total_loss, train_op=train_ops)
@ -338,23 +368,18 @@ class ResnetModel(object):
}
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=total_loss,
eval_metric_ops=eval_metrics
mode=mode, predictions=predictions, loss=total_loss, eval_metric_ops=eval_metrics
)
else:
raise NotImplementedError('Unknown mode {}'.format(mode))
@staticmethod
def _stage(tensors):
"""Stages the given tensors in a StagingArea for asynchronous put/get.
"""
stage_area = tf.contrib.staging.StagingArea(
dtypes=[tensor.dtype for tensor in tensors],
shapes=[tensor.get_shape() for tensor in tensors]
dtypes=[tensor.dtype for tensor in tensors], shapes=[tensor.get_shape() for tensor in tensors]
)
put_op = stage_area.put(tensors)
@ -364,14 +389,11 @@ class ResnetModel(object):
return put_op, get_tensors
def build_model(self, inputs, training=True, reuse=False, use_final_conv=False):
with var_storage.model_variable_scope(
self.model_hparams.model_name,
reuse=reuse,
dtype=self.model_hparams.dtype):
self.model_hparams.model_name, reuse=reuse, dtype=self.model_hparams.dtype
):
with tf.variable_scope("input_reshape"):
if self.model_hparams.input_format == 'NHWC' and self.model_hparams.compute_format == 'NCHW':
@ -426,27 +448,29 @@ class ResnetModel(object):
batch_norm_hparams=self.batch_norm_hparams,
block_name="btlnck_block_%d_%d" % (block_id, layer_id),
use_se=self.model_hparams.use_se,
ratio=self.model_hparams.se_ratio)
ratio=self.model_hparams.se_ratio
)
with tf.variable_scope("output"):
net = layers.reduce_mean(
net, keepdims=use_final_conv, data_format=self.model_hparams.compute_format, name='spatial_mean')
net, keepdims=False, data_format=self.model_hparams.compute_format, name='spatial_mean'
)
if use_final_conv:
logits = layers.conv2d(
net,
n_channels=self.model_hparams.n_classes,
kernel_size=(1, 1),
strides=(1, 1),
padding='SAME',
data_format=self.model_hparams.compute_format,
dilation_rate=(1, 1),
use_bias=True,
kernel_initializer=self.dense_hparams.kernel_initializer,
bias_initializer=self.dense_hparams.bias_initializer,
trainable=training,
name='dense'
)
net,
n_channels=self.model_hparams.n_classes,
kernel_size=(1, 1),
strides=(1, 1),
padding='SAME',
data_format=self.model_hparams.compute_format,
dilation_rate=(1, 1),
use_bias=True,
kernel_initializer=self.dense_hparams.kernel_initializer,
bias_initializer=self.dense_hparams.bias_initializer,
trainable=training,
name='dense'
)
else:
logits = layers.dense(
inputs=net,
@ -454,7 +478,8 @@ class ResnetModel(object):
use_bias=True,
trainable=training,
kernel_initializer=self.dense_hparams.kernel_initializer,
bias_initializer=self.dense_hparams.bias_initializer)
bias_initializer=self.dense_hparams.bias_initializer
)
if logits.dtype != tf.float32:
logits = tf.cast(logits, tf.float32)
@ -464,27 +489,25 @@ class ResnetModel(object):
return probs, logits
model_architectures = {
'resnet50': {
'layers': [3, 4, 6, 3],
'widths': [64, 128, 256, 512],
'expansions': 4,
},
'resnext101-32x4d': {
'layers': [3, 4, 23, 3],
'widths': [128, 256, 512, 1024],
'expansions': 2,
'cardinality': 32,
},
'se-resnext101-32x4d' : {
'cardinality' : 32,
'layers' : [3, 4, 23, 3],
'widths' : [128, 256, 512, 1024],
'expansions' : 2,
'se-resnext101-32x4d': {
'cardinality': 32,
'layers': [3, 4, 23, 3],
'widths': [128, 256, 512, 1024],
'expansions': 2,
'use_se': True,
'se_ratio': 16,
},
}

View file

@ -71,4 +71,4 @@ if __name__=='__main__':
file.write("model_checkpoint_path: "+ "\"" + new_ckpt + "\"")
# Process the input checkpoint, apply transforms and generate a new checkpoint.
process_checkpoint(input_ckpt, new_ckpt_path, args.dense_layer)
process_checkpoint(input_ckpt, new_ckpt_path, args.dense_layer)

View file

@ -244,16 +244,16 @@ For example, to train on DGX-1 for 90 epochs using AMP, run:
Additionally, features like DALI data preprocessing or TensorFlow XLA can be enabled with
following arguments when running those scripts:
`bash ./resnet50v1.5/training/DGX1_RN50_AMP_90E.sh /path/to/result /data --use_xla --use_dali`
`bash ./resnet50v1.5/training/DGX1_RN50_AMP_90E.sh /path/to/result /data --xla --dali`
7. Start validation/evaluation.
To evaluate the validation dataset located in `/data/tfrecords`, run `main.py` with
`--mode=evaluate`. For example:
`python main.py --mode=evaluate --data_dir=/data/tfrecords --batch_size <batch size> --model_dir
<model location> --results_dir <output location> [--use_xla] [--use_tf_amp]`
<model location> --results_dir <output location> [--xla] [--amp]`
The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during evaluation.
The optional `--xla` and `--amp` flags control XLA and AMP during evaluation.
## Advanced
@ -292,99 +292,116 @@ The `runtime/` directory contains the following module that define the mechanics
The script for training and evaluating the ResNet-50 v1.5 model has a variety of parameters that control these processes.
```
usage: main.py [-h]
[--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
usage: main.py [-h] [--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
[--mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}]
[--data_dir DATA_DIR] [--data_idx_dir DATA_IDX_DIR]
[--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]
[--batch_size BATCH_SIZE] [--num_iter NUM_ITER]
[--iter_unit {epoch,batch}] [--warmup_steps WARMUP_STEPS]
[--model_dir MODEL_DIR] [--results_dir RESULTS_DIR]
[--log_filename LOG_FILENAME] [--display_every DISPLAY_EVERY]
[--lr_init LR_INIT] [--lr_warmup_epochs LR_WARMUP_EPOCHS]
[--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
[--momentum MOMENTUM] [--loss_scale LOSS_SCALE]
[--label_smoothing LABEL_SMOOTHING] [--mixup MIXUP]
[--use_static_loss_scaling | --nouse_static_loss_scaling]
[--use_xla | --nouse_xla] [--use_dali | --nouse_dali]
[--use_tf_amp | --nouse_tf_amp]
[--use_cosine_lr | --nouse_cosine_lr] [--seed SEED]
[--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]
--batch_size BATCH_SIZE [--num_iter NUM_ITER]
[--run_iter RUN_ITER] [--iter_unit {epoch,batch}]
[--warmup_steps WARMUP_STEPS] [--model_dir MODEL_DIR]
[--results_dir RESULTS_DIR] [--log_filename LOG_FILENAME]
[--display_every DISPLAY_EVERY] [--seed SEED]
[--gpu_memory_fraction GPU_MEMORY_FRACTION] [--gpu_id GPU_ID]
JoC-RN50v1.5-TF
optional arguments:
-h, --help Show this help message and exit
[--finetune_checkpoint FINETUNE_CHECKPOINT] [--use_final_conv]
[--quant_delay QUANT_DELAY] [--quantize] [--use_qdq]
[--symmetric] [--data_dir DATA_DIR]
[--data_idx_dir DATA_IDX_DIR] [--dali]
[--synthetic_data_size SYNTHETIC_DATA_SIZE] [--lr_init LR_INIT]
[--lr_warmup_epochs LR_WARMUP_EPOCHS]
[--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
[--momentum MOMENTUM] [--label_smoothing LABEL_SMOOTHING]
[--mixup MIXUP] [--cosine_lr] [--xla]
[--data_format {NHWC,NCHW}] [--amp]
[--static_loss_scale STATIC_LOSS_SCALE]
JoC-RN50v1.5-TF
optional arguments:
-h, --help show this help message and exit.
--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}
Architecture of model to run (default is resnet50)
Architecture of model to run.
--mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}
The execution mode of the script.
--export_dir EXPORT_DIR
Directory in which to write exported SavedModel.
--to_predict TO_PREDICT
Path to file or directory of files to run prediction
on.
--batch_size BATCH_SIZE
Size of each minibatch per GPU.
--num_iter NUM_ITER Number of iterations to run.
--run_iter RUN_ITER Number of training iterations to run on single run.
--iter_unit {epoch,batch}
Unit of iterations.
--warmup_steps WARMUP_STEPS
Number of steps considered as warmup and not taken
into account for performance measurements.
--model_dir MODEL_DIR
Directory in which to write model. If undefined,
results dir will be used.
--results_dir RESULTS_DIR
Directory in which to write training logs, summaries
and checkpoints.
--log_filename LOG_FILENAME
Name of the JSON file to which write the training log.
--display_every DISPLAY_EVERY
How often (in batches) to print out running
information.
--seed SEED Random seed.
--gpu_memory_fraction GPU_MEMORY_FRACTION
Limit memory fraction used by training script for DALI.
--gpu_id GPU_ID Specify ID of the target GPU on multi-device platform.
Effective only for single-GPU mode.
--finetune_checkpoint FINETUNE_CHECKPOINT
Path to pre-trained checkpoint which will be used for
fine-tuning.
--use_final_conv Use convolution operator instead of MLP as last layer.
--quant_delay QUANT_DELAY
Number of steps to be run before quantization starts
to happen.
--quantize Quantize weights and activations during training.
(Defaults to Assymmetric quantization)
--use_qdq Use QDQV3 op instead of FakeQuantWithMinMaxVars op for
quantization. QDQv3 does only scaling.
--symmetric Quantize weights and activations during training using
symmetric quantization.
Dataset arguments:
--data_dir DATA_DIR Path to dataset in TFRecord format. Files should be
named 'train-*' and 'validation-*'.
--data_idx_dir DATA_IDX_DIR
Path to index files for DALI. Files should be named
'train-*' and 'validation-*'.
--export_dir EXPORT_DIR
Directory in which to write exported SavedModel.
--to_predict TO_PREDICT
Path to file or directory of files to run prediction
on.
--batch_size BATCH_SIZE
Size of each minibatch per GPU.
--num_iter NUM_ITER Number of iterations to run.
--iter_unit {epoch,batch}
Unit of iterations.
--warmup_steps WARMUP_STEPS
Number of steps considered as warmup and not taken
into account for performance measurements.
--model_dir MODEL_DIR
Directory in which to write the model. If undefined,
results directory will be used.
--results_dir RESULTS_DIR
Directory in which to write training logs, summaries
and checkpoints.
--log_filename LOG_FILENAME
Name of the JSON file to which write the training log
--display_every DISPLAY_EVERY
How often (in batches) to print out running
information.
--dali Enable DALI data input.
--synthetic_data_size SYNTHETIC_DATA_SIZE
Dimension of image for synthetic dataset.
Training arguments:
--lr_init LR_INIT Initial value for the learning rate.
--lr_warmup_epochs LR_WARMUP_EPOCHS
Number of warmup epochs for the learning rate schedule.
Number of warmup epochs for learning rate schedule.
--weight_decay WEIGHT_DECAY
Weight Decay scale factor.
--weight_init {fan_in,fan_out}
Model weight initialization method.
--momentum MOMENTUM SGD momentum value for the momentum optimizer.
--loss_scale LOSS_SCALE
Loss scale for FP16 training and fast math FP32.
--momentum MOMENTUM SGD momentum value for the Momentum optimizer.
--label_smoothing LABEL_SMOOTHING
The value of label smoothing.
--mixup MIXUP The alpha parameter for mixup (if 0 then mixup is not
applied).
--use_static_loss_scaling
Use static loss scaling in FP16 or FP32 AMP.
--nouse_static_loss_scaling
--use_xla Enable XLA (Accelerated Linear Algebra) computation
--cosine_lr Use cosine learning rate schedule.
Generic optimization arguments:
--xla Enable XLA (Accelerated Linear Algebra) computation
for improved performance.
--nouse_xla
--use_dali Enable DALI data input.
--nouse_dali
--use_tf_amp Enable AMP to speedup FP32
computation using Tensor Cores.
--nouse_tf_amp
--use_cosine_lr Use cosine learning rate schedule.
--nouse_cosine_lr
--seed SEED Random seed.
--gpu_memory_fraction GPU_MEMORY_FRACTION
Limit memory fraction used by the training script for DALI
--gpu_id GPU_ID Specify the ID of the target GPU on a multi-device platform.
Effective only for single-GPU mode.
--quantize Used to add quantization nodes in the graph (Default: Asymmetric quantization)
--symmetric If --quantize mode is used, this option enables symmetric quantization
--use_qdq Use quantize_and_dequantize (QDQ) op instead of FakeQuantWithMinMaxVars op for quantization. QDQ does only scaling.
--finetune_checkpoint Path to pre-trained checkpoint which can be used for fine-tuning
--quant_delay Number of steps to be run before quantization starts to happen
--data_format {NHWC,NCHW}
Data format used to do calculations.
--amp Enable Automatic Mixed Precision to speedup
computation using tensor cores.
Automatic Mixed Precision arguments:
--static_loss_scale STATIC_LOSS_SCALE
Use static loss scaling in FP32 AMP.
```
### Quantization Aware Training
@ -424,12 +441,13 @@ Arguments:
* `--input_format` : Data format of input tensor (Default: NCHW). Use NCHW format to optimize the graph with TensorRT.
* `--compute_format` : Data format of the operations in the network (Default: NCHW). Use NCHW format to optimize the graph with TensorRT.
### Inference process
To run inference on a single example with a checkpoint and a model script, use:
`python main.py --mode predict --model_dir <path to model> --to_predict <path to image> --results_dir <path to results>`
The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during inference.
The optional `--xla` and `--amp` flags control XLA and AMP during inference.
## Performance
@ -448,7 +466,7 @@ To benchmark the training performance on a specific batch size, run:
* AMP
`python ./main.py --mode=training_benchmark --use_tf_amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
`python ./main.py --mode=training_benchmark --amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
* For multiple GPUs
* FP32 / TF32
@ -457,16 +475,18 @@ To benchmark the training performance on a specific batch size, run:
* AMP
`mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --use_tf_amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
`mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
Each of these scripts runs 200 warm-up iterations and measures the first epoch.
To control warmup and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. Features like XLA or DALI can be controlled
with `--use_xla` and `--use_dali` flags. If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.
For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
with `--xla` and `--dali` flags. For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
Suggested batch sizes for training are 256 for mixed precision training and 128 for single precision training per single V100 16 GB.
If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with `--synthetic_data_size` flag.
#### Inference performance benchmark
To benchmark the inference performance on a specific batch size, run:
@ -477,11 +497,10 @@ To benchmark the inference performance on a specific batch size, run:
* AMP
`python ./main.py --mode=inference_benchmark --use_tf_amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
`python ./main.py --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations.
To control warm-up and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags.
For proper throughput and latency reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.
The benchmark can be automated with the `inference_benchmark.sh` script provided in `resnet50v1.5`, by simply running:
@ -490,6 +509,9 @@ The benchmark can be automated with the `inference_benchmark.sh` script provided
The `<data dir>` parameter refers to the input data directory (by default `/data/tfrecords` inside the container).
By default, the benchmark tests the following configurations: **FP32**, **AMP**, **AMP + XLA** with different batch sizes.
When the optional directory with the DALI index files `<data idx dir>` is specified, the benchmark executes an additional **DALI + AMP + XLA** configuration.
For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
For performance benchmark of raw model, synthetic dataset can be used. To use synthetic dataset, use `--synthetic_data_size` flag instead of `--data_dir` to specify input image size.
### Results
@ -568,17 +590,6 @@ on NVIDIA DGX A100 with (8x A100 40G) GPUs.
| 8 | ~2h | ~5h |
##### Training time: NVIDIA DGX A100 (8x A100 40GB)
Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-a100-8x-a100-40g)
on NVIDIA DGX A100 with (8x A100 40G) GPUs.
| GPUs | Time to train - mixed precision + XLA | Time to train - mixed precision | Time to train - TF32 + XLA | Time to train - TF32 |
|---|--------|---------|---------|-------|
| 1 | ~18h | ~19.5h | ~40h | ~47h |
| 8 | ~2h | ~2.5h | ~5h | ~6h |
##### Training time: NVIDIA DGX-1 (8x V100 16G)
Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-1-8x-v100-16g)
@ -821,22 +832,25 @@ on NVIDIA T4 with (1x T4 16G) GPU.
* Added benchmark results for DGX-2 and XLA-enabled DGX-1 and DGX-2.
3. July, 2019
* Added Cosine learning rate schedule
3. August, 2019
4. August, 2019
* Added mixup regularization
* Added T4 benchmarks
* Improved inference capabilities
* Added SavedModel export
4. January, 2020
5. January, 2020
* Removed manual checks for dataset paths to facilitate cloud storage solutions
* Move to a new logging solution
* Bump base docker image version
5. March, 2020
6. March, 2020
* Code cleanup and refactor
* Improved training process
6. June, 2020
7. June, 2020
* Added benchmark results for DGX-A100
* Updated benchmark results for DGX-1, DGX-2 and T4
* Updated base docker image version
8. August 2020
* Updated command line argument names
* Added support for syntetic dataset with different image size
### Known issues
Performance without XLA enabled is low. We recommend using XLA.
Performance without XLA enabled is low due to BN + ReLU fusion bug.

View file

@ -22,12 +22,12 @@ function test_configuration() {
}
test_configuration "FP32 nodali noxla"
test_configuration "FP32 nodali xla" "--use_xla"
test_configuration "FP16 nodali noxla" "--use_tf_amp"
test_configuration "FP16 nodali xla" "--use_tf_amp --use_xla"
test_configuration "FP32 nodali xla" "--xla"
test_configuration "FP16 nodali noxla" "--amp"
test_configuration "FP16 nodali xla" "--amp --xla"
if [ ! -z $DALI_DIR ]; then
test_configuration "FP16 dali xla" "--use_tf_amp --use_xla --use_dali --data_idx_dir ${DALI_DIR}"
test_configuration "FP16 dali xla" "--amp --xla --dali --data_idx_dir ${DALI_DIR}"
fi
cat $INFERENCE_BENCHMARK

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -1,20 +0,0 @@
#!/bin/bash
# Copyright (c) 2020 NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This script does Quantization aware training of Resnet-50 by finetuning on the pre-trained model using 1 GPU and a batch size of 32.
# Usage ./GPU1_RN50_QAT.sh <path to the pre-trained model> <path to dataset> <path to results directory>
python main.py --mode=train_and_evaluate --batch_size=32 --lr_warmup_epochs=1 --quantize --symmetric --use_qdq --label_smoothing 0.1 --lr_init=0.00005 --momentum=0.875 --weight_decay=3.0517578125e-05 --finetune_checkpoint=$1 --data_dir=$2 --results_dir=$3 --num_iter 10 --data_format NHWC

View file

@ -26,13 +26,13 @@ function run_benchmark() {
MODE_SIZE=$2
if [[ $4 -eq "1" ]]; then
XLA="--use_xla"
XLA="--xla"
else
XLA=""
fi
case $2 in
"amp") MODE_FLAGS="--use_tf_amp --use_static_loss_scaling --loss_scale=128";;
"amp") MODE_FLAGS="--amp --static_loss_scale 128";;
"fp32"|"tf32") MODE_FLAGS="";;
*) echo "Unsupported configuration, use amp, tf32 or fp32";;
esac

View file

@ -251,16 +251,16 @@ For example, to train on DGX-1 for 90 epochs using AMP, run:
Additionally, features like DALI data preprocessing or TensorFlow XLA can be enabled with
following arguments when running those scripts:
`bash ./resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_90E.sh /path/to/result /data --use_xla --use_dali`
`bash ./resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_90E.sh /path/to/result /data --xla --dali`
7. Start validation/evaluation.
To evaluate the validation dataset located in `/data/tfrecords`, run `main.py` with
`--mode=evaluate`. For example:
`python main.py --arch=resnext101-32x4d --mode=evaluate --data_dir=/data/tfrecords --batch_size <batch size> --model_dir
<model location> --results_dir <output location> [--use_xla] [--use_tf_amp]`
<model location> --results_dir <output location> [--xla] [--amp]`
The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during evaluation.
The optional `--xla` and `--amp` flags control XLA and AMP during evaluation.
## Advanced
@ -299,95 +299,116 @@ The `runtime/` directory contains the following module that define the mechanics
The script for training and evaluating the ResNext101-32x4d model has a variety of parameters that control these processes.
```
usage: main.py [-h]
[--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
usage: main.py [-h] [--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
[--mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}]
[--data_dir DATA_DIR] [--data_idx_dir DATA_IDX_DIR]
[--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]
[--batch_size BATCH_SIZE] [--num_iter NUM_ITER]
[--iter_unit {epoch,batch}] [--warmup_steps WARMUP_STEPS]
[--model_dir MODEL_DIR] [--results_dir RESULTS_DIR]
[--log_filename LOG_FILENAME] [--display_every DISPLAY_EVERY]
[--lr_init LR_INIT] [--lr_warmup_epochs LR_WARMUP_EPOCHS]
[--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
[--momentum MOMENTUM] [--loss_scale LOSS_SCALE]
[--label_smoothing LABEL_SMOOTHING] [--mixup MIXUP]
[--use_static_loss_scaling | --nouse_static_loss_scaling]
[--use_xla | --nouse_xla] [--use_dali | --nouse_dali]
[--use_tf_amp | --nouse_tf_amp]
[--use_cosine_lr | --nouse_cosine_lr] [--seed SEED]
[--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]
--batch_size BATCH_SIZE [--num_iter NUM_ITER]
[--run_iter RUN_ITER] [--iter_unit {epoch,batch}]
[--warmup_steps WARMUP_STEPS] [--model_dir MODEL_DIR]
[--results_dir RESULTS_DIR] [--log_filename LOG_FILENAME]
[--display_every DISPLAY_EVERY] [--seed SEED]
[--gpu_memory_fraction GPU_MEMORY_FRACTION] [--gpu_id GPU_ID]
JoC-RN50v1.5-TF
optional arguments:
-h, --help Show this help message and exit
[--finetune_checkpoint FINETUNE_CHECKPOINT] [--use_final_conv]
[--quant_delay QUANT_DELAY] [--quantize] [--use_qdq]
[--symmetric] [--data_dir DATA_DIR]
[--data_idx_dir DATA_IDX_DIR] [--dali]
[--synthetic_data_size SYNTHETIC_DATA_SIZE] [--lr_init LR_INIT]
[--lr_warmup_epochs LR_WARMUP_EPOCHS]
[--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
[--momentum MOMENTUM] [--label_smoothing LABEL_SMOOTHING]
[--mixup MIXUP] [--cosine_lr] [--xla]
[--data_format {NHWC,NCHW}] [--amp]
[--static_loss_scale STATIC_LOSS_SCALE]
JoC-RN50v1.5-TF
optional arguments:
-h, --help show this help message and exit.
--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}
Architecture of model to run (to run Resnext-32x4d set
--arch=rensext101-32x4d)
Architecture of model to run.
--mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}
The execution mode of the script.
--export_dir EXPORT_DIR
Directory in which to write exported SavedModel.
--to_predict TO_PREDICT
Path to file or directory of files to run prediction
on.
--batch_size BATCH_SIZE
Size of each minibatch per GPU.
--num_iter NUM_ITER Number of iterations to run.
--run_iter RUN_ITER Number of training iterations to run on single run.
--iter_unit {epoch,batch}
Unit of iterations.
--warmup_steps WARMUP_STEPS
Number of steps considered as warmup and not taken
into account for performance measurements.
--model_dir MODEL_DIR
Directory in which to write model. If undefined,
results dir will be used.
--results_dir RESULTS_DIR
Directory in which to write training logs, summaries
and checkpoints.
--log_filename LOG_FILENAME
Name of the JSON file to which write the training log.
--display_every DISPLAY_EVERY
How often (in batches) to print out running
information.
--seed SEED Random seed.
--gpu_memory_fraction GPU_MEMORY_FRACTION
Limit memory fraction used by training script for DALI.
--gpu_id GPU_ID Specify ID of the target GPU on multi-device platform.
Effective only for single-GPU mode.
--finetune_checkpoint FINETUNE_CHECKPOINT
Path to pre-trained checkpoint which will be used for
fine-tuning.
--use_final_conv Use convolution operator instead of MLP as last layer.
--quant_delay QUANT_DELAY
Number of steps to be run before quantization starts
to happen.
--quantize Quantize weights and activations during training.
(Defaults to Assymmetric quantization)
--use_qdq Use QDQV3 op instead of FakeQuantWithMinMaxVars op for
quantization. QDQv3 does only scaling.
--symmetric Quantize weights and activations during training using
symmetric quantization.
Dataset arguments:
--data_dir DATA_DIR Path to dataset in TFRecord format. Files should be
named 'train-*' and 'validation-*'.
--data_idx_dir DATA_IDX_DIR
Path to index files for DALI. Files should be named
'train-*' and 'validation-*'.
--export_dir EXPORT_DIR
Directory in which to write exported SavedModel.
--to_predict TO_PREDICT
Path to file or directory of files to run prediction
on.
--batch_size BATCH_SIZE
Size of each minibatch per GPU.
--num_iter NUM_ITER Number of iterations to run.
--iter_unit {epoch,batch}
Unit of iterations.
--warmup_steps WARMUP_STEPS
Number of steps considered as warmup and not taken
into account for performance measurements.
--model_dir MODEL_DIR
Directory in which to write the model. If undefined,
results directory will be used.
--results_dir RESULTS_DIR
Directory in which to write training logs, summaries
and checkpoints.
--log_filename LOG_FILENAME
Name of the JSON file to which write the training log
--display_every DISPLAY_EVERY
How often (in batches) to print out running
information.
--dali Enable DALI data input.
--synthetic_data_size SYNTHETIC_DATA_SIZE
Dimension of image for synthetic dataset.
Training arguments:
--lr_init LR_INIT Initial value for the learning rate.
--lr_warmup_epochs LR_WARMUP_EPOCHS
Number of warmup epochs for the learning rate schedule.
Number of warmup epochs for learning rate schedule.
--weight_decay WEIGHT_DECAY
Weight Decay scale factor.
--weight_init {fan_in,fan_out}
Model weight initialization method.
--momentum MOMENTUM SGD momentum value for the momentum optimizer.
--loss_scale LOSS_SCALE
Loss scale for FP16 training and fast math FP32.
--momentum MOMENTUM SGD momentum value for the Momentum optimizer.
--label_smoothing LABEL_SMOOTHING
The value of label smoothing.
--mixup MIXUP The alpha parameter for mixup (if 0 then mixup is not
applied).
--use_static_loss_scaling
Use static loss scaling in FP16 or FP32 AMP.
--nouse_static_loss_scaling
--use_xla Enable XLA (Accelerated Linear Algebra) computation
--cosine_lr Use cosine learning rate schedule.
Generic optimization arguments:
--xla Enable XLA (Accelerated Linear Algebra) computation
for improved performance.
--nouse_xla
--use_dali Enable DALI data input.
--nouse_dali
--use_tf_amp Enable AMP to speedup FP32
computation using Tensor Cores.
--nouse_tf_amp
--use_cosine_lr Use cosine learning rate schedule.
--nouse_cosine_lr
--seed SEED Random seed.
--gpu_memory_fraction GPU_MEMORY_FRACTION
Limit memory fraction used by the training script for DALI
--gpu_id GPU_ID Specify the ID of the target GPU on a multi-device platform.
Effective only for single-GPU mode.
--data_format {NHWC,NCHW}
Data format used to do calculations.
--amp Enable Automatic Mixed Precision to speedup
computation using tensor cores.
Automatic Mixed Precision arguments:
--static_loss_scale STATIC_LOSS_SCALE
Use static loss scaling in FP32 AMP.
```
### Inference process
@ -395,7 +416,7 @@ To run inference on a single example with a checkpoint and a model script, use:
`python main.py --arch=resnext101-32x4d --mode predict --model_dir <path to model> --to_predict <path to image> --results_dir <path to results>`
The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during inference.
The optional `--xla` and `--amp` flags control XLA and AMP during inference.
## Performance
@ -414,7 +435,7 @@ To benchmark the training performance on a specific batch size, run:
* AMP
`python ./main.py --arch=resnext101-32x4d --mode=training_benchmark --use_tf_amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
`python ./main.py --arch=resnext101-32x4d --mode=training_benchmark --amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
* For multiple GPUs
* FP32 / TF32
@ -423,16 +444,16 @@ To benchmark the training performance on a specific batch size, run:
* AMP
`mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --arch=resnext101-32x4d --mode=training_benchmark --use_tf_amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
`mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --arch=resnext101-32x4d --mode=training_benchmark --amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
Each of these scripts runs 200 warm-up iterations and measures the first epoch.
To control warmup and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. Features like XLA or DALI can be controlled
with `--use_xla` and `--use_dali` flags. If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.
For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
with `--xla` and `--dali` flags. For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
Suggested batch sizes for training are 128 for mixed precision training and 64 for single precision training per single V100 16 GB.
If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with `--synthetic_data_size` flag.
#### Inference performance benchmark
@ -444,11 +465,10 @@ To benchmark the inference performance on a specific batch size, run:
* AMP
`python ./main.py --arch=resnext101-32x4d --mode=inference_benchmark --use_tf_amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
`python ./main.py --arch=resnext101-32x4d --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations.
To control warm-up and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags.
For proper throughput and latency reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.
The benchmark can be automated with the `inference_benchmark.sh` script provided in `resnext101-32x4d`, by simply running:
@ -457,6 +477,9 @@ The benchmark can be automated with the `inference_benchmark.sh` script provided
The `<data dir>` parameter refers to the input data directory (by default `/data/tfrecords` inside the container).
By default, the benchmark tests the following configurations: **FP32**, **AMP**, **AMP + XLA** with different batch sizes.
When the optional directory with the DALI index files `<data idx dir>` is specified, the benchmark executes an additional **DALI + AMP + XLA** configuration.
For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
For performance benchamrk of raw model, synthetic dataset can be used. To use synthetic dataset, use `--synthetic_data_size` flag instead of `--data_dir` to specify input image size.
### Results
@ -769,6 +792,9 @@ on NVIDIA T4 with (1x T4 16G) GPU.
June 2020
- Initial release
August 2020
- Updated command line argument names
- Added support for syntetic dataset with different image size
### Known issues
Performance without XLA enabled is low. We recommend using XLA.
Performance without XLA enabled is low due to BN + ReLU fusion bug.

View file

@ -22,12 +22,12 @@ function test_configuration() {
}
test_configuration "FP32 nodali noxla"
test_configuration "FP32 nodali xla" "--use_xla"
test_configuration "FP16 nodali noxla" "--use_tf_amp"
test_configuration "FP16 nodali xla" "--use_tf_amp --use_xla"
test_configuration "FP32 nodali xla" "--xla"
test_configuration "FP16 nodali noxla" "--amp"
test_configuration "FP16 nodali xla" "--amp --xla"
if [ ! -z $DALI_DIR ]; then
test_configuration "FP16 dali xla" "--use_tf_amp --use_xla --use_dali --data_idx_dir ${DALI_DIR}"
test_configuration "FP16 dali xla" "--amp --xla --dali --data_idx_dir ${DALI_DIR}"
fi
cat $INFERENCE_BENCHMARK

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
--batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,9 +25,9 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--use_tf_amp --use_static_loss_scaling --loss_scale 128 \
--amp --static_loss_scale 128 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

View file

@ -25,7 +25,7 @@ fi
mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
--mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
--batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
--batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
--lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
--data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
--results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

Some files were not shown because too many files have changed in this diff Show more