Merge pull request #920 from NVIDIA/gh/release

Gh/release
2021-04-20 13:54:27 +02:00 · 2021-04-20 13:54:27 +02:00 · bd257e1494
parent 5b9787c4a0 169b081827
commit bd257e1494
173 changed files with 124585 additions and 1565 deletions
--- a/PyTorch/Classification/ConvNets/README.md
+++ b/PyTorch/Classification/ConvNets/README.md
@ -30,7 +30,7 @@ The following table provides links to where you can find additional information
 ## Validation accuracy results

 Our results were obtained by running the applicable
-training scripts in the [framework-container-name] NGC container
+training scripts in the 20.12 PyTorch NGC container
 on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
 The specific training script that was run is documented
 in the corresponding model's README.
@ -56,49 +56,48 @@ three classification models side-by-side.


 Our results were obtained by running the applicable
-training scripts in the pytorch-20.12 NGC container
+training scripts in the 21.03 PyTorch NGC container
 on NVIDIA DGX A100 with (8x A100 80GB) GPUs.
 Performance numbers (in images per second)
 were averaged over an entire training epoch.
 The specific training script that was run is documented
 in the corresponding model's README.

-The following table shows the training accuracy results of the
-three classification models side-by-side.
-
+The following table shows the training accuracy results of
+all the classification models side-by-side.

 |       **Model**        | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** |
 |:----------------------:|:-------------------:|:----------:|:---------------------------:|
-|    efficientnet-b0     |     14391 img/s     | 8225 img/s |           1.74 x            |
-|    efficientnet-b4     |     2341 img/s      | 1204 img/s |           1.94 x            |
-| efficientnet-widese-b0 |     15053 img/s     | 8233 img/s |           1.82 x            |
-| efficientnet-widese-b4 |     2339 img/s      | 1202 img/s |           1.94 x            |
-|        resnet50        |     15977 img/s     | 7365 img/s |           2.16 x            |
-|    resnext101-32x4d    |     7399 img/s      | 3193 img/s |           2.31 x            |
-|  se-resnext101-32x4d   |     5248 img/s      | 2665 img/s |           1.96 x            |
+|    efficientnet-b0     |     16652 img/s     | 8193 img/s |           2.03 x            |
+|    efficientnet-b4     |     2570 img/s      | 1223 img/s |            2.1 x            |
+| efficientnet-widese-b0 |     16368 img/s     | 8244 img/s |           1.98 x            |
+| efficientnet-widese-b4 |     2585 img/s      | 1223 img/s |           2.11 x            |
+|        resnet50        |     16621 img/s     | 7248 img/s |           2.29 x            |
+|    resnext101-32x4d    |     7925 img/s      | 3471 img/s |           2.28 x            |
+|  se-resnext101-32x4d   |     5779 img/s      | 2991 img/s |           1.93 x            |

 ### Training performance: NVIDIA DGX-1 16G (8x V100 16GB)

 Our results were obtained by running the applicable
-training scripts in the pytorch-20.12 NGC container
+training scripts in the 21.03 PyTorch NGC container
 on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
 Performance numbers (in images per second)
 were averaged over an entire training epoch.
 The specific training script that was run is documented
 in the corresponding model's README.

-The following table shows the training accuracy results of the
-three classification models side-by-side.
+The following table shows the training accuracy results of all the
+classification models side-by-side.

 |       **Model**        | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** |
 |:----------------------:|:-------------------:|:----------:|:---------------------------:|
-|    efficientnet-b0     |     7664 img/s      | 4571 img/s |           1.67 x            |
-|    efficientnet-b4     |     1330 img/s      | 598 img/s  |           2.22 x            |
-| efficientnet-widese-b0 |     7694 img/s      | 4489 img/s |           1.71 x            |
-| efficientnet-widese-b4 |     1323 img/s      | 590 img/s  |           2.24 x            |
-|        resnet50        |     7608 img/s      | 2851 img/s |           2.66 x            |
-|    resnext101-32x4d    |     3742 img/s      | 1117 img/s |           3.34 x            |
-|  se-resnext101-32x4d   |     2716 img/s      | 994 img/s  |           2.73 x            |
+|    efficientnet-b0     |     7789 img/s      | 4672 img/s |           1.66 x            |
+|    efficientnet-b4     |     1366 img/s      | 616 img/s  |           2.21 x            |
+| efficientnet-widese-b0 |     7875 img/s      | 4592 img/s |           1.71 x            |
+| efficientnet-widese-b4 |     1356 img/s      | 612 img/s  |           2.21 x            |
+|        resnet50        |     8322 img/s      | 2855 img/s |           2.91 x            |
+|    resnext101-32x4d    |     4065 img/s      | 1133 img/s |           3.58 x            |
+|  se-resnext101-32x4d   |     2971 img/s      | 1004 img/s |           2.95 x            |

 ## Model Comparison

--- a/PyTorch/Classification/ConvNets/efficientnet/README.md
+++ b/PyTorch/Classification/ConvNets/efficientnet/README.md
@ -520,7 +520,7 @@ Each of these scripts will run 100 iterations and save results in the `benchmark

 ### Results

-Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
+Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

@ -562,226 +562,234 @@ The following images show an A100 run.

 ##### Training performance: NVIDIA A100 (8x A100 80GB)

-Our results were obtained by running the applicable `efficientnet/training/<AMP|TF32>/*.sh` training script in the PyTorch 20.12 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
+Our results were obtained by running the applicable `efficientnet/training/<AMP|TF32>/*.sh` training script in the PyTorch 21.03 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.

-|       **Model**        | **GPUs** |  **TF32**  | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** |
-|:----------------------:|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
-|    efficientnet-b0     |    1     | 1082 img/s |            2364 img/s            |                      2.18 x                      |          1.0 x          |               1.0 x                |
-|    efficientnet-b0     |    8     | 8225 img/s |           14391 img/s            |                      1.74 x                      |         7.59 x          |               6.08 x               |
-|    efficientnet-b4     |    1     | 154 img/s  |            300 img/s             |                      1.94 x                      |          1.0 x          |               1.0 x                |
-|    efficientnet-b4     |    8     | 1204 img/s |            2341 img/s            |                      1.94 x                      |          7.8 x          |               7.8 x                |
-| efficientnet-widese-b0 |    1     | 1081 img/s |            2368 img/s            |                      2.19 x                      |          1.0 x          |               1.0 x                |
-| efficientnet-widese-b0 |    8     | 8233 img/s |           15053 img/s            |                      1.82 x                      |         7.61 x          |               6.35 x               |
-| efficientnet-widese-b4 |    1     | 154 img/s  |            299 img/s             |                      1.94 x                      |          1.0 x          |               1.0 x                |
-| efficientnet-widese-b4 |    8     | 1202 img/s |            2339 img/s            |                      1.94 x                      |          7.8 x          |               7.81 x               |
+|       **Model**        | **GPUs** |  **TF32**   | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** |
+|:----------------------:|:--------:|:-----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
+|    efficientnet-b0     |    1     | 1078 img/s  |            2489 img/s            |                      2.3 x                       |          1.0 x          |               1.0 x                |
+|    efficientnet-b0     |    8     | 8193 img/s  |           16652 img/s            |                      2.03 x                      |         7.59 x          |               6.68 x               |
+|    efficientnet-b0     |    16    | 16137 img/s |           29332 img/s            |                      1.81 x                      |         14.96 x         |              11.78 x               |
+|    efficientnet-b4     |    1     |  157 img/s  |            331 img/s             |                      2.1 x                       |          1.0 x          |               1.0 x                |
+|    efficientnet-b4     |    8     | 1223 img/s  |            2570 img/s            |                      2.1 x                       |         7.76 x          |               7.75 x               |
+|    efficientnet-b4     |    16    | 2417 img/s  |            4813 img/s            |                      1.99 x                      |         15.34 x         |              14.51 x               |
+|    efficientnet-b4     |    32    | 4813 img/s  |            9425 img/s            |                      1.95 x                      |         30.55 x         |              28.42 x               |
+|    efficientnet-b4     |    64    | 9146 img/s  |           18900 img/s            |                      2.06 x                      |         58.05 x         |               57.0 x               |
+| efficientnet-widese-b0 |    1     | 1078 img/s  |            2512 img/s            |                      2.32 x                      |          1.0 x          |               1.0 x                |
+| efficientnet-widese-b0 |    8     | 8244 img/s  |           16368 img/s            |                      1.98 x                      |         7.64 x          |               6.51 x               |
+| efficientnet-widese-b0 |    16    | 16062 img/s |           29798 img/s            |                      1.85 x                      |         14.89 x         |              11.86 x               |
+| efficientnet-widese-b4 |    1     |  157 img/s  |            331 img/s             |                      2.1 x                       |          1.0 x          |               1.0 x                |
+| efficientnet-widese-b4 |    8     | 1223 img/s  |            2585 img/s            |                      2.11 x                      |         7.77 x          |               7.8 x                |
+| efficientnet-widese-b4 |    16    | 2399 img/s  |            5041 img/s            |                      2.1 x                       |         15.24 x         |              15.21 x               |
+| efficientnet-widese-b4 |    32    | 4616 img/s  |            9379 img/s            |                      2.03 x                      |         29.32 x         |               28.3 x               |
+| efficientnet-widese-b4 |    64    | 9140 img/s  |           18516 img/s            |                      2.02 x                      |         58.07 x         |              55.88 x               |


 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)

-Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
+Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.

 |       **Model**        | **GPUs** |  **FP32**  | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** |
 |:----------------------:|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
-|    efficientnet-b0     |    1     | 652 img/s  |            1254 img/s            |                      1.92 x                      |          1.0 x          |               1.0 x                |
-|    efficientnet-b0     |    8     | 4571 img/s |            7664 img/s            |                      1.67 x                      |          7.0 x          |               6.1 x                |
-|    efficientnet-b4     |    1     |  80 img/s  |            199 img/s             |                      2.47 x                      |          1.0 x          |               1.0 x                |
-|    efficientnet-b4     |    8     | 598 img/s  |            1330 img/s            |                      2.22 x                      |         7.42 x          |               6.67 x               |
-| efficientnet-widese-b0 |    1     | 654 img/s  |            1255 img/s            |                      1.91 x                      |          1.0 x          |               1.0 x                |
-| efficientnet-widese-b0 |    8     | 4489 img/s |            7694 img/s            |                      1.71 x                      |         6.85 x          |               6.12 x               |
-| efficientnet-widese-b4 |    1     |  79 img/s  |            198 img/s             |                      2.51 x                      |          1.0 x          |               1.0 x                |
-| efficientnet-widese-b4 |    8     | 590 img/s  |            1323 img/s            |                      2.24 x                      |         7.46 x          |               6.65 x               |
+|    efficientnet-b0     |    1     | 655 img/s  |            1301 img/s            |                      1.98 x                      |          1.0 x          |               1.0 x                |
+|    efficientnet-b0     |    8     | 4672 img/s |            7789 img/s            |                      1.66 x                      |         7.12 x          |               5.98 x               |
+|    efficientnet-b4     |    1     |  83 img/s  |            204 img/s             |                      2.46 x                      |          1.0 x          |               1.0 x                |
+|    efficientnet-b4     |    8     | 616 img/s  |            1366 img/s            |                      2.21 x                      |         7.41 x          |               6.67 x               |
+| efficientnet-widese-b0 |    1     | 655 img/s  |            1299 img/s            |                      1.98 x                      |          1.0 x          |               1.0 x                |
+| efficientnet-widese-b0 |    8     | 4592 img/s |            7875 img/s            |                      1.71 x                      |          7.0 x          |               6.05 x               |
+| efficientnet-widese-b4 |    1     |  83 img/s  |            204 img/s             |                      2.45 x                      |          1.0 x          |               1.0 x                |
+| efficientnet-widese-b4 |    8     | 612 img/s  |            1356 img/s            |                      2.21 x                      |         7.34 x          |               6.63 x               |


 ##### Training performance: NVIDIA DGX-1 (8x V100 32GB)

-Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
+Our results were obtained by running the applicable `efficientnet/training/<AMP|FP32>/*.sh` training script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.

 |       **Model**        | **GPUs** |  **FP32**  | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** |
 |:----------------------:|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|
-|    efficientnet-b0     |    1     | 637 img/s  |            1352 img/s            |                      2.12 x                      |          1.0 x          |               1.0 x                |
-|    efficientnet-b0     |    8     | 4834 img/s |            8645 img/s            |                      1.78 x                      |         7.58 x          |               6.39 x               |
-|    efficientnet-b4     |    1     |  84 img/s  |            200 img/s             |                      2.38 x                      |          1.0 x          |               1.0 x                |
-|    efficientnet-b4     |    8     | 632 img/s  |            1519 img/s            |                      2.4 x                       |         7.53 x          |               7.58 x               |
-| efficientnet-widese-b0 |    1     | 637 img/s  |            1349 img/s            |                      2.11 x                      |          1.0 x          |               1.0 x                |
-| efficientnet-widese-b0 |    8     | 4841 img/s |            8693 img/s            |                      1.79 x                      |         7.59 x          |               6.43 x               |
-| efficientnet-widese-b4 |    1     |  83 img/s  |            200 img/s             |                      2.38 x                      |          1.0 x          |               1.0 x                |
-| efficientnet-widese-b4 |    8     | 627 img/s  |            1508 img/s            |                      2.4 x                       |         7.47 x          |               7.53 x               |
+|    efficientnet-b0     |    1     | 646 img/s  |            1401 img/s            |                      2.16 x                      |          1.0 x          |               1.0 x                |
+|    efficientnet-b0     |    8     | 4937 img/s |            8615 img/s            |                      1.74 x                      |         7.63 x          |               6.14 x               |
+|    efficientnet-b4     |    1     |  36 img/s  |             89 img/s             |                      2.44 x                      |          1.0 x          |               1.0 x                |
+|    efficientnet-b4     |    8     | 641 img/s  |            1565 img/s            |                      2.44 x                      |         17.6 x          |              17.57 x               |
+| efficientnet-widese-b0 |    1     | 281 img/s  |            603 img/s             |                      2.14 x                      |          1.0 x          |               1.0 x                |
+| efficientnet-widese-b0 |    8     | 4924 img/s |            8870 img/s            |                      1.8 x                       |         17.49 x         |               14.7 x               |
+| efficientnet-widese-b4 |    1     |  36 img/s  |             89 img/s             |                      2.45 x                      |          1.0 x          |               1.0 x                |
+| efficientnet-widese-b4 |    8     | 639 img/s  |            1556 img/s            |                      2.43 x                      |         17.61 x         |              17.44 x               |


 #### Inference performance results

 ##### Inference performance: NVIDIA A100 (1x A100 80GB)

-Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
+Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.

 ###### TF32 Inference Latency

 |       **Model**        | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|    efficientnet-b0     |       1        |     122 img/s      |    10.04 ms     |     8.59 ms     |     10.2 ms     |
-|    efficientnet-b0     |       2        |     249 img/s      |     9.91 ms     |     9.08 ms     |    10.84 ms     |
-|    efficientnet-b0     |       4        |     472 img/s      |    10.31 ms     |     9.67 ms     |    11.25 ms     |
-|    efficientnet-b0     |       8        |     922 img/s      |    10.67 ms     |    10.76 ms     |    12.13 ms     |
-|    efficientnet-b0     |       16       |     1796 img/s     |    10.86 ms     |     11.1 ms     |    13.01 ms     |
-|    efficientnet-b0     |       32       |     3235 img/s     |    12.05 ms     |    13.28 ms     |    15.07 ms     |
-|    efficientnet-b0     |       64       |     4658 img/s     |    16.27 ms     |    14.56 ms     |    16.18 ms     |
-|    efficientnet-b0     |      128       |     4911 img/s     |    31.51 ms     |    26.24 ms     |    27.29 ms     |
-|    efficientnet-b0     |      256       |     5015 img/s     |    62.64 ms     |    50.81 ms     |     55.6 ms     |
-|    efficientnet-b4     |       1        |      63 img/s      |    17.64 ms     |    16.29 ms     |    17.92 ms     |
-|    efficientnet-b4     |       2        |     122 img/s      |    18.27 ms     |    18.12 ms     |    22.32 ms     |
-|    efficientnet-b4     |       4        |     247 img/s      |    18.25 ms     |    17.79 ms     |    21.02 ms     |
-|    efficientnet-b4     |       8        |     469 img/s      |    19.03 ms     |    18.94 ms     |    22.49 ms     |
-|    efficientnet-b4     |       16       |     572 img/s      |    29.95 ms     |    28.14 ms     |    28.99 ms     |
-|    efficientnet-b4     |       32       |     638 img/s      |    52.25 ms     |    50.24 ms     |     50.5 ms     |
-|    efficientnet-b4     |       64       |     680 img/s      |    96.93 ms     |     94.1 ms     |     94.3 ms     |
-|    efficientnet-b4     |      128       |     672 img/s      |    197.49 ms    |    189.69 ms    |    189.91 ms    |
-|    efficientnet-b4     |      256       |     679 img/s      |    392.15 ms    |    374.18 ms    |    386.85 ms    |
-| efficientnet-widese-b0 |       1        |     120 img/s      |    10.21 ms     |     8.61 ms     |    11.37 ms     |
-| efficientnet-widese-b0 |       2        |     242 img/s      |    10.16 ms     |     9.98 ms     |    11.36 ms     |
-| efficientnet-widese-b0 |       4        |     493 img/s      |     9.97 ms     |     8.92 ms     |    10.23 ms     |
-| efficientnet-widese-b0 |       8        |     913 img/s      |    10.77 ms     |    10.58 ms     |    12.11 ms     |
-| efficientnet-widese-b0 |       16       |     1864 img/s     |    10.54 ms     |    10.34 ms     |    11.69 ms     |
-| efficientnet-widese-b0 |       32       |     3218 img/s     |    12.06 ms     |    13.17 ms     |    15.69 ms     |
-| efficientnet-widese-b0 |       64       |     4625 img/s     |     16.4 ms     |    15.35 ms     |    17.86 ms     |
-| efficientnet-widese-b0 |      128       |     4904 img/s     |    31.84 ms     |    26.22 ms     |    28.69 ms     |
-| efficientnet-widese-b0 |      256       |     5013 img/s     |     63.1 ms     |    50.95 ms     |    52.44 ms     |
-| efficientnet-widese-b4 |       1        |      64 img/s      |    17.51 ms     |     16.5 ms     |    20.03 ms     |
-| efficientnet-widese-b4 |       2        |     125 img/s      |    17.86 ms     |    17.24 ms     |    19.27 ms     |
-| efficientnet-widese-b4 |       4        |     248 img/s      |    18.09 ms     |    17.36 ms     |    21.34 ms     |
-| efficientnet-widese-b4 |       8        |     472 img/s      |    18.92 ms     |    18.33 ms     |    20.68 ms     |
-| efficientnet-widese-b4 |       16       |     569 img/s      |    30.11 ms     |    28.18 ms     |    28.45 ms     |
-| efficientnet-widese-b4 |       32       |     628 img/s      |    53.05 ms     |    51.11 ms     |    51.29 ms     |
-| efficientnet-widese-b4 |       64       |     679 img/s      |    97.17 ms     |    94.22 ms     |    94.43 ms     |
-| efficientnet-widese-b4 |      128       |     672 img/s      |    197.74 ms    |    189.93 ms    |    190.95 ms    |
-| efficientnet-widese-b4 |      256       |     679 img/s      |    392.7 ms     |    373.84 ms    |    378.35 ms    |
+|    efficientnet-b0     |       1        |     130 img/s      |     9.33 ms     |     7.95 ms     |     9.0 ms      |
+|    efficientnet-b0     |       2        |     262 img/s      |     9.39 ms     |     8.51 ms     |     9.5 ms      |
+|    efficientnet-b0     |       4        |     503 img/s      |     9.68 ms     |     9.53 ms     |    10.78 ms     |
+|    efficientnet-b0     |       8        |     1004 img/s     |     9.85 ms     |     9.89 ms     |    11.49 ms     |
+|    efficientnet-b0     |       16       |     1880 img/s     |    10.27 ms     |    10.34 ms     |    11.19 ms     |
+|    efficientnet-b0     |       32       |     3401 img/s     |    11.46 ms     |    12.51 ms     |    14.39 ms     |
+|    efficientnet-b0     |       64       |     4656 img/s     |    19.58 ms     |    14.52 ms     |    16.63 ms     |
+|    efficientnet-b0     |      128       |     5001 img/s     |    31.03 ms     |    25.72 ms     |    28.34 ms     |
+|    efficientnet-b0     |      256       |     5154 img/s     |    60.71 ms     |    49.44 ms     |    54.99 ms     |
+|    efficientnet-b4     |       1        |      69 img/s      |    16.22 ms     |    14.87 ms     |    15.34 ms     |
+|    efficientnet-b4     |       2        |     133 img/s      |    16.84 ms     |    16.49 ms     |    17.72 ms     |
+|    efficientnet-b4     |       4        |     259 img/s      |    17.33 ms     |    16.39 ms     |    19.67 ms     |
+|    efficientnet-b4     |       8        |     491 img/s      |    18.22 ms     |    18.09 ms     |    19.51 ms     |
+|    efficientnet-b4     |       16       |     606 img/s      |    28.28 ms     |    26.55 ms     |    26.84 ms     |
+|    efficientnet-b4     |       32       |     651 img/s      |    51.08 ms     |    49.39 ms     |    49.61 ms     |
+|    efficientnet-b4     |       64       |     684 img/s      |    96.23 ms     |    93.54 ms     |    93.78 ms     |
+|    efficientnet-b4     |      128       |     700 img/s      |    195.22 ms    |    182.17 ms    |    182.42 ms    |
+|    efficientnet-b4     |      256       |     702 img/s      |    380.01 ms    |    361.81 ms    |    371.64 ms    |
+| efficientnet-widese-b0 |       1        |     130 img/s      |     9.49 ms     |     8.76 ms     |     9.68 ms     |
+| efficientnet-widese-b0 |       2        |     265 img/s      |     9.25 ms     |     8.51 ms     |     9.75 ms     |
+| efficientnet-widese-b0 |       4        |     520 img/s      |     9.42 ms     |     8.67 ms     |     9.97 ms     |
+| efficientnet-widese-b0 |       8        |     996 img/s      |    12.27 ms     |     9.69 ms     |    11.31 ms     |
+| efficientnet-widese-b0 |       16       |     1916 img/s     |     10.2 ms     |    10.29 ms     |     11.3 ms     |
+| efficientnet-widese-b0 |       32       |     3293 img/s     |    11.71 ms     |     13.0 ms     |    14.57 ms     |
+| efficientnet-widese-b0 |       64       |     4639 img/s     |    16.21 ms     |    14.61 ms     |    16.29 ms     |
+| efficientnet-widese-b0 |      128       |     4997 img/s     |    30.81 ms     |    25.76 ms     |    26.02 ms     |
+| efficientnet-widese-b0 |      256       |     5166 img/s     |    73.68 ms     |    49.39 ms     |    55.74 ms     |
+| efficientnet-widese-b4 |       1        |      68 img/s      |    16.41 ms     |    15.14 ms     |    16.59 ms     |
+| efficientnet-widese-b4 |       2        |     135 img/s      |    16.65 ms     |    15.52 ms     |    17.93 ms     |
+| efficientnet-widese-b4 |       4        |     251 img/s      |    17.74 ms     |    17.29 ms     |    20.47 ms     |
+| efficientnet-widese-b4 |       8        |     501 img/s      |    17.75 ms     |    17.12 ms     |    18.01 ms     |
+| efficientnet-widese-b4 |       16       |     590 img/s      |    28.94 ms     |    27.29 ms     |    27.81 ms     |
+| efficientnet-widese-b4 |       32       |     651 img/s      |    50.96 ms     |    49.34 ms     |    49.55 ms     |
+| efficientnet-widese-b4 |       64       |     683 img/s      |    99.28 ms     |    93.65 ms     |    93.88 ms     |
+| efficientnet-widese-b4 |      128       |     700 img/s      |    189.81 ms    |    182.3 ms     |    182.58 ms    |
+| efficientnet-widese-b4 |      256       |     702 img/s      |    379.36 ms    |    361.84 ms    |    366.05 ms    |


 ###### Mixed Precision Inference Latency

 |       **Model**        | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|    efficientnet-b0     |       1        |      99 img/s      |    11.89 ms     |    10.83 ms     |    13.04 ms     |
-|    efficientnet-b0     |       2        |     208 img/s      |    11.43 ms     |    10.15 ms     |    10.87 ms     |
-|    efficientnet-b0     |       4        |     395 img/s      |     12.0 ms     |    11.01 ms     |     12.8 ms     |
-|    efficientnet-b0     |       8        |     763 img/s      |    12.33 ms     |    11.62 ms     |    13.94 ms     |
-|    efficientnet-b0     |       16       |     1499 img/s     |    12.58 ms     |    12.57 ms     |     14.4 ms     |
-|    efficientnet-b0     |       32       |     2875 img/s     |    13.19 ms     |    13.76 ms     |    15.29 ms     |
-|    efficientnet-b0     |       64       |     5841 img/s     |     13.7 ms     |    14.91 ms     |    18.73 ms     |
-|    efficientnet-b0     |      128       |     7850 img/s     |    21.53 ms     |    16.58 ms     |    18.94 ms     |
-|    efficientnet-b0     |      256       |     8285 img/s     |    42.07 ms     |    30.87 ms     |    38.03 ms     |
-|    efficientnet-b4     |       1        |      51 img/s      |     21.2 ms     |    19.73 ms     |    21.47 ms     |
-|    efficientnet-b4     |       2        |     103 img/s      |    21.17 ms     |    20.91 ms     |    24.17 ms     |
-|    efficientnet-b4     |       4        |     205 img/s      |    21.34 ms     |    20.32 ms     |    23.46 ms     |
-|    efficientnet-b4     |       8        |     376 img/s      |    23.11 ms     |    22.64 ms     |    24.77 ms     |
-|    efficientnet-b4     |       16       |     781 img/s      |    22.42 ms     |    23.03 ms     |    25.37 ms     |
-|    efficientnet-b4     |       32       |     1048 img/s     |    32.52 ms     |    30.76 ms     |    31.65 ms     |
-|    efficientnet-b4     |       64       |     1156 img/s     |    58.31 ms     |    55.45 ms     |    56.89 ms     |
-|    efficientnet-b4     |      128       |     1197 img/s     |    112.92 ms    |    106.69 ms    |    107.84 ms    |
-|    efficientnet-b4     |      256       |     1229 img/s     |    220.5 ms     |    206.68 ms    |    223.16 ms    |
-| efficientnet-widese-b0 |       1        |     100 img/s      |    11.75 ms     |    10.62 ms     |    13.67 ms     |
-| efficientnet-widese-b0 |       2        |     200 img/s      |    11.86 ms     |    11.38 ms     |    14.32 ms     |
-| efficientnet-widese-b0 |       4        |     400 img/s      |    11.81 ms     |     10.8 ms     |     13.8 ms     |
-| efficientnet-widese-b0 |       8        |     770 img/s      |    12.17 ms     |     11.2 ms     |    12.38 ms     |
-| efficientnet-widese-b0 |       16       |     1501 img/s     |    12.62 ms     |    12.12 ms     |    14.94 ms     |
-| efficientnet-widese-b0 |       32       |     2901 img/s     |    13.06 ms     |    13.28 ms     |    15.23 ms     |
-| efficientnet-widese-b0 |       64       |     5853 img/s     |    13.69 ms     |    14.38 ms     |    16.91 ms     |
-| efficientnet-widese-b0 |      128       |     7807 img/s     |    21.43 ms     |    16.63 ms     |     21.8 ms     |
-| efficientnet-widese-b0 |      256       |     8270 img/s     |    42.01 ms     |    30.97 ms     |    34.55 ms     |
-| efficientnet-widese-b4 |       1        |      52 img/s      |    21.03 ms     |     19.9 ms     |    22.23 ms     |
-| efficientnet-widese-b4 |       2        |     102 img/s      |    21.34 ms     |     21.6 ms     |    24.23 ms     |
-| efficientnet-widese-b4 |       4        |     200 img/s      |    21.76 ms     |    21.19 ms     |    23.69 ms     |
-| efficientnet-widese-b4 |       8        |     373 img/s      |    23.31 ms     |    22.99 ms     |    28.33 ms     |
-| efficientnet-widese-b4 |       16       |     763 img/s      |    22.93 ms     |    23.75 ms     |     26.6 ms     |
-| efficientnet-widese-b4 |       32       |     1043 img/s     |     32.7 ms     |    31.03 ms     |    33.52 ms     |
-| efficientnet-widese-b4 |       64       |     1152 img/s     |    58.27 ms     |    55.64 ms     |    55.86 ms     |
-| efficientnet-widese-b4 |      128       |     1197 img/s     |    112.86 ms    |    106.72 ms    |    108.65 ms    |
-| efficientnet-widese-b4 |      256       |     1229 img/s     |    221.11 ms    |    206.5 ms     |    221.37 ms    |
+|    efficientnet-b0     |       1        |     105 img/s      |    11.21 ms     |     9.9 ms      |    12.55 ms     |
+|    efficientnet-b0     |       2        |     214 img/s      |    11.01 ms     |    10.06 ms     |    11.89 ms     |
+|    efficientnet-b0     |       4        |     412 img/s      |    11.45 ms     |    11.73 ms     |     13.0 ms     |
+|    efficientnet-b0     |       8        |     803 img/s      |    11.78 ms     |    11.59 ms     |     14.2 ms     |
+|    efficientnet-b0     |       16       |     1584 img/s     |    11.89 ms     |     11.9 ms     |    13.63 ms     |
+|    efficientnet-b0     |       32       |     2915 img/s     |    13.03 ms     |    14.79 ms     |    17.35 ms     |
+|    efficientnet-b0     |       64       |     6315 img/s     |    12.71 ms     |    13.59 ms     |    15.27 ms     |
+|    efficientnet-b0     |      128       |     9311 img/s     |    18.78 ms     |    15.34 ms     |    17.99 ms     |
+|    efficientnet-b0     |      256       |    10239 img/s     |    39.05 ms     |    24.97 ms     |    29.24 ms     |
+|    efficientnet-b4     |       1        |      53 img/s      |    20.45 ms     |    19.06 ms     |    20.36 ms     |
+|    efficientnet-b4     |       2        |     109 img/s      |    20.01 ms     |    19.74 ms     |     21.5 ms     |
+|    efficientnet-b4     |       4        |     212 img/s      |     20.6 ms     |    19.88 ms     |    22.37 ms     |
+|    efficientnet-b4     |       8        |     416 img/s      |    21.02 ms     |    21.46 ms     |    24.82 ms     |
+|    efficientnet-b4     |       16       |     816 img/s      |    21.53 ms     |    22.91 ms     |    26.06 ms     |
+|    efficientnet-b4     |       32       |     1208 img/s     |     28.4 ms     |    26.77 ms     |     28.3 ms     |
+|    efficientnet-b4     |       64       |     1332 img/s     |    50.55 ms     |    48.23 ms     |    48.49 ms     |
+|    efficientnet-b4     |      128       |     1418 img/s     |    95.84 ms     |    90.12 ms     |    95.76 ms     |
+|    efficientnet-b4     |      256       |     1442 img/s     |    191.48 ms    |    176.19 ms    |    189.04 ms    |
+| efficientnet-widese-b0 |       1        |     104 img/s      |    11.28 ms     |     10.0 ms     |    12.72 ms     |
+| efficientnet-widese-b0 |       2        |     206 img/s      |    11.41 ms     |    10.65 ms     |    12.72 ms     |
+| efficientnet-widese-b0 |       4        |     426 img/s      |    11.15 ms     |    10.23 ms     |    11.03 ms     |
+| efficientnet-widese-b0 |       8        |     794 img/s      |     11.9 ms     |    12.68 ms     |    14.17 ms     |
+| efficientnet-widese-b0 |       16       |     1536 img/s     |    12.32 ms     |    13.22 ms     |    14.57 ms     |
+| efficientnet-widese-b0 |       32       |     2876 img/s     |    14.12 ms     |    14.45 ms     |    16.23 ms     |
+| efficientnet-widese-b0 |       64       |     6183 img/s     |    13.02 ms     |    14.19 ms     |    16.68 ms     |
+| efficientnet-widese-b0 |      128       |     9310 img/s     |    20.06 ms     |    15.24 ms     |    17.84 ms     |
+| efficientnet-widese-b0 |      256       |    10193 img/s     |    36.07 ms     |    25.13 ms     |    34.22 ms     |
+| efficientnet-widese-b4 |       1        |      53 img/s      |    20.24 ms     |    19.05 ms     |    19.91 ms     |
+| efficientnet-widese-b4 |       2        |     109 img/s      |    20.98 ms     |    19.24 ms     |    22.58 ms     |
+| efficientnet-widese-b4 |       4        |     213 img/s      |    20.48 ms     |    20.48 ms     |    23.64 ms     |
+| efficientnet-widese-b4 |       8        |     425 img/s      |    20.57 ms     |    20.26 ms     |    22.44 ms     |
+| efficientnet-widese-b4 |       16       |     800 img/s      |    21.93 ms     |    23.15 ms     |    26.51 ms     |
+| efficientnet-widese-b4 |       32       |     1201 img/s     |    28.51 ms     |    26.89 ms     |    28.13 ms     |
+| efficientnet-widese-b4 |       64       |     1322 img/s     |    50.96 ms     |    48.58 ms     |    48.77 ms     |
+| efficientnet-widese-b4 |      128       |     1417 img/s     |    96.45 ms     |    90.17 ms     |    90.43 ms     |
+| efficientnet-widese-b4 |      256       |     1439 img/s     |    190.06 ms    |    176.59 ms    |    188.51 ms    |


 ##### Inference performance: NVIDIA V100 (1x V100 16GB)

-Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 20.12 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.
+Our results were obtained by running the applicable `efficientnet/inference/<AMP|FP32>/*.sh` inference script in the PyTorch 21.03 NGC container on NVIDIA DGX-1 (8x V100 16GB) GPUs.

 ###### FP32 Inference Latency

 |       **Model**        | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|    efficientnet-b0     |       1        |      77 img/s      |    14.23 ms     |    13.31 ms     |    14.68 ms     |
-|    efficientnet-b0     |       2        |     153 img/s      |    14.46 ms     |    13.67 ms     |    14.69 ms     |
-|    efficientnet-b0     |       4        |     317 img/s      |    14.06 ms     |    15.77 ms     |    17.28 ms     |
-|    efficientnet-b0     |       8        |     646 img/s      |    13.88 ms     |    14.32 ms     |    15.05 ms     |
-|    efficientnet-b0     |       16       |     1217 img/s     |    14.74 ms     |    15.89 ms     |    18.03 ms     |
-|    efficientnet-b0     |       32       |     2162 img/s     |    16.51 ms     |     17.9 ms     |    20.06 ms     |
-|    efficientnet-b0     |       64       |     2716 img/s     |    25.74 ms     |    23.64 ms     |    24.08 ms     |
-|    efficientnet-b0     |      128       |     2816 img/s     |    50.21 ms     |    45.43 ms     |     46.3 ms     |
-|    efficientnet-b0     |      256       |     2955 img/s     |    96.46 ms     |    85.96 ms     |    92.74 ms     |
-|    efficientnet-b4     |       1        |      38 img/s      |    27.73 ms     |    27.98 ms     |    29.45 ms     |
-|    efficientnet-b4     |       2        |      84 img/s      |     25.1 ms     |     24.6 ms     |    26.29 ms     |
-|    efficientnet-b4     |       4        |     170 img/s      |    25.01 ms     |    24.84 ms     |    26.52 ms     |
-|    efficientnet-b4     |       8        |     304 img/s      |    27.75 ms     |    26.28 ms     |    27.71 ms     |
-|    efficientnet-b4     |       16       |     334 img/s      |    49.51 ms     |    47.98 ms     |    48.46 ms     |
-|    efficientnet-b4     |       32       |     353 img/s      |    92.42 ms     |    90.81 ms     |     91.0 ms     |
-|    efficientnet-b4     |       64       |     380 img/s      |    170.58 ms    |    168.32 ms    |    168.8 ms     |
-|    efficientnet-b4     |      128       |     381 img/s      |    343.03 ms    |    334.58 ms    |    334.94 ms    |
-| efficientnet-widese-b0 |       1        |      83 img/s      |    13.38 ms     |    13.14 ms     |    13.58 ms     |
-| efficientnet-widese-b0 |       2        |     149 img/s      |    14.82 ms     |    15.09 ms     |    16.03 ms     |
-| efficientnet-widese-b0 |       4        |     319 img/s      |    13.91 ms     |    13.06 ms     |    13.96 ms     |
-| efficientnet-widese-b0 |       8        |     566 img/s      |    15.62 ms     |     16.3 ms     |     17.5 ms     |
-| efficientnet-widese-b0 |       16       |     1211 img/s     |    14.85 ms     |    15.97 ms     |     18.8 ms     |
-| efficientnet-widese-b0 |       32       |     2055 img/s     |    17.33 ms     |    19.54 ms     |    21.59 ms     |
-| efficientnet-widese-b0 |       64       |     2707 img/s     |    25.66 ms     |    23.72 ms     |    23.93 ms     |
-| efficientnet-widese-b0 |      128       |     2811 img/s     |    49.93 ms     |    45.46 ms     |    45.51 ms     |
-| efficientnet-widese-b0 |      256       |     2953 img/s     |    96.43 ms     |    86.11 ms     |    87.33 ms     |
-| efficientnet-widese-b4 |       1        |      44 img/s      |    24.16 ms     |    23.16 ms     |    25.41 ms     |
-| efficientnet-widese-b4 |       2        |      89 img/s      |    23.95 ms     |    23.39 ms     |    25.93 ms     |
-| efficientnet-widese-b4 |       4        |     169 img/s      |    25.35 ms     |    25.15 ms     |    30.58 ms     |
-| efficientnet-widese-b4 |       8        |     279 img/s      |    30.27 ms     |    31.76 ms     |    33.37 ms     |
-| efficientnet-widese-b4 |       16       |     331 img/s      |    49.84 ms     |    48.32 ms     |    48.75 ms     |
-| efficientnet-widese-b4 |       32       |     353 img/s      |    92.31 ms     |    90.81 ms     |    90.95 ms     |
-| efficientnet-widese-b4 |       64       |     375 img/s      |    172.79 ms    |    170.49 ms    |    170.69 ms    |
-| efficientnet-widese-b4 |      128       |     381 img/s      |    342.33 ms    |    334.91 ms    |    335.23 ms    |
+|    efficientnet-b0     |       1        |      83 img/s      |    13.15 ms     |    13.23 ms     |    14.11 ms     |
+|    efficientnet-b0     |       2        |     167 img/s      |    13.17 ms     |    13.46 ms     |    14.39 ms     |
+|    efficientnet-b0     |       4        |     332 img/s      |    13.25 ms     |    13.29 ms     |    14.85 ms     |
+|    efficientnet-b0     |       8        |     657 img/s      |    13.42 ms     |    13.86 ms     |    15.77 ms     |
+|    efficientnet-b0     |       16       |     1289 img/s     |    13.78 ms     |    15.02 ms     |    16.99 ms     |
+|    efficientnet-b0     |       32       |     2140 img/s     |    16.46 ms     |    18.92 ms     |     22.2 ms     |
+|    efficientnet-b0     |       64       |     2743 img/s     |    25.14 ms     |    23.44 ms     |    23.79 ms     |
+|    efficientnet-b0     |      128       |     2908 img/s     |    48.03 ms     |    43.98 ms     |    45.36 ms     |
+|    efficientnet-b0     |      256       |     2968 img/s     |    94.86 ms     |    85.62 ms     |    91.01 ms     |
+|    efficientnet-b4     |       1        |      45 img/s      |    23.31 ms     |     23.3 ms     |     24.9 ms     |
+|    efficientnet-b4     |       2        |      87 img/s      |    24.07 ms     |    23.81 ms     |    25.14 ms     |
+|    efficientnet-b4     |       4        |     160 img/s      |    26.29 ms     |    26.78 ms     |    30.85 ms     |
+|    efficientnet-b4     |       8        |     316 img/s      |    26.65 ms     |    26.44 ms     |    28.61 ms     |
+|    efficientnet-b4     |       16       |     341 img/s      |    48.18 ms     |     46.9 ms     |    47.13 ms     |
+|    efficientnet-b4     |       32       |     365 img/s      |    89.07 ms     |    87.83 ms     |    88.02 ms     |
+|    efficientnet-b4     |       64       |     374 img/s      |    173.2 ms     |    171.61 ms    |    172.27 ms    |
+|    efficientnet-b4     |      128       |     376 img/s      |    346.32 ms    |    339.74 ms    |    340.37 ms    |
+| efficientnet-widese-b0 |       1        |      82 img/s      |    13.37 ms     |    12.95 ms     |    13.89 ms     |
+| efficientnet-widese-b0 |       2        |     168 img/s      |    13.11 ms     |    12.45 ms     |    13.94 ms     |
+| efficientnet-widese-b0 |       4        |     346 img/s      |    12.73 ms     |    12.22 ms     |    12.95 ms     |
+| efficientnet-widese-b0 |       8        |     674 img/s      |    13.07 ms     |    12.75 ms     |    14.93 ms     |
+| efficientnet-widese-b0 |       16       |     1235 img/s     |     14.3 ms     |    15.05 ms     |    16.53 ms     |
+| efficientnet-widese-b0 |       32       |     2194 img/s     |    15.99 ms     |    17.37 ms     |    19.01 ms     |
+| efficientnet-widese-b0 |       64       |     2747 img/s     |    25.05 ms     |    23.38 ms     |    23.71 ms     |
+| efficientnet-widese-b0 |      128       |     2906 img/s     |    48.05 ms     |     44.0 ms     |    44.59 ms     |
+| efficientnet-widese-b0 |      256       |     2962 img/s     |    95.14 ms     |    85.86 ms     |    86.25 ms     |
+| efficientnet-widese-b4 |       1        |      43 img/s      |    24.28 ms     |    25.24 ms     |    27.36 ms     |
+| efficientnet-widese-b4 |       2        |      87 img/s      |    24.04 ms     |    24.38 ms     |    26.01 ms     |
+| efficientnet-widese-b4 |       4        |     169 img/s      |    24.96 ms     |     25.8 ms     |    27.14 ms     |
+| efficientnet-widese-b4 |       8        |     307 img/s      |    27.39 ms     |     28.4 ms     |     30.7 ms     |
+| efficientnet-widese-b4 |       16       |     342 img/s      |    48.05 ms     |    46.74 ms     |     46.9 ms     |
+| efficientnet-widese-b4 |       32       |     363 img/s      |    89.44 ms     |    88.23 ms     |    88.39 ms     |
+| efficientnet-widese-b4 |       64       |     373 img/s      |    173.47 ms    |    172.01 ms    |    172.36 ms    |
+| efficientnet-widese-b4 |      128       |     376 img/s      |    347.18 ms    |    340.09 ms    |    340.45 ms    |


 ###### Mixed Precision Inference Latency

 |       **Model**        | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:----------------------:|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|    efficientnet-b0     |       1        |      66 img/s      |    16.38 ms     |    15.63 ms     |    17.01 ms     |
-|    efficientnet-b0     |       2        |     120 img/s      |     18.0 ms     |    18.39 ms     |    19.35 ms     |
-|    efficientnet-b0     |       4        |     244 img/s      |    17.77 ms     |    18.98 ms     |     21.4 ms     |
-|    efficientnet-b0     |       8        |     506 img/s      |    17.26 ms     |    18.23 ms     |    20.24 ms     |
-|    efficientnet-b0     |       16       |     912 img/s      |    19.07 ms     |    20.33 ms     |    22.59 ms     |
-|    efficientnet-b0     |       32       |     1758 img/s     |     20.3 ms     |     22.2 ms     |     24.7 ms     |
-|    efficientnet-b0     |       64       |     3720 img/s     |    19.18 ms     |    20.09 ms     |    21.48 ms     |
-|    efficientnet-b0     |      128       |     4942 img/s     |    30.53 ms     |     26.0 ms     |    27.54 ms     |
-|    efficientnet-b0     |      256       |     5339 img/s     |    57.82 ms     |    47.63 ms     |    51.61 ms     |
-|    efficientnet-b4     |       1        |      32 img/s      |    31.83 ms     |    32.51 ms     |    34.09 ms     |
-|    efficientnet-b4     |       2        |      65 img/s      |    31.82 ms     |    34.53 ms     |    36.95 ms     |
-|    efficientnet-b4     |       4        |     127 img/s      |    32.77 ms     |    32.87 ms     |    35.95 ms     |
-|    efficientnet-b4     |       8        |     255 img/s      |     32.9 ms     |    34.56 ms     |    37.01 ms     |
-|    efficientnet-b4     |       16       |     486 img/s      |    34.46 ms     |    36.56 ms     |     39.1 ms     |
-|    efficientnet-b4     |       32       |     681 img/s      |    48.48 ms     |    46.98 ms     |    48.55 ms     |
-|    efficientnet-b4     |       64       |     738 img/s      |    88.55 ms     |    86.55 ms     |    87.31 ms     |
-|    efficientnet-b4     |      128       |     757 img/s      |    174.13 ms    |    168.73 ms    |    168.92 ms    |
-|    efficientnet-b4     |      256       |     770 img/s      |    343.04 ms    |    329.95 ms    |    330.66 ms    |
-| efficientnet-widese-b0 |       1        |      63 img/s      |    17.08 ms     |    16.36 ms     |     17.8 ms     |
-| efficientnet-widese-b0 |       2        |     123 img/s      |    17.48 ms     |    16.74 ms     |    18.17 ms     |
-| efficientnet-widese-b0 |       4        |     241 img/s      |    17.95 ms     |    17.29 ms     |    18.76 ms     |
-| efficientnet-widese-b0 |       8        |     486 img/s      |    17.92 ms     |    19.42 ms     |     22.3 ms     |
-| efficientnet-widese-b0 |       16       |     898 img/s      |     19.3 ms     |    20.57 ms     |    22.41 ms     |
-| efficientnet-widese-b0 |       32       |     1649 img/s     |    21.06 ms     |    23.14 ms     |    24.83 ms     |
-| efficientnet-widese-b0 |       64       |     3360 img/s     |    21.22 ms     |    22.89 ms     |    25.07 ms     |
-| efficientnet-widese-b0 |      128       |     4934 img/s     |    30.35 ms     |    26.48 ms     |     30.3 ms     |
-| efficientnet-widese-b0 |      256       |     5340 img/s     |    57.83 ms     |    47.59 ms     |     54.7 ms     |
-| efficientnet-widese-b4 |       1        |      31 img/s      |    33.37 ms     |    34.12 ms     |    35.95 ms     |
-| efficientnet-widese-b4 |       2        |      63 img/s      |     33.0 ms     |    33.73 ms     |    35.15 ms     |
-| efficientnet-widese-b4 |       4        |     133 img/s      |    31.43 ms     |    31.72 ms     |    33.93 ms     |
-| efficientnet-widese-b4 |       8        |     244 img/s      |    34.35 ms     |    36.98 ms     |    39.72 ms     |
-| efficientnet-widese-b4 |       16       |     454 img/s      |     36.8 ms     |     39.8 ms     |    42.41 ms     |
-| efficientnet-widese-b4 |       32       |     680 img/s      |    48.63 ms     |     48.1 ms     |    50.57 ms     |
-| efficientnet-widese-b4 |       64       |     738 img/s      |    88.64 ms     |    86.56 ms     |     86.7 ms     |
-| efficientnet-widese-b4 |      128       |     756 img/s      |    174.52 ms    |    168.98 ms    |    169.13 ms    |
-| efficientnet-widese-b4 |      256       |     771 img/s      |    344.05 ms    |    329.69 ms    |    330.7 ms     |
+|    efficientnet-b0     |       1        |      62 img/s      |    17.19 ms     |    18.01 ms     |    18.63 ms     |
+|    efficientnet-b0     |       2        |     119 img/s      |    17.96 ms     |     18.3 ms     |    19.95 ms     |
+|    efficientnet-b0     |       4        |     238 img/s      |     17.9 ms     |     17.8 ms     |    19.13 ms     |
+|    efficientnet-b0     |       8        |     495 img/s      |    17.38 ms     |    18.34 ms     |    19.29 ms     |
+|    efficientnet-b0     |       16       |     945 img/s      |    18.23 ms     |    19.42 ms     |    21.58 ms     |
+|    efficientnet-b0     |       32       |     1784 img/s     |    19.29 ms     |    20.71 ms     |    22.51 ms     |
+|    efficientnet-b0     |       64       |     3480 img/s     |    20.34 ms     |    22.22 ms     |    24.62 ms     |
+|    efficientnet-b0     |      128       |     5759 img/s     |    26.11 ms     |    22.61 ms     |    24.06 ms     |
+|    efficientnet-b0     |      256       |     6176 img/s     |    49.36 ms     |    41.18 ms     |     43.5 ms     |
+|    efficientnet-b4     |       1        |      34 img/s      |    30.28 ms     |     30.2 ms     |    32.24 ms     |
+|    efficientnet-b4     |       2        |      69 img/s      |    30.12 ms     |    30.02 ms     |    31.92 ms     |
+|    efficientnet-b4     |       4        |     129 img/s      |    32.08 ms     |    33.29 ms     |    34.74 ms     |
+|    efficientnet-b4     |       8        |     242 img/s      |    34.43 ms     |    37.34 ms     |    41.08 ms     |
+|    efficientnet-b4     |       16       |     488 img/s      |    34.12 ms     |    36.13 ms     |    39.39 ms     |
+|    efficientnet-b4     |       32       |     738 img/s      |    44.67 ms     |    44.85 ms     |    47.86 ms     |
+|    efficientnet-b4     |       64       |     809 img/s      |    80.93 ms     |    79.19 ms     |    79.42 ms     |
+|    efficientnet-b4     |      128       |     843 img/s      |    156.42 ms    |    152.17 ms    |    152.76 ms    |
+|    efficientnet-b4     |      256       |     847 img/s      |    311.03 ms    |    301.44 ms    |    302.48 ms    |
+| efficientnet-widese-b0 |       1        |      64 img/s      |    16.71 ms     |    17.59 ms     |    19.23 ms     |
+| efficientnet-widese-b0 |       2        |     129 img/s      |    16.63 ms     |     16.1 ms     |    17.34 ms     |
+| efficientnet-widese-b0 |       4        |     238 img/s      |    17.92 ms     |    17.52 ms     |    18.82 ms     |
+| efficientnet-widese-b0 |       8        |     445 img/s      |    19.24 ms     |    19.53 ms     |     20.4 ms     |
+| efficientnet-widese-b0 |       16       |     936 img/s      |    18.64 ms     |    19.55 ms     |     21.1 ms     |
+| efficientnet-widese-b0 |       32       |     1818 img/s     |    18.97 ms     |    20.62 ms     |    23.06 ms     |
+| efficientnet-widese-b0 |       64       |     3572 img/s     |    19.81 ms     |    21.14 ms     |    23.29 ms     |
+| efficientnet-widese-b0 |      128       |     5748 img/s     |    26.18 ms     |    23.72 ms     |     26.1 ms     |
+| efficientnet-widese-b0 |      256       |     6187 img/s     |    49.11 ms     |    41.11 ms     |    41.59 ms     |
+| efficientnet-widese-b4 |       1        |      32 img/s      |     32.1 ms     |     31.6 ms     |    34.69 ms     |
+| efficientnet-widese-b4 |       2        |      68 img/s      |     30.4 ms     |     30.9 ms     |    32.67 ms     |
+| efficientnet-widese-b4 |       4        |     123 img/s      |    33.81 ms     |     39.0 ms     |    40.76 ms     |
+| efficientnet-widese-b4 |       8        |     257 img/s      |    32.34 ms     |    33.39 ms     |    34.93 ms     |
+| efficientnet-widese-b4 |       16       |     497 img/s      |    33.51 ms     |    34.92 ms     |    37.24 ms     |
+| efficientnet-widese-b4 |       32       |     739 img/s      |    44.63 ms     |    43.62 ms     |    46.39 ms     |
+| efficientnet-widese-b4 |       64       |     808 img/s      |    81.08 ms     |    79.43 ms     |    79.59 ms     |
+| efficientnet-widese-b4 |      128       |     840 img/s      |    157.11 ms    |    152.87 ms    |    153.26 ms    |
+| efficientnet-widese-b4 |      256       |     846 img/s      |    310.73 ms    |    301.68 ms    |    302.9 ms     |



--- a/PyTorch/Classification/ConvNets/resnet50v1.5/README.md
+++ b/PyTorch/Classification/ConvNets/resnet50v1.5/README.md
@ -206,7 +206,7 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* [PyTorch 21.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
 * Supported GPUs:
    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -533,7 +533,7 @@ To benchmark inference, run:

 * TF32 (A100 GPUs only)

-`python ./launch.py --model resnet50 --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+`python ./launch.py --model resnet50 --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 * AMP

@ -543,11 +543,12 @@ Each of these scripts will run 100 iterations and save results in the `benchmark

 ### Results

-Our results were obtained by running the applicable training script     in the pytorch-20.12 NGC container.
+#### Training accuracy results
+
+Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

-#### Training accuracy results

 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

@ -573,8 +574,6 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
 | 90 | 77.10 +/- 0.06 | 77.23 +/- 0.04 |
 | 250 | 78.59 +/- 0.13 | 78.46 +/- 0.03 |

-
-
 ##### Example plots

 The following images show a 250 epochs configuration on a DGX-1V.
@ -587,64 +586,70 @@ The following images show a 250 epochs configuration on a DGX-1V.

 #### Training performance results

+Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)

-| **GPUs** | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |     2461 img/s      | 945 img/s  |            2.6 x            |               1.0 x                |                ~14 hours                |          1.0 x          |          ~36 hours           |
-|    8     |     15977 img/s     | 7365 img/s |           2.16 x            |               6.49 x               |                ~3 hours                 |         7.78 x          |           ~5 hours           |
+| **GPUs** |  **Throughput - TF32**  | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
+|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     | 938 img/s  |            2470 img/s            |                      2.63 x                      |          1.0 x          |               1.0 x                |                ~14 hours                |          ~36 hours           |
+|    8     | 7248 img/s |           16621 img/s            |                      2.29 x                      |         7.72 x          |               6.72 x               |                ~3 hours                 |           ~5 hours           |


 ##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

-| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |     1180 img/s      | 371 img/s  |           3.17 x            |               1.0 x                |                ~29 hours                |          1.0 x          |          ~91 hours           |
-|    8     |     7608 img/s      | 2851 img/s |           2.66 x            |               6.44 x               |                ~5 hours                 |         7.66 x          |          ~12 hours           |
+| **GPUs** |  **Throughput - FP32**  | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
+|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     | 367 img/s  |            1200 img/s            |                      3.26 x                      |          1.0 x          |               1.0 x                |                ~29 hours                |          ~92 hours           |
+|    8     | 2855 img/s |            8322 img/s            |                      2.91 x                      |         7.76 x          |               6.93 x               |                ~5 hours                 |          ~12 hours           |


 ##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

-| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |     1115 img/s      | 365 img/s  |           3.04 x            |               1.0 x                |                ~31 hours                |          1.0 x          |          ~92 hours           |
-|    8     |     7375 img/s      | 2811 img/s |           2.62 x            |               6.61 x               |                ~5 hours                 |         7.68 x          |          ~12 hours           |
+| **GPUs** |  **Throughput - FP32**  | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
+|:--------:|:----------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     | 356 img/s  |            1156 img/s            |                      3.24 x                      |          1.0 x          |               1.0 x                |                ~30 hours                |          ~95 hours           |
+|    8     | 2766 img/s |            8056 img/s            |                      2.91 x                      |         7.75 x          |               6.96 x               |                ~5 hours                 |          ~13 hours           |


 #### Inference performance results

+Our results were obtained by running the applicable training script in the pytorch-21.03 NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

 ###### FP32 Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      99 img/s      |    10.38 ms     |    11.24 ms     |    12.32 ms     |
-|       2        |     190 img/s      |    10.87 ms     |    12.18 ms     |    14.27 ms     |
-|       4        |     403 img/s      |    10.26 ms     |    11.02 ms     |    13.28 ms     |
-|       8        |     754 img/s      |    10.96 ms     |    11.99 ms     |    13.89 ms     |
-|       16       |     960 img/s      |    17.16 ms     |    16.74 ms     |    18.18 ms     |
-|       32       |     1057 img/s     |    31.39 ms     |     30.4 ms     |    30.55 ms     |
-|       64       |     1168 img/s     |     57.1 ms     |    55.01 ms     |    56.19 ms     |
-|      112       |     1166 img/s     |    100.78 ms    |    95.98 ms     |    97.43 ms     |
-|      128       |     1215 img/s     |    111.11 ms    |    105.52 ms    |    106.38 ms    |
-|      256       |     1253 img/s     |    217.03 ms    |    203.78 ms    |    208.68 ms    |
+|       1        |      96 img/s      |    10.37 ms     |    10.81 ms     |    11.73 ms     |
+|       2        |     196 img/s      |    10.24 ms     |    11.18 ms     |    12.89 ms     |
+|       4        |     386 img/s      |    10.46 ms     |    11.01 ms     |    11.75 ms     |
+|       8        |     709 img/s      |     11.5 ms     |    12.36 ms     |    13.12 ms     |
+|       16       |     1023 img/s     |    16.07 ms     |    15.69 ms     |    15.97 ms     |
+|       32       |     1127 img/s     |    29.37 ms     |    28.53 ms     |    28.67 ms     |
+|       64       |     1200 img/s     |     55.4 ms     |     53.5 ms     |    53.71 ms     |
+|      128       |     1229 img/s     |    109.26 ms    |    104.04 ms    |    104.34 ms    |
+|      256       |     1261 img/s     |    214.48 ms    |    202.51 ms    |    202.88 ms    |


 ###### Mixed Precision Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      82 img/s      |    12.43 ms     |    13.29 ms     |    14.89 ms     |
-|       2        |     157 img/s      |    13.04 ms     |    13.84 ms     |    16.79 ms     |
-|       4        |     310 img/s      |    13.26 ms     |    14.42 ms     |    15.63 ms     |
-|       8        |     646 img/s      |    12.69 ms     |    13.65 ms     |    15.48 ms     |
-|       16       |     1188 img/s     |    14.01 ms     |    15.56 ms     |    18.34 ms     |
-|       32       |     2093 img/s     |    16.41 ms     |    18.25 ms     |     19.9 ms     |
-|       64       |     2899 img/s     |    24.12 ms     |    22.14 ms     |    22.55 ms     |
-|      128       |     3142 img/s     |    45.28 ms     |    40.77 ms     |    42.89 ms     |
-|      256       |     3276 img/s     |    88.44 ms     |     77.8 ms     |    79.01 ms     |
-|      256       |     3276 img/s     |     88.6 ms     |    77.74 ms     |    79.11 ms     |
+|       1        |      78 img/s      |    12.78 ms     |    13.27 ms     |    14.36 ms     |
+|       2        |     154 img/s      |    13.01 ms     |    13.74 ms     |    15.19 ms     |
+|       4        |     300 img/s      |    13.41 ms     |    14.25 ms     |    15.68 ms     |
+|       8        |     595 img/s      |    13.65 ms     |    14.51 ms     |     15.6 ms     |
+|       16       |     1178 img/s     |     14.0 ms     |    15.07 ms     |    16.26 ms     |
+|       32       |     2146 img/s     |    15.84 ms     |    17.25 ms     |    18.53 ms     |
+|       64       |     2984 img/s     |    23.18 ms     |    21.51 ms     |    21.93 ms     |
+|      128       |     3249 img/s     |    43.55 ms     |    39.36 ms     |     40.1 ms     |
+|      256       |     3382 img/s     |    84.14 ms     |     75.3 ms     |    80.08 ms     |


 ##### Inference performance: NVIDIA T4
@ -653,30 +658,30 @@ The following images show a 250 epochs configuration on a DGX-1V.

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |     147 img/s      |     7.28 ms     |     8.48 ms     |     9.79 ms     |
-|       2        |     251 img/s      |     8.48 ms     |    10.23 ms     |    14.01 ms     |
-|       4        |     303 img/s      |    13.57 ms     |    13.61 ms     |    15.42 ms     |
-|       8        |     329 img/s      |     24.7 ms     |    24.74 ms     |     25.0 ms     |
-|       16       |     371 img/s      |    43.73 ms     |    43.74 ms     |    44.03 ms     |
-|       32       |     395 img/s      |    82.36 ms     |    82.13 ms     |    82.58 ms     |
-|       64       |     421 img/s      |    155.37 ms    |    153.07 ms    |    153.55 ms    |
-|      128       |     426 img/s      |    309.06 ms    |    303.0 ms     |    307.42 ms    |
-|      256       |     419 img/s      |    631.43 ms    |    612.42 ms    |    614.82 ms    |
+|       1        |      98 img/s      |     10.7 ms     |    12.82 ms     |    16.71 ms     |
+|       2        |     186 img/s      |    11.26 ms     |    13.79 ms     |    16.99 ms     |
+|       4        |     325 img/s      |    12.73 ms     |    13.89 ms     |    18.03 ms     |
+|       8        |     363 img/s      |    22.41 ms     |    22.57 ms     |     22.9 ms     |
+|       16       |     409 img/s      |    39.77 ms     |     39.8 ms     |    40.23 ms     |
+|       32       |     420 img/s      |    77.62 ms     |    76.92 ms     |    77.28 ms     |
+|       64       |     428 img/s      |    152.73 ms    |    152.03 ms    |    153.02 ms    |
+|      128       |     426 img/s      |    309.26 ms    |    303.38 ms    |    305.13 ms    |
+|      256       |     415 img/s      |    635.98 ms    |    620.16 ms    |    625.21 ms    |


 ###### Mixed Precision Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |     112 img/s      |     9.25 ms     |     9.87 ms     |    10.62 ms     |
-|       2        |     223 img/s      |     9.4 ms      |    10.62 ms     |     13.9 ms     |
-|       4        |     468 img/s      |     9.06 ms     |    11.15 ms     |     15.5 ms     |
-|       8        |     844 img/s      |    10.05 ms     |    12.67 ms     |    17.86 ms     |
-|       16       |     1037 img/s     |    16.01 ms     |    15.66 ms     |    15.86 ms     |
-|       32       |     1103 img/s     |    30.27 ms     |    29.45 ms     |    29.74 ms     |
-|       64       |     1154 img/s     |    57.96 ms     |    56.33 ms     |    56.96 ms     |
-|      128       |     1177 img/s     |    114.95 ms    |    110.4 ms     |    111.1 ms     |
-|      256       |     1184 img/s     |    229.61 ms    |    217.84 ms    |    224.75 ms    |
+|       1        |      79 img/s      |    12.96 ms     |    15.47 ms     |     20.0 ms     |
+|       2        |     156 img/s      |    13.18 ms     |     14.9 ms     |    18.73 ms     |
+|       4        |     317 img/s      |    12.99 ms     |    14.69 ms     |    19.05 ms     |
+|       8        |     652 img/s      |    12.82 ms     |    16.04 ms     |    19.43 ms     |
+|       16       |     1050 img/s     |     15.8 ms     |    16.57 ms     |    20.62 ms     |
+|       32       |     1128 img/s     |    29.54 ms     |    28.79 ms     |    28.97 ms     |
+|       64       |     1165 img/s     |    57.41 ms     |    55.67 ms     |    56.11 ms     |
+|      128       |     1190 img/s     |    114.24 ms    |    109.17 ms    |    110.41 ms    |
+|      256       |     1198 img/s     |    225.95 ms    |    215.28 ms    |    222.94 ms    |


 ## Release notes
@ -701,6 +706,7 @@ The following images show a 250 epochs configuration on a DGX-1V.
  * Updated README
 6. February 2021
  * Moved from APEX AMP to Native AMP
+
 ### Known issues

 There are no known issues with this model.
--- a/PyTorch/Classification/ConvNets/resnext101-32x4d/README.md
+++ b/PyTorch/Classification/ConvNets/resnext101-32x4d/README.md
@ -190,7 +190,7 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* [PyTorch 21.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
 * Supported GPUs:
    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -516,7 +516,7 @@ To benchmark inference, run:

 * TF32 (A100 GPUs only)

-`python ./launch.py --model resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+`python ./launch.py --model resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 * AMP

@ -526,12 +526,12 @@ Each of these scripts will run 100 iterations and save results in the `benchmark

 ### Results

-Our results were obtained by running the applicable training script     in the pytorch-20.12 NGC container.
+#### Training accuracy results
+
+Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

-#### Training accuracy results
-
 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

 | **Epochs** | **Mixed Precision Top1** | **TF32 Top1**  |
@ -560,62 +560,70 @@ The following images show a 250 epochs configuration on a DGX-1V.

 #### Training performance results

+Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)

-| **GPUs** | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |     1169 img/s      | 420 img/s  |           2.77 x            |               1.0 x                |                ~29 hours                |          1.0 x          |          ~80 hours           |
-|    8     |     7399 img/s      | 3193 img/s |           2.31 x            |               6.32 x               |                ~5 hours                 |         7.58 x          |          ~11 hours           |
+| **GPUs** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
+|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     |       456 img/s       |            1211 img/s            |                      2.65 x                      |          1.0 x          |               1.0 x                |                ~28 hours                |          ~74 hours           |
+|    8     |      3471 img/s       |            7925 img/s            |                      2.28 x                      |          7.6 x          |               6.54 x               |                ~5 hours                 |          ~10 hours           |


 ##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

-| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |      578 img/s      | 149 img/s  |           3.86 x            |               1.0 x                |                ~59 hours                |          1.0 x          |          ~225 hours          |
-|    8     |     3742 img/s      | 1117 img/s |           3.34 x            |               6.46 x               |                ~9 hours                 |         7.45 x          |          ~31 hours           |
+| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
+|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     |       147 img/s       |            587 img/s             |                      3.97 x                      |          1.0 x          |               1.0 x                |                ~58 hours                |          ~228 hours          |
+|    8     |      1133 img/s       |            4065 img/s            |                      3.58 x                      |         7.65 x          |               6.91 x               |                ~9 hours                 |          ~30 hours           |


 ##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

-| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |      556 img/s      | 151 img/s  |           3.68 x            |               1.0 x                |                ~61 hours                |          1.0 x          |          ~223 hours          |
-|    8     |     3595 img/s      | 1102 img/s |           3.26 x            |               6.45 x               |                ~10 hours                |         7.28 x          |          ~31 hours           |
+| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
+|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     |       144 img/s       |            565 img/s             |                      3.9 x                       |          1.0 x          |               1.0 x                |                ~60 hours                |          ~233 hours          |
+|    8     |      1108 img/s       |            3863 img/s            |                      3.48 x                      |         7.66 x          |               6.83 x               |                ~9 hours                 |          ~31 hours           |


 #### Inference performance results

+Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

 ###### FP32 Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      55 img/s      |    18.48 ms     |    18.88 ms     |    20.74 ms     |
-|       2        |     116 img/s      |    17.54 ms     |    18.15 ms     |    21.32 ms     |
-|       4        |     214 img/s      |    19.07 ms     |    20.44 ms     |    22.69 ms     |
-|       8        |     291 img/s      |     27.8 ms     |    27.99 ms     |    28.47 ms     |
-|       16       |     354 img/s      |    45.78 ms     |     45.4 ms     |    45.73 ms     |
-|       32       |     423 img/s      |    77.13 ms     |    75.96 ms     |    76.21 ms     |
-|       64       |     486 img/s      |    134.92 ms    |    132.17 ms    |    132.51 ms    |
-|      128       |     523 img/s      |    252.11 ms    |    244.5 ms     |    244.99 ms    |
-|      256       |     530 img/s      |    499.64 ms    |    479.83 ms    |    481.41 ms    |
+|       1        |      55 img/s      |    17.95 ms     |    20.61 ms     |     22.0 ms     |
+|       2        |     105 img/s      |     19.2 ms     |    20.74 ms     |    22.77 ms     |
+|       4        |     170 img/s      |    23.65 ms     |    24.66 ms     |     28.0 ms     |
+|       8        |     336 img/s      |    24.05 ms     |    24.92 ms     |    27.75 ms     |
+|       16       |     397 img/s      |    40.77 ms     |    40.44 ms     |    40.65 ms     |
+|       32       |     452 img/s      |    72.12 ms     |     71.1 ms     |    71.35 ms     |
+|       64       |     500 img/s      |    130.9 ms     |    128.19 ms    |    128.64 ms    |
+|      128       |     527 img/s      |    249.57 ms    |    242.77 ms    |    243.63 ms    |
+|      256       |     533 img/s      |    496.76 ms    |    478.04 ms    |    480.42 ms    |


 ###### Mixed Precision Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      40 img/s      |    25.17 ms     |     28.4 ms     |    30.66 ms     |
-|       2        |      89 img/s      |    22.64 ms     |    24.29 ms     |    25.99 ms     |
-|       4        |     165 img/s      |    24.54 ms     |    26.23 ms     |    28.61 ms     |
-|       8        |     334 img/s      |    24.31 ms     |    28.46 ms     |    29.91 ms     |
-|       16       |     632 img/s      |     25.8 ms     |    27.76 ms     |    29.53 ms     |
-|       32       |     1219 img/s     |    27.35 ms     |    29.86 ms     |     31.6 ms     |
-|       64       |     1525 img/s     |    43.97 ms     |    42.01 ms     |    42.96 ms     |
-|      128       |     1647 img/s     |    82.22 ms     |    77.65 ms     |    79.56 ms     |
-|      256       |     1689 img/s     |    161.53 ms    |    151.25 ms    |    152.01 ms    |
+|       1        |      43 img/s      |    23.08 ms     |    24.18 ms     |    27.82 ms     |
+|       2        |      84 img/s      |    23.65 ms     |    24.64 ms     |    27.87 ms     |
+|       4        |     164 img/s      |    24.38 ms     |    27.33 ms     |    27.95 ms     |
+|       8        |     333 img/s      |    24.18 ms     |    25.92 ms     |     28.3 ms     |
+|       16       |     640 img/s      |     25.4 ms     |    26.53 ms     |    29.47 ms     |
+|       32       |     1195 img/s     |    27.72 ms     |     29.9 ms     |    32.19 ms     |
+|       64       |     1595 img/s     |    41.89 ms     |    40.15 ms     |    41.08 ms     |
+|      128       |     1699 img/s     |    79.45 ms     |    75.65 ms     |    76.08 ms     |
+|      256       |     1746 img/s     |    154.68 ms    |    145.76 ms    |    146.52 ms    |


 ##### Inference performance: NVIDIA T4
@ -624,30 +632,30 @@ The following images show a 250 epochs configuration on a DGX-1V.

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      79 img/s      |    13.07 ms     |    14.66 ms     |    15.59 ms     |
-|       2        |     119 img/s      |    17.21 ms     |    18.07 ms     |    19.78 ms     |
-|       4        |     141 img/s      |    28.65 ms     |    28.62 ms     |    28.77 ms     |
-|       8        |     139 img/s      |    57.84 ms     |    58.29 ms     |    58.62 ms     |
-|       16       |     153 img/s      |    104.8 ms     |    105.65 ms    |    106.2 ms     |
-|       32       |     178 img/s      |    181.24 ms    |    180.96 ms    |    181.57 ms    |
-|       64       |     179 img/s      |    360.93 ms    |    358.22 ms    |    359.11 ms    |
-|      128       |     177 img/s      |    735.99 ms    |    726.15 ms    |    727.81 ms    |
-|      256       |     167 img/s      |   1561.91 ms    |   1523.52 ms    |   1525.96 ms    |
+|       1        |      56 img/s      |    18.18 ms     |    20.45 ms     |    24.58 ms     |
+|       2        |     109 img/s      |    18.77 ms     |    21.53 ms     |    26.21 ms     |
+|       4        |     151 img/s      |    26.89 ms     |    27.81 ms     |    30.94 ms     |
+|       8        |     164 img/s      |    48.99 ms     |    49.44 ms     |    49.91 ms     |
+|       16       |     172 img/s      |    93.51 ms     |    93.73 ms     |    94.16 ms     |
+|       32       |     180 img/s      |    178.83 ms    |    178.41 ms    |    179.07 ms    |
+|       64       |     178 img/s      |    361.95 ms    |    360.7 ms     |    362.32 ms    |
+|      128       |     172 img/s      |    756.93 ms    |    750.21 ms    |    752.45 ms    |
+|      256       |     161 img/s      |   1615.79 ms    |   1580.61 ms    |   1583.43 ms    |


 ###### Mixed Precision Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      65 img/s      |    15.69 ms     |    16.95 ms     |    17.97 ms     |
-|       2        |     126 img/s      |     16.2 ms     |    16.78 ms     |     18.6 ms     |
-|       4        |     245 img/s      |    16.77 ms     |    18.35 ms     |    25.88 ms     |
-|       8        |     488 img/s      |    16.82 ms     |    17.86 ms     |    25.45 ms     |
-|       16       |     541 img/s      |    30.16 ms     |    29.95 ms     |    30.18 ms     |
-|       32       |     566 img/s      |    57.79 ms     |    57.11 ms     |    57.29 ms     |
-|       64       |     580 img/s      |    112.84 ms    |    111.07 ms    |    111.56 ms    |
-|      128       |     586 img/s      |    224.75 ms    |    219.12 ms    |    219.64 ms    |
-|      256       |     589 img/s      |    447.25 ms    |    434.18 ms    |    439.22 ms    |
+|       1        |      44 img/s      |     23.0 ms     |    25.77 ms     |    29.41 ms     |
+|       2        |      87 img/s      |    23.14 ms     |    26.55 ms     |    30.97 ms     |
+|       4        |     178 img/s      |     22.8 ms     |     24.2 ms     |    29.38 ms     |
+|       8        |     371 img/s      |    21.98 ms     |    25.34 ms     |    29.61 ms     |
+|       16       |     553 img/s      |    29.47 ms     |    29.52 ms     |    31.14 ms     |
+|       32       |     578 img/s      |    56.56 ms     |    56.04 ms     |    56.37 ms     |
+|       64       |     591 img/s      |    110.82 ms    |    109.37 ms    |    109.83 ms    |
+|      128       |     597 img/s      |    220.44 ms    |    215.33 ms    |    216.3 ms     |
+|      256       |     598 img/s      |    439.3 ms     |    428.2 ms     |    431.46 ms    |


 ## Release notes
--- a/PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md
+++ b/PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md
@ -191,7 +191,7 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* [PyTorch 21.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
 * Supported GPUs:
    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@ -518,7 +518,7 @@ To benchmark inference, run:

 * TF32 (A100 GPUs only)

-`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
+`python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`

 * AMP

@ -528,12 +528,12 @@ Each of these scripts will run 100 iterations and save results in the `benchmark

 ### Results

-Our results were obtained by running the applicable training script     in the pytorch-20.12 NGC container.
+#### Training accuracy results
+
+Our results were obtained by running the applicable training script the pytorch-20.12 NGC container.

 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).

-#### Training accuracy results
-
 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

 | **Epochs** | **Mixed Precision Top1** | **TF32 Top1**  |
@ -562,63 +562,70 @@ The following images show a 250 epochs configuration on a DGX-1V.

 #### Training performance results

+Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)

-| **GPUs** | **Mixed Precision** |  **TF32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |      804 img/s      | 360 img/s  |           2.22 x            |               1.0 x                |                ~42 hours                |          1.0 x          |          ~94 hours           |
-|    8     |     5248 img/s      | 2665 img/s |           1.96 x            |               6.52 x               |                ~7 hours                 |         7.38 x          |          ~13 hours           |
+| **GPUs** | **Throughput - TF32** | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
+|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     |       395 img/s       |            855 img/s             |                      2.16 x                      |          1.0 x          |               1.0 x                |                ~40 hours                |          ~86 hours           |
+|    8     |      2991 img/s       |            5779 img/s            |                      1.93 x                      |         7.56 x          |               6.75 x               |                ~6 hours                 |          ~12 hours           |


 ##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)

-| **GPUs** | **Mixed Precision** | **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
-|:--------:|:-------------------:|:---------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |      430 img/s      | 133 img/s |           3.21 x            |               1.0 x                |                ~79 hours                |          1.0 x          |          ~252 hours          |
-|    8     |     2716 img/s      | 994 img/s |           2.73 x            |               6.31 x               |                ~13 hours                |         7.42 x          |          ~34 hours           |
+| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
+|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     |       132 img/s       |            443 img/s             |                      3.34 x                      |          1.0 x          |               1.0 x                |                ~76 hours                |          ~254 hours          |
+|    8     |      1004 img/s       |            2971 img/s            |                      2.95 x                      |         7.57 x          |               6.7 x                |                ~12 hours                |          ~34 hours           |


 ##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)

-| **GPUs** | **Mixed Precision** |  **FP32**  | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
-|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
-|    1     |      413 img/s      | 134 img/s  |           3.08 x            |               1.0 x                |                ~82 hours                |          1.0 x          |          ~251 hours          |
-|    8     |     2572 img/s      | 1011 img/s |           2.54 x            |               6.22 x               |                ~14 hours                |         7.54 x          |          ~34 hours           |
+| **GPUs** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **FP32 Strong Scaling** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Training Time (90E)** |
+|:--------:|:---------------------:|:--------------------------------:|:------------------------------------------------:|:-----------------------:|:----------------------------------:|:---------------------------------------:|:----------------------------:|
+|    1     |       130 img/s       |            427 img/s             |                      3.26 x                      |          1.0 x          |               1.0 x                |                ~79 hours                |          ~257 hours          |
+|    8     |       992 img/s       |            2925 img/s            |                      2.94 x                      |         7.58 x          |               6.84 x               |                ~12 hours                |          ~34 hours           |


 #### Inference performance results

+Our results were obtained by running the applicable training script the pytorch-21.03 NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)

 ###### FP32 Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      37 img/s      |    26.81 ms     |    27.89 ms     |    31.44 ms     |
-|       2        |      75 img/s      |    27.01 ms     |    28.89 ms     |    31.17 ms     |
-|       4        |     144 img/s      |    28.09 ms     |    30.14 ms     |    32.47 ms     |
-|       8        |     259 img/s      |    31.23 ms     |    33.65 ms     |     38.4 ms     |
-|       16       |     332 img/s      |     48.7 ms     |    48.35 ms     |     48.8 ms     |
-|       32       |     394 img/s      |    83.02 ms     |    81.55 ms     |     81.9 ms     |
-|       64       |     471 img/s      |    138.88 ms    |    136.24 ms    |    136.54 ms    |
-|      128       |     505 img/s      |    261.4 ms     |    253.07 ms    |    254.29 ms    |
-|      256       |     513 img/s      |    516.66 ms    |    496.06 ms    |    497.05 ms    |
+|       1        |      40 img/s      |    24.92 ms     |    26.78 ms     |    31.12 ms     |
+|       2        |      80 img/s      |    24.89 ms     |    27.63 ms     |    30.81 ms     |
+|       4        |     127 img/s      |    31.58 ms     |    35.92 ms     |    39.64 ms     |
+|       8        |     250 img/s      |    32.29 ms     |     34.5 ms     |    38.14 ms     |
+|       16       |     363 img/s      |     44.5 ms     |    44.16 ms     |    44.37 ms     |
+|       32       |     423 img/s      |    76.86 ms     |    75.89 ms     |    76.17 ms     |
+|       64       |     472 img/s      |    138.36 ms    |    135.85 ms    |    136.52 ms    |
+|      128       |     501 img/s      |    262.64 ms    |    255.48 ms    |    256.02 ms    |
+|      256       |     508 img/s      |    519.84 ms    |    500.71 ms    |    501.5 ms     |


 ###### Mixed Precision Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      29 img/s      |    34.24 ms     |    36.67 ms     |     39.4 ms     |
-|       2        |      53 img/s      |    37.81 ms     |    43.03 ms     |     45.1 ms     |
-|       4        |     103 img/s      |     39.1 ms     |    43.05 ms     |    46.16 ms     |
-|       8        |     226 img/s      |    35.66 ms     |    38.39 ms     |    41.13 ms     |
-|       16       |     458 img/s      |     35.4 ms     |    37.38 ms     |    39.97 ms     |
-|       32       |     882 img/s      |    37.37 ms     |    40.12 ms     |    42.64 ms     |
-|       64       |     1356 img/s     |    49.31 ms     |    47.21 ms     |    49.87 ms     |
-|      112       |     1448 img/s     |    81.27 ms     |    77.35 ms     |    78.28 ms     |
-|      128       |     1486 img/s     |    90.59 ms     |    86.15 ms     |    87.04 ms     |
-|      256       |     1534 img/s     |    176.72 ms    |    166.2 ms     |    167.53 ms    |
+|       1        |      29 img/s      |    33.83 ms     |     39.1 ms     |    41.57 ms     |
+|       2        |      58 img/s      |    34.35 ms     |    36.92 ms     |    41.66 ms     |
+|       4        |     117 img/s      |    34.33 ms     |    38.67 ms     |    41.05 ms     |
+|       8        |     232 img/s      |    34.66 ms     |    39.51 ms     |    42.16 ms     |
+|       16       |     459 img/s      |    35.23 ms     |    36.77 ms     |    38.11 ms     |
+|       32       |     871 img/s      |    37.62 ms     |    39.36 ms     |    41.26 ms     |
+|       64       |     1416 img/s     |    46.95 ms     |    45.26 ms     |    47.48 ms     |
+|      128       |     1533 img/s     |    87.49 ms     |    83.54 ms     |    83.75 ms     |
+|      256       |     1576 img/s     |    170.79 ms    |    161.97 ms    |    162.93 ms    |


 ##### Inference performance: NVIDIA T4
@ -627,30 +634,30 @@ The following images show a 250 epochs configuration on a DGX-1V.

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      52 img/s      |    19.39 ms     |    20.39 ms     |    21.18 ms     |
-|       2        |     102 img/s      |    19.98 ms     |     21.4 ms     |    23.75 ms     |
-|       4        |     134 img/s      |    30.12 ms     |    30.14 ms     |    30.54 ms     |
-|       8        |     136 img/s      |    59.07 ms     |    60.63 ms     |    61.49 ms     |
-|       16       |     154 img/s      |    104.38 ms    |    105.21 ms    |    105.81 ms    |
-|       32       |     169 img/s      |    190.12 ms    |    189.64 ms    |    190.24 ms    |
-|       64       |     171 img/s      |    376.19 ms    |    374.16 ms    |    375.6 ms     |
-|      128       |     168 img/s      |    771.4 ms     |    761.64 ms    |    764.7 ms     |
-|      256       |     159 img/s      |   1639.15 ms    |   1603.45 ms    |   1605.47 ms    |
+|       1        |      40 img/s      |    25.12 ms     |    28.83 ms     |    31.59 ms     |
+|       2        |      75 img/s      |    26.82 ms     |    30.54 ms     |    33.13 ms     |
+|       4        |     136 img/s      |    29.79 ms     |    33.33 ms     |    37.65 ms     |
+|       8        |     155 img/s      |    51.74 ms     |    52.57 ms     |    53.12 ms     |
+|       16       |     164 img/s      |    97.99 ms     |    98.76 ms     |    99.21 ms     |
+|       32       |     173 img/s      |    186.31 ms    |    186.43 ms    |    187.4 ms     |
+|       64       |     171 img/s      |    378.1 ms     |    377.19 ms    |    378.82 ms    |
+|      128       |     165 img/s      |    785.83 ms    |    778.23 ms    |    782.64 ms    |
+|      256       |     158 img/s      |   1641.96 ms    |   1601.74 ms    |   1614.52 ms    |


 ###### Mixed Precision Inference Latency

 | **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
 |:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
-|       1        |      42 img/s      |    24.17 ms     |    27.26 ms     |    29.98 ms     |
-|       2        |      87 img/s      |    23.24 ms     |    24.66 ms     |    26.77 ms     |
-|       4        |     170 img/s      |    23.87 ms     |    24.89 ms     |    29.59 ms     |
-|       8        |     334 img/s      |    24.49 ms     |    27.92 ms     |    35.66 ms     |
-|       16       |     472 img/s      |    34.45 ms     |    34.29 ms     |    35.72 ms     |
-|       32       |     502 img/s      |    64.93 ms     |    64.47 ms     |    65.16 ms     |
-|       64       |     517 img/s      |    126.24 ms    |    125.03 ms    |    125.86 ms    |
-|      128       |     522 img/s      |    250.99 ms    |    245.87 ms    |    247.1 ms     |
-|      256       |     523 img/s      |    502.41 ms    |    487.58 ms    |    489.69 ms    |
+|       1        |      31 img/s      |    32.51 ms     |    37.26 ms     |    39.53 ms     |
+|       2        |      61 img/s      |    32.76 ms     |    37.61 ms     |    39.62 ms     |
+|       4        |     123 img/s      |    32.98 ms     |    38.97 ms     |    42.66 ms     |
+|       8        |     262 img/s      |    31.01 ms     |     36.3 ms     |    39.11 ms     |
+|       16       |     482 img/s      |    33.76 ms     |    34.54 ms     |     38.5 ms     |
+|       32       |     512 img/s      |    63.68 ms     |    63.29 ms     |    63.73 ms     |
+|       64       |     527 img/s      |    123.57 ms    |    122.69 ms    |    123.56 ms    |
+|      128       |     525 img/s      |    248.97 ms    |    245.39 ms    |    246.66 ms    |
+|      256       |     527 img/s      |    496.23 ms    |    485.68 ms    |    488.3 ms     |


 ## Release notes
--- a/PyTorch/Classification/ConvNets/triton/calculate_metrics.py
+++ b/PyTorch/Classification/ConvNets/triton/calculate_metrics.py
@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+Using `calculate_metrics.py` script, you can obtain model accuracy/error metrics using defined `MetricsCalculator` class.
+
+Data provided to `MetricsCalculator` are obtained from npz dump files
+stored in directory pointed by `--dump-dir` argument.
+Above files are prepared by `run_inference_on_fw.py` and `run_inference_on_triton.py` scripts.
+
+Output data is stored in csv file pointed by `--csv` argument.
+
+Example call:
+
+```shell script
+python ./triton/calculate_metrics.py \
+    --dump-dir /results/dump_triton \
+    --csv /results/accuracy_results.csv \
+    --metrics metrics.py \
+    --metric-class-param1 value
+```
+"""
+
+import argparse
+import csv
+import logging
+import string
+from pathlib import Path
+
+import numpy as np
+
+# method from PEP-366 to support relative import in executed modules
+
+if __package__ is None:
+    __package__ = Path(__file__).parent.name
+
+from .deployment_toolkit.args import ArgParserGenerator
+from .deployment_toolkit.core import BaseMetricsCalculator, load_from_file
+from .deployment_toolkit.dump import pad_except_batch_axis
+
+LOGGER = logging.getLogger("calculate_metrics")
+TOTAL_COLUMN_NAME = "_total_"
+
+
+def get_data(dump_dir, prefix):
+    """Loads and concatenates dump files for given prefix (ex. inputs, outputs, labels, ids)"""
+    dump_dir = Path(dump_dir)
+    npz_files = sorted(dump_dir.glob(f"{prefix}*.npz"))
+    data = None
+    if npz_files:
+        # assume that all npz files with given prefix contain same set of names
+        names = list(np.load(npz_files[0].as_posix()).keys())
+        # calculate target shape
+        target_shape = {
+            name: tuple(np.max([np.load(npz_file.as_posix())[name].shape for npz_file in npz_files], axis=0))
+            for name in names
+        }
+        # pad and concatenate data
+        data = {
+            name: np.concatenate(
+                [pad_except_batch_axis(np.load(npz_file.as_posix())[name], target_shape[name]) for npz_file in npz_files]
+            )
+            for name in names
+        }
+    return data
+
+
+def main():
+    logging.basicConfig(level=logging.INFO)
+
+    parser = argparse.ArgumentParser(description="Run models with given dataloader", allow_abbrev=False)
+    parser.add_argument("--metrics", help=f"Path to python module containing metrics calculator", required=True)
+    parser.add_argument("--csv", help="Path to csv file", required=True)
+    parser.add_argument("--dump-dir", help="Path to directory with dumped outputs (and labels)", required=True)
+
+    args, *_ = parser.parse_known_args()
+
+    MetricsCalculator = load_from_file(args.metrics, "metrics", "MetricsCalculator")
+    ArgParserGenerator(MetricsCalculator).update_argparser(parser)
+
+    args = parser.parse_args()
+
+    LOGGER.info(f"args:")
+    for key, value in vars(args).items():
+        LOGGER.info(f"    {key} = {value}")
+
+    MetricsCalculator = load_from_file(args.metrics, "metrics", "MetricsCalculator")
+    metrics_calculator: BaseMetricsCalculator = ArgParserGenerator(MetricsCalculator).from_args(args)
+
+    ids = get_data(args.dump_dir, "ids")["ids"]
+    x = get_data(args.dump_dir, "inputs")
+    y_true = get_data(args.dump_dir, "labels")
+    y_pred = get_data(args.dump_dir, "outputs")
+
+    common_keys = list({k for k in (y_true or [])} & {k for k in (y_pred or [])})
+    for key in common_keys:
+        if y_true[key].shape != y_pred[key].shape:
+            LOGGER.warning(
+                f"Model predictions and labels shall have equal shapes. "
+                f"y_pred[{key}].shape={y_pred[key].shape} != "
+                f"y_true[{key}].shape={y_true[key].shape}"
+            )
+
+    metrics = metrics_calculator.calc(ids=ids, x=x, y_pred=y_pred, y_real=y_true)
+    metrics = {TOTAL_COLUMN_NAME: len(ids), **metrics}
+
+    metric_names_with_space = [name for name in metrics if any([c in string.whitespace for c in name])]
+    if metric_names_with_space:
+        raise ValueError(f"Metric names shall have no spaces; Incorrect names: {', '.join(metric_names_with_space)}")
+
+    csv_path = Path(args.csv)
+    csv_path.parent.mkdir(parents=True, exist_ok=True)
+    with csv_path.open("w") as csv_file:
+        writer = csv.DictWriter(csv_file, fieldnames=list(metrics.keys()))
+        writer.writeheader()
+        writer.writerow(metrics)
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/config_model_on_triton.py
+++ b/PyTorch/Classification/ConvNets/triton/config_model_on_triton.py
@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+To configure model on Triton, you can use `config_model_on_triton.py` script.
+This will prepare layout of Model Repository, including  Model Configuration.
+
+```shell script
+python ./triton/config_model_on_triton.py \
+    --model-repository /model_repository \
+    --model-path /models/exported/model.onnx \
+    --model-format onnx \
+    --model-name ResNet50 \
+    --model-version 1 \
+    --max-batch-size 32 \
+    --precision fp16 \
+    --backend-accelerator trt \
+    --load-model explicit \
+    --timeout 120 \
+    --verbose
+```
+
+If Triton server to which we prepare model repository is running with **explicit model control mode**,
+use `--load-model` argument to send request load_model request to Triton Inference Server.
+If server is listening on non-default address or port use `--server-url` argument to point server control endpoint.
+If it is required to use HTTP protocol to communicate with Triton server use `--http` argument.
+
+To improve inference throughput you can use
+[dynamic batching](https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md#dynamic-batcher)
+for your model by providing `--preferred-batch-sizes` and `--max-queue-delay-us` parameters.
+
+For models which doesn't support batching, set `--max-batch-sizes` to 0.
+
+By default Triton will [automatically obtain inputs and outputs definitions](https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md#auto-generated-model-configuration).
+but for TorchScript ang TF GraphDef models script uses file with I/O specs. This file is automatically generated
+when the model is converted to ScriptModule (either traced or scripted).
+If there is a need to pass different than default path to I/O spec file use `--io-spec` CLI argument.
+
+I/O spec file is yaml file with below structure:
+
+```yaml
+- inputs:
+  - name: input
+    dtype: float32   # np.dtype name
+    shape: [None, 224, 224, 3]
+- outputs:
+  - name: probabilities
+    dtype: float32
+    shape: [None, 1001]
+  - name: classes
+    dtype: int32
+    shape: [None, 1]
+```
+
+"""
+
+import argparse
+import logging
+import time
+
+from model_navigator import Accelerator, Format, Precision
+from model_navigator.args import str2bool
+from model_navigator.log import set_logger, log_dict
+from model_navigator.triton import ModelConfig, TritonClient, TritonModelStore
+
+LOGGER = logging.getLogger("config_model")
+
+
+def _available_enum_values(my_enum):
+    return [item.value for item in my_enum]
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Create Triton model repository and model configuration", allow_abbrev=False
+    )
+    parser.add_argument("--model-repository", required=True, help="Path to Triton model repository.")
+    parser.add_argument("--model-path", required=True, help="Path to model to configure")
+
+    # TODO: automation
+    parser.add_argument(
+        "--model-format",
+        required=True,
+        choices=_available_enum_values(Format),
+        help="Format of model to deploy",
+    )
+    parser.add_argument("--model-name", required=True, help="Model name")
+    parser.add_argument("--model-version", default="1", help="Version of model (default 1)")
+    parser.add_argument(
+        "--max-batch-size",
+        type=int,
+        default=32,
+        help="Maximum batch size allowed for inference. "
+        "A max_batch_size value of 0 indicates that batching is not allowed for the model",
+    )
+    # TODO: automation
+    parser.add_argument(
+        "--precision",
+        type=str,
+        default=Precision.FP16.value,
+        choices=_available_enum_values(Precision),
+        help="Model precision (parameter used only by Tensorflow backend with TensorRT optimization)",
+    )
+
+    # Triton Inference Server endpoint
+    parser.add_argument(
+        "--server-url",
+        type=str,
+        default="grpc://localhost:8001",
+        help="Inference server URL in format protocol://host[:port] (default grpc://localhost:8001)",
+    )
+    parser.add_argument(
+        "--load-model",
+        choices=["none", "poll", "explicit"],
+        help="Loading model while Triton Server is in given model control mode",
+    )
+    parser.add_argument(
+        "--timeout", default=120, help="Timeout in seconds to wait till model load (default=120)", type=int
+    )
+
+    # optimization related
+    parser.add_argument(
+        "--backend-accelerator",
+        type=str,
+        choices=_available_enum_values(Accelerator),
+        default=Accelerator.TRT.value,
+        help="Select Backend Accelerator used to serve model",
+    )
+    parser.add_argument("--number-of-model-instances", type=int, default=1, help="Number of model instances per GPU")
+    parser.add_argument(
+        "--preferred-batch-sizes",
+        type=int,
+        nargs="*",
+        help="Batch sizes that the dynamic batcher should attempt to create. "
+        "In case --max-queue-delay-us is set and this parameter is not, default value will be --max-batch-size",
+    )
+    parser.add_argument(
+        "--max-queue-delay-us",
+        type=int,
+        default=0,
+        help="Max delay time which dynamic batcher shall wait to form a batch (default 0)",
+    )
+    parser.add_argument(
+        "--capture-cuda-graph",
+        type=int,
+        default=0,
+        help="Use cuda capture graph (used only by TensorRT platform)",
+    )
+
+    parser.add_argument("-v", "--verbose", help="Provide verbose logs", type=str2bool, default=False)
+    args = parser.parse_args()
+
+    set_logger(verbose=args.verbose)
+    log_dict("args", vars(args))
+
+    config = ModelConfig.create(
+        model_path=args.model_path,
+        # model definition
+        model_name=args.model_name,
+        model_version=args.model_version,
+        model_format=args.model_format,
+        precision=args.precision,
+        max_batch_size=args.max_batch_size,
+        # optimization
+        accelerator=args.backend_accelerator,
+        gpu_engine_count=args.number_of_model_instances,
+        preferred_batch_sizes=args.preferred_batch_sizes or [],
+        max_queue_delay_us=args.max_queue_delay_us,
+        capture_cuda_graph=args.capture_cuda_graph,
+    )
+
+    model_store = TritonModelStore(args.model_repository)
+    model_store.deploy_model(model_config=config, model_path=args.model_path)
+
+    if args.load_model != "none":
+        client = TritonClient(server_url=args.server_url, verbose=args.verbose)
+        client.wait_for_server_ready(timeout=args.timeout)
+
+        if args.load_model == "explicit":
+            client.load_model(model_name=args.model_name)
+
+        if args.load_model == "poll":
+            time.sleep(15)
+
+        client.wait_for_model(model_name=args.model_name, model_version=args.model_version, timeout_s=args.timeout)
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/convert_model.py
+++ b/PyTorch/Classification/ConvNets/triton/convert_model.py
@ -0,0 +1,166 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+r"""
+`convert_model.py` script allows to convert between model formats with additional model optimizations
+for faster inference.
+It converts model from results of get_model function.
+
+Currently supported input and output formats are:
+
+  - inputs
+    - `tf-estimator` - `get_model` function returning Tensorflow Estimator
+    - `tf-keras` - `get_model` function returning Tensorflow Keras Model
+    - `tf-savedmodel` - Tensorflow SavedModel binary
+    - `pyt` - `get_model` function returning PyTorch Module
+  - output
+    - `tf-savedmodel` - Tensorflow saved model
+    - `tf-trt` - TF-TRT saved model
+    - `ts-trace` - PyTorch traced ScriptModule
+    - `ts-script` - PyTorch scripted ScriptModule
+    - `onnx` - ONNX
+    - `trt` - TensorRT plan file
+
+For tf-keras input you can use:
+  - --large-model flag - helps loading model which exceeds maximum protobuf size of 2GB
+  - --tf-allow-growth flag - control limiting GPU memory growth feature
+    (https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth). By default it is disabled.
+"""
+
+import argparse
+import logging
+import os
+from pathlib import Path
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
+os.environ["TF_ENABLE_DEPRECATION_WARNINGS"] = "1"
+
+# method from PEP-366 to support relative import in executed modules
+if __name__ == "__main__" and __package__ is None:
+    __package__ = Path(__file__).parent.name
+
+from .deployment_toolkit.args import ArgParserGenerator
+from .deployment_toolkit.core import (
+    DATALOADER_FN_NAME,
+    BaseConverter,
+    BaseLoader,
+    BaseSaver,
+    Format,
+    Precision,
+    load_from_file,
+)
+from .deployment_toolkit.extensions import converters, loaders, savers
+
+LOGGER = logging.getLogger("convert_model")
+
+INPUT_MODEL_TYPES = [Format.TF_ESTIMATOR, Format.TF_KERAS, Format.TF_SAVEDMODEL, Format.PYT]
+OUTPUT_MODEL_TYPES = [Format.TF_SAVEDMODEL, Format.TF_TRT, Format.ONNX, Format.TRT, Format.TS_TRACE, Format.TS_SCRIPT]
+
+
+def _get_args():
+    parser = argparse.ArgumentParser(description="Script for conversion between model formats.", allow_abbrev=False)
+    parser.add_argument("--input-path", help="Path to input model file (python module or binary file)", required=True)
+    parser.add_argument(
+        "--input-type", help="Input model type", choices=[f.value for f in INPUT_MODEL_TYPES], required=True
+    )
+    parser.add_argument("--output-path", help="Path to output model file", required=True)
+    parser.add_argument(
+        "--output-type", help="Output model type", choices=[f.value for f in OUTPUT_MODEL_TYPES], required=True
+    )
+    parser.add_argument("--dataloader", help="Path to python module containing data loader")
+    parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
+    parser.add_argument(
+        "--ignore-unknown-parameters",
+        help="Ignore unknown parameters (argument often used in CI where set of arguments is constant)",
+        action="store_true",
+        default=False,
+    )
+
+    args, unparsed_args = parser.parse_known_args()
+
+    Loader: BaseLoader = loaders.get(args.input_type)
+    ArgParserGenerator(Loader, module_path=args.input_path).update_argparser(parser)
+
+    converter_name = f"{args.input_type}--{args.output_type}"
+    Converter: BaseConverter = converters.get(converter_name)
+    if Converter is not None:
+        ArgParserGenerator(Converter).update_argparser(parser)
+
+    Saver: BaseSaver = savers.get(args.output_type)
+    ArgParserGenerator(Saver).update_argparser(parser)
+
+    if args.dataloader is not None:
+        get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
+        ArgParserGenerator(get_dataloader_fn).update_argparser(parser)
+
+    if args.ignore_unknown_parameters:
+        args, unknown_args = parser.parse_known_args()
+        LOGGER.warning(f"Got additional args {unknown_args}")
+    else:
+        args = parser.parse_args()
+    return args
+
+
+def main():
+    args = _get_args()
+
+    log_level = logging.INFO if not args.verbose else logging.DEBUG
+    log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
+    logging.basicConfig(level=log_level, format=log_format)
+
+    LOGGER.info(f"args:")
+    for key, value in vars(args).items():
+        LOGGER.info(f"    {key} = {value}")
+
+    requested_model_precision = Precision(args.precision)
+    dataloader_fn = None
+
+    # if conversion is required, temporary change model load precision to that required by converter
+    # it is for TensorRT converters which require fp32 models for all requested precisions
+    converter_name = f"{args.input_type}--{args.output_type}"
+    Converter: BaseConverter = converters.get(converter_name)
+    if Converter:
+        args.precision = Converter.required_source_model_precision(requested_model_precision).value
+
+    Loader: BaseLoader = loaders.get(args.input_type)
+    loader = ArgParserGenerator(Loader, module_path=args.input_path).from_args(args)
+    model = loader.load(args.input_path)
+
+
+    LOGGER.info("inputs: %s", model.inputs)
+    LOGGER.info("outputs: %s", model.outputs)
+
+    if Converter:  # if conversion is needed
+        # dataloader must much source model precision - so not recovering it yet
+        if args.dataloader is not None:
+            get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
+            dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)
+
+    # recover precision to that requested by user
+    args.precision = requested_model_precision.value
+
+    if Converter:
+        converter = ArgParserGenerator(Converter).from_args(args)
+        model = converter.convert(model, dataloader_fn=dataloader_fn)
+
+    Saver: BaseSaver = savers.get(args.output_type)
+    saver = ArgParserGenerator(Saver).from_args(args)
+    saver.save(model, args.output_path)
+
+    return 0
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/dataloader.py
+++ b/PyTorch/Classification/ConvNets/triton/dataloader.py
@ -0,0 +1,49 @@
+import logging
+from pathlib import Path
+
+import numpy as np
+from PIL import Image
+
+LOGGER = logging.getLogger(__name__)
+
+
+def get_dataloader_fn(
+    *, data_dir: str, batch_size: int = 1, width: int = 224, height: int = 224, images_num: int = None,
+    precision: str = "fp32", classes: int = 1000
+):
+    def _dataloader():
+        image_extensions = [".gif", ".png", ".jpeg", ".jpg"]
+
+        image_paths = sorted([p for p in Path(data_dir).rglob("*") if p.suffix.lower() in image_extensions])
+        if images_num is not None:
+            image_paths = image_paths[:images_num]
+
+        LOGGER.info(
+            f"Creating PIL dataloader on data_dir={data_dir} #images={len(image_paths)} "
+            f"image_size=({width}, {height}) batch_size={batch_size}"
+        )
+
+        onehot = np.eye(classes)
+
+        batch = []
+        for image_path in image_paths:
+            img = Image.open(image_path.as_posix()).convert("RGB")
+            img = img.resize((width, height))
+            img = (np.array(img).astype(np.float32) / 255) - np.array([0.485, 0.456, 0.406], dtype=np.float32).reshape(1, 1, 3)
+            img = img / np.array([0.229, 0.224, 0.225], dtype=np.float32).reshape(1, 1, 3)
+
+            true_class = np.array([int(image_path.parent.name)])
+            assert tuple(img.shape) == (height, width, 3)
+            img = img[np.newaxis, ...]
+            batch.append((img, image_path.as_posix(), true_class))
+            if len(batch) >= batch_size:
+                ids = [image_path for _, image_path, *_ in batch]
+                x = {"INPUT__0": np.ascontiguousarray(
+                                    np.transpose(np.concatenate([img for img, *_ in batch]), 
+                                                 (0, 3, 1, 2)).astype(np.float32 if precision == "fp32" else np.float16))}
+                y_real = {"OUTPUT__0": onehot[np.concatenate([class_ for *_, class_ in batch])].astype(
+                    np.float32 if precision == "fp32" else np.float16                              
+                )}
+                batch = []
+                yield ids, x, y_real
+    return _dataloader
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/.version
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/.version
@ -0,0 +1 @@
+0.5.0-2-gd556907
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/init.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/args.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/args.py
@ -0,0 +1,124 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import inspect
+import logging
+from typing import Any, Callable, Dict, Optional, Union
+
+from .core import GET_ARGPARSER_FN_NAME, load_from_file
+
+LOGGER = logging.getLogger(__name__)
+
+
+def str2bool(v):
+    if isinstance(v, bool):
+        return v
+    if v.lower() in ("yes", "true", "t", "y", "1"):
+        return True
+    elif v.lower() in ("no", "false", "f", "n", "0"):
+        return False
+    else:
+        raise argparse.ArgumentTypeError("Boolean value expected.")
+
+
+def filter_fn_args(args: Union[dict, argparse.Namespace], fn: Callable) -> dict:
+    signature = inspect.signature(fn)
+    parameters_names = list(signature.parameters)
+    if isinstance(args, argparse.Namespace):
+        args = vars(args)
+    args = {k: v for k, v in args.items() if k in parameters_names}
+    return args
+
+
+def add_args_for_fn_signature(parser, fn) -> argparse.ArgumentParser:
+    parser.conflict_handler = "resolve"
+    signature = inspect.signature(fn)
+    for parameter in signature.parameters.values():
+        if parameter.name in ["self", "args", "kwargs"]:
+            continue
+        argument_kwargs = {}
+        if parameter.annotation != inspect.Parameter.empty:
+            if parameter.annotation == bool:
+                argument_kwargs["type"] = str2bool
+                argument_kwargs["choices"] = [0, 1]
+            elif isinstance(parameter.annotation, type(Optional[Any])):
+                types = [type_ for type_ in parameter.annotation.__args__ if not isinstance(None, type_)]
+                if len(types) != 1:
+                    raise RuntimeError(
+                        f"Could not prepare argument parser for {parameter.name}: {parameter.annotation} in {fn}"
+                    )
+                argument_kwargs["type"] = types[0]
+            else:
+                argument_kwargs["type"] = parameter.annotation
+
+        if parameter.default != inspect.Parameter.empty:
+            if parameter.annotation == bool:
+                argument_kwargs["default"] = str2bool(parameter.default)
+            else:
+                argument_kwargs["default"] = parameter.default
+        else:
+            argument_kwargs["required"] = True
+        name = parameter.name.replace("_", "-")
+        LOGGER.debug(f"Adding argument {name} with {argument_kwargs}")
+        parser.add_argument(f"--{name}", **argument_kwargs)
+    return parser
+
+
+class ArgParserGenerator:
+    def __init__(self, cls_or_fn, module_path: Optional[str] = None):
+        self._cls_or_fn = cls_or_fn
+
+        self._handle = cls_or_fn if inspect.isfunction(cls_or_fn) else getattr(cls_or_fn, "__init__")
+        input_is_python_file = module_path and module_path.endswith(".py")
+        self._input_path = module_path if input_is_python_file else None
+        self._required_fn_name_for_signature_parsing = getattr(
+            cls_or_fn, "required_fn_name_for_signature_parsing", None
+        )
+
+    def update_argparser(self, parser):
+        name = self._handle.__name__
+        group_parser = parser.add_argument_group(name)
+        add_args_for_fn_signature(group_parser, fn=self._handle)
+        self._update_argparser(group_parser)
+
+    def get_args(self, args: argparse.Namespace):
+        filtered_args = filter_fn_args(args, fn=self._handle)
+
+        tmp_parser = argparse.ArgumentParser(allow_abbrev=False)
+        self._update_argparser(tmp_parser)
+        custom_names = [
+            p.dest.replace("-", "_") for p in tmp_parser._actions if not isinstance(p, argparse._HelpAction)
+        ]
+        custom_params = {n: getattr(args, n) for n in custom_names}
+        filtered_args = {**filtered_args, **custom_params}
+        return filtered_args
+
+    def from_args(self, args: Union[argparse.Namespace, Dict]):
+        args = self.get_args(args)
+        LOGGER.info(f"Initializing {self._cls_or_fn.__name__}({args})")
+        return self._cls_or_fn(**args)
+
+    def _update_argparser(self, parser):
+        label = "argparser_update"
+        if self._input_path:
+            update_argparser_handle = load_from_file(self._input_path, label=label, target=GET_ARGPARSER_FN_NAME)
+            if update_argparser_handle:
+                update_argparser_handle(parser)
+            elif self._required_fn_name_for_signature_parsing:
+                fn_handle = load_from_file(
+                    self._input_path, label=label, target=self._required_fn_name_for_signature_parsing
+                )
+                if fn_handle:
+                    add_args_for_fn_signature(parser, fn_handle)
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/init.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/onnx.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/onnx.py
@ -0,0 +1,237 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from pathlib import Path
+from typing import Dict, Optional, Union
+
+import numpy as np
+
+# pytype: disable=import-error
+import onnx
+import onnx.optimizer
+import onnx.shape_inference
+import onnxruntime
+from google.protobuf import text_format
+from onnx.mapping import TENSOR_TYPE_TO_NP_TYPE
+
+# pytype: enable=import-error
+
+from ..core import BaseLoader, BaseRunner, BaseRunnerSession, BaseSaver, Format, Model, Precision, TensorSpec
+from ..extensions import loaders, runners, savers
+from .utils import infer_precision
+
+LOGGER = logging.getLogger(__name__)
+
+
+def _value_info2tensor_spec(value_info: onnx.ValueInfoProto):
+    onnx_data_type_map = {"float": "float32", "double": "float64"}
+
+    elem_type_name = onnx.TensorProto.DataType.Name(value_info.type.tensor_type.elem_type).lower()
+    dtype = onnx_data_type_map.get(elem_type_name, elem_type_name)
+
+    def _get_dim(dim):
+        which = dim.WhichOneof("value")
+        if which is not None:  # which is None when dim is None
+            dim = getattr(dim, which)
+        return None if isinstance(dim, (str, bytes)) else dim
+
+    shape = value_info.type.tensor_type.shape
+    shape = tuple([_get_dim(d) for d in shape.dim])
+    return TensorSpec(value_info.name, dtype=dtype, shape=shape)
+
+
+def _infer_graph_precision(onnx_graph: onnx.GraphProto) -> Optional[Precision]:
+    import networkx as nx
+
+    # build directed graph
+    nx_graph = nx.DiGraph()
+
+    def _get_dtype(vi):
+        t = vi.type
+        if hasattr(t, "tensor_type"):
+            type_id = t.tensor_type.elem_type
+        else:
+            raise NotImplementedError("Not implemented yet")
+        return TENSOR_TYPE_TO_NP_TYPE[type_id]
+
+    node_output2type = {vi.name: _get_dtype(vi) for vi in onnx_graph.value_info}
+
+    node_outputs2node = {output_name: node for node in onnx_graph.node for output_name in node.output}
+    node_inputs2node = {input_name: node for node in onnx_graph.node for input_name in node.input}
+
+    for node in onnx_graph.node:
+        node_dtype = node_output2type.get("+".join(node.output), None)
+        nx_graph.add_node(
+            node.name,
+            op=node.op_type,
+            attr={a.name: a for a in node.attribute},
+            dtype=node_dtype,
+        )
+        for input_name in node.input:
+            prev_node = node_outputs2node.get(input_name, None)
+            if prev_node:
+                nx_graph.add_edge(prev_node.name, node.name)
+
+    for input_node in onnx_graph.input:
+        input_name = input_node.name
+        nx_graph.add_node(input_name, op="input", dtype=_get_dtype(input_node))
+        next_node = node_inputs2node.get(input_name, None)
+        if next_node:
+            nx_graph.add_edge(input_name, next_node.name)
+
+    for output in onnx_graph.output:
+        output_name = output.name
+        nx_graph.add_node(output_name, op="output", dtype=_get_dtype(output))
+        prev_node = node_outputs2node.get(output_name, None)
+        if prev_node:
+            nx_graph.add_edge(prev_node.name, output_name)
+        else:
+            LOGGER.warning(f"Could not find previous node for {output_name}")
+
+    input_names = [n.name for n in onnx_graph.input]
+    output_names = [n.name for n in onnx_graph.output]
+    most_common_dtype = infer_precision(nx_graph, input_names, output_names, lambda node: node.get("dtype", None))
+    if most_common_dtype is not None:
+        precision = {np.dtype("float32"): Precision.FP32, np.dtype("float16"): Precision.FP16}[most_common_dtype]
+    else:
+        precision = None
+    return precision
+
+
+class OnnxLoader(BaseLoader):
+    def load(self, model_path: Union[str, Path], **_) -> Model:
+        if isinstance(model_path, Path):
+            model_path = model_path.as_posix()
+
+        model = onnx.load(model_path)
+        onnx.checker.check_model(model)
+        onnx.helper.strip_doc_string(model)
+        model = onnx.shape_inference.infer_shapes(model)
+
+        # TODO: probably modification of onnx model ios causes error on optimize
+        # from onnx.utils import polish_model
+        # model = polish_model(model)  # run checker, docs strip, optimizer and shape inference
+
+        inputs = {vi.name: _value_info2tensor_spec(vi) for vi in model.graph.input}
+        outputs = {vi.name: _value_info2tensor_spec(vi) for vi in model.graph.output}
+
+        precision = _infer_graph_precision(model.graph)
+
+        return Model(model, precision, inputs, outputs)
+
+
+class OnnxSaver(BaseSaver):
+    def __init__(self, as_text: bool = False):
+        self._as_text = as_text
+
+    def save(self, model: Model, model_path: Union[str, Path]) -> None:
+        model_path = Path(model_path)
+        LOGGER.debug(f"Saving ONNX model to {model_path.as_posix()}")
+        model_path.parent.mkdir(parents=True, exist_ok=True)
+
+        onnx_model: onnx.ModelProto = model.handle
+        if self._as_text:
+            with model_path.open("w") as f:
+                f.write(text_format.MessageToString(onnx_model))
+        else:
+            with model_path.open("wb") as f:
+                f.write(onnx_model.SerializeToString())
+
+
+"""
+ExecutionProviders on onnxruntime 1.4.0
+['TensorrtExecutionProvider',
+ 'CUDAExecutionProvider',
+ 'MIGraphXExecutionProvider',
+ 'NGRAPHExecutionProvider',
+ 'OpenVINOExecutionProvider',
+ 'DnnlExecutionProvider',
+ 'NupharExecutionProvider',
+ 'VitisAIExecutionProvider',
+ 'ArmNNExecutionProvider',
+ 'ACLExecutionProvider',
+ 'CPUExecutionProvider']
+"""
+
+
+def _check_providers(providers):
+    providers = providers or []
+    if not isinstance(providers, (list, tuple)):
+        providers = [providers]
+    available_providers = onnxruntime.get_available_providers()
+    unavailable = set(providers) - set(available_providers)
+    if unavailable:
+        raise RuntimeError(f"Unavailable providers {unavailable}")
+    return providers
+
+
+class OnnxRunner(BaseRunner):
+    def __init__(self, verbose_runtime_logs: bool = False):
+        self._providers = None
+        self._verbose_runtime_logs = verbose_runtime_logs
+
+    def init_inference(self, model: Model):
+        assert isinstance(model.handle, onnx.ModelProto)
+        return OnnxRunnerSession(
+            model=model, providers=self._providers, verbose_runtime_logs=self._verbose_runtime_logs
+        )
+
+
+class OnnxRunnerSession(BaseRunnerSession):
+    def __init__(self, model: Model, providers, verbose_runtime_logs: bool = False):
+        super().__init__(model)
+        self._input_names = None
+        self._output_names = None
+        self._session = None
+        self._providers = providers
+        self._verbose_runtime_logs = verbose_runtime_logs
+        self._old_env_values = {}
+
+    def __enter__(self):
+        self._old_env_values = self._set_env_variables()
+        sess_options = onnxruntime.SessionOptions()  # default session options
+        if self._verbose_runtime_logs:
+            sess_options.log_severity_level = 0
+            sess_options.log_verbosity_level = 1
+        LOGGER.info(
+            f"Starting inference session for onnx model providers={self._providers} sess_options={sess_options}"
+        )
+
+        self._input_names = list(self._model.inputs)
+        self._output_names = list(self._model.outputs)
+
+        model_payload = self._model.handle.SerializeToString()
+        self._session = onnxruntime.InferenceSession(
+            model_payload, providers=self._providers, sess_options=sess_options
+        )
+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        self._input_names = None
+        self._output_names = None
+        self._session = None
+        self._recover_env_variables(self._old_env_values)
+
+    def __call__(self, x: Dict[str, object]):
+        feed_dict = {k: x[k] for k in self._input_names}
+        y_pred = self._session.run(self._output_names, feed_dict)
+        y_pred = dict(zip(self._output_names, y_pred))
+
+        return y_pred
+
+
+loaders.register_extension(Format.ONNX.value, OnnxLoader)
+runners.register_extension(Format.ONNX.value, OnnxRunner)
+savers.register_extension(Format.ONNX.value, OnnxSaver)
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/onnx2trt_conv.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/onnx2trt_conv.py
@ -0,0 +1,114 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import Dict, Iterable, Optional
+
+# pytype: disable=import-error
+import onnx
+import tensorrt as trt
+
+from ..core import BaseConverter, Format, Model, Precision, ShapeSpec
+from ..extensions import converters
+from .utils import get_input_shapes
+
+# pytype: enable=import-error
+
+
+LOGGER = logging.getLogger(__name__)
+TRT_LOGGER = trt.Logger(trt.Logger.INFO)
+
+
+class Onnx2TRTConverter(BaseConverter):
+    def __init__(self, *, max_batch_size: int, max_workspace_size: int, precision: str):
+        self._max_batch_size = max_batch_size
+        self._max_workspace_size = max_workspace_size
+        self._precision = Precision(precision)
+
+    def convert(self, model: Model, dataloader_fn) -> Model:
+        input_shapes = get_input_shapes(dataloader_fn(), self._max_batch_size)
+        cuda_engine = onnx2trt(
+            model.handle,
+            shapes=input_shapes,
+            max_workspace_size=self._max_workspace_size,
+            max_batch_size=self._max_batch_size,
+            model_precision=self._precision.value,
+        )
+        return model._replace(handle=cuda_engine)
+
+    @staticmethod
+    def required_source_model_precision(requested_model_precision: Precision) -> Precision:
+        # TensorRT requires source models to be in FP32 precision
+        return Precision.FP32
+
+
+def onnx2trt(
+    onnx_model: onnx.ModelProto,
+    *,
+    shapes: Dict[str, ShapeSpec],
+    max_workspace_size: int,
+    max_batch_size: int,
+    model_precision: str,
+) -> "trt.ICudaEngine":
+    """
+    Converts onnx model to TensorRT ICudaEngine
+    Args:
+        onnx_model: onnx.Model to convert
+        shapes: dictionary containing min shape, max shape, opt shape for each input name
+        max_workspace_size: The maximum GPU temporary memory which the CudaEngine can use at execution time.
+        max_batch_size: The maximum batch size which can be used at execution time,
+                        and also the batch size for which the CudaEngine will be optimized.
+        model_precision: precision of kernels (possible values: fp16, fp32)
+
+    Returns: TensorRT ICudaEngine
+    """
+    # Whether or not 16-bit kernels are permitted.
+    # During :class:`ICudaEngine` build fp16 kernels will also be tried when this mode is enabled.
+    fp16_mode = "16" in model_precision
+
+    builder = trt.Builder(TRT_LOGGER)
+    builder.fp16_mode = fp16_mode
+    builder.max_batch_size = max_batch_size
+    builder.max_workspace_size = max_workspace_size
+
+    # In TensorRT 7.0, the ONNX parser only supports full-dimensions mode,
+    # meaning that your network definition must be created with the explicitBatch flag set.
+    # For more information, see
+    # https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes
+    flags = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
+    network = builder.create_network(flags)
+
+    with trt.OnnxParser(network, TRT_LOGGER) as parser:
+        # onnx model parsing
+        if not parser.parse(onnx_model.SerializeToString()):
+            for i in range(parser.num_errors):
+                LOGGER.error(f"OnnxParser error {i}/{parser.num_errors}: {parser.get_error(i)}")
+            raise RuntimeError("Error during parsing ONNX model (see logs for details)")
+
+        # optimization
+        config = builder.create_builder_config()
+        config.flags |= bool(fp16_mode) << int(trt.BuilderFlag.FP16)
+        config.max_workspace_size = max_workspace_size
+
+        profile = builder.create_optimization_profile()
+        for name, spec in shapes.items():
+            profile.set_shape(name, **spec._asdict())
+
+        config.add_optimization_profile(profile)
+        engine = builder.build_engine(network, config=config)
+
+    return engine
+
+
+converters.register_extension(f"{Format.ONNX.value}--{Format.TRT.value}", Onnx2TRTConverter)
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/pyt.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/pyt.py
@ -0,0 +1,358 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import os
+from collections import Counter
+from pathlib import Path
+from typing import Dict, Iterable, NamedTuple, Optional, Union
+
+import torch  # pytype: disable=import-error
+import yaml
+
+from ..core import (
+    GET_MODEL_FN_NAME,
+    BaseConverter,
+    BaseLoader,
+    BaseRunner,
+    BaseRunnerSession,
+    BaseSaver,
+    Format,
+    Model,
+    Precision,
+    TensorSpec,
+    load_from_file,
+)
+from ..extensions import converters, loaders, runners, savers
+from .utils import get_dynamic_axes, get_input_shapes, get_shapes_with_dynamic_axes
+
+LOGGER = logging.getLogger(__name__)
+
+
+class InputOutputSpec(NamedTuple):
+    inputs: Dict[str, TensorSpec]
+    outputs: Dict[str, TensorSpec]
+
+
+def get_sample_input(dataloader, device):
+    for batch in dataloader:
+        _, x, _ = batch
+        break
+    if isinstance(x, dict):
+        sample_input = list(x.values())
+    elif isinstance(x, list):
+        sample_input = x
+    else:
+        raise TypeError("The first element (x) of batch returned by dataloader must be a list or a dict")
+
+    for idx, s in enumerate(sample_input):
+        sample_input[idx] = torch.from_numpy(s).to(device)
+
+    return tuple(sample_input)
+
+
+def get_model_device(torch_model):
+    if next(torch_model.parameters()).is_cuda:
+        return "cuda"
+    else:
+        return "cpu"
+
+
+def infer_model_precision(model):
+    counter = Counter()
+    for param in model.parameters():
+        counter[param.dtype] += 1
+    if counter[torch.float16] > 0:
+        return Precision.FP16
+    else:
+        return Precision.FP32
+
+
+def _get_tensor_dtypes(dataloader, precision):
+    def _get_dtypes(t):
+        dtypes = {}
+        for k, v in t.items():
+            dtype = str(v.dtype)
+            if dtype == "float64":
+                dtype = "float32"
+            if precision == Precision.FP16 and dtype == "float32":
+                dtype = "float16"
+            dtypes[k] = dtype
+        return dtypes
+
+    input_dtypes = {}
+    output_dtypes = {}
+
+    for batch in dataloader:
+        _, x, y = batch
+        input_dtypes = _get_dtypes(x)
+        output_dtypes = _get_dtypes(y)
+        break
+
+    return input_dtypes, output_dtypes
+
+
+### TODO assumption: floating point input
+### type has same precision as the model
+def _get_io_spec(model, dataloader_fn):
+    precision = model.precision
+
+    dataloader = dataloader_fn()
+    input_dtypes, output_dtypes = _get_tensor_dtypes(dataloader, precision)
+    input_shapes, output_shapes = get_shapes_with_dynamic_axes(dataloader)
+
+    inputs = {
+        name: TensorSpec(name=name, dtype=input_dtypes[name], shape=tuple(input_shapes[name])) for name in model.inputs
+    }
+    outputs = {
+        name: TensorSpec(name=name, dtype=output_dtypes[name], shape=tuple(output_shapes[name]))
+        for name in model.outputs
+    }
+
+    return InputOutputSpec(inputs, outputs)
+
+
+class PyTorchModelLoader(BaseLoader):
+    required_fn_name_for_signature_parsing: Optional[str] = GET_MODEL_FN_NAME
+
+    def __init__(self, **kwargs):
+        self._model_args = kwargs
+
+    def load(self, model_path: Union[str, Path], **_) -> Model:
+        if isinstance(model_path, Path):
+            model_path = model_path.as_posix()
+        get_model = load_from_file(model_path, "model", GET_MODEL_FN_NAME)
+        model, tensor_infos = get_model(**self._model_args)
+        io_spec = InputOutputSpec(tensor_infos["inputs"], tensor_infos["outputs"])
+        precision = infer_model_precision(model)
+        return Model(handle=model, precision=precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
+
+
+class TorchScriptLoader(BaseLoader):
+    def __init__(self, tensor_names_path: str = None, **kwargs):
+        self._model_args = kwargs
+        self._io_spec = None
+        if tensor_names_path is not None:
+            with Path(tensor_names_path).open("r") as fh:
+                tensor_infos = yaml.load(fh, Loader=yaml.SafeLoader)
+                self._io_spec = InputOutputSpec(tensor_infos["inputs"], tensor_infos["outputs"])
+
+    def load(self, model_path: Union[str, Path], **_) -> Model:
+        if not isinstance(model_path, Path):
+            model_path = Path(model_path)
+        model = torch.jit.load(model_path.as_posix())
+        precision = infer_model_precision(model)
+
+        io_spec = self._io_spec
+        if not io_spec:
+            yaml_path = model_path.parent / f"{model_path.stem}.yaml"
+            if not yaml_path.is_file():
+                raise ValueError(
+                    f"If `--tensor-names-path is not provided, "
+                    f"TorchScript model loader expects file {yaml_path} with tensor information."
+                )
+            with yaml_path.open("r") as fh:
+                tensor_info = yaml.load(fh, Loader=yaml.SafeLoader)
+                io_spec = InputOutputSpec(tensor_info["inputs"], tensor_info["outputs"])
+
+        return Model(handle=model, precision=precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
+
+
+class TorchScriptTraceConverter(BaseConverter):
+    def __init__(self):
+        pass
+
+    def convert(self, model: Model, dataloader_fn) -> Model:
+        device = get_model_device(model.handle)
+        dummy_input = get_sample_input(dataloader_fn(), device)
+        converted_model = torch.jit.trace_module(model.handle, {"forward": dummy_input})
+        io_spec = _get_io_spec(model, dataloader_fn)
+        return Model(converted_model, precision=model.precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
+
+
+class TorchScriptScriptConverter(BaseConverter):
+    def __init__(self):
+        pass
+
+    def convert(self, model: Model, dataloader_fn) -> Model:
+        converted_model = torch.jit.script(model.handle)
+        io_spec = _get_io_spec(model, dataloader_fn)
+        return Model(converted_model, precision=model.precision, inputs=io_spec.inputs, outputs=io_spec.outputs)
+
+
+class PYT2ONNXConverter(BaseConverter):
+    def __init__(self, onnx_opset: int = None):
+        self._onnx_opset = onnx_opset
+
+    def convert(self, model: Model, dataloader_fn) -> Model:
+        import tempfile
+
+        import onnx  # pytype: disable=import-error
+
+        assert isinstance(model.handle, torch.jit.ScriptModule) or isinstance(
+            model.handle, torch.nn.Module
+        ), "The model must be of type 'torch.jit.ScriptModule' or 'torch.nn.Module'. Converter aborted."
+
+        dynamic_axes = get_dynamic_axes(dataloader_fn())
+
+        device = get_model_device(model.handle)
+        dummy_input = get_sample_input(dataloader_fn(), device)
+
+        with tempfile.TemporaryDirectory() as tmpdirname:
+            export_path = os.path.join(tmpdirname, "model.onnx")
+            with torch.no_grad():
+                torch.onnx.export(
+                    model.handle,
+                    dummy_input,
+                    export_path,
+                    do_constant_folding=True,
+                    input_names=list(model.inputs),
+                    output_names=list(model.outputs),
+                    dynamic_axes=dynamic_axes,
+                    opset_version=self._onnx_opset,
+                    enable_onnx_checker=True,
+                )
+
+            onnx_model = onnx.load(export_path)
+            onnx.checker.check_model(onnx_model)
+            onnx.helper.strip_doc_string(onnx_model)
+            onnx_model = onnx.shape_inference.infer_shapes(onnx_model)
+
+        return Model(
+            handle=onnx_model,
+            precision=model.precision,
+            inputs=model.inputs,
+            outputs=model.outputs,
+        )
+
+
+class PYT2TensorRTConverter(BaseConverter):
+    def __init__(self, max_batch_size: int, max_workspace_size: int, onnx_opset: int, precision: str):
+        self._max_batch_size = max_batch_size
+        self._max_workspace_size = max_workspace_size
+        self._onnx_opset = onnx_opset
+        self._precision = Precision(precision)
+
+    def convert(self, model: Model, dataloader_fn) -> Model:
+        from .onnx import _infer_graph_precision
+        from .onnx2trt_conv import onnx2trt
+
+        pyt2onnx_converter = PYT2ONNXConverter(self._onnx_opset)
+        onnx_model = pyt2onnx_converter.convert(model, dataloader_fn).handle
+        precision = _infer_graph_precision(onnx_model.graph)
+
+        input_shapes = get_input_shapes(dataloader_fn(), self._max_batch_size)
+
+        cuda_engine = onnx2trt(
+            onnx_model,
+            shapes=input_shapes,
+            max_workspace_size=self._max_workspace_size,
+            max_batch_size=self._max_batch_size,
+            model_precision=self._precision.value,
+        )
+
+        return Model(
+            handle=cuda_engine,
+            precision=model.precision,
+            inputs=model.inputs,
+            outputs=model.outputs,
+        )
+
+    @staticmethod
+    def required_source_model_precision(requested_model_precision: Precision) -> Precision:
+        # TensorRT requires source models to be in FP32 precision
+        return Precision.FP32
+
+
+class TorchScriptSaver(BaseSaver):
+    def save(self, model: Model, model_path: Union[str, Path]) -> None:
+        if not isinstance(model_path, Path):
+            model_path = Path(model_path)
+        if isinstance(model.handle, torch.jit.ScriptModule):
+            torch.jit.save(model.handle, model_path.as_posix())
+        else:
+            print("The model must be of type 'torch.jit.ScriptModule'. Saving aborted.")
+            assert False  # temporary error handling
+
+        def _format_tensor_spec(tensor_spec):
+            # wrapping shape with list and whole tensor_spec with dict() is required for correct yaml dump
+            tensor_spec = tensor_spec._replace(shape=list(tensor_spec.shape))
+            tensor_spec = dict(tensor_spec._asdict())
+            return tensor_spec
+
+        # store TensorSpecs from inputs and outputs in a yaml file
+        tensor_specs = {
+            "inputs": {k: _format_tensor_spec(v) for k, v in model.inputs.items()},
+            "outputs": {k: _format_tensor_spec(v) for k, v in model.outputs.items()},
+        }
+
+        yaml_path = model_path.parent / f"{model_path.stem}.yaml"
+        with Path(yaml_path).open("w") as fh:
+            yaml.dump(tensor_specs, fh, indent=4)
+
+
+class PyTorchRunner(BaseRunner):
+    def __init__(self):
+        pass
+
+    def init_inference(self, model: Model):
+        return PyTorchRunnerSession(model=model)
+
+
+class PyTorchRunnerSession(BaseRunnerSession):
+    def __init__(self, model: Model):
+        super().__init__(model)
+
+        assert isinstance(model.handle, torch.jit.ScriptModule) or isinstance(
+            model.handle, torch.nn.Module
+        ), "The model must be of type 'torch.jit.ScriptModule' or 'torch.nn.Module'. Runner aborted."
+
+        self._model = model
+        self._output_names = None
+
+    def __enter__(self):
+        self._output_names = list(self._model.outputs)
+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        self._output_names = None
+        self._model = None
+
+    def __call__(self, x: Dict[str, object]):
+        with torch.no_grad():
+            feed_list = [torch.from_numpy(v).cuda() for k, v in x.items()]
+            y_pred = self._model.handle(*feed_list)
+            if isinstance(y_pred, torch.Tensor):
+                y_pred = (y_pred,)
+            y_pred = [t.cpu().numpy() for t in y_pred]
+            y_pred = dict(zip(self._output_names, y_pred))
+
+        return y_pred
+
+
+loaders.register_extension(Format.PYT.value, PyTorchModelLoader)
+loaders.register_extension(Format.TS_TRACE.value, TorchScriptLoader)
+loaders.register_extension(Format.TS_SCRIPT.value, TorchScriptLoader)
+
+converters.register_extension(f"{Format.PYT.value}--{Format.TS_SCRIPT.value}", TorchScriptScriptConverter)
+converters.register_extension(f"{Format.PYT.value}--{Format.TS_TRACE.value}", TorchScriptTraceConverter)
+converters.register_extension(f"{Format.PYT.value}--{Format.ONNX.value}", PYT2ONNXConverter)
+converters.register_extension(f"{Format.PYT.value}--{Format.TRT.value}", PYT2TensorRTConverter)
+
+savers.register_extension(Format.TS_SCRIPT.value, TorchScriptSaver)
+savers.register_extension(Format.TS_TRACE.value, TorchScriptSaver)
+
+runners.register_extension(Format.PYT.value, PyTorchRunner)
+runners.register_extension(Format.TS_SCRIPT.value, PyTorchRunner)
+runners.register_extension(Format.TS_TRACE.value, PyTorchRunner)
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/tensorrt.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/tensorrt.py
@ -0,0 +1,216 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import sys
+from pathlib import Path
+from typing import Dict, NamedTuple, Optional, Union
+
+import numpy as np
+
+# pytype: disable=import-error
+try:
+    import pycuda.autoinit
+    import pycuda.driver as cuda
+except (ImportError, Exception) as e:
+    logging.getLogger(__name__).warning(f"Problems with importing pycuda package; {e}")
+# pytype: enable=import-error
+
+import tensorrt as trt  # pytype: disable=import-error
+
+from ..core import BaseLoader, BaseRunner, BaseRunnerSession, BaseSaver, Format, Model, Precision, TensorSpec
+from ..extensions import loaders, runners, savers
+
+LOGGER = logging.getLogger(__name__)
+TRT_LOGGER = trt.Logger(trt.Logger.INFO)
+
+"""
+documentation:
+https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html
+https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#python_samples_section
+"""
+
+
+class TensorRTLoader(BaseLoader):
+    def load(self, model_path: Union[str, Path], **_) -> Model:
+        model_path = Path(model_path)
+        LOGGER.debug(f"Loading TensorRT engine from {model_path}")
+
+        with model_path.open("rb") as fh, trt.Runtime(TRT_LOGGER) as runtime:
+            engine = runtime.deserialize_cuda_engine(fh.read())
+
+        if engine is None:
+            raise RuntimeError(f"Could not load ICudaEngine from {model_path}")
+
+        inputs = {}
+        outputs = {}
+        for binding_idx in range(engine.num_bindings):
+            name = engine.get_binding_name(binding_idx)
+            is_input = engine.binding_is_input(binding_idx)
+            dtype = engine.get_binding_dtype(binding_idx)
+            shape = engine.get_binding_shape(binding_idx)
+            if is_input:
+                inputs[name] = TensorSpec(name, dtype, shape)
+            else:
+                outputs[name] = TensorSpec(name, dtype, shape)
+
+        return Model(engine, None, inputs, outputs)
+
+
+class TensorRTSaver(BaseSaver):
+    def __init__(self):
+        pass
+
+    def save(self, model: Model, model_path: Union[str, Path]) -> None:
+        model_path = Path(model_path)
+        LOGGER.debug(f"Saving TensorRT engine to {model_path.as_posix()}")
+        model_path.parent.mkdir(parents=True, exist_ok=True)
+        engine: "trt.ICudaEngine" = model.handle
+        with model_path.open("wb") as fh:
+            fh.write(engine.serialize())
+
+
+class TRTBuffers(NamedTuple):
+    x_host: Optional[Dict[str, object]]
+    x_dev: Dict[str, object]
+    y_pred_host: Dict[str, object]
+    y_pred_dev: Dict[str, object]
+
+
+class TensorRTRunner(BaseRunner):
+    def __init__(self):
+        pass
+
+    def init_inference(self, model: Model):
+        return TensorRTRunnerSession(model=model)
+
+
+class TensorRTRunnerSession(BaseRunnerSession):
+    def __init__(self, model: Model):
+        super().__init__(model)
+        assert isinstance(model.handle, trt.ICudaEngine)
+        self._model = model
+        self._has_dynamic_shapes = None
+
+        self._context = None
+        self._engine: trt.ICudaEngine = self._model.handle
+        self._cuda_context = pycuda.autoinit.context
+
+        self._input_names = None
+        self._output_names = None
+        self._buffers = None
+
+    def __enter__(self):
+        self._context = self._engine.create_execution_context()
+        self._context.__enter__()
+
+        self._input_names = [
+            self._engine[idx] for idx in range(self._engine.num_bindings) if self._engine.binding_is_input(idx)
+        ]
+        self._output_names = [
+            self._engine[idx] for idx in range(self._engine.num_bindings) if not self._engine.binding_is_input(idx)
+        ]
+        # all_binding_shapes_specified is True for models without dynamic shapes
+        # so initially this variable is False for models with dynamic shapes
+        self._has_dynamic_shapes = not self._context.all_binding_shapes_specified
+
+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        self._context.__exit__(exc_type, exc_value, traceback)
+        self._input_names = None
+        self._output_names = None
+
+        # TODO: are cuda buffers dealloc automatically?
+        self._buffers = None
+
+    def __call__(self, x):
+        buffers = self._prepare_buffers_if_needed(x)
+        bindings = self._update_bindings(buffers)
+
+        for name in self._input_names:
+            cuda.memcpy_htod(buffers.x_dev[name], buffers.x_host[name])
+        self._cuda_context.push()
+        self._context.execute_v2(bindings=bindings)
+        self._cuda_context.pop()
+        for name in self._output_names:
+            cuda.memcpy_dtoh(buffers.y_pred_host[name], buffers.y_pred_dev[name])
+
+        return buffers.y_pred_host
+
+    def _update_bindings(self, buffers: TRTBuffers):
+        bindings = [None] * self._engine.num_bindings
+        for name in buffers.y_pred_dev:
+            binding_idx: int = self._engine[name]
+            bindings[binding_idx] = buffers.y_pred_dev[name]
+
+        for name in buffers.x_dev:
+            binding_idx: int = self._engine[name]
+            bindings[binding_idx] = buffers.x_dev[name]
+
+        return bindings
+
+    def _set_dynamic_input_shapes(self, x_host):
+        def _is_shape_dynamic(input_shape):
+            return any([dim is None or dim == -1 for dim in input_shape])
+
+        for name in self._input_names:
+            bindings_idx = self._engine[name]
+            data_shape = x_host[name].shape  # pytype: disable=attribute-error
+            if self._engine.is_shape_binding(bindings_idx):
+                input_shape = self._context.get_shape(bindings_idx)
+                if _is_shape_dynamic(input_shape):
+                    self._context.set_shape_input(bindings_idx, data_shape)
+            else:
+                input_shape = self._engine.get_binding_shape(bindings_idx)
+                if _is_shape_dynamic(input_shape):
+                    self._context.set_binding_shape(bindings_idx, data_shape)
+
+        assert self._context.all_binding_shapes_specified and self._context.all_shape_inputs_specified
+
+    def _prepare_buffers_if_needed(self, x_host: Dict[str, object]):
+        # pytype: disable=attribute-error
+        new_batch_size = list(x_host.values())[0].shape[0]
+        current_batch_size = list(self._buffers.y_pred_host.values())[0].shape[0] if self._buffers else 0
+        # pytype: enable=attribute-error
+
+        if self._has_dynamic_shapes or new_batch_size != current_batch_size:
+            # TODO: are CUDA buffers dealloc automatically?
+
+            self._set_dynamic_input_shapes(x_host)
+
+            y_pred_host = {}
+            for name in self._output_names:
+                shape = self._context.get_binding_shape(self._engine[name])
+                y_pred_host[name] = np.zeros(shape, dtype=trt.nptype(self._model.outputs[name].dtype))
+
+            y_pred_dev = {name: cuda.mem_alloc(data.nbytes) for name, data in y_pred_host.items()}
+
+            x_dev = {
+                name: cuda.mem_alloc(host_input.nbytes)
+                for name, host_input in x_host.items()
+                if name in self._input_names  # pytype: disable=attribute-error
+            }
+
+            self._buffers = TRTBuffers(None, x_dev, y_pred_host, y_pred_dev)
+
+        return self._buffers._replace(x_host=x_host)
+
+
+if "pycuda.driver" in sys.modules:
+    loaders.register_extension(Format.TRT.value, TensorRTLoader)
+    runners.register_extension(Format.TRT.value, TensorRTRunner)
+    savers.register_extension(Format.TRT.value, TensorRTSaver)
+else:
+    LOGGER.warning("Do not register TensorRT extension due problems with importing pycuda.driver package.")
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/utils.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/bermuda/utils.py
@ -0,0 +1,121 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import Counter
+from typing import Callable, Dict, List
+
+import networkx as nx
+
+from ..core import ShapeSpec
+
+
+def infer_precision(
+    nx_graph: nx.Graph,
+    input_names: List[str],
+    output_names: List[str],
+    get_node_dtype_fn: Callable,
+):
+    node_dtypes = [nx_graph.nodes[node_name].get("dtype", None) for node_name in nx_graph.nodes]
+    node_dtypes = [dt for dt in node_dtypes if dt is None or dt.kind not in ["i", "b"]]
+    dtypes_counter = Counter(node_dtypes)
+    return dtypes_counter.most_common()[0][0]
+
+
+def get_shapes_with_dynamic_axes(dataloader, batch_size_dim=0):
+    def _set_dynamic_shapes(t, shapes):
+        for k, v in t.items():
+            shape = list(v.shape)
+            for dim, s in enumerate(shape):
+                if shapes[k][dim] != -1 and shapes[k][dim] != s:
+                    shapes[k][dim] = -1
+
+    ## get all shapes from input and output tensors
+    input_shapes = {}
+    output_shapes = {}
+    for batch in dataloader:
+        _, x, y = batch
+        for k, v in x.items():
+            input_shapes[k] = list(v.shape)
+        for k, v in y.items():
+            output_shapes[k] = list(v.shape)
+        break
+
+    # based on max <max_num_iters> iterations, check which
+    # dimensions differ to determine dynamic_axes
+    max_num_iters = 100
+    for idx, batch in enumerate(dataloader):
+        if idx >= max_num_iters:
+            break
+
+        _, x, y = batch
+
+        _set_dynamic_shapes(x, input_shapes)
+        _set_dynamic_shapes(y, output_shapes)
+
+    return input_shapes, output_shapes
+
+
+def get_dynamic_axes(dataloader, batch_size_dim=0):
+    input_shapes, output_shapes = get_shapes_with_dynamic_axes(dataloader, batch_size_dim)
+    all_shapes = {**input_shapes, **output_shapes}
+    dynamic_axes = {}
+
+    for k, shape in all_shapes.items():
+        for idx, s in enumerate(shape):
+            if s == -1:
+                dynamic_axes[k] = {idx: k + "_" + str(idx)}
+
+    for k, v in all_shapes.items():
+        if k in dynamic_axes:
+            dynamic_axes[k].update({batch_size_dim: "batch_size_" + str(batch_size_dim)})
+        else:
+            dynamic_axes[k] = {batch_size_dim: "batch_size_" + str(batch_size_dim)}
+
+    return dynamic_axes
+
+
+def get_input_shapes(dataloader, max_batch_size=1) -> Dict[str, ShapeSpec]:
+    def init_counters_and_shapes(x, counters, min_shapes, max_shapes):
+        for k, v in x.items():
+            counters[k] = Counter()
+            min_shapes[k] = [float("inf")] * v.ndim
+            max_shapes[k] = [float("-inf")] * v.ndim
+
+    counters = {}
+    min_shapes: Dict[str, tuple] = {}
+    max_shapes: Dict[str, tuple] = {}
+    for idx, batch in enumerate(dataloader):
+        ids, x, y = batch
+
+        if idx == 0:
+            init_counters_and_shapes(x, counters, min_shapes, max_shapes)
+
+        for k, v in x.items():
+            shape = v.shape
+            counters[k][shape] += 1
+            min_shapes[k] = tuple([min(a, b) for a, b in zip(min_shapes[k], shape)])
+            max_shapes[k] = tuple([max(a, b) for a, b in zip(max_shapes[k], shape)])
+
+    opt_shapes: Dict[str, tuple] = {}
+    for k, v in counters.items():
+        opt_shapes[k] = v.most_common(1)[0][0]
+
+    shapes = {}
+    for k in opt_shapes.keys():  # same keys in min_shapes and max_shapes
+        shapes[k] = ShapeSpec(
+            min=(1,) + min_shapes[k][1:],
+            max=(max_batch_size,) + max_shapes[k][1:],
+            opt=(max_batch_size,) + opt_shapes[k][1:],
+        )
+    return shapes
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/core.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/core.py
@ -0,0 +1,183 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+import importlib
+import logging
+import os
+from enum import Enum
+from pathlib import Path
+from typing import Any, Dict, List, NamedTuple, Optional, Tuple, Union
+
+import numpy as np
+
+LOGGER = logging.getLogger(__name__)
+DATALOADER_FN_NAME = "get_dataloader_fn"
+GET_MODEL_FN_NAME = "get_model"
+GET_SERVING_INPUT_RECEIVER_FN = "get_serving_input_receiver_fn"
+GET_ARGPARSER_FN_NAME = "update_argparser"
+
+
+class TensorSpec(NamedTuple):
+    name: str
+    dtype: str
+    shape: Tuple
+
+
+class Parameter(Enum):
+    def __lt__(self, other: "Parameter") -> bool:
+        return self.value < other.value
+
+
+class Accelerator(Parameter):
+    AMP = "amp"
+    CUDA = "cuda"
+    TRT = "trt"
+
+
+class Precision(Parameter):
+    FP16 = "fp16"
+    FP32 = "fp32"
+    TF32 = "tf32"  # Deprecated
+
+
+class Format(Parameter):
+    TF_GRAPHDEF = "tf-graphdef"
+    TF_SAVEDMODEL = "tf-savedmodel"
+    TF_TRT = "tf-trt"
+    TF_ESTIMATOR = "tf-estimator"
+    TF_KERAS = "tf-keras"
+    ONNX = "onnx"
+    TRT = "trt"
+    TS_SCRIPT = "ts-script"
+    TS_TRACE = "ts-trace"
+    PYT = "pyt"
+
+
+class Model(NamedTuple):
+    handle: object
+    precision: Optional[Precision]
+    inputs: Dict[str, TensorSpec]
+    outputs: Dict[str, TensorSpec]
+
+
+def load_from_file(file_path, label, target):
+    spec = importlib.util.spec_from_file_location(name=label, location=file_path)
+    my_module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(my_module)  # pytype: disable=attribute-error
+    return getattr(my_module, target, None)
+
+
+class BaseLoader(abc.ABC):
+    required_fn_name_for_signature_parsing: Optional[str] = None
+
+    @abc.abstractmethod
+    def load(self, model_path: Union[str, Path], **kwargs) -> Model:
+        """
+        Loads and process model from file based on given set of args
+        """
+        pass
+
+
+class BaseSaver(abc.ABC):
+    required_fn_name_for_signature_parsing: Optional[str] = None
+
+    @abc.abstractmethod
+    def save(self, model: Model, model_path: Union[str, Path]) -> None:
+        """
+        Save model to file
+        """
+        pass
+
+
+class BaseRunner(abc.ABC):
+    required_fn_name_for_signature_parsing: Optional[str] = None
+
+    @abc.abstractmethod
+    def init_inference(self, model: Model):
+        raise NotImplementedError
+
+
+class BaseRunnerSession(abc.ABC):
+    def __init__(self, model: Model):
+        self._model = model
+
+    @abc.abstractmethod
+    def __enter__(self):
+        raise NotImplementedError()
+
+    @abc.abstractmethod
+    def __exit__(self, exc_type, exc_value, traceback):
+        raise NotImplementedError()
+
+    @abc.abstractmethod
+    def __call__(self, x: Dict[str, object]):
+        raise NotImplementedError()
+
+    def _set_env_variables(self) -> Dict[str, object]:
+        """this method not remove values; fix it if needed"""
+        to_set = {}
+        old_values = {k: os.environ.pop(k, None) for k in to_set}
+        os.environ.update(to_set)
+        return old_values
+
+    def _recover_env_variables(self, old_envs: Dict[str, object]):
+        for name, value in old_envs.items():
+            if value is None:
+                del os.environ[name]
+            else:
+                os.environ[name] = str(value)
+
+
+class BaseConverter(abc.ABC):
+    required_fn_name_for_signature_parsing: Optional[str] = None
+
+    @abc.abstractmethod
+    def convert(self, model: Model, dataloader_fn) -> Model:
+        raise NotImplementedError()
+
+    @staticmethod
+    def required_source_model_precision(requested_model_precision: Precision) -> Precision:
+        return requested_model_precision
+
+
+class BaseMetricsCalculator(abc.ABC):
+    required_fn_name_for_signature_parsing: Optional[str] = None
+
+    @abc.abstractmethod
+    def calc(
+        self,
+        *,
+        ids: List[Any],
+        y_pred: Dict[str, np.ndarray],
+        x: Optional[Dict[str, np.ndarray]],
+        y_real: Optional[Dict[str, np.ndarray]],
+    ) -> Dict[str, float]:
+        """
+        Calculates error/accuracy metrics
+        Args:
+            ids: List of ids identifying each sample in the batch
+            y_pred: model output as dict where key is output name and value is output value
+            x: model input as dict where key is input name and value is input value
+            y_real: input ground truth as dict where key is output name and value is output value
+        Returns:
+            dictionary where key is metric name and value is its value
+        """
+        pass
+
+
+class ShapeSpec(NamedTuple):
+    min: Tuple
+    opt: Tuple
+    max: Tuple
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/dump.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/dump.py
@ -0,0 +1,147 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pathlib import Path
+from typing import Dict, Iterable
+
+import numpy as np
+
+MB2B = 2 ** 20
+B2MB = 1 / MB2B
+FLUSH_THRESHOLD_B = 256 * MB2B
+
+
+def pad_except_batch_axis(data: np.ndarray, target_shape_with_batch_axis: Iterable[int]):
+    assert all(
+        [current_size <= target_size for target_size, current_size in zip(target_shape_with_batch_axis, data.shape)]
+    ), "target_shape should have equal or greater all dimensions comparing to data.shape"
+    padding = [(0, 0)] + [  # (0, 0) - do not pad on batch_axis (with index 0)
+        (0, target_size - current_size)
+        for target_size, current_size in zip(target_shape_with_batch_axis[1:], data.shape[1:])
+    ]
+    return np.pad(data, padding, "constant", constant_values=np.nan)
+
+
+class NpzWriter:
+    """
+    Dumps dicts of numpy arrays into npz files
+
+    It can/shall be used as context manager:
+    ```
+    with OutputWriter('mydir') as writer:
+        writer.write(outputs={'classes': np.zeros(8), 'probs': np.zeros((8, 4))},
+                     labels={'classes': np.zeros(8)},
+                     inputs={'input': np.zeros((8, 240, 240, 3)})
+    ```
+
+    ## Variable size data
+
+    Only dynamic of last axis is handled. Data is padded with np.nan value.
+    Also each generated file may have different size of dynamic axis.
+    """
+
+    def __init__(self, output_dir, compress=False):
+        self._output_dir = Path(output_dir)
+        self._items_cache: Dict[str, Dict[str, np.ndarray]] = {}
+        self._items_counters: Dict[str, int] = {}
+        self._flush_threshold_b = FLUSH_THRESHOLD_B
+        self._compress = compress
+
+    @property
+    def cache_size(self):
+        return {name: sum([a.nbytes for a in data.values()]) for name, data in self._items_cache.items()}
+
+    def _append_to_cache(self, prefix, data):
+        if data is None:
+            return
+
+        if not isinstance(data, dict):
+            raise ValueError(f"{prefix} data to store shall be dict")
+
+        cached_data = self._items_cache.get(prefix, {})
+        for name, value in data.items():
+            assert isinstance(
+                value, (list, np.ndarray)
+            ), f"Values shall be lists or np.ndarrays; current type {type(value)}"
+            if not isinstance(value, np.ndarray):
+                value = np.array(value)
+
+            assert value.dtype.kind in ["S", "U"] or not np.any(
+                np.isnan(value)
+            ), f"Values with np.nan is not supported; {name}={value}"
+            cached_value = cached_data.get(name, None)
+            if cached_value is not None:
+                target_shape = np.max([cached_value.shape, value.shape], axis=0)
+                cached_value = pad_except_batch_axis(cached_value, target_shape)
+                value = pad_except_batch_axis(value, target_shape)
+                value = np.concatenate((cached_value, value))
+            cached_data[name] = value
+        self._items_cache[prefix] = cached_data
+
+    def write(self, **kwargs):
+        """
+        Writes named list of dictionaries of np.ndarrays.
+        Finally keyword names will be later prefixes of npz files where those dictionaries will be stored.
+
+        ex. writer.write(inputs={'input': np.zeros((2, 10))},
+                         outputs={'classes': np.zeros((2,)), 'probabilities': np.zeros((2, 32))},
+                         labels={'classes': np.zeros((2,))})
+        Args:
+            **kwargs: named list of dictionaries of np.ndarrays to store
+        """
+
+        for prefix, data in kwargs.items():
+            self._append_to_cache(prefix, data)
+
+        biggest_item_size = max(self.cache_size.values())
+        if biggest_item_size > self._flush_threshold_b:
+            self.flush()
+
+    def flush(self):
+        for prefix, data in self._items_cache.items():
+            self._dump(prefix, data)
+        self._items_cache = {}
+
+    def _dump(self, prefix, data):
+        idx = self._items_counters.setdefault(prefix, 0)
+        filename = f"{prefix}-{idx:012d}.npz"
+        output_path = self._output_dir / filename
+        if self._compress:
+            np.savez_compressed(output_path, **data)
+        else:
+            np.savez(output_path, **data)
+
+        nitems = len(list(data.values())[0])
+
+        msg_for_labels = (
+            "If these are correct shapes - consider moving loading of them into metrics.py."
+            if prefix == "labels"
+            else ""
+        )
+        shapes = {name: value.shape if isinstance(value, np.ndarray) else (len(value),) for name, value in data.items()}
+
+        assert all(len(v) == nitems for v in data.values()), (
+            f'All items in "{prefix}" shall have same size on 0 axis equal to batch size. {msg_for_labels}'
+            f'{", ".join(f"{name}: {shape}" for name, shape in shapes.items())}'
+        )
+        self._items_counters[prefix] += nitems
+
+    def __enter__(self):
+        if self._output_dir.exists() and len(list(self._output_dir.iterdir())):
+            raise ValueError(f"{self._output_dir.as_posix()} is not empty")
+        self._output_dir.mkdir(parents=True, exist_ok=True)
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.flush()
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/extensions.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/extensions.py
@ -0,0 +1,83 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import importlib
+import logging
+import os
+import re
+from pathlib import Path
+from typing import List
+
+LOGGER = logging.getLogger(__name__)
+
+
+class ExtensionManager:
+    def __init__(self, name: str):
+        self._name = name
+        self._registry = {}
+
+    def register_extension(self, extension: str, clazz):
+        already_registered_class = self._registry.get(extension, None)
+        if already_registered_class and already_registered_class.__module__ != clazz.__module__:
+            raise RuntimeError(
+                f"Conflicting extension {self._name}/{extension}; "
+                f"{already_registered_class.__module__}.{already_registered_class.__name} "
+                f"and "
+                f"{clazz.__module__}.{clazz.__name__}"
+            )
+        elif already_registered_class is None:
+            clazz_full_name = f"{clazz.__module__}.{clazz.__name__}" if clazz is not None else "None"
+            LOGGER.debug(f"Registering extension {self._name}/{extension}: {clazz_full_name}")
+            self._registry[extension] = clazz
+
+    def get(self, extension):
+        if extension not in self._registry:
+            raise RuntimeError(f"Missing extension {self._name}/{extension}")
+        return self._registry[extension]
+
+    @property
+    def supported_extensions(self):
+        return list(self._registry)
+
+    @staticmethod
+    def scan_for_extensions(extension_dirs: List[Path]):
+        register_pattern = r".*\.register_extension\(.*"
+
+        for extension_dir in extension_dirs:
+            for python_path in extension_dir.rglob("*.py"):
+                if not python_path.is_file():
+                    continue
+                payload = python_path.read_text()
+                if re.findall(register_pattern, payload):
+                    import_path = python_path.relative_to(toolkit_root_dir.parent)
+                    package = import_path.parent.as_posix().replace(os.sep, ".")
+                    package_with_module = f"{package}.{import_path.stem}"
+                    spec = importlib.util.spec_from_file_location(name=package_with_module, location=python_path)
+                    my_module = importlib.util.module_from_spec(spec)
+                    my_module.__package__ = package
+
+                    try:
+                        spec.loader.exec_module(my_module)  # pytype: disable=attribute-error
+                    except ModuleNotFoundError as e:
+                        LOGGER.error(
+                            f"Could not load extensions from {import_path} due to missing python packages; {e}"
+                        )
+
+
+runners = ExtensionManager("runners")
+loaders = ExtensionManager("loaders")
+savers = ExtensionManager("savers")
+converters = ExtensionManager("converters")
+toolkit_root_dir = (Path(__file__).parent / "..").resolve()
+ExtensionManager.scan_for_extensions([toolkit_root_dir])
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/report.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/report.py
@ -0,0 +1,61 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import csv
+import re
+from typing import Dict, List
+
+from natsort import natsorted
+from tabulate import tabulate
+
+
+def sort_results(results: List):
+    results = natsorted(results, key=lambda item: [item[key] for key in item.keys()])
+    return results
+
+
+def save_results(filename: str, data: List, formatted: bool = False):
+    data = format_data(data=data) if formatted else data
+    with open(filename, "a") as csvfile:
+        fieldnames = data[0].keys()
+        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
+
+        writer.writeheader()
+        for row in data:
+            writer.writerow(row)
+
+
+def format_data(data: List[Dict]) -> List[Dict]:
+    formatted_data = list()
+    for item in data:
+        formatted_item = format_keys(data=item)
+        formatted_data.append(formatted_item)
+
+    return formatted_data
+
+
+def format_keys(data: Dict) -> Dict:
+    keys = {format_key(key=key): value for key, value in data.items()}
+    return keys
+
+
+def format_key(key: str) -> str:
+    key = " ".join([k.capitalize() for k in re.split("_| ", key)])
+    return key
+
+
+def show_results(results: List[Dict]):
+    headers = list(results[0].keys())
+    summary = map(lambda x: list(map(lambda item: item[1], x.items())), results)
+    print(tabulate(summary, headers=headers))
--- a/PyTorch/Classification/ConvNets/triton/deployment_toolkit/warmup.py
+++ b/PyTorch/Classification/ConvNets/triton/deployment_toolkit/warmup.py
@ -0,0 +1,67 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from typing import List, Optional
+
+
+def warmup(
+    model_name: str,
+    batch_sizes: List[int],
+    triton_gpu_engine_count: int = 1,
+    triton_instances: int = 1,
+    profiling_data: str = "random",
+    input_shapes: Optional[List[str]] = None,
+    server_url: str = "localhost",
+    measurement_window: int = 10000,
+    shared_memory: bool = False
+):
+    print("\n")
+    print(f"==== Warmup start ====")
+    print("\n")
+
+    input_shapes = " ".join(map(lambda shape: f" --shape {shape}", input_shapes)) if input_shapes else ""
+
+    measurement_window = 6 * measurement_window
+
+    max_batch_size = max(batch_sizes)
+    max_total_requests = 2 * max_batch_size * triton_instances * triton_gpu_engine_count
+    max_concurrency = min(256, max_total_requests)
+    batch_size = max(1, max_total_requests // 256)
+
+    step = max(1, max_concurrency // 2)
+    min_concurrency = step
+
+    exec_args = f"""-m {model_name} \
+        -x 1 \
+        -p {measurement_window} \
+        -v \
+        -i http \
+        -u {server_url}:8000 \
+        -b {batch_size} \
+        --concurrency-range {min_concurrency}:{max_concurrency}:{step} \
+        --input-data {profiling_data} {input_shapes}"""
+
+    if shared_memory:
+        exec_args += " --shared-memory=cuda"
+
+    result = os.system(f"perf_client {exec_args}")
+    if result != 0:
+        print(f"Failed running performance tests. Perf client failed with exit code {result}")
+        sys.exit(1)
+
+    print("\n")
+    print(f"==== Warmup done ====")
+    print("\n")
--- a/PyTorch/Classification/ConvNets/triton/metric.py
+++ b/PyTorch/Classification/ConvNets/triton/metric.py
@ -0,0 +1,26 @@
+from typing import Any, Dict, List, NamedTuple, Optional
+
+import numpy as np
+from deployment_toolkit.core import BaseMetricsCalculator
+
+class MetricsCalculator(BaseMetricsCalculator):
+    def __init__(self):
+        pass
+
+    def calc(
+            self,
+            *,
+            ids: List[Any],
+            y_pred: Dict[str, np.ndarray],
+            x: Optional[Dict[str, np.ndarray]],
+            y_real: Optional[Dict[str, np.ndarray]],
+    ) -> Dict[str, float]:
+        categories = np.argmax(y_pred["OUTPUT__0"], axis=-1)
+        print(categories.shape)
+        print(categories[:128], y_pred["OUTPUT__0"] )
+        print(y_real["OUTPUT__0"][:128])
+
+        return {
+            "accuracy": np.mean(np.argmax(y_pred["OUTPUT__0"], axis=-1) == 
+                                np.argmax(y_real["OUTPUT__0"], axis=-1))
+        }
--- a/PyTorch/Classification/ConvNets/triton/model.py
+++ b/PyTorch/Classification/ConvNets/triton/model.py
@ -0,0 +1,32 @@
+import torch
+
+def update_argparser(parser):
+    parser.add_argument(
+        "--config", default="resnet50", type=str, required=True, help="Network to deploy")
+    parser.add_argument(
+        "--checkpoint", default=None, type=str, help="The checkpoint of the model. ")
+    parser.add_argument("--classes", type=int, default=1000, help="Number of classes")
+    parser.add_argument("--precision", type=str, default="fp32", 
+                        choices=["fp32", "fp16"], help="Inference precision")
+
+def get_model(**model_args):
+    from image_classification import models
+
+    model = models.resnet50()
+
+    if "checkpoint" in model_args:
+        print(f"loading checkpoint {model_args['checkpoint']}")
+        state_dict = torch.load(model_args["checkpoint"], map_location="cpu")
+        model.load_state_dict({k.replace("module.", ""): v 
+                              for k, v in state_dict.items()})
+    if model_args["precision"] == "fp16":
+        model = model.half()
+
+    model = model.cuda()
+    model.eval()
+    tensor_names = {"inputs": ["INPUT__0"],
+                    "outputs": ["OUTPUT__0"]}
+
+    return model, tensor_names
+
+    
--- a/PyTorch/Classification/ConvNets/triton/process_dataset.py
+++ b/PyTorch/Classification/ConvNets/triton/process_dataset.py
@ -0,0 +1,127 @@
+#!/usr/bin/env python3
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import tarfile
+from pathlib import Path
+from typing import Tuple, Dict, List
+
+from PIL import Image
+from tqdm import tqdm
+
+DATASETS_DIR = os.environ.get("DATASETS_DIR", None)
+IMAGENET_DIRNAME = "imagenet"
+IMAGE_ARCHIVE_FILENAME = "ILSVRC2012_img_val.tar"
+DEVKIT_ARCHIVE_FILENAME = "ILSVRC2012_devkit_t12.tar.gz"
+LABELS_REL_PATH = "ILSVRC2012_devkit_t12/data/ILSVRC2012_validation_ground_truth.txt"
+META_REL_PATH = "ILSVRC2012_devkit_t12/data/meta.mat"
+
+TARGET_SIZE = (224, 224)  # (width, height)
+_RESIZE_MIN = 256  # resize preserving aspect ratio to where this is minimal size
+
+
+def parse_meta_mat(metafile) -> Dict[int, str]:
+    import scipy.io
+
+    meta = scipy.io.loadmat(metafile, squeeze_me=True)["synsets"]
+    nums_children = list(zip(*meta))[4]
+    meta = [meta[idx] for idx, num_children in enumerate(nums_children) if num_children == 0]
+    idcs, wnids = list(zip(*meta))[:2]
+    idx_to_wnid = {idx: wnid for idx, wnid in zip(idcs, wnids)}
+    return idx_to_wnid
+
+
+def _process_image(image_file, target_size):
+    image = Image.open(image_file)
+    original_size = image.size
+
+    # scale image to size where minimal size is _RESIZE_MIN
+    scale_factor = max(_RESIZE_MIN / original_size[0], _RESIZE_MIN / original_size[1])
+    resize_to = int(original_size[0] * scale_factor), int(original_size[1] * scale_factor)
+    resized_image = image.resize(resize_to)
+
+    # central crop of image to target_size
+    left, upper = (resize_to[0] - target_size[0]) // 2, (resize_to[1] - target_size[1]) // 2
+    cropped_image = resized_image.crop((left, upper, left + target_size[0], upper + target_size[1]))
+    return cropped_image
+
+
+def main():
+    import argparse
+
+    parser = argparse.ArgumentParser(description="short_description")
+    parser.add_argument(
+        "--dataset-dir",
+        help="Path to dataset directory where imagenet archives are stored and processed files will be saved.",
+        required=False,
+        default=DATASETS_DIR,
+    )
+    parser.add_argument(
+        "--target-size",
+        help="Size of target image. Format it as <width>,<height>.",
+        required=False,
+        default=",".join(map(str, TARGET_SIZE)),
+    )
+    args = parser.parse_args()
+
+    if args.dataset_dir is None:
+        raise ValueError(
+            "Please set $DATASETS_DIR env variable to point dataset dir with original dataset archives "
+            "and where processed files should be stored. Alternatively provide --dataset-dir CLI argument"
+        )
+
+    datasets_dir = Path(args.dataset_dir)
+    target_size = tuple(map(int, args.target_size.split(",")))
+
+    image_archive_path = datasets_dir / IMAGE_ARCHIVE_FILENAME
+    if not image_archive_path.exists():
+        raise RuntimeError(
+            f"There should be {IMAGE_ARCHIVE_FILENAME} file in {datasets_dir}."
+            f"You need to download the dataset from http://www.image-net.org/download."
+        )
+
+    devkit_archive_path = datasets_dir / DEVKIT_ARCHIVE_FILENAME
+    if not devkit_archive_path.exists():
+        raise RuntimeError(
+            f"There should be {DEVKIT_ARCHIVE_FILENAME} file in {datasets_dir}."
+            f"You need to download the dataset from http://www.image-net.org/download."
+        )
+
+    with tarfile.open(devkit_archive_path, mode="r") as devkit_archive_file:
+        labels_file = devkit_archive_file.extractfile(LABELS_REL_PATH)
+        labels = list(map(int, labels_file.readlines()))
+
+        # map validation labels (idxes from LABELS_REL_PATH) into WNID compatible with training set
+        meta_file = devkit_archive_file.extractfile(META_REL_PATH)
+        idx_to_wnid = parse_meta_mat(meta_file)
+        labels_wnid = [idx_to_wnid[idx] for idx in labels]
+
+        # remap WNID into index in sorted list of all WNIDs - this is how network outputs class
+        available_wnids = sorted(set(labels_wnid))
+        wnid_to_newidx = {wnid: new_cls for new_cls, wnid in enumerate(available_wnids)}
+        labels = [wnid_to_newidx[wnid] for wnid in labels_wnid]
+
+    output_dir = datasets_dir / IMAGENET_DIRNAME
+    with tarfile.open(image_archive_path, mode="r") as image_archive_file:
+        image_rel_paths = sorted(image_archive_file.getnames())
+        for cls, image_rel_path in tqdm(zip(labels, image_rel_paths), total=len(image_rel_paths)):
+            output_path = output_dir / str(cls) / image_rel_path
+            original_image_file = image_archive_file.extractfile(image_rel_path)
+            processed_image = _process_image(original_image_file, target_size)
+            output_path.parent.mkdir(parents=True, exist_ok=True)
+            processed_image.save(output_path.as_posix())
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/requirements.txt
+++ b/PyTorch/Classification/ConvNets/triton/requirements.txt
@ -0,0 +1,24 @@
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+networkx==2.5
+numpy<1.20.0,>=1.19.1  # # numpy 1.20+ requires py37
+onnx==1.8.0
+onnxruntime==1.5.2
+pycuda>=2019.1.2
+PyYAML>=5.2
+tqdm>=4.44.1
+tabulate>=0.8.7
+natsort>=7.0.0
+# use tags instead of branch names - because there might be docker cache hit causing not fetching most recent changes on branch
+model_navigator @ git+https://github.com/triton-inference-server/model_navigator.git@v0.1.0#egg=model_navigator
--- a/PyTorch/Classification/ConvNets/triton/resnet50/Dockerfile
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/Dockerfile
@ -0,0 +1,28 @@
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.02-py3
+ARG TRITON_CLIENT_IMAGE_NAME=nvcr.io/nvidia/tritonserver:21.02-py3-sdk
+FROM ${TRITON_CLIENT_IMAGE_NAME} as triton-client
+FROM ${FROM_IMAGE_NAME}
+
+# Install Perf Client required library
+RUN apt-get update && apt-get install -y libb64-dev libb64-0d
+
+# Install Triton Client PythonAPI and copy Perf Client
+COPY --from=triton-client /workspace/install/ /workspace/install/
+ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
+RUN find /workspace/install/python/ -iname triton*manylinux*.whl -exec pip install {}[all] \;
+
+# Setup environment variables to access Triton Client binaries and libs
+ENV PATH /workspace/install/bin:${PATH}
+ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
+
+ENV PYTHONPATH /workspace
+WORKDIR /workspace
+
+RUN pip install nvidia-pyindex
+ADD requirements.txt /workspace/requirements.txt
+ADD triton/requirements.txt /workspace/triton/requirements.txt
+RUN pip install -r /workspace/requirements.txt
+RUN pip install -r /workspace/triton/requirements.txt
+
+ADD . /workspace
+
--- a/PyTorch/Classification/ConvNets/triton/resnet50/Latency-vs-Throughput-TensorRT.png
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/Latency-vs-Throughput-TensorRT.png
--- a/PyTorch/Classification/ConvNets/triton/resnet50/Performance-analysis-TensorRT-FP16.png
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/Performance-analysis-TensorRT-FP16.png
--- a/PyTorch/Classification/ConvNets/triton/resnet50/Performance-analysis-TensorRT-FP32.png
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/Performance-analysis-TensorRT-FP32.png
--- a/PyTorch/Classification/ConvNets/triton/resnet50/README.md
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/README.md
@ -1,248 +1,700 @@
-# Deploying the ResNet-50 v1.5 model using Triton Inference Server
+# Deploying the ResNet50 v1.5 model on Triton Inference Server

-The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/trtis-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. 
+This folder contains instructions for deployment to run inference
+on Triton Inference Server as well as a detailed performance analysis.
+The purpose of this document is to help you with achieving
+the best inference performance.

-This folder contains instructions on how to deploy and run inference on
-Triton Inference Server as well as gather detailed performance analysis.
+## Table of contents

-## Table Of Contents
+  - [Solution overview](#solution-overview)
+    - [Introduction](#introduction)
+    - [Deployment process](#deployment-process)
+  - [Setup](#setup)
+  - [Quick Start Guide](#quick-start-guide)
+  - [Advanced](#advanced)
+    - [Prepare configuration](#prepare-configuration)
+    - [Latency explanation](#latency-explanation)
+  - [Performance](#performance)
+    - [Offline scenario](#offline-scenario)
+      - [Offline: NVIDIA A40, ONNX Runtime TensorRT with FP16](#offline-nvidia-a40-onnx-runtime-tensorrt-with-fp16)
+      - [Offline: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16](#offline-nvidia-dgx-a100-1x-a100-80gb-onnx-runtime-tensorrt-with-fp16)
+      - [Offline: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16](#offline-nvidia-dgx-1-1x-v100-32gb-onnx-runtime-tensorrt-with-fp16)
+      - [Offline: NVIDIA T4, ONNX Runtime TensorRT with FP16](#offline-nvidia-t4-onnx-runtime-tensorrt-with-fp16)
+    - [Online scenario](#online-scenario)
+      - [Online: NVIDIA A40, ONNX Runtime TensorRT with FP16](#online-nvidia-a40-onnx-runtime-tensorrt-with-fp16)
+      - [Online: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16](#online-nvidia-dgx-a100-1x-a100-80gb-onnx-runtime-tensorrt-with-fp16)
+      - [Online: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16](#online-nvidia-dgx-1-1x-v100-32gb-onnx-runtime-tensorrt-with-fp16)
+      - [Online: NVIDIA T4, ONNX Runtime TensorRT with FP16](#online-nvidia-t4-onnx-runtime-tensorrt-with-fp16)
+  - [Release Notes](#release-notes)
+      - [Changelog](#changelog)
+      - [Known issues](#known-issues)

-* [Model overview](#model-overview)
-* [Setup](#setup)
-  * [Inference container](#inference-container)
-  * [Deploying the model](#deploying-the-model)
-  * [Running the Triton Inference Server](#running-the-triton-inference-server)
-* [Quick Start Guide](#quick-start-guide)
-  * [Running the client](#running-the-client)
-  * [Gathering performance data](#gathering-performance-data)
-* [Advanced](#advanced)
-  * [Automated benchmark script](#automated-benchmark-script)
-* [Performance](#performance)
-  * [Dynamic batching performance](#dynamic-batching-performance)
-  * [TensorRT backend inference performance (1x V100 16GB)](#tensorrt-backend-inference-performance-1x-v100-16gb)
-* [Release notes](#release-notes)
-  * [Changelog](#changelog)
-  * [Known issues](#known-issues)

-## Model overview
-The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).

-The difference between v1 and v1.5 is that, in the bottleneck blocks which requires
-downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.

-This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1, but comes with a smallperformance drawback (~5% imgs/sec)
+## Solution overview
+
+
+### Introduction
+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server)
+provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs.
+The server provides an inference service via an HTTP or gRPC endpoint,
+allowing remote clients to request inferencing for any number of GPU
+or CPU models being managed by the server.
+
+This README provides step-by-step deployment instructions for models generated
+during training (as described in the [model README](../../resnet50v1.5/README.md)).
+Additionally, this README provides the corresponding deployment scripts that
+ensure optimal GPU utilization during inferencing on Triton Inference Server.
+
+### Deployment process
+The deployment process consists of two steps:
+
+1. Conversion. The purpose of conversion is to find the best performing model
+   format supported by Triton Inference Server.
+   Triton Inference Server uses a number of runtime backends such as
+   [TensorRT](https://developer.nvidia.com/tensorrt),
+   [LibTorch](https://github.com/triton-inference-server/pytorch_backend) and
+   [ONNX Runtime](https://github.com/triton-inference-server/onnxruntime_backend)
+   to support various model types. Refer to the
+   [Triton documentation](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
+   for a list of available backends.
+2. Configuration. Model configuration on Triton Inference Server, which generates
+   necessary [configuration files](https://github.com/triton-inference-server/server/blob/master/docs/model_configuration.md).
+
+To run benchmarks measuring the model performance in inference,
+perform the following steps:
+
+1. Start the Triton Inference Server.
+
+   The Triton Inference Server container is started
+   in one (possibly remote) container and ports for gRPC or REST API are exposed.
+
+2. Run accuracy tests.
+
+   Produce results which are tested against the given accuracy thresholds.
+   Refer to step 9 in the [Quick Start Guide](#quick-start-guide).
+
+3. Run performance tests.
+
+   Produce latency and throughput results for offline (static batching)
+   and online (dynamic batching) scenarios.
+   Refer to step 11 in the [Quick Start Guide](#quick-start-guide).

-The ResNet50 v1.5 model can be deployed for inference on the [NVIDIA Triton Inference Server](https://github.com/NVIDIA/trtis-inference-server) using
-TorchScript, ONNX Runtime or TensorRT as an execution backend.

 ## Setup

-This script requires trained ResNet50 v1.5 model checkpoint that can be used for deployment. 

-### Inference container

-For easy-to-use deployment, a build script for special inference container was prepared. To build that container, go to the main repository folder and run:
+Ensure you have the following components:
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch NGC container 20.11](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [Triton Inference Server NGC container 20.11](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver)
+* [NVIDIA CUDA repository](https://docs.nvidia.com/cuda/archive/11.1.1/index.html)
+* [NVIDIA Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/), [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU

-`docker build -t rn50_inference . -f triton/Dockerfile`

-This command will download the dependencies and build the inference containers. Then, run shell inside the container:
-
-`docker run -it --rm --gpus device=0 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v <PATH_TO_MODEL_REPOSITORY>:/repository rn50_inference bash`
-
-Here `device=0,1,2,3` selects the GPUs indexed by ordinals `0,1,2` and `3`, respectively. The server will see only these GPUs. If you write `device=all`, then the server will see all the available GPUs. `PATH_TO_MODEL_REPOSITORY` indicates location to where the
-deployed models were stored.
-
-### Deploying the model
-
-To deploy the ResNet-50 v1.5 model into the Triton Inference Server, you must run the `deployer.py` script from inside the deployment Docker container to achieve a compatible format. 
-
-```
-usage: deployer.py [-h] (--ts-script | --ts-trace | --onnx | --trt)
-                   [--triton-no-cuda] [--triton-model-name TRITON_MODEL_NAME]
-                   [--triton-model-version TRITON_MODEL_VERSION]
-                   [--triton-server-url TRITON_SERVER_URL]
-                   [--triton-max-batch-size TRITON_MAX_BATCH_SIZE]
-                   [--triton-dyn-batching-delay TRITON_DYN_BATCHING_DELAY]
-                   [--triton-engine-count TRITON_ENGINE_COUNT]
-                   [--save-dir SAVE_DIR]
-                   [--max_workspace_size MAX_WORKSPACE_SIZE] [--trt-fp16]
-                   [--capture-cuda-graph CAPTURE_CUDA_GRAPH]
-                   ...
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --ts-script           convert to torchscript using torch.jit.script
-  --ts-trace            convert to torchscript using torch.jit.trace
-  --onnx                convert to onnx using torch.onnx.export
-  --trt                 convert to trt using tensorrt
-
-triton related flags:
-  --triton-no-cuda      Use the CPU for tracing.
-  --triton-model-name TRITON_MODEL_NAME
-                        exports to appropriate directory structure for TRITON
-  --triton-model-version TRITON_MODEL_VERSION
-                        exports to appropriate directory structure for TRITON
-  --triton-server-url TRITON_SERVER_URL
-                        exports to appropriate directory structure for TRITON
-  --triton-max-batch-size TRITON_MAX_BATCH_SIZE
-                        Specifies the 'max_batch_size' in the TRITON model
-                        config. See the TRITON documentation for more info.
-  --triton-dyn-batching-delay TRITON_DYN_BATCHING_DELAY
-                        Determines the dynamic_batching queue delay in
-                        milliseconds(ms) for the TRITON model config. Use '0'
-                        or '-1' to specify static batching. See the TRITON
-                        documentation for more info.
-  --triton-engine-count TRITON_ENGINE_COUNT
-                        Specifies the 'instance_group' count value in the
-                        TRITON model config. See the TRITON documentation for
-                        more info.
-  --save-dir SAVE_DIR   Saved model directory
-
-optimization flags:
-  --max_workspace_size MAX_WORKSPACE_SIZE
-                        set the size of the workspace for trt export
-  --trt-fp16            trt flag ---- export model in mixed precision mode
-  --capture-cuda-graph CAPTURE_CUDA_GRAPH
-                        capture cuda graph for obtaining speedup. possible
-                        values: 0, 1. default: 1.
-  model_arguments       arguments that will be ignored by deployer lib and
-                        will be forwarded to your deployer script
-```
-
-Following model specific arguments have to be specified for model deployment:
-  
-```
-  --config CONFIG        Network architecture to use for deployment (eg. resnet50, 
-                         resnext101-32x4d or se-resnext101-32x4d)
-  --checkpoint CHECKPOINT
-                         Path to stored model weight. If not specified, model will be 
-                         randomly initialized
-  --batch_size BATCH_SIZE
-                         Batch size used for dummy dataloader
-  --fp16                 Use model with half-precision calculations
-```
-
-For example, to deploy model into TensorRT format, using half precision and max batch size 64 called
-`rn-trt-16` execute:
-
-`python -m triton.deployer --trt --trt-fp16 --triton-model-name rn-trt-16 --triton-max-batch-size 64 --save-dir /repository -- --config resnet50 --checkpoint model_checkpoint --batch_size 64 --fp16`
-
-Where `model_checkpoint` is a checkpoint for a trained model with the same architecture (resnet50) as used during export.
-
-### Running the Triton Inference Server
-
-**NOTE: This step is executed outside the inference container.**
-
-Pull the Triton Inference Server container from our repository:
-
-`docker pull nvcr.io/nvidia/tritonserver:20.07-py3`
-
-Run the command to start the Triton Inference Server:
-
-`docker run -d --rm --gpus device=0 --ipc=host --network=host -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <PATH_TO_MODEL_REPOSITORY>:/models nvcr.io/nvidia/tritonserver:20.07-py3 trtserver --model-store=/models --log-verbose=1 --model-control-mode=poll --repository-poll-secs=5`
-
-Here `device=0,1,2,3` selects GPUs indexed by ordinals `0,1,2` and `3`, respectively. The server will see only these GPUs. If you write `device=all`, then the server will see all the available GPUs. `PATH_TO_MODEL_REPOSITORY` indicates the location where the 
-deployed models were stored. An additional `--model-controle-mode` option allows to reload the model when it changes in the filesystem. It is a required option for benchmark scripts that works with multiple model versions on a single Triton Inference Server instance.

 ## Quick Start Guide
+Running the following scripts will build and launch the container with all required dependencies for native PyTorch as well as Triton Inference Server. This is necessary for running inference and can also be used for data download, processing, and training of the model. 
+ 
+1. Clone the repository.
+ 
+   IMPORTANT: This step is executed on the host computer.
+ 
+   ```
+    git clone https://github.com/NVIDIA/DeepLearningExamples.git
+    cd DeepLearningExamples/PyTorch/Classification/ConvNets
+   ```
+2. Setup the environment in the host computer and start Triton Inference Server.
+ 
+   ```
+    source triton/scripts/setup_environment.sh
+    bash triton/scripts/docker/triton_inference_server.sh 
+   ```

-### Running the client
-
-The client `client.py` checks the model accuracy against synthetic or real validation
-data. The client connects to Triton Inference Server and performs inference. 
-
-```
-usage: client.py [-h] --triton-server-url TRITON_SERVER_URL
-                 --triton-model-name TRITON_MODEL_NAME [-v]
-                 [--inference_data INFERENCE_DATA] [--batch_size BATCH_SIZE]
-                 [--fp16]
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --triton-server-url TRITON_SERVER_URL
-                        URL adress of trtion server (with port)
-  --triton-model-name TRITON_MODEL_NAME
-                        Triton deployed model name
-  -v, --verbose         Verbose mode.
-  --inference_data INFERENCE_DATA
-                        Path to file with inference data.
-  --batch_size BATCH_SIZE
-                        Inference request batch size
-  --fp16                Use fp16 precision for input data
-
-```
-
-To run inference on the model exported in the previous steps, using the data located under
-`/dataset`, run:
-
-`python -m triton.client --triton-server-url localhost:8001 --triton-model-name rn-trt-16 --inference_data /data/test_data.bin --batch_size 16 --fp16`
+3. Build and run a container that extends the NGC PyTorch container with the Triton Inference Server client libraries and dependencies.
+ 
+   ```
+    bash triton/scripts/docker/build.sh
+    bash triton/scripts/docker/interactive.sh
+   ```


-### Gathering performance data
-Performance data can be gathered using the `perf_client` tool. To use this tool to measure performance for batch_size=32, the following command can be used:
+4. Prepare the deployment configuration and create folders in Docker.
+ 
+   IMPORTANT: These and the following commands must be executed in the PyTorch NGC container.
+ 
+ 
+   ```
+    source triton/scripts/setup_environment.sh
+   ```

-`/workspace/bin/perf_client --max-threads 10 -m rn-trt-16 -x 1 -p 10000 -v -i gRPC -u localhost:8001 -b 32 -l 5000 --concurrency-range 1 -f result.csv`
+5. Download and pre-process the dataset.
+ 
+ 
+   ```
+    bash triton/scripts/download_data.sh
+    bash triton/scripts/process_dataset.sh
+   ```
+ 
+6. Setup the parameters for deployment.
+ 
+   ```
+    source triton/scripts/setup_parameters.sh
+   ```
+ 
+7. Convert the model from training to inference format (e.g. TensorRT).
+ 
+ 
+   ```
+    python3 triton/convert_model.py \
+        --input-path triton/model.py \
+        --input-type pyt \
+        --output-path ${SHARED_DIR}/model \
+        --output-type ${FORMAT} \
+        --onnx-opset 11 \
+        --onnx-optimized 1 \
+        --max-batch-size ${MAX_BATCH_SIZE} \
+        --max-workspace-size 1073741824 \
+        --ignore-unknown-parameters \
+        \
+        --checkpoint ${CHECKPOINT_DIR}/nvidia_resnet50_200821.pth.tar \
+        --precision ${PRECISION} \
+        --config resnet50 \
+        --classes 1000 \
+        \
+        --dataloader triton/dataloader.py \
+        --data-dir ${DATASETS_DIR}/imagenet \
+        --batch-size ${MAX_BATCH_SIZE}
+
+   ```
+ 
+ 
+8. Configure the model on Triton Inference Server.
+ 
+   Generate the configuration from your model repository.
+ 
+   ```
+    python3 triton/config_model_on_triton.py \
+        --model-repository ${MODEL_REPOSITORY_PATH} \
+        --model-path ${SHARED_DIR}/model \
+        --model-format ${FORMAT} \
+        --model-name ${MODEL_NAME} \
+        --model-version 1 \
+        --max-batch-size ${MAX_BATCH_SIZE} \
+        --precision ${PRECISION} \
+        --number-of-model-instances ${NUMBER_OF_MODEL_INSTANCES} \
+        --max-queue-delay-us 0 \
+        --preferred-batch-sizes ${MAX_BATCH_SIZE} \
+        --capture-cuda-graph 0 \
+        --backend-accelerator ${BACKEND_ACCELERATOR} \
+        --load-model ${TRITON_LOAD_MODEL_METHOD}
+   ```
+ 
+9. Run the Triton Inference Server accuracy tests.
+ 
+   ```
+    python3 triton/run_inference_on_triton.py \
+        --server-url localhost:8001 \
+        --model-name ${MODEL_NAME} \
+        --model-version 1 \
+        --output-dir ${SHARED_DIR}/accuracy_dump \
+        \
+        --precision ${PRECISION} \
+        --dataloader triton/dataloader.py \
+        --data-dir ${DATASETS_DIR}/imagenet \
+        --batch-size ${MAX_BATCH_SIZE} \
+        --dump-labels
+
+    python3 triton/calculate_metrics.py \
+        --metrics triton/metric.py \
+        --dump-dir ${SHARED_DIR}/accuracy_dump \
+        --csv ${SHARED_DIR}/accuracy_metrics.csv
+
+    cat ${SHARED_DIR}/accuracy_metrics.csv
+   ```
+ 
+ 
+10. Run the Triton Inference Server performance online tests.
+ 
+   We want to maximize throughput within latency budget constraints.
+   Dynamic batching is a feature of Triton Inference Server that allows
+   inference requests to be combined by the server, so that a batch is
+   created dynamically, resulting in a reduced average latency.
+   You can set the Dynamic Batcher parameter `max_queue_delay_microseconds` to
+   indicate the maximum amount of time you are willing to wait and
+   `preferred_batch_size` to indicate your maximum server batch size
+   in the Triton Inference Server model configuration. The measurements
+   presented below set the maximum latency to zero to achieve the best latency
+   possible with good performance.
+ 
+   ```
+    python triton/run_online_performance_test_on_triton.py \
+        --model-name ${MODEL_NAME} \
+        --input-data random \
+        --batch-sizes ${BATCH_SIZE} \
+        --triton-instances ${TRITON_INSTANCES} \
+        --number-of-model-instances ${NUMBER_OF_MODEL_INSTANCES} \
+        --result-path ${SHARED_DIR}/triton_performance_online.csv
+ 
+   ```
+ 
+
+
+11. Run the Triton Inference Server performance offline tests.
+ 
+   We want to maximize throughput. It assumes you have your data available
+   for inference or that your data saturate to maximum batch size quickly.
+   Triton Inference Server supports offline scenarios with static batching.
+   Static batching allows inference requests to be served
+   as they are received. The largest improvements to throughput come
+   from increasing the batch size due to efficiency gains in the GPU with larger
+   batches.
+ 
+   ```
+    python triton/run_offline_performance_test_on_triton.py \
+        --model-name ${MODEL_NAME} \
+        --input-data random \
+        --batch-sizes ${BATCH_SIZE} \
+        --triton-instances ${TRITON_INSTANCES} \
+        --result-path ${SHARED_DIR}/triton_performance_offline.csv
+   ```
+ 

-For more information about `perf_client`, refer to the [documentation](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/optimization.html#perf-client).

 ## Advanced

-### Automated benchmark script
-To automate benchmarks of different model configurations, a special benchmark script is located in `triton/scripts/benchmark.sh`. To use this script,
-run Triton Inference Server and then execute the script as follows:

-`bash triton/scripts/benchmark.sh <MODEL_REPOSITORY> <LOG_DIRECTORY> <ARCHITECTURE> (<CHECKPOINT_PATH>)`
+### Prepare configuration
+You can use the environment variables to set the parameters of your inference
+configuration.
+
+Triton deployment scripts support several inference runtimes listed in the table below:
+| Inference runtime | Mnemonic used in scripts |
+|-------------------|--------------------------|
+| [TorchScript Tracing](https://pytorch.org/docs/stable/jit.html) | `ts-trace` |
+| [TorchScript Tracing](https://pytorch.org/docs/stable/jit.html) | `ts-script` |
+| [ONNX](https://onnx.ai) | `onnx` |
+| [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) | `trt` |
+
+The name of the inference runtime should be put into the `FORMAT` variable.
+
+
+Example values of some key variables in one configuration:
+```
+PRECISION="fp16"
+FORMAT="trt"
+BATCH_SIZE="1, 2, 4, 8, 16, 32, 64, 128"
+BACKEND_ACCELERATOR="cuda"
+MAX_BATCH_SIZE="128"
+NUMBER_OF_MODEL_INSTANCES="1"
+TRITON_MAX_QUEUE_DELAY="1"
+TRITON_PREFERRED_BATCH_SIZES="64 128"
+
+```
+
+
+
+### Latency explanation
+A typical Triton Inference Server pipeline can be broken down into the following steps:
+
+1. The client serializes the inference request into a message and sends it to
+the server (Client Send).
+2. The message travels over the network from the client to the server (Network).
+3. The message arrives at the server and is deserialized (Server Receive).
+4. The request is placed on the queue (Server Queue).
+5. The request is removed from the queue and computed (Server Compute).
+6. The completed request is serialized in a message and sent back to
+the client (Server Send).
+7. The completed message then travels over the network from the server
+to the client (Network).
+8. The completed message is deserialized by the client and processed as
+a completed inference request (Client Receive).
+
+Generally, for local clients, steps 1-4 and 6-8 will only occupy
+a small fraction of time, compared to step 5. As backend deep learning
+systems like Jasper are rarely exposed directly to end users, but instead
+only interfacing with local front-end servers, for the sake of Jasper,
+we can consider that all clients are local.
+
+
+

-The benchmark script tests all supported backends with different batch sizes and server configuration. Logs from execution will be stored in `<LOG DIRECTORY>`.
-To process static configuration logs, `triton/scripts/process_output.sh` script can be used.

 ## Performance

-### Dynamic batching performance
-The Triton Inference Server has a dynamic batching mechanism built-in that can be enabled. When it is enabled, the server creates inference batches from multiple received requests. This allows us to achieve better performance than doing inference on each single request. The single request is assumed to be a single image that needs to be inferenced. With dynamic batching enabled, the server will concatenate single image requests into an inference batch. The upper bound of the size of the inference batch is set to 64. All these parameters are configurable.

-Our results were obtained by running automated benchmark script. 
-Throughput is measured in images/second, and latency in milliseconds.
-
-### TensorRT backend inference performance (1x V100 16GB)
-**FP32 Inference Performance**
-
-|**Concurrent requests**|**Throughput (img/s)**|**Avg. Latency (ms)**|**90% Latency (ms)**|**95% Latency (ms)**|**99% Latency (ms)**|
-|-----|--------|-------|--------|-------|-------|
-| 1 | 133.6 | 7.48 | 7.56 | 7.59 | 7.68 |
-| 2 | 156.6 | 12.77 | 12.84 | 12.86 | 12.93 |
-| 4 | 193.3 | 20.70 | 20.82 | 20.85 | 20.92 | 
-| 8 | 357.4 | 22.38 | 22.53 | 22.57 | 22.67 |
-| 16 | 627.3 | 25.49 | 25.64 | 25.69 | 25.80 |
-| 32 | 1003 | 31.87 | 32.43 | 32.61 | 32.91 |
-| 64 | 1394.7 | 45.85 | 46.13 | 46.22 | 46.86 |
-| 128 | 1604.4 | 79.70 | 80.50 | 80.96 | 83.09 |
-| 256 | 1670.7 | 152.21 | 186.78 | 188.36 | 190.52 |
-
-**FP16 Inference Performance**
-
-|**Concurrent requests**|**Throughput (img/s)**|**Avg. Latency (ms)**|**90% Latency (ms)**|**95% Latency (ms)**|**99% Latency (ms)**|
-|-----|--------|-------|--------|-------|-------|
-| 1 | 250.1 | 3.99 | 4.08 | 4.11 | 4.16 |
-| 2 | 314.8 | 6.35 | 6.42 | 6.44 | 6.49 |
-| 4 | 384.8 | 10.39 | 10.51 | 10.54 | 10.60 |
-| 8 | 693.8 | 11.52 | 11.78 | 11.88 | 12.09 |
-| 16 | 1132.9 | 14.13 | 14.31 | 14.41 | 14.65 |
-| 32 | 1689.7 | 18.93 | 19.11 | 19.20 | 19.44 |
-| 64 | 2226.3 | 28.74 | 29.53 | 29.74 | 31.09 |
-| 128 | 2521.5 | 50.74 | 51.97 | 52.30 | 53.61 |
-| 256 | 2738 | 93.76 | 97.14 | 115.19 | 117.21 |
+### Offline scenario
+This table lists the common variable parameters for all performance measurements:
+| Parameter Name               | Parameter Value   |
+|:-----------------------------|:------------------|
+| Max Batch Size               | 128.0             |
+| Number of model instances    | 1.0               |
+| Triton Max Queue Delay       | 1.0               |
+| Triton Preferred Batch Sizes | 64 128            |


-![Latency vs Througput](./Latency-vs-Throughput-TensorRT.png)

-![Performance analysis - TensorRT FP32](./Performance-analysis-TensorRT-FP32.png)
+#### Offline: NVIDIA A40, ONNX Runtime TensorRT with FP16

-![Performance analysis - TensorRT FP16](./Performance-analysis-TensorRT-FP16.png)
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA A40
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX


-## Release notes
+<table><tr><td>
+
+![](plots/graph_performance_offline_1l.svg)
+
+</td><td>
+
+![](plots/graph_performance_offline_1r.svg)
+
+</td></tr></table>
+
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+| Precision   | Backend Accelerator  |   Client Batch Size |   Inferences/second |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
+| FP16        | TensorRT             |                   1 |               491.5 |         2.046 |         2.111 |         2.126 |         2.031 |
+| FP16        | TensorRT             |                   2 |               811.8 |         2.509 |         2.568 |         2.594 |         2.459 |
+| FP16        | TensorRT             |                   4 |              1094   |         3.814 |         3.833 |         3.877 |         3.652 |
+| FP16        | TensorRT             |                   8 |              1573.6 |         5.45  |         5.517 |         5.636 |         5.078 |
+| FP16        | TensorRT             |                  16 |              1651.2 |         9.896 |         9.978 |        10.074 |         9.678 |
+| FP16        | TensorRT             |                  32 |              2070.4 |        17.49  |        17.837 |        19.228 |        15.451 |
+| FP16        | TensorRT             |                  64 |              1766.4 |        37.123 |        37.353 |        37.85  |        36.147 |
+| FP16        | TensorRT             |                 128 |              1894.4 |        69.027 |        69.15  |        69.789 |        67.889 |
+
+</details>
+
+
+
+#### Offline: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA DGX A100 (1x A100 80GB)
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+<table><tr><td>
+
+![](plots/graph_performance_offline_5l.svg)
+
+</td><td>
+
+![](plots/graph_performance_offline_5r.svg)
+
+</td></tr></table>
+
+
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+| Precision   | Backend Accelerator  |   Client Batch Size |   Inferences/second |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
+| FP16        | TensorRT             |                   1 |               469.1 |         2.195 |         2.245 |         2.272 |         2.128 |
+| FP16        | TensorRT             |                   2 |               910   |         2.222 |         2.229 |         2.357 |         2.194 |
+| FP16        | TensorRT             |                   4 |              1447.6 |         3.055 |         3.093 |         3.354 |         2.759 |
+| FP16        | TensorRT             |                   8 |              2051.2 |         4.035 |         4.195 |         4.287 |         3.895 |
+| FP16        | TensorRT             |                  16 |              2760   |         6.033 |         6.121 |         6.348 |         5.793 |
+| FP16        | TensorRT             |                  32 |              2857.6 |        11.47  |        11.573 |        11.962 |        11.193 |
+| FP16        | TensorRT             |                  64 |              2534.4 |        26.345 |        26.899 |        29.744 |        25.244 |
+| FP16        | TensorRT             |                 128 |              2662.4 |        49.612 |        51.713 |        53.666 |        48.086 |
+
+</details>
+
+
+#### Offline: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA DGX-1 (1x V100 32GB)
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+<table><tr><td>
+
+![](plots/graph_performance_offline_9l.svg)
+
+</td><td>
+
+![](plots/graph_performance_offline_9r.svg)
+
+</td></tr></table>
+
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+| Precision   | Backend Accelerator  |   Client Batch Size |   Inferences/second |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
+| FP16        | TensorRT             |                   1 |               351.8 |         2.996 |         3.051 |         3.143 |         2.838 |
+| FP16        | TensorRT             |                   2 |               596.2 |         3.481 |         3.532 |         3.627 |         3.35  |
+| FP16        | TensorRT             |                   4 |               953.6 |         4.314 |         4.351 |         4.45  |         4.191 |
+| FP16        | TensorRT             |                   8 |              1337.6 |         6.185 |         6.347 |         6.581 |         5.979 |
+| FP16        | TensorRT             |                  16 |              1726.4 |         9.736 |         9.87  |        10.904 |         9.266 |
+| FP16        | TensorRT             |                  32 |              2044.8 |        15.833 |        15.977 |        16.438 |        15.664 |
+| FP16        | TensorRT             |                  64 |              1670.4 |        38.667 |        38.842 |        40.773 |        38.412 |
+| FP16        | TensorRT             |                 128 |              1548.8 |        84.454 |        85.308 |        88.363 |        82.159 |
+
+</details>
+
+
+
+
+#### Offline: NVIDIA T4, ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA T4
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+<table><tr><td>
+
+![](plots/graph_performance_offline_13l.svg)
+
+</td><td>
+
+![](plots/graph_performance_offline_13r.svg)
+
+</td></tr></table>
+
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+| Precision   | Backend Accelerator  |   Client Batch Size |   Inferences/second |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|:------------|:---------------------|--------------------:|--------------------:|--------------:|--------------:|--------------:|--------------:|
+| FP16        | TensorRT             |                   1 |               332.4 |         3.065 |         3.093 |         3.189 |         3.003 |
+| FP16        | TensorRT             |                   2 |               499.4 |         4.069 |         4.086 |         4.143 |         3.998 |
+| FP16        | TensorRT             |                   4 |               695.2 |         5.779 |         5.786 |         5.802 |         5.747 |
+| FP16        | TensorRT             |                   8 |               888   |         9.039 |         9.05  |         9.065 |         8.998 |
+| FP16        | TensorRT             |                  16 |              1057.6 |        15.319 |        15.337 |        15.389 |        15.113 |
+| FP16        | TensorRT             |                  32 |              1129.6 |        28.77  |        28.878 |        29.082 |        28.353 |
+| FP16        | TensorRT             |                  64 |              1203.2 |        54.194 |        54.417 |        55.331 |        53.187 |
+| FP16        | TensorRT             |                 128 |              1280   |       102.466 |       102.825 |       103.177 |       100.155 |
+
+</details>
+
+
+
+
+
+### Online scenario
+
+This table lists the common variable parameters for all performance measurements:
+| Parameter Name               | Parameter Value   |
+|:-----------------------------|:------------------|
+| Max Batch Size               | 128.0             |
+| Number of model instances    | 1.0               |
+| Triton Max Queue Delay       | 1.0               |
+| Triton Preferred Batch Sizes | 64 128            |
+
+
+
+
+
+#### Online: NVIDIA A40, ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA A40
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+
+![](plots/graph_performance_online_2.svg)
+ 
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+|   Concurrent client requests |   Inferences/second |   Client Send |   Network+server Send/recv |   Server Queue |   Server Compute Input |   Server Compute Infer |   Server Compute Output |   Client Recv |   P50 Latency |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
+|                           16 |              2543.7 |         0.078 |                      1.912 |          1.286 |                  0.288 |                  2.697 |                   0.024 |             0 |         6.624 |         7.039 |         7.414 |         9.188 |         6.285 |
+|                           32 |              3166.7 |         0.085 |                      3.478 |          1.81  |                  0.582 |                  4.098 |                   0.047 |             0 |         9.924 |        11.001 |        12.217 |        14.717 |        10.1   |
+|                           48 |              3563.9 |         0.085 |                      5.169 |          1.935 |                  0.99  |                  5.204 |                   0.08  |             0 |        13.199 |        14.813 |        16.421 |        19.793 |        13.463 |
+|                           64 |              3514.9 |         0.091 |                      5.729 |          3.847 |                  1.553 |                  6.842 |                   0.138 |             0 |        17.986 |        18.85  |        19.916 |        25.825 |        18.2   |
+|                           80 |              3703.5 |         0.097 |                      7.244 |          4.414 |                  2     |                  7.675 |                   0.169 |             0 |        21.313 |        23.838 |        28.664 |        32.631 |        21.599 |
+|                           96 |              3636.9 |         0.101 |                      8.459 |          5.679 |                  3.157 |                  8.771 |                   0.215 |             0 |        26.131 |        27.751 |        31.269 |        38.695 |        26.382 |
+|                          112 |              3701.7 |         0.099 |                      9.332 |          6.711 |                  3.588 |                 10.276 |                   0.241 |             0 |        30.319 |        31.282 |        31.554 |        32.151 |        30.247 |
+|                          128 |              3795.8 |         0.106 |                     10.581 |          7.309 |                  4.067 |                 11.386 |                   0.268 |             0 |        33.893 |        34.793 |        35.448 |        43.182 |        33.717 |
+|                          144 |              3892.4 |         0.106 |                     11.488 |          8.144 |                  4.713 |                 12.212 |                   0.32  |             0 |        37.184 |        38.277 |        38.597 |        39.393 |        36.983 |
+|                          160 |              3950   |         0.106 |                     13.5   |          7.999 |                  5.083 |                 13.481 |                   0.343 |             0 |        40.656 |        42.157 |        44.756 |        53.426 |        40.512 |
+|                          176 |              3992.5 |         0.118 |                     13.6   |          9.809 |                  5.596 |                 14.611 |                   0.379 |             0 |        44.324 |        45.689 |        46.331 |        52.155 |        44.113 |
+|                          192 |              4058.3 |         0.116 |                     14.902 |         10.223 |                  6.054 |                 15.564 |                   0.416 |             0 |        47.537 |        48.91  |        49.752 |        55.973 |        47.275 |
+|                          208 |              4121.8 |         0.117 |                     16.777 |          9.991 |                  6.347 |                 16.827 |                   0.441 |             0 |        50.652 |        52.241 |        53.4   |        62.688 |        50.5   |
+|                          224 |              4116.1 |         0.124 |                     17.048 |         11.743 |                  7.065 |                 17.91  |                   0.504 |             0 |        54.571 |        56.204 |        56.877 |        62.169 |        54.394 |
+|                          240 |              4100   |         0.157 |                     17.54  |         13.611 |                  7.532 |                 19.185 |                   0.538 |             0 |        58.683 |        60.034 |        60.64  |        64.791 |        58.563 |
+|                          256 |              4310.5 |         0.277 |                     18.282 |         13.5   |                  7.714 |                 19.136 |                   0.539 |             0 |        59.244 |        60.686 |        61.349 |        66.84  |        59.448 |
+
+</details>
+
+
+#### Online: NVIDIA DGX A100 (1x A100 80GB), ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA DGX A100 (1x A100 80GB)
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+![](plots/graph_performance_online_10.svg)
+ 
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+|   Concurrent client requests |   Inferences/second |   Client Send |   Network+server Send/recv |   Server Queue |   Server Compute Input |   Server Compute Infer |   Server Compute Output |   Client Recv |   P50 Latency |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
+|                           16 |              2571.2 |         0.067 |                      1.201 |          1.894 |                  0.351 |                  2.678 |                   0.027 |             0 |         6.205 |         6.279 |         6.31  |         6.418 |         6.218 |
+|                           32 |              3600.2 |         0.058 |                      2.641 |          2.004 |                  0.716 |                  3.41  |                   0.057 |             0 |         8.852 |         9.233 |         9.353 |        12.253 |         8.886 |
+|                           48 |              4274.2 |         0.062 |                      3.102 |          2.738 |                  1.121 |                  4.113 |                   0.089 |             0 |        11.03  |        11.989 |        12.1   |        15.115 |        11.225 |
+|                           64 |              4387.7 |         0.07  |                      3.767 |          3.438 |                  2.016 |                  5.164 |                   0.122 |             0 |        14.628 |        15.067 |        15.211 |        15.504 |        14.577 |
+|                           80 |              4630.1 |         0.064 |                      4.23  |          5.049 |                  2.316 |                  5.463 |                   0.151 |             0 |        17.205 |        17.726 |        17.9   |        18.31  |        17.273 |
+|                           96 |              4893.9 |         0.068 |                      4.811 |          5.764 |                  2.741 |                  6.044 |                   0.179 |             0 |        19.44  |        20.23  |        20.411 |        22.781 |        19.607 |
+|                          112 |              4887.6 |         0.069 |                      6.232 |          5.202 |                  3.597 |                  7.586 |                   0.236 |             0 |        23.099 |        23.665 |        23.902 |        24.192 |        22.922 |
+|                          128 |              5411.5 |         0.081 |                      5.921 |          7     |                  3.387 |                  7.016 |                   0.255 |             0 |        23.852 |        24.349 |        24.557 |        26.433 |        23.66  |
+|                          144 |              5322.9 |         0.08  |                      7.066 |          7.55  |                  3.996 |                  8.059 |                   0.299 |             0 |        27.024 |        28.487 |        29.725 |        33.7   |        27.05  |
+|                          160 |              5310.5 |         0.079 |                      6.98  |          9.157 |                  4.61  |                  8.98  |                   0.331 |             0 |        30.446 |        31.497 |        31.91  |        34.269 |        30.137 |
+|                          176 |              5458.7 |         0.081 |                      7.857 |          9.272 |                  5.047 |                  9.634 |                   0.345 |             0 |        32.588 |        33.271 |        33.478 |        35.47  |        32.236 |
+|                          192 |              5654.1 |         0.081 |                      9.355 |          8.898 |                  5.294 |                  9.923 |                   0.388 |             0 |        34.35  |        35.895 |        36.302 |        39.288 |        33.939 |
+|                          208 |              5643.7 |         0.093 |                      9.407 |         10.488 |                  5.953 |                 10.54  |                   0.383 |             0 |        36.994 |        38.14  |        38.766 |        41.616 |        36.864 |
+|                          224 |              5795.5 |         0.101 |                      9.862 |         10.852 |                  6.331 |                 11.081 |                   0.415 |             0 |        38.536 |        39.741 |        40.563 |        43.227 |        38.642 |
+|                          240 |              5855.8 |         0.131 |                      9.994 |         12.391 |                  6.589 |                 11.419 |                   0.447 |             0 |        40.721 |        43.344 |        44.449 |        46.902 |        40.971 |
+|                          256 |              6127.3 |         0.131 |                     10.495 |         12.342 |                  6.979 |                 11.344 |                   0.473 |             0 |        41.606 |        43.106 |        43.694 |        46.457 |        41.764 |
+
+</details>
+
+
+
+
+#### Online: NVIDIA DGX-1 (1x V100 32GB), ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA DGX-1 (1x V100 32GB)
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+![](plots/graph_performance_online_18.svg)
+ 
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+|   Concurrent client requests |   Inferences/second |   Client Send |   Network+server Send/recv |   Server Queue |   Server Compute Input |   Server Compute Infer |   Server Compute Output |   Client Recv |   P50 Latency |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
+|                           16 |              1679.6 |         0.096 |                      3.312 |          1.854 |                  0.523 |                  3.713 |                   0.026 |             0 |         8.072 |        12.416 |        12.541 |        12.729 |         9.524 |
+|                           32 |              2760.1 |         0.095 |                      3.933 |          1.978 |                  0.949 |                  4.597 |                   0.035 |             0 |        11.569 |        11.728 |        11.785 |        12.39  |        11.587 |
+|                           48 |              3127.1 |         0.099 |                      4.919 |          3.105 |                  1.358 |                  5.816 |                   0.051 |             0 |        15.471 |        15.86  |        18.206 |        20.198 |        15.348 |
+|                           64 |              3287.4 |         0.101 |                      5.874 |          4.346 |                  1.789 |                  7.293 |                   0.069 |             0 |        19.44  |        19.727 |        19.838 |        20.584 |        19.472 |
+|                           80 |              3209   |         0.131 |                      7.032 |          6.014 |                  3.227 |                  8.418 |                   0.111 |             0 |        25.362 |        25.889 |        26.095 |        29.005 |        24.933 |
+|                           96 |              3273.6 |         0.14  |                      8.539 |          6.74  |                  4.371 |                  9.369 |                   0.153 |             0 |        29.217 |        29.641 |        29.895 |        31.002 |        29.312 |
+|                          112 |              3343.3 |         0.149 |                      9.683 |          7.802 |                  4.214 |                 11.484 |                   0.159 |             0 |        30.933 |        37.027 |        37.121 |        37.358 |        33.491 |
+|                          128 |              3335.1 |         0.152 |                      9.865 |         10.127 |                  5.519 |                 12.534 |                   0.195 |             0 |        38.762 |        40.022 |        40.336 |        42.943 |        38.392 |
+|                          144 |              3304.2 |         0.185 |                     11.017 |         11.901 |                  6.877 |                 13.35  |                   0.209 |             0 |        43.372 |        43.812 |        44.042 |        46.708 |        43.539 |
+|                          160 |              3319.9 |         0.206 |                     12.701 |         12.625 |                  7.49  |                 14.907 |                   0.238 |             0 |        48.31  |        49.135 |        49.343 |        50.441 |        48.167 |
+|                          176 |              3335   |         0.271 |                     13.013 |         14.788 |                  8.564 |                 15.789 |                   0.263 |             0 |        52.352 |        53.653 |        54.385 |        57.332 |        52.688 |
+|                          192 |              3380   |         0.243 |                     13.894 |         15.719 |                  9.865 |                 16.841 |                   0.283 |             0 |        56.872 |        58.64  |        58.944 |        62.097 |        56.845 |
+|                          208 |              3387.6 |         0.273 |                     16.221 |         15.73  |                 10.334 |                 18.448 |                   0.326 |             0 |        61.402 |        63.099 |        63.948 |        68.63  |        61.332 |
+|                          224 |              3477.2 |         0.613 |                     14.167 |         18.902 |                 10.896 |                 19.605 |                   0.34  |             0 |        64.495 |        65.69  |        66.101 |        67.522 |        64.523 |
+|                          240 |              3528   |         0.878 |                     14.713 |         20.894 |                 10.259 |                 20.859 |                   0.436 |             0 |        66.404 |        71.807 |        72.857 |        75.076 |        68.039 |
+|                          256 |              3558.4 |         1.035 |                     15.534 |         22.837 |                 11     |                 21.062 |                   0.435 |             0 |        71.657 |        77.271 |        78.269 |        80.804 |        71.903 |
+
+</details>
+
+
+
+
+#### Online: NVIDIA T4, ONNX Runtime TensorRT with FP16
+
+Our results were obtained using the following configuration:
+ * **GPU:** NVIDIA T4
+ * **Backend:** ONNX Runtime
+ * **Backend accelerator:** TensorRT
+ * **Precision:** FP16
+ * **Model format:** ONNX
+
+![](plots/graph_performance_online_26.svg)
+ 
+<details>
+
+<summary>
+Full tabular data
+</summary>
+
+|   Concurrent client requests |   Inferences/second |   Client Send |   Network+server Send/recv |   Server Queue |   Server Compute Input |   Server Compute Infer |   Server Compute Output |   Client Recv |   P50 Latency |   P90 Latency |   P95 Latency |   P99 Latency |   Avg Latency |
+|-----------------------------:|--------------------:|--------------:|---------------------------:|---------------:|-----------------------:|-----------------------:|------------------------:|--------------:|--------------:|--------------:|--------------:|--------------:|--------------:|
+|                           16 |              1078.4 |         0.169 |                      6.163 |          2.009 |                  0.495 |                  5.963 |                   0.022 |         0     |        15.75  |        16.219 |        16.376 |        16.597 |        14.821 |
+|                           32 |              2049.6 |         0.195 |                      4.342 |          3.384 |                  0.849 |                  6.804 |                   0.032 |         0     |        15.606 |        15.792 |        15.853 |        15.975 |        15.606 |
+|                           48 |              2133.1 |         0.189 |                      6.365 |          4.926 |                  1.379 |                  9.573 |                   0.063 |         0     |        22.304 |        23.432 |        23.73  |        27.241 |        22.495 |
+|                           64 |              2114.3 |         0.206 |                      9.038 |          6.258 |                  1.863 |                 12.812 |                   0.086 |         0     |        30.074 |        31.063 |        31.535 |        42.845 |        30.263 |
+|                           80 |              2089.3 |         0.204 |                     11.943 |          7.841 |                  2.676 |                 15.556 |                   0.108 |         0     |        38.289 |        40.895 |        52.977 |        58.393 |        38.328 |
+|                           96 |              2145.3 |         0.23  |                     12.987 |          9.63  |                  3.597 |                 18.132 |                   0.134 |         0     |        44.511 |        47.352 |        47.809 |        48.429 |        44.71  |
+|                          112 |              2062.3 |         0.28  |                     13.253 |         14.112 |                  5.088 |                 21.398 |                   0.154 |         0     |        54.289 |        55.441 |        55.69  |        56.205 |        54.285 |
+|                          128 |              2042.6 |         0.485 |                     14.377 |         16.957 |                  6.279 |                 24.487 |                   0.169 |         0     |        62.718 |        63.902 |        64.178 |        64.671 |        62.754 |
+|                          144 |              2066.6 |         0.726 |                     16.363 |         18.601 |                  7.085 |                 26.801 |                   0.193 |         0.001 |        69.67  |        71.418 |        71.765 |        73.255 |        69.77  |
+|                          160 |              2073.1 |         0.557 |                     17.787 |         20.809 |                  7.378 |                 30.43  |                   0.215 |         0     |        77.212 |        79.089 |        79.815 |        83.434 |        77.176 |
+|                          176 |              2076.8 |         1.209 |                     18.446 |         23.075 |                  8.689 |                 32.894 |                   0.253 |         0     |        84.13  |        86.732 |        87.404 |        95.286 |        84.566 |
+|                          192 |              2073.9 |         1.462 |                     19.845 |         25.653 |                  9.088 |                 36.153 |                   0.272 |         0     |        92.32  |        94.276 |        94.805 |        96.765 |        92.473 |
+|                          208 |              2053.2 |         1.071 |                     22.995 |         26.411 |                 10.123 |                 40.415 |                   0.322 |         0     |       101.178 |       103.725 |       105.498 |       110.695 |       101.337 |
+|                          224 |              1994.1 |         0.968 |                     24.931 |         31.14  |                 14.276 |                 40.804 |                   0.389 |         0     |       114.177 |       116.977 |       118.248 |       121.879 |       112.508 |
+|                          240 |              1952.6 |         1.028 |                     27.957 |         34.546 |                 16.535 |                 42.685 |                   0.38  |         0     |       122.846 |       126.022 |       128.074 |       136.541 |       123.131 |
+|                          256 |              2017.8 |         0.85  |                     27.437 |         38.553 |                 15.224 |                 44.637 |                   0.401 |         0     |       129.052 |       132.762 |       134.337 |       138.108 |       127.102 |
+
+</details>
+
+
+
+## Release Notes
+We’re constantly refining and improving our performance on AI
+and HPC workloads even on the same hardware with frequent updates
+to our software stack. For our latest performance data refer
+to these pages for
+[AI](https://developer.nvidia.com/deep-learning-performance-training-inference)
+and [HPC](https://developer.nvidia.com/hpc-application-performance) benchmarks.

 ### Changelog
+
+April 2021
+- NVIDIA Ampere results added
+
 September 2020
- Initial release
+- Initial release
+
+### Known issues
+
+- There are no known issues with this model.
+
+
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_13l.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_13l.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_13r.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_13r.svg
@ -0,0 +1,995 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<!-- Created with matplotlib (https://matplotlib.org/) -->
+<svg height="331.389812pt" version="1.1" viewBox="0 0 417.63 331.389812" width="417.63pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+ <metadata>
+  <rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2021-04-14T17:54:02.787041</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.3.4, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linecap:butt;stroke-linejoin:round;}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 331.389812 
+L 417.63 331.389812 
+L 417.63 0 
+L 0 0 
+z
+" style="fill:#ffffff;"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 53.31 288.430125 
+L 410.43 288.430125 
+L 410.43 22.318125 
+L 53.31 22.318125 
+z
+" style="fill:#ffffff;"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="text_1">
+      <!-- 1 -->
+      <g style="fill:#262626;" transform="translate(72.130625 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 12.40625 8.296875 
+L 28.515625 8.296875 
+L 28.515625 63.921875 
+L 10.984375 60.40625 
+L 10.984375 69.390625 
+L 28.421875 72.90625 
+L 38.28125 72.90625 
+L 38.28125 8.296875 
+L 54.390625 8.296875 
+L 54.390625 0 
+L 12.40625 0 
+z
+" id="DejaVuSans-49"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-49"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="text_2">
+      <!-- 2 -->
+      <g style="fill:#262626;" transform="translate(116.770625 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 19.1875 8.296875 
+L 53.609375 8.296875 
+L 53.609375 0 
+L 7.328125 0 
+L 7.328125 8.296875 
+Q 12.9375 14.109375 22.625 23.890625 
+Q 32.328125 33.6875 34.8125 36.53125 
+Q 39.546875 41.84375 41.421875 45.53125 
+Q 43.3125 49.21875 43.3125 52.78125 
+Q 43.3125 58.59375 39.234375 62.25 
+Q 35.15625 65.921875 28.609375 65.921875 
+Q 23.96875 65.921875 18.8125 64.3125 
+Q 13.671875 62.703125 7.8125 59.421875 
+L 7.8125 69.390625 
+Q 13.765625 71.78125 18.9375 73 
+Q 24.125 74.21875 28.421875 74.21875 
+Q 39.75 74.21875 46.484375 68.546875 
+Q 53.21875 62.890625 53.21875 53.421875 
+Q 53.21875 48.921875 51.53125 44.890625 
+Q 49.859375 40.875 45.40625 35.40625 
+Q 44.1875 33.984375 37.640625 27.21875 
+Q 31.109375 20.453125 19.1875 8.296875 
+z
+" id="DejaVuSans-50"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-50"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="text_3">
+      <!-- 4 -->
+      <g style="fill:#262626;" transform="translate(161.410625 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 37.796875 64.3125 
+L 12.890625 25.390625 
+L 37.796875 25.390625 
+z
+M 35.203125 72.90625 
+L 47.609375 72.90625 
+L 47.609375 25.390625 
+L 58.015625 25.390625 
+L 58.015625 17.1875 
+L 47.609375 17.1875 
+L 47.609375 0 
+L 37.796875 0 
+L 37.796875 17.1875 
+L 4.890625 17.1875 
+L 4.890625 26.703125 
+z
+" id="DejaVuSans-52"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-52"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="text_4">
+      <!-- 8 -->
+      <g style="fill:#262626;" transform="translate(206.050625 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 31.78125 34.625 
+Q 24.75 34.625 20.71875 30.859375 
+Q 16.703125 27.09375 16.703125 20.515625 
+Q 16.703125 13.921875 20.71875 10.15625 
+Q 24.75 6.390625 31.78125 6.390625 
+Q 38.8125 6.390625 42.859375 10.171875 
+Q 46.921875 13.96875 46.921875 20.515625 
+Q 46.921875 27.09375 42.890625 30.859375 
+Q 38.875 34.625 31.78125 34.625 
+z
+M 21.921875 38.8125 
+Q 15.578125 40.375 12.03125 44.71875 
+Q 8.5 49.078125 8.5 55.328125 
+Q 8.5 64.0625 14.71875 69.140625 
+Q 20.953125 74.21875 31.78125 74.21875 
+Q 42.671875 74.21875 48.875 69.140625 
+Q 55.078125 64.0625 55.078125 55.328125 
+Q 55.078125 49.078125 51.53125 44.71875 
+Q 48 40.375 41.703125 38.8125 
+Q 48.828125 37.15625 52.796875 32.3125 
+Q 56.78125 27.484375 56.78125 20.515625 
+Q 56.78125 9.90625 50.3125 4.234375 
+Q 43.84375 -1.421875 31.78125 -1.421875 
+Q 19.734375 -1.421875 13.25 4.234375 
+Q 6.78125 9.90625 6.78125 20.515625 
+Q 6.78125 27.484375 10.78125 32.3125 
+Q 14.796875 37.15625 21.921875 38.8125 
+z
+M 18.3125 54.390625 
+Q 18.3125 48.734375 21.84375 45.5625 
+Q 25.390625 42.390625 31.78125 42.390625 
+Q 38.140625 42.390625 41.71875 45.5625 
+Q 45.3125 48.734375 45.3125 54.390625 
+Q 45.3125 60.0625 41.71875 63.234375 
+Q 38.140625 66.40625 31.78125 66.40625 
+Q 25.390625 66.40625 21.84375 63.234375 
+Q 18.3125 60.0625 18.3125 54.390625 
+z
+" id="DejaVuSans-56"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-56"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="text_5">
+      <!-- 16 -->
+      <g style="fill:#262626;" transform="translate(247.19125 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 33.015625 40.375 
+Q 26.375 40.375 22.484375 35.828125 
+Q 18.609375 31.296875 18.609375 23.390625 
+Q 18.609375 15.53125 22.484375 10.953125 
+Q 26.375 6.390625 33.015625 6.390625 
+Q 39.65625 6.390625 43.53125 10.953125 
+Q 47.40625 15.53125 47.40625 23.390625 
+Q 47.40625 31.296875 43.53125 35.828125 
+Q 39.65625 40.375 33.015625 40.375 
+z
+M 52.59375 71.296875 
+L 52.59375 62.3125 
+Q 48.875 64.0625 45.09375 64.984375 
+Q 41.3125 65.921875 37.59375 65.921875 
+Q 27.828125 65.921875 22.671875 59.328125 
+Q 17.53125 52.734375 16.796875 39.40625 
+Q 19.671875 43.65625 24.015625 45.921875 
+Q 28.375 48.1875 33.59375 48.1875 
+Q 44.578125 48.1875 50.953125 41.515625 
+Q 57.328125 34.859375 57.328125 23.390625 
+Q 57.328125 12.15625 50.6875 5.359375 
+Q 44.046875 -1.421875 33.015625 -1.421875 
+Q 20.359375 -1.421875 13.671875 8.265625 
+Q 6.984375 17.96875 6.984375 36.375 
+Q 6.984375 53.65625 15.1875 63.9375 
+Q 23.390625 74.21875 37.203125 74.21875 
+Q 40.921875 74.21875 44.703125 73.484375 
+Q 48.484375 72.75 52.59375 71.296875 
+z
+" id="DejaVuSans-54"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-49"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-54"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="text_6">
+      <!-- 32 -->
+      <g style="fill:#262626;" transform="translate(291.83125 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 40.578125 39.3125 
+Q 47.65625 37.796875 51.625 33 
+Q 55.609375 28.21875 55.609375 21.1875 
+Q 55.609375 10.40625 48.1875 4.484375 
+Q 40.765625 -1.421875 27.09375 -1.421875 
+Q 22.515625 -1.421875 17.65625 -0.515625 
+Q 12.796875 0.390625 7.625 2.203125 
+L 7.625 11.71875 
+Q 11.71875 9.328125 16.59375 8.109375 
+Q 21.484375 6.890625 26.8125 6.890625 
+Q 36.078125 6.890625 40.9375 10.546875 
+Q 45.796875 14.203125 45.796875 21.1875 
+Q 45.796875 27.640625 41.28125 31.265625 
+Q 36.765625 34.90625 28.71875 34.90625 
+L 20.21875 34.90625 
+L 20.21875 43.015625 
+L 29.109375 43.015625 
+Q 36.375 43.015625 40.234375 45.921875 
+Q 44.09375 48.828125 44.09375 54.296875 
+Q 44.09375 59.90625 40.109375 62.90625 
+Q 36.140625 65.921875 28.71875 65.921875 
+Q 24.65625 65.921875 20.015625 65.03125 
+Q 15.375 64.15625 9.8125 62.3125 
+L 9.8125 71.09375 
+Q 15.4375 72.65625 20.34375 73.4375 
+Q 25.25 74.21875 29.59375 74.21875 
+Q 40.828125 74.21875 47.359375 69.109375 
+Q 53.90625 64.015625 53.90625 55.328125 
+Q 53.90625 49.265625 50.4375 45.09375 
+Q 46.96875 40.921875 40.578125 39.3125 
+z
+" id="DejaVuSans-51"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-51"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-50"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_7">
+     <g id="text_7">
+      <!-- 64 -->
+      <g style="fill:#262626;" transform="translate(336.47125 306.288406)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-54"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-52"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_8">
+     <g id="text_8">
+      <!-- 128 -->
+      <g style="fill:#262626;" transform="translate(377.611875 306.288406)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-49"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-50"/>
+       <use x="127.246094" xlink:href="#DejaVuSans-56"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_9">
+     <!-- Client Batch Size -->
+     <g style="fill:#262626;" transform="translate(181.122187 321.694187)scale(0.12 -0.12)">
+      <defs>
+       <path d="M 64.40625 67.28125 
+L 64.40625 56.890625 
+Q 59.421875 61.53125 53.78125 63.8125 
+Q 48.140625 66.109375 41.796875 66.109375 
+Q 29.296875 66.109375 22.65625 58.46875 
+Q 16.015625 50.828125 16.015625 36.375 
+Q 16.015625 21.96875 22.65625 14.328125 
+Q 29.296875 6.6875 41.796875 6.6875 
+Q 48.140625 6.6875 53.78125 8.984375 
+Q 59.421875 11.28125 64.40625 15.921875 
+L 64.40625 5.609375 
+Q 59.234375 2.09375 53.4375 0.328125 
+Q 47.65625 -1.421875 41.21875 -1.421875 
+Q 24.65625 -1.421875 15.125 8.703125 
+Q 5.609375 18.84375 5.609375 36.375 
+Q 5.609375 53.953125 15.125 64.078125 
+Q 24.65625 74.21875 41.21875 74.21875 
+Q 47.75 74.21875 53.53125 72.484375 
+Q 59.328125 70.75 64.40625 67.28125 
+z
+" id="DejaVuSans-67"/>
+       <path d="M 9.421875 75.984375 
+L 18.40625 75.984375 
+L 18.40625 0 
+L 9.421875 0 
+z
+" id="DejaVuSans-108"/>
+       <path d="M 9.421875 54.6875 
+L 18.40625 54.6875 
+L 18.40625 0 
+L 9.421875 0 
+z
+M 9.421875 75.984375 
+L 18.40625 75.984375 
+L 18.40625 64.59375 
+L 9.421875 64.59375 
+z
+" id="DejaVuSans-105"/>
+       <path d="M 56.203125 29.59375 
+L 56.203125 25.203125 
+L 14.890625 25.203125 
+Q 15.484375 15.921875 20.484375 11.0625 
+Q 25.484375 6.203125 34.421875 6.203125 
+Q 39.59375 6.203125 44.453125 7.46875 
+Q 49.3125 8.734375 54.109375 11.28125 
+L 54.109375 2.78125 
+Q 49.265625 0.734375 44.1875 -0.34375 
+Q 39.109375 -1.421875 33.890625 -1.421875 
+Q 20.796875 -1.421875 13.15625 6.1875 
+Q 5.515625 13.8125 5.515625 26.8125 
+Q 5.515625 40.234375 12.765625 48.109375 
+Q 20.015625 56 32.328125 56 
+Q 43.359375 56 49.78125 48.890625 
+Q 56.203125 41.796875 56.203125 29.59375 
+z
+M 47.21875 32.234375 
+Q 47.125 39.59375 43.09375 43.984375 
+Q 39.0625 48.390625 32.421875 48.390625 
+Q 24.90625 48.390625 20.390625 44.140625 
+Q 15.875 39.890625 15.1875 32.171875 
+z
+" id="DejaVuSans-101"/>
+       <path d="M 54.890625 33.015625 
+L 54.890625 0 
+L 45.90625 0 
+L 45.90625 32.71875 
+Q 45.90625 40.484375 42.875 44.328125 
+Q 39.84375 48.1875 33.796875 48.1875 
+Q 26.515625 48.1875 22.3125 43.546875 
+Q 18.109375 38.921875 18.109375 30.90625 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 54.6875 
+L 18.109375 54.6875 
+L 18.109375 46.1875 
+Q 21.34375 51.125 25.703125 53.5625 
+Q 30.078125 56 35.796875 56 
+Q 45.21875 56 50.046875 50.171875 
+Q 54.890625 44.34375 54.890625 33.015625 
+z
+" id="DejaVuSans-110"/>
+       <path d="M 18.3125 70.21875 
+L 18.3125 54.6875 
+L 36.8125 54.6875 
+L 36.8125 47.703125 
+L 18.3125 47.703125 
+L 18.3125 18.015625 
+Q 18.3125 11.328125 20.140625 9.421875 
+Q 21.96875 7.515625 27.59375 7.515625 
+L 36.8125 7.515625 
+L 36.8125 0 
+L 27.59375 0 
+Q 17.1875 0 13.234375 3.875 
+Q 9.28125 7.765625 9.28125 18.015625 
+L 9.28125 47.703125 
+L 2.6875 47.703125 
+L 2.6875 54.6875 
+L 9.28125 54.6875 
+L 9.28125 70.21875 
+z
+" id="DejaVuSans-116"/>
+       <path id="DejaVuSans-32"/>
+       <path d="M 19.671875 34.8125 
+L 19.671875 8.109375 
+L 35.5 8.109375 
+Q 43.453125 8.109375 47.28125 11.40625 
+Q 51.125 14.703125 51.125 21.484375 
+Q 51.125 28.328125 47.28125 31.5625 
+Q 43.453125 34.8125 35.5 34.8125 
+z
+M 19.671875 64.796875 
+L 19.671875 42.828125 
+L 34.28125 42.828125 
+Q 41.5 42.828125 45.03125 45.53125 
+Q 48.578125 48.25 48.578125 53.8125 
+Q 48.578125 59.328125 45.03125 62.0625 
+Q 41.5 64.796875 34.28125 64.796875 
+z
+M 9.8125 72.90625 
+L 35.015625 72.90625 
+Q 46.296875 72.90625 52.390625 68.21875 
+Q 58.5 63.53125 58.5 54.890625 
+Q 58.5 48.1875 55.375 44.234375 
+Q 52.25 40.28125 46.1875 39.3125 
+Q 53.46875 37.75 57.5 32.78125 
+Q 61.53125 27.828125 61.53125 20.40625 
+Q 61.53125 10.640625 54.890625 5.3125 
+Q 48.25 0 35.984375 0 
+L 9.8125 0 
+z
+" id="DejaVuSans-66"/>
+       <path d="M 34.28125 27.484375 
+Q 23.390625 27.484375 19.1875 25 
+Q 14.984375 22.515625 14.984375 16.5 
+Q 14.984375 11.71875 18.140625 8.90625 
+Q 21.296875 6.109375 26.703125 6.109375 
+Q 34.1875 6.109375 38.703125 11.40625 
+Q 43.21875 16.703125 43.21875 25.484375 
+L 43.21875 27.484375 
+z
+M 52.203125 31.203125 
+L 52.203125 0 
+L 43.21875 0 
+L 43.21875 8.296875 
+Q 40.140625 3.328125 35.546875 0.953125 
+Q 30.953125 -1.421875 24.3125 -1.421875 
+Q 15.921875 -1.421875 10.953125 3.296875 
+Q 6 8.015625 6 15.921875 
+Q 6 25.140625 12.171875 29.828125 
+Q 18.359375 34.515625 30.609375 34.515625 
+L 43.21875 34.515625 
+L 43.21875 35.40625 
+Q 43.21875 41.609375 39.140625 45 
+Q 35.0625 48.390625 27.6875 48.390625 
+Q 23 48.390625 18.546875 47.265625 
+Q 14.109375 46.140625 10.015625 43.890625 
+L 10.015625 52.203125 
+Q 14.9375 54.109375 19.578125 55.046875 
+Q 24.21875 56 28.609375 56 
+Q 40.484375 56 46.34375 49.84375 
+Q 52.203125 43.703125 52.203125 31.203125 
+z
+" id="DejaVuSans-97"/>
+       <path d="M 48.78125 52.59375 
+L 48.78125 44.1875 
+Q 44.96875 46.296875 41.140625 47.34375 
+Q 37.3125 48.390625 33.40625 48.390625 
+Q 24.65625 48.390625 19.8125 42.84375 
+Q 14.984375 37.3125 14.984375 27.296875 
+Q 14.984375 17.28125 19.8125 11.734375 
+Q 24.65625 6.203125 33.40625 6.203125 
+Q 37.3125 6.203125 41.140625 7.25 
+Q 44.96875 8.296875 48.78125 10.40625 
+L 48.78125 2.09375 
+Q 45.015625 0.34375 40.984375 -0.53125 
+Q 36.96875 -1.421875 32.421875 -1.421875 
+Q 20.0625 -1.421875 12.78125 6.34375 
+Q 5.515625 14.109375 5.515625 27.296875 
+Q 5.515625 40.671875 12.859375 48.328125 
+Q 20.21875 56 33.015625 56 
+Q 37.15625 56 41.109375 55.140625 
+Q 45.0625 54.296875 48.78125 52.59375 
+z
+" id="DejaVuSans-99"/>
+       <path d="M 54.890625 33.015625 
+L 54.890625 0 
+L 45.90625 0 
+L 45.90625 32.71875 
+Q 45.90625 40.484375 42.875 44.328125 
+Q 39.84375 48.1875 33.796875 48.1875 
+Q 26.515625 48.1875 22.3125 43.546875 
+Q 18.109375 38.921875 18.109375 30.90625 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 75.984375 
+L 18.109375 75.984375 
+L 18.109375 46.1875 
+Q 21.34375 51.125 25.703125 53.5625 
+Q 30.078125 56 35.796875 56 
+Q 45.21875 56 50.046875 50.171875 
+Q 54.890625 44.34375 54.890625 33.015625 
+z
+" id="DejaVuSans-104"/>
+       <path d="M 53.515625 70.515625 
+L 53.515625 60.890625 
+Q 47.90625 63.578125 42.921875 64.890625 
+Q 37.9375 66.21875 33.296875 66.21875 
+Q 25.25 66.21875 20.875 63.09375 
+Q 16.5 59.96875 16.5 54.203125 
+Q 16.5 49.359375 19.40625 46.890625 
+Q 22.3125 44.4375 30.421875 42.921875 
+L 36.375 41.703125 
+Q 47.40625 39.59375 52.65625 34.296875 
+Q 57.90625 29 57.90625 20.125 
+Q 57.90625 9.515625 50.796875 4.046875 
+Q 43.703125 -1.421875 29.984375 -1.421875 
+Q 24.8125 -1.421875 18.96875 -0.25 
+Q 13.140625 0.921875 6.890625 3.21875 
+L 6.890625 13.375 
+Q 12.890625 10.015625 18.65625 8.296875 
+Q 24.421875 6.59375 29.984375 6.59375 
+Q 38.421875 6.59375 43.015625 9.90625 
+Q 47.609375 13.234375 47.609375 19.390625 
+Q 47.609375 24.75 44.3125 27.78125 
+Q 41.015625 30.8125 33.5 32.328125 
+L 27.484375 33.5 
+Q 16.453125 35.6875 11.515625 40.375 
+Q 6.59375 45.0625 6.59375 53.421875 
+Q 6.59375 63.09375 13.40625 68.65625 
+Q 20.21875 74.21875 32.171875 74.21875 
+Q 37.3125 74.21875 42.625 73.28125 
+Q 47.953125 72.359375 53.515625 70.515625 
+z
+" id="DejaVuSans-83"/>
+       <path d="M 5.515625 54.6875 
+L 48.1875 54.6875 
+L 48.1875 46.484375 
+L 14.40625 7.171875 
+L 48.1875 7.171875 
+L 48.1875 0 
+L 4.296875 0 
+L 4.296875 8.203125 
+L 38.09375 47.515625 
+L 5.515625 47.515625 
+z
+" id="DejaVuSans-122"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-67"/>
+      <use x="69.824219" xlink:href="#DejaVuSans-108"/>
+      <use x="97.607422" xlink:href="#DejaVuSans-105"/>
+      <use x="125.390625" xlink:href="#DejaVuSans-101"/>
+      <use x="186.914062" xlink:href="#DejaVuSans-110"/>
+      <use x="250.292969" xlink:href="#DejaVuSans-116"/>
+      <use x="289.501953" xlink:href="#DejaVuSans-32"/>
+      <use x="321.289062" xlink:href="#DejaVuSans-66"/>
+      <use x="389.892578" xlink:href="#DejaVuSans-97"/>
+      <use x="451.171875" xlink:href="#DejaVuSans-116"/>
+      <use x="490.380859" xlink:href="#DejaVuSans-99"/>
+      <use x="545.361328" xlink:href="#DejaVuSans-104"/>
+      <use x="608.740234" xlink:href="#DejaVuSans-32"/>
+      <use x="640.527344" xlink:href="#DejaVuSans-83"/>
+      <use x="704.003906" xlink:href="#DejaVuSans-105"/>
+      <use x="731.787109" xlink:href="#DejaVuSans-122"/>
+      <use x="784.277344" xlink:href="#DejaVuSans-101"/>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_1">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 288.430125 
+L 410.43 288.430125 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_10">
+      <!-- 0 -->
+      <g style="fill:#262626;" transform="translate(36.81125 292.609266)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 31.78125 66.40625 
+Q 24.171875 66.40625 20.328125 58.90625 
+Q 16.5 51.421875 16.5 36.375 
+Q 16.5 21.390625 20.328125 13.890625 
+Q 24.171875 6.390625 31.78125 6.390625 
+Q 39.453125 6.390625 43.28125 13.890625 
+Q 47.125 21.390625 47.125 36.375 
+Q 47.125 51.421875 43.28125 58.90625 
+Q 39.453125 66.40625 31.78125 66.40625 
+z
+M 31.78125 74.21875 
+Q 44.046875 74.21875 50.515625 64.515625 
+Q 56.984375 54.828125 56.984375 36.375 
+Q 56.984375 17.96875 50.515625 8.265625 
+Q 44.046875 -1.421875 31.78125 -1.421875 
+Q 19.53125 -1.421875 13.0625 8.265625 
+Q 6.59375 17.96875 6.59375 36.375 
+Q 6.59375 54.828125 13.0625 64.515625 
+Q 19.53125 74.21875 31.78125 74.21875 
+z
+" id="DejaVuSans-48"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_2">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 244.146764 
+L 410.43 244.146764 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_11">
+      <!-- 20 -->
+      <g style="fill:#262626;" transform="translate(29.8125 248.325905)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-50"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_3">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 199.863403 
+L 410.43 199.863403 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_12">
+      <!-- 40 -->
+      <g style="fill:#262626;" transform="translate(29.8125 204.042544)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-52"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_4">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 155.580043 
+L 410.43 155.580043 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_13">
+      <!-- 60 -->
+      <g style="fill:#262626;" transform="translate(29.8125 159.759183)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-54"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_5">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 111.296682 
+L 410.43 111.296682 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_14">
+      <!-- 80 -->
+      <g style="fill:#262626;" transform="translate(29.8125 115.475822)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-56"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_6">
+     <g id="line2d_6">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 67.013321 
+L 410.43 67.013321 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_15">
+      <!-- 100 -->
+      <g style="fill:#262626;" transform="translate(22.81375 71.192462)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-49"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+       <use x="127.246094" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_7">
+     <g id="line2d_7">
+      <path clip-path="url(#p4447e285f4)" d="M 53.31 22.72996 
+L 410.43 22.72996 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_16">
+      <!-- 120 -->
+      <g style="fill:#262626;" transform="translate(22.81375 26.909101)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-49"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-50"/>
+       <use x="127.246094" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_17">
+     <!-- Avg Latency -->
+     <g style="fill:#262626;" transform="translate(16.318125 192.110062)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path d="M 34.1875 63.1875 
+L 20.796875 26.90625 
+L 47.609375 26.90625 
+z
+M 28.609375 72.90625 
+L 39.796875 72.90625 
+L 67.578125 0 
+L 57.328125 0 
+L 50.6875 18.703125 
+L 17.828125 18.703125 
+L 11.1875 0 
+L 0.78125 0 
+z
+" id="DejaVuSans-65"/>
+       <path d="M 2.984375 54.6875 
+L 12.5 54.6875 
+L 29.59375 8.796875 
+L 46.6875 54.6875 
+L 56.203125 54.6875 
+L 35.6875 0 
+L 23.484375 0 
+z
+" id="DejaVuSans-118"/>
+       <path d="M 45.40625 27.984375 
+Q 45.40625 37.75 41.375 43.109375 
+Q 37.359375 48.484375 30.078125 48.484375 
+Q 22.859375 48.484375 18.828125 43.109375 
+Q 14.796875 37.75 14.796875 27.984375 
+Q 14.796875 18.265625 18.828125 12.890625 
+Q 22.859375 7.515625 30.078125 7.515625 
+Q 37.359375 7.515625 41.375 12.890625 
+Q 45.40625 18.265625 45.40625 27.984375 
+z
+M 54.390625 6.78125 
+Q 54.390625 -7.171875 48.1875 -13.984375 
+Q 42 -20.796875 29.203125 -20.796875 
+Q 24.46875 -20.796875 20.265625 -20.09375 
+Q 16.0625 -19.390625 12.109375 -17.921875 
+L 12.109375 -9.1875 
+Q 16.0625 -11.328125 19.921875 -12.34375 
+Q 23.78125 -13.375 27.78125 -13.375 
+Q 36.625 -13.375 41.015625 -8.765625 
+Q 45.40625 -4.15625 45.40625 5.171875 
+L 45.40625 9.625 
+Q 42.625 4.78125 38.28125 2.390625 
+Q 33.9375 0 27.875 0 
+Q 17.828125 0 11.671875 7.65625 
+Q 5.515625 15.328125 5.515625 27.984375 
+Q 5.515625 40.671875 11.671875 48.328125 
+Q 17.828125 56 27.875 56 
+Q 33.9375 56 38.28125 53.609375 
+Q 42.625 51.21875 45.40625 46.390625 
+L 45.40625 54.6875 
+L 54.390625 54.6875 
+z
+" id="DejaVuSans-103"/>
+       <path d="M 9.8125 72.90625 
+L 19.671875 72.90625 
+L 19.671875 8.296875 
+L 55.171875 8.296875 
+L 55.171875 0 
+L 9.8125 0 
+z
+" id="DejaVuSans-76"/>
+       <path d="M 32.171875 -5.078125 
+Q 28.375 -14.84375 24.75 -17.8125 
+Q 21.140625 -20.796875 15.09375 -20.796875 
+L 7.90625 -20.796875 
+L 7.90625 -13.28125 
+L 13.1875 -13.28125 
+Q 16.890625 -13.28125 18.9375 -11.515625 
+Q 21 -9.765625 23.484375 -3.21875 
+L 25.09375 0.875 
+L 2.984375 54.6875 
+L 12.5 54.6875 
+L 29.59375 11.921875 
+L 46.6875 54.6875 
+L 56.203125 54.6875 
+z
+" id="DejaVuSans-121"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-65"/>
+      <use x="62.533203" xlink:href="#DejaVuSans-118"/>
+      <use x="121.712891" xlink:href="#DejaVuSans-103"/>
+      <use x="185.189453" xlink:href="#DejaVuSans-32"/>
+      <use x="216.976562" xlink:href="#DejaVuSans-76"/>
+      <use x="272.689453" xlink:href="#DejaVuSans-97"/>
+      <use x="333.96875" xlink:href="#DejaVuSans-116"/>
+      <use x="373.177734" xlink:href="#DejaVuSans-101"/>
+      <use x="434.701172" xlink:href="#DejaVuSans-110"/>
+      <use x="498.080078" xlink:href="#DejaVuSans-99"/>
+      <use x="553.060547" xlink:href="#DejaVuSans-121"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path clip-path="url(#p4447e285f4)" d="M 57.774 288.430125 
+L 93.486 288.430125 
+L 93.486 281.780978 
+L 57.774 281.780978 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_4">
+    <path clip-path="url(#p4447e285f4)" d="M 102.414 288.430125 
+L 138.126 288.430125 
+L 138.126 279.577881 
+L 102.414 279.577881 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_5">
+    <path clip-path="url(#p4447e285f4)" d="M 147.054 288.430125 
+L 182.766 288.430125 
+L 182.766 275.705301 
+L 147.054 275.705301 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_6">
+    <path clip-path="url(#p4447e285f4)" d="M 191.694 288.430125 
+L 227.406 288.430125 
+L 227.406 268.507041 
+L 191.694 268.507041 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_7">
+    <path clip-path="url(#p4447e285f4)" d="M 236.334 288.430125 
+L 272.046 288.430125 
+L 272.046 254.967403 
+L 236.334 254.967403 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_8">
+    <path clip-path="url(#p4447e285f4)" d="M 280.974 288.430125 
+L 316.686 288.430125 
+L 316.686 225.651819 
+L 280.974 225.651819 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_9">
+    <path clip-path="url(#p4447e285f4)" d="M 325.614 288.430125 
+L 361.326 288.430125 
+L 361.326 170.665169 
+L 325.614 170.665169 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_10">
+    <path clip-path="url(#p4447e285f4)" d="M 370.254 288.430125 
+L 405.966 288.430125 
+L 405.966 66.670125 
+L 370.254 66.670125 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_11">
+    <path d="M 53.31 288.430125 
+L 53.31 22.318125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="patch_12">
+    <path d="M 410.43 288.430125 
+L 410.43 22.318125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="patch_13">
+    <path d="M 53.31 288.430125 
+L 410.43 288.430125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="patch_14">
+    <path d="M 53.31 22.318125 
+L 410.43 22.318125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="text_18">
+    <!-- Performance offline -->
+    <g style="fill:#262626;" transform="translate(173.220937 16.318125)scale(0.12 -0.12)">
+     <defs>
+      <path d="M 19.671875 64.796875 
+L 19.671875 37.40625 
+L 32.078125 37.40625 
+Q 38.96875 37.40625 42.71875 40.96875 
+Q 46.484375 44.53125 46.484375 51.125 
+Q 46.484375 57.671875 42.71875 61.234375 
+Q 38.96875 64.796875 32.078125 64.796875 
+z
+M 9.8125 72.90625 
+L 32.078125 72.90625 
+Q 44.34375 72.90625 50.609375 67.359375 
+Q 56.890625 61.8125 56.890625 51.125 
+Q 56.890625 40.328125 50.609375 34.8125 
+Q 44.34375 29.296875 32.078125 29.296875 
+L 19.671875 29.296875 
+L 19.671875 0 
+L 9.8125 0 
+z
+" id="DejaVuSans-80"/>
+      <path d="M 41.109375 46.296875 
+Q 39.59375 47.171875 37.8125 47.578125 
+Q 36.03125 48 33.890625 48 
+Q 26.265625 48 22.1875 43.046875 
+Q 18.109375 38.09375 18.109375 28.8125 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 54.6875 
+L 18.109375 54.6875 
+L 18.109375 46.1875 
+Q 20.953125 51.171875 25.484375 53.578125 
+Q 30.03125 56 36.53125 56 
+Q 37.453125 56 38.578125 55.875 
+Q 39.703125 55.765625 41.0625 55.515625 
+z
+" id="DejaVuSans-114"/>
+      <path d="M 37.109375 75.984375 
+L 37.109375 68.5 
+L 28.515625 68.5 
+Q 23.6875 68.5 21.796875 66.546875 
+Q 19.921875 64.59375 19.921875 59.515625 
+L 19.921875 54.6875 
+L 34.71875 54.6875 
+L 34.71875 47.703125 
+L 19.921875 47.703125 
+L 19.921875 0 
+L 10.890625 0 
+L 10.890625 47.703125 
+L 2.296875 47.703125 
+L 2.296875 54.6875 
+L 10.890625 54.6875 
+L 10.890625 58.5 
+Q 10.890625 67.625 15.140625 71.796875 
+Q 19.390625 75.984375 28.609375 75.984375 
+z
+" id="DejaVuSans-102"/>
+      <path d="M 30.609375 48.390625 
+Q 23.390625 48.390625 19.1875 42.75 
+Q 14.984375 37.109375 14.984375 27.296875 
+Q 14.984375 17.484375 19.15625 11.84375 
+Q 23.34375 6.203125 30.609375 6.203125 
+Q 37.796875 6.203125 41.984375 11.859375 
+Q 46.1875 17.53125 46.1875 27.296875 
+Q 46.1875 37.015625 41.984375 42.703125 
+Q 37.796875 48.390625 30.609375 48.390625 
+z
+M 30.609375 56 
+Q 42.328125 56 49.015625 48.375 
+Q 55.71875 40.765625 55.71875 27.296875 
+Q 55.71875 13.875 49.015625 6.21875 
+Q 42.328125 -1.421875 30.609375 -1.421875 
+Q 18.84375 -1.421875 12.171875 6.21875 
+Q 5.515625 13.875 5.515625 27.296875 
+Q 5.515625 40.765625 12.171875 48.375 
+Q 18.84375 56 30.609375 56 
+z
+" id="DejaVuSans-111"/>
+      <path d="M 52 44.1875 
+Q 55.375 50.25 60.0625 53.125 
+Q 64.75 56 71.09375 56 
+Q 79.640625 56 84.28125 50.015625 
+Q 88.921875 44.046875 88.921875 33.015625 
+L 88.921875 0 
+L 79.890625 0 
+L 79.890625 32.71875 
+Q 79.890625 40.578125 77.09375 44.375 
+Q 74.3125 48.1875 68.609375 48.1875 
+Q 61.625 48.1875 57.5625 43.546875 
+Q 53.515625 38.921875 53.515625 30.90625 
+L 53.515625 0 
+L 44.484375 0 
+L 44.484375 32.71875 
+Q 44.484375 40.625 41.703125 44.40625 
+Q 38.921875 48.1875 33.109375 48.1875 
+Q 26.21875 48.1875 22.15625 43.53125 
+Q 18.109375 38.875 18.109375 30.90625 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 54.6875 
+L 18.109375 54.6875 
+L 18.109375 46.1875 
+Q 21.1875 51.21875 25.484375 53.609375 
+Q 29.78125 56 35.6875 56 
+Q 41.65625 56 45.828125 52.96875 
+Q 50 49.953125 52 44.1875 
+z
+" id="DejaVuSans-109"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-80"/>
+     <use x="56.677734" xlink:href="#DejaVuSans-101"/>
+     <use x="118.201172" xlink:href="#DejaVuSans-114"/>
+     <use x="159.314453" xlink:href="#DejaVuSans-102"/>
+     <use x="194.519531" xlink:href="#DejaVuSans-111"/>
+     <use x="255.701172" xlink:href="#DejaVuSans-114"/>
+     <use x="295.064453" xlink:href="#DejaVuSans-109"/>
+     <use x="392.476562" xlink:href="#DejaVuSans-97"/>
+     <use x="453.755859" xlink:href="#DejaVuSans-110"/>
+     <use x="517.134766" xlink:href="#DejaVuSans-99"/>
+     <use x="572.115234" xlink:href="#DejaVuSans-101"/>
+     <use x="633.638672" xlink:href="#DejaVuSans-32"/>
+     <use x="665.425781" xlink:href="#DejaVuSans-111"/>
+     <use x="726.607422" xlink:href="#DejaVuSans-102"/>
+     <use x="761.8125" xlink:href="#DejaVuSans-102"/>
+     <use x="797.017578" xlink:href="#DejaVuSans-108"/>
+     <use x="824.800781" xlink:href="#DejaVuSans-105"/>
+     <use x="852.583984" xlink:href="#DejaVuSans-110"/>
+     <use x="915.962891" xlink:href="#DejaVuSans-101"/>
+    </g>
+   </g>
+   <g id="legend_1"/>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p4447e285f4">
+   <rect height="266.112" width="357.12" x="53.31" y="22.318125"/>
+  </clipPath>
+ </defs>
+</svg>
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_1l.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_1l.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_1r.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_1r.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_5l.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_5l.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_5r.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_5r.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_9l.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_9l.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_9r.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_offline_9r.svg
@ -0,0 +1,965 @@
+<?xml version="1.0" encoding="utf-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+  "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<!-- Created with matplotlib (https://matplotlib.org/) -->
+<svg height="331.389812pt" version="1.1" viewBox="0 0 410.63125 331.389812" width="410.63125pt" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+ <metadata>
+  <rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+   <cc:Work>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:date>2021-04-14T17:54:01.333579</dc:date>
+    <dc:format>image/svg+xml</dc:format>
+    <dc:creator>
+     <cc:Agent>
+      <dc:title>Matplotlib v3.3.4, https://matplotlib.org/</dc:title>
+     </cc:Agent>
+    </dc:creator>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <defs>
+  <style type="text/css">*{stroke-linecap:butt;stroke-linejoin:round;}</style>
+ </defs>
+ <g id="figure_1">
+  <g id="patch_1">
+   <path d="M 0 331.389812 
+L 410.63125 331.389812 
+L 410.63125 0 
+L 0 0 
+z
+" style="fill:#ffffff;"/>
+  </g>
+  <g id="axes_1">
+   <g id="patch_2">
+    <path d="M 46.31125 288.430125 
+L 403.43125 288.430125 
+L 403.43125 22.318125 
+L 46.31125 22.318125 
+z
+" style="fill:#ffffff;"/>
+   </g>
+   <g id="matplotlib.axis_1">
+    <g id="xtick_1">
+     <g id="text_1">
+      <!-- 1 -->
+      <g style="fill:#262626;" transform="translate(65.131875 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 12.40625 8.296875 
+L 28.515625 8.296875 
+L 28.515625 63.921875 
+L 10.984375 60.40625 
+L 10.984375 69.390625 
+L 28.421875 72.90625 
+L 38.28125 72.90625 
+L 38.28125 8.296875 
+L 54.390625 8.296875 
+L 54.390625 0 
+L 12.40625 0 
+z
+" id="DejaVuSans-49"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-49"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_2">
+     <g id="text_2">
+      <!-- 2 -->
+      <g style="fill:#262626;" transform="translate(109.771875 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 19.1875 8.296875 
+L 53.609375 8.296875 
+L 53.609375 0 
+L 7.328125 0 
+L 7.328125 8.296875 
+Q 12.9375 14.109375 22.625 23.890625 
+Q 32.328125 33.6875 34.8125 36.53125 
+Q 39.546875 41.84375 41.421875 45.53125 
+Q 43.3125 49.21875 43.3125 52.78125 
+Q 43.3125 58.59375 39.234375 62.25 
+Q 35.15625 65.921875 28.609375 65.921875 
+Q 23.96875 65.921875 18.8125 64.3125 
+Q 13.671875 62.703125 7.8125 59.421875 
+L 7.8125 69.390625 
+Q 13.765625 71.78125 18.9375 73 
+Q 24.125 74.21875 28.421875 74.21875 
+Q 39.75 74.21875 46.484375 68.546875 
+Q 53.21875 62.890625 53.21875 53.421875 
+Q 53.21875 48.921875 51.53125 44.890625 
+Q 49.859375 40.875 45.40625 35.40625 
+Q 44.1875 33.984375 37.640625 27.21875 
+Q 31.109375 20.453125 19.1875 8.296875 
+z
+" id="DejaVuSans-50"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-50"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_3">
+     <g id="text_3">
+      <!-- 4 -->
+      <g style="fill:#262626;" transform="translate(154.411875 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 37.796875 64.3125 
+L 12.890625 25.390625 
+L 37.796875 25.390625 
+z
+M 35.203125 72.90625 
+L 47.609375 72.90625 
+L 47.609375 25.390625 
+L 58.015625 25.390625 
+L 58.015625 17.1875 
+L 47.609375 17.1875 
+L 47.609375 0 
+L 37.796875 0 
+L 37.796875 17.1875 
+L 4.890625 17.1875 
+L 4.890625 26.703125 
+z
+" id="DejaVuSans-52"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-52"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_4">
+     <g id="text_4">
+      <!-- 8 -->
+      <g style="fill:#262626;" transform="translate(199.051875 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 31.78125 34.625 
+Q 24.75 34.625 20.71875 30.859375 
+Q 16.703125 27.09375 16.703125 20.515625 
+Q 16.703125 13.921875 20.71875 10.15625 
+Q 24.75 6.390625 31.78125 6.390625 
+Q 38.8125 6.390625 42.859375 10.171875 
+Q 46.921875 13.96875 46.921875 20.515625 
+Q 46.921875 27.09375 42.890625 30.859375 
+Q 38.875 34.625 31.78125 34.625 
+z
+M 21.921875 38.8125 
+Q 15.578125 40.375 12.03125 44.71875 
+Q 8.5 49.078125 8.5 55.328125 
+Q 8.5 64.0625 14.71875 69.140625 
+Q 20.953125 74.21875 31.78125 74.21875 
+Q 42.671875 74.21875 48.875 69.140625 
+Q 55.078125 64.0625 55.078125 55.328125 
+Q 55.078125 49.078125 51.53125 44.71875 
+Q 48 40.375 41.703125 38.8125 
+Q 48.828125 37.15625 52.796875 32.3125 
+Q 56.78125 27.484375 56.78125 20.515625 
+Q 56.78125 9.90625 50.3125 4.234375 
+Q 43.84375 -1.421875 31.78125 -1.421875 
+Q 19.734375 -1.421875 13.25 4.234375 
+Q 6.78125 9.90625 6.78125 20.515625 
+Q 6.78125 27.484375 10.78125 32.3125 
+Q 14.796875 37.15625 21.921875 38.8125 
+z
+M 18.3125 54.390625 
+Q 18.3125 48.734375 21.84375 45.5625 
+Q 25.390625 42.390625 31.78125 42.390625 
+Q 38.140625 42.390625 41.71875 45.5625 
+Q 45.3125 48.734375 45.3125 54.390625 
+Q 45.3125 60.0625 41.71875 63.234375 
+Q 38.140625 66.40625 31.78125 66.40625 
+Q 25.390625 66.40625 21.84375 63.234375 
+Q 18.3125 60.0625 18.3125 54.390625 
+z
+" id="DejaVuSans-56"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-56"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_5">
+     <g id="text_5">
+      <!-- 16 -->
+      <g style="fill:#262626;" transform="translate(240.1925 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 33.015625 40.375 
+Q 26.375 40.375 22.484375 35.828125 
+Q 18.609375 31.296875 18.609375 23.390625 
+Q 18.609375 15.53125 22.484375 10.953125 
+Q 26.375 6.390625 33.015625 6.390625 
+Q 39.65625 6.390625 43.53125 10.953125 
+Q 47.40625 15.53125 47.40625 23.390625 
+Q 47.40625 31.296875 43.53125 35.828125 
+Q 39.65625 40.375 33.015625 40.375 
+z
+M 52.59375 71.296875 
+L 52.59375 62.3125 
+Q 48.875 64.0625 45.09375 64.984375 
+Q 41.3125 65.921875 37.59375 65.921875 
+Q 27.828125 65.921875 22.671875 59.328125 
+Q 17.53125 52.734375 16.796875 39.40625 
+Q 19.671875 43.65625 24.015625 45.921875 
+Q 28.375 48.1875 33.59375 48.1875 
+Q 44.578125 48.1875 50.953125 41.515625 
+Q 57.328125 34.859375 57.328125 23.390625 
+Q 57.328125 12.15625 50.6875 5.359375 
+Q 44.046875 -1.421875 33.015625 -1.421875 
+Q 20.359375 -1.421875 13.671875 8.265625 
+Q 6.984375 17.96875 6.984375 36.375 
+Q 6.984375 53.65625 15.1875 63.9375 
+Q 23.390625 74.21875 37.203125 74.21875 
+Q 40.921875 74.21875 44.703125 73.484375 
+Q 48.484375 72.75 52.59375 71.296875 
+z
+" id="DejaVuSans-54"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-49"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-54"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_6">
+     <g id="text_6">
+      <!-- 32 -->
+      <g style="fill:#262626;" transform="translate(284.8325 306.288406)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 40.578125 39.3125 
+Q 47.65625 37.796875 51.625 33 
+Q 55.609375 28.21875 55.609375 21.1875 
+Q 55.609375 10.40625 48.1875 4.484375 
+Q 40.765625 -1.421875 27.09375 -1.421875 
+Q 22.515625 -1.421875 17.65625 -0.515625 
+Q 12.796875 0.390625 7.625 2.203125 
+L 7.625 11.71875 
+Q 11.71875 9.328125 16.59375 8.109375 
+Q 21.484375 6.890625 26.8125 6.890625 
+Q 36.078125 6.890625 40.9375 10.546875 
+Q 45.796875 14.203125 45.796875 21.1875 
+Q 45.796875 27.640625 41.28125 31.265625 
+Q 36.765625 34.90625 28.71875 34.90625 
+L 20.21875 34.90625 
+L 20.21875 43.015625 
+L 29.109375 43.015625 
+Q 36.375 43.015625 40.234375 45.921875 
+Q 44.09375 48.828125 44.09375 54.296875 
+Q 44.09375 59.90625 40.109375 62.90625 
+Q 36.140625 65.921875 28.71875 65.921875 
+Q 24.65625 65.921875 20.015625 65.03125 
+Q 15.375 64.15625 9.8125 62.3125 
+L 9.8125 71.09375 
+Q 15.4375 72.65625 20.34375 73.4375 
+Q 25.25 74.21875 29.59375 74.21875 
+Q 40.828125 74.21875 47.359375 69.109375 
+Q 53.90625 64.015625 53.90625 55.328125 
+Q 53.90625 49.265625 50.4375 45.09375 
+Q 46.96875 40.921875 40.578125 39.3125 
+z
+" id="DejaVuSans-51"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-51"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-50"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_7">
+     <g id="text_7">
+      <!-- 64 -->
+      <g style="fill:#262626;" transform="translate(329.4725 306.288406)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-54"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-52"/>
+      </g>
+     </g>
+    </g>
+    <g id="xtick_8">
+     <g id="text_8">
+      <!-- 128 -->
+      <g style="fill:#262626;" transform="translate(370.613125 306.288406)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-49"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-50"/>
+       <use x="127.246094" xlink:href="#DejaVuSans-56"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_9">
+     <!-- Client Batch Size -->
+     <g style="fill:#262626;" transform="translate(174.123438 321.694187)scale(0.12 -0.12)">
+      <defs>
+       <path d="M 64.40625 67.28125 
+L 64.40625 56.890625 
+Q 59.421875 61.53125 53.78125 63.8125 
+Q 48.140625 66.109375 41.796875 66.109375 
+Q 29.296875 66.109375 22.65625 58.46875 
+Q 16.015625 50.828125 16.015625 36.375 
+Q 16.015625 21.96875 22.65625 14.328125 
+Q 29.296875 6.6875 41.796875 6.6875 
+Q 48.140625 6.6875 53.78125 8.984375 
+Q 59.421875 11.28125 64.40625 15.921875 
+L 64.40625 5.609375 
+Q 59.234375 2.09375 53.4375 0.328125 
+Q 47.65625 -1.421875 41.21875 -1.421875 
+Q 24.65625 -1.421875 15.125 8.703125 
+Q 5.609375 18.84375 5.609375 36.375 
+Q 5.609375 53.953125 15.125 64.078125 
+Q 24.65625 74.21875 41.21875 74.21875 
+Q 47.75 74.21875 53.53125 72.484375 
+Q 59.328125 70.75 64.40625 67.28125 
+z
+" id="DejaVuSans-67"/>
+       <path d="M 9.421875 75.984375 
+L 18.40625 75.984375 
+L 18.40625 0 
+L 9.421875 0 
+z
+" id="DejaVuSans-108"/>
+       <path d="M 9.421875 54.6875 
+L 18.40625 54.6875 
+L 18.40625 0 
+L 9.421875 0 
+z
+M 9.421875 75.984375 
+L 18.40625 75.984375 
+L 18.40625 64.59375 
+L 9.421875 64.59375 
+z
+" id="DejaVuSans-105"/>
+       <path d="M 56.203125 29.59375 
+L 56.203125 25.203125 
+L 14.890625 25.203125 
+Q 15.484375 15.921875 20.484375 11.0625 
+Q 25.484375 6.203125 34.421875 6.203125 
+Q 39.59375 6.203125 44.453125 7.46875 
+Q 49.3125 8.734375 54.109375 11.28125 
+L 54.109375 2.78125 
+Q 49.265625 0.734375 44.1875 -0.34375 
+Q 39.109375 -1.421875 33.890625 -1.421875 
+Q 20.796875 -1.421875 13.15625 6.1875 
+Q 5.515625 13.8125 5.515625 26.8125 
+Q 5.515625 40.234375 12.765625 48.109375 
+Q 20.015625 56 32.328125 56 
+Q 43.359375 56 49.78125 48.890625 
+Q 56.203125 41.796875 56.203125 29.59375 
+z
+M 47.21875 32.234375 
+Q 47.125 39.59375 43.09375 43.984375 
+Q 39.0625 48.390625 32.421875 48.390625 
+Q 24.90625 48.390625 20.390625 44.140625 
+Q 15.875 39.890625 15.1875 32.171875 
+z
+" id="DejaVuSans-101"/>
+       <path d="M 54.890625 33.015625 
+L 54.890625 0 
+L 45.90625 0 
+L 45.90625 32.71875 
+Q 45.90625 40.484375 42.875 44.328125 
+Q 39.84375 48.1875 33.796875 48.1875 
+Q 26.515625 48.1875 22.3125 43.546875 
+Q 18.109375 38.921875 18.109375 30.90625 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 54.6875 
+L 18.109375 54.6875 
+L 18.109375 46.1875 
+Q 21.34375 51.125 25.703125 53.5625 
+Q 30.078125 56 35.796875 56 
+Q 45.21875 56 50.046875 50.171875 
+Q 54.890625 44.34375 54.890625 33.015625 
+z
+" id="DejaVuSans-110"/>
+       <path d="M 18.3125 70.21875 
+L 18.3125 54.6875 
+L 36.8125 54.6875 
+L 36.8125 47.703125 
+L 18.3125 47.703125 
+L 18.3125 18.015625 
+Q 18.3125 11.328125 20.140625 9.421875 
+Q 21.96875 7.515625 27.59375 7.515625 
+L 36.8125 7.515625 
+L 36.8125 0 
+L 27.59375 0 
+Q 17.1875 0 13.234375 3.875 
+Q 9.28125 7.765625 9.28125 18.015625 
+L 9.28125 47.703125 
+L 2.6875 47.703125 
+L 2.6875 54.6875 
+L 9.28125 54.6875 
+L 9.28125 70.21875 
+z
+" id="DejaVuSans-116"/>
+       <path id="DejaVuSans-32"/>
+       <path d="M 19.671875 34.8125 
+L 19.671875 8.109375 
+L 35.5 8.109375 
+Q 43.453125 8.109375 47.28125 11.40625 
+Q 51.125 14.703125 51.125 21.484375 
+Q 51.125 28.328125 47.28125 31.5625 
+Q 43.453125 34.8125 35.5 34.8125 
+z
+M 19.671875 64.796875 
+L 19.671875 42.828125 
+L 34.28125 42.828125 
+Q 41.5 42.828125 45.03125 45.53125 
+Q 48.578125 48.25 48.578125 53.8125 
+Q 48.578125 59.328125 45.03125 62.0625 
+Q 41.5 64.796875 34.28125 64.796875 
+z
+M 9.8125 72.90625 
+L 35.015625 72.90625 
+Q 46.296875 72.90625 52.390625 68.21875 
+Q 58.5 63.53125 58.5 54.890625 
+Q 58.5 48.1875 55.375 44.234375 
+Q 52.25 40.28125 46.1875 39.3125 
+Q 53.46875 37.75 57.5 32.78125 
+Q 61.53125 27.828125 61.53125 20.40625 
+Q 61.53125 10.640625 54.890625 5.3125 
+Q 48.25 0 35.984375 0 
+L 9.8125 0 
+z
+" id="DejaVuSans-66"/>
+       <path d="M 34.28125 27.484375 
+Q 23.390625 27.484375 19.1875 25 
+Q 14.984375 22.515625 14.984375 16.5 
+Q 14.984375 11.71875 18.140625 8.90625 
+Q 21.296875 6.109375 26.703125 6.109375 
+Q 34.1875 6.109375 38.703125 11.40625 
+Q 43.21875 16.703125 43.21875 25.484375 
+L 43.21875 27.484375 
+z
+M 52.203125 31.203125 
+L 52.203125 0 
+L 43.21875 0 
+L 43.21875 8.296875 
+Q 40.140625 3.328125 35.546875 0.953125 
+Q 30.953125 -1.421875 24.3125 -1.421875 
+Q 15.921875 -1.421875 10.953125 3.296875 
+Q 6 8.015625 6 15.921875 
+Q 6 25.140625 12.171875 29.828125 
+Q 18.359375 34.515625 30.609375 34.515625 
+L 43.21875 34.515625 
+L 43.21875 35.40625 
+Q 43.21875 41.609375 39.140625 45 
+Q 35.0625 48.390625 27.6875 48.390625 
+Q 23 48.390625 18.546875 47.265625 
+Q 14.109375 46.140625 10.015625 43.890625 
+L 10.015625 52.203125 
+Q 14.9375 54.109375 19.578125 55.046875 
+Q 24.21875 56 28.609375 56 
+Q 40.484375 56 46.34375 49.84375 
+Q 52.203125 43.703125 52.203125 31.203125 
+z
+" id="DejaVuSans-97"/>
+       <path d="M 48.78125 52.59375 
+L 48.78125 44.1875 
+Q 44.96875 46.296875 41.140625 47.34375 
+Q 37.3125 48.390625 33.40625 48.390625 
+Q 24.65625 48.390625 19.8125 42.84375 
+Q 14.984375 37.3125 14.984375 27.296875 
+Q 14.984375 17.28125 19.8125 11.734375 
+Q 24.65625 6.203125 33.40625 6.203125 
+Q 37.3125 6.203125 41.140625 7.25 
+Q 44.96875 8.296875 48.78125 10.40625 
+L 48.78125 2.09375 
+Q 45.015625 0.34375 40.984375 -0.53125 
+Q 36.96875 -1.421875 32.421875 -1.421875 
+Q 20.0625 -1.421875 12.78125 6.34375 
+Q 5.515625 14.109375 5.515625 27.296875 
+Q 5.515625 40.671875 12.859375 48.328125 
+Q 20.21875 56 33.015625 56 
+Q 37.15625 56 41.109375 55.140625 
+Q 45.0625 54.296875 48.78125 52.59375 
+z
+" id="DejaVuSans-99"/>
+       <path d="M 54.890625 33.015625 
+L 54.890625 0 
+L 45.90625 0 
+L 45.90625 32.71875 
+Q 45.90625 40.484375 42.875 44.328125 
+Q 39.84375 48.1875 33.796875 48.1875 
+Q 26.515625 48.1875 22.3125 43.546875 
+Q 18.109375 38.921875 18.109375 30.90625 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 75.984375 
+L 18.109375 75.984375 
+L 18.109375 46.1875 
+Q 21.34375 51.125 25.703125 53.5625 
+Q 30.078125 56 35.796875 56 
+Q 45.21875 56 50.046875 50.171875 
+Q 54.890625 44.34375 54.890625 33.015625 
+z
+" id="DejaVuSans-104"/>
+       <path d="M 53.515625 70.515625 
+L 53.515625 60.890625 
+Q 47.90625 63.578125 42.921875 64.890625 
+Q 37.9375 66.21875 33.296875 66.21875 
+Q 25.25 66.21875 20.875 63.09375 
+Q 16.5 59.96875 16.5 54.203125 
+Q 16.5 49.359375 19.40625 46.890625 
+Q 22.3125 44.4375 30.421875 42.921875 
+L 36.375 41.703125 
+Q 47.40625 39.59375 52.65625 34.296875 
+Q 57.90625 29 57.90625 20.125 
+Q 57.90625 9.515625 50.796875 4.046875 
+Q 43.703125 -1.421875 29.984375 -1.421875 
+Q 24.8125 -1.421875 18.96875 -0.25 
+Q 13.140625 0.921875 6.890625 3.21875 
+L 6.890625 13.375 
+Q 12.890625 10.015625 18.65625 8.296875 
+Q 24.421875 6.59375 29.984375 6.59375 
+Q 38.421875 6.59375 43.015625 9.90625 
+Q 47.609375 13.234375 47.609375 19.390625 
+Q 47.609375 24.75 44.3125 27.78125 
+Q 41.015625 30.8125 33.5 32.328125 
+L 27.484375 33.5 
+Q 16.453125 35.6875 11.515625 40.375 
+Q 6.59375 45.0625 6.59375 53.421875 
+Q 6.59375 63.09375 13.40625 68.65625 
+Q 20.21875 74.21875 32.171875 74.21875 
+Q 37.3125 74.21875 42.625 73.28125 
+Q 47.953125 72.359375 53.515625 70.515625 
+z
+" id="DejaVuSans-83"/>
+       <path d="M 5.515625 54.6875 
+L 48.1875 54.6875 
+L 48.1875 46.484375 
+L 14.40625 7.171875 
+L 48.1875 7.171875 
+L 48.1875 0 
+L 4.296875 0 
+L 4.296875 8.203125 
+L 38.09375 47.515625 
+L 5.515625 47.515625 
+z
+" id="DejaVuSans-122"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-67"/>
+      <use x="69.824219" xlink:href="#DejaVuSans-108"/>
+      <use x="97.607422" xlink:href="#DejaVuSans-105"/>
+      <use x="125.390625" xlink:href="#DejaVuSans-101"/>
+      <use x="186.914062" xlink:href="#DejaVuSans-110"/>
+      <use x="250.292969" xlink:href="#DejaVuSans-116"/>
+      <use x="289.501953" xlink:href="#DejaVuSans-32"/>
+      <use x="321.289062" xlink:href="#DejaVuSans-66"/>
+      <use x="389.892578" xlink:href="#DejaVuSans-97"/>
+      <use x="451.171875" xlink:href="#DejaVuSans-116"/>
+      <use x="490.380859" xlink:href="#DejaVuSans-99"/>
+      <use x="545.361328" xlink:href="#DejaVuSans-104"/>
+      <use x="608.740234" xlink:href="#DejaVuSans-32"/>
+      <use x="640.527344" xlink:href="#DejaVuSans-83"/>
+      <use x="704.003906" xlink:href="#DejaVuSans-105"/>
+      <use x="731.787109" xlink:href="#DejaVuSans-122"/>
+      <use x="784.277344" xlink:href="#DejaVuSans-101"/>
+     </g>
+    </g>
+   </g>
+   <g id="matplotlib.axis_2">
+    <g id="ytick_1">
+     <g id="line2d_1">
+      <path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 288.430125 
+L 403.43125 288.430125 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_10">
+      <!-- 0 -->
+      <g style="fill:#262626;" transform="translate(29.8125 292.609266)scale(0.11 -0.11)">
+       <defs>
+        <path d="M 31.78125 66.40625 
+Q 24.171875 66.40625 20.328125 58.90625 
+Q 16.5 51.421875 16.5 36.375 
+Q 16.5 21.390625 20.328125 13.890625 
+Q 24.171875 6.390625 31.78125 6.390625 
+Q 39.453125 6.390625 43.28125 13.890625 
+Q 47.125 21.390625 47.125 36.375 
+Q 47.125 51.421875 43.28125 58.90625 
+Q 39.453125 66.40625 31.78125 66.40625 
+z
+M 31.78125 74.21875 
+Q 44.046875 74.21875 50.515625 64.515625 
+Q 56.984375 54.828125 56.984375 36.375 
+Q 56.984375 17.96875 50.515625 8.265625 
+Q 44.046875 -1.421875 31.78125 -1.421875 
+Q 19.53125 -1.421875 13.0625 8.265625 
+Q 6.59375 17.96875 6.59375 36.375 
+Q 6.59375 54.828125 13.0625 64.515625 
+Q 19.53125 74.21875 31.78125 74.21875 
+z
+" id="DejaVuSans-48"/>
+       </defs>
+       <use xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_2">
+     <g id="line2d_2">
+      <path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 234.446995 
+L 403.43125 234.446995 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_11">
+      <!-- 20 -->
+      <g style="fill:#262626;" transform="translate(22.81375 238.626135)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-50"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_3">
+     <g id="line2d_3">
+      <path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 180.463864 
+L 403.43125 180.463864 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_12">
+      <!-- 40 -->
+      <g style="fill:#262626;" transform="translate(22.81375 184.643005)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-52"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_4">
+     <g id="line2d_4">
+      <path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 126.480734 
+L 403.43125 126.480734 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_13">
+      <!-- 60 -->
+      <g style="fill:#262626;" transform="translate(22.81375 130.659875)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-54"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="ytick_5">
+     <g id="line2d_5">
+      <path clip-path="url(#p8f4ea3f47d)" d="M 46.31125 72.497604 
+L 403.43125 72.497604 
+" style="fill:none;stroke:#c0c0c0;stroke-linecap:round;stroke-width:0.5;"/>
+     </g>
+     <g id="text_14">
+      <!-- 80 -->
+      <g style="fill:#262626;" transform="translate(22.81375 76.676745)scale(0.11 -0.11)">
+       <use xlink:href="#DejaVuSans-56"/>
+       <use x="63.623047" xlink:href="#DejaVuSans-48"/>
+      </g>
+     </g>
+    </g>
+    <g id="text_15">
+     <!-- Avg Latency -->
+     <g style="fill:#262626;" transform="translate(16.318125 192.110062)rotate(-90)scale(0.12 -0.12)">
+      <defs>
+       <path d="M 34.1875 63.1875 
+L 20.796875 26.90625 
+L 47.609375 26.90625 
+z
+M 28.609375 72.90625 
+L 39.796875 72.90625 
+L 67.578125 0 
+L 57.328125 0 
+L 50.6875 18.703125 
+L 17.828125 18.703125 
+L 11.1875 0 
+L 0.78125 0 
+z
+" id="DejaVuSans-65"/>
+       <path d="M 2.984375 54.6875 
+L 12.5 54.6875 
+L 29.59375 8.796875 
+L 46.6875 54.6875 
+L 56.203125 54.6875 
+L 35.6875 0 
+L 23.484375 0 
+z
+" id="DejaVuSans-118"/>
+       <path d="M 45.40625 27.984375 
+Q 45.40625 37.75 41.375 43.109375 
+Q 37.359375 48.484375 30.078125 48.484375 
+Q 22.859375 48.484375 18.828125 43.109375 
+Q 14.796875 37.75 14.796875 27.984375 
+Q 14.796875 18.265625 18.828125 12.890625 
+Q 22.859375 7.515625 30.078125 7.515625 
+Q 37.359375 7.515625 41.375 12.890625 
+Q 45.40625 18.265625 45.40625 27.984375 
+z
+M 54.390625 6.78125 
+Q 54.390625 -7.171875 48.1875 -13.984375 
+Q 42 -20.796875 29.203125 -20.796875 
+Q 24.46875 -20.796875 20.265625 -20.09375 
+Q 16.0625 -19.390625 12.109375 -17.921875 
+L 12.109375 -9.1875 
+Q 16.0625 -11.328125 19.921875 -12.34375 
+Q 23.78125 -13.375 27.78125 -13.375 
+Q 36.625 -13.375 41.015625 -8.765625 
+Q 45.40625 -4.15625 45.40625 5.171875 
+L 45.40625 9.625 
+Q 42.625 4.78125 38.28125 2.390625 
+Q 33.9375 0 27.875 0 
+Q 17.828125 0 11.671875 7.65625 
+Q 5.515625 15.328125 5.515625 27.984375 
+Q 5.515625 40.671875 11.671875 48.328125 
+Q 17.828125 56 27.875 56 
+Q 33.9375 56 38.28125 53.609375 
+Q 42.625 51.21875 45.40625 46.390625 
+L 45.40625 54.6875 
+L 54.390625 54.6875 
+z
+" id="DejaVuSans-103"/>
+       <path d="M 9.8125 72.90625 
+L 19.671875 72.90625 
+L 19.671875 8.296875 
+L 55.171875 8.296875 
+L 55.171875 0 
+L 9.8125 0 
+z
+" id="DejaVuSans-76"/>
+       <path d="M 32.171875 -5.078125 
+Q 28.375 -14.84375 24.75 -17.8125 
+Q 21.140625 -20.796875 15.09375 -20.796875 
+L 7.90625 -20.796875 
+L 7.90625 -13.28125 
+L 13.1875 -13.28125 
+Q 16.890625 -13.28125 18.9375 -11.515625 
+Q 21 -9.765625 23.484375 -3.21875 
+L 25.09375 0.875 
+L 2.984375 54.6875 
+L 12.5 54.6875 
+L 29.59375 11.921875 
+L 46.6875 54.6875 
+L 56.203125 54.6875 
+z
+" id="DejaVuSans-121"/>
+      </defs>
+      <use xlink:href="#DejaVuSans-65"/>
+      <use x="62.533203" xlink:href="#DejaVuSans-118"/>
+      <use x="121.712891" xlink:href="#DejaVuSans-103"/>
+      <use x="185.189453" xlink:href="#DejaVuSans-32"/>
+      <use x="216.976562" xlink:href="#DejaVuSans-76"/>
+      <use x="272.689453" xlink:href="#DejaVuSans-97"/>
+      <use x="333.96875" xlink:href="#DejaVuSans-116"/>
+      <use x="373.177734" xlink:href="#DejaVuSans-101"/>
+      <use x="434.701172" xlink:href="#DejaVuSans-110"/>
+      <use x="498.080078" xlink:href="#DejaVuSans-99"/>
+      <use x="553.060547" xlink:href="#DejaVuSans-121"/>
+     </g>
+    </g>
+   </g>
+   <g id="patch_3">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 50.77525 288.430125 
+L 86.48725 288.430125 
+L 86.48725 280.769919 
+L 50.77525 280.769919 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_4">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 95.41525 288.430125 
+L 131.12725 288.430125 
+L 131.12725 279.387951 
+L 95.41525 279.387951 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_5">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 140.05525 288.430125 
+L 175.76725 288.430125 
+L 175.76725 277.11796 
+L 140.05525 277.11796 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_6">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 184.69525 288.430125 
+L 220.40725 288.430125 
+L 220.40725 272.291868 
+L 184.69525 272.291868 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_7">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 229.33525 288.430125 
+L 265.04725 288.430125 
+L 265.04725 263.419741 
+L 229.33525 263.419741 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_8">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 273.97525 288.430125 
+L 309.68725 288.430125 
+L 309.68725 246.150537 
+L 273.97525 246.150537 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_9">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 318.61525 288.430125 
+L 354.32725 288.430125 
+L 354.32725 184.750125 
+L 318.61525 184.750125 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_10">
+    <path clip-path="url(#p8f4ea3f47d)" d="M 363.25525 288.430125 
+L 398.96725 288.430125 
+L 398.96725 66.670125 
+L 363.25525 66.670125 
+z
+" style="fill:#5875a4;stroke:#ffffff;stroke-linejoin:miter;"/>
+   </g>
+   <g id="patch_11">
+    <path d="M 46.31125 288.430125 
+L 46.31125 22.318125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="patch_12">
+    <path d="M 403.43125 288.430125 
+L 403.43125 22.318125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="patch_13">
+    <path d="M 46.31125 288.430125 
+L 403.43125 288.430125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="patch_14">
+    <path d="M 46.31125 22.318125 
+L 403.43125 22.318125 
+" style="fill:none;stroke:#000000;stroke-linecap:square;stroke-linejoin:miter;stroke-width:2;"/>
+   </g>
+   <g id="text_16">
+    <!-- Performance offline -->
+    <g style="fill:#262626;" transform="translate(166.222188 16.318125)scale(0.12 -0.12)">
+     <defs>
+      <path d="M 19.671875 64.796875 
+L 19.671875 37.40625 
+L 32.078125 37.40625 
+Q 38.96875 37.40625 42.71875 40.96875 
+Q 46.484375 44.53125 46.484375 51.125 
+Q 46.484375 57.671875 42.71875 61.234375 
+Q 38.96875 64.796875 32.078125 64.796875 
+z
+M 9.8125 72.90625 
+L 32.078125 72.90625 
+Q 44.34375 72.90625 50.609375 67.359375 
+Q 56.890625 61.8125 56.890625 51.125 
+Q 56.890625 40.328125 50.609375 34.8125 
+Q 44.34375 29.296875 32.078125 29.296875 
+L 19.671875 29.296875 
+L 19.671875 0 
+L 9.8125 0 
+z
+" id="DejaVuSans-80"/>
+      <path d="M 41.109375 46.296875 
+Q 39.59375 47.171875 37.8125 47.578125 
+Q 36.03125 48 33.890625 48 
+Q 26.265625 48 22.1875 43.046875 
+Q 18.109375 38.09375 18.109375 28.8125 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 54.6875 
+L 18.109375 54.6875 
+L 18.109375 46.1875 
+Q 20.953125 51.171875 25.484375 53.578125 
+Q 30.03125 56 36.53125 56 
+Q 37.453125 56 38.578125 55.875 
+Q 39.703125 55.765625 41.0625 55.515625 
+z
+" id="DejaVuSans-114"/>
+      <path d="M 37.109375 75.984375 
+L 37.109375 68.5 
+L 28.515625 68.5 
+Q 23.6875 68.5 21.796875 66.546875 
+Q 19.921875 64.59375 19.921875 59.515625 
+L 19.921875 54.6875 
+L 34.71875 54.6875 
+L 34.71875 47.703125 
+L 19.921875 47.703125 
+L 19.921875 0 
+L 10.890625 0 
+L 10.890625 47.703125 
+L 2.296875 47.703125 
+L 2.296875 54.6875 
+L 10.890625 54.6875 
+L 10.890625 58.5 
+Q 10.890625 67.625 15.140625 71.796875 
+Q 19.390625 75.984375 28.609375 75.984375 
+z
+" id="DejaVuSans-102"/>
+      <path d="M 30.609375 48.390625 
+Q 23.390625 48.390625 19.1875 42.75 
+Q 14.984375 37.109375 14.984375 27.296875 
+Q 14.984375 17.484375 19.15625 11.84375 
+Q 23.34375 6.203125 30.609375 6.203125 
+Q 37.796875 6.203125 41.984375 11.859375 
+Q 46.1875 17.53125 46.1875 27.296875 
+Q 46.1875 37.015625 41.984375 42.703125 
+Q 37.796875 48.390625 30.609375 48.390625 
+z
+M 30.609375 56 
+Q 42.328125 56 49.015625 48.375 
+Q 55.71875 40.765625 55.71875 27.296875 
+Q 55.71875 13.875 49.015625 6.21875 
+Q 42.328125 -1.421875 30.609375 -1.421875 
+Q 18.84375 -1.421875 12.171875 6.21875 
+Q 5.515625 13.875 5.515625 27.296875 
+Q 5.515625 40.765625 12.171875 48.375 
+Q 18.84375 56 30.609375 56 
+z
+" id="DejaVuSans-111"/>
+      <path d="M 52 44.1875 
+Q 55.375 50.25 60.0625 53.125 
+Q 64.75 56 71.09375 56 
+Q 79.640625 56 84.28125 50.015625 
+Q 88.921875 44.046875 88.921875 33.015625 
+L 88.921875 0 
+L 79.890625 0 
+L 79.890625 32.71875 
+Q 79.890625 40.578125 77.09375 44.375 
+Q 74.3125 48.1875 68.609375 48.1875 
+Q 61.625 48.1875 57.5625 43.546875 
+Q 53.515625 38.921875 53.515625 30.90625 
+L 53.515625 0 
+L 44.484375 0 
+L 44.484375 32.71875 
+Q 44.484375 40.625 41.703125 44.40625 
+Q 38.921875 48.1875 33.109375 48.1875 
+Q 26.21875 48.1875 22.15625 43.53125 
+Q 18.109375 38.875 18.109375 30.90625 
+L 18.109375 0 
+L 9.078125 0 
+L 9.078125 54.6875 
+L 18.109375 54.6875 
+L 18.109375 46.1875 
+Q 21.1875 51.21875 25.484375 53.609375 
+Q 29.78125 56 35.6875 56 
+Q 41.65625 56 45.828125 52.96875 
+Q 50 49.953125 52 44.1875 
+z
+" id="DejaVuSans-109"/>
+     </defs>
+     <use xlink:href="#DejaVuSans-80"/>
+     <use x="56.677734" xlink:href="#DejaVuSans-101"/>
+     <use x="118.201172" xlink:href="#DejaVuSans-114"/>
+     <use x="159.314453" xlink:href="#DejaVuSans-102"/>
+     <use x="194.519531" xlink:href="#DejaVuSans-111"/>
+     <use x="255.701172" xlink:href="#DejaVuSans-114"/>
+     <use x="295.064453" xlink:href="#DejaVuSans-109"/>
+     <use x="392.476562" xlink:href="#DejaVuSans-97"/>
+     <use x="453.755859" xlink:href="#DejaVuSans-110"/>
+     <use x="517.134766" xlink:href="#DejaVuSans-99"/>
+     <use x="572.115234" xlink:href="#DejaVuSans-101"/>
+     <use x="633.638672" xlink:href="#DejaVuSans-32"/>
+     <use x="665.425781" xlink:href="#DejaVuSans-111"/>
+     <use x="726.607422" xlink:href="#DejaVuSans-102"/>
+     <use x="761.8125" xlink:href="#DejaVuSans-102"/>
+     <use x="797.017578" xlink:href="#DejaVuSans-108"/>
+     <use x="824.800781" xlink:href="#DejaVuSans-105"/>
+     <use x="852.583984" xlink:href="#DejaVuSans-110"/>
+     <use x="915.962891" xlink:href="#DejaVuSans-101"/>
+    </g>
+   </g>
+   <g id="legend_1"/>
+  </g>
+ </g>
+ <defs>
+  <clipPath id="p8f4ea3f47d">
+   <rect height="266.112" width="357.12" x="46.31125" y="22.318125"/>
+  </clipPath>
+ </defs>
+</svg>
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_10.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_10.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_18.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_18.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_2.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_2.svg
--- a/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_26.svg
+++ b/PyTorch/Classification/ConvNets/triton/resnet50/plots/graph_performance_online_26.svg
--- a/PyTorch/Classification/ConvNets/triton/run_inference_on_fw.py
+++ b/PyTorch/Classification/ConvNets/triton/run_inference_on_fw.py
@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+To infer the model on framework runtime, you can use `run_inference_on_fw.py` script.
+It infers data obtained from pointed data loader locally and saves received data into npz files.
+Those files are stored in directory pointed by `--output-dir` argument.
+
+Example call:
+
+```shell script
+python ./triton/run_inference_on_fw.py \
+    --input-path /models/exported/model.onnx \
+    --input-type onnx \
+    --dataloader triton/dataloader.py \
+    --data-dir /data/imagenet \
+    --batch-size 32 \
+    --output-dir /results/dump_local \
+    --dump-labels
+```
+"""
+
+import argparse
+import logging
+import os
+from pathlib import Path
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
+os.environ["TF_ENABLE_DEPRECATION_WARNINGS"] = "0"
+
+from tqdm import tqdm
+
+# method from PEP-366 to support relative import in executed modules
+if __package__ is None:
+    __package__ = Path(__file__).parent.name
+
+from .deployment_toolkit.args import ArgParserGenerator
+from .deployment_toolkit.core import DATALOADER_FN_NAME, BaseLoader, BaseRunner, Format, load_from_file
+from .deployment_toolkit.dump import NpzWriter
+from .deployment_toolkit.extensions import loaders, runners
+
+LOGGER = logging.getLogger("run_inference_on_fw")
+
+
+def _verify_and_format_dump(args, ids, x, y_pred, y_real):
+    data = {"outputs": y_pred, "ids": {"ids": ids}}
+    if args.dump_inputs:
+        data["inputs"] = x
+    if args.dump_labels:
+        if not y_real:
+            raise ValueError(
+                "Found empty label values. Please provide labels in dataloader_fn or do not use --dump-labels argument"
+            )
+        data["labels"] = y_real
+    return data
+
+
+def _parse_and_validate_args():
+    supported_inputs = set(runners.supported_extensions) & set(loaders.supported_extensions)
+
+    parser = argparse.ArgumentParser(description="Dump local inference output of given model", allow_abbrev=False)
+    parser.add_argument("--input-path", help="Path to input model", required=True)
+    parser.add_argument("--input-type", help="Input model type", choices=supported_inputs, required=True)
+    parser.add_argument("--dataloader", help="Path to python file containing dataloader.", required=True)
+    parser.add_argument("--output-dir", help="Path to dir where output files will be stored", required=True)
+    parser.add_argument("--dump-labels", help="Dump labels to output dir", action="store_true", default=False)
+    parser.add_argument("--dump-inputs", help="Dump inputs to output dir", action="store_true", default=False)
+    parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
+
+    args, *_ = parser.parse_known_args()
+
+    get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
+    ArgParserGenerator(get_dataloader_fn).update_argparser(parser)
+
+    Loader: BaseLoader = loaders.get(args.input_type)
+    ArgParserGenerator(Loader, module_path=args.input_path).update_argparser(parser)
+
+    Runner: BaseRunner = runners.get(args.input_type)
+    ArgParserGenerator(Runner).update_argparser(parser)
+
+    args = parser.parse_args()
+
+    types_requiring_io_params = []
+
+    if args.input_type in types_requiring_io_params and not all(p for p in [args.inputs, args.outputs]):
+        parser.error(f"For {args.input_type} input provide --inputs and --outputs parameters")
+
+    return args
+
+
+def main():
+    args = _parse_and_validate_args()
+
+    log_level = logging.INFO if not args.verbose else logging.DEBUG
+    log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
+    logging.basicConfig(level=log_level, format=log_format)
+
+    LOGGER.info(f"args:")
+    for key, value in vars(args).items():
+        LOGGER.info(f"    {key} = {value}")
+
+    Loader: BaseLoader = loaders.get(args.input_type)
+    Runner: BaseRunner = runners.get(args.input_type)
+
+    loader = ArgParserGenerator(Loader, module_path=args.input_path).from_args(args)
+    runner = ArgParserGenerator(Runner).from_args(args)
+    LOGGER.info(f"Loading {args.input_path}")
+    model = loader.load(args.input_path)
+    with runner.init_inference(model=model) as runner_session, NpzWriter(args.output_dir) as writer:
+        get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
+        dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)
+        LOGGER.info(f"Data loader initialized; Running inference")
+        for ids, x, y_real in tqdm(dataloader_fn(), unit="batch", mininterval=10):
+            y_pred = runner_session(x)
+            data = _verify_and_format_dump(args, ids=ids, x=x, y_pred=y_pred, y_real=y_real)
+            writer.write(**data)
+        LOGGER.info(f"Inference finished")
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/run_inference_on_triton.py
+++ b/PyTorch/Classification/ConvNets/triton/run_inference_on_triton.py
@ -0,0 +1,287 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+To infer the model deployed on Triton, you can use `run_inference_on_triton.py` script.
+It sends a request with data obtained from pointed data loader and dumps received data into npz files.
+Those files are stored in directory pointed by `--output-dir` argument.
+
+Currently, the client communicates with the Triton server asynchronously using GRPC protocol.
+
+Example call:
+
+```shell script
+python ./triton/run_inference_on_triton.py \
+    --server-url localhost:8001 \
+    --model-name ResNet50 \
+    --model-version 1 \
+    --dump-labels \
+    --output-dir /results/dump_triton
+```
+"""
+
+import argparse
+import functools
+import logging
+import queue
+import threading
+import time
+from pathlib import Path
+from typing import Optional
+
+from tqdm import tqdm
+
+# pytype: disable=import-error
+try:
+    from tritonclient import utils as client_utils  # noqa: F401
+    from tritonclient.grpc import (
+        InferenceServerClient,
+        InferInput,
+        InferRequestedOutput,
+    )
+except ImportError:
+    import tritongrpcclient as grpc_client
+    from tritongrpcclient import (
+        InferenceServerClient,
+        InferInput,
+        InferRequestedOutput,
+    )
+# pytype: enable=import-error
+
+# method from PEP-366 to support relative import in executed modules
+if __package__ is None:
+    __package__ = Path(__file__).parent.name
+
+from .deployment_toolkit.args import ArgParserGenerator
+from .deployment_toolkit.core import DATALOADER_FN_NAME, load_from_file
+from .deployment_toolkit.dump import NpzWriter
+
+LOGGER = logging.getLogger("run_inference_on_triton")
+
+
+class AsyncGRPCTritonRunner:
+    DEFAULT_MAX_RESP_WAIT_S = 120
+    DEFAULT_MAX_UNRESP_REQS = 128
+    DEFAULT_MAX_FINISH_WAIT_S = 900  # 15min
+
+    def __init__(
+        self,
+        server_url: str,
+        model_name: str,
+        model_version: str,
+        *,
+        dataloader,
+        verbose=False,
+        resp_wait_s: Optional[float] = None,
+        max_unresponded_reqs: Optional[int] = None,
+    ):
+        self._server_url = server_url
+        self._model_name = model_name
+        self._model_version = model_version
+        self._dataloader = dataloader
+        self._verbose = verbose
+        self._response_wait_t = self.DEFAULT_MAX_RESP_WAIT_S if resp_wait_s is None else resp_wait_s
+        self._max_unresp_reqs = self.DEFAULT_MAX_UNRESP_REQS if max_unresponded_reqs is None else max_unresponded_reqs
+
+        self._results = queue.Queue()
+        self._processed_all = False
+        self._errors = []
+        self._num_waiting_for = 0
+        self._sync = threading.Condition()
+        self._req_thread = threading.Thread(target=self.req_loop, daemon=True)
+
+    def __iter__(self):
+        self._req_thread.start()
+        timeout_s = 0.050  # check flags processed_all and error flags every 50ms
+        while True:
+            try:
+                ids, x, y_pred, y_real = self._results.get(timeout=timeout_s)
+                yield ids, x, y_pred, y_real
+            except queue.Empty:
+                shall_stop = self._processed_all or self._errors
+                if shall_stop:
+                    break
+
+        LOGGER.debug("Waiting for request thread to stop")
+        self._req_thread.join()
+        if self._errors:
+            error_msg = "\n".join(map(str, self._errors))
+            raise RuntimeError(error_msg)
+
+    def _on_result(self, ids, x, y_real, output_names, result, error):
+        with self._sync:
+            if error:
+                self._errors.append(error)
+            else:
+                y_pred = {name: result.as_numpy(name) for name in output_names}
+                self._results.put((ids, x, y_pred, y_real))
+            self._num_waiting_for -= 1
+            self._sync.notify_all()
+
+    def req_loop(self):
+        client = InferenceServerClient(self._server_url, verbose=self._verbose)
+        self._errors = self._verify_triton_state(client)
+        if self._errors:
+            return
+
+        LOGGER.debug(
+            f"Triton server {self._server_url} and model {self._model_name}:{self._model_version} " f"are up and ready!"
+        )
+
+        model_config = client.get_model_config(self._model_name, self._model_version)
+        model_metadata = client.get_model_metadata(self._model_name, self._model_version)
+        LOGGER.info(f"Model config {model_config}")
+        LOGGER.info(f"Model metadata {model_metadata}")
+
+        inputs = {tm.name: tm for tm in model_metadata.inputs}
+        outputs = {tm.name: tm for tm in model_metadata.outputs}
+        output_names = list(outputs)
+        outputs_req = [InferRequestedOutput(name) for name in outputs]
+
+        self._num_waiting_for = 0
+
+        for ids, x, y_real in self._dataloader:
+            infer_inputs = []
+            for name in inputs:
+                data = x[name]
+                infer_input = InferInput(name, data.shape, inputs[name].datatype)
+
+                target_np_dtype = client_utils.triton_to_np_dtype(inputs[name].datatype)
+                data = data.astype(target_np_dtype)
+
+                infer_input.set_data_from_numpy(data)
+                infer_inputs.append(infer_input)
+
+            with self._sync:
+
+                def _check_can_send():
+                    return self._num_waiting_for < self._max_unresp_reqs
+
+                can_send = self._sync.wait_for(_check_can_send, timeout=self._response_wait_t)
+                if not can_send:
+                    error_msg = f"Runner could not send new requests for {self._response_wait_t}s"
+                    self._errors.append(error_msg)
+                    break
+
+                callback = functools.partial(AsyncGRPCTritonRunner._on_result, self, ids, x, y_real, output_names)
+                client.async_infer(
+                    model_name=self._model_name,
+                    model_version=self._model_version,
+                    inputs=infer_inputs,
+                    outputs=outputs_req,
+                    callback=callback,
+                )
+                self._num_waiting_for += 1
+
+        # wait till receive all requested data
+        with self._sync:
+
+            def _all_processed():
+                LOGGER.debug(f"wait for {self._num_waiting_for} unprocessed jobs")
+                return self._num_waiting_for == 0
+
+            self._processed_all = self._sync.wait_for(_all_processed, self.DEFAULT_MAX_FINISH_WAIT_S)
+            if not self._processed_all:
+                error_msg = f"Runner {self._response_wait_t}s timeout received while waiting for results from server"
+                self._errors.append(error_msg)
+        LOGGER.debug("Finished request thread")
+
+    def _verify_triton_state(self, triton_client):
+        errors = []
+        if not triton_client.is_server_live():
+            errors.append(f"Triton server {self._server_url} is not live")
+        elif not triton_client.is_server_ready():
+            errors.append(f"Triton server {self._server_url} is not ready")
+        elif not triton_client.is_model_ready(self._model_name, self._model_version):
+            errors.append(f"Model {self._model_name}:{self._model_version} is not ready")
+        return errors
+
+
+def _parse_args():
+    parser = argparse.ArgumentParser(description="Infer model on Triton server", allow_abbrev=False)
+    parser.add_argument(
+        "--server-url", type=str, default="localhost:8001", help="Inference server URL (default localhost:8001)"
+    )
+    parser.add_argument("--model-name", help="The name of the model used for inference.", required=True)
+    parser.add_argument("--model-version", help="The version of the model used for inference.", required=True)
+    parser.add_argument("--dataloader", help="Path to python file containing dataloader.", required=True)
+    parser.add_argument("--dump-labels", help="Dump labels to output dir", action="store_true", default=False)
+    parser.add_argument("--dump-inputs", help="Dump inputs to output dir", action="store_true", default=False)
+    parser.add_argument("-v", "--verbose", help="Verbose logs", action="store_true", default=False)
+    parser.add_argument("--output-dir", required=True, help="Path to directory where outputs will be saved")
+    parser.add_argument("--response-wait-time", required=False, help="Maximal time to wait for response", default=120)
+    parser.add_argument(
+        "--max-unresponded-requests", required=False, help="Maximal number of unresponded requests", default=128
+    )
+
+    args, *_ = parser.parse_known_args()
+
+    get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
+    ArgParserGenerator(get_dataloader_fn).update_argparser(parser)
+    args = parser.parse_args()
+
+    return args
+
+
+def main():
+    args = _parse_args()
+
+    log_format = "%(asctime)s %(levelname)s %(name)s %(message)s"
+    log_level = logging.INFO if not args.verbose else logging.DEBUG
+    logging.basicConfig(level=log_level, format=log_format)
+
+    LOGGER.info(f"args:")
+    for key, value in vars(args).items():
+        LOGGER.info(f"    {key} = {value}")
+
+    get_dataloader_fn = load_from_file(args.dataloader, label="dataloader", target=DATALOADER_FN_NAME)
+    dataloader_fn = ArgParserGenerator(get_dataloader_fn).from_args(args)
+
+    runner = AsyncGRPCTritonRunner(
+        args.server_url,
+        args.model_name,
+        args.model_version,
+        dataloader=dataloader_fn(),
+        verbose=False,
+        resp_wait_s=args.response_wait_time,
+        max_unresponded_reqs=args.max_unresponded_requests,
+    )
+
+    with NpzWriter(output_dir=args.output_dir) as writer:
+        start = time.time()
+        for ids, x, y_pred, y_real in tqdm(runner, unit="batch", mininterval=10):
+            data = _verify_and_format_dump(args, ids, x, y_pred, y_real)
+            writer.write(**data)
+        stop = time.time()
+
+    LOGGER.info(f"\nThe inference took {stop - start:0.3f}s")
+
+
+def _verify_and_format_dump(args, ids, x, y_pred, y_real):
+    data = {"outputs": y_pred, "ids": {"ids": ids}}
+    if args.dump_inputs:
+        data["inputs"] = x
+    if args.dump_labels:
+        if not y_real:
+            raise ValueError(
+                "Found empty label values. Please provide labels in dataloader_fn or do not use --dump-labels argument"
+            )
+        data["labels"] = y_real
+    return data
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/run_offline_performance_test_on_triton.py
+++ b/PyTorch/Classification/ConvNets/triton/run_offline_performance_test_on_triton.py
@ -0,0 +1,178 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+For models with variable-sized inputs you must provide the --input-shape argument so that perf_analyzer knows
+what shape tensors to use. For example, for a model that has an input called IMAGE that has shape [ 3, N, M ],
+where N and M are variable-size dimensions, to tell perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]
+`--shape IMAGE:3,224,224`.
+"""
+
+import argparse
+import csv
+import os
+import sys
+from pathlib import Path
+from typing import Dict, List, Optional
+
+# method from PEP-366 to support relative import in executed modules
+if __package__ is None:
+    __package__ = Path(__file__).parent.name
+
+from .deployment_toolkit.report import save_results, show_results, sort_results
+from .deployment_toolkit.warmup import warmup
+
+
+def calculate_average_latency(r):
+    avg_sum_fields = [
+        "Client Send",
+        "Network+Server Send/Recv",
+        "Server Queue",
+        "Server Compute",
+        "Server Compute Input",
+        "Server Compute Infer",
+        "Server Compute Output",
+        "Client Recv",
+    ]
+    avg_latency = sum([int(r.get(f, 0)) for f in avg_sum_fields])
+
+    return avg_latency
+
+
+def update_performance_data(results: List, batch_size: int, performance_partial_file: str):
+    row: Dict = {"batch_size": batch_size}
+    with open(performance_partial_file, "r") as csvfile:
+        reader = csv.DictReader(csvfile)
+        for r in reader:
+            avg_latency = calculate_average_latency(r)
+            row = {**row, **r, "avg latency": avg_latency}
+
+    results.append(row)
+
+
+def _parse_batch_sizes(batch_sizes: str):
+    batches = batch_sizes.split(sep=",")
+    return list(map(lambda x: int(x.strip()), batches))
+
+
+def offline_performance(
+        model_name: str,
+        batch_sizes: List[int],
+        result_path: str,
+        input_shapes: Optional[List[str]] = None,
+        profiling_data: str = "random",
+        triton_instances: int = 1,
+        server_url: str = "localhost",
+        measurement_window: int = 10000,
+        shared_memory: bool = False
+):
+    print("\n")
+    print(f"==== Static batching analysis start ====")
+    print("\n")
+
+    input_shapes = " ".join(map(lambda shape: f" --shape {shape}", input_shapes)) if input_shapes else ""
+
+    results: List[Dict] = list()
+    for batch_size in batch_sizes:
+        print(f"Running performance tests for batch size: {batch_size}")
+        performance_partial_file = f"triton_performance_partial_{batch_size}.csv"
+
+        exec_args = f"""-max-threads {triton_instances} \
+           -m {model_name} \
+           -x 1 \
+           -c {triton_instances} \
+           -t {triton_instances} \
+           -p {measurement_window} \
+           -v \
+           -i http \
+           -u {server_url}:8000 \
+           -b {batch_size} \
+           -f {performance_partial_file} \
+           --input-data {profiling_data} {input_shapes}"""
+
+        if shared_memory:
+            exec_args += " --shared-memory=cuda"
+
+        result = os.system(f"perf_client {exec_args}")
+        if result != 0:
+            print(f"Failed running performance tests. Perf client failed with exit code {result}")
+            sys.exit(1)
+
+        update_performance_data(results, batch_size, performance_partial_file)
+        os.remove(performance_partial_file)
+
+    results = sort_results(results=results)
+
+    save_results(filename=result_path, data=results)
+    show_results(results=results)
+
+    print("Performance results for static batching stored in: {0}".format(result_path))
+
+    print("\n")
+    print(f"==== Analysis done ====")
+    print("\n")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-name", type=str, required=True, help="Name of the model to test")
+    parser.add_argument(
+        "--input-data", type=str, required=False, default="random", help="Input data to perform profiling."
+    )
+    parser.add_argument(
+        "--input-shape",
+        action="append",
+        required=False,
+        help="Input data shape in form INPUT_NAME:<full_shape_without_batch_axis>.",
+    )
+    parser.add_argument("--batch-sizes", type=str, required=True, help="List of batch sizes to tests. Comma separated.")
+    parser.add_argument("--result-path", type=str, required=True, help="Path where result file is going to be stored.")
+    parser.add_argument("--triton-instances", type=int, default=1, help="Number of Triton Server instances")
+    parser.add_argument("--server-url", type=str, required=False, default="localhost", help="Url to Triton server")
+    parser.add_argument(
+        "--measurement-window", required=False, help="Time which perf_analyzer will wait for results", default=10000
+    )
+    parser.add_argument("--shared-memory", help="Use shared memory for communication with Triton", action="store_true",
+                        default=False)
+
+    args = parser.parse_args()
+
+    warmup(
+        server_url=args.server_url,
+        model_name=args.model_name,
+        batch_sizes=_parse_batch_sizes(args.batch_sizes),
+        triton_instances=args.triton_instances,
+        profiling_data=args.input_data,
+        input_shapes=args.input_shape,
+        measurement_window=args.measurement_window,
+        shared_memory=args.shared_memory
+    )
+
+    offline_performance(
+        server_url=args.server_url,
+        model_name=args.model_name,
+        batch_sizes=_parse_batch_sizes(args.batch_sizes),
+        triton_instances=args.triton_instances,
+        profiling_data=args.input_data,
+        input_shapes=args.input_shape,
+        result_path=args.result_path,
+        measurement_window=args.measurement_window,
+        shared_memory=args.shared_memory
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/run_online_performance_test_on_triton.py
+++ b/PyTorch/Classification/ConvNets/triton/run_online_performance_test_on_triton.py
@ -0,0 +1,188 @@
+#!/usr/bin/env python3
+
+# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+For models with variable-sized inputs you must provide the --input-shape argument so that perf_analyzer knows
+what shape tensors to use. For example, for a model that has an input called IMAGE that has shape [ 3, N, M ],
+where N and M are variable-size dimensions, to tell perf_analyzer to send batch-size 4 requests of shape [ 3, 224, 224 ]
+`--shape IMAGE:3,224,224`.
+"""
+
+import argparse
+import csv
+import os
+import sys
+from pathlib import Path
+from typing import List, Optional
+
+# method from PEP-366 to support relative import in executed modules
+if __package__ is None:
+    __package__ = Path(__file__).parent.name
+
+from .deployment_toolkit.report import save_results, show_results, sort_results
+from .deployment_toolkit.warmup import warmup
+
+
+def calculate_average_latency(r):
+    avg_sum_fields = [
+        "Client Send",
+        "Network+Server Send/Recv",
+        "Server Queue",
+        "Server Compute",
+        "Server Compute Input",
+        "Server Compute Infer",
+        "Server Compute Output",
+        "Client Recv",
+    ]
+    avg_latency = sum([int(r.get(f, 0)) for f in avg_sum_fields])
+
+    return avg_latency
+
+
+def update_performance_data(results: List, performance_file: str):
+    with open(performance_file, "r") as csvfile:
+        reader = csv.DictReader(csvfile)
+        for row in reader:
+            row["avg latency"] = calculate_average_latency(row)
+
+            results.append(row)
+
+
+def _parse_batch_sizes(batch_sizes: str):
+    batches = batch_sizes.split(sep=",")
+    return list(map(lambda x: int(x.strip()), batches))
+
+
+def online_performance(
+        model_name: str,
+        batch_sizes: List[int],
+        result_path: str,
+        input_shapes: Optional[List[str]] = None,
+        profiling_data: str = "random",
+        triton_instances: int = 1,
+        triton_gpu_engine_count: int = 1,
+        server_url: str = "localhost",
+        measurement_window: int = 10000,
+        shared_memory: bool = False
+):
+    print("\n")
+    print(f"==== Dynamic batching analysis start ====")
+    print("\n")
+
+    input_shapes = " ".join(map(lambda shape: f" --shape {shape}", input_shapes)) if input_shapes else ""
+
+    print(f"Running performance tests for dynamic batching")
+    performance_file = f"triton_performance_dynamic_partial.csv"
+
+    max_batch_size = max(batch_sizes)
+    max_total_requests = 2 * max_batch_size * triton_instances * triton_gpu_engine_count
+    max_concurrency = min(256, max_total_requests)
+    batch_size = max(1, max_total_requests // 256)
+
+    step = max(1, max_concurrency // 32)
+    min_concurrency = step
+
+    exec_args = f"""-m {model_name} \
+        -x 1 \
+        -p {measurement_window} \
+        -v \
+        -i http \
+        -u {server_url}:8000 \
+        -b {batch_size} \
+        -f {performance_file} \
+        --concurrency-range {min_concurrency}:{max_concurrency}:{step} \
+        --input-data {profiling_data} {input_shapes}"""
+
+    if shared_memory:
+        exec_args += " --shared-memory=cuda"
+
+    result = os.system(f"perf_client {exec_args}")
+    if result != 0:
+        print(f"Failed running performance tests. Perf client failed with exit code {result}")
+        sys.exit(1)
+
+    results = list()
+    update_performance_data(results=results, performance_file=performance_file)
+
+    results = sort_results(results=results)
+
+    save_results(filename=result_path, data=results)
+    show_results(results=results)
+
+    os.remove(performance_file)
+
+    print("Performance results for dynamic batching stored in: {0}".format(result_path))
+
+    print("\n")
+    print(f"==== Analysis done ====")
+    print("\n")
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-name", type=str, required=True, help="Name of the model to test")
+    parser.add_argument(
+        "--input-data", type=str, required=False, default="random", help="Input data to perform profiling."
+    )
+    parser.add_argument(
+        "--input-shape",
+        action="append",
+        required=False,
+        help="Input data shape in form INPUT_NAME:<full_shape_without_batch_axis>.",
+    )
+    parser.add_argument("--batch-sizes", type=str, required=True, help="List of batch sizes to tests. Comma separated.")
+    parser.add_argument("--triton-instances", type=int, default=1, help="Number of Triton Server instances")
+    parser.add_argument(
+        "--number-of-model-instances", type=int, default=1, help="Number of models instances on Triton Server"
+    )
+    parser.add_argument("--result-path", type=str, required=True, help="Path where result file is going to be stored.")
+    parser.add_argument("--server-url", type=str, required=False, default="localhost", help="Url to Triton server")
+    parser.add_argument(
+        "--measurement-window", required=False, help="Time which perf_analyzer will wait for results", default=10000
+    )
+    parser.add_argument("--shared-memory", help="Use shared memory for communication with Triton", action="store_true",
+                        default=False)
+
+    args = parser.parse_args()
+
+    warmup(
+        server_url=args.server_url,
+        model_name=args.model_name,
+        batch_sizes=_parse_batch_sizes(args.batch_sizes),
+        triton_instances=args.triton_instances,
+        triton_gpu_engine_count=args.number_of_model_instances,
+        profiling_data=args.input_data,
+        input_shapes=args.input_shape,
+        measurement_window=args.measurement_window,
+        shared_memory=args.shared_memory
+    )
+
+    online_performance(
+        server_url=args.server_url,
+        model_name=args.model_name,
+        batch_sizes=_parse_batch_sizes(args.batch_sizes),
+        triton_instances=args.triton_instances,
+        triton_gpu_engine_count=args.number_of_model_instances,
+        profiling_data=args.input_data,
+        input_shapes=args.input_shape,
+        result_path=args.result_path,
+        measurement_window=args.measurement_window,
+        shared_memory=args.shared_memory
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/PyTorch/Classification/ConvNets/triton/scripts/docker/build.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/docker/build.sh
@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+docker build -t resnet50 . -f triton/resnet50/Dockerfile
--- a/PyTorch/Classification/ConvNets/triton/scripts/docker/interactive.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/docker/interactive.sh
@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+docker run -it --rm \
+  --gpus "device=all" \
+  --net=host \
+  --shm-size=1g \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  -e WORKDIR=$(pwd) \
+  -e PYTHONPATH=$(pwd) \
+  -v $(pwd):$(pwd) \
+  -w $(pwd) \
+  resnet50:latest bash
--- a/PyTorch/Classification/ConvNets/triton/scripts/docker/triton_inference_server.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/docker/triton_inference_server.sh
@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:=all}
+
+docker run --rm -d \
+  -p 8000:8000 \
+  -p 8001:8001 \
+  -p 8002:8002 \
+  --runtime=nvidia \
+  -e NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES} \
+  -v ${MODEL_REPOSITORY_PATH}:${MODEL_REPOSITORY_PATH} \
+  --shm-size=1g \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  nvcr.io/nvidia/tritonserver:21.02-py3 tritonserver \
+  --model-store=${MODEL_REPOSITORY_PATH} \
+  --strict-model-config=false \
+  --exit-on-error=true \
+  --model-control-mode=explicit
--- a/PyTorch/Classification/ConvNets/triton/scripts/download_data.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/download_data.sh
@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Download checkpoint
+if [ -f "${CHECKPOINT_DIR}/nvidia_resnet50_200821.pth.tar" ]; then
+  echo "Checkpoint already downloaded."
+else
+  echo "Downloading checkpoint ..."
+  wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/resnet50_pyt_amp/versions/20.06.0/zip -O \
+    resnet50_pyt_amp_20.06.0.zip || {
+    echo "ERROR: Failed to download checkpoint from NGC"
+    exit 1
+  }
+  unzip resnet50_pyt_amp_20.06.0.zip -d ${CHECKPOINT_DIR}
+  rm resnet50_pyt_amp_20.06.0.zip
+  echo "ok"
+fi
--- a/PyTorch/Classification/ConvNets/triton/scripts/process_dataset.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/process_dataset.sh
@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+if [ -d "${DATASETS_DIR}/imagenet" ]; then
+  echo "Dataset already downloaded and processed."
+else
+  python triton/process_dataset.py
+fi
--- a/PyTorch/Classification/ConvNets/triton/scripts/setup_environment.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/setup_environment.sh
@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+WORKDIR="${WORKDIR:=$(pwd)}"
+export WORKSPACE_DIR=${WORKDIR}/workspace
+export DATASETS_DIR=${WORKSPACE_DIR}/datasets_dir
+export CHECKPOINT_DIR=${WORKSPACE_DIR}/checkpoint_dir
+export MODEL_REPOSITORY_PATH=${WORKSPACE_DIR}/model_store
+export SHARED_DIR=${WORKSPACE_DIR}/shared_dir
+
+echo "Preparing directories"
+mkdir -p ${WORKSPACE_DIR}
+mkdir -p ${DATASETS_DIR}
+mkdir -p ${CHECKPOINT_DIR}
+mkdir -p ${MODEL_REPOSITORY_PATH}
+mkdir -p ${SHARED_DIR}
+
+echo "Setting up environment"
+export MODEL_NAME=resnet50
+export TRITON_LOAD_MODEL_METHOD=explicit
+export TRITON_INSTANCES=1
--- a/PyTorch/Classification/ConvNets/triton/scripts/setup_parameters.sh
+++ b/PyTorch/Classification/ConvNets/triton/scripts/setup_parameters.sh
@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+# Copyright (c) 2021 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+export PRECISION="fp16"
+export FORMAT="trt"
+export BATCH_SIZE="1,2,4,8,16,32,64,128"
+export BACKEND_ACCELERATOR="cuda"
+export MAX_BATCH_SIZE="128"
+export NUMBER_OF_MODEL_INSTANCES="1"
+export TRITON_MAX_QUEUE_DELAY="1"
+export TRITON_PREFERRED_BATCH_SIZES="64 128"
+
--- a/TensorFlow/Classification/ConvNets/.style.yapf
+++ b/TensorFlow/Classification/ConvNets/.style.yapf
@ -32,7 +32,7 @@ allow_multiline_lambdas = True
 #                    # <------ this blank line
 #     def method():
 #         pass
-blank_line_before_nested_class_or_def = True
+blank_line_before_nested_class_or_def = False

 # Insert a blank line before a module docstring.
 blank_line_before_module_docstring = True
@ -83,7 +83,7 @@ continuation_indent_width = 4
 #       start_ts=now()-timedelta(days=3),
 #       end_ts=now(),
 #   )        # <--- this bracket is dedented and on a separate line
-dedent_closing_brackets = True
+dedent_closing_brackets = False

 # Disable the heuristic which places each list element on a separate line if the list is comma-terminated.
 disable_ending_comma_heuristic = false
--- a/TensorFlow/Classification/ConvNets/Dockerfile
+++ b/TensorFlow/Classification/ConvNets/Dockerfile
@ -1,8 +1,30 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.12-tf1-py3
+ARG TRITON_CLIENT_IMAGE_NAME=nvcr.io/nvidia/tritonserver:20.12-py3-sdk
+FROM ${TRITON_CLIENT_IMAGE_NAME} as triton-client
 FROM ${FROM_IMAGE_NAME}

-ADD requirements.txt .
-RUN pip install -r requirements.txt
+# Install perf_client required library
+RUN apt-get update && \
+    apt-get install -y libb64-dev libb64-0d && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*

-ADD . /workspace/rn50v15_tf
+# Install Triton Client PythonAPI and copy Perf Client
+COPY --from=triton-client /workspace/install/ /workspace/install/
+ENV LD_LIBRARY_PATH /workspace/install/lib:${LD_LIBRARY_PATH}
+RUN find /workspace/install/python/ -iname triton*manylinux*.whl -exec pip install {}[all] \;
+
+# Setup environmnent variables to access Triton Client lib and bin
+ENV PATH /workspace/install/bin:${PATH}
+
+ENV PYTHONPATH /workspace/rn50v15_tf
 WORKDIR /workspace/rn50v15_tf
+
+RUN pip uninstall -y typing
+
+ADD requirements.txt .
+ADD triton/requirements.txt triton/requirements.txt
+RUN pip install -r requirements.txt
+RUN pip install -r triton/requirements.txt
+
+ADD . .
--- a/TensorFlow/Classification/ConvNets/README.md
+++ b/TensorFlow/Classification/ConvNets/README.md
@ -51,7 +51,7 @@ were averaged over an entire training epoch.
 The specific training script that was run is documented 
 in the corresponding model's README.

-The following table shows the training accuracy results of the 
+The following table shows the training performance results of the 
 three classification models side-by-side.


@ -71,7 +71,7 @@ were averaged over an entire training epoch.
 The specific training script that was run is documented 
 in the corresponding model's README.

-The following table shows the training accuracy results of the 
+The following table shows the training performance results of the 
 three classification models side-by-side.


--- a/TensorFlow/Classification/ConvNets/dataprep/build_image_data.py
+++ b/TensorFlow/Classification/ConvNets/dataprep/build_image_data.py
@ -0,0 +1,436 @@
+#!/usr/bin/python
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converts image data to TFRecords file format with Example protos.
+
+The image data set is expected to reside in JPEG files located in the
+following directory structure.
+
+  data_dir/label_0/image0.jpeg
+  data_dir/label_0/image1.jpg
+  ...
+  data_dir/label_1/weird-image.jpeg
+  data_dir/label_1/my-image.jpeg
+  ...
+
+where the sub-directory is the unique label associated with these images.
+
+This TensorFlow script converts the training and evaluation data into
+a sharded data set consisting of TFRecord files
+
+  train_directory/train-00000-of-01024
+  train_directory/train-00001-of-01024
+  ...
+  train_directory/train-01023-of-01024
+
+and
+
+  validation_directory/validation-00000-of-00128
+  validation_directory/validation-00001-of-00128
+  ...
+  validation_directory/validation-00127-of-00128
+
+where we have selected 1024 and 128 shards for each data set. Each record
+within the TFRecord file is a serialized Example proto. The Example proto
+contains the following fields:
+
+  image/encoded: string containing JPEG encoded image in RGB colorspace
+  image/height: integer, image height in pixels
+  image/width: integer, image width in pixels
+  image/colorspace: string, specifying the colorspace, always 'RGB'
+  image/channels: integer, specifying the number of channels, always 3
+  image/format: string, specifying the format, always 'JPEG'
+
+  image/filename: string containing the basename of the image file
+            e.g. 'n01440764_10026.JPEG' or 'ILSVRC2012_val_00000293.JPEG'
+  image/class/label: integer specifying the index in a classification layer.
+    The label ranges from [0, num_labels] where 0 is unused and left as
+    the background class.
+  image/class/text: string specifying the human-readable version of the label
+    e.g. 'dog'
+
+If your data set involves bounding boxes, please look at build_imagenet_data.py.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from datetime import datetime
+import os
+import random
+import sys
+import threading
+
+import numpy as np
+import tensorflow as tf
+
+tf.app.flags.DEFINE_string('train_directory', '/tmp/',
+                           'Training data directory')
+tf.app.flags.DEFINE_string('validation_directory', '/tmp/',
+                           'Validation data directory')
+tf.app.flags.DEFINE_string('output_directory', '/tmp/',
+                           'Output data directory')
+
+tf.app.flags.DEFINE_integer('train_shards', 2,
+                            'Number of shards in training TFRecord files.')
+tf.app.flags.DEFINE_integer('validation_shards', 2,
+                            'Number of shards in validation TFRecord files.')
+
+tf.app.flags.DEFINE_integer('num_threads', 2,
+                            'Number of threads to preprocess the images.')
+
+# The labels file contains a list of valid labels are held in this file.
+# Assumes that the file contains entries as such:
+#   dog
+#   cat
+#   flower
+# where each line corresponds to a label. We map each label contained in
+# the file to an integer corresponding to the line number starting from 0.
+tf.app.flags.DEFINE_string('labels_file', '', 'Labels file')
+
+
+FLAGS = tf.app.flags.FLAGS
+
+
+def _int64_feature(value):
+  """Wrapper for inserting int64 features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+
+
+def _bytes_feature(value):
+  """Wrapper for inserting bytes features into Example proto."""
+  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+
+def _convert_to_example(filename, image_buffer, label, text, height, width):
+  """Build an Example proto for an example.
+
+  Args:
+    filename: string, path to an image file, e.g., '/path/to/example.JPG'
+    image_buffer: string, JPEG encoding of RGB image
+    label: integer, identifier for the ground truth for the network
+    text: string, unique human-readable, e.g. 'dog'
+    height: integer, image height in pixels
+    width: integer, image width in pixels
+  Returns:
+    Example proto
+  """
+
+  colorspace = 'RGB'
+  channels = 3
+  image_format = 'JPEG'
+
+  example = tf.train.Example(features=tf.train.Features(feature={
+      'image/height': _int64_feature(height),
+      'image/width': _int64_feature(width),
+      'image/colorspace': _bytes_feature(tf.compat.as_bytes(colorspace)),
+      'image/channels': _int64_feature(channels),
+      'image/class/label': _int64_feature(label),
+      'image/class/text': _bytes_feature(tf.compat.as_bytes(text)),
+      'image/format': _bytes_feature(tf.compat.as_bytes(image_format)),
+      'image/filename': _bytes_feature(tf.compat.as_bytes(os.path.basename(filename))),
+      'image/encoded': _bytes_feature(tf.compat.as_bytes(image_buffer))}))
+  return example
+
+
+class ImageCoder(object):
+  """Helper class that provides TensorFlow image coding utilities."""
+
+  def __init__(self):
+    # Create a single Session to run all image coding calls.
+    self._sess = tf.Session()
+
+    # Initializes function that converts PNG to JPEG data.
+    self._png_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_png(self._png_data, channels=3)
+    self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that decodes RGB JPEG data.
+    self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
+    self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
+
+  def png_to_jpeg(self, image_data):
+    return self._sess.run(self._png_to_jpeg,
+                          feed_dict={self._png_data: image_data})
+
+  def decode_jpeg(self, image_data):
+    image = self._sess.run(self._decode_jpeg,
+                           feed_dict={self._decode_jpeg_data: image_data})
+    assert len(image.shape) == 3
+    assert image.shape[2] == 3
+    return image
+
+
+def _is_png(filename):
+  """Determine if a file contains a PNG format image.
+
+  Args:
+    filename: string, path of the image file.
+
+  Returns:
+    boolean indicating if the image is a PNG.
+  """
+  return filename.endswith('.png')
+
+
+def _process_image(filename, coder):
+  """Process a single image file.
+
+  Args:
+    filename: string, path to an image file e.g., '/path/to/example.JPG'.
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+  Returns:
+    image_buffer: string, JPEG encoding of RGB image.
+    height: integer, image height in pixels.
+    width: integer, image width in pixels.
+  """
+  # Read the image file.
+  with tf.gfile.FastGFile(filename, 'rb') as f:
+    image_data = f.read()
+
+  # Convert any PNG to JPEG's for consistency.
+  if _is_png(filename):
+    print('Converting PNG to JPEG for %s' % filename)
+    image_data = coder.png_to_jpeg(image_data)
+
+  # Decode the RGB JPEG.
+  image = coder.decode_jpeg(image_data)
+
+  # Check that image converted to RGB
+  assert len(image.shape) == 3
+  height = image.shape[0]
+  width = image.shape[1]
+  assert image.shape[2] == 3
+
+  return image_data, height, width
+
+
+def _process_image_files_batch(coder, thread_index, ranges, name, filenames,
+                               texts, labels, num_shards):
+  """Processes and saves list of images as TFRecord in 1 thread.
+
+  Args:
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+    thread_index: integer, unique batch to run index is within [0, len(ranges)).
+    ranges: list of pairs of integers specifying ranges of each batches to
+      analyze in parallel.
+    name: string, unique identifier specifying the data set
+    filenames: list of strings; each string is a path to an image file
+    texts: list of strings; each string is human readable, e.g. 'dog'
+    labels: list of integer; each integer identifies the ground truth
+    num_shards: integer number of shards for this data set.
+  """
+  # Each thread produces N shards where N = int(num_shards / num_threads).
+  # For instance, if num_shards = 128, and the num_threads = 2, then the first
+  # thread would produce shards [0, 64).
+  num_threads = len(ranges)
+  assert not num_shards % num_threads
+  num_shards_per_batch = int(num_shards / num_threads)
+
+  shard_ranges = np.linspace(ranges[thread_index][0],
+                             ranges[thread_index][1],
+                             num_shards_per_batch + 1).astype(int)
+  num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
+
+  counter = 0
+  for s in range(num_shards_per_batch):
+    # Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
+    shard = thread_index * num_shards_per_batch + s
+    output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
+    output_file = os.path.join(FLAGS.output_directory, output_filename)
+    writer = tf.python_io.TFRecordWriter(output_file)
+
+    shard_counter = 0
+    files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
+    for i in files_in_shard:
+      filename = filenames[i]
+      label = labels[i]
+      text = texts[i]
+
+      try:
+        image_buffer, height, width = _process_image(filename, coder)
+      except Exception as e:
+        print(e)
+        print('SKIPPED: Unexpected error while decoding %s.' % filename)
+        continue
+
+      example = _convert_to_example(filename, image_buffer, label,
+                                    text, height, width)
+      writer.write(example.SerializeToString())
+      shard_counter += 1
+      counter += 1
+
+      if not counter % 1000:
+        print('%s [thread %d]: Processed %d of %d images in thread batch.' %
+              (datetime.now(), thread_index, counter, num_files_in_thread))
+        sys.stdout.flush()
+
+    writer.close()
+    print('%s [thread %d]: Wrote %d images to %s' %
+          (datetime.now(), thread_index, shard_counter, output_file))
+    sys.stdout.flush()
+    shard_counter = 0
+  print('%s [thread %d]: Wrote %d images to %d shards.' %
+        (datetime.now(), thread_index, counter, num_files_in_thread))
+  sys.stdout.flush()
+
+
+def _process_image_files(name, filenames, texts, labels, num_shards):
+  """Process and save list of images as TFRecord of Example protos.
+
+  Args:
+    name: string, unique identifier specifying the data set
+    filenames: list of strings; each string is a path to an image file
+    texts: list of strings; each string is human readable, e.g. 'dog'
+    labels: list of integer; each integer identifies the ground truth
+    num_shards: integer number of shards for this data set.
+  """
+  assert len(filenames) == len(texts)
+  assert len(filenames) == len(labels)
+
+  # Break all images into batches with a [ranges[i][0], ranges[i][1]].
+  spacing = np.linspace(0, len(filenames), FLAGS.num_threads + 1).astype(np.int)
+  ranges = []
+  for i in range(len(spacing) - 1):
+    ranges.append([spacing[i], spacing[i + 1]])
+
+  # Launch a thread for each batch.
+  print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
+  sys.stdout.flush()
+
+  # Create a mechanism for monitoring when all threads are finished.
+  coord = tf.train.Coordinator()
+
+  # Create a generic TensorFlow-based utility for converting all image codings.
+  coder = ImageCoder()
+
+  threads = []
+  for thread_index in range(len(ranges)):
+    args = (coder, thread_index, ranges, name, filenames,
+            texts, labels, num_shards)
+    t = threading.Thread(target=_process_image_files_batch, args=args)
+    t.start()
+    threads.append(t)
+
+  # Wait for all the threads to terminate.
+  coord.join(threads)
+  print('%s: Finished writing all %d images in data set.' %
+        (datetime.now(), len(filenames)))
+  sys.stdout.flush()
+
+
+def _find_image_files(data_dir, labels_file):
+  """Build a list of all images files and labels in the data set.
+
+  Args:
+    data_dir: string, path to the root directory of images.
+
+      Assumes that the image data set resides in JPEG files located in
+      the following directory structure.
+
+        data_dir/dog/another-image.JPEG
+        data_dir/dog/my-image.jpg
+
+      where 'dog' is the label associated with these images.
+
+    labels_file: string, path to the labels file.
+
+      The list of valid labels are held in this file. Assumes that the file
+      contains entries as such:
+        dog
+        cat
+        flower
+      where each line corresponds to a label. We map each label contained in
+      the file to an integer starting with the integer 0 corresponding to the
+      label contained in the first line.
+
+  Returns:
+    filenames: list of strings; each string is a path to an image file.
+    texts: list of strings; each string is the class, e.g. 'dog'
+    labels: list of integer; each integer identifies the ground truth.
+  """
+  print('Determining list of input files and labels from %s.' % data_dir)
+  unique_labels = [l.strip() for l in tf.gfile.FastGFile(
+      labels_file, 'r').readlines()]
+
+  labels = []
+  filenames = []
+  texts = []
+
+  # Leave label index 0 empty as a background class.
+  label_index = 1
+
+  # Construct the list of JPEG files and labels.
+  for text in unique_labels:
+    jpeg_file_path = '%s/%s/*' % (data_dir, text)
+    matching_files = tf.gfile.Glob(jpeg_file_path)
+
+    labels.extend([label_index] * len(matching_files))
+    texts.extend([text] * len(matching_files))
+    filenames.extend(matching_files)
+
+    if not label_index % 100:
+      print('Finished finding files in %d of %d classes.' % (
+          label_index, len(labels)))
+    label_index += 1
+
+  # Shuffle the ordering of all image files in order to guarantee
+  # random ordering of the images with respect to label in the
+  # saved TFRecord files. Make the randomization repeatable.
+  shuffled_index = list(range(len(filenames)))
+  random.seed(12345)
+  random.shuffle(shuffled_index)
+
+  filenames = [filenames[i] for i in shuffled_index]
+  texts = [texts[i] for i in shuffled_index]
+  labels = [labels[i] for i in shuffled_index]
+
+  print('Found %d JPEG files across %d labels inside %s.' %
+        (len(filenames), len(unique_labels), data_dir))
+  return filenames, texts, labels
+
+
+def _process_dataset(name, directory, num_shards, labels_file):
+  """Process a complete data set and save it as a TFRecord.
+
+  Args:
+    name: string, unique identifier specifying the data set.
+    directory: string, root path to the data set.
+    num_shards: integer number of shards for this data set.
+    labels_file: string, path to the labels file.
+  """
+  filenames, texts, labels = _find_image_files(directory, labels_file)
+  _process_image_files(name, filenames, texts, labels, num_shards)
+
+
+def main(unused_argv):
+  assert not FLAGS.train_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
+  assert not FLAGS.validation_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with '
+      'FLAGS.validation_shards')
+  print('Saving results to %s' % FLAGS.output_directory)
+
+  # Run it!
+  _process_dataset('validation', FLAGS.validation_directory,
+                   FLAGS.validation_shards, FLAGS.labels_file)
+  _process_dataset('train', FLAGS.train_directory,
+                   FLAGS.train_shards, FLAGS.labels_file)
+
+
+if __name__ == '__main__':
+  tf.app.run()
--- a/TensorFlow/Classification/ConvNets/dataprep/build_imagenet_data.py
+++ b/TensorFlow/Classification/ConvNets/dataprep/build_imagenet_data.py
@ -0,0 +1,707 @@
+#!/usr/bin/python
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converts ImageNet data to TFRecords file format with Example protos.
+
+The raw ImageNet data set is expected to reside in JPEG files located in the
+following directory structure.
+
+  data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
+  data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
+  ...
+
+where 'n01440764' is the unique synset label associated with
+these images.
+
+The training data set consists of 1000 sub-directories (i.e. labels)
+each containing 1200 JPEG images for a total of 1.2M JPEG images.
+
+The evaluation data set consists of 1000 sub-directories (i.e. labels)
+each containing 50 JPEG images for a total of 50K JPEG images.
+
+This TensorFlow script converts the training and evaluation data into
+a sharded data set consisting of 1024 and 128 TFRecord files, respectively.
+
+  train_directory/train-00000-of-01024
+  train_directory/train-00001-of-01024
+  ...
+  train_directory/train-01023-of-01024
+
+and
+
+  validation_directory/validation-00000-of-00128
+  validation_directory/validation-00001-of-00128
+  ...
+  validation_directory/validation-00127-of-00128
+
+Each validation TFRecord file contains ~390 records. Each training TFREcord
+file contains ~1250 records. Each record within the TFRecord file is a
+serialized Example proto. The Example proto contains the following fields:
+
+  image/encoded: string containing JPEG encoded image in RGB colorspace
+  image/height: integer, image height in pixels
+  image/width: integer, image width in pixels
+  image/colorspace: string, specifying the colorspace, always 'RGB'
+  image/channels: integer, specifying the number of channels, always 3
+  image/format: string, specifying the format, always 'JPEG'
+
+  image/filename: string containing the basename of the image file
+            e.g. 'n01440764_10026.JPEG' or 'ILSVRC2012_val_00000293.JPEG'
+  image/class/label: integer specifying the index in a classification layer.
+    The label ranges from [1, 1000] where 0 is not used.
+  image/class/synset: string specifying the unique ID of the label,
+    e.g. 'n01440764'
+  image/class/text: string specifying the human-readable version of the label
+    e.g. 'red fox, Vulpes vulpes'
+
+  image/object/bbox/xmin: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/xmax: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/ymin: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/ymax: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/label: integer specifying the index in a classification
+    layer. The label ranges from [1, 1000] where 0 is not used. Note this is
+    always identical to the image label.
+
+Note that the length of xmin is identical to the length of xmax, ymin and ymax
+for each example.
+
+Running this script using 16 threads may take around ~2.5 hours on an HP Z420.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from datetime import datetime
+import os
+import random
+import sys
+import threading
+
+import numpy as np
+import six
+import tensorflow as tf
+
+tf.app.flags.DEFINE_string('train_directory', '/tmp/',
+                           'Training data directory')
+tf.app.flags.DEFINE_string('validation_directory', '/tmp/',
+                           'Validation data directory')
+tf.app.flags.DEFINE_string('output_directory', '/tmp/',
+                           'Output data directory')
+
+tf.app.flags.DEFINE_integer('train_shards', 1024,
+                            'Number of shards in training TFRecord files.')
+tf.app.flags.DEFINE_integer('validation_shards', 128,
+                            'Number of shards in validation TFRecord files.')
+
+tf.app.flags.DEFINE_integer('num_threads', 8,
+                            'Number of threads to preprocess the images.')
+
+# The labels file contains a list of valid labels are held in this file.
+# Assumes that the file contains entries as such:
+#   n01440764
+#   n01443537
+#   n01484850
+# where each line corresponds to a label expressed as a synset. We map
+# each synset contained in the file to an integer (based on the alphabetical
+# ordering). See below for details.
+tf.app.flags.DEFINE_string('labels_file',
+                           'imagenet_lsvrc_2015_synsets.txt',
+                           'Labels file')
+
+# This file containing mapping from synset to human-readable label.
+# Assumes each line of the file looks like:
+#
+#   n02119247    black fox
+#   n02119359    silver fox
+#   n02119477    red fox, Vulpes fulva
+#
+# where each line corresponds to a unique mapping. Note that each line is
+# formatted as <synset>\t<human readable label>.
+tf.app.flags.DEFINE_string('imagenet_metadata_file',
+                           'imagenet_metadata.txt',
+                           'ImageNet metadata file')
+
+# This file is the output of process_bounding_box.py
+# Assumes each line of the file looks like:
+#
+#   n00007846_64193.JPEG,0.0060,0.2620,0.7545,0.9940
+#
+# where each line corresponds to one bounding box annotation associated
+# with an image. Each line can be parsed as:
+#
+#   <JPEG file name>, <xmin>, <ymin>, <xmax>, <ymax>
+#
+# Note that there might exist mulitple bounding box annotations associated
+# with an image file.
+tf.app.flags.DEFINE_string('bounding_box_file',
+                           './imagenet_2012_bounding_boxes.csv',
+                           'Bounding box file')
+
+FLAGS = tf.app.flags.FLAGS
+
+
+def _int64_feature(value):
+  """Wrapper for inserting int64 features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+
+
+def _float_feature(value):
+  """Wrapper for inserting float features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(float_list=tf.train.FloatList(value=value))
+
+
+def _bytes_feature(value):
+  """Wrapper for inserting bytes features into Example proto."""
+  if six.PY3 and isinstance(value, six.text_type):           
+    value = six.binary_type(value, encoding='utf-8') 
+  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+
+def _convert_to_example(filename, image_buffer, label, synset, human, bbox,
+                        height, width):
+  """Build an Example proto for an example.
+
+  Args:
+    filename: string, path to an image file, e.g., '/path/to/example.JPG'
+    image_buffer: string, JPEG encoding of RGB image
+    label: integer, identifier for the ground truth for the network
+    synset: string, unique WordNet ID specifying the label, e.g., 'n02323233'
+    human: string, human-readable label, e.g., 'red fox, Vulpes vulpes'
+    bbox: list of bounding boxes; each box is a list of integers
+      specifying [xmin, ymin, xmax, ymax]. All boxes are assumed to belong to
+      the same label as the image label.
+    height: integer, image height in pixels
+    width: integer, image width in pixels
+  Returns:
+    Example proto
+  """
+  xmin = []
+  ymin = []
+  xmax = []
+  ymax = []
+  for b in bbox:
+    assert len(b) == 4
+    # pylint: disable=expression-not-assigned
+    [l.append(point) for l, point in zip([xmin, ymin, xmax, ymax], b)]
+    # pylint: enable=expression-not-assigned
+
+  colorspace = 'RGB'
+  channels = 3
+  image_format = 'JPEG'
+
+  example = tf.train.Example(features=tf.train.Features(feature={
+      'image/height': _int64_feature(height),
+      'image/width': _int64_feature(width),
+      'image/colorspace': _bytes_feature(colorspace),
+      'image/channels': _int64_feature(channels),
+      'image/class/label': _int64_feature(label),
+      'image/class/synset': _bytes_feature(synset),
+      'image/class/text': _bytes_feature(human),
+      'image/object/bbox/xmin': _float_feature(xmin),
+      'image/object/bbox/xmax': _float_feature(xmax),
+      'image/object/bbox/ymin': _float_feature(ymin),
+      'image/object/bbox/ymax': _float_feature(ymax),
+      'image/object/bbox/label': _int64_feature([label] * len(xmin)),
+      'image/format': _bytes_feature(image_format),
+      'image/filename': _bytes_feature(os.path.basename(filename)),
+      'image/encoded': _bytes_feature(image_buffer)}))
+  return example
+
+
+class ImageCoder(object):
+  """Helper class that provides TensorFlow image coding utilities."""
+
+  def __init__(self):
+    # Create a single Session to run all image coding calls.
+    self._sess = tf.Session()
+
+    # Initializes function that converts PNG to JPEG data.
+    self._png_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_png(self._png_data, channels=3)
+    self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that converts CMYK JPEG data to RGB JPEG data.
+    self._cmyk_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_jpeg(self._cmyk_data, channels=0)
+    self._cmyk_to_rgb = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that decodes RGB JPEG data.
+    self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
+    self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
+
+  def png_to_jpeg(self, image_data):
+    return self._sess.run(self._png_to_jpeg,
+                          feed_dict={self._png_data: image_data})
+
+  def cmyk_to_rgb(self, image_data):
+    return self._sess.run(self._cmyk_to_rgb,
+                          feed_dict={self._cmyk_data: image_data})
+
+  def decode_jpeg(self, image_data):
+    image = self._sess.run(self._decode_jpeg,
+                           feed_dict={self._decode_jpeg_data: image_data})
+    assert len(image.shape) == 3
+    assert image.shape[2] == 3
+    return image
+
+
+def _is_png(filename):
+  """Determine if a file contains a PNG format image.
+
+  Args:
+    filename: string, path of the image file.
+
+  Returns:
+    boolean indicating if the image is a PNG.
+  """
+  # File list from:
+  # https://groups.google.com/forum/embed/?place=forum/torch7#!topic/torch7/fOSTXHIESSU
+  return 'n02105855_2933.JPEG' in filename
+
+
+def _is_cmyk(filename):
+  """Determine if file contains a CMYK JPEG format image.
+
+  Args:
+    filename: string, path of the image file.
+
+  Returns:
+    boolean indicating if the image is a JPEG encoded with CMYK color space.
+  """
+  # File list from:
+  # https://github.com/cytsai/ilsvrc-cmyk-image-list
+  blacklist = ['n01739381_1309.JPEG', 'n02077923_14822.JPEG',
+               'n02447366_23489.JPEG', 'n02492035_15739.JPEG',
+               'n02747177_10752.JPEG', 'n03018349_4028.JPEG',
+               'n03062245_4620.JPEG', 'n03347037_9675.JPEG',
+               'n03467068_12171.JPEG', 'n03529860_11437.JPEG',
+               'n03544143_17228.JPEG', 'n03633091_5218.JPEG',
+               'n03710637_5125.JPEG', 'n03961711_5286.JPEG',
+               'n04033995_2932.JPEG', 'n04258138_17003.JPEG',
+               'n04264628_27969.JPEG', 'n04336792_7448.JPEG',
+               'n04371774_5854.JPEG', 'n04596742_4225.JPEG',
+               'n07583066_647.JPEG', 'n13037406_4650.JPEG']
+  return filename.split('/')[-1] in blacklist
+
+
+def _process_image(filename, coder):
+  """Process a single image file.
+
+  Args:
+    filename: string, path to an image file e.g., '/path/to/example.JPG'.
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+  Returns:
+    image_buffer: string, JPEG encoding of RGB image.
+    height: integer, image height in pixels.
+    width: integer, image width in pixels.
+  """
+  # Read the image file.
+  with tf.gfile.FastGFile(filename, 'rb') as f:
+    image_data = f.read()
+
+  # Clean the dirty data.
+  if _is_png(filename):
+    # 1 image is a PNG.
+    print('Converting PNG to JPEG for %s' % filename)
+    image_data = coder.png_to_jpeg(image_data)
+  elif _is_cmyk(filename):
+    # 22 JPEG images are in CMYK colorspace.
+    print('Converting CMYK to RGB for %s' % filename)
+    image_data = coder.cmyk_to_rgb(image_data)
+
+  # Decode the RGB JPEG.
+  image = coder.decode_jpeg(image_data)
+
+  # Check that image converted to RGB
+  assert len(image.shape) == 3
+  height = image.shape[0]
+  width = image.shape[1]
+  assert image.shape[2] == 3
+
+  return image_data, height, width
+
+
+def _process_image_files_batch(coder, thread_index, ranges, name, filenames,
+                               synsets, labels, humans, bboxes, num_shards):
+  """Processes and saves list of images as TFRecord in 1 thread.
+
+  Args:
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+    thread_index: integer, unique batch to run index is within [0, len(ranges)).
+    ranges: list of pairs of integers specifying ranges of each batches to
+      analyze in parallel.
+    name: string, unique identifier specifying the data set
+    filenames: list of strings; each string is a path to an image file
+    synsets: list of strings; each string is a unique WordNet ID
+    labels: list of integer; each integer identifies the ground truth
+    humans: list of strings; each string is a human-readable label
+    bboxes: list of bounding boxes for each image. Note that each entry in this
+      list might contain from 0+ entries corresponding to the number of bounding
+      box annotations for the image.
+    num_shards: integer number of shards for this data set.
+  """
+  # Each thread produces N shards where N = int(num_shards / num_threads).
+  # For instance, if num_shards = 128, and the num_threads = 2, then the first
+  # thread would produce shards [0, 64).
+  num_threads = len(ranges)
+  assert not num_shards % num_threads
+  num_shards_per_batch = int(num_shards / num_threads)
+
+  shard_ranges = np.linspace(ranges[thread_index][0],
+                             ranges[thread_index][1],
+                             num_shards_per_batch + 1).astype(int)
+  num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
+
+  counter = 0
+  for s in range(num_shards_per_batch):
+    # Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
+    shard = thread_index * num_shards_per_batch + s
+    output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
+    output_file = os.path.join(FLAGS.output_directory, output_filename)
+    writer = tf.python_io.TFRecordWriter(output_file)
+
+    shard_counter = 0
+    files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
+    for i in files_in_shard:
+      filename = filenames[i]
+      label = labels[i]
+      synset = synsets[i]
+      human = humans[i]
+      bbox = bboxes[i]
+
+      image_buffer, height, width = _process_image(filename, coder)
+
+      example = _convert_to_example(filename, image_buffer, label,
+                                    synset, human, bbox,
+                                    height, width)
+      writer.write(example.SerializeToString())
+      shard_counter += 1
+      counter += 1
+
+      if not counter % 1000:
+        print('%s [thread %d]: Processed %d of %d images in thread batch.' %
+              (datetime.now(), thread_index, counter, num_files_in_thread))
+        sys.stdout.flush()
+
+    writer.close()
+    print('%s [thread %d]: Wrote %d images to %s' %
+          (datetime.now(), thread_index, shard_counter, output_file))
+    sys.stdout.flush()
+    shard_counter = 0
+  print('%s [thread %d]: Wrote %d images to %d shards.' %
+        (datetime.now(), thread_index, counter, num_files_in_thread))
+  sys.stdout.flush()
+
+
+def _process_image_files(name, filenames, synsets, labels, humans,
+                         bboxes, num_shards):
+  """Process and save list of images as TFRecord of Example protos.
+
+  Args:
+    name: string, unique identifier specifying the data set
+    filenames: list of strings; each string is a path to an image file
+    synsets: list of strings; each string is a unique WordNet ID
+    labels: list of integer; each integer identifies the ground truth
+    humans: list of strings; each string is a human-readable label
+    bboxes: list of bounding boxes for each image. Note that each entry in this
+      list might contain from 0+ entries corresponding to the number of bounding
+      box annotations for the image.
+    num_shards: integer number of shards for this data set.
+  """
+  assert len(filenames) == len(synsets)
+  assert len(filenames) == len(labels)
+  assert len(filenames) == len(humans)
+  assert len(filenames) == len(bboxes)
+
+  # Break all images into batches with a [ranges[i][0], ranges[i][1]].
+  spacing = np.linspace(0, len(filenames), FLAGS.num_threads + 1).astype(np.int)
+  ranges = []
+  threads = []
+  for i in range(len(spacing) - 1):
+    ranges.append([spacing[i], spacing[i + 1]])
+
+  # Launch a thread for each batch.
+  print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
+  sys.stdout.flush()
+
+  # Create a mechanism for monitoring when all threads are finished.
+  coord = tf.train.Coordinator()
+
+  # Create a generic TensorFlow-based utility for converting all image codings.
+  coder = ImageCoder()
+
+  threads = []
+  for thread_index in range(len(ranges)):
+    args = (coder, thread_index, ranges, name, filenames,
+            synsets, labels, humans, bboxes, num_shards)
+    t = threading.Thread(target=_process_image_files_batch, args=args)
+    t.start()
+    threads.append(t)
+
+  # Wait for all the threads to terminate.
+  coord.join(threads)
+  print('%s: Finished writing all %d images in data set.' %
+        (datetime.now(), len(filenames)))
+  sys.stdout.flush()
+
+
+def _find_image_files(data_dir, labels_file):
+  """Build a list of all images files and labels in the data set.
+
+  Args:
+    data_dir: string, path to the root directory of images.
+
+      Assumes that the ImageNet data set resides in JPEG files located in
+      the following directory structure.
+
+        data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
+        data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
+
+      where 'n01440764' is the unique synset label associated with these images.
+
+    labels_file: string, path to the labels file.
+
+      The list of valid labels are held in this file. Assumes that the file
+      contains entries as such:
+        n01440764
+        n01443537
+        n01484850
+      where each line corresponds to a label expressed as a synset. We map
+      each synset contained in the file to an integer (based on the alphabetical
+      ordering) starting with the integer 1 corresponding to the synset
+      contained in the first line.
+
+      The reason we start the integer labels at 1 is to reserve label 0 as an
+      unused background class.
+
+  Returns:
+    filenames: list of strings; each string is a path to an image file.
+    synsets: list of strings; each string is a unique WordNet ID.
+    labels: list of integer; each integer identifies the ground truth.
+  """
+  print('Determining list of input files and labels from %s.' % data_dir)
+  challenge_synsets = [l.strip() for l in
+                       tf.gfile.FastGFile(labels_file, 'r').readlines()]
+
+  labels = []
+  filenames = []
+  synsets = []
+
+  # Leave label index 0 empty as a background class.
+  label_index = 1
+
+  # Construct the list of JPEG files and labels.
+  for synset in challenge_synsets:
+    jpeg_file_path = '%s/%s/*.JPEG' % (data_dir, synset)
+    matching_files = tf.gfile.Glob(jpeg_file_path)
+
+    labels.extend([label_index] * len(matching_files))
+    synsets.extend([synset] * len(matching_files))
+    filenames.extend(matching_files)
+
+    if not label_index % 100:
+      print('Finished finding files in %d of %d classes.' % (
+          label_index, len(challenge_synsets)))
+    label_index += 1
+
+  # Shuffle the ordering of all image files in order to guarantee
+  # random ordering of the images with respect to label in the
+  # saved TFRecord files. Make the randomization repeatable.
+  shuffled_index = list(range(len(filenames)))
+  random.seed(12345)
+  random.shuffle(shuffled_index)
+
+  filenames = [filenames[i] for i in shuffled_index]
+  synsets = [synsets[i] for i in shuffled_index]
+  labels = [labels[i] for i in shuffled_index]
+
+  print('Found %d JPEG files across %d labels inside %s.' %
+        (len(filenames), len(challenge_synsets), data_dir))
+  return filenames, synsets, labels
+
+
+def _find_human_readable_labels(synsets, synset_to_human):
+  """Build a list of human-readable labels.
+
+  Args:
+    synsets: list of strings; each string is a unique WordNet ID.
+    synset_to_human: dict of synset to human labels, e.g.,
+      'n02119022' --> 'red fox, Vulpes vulpes'
+
+  Returns:
+    List of human-readable strings corresponding to each synset.
+  """
+  humans = []
+  for s in synsets:
+    assert s in synset_to_human, ('Failed to find: %s' % s)
+    humans.append(synset_to_human[s])
+  return humans
+
+
+def _find_image_bounding_boxes(filenames, image_to_bboxes):
+  """Find the bounding boxes for a given image file.
+
+  Args:
+    filenames: list of strings; each string is a path to an image file.
+    image_to_bboxes: dictionary mapping image file names to a list of
+      bounding boxes. This list contains 0+ bounding boxes.
+  Returns:
+    List of bounding boxes for each image. Note that each entry in this
+    list might contain from 0+ entries corresponding to the number of bounding
+    box annotations for the image.
+  """
+  num_image_bbox = 0
+  bboxes = []
+  for f in filenames:
+    basename = os.path.basename(f)
+    if basename in image_to_bboxes:
+      bboxes.append(image_to_bboxes[basename])
+      num_image_bbox += 1
+    else:
+      bboxes.append([])
+  print('Found %d images with bboxes out of %d images' % (
+      num_image_bbox, len(filenames)))
+  return bboxes
+
+
+def _process_dataset(name, directory, num_shards, synset_to_human,
+                     image_to_bboxes):
+  """Process a complete data set and save it as a TFRecord.
+
+  Args:
+    name: string, unique identifier specifying the data set.
+    directory: string, root path to the data set.
+    num_shards: integer number of shards for this data set.
+    synset_to_human: dict of synset to human labels, e.g.,
+      'n02119022' --> 'red fox, Vulpes vulpes'
+    image_to_bboxes: dictionary mapping image file names to a list of
+      bounding boxes. This list contains 0+ bounding boxes.
+  """
+  filenames, synsets, labels = _find_image_files(directory, FLAGS.labels_file)
+  humans = _find_human_readable_labels(synsets, synset_to_human)
+  bboxes = _find_image_bounding_boxes(filenames, image_to_bboxes)
+  _process_image_files(name, filenames, synsets, labels,
+                       humans, bboxes, num_shards)
+
+
+def _build_synset_lookup(imagenet_metadata_file):
+  """Build lookup for synset to human-readable label.
+
+  Args:
+    imagenet_metadata_file: string, path to file containing mapping from
+      synset to human-readable label.
+
+      Assumes each line of the file looks like:
+
+        n02119247    black fox
+        n02119359    silver fox
+        n02119477    red fox, Vulpes fulva
+
+      where each line corresponds to a unique mapping. Note that each line is
+      formatted as <synset>\t<human readable label>.
+
+  Returns:
+    Dictionary of synset to human labels, such as:
+      'n02119022' --> 'red fox, Vulpes vulpes'
+  """
+  lines = tf.gfile.FastGFile(imagenet_metadata_file, 'r').readlines()
+  synset_to_human = {}
+  for l in lines:
+    if l:
+      parts = l.strip().split('\t')
+      assert len(parts) == 2
+      synset = parts[0]
+      human = parts[1]
+      synset_to_human[synset] = human
+  return synset_to_human
+
+
+def _build_bounding_box_lookup(bounding_box_file):
+  """Build a lookup from image file to bounding boxes.
+
+  Args:
+    bounding_box_file: string, path to file with bounding boxes annotations.
+
+      Assumes each line of the file looks like:
+
+        n00007846_64193.JPEG,0.0060,0.2620,0.7545,0.9940
+
+      where each line corresponds to one bounding box annotation associated
+      with an image. Each line can be parsed as:
+
+        <JPEG file name>, <xmin>, <ymin>, <xmax>, <ymax>
+
+      Note that there might exist mulitple bounding box annotations associated
+      with an image file. This file is the output of process_bounding_boxes.py.
+
+  Returns:
+    Dictionary mapping image file names to a list of bounding boxes. This list
+    contains 0+ bounding boxes.
+  """
+  lines = tf.gfile.FastGFile(bounding_box_file, 'r').readlines()
+  images_to_bboxes = {}
+  num_bbox = 0
+  num_image = 0
+  for l in lines:
+    if l:
+      parts = l.split(',')
+      assert len(parts) == 5, ('Failed to parse: %s' % l)
+      filename = parts[0]
+      xmin = float(parts[1])
+      ymin = float(parts[2])
+      xmax = float(parts[3])
+      ymax = float(parts[4])
+      box = [xmin, ymin, xmax, ymax]
+
+      if filename not in images_to_bboxes:
+        images_to_bboxes[filename] = []
+        num_image += 1
+      images_to_bboxes[filename].append(box)
+      num_bbox += 1
+
+  print('Successfully read %d bounding boxes '
+        'across %d images.' % (num_bbox, num_image))
+  return images_to_bboxes
+
+
+def main(unused_argv):
+  assert not FLAGS.train_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
+  assert not FLAGS.validation_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with '
+      'FLAGS.validation_shards')
+  print('Saving results to %s' % FLAGS.output_directory)
+
+  # Build a map from synset to human-readable label.
+  synset_to_human = _build_synset_lookup(FLAGS.imagenet_metadata_file)
+  image_to_bboxes = _build_bounding_box_lookup(FLAGS.bounding_box_file)
+
+  # Run it!
+  _process_dataset('validation', FLAGS.validation_directory,
+                   FLAGS.validation_shards, synset_to_human, image_to_bboxes)
+  _process_dataset('train', FLAGS.train_directory, FLAGS.train_shards,
+                   synset_to_human, image_to_bboxes)
+
+
+if __name__ == '__main__':
+  tf.app.run()
--- a/TensorFlow/Classification/ConvNets/dataprep/build_imagewoof_data.py
+++ b/TensorFlow/Classification/ConvNets/dataprep/build_imagewoof_data.py
@ -0,0 +1,618 @@
+#!/usr/bin/python
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Converts ImageNet data to TFRecords file format with Example protos.
+
+The raw ImageNet data set is expected to reside in JPEG files located in the
+following directory structure.
+
+  data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
+  data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
+  ...
+
+where 'n01440764' is the unique synset label associated with
+these images.
+
+The training data set consists of 1000 sub-directories (i.e. labels)
+each containing 1200 JPEG images for a total of 1.2M JPEG images.
+
+The evaluation data set consists of 1000 sub-directories (i.e. labels)
+each containing 50 JPEG images for a total of 50K JPEG images.
+
+This TensorFlow script converts the training and evaluation data into
+a sharded data set consisting of 1024 and 128 TFRecord files, respectively.
+
+  train_directory/train-00000-of-01024
+  train_directory/train-00001-of-01024
+  ...
+  train_directory/train-01023-of-01024
+
+and
+
+  validation_directory/validation-00000-of-00128
+  validation_directory/validation-00001-of-00128
+  ...
+  validation_directory/validation-00127-of-00128
+
+Each validation TFRecord file contains ~390 records. Each training TFREcord
+file contains ~1250 records. Each record within the TFRecord file is a
+serialized Example proto. The Example proto contains the following fields:
+
+  image/encoded: string containing JPEG encoded image in RGB colorspace
+  image/height: integer, image height in pixels
+  image/width: integer, image width in pixels
+  image/colorspace: string, specifying the colorspace, always 'RGB'
+  image/channels: integer, specifying the number of channels, always 3
+  image/format: string, specifying the format, always 'JPEG'
+
+  image/filename: string containing the basename of the image file
+            e.g. 'n01440764_10026.JPEG' or 'ILSVRC2012_val_00000293.JPEG'
+  image/class/label: integer specifying the index in a classification layer.
+    The label ranges from [1, 1000] where 0 is not used.
+  image/class/synset: string specifying the unique ID of the label,
+    e.g. 'n01440764'
+  image/class/text: string specifying the human-readable version of the label
+    e.g. 'red fox, Vulpes vulpes'
+
+  image/object/bbox/xmin: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/xmax: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/ymin: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/ymax: list of integers specifying the 0+ human annotated
+    bounding boxes
+  image/object/bbox/label: integer specifying the index in a classification
+    layer. The label ranges from [1, 1000] where 0 is not used. Note this is
+    always identical to the image label.
+
+Note that the length of xmin is identical to the length of xmax, ymin and ymax
+for each example.
+
+Running this script using 16 threads may take around ~2.5 hours on an HP Z420.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+from datetime import datetime
+import os
+import random
+import sys
+import threading
+
+import numpy as np
+import six
+import tensorflow as tf
+
+tf.app.flags.DEFINE_string('train_directory', '/tmp/',
+                           'Training data directory')
+tf.app.flags.DEFINE_string('validation_directory', '/tmp/',
+                           'Validation data directory')
+tf.app.flags.DEFINE_string('output_directory', '/tmp/',
+                           'Output data directory')
+
+tf.app.flags.DEFINE_integer('train_shards', 1024,
+                            'Number of shards in training TFRecord files.')
+tf.app.flags.DEFINE_integer('validation_shards', 128,
+                            'Number of shards in validation TFRecord files.')
+
+tf.app.flags.DEFINE_integer('num_threads', 8,
+                            'Number of threads to preprocess the images.')
+
+# The labels file contains a list of valid labels are held in this file.
+# Assumes that the file contains entries as such:
+#   n01440764
+#   n01443537
+#   n01484850
+# where each line corresponds to a label expressed as a synset. We map
+# each synset contained in the file to an integer (based on the alphabetical
+# ordering). See below for details.
+tf.app.flags.DEFINE_string('labels_file',
+                           'imagenet_lsvrc_2015_synsets.txt',
+                           'Labels file')
+
+# This file containing mapping from synset to human-readable label.
+# Assumes each line of the file looks like:
+#
+#   n02119247    black fox
+#   n02119359    silver fox
+#   n02119477    red fox, Vulpes fulva
+#
+# where each line corresponds to a unique mapping. Note that each line is
+# formatted as <synset>\t<human readable label>.
+tf.app.flags.DEFINE_string('imagenet_metadata_file',
+                           'imagenet_metadata.txt',
+                           'ImageNet metadata file')
+
+
+FLAGS = tf.app.flags.FLAGS
+
+
+def _int64_feature(value):
+  """Wrapper for inserting int64 features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(int64_list=tf.train.Int64List(value=value))
+
+
+def _float_feature(value):
+  """Wrapper for inserting float features into Example proto."""
+  if not isinstance(value, list):
+    value = [value]
+  return tf.train.Feature(float_list=tf.train.FloatList(value=value))
+
+
+def _bytes_feature(value):
+  """Wrapper for inserting bytes features into Example proto."""
+  if six.PY3 and isinstance(value, six.text_type):           
+    value = six.binary_type(value, encoding='utf-8') 
+  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
+
+
+def _convert_to_example(filename, image_buffer, label, synset, human, bbox,
+                        height, width):
+  """Build an Example proto for an example.
+
+  Args:
+    filename: string, path to an image file, e.g., '/path/to/example.JPG'
+    image_buffer: string, JPEG encoding of RGB image
+    label: integer, identifier for the ground truth for the network
+    synset: string, unique WordNet ID specifying the label, e.g., 'n02323233'
+    human: string, human-readable label, e.g., 'red fox, Vulpes vulpes'
+    bbox: list of bounding boxes; each box is a list of integers
+      specifying [xmin, ymin, xmax, ymax]. All boxes are assumed to belong to
+      the same label as the image label.
+    height: integer, image height in pixels
+    width: integer, image width in pixels
+  Returns:
+    Example proto
+  """
+  xmin = []
+  ymin = []
+  xmax = []
+  ymax = []
+  for b in bbox:
+    assert len(b) == 4
+    # pylint: disable=expression-not-assigned
+    [l.append(point) for l, point in zip([xmin, ymin, xmax, ymax], b)]
+    # pylint: enable=expression-not-assigned
+
+  colorspace = 'RGB'
+  channels = 3
+  image_format = 'JPEG'
+
+  example = tf.train.Example(features=tf.train.Features(feature={
+      'image/height': _int64_feature(height),
+      'image/width': _int64_feature(width),
+      'image/colorspace': _bytes_feature(colorspace),
+      'image/channels': _int64_feature(channels),
+      'image/class/label': _int64_feature(label),
+      'image/class/synset': _bytes_feature(synset),
+      'image/class/text': _bytes_feature(human),
+      'image/object/bbox/xmin': _float_feature(xmin),
+      'image/object/bbox/xmax': _float_feature(xmax),
+      'image/object/bbox/ymin': _float_feature(ymin),
+      'image/object/bbox/ymax': _float_feature(ymax),
+      'image/object/bbox/label': _int64_feature([label] * len(xmin)),
+      'image/format': _bytes_feature(image_format),
+      'image/filename': _bytes_feature(os.path.basename(filename)),
+      'image/encoded': _bytes_feature(image_buffer)}))
+  return example
+
+
+class ImageCoder(object):
+  """Helper class that provides TensorFlow image coding utilities."""
+
+  def __init__(self):
+    # Create a single Session to run all image coding calls.
+    self._sess = tf.Session()
+
+    # Initializes function that converts PNG to JPEG data.
+    self._png_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_png(self._png_data, channels=3)
+    self._png_to_jpeg = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that converts CMYK JPEG data to RGB JPEG data.
+    self._cmyk_data = tf.placeholder(dtype=tf.string)
+    image = tf.image.decode_jpeg(self._cmyk_data, channels=0)
+    self._cmyk_to_rgb = tf.image.encode_jpeg(image, format='rgb', quality=100)
+
+    # Initializes function that decodes RGB JPEG data.
+    self._decode_jpeg_data = tf.placeholder(dtype=tf.string)
+    self._decode_jpeg = tf.image.decode_jpeg(self._decode_jpeg_data, channels=3)
+
+  def png_to_jpeg(self, image_data):
+    return self._sess.run(self._png_to_jpeg,
+                          feed_dict={self._png_data: image_data})
+
+  def cmyk_to_rgb(self, image_data):
+    return self._sess.run(self._cmyk_to_rgb,
+                          feed_dict={self._cmyk_data: image_data})
+
+  def decode_jpeg(self, image_data):
+    image = self._sess.run(self._decode_jpeg,
+                           feed_dict={self._decode_jpeg_data: image_data})
+    assert len(image.shape) == 3
+    assert image.shape[2] == 3
+    return image
+
+
+def _is_png(filename):
+  """Determine if a file contains a PNG format image.
+
+  Args:
+    filename: string, path of the image file.
+
+  Returns:
+    boolean indicating if the image is a PNG.
+  """
+  # File list from:
+  # https://groups.google.com/forum/embed/?place=forum/torch7#!topic/torch7/fOSTXHIESSU
+  return 'n02105855_2933.JPEG' in filename
+
+
+def _is_cmyk(filename):
+  """Determine if file contains a CMYK JPEG format image.
+
+  Args:
+    filename: string, path of the image file.
+
+  Returns:
+    boolean indicating if the image is a JPEG encoded with CMYK color space.
+  """
+  # File list from:
+  # https://github.com/cytsai/ilsvrc-cmyk-image-list
+  blacklist = ['n01739381_1309.JPEG', 'n02077923_14822.JPEG',
+               'n02447366_23489.JPEG', 'n02492035_15739.JPEG',
+               'n02747177_10752.JPEG', 'n03018349_4028.JPEG',
+               'n03062245_4620.JPEG', 'n03347037_9675.JPEG',
+               'n03467068_12171.JPEG', 'n03529860_11437.JPEG',
+               'n03544143_17228.JPEG', 'n03633091_5218.JPEG',
+               'n03710637_5125.JPEG', 'n03961711_5286.JPEG',
+               'n04033995_2932.JPEG', 'n04258138_17003.JPEG',
+               'n04264628_27969.JPEG', 'n04336792_7448.JPEG',
+               'n04371774_5854.JPEG', 'n04596742_4225.JPEG',
+               'n07583066_647.JPEG', 'n13037406_4650.JPEG']
+  return filename.split('/')[-1] in blacklist
+
+
+def _process_image(filename, coder):
+  """Process a single image file.
+
+  Args:
+    filename: string, path to an image file e.g., '/path/to/example.JPG'.
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+  Returns:
+    image_buffer: string, JPEG encoding of RGB image.
+    height: integer, image height in pixels.
+    width: integer, image width in pixels.
+  """
+  # Read the image file.
+  with tf.gfile.FastGFile(filename, 'rb') as f:
+    image_data = f.read()
+
+  # Clean the dirty data.
+  if _is_png(filename):
+    # 1 image is a PNG.
+    print('Converting PNG to JPEG for %s' % filename)
+    image_data = coder.png_to_jpeg(image_data)
+  elif _is_cmyk(filename):
+    # 22 JPEG images are in CMYK colorspace.
+    print('Converting CMYK to RGB for %s' % filename)
+    image_data = coder.cmyk_to_rgb(image_data)
+
+  # Decode the RGB JPEG.
+  image = coder.decode_jpeg(image_data)
+
+  # Check that image converted to RGB
+  assert len(image.shape) == 3
+  height = image.shape[0]
+  width = image.shape[1]
+  assert image.shape[2] == 3
+
+  return image_data, height, width
+
+
+def _process_image_files_batch(coder, thread_index, ranges, name, filenames,
+                               synsets, labels, humans, bboxes, num_shards):
+  """Processes and saves list of images as TFRecord in 1 thread.
+
+  Args:
+    coder: instance of ImageCoder to provide TensorFlow image coding utils.
+    thread_index: integer, unique batch to run index is within [0, len(ranges)).
+    ranges: list of pairs of integers specifying ranges of each batches to
+      analyze in parallel.
+    name: string, unique identifier specifying the data set
+    filenames: list of strings; each string is a path to an image file
+    synsets: list of strings; each string is a unique WordNet ID
+    labels: list of integer; each integer identifies the ground truth
+    humans: list of strings; each string is a human-readable label
+    bboxes: list of bounding boxes for each image. Note that each entry in this
+      list might contain from 0+ entries corresponding to the number of bounding
+      box annotations for the image.
+    num_shards: integer number of shards for this data set.
+  """
+  # Each thread produces N shards where N = int(num_shards / num_threads).
+  # For instance, if num_shards = 128, and the num_threads = 2, then the first
+  # thread would produce shards [0, 64).
+  num_threads = len(ranges)
+  assert not num_shards % num_threads
+  num_shards_per_batch = int(num_shards / num_threads)
+
+  shard_ranges = np.linspace(ranges[thread_index][0],
+                             ranges[thread_index][1],
+                             num_shards_per_batch + 1).astype(int)
+  num_files_in_thread = ranges[thread_index][1] - ranges[thread_index][0]
+
+  counter = 0
+  for s in range(num_shards_per_batch):
+    # Generate a sharded version of the file name, e.g. 'train-00002-of-00010'
+    shard = thread_index * num_shards_per_batch + s
+    output_filename = '%s-%.5d-of-%.5d' % (name, shard, num_shards)
+    output_file = os.path.join(FLAGS.output_directory, output_filename)
+    writer = tf.python_io.TFRecordWriter(output_file)
+
+    shard_counter = 0
+    files_in_shard = np.arange(shard_ranges[s], shard_ranges[s + 1], dtype=int)
+    for i in files_in_shard:
+      filename = filenames[i]
+      label = labels[i]
+      synset = synsets[i]
+      human = humans[i]
+      #bbox = bboxes[i]
+
+      image_buffer, height, width = _process_image(filename, coder)
+
+      example = _convert_to_example(filename, image_buffer, label,
+                                    synset, human, [[0, 0, 1, 1]],
+                                    height, width)
+      writer.write(example.SerializeToString())
+      shard_counter += 1
+      counter += 1
+
+      if not counter % 1000:
+        print('%s [thread %d]: Processed %d of %d images in thread batch.' %
+              (datetime.now(), thread_index, counter, num_files_in_thread))
+        sys.stdout.flush()
+
+    writer.close()
+    print('%s [thread %d]: Wrote %d images to %s' %
+          (datetime.now(), thread_index, shard_counter, output_file))
+    sys.stdout.flush()
+    shard_counter = 0
+  print('%s [thread %d]: Wrote %d images to %d shards.' %
+        (datetime.now(), thread_index, counter, num_files_in_thread))
+  sys.stdout.flush()
+
+
+def _process_image_files(name, filenames, synsets, labels, humans,
+                         bboxes, num_shards):
+  """Process and save list of images as TFRecord of Example protos.
+
+  Args:
+    name: string, unique identifier specifying the data set
+    filenames: list of strings; each string is a path to an image file
+    synsets: list of strings; each string is a unique WordNet ID
+    labels: list of integer; each integer identifies the ground truth
+    humans: list of strings; each string is a human-readable label
+    bboxes: list of bounding boxes for each image. Note that each entry in this
+      list might contain from 0+ entries corresponding to the number of bounding
+      box annotations for the image.
+    num_shards: integer number of shards for this data set.
+  """
+  assert len(filenames) == len(synsets)
+  assert len(filenames) == len(labels)
+  assert len(filenames) == len(humans)
+  #assert len(filenames) == len(bboxes)
+
+  # Break all images into batches with a [ranges[i][0], ranges[i][1]].
+  spacing = np.linspace(0, len(filenames), FLAGS.num_threads + 1).astype(np.int)
+  ranges = []
+  threads = []
+  for i in range(len(spacing) - 1):
+    ranges.append([spacing[i], spacing[i + 1]])
+
+  # Launch a thread for each batch.
+  print('Launching %d threads for spacings: %s' % (FLAGS.num_threads, ranges))
+  sys.stdout.flush()
+
+  # Create a mechanism for monitoring when all threads are finished.
+  coord = tf.train.Coordinator()
+
+  # Create a generic TensorFlow-based utility for converting all image codings.
+  coder = ImageCoder()
+
+  threads = []
+  for thread_index in range(len(ranges)):
+    args = (coder, thread_index, ranges, name, filenames,
+            synsets, labels, humans, bboxes, num_shards)
+    t = threading.Thread(target=_process_image_files_batch, args=args)
+    t.start()
+    threads.append(t)
+
+  # Wait for all the threads to terminate.
+  coord.join(threads)
+  print('%s: Finished writing all %d images in data set.' %
+        (datetime.now(), len(filenames)))
+  sys.stdout.flush()
+
+
+def _find_image_files(data_dir, labels_file):
+  """Build a list of all images files and labels in the data set.
+
+  Args:
+    data_dir: string, path to the root directory of images.
+
+      Assumes that the ImageNet data set resides in JPEG files located in
+      the following directory structure.
+
+        data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
+        data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
+
+      where 'n01440764' is the unique synset label associated with these images.
+
+    labels_file: string, path to the labels file.
+
+      The list of valid labels are held in this file. Assumes that the file
+      contains entries as such:
+        n01440764
+        n01443537
+        n01484850
+      where each line corresponds to a label expressed as a synset. We map
+      each synset contained in the file to an integer (based on the alphabetical
+      ordering) starting with the integer 1 corresponding to the synset
+      contained in the first line.
+
+      The reason we start the integer labels at 1 is to reserve label 0 as an
+      unused background class.
+
+  Returns:
+    filenames: list of strings; each string is a path to an image file.
+    synsets: list of strings; each string is a unique WordNet ID.
+    labels: list of integer; each integer identifies the ground truth.
+  """
+  print('Determining list of input files and labels from %s.' % data_dir)
+  challenge_synsets = [l.strip() for l in
+                       tf.gfile.FastGFile(labels_file, 'r').readlines()]
+
+  labels = []
+  filenames = []
+  synsets = []
+
+  # Leave label index 0 empty as a background class.
+  label_index = 1
+
+  # Construct the list of JPEG files and labels.
+  for synset in challenge_synsets:
+    jpeg_file_path = '%s/%s/*.JPEG' % (data_dir, synset)
+    matching_files = tf.gfile.Glob(jpeg_file_path)
+
+    labels.extend([label_index] * len(matching_files))
+    synsets.extend([synset] * len(matching_files))
+    filenames.extend(matching_files)
+
+    if not label_index % 100:
+      print('Finished finding files in %d of %d classes.' % (
+          label_index, len(challenge_synsets)))
+    label_index += 1
+
+  # Shuffle the ordering of all image files in order to guarantee
+  # random ordering of the images with respect to label in the
+  # saved TFRecord files. Make the randomization repeatable.
+  shuffled_index = list(range(len(filenames)))
+  random.seed(12345)
+  random.shuffle(shuffled_index)
+
+  filenames = [filenames[i] for i in shuffled_index]
+  synsets = [synsets[i] for i in shuffled_index]
+  labels = [labels[i] for i in shuffled_index]
+
+  print('Found %d JPEG files across %d labels inside %s.' %
+        (len(filenames), len(challenge_synsets), data_dir))
+  return filenames, synsets, labels
+
+
+def _find_human_readable_labels(synsets, synset_to_human):
+  """Build a list of human-readable labels.
+
+  Args:
+    synsets: list of strings; each string is a unique WordNet ID.
+    synset_to_human: dict of synset to human labels, e.g.,
+      'n02119022' --> 'red fox, Vulpes vulpes'
+
+  Returns:
+    List of human-readable strings corresponding to each synset.
+  """
+  humans = []
+  for s in synsets:
+    assert s in synset_to_human, ('Failed to find: %s' % s)
+    humans.append(synset_to_human[s])
+  return humans
+
+
+def _process_dataset(name, directory, num_shards, synset_to_human,
+                     image_to_bboxes):
+  """Process a complete data set and save it as a TFRecord.
+
+  Args:
+    name: string, unique identifier specifying the data set.
+    directory: string, root path to the data set.
+    num_shards: integer number of shards for this data set.
+    synset_to_human: dict of synset to human labels, e.g.,
+      'n02119022' --> 'red fox, Vulpes vulpes'
+    image_to_bboxes: dictionary mapping image file names to a list of
+      bounding boxes. This list contains 0+ bounding boxes.
+  """
+  filenames, synsets, labels = _find_image_files(directory, FLAGS.labels_file)
+  humans = _find_human_readable_labels(synsets, synset_to_human)
+  #bboxes = _find_image_bounding_boxes(filenames, image_to_bboxes)
+  bboxes = []
+  _process_image_files(name, filenames, synsets, labels,
+                       humans, bboxes, num_shards)
+
+
+def _build_synset_lookup(imagenet_metadata_file):
+  """Build lookup for synset to human-readable label.
+
+  Args:
+    imagenet_metadata_file: string, path to file containing mapping from
+      synset to human-readable label.
+
+      Assumes each line of the file looks like:
+
+        n02119247    black fox
+        n02119359    silver fox
+        n02119477    red fox, Vulpes fulva
+
+      where each line corresponds to a unique mapping. Note that each line is
+      formatted as <synset>\t<human readable label>.
+
+  Returns:
+    Dictionary of synset to human labels, such as:
+      'n02119022' --> 'red fox, Vulpes vulpes'
+  """
+  lines = tf.gfile.FastGFile(imagenet_metadata_file, 'r').readlines()
+  synset_to_human = {}
+  for l in lines:
+    if l:
+      parts = l.strip().split('\t')
+      assert len(parts) == 2
+      synset = parts[0]
+      human = parts[1]
+      synset_to_human[synset] = human
+  return synset_to_human
+
+
+def main(unused_argv):
+  assert not FLAGS.train_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with FLAGS.train_shards')
+  assert not FLAGS.validation_shards % FLAGS.num_threads, (
+      'Please make the FLAGS.num_threads commensurate with '
+      'FLAGS.validation_shards')
+  print('Saving results to %s' % FLAGS.output_directory)
+
+  # Build a map from synset to human-readable label.
+  synset_to_human = _build_synset_lookup(FLAGS.imagenet_metadata_file)
+
+  # Run it!
+  _process_dataset('validation', FLAGS.validation_directory,
+                   FLAGS.validation_shards, synset_to_human, None)
+  _process_dataset('train', FLAGS.train_directory, FLAGS.train_shards,
+                   synset_to_human, None)
+
+
+if __name__ == '__main__':
+  tf.app.run()
--- a/TensorFlow/Classification/ConvNets/dataprep/imagenet_2012_validation_synset_labels.txt
+++ b/TensorFlow/Classification/ConvNets/dataprep/imagenet_2012_validation_synset_labels.txt
--- a/TensorFlow/Classification/ConvNets/dataprep/imagenet_lsvrc_2015_synsets.txt
+++ b/TensorFlow/Classification/ConvNets/dataprep/imagenet_lsvrc_2015_synsets.txt
--- a/TensorFlow/Classification/ConvNets/dataprep/imagenet_metadata.txt
+++ b/TensorFlow/Classification/ConvNets/dataprep/imagenet_metadata.txt
--- a/TensorFlow/Classification/ConvNets/dataprep/imagewoof_synsets.txt
+++ b/TensorFlow/Classification/ConvNets/dataprep/imagewoof_synsets.txt
@ -0,0 +1,10 @@
+n02086240
+n02087394
+n02088364
+n02089973
+n02093754
+n02096294
+n02099601
+n02105641
+n02111889
+n02115641
--- a/TensorFlow/Classification/ConvNets/dataprep/preprocess_imagenet.sh
+++ b/TensorFlow/Classification/ConvNets/dataprep/preprocess_imagenet.sh
@ -0,0 +1,82 @@
+#!/bin/bash
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# Script to download and preprocess ImageNet Challenge 2012
+# training and validation data set.
+#
+# The final output of this script are sharded TFRecord files containing
+# serialized Example protocol buffers. See build_imagenet_data.py for
+# details of how the Example protocol buffers contain the ImageNet data.
+#
+# The final output of this script appears as such:
+#
+#   data_dir/train-00000-of-01024
+#   data_dir/train-00001-of-01024
+#    ...
+#   data_dir/train-01023-of-01024
+#
+# and
+#
+#   data_dir/validation-00000-of-00128
+#   data_dir/validation-00001-of-00128
+#   ...
+#   data_dir/validation-00127-of-00128
+#
+# Note that this script may take several hours to run to completion. The
+# conversion of the ImageNet data to TFRecords alone takes 2-3 hours depending
+# on the speed of your machine. Please be patient.
+#
+# **IMPORTANT**
+# To download the raw images, the user must create an account with image-net.org
+# and generate a username and access_key. The latter two are required for
+# downloading the raw images.
+#
+# usage:
+#  ./preprocess_imagenet.sh [data-dir]
+set -e
+
+if [ -z "$1" ]; then
+  echo "Usage: preprocess_imagenet.sh [data dir]"
+  exit
+fi
+
+DATA_DIR="${1%/}"
+SCRATCH_DIR="${DATA_DIR}/raw-data/"
+mkdir -p ${SCRATCH_DIR}
+
+# Convert the XML files for bounding box annotations into a single CSV.
+echo "Extracting bounding box information from XML."
+BOUNDING_BOX_SCRIPT="./dataprep/process_bounding_boxes.py"
+BOUNDING_BOX_FILE="${DATA_DIR}/imagenet_2012_bounding_boxes.csv"
+BOUNDING_BOX_DIR="${DATA_DIR}/bounding_boxes/"
+
+LABELS_FILE="./dataprep/imagenet_lsvrc_2015_synsets.txt"
+
+"${BOUNDING_BOX_SCRIPT}" "${BOUNDING_BOX_DIR}" "${LABELS_FILE}" \
+ | sort > "${BOUNDING_BOX_FILE}"
+echo "preprocessing the ImageNet data."
+
+# Build the TFRecords version of the ImageNet data.
+OUTPUT_DIRECTORY="${DATA_DIR}"
+IMAGENET_METADATA_FILE="./dataprep/imagenet_metadata.txt"
+
+python ./dataprep/build_imagenet_data.py \
+  --train_directory="${DATA_DIR}/train" \
+  --validation_directory="${DATA_DIR}/val" \
+  --output_directory="${DATA_DIR}/result" \
+  --imagenet_metadata_file="${IMAGENET_METADATA_FILE}" \
+  --labels_file="${LABELS_FILE}" \
+  --bounding_box_file="${BOUNDING_BOX_FILE}"
--- a/TensorFlow/Classification/ConvNets/dataprep/preprocess_imagenet_validation_data.py
+++ b/TensorFlow/Classification/ConvNets/dataprep/preprocess_imagenet_validation_data.py
@ -0,0 +1,89 @@
+#!/usr/bin/python
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Process the ImageNet Challenge bounding boxes for TensorFlow model training.
+
+Associate the ImageNet 2012 Challenge validation data set with labels.
+
+The raw ImageNet validation data set is expected to reside in JPEG files
+located in the following directory structure.
+
+ data_dir/ILSVRC2012_val_00000001.JPEG
+ data_dir/ILSVRC2012_val_00000002.JPEG
+ ...
+ data_dir/ILSVRC2012_val_00050000.JPEG
+
+This script moves the files into a directory structure like such:
+ data_dir/n01440764/ILSVRC2012_val_00000293.JPEG
+ data_dir/n01440764/ILSVRC2012_val_00000543.JPEG
+ ...
+where 'n01440764' is the unique synset label associated with
+these images.
+
+This directory reorganization requires a mapping from validation image
+number (i.e. suffix of the original file) to the associated label. This
+is provided in the ImageNet development kit via a Matlab file.
+
+In order to make life easier and divorce ourselves from Matlab, we instead
+supply a custom text file that provides this mapping for us.
+
+Sample usage:
+  ./preprocess_imagenet_validation_data.py ILSVRC2012_img_val \
+  imagenet_2012_validation_synset_labels.txt
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import errno
+import os.path
+import sys
+
+
+if __name__ == '__main__':
+  if len(sys.argv) < 3:
+    print('Invalid usage\n'
+          'usage: preprocess_imagenet_validation_data.py '
+          '<validation data dir> <validation labels file>')
+    sys.exit(-1)
+  data_dir = sys.argv[1]
+  validation_labels_file = sys.argv[2]
+
+  # Read in the 50000 synsets associated with the validation data set.
+  labels = [l.strip() for l in open(validation_labels_file).readlines()]
+  unique_labels = set(labels)
+
+  # Make all sub-directories in the validation data dir.
+  for label in unique_labels:
+    labeled_data_dir = os.path.join(data_dir, label)
+    # Catch error if sub-directory exists
+    try:
+      os.makedirs(labeled_data_dir)
+    except OSError as e:
+      # Raise all errors but 'EEXIST'
+      if e.errno != errno.EEXIST:
+        raise
+
+  # Move all of the image to the appropriate sub-directory.
+  for i in range(len(labels)):
+    basename = 'ILSVRC2012_val_000%.5d.JPEG' % (i + 1)
+    original_filename = os.path.join(data_dir, basename)
+    if not os.path.exists(original_filename):
+      print('Failed to find: %s' % original_filename)
+      sys.exit(-1)
+    new_filename = os.path.join(data_dir, labels[i], basename)
+    os.rename(original_filename, new_filename)
--- a/TensorFlow/Classification/ConvNets/dataprep/process_bounding_boxes.py
+++ b/TensorFlow/Classification/ConvNets/dataprep/process_bounding_boxes.py
@ -0,0 +1,254 @@
+#!/usr/bin/python
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Process the ImageNet Challenge bounding boxes for TensorFlow model training.
+
+This script is called as
+
+process_bounding_boxes.py <dir> [synsets-file]
+
+Where <dir> is a directory containing the downloaded and unpacked bounding box
+data. If [synsets-file] is supplied, then only the bounding boxes whose
+synstes are contained within this file are returned. Note that the
+[synsets-file] file contains synset ids, one per line.
+
+The script dumps out a CSV text file in which each line contains an entry.
+  n00007846_64193.JPEG,0.0060,0.2620,0.7545,0.9940
+
+The entry can be read as:
+  <JPEG file name>, <xmin>, <ymin>, <xmax>, <ymax>
+
+The bounding box for <JPEG file name> contains two points (xmin, ymin) and
+(xmax, ymax) specifying the lower-left corner and upper-right corner of a
+bounding box in *relative* coordinates.
+
+The user supplies a directory where the XML files reside. The directory
+structure in the directory <dir> is assumed to look like this:
+
+<dir>/nXXXXXXXX/nXXXXXXXX_YYYY.xml
+
+Each XML file contains a bounding box annotation. The script:
+
+ (1) Parses the XML file and extracts the filename, label and bounding box info.
+
+ (2) The bounding box is specified in the XML files as integer (xmin, ymin) and
+    (xmax, ymax) *relative* to image size displayed to the human annotator. The
+    size of the image displayed to the human annotator is stored in the XML file
+    as integer (height, width).
+
+    Note that the displayed size will differ from the actual size of the image
+    downloaded from image-net.org. To make the bounding box annotation useable,
+    we convert bounding box to floating point numbers relative to displayed
+    height and width of the image.
+
+    Note that each XML file might contain N bounding box annotations.
+
+    Note that the points are all clamped at a range of [0.0, 1.0] because some
+    human annotations extend outside the range of the supplied image.
+
+    See details here: http://image-net.org/download-bboxes
+
+(3) By default, the script outputs all valid bounding boxes. If a
+    [synsets-file] is supplied, only the subset of bounding boxes associated
+    with those synsets are outputted. Importantly, one can supply a list of
+    synsets in the ImageNet Challenge and output the list of bounding boxes
+    associated with the training images of the ILSVRC.
+
+    We use these bounding boxes to inform the random distortion of images
+    supplied to the network.
+
+If you run this script successfully, you will see the following output
+to stderr:
+> Finished processing 544546 XML files.
+> Skipped 0 XML files not in ImageNet Challenge.
+> Skipped 0 bounding boxes not in ImageNet Challenge.
+> Wrote 615299 bounding boxes from 544546 annotated images.
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import glob
+import os.path
+import sys
+import xml.etree.ElementTree as ET
+
+
+class BoundingBox(object):
+  pass
+
+
+def GetItem(name, root, index=0):
+  count = 0
+  for item in root.iter(name):
+    if count == index:
+      return item.text
+    count += 1
+  # Failed to find "index" occurrence of item.
+  return -1
+
+
+def GetInt(name, root, index=0):
+  # In some XML annotation files, the point values are not integers, but floats.
+  # So we add a float function to avoid ValueError.
+  return int(float(GetItem(name, root, index)))
+
+
+def FindNumberBoundingBoxes(root):
+  index = 0
+  while True:
+    if GetInt('xmin', root, index) == -1:
+      break
+    index += 1
+  return index
+
+
+def ProcessXMLAnnotation(xml_file):
+  """Process a single XML file containing a bounding box."""
+  # pylint: disable=broad-except
+  try:
+    tree = ET.parse(xml_file)
+  except Exception:
+    print('Failed to parse: ' + xml_file, file=sys.stderr)
+    return None
+  # pylint: enable=broad-except
+  root = tree.getroot()
+
+  num_boxes = FindNumberBoundingBoxes(root)
+  boxes = []
+
+  for index in range(num_boxes):
+    box = BoundingBox()
+    # Grab the 'index' annotation.
+    box.xmin = GetInt('xmin', root, index)
+    box.ymin = GetInt('ymin', root, index)
+    box.xmax = GetInt('xmax', root, index)
+    box.ymax = GetInt('ymax', root, index)
+
+    box.width = GetInt('width', root)
+    box.height = GetInt('height', root)
+    box.filename = GetItem('filename', root) + '.JPEG'
+    box.label = GetItem('name', root)
+
+    xmin = float(box.xmin) / float(box.width)
+    xmax = float(box.xmax) / float(box.width)
+    ymin = float(box.ymin) / float(box.height)
+    ymax = float(box.ymax) / float(box.height)
+
+    # Some images contain bounding box annotations that
+    # extend outside of the supplied image. See, e.g.
+    # n03127925/n03127925_147.xml
+    # Additionally, for some bounding boxes, the min > max
+    # or the box is entirely outside of the image.
+    min_x = min(xmin, xmax)
+    max_x = max(xmin, xmax)
+    box.xmin_scaled = min(max(min_x, 0.0), 1.0)
+    box.xmax_scaled = min(max(max_x, 0.0), 1.0)
+
+    min_y = min(ymin, ymax)
+    max_y = max(ymin, ymax)
+    box.ymin_scaled = min(max(min_y, 0.0), 1.0)
+    box.ymax_scaled = min(max(max_y, 0.0), 1.0)
+
+    boxes.append(box)
+
+  return boxes
+
+if __name__ == '__main__':
+  if len(sys.argv) < 2 or len(sys.argv) > 3:
+    print('Invalid usage\n'
+          'usage: process_bounding_boxes.py <dir> [synsets-file]',
+          file=sys.stderr)
+    sys.exit(-1)
+
+  xml_files = glob.glob(sys.argv[1] + '/*/*.xml')
+  print('Identified %d XML files in %s' % (len(xml_files), sys.argv[1]),
+        file=sys.stderr)
+
+  if len(sys.argv) == 3:
+    labels = set([l.strip() for l in open(sys.argv[2]).readlines()])
+    print('Identified %d synset IDs in %s' % (len(labels), sys.argv[2]),
+          file=sys.stderr)
+  else:
+    labels = None
+
+  skipped_boxes = 0
+  skipped_files = 0
+  saved_boxes = 0
+  saved_files = 0
+  for file_index, one_file in enumerate(xml_files):
+    # Example: <...>/n06470073/n00141669_6790.xml
+    label = os.path.basename(os.path.dirname(one_file))
+
+    # Determine if the annotation is from an ImageNet Challenge label.
+    if labels is not None and label not in labels:
+      skipped_files += 1
+      continue
+
+    bboxes = ProcessXMLAnnotation(one_file)
+    assert bboxes is not None, 'No bounding boxes found in ' + one_file
+
+    found_box = False
+    for bbox in bboxes:
+      if labels is not None:
+        if bbox.label != label:
+          # Note: There is a slight bug in the bounding box annotation data.
+          # Many of the dog labels have the human label 'Scottish_deerhound'
+          # instead of the synset ID 'n02092002' in the bbox.label field. As a
+          # simple hack to overcome this issue, we only exclude bbox labels
+          # *which are synset ID's* that do not match original synset label for
+          # the XML file.
+          if bbox.label in labels:
+            skipped_boxes += 1
+            continue
+
+      # Guard against improperly specified boxes.
+      if (bbox.xmin_scaled >= bbox.xmax_scaled or
+          bbox.ymin_scaled >= bbox.ymax_scaled):
+        skipped_boxes += 1
+        continue
+
+      # Note bbox.filename occasionally contains '%s' in the name. This is
+      # data set noise that is fixed by just using the basename of the XML file.
+      image_filename = os.path.splitext(os.path.basename(one_file))[0]
+      print('%s.JPEG,%.4f,%.4f,%.4f,%.4f' %
+            (image_filename,
+             bbox.xmin_scaled, bbox.ymin_scaled,
+             bbox.xmax_scaled, bbox.ymax_scaled))
+
+      saved_boxes += 1
+      found_box = True
+    if found_box:
+      saved_files += 1
+    else:
+      skipped_files += 1
+
+    if not file_index % 5000:
+      print('--> processed %d of %d XML files.' %
+            (file_index + 1, len(xml_files)),
+            file=sys.stderr)
+      print('--> skipped %d boxes and %d XML files.' %
+            (skipped_boxes, skipped_files), file=sys.stderr)
+
+  print('Finished processing %d XML files.' % len(xml_files), file=sys.stderr)
+  print('Skipped %d XML files not in ImageNet Challenge.' % skipped_files,
+        file=sys.stderr)
+  print('Skipped %d bounding boxes not in ImageNet Challenge.' % skipped_boxes,
+        file=sys.stderr)
+  print('Wrote %d bounding boxes from %d annotated images.' %
+        (saved_boxes, saved_files),
+        file=sys.stderr)
+  print('Finished.', file=sys.stderr)
--- a/TensorFlow/Classification/ConvNets/main.py
+++ b/TensorFlow/Classification/ConvNets/main.py
@ -42,12 +42,10 @@ if __name__ == "__main__":
        log_path = os.path.join(FLAGS.results_dir, FLAGS.log_filename)
        os.makedirs(FLAGS.results_dir, exist_ok=True)

-        dllogger.init(
-            backends=[
-                dllogger.JSONStreamBackend(verbosity=dllogger.Verbosity.VERBOSE, filename=log_path),
-                dllogger.StdOutBackend(verbosity=dllogger.Verbosity.VERBOSE)
-            ]
-        )
+        dllogger.init(backends=[
+            dllogger.JSONStreamBackend(verbosity=dllogger.Verbosity.VERBOSE, filename=log_path),
+            dllogger.StdOutBackend(verbosity=dllogger.Verbosity.VERBOSE)
+        ])
    else:
        dllogger.init(backends=[])
    dllogger.log(data=vars(FLAGS), step='PARAMETER')
@ -58,49 +56,46 @@ if __name__ == "__main__":
        architecture=FLAGS.arch,
        input_format='NHWC',
        compute_format=FLAGS.data_format,
-        dtype=tf.float32 if FLAGS.precision == 'fp32' else tf.float16,
+        dtype=tf.float32,
        n_channels=3,
-        height=224,
-        width=224,
+        height=224 if FLAGS.data_dir else FLAGS.synthetic_data_size,
+        width=224 if FLAGS.data_dir else FLAGS.synthetic_data_size,
        distort_colors=False,
        log_dir=FLAGS.results_dir,
        model_dir=FLAGS.model_dir if FLAGS.model_dir is not None else FLAGS.results_dir,
        data_dir=FLAGS.data_dir,
        data_idx_dir=FLAGS.data_idx_dir,
        weight_init=FLAGS.weight_init,
-        use_xla=FLAGS.use_xla,
-        use_tf_amp=FLAGS.use_tf_amp,
-        use_dali=FLAGS.use_dali,
+        use_xla=FLAGS.xla,
+        use_tf_amp=FLAGS.amp,
+        use_dali=FLAGS.dali,
        gpu_memory_fraction=FLAGS.gpu_memory_fraction,
        gpu_id=FLAGS.gpu_id,
-        seed=FLAGS.seed
-    )
+        seed=FLAGS.seed)

    if FLAGS.mode in ["train", "train_and_evaluate", "training_benchmark"]:
-        runner.train(
-            iter_unit=FLAGS.iter_unit,
-            num_iter=FLAGS.num_iter,
-            run_iter=FLAGS.run_iter,
-            batch_size=FLAGS.batch_size,
-            warmup_steps=FLAGS.warmup_steps,
-            log_every_n_steps=FLAGS.display_every,
-            weight_decay=FLAGS.weight_decay,
-            lr_init=FLAGS.lr_init,
-            lr_warmup_epochs=FLAGS.lr_warmup_epochs,
-            momentum=FLAGS.momentum,
-            loss_scale=FLAGS.loss_scale,
-            label_smoothing=FLAGS.label_smoothing,
-            mixup=FLAGS.mixup,
-            use_static_loss_scaling=FLAGS.use_static_loss_scaling,
-            use_cosine_lr=FLAGS.use_cosine_lr,
-            is_benchmark=FLAGS.mode == 'training_benchmark',
-            use_final_conv=FLAGS.use_final_conv,
-            quantize=FLAGS.quantize,
-            symmetric=FLAGS.symmetric,
-            quant_delay = FLAGS.quant_delay,
-            use_qdq = FLAGS.use_qdq,
-            finetune_checkpoint=FLAGS.finetune_checkpoint,
-        )
+        runner.train(iter_unit=FLAGS.iter_unit,
+                     num_iter=FLAGS.num_iter,
+                     run_iter=FLAGS.run_iter,
+                     batch_size=FLAGS.batch_size,
+                     warmup_steps=FLAGS.warmup_steps,
+                     log_every_n_steps=FLAGS.display_every,
+                     weight_decay=FLAGS.weight_decay,
+                     lr_init=FLAGS.lr_init,
+                     lr_warmup_epochs=FLAGS.lr_warmup_epochs,
+                     momentum=FLAGS.momentum,
+                     loss_scale=FLAGS.static_loss_scale,
+                     label_smoothing=FLAGS.label_smoothing,
+                     mixup=FLAGS.mixup,
+                     use_static_loss_scaling=(FLAGS.static_loss_scale != -1),
+                     use_cosine_lr=FLAGS.cosine_lr,
+                     is_benchmark=FLAGS.mode == 'training_benchmark',
+                     use_final_conv=FLAGS.use_final_conv,
+                     quantize=FLAGS.quantize,
+                     symmetric=FLAGS.symmetric,
+                     quant_delay=FLAGS.quant_delay,
+                     use_qdq=FLAGS.use_qdq,
+                     finetune_checkpoint=FLAGS.finetune_checkpoint)

    if FLAGS.mode in ["train_and_evaluate", 'evaluate', 'inference_benchmark']:

@ -109,19 +104,17 @@ if __name__ == "__main__":

        elif not hvd_utils.is_using_hvd() or hvd.rank() == 0:

-            runner.evaluate(
-                iter_unit=FLAGS.iter_unit if FLAGS.mode != "train_and_evaluate" else "epoch",
-                num_iter=FLAGS.num_iter if FLAGS.mode != "train_and_evaluate" else 1,
-                warmup_steps=FLAGS.warmup_steps,
-                batch_size=FLAGS.batch_size,
-                log_every_n_steps=FLAGS.display_every,
-                is_benchmark=FLAGS.mode == 'inference_benchmark',
-                export_dir=FLAGS.export_dir,
-                quantize=FLAGS.quantize,
-                symmetric=FLAGS.symmetric,
-                use_final_conv=FLAGS.use_final_conv,
-                use_qdq=FLAGS.use_qdq
-            )
+            runner.evaluate(iter_unit=FLAGS.iter_unit if FLAGS.mode != "train_and_evaluate" else "epoch",
+                            num_iter=FLAGS.num_iter if FLAGS.mode != "train_and_evaluate" else 1,
+                            warmup_steps=FLAGS.warmup_steps,
+                            batch_size=FLAGS.batch_size,
+                            log_every_n_steps=FLAGS.display_every,
+                            is_benchmark=FLAGS.mode == 'inference_benchmark',
+                            export_dir=FLAGS.export_dir,
+                            quantize=FLAGS.quantize,
+                            symmetric=FLAGS.symmetric,
+                            use_final_conv=FLAGS.use_final_conv,
+                            use_qdq=FLAGS.use_qdq)

    if FLAGS.mode == 'predict':
        if FLAGS.to_predict is None:
@ -134,4 +127,8 @@ if __name__ == "__main__":
            raise NotImplementedError("Only single GPU inference is implemented.")

        elif not hvd_utils.is_using_hvd() or hvd.rank() == 0:
-            runner.predict(FLAGS.to_predict, quantize=FLAGS.quantize, symmetric=FLAGS.symmetric, use_qdq=FLAGS.use_qdq, use_final_conv=FLAGS.use_final_conv)
+            runner.predict(FLAGS.to_predict,
+                           quantize=FLAGS.quantize,
+                           symmetric=FLAGS.symmetric,
+                           use_qdq=FLAGS.use_qdq,
+                           use_final_conv=FLAGS.use_final_conv)
--- a/TensorFlow/Classification/ConvNets/model/layers/conv2d.py
+++ b/TensorFlow/Classification/ConvNets/model/layers/conv2d.py
@ -29,7 +29,7 @@ def conv2d(
    data_format='NHWC',
    dilation_rate=(1, 1),
    use_bias=True,
-    kernel_initializer=tf.variance_scaling_initializer(),
+    kernel_initializer=tf.compat.v1.variance_scaling_initializer(),
    bias_initializer=tf.zeros_initializer(),
    trainable=True,
    name=None
@ -56,6 +56,5 @@ def conv2d(
        activation=None,
        name=name
    )
-    
-    return net

+    return net
--- a/TensorFlow/Classification/ConvNets/model/layers/dense.py
+++ b/TensorFlow/Classification/ConvNets/model/layers/dense.py
@ -22,7 +22,7 @@ def dense(
    units,
    use_bias=True,
    trainable=True,
-    kernel_initializer=tf.variance_scaling_initializer(),
+    kernel_initializer=tf.compat.v1.variance_scaling_initializer(),
    bias_initializer=tf.zeros_initializer()
 ):

--- a/TensorFlow/Classification/ConvNets/model/layers/squeeze_excitation_layer.py
+++ b/TensorFlow/Classification/ConvNets/model/layers/squeeze_excitation_layer.py
@ -29,7 +29,7 @@ def squeeze_excitation_layer(
    ratio,
    training=True,
    data_format='NCHW',
-    kernel_initializer=tf.variance_scaling_initializer(),
+    kernel_initializer=tf.compat.v1.variance_scaling_initializer(),
    bias_initializer=tf.zeros_initializer(),
    name="squeeze_excitation_layer"
 ):
--- a/TensorFlow/Classification/ConvNets/model/resnet.py
+++ b/TensorFlow/Classification/ConvNets/model/resnet.py
@ -15,7 +15,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-
 from __future__ import print_function

 import tensorflow as tf
@ -34,7 +33,6 @@ from utils.data_utils import normalized_inputs
 from utils.learning_rate import learning_rate_scheduler
 from utils.optimizers import FixedLossScalerOptimizer

-
 __all__ = [
    'ResnetModel',
 ]
@ -89,14 +87,14 @@ class ResnetModel(object):
        )

        self.conv2d_hparams = tf.contrib.training.HParams(
-            kernel_initializer=tf.variance_scaling_initializer(
+            kernel_initializer=tf.compat.v1.variance_scaling_initializer(
                scale=2.0, distribution='truncated_normal', mode=weight_init
            ),
            bias_initializer=tf.constant_initializer(0.0)
        )

        self.dense_hparams = tf.contrib.training.HParams(
-            kernel_initializer=tf.variance_scaling_initializer(
+            kernel_initializer=tf.compat.v1.variance_scaling_initializer(
                scale=2.0, distribution='truncated_normal', mode=weight_init
            ),
            bias_initializer=tf.constant_initializer(0.0)
@ -109,12 +107,13 @@ class ResnetModel(object):
            print("Input_format", input_format)
            print("dtype", str(dtype))

-
    def __call__(self, features, labels, mode, params):

        if mode == tf.estimator.ModeKeys.TRAIN:
-            mandatory_params = ["batch_size", "lr_init", "num_gpus", "steps_per_epoch",
-                                "momentum", "weight_decay", "loss_scale", "label_smoothing"]
+            mandatory_params = [
+                "batch_size", "lr_init", "num_gpus", "steps_per_epoch", "momentum", "weight_decay", "loss_scale",
+                "label_smoothing"
+            ]
            for p in mandatory_params:
                if p not in params:
                    raise RuntimeError("Parameter {} is missing.".format(p))
@ -141,43 +140,46 @@ class ResnetModel(object):

            mixup = 0
            eta = 0
-            
-            if mode == tf.estimator.ModeKeys.TRAIN:        
+
+            if mode == tf.estimator.ModeKeys.TRAIN:
                eta = params['label_smoothing']
                mixup = params['mixup']
-                
-            if mode != tf.estimator.ModeKeys.PREDICT: 
-                one_hot_smoothed_labels = tf.one_hot(labels, 1001, 
-                                                     on_value = 1 - eta + eta/1001,
-                                                     off_value = eta/1001)
+
+            if mode != tf.estimator.ModeKeys.PREDICT:
+                n_cls = self.model_hparams.n_classes
+                one_hot_smoothed_labels = tf.one_hot(labels, n_cls, 
+                        on_value=1 - eta + eta / n_cls, off_value=eta / n_cls)
                if mixup != 0:

                    print("Using mixup training with beta=", params['mixup'])
                    beta_distribution = tf.distributions.Beta(params['mixup'], params['mixup'])

-                    feature_coefficients = beta_distribution.sample(sample_shape=[params['batch_size'], 1, 1, 1])      
+                    feature_coefficients = beta_distribution.sample(sample_shape=[params['batch_size'], 1, 1, 1])

-                    reversed_feature_coefficients = tf.subtract(tf.ones(shape=feature_coefficients.shape), feature_coefficients)
+                    reversed_feature_coefficients = tf.subtract(
+                        tf.ones(shape=feature_coefficients.shape), feature_coefficients
+                    )

-                    rotated_features = tf.reverse(features, axis=[0])      
+                    rotated_features = tf.reverse(features, axis=[0])

                    features = feature_coefficients * features + reversed_feature_coefficients * rotated_features

                    label_coefficients = tf.squeeze(feature_coefficients, axis=[2, 3])

-                    rotated_labels = tf.reverse(one_hot_smoothed_labels, axis=[0])    
+                    rotated_labels = tf.reverse(one_hot_smoothed_labels, axis=[0])

-                    reversed_label_coefficients = tf.subtract(tf.ones(shape=label_coefficients.shape), label_coefficients)
+                    reversed_label_coefficients = tf.subtract(
+                        tf.ones(shape=label_coefficients.shape), label_coefficients
+                    )

                    one_hot_smoothed_labels = label_coefficients * one_hot_smoothed_labels + reversed_label_coefficients * rotated_labels
-                
-                
+
            # Update Global Step
            global_step = tf.train.get_or_create_global_step()
            tf.identity(global_step, name="global_step_ref")

            tf.identity(features, name="features_ref")
-            
+
            if mode == tf.estimator.ModeKeys.TRAIN:
                tf.identity(labels, name="labels_ref")

@ -202,16 +204,31 @@ class ResnetModel(object):
            tf.identity(probs, name="probs_ref")
            tf.identity(y_preds, name="y_preds_ref")

+            #if mode == tf.estimator.ModeKeys.TRAIN:
+            #
+            #    assert (len(tf.trainable_variables()) == 161)
+            #
+            #else:
+            #
+            #    assert (len(tf.trainable_variables()) == 0)
+
            if mode == tf.estimator.ModeKeys.TRAIN and params['quantize']:
                dllogger.log(data={"QUANTIZATION AWARE TRAINING ENABLED": True}, step=tuple())
                if params['symmetric']:
-                    dllogger.log(data={"MODE":"USING SYMMETRIC MODE"}, step=tuple())
-                    tf.contrib.quantize.experimental_create_training_graph(tf.get_default_graph(), symmetric=True, use_qdq=params['use_qdq'] ,quant_delay=params['quant_delay'])
+                    dllogger.log(data={"MODE": "USING SYMMETRIC MODE"}, step=tuple())
+                    tf.contrib.quantize.experimental_create_training_graph(
+                        tf.get_default_graph(),
+                        symmetric=True,
+                        use_qdq=params['use_qdq'],
+                        quant_delay=params['quant_delay']
+                    )
                else:
-                    dllogger.log(data={"MODE":"USING ASSYMETRIC MODE"}, step=tuple())
-                    tf.contrib.quantize.create_training_graph(tf.get_default_graph(), quant_delay=params['quant_delay'], use_qdq=params['use_qdq'])
-            
-            # Fix for restoring variables during fine-tuning of Resnet-50
+                    dllogger.log(data={"MODE": "USING ASSYMETRIC MODE"}, step=tuple())
+                    tf.contrib.quantize.create_training_graph(
+                        tf.get_default_graph(), quant_delay=params['quant_delay'], use_qdq=params['use_qdq']
+                    )
+
+            # Fix for restoring variables during fine-tuning of Resnet
            if 'finetune_checkpoint' in params.keys():
                train_vars = tf.trainable_variables()
                train_var_dict = {}
@ -220,6 +237,13 @@ class ResnetModel(object):
                dllogger.log(data={"Restoring variables from checkpoint": params['finetune_checkpoint']}, step=tuple())
                tf.train.init_from_checkpoint(params['finetune_checkpoint'], train_var_dict)

+        with tf.device("/cpu:0"):
+            if hvd_utils.is_using_hvd():
+                sync_var = tf.Variable(initial_value=[0], dtype=tf.int32, name="signal_handler_var")
+                sync_var_assing = sync_var.assign([1], name="signal_handler_var_set")
+                sync_var_reset = sync_var.assign([0], name="signal_handler_var_reset")
+                sync_op = hvd.allreduce(sync_var, op=hvd.Sum, name="signal_handler_all_reduce")
+
        if mode == tf.estimator.ModeKeys.PREDICT:

            predictions = {'classes': y_preds, 'probabilities': probs}
@ -239,8 +263,12 @@ class ResnetModel(object):
                    acc_top5 = tf.nn.in_top_k(predictions=logits, targets=labels, k=5)

                else:
-                    acc_top1, acc_top1_update_op = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, targets=labels, k=1))
-                    acc_top5, acc_top5_update_op = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, targets=labels, k=5))
+                    acc_top1, acc_top1_update_op = tf.metrics.mean(
+                        tf.nn.in_top_k(predictions=logits, targets=labels, k=1)
+                    )
+                    acc_top5, acc_top5_update_op = tf.metrics.mean(
+                        tf.nn.in_top_k(predictions=logits, targets=labels, k=5)
+                    )

                tf.identity(acc_top1, name="acc_top1_ref")
                tf.identity(acc_top5, name="acc_top5_ref")
@ -251,20 +279,21 @@ class ResnetModel(object):
                    'accuracy_top1': acc_top1,
                    'accuracy_top5': acc_top5
                }
-                
-                cross_entropy = tf.losses.softmax_cross_entropy(
-                    logits=logits, onehot_labels=one_hot_smoothed_labels)
+
+                cross_entropy = tf.losses.softmax_cross_entropy(logits=logits, onehot_labels=one_hot_smoothed_labels)

                assert (cross_entropy.dtype == tf.float32)
                tf.identity(cross_entropy, name='cross_entropy_loss_ref')

                def loss_filter_fn(name):
                    """we don't need to compute L2 loss for BN and bias (eq. to add a cste)"""
-                    return all([
-                        tensor_name not in name.lower()
-                        # for tensor_name in ["batchnorm", "batch_norm", "batch_normalization", "bias"]
-                        for tensor_name in ["batchnorm", "batch_norm", "batch_normalization"]
-                    ])
+                    return all(
+                        [
+                            tensor_name not in name.lower()
+                            # for tensor_name in ["batchnorm", "batch_norm", "batch_normalization", "bias"]
+                            for tensor_name in ["batchnorm", "batch_norm", "batch_normalization"]
+                        ]
+                    )

                filtered_params = [tf.cast(v, tf.float32) for v in tf.trainable_variables() if loss_filter_fn(v.name)]

@ -287,7 +316,7 @@ class ResnetModel(object):
                tf.summary.scalar('cross_entropy', cross_entropy)
                tf.summary.scalar('l2_loss', l2_loss)
                tf.summary.scalar('total_loss', total_loss)
-                
+
                if mode == tf.estimator.ModeKeys.TRAIN:

                    with tf.device("/cpu:0"):
@ -317,17 +346,18 @@ class ResnetModel(object):
                    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
                    if mode != tf.estimator.ModeKeys.TRAIN:
                        update_ops += [acc_top1_update_op, acc_top5_update_op]
-                    
+
                    deterministic = True
-                    gate_gradients = (tf.train.Optimizer.GATE_OP if deterministic else tf.train.Optimizer.GATE_NONE)
+                    gate_gradients = (tf.compat.v1.train.Optimizer.GATE_OP if deterministic else tf.compat.v1.train.Optimizer.GATE_NONE)

                    backprop_op = optimizer.minimize(total_loss, gate_gradients=gate_gradients, global_step=global_step)

-                    
                    if self.model_hparams.use_dali:
                        train_ops = tf.group(backprop_op, update_ops, name='train_ops')
                    else:
-                        train_ops = tf.group(backprop_op, cpu_prefetch_op, gpu_prefetch_op, update_ops, name='train_ops')
+                        train_ops = tf.group(
+                            backprop_op, cpu_prefetch_op, gpu_prefetch_op, update_ops, name='train_ops'
+                        )

                    return tf.estimator.EstimatorSpec(mode=mode, loss=total_loss, train_op=train_ops)

@ -338,23 +368,18 @@ class ResnetModel(object):
                    }

                    return tf.estimator.EstimatorSpec(
-                        mode=mode,
-                        predictions=predictions,
-                        loss=total_loss,
-                        eval_metric_ops=eval_metrics
+                        mode=mode, predictions=predictions, loss=total_loss, eval_metric_ops=eval_metrics
                    )

                else:
                    raise NotImplementedError('Unknown mode {}'.format(mode))

-                
    @staticmethod
    def _stage(tensors):
        """Stages the given tensors in a StagingArea for asynchronous put/get.
        """
        stage_area = tf.contrib.staging.StagingArea(
-            dtypes=[tensor.dtype for tensor in tensors],
-            shapes=[tensor.get_shape() for tensor in tensors]
+            dtypes=[tensor.dtype for tensor in tensors], shapes=[tensor.get_shape() for tensor in tensors]
        )

        put_op = stage_area.put(tensors)
@ -364,14 +389,11 @@ class ResnetModel(object):

        return put_op, get_tensors

-
-
    def build_model(self, inputs, training=True, reuse=False, use_final_conv=False):
-        
+
        with var_storage.model_variable_scope(
-            self.model_hparams.model_name,
-            reuse=reuse,
-            dtype=self.model_hparams.dtype):
+            self.model_hparams.model_name, reuse=reuse, dtype=self.model_hparams.dtype
+        ):

            with tf.variable_scope("input_reshape"):
                if self.model_hparams.input_format == 'NHWC' and self.model_hparams.compute_format == 'NCHW':
@ -426,27 +448,29 @@ class ResnetModel(object):
                        batch_norm_hparams=self.batch_norm_hparams,
                        block_name="btlnck_block_%d_%d" % (block_id, layer_id),
                        use_se=self.model_hparams.use_se,
-                        ratio=self.model_hparams.se_ratio)
+                        ratio=self.model_hparams.se_ratio
+                    )

            with tf.variable_scope("output"):
                net = layers.reduce_mean(
-                    net, keepdims=use_final_conv, data_format=self.model_hparams.compute_format, name='spatial_mean')
+                    net, keepdims=False, data_format=self.model_hparams.compute_format, name='spatial_mean'
+                )

                if use_final_conv:
                    logits = layers.conv2d(
-                                    net,
-                                    n_channels=self.model_hparams.n_classes,
-                                    kernel_size=(1, 1),
-                                    strides=(1, 1),
-                                    padding='SAME',
-                                    data_format=self.model_hparams.compute_format,
-                                    dilation_rate=(1, 1),
-                                    use_bias=True,
-                                    kernel_initializer=self.dense_hparams.kernel_initializer,
-                                    bias_initializer=self.dense_hparams.bias_initializer,
-                                    trainable=training,
-                                    name='dense'
-                                )
+                        net,
+                        n_channels=self.model_hparams.n_classes,
+                        kernel_size=(1, 1),
+                        strides=(1, 1),
+                        padding='SAME',
+                        data_format=self.model_hparams.compute_format,
+                        dilation_rate=(1, 1),
+                        use_bias=True,
+                        kernel_initializer=self.dense_hparams.kernel_initializer,
+                        bias_initializer=self.dense_hparams.bias_initializer,
+                        trainable=training,
+                        name='dense'
+                    )
                else:
                    logits = layers.dense(
                        inputs=net,
@ -454,7 +478,8 @@ class ResnetModel(object):
                        use_bias=True,
                        trainable=training,
                        kernel_initializer=self.dense_hparams.kernel_initializer,
-                        bias_initializer=self.dense_hparams.bias_initializer)
+                        bias_initializer=self.dense_hparams.bias_initializer
+                    )

                if logits.dtype != tf.float32:
                    logits = tf.cast(logits, tf.float32)
@ -464,27 +489,25 @@ class ResnetModel(object):

            return probs, logits

+
 model_architectures = {
    'resnet50': {
        'layers': [3, 4, 6, 3],
        'widths': [64, 128, 256, 512],
        'expansions': 4,
    },
-
    'resnext101-32x4d': {
        'layers': [3, 4, 23, 3],
        'widths': [128, 256, 512, 1024],
        'expansions': 2,
        'cardinality': 32,
    },
-
-    'se-resnext101-32x4d' : {
-        'cardinality' : 32,
-        'layers' : [3, 4, 23, 3],
-        'widths' : [128, 256, 512, 1024],
-        'expansions' : 2,
+    'se-resnext101-32x4d': {
+        'cardinality': 32,
+        'layers': [3, 4, 23, 3],
+        'widths': [128, 256, 512, 1024],
+        'expansions': 2,
        'use_se': True,
        'se_ratio': 16,
    },
-
 }
--- a/TensorFlow/Classification/ConvNets/postprocess_ckpt.py
+++ b/TensorFlow/Classification/ConvNets/postprocess_ckpt.py
@ -71,4 +71,4 @@ if __name__=='__main__':
        file.write("model_checkpoint_path: "+ "\"" + new_ckpt + "\"")
        
    # Process the input checkpoint, apply transforms and generate a new checkpoint.
-    process_checkpoint(input_ckpt, new_ckpt_path, args.dense_layer)
+    process_checkpoint(input_ckpt, new_ckpt_path, args.dense_layer)
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/README.md
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/README.md
@ -244,16 +244,16 @@ For example, to train on DGX-1 for 90 epochs using AMP, run:
 Additionally, features like DALI data preprocessing or TensorFlow XLA can be enabled with
 following arguments when running those scripts:

-`bash ./resnet50v1.5/training/DGX1_RN50_AMP_90E.sh /path/to/result /data --use_xla --use_dali`
+`bash ./resnet50v1.5/training/DGX1_RN50_AMP_90E.sh /path/to/result /data --xla --dali`

 7. Start validation/evaluation.
 To evaluate the validation dataset located in `/data/tfrecords`, run `main.py` with
 `--mode=evaluate`. For example:

 `python main.py --mode=evaluate --data_dir=/data/tfrecords --batch_size <batch size> --model_dir
-<model location> --results_dir <output location> [--use_xla] [--use_tf_amp]`
+<model location> --results_dir <output location> [--xla] [--amp]`

-The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during evaluation. 
+The optional `--xla` and `--amp` flags control XLA and AMP during evaluation. 

 ## Advanced

@ -292,99 +292,116 @@ The `runtime/` directory contains the following module that define the mechanics
 The script for training and evaluating the ResNet-50 v1.5 model has a variety of parameters that control these processes.

 ```
-usage: main.py [-h]
-               [--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
+usage: main.py [-h] [--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
               [--mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}]
-               [--data_dir DATA_DIR] [--data_idx_dir DATA_IDX_DIR]
-               [--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]
-               [--batch_size BATCH_SIZE] [--num_iter NUM_ITER]
-               [--iter_unit {epoch,batch}] [--warmup_steps WARMUP_STEPS]
-               [--model_dir MODEL_DIR] [--results_dir RESULTS_DIR]
-               [--log_filename LOG_FILENAME] [--display_every DISPLAY_EVERY]
-               [--lr_init LR_INIT] [--lr_warmup_epochs LR_WARMUP_EPOCHS]
-               [--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
-               [--momentum MOMENTUM] [--loss_scale LOSS_SCALE]
-               [--label_smoothing LABEL_SMOOTHING] [--mixup MIXUP]
-               [--use_static_loss_scaling | --nouse_static_loss_scaling]
-               [--use_xla | --nouse_xla] [--use_dali | --nouse_dali]
-               [--use_tf_amp | --nouse_tf_amp]
-               [--use_cosine_lr | --nouse_cosine_lr] [--seed SEED]
+               [--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]       
+               --batch_size BATCH_SIZE [--num_iter NUM_ITER]  
+               [--run_iter RUN_ITER] [--iter_unit {epoch,batch}]              
+               [--warmup_steps WARMUP_STEPS] [--model_dir MODEL_DIR]
+               [--results_dir RESULTS_DIR] [--log_filename LOG_FILENAME]      
+               [--display_every DISPLAY_EVERY] [--seed SEED]
               [--gpu_memory_fraction GPU_MEMORY_FRACTION] [--gpu_id GPU_ID]
-
-JoC-RN50v1.5-TF
-
-optional arguments:
-  -h, --help            Show this help message and exit
+               [--finetune_checkpoint FINETUNE_CHECKPOINT] [--use_final_conv]
+               [--quant_delay QUANT_DELAY] [--quantize] [--use_qdq]        
+               [--symmetric] [--data_dir DATA_DIR]         
+               [--data_idx_dir DATA_IDX_DIR] [--dali]
+               [--synthetic_data_size SYNTHETIC_DATA_SIZE] [--lr_init LR_INIT]
+               [--lr_warmup_epochs LR_WARMUP_EPOCHS] 
+               [--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
+               [--momentum MOMENTUM] [--label_smoothing LABEL_SMOOTHING]
+               [--mixup MIXUP] [--cosine_lr] [--xla]            
+               [--data_format {NHWC,NCHW}] [--amp]
+               [--static_loss_scale STATIC_LOSS_SCALE]
+                                                            
+JoC-RN50v1.5-TF                      
+                                                                           
+optional arguments:          
+  -h, --help            show this help message and exit.
  --arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}
-                        Architecture of model to run (default is resnet50)
+                        Architecture of model to run.                           
  --mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}
                        The execution mode of the script.
+  --export_dir EXPORT_DIR                                                                                                                                                                                                                                                  
+                        Directory in which to write exported SavedModel.         
+  --to_predict TO_PREDICT        
+                        Path to file or directory of files to run prediction
+                        on.
+  --batch_size BATCH_SIZE      
+                        Size of each minibatch per GPU.                    
+  --num_iter NUM_ITER   Number of iterations to run.
+  --run_iter RUN_ITER   Number of training iterations to run on single run.
+  --iter_unit {epoch,batch}                                
+                        Unit of iterations.                                  
+  --warmup_steps WARMUP_STEPS                                    
+                        Number of steps considered as warmup and not taken
+                        into account for performance measurements.                                  
+  --model_dir MODEL_DIR                
+                        Directory in which to write model. If undefined,         
+                        results dir will be used.                                                  
+  --results_dir RESULTS_DIR
+                        Directory in which to write training logs, summaries
+                        and checkpoints.
+  --log_filename LOG_FILENAME
+                        Name of the JSON file to which write the training log.
+  --display_every DISPLAY_EVERY
+                        How often (in batches) to print out running
+                        information.
+  --seed SEED           Random seed.
+  --gpu_memory_fraction GPU_MEMORY_FRACTION
+                        Limit memory fraction used by training script for DALI.
+  --gpu_id GPU_ID       Specify ID of the target GPU on multi-device platform.
+                        Effective only for single-GPU mode.
+  --finetune_checkpoint FINETUNE_CHECKPOINT
+                        Path to pre-trained checkpoint which will be used for
+                        fine-tuning.
+  --use_final_conv      Use convolution operator instead of MLP as last layer.
+  --quant_delay QUANT_DELAY
+                        Number of steps to be run before quantization starts
+                        to happen.
+  --quantize            Quantize weights and activations during training.
+                        (Defaults to Assymmetric quantization)
+  --use_qdq             Use QDQV3 op instead of FakeQuantWithMinMaxVars op for
+                        quantization. QDQv3 does only scaling.
+  --symmetric           Quantize weights and activations during training using
+                        symmetric quantization.
+
+Dataset arguments:
  --data_dir DATA_DIR   Path to dataset in TFRecord format. Files should be
                        named 'train-*' and 'validation-*'.
  --data_idx_dir DATA_IDX_DIR
                        Path to index files for DALI. Files should be named
                        'train-*' and 'validation-*'.
-  --export_dir EXPORT_DIR
-                        Directory in which to write exported SavedModel.
-  --to_predict TO_PREDICT
-                        Path to file or directory of files to run prediction
-                        on.
-  --batch_size BATCH_SIZE
-                        Size of each minibatch per GPU.
-  --num_iter NUM_ITER   Number of iterations to run.
-  --iter_unit {epoch,batch}
-                        Unit of iterations.
-  --warmup_steps WARMUP_STEPS
-                        Number of steps considered as warmup and not taken
-                        into account for performance measurements.
-  --model_dir MODEL_DIR
-                        Directory in which to write the model. If undefined,
-                        results directory will be used.
-  --results_dir RESULTS_DIR
-                        Directory in which to write training logs, summaries
-                        and checkpoints.
-  --log_filename LOG_FILENAME
-                        Name of the JSON file to which write the training log
-  --display_every DISPLAY_EVERY
-                        How often (in batches) to print out running
-                        information.
+  --dali                Enable DALI data input.
+  --synthetic_data_size SYNTHETIC_DATA_SIZE
+                        Dimension of image for synthetic dataset.
+
+Training arguments:
  --lr_init LR_INIT     Initial value for the learning rate.
  --lr_warmup_epochs LR_WARMUP_EPOCHS
-                        Number of warmup epochs for the learning rate schedule.
+                        Number of warmup epochs for learning rate schedule.
  --weight_decay WEIGHT_DECAY
                        Weight Decay scale factor.
  --weight_init {fan_in,fan_out}
                        Model weight initialization method.
-  --momentum MOMENTUM   SGD momentum value for the momentum optimizer.
-  --loss_scale LOSS_SCALE
-                        Loss scale for FP16 training and fast math FP32.
+  --momentum MOMENTUM   SGD momentum value for the Momentum optimizer.
  --label_smoothing LABEL_SMOOTHING
                        The value of label smoothing.
  --mixup MIXUP         The alpha parameter for mixup (if 0 then mixup is not
                        applied).
-  --use_static_loss_scaling
-                        Use static loss scaling in FP16 or FP32 AMP.
-  --nouse_static_loss_scaling
-  --use_xla             Enable XLA (Accelerated Linear Algebra) computation
+  --cosine_lr           Use cosine learning rate schedule.
+
+Generic optimization arguments:
+  --xla                 Enable XLA (Accelerated Linear Algebra) computation
                        for improved performance.
-  --nouse_xla
-  --use_dali            Enable DALI data input.
-  --nouse_dali
-  --use_tf_amp          Enable AMP to speedup FP32
-                        computation using Tensor Cores.
-  --nouse_tf_amp
-  --use_cosine_lr       Use cosine learning rate schedule.
-  --nouse_cosine_lr
-  --seed SEED           Random seed.
-  --gpu_memory_fraction GPU_MEMORY_FRACTION
-                        Limit memory fraction used by the training script for DALI
-  --gpu_id GPU_ID       Specify the ID of the target GPU on a multi-device platform.
-                        Effective only for single-GPU mode.
-  --quantize            Used to add quantization nodes in the graph (Default: Asymmetric quantization)
-  --symmetric           If --quantize mode is used, this option enables symmetric quantization
-  --use_qdq             Use quantize_and_dequantize (QDQ) op instead of FakeQuantWithMinMaxVars op for quantization. QDQ does only scaling.
-  --finetune_checkpoint Path to pre-trained checkpoint which can be used for fine-tuning
-  --quant_delay         Number of steps to be run before quantization starts to happen
+  --data_format {NHWC,NCHW}
+                        Data format used to do calculations.
+  --amp                 Enable Automatic Mixed Precision to speedup
+                        computation using tensor cores.
+
+Automatic Mixed Precision arguments:
+  --static_loss_scale STATIC_LOSS_SCALE
+                        Use static loss scaling in FP32 AMP.
+
 ```

 ### Quantization Aware Training
@ -424,12 +441,13 @@ Arguments:
 * `--input_format` : Data format of input tensor (Default: NCHW). Use NCHW format to optimize the graph with TensorRT.
 * `--compute_format` : Data format of the operations in the network (Default: NCHW). Use NCHW format to optimize the graph with TensorRT.

+
 ### Inference process
 To run inference on a single example with a checkpoint and a model script, use: 

 `python main.py --mode predict --model_dir <path to model> --to_predict <path to image> --results_dir <path to results>`

-The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during inference.
+The optional `--xla` and `--amp` flags control XLA and AMP during inference.

 ## Performance

@ -448,7 +466,7 @@ To benchmark the training performance on a specific batch size, run:
        
    * AMP

-        `python ./main.py --mode=training_benchmark  --use_tf_amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
+        `python ./main.py --mode=training_benchmark  --amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
        
 * For multiple GPUs
    * FP32 / TF32
@ -457,16 +475,18 @@ To benchmark the training performance on a specific batch size, run:
        
    * AMP
    
-        `mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --use_tf_amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
+        `mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --mode=training_benchmark --amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
        
        
 Each of these scripts runs 200 warm-up iterations and measures the first epoch.

 To control warmup and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. Features like XLA or DALI can be controlled
-with `--use_xla` and `--use_dali` flags. If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.
-For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
+with `--xla` and `--dali` flags. For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
 Suggested batch sizes for training are 256 for mixed precision training and 128 for single precision training per single V100 16 GB.

+If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with `--synthetic_data_size` flag.
+
+
 #### Inference performance benchmark

 To benchmark the inference performance on a specific batch size, run:
@ -477,11 +497,10 @@ To benchmark the inference performance on a specific batch size, run:

 * AMP

-`python ./main.py --mode=inference_benchmark --use_tf_amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
+`python ./main.py --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`

 By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations.
 To control warm-up and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. 
-For proper throughput and latency reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
 If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.

 The benchmark can be automated with the `inference_benchmark.sh` script provided in `resnet50v1.5`, by simply running:
@ -490,6 +509,9 @@ The benchmark can be automated with the `inference_benchmark.sh` script provided
 The `<data dir>` parameter refers to the input data directory (by default `/data/tfrecords` inside the container). 
 By default, the benchmark tests the following configurations: **FP32**, **AMP**, **AMP + XLA** with different batch sizes.
 When the optional directory with the DALI index files `<data idx dir>` is specified, the benchmark executes an additional **DALI + AMP + XLA** configuration.
+For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
+
+For performance benchmark of raw model, synthetic dataset can be used. To use synthetic dataset, use `--synthetic_data_size` flag instead of `--data_dir` to specify input image size.

 ### Results

@ -568,17 +590,6 @@ on NVIDIA DGX A100 with (8x A100 40G) GPUs.
 | 8 | ~2h    | ~5h   | 


-##### Training time: NVIDIA DGX A100 (8x A100 40GB)
-
-Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-a100-8x-a100-40g) 
-on NVIDIA DGX A100 with (8x A100 40G) GPUs.
-
-| GPUs | Time to train - mixed precision + XLA | Time to train - mixed precision | Time to train - TF32 + XLA | Time to train - TF32 |
-|---|--------|---------|---------|-------|
-| 1 | ~18h   | ~19.5h | ~40h   | ~47h   |
-| 8 | ~2h    | ~2.5h  | ~5h    | ~6h    | 
-
-
 ##### Training time: NVIDIA DGX-1 (8x V100 16G)

 Our results were estimated based on the [training performance results](#training-performance-nvidia-dgx-1-8x-v100-16g) 
@ -821,22 +832,25 @@ on NVIDIA T4 with (1x T4 16G) GPU.
  * Added benchmark results for DGX-2 and XLA-enabled DGX-1 and DGX-2.
 3. July, 2019
  * Added Cosine learning rate schedule
-3. August, 2019
+4. August, 2019
  * Added mixup regularization
  * Added T4 benchmarks
  * Improved inference capabilities
  * Added SavedModel export 
-4. January, 2020
+5. January, 2020
  * Removed manual checks for dataset paths to facilitate cloud storage solutions
  * Move to a new logging solution
  * Bump base docker image version
-5. March, 2020
+6. March, 2020
  * Code cleanup and refactor
  * Improved training process
-6. June, 2020
+7. June, 2020
  * Added benchmark results for DGX-A100
  * Updated benchmark results for DGX-1, DGX-2 and T4
  * Updated base docker image version
+8. August 2020
+  * Updated command line argument names
+  * Added support for syntetic dataset with different image size

 ### Known issues
-Performance without XLA enabled is low. We recommend using XLA.
+Performance without XLA enabled is low due to BN + ReLU fusion bug.
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/inference_benchmark.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/inference_benchmark.sh
@ -22,12 +22,12 @@ function test_configuration() {
 }

 test_configuration "FP32 nodali noxla"
-test_configuration "FP32 nodali xla" "--use_xla"
-test_configuration "FP16 nodali noxla" "--use_tf_amp"
-test_configuration "FP16 nodali xla" "--use_tf_amp --use_xla"
+test_configuration "FP32 nodali xla" "--xla"
+test_configuration "FP16 nodali noxla" "--amp"
+test_configuration "FP16 nodali xla" "--amp --xla"

 if [ ! -z $DALI_DIR ]; then
-    test_configuration "FP16 dali xla" "--use_tf_amp --use_xla --use_dali --data_idx_dir ${DALI_DIR}"
+    test_configuration "FP16 dali xla" "--amp --xla --dali --data_idx_dir ${DALI_DIR}"
 fi

 cat $INFERENCE_BENCHMARK
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_250E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_AMP_90E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_250E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX1_RN50_FP32_90E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_250E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_AMP_90E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_250E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGX2_RN50_FP32_90E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGXA100_RN50_AMP_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGXA100_RN50_AMP_90E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGXA100_RN50_TF32_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/DGXA100_RN50_TF32_90E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnet50 \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=3.0517578125e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/GPU1_RN50_QAT.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/GPU1_RN50_QAT.sh
@ -1,20 +0,0 @@
-#!/bin/bash
-
-# Copyright (c) 2020 NVIDIA CORPORATION. All rights reserved.
-# 
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-# 
-#       http://www.apache.org/licenses/LICENSE-2.0
-# 
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# This script does Quantization aware training of Resnet-50 by finetuning on the pre-trained model using 1 GPU and a batch size of 32.
-# Usage ./GPU1_RN50_QAT.sh <path to the pre-trained model> <path to dataset> <path to results directory>
-
-python main.py --mode=train_and_evaluate --batch_size=32 --lr_warmup_epochs=1 --quantize --symmetric --use_qdq --label_smoothing 0.1 --lr_init=0.00005 --momentum=0.875 --weight_decay=3.0517578125e-05 --finetune_checkpoint=$1 --data_dir=$2 --results_dir=$3 --num_iter 10 --data_format NHWC
--- a/TensorFlow/Classification/ConvNets/resnet50v1.5/training/training_perf.sh
+++ b/TensorFlow/Classification/ConvNets/resnet50v1.5/training/training_perf.sh
@ -26,13 +26,13 @@ function run_benchmark() {
    MODE_SIZE=$2
    
    if [[ $4 -eq "1" ]]; then
-        XLA="--use_xla"
+        XLA="--xla"
    else
        XLA=""
    fi

    case $2 in
-        "amp") MODE_FLAGS="--use_tf_amp --use_static_loss_scaling --loss_scale=128";;
+        "amp") MODE_FLAGS="--amp --static_loss_scale 128";;
        "fp32"|"tf32") MODE_FLAGS="";;
        *) echo "Unsupported configuration, use amp, tf32 or fp32";;
    esac
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/README.md
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/README.md
@ -251,16 +251,16 @@ For example, to train on DGX-1 for 90 epochs using AMP, run:
 Additionally, features like DALI data preprocessing or TensorFlow XLA can be enabled with
 following arguments when running those scripts:

-`bash ./resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_90E.sh /path/to/result /data --use_xla --use_dali`
+`bash ./resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_90E.sh /path/to/result /data --xla --dali`

 7. Start validation/evaluation.
 To evaluate the validation dataset located in `/data/tfrecords`, run `main.py` with
 `--mode=evaluate`. For example:

 `python main.py --arch=resnext101-32x4d --mode=evaluate --data_dir=/data/tfrecords --batch_size <batch size> --model_dir
-<model location> --results_dir <output location> [--use_xla] [--use_tf_amp]`
+<model location> --results_dir <output location> [--xla] [--amp]`

-The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during evaluation. 
+The optional `--xla` and `--amp` flags control XLA and AMP during evaluation. 

 ## Advanced

@ -299,95 +299,116 @@ The `runtime/` directory contains the following module that define the mechanics
 The script for training and evaluating the ResNext101-32x4d model has a variety of parameters that control these processes.

 ```
-usage: main.py [-h]
-               [--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
+usage: main.py [-h] [--arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}]
               [--mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}]
-               [--data_dir DATA_DIR] [--data_idx_dir DATA_IDX_DIR]
-               [--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]
-               [--batch_size BATCH_SIZE] [--num_iter NUM_ITER]
-               [--iter_unit {epoch,batch}] [--warmup_steps WARMUP_STEPS]
-               [--model_dir MODEL_DIR] [--results_dir RESULTS_DIR]
-               [--log_filename LOG_FILENAME] [--display_every DISPLAY_EVERY]
-               [--lr_init LR_INIT] [--lr_warmup_epochs LR_WARMUP_EPOCHS]
-               [--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
-               [--momentum MOMENTUM] [--loss_scale LOSS_SCALE]
-               [--label_smoothing LABEL_SMOOTHING] [--mixup MIXUP]
-               [--use_static_loss_scaling | --nouse_static_loss_scaling]
-               [--use_xla | --nouse_xla] [--use_dali | --nouse_dali]
-               [--use_tf_amp | --nouse_tf_amp]
-               [--use_cosine_lr | --nouse_cosine_lr] [--seed SEED]
+               [--export_dir EXPORT_DIR] [--to_predict TO_PREDICT]       
+               --batch_size BATCH_SIZE [--num_iter NUM_ITER]  
+               [--run_iter RUN_ITER] [--iter_unit {epoch,batch}]              
+               [--warmup_steps WARMUP_STEPS] [--model_dir MODEL_DIR]
+               [--results_dir RESULTS_DIR] [--log_filename LOG_FILENAME]      
+               [--display_every DISPLAY_EVERY] [--seed SEED]
               [--gpu_memory_fraction GPU_MEMORY_FRACTION] [--gpu_id GPU_ID]
-
-JoC-RN50v1.5-TF
-
-optional arguments:
-  -h, --help            Show this help message and exit
+               [--finetune_checkpoint FINETUNE_CHECKPOINT] [--use_final_conv]
+               [--quant_delay QUANT_DELAY] [--quantize] [--use_qdq]        
+               [--symmetric] [--data_dir DATA_DIR]         
+               [--data_idx_dir DATA_IDX_DIR] [--dali]
+               [--synthetic_data_size SYNTHETIC_DATA_SIZE] [--lr_init LR_INIT]
+               [--lr_warmup_epochs LR_WARMUP_EPOCHS] 
+               [--weight_decay WEIGHT_DECAY] [--weight_init {fan_in,fan_out}]
+               [--momentum MOMENTUM] [--label_smoothing LABEL_SMOOTHING]
+               [--mixup MIXUP] [--cosine_lr] [--xla]            
+               [--data_format {NHWC,NCHW}] [--amp]
+               [--static_loss_scale STATIC_LOSS_SCALE]
+                                                            
+JoC-RN50v1.5-TF                      
+                                                                           
+optional arguments:          
+  -h, --help            show this help message and exit.
  --arch {resnet50,resnext101-32x4d,se-resnext101-32x4d}
-                        Architecture of model to run (to run Resnext-32x4d set
-                        --arch=rensext101-32x4d)
+                        Architecture of model to run.                           
  --mode {train,train_and_evaluate,evaluate,predict,training_benchmark,inference_benchmark}
                        The execution mode of the script.
+  --export_dir EXPORT_DIR                                                                                                                                                                                                                                                  
+                        Directory in which to write exported SavedModel.         
+  --to_predict TO_PREDICT        
+                        Path to file or directory of files to run prediction
+                        on.
+  --batch_size BATCH_SIZE      
+                        Size of each minibatch per GPU.                    
+  --num_iter NUM_ITER   Number of iterations to run.
+  --run_iter RUN_ITER   Number of training iterations to run on single run.
+  --iter_unit {epoch,batch}                                
+                        Unit of iterations.                                  
+  --warmup_steps WARMUP_STEPS                                    
+                        Number of steps considered as warmup and not taken
+                        into account for performance measurements.                                  
+  --model_dir MODEL_DIR                
+                        Directory in which to write model. If undefined,         
+                        results dir will be used.                                                  
+  --results_dir RESULTS_DIR
+                        Directory in which to write training logs, summaries
+                        and checkpoints.
+  --log_filename LOG_FILENAME
+                        Name of the JSON file to which write the training log.
+  --display_every DISPLAY_EVERY
+                        How often (in batches) to print out running
+                        information.
+  --seed SEED           Random seed.
+  --gpu_memory_fraction GPU_MEMORY_FRACTION
+                        Limit memory fraction used by training script for DALI.
+  --gpu_id GPU_ID       Specify ID of the target GPU on multi-device platform.
+                        Effective only for single-GPU mode.
+  --finetune_checkpoint FINETUNE_CHECKPOINT
+                        Path to pre-trained checkpoint which will be used for
+                        fine-tuning.
+  --use_final_conv      Use convolution operator instead of MLP as last layer.
+  --quant_delay QUANT_DELAY
+                        Number of steps to be run before quantization starts
+                        to happen.
+  --quantize            Quantize weights and activations during training.
+                        (Defaults to Assymmetric quantization)
+  --use_qdq             Use QDQV3 op instead of FakeQuantWithMinMaxVars op for
+                        quantization. QDQv3 does only scaling.
+  --symmetric           Quantize weights and activations during training using
+                        symmetric quantization.
+
+Dataset arguments:
  --data_dir DATA_DIR   Path to dataset in TFRecord format. Files should be
                        named 'train-*' and 'validation-*'.
  --data_idx_dir DATA_IDX_DIR
                        Path to index files for DALI. Files should be named
                        'train-*' and 'validation-*'.
-  --export_dir EXPORT_DIR
-                        Directory in which to write exported SavedModel.
-  --to_predict TO_PREDICT
-                        Path to file or directory of files to run prediction
-                        on.
-  --batch_size BATCH_SIZE
-                        Size of each minibatch per GPU.
-  --num_iter NUM_ITER   Number of iterations to run.
-  --iter_unit {epoch,batch}
-                        Unit of iterations.
-  --warmup_steps WARMUP_STEPS
-                        Number of steps considered as warmup and not taken
-                        into account for performance measurements.
-  --model_dir MODEL_DIR
-                        Directory in which to write the model. If undefined,
-                        results directory will be used.
-  --results_dir RESULTS_DIR
-                        Directory in which to write training logs, summaries
-                        and checkpoints.
-  --log_filename LOG_FILENAME
-                        Name of the JSON file to which write the training log
-  --display_every DISPLAY_EVERY
-                        How often (in batches) to print out running
-                        information.
+  --dali                Enable DALI data input.
+  --synthetic_data_size SYNTHETIC_DATA_SIZE
+                        Dimension of image for synthetic dataset.
+
+Training arguments:
  --lr_init LR_INIT     Initial value for the learning rate.
  --lr_warmup_epochs LR_WARMUP_EPOCHS
-                        Number of warmup epochs for the learning rate schedule.
+                        Number of warmup epochs for learning rate schedule.
  --weight_decay WEIGHT_DECAY
                        Weight Decay scale factor.
  --weight_init {fan_in,fan_out}
                        Model weight initialization method.
-  --momentum MOMENTUM   SGD momentum value for the momentum optimizer.
-  --loss_scale LOSS_SCALE
-                        Loss scale for FP16 training and fast math FP32.
+  --momentum MOMENTUM   SGD momentum value for the Momentum optimizer.
  --label_smoothing LABEL_SMOOTHING
                        The value of label smoothing.
  --mixup MIXUP         The alpha parameter for mixup (if 0 then mixup is not
                        applied).
-  --use_static_loss_scaling
-                        Use static loss scaling in FP16 or FP32 AMP.
-  --nouse_static_loss_scaling
-  --use_xla             Enable XLA (Accelerated Linear Algebra) computation
+  --cosine_lr           Use cosine learning rate schedule.
+
+Generic optimization arguments:
+  --xla                 Enable XLA (Accelerated Linear Algebra) computation
                        for improved performance.
-  --nouse_xla
-  --use_dali            Enable DALI data input.
-  --nouse_dali
-  --use_tf_amp          Enable AMP to speedup FP32
-                        computation using Tensor Cores.
-  --nouse_tf_amp
-  --use_cosine_lr       Use cosine learning rate schedule.
-  --nouse_cosine_lr
-  --seed SEED           Random seed.
-  --gpu_memory_fraction GPU_MEMORY_FRACTION
-                        Limit memory fraction used by the training script for DALI
-  --gpu_id GPU_ID       Specify the ID of the target GPU on a multi-device platform.
-                        Effective only for single-GPU mode.
+  --data_format {NHWC,NCHW}
+                        Data format used to do calculations.
+  --amp                 Enable Automatic Mixed Precision to speedup
+                        computation using tensor cores.
+
+Automatic Mixed Precision arguments:
+  --static_loss_scale STATIC_LOSS_SCALE
+                        Use static loss scaling in FP32 AMP.
+
 ```

 ### Inference process
@ -395,7 +416,7 @@ To run inference on a single example with a checkpoint and a model script, use:

 `python main.py --arch=resnext101-32x4d --mode predict --model_dir <path to model> --to_predict <path to image> --results_dir <path to results>`

-The optional `--use_xla` and `--use_tf_amp` flags control XLA and AMP during inference.
+The optional `--xla` and `--amp` flags control XLA and AMP during inference.

 ## Performance

@ -414,7 +435,7 @@ To benchmark the training performance on a specific batch size, run:
        
    * AMP

-        `python ./main.py --arch=resnext101-32x4d --mode=training_benchmark  --use_tf_amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
+        `python ./main.py --arch=resnext101-32x4d --mode=training_benchmark  --amp --warmup_steps 200 --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
        
 * For multiple GPUs
    * FP32 / TF32
@ -423,16 +444,16 @@ To benchmark the training performance on a specific batch size, run:
        
    * AMP

-        `mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --arch=resnext101-32x4d --mode=training_benchmark --use_tf_amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
+        `mpiexec --allow-run-as-root --bind-to socket -np <num_gpus> python ./main.py --arch=resnext101-32x4d --mode=training_benchmark --amp --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
        
        
 Each of these scripts runs 200 warm-up iterations and measures the first epoch.

 To control warmup and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags. Features like XLA or DALI can be controlled
-with `--use_xla` and `--use_dali` flags. If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.
-For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
+with `--xla` and `--dali` flags. For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
 Suggested batch sizes for training are 128 for mixed precision training and 64 for single precision training per single V100 16 GB.

+If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset. The resolution of synthetic images used can be controlled with `--synthetic_data_size` flag.

 #### Inference performance benchmark

@ -444,11 +465,10 @@ To benchmark the inference performance on a specific batch size, run:

 * AMP

-`python ./main.py --arch=resnext101-32x4d --mode=inference_benchmark --use_tf_amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`
+`python ./main.py --arch=resnext101-32x4d --mode=inference_benchmark --amp --warmup_steps 20 --num_iter 100 --iter_unit batch --batch_size <batch size> --data_dir=<path to imagenet> --results_dir=<path to results directory>`

 By default, each of these scripts runs 20 warm-up iterations and measures the next 80 iterations.
 To control warm-up and benchmark length, use the `--warmup_steps`, `--num_iter` and `--iter_unit` flags.
-For proper throughput and latency reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
 If no `--data_dir=<path to imagenet>` flag is specified then the benchmarks will use a synthetic dataset.

 The benchmark can be automated with the `inference_benchmark.sh` script provided in `resnext101-32x4d`, by simply running:
@ -457,6 +477,9 @@ The benchmark can be automated with the `inference_benchmark.sh` script provided
 The `<data dir>` parameter refers to the input data directory (by default `/data/tfrecords` inside the container). 
 By default, the benchmark tests the following configurations: **FP32**, **AMP**, **AMP + XLA** with different batch sizes.
 When the optional directory with the DALI index files `<data idx dir>` is specified, the benchmark executes an additional **DALI + AMP + XLA** configuration.
+For proper throughput reporting the value of `--num_iter` must be greater than `--warmup_steps` value.
+
+For performance benchamrk of raw model, synthetic dataset can be used. To use synthetic dataset, use `--synthetic_data_size` flag instead of `--data_dir` to specify input image size.

 ### Results

@ -769,6 +792,9 @@ on NVIDIA T4 with (1x T4 16G) GPU.

 June 2020
   - Initial release
+August 2020
+   - Updated command line argument names
+   - Added support for syntetic dataset with different image size

 ### Known issues
-Performance without XLA enabled is low. We recommend using XLA.
+Performance without XLA enabled is low due to BN + ReLU fusion bug.
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/inference_benchmark.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/inference_benchmark.sh
@ -22,12 +22,12 @@ function test_configuration() {
 }

 test_configuration "FP32 nodali noxla"
-test_configuration "FP32 nodali xla" "--use_xla"
-test_configuration "FP16 nodali noxla" "--use_tf_amp"
-test_configuration "FP16 nodali xla" "--use_tf_amp --use_xla"
+test_configuration "FP32 nodali xla" "--xla"
+test_configuration "FP16 nodali noxla" "--amp"
+test_configuration "FP16 nodali xla" "--amp --xla"

 if [ ! -z $DALI_DIR ]; then
-    test_configuration "FP16 dali xla" "--use_tf_amp --use_xla --use_dali --data_idx_dir ${DALI_DIR}"
+    test_configuration "FP16 dali xla" "--amp --xla --dali --data_idx_dir ${DALI_DIR}"
 fi

 cat $INFERENCE_BENCHMARK
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_250E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_AMP_90E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_FP32_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_FP32_250E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_FP32_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX1_RNxt101-32x4d_FP32_90E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_AMP_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_AMP_250E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_AMP_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_AMP_90E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_FP32_250E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_FP32_250E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=250 --muxup=0.2 \
-    --batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_FP32_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGX2_RNxt101-32x4d_FP32_90E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 16 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=64 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=64 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGXA100_RNxt101-32x4d_AMP_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGXA100_RNxt101-32x4d_AMP_90E.sh
@ -25,9 +25,9 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=256 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=256 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
-    --use_tf_amp --use_static_loss_scaling --loss_scale 128 \
+    --amp --static_loss_scale 128 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}

--- a/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGXA100_RNxt101-32x4d_TF32_90E.sh
+++ b/TensorFlow/Classification/ConvNets/resnext101-32x4d/training/DGXA100_RNxt101-32x4d_TF32_90E.sh
@ -25,7 +25,7 @@ fi

 mpiexec --allow-run-as-root ${BIND_TO_SOCKET} -np 8 python3 main.py --arch=resnext101-32x4d \
    --mode=train_and_evaluate --iter_unit=epoch --num_iter=90 \
-    --batch_size=128 --warmup_steps=100 --use_cosine --label_smoothing 0.1 \
+    --batch_size=128 --warmup_steps=100 --cosine_lr --label_smoothing 0.1 \
    --lr_init=0.256 --lr_warmup_epochs=8 --momentum=0.875 --weight_decay=6.103515625e-05 \
    --data_dir=${DATA_DIR}/tfrecords --data_idx_dir=${DATA_DIR}/dali_idx \
    --results_dir=${WORKSPACE}/results --weight_init=fan_in ${OTHER}
--- a/Show more
+++ b/Show more