Merge pull request #502 from swethmandava/master

BERT module update
2020-05-18 10:27:19 -07:00 · 2020-05-18 10:27:19 -07:00 · f3786969a3
parent 06e39168fb 58763147ff
commit f3786969a3
30 changed files with 985 additions and 727 deletions
--- a/TensorFlow/LanguageModeling/BERT/Dockerfile
+++ b/TensorFlow/LanguageModeling/BERT/Dockerfile
@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:19.10-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.03-tf1-py3

 FROM ${FROM_IMAGE_NAME}

@ -13,10 +13,11 @@ RUN git clone https://github.com/attardi/wikiextractor.git
 RUN git clone https://github.com/soskek/bookcorpus.git
 RUN git clone https://github.com/titipata/pubmed_parser

+
 RUN pip3 install /workspace/pubmed_parser

 #Copy the perf_client over
-ARG TRTIS_CLIENTS_URL=https://github.com/NVIDIA/tensorrt-inference-server/releases/download/v1.5.0/v1.5.0_ubuntu1804.clients.tar.gz
+ARG TRTIS_CLIENTS_URL=https://github.com/NVIDIA/triton-inference-server/releases/download/v1.12.0/v1.12.0_ubuntu1804.clients.tar.gz
 RUN mkdir -p /workspace/install \
    && curl -L ${TRTIS_CLIENTS_URL} | tar xvz -C /workspace/install

--- a/TensorFlow/LanguageModeling/BERT/README.md
+++ b/TensorFlow/LanguageModeling/BERT/README.md
@ -28,7 +28,7 @@ This repository provides a script and recipe to train the BERT model for TensorF
    * [Multi-node](#multi-node)
  * [Inference process](#inference-process)
  * [Inference Process With TensorRT](#inference-process-with-tensorrt)
-  * [Deploying the BERT model using TensorRT Inference Server](#deploying-the-bert-model-using-tensorrt-inference-server)
+  * [Deploying the BERT model using Triton Inference Server](#deploying-the-bert-model-using-triton-inference-server)
  * [BioBERT](#biobert)
 - [Performance](#performance)
  * [Benchmarking](#benchmarking)
@ -619,9 +619,9 @@ I0312 23:14:00.550973 140287431493376 run_squad.py:1397] 0 Inference Performance
 ### Inference Process With TensorRT
 NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. More information on how to perform inference using TensorRT can be found in the subfolder [./trt/README.md](trt/README.md)

-### Deploying the BERT model using TensorRT Inference Server
+### Deploying the BERT model using Triton Inference Server

-The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `TensorRT Inference Server` can be found in the subfolder `./trtis/README.md`.
+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `Triton Inference Server` can be found in the subfolder `./triton/README.md`.

 ### BioBERT

@ -1153,7 +1153,7 @@ September 2019

 July 2019
 - Results obtained using 19.06
- Inference Studies using TensorRT Inference Server
+- Inference Studies using Triton Inference Server

 March 2019
 - Initial release
--- a/TensorFlow/LanguageModeling/BERT/notebooks/README.md
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/README.md
@ -41,24 +41,10 @@ Here is a short description of each relevant file:
 ### 2.a Build the BERT TensorFlow NGC container:
 To run the notebook you first need to build the Bert TensorFlow container using the following command from the main directory of this repository:

-``` bash
+```bash
 docker build . --rm -t bert
 ```
-### 2.b Dataset
-
-We need to download the vocabulary and the bert_config files:
-
-``` python3
-python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights  # Includes vocab
-```
-
-This is only needed during fine-tuning in order to download the Squad dataset:
-
-``` python3
-python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
-```
-
-### 2.c Start of the NGC container to run inference:
+### 2.b Start of the NGC container to run inference:
 Once the image is built, you need to run the container with the `--publish
 0.0.0.0:8888:8888` option to publish Jupyter's port `8888` to the host machine
 at port `8888` over all network interfaces (`0.0.0.0`):
@ -74,7 +60,23 @@ nvidia-docker run \
  -it bert:latest bash
 ```

-Then you can use the following command within the BERT Tensorflow container under
+### 2.c Dataset
+
+We need to download the vocabulary and the bert_config files:
+
+```python3
+python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights  # Includes vocab
+```
+
+This is only needed during fine-tuning in order to download the Squad dataset:
+
+```python3
+python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
+```
+
+### 2.d Starting Jupyter Notebook
+
+Now you can use the following command within the BERT Tensorflow container under
 `/workspace/bert`:

 ```bash
@ -123,7 +125,7 @@ Here is a short description of the relevant file:
 ### 2.a Build the BERT TensorFlow NGC container:
 To run the notebook you first need to build the Bert TensorFlow container using the following command from the main directory of this repository:

-``` bash
+```bash
 docker build . --rm -t bert
 ```
 ### 2.b Start of the NGC container to run inference:
--- a/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_finetuning.ipynb
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_finetuning.ipynb
@ -84,7 +84,7 @@
    "import os\n",
    "import sys\n",
    "\n",
-    "data_dir =  '/workspace/bert/data/download'\n",
+    "data_dir =  '../data/download'\n",
    "\n",
    "# SQuAD json for training\n",
    "train_file = os.path.join(data_dir, 'squad/v1.1/train-v1.1.json')\n",
@ -141,7 +141,7 @@
    "|BERTBASE |12 encoder| 768| 12|4 x  768|512|110M|\n",
    "|BERTLARGE|24 encoder|1024| 16|4 x 1024|512|330M|\n",
    "\n",
-    "We will large use pre-trained models avaialble on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
+    "We will use large pre-trained models avaialble on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
    "There are many configuration available, in particular we will download and use the following:\n",
    "\n",
    "**bert_tf_large_fp16_384**\n",
@ -164,7 +164,7 @@
   "outputs": [],
   "source": [
    "# bert_tf_large_fp16_384\n",
-    "DATA_DIR_FP16 = '/workspace/bert/data/download/pretrained_model_fp16'\n",
+    "DATA_DIR_FP16 = data_dir + '/pretrained_model_fp16'\n",
    "!mkdir -p $DATA_DIR_FP16\n",
    "!wget -nc -q --show-progress -O $DATA_DIR_FP16/bert_for_tensorflow.zip \\\n",
    "https://api.ngc.nvidia.com/v2/models/nvidia/bert_for_tensorflow/versions/1/zip\n",
@ -184,9 +184,9 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "notebooks_dir = '/workspace/bert/notebooks'\n",
+    "notebooks_dir = '../notebooks'\n",
    "\n",
-    "working_dir = '/workspace/bert'\n",
+    "working_dir = '..'\n",
    "if working_dir not in sys.path:\n",
    "    sys.path.append(working_dir)\n",
    "\n",
@ -257,7 +257,10 @@
    "if 'f' not in tf.flags.FLAGS: \n",
    "    tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
    "FLAGS = flags.FLAGS\n",
-    "# FLAGS.verbose_logging = True\n",
+    "\n",
+    "verbose_logging = True\n",
+    "# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False\n",
+    "version_2_with_negative = False\n",
    "\n",
    "# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
    "n_best_size = 20\n",
@ -536,7 +539,10 @@
    "          end_logits=end_logits))\n",
    "\n",
    "eval_time_elapsed = time.time() - eval_start_time\n",
-    "eval_time_wo_startup = eval_hooks[-1].total_time\n",
+    "\n",
+    "time_list = eval_hooks[-1].time_list\n",
+    "time_list.sort()\n",
+    "eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])\n",
    "num_sentences = eval_hooks[-1].count * predict_batch_size\n",
    "avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
    "\n",
@ -554,7 +560,8 @@
    "run_squad.write_predictions(eval_examples, eval_features, all_results,\n",
    "                  n_best_size, max_answer_length,\n",
    "                  do_lower_case, output_prediction_file,\n",
-    "                  output_nbest_file, output_null_log_odds_file)\n",
+    "                  output_nbest_file, output_null_log_odds_file,\n",
+    "                  version_2_with_negative, verbose_logging)\n",
    "\n",
    "tf.logging.info(\"Inference Results:\")\n",
    "\n",
@ -585,7 +592,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "!python /workspace/bert/data/download/squad/v1.1/evaluate-v1.1.py \\\n",
+    "!python ../data/download/squad/v1.1/evaluate-v1.1.py \\\n",
    "    $predict_file \\\n",
    "    $output_dir/predictions.json"
   ]
@ -596,7 +603,7 @@
   "source": [
    "## 6. What's next\n",
    "\n",
-    "Now that you have fine-tuned a BERT model you may want to take a look ad the run_squad script which containd more options for fine-tuning."
+    "Now that you have fine-tuned a BERT model you may want to take a look at the run_squad script which containd more options for fine-tuning."
   ]
  }
 ],
@ -616,7 +623,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.8"
+   "version": "3.6.9"
  }
 },
 "nbformat": 4,
--- a/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference.ipynb
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference.ipynb
@ -162,10 +162,10 @@
    "import os\n",
    "import sys\n",
    "\n",
-    "notebooks_dir = '/workspace/bert/notebooks'\n",
-    "data_dir = '/workspace/bert/data/download'\n",
+    "notebooks_dir = '../notebooks'\n",
+    "data_dir = '../data/download'\n",
    "\n",
-    "working_dir = '/workspace/bert'\n",
+    "working_dir = '../'\n",
    "if working_dir not in sys.path:\n",
    "    sys.path.append(working_dir)"
   ]
@ -239,7 +239,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "#input_file = '/workspace/bert/data/download/squad/v2.0/dev-v2.0.json'"
+    "#input_file = '../data/download/squad/v2.0/dev-v2.0.json'"
   ]
  },
  {
@ -272,14 +272,14 @@
   "outputs": [],
   "source": [
    "# bert_tf_v2_large_fp32_384\n",
-    "DATA_DIR_FP32='/workspace/bert/data/download/finetuned_model_fp32'\n",
+    "DATA_DIR_FP32 = data_dir + '/finetuned_model_fp32'\n",
    "!mkdir -p $DATA_DIR_FP32\n",
    "!wget -nc -q --show-progress -O $DATA_DIR_FP32/bert_tf_v2_large_fp32_384.zip \\\n",
    "https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v2_large_fp32_384/versions/1/zip\n",
    "!unzip -n -d $DATA_DIR_FP32/ $DATA_DIR_FP32/bert_tf_v2_large_fp32_384.zip \n",
    "    \n",
    "# bert_tf_v2_large_fp16_384\n",
-    "DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
+    "DATA_DIR_FP16 = data_dir + '/finetuned_model_fp16'\n",
    "!mkdir -p $DATA_DIR_FP16\n",
    "!wget -nc -q --show-progress -O $DATA_DIR_FP16/bert_tf_v2_large_fp16_384.zip \\\n",
    "https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v2_large_fp16_384/versions/1/zip\n",
@ -363,6 +363,10 @@
    "    tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
    "FLAGS = flags.FLAGS\n",
    "\n",
+    "verbose_logging = True\n",
+    "# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False\n",
+    "version_2_with_negative = False\n",
+    "\n",
    "# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
    "n_best_size = 20\n",
    "\n",
@ -501,7 +505,9 @@
    "\n",
    "eval_time_elapsed = time.time() - eval_start_time\n",
    "\n",
-    "eval_time_wo_startup = eval_hooks[-1].total_time\n",
+    "time_list = eval_hooks[-1].time_list\n",
+    "time_list.sort()\n",
+    "eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])\n",
    "num_sentences = eval_hooks[-1].count * predict_batch_size\n",
    "avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
    "\n",
@ -519,7 +525,8 @@
    "run_squad.write_predictions(eval_examples, eval_features, all_results,\n",
    "                  n_best_size, max_answer_length,\n",
    "                  do_lower_case, output_prediction_file,\n",
-    "                  output_nbest_file, output_null_log_odds_file)\n",
+    "                  output_nbest_file, output_null_log_odds_file,\n",
+    "                  version_2_with_negative, verbose_logging)\n",
    "\n",
    "tf.logging.info(\"Inference Results:\")\n",
    "\n",
@ -569,7 +576,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.8"
+   "version": "3.6.9"
  }
 },
 "nbformat": 4,
--- a/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb
@ -527,6 +527,10 @@
    "    tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
    "FLAGS = flags.FLAGS\n",
    "\n",
+    "verbose_logging = True\n",
+    "# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False\n",
+    "version_2_with_negative = False\n",
+    "\n",
    "# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
    "n_best_size = 20\n",
    "\n",
@ -678,7 +682,9 @@
    "\n",
    "eval_time_elapsed = time.time() - eval_start_time\n",
    "\n",
-    "eval_time_wo_startup = eval_hooks[-1].total_time\n",
+    "time_list = eval_hooks[-1].time_list\n",
+    "time_list.sort()\n",
+    "eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])\n",
    "num_sentences = eval_hooks[-1].count * predict_batch_size\n",
    "avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
    "\n",
@ -757,7 +763,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.8"
+   "version": "3.6.9"
  }
 },
 "nbformat": 4,
--- a/TensorFlow/LanguageModeling/BERT/notebooks/biobert_ner_tf_inference.ipynb
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/biobert_ner_tf_inference.ipynb
@ -169,8 +169,8 @@
    "import os\n",
    "import sys\n",
    "\n",
-    "notebooks_dir = '/workspace/bert/notebooks'\n",
-    "working_dir = '/workspace/bert'\n",
+    "notebooks_dir = '../notebooks'\n",
+    "working_dir = '../'\n",
    "if working_dir not in sys.path:\n",
    "    sys.path.append(working_dir)"
   ]
@ -267,7 +267,7 @@
   "outputs": [],
   "source": [
    "# biobert_uncased_base_ner_disease\n",
-    "DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
+    "DATA_DIR_FP16 = '../data/download/finetuned_model_fp16'\n",
    "!mkdir -p $DATA_DIR_FP16\n",
    "!wget -nc -q --show-progress -O $DATA_DIR_FP16/biobert_uncased_base_ner_disease.zip \\\n",
    "https://api.ngc.nvidia.com/v2/models/nvidia/biobert_uncased_base_ner_disease/versions/1/zip\n",
@ -602,7 +602,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.8"
+   "version": "3.6.9"
  }
 },
 "nbformat": 4,
--- a/TensorFlow/LanguageModeling/BERT/optimization.py
+++ b/TensorFlow/LanguageModeling/BERT/optimization.py
@ -95,8 +95,15 @@ def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None,

  if hvd is not None and (num_accumulation_steps == 1 or (not allreduce_post_accumulation)):
    optimizer = hvd.DistributedOptimizer(optimizer, sparse_as_dense=True, compression=Compression.fp16 if use_fp16 or manual_fp16 else Compression.none)
-  if manual_fp16 or use_fp16:
-    loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
+  if use_fp16:
+    loss_scaler = tf.train.experimental.DynamicLossScale(initial_loss_scale=2**32, increment_period=1000, multiplier=2.0)
+    optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer, loss_scaler)
+    loss_scale_value = tf.identity(loss_scaler(), name="loss_scale")
+  if manual_fp16:
+    loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2 ** 32,
+                                                                                    incr_every_n_steps=1000,
+                                                                                    decr_every_n_nan_or_inf=2,
+                                                                                    decr_ratio=0.5)
    optimizer = tf.contrib.mixed_precision.LossScaleOptimizer(optimizer, loss_scale_manager)

  tvars = tf.trainable_variables()
@ -149,7 +156,10 @@ def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None,
      update_op = tf.cond(update_step,
                          lambda: update(accum_vars), lambda: tf.no_op())

-      new_global_step = tf.cond(tf.math.logical_and(update_step, tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)), tf.bool)), lambda: global_step+1, lambda: global_step)
+      new_global_step = tf.cond(tf.math.logical_and(update_step, 
+                                                    tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)), tf.bool)) if hvd is not None else batch_finite, 
+                                lambda: global_step+1,
+                                lambda: global_step)
      new_global_step = tf.identity(new_global_step, name='step_update')
      train_op = tf.group(update_op, [global_step.assign(new_global_step)])
  else:
--- a/TensorFlow/LanguageModeling/BERT/run_classifier.py
+++ b/TensorFlow/LanguageModeling/BERT/run_classifier.py
@ -61,7 +61,7 @@ flags.DEFINE_string(

 ## Other parameters
 flags.DEFINE_string(
-    "dllog_path", "bert_dllog.json",
+    "dllog_path", "/results/bert_dllog.json",
    "filename where dllogger writes to")

 flags.DEFINE_string(
@ -301,6 +301,7 @@ def model_fn_builder(task_name, bert_config, num_labels, init_checkpoint, learni
                "eval_loss": loss,
            }
    tf.compat.v1.logging.info("*** Features ***")
+    tf.compat.v1.logging.info("*** Features ***")
    for name in sorted(features.keys()):
      tf.compat.v1.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

@ -428,13 +429,14 @@ def input_fn_builder(features, batch_size, seq_length, is_training, drop_remaind


 def main(_):
+  os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
+
  tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
  dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)

  if FLAGS.horovod:
    hvd.init()
-  if FLAGS.use_fp16:
-    os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
+
  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
@ -597,11 +599,12 @@ def main(_):
    result = estimator.evaluate(input_fn=eval_input_fn, hooks=eval_hooks)

    eval_time_elapsed = time.time() - eval_start_time
-    eval_time_wo_overhead = eval_hooks[-1].total_time

    time_list = eval_hooks[-1].time_list
    time_list.sort()
-    num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size
+    # Removing outliers (init/warmup) in throughput computation.
+    eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
+    num_sentences = (int(len(time_list) * 0.99)) * FLAGS.predict_batch_size

    avg = np.mean(time_list)
    cf_50 = max(time_list[:int(len(time_list) * 0.50)])
@ -615,7 +618,7 @@ def main(_):
    tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
                    eval_hooks[-1].count * FLAGS.eval_batch_size)
    tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
-                    (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size)
+                    num_sentences))
    tf.compat.v1.logging.info("Summary Inference Statistics on EVAL set")
    tf.compat.v1.logging.info("Batch size = %d", FLAGS.eval_batch_size)
    tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
--- a/TensorFlow/LanguageModeling/BERT/run_ner.py
+++ b/TensorFlow/LanguageModeling/BERT/run_ner.py
@ -57,7 +57,7 @@ flags.DEFINE_string(
    "The vocabulary file that the BERT model was trained on.")

 flags.DEFINE_string(
-    "dllog_path", "bert_dllog.json",
+    "dllog_path", "/results/bert_dllog.json",
    "filename where dllogger writes to")

 flags.DEFINE_string(
@ -174,7 +174,7 @@ class DataProcessor(object):
    @classmethod
    def _read_data(cls, input_file):
        """Reads a BIO data."""
-        with tf.io.gfile.Open(input_file, "r") as f:
+        with open(input_file, "r") as f:
            lines = []
            words = []
            labels = []
@ -321,8 +321,8 @@ def convert_single_example(ex_index, example, label_list, max_seq_length, tokeni
    for (i, label) in enumerate(label_list, 1):
        label_map[label] = i
    label2id_file = os.path.join(FLAGS.output_dir, 'label2id.pkl')
-    if not tf.io.gfile.Exists(label2id_file):
-        with tf.io.gfile.Open(label2id_file, 'wb') as w:
+    if not os.path.exists(label2id_file):
+        with open(label2id_file, 'wb') as w:
            pickle.dump(label_map, w)
    textlist = example.text.split(' ')
    labellist = example.label.split(' ')
@ -613,13 +613,13 @@ def result_to_pair(predict_line, pred_ids, id2label, writer, err_writer):


 def main(_):
+    os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
+
    tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
    dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)

    if FLAGS.horovod:
      hvd.init()
-    if FLAGS.use_fp16:
-        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"

    processors = {
        "bc5cdr": BC5CDRProcessor,
@ -828,11 +828,12 @@ def main(_):
                i = i + 1

        eval_time_elapsed = time.time() - eval_start_time
-        eval_time_wo_overhead = eval_hooks[-1].total_time

        time_list = eval_hooks[-1].time_list
        time_list.sort()
-        num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
+        # Removing outliers (init/warmup) in throughput computation.
+        eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
+        num_sentences = (int(len(time_list) * 0.99)) * FLAGS.predict_batch_size

        avg = np.mean(time_list)
        cf_50 = max(time_list[:int(len(time_list) * 0.50)])
@ -846,7 +847,7 @@ def main(_):
        tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
                        eval_hooks[-1].count * FLAGS.predict_batch_size)
        tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
-                        (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
+                        num_sentences)
        tf.compat.v1.logging.info("Summary Inference Statistics")
        tf.compat.v1.logging.info("Batch size = %d", FLAGS.predict_batch_size)
        tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
--- a/TensorFlow/LanguageModeling/BERT/run_pretraining.py
+++ b/TensorFlow/LanguageModeling/BERT/run_pretraining.py
@ -56,7 +56,7 @@ flags.DEFINE_string(

 ## Other parameters
 flags.DEFINE_string(
-    "dllog_path", "bert_dllog.json",
+    "dllog_path", "/results/bert_dllog.json",
    "filename where dllogger writes to")

 flags.DEFINE_string(
@ -125,18 +125,24 @@ flags.DEFINE_bool("use_fp16", False, "Whether to enable AMP ops.")

 # report samples/sec, total loss and learning rate during training
 class _LogSessionRunHook(tf.estimator.SessionRunHook):
-  def __init__(self, global_batch_size, num_accumulation_steps, dllogging, display_every=10, hvd_rank=-1):
+  def __init__(self, global_batch_size, num_accumulation_steps, dllogging, display_every=10,
+               save_ckpt_steps=1000, report_loss=True, hvd_rank=-1):
    self.global_batch_size = global_batch_size
    self.display_every = display_every
+    self.save_ckpt_steps = save_ckpt_steps
    self.hvd_rank = hvd_rank
    self.num_accumulation_steps = num_accumulation_steps
    self.dllogging = dllogging
+    self.report_loss = report_loss

  def after_create_session(self, session, coord):
-    self.elapsed_secs = 0.
-    self.count = 0
-    self.all_count = 0
-    self.avg_loss = 0.0
+    self.elapsed_secs = 0.0 #elapsed seconds between every print
+    self.count = 0 # number of global steps between every print
+    self.all_count = 0 #number of steps (including accumulation) between every print
+    self.loss = 0.0 # accumulation of loss in each step between every print
+
+    self.total_time = 0.0 # total time taken to train (excluding warmup + ckpt saving steps)
+    self.init_global_step = session.run(tf.train.get_global_step()) # training starts at init_global_step

  def before_run(self, run_context):
    self.t0 = time.time()
@ -153,40 +159,49 @@ class _LogSessionRunHook(tf.estimator.SessionRunHook):
                         'mlm_loss:0'])
    else:
        if FLAGS.manual_fp16 or FLAGS.use_fp16:
-          return tf.estimator.SessionRunArgs(
-              fetches=['step_update:0', 'update_step:0', 'total_loss:0',
-                       'learning_rate:0', 'nsp_loss:0',
-                       'mlm_loss:0', 'loss_scale:0'])
+            return tf.estimator.SessionRunArgs(
+                fetches=['step_update:0', 'update_step:0', 'total_loss:0',
+                         'learning_rate:0', 'nsp_loss:0',
+                         'mlm_loss:0', 'loss_scale:0'])
        else:
          return tf.estimator.SessionRunArgs(
              fetches=['step_update:0', 'update_step:0', 'total_loss:0',
                       'learning_rate:0', 'nsp_loss:0',
                       'mlm_loss:0'])
  def after_run(self, run_context, run_values):
-    self.elapsed_secs += time.time() - self.t0
+    run_time = time.time() - self.t0
+
    if self.num_accumulation_steps <=1:
        if FLAGS.manual_fp16 or FLAGS.use_fp16:
-            global_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
+            self.global_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
        else:
-            global_step, total_loss, lr, nsp_loss, mlm_loss = run_values. \
+            self.global_step, total_loss, lr, nsp_loss, mlm_loss = run_values. \
                results
        update_step = True
    else:
        if FLAGS.manual_fp16 or FLAGS.use_fp16:
-          global_step, update_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
+          self.global_step, update_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
        else:
-          global_step, update_step, total_loss, lr, nsp_loss, mlm_loss = run_values.\
+          self.global_step, update_step, total_loss, lr, nsp_loss, mlm_loss = run_values.\
              results
-    print_step = global_step + 1 # One-based index for printing.
-    self.avg_loss += total_loss
+
+    # Removing first two steps after every checkpoint save from timing
+    if (self.global_step - self.init_global_step) % self.save_ckpt_steps <= 5:
+      print("Skipping time record for ", self.global_step, " due to checkpoint-saving/warmup overhead")
+    else:
+      self.total_time += run_time
+    self.elapsed_secs += run_time
+
+    print_step = self.global_step + 1 # One-based index for printing.
+    self.loss += total_loss
    self.all_count += 1
    if update_step:
        self.count += 1
        if (print_step == 1 or print_step % self.display_every == 0):
            dt = self.elapsed_secs / self.count
            sent_per_sec = self.global_batch_size / dt
-            avg_loss_step = self.avg_loss / self.all_count
-            if self.hvd_rank >= 0:
+            avg_loss_step = self.loss / self.all_count
+            if self.hvd_rank >= 0 and FLAGS.report_loss:
              if FLAGS.manual_fp16 or FLAGS.use_fp16:
                self.dllogging.logger.log(step=(print_step),
                                     data={"Rank": int(rank), "throughput_train": float(sent_per_sec),
@ -216,10 +231,15 @@ class _LogSessionRunHook(tf.estimator.SessionRunHook):
                                           "total_loss":float(total_loss), "avg_loss_step":float(avg_loss_step),
                                           "learning_rate": str(lr)},
                                     verbosity=Verbosity.DEFAULT)
-            self.elapsed_secs = 0.
+            self.elapsed_secs = 0.0
            self.count = 0
-            self.avg_loss = 0.0
+            self.loss = 0.0
            self.all_count = 0
+  def end(self, session):
+    num_global_steps = self.global_step - self.init_global_step
+    self.skipped = (num_global_steps // self.save_ckpt_steps) * 5 + \
+                   min(5, num_global_steps % self.save_ckpt_steps)
+

 def model_fn_builder(bert_config, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps,
@ -519,15 +539,14 @@ def _decode_record(record, name_to_features):


 def main(_):
+  os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
+
  tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
  dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)

  if not FLAGS.do_train and not FLAGS.do_eval:
    raise ValueError("At least one of `do_train` or `do_eval` must be True.")

-  if FLAGS.use_fp16:
-    os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
-
  if FLAGS.horovod:
    import horovod.tensorflow as hvd
    hvd.init()
@ -564,6 +583,7 @@ def main(_):
      model_dir=FLAGS.output_dir,
      session_config=config,
      save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None,
+      save_summary_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None,
      # This variable controls how often estimator reports examples/sec.
      # Default value is every 100 steps.
      # When --report_loss is True, we set to very large value to prevent
@ -580,18 +600,19 @@ def main(_):
      use_one_hot_embeddings=False,
      hvd=None if not FLAGS.horovod else hvd)

-  training_hooks = []
-  if FLAGS.report_loss and (not FLAGS.horovod or hvd.rank() == 0):
-    global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
-    training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, dllogging, FLAGS.display_loss_steps))
-  if FLAGS.horovod and hvd.size() > 1:
-    training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
-
  estimator = tf.estimator.Estimator(
      model_fn=model_fn,
      config=run_config)

  if FLAGS.do_train:
+
+    training_hooks = []
+    if FLAGS.horovod and hvd.size() > 1:
+      training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
+    if (not FLAGS.horovod or hvd.rank() == 0):
+      global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
+      training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, dllogging, FLAGS.display_loss_steps, FLAGS.save_checkpoints_steps, FLAGS.report_loss))
+
    tf.compat.v1.logging.info("***** Running training *****")
    tf.compat.v1.logging.info("  Batch size = %d", FLAGS.train_batch_size)
    train_input_fn = input_fn_builder(
@ -602,7 +623,24 @@ def main(_):
        is_training=True,
        hvd=None if not FLAGS.horovod else hvd)

+    train_start_time = time.time()
    estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
+    train_time_elapsed = time.time() - train_start_time
+
+    if (not FLAGS.horovod or hvd.rank() == 0):
+        train_time_wo_overhead = training_hooks[-1].total_time
+        avg_sentences_per_second = FLAGS.num_train_steps * global_batch_size * 1.0 / train_time_elapsed
+        ss_sentences_per_second = (FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
+
+        tf.compat.v1.logging.info("-----------------------------")
+        tf.compat.v1.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
+                        FLAGS.num_train_steps * global_batch_size)
+        tf.compat.v1.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
+                        (FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size)
+        tf.compat.v1.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second)
+        tf.compat.v1.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
+        dllogging.logger.log(step=(), data={"throughput_train": ss_sentences_per_second}, verbosity=Verbosity.DEFAULT)
+        tf.compat.v1.logging.info("-----------------------------")

  if FLAGS.do_eval and (not FLAGS.horovod or hvd.rank() == 0):
    tf.compat.v1.logging.info("***** Running evaluation *****")
@ -626,9 +664,11 @@ def main(_):
        input_fn=eval_input_fn, steps=FLAGS.max_eval_steps, hooks=eval_hooks)

    eval_time_elapsed = time.time() - eval_start_time
-    eval_time_wo_overhead = eval_hooks[-1].total_time
-
-    num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size
+    time_list = eval_hooks[-1].time_list
+    time_list.sort()
+    # Removing outliers (init/warmup) in throughput computation.
+    eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
+    num_sentences = (int(len(time_list) * 0.99)) * FLAGS.eval_batch_size

    ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead

@ -636,7 +676,7 @@ def main(_):
    tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
                    eval_hooks[-1].count * FLAGS.eval_batch_size)
    tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
-                    (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size)
+                    num_sentences)
    tf.compat.v1.logging.info("Summary Inference Statistics on EVAL set")
    tf.compat.v1.logging.info("Batch size = %d", FLAGS.eval_batch_size)
    tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
--- a/TensorFlow/LanguageModeling/BERT/run_re.py
+++ b/TensorFlow/LanguageModeling/BERT/run_re.py
@ -65,7 +65,7 @@ flags.DEFINE_string(

 ## Other parameters
 flags.DEFINE_string(
-    "dllog_path", "bert_dllog.json",
+    "dllog_path", "/results/bert_dllog.json",
    "filename where dllogger writes to")

 flags.DEFINE_string(
@ -719,13 +719,13 @@ def convert_examples_to_features(examples, label_list, max_seq_length,


 def main(_):
+    os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
+
    tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
    dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)

    if FLAGS.horovod:
      hvd.init()
-    if FLAGS.use_fp16:
-        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"

    processors = {
        "chemprot": BioBERTChemprotProcessor,
@ -946,11 +946,12 @@ def main(_):
        assert num_written_lines == num_actual_predict_examples

        eval_time_elapsed = time.time() - eval_start_time
-        eval_time_wo_overhead = eval_hooks[-1].total_time

        time_list = eval_hooks[-1].time_list
        time_list.sort()
-        num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
+        # Removing outliers (init/warmup) in throughput computation.
+        eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
+        num_sentences = (int(len(time_list) * 0.99)) * FLAGS.predict_batch_size

        avg = np.mean(time_list)
        cf_50 = max(time_list[:int(len(time_list) * 0.50)])
@ -964,7 +965,7 @@ def main(_):
        tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
                        eval_hooks[-1].count * FLAGS.predict_batch_size)
        tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
-                        (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
+                        num_sentences)
        tf.compat.v1.logging.info("Summary Inference Statistics")
        tf.compat.v1.logging.info("Batch size = %d", FLAGS.predict_batch_size)
        tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
--- a/TensorFlow/LanguageModeling/BERT/run_squad.py
+++ b/TensorFlow/LanguageModeling/BERT/run_squad.py
@ -41,134 +41,139 @@ import utils.dllogger_class
 from dllogger import Verbosity

 flags = tf.flags
+FLAGS = None

-FLAGS = flags.FLAGS
+def extract_run_squad_flags():

-## Required parameters
-flags.DEFINE_string(
-    "bert_config_file", None,
-    "The config json file corresponding to the pre-trained BERT model. "
-    "This specifies the model architecture.")
+  ## Required parameters
+  flags.DEFINE_string(
+      "bert_config_file", None,
+      "The config json file corresponding to the pre-trained BERT model. "
+      "This specifies the model architecture.")

-flags.DEFINE_string("vocab_file", None,
-                    "The vocabulary file that the BERT model was trained on.")
+  flags.DEFINE_string("vocab_file", None,
+                      "The vocabulary file that the BERT model was trained on.")

-flags.DEFINE_string(
-    "output_dir", None,
-    "The output directory where the model checkpoints will be written.")
+  flags.DEFINE_string(
+      "output_dir", None,
+      "The output directory where the model checkpoints will be written.")

-## Other parameters
+  ## Other parameters

-flags.DEFINE_string(
-    "dllog_path", "bert_dllog.json",
-    "filename where dllogger writes to")
+  flags.DEFINE_string(
+      "dllog_path", "/results/bert_dllog.json",
+      "filename where dllogger writes to")

-flags.DEFINE_string("train_file", None,
-                    "SQuAD json for training. E.g., train-v1.1.json")
+  flags.DEFINE_string("train_file", None,
+                      "SQuAD json for training. E.g., train-v1.1.json")

-flags.DEFINE_string(
-    "predict_file", None,
-    "SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
-flags.DEFINE_string(
-    "eval_script", None,
-    "SQuAD evaluate.py file to compute f1 and exact_match E.g., evaluate-v1.1.py")
+  flags.DEFINE_string(
+      "predict_file", None,
+      "SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
+  flags.DEFINE_string(
+      "eval_script", None,
+      "SQuAD evaluate.py file to compute f1 and exact_match E.g., evaluate-v1.1.py")

-flags.DEFINE_string(
-    "init_checkpoint", None,
-    "Initial checkpoint (usually from a pre-trained BERT model).")
+  flags.DEFINE_string(
+      "init_checkpoint", None,
+      "Initial checkpoint (usually from a pre-trained BERT model).")

-flags.DEFINE_bool(
-    "do_lower_case", True,
-    "Whether to lower case the input text. Should be True for uncased "
-    "models and False for cased models.")
+  flags.DEFINE_bool(
+      "do_lower_case", True,
+      "Whether to lower case the input text. Should be True for uncased "
+      "models and False for cased models.")

-flags.DEFINE_integer(
-    "max_seq_length", 384,
-    "The maximum total input sequence length after WordPiece tokenization. "
-    "Sequences longer than this will be truncated, and sequences shorter "
-    "than this will be padded.")
+  flags.DEFINE_integer(
+      "max_seq_length", 384,
+      "The maximum total input sequence length after WordPiece tokenization. "
+      "Sequences longer than this will be truncated, and sequences shorter "
+      "than this will be padded.")

-flags.DEFINE_integer(
-    "doc_stride", 128,
-    "When splitting up a long document into chunks, how much stride to "
-    "take between chunks.")
+  flags.DEFINE_integer(
+      "doc_stride", 128,
+      "When splitting up a long document into chunks, how much stride to "
+      "take between chunks.")

-flags.DEFINE_integer(
-    "max_query_length", 64,
-    "The maximum number of tokens for the question. Questions longer than "
-    "this will be truncated to this length.")
+  flags.DEFINE_integer(
+      "max_query_length", 64,
+      "The maximum number of tokens for the question. Questions longer than "
+      "this will be truncated to this length.")

-flags.DEFINE_bool("do_train", False, "Whether to run training.")
+  flags.DEFINE_bool("do_train", False, "Whether to run training.")

-flags.DEFINE_bool("do_predict", False, "Whether to run eval on the dev set.")
+  flags.DEFINE_bool("do_predict", False, "Whether to run eval on the dev set.")

-flags.DEFINE_integer("train_batch_size", 8, "Total batch size for training.")
+  flags.DEFINE_integer("train_batch_size", 8, "Total batch size for training.")

-flags.DEFINE_integer("predict_batch_size", 8,
-                     "Total batch size for predictions.")
+  flags.DEFINE_integer("predict_batch_size", 8,
+                       "Total batch size for predictions.")

-flags.DEFINE_float("learning_rate", 5e-6, "The initial learning rate for Adam.")
+  flags.DEFINE_float("learning_rate", 5e-6, "The initial learning rate for Adam.")

-flags.DEFINE_bool("use_trt", False, "Whether to use TF-TRT")
+  flags.DEFINE_bool("use_trt", False, "Whether to use TF-TRT")

-flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
-flags.DEFINE_float("num_train_epochs", 3.0,
-                   "Total number of training epochs to perform.")
+  flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
+  flags.DEFINE_float("num_train_epochs", 3.0,
+                     "Total number of training epochs to perform.")

-flags.DEFINE_float(
-    "warmup_proportion", 0.1,
-    "Proportion of training to perform linear learning rate warmup for. "
-    "E.g., 0.1 = 10% of training.")
+  flags.DEFINE_float(
+      "warmup_proportion", 0.1,
+      "Proportion of training to perform linear learning rate warmup for. "
+      "E.g., 0.1 = 10% of training.")

-flags.DEFINE_integer("save_checkpoints_steps", 1000,
-                     "How often to save the model checkpoint.")
+  flags.DEFINE_integer("save_checkpoints_steps", 1000,
+                       "How often to save the model checkpoint.")

-flags.DEFINE_integer("iterations_per_loop", 1000,
-                     "How many steps to make in each estimator call.")
+  flags.DEFINE_integer("iterations_per_loop", 1000,
+                       "How many steps to make in each estimator call.")

-flags.DEFINE_integer("num_accumulation_steps", 1,
-                     "Number of accumulation steps before gradient update" 
-                      "Global batch size = num_accumulation_steps * train_batch_size")
+  flags.DEFINE_integer("num_accumulation_steps", 1,
+                       "Number of accumulation steps before gradient update" 
+                        "Global batch size = num_accumulation_steps * train_batch_size")

-flags.DEFINE_integer(
-    "n_best_size", 20,
-    "The total number of n-best predictions to generate in the "
-    "nbest_predictions.json output file.")
+  flags.DEFINE_integer(
+      "n_best_size", 20,
+      "The total number of n-best predictions to generate in the "
+      "nbest_predictions.json output file.")

-flags.DEFINE_integer(
-    "max_answer_length", 30,
-    "The maximum length of an answer that can be generated. This is needed "
-    "because the start and end predictions are not conditioned on one another.")
+  flags.DEFINE_integer(
+      "max_answer_length", 30,
+      "The maximum length of an answer that can be generated. This is needed "
+      "because the start and end predictions are not conditioned on one another.")


-flags.DEFINE_bool(
-    "verbose_logging", False,
-    "If true, all of the warnings related to data processing will be printed. "
-    "A number of warnings are expected for a normal SQuAD evaluation.")
+  flags.DEFINE_bool(
+      "verbose_logging", False,
+      "If true, all of the warnings related to data processing will be printed. "
+      "A number of warnings are expected for a normal SQuAD evaluation.")

-flags.DEFINE_bool(
-    "version_2_with_negative", False,
-    "If true, the SQuAD examples contain some that do not have an answer.")
+  flags.DEFINE_bool(
+      "version_2_with_negative", False,
+      "If true, the SQuAD examples contain some that do not have an answer.")

-flags.DEFINE_float(
-    "null_score_diff_threshold", 0.0,
-    "If null_score - best_non_null is greater than the threshold predict null.")
+  flags.DEFINE_float(
+      "null_score_diff_threshold", 0.0,
+      "If null_score - best_non_null is greater than the threshold predict null.")

-flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
-flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
-flags.DEFINE_integer("num_eval_iterations", None,
-                     "How many eval iterations to run - performs inference on subset")
+  flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
+  flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
+  flags.DEFINE_integer("num_eval_iterations", None,
+                       "How many eval iterations to run - performs inference on subset")

-# TRTIS Specific flags
-flags.DEFINE_bool("export_trtis", False, "Whether to export saved model or run inference with TRTIS")
-flags.DEFINE_string("trtis_model_name", "bert", "exports to appropriate directory for TRTIS")
-flags.DEFINE_integer("trtis_model_version", 1, "exports to appropriate directory for TRTIS")
-flags.DEFINE_string("trtis_server_url", "localhost:8001", "exports to appropriate directory for TRTIS")
-flags.DEFINE_bool("trtis_model_overwrite", False, "If True, will overwrite an existing directory with the specified 'model_name' and 'version_name'")
-flags.DEFINE_integer("trtis_max_batch_size", 8, "Specifies the 'max_batch_size' in the TRTIS model config. See the TRTIS documentation for more info.")
-flags.DEFINE_float("trtis_dyn_batching_delay", 0, "Determines the dynamic_batching queue delay in milliseconds(ms) for the TRTIS model config. Use '0' or '-1' to specify static batching. See the TRTIS documentation for more info.")
-flags.DEFINE_integer("trtis_engine_count", 1, "Specifies the 'instance_group' count value in the TRTIS model config. See the TRTIS documentation for more info.")
+  # Triton Specific flags
+  flags.DEFINE_bool("export_triton", False, "Whether to export saved model or run inference with Triton")
+  flags.DEFINE_string("triton_model_name", "bert", "exports to appropriate directory for Triton")
+  flags.DEFINE_integer("triton_model_version", 1, "exports to appropriate directory for Triton")
+  flags.DEFINE_string("triton_server_url", "localhost:8001", "exports to appropriate directory for Triton")
+  flags.DEFINE_bool("triton_model_overwrite", False, "If True, will overwrite an existing directory with the specified 'model_name' and 'version_name'")
+  flags.DEFINE_integer("triton_max_batch_size", 8, "Specifies the 'max_batch_size' in the Triton model config. See the Triton documentation for more info.")
+  flags.DEFINE_float("triton_dyn_batching_delay", 0, "Determines the dynamic_batching queue delay in milliseconds(ms) for the Triton model config. Use '0' or '-1' to specify static batching. See the Triton documentation for more info.")
+  flags.DEFINE_integer("triton_engine_count", 1, "Specifies the 'instance_group' count value in the Triton model config. See the Triton documentation for more info.")
+  flags.mark_flag_as_required("vocab_file")
+  flags.mark_flag_as_required("bert_config_file")
+  flags.mark_flag_as_required("output_dir")

+  return flags.FLAGS

 def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 use_one_hot_embeddings):
@ -434,12 +439,9 @@ RawResult = collections.namedtuple("RawResult",
                                   ["unique_id", "start_logits", "end_logits"])


-def write_predictions(all_examples, all_features, all_results, n_best_size,
-                      max_answer_length, do_lower_case, output_prediction_file,
-                      output_nbest_file, output_null_log_odds_file):
-  """Write final predictions to the json file and log-odds of null if needed."""
-  tf.compat.v1.logging.info("Writing predictions to: %s" % (output_prediction_file))
-  tf.compat.v1.logging.info("Writing nbest to: %s" % (output_nbest_file))
+def get_predictions(all_examples, all_features, all_results, n_best_size, max_answer_length, 
+  do_lower_case, version_2_with_negative, verbose_logging):
+  """Get final predictions"""

  example_index_to_features = collections.defaultdict(list)
  for feature in all_features:
@ -471,7 +473,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
      start_indexes = _get_best_indexes(result.start_logits, n_best_size)
      end_indexes = _get_best_indexes(result.end_logits, n_best_size)
      # if we could have irrelevant answers, get the min score of irrelevant
-      if FLAGS.version_2_with_negative:
+      if version_2_with_negative:
        feature_null_score = result.start_logits[0] + result.end_logits[0]
        if feature_null_score < score_null:
          score_null = feature_null_score
@ -506,7 +508,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
                  start_logit=result.start_logits[start_index],
                  end_logit=result.end_logits[end_index]))

-    if FLAGS.version_2_with_negative:
+    if version_2_with_negative:
      prelim_predictions.append(
          _PrelimPrediction(
              feature_index=min_null_feature_index,
@ -544,7 +546,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
        tok_text = " ".join(tok_text.split())
        orig_text = " ".join(orig_tokens)

-        final_text = get_final_text(tok_text, orig_text, do_lower_case)
+        final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)
        if final_text in seen_predictions:
          continue

@ -560,7 +562,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
              end_logit=pred.end_logit))

    # if we didn't inlude the empty option in the n-best, inlcude it
-    if FLAGS.version_2_with_negative:
+    if version_2_with_negative:
      if "" not in seen_predictions:
        nbest.append(
            _NbestPrediction(
@ -595,7 +597,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,

    assert len(nbest_json) >= 1

-    if not FLAGS.version_2_with_negative:
+    if not version_2_with_negative:
      all_predictions[example.qas_id] = nbest_json[0]["text"]
    else:
      # predict "" iff the null score - the score of best non-null > threshold
@ -608,6 +610,19 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
        all_predictions[example.qas_id] = best_non_null_entry.text

    all_nbest_json[example.qas_id] = nbest_json
+  return all_predictions, all_nbest_json, scores_diff_json
+
+def write_predictions(all_examples, all_features, all_results, n_best_size,
+                      max_answer_length, do_lower_case, output_prediction_file,
+                      output_nbest_file, output_null_log_odds_file,
+                      version_2_with_negative, verbose_logging):
+  """Write final predictions to the json file and log-odds of null if needed."""
+
+  tf.compat.v1.logging.info("Writing predictions to: %s" % (output_prediction_file))
+  tf.compat.v1.logging.info("Writing nbest to: %s" % (output_nbest_file))
+
+  all_predictions, all_nbest_json, scores_diff_json = get_predictions(all_examples, all_features, 
+    all_results, n_best_size, max_answer_length, do_lower_case, version_2_with_negative, verbose_logging)

  with tf.io.gfile.GFile(output_prediction_file, "w") as writer:
    writer.write(json.dumps(all_predictions, indent=4) + "\n")
@ -615,12 +630,12 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
  with tf.io.gfile.GFile(output_nbest_file, "w") as writer:
    writer.write(json.dumps(all_nbest_json, indent=4) + "\n")

-  if FLAGS.version_2_with_negative:
+  if version_2_with_negative:
    with tf.io.gfile.GFile(output_null_log_odds_file, "w") as writer:
      writer.write(json.dumps(scores_diff_json, indent=4) + "\n")


-def get_final_text(pred_text, orig_text, do_lower_case):
+def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging):
  """Project the tokenized prediction back to the original text."""

  # When we created the data, we kept track of the alignment between original
@ -669,7 +684,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):

  start_position = tok_text.find(pred_text)
  if start_position == -1:
-    if FLAGS.verbose_logging:
+    if verbose_logging:
      tf.compat.v1.logging.info(
          "Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
    return orig_text
@ -679,7 +694,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
  (tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)

  if len(orig_ns_text) != len(tok_ns_text):
-    if FLAGS.verbose_logging:
+    if verbose_logging:
      tf.compat.v1.logging.info("Length not equal after stripping spaces: '%s' vs '%s'",
                      orig_ns_text, tok_ns_text)
    return orig_text
@ -697,7 +712,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
      orig_start_position = orig_ns_to_s_map[ns_start_position]

  if orig_start_position is None:
-    if FLAGS.verbose_logging:
+    if verbose_logging:
      tf.compat.v1.logging.info("Couldn't map start position")
    return orig_text

@ -708,7 +723,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
      orig_end_position = orig_ns_to_s_map[ns_end_position]

  if orig_end_position is None:
-    if FLAGS.verbose_logging:
+    if verbose_logging:
      tf.compat.v1.logging.info("Couldn't map end position")
    return orig_text

@ -757,7 +772,7 @@ def validate_flags_or_throw(bert_config):
  tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
                                                FLAGS.init_checkpoint)

-  if not FLAGS.do_train and not FLAGS.do_predict and not FLAGS.export_trtis:
+  if not FLAGS.do_train and not FLAGS.do_predict and not FLAGS.export_triton:
    raise ValueError("At least one of `do_train` or `do_predict` or `export_SavedModel` must be True.")

  if FLAGS.do_train:
@ -782,7 +797,7 @@ def validate_flags_or_throw(bert_config):


 def export_model(estimator, export_dir, init_checkpoint):
-    """Exports a checkpoint in SavedModel format in a directory structure compatible with TRTIS."""
+    """Exports a checkpoint in SavedModel format in a directory structure compatible with Triton."""


    def serving_input_fn():
@ -806,10 +821,10 @@ def export_model(estimator, export_dir, init_checkpoint):
        checkpoint_path=init_checkpoint,
        strip_default_attrs=False)

-    model_name = FLAGS.trtis_model_name
+    model_name = FLAGS.triton_model_name

-    model_folder = export_dir + "/trtis_models/" + model_name
-    version_folder = model_folder + "/" + str(FLAGS.trtis_model_version)
+    model_folder = export_dir + "/triton_models/" + model_name
+    version_folder = model_folder + "/" + str(FLAGS.triton_model_version)
    final_model_folder = version_folder + "/model.savedmodel"

    if not os.path.exists(version_folder):
@ -819,19 +834,19 @@ def export_model(estimator, export_dir, init_checkpoint):
        os.rename(saved_dir, final_model_folder)
        print("Model saved to dir", final_model_folder)
    else:
-        if (FLAGS.trtis_model_overwrite):
+        if (FLAGS.triton_model_overwrite):
            shutil.rmtree(final_model_folder)
            os.rename(saved_dir, final_model_folder)
            print("WARNING: Existing model was overwritten. Model dir: {}".format(final_model_folder))
        else:
-            print("ERROR: Could not save TRTIS model. Folder already exists. Use '--trtis_model_overwrite=True' if you would like to overwrite an existing model. Model dir: {}".format(final_model_folder))
+            print("ERROR: Could not save Triton model. Folder already exists. Use '--triton_model_overwrite=True' if you would like to overwrite an existing model. Model dir: {}".format(final_model_folder))
            return

-    # Now build the config for TRTIS. Check to make sure we can overwrite it, if it exists
+    # Now build the config for Triton. Check to make sure we can overwrite it, if it exists
    config_filename = os.path.join(model_folder, "config.pbtxt")

-    if (os.path.exists(config_filename) and not FLAGS.trtis_model_overwrite):
-        print("ERROR: Could not save TRTIS model config. Config file already exists. Use '--trtis_model_overwrite=True' if you would like to overwrite an existing model config. Model config: {}".format(config_filename))
+    if (os.path.exists(config_filename) and not FLAGS.triton_model_overwrite):
+        print("ERROR: Could not save Triton model config. Config file already exists. Use '--triton_model_overwrite=True' if you would like to overwrite an existing model config. Model config: {}".format(config_filename))
        return
    
    config_template = r"""
@ -883,9 +898,9 @@ instance_group [
 ]"""

    batching_str = ""
-    max_batch_size = FLAGS.trtis_max_batch_size
+    max_batch_size = FLAGS.triton_max_batch_size

-    if (FLAGS.trtis_dyn_batching_delay > 0):
+    if (FLAGS.triton_dyn_batching_delay > 0):

        # Use only full and half full batches
        pref_batch_size = [int(max_batch_size / 2.0), max_batch_size]
@ -894,7 +909,7 @@ instance_group [
 dynamic_batching {{
    preferred_batch_size: [{0}]
    max_queue_delay_microseconds: {1}
-}}""".format(", ".join([str(x) for x in pref_batch_size]), int(FLAGS.trtis_dyn_batching_delay * 1000.0))
+}}""".format(", ".join([str(x) for x in pref_batch_size]), int(FLAGS.triton_dyn_batching_delay * 1000.0))

    config_values = {
        "model_name": model_name,
@ -902,7 +917,7 @@ dynamic_batching {{
        "seq_length": FLAGS.max_seq_length,
        "dynamic_batching": batching_str,
        "gpu_list": ", ".join([x.name.split(":")[-1] for x in device_lib.list_local_devices() if x.device_type == "GPU"]),
-        "engine_count": FLAGS.trtis_engine_count
+        "engine_count": FLAGS.triton_engine_count
    }

    with open(model_folder + "/config.pbtxt", "w") as file:
@ -911,13 +926,13 @@ dynamic_batching {{
        file.write(final_config_str)

 def main(_):
+  os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
+
  tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
  dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)

  if FLAGS.horovod:
    hvd.init()
-  if FLAGS.use_fp16:
-    os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"

  bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)

@ -1061,7 +1076,7 @@ def main(_):
        tf.compat.v1.logging.info("-----------------------------")


-  if FLAGS.export_trtis and master_process:
+  if FLAGS.export_triton and master_process:
    export_model(estimator, FLAGS.output_dir, FLAGS.init_checkpoint)

  if FLAGS.do_predict and master_process:
@ -1163,7 +1178,8 @@ def main(_):
    write_predictions(eval_examples, eval_features, all_results,
                      FLAGS.n_best_size, FLAGS.max_answer_length,
                      FLAGS.do_lower_case, output_prediction_file,
-                      output_nbest_file, output_null_log_odds_file)
+                      output_nbest_file, output_null_log_odds_file,
+                      FLAGS.version_2_with_negative, FLAGS.verbose_logging)

    if FLAGS.eval_script:
        import sys
@ -1179,7 +1195,5 @@ def main(_):


 if __name__ == "__main__":
-  flags.mark_flag_as_required("vocab_file")
-  flags.mark_flag_as_required("bert_config_file")
-  flags.mark_flag_as_required("output_dir")
-  tf.compat.v1.app.run()
+  FLAGS = extract_run_squad_flags()
+  tf.app.run()
--- a/TensorFlow/LanguageModeling/BERT/scripts/run_pretraining_adam.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/run_pretraining_adam.sh
@ -68,7 +68,14 @@ printf "Logs written to %s\n" "$LOGFILE"
 INPUT_FILES="$DATA_DIR/training"
 EVAL_FILES="$DATA_DIR/test"

-CMD="python3 /workspace/bert/run_pretraining.py"
+horovod_str=""
+mpi=""
+if [ $num_gpus -gt 1 ] ; then
+   mpi="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket"
+   horovod_str="--horovod"
+fi
+
+CMD="$mpi python3 /workspace/bert/run_pretraining.py"
 CMD+=" --input_files_dir=$INPUT_FILES"
 CMD+=" --eval_files_dir=$EVAL_FILES"
 CMD+=" --output_dir=$RESULTS_DIR"
@ -85,7 +92,7 @@ CMD+=" --num_accumulation_steps=$num_accumulation_steps"
 CMD+=" --save_checkpoints_steps=$save_checkpoints_steps"
 CMD+=" --learning_rate=$learning_rate"
 CMD+=" --optimizer_type=adam"
-CMD+=" --horovod $PREC"
+CMD+=" $horovod_str $PREC"
 CMD+=" --allreduce_post_accumulation=True"

 #Check if all necessary files are available before training
@ -96,10 +103,6 @@ for DIR_or_file in $DATA_DIR $BERT_CONFIG $RESULTS_DIR; do
  fi
 done

-if [ $num_gpus -gt 1 ] ; then
-   CMD="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket $CMD"
-fi
-
 set -x
 if [ -z "$LOGFILE" ] ; then
   $CMD
--- a/TensorFlow/LanguageModeling/BERT/scripts/run_pretraining_lamb_phase1.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/run_pretraining_lamb_phase1.sh
@ -59,8 +59,10 @@ if [ "$use_xla" = "true" ] ; then
 fi

 mpi=""
+horovod_str=""
 if [ $num_gpus -gt 1 ] ; then
   mpi="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket"
+   horovod_str="--horovod"
 fi

 #PHASE 1
@ -99,5 +101,5 @@ done
     --num_warmup_steps=$warmup_steps_phase1 \
     --save_checkpoints_steps=$save_checkpoints_steps \
     --learning_rate=$learning_rate_phase1 \
-     --horovod $PREC \
+     $horovod_str $PREC \
     --allreduce_post_accumulation=True
--- a/TensorFlow/LanguageModeling/BERT/scripts/run_pretraining_lamb_phase2.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/run_pretraining_lamb_phase2.sh
@ -61,8 +61,10 @@ if [ "$use_xla" = "true" ] ; then
 fi

 mpi=""
+horovod_str=""
 if [ $num_gpus -gt 1 ] ; then
   mpi="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket"
+   horovod_str="--horovod"
 fi

 #PHASE 1 Config
@ -110,6 +112,6 @@ $mpi python /workspace/bert/run_pretraining.py \
    --num_warmup_steps=$warmup_steps_phase2 \
    --save_checkpoints_steps=$save_checkpoints_steps \
    --learning_rate=$learning_rate_phase2 \
-    --horovod $PREC \
+    $horovod_str $PREC \
    --allreduce_post_accumulation=True

--- a/TensorFlow/LanguageModeling/BERT/scripts/run_squad_inference.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/run_squad_inference.sh
@ -76,6 +76,7 @@ python run_squad.py \
 --init_checkpoint=$init_checkpoint \
 --do_predict=True \
 --predict_file=$SQUAD_DIR/dev-v${squad_version}.json \
+--eval_script=$SQUAD_DIR/evaluate-v${squad_version}.py \
 --predict_batch_size=$batch_size \
 --max_seq_length=$seq_length \
 --doc_stride=$doc_stride \
@ -83,5 +84,3 @@ python run_squad.py \
 --output_dir=$RESULTS_DIR \
 "$use_fp16" \
 $use_xla_tag --version_2_with_negative=${version_2_with_negative}
-
-python $SQUAD_DIR/evaluate-v${squad_version}.py $SQUAD_DIR/dev-v${squad_version}.json $RESULTS_DIR/predictions.json
--- a/TensorFlow/LanguageModeling/BERT/triton/README.md
+++ b/TensorFlow/LanguageModeling/BERT/triton/README.md
@ -1,18 +1,18 @@
-# Deploying the BERT model using TensorRT Inference Server
+# Deploying the BERT model using Triton Inference Server

-The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
-This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using TensorRT Inference Server.
+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using Triton Inference Server.

 ## Table Of Contents

- [TensorRT Inference Server Overview](#tensorrt-inference-server-overview)
- [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
+- [Triton Inference Server Overview](#triton-inference-server-overview)
+- [Running the Triton Inference Server and client](#running-the-triton-inference-server-and-client)
+- [Performance analysis for Triton Inference Server](#performance-analysis-for-triton-inference-server)
  * [Advanced Details](#advanced-details)
- [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)

-## TensorRT Inference Server Overview
+## Triton Inference Server Overview

-A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
+A typical Triton Inference Server pipeline can be broken down into the following 8 steps:
 1. Client serializes the inference request into a message and sends it to the server (Client Send)
 2. Message travels over the network from the client to the server (Network)
 3. Message arrives at server, and is deserialized (Server Receive)
@ -23,30 +23,40 @@ A typical TensorRT Inference Server pipeline can be broken down into the followi
 8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)

 Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
-In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
+In this section, we will go over how to launch Triton Inference Server and client and get the best performant solution that fits your specific application needs.

 Note: The following instructions are run from outside the container and call `docker run` commands as required.

-## Performance analysis for TensorRT Inference Server
+## Running the Triton Inference Server and client
+
+The `run_triton.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that Triton Inference Server accepts, builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client on SQuAD v1.1 dataset and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
+
+```bash
+bash triton/scripts/run_triton.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <triton_version_name> <triton_model_name> <triton_export_model> <triton_dyn_batching_delay> <triton_engine_count> <triton_model_overwrite>
+```
+
+You can also run inference with a sample by passing `--question` and `--context` arguments to the client.
+
+## Performance analysis for Triton Inference Server

 Based on the figures 1 and 2 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.

-![](../data/images/trtis_base_summary.png?raw=true)
+![](../data/images/triton_base_summary.png?raw=true)

-Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
+Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in Triton Inference Server

-![](../data/images/trtis_large_summary.png?raw=true)
+![](../data/images/triton_large_summary.png?raw=true)

-Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
+Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in Triton Inference Server

 ### Advanced Details

-This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
+This section digs deeper into the performance numbers and configurations corresponding to running Triton Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.

-Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
+Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that Triton Inference Server accepts, builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/client.html#performance-example-application) for various configurations.

 ```bash
-bash trtis/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
+bash triton/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
 ```

 All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1).  A high number of concurrent requests can reduce the impact of network latency on overall throughput.
@ -59,11 +69,11 @@ Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurren

 Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.

-![](../data/images/trtis_bs_1.png?raw=true)
+![](../data/images/triton_bs_1.png?raw=true)

 Figure 3: Latency & Throughput vs Concurrency at Batch size = 1

-![](../data/images/trtis_bs_8.png?raw=true)
+![](../data/images/triton_bs_8.png?raw=true)

 Figure 4: Latency & Throughput vs Concurrency at Batch size = 8

@ -71,38 +81,30 @@ Figure 4: Latency & Throughput vs Concurrency at Batch size = 8

 Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.

-Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and ‘preferred_batchsize’ to indicate your optimal batch sizes in the TensorRT Inference Server model config.
+Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and ‘preferred_batchsize’ to indicate your optimal batch sizes in the Triton Inference Server model config.

 Figures 5 and 6 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.

-![](../data/images/trtis_static.png?raw=true)
+![](../data/images/triton_static.png?raw=true)

 Figure 5: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1

-![](../data/images/trtis_dynamic.png?raw=true)
+![](../data/images/triton_dynamic.png?raw=true)

 Figure 6: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000

 #### Model execution instance count

-TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesn’t saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
+Triton Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesn’t saturate the GPU allowing the GPU to run multiple instances of the model in parallel.

 Figures 7 and 8 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.

-![](../data/images/trtis_ec_1.png?raw=true)
+![](../data/images/triton_ec_1.png?raw=true)

 Figure 7: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
 (One copy of the model loaded in GPU memory)

-![](../data/images/trtis_ec_4.png?raw=true)
+![](../data/images/triton_ec_4.png?raw=true)

 Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
-(Four copies the model loaded in GPU memory)
-
-## Running the TensorRT Inference Server and client
-
-The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
-
-```bash
-bash trtis/scripts/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
-```
+(Four copies the model loaded in GPU memory)
--- a/TensorFlow/LanguageModeling/BERT/triton/run_squad_triton_client.py
+++ b/TensorFlow/LanguageModeling/BERT/triton/run_squad_triton_client.py
@ -0,0 +1,325 @@
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import modeling
+import tokenization
+from tensorrtserver.api import ProtocolType, InferContext, ServerStatusContext, grpc_service_pb2_grpc, grpc_service_pb2, model_config_pb2
+from utils.create_squad_data import *
+import grpc
+from run_squad import write_predictions, get_predictions, RawResult
+import numpy as np
+import tqdm
+from functools import partial
+
+import sys
+if sys.version_info >= (3, 0):
+  import queue
+else:
+  import Queue as queue
+
+
+flags = tf.flags
+FLAGS = flags.FLAGS
+
+## Required parameters
+flags.DEFINE_string(
+    "bert_config_file", None,
+    "The config json file corresponding to the pre-trained BERT model. "
+    "This specifies the model architecture.")
+
+flags.DEFINE_string("vocab_file", None,
+                    "The vocabulary file that the BERT model was trained on.")
+
+flags.DEFINE_string(
+    "output_dir", None,
+    "The output directory where the model checkpoints will be written.")
+
+flags.DEFINE_bool(
+    "do_lower_case", True,
+    "Whether to lower case the input text. Should be True for uncased "
+    "models and False for cased models.")
+
+flags.DEFINE_integer(
+    "max_seq_length", 384,
+    "The maximum total input sequence length after WordPiece tokenization. "
+    "Sequences longer than this will be truncated, and sequences shorter "
+    "than this will be padded.")
+
+flags.DEFINE_integer(
+    "doc_stride", 128,
+    "When splitting up a long document into chunks, how much stride to "
+    "take between chunks.")
+
+flags.DEFINE_integer(
+    "max_query_length", 64,
+    "The maximum number of tokens for the question. Questions longer than "
+    "this will be truncated to this length.")
+
+flags.DEFINE_integer("predict_batch_size", 8,
+                     "Total batch size for predictions.")
+
+flags.DEFINE_integer(
+    "n_best_size", 20,
+    "The total number of n-best predictions to generate in the "
+    "nbest_predictions.json output file.")
+
+flags.DEFINE_integer(
+    "max_answer_length", 30,
+    "The maximum length of an answer that can be generated. This is needed "
+    "because the start and end predictions are not conditioned on one another.")
+
+flags.DEFINE_bool(
+    "version_2_with_negative", False,
+    "If true, the SQuAD examples contain some that do not have an answer.")
+
+flags.DEFINE_bool(
+    "verbose_logging", False,
+    "If true, all of the warnings related to data processing will be printed. "
+    "A number of warnings are expected for a normal SQuAD evaluation.")
+
+# Triton Specific flags
+flags.DEFINE_string("triton_model_name", "bert", "exports to appropriate directory for Triton")
+flags.DEFINE_integer("triton_model_version", 1, "exports to appropriate directory for Triton")
+flags.DEFINE_string("triton_server_url", "localhost:8001", "exports to appropriate directory for Triton")
+
+# Input Text for Inference
+flags.DEFINE_string("question", None, "Question for Inference")
+flags.DEFINE_string("context", None, "Context for Inference")
+flags.DEFINE_string(
+    "predict_file", None,
+    "SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
+
+
+# Set this to either 'label_ids' for Google bert or 'unique_ids' for JoC
+label_id_key = "unique_ids"
+
+# User defined class to store infer_ctx and request id
+# from callback function and let main thread to handle them
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+# Callback function used for async_run(), it can capture
+# additional information using functools.partial as long as the last
+# two arguments are reserved for InferContext and request id
+def completion_callback(user_data, idx, start_time, inputs, infer_ctx, request_id):
+    user_data._completed_requests.put((infer_ctx, request_id, idx, start_time, inputs))
+
+def batch(iterable, n=1):
+    l = len(iterable)
+    for ndx in range(0, l, n):
+        label_ids_data = ()
+        input_ids_data = ()
+        input_mask_data = ()
+        segment_ids_data = ()
+        for i in range(0, min(n, l-ndx)):
+            label_ids_data = label_ids_data + (np.array([iterable[ndx + i].unique_id], dtype=np.int32),)
+            input_ids_data = input_ids_data+ (np.array(iterable[ndx + i].input_ids, dtype=np.int32),)
+            input_mask_data = input_mask_data+ (np.array(iterable[ndx + i].input_mask, dtype=np.int32),)
+            segment_ids_data = segment_ids_data+ (np.array(iterable[ndx + i].segment_ids, dtype=np.int32),)
+
+        inputs_dict = {label_id_key: label_ids_data,
+                       'input_ids': input_ids_data,
+                       'input_mask': input_mask_data,
+                       'segment_ids': segment_ids_data}
+        yield inputs_dict
+
+def main(_):
+    """
+    Ask a question of context on Triton.
+    :param context: str
+    :param question: str
+    :param question_id: int
+    :return:
+    """
+    os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
+
+    tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
+
+    # Get the Data
+    if FLAGS.predict_file:
+        eval_examples = read_squad_examples(
+            input_file=FLAGS.predict_file, is_training=False,
+            version_2_with_negative=FLAGS.version_2_with_negative)
+    elif FLAGS.question and FLAGS.answer:
+        input_data = [{"paragraphs":[{"context":FLAGS.context,
+                        "qas":[{"id":0, "question":FLAGS.question}]}]}]
+
+        eval_examples = read_squad_examples(input_file=None, is_training=False,
+            version_2_with_negative=FLAGS.version_2_with_negative, input_data=input_data)
+    else:
+        raise ValueError("Either predict_file or question+answer need to defined")
+    
+    # Get Eval Features = Preprocessing
+    eval_features = []
+    def append_feature(feature):
+        eval_features.append(feature)
+
+    convert_examples_to_features(
+        examples=eval_examples[0:],
+        tokenizer=tokenizer,
+        max_seq_length=FLAGS.max_seq_length,
+        doc_stride=FLAGS.doc_stride,
+        max_query_length=FLAGS.max_query_length,
+        is_training=False,
+        output_fn=append_feature)
+
+    protocol_str = 'grpc' # http or grpc
+    url = FLAGS.triton_server_url
+    verbose = True
+    model_name = FLAGS.triton_model_name
+    model_version = FLAGS.triton_model_version
+    batch_size = FLAGS.predict_batch_size
+
+    protocol = ProtocolType.from_str(protocol_str) # or 'grpc'
+
+    ctx = InferContext(url, protocol, model_name, model_version, verbose)
+
+    status_ctx = ServerStatusContext(url, protocol, model_name=model_name, verbose=verbose)
+
+    model_config_pb2.ModelConfig()
+
+    status_result = status_ctx.get_server_status()
+    user_data = UserData()
+
+    max_outstanding = 20
+    # Number of outstanding requests
+    outstanding = 0
+
+    sent_prog = tqdm.tqdm(desc="Send Requests", total=len(eval_features))
+    recv_prog = tqdm.tqdm(desc="Recv Requests", total=len(eval_features))
+
+    def process_outstanding(do_wait, outstanding):
+
+        if (outstanding == 0 or do_wait is False):
+            return outstanding
+
+        # Wait for deferred items from callback functions
+        (infer_ctx, ready_id, idx, start_time, inputs) = user_data._completed_requests.get()
+
+        if (ready_id is None):
+            return outstanding
+
+        # If we are here, we got an id
+        result = ctx.get_async_run_results(ready_id)
+        stop = time.time()
+
+        if (result is None):
+            raise ValueError("Context returned null for async id marked as done")
+
+        outstanding -= 1
+
+        time_list.append(stop - start_time)
+
+        batch_count = len(inputs[label_id_key])
+
+        for i in range(batch_count):
+            unique_id = int(inputs[label_id_key][i][0])
+            start_logits = [float(x) for x in result["start_logits"][i].flat]
+            end_logits = [float(x) for x in result["end_logits"][i].flat]
+            all_results.append(
+                RawResult(
+                    unique_id=unique_id,
+                    start_logits=start_logits,
+                    end_logits=end_logits))
+
+        recv_prog.update(n=batch_count)
+       	return outstanding
+
+    all_results = []
+    time_list = []
+
+    print("Starting Sending Requests....\n")
+
+    all_results_start = time.time()
+    idx = 0
+    for inputs_dict in batch(eval_features, batch_size):
+
+        present_batch_size = len(inputs_dict[label_id_key])
+
+        outputs_dict = {'start_logits': InferContext.ResultFormat.RAW,
+                        'end_logits': InferContext.ResultFormat.RAW}
+
+        start_time = time.time()
+        ctx.async_run(partial(completion_callback, user_data, idx, start_time, inputs_dict),
+        	inputs_dict, outputs_dict, batch_size=present_batch_size)
+        outstanding += 1
+        idx += 1
+
+        sent_prog.update(n=present_batch_size)
+
+        # Try to process at least one response per request
+        outstanding = process_outstanding(outstanding >= max_outstanding, outstanding)
+
+    tqdm.tqdm.write("All Requests Sent! Waiting for responses. Outstanding: {}.\n".format(outstanding))
+
+    # Now process all outstanding requests
+    while (outstanding > 0):
+        outstanding = process_outstanding(True, outstanding)
+
+    all_results_end = time.time()
+    all_results_total = (all_results_end - all_results_start) * 1000.0
+
+    print("-----------------------------")
+    print("Total Time: {} ms".format(all_results_total))
+    print("-----------------------------")
+
+    print("-----------------------------")
+    print("Total Inference Time = %0.2f for"
+          "Sentences processed = %d" % (sum(time_list), len(eval_features)))
+    print("Throughput Average (sentences/sec) = %0.2f" % (len(eval_features) / all_results_total * 1000.0))
+    print("-----------------------------")
+
+    if FLAGS.output_dir and FLAGS.predict_file:
+        # When inferencing on a dataset, get inference statistics and write results to json file
+        time_list.sort()
+
+        avg = np.mean(time_list)
+        cf_95 = max(time_list[:int(len(time_list) * 0.95)])
+        cf_99 = max(time_list[:int(len(time_list) * 0.99)])
+        cf_100 = max(time_list[:int(len(time_list) * 1)])
+        print("-----------------------------")
+        print("Summary Statistics")
+        print("Batch size =", FLAGS.predict_batch_size)
+        print("Sequence Length =", FLAGS.max_seq_length)
+        print("Latency Confidence Level 95 (ms) =", cf_95 * 1000)
+        print("Latency Confidence Level 99 (ms)  =", cf_99 * 1000)
+        print("Latency Confidence Level 100 (ms)  =", cf_100 * 1000)
+        print("Latency Average (ms)  =", avg * 1000)
+        print("-----------------------------")
+
+
+        output_prediction_file = os.path.join(FLAGS.output_dir, "predictions.json")
+        output_nbest_file = os.path.join(FLAGS.output_dir, "nbest_predictions.json")
+        output_null_log_odds_file = os.path.join(FLAGS.output_dir, "null_odds.json")
+
+        write_predictions(eval_examples, eval_features, all_results,
+                          FLAGS.n_best_size, FLAGS.max_answer_length,
+                          FLAGS.do_lower_case, output_prediction_file,
+                          output_nbest_file, output_null_log_odds_file,
+                          FLAGS.version_2_with_negative, FLAGS.verbose_logging)
+    else:
+        # When inferencing on a single example, write best answer to stdout
+        all_predictions, all_nbest_json, scores_diff_json = get_predictions(
+                  eval_examples, eval_features, all_results,
+                  FLAGS.n_best_size, FLAGS.max_answer_length,
+                  FLAGS.do_lower_case, FLAGS.version_2_with_negative, 
+                  FLAGS.verbose_logging)
+        print("Context is: %s \n\nQuestion is: %s \n\nPredicted Answer is: %s" %(FLAGS.context, FLAGS.question, all_predictions[0]))
+
+
+if __name__ == "__main__":
+  flags.mark_flag_as_required("vocab_file")
+  flags.mark_flag_as_required("bert_config_file")
+  tf.compat.v1.app.run()
+
--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/export_model.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/export_model.sh
@ -20,15 +20,15 @@ use_xla=${4:-"true"}
 seq_length=${5:-"384"}
 doc_stride=${6:-"128"}
 BERT_DIR=${7:-"data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16"}
-trtis_model_version=${8:-1}
-trtis_model_name=${9:-"bert"}
-trtis_dyn_batching_delay=${10:-0}
-trtis_engine_count=${11:-1}
-trtis_model_overwrite=${12:-"False"}
+triton_model_version=${8:-1}
+triton_model_name=${9:-"bert"}
+triton_dyn_batching_delay=${10:-0}
+triton_engine_count=${11:-1}
+triton_model_overwrite=${12:-"False"}

-additional_args="--trtis_model_version=$trtis_model_version --trtis_model_name=$trtis_model_name --trtis_max_batch_size=$batch_size \
-                 --trtis_model_overwrite=$trtis_model_overwrite --trtis_dyn_batching_delay=$trtis_dyn_batching_delay \
-                 --trtis_engine_count=$trtis_engine_count"
+additional_args="--triton_model_version=$triton_model_version --triton_model_name=$triton_model_name --triton_max_batch_size=$batch_size \
+                 --triton_model_overwrite=$triton_model_overwrite --triton_dyn_batching_delay=$triton_dyn_batching_delay \
+                 --triton_engine_count=$triton_engine_count"

 if [ "$precision" = "fp16" ] ; then
   echo "fp16 activated!"
@ -51,7 +51,7 @@ bash scripts/docker/launch.sh \
       --doc_stride=${doc_stride} \
       --predict_batch_size=${batch_size} \
       --output_dir=/results \
-       --export_trtis=True \
+       --export_triton=True \
       ${additional_args}


--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/generate_figures.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/generate_figures.sh
@ -0,0 +1,146 @@
+#!/bin/bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Set the number of devices to use
+export NVIDIA_VISIBLE_DEVICES=0
+
+# Always need to be overwriting models to keep memory use low
+export TRITON_MODEL_OVERWRITE=True
+
+bert_model=${1:-small}
+seq_length=${2:-128}
+precision=${3:-fp16}
+init_checkpoint=${4:-"/results/models/bert_${bert_model}_${precision}_${seq_length}_v1/model.ckpt-5474"}
+
+MODEL_NAME="bert_${bert_model}_${seq_length}_${precision}"
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
+else
+    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
+fi
+
+doc_stride=128
+use_xla=true
+EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DIR} 1 ${MODEL_NAME}"
+PERF_CLIENT_ARGS="1000 10 20 localhost"
+
+# Start Server
+bash triton/scripts/launch_server.sh $precision
+
+# Restart Server
+restart_server() {
+docker kill triton_server_cont
+bash triton/scripts/launch_server.sh $precision
+}
+
+############## Dynamic Batching Comparison ##############
+SERVER_BATCH_SIZE=8
+CLIENT_BATCH_SIZE=1
+TRITON_ENGINE_COUNT=1
+
+# Dynamic batching 10 ms
+TRITON_DYN_BATCHING_DELAY=10
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Dynamic batching 5 ms
+TRITON_DYN_BATCHING_DELAY=5
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Dynamic batching 2 ms
+TRITON_DYN_BATCHING_DELAY=2
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+
+# Static Batching (i.e. Dynamic batching 0 ms)
+TRITON_DYN_BATCHING_DELAY=0
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+
+# ############## Engine Count Comparison ##############
+SERVER_BATCH_SIZE=1
+CLIENT_BATCH_SIZE=1
+TRITON_DYN_BATCHING_DELAY=0
+
+# Engine Count = 4
+TRITON_ENGINE_COUNT=4
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Engine Count = 2
+TRITON_ENGINE_COUNT=2
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Engine Count = 1
+TRITON_ENGINE_COUNT=1
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+
+############## Batch Size Comparison ##############
+# BATCH=1 Generate model and perf
+SERVER_BATCH_SIZE=1
+CLIENT_BATCH_SIZE=1
+TRITON_ENGINE_COUNT=1 
+TRITON_DYN_BATCHING_DELAY=0 
+
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
+
+# BATCH=2 Generate model and perf
+SERVER_BATCH_SIZE=2
+CLIENT_BATCH_SIZE=2
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
+
+# BATCH=4 Generate model and perf
+SERVER_BATCH_SIZE=4
+CLIENT_BATCH_SIZE=4
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
+
+# BATCH=8 Generate model and perf
+SERVER_BATCH_SIZE=8
+CLIENT_BATCH_SIZE=8
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost
+
--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/launch_server.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/launch_server.sh
@ -9,16 +9,16 @@ else
   export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=0
 fi

-# Start TRTIS server in detached state
-nvidia-docker run -d --rm \
+# Start TRITON server in detached state
+docker run --gpus all -d --rm \
   --shm-size=1g \
   --ulimit memlock=-1 \
   --ulimit stack=67108864 \
   -p8000:8000 \
   -p8001:8001 \
   -p8002:8002 \
-   --name trt_server_cont \
+   --name triton_server_cont \
   -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
   -e TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE \
-   -v $PWD/results/trtis_models:/models \
-   nvcr.io/nvidia/tensorrtserver:20.02-py3 trtserver --model-store=/models --strict-model-config=false
+   -v $PWD/results/triton_models:/models \
+   nvcr.io/nvidia/tritonserver:20.03-py3 trtserver --model-store=/models --strict-model-config=false
--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/run_client.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/run_client.sh
@ -16,36 +16,18 @@
 batch_size=${1:-"8"}
 seq_length=${2:-"384"}
 doc_stride=${3:-"128"}
-trtis_version_name=${4:-"1"}
-trtis_model_name=${5:-"bert"}
+triton_version_name=${4:-"1"}
+triton_model_name=${5:-"bert"}
 BERT_DIR=${6:-"data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16"}
-squad_version=${7:-"1.1"}
-
-export SQUAD_DIR=data/download/squad/v${squad_version}
-if [ "$squad_version" = "1.1" ] ; then
-    version_2_with_negative="False"
-else
-    version_2_with_negative="True"
-fi
-
-echo "Squad directory set as " $SQUAD_DIR
-if [ ! -d "$SQUAD_DIR" ] ; then
-   echo "Error! $SQUAD_DIR directory missing. Please mount SQuAD dataset."
-   exit -1
-fi

 bash scripts/docker/launch.sh \
-   "python trtis/run_squad_trtis_client.py \
-      --trtis_model_name=$trtis_model_name \
-      --trtis_model_version=$trtis_version_name \
+   "python triton/run_squad_triton_client.py \
+      --triton_model_name=$triton_model_name \
+      --triton_model_version=$triton_version_name \
      --vocab_file=$BERT_DIR/vocab.txt \
      --bert_config_file=$BERT_DIR/bert_config.json \
-      --predict_file=$SQUAD_DIR/dev-v${squad_version}.json \
      --predict_batch_size=$batch_size \
      --max_seq_length=${seq_length} \
      --doc_stride=${doc_stride} \
      --output_dir=/results \
-      --version_2_with_negative=${version_2_with_negative}"
-
-bash scripts/docker/launch.sh "python $SQUAD_DIR/evaluate-v${squad_version}.py \
-    $SQUAD_DIR/dev-v${squad_version}.json /results/predictions.json"
+      ${@:7}"
--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/run_perf_client.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/run_perf_client.sh
@ -23,21 +23,21 @@ MAX_CONCURRENCY=${7:-50}
 SERVER_HOSTNAME=${8:-"localhost"}

 if [[ $SERVER_HOSTNAME == *":"* ]]; then
-  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRTIS HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
+  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRITON HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
  exit 1
 fi

 if [ "$SERVER_HOSTNAME" = "localhost" ]
 then
-    if [ ! "$(docker inspect -f "{{.State.Running}}" trt_server_cont)" = "true" ] ; then
+    if [ ! "$(docker inspect -f "{{.State.Running}}" triton_server_cont)" = "true" ] ; then

-        echo "Launching TRTIS server"
-        bash trtis/scripts/launch_server.sh $precision
+        echo "Launching TRITON server"
+        bash triton/scripts/launch_server.sh $precision
        SERVER_LAUNCHED=true

        function cleanup_server {
-            echo "Killing TRTIS server"
-            docker kill trt_server_cont
+            echo "Killing TRITON server"
+            docker kill triton_server_cont
        }

        # Ensure we cleanup the server on exit
@ -47,7 +47,7 @@ then
 fi

 # Wait until server is up. curl on the health of the server and sleep until its ready
-bash trtis/scripts/wait_for_trtis_server.sh $SERVER_HOSTNAME
+bash triton/scripts/wait_for_triton_server.sh $SERVER_HOSTNAME

 TIMESTAMP=$(date "+%y%m%d_%H%M")

--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/run_triton.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/run_triton.sh
@ -21,12 +21,13 @@ seq_length=${5:-"384"}
 doc_stride=${6:-"128"}
 bert_model=${7:-"large"}
 squad_version=${8:-"1.1"}
-trtis_version_name=${9:-1}
-trtis_model_name=${10:-"bert"}
-trtis_export_model=${11:-"false"}
-trtis_dyn_batching_delay=${12:-0}
-trtis_engine_count=${13:-1}
-trtis_model_overwrite=${14:-"False"}
+triton_version_name=${9:-1}
+triton_model_name=${10:-"bert"}
+triton_export_model=${11:-"true"}
+triton_dyn_batching_delay=${12:-0}
+triton_engine_count=${13:-1}
+triton_model_overwrite=${14:-"False"}
+squad_version=${15:-"1.1"}

 if [ "$bert_model" = "large" ] ; then
    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
@ -39,8 +40,21 @@ if [ ! -d "$BERT_DIR" ] ; then
   exit -1
 fi

+export SQUAD_DIR=data/download/squad/v${squad_version}
+if [ "$squad_version" = "1.1" ] ; then
+    version_2_with_negative="False"
+else
+    version_2_with_negative="True"
+fi
+
+echo "Squad directory set as " $SQUAD_DIR
+if [ ! -d "$SQUAD_DIR" ] ; then
+   echo "Error! $SQUAD_DIR directory missing. Please mount SQuAD dataset."
+   exit -1
+fi
+
 # Need to ignore case on some variables
-trtis_export_model=$(echo "$trtis_export_model" | tr '[:upper:]' '[:lower:]')
+triton_export_model=$(echo "$triton_export_model" | tr '[:upper:]' '[:lower:]')

 # Explicitly save this variable to pass down to new containers
 NV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}
@ -56,33 +70,36 @@ echo "   seq_length      = $seq_length"
 echo "   doc_stride      = $doc_stride"
 echo "   bert_model      = $bert_model"
 echo "   squad_version   = $squad_version"
-echo "   version_name    = $trtis_version_name"
-echo "   model_name      = $trtis_model_name"
-echo "   export_model    = $trtis_export_model"
+echo "   version_name    = $triton_version_name"
+echo "   model_name      = $triton_model_name"
+echo "   export_model    = $triton_export_model"
 echo
 echo "Env: "
 echo "   NVIDIA_VISIBLE_DEVICES = $NV_VISIBLE_DEVICES"
 echo

 # Export Model in SavedModel format if enabled
-if [ "$trtis_export_model" = "true" ] ; then
-   echo "Exporting model as: Name - $trtis_model_name Version - $trtis_version_name"
+if [ "$triton_export_model" = "true" ] ; then
+   echo "Exporting model as: Name - $triton_model_name Version - $triton_version_name"

-      bash trtis/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
-         $doc_stride $BERT_DIR $RESULTS_DIR $trtis_version_name $trtis_model_name \
-         $trtis_dyn_batching_delay $trtis_engine_count $trtis_model_overwrite
+      bash triton/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
+         $doc_stride $BERT_DIR $triton_version_name $triton_model_name \
+         $triton_dyn_batching_delay $triton_engine_count $triton_model_overwrite
 fi

 # Start TRTIS server in detached state
-bash trtis/scripts/launch_server.sh $precision
+bash triton/scripts/launch_server.sh $precision

 # Wait until server is up. curl on the health of the server and sleep until its ready
-bash trtis/scripts/wait_for_trtis_server.sh localhost
+bash triton/scripts/wait_for_triton_server.sh localhost

-# Start TRTIS client for inference and evaluate results
-bash trtis/scripts/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
-    $BERT_DIR $squad_version
+# Start TRTIS client for inference on SQuAD Dataset
+bash triton/scripts/run_client.sh $batch_size $seq_length $doc_stride $triton_version_name $triton_model_name \
+    $BERT_DIR --predict_file=$SQUAD_DIR/dev-v${squad_version}.json --version_2_with_negative=${version_2_with_negative}

+# Evaluate SQuAD results
+bash scripts/docker/launch.sh "python $SQUAD_DIR/evaluate-v${squad_version}.py \
+    $SQUAD_DIR/dev-v${squad_version}.json /results/predictions.json"

 #Kill the TRTIS Server
-docker kill trt_server_cont
+docker kill triton_server_cont
--- a/TensorFlow/LanguageModeling/BERT/triton/scripts/wait_for_triton_server.sh
+++ b/TensorFlow/LanguageModeling/BERT/triton/scripts/wait_for_triton_server.sh
@ -15,7 +15,7 @@

 SERVER_URI=${1:-"localhost"}

-echo "Waiting for TRTIS Server to be ready at http://$SERVER_URI:8000..."
+echo "Waiting for TRITON Server to be ready at http://$SERVER_URI:8000..."

 live_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/live"
 ready_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/ready"
@ -30,4 +30,4 @@ while [[ ${current_status} != "200" ]] || [[ $($ready_command) != "200" ]]; do
   current_status=$($live_command)
 done

-echo "TRTIS Server is ready!"
+echo "TRITON Server is ready!"
--- a/TensorFlow/LanguageModeling/BERT/trtis/run_squad_trtis_client.py
+++ b/TensorFlow/LanguageModeling/BERT/trtis/run_squad_trtis_client.py
@ -1,222 +0,0 @@
-# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import modeling
-import tokenization
-from tensorrtserver.api import ProtocolType, InferContext, ServerStatusContext, grpc_service_pb2_grpc, grpc_service_pb2, model_config_pb2
-from utils.create_squad_data import *
-import grpc
-from run_squad import *
-import numpy as np
-import tqdm
-
-# Set this to either 'label_ids' for Google bert or 'unique_ids' for JoC
-label_id_key = "unique_ids"
-
-PendingResult = collections.namedtuple("PendingResult",
-                                   ["async_id", "start_time", "inputs"])
-
-def batch(iterable, n=1):
-    l = len(iterable)
-    for ndx in range(0, l, n):
-        label_ids_data = ()
-        input_ids_data = ()
-        input_mask_data = ()
-        segment_ids_data = ()
-        for i in range(0, min(n, l-ndx)):
-            label_ids_data = label_ids_data + (np.array([iterable[ndx + i].unique_id], dtype=np.int32),)
-            input_ids_data = input_ids_data+ (np.array(iterable[ndx + i].input_ids, dtype=np.int32),)
-            input_mask_data = input_mask_data+ (np.array(iterable[ndx + i].input_mask, dtype=np.int32),)
-            segment_ids_data = segment_ids_data+ (np.array(iterable[ndx + i].segment_ids, dtype=np.int32),)
-
-        inputs_dict = {label_id_key: label_ids_data,
-                       'input_ids': input_ids_data,
-                       'input_mask': input_mask_data,
-                       'segment_ids': segment_ids_data}
-        yield inputs_dict
-
-def run_client():
-    """
-    Ask a question of context on TRTIS.
-    :param context: str
-    :param question: str
-    :param question_id: int
-    :return:
-    """
-
-    tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
-
-
-    eval_examples = read_squad_examples(
-        input_file=FLAGS.predict_file, is_training=False,
-        version_2_with_negative=FLAGS.version_2_with_negative)
-
-    eval_features = []
-
-    def append_feature(feature):
-        eval_features.append(feature)
-
-    convert_examples_to_features(
-        examples=eval_examples[0:],
-        tokenizer=tokenizer,
-        max_seq_length=FLAGS.max_seq_length,
-        doc_stride=FLAGS.doc_stride,
-        max_query_length=FLAGS.max_query_length,
-        is_training=False,
-        output_fn=append_feature)
-
-    protocol_str = 'grpc' # http or grpc
-    url = FLAGS.trtis_server_url
-    verbose = True
-    model_name = FLAGS.trtis_model_name
-    model_version = FLAGS.trtis_model_version
-    batch_size = FLAGS.predict_batch_size
-
-    protocol = ProtocolType.from_str(protocol_str) # or 'grpc'
-
-    ctx = InferContext(url, protocol, model_name, model_version, verbose)
-
-    channel = grpc.insecure_channel(url)
-
-    stub = grpc_service_pb2_grpc.GRPCServiceStub(channel)
-
-    prof_request = grpc_service_pb2.server__status__pb2.model__config__pb2.ModelConfig()
-
-    prof_response = stub.Profile(prof_request)
-
-    status_ctx = ServerStatusContext(url, protocol, model_name=model_name, verbose=verbose)
-
-    model_config_pb2.ModelConfig()
-
-    status_result = status_ctx.get_server_status()
-
-    outstanding = {}
-    max_outstanding = 20
-
-    sent_prog = tqdm.tqdm(desc="Send Requests", total=len(eval_features))
-    recv_prog = tqdm.tqdm(desc="Recv Requests", total=len(eval_features))
-
-    def process_outstanding(do_wait):
-
-        if (len(outstanding) == 0):
-            return
-        
-        ready_id = ctx.get_ready_async_request(do_wait)
-
-        if (ready_id is None):
-            return
-
-        # If we are here, we got an id
-        result = ctx.get_async_run_results(ready_id, False)
-        stop = time.time()
-
-        if (result is None):
-            raise ValueError("Context returned null for async id marked as done")
-
-        outResult = outstanding.pop(ready_id)
-
-        time_list.append(stop - outResult.start_time)
-
-        batch_count = len(outResult.inputs[label_id_key])
-
-        for i in range(batch_count):
-            unique_id = int(outResult.inputs[label_id_key][i][0])
-            start_logits = [float(x) for x in result["start_logits"][i].flat]
-            end_logits = [float(x) for x in result["end_logits"][i].flat]
-            all_results.append(
-                RawResult(
-                    unique_id=unique_id,
-                    start_logits=start_logits,
-                    end_logits=end_logits))
-
-        recv_prog.update(n=batch_count)
-
-    all_results = []
-    time_list = []
-
-    print("Starting Sending Requests....\n")
-
-    all_results_start = time.time()
-
-    for inputs_dict in batch(eval_features, batch_size):
-
-        present_batch_size = len(inputs_dict[label_id_key])
-
-        outputs_dict = {'start_logits': InferContext.ResultFormat.RAW,
-                        'end_logits': InferContext.ResultFormat.RAW}
-
-        start = time.time()
-        async_id = ctx.async_run(inputs_dict, outputs_dict, batch_size=present_batch_size)
-
-        outstanding[async_id] = PendingResult(async_id=async_id, start_time=start, inputs=inputs_dict)
-
-        sent_prog.update(n=present_batch_size)
-
-        # Try to process at least one response per request
-        process_outstanding(len(outstanding) >= max_outstanding)
-
-    tqdm.tqdm.write("All Requests Sent! Waiting for responses. Outstanding: {}.\n".format(len(outstanding)))
-
-    # Now process all outstanding requests
-    while (len(outstanding) > 0):
-        process_outstanding(True)
-
-    all_results_end = time.time()
-    all_results_total = (all_results_end - all_results_start) * 1000.0
-
-    print("-----------------------------")
-    print("Individual Time Runs - Ignoring first two iterations")
-    print("Total Time: {} ms".format(all_results_total))
-    print("-----------------------------")
-
-    print("-----------------------------")
-    print("Total Inference Time = %0.2f for"
-          "Sentences processed = %d" % (sum(time_list), len(eval_features)))
-    print("Throughput Average (sentences/sec) = %0.2f" % (len(eval_features) / all_results_total * 1000.0))
-    print("-----------------------------")
-
-    time_list.sort()
-
-    avg = np.mean(time_list)
-    cf_95 = max(time_list[:int(len(time_list) * 0.95)])
-    cf_99 = max(time_list[:int(len(time_list) * 0.99)])
-    cf_100 = max(time_list[:int(len(time_list) * 1)])
-    print("-----------------------------")
-    print("Summary Statistics")
-    print("Batch size =", FLAGS.predict_batch_size)
-    print("Sequence Length =", FLAGS.max_seq_length)
-    print("Latency Confidence Level 95 (ms) =", cf_95 * 1000)
-    print("Latency Confidence Level 99 (ms)  =", cf_99 * 1000)
-    print("Latency Confidence Level 100 (ms)  =", cf_100 * 1000)
-    print("Latency Average (ms)  =", avg * 1000)
-    print("-----------------------------")
-
-
-    output_prediction_file = os.path.join(FLAGS.output_dir, "predictions.json")
-    output_nbest_file = os.path.join(FLAGS.output_dir, "nbest_predictions.json")
-    output_null_log_odds_file = os.path.join(FLAGS.output_dir, "null_odds.json")
-
-    write_predictions(eval_examples, eval_features, all_results,
-                      FLAGS.n_best_size, FLAGS.max_answer_length,
-                      FLAGS.do_lower_case, output_prediction_file,
-                      output_nbest_file, output_null_log_odds_file)
-
-
-
-if __name__ == "__main__":
-  flags.mark_flag_as_required("vocab_file")
-  flags.mark_flag_as_required("bert_config_file")
-  flags.mark_flag_as_required("output_dir")
-
-  run_client()
-
--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/generate_figures.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/generate_figures.sh
@ -1,146 +0,0 @@
-#!/bin/bash
-
-# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Set the number of devices to use
-export NVIDIA_VISIBLE_DEVICES=0
-
-# Always need to be overwriting models to keep memory use low
-export TRTIS_MODEL_OVERWRITE=True
-
-bert_model=${1:-small}
-seq_length=${2:-128}
-precision=${3:-fp16}
-init_checkpoint=${4:-"/results/models/bert_tf_${bert_model}_${precision}_${seq_length}_v1/model.ckpt-5474"}
-
-MODEL_NAME="bert_${bert_model}_${seq_length}_${precision}"
-
-if [ "$bert_model" = "large" ] ; then
-    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
-else
-    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
-fi
-
-doc_stride=128
-use_xla=true
-EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DIR} 1 ${MODEL_NAME}"
-PERF_CLIENT_ARGS="1000 10 20 localhost"
-
-# Start Server
-bash trtis/scripts/launch_server.sh $precision
-
-# Restart Server
-restart_server() {
-docker kill trt_server_cont
-bash trtis/scripts/launch_server.sh $precision
-}
-
-############## Dynamic Batching Comparison ##############
-SERVER_BATCH_SIZE=8
-CLIENT_BATCH_SIZE=1
-TRTIS_ENGINE_COUNT=1
-
-# Dynamic batching 10 ms
-TRTIS_DYN_BATCHING_DELAY=10
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Dynamic batching 5 ms
-TRTIS_DYN_BATCHING_DELAY=5
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Dynamic batching 2 ms
-TRTIS_DYN_BATCHING_DELAY=2
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-
-# Static Batching (i.e. Dynamic batching 0 ms)
-TRTIS_DYN_BATCHING_DELAY=0
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-
-# ############## Engine Count Comparison ##############
-SERVER_BATCH_SIZE=1
-CLIENT_BATCH_SIZE=1
-TRTIS_DYN_BATCHING_DELAY=0
-
-# Engine Count = 4
-TRTIS_ENGINE_COUNT=4
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Engine Count = 2
-TRTIS_ENGINE_COUNT=2
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Engine Count = 1
-TRTIS_ENGINE_COUNT=1
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-
-############## Batch Size Comparison ##############
-# BATCH=1 Generate model and perf
-SERVER_BATCH_SIZE=1
-CLIENT_BATCH_SIZE=1
-TRTIS_ENGINE_COUNT=1 
-TRTIS_DYN_BATCHING_DELAY=0 
-
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
-
-# BATCH=2 Generate model and perf
-SERVER_BATCH_SIZE=2
-CLIENT_BATCH_SIZE=2
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
-
-# BATCH=4 Generate model and perf
-SERVER_BATCH_SIZE=4
-CLIENT_BATCH_SIZE=4
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
-
-# BATCH=8 Generate model and perf
-SERVER_BATCH_SIZE=8
-CLIENT_BATCH_SIZE=8
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost
-
--- a/TensorFlow/LanguageModeling/BERT/utils/create_squad_data.py
+++ b/TensorFlow/LanguageModeling/BERT/utils/create_squad_data.py
@ -149,10 +149,11 @@ class InputFeatures(object):
    self.end_position = end_position
    self.is_impossible = is_impossible

-def read_squad_examples(input_file, is_training, version_2_with_negative=False):
-  """Read a SQuAD json file into a list of SquadExample."""
-  with tf.gfile.Open(input_file, "r") as reader:
-    input_data = json.load(reader)["data"]
+def read_squad_examples(input_file, is_training, version_2_with_negative=False, input_data=None):
+  """Return list of SquadExample from input_data or input_file (SQuAD json file)"""
+  if input_data is None:
+    with tf.gfile.Open(input_file, "r") as reader:
+      input_data = json.load(reader)["data"]

  def is_whitespace(c):
    if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
--- a/TensorFlow/LanguageModeling/BERT/utils/dllogger_class.py
+++ b/TensorFlow/LanguageModeling/BERT/utils/dllogger_class.py
@ -0,0 +1,55 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dllogger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
+import numpy
+
+class dllogger_class():
+
+    def format_step(self, step):
+        if isinstance(step, str):
+            return step
+        elif isinstance(step, int):
+            return "Iteration: {} ".format(step)
+        elif len(step) > 0:
+            return "Iteration: {} ".format(step[0])
+        else:
+            return ""
+
+    def __init__(self, log_path="bert_dllog.json"):
+        self.logger = Logger([
+            StdOutBackend(Verbosity.DEFAULT, step_format=self.format_step),
+            JSONStreamBackend(Verbosity.VERBOSE, log_path),
+            ])
+        self.logger.metadata("mlm_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("nsp_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("avg_loss_step", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("total_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("f1", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("precision", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("recall", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("mcc", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("exact_match", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata(
+            "throughput_train",
+            {"unit": "seq/s", "format": ":.3f", "GOAL": "MAXIMIZE", "STAGE": "TRAIN"},
+        )
+        self.logger.metadata(
+            "throughput_inf",
+            {"unit": "seq/s", "format": ":.3f", "GOAL": "MAXIMIZE", "STAGE": "VAL"},
+        )