Merge pull request #277 from NVIDIA/nvpstr/release

Updating BERT/TF
2019-11-04 23:33:38 +01:00 · 2019-11-04 23:33:38 +01:00 · 4760c03a04
parent 33dd665905 eb32aee293
commit 4760c03a04
42 changed files with 5219 additions and 446 deletions
--- a/TensorFlow/LanguageModeling/BERT/.dockerignore
+++ b/TensorFlow/LanguageModeling/BERT/.dockerignore
@ -15,6 +15,7 @@
 .git/
 __pycache__/
 results/
+data/binary
 data/download
 data/extracted
 data/formatted_one_article_per_line
--- a/TensorFlow/LanguageModeling/BERT/Dockerfile
+++ b/TensorFlow/LanguageModeling/BERT/Dockerfile
@ -23,6 +23,7 @@ COPY --from=trt /workspace/build/perf_client /workspace/build/perf_client
 COPY --from=trt /workspace/build/dist/dist/tensorrtserver*.whl /tmp/
 RUN pip install /tmp/tensorrtserver*.whl && rm /tmp/tensorrtserver*.whl

+
 WORKDIR /workspace/bert
 COPY . .

--- a/TensorFlow/LanguageModeling/BERT/README.md
+++ b/TensorFlow/LanguageModeling/BERT/README.md
@ -28,9 +28,7 @@ This repository provides a script and recipe to train the BERT model for TensorF
    * [Multi-node](#multi-node)
  * [Inference process](#inference-process)
  * [Deploying the BERT model using TensorRT Inference Server](#deploying-the-bert-model-using-tensorrt-inference-server)
-    * [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
-      * [Advanced Details](#advanced-details)
-    * [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)
+  * [BioBERT](#biobert)
 - [Performance](#performance)
  * [Benchmarking](#benchmarking)
    * [Training performance benchmark](#training-performance-benchmark)
@ -138,6 +136,8 @@ Multi-GPU training with Horovod - Our model uses Horovod to implement efficient

 [LAMB](https://arxiv.org/pdf/1904.00962.pdf) stands for Layerwise Adaptive Moments based optimizer, is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches. It allows using a global batch size of 65536 and 32768 on sequence lengths 128 and 512 respectively, compared to a batch size of 256 for Adam. The optimized implementation accumulates 1024 gradients batches in phase 1 and 4096 steps in phase 2 before updating weights once. This results in 27% training speedup on a single DGX2 node. On multi-node systems, LAMB allows scaling up to 1024 GPUs resulting in training speedups of up to 17x in comparison to [Adam](https://arxiv.org/pdf/1412.6980.pdf). Adam has limitations on the learning rate that can be used since it is applied globally on all parameters whereas LAMB follows a layerwise learning rate strategy.

+NVLAMB adds necessary tweaks to [LAMB version 1](https://arxiv.org/abs/1904.00962v1), to ensure correct convergence. The algorithm is as follows:
+  ![NVLAMB](data/images/images_nvlamb.png)

 ### Mixed precision training

@ -183,7 +183,7 @@ The following section lists the requirements in order to start training the BERT

 This repository contains `Dockerfile` which extends the TensorFlow NGC container and encapsulates some dependencies.  Aside from these dependencies, ensure you have the following components:
 - [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- [TensorFlow 19.06-py3+](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
+- [TensorFlow 19.08-py3+](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
 - [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU

 For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
@ -227,7 +227,7 @@ bash scripts/data_download.sh

 The script launches a Docker container with the current directory mounted and downloads the datasets to a `data/` folder on the host.

-Note: The dataset is 170GB+ and takes 15+ hours to download. Expired dataset links are ignored during data download.
+Note: The dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time. Expired dataset links are ignored during data download.

 4. Download the pretrained models from NGC.

@ -617,102 +617,13 @@ I0312 23:14:00.550973 140287431493376 run_squad.py:1397] 0 Inference Performance

 ### Deploying the BERT model using TensorRT Inference Server

-The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `TensorRT Inference Server` can be found in the subfolder `./trtis/README.md`.

-A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
-1. Client serializes the inference request into a message and sends it to the server (Client Send)
-2. Message travels over the network from the client to the server (Network)
-3. Message arrives at server, and is deserialized (Server Receive)
-4. Request is placed on the queue (Server Queue)
-5. Request is removed from the queue and computed (Server Compute)
-6. Completed request is serialized in a message and sent back to the client (Server Send)
-7. Completed message travels over network from the server to the client (Network)
-8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)
+### BioBERT

-Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
-In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
+Many works, including [BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [SciBERT](https://arxiv.org/pdf/1903.10676.pdf), [NCBI-BERT](https://arxiv.org/pdf/1906.05474.pdf), [ClinicalBERT (MIT)](https://arxiv.org/pdf/1904.03323.pdf), [ClinicalBERT (NYU, Princeton)](https://arxiv.org/pdf/1904.05342.pdf), and others at [BioNLP’19 workshop](https://aclweb.org/aclwiki/BioNLP_Workshop), show that pre-training of BERT on large biomedical text corpus such as [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) results in better performance in biomedical text-mining tasks.

-Note: The following instructions are run from outside the container and call `docker run` commands as required.
-
-#### Performance analysis for TensorRT Inference Server
-
-Based on the figures 2 and 3 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.
-
-![](data/images/trtis_base_summary.png?raw=true)
-
-Figure 2: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
-
-![](data/images/trtis_large_summary.png?raw=true)
-
-Figure 3: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
-
-##### Advanced Details
-
-This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
-
-Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
-
-```bash
-bash scripts/trtis/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
-```
-
-All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1).  A high number of concurrent requests can reduce the impact of network latency on overall throughput.
-
-###### Maximum batch size
-
-As we can see in Figure 4, the throughput at BS=1, Client Concurrent Requests = 64 is 119 and in Figure 5, the throughput at BS=8, Client Concurrent Requests = 8 is 517, respectively giving a speedup of ~4.3x
-
-Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurrent Requests = 8 to keep the Total Number of Outstanding Requests equal between the two different modes. Where Total Number of Outstanding Requests = Batch Size * Client Concurrent Requests. This is also why there are 8 times as many bars on the BS=1 chart than the BS=8 chart.
-
-Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
-
-![](data/images/trtis_bs_1.png?raw=true)
-
-Figure 4: Latency & Throughput vs Concurrency at Batch size = 1
-
-![](data/images/trtis_bs_8.png?raw=true)
-
-Figure 5: Latency & Throughput vs Concurrency at Batch size = 8
-
-###### Batching techniques
-
-Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
-
-Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and ‘preferred_batchsize’ to indicate your optimal batch sizes in the TensorRT Inference Server model config.
-
-Figures 6 and 7 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.
-
-![](data/images/trtis_static.png?raw=true)
-
-Figure 6: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1
-
-![](data/images/trtis_dynamic.png?raw=true)
-
-Figure 7: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000
-
-###### Model execution instance count
-
-TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesn’t saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
-
-Figures 8 and 9 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.
-
-![](data/images/trtis_ec_1.png?raw=true)
-
-Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
-(One copy of the model loaded in GPU memory)
-
-![](data/images/trtis_ec_4.png?raw=true)
-
-Figure 9: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
-(Four copies the model loaded in GPU memory)
-
-#### Running the TensorRT Inference Server and client
-
-The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
-
-```bash
-bash scripts/trtis/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
-```
+More information on how to download a biomedical corpus and pre-train as well as finetune for biomedical tasks can be found in the subfolder `./biobert/README.md`.

 ## Performance

@ -736,10 +647,10 @@ This script runs 2 epochs by default on the SQuAD v1.1 dataset and extracts perf
 Inference benchmarking can be performed by running the script:

 ``` bash
-scripts/finetune_inference_benchmark.sh <bert_model> <use_xla> squad
+scripts/finetune_inference_benchmark.sh <bert_model> squad
 ```

-This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at `/results/squad_train_benchmark_bert_<bert_model>.log`.
+This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at `/results/squad_train_benchmark_bert_<bert_model>.log`.

 ### Results

@ -747,7 +658,6 @@ The following sections provide details on how we achieved our performance and ac

 #### Training accuracy results

-
 ##### Training accuracy

 ###### Pre-training accuracy: single-node
@ -759,32 +669,32 @@ Our results were obtained by running the `scripts/run_pretraining_lamb.sh` train
 | DGX1  | 8  | 16, 2 | 512, 2048 | 247.51 | 1.43 |
 | DGX2  | 16 | 64, 8 | 64, 256 | 108.16 | 1.58 |

-
 ###### Pre-training accuracy: multi-node

 Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container.

-| **DGX System** | **Nodes** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)**  | **Final Loss**|
+| **DGX System** | **Nodes** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)** | **Final Loss** |
 |----------------|-----------|---------------|------------------------------------|----------------------------------------|----------------|-------------------------|
-| DGX1  | 4  | FP16 | 32, 2 | 32, 128 | 48.66 | 1.48 |
-| DGX1  | 16 | FP16 | 32, 2 | 32, 128 | 24.35 | 1.53 |
-| DGX1  | 32 | FP16 | 32, 2 | 32, 128 | 12.98 | 1.61 |
-| DGX1  | 32 | FP32 | 32, 2 | 32, 128 | 30.92 | 1.49 |
-| DGX2H | 4  | FP16 | 64, 8 | 16, 64  | 25.85 | 1.56 |
-| DGX2H | 16 | FP16 | 64, 8 | 8, 32   | 7.9   | 1.57 |
-| DGX2H | 32 | FP16 | 64, 8 | 4, 16   | 4.77  | 1.61 |
-| DGX2H | 32 | FP32 | 32, 4 | 8, 32   | 12.72 | 1.53 |
+| DGX1  | 4  | FP16 | 16, 4 |128, 256| 62.49 | 1.72 |
+| DGX1  | 16 | FP16 | 16, 4 | 32, 64 | 16.58 | 1.76 |
+| DGX1  | 32 | FP16 | 16, 2 | 16, 64 | 9.85  | 1.71 |
+| DGX2H | 1  | FP16 | 32, 8 |128, 256| 69.27 | 1.59 |
+| DGX2H | 4  | FP16 | 32, 8 | 32, 64 | 22.17 | 1.60 |
+| DGX2H | 16 | FP16 | 32, 8 | 8, 16  | 6.25  | 1.56 |
+| DGX2H | 32 | FP16 | 32, 8 | 4, 8   | 3.73  | 1.58 |
+| DGX2H | 64 | FP16 | 32, 8 | 2, 4   | 2.44  | 1.64 |
+| DGX2H | 64 | FP32 | 32, 4 | 2, 8   | 5.76  | 1.66 |

-Note: Time to train includes upto 16 minutes of start up time for every restart. Experiments were run on clusters with a maximum wall clock time of 8 hours and 2 hours for DGX1 and DGX2H systems respectively.
+Note: Time to train includes upto 16 minutes of start up time for every restart. Experiments were run on clusters with a maximum wall clock time of 8 hours.

 ###### Fine-tuning accuracy for SQuAD: NVIDIA DGX-2 (16x V100 32G)

-Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.
+Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.


 | **GPUs** | **Batch size / GPU** | **Accuracy - FP32** | **Accuracy - mixed precision** | **Time to Train - FP32 (Hrs)** | **Time to Train - mixed precision (Hrs)** |
 |:---:|:----:|:----:|:---:|:----:|:----:|
-| 16 | 4 |90.94|90.84|0.38|0.27|
+| 16 | 4 |90.94|90.84|0.44|0.16|


 ##### Training stability test
@ -818,17 +728,17 @@ The following tables compare `F1` scores across 5 different training runs with d

 ###### Pre-training training performance: single-node on 16G

-Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
+Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.


 | **GPUs** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
-| 1 | 128 | 16, 8 | 80.1  | 23.1  | 3.47 | 1    | 1    |
-| 4 | 128 | 16, 8 | 282.1 | 85    | 3.32 | 3.52 | 3.68 |
-| 8 | 128 | 16, 8 | 540.4 | 166.1 | 3.25 | 6.75 | 7.19 |
-| 1 | 512 | 4, 2  | 10.9  | 5.3   | 2.06 | 1    | 1    |
-| 4 | 512 | 4, 2  | 35.6  | 19.5  | 1.83 | 3.27 | 3.68 |
-| 8 | 512 | 4, 2  | 61.1  | 37.9  | 1.61 | 5.61 | 7.15 |
+| 1 | 128 | 16,8 | 91.30  | 23.90  | 3.82 | 1.00 | 1.00 |
+| 4 | 128 | 16,8 | 297.70 | 86.90  | 3.43 | 3.26 | 3.64 |
+| 8 | 128 | 16,8 | 578.60 | 167.80 | 3.45 | 6.34 | 7.02 |
+| 1 | 512 | 4,1  | 20.00  | 4.00   | 5.00 | 1.00 | 1.00 |
+| 4 | 512 | 4,1  | 66.80  | 13.50  | 4.95 | 3.34 | 3.38 |
+| 8 | 512 | 4,1  | 129.50 | 26.30  | 4.92 | 6.48 | 6.58 |

 Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

@ -838,29 +748,26 @@ Our results were obtained by running the `run.sub` training script in the Tensor

 | **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
-| 1  | 128 | 16,8 | 440.3  | 167.9  | 2.62 | 1.00  | 1.00  |
-| 4  | 128 | 16,8 | 1712.3 | 600.7  | 2.85 | 3.89  | 3.58  |
-| 16 | 128 | 16,8 | 4833.5 | 2186.2 | 2.21 | 10.98 | 13.02 |
-| 32 | 128 | 16,8 | 9742.9 | 4020.9 | 2.42 | 22.13 | 23.95 |
-| 1  | 512 | 2,1  | 74.9   | 26     | 2.88 | 0.00  | 0.00  |
-| 4  | 512 | 2,1  | 257.5  | 91.2   | 2.82 | 1.00  | 1.00  |
-| 16 | 512 | 2,1  | 899.7  | 313    | 2.87 | 3.44  | 3.51  |
-| 32 | 512 | 2,1  | 1737.1 | 579.4  | 3.0  | 23.19 | 22.28 |
+| 1  | 128 | 16,4 | 571.877  | 109.366 | 5.229019988 | 1.00  | 1.00  |
+| 4  | 128 | 16,4 | 2028.85  | 386.23  | 5.252958082 | 3.55  | 3.53  |
+| 16 | 128 | 16,4 | 7299.88  | 1350.49 | 5.405356574 | 12.76 | 12.35 |
+| 32 | 128 | 16,4 | 13917.37 | 2555.25 | 5.446578613 | 24.34 | 23.36 |
+| 1  | 512 | 4,1  | 128.94   | 25.65   | 5.026900585 | 1.00  | 1.00  |
+| 4  | 512 | 4,1  | 466      | 92.36   | 5.045474231 | 3.61  | 3.60  |
+| 16 | 512 | 4,1  | 1632     | 325.22  | 5.018141566 | 12.66 | 12.68 |
+| 32 | 512 | 4,1  | 3076     | 614.18  | 5.008303755 | 23.86 | 23.94 |

 Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

 ###### Fine-tuning training performance for SQuAD on 16G

-Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
+Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
+| **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
 |:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
-| 1 | 2 | 7.19 |14.37|2.0 |1.0 |1.0 |
-| 4 | 2 |25.61 |40.44|1.58|3.56|2.81|
-| 8 | 2 |49.79 |74.61|1.5 |6.92|5.19|
-| 1 | 3 |  -   |17.2 | -  | -  |1.0 |
-| 4 | 3 |  -   |50.71| -  | -  |2.95|
-| 8 | 3 |  -   |91.88| -  | -  |5.34|
+| 1 | 3,2 | 17.17 | 7.35  | 2.336054422 | 1.00 | 1.00 |
+| 4 | 3,2 | 50.68 | 26.38 | 1.921152388 | 3.59 | 2.95 |
+| 8 | 3,2 | 89.98 | 50.17 | 1.793502093 | 6.83 | 5.24 |

 Note: The respective values for FP32 runs that use a batch size of 3 are not available due to out of memory errors that arise. Batch size of 3 is only available on using FP16.

@ -870,32 +777,29 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide

 ###### Pre-training training performance: single-node on 32G

-Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
+Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.

 | **GPUs** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
-| 1 | 128 | 48,32 | 130.2 | 33.5  | 3.89 | 1    | 1    |
-| 4 | 128 | 48,32 | 462.1 | 127.7 | 3.62 | 3.55 | 3.81 |
-| 8 | 128 | 48,32 | 874.8 | 255.4 | 3.43 | 6.72 | 7.62 |
-| 1 | 512 | 8, 4  | 22.1  | 6.3   | 3.51 | 1    | 1    |
-| 4 | 512 | 8, 4  | 80.4  | 24    | 3.35 | 3.64 | 3.81 |
-| 8 | 512 | 8, 4  | 155   | 47.1  | 3.29 | 7.01 | 7.48 |
+| 1 | 128 | 48,32 | 140.30 | 34.30  | 4.09 | 1.00 | 1.00 |
+| 4 | 128 | 48,32 | 504.40 | 131.70 | 3.83 | 3.60 | 3.84 |
+| 8 | 128 | 48,32 | 986.80 | 260.10 | 3.79 | 7.03 | 7.58 |
+| 1 | 512 | 8,4   | 25.60  | 6.50   | 3.94 | 1.00 | 1.00 |
+| 4 | 512 | 8,4   | 89.90  | 24.70  | 3.64 | 3.51 | 3.80 |
+| 8 | 512 | 8,4   | 176.70 | 48.60  | 3.64 | 6.90 | 7.48 |

 Note: The respective values for FP32 runs that use a batch size of 48, 8 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

 ###### Fine-tuning training performance for SQuAD on 32G

-Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
+Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.


-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
+| **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
 |---|---|-----|------|----|----|----|
-| 1 | 4 | 8.74|20.55 |2.35|1.0 |1.0 |
-| 4 | 4 |32.22|57.58 |1.79|3.69|2.81|
-| 8 | 4 |62.69|100.22|1.60|7.17|4.88|
-| 1 | 10|  -  |31.33 | -  | -  |1.0 |
-| 4 | 10|  -  |94.19 | -  | -  |3.0|
-| 8 | 10|  -  |155.53| -  | -  |4.96|
+| 1 | 10,4 | 33.79  | 9     | 3.754444444 | 1.00 | 1.00 |
+| 4 | 10,4 | 103.38 | 32.5  | 3.180923077 | 3.61 | 3.06 |
+| 8 | 10,4 | 172.46 | 63.54 | 2.714195782 | 7.06 | 5.10 |

 Note: The respective values for FP32 runs that use a batch size of 10 are not available due to out of memory errors that arise. Batch size of 10 is only available on using FP16.

@ -905,52 +809,50 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide

 ###### Pre-training training performance: single-node on DGX-2 32G

-Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
+Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.

 | **GPUs** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
-| 1 | 128 | 48,32 | 141.3 | 35.8  | 3.946927374 | 1    | 1     |
-| 4 | 128 | 48,32 | 520.4 | 138.8 | 3.749279539 | 3.68 | 3.88  |
-| 8 | 128 | 48,32 | 1024  | 275.1 | 3.722282806 | 7.25 | 7.68  |
-| 16| 128 | 48,32 | 1907  | 533   | 3.577861163 | 13.5 | 14.89 |
-| 1 | 512 | 8, 4  | 23.9  | 6.8   | 3.514705882 | 1    | 1     |
-| 4 | 512 | 8, 4  | 89.8  | 25.8  | 3.480620155 | 3.76 | 3.79  |
-| 8 | 512 | 8, 4  | 177.2 | 51    | 3.474509804 | 7.41 | 7.5   |
-| 16| 512 | 8, 4  | 332.2 | 94.2  | 3.526539278 | 13.9 | 13.85 |
+| 1  | 128 | 48,32 | 143.20  | 36.30  | 3.94 | 1.00  | 1.00  |
+| 4  | 128 | 48,32 | 538.30  | 141.50 | 3.80 | 3.76  | 3.90  |
+| 8  | 128 | 48,32 | 1057.30 | 281.30 | 3.76 | 7.38  | 7.75  |
+| 16 | 128 | 48,32 | 1990.70 | 516.80 | 3.85 | 13.90 | 14.24 |
+| 1  | 512 | 8,4   | 26.90   | 6.90   | 3.90 | 1.00  | 1.00  |
+| 4  | 512 | 8,4   | 96.30   | 26.40  | 3.65 | 3.58  | 3.83  |
+| 8  | 512 | 8,4   | 189.00  | 52.40  | 3.61 | 7.03  | 7.59  |
+| 16 | 512 | 8,4   | 354.30  | 96.50  | 3.67 | 13.17 | 13.99 |

 Note: The respective values for FP32 runs that use a batch size of 48, 8 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

-###### Pre-training training performance: multi-node on DGX-2 32G
+###### Pre-training training performance: multi-node on DGX-2H 32G

 Our results were obtained by running the `run.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.


 | **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
 |:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
-| 1  | 128 | 32, 32 | 1806.7  | 599.3  | 3.01 | 1    | 1    |
-| 4  | 128 | 32, 32 | 4088.7  | 1762.3 | 2.32 | 2.26 | 2.94 |
-| 16 | 128 | 32, 32 | 14719.6 | 6400.2 | 2.30 | 8.15 | 10.68|
-| 32 | 128 | 32, 32 | 27303.6 | 12203.6| 2.24 | 15.11| 20.36|
-| 1  | 512 | 8, 4   | 269.7   | 109.6  | 2.46 | 1    | 1    |
-| 4  | 512 | 8, 4   | 960.9   | 268.5  | 3.58 | 3.56 | 2.45 |
-| 16 | 512 | 8, 4   | 3726.3  | 965    | 3.86 | 13.82| 8.8  |
-| 32 | 512 | 8, 4   | 6192.7  | 1800.3 | 3.44 | 22.96| 16.43|
+| 1  | 128 | 32,32 | 1758.32  | 602.22   | 2.92 | 1.00  | 1.00  |
+| 4  | 128 | 32,32 | 6379.94  | 2261.10  | 2.82 | 3.63  | 3.75  |
+| 16 | 128 | 32,32 | 23846.92 | 8875.42  | 2.69 | 13.56 | 14.74 |
+| 32 | 128 | 32,32 | 46191.78 | 17445.53 | 2.65 | 26.27 | 28.97 |
+| 64 | 128 | 32,32 | 89195.63 | 34263.71 | 2.60 | 50.73 | 56.90 |
+| 1  | 512 | 8,4   | 383.35   | 109.97   | 3.49 | 1.00  | 1.00  |
+| 4  | 512 | 8,4   | 1408.75  | 400.93   | 3.51 | 3.67  | 3.65  |
+| 16 | 512 | 8,4   | 5344.10  | 1559.96  | 3.43 | 13.94 | 14.19 |
+| 32 | 512 | 8,4   | 10323.75 | 3061.39  | 3.37 | 26.93 | 27.84 |
+| 64 | 512 | 8,4   | 19766.57 | 6029.48  | 3.28 | 51.56 | 54.83 |


 ###### Fine-tuning training performance for SQuAD on DGX-2 32G

-Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
+Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

-| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
+| **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
 |---|---|------|------|----|-----|-----|
-|  1| 4 | 9.39 | 20.69 |2.20| 1.0  | 1.0  |
-|  4| 4 | 34.63| 62.79|1.81| 3.69  | 3.03 |
-|  8| 4 | 66.95|111.47|1.66| 7.13  | 5.39 |
-| 16| 4 |126.09|179.09|1.42| 13.43 |8.66  |
-|  1| 10| -    | 32.72| -  | -     | 1.0  |
-|  4| 10| -    |100.73| -  | -     | 3.07 |
-|  8| 10| -    |168.92| -  | -     | 5.16 |
-| 16| 10| -    |249.54| -  | -     | 7.63 |
+| 1  | 10,4 |      36.30  |      9.59   | 3.785192909 | 1.00  | 1.00 |
+| 4  | 10,4 |      115.67 |      35.46  | 3.261985336 | 3.70  | 3.19 |
+| 8  | 10,4 |      197.16 |      68.00  | 2.899411765 | 7.09  | 5.43 |
+| 16 | 10,4 |      304.72 |      111.62 | 2.729976707 | 11.64 | 8.39 |


 Note: The respective values for FP32 runs that use a batch size of 10 are not available due to out of memory errors that arise. Batch size of 10 is only available on using FP16.
@ -967,63 +869,63 @@ Our results were obtained by running the `scripts/run_pretraining_lamb.sh` scrip

 | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** |
 |:-----:|:-------:|:-------:|:-------:|:-------------:|
-|128    |8, 8     |349.49   | 104.03  | 3.36          |
+|128    |8, 8     |349.51   | 104.31  | 3.35          |

 ###### Fine-tuning inference performance for SQuAD on 16G

-Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
+Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

 BERT LARGE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 89.4                         | 1.19                                         | 11.19               | 11.29           | 11.44           | 11.71           |
-| 128             | 2          | 162.29                       | 1.56                                         | 12.32               | 12.5            | 12.57           | 12.74           |
-| 128             | 4          | 263.44                       | 2.24                                         | 15.18               | 15.32           | 15.54           | 17              |
-| 128             | 8          | 374.33                       | 2.98                                         | 21.37               | 21.56           | 21.72           | 23.23           |
-| 384             | 1          | 64.57                        | 1.87                                         | 15.49               | 15.61           | 15.73           | 16.18           |
-| 384             | 2          | 94.04                        | 2.47                                         | 21.27               | 21.34           | 21.4            | 21.9            |
-| 384             | 4          | 118.81                       | 2.96                                         | 33.67               | 33.89           | 34.37           | 36.18           |
-| 384             | 8          | 137.65                       | 3.26                                         | 58.12               | 58.53           | 59.34           | 61.32           |
+| 128 | 1 | 95.87  | 1.433462919 | 10.43 | 10.61 | 10.71 | 11.27 |
+| 128 | 2 | 168.02 | 1.871046771 | 11.9  | 12.08 | 12.18 | 12.32 |
+| 128 | 4 | 263.08 | 2.617451    | 15.2  | 14.86 | 14.95 | 15.55 |
+| 128 | 8 | 379.78 | 3.414366628 | 21.07 | 20.94 | 21.03 | 21.49 |
+| 384 | 1 | 67.52  | 2.274932615 | 14.81 | 14.93 | 15.05 | 15.38 |
+| 384 | 2 | 93.8   | 2.929419113 | 21.32 | 20.75 | 20.83 | 21.43 |
+| 384 | 4 | 118.97 | 3.397201599 | 33.62 | 33.17 | 33.37 | 33.85 |
+| 384 | 8 | 138.43 | 3.838879645 | 57.79 | 57    | 57.38 | 58.19 |

 BERT LARGE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 75.28                        | 13.28               | 13.4            | 13.49           | 13.66           |
-| 128             | 2          | 104.16                       | 19.2                | 19.51           | 19.69           | 20.83           |
-| 128             | 4          | 117.4                        | 34.07               | 34.4            | 34.76           | 36.99           |
-| 128             | 8          | 125.63                       | 63.68               | 64.58           | 65.1            | 67.54           |
-| 384             | 1          | 34.53                        | 28.96               | 29.32           | 29.61           | 31.08           |
-| 384             | 2          | 38.03                        | 52.59               | 53.16           | 53.75           | 55.5            |
-| 384             | 4          | 40.16                        | 99.6                | 100.76          | 101.62          | 103.4           |
-| 384             | 8          | 42.2                         | 189.57              | 190.82          | 191.47          | 193.27          |
+| 128 | 1 | 66.88  | 14.95  | 14.96  | 15.41  | 18.02  |
+| 128 | 2 | 89.8   | 22.27  | 22.46  | 22.53  | 22.84  |
+| 128 | 4 | 100.51 | 39.8   | 39.91  | 40.06  | 41.04  |
+| 128 | 8 | 111.23 | 71.92  | 72.42  | 72.58  | 73.63  |
+| 384 | 1 | 29.68  | 33.7   | 33.85  | 33.91  | 34.62  |
+| 384 | 2 | 32.02  | 62.47  | 63.06  | 63.28  | 63.66  |
+| 384 | 4 | 35.02  | 114.21 | 114.69 | 114.82 | 115.85 |
+| 384 | 8 | 36.06  | 221.86 | 222.7  | 223.03 | 223.53 |

 BERT BASE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 196.58                       | 1.19                                         | 5.09                | 5.18            | 5.23            | 5.42            |
-| 128             | 2          | 361.92                       | 1.41                                         | 5.53                | 5.62            | 5.67            | 5.85            |
-| 128             | 4          | 605.43                       | 1.79                                         | 6.61                | 6.71            | 6.8             | 7.04            |
-| 128             | 8          | 916                          | 2.18                                         | 8.73                | 8.83            | 8.95            | 9.19            |
-| 384             | 1          | 154.05                       | 1.58                                         | 6.49                | 6.6             | 6.72            | 7.05            |
-| 384             | 2          | 238.89                       | 1.99                                         | 8.37                | 8.42            | 8.47            | 9.1             |
-| 384             | 4          | 327.18                       | 2.47                                         | 12.23               | 12.3            | 12.36           | 13.08           |
-| 384             | 8          | 390.95                       | 2.82                                         | 20.46               | 20.5            | 20.8            | 21.89           |
+| 128 | 1 | 204.33 | 1.459187317 | 4.89  | 5.14  | 5.32  | 5.54  |
+| 128 | 2 | 375.19 | 1.779501043 | 5.33  | 5.47  | 5.58  | 5.87  |
+| 128 | 4 | 606.98 | 2.198645271 | 6.59  | 6.49  | 6.55  | 6.83  |
+| 128 | 8 | 902.6  | 2.69023278  | 8.86  | 8.62  | 8.72  | 9.22  |
+| 384 | 1 | 154.33 | 1.990070922 | 6.48  | 6.59  | 6.65  | 7.04  |
+| 384 | 2 | 225.7  | 2.386087324 | 8.86  | 8.45  | 8.53  | 9.16  |
+| 384 | 4 | 317.93 | 3.044431677 | 12.58 | 12.34 | 12.39 | 13.01 |
+| 384 | 8 | 393.44 | 3.672547372 | 20.33 | 20.06 | 20.38 | 21.38 |

 BERT BASE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 165.51                       | 6.04                | 6.19            | 6.3             | 6.62            |
-| 128             | 2          | 257.54                       | 7.77                | 7.86            | 7.92            | 8.28            |
-| 128             | 4          | 338.52                       | 11.82               | 11.98           | 12.05           | 12.27           |
-| 128             | 8          | 419.94                       | 19.05               | 19.25           | 19.35           | 20.12           |
-| 384             | 1          | 97.4                         | 10.27               | 10.39           | 10.44           | 10.56           |
-| 384             | 2          | 119.84                       | 16.69               | 16.78           | 16.85           | 17.66           |
-| 384             | 4          | 132.5                        | 30.19               | 30.41           | 30.5            | 31.13           |
-| 384             | 8          | 138.63                       | 57.71               | 58.15           | 58.37           | 59.33           |
+| 128 | 1 | 140.03 | 7.14  | 7.6   | 7.78  | 7.97  |
+| 128 | 2 | 210.84 | 9.49  | 9.59  | 9.65  | 10.57 |
+| 128 | 4 | 276.07 | 14.49 | 14.61 | 14.71 | 15.16 |
+| 128 | 8 | 335.51 | 23.84 | 23.79 | 23.89 | 24.94 |
+| 384 | 1 | 77.55  | 12.89 | 13.01 | 13.05 | 14.26 |
+| 384 | 2 | 94.59  | 21.14 | 21.14 | 21.23 | 21.86 |
+| 384 | 4 | 104.43 | 38.3  | 38.38 | 38.45 | 39.15 |
+| 384 | 8 | 107.13 | 74.68 | 75.05 | 75.19 | 76.2  |


 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@ -1032,67 +934,67 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide

 ###### Pre-training inference performance on 32G

-Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs.
+Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs.

 | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** |
 |:-----:|:-------:|:-------:|:-------:|:-------------:|
-|128    |8, 8     |304.88   | 100.88  | 3.02          |
+|128    |8, 8     |345.50   | 101.84  | 3.39          |

 ###### Fine-tuning inference performance for SQuAD on 32G

-Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
+Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

 BERT LARGE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 86.4                         | 1.18                                         | 11.57               | 11.74           | 11.86           | 12.04           |
-| 128             | 2          | 155.32                       | 1.52                                         | 12.88               | 12.98           | 13.05           | 13.31           |
-| 128             | 4          | 252.18                       | 2.18                                         | 15.86               | 15.78           | 15.89           | 17.01           |
-| 128             | 8          | 359.19                       | 2.88                                         | 22.27               | 22.44           | 22.58           | 23.94           |
-| 384             | 1          | 62.45                        | 1.84                                         | 16.01               | 16.16           | 16.23           | 16.42           |
-| 384             | 2          | 89.34                        | 2.37                                         | 22.39               | 22.45           | 22.53           | 23.13           |
-| 384             | 4          | 113.77                       | 2.84                                         | 35.16               | 35.24           | 35.33           | 35.9            |
-| 384             | 8          | 131.9                        | 3.13                                         | 60.65               | 61              | 61.49           | 65.3            |
+| 128 | 1 | 87.75  | 1.352913969 | 11.4  | 11.46 | 18.77 | 19.06 |
+| 128 | 2 | 159.87 | 1.833161335 | 12.51 | 12.69 | 12.79 | 12.98 |
+| 128 | 4 | 254.65 | 2.622014003 | 15.71 | 15.49 | 15.59 | 16.03 |
+| 128 | 8 | 365.51 | 3.377783939 | 21.89 | 21.72 | 21.94 | 23.79 |
+| 384 | 1 | 63.11  | 2.153924915 | 15.84 | 17.3  | 19.22 | 19.37 |
+| 384 | 2 | 89.61  | 2.884132604 | 22.32 | 21.83 | 21.96 | 23.8  |
+| 384 | 4 | 114.9  | 3.395390071 | 34.81 | 34.33 | 34.47 | 35.15 |
+| 384 | 8 | 132.79 | 3.814708417 | 60.25 | 59.4  | 59.77 | 60.7  |

 BERT LARGE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 73.42                        | 13.62               | 13.78           | 13.85           | 14.13           |
-| 128             | 2          | 102.47                       | 19.52               | 19.66           | 19.73           | 19.98           |
-| 128             | 4          | 115.76                       | 34.55               | 34.86           | 35.34           | 37.87           |
-| 128             | 8          | 124.84                       | 64.08               | 64.78           | 65.78           | 69.55           |
-| 384             | 1          | 33.93                        | 29.47               | 29.7            | 29.8            | 29.98           |
-| 384             | 2          | 37.62                        | 53.16               | 53.52           | 53.73           | 55.03           |
-| 384             | 4          | 39.99                        | 100.02              | 100.91          | 101.69          | 106.63          |
-| 384             | 8          | 42.09                        | 190.08              | 191.35          | 192.29          | 196.47          |
+| 128 | 1 | 64.86  | 15.42  | 16.32  | 17.55  | 20.89  |
+| 128 | 2 | 87.21  | 22.93  | 23.06  | 24.17  | 31.93  |
+| 128 | 4 | 97.12  | 41.19  | 41.38  | 41.5   | 44.13  |
+| 128 | 8 | 108.21 | 73.93  | 74.34  | 74.48  | 74.77  |
+| 384 | 1 | 29.3   | 34.13  | 34.21  | 34.25  | 34.76  |
+| 384 | 2 | 31.07  | 64.38  | 64.83  | 64.95  | 65.42  |
+| 384 | 4 | 33.84  | 118.22 | 119.01 | 119.57 | 120.06 |
+| 384 | 8 | 34.81  | 229.84 | 230.72 | 231.22 | 232.96 |

 BERT BASE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 192.89                       | 1.19                                         | 5.18                | 5.29            | 5.35            | 5.55            |
-| 128             | 2          | 348.23                       | 1.37                                         | 5.74                | 5.91            | 6.02            | 6.26            |
-| 128             | 4          | 592.54                       | 1.79                                         | 6.75                | 6.96            | 7.08            | 7.34            |
-| 128             | 8          | 888.58                       | 2.15                                         | 9                   | 9.11            | 9.22            | 9.5             |
-| 384             | 1          | 148.64                       | 1.57                                         | 6.73                | 6.82            | 6.87            | 7.06            |
-| 384             | 2          | 230.74                       | 1.96                                         | 8.67                | 8.75            | 8.87            | 9.44            |
-| 384             | 4          | 318.45                       | 2.42                                         | 12.56               | 12.65           | 12.76           | 13.36           |
-| 384             | 8          | 380.14                       | 2.72                                         | 21.05               | 21.1            | 21.25           | 21.83           |
+| 128 | 1 | 198.72 | 1.393352966 | 5.03  | 5.3   | 5.47  | 5.69  |
+| 128 | 2 | 338.44 | 1.611158717 | 5.91  | 6.04  | 9.77  | 9.94  |
+| 128 | 4 | 599.62 | 2.24804109  | 6.67  | 6.6   | 6.66  | 6.83  |
+| 128 | 8 | 858.56 | 2.63370042  | 9.32  | 10.01 | 10.04 | 10.39 |
+| 384 | 1 | 150.28 | 1.948146228 | 6.65  | 6.76  | 6.82  | 7.21  |
+| 384 | 2 | 200.68 | 2.200197347 | 9.97  | 9.88  | 9.94  | 10.08 |
+| 384 | 4 | 305.72 | 3.01707293  | 13.08 | 12.86 | 12.97 | 13.71 |
+| 384 | 8 | 373.64 | 3.61249154  | 21.41 | 21.98 | 22.03 | 22.61 |

 BERT BASE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 161.69                       | 6.18                | 6.26            | 6.31            | 6.51            |
-| 128             | 2          | 254.84                       | 7.85                | 8               | 8.09            | 8.29            |
-| 128             | 4          | 331.72                       | 12.06               | 12.17           | 12.26           | 12.51           |
-| 128             | 8          | 412.85                       | 19.38               | 19.6            | 19.72           | 20.13           |
-| 384             | 1          | 94.42                        | 10.59               | 10.71           | 10.8            | 11.36           |
-| 384             | 2          | 117.64                       | 17                  | 17.07           | 17.1            | 17.83           |
-| 384             | 4          | 131.72                       | 30.37               | 30.64           | 30.77           | 31.26           |
-| 384             | 8          | 139.75                       | 57.25               | 57.74           | 58.08           | 59.53           |
+| 128 | 1 | 142.62 | 7.01  | 7.07  | 7.44  | 9.23  |
+| 128 | 2 | 210.06 | 9.52  | 9.63  | 9.69  | 10.22 |
+| 128 | 4 | 266.73 | 15    | 15.77 | 15.91 | 16.79 |
+| 128 | 8 | 325.99 | 24.54 | 24.52 | 24.6  | 25    |
+| 384 | 1 | 77.14  | 12.96 | 13.01 | 13.03 | 13.67 |
+| 384 | 2 | 91.21  | 21.93 | 21.93 | 21.99 | 22.31 |
+| 384 | 4 | 101.33 | 39.47 | 39.69 | 39.82 | 40.88 |
+| 384 | 8 | 103.43 | 77.34 | 77.76 | 77.9  | 78.45 |


 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@ -1101,126 +1003,126 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide

 ###### Pre-training inference performance on DGX-2 32G

-Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs.
+Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs.

 | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** |
 |:-----:|:-------:|:-------:|:-------:|:-------------:|
-|128    |8, 8     |350.63   | 106.36  | 3.30          |
+|128    |8, 8     |366.24   | 107.88  | 3.39          |

 ###### Fine-tuning inference performance for SQuAD on DGX-2  32G

-Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
+Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

 BERT LARGE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 79                           | 1.18                                         | 12.66               | 13.13           | 13.36           | 14.49           |
-| 128             | 2          | 151.28                       | 1.52                                         | 13.22               | 13.66           | 13.89           | 14.84           |
-| 128             | 4          | 250.41                       | 2.18                                         | 15.97               | 16.13           | 16.3            | 17.81           |
-| 128             | 8          | 369.76                       | 2.88                                         | 21.64               | 21.88           | 22.08           | 26.35           |
-| 384             | 1          | 61.66                        | 1.84                                         | 16.22               | 16.46           | 16.62           | 17.26           |
-| 384             | 2          | 91.54                        | 2.37                                         | 21.85               | 22.11           | 22.3            | 23.44           |
-| 384             | 4          | 121.04                       | 2.84                                         | 33.05               | 33.08           | 33.31           | 34.97           |
-| 384             | 8          | 142.03                       | 3.13                                         | 56.33               | 56.46           | 57.49           | 59.85           |
+| 128 | 1 | 96.22  | 1.371045882 | 10.39 | 10.78 | 10.9  | 11.43 |
+| 128 | 2 | 171.66 | 1.835935829 | 11.65 | 11.86 | 12.04 | 12.45 |
+| 128 | 4 | 262.89 | 2.566032211 | 15.22 | 15.13 | 15.24 | 15.91 |
+| 128 | 8 | 394.23 | 3.441253492 | 20.29 | 20.22 | 20.6  | 22.19 |
+| 384 | 1 | 69.69  | 2.278195489 | 14.35 | 14.39 | 14.58 | 15.68 |
+| 384 | 2 | 96.35  | 2.909118357 | 20.76 | 20.25 | 20.32 | 21.54 |
+| 384 | 4 | 124.06 | 3.42612538  | 32.24 | 31.87 | 32.14 | 33.02 |
+| 384 | 8 | 144.28 | 3.876410532 | 55.45 | 54.77 | 55.16 | 55.93 |

 BERT LARGE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 70.1                         | 14.27               | 14.6            | 14.84           | 15.38           |
-| 128             | 2          | 101.3                        | 19.74               | 20.09           | 20.27           | 20.77           |
-| 128             | 4          | 122.19                       | 32.74               | 32.99           | 33.39           | 36.76           |
-| 128             | 8          | 134.09                       | 59.66               | 60.36           | 61.79           | 69.33           |
-| 384             | 1          | 34.52                        | 28.97               | 29.28           | 29.46           | 31.78           |
-| 384             | 2          | 39.84                        | 50.21               | 50.61           | 51.53           | 54              |
-| 384             | 4          | 42.79                        | 93.48               | 94.73           | 96.52           | 104.37          |
-| 384             | 8          | 45.91                        | 174.24              | 175.34          | 176.59          | 183.76          |
+| 128 | 1 | 70.18  | 14.25  | 14.7   | 14.88  | 15.35  |
+| 128 | 2 | 93.5   | 21.39  | 21.83  | 22.04  | 22.85  |
+| 128 | 4 | 102.45 | 39.04  | 39.28  | 39.42  | 40.5   |
+| 128 | 8 | 114.56 | 69.83  | 70.5   | 70.74  | 72.78  |
+| 384 | 1 | 30.59  | 32.69  | 33.14  | 33.32  | 33.86  |
+| 384 | 2 | 33.12  | 60.38  | 60.91  | 61.12  | 61.67  |
+| 384 | 4 | 36.21  | 110.46 | 111.1  | 111.26 | 112.15 |
+| 384 | 8 | 37.22  | 214.95 | 215.69 | 216.13 | 217.96 |

 BERT BASE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 172.33                       | 1.19                                         | 5.8                 | 5.94            | 6               | 6.27            |
-| 128             | 2          | 315.17                       | 1.37                                         | 6.35                | 6.64            | 6.78            | 7.07            |
-| 128             | 4          | 549.36                       | 1.79                                         | 7.28                | 7.47            | 7.6             | 8.05            |
-| 128             | 8          | 872.67                       | 2.15                                         | 9.17                | 9.33            | 9.5             | 9.92            |
-| 384             | 1          | 138.52                       | 1.57                                         | 7.22                | 7.45            | 7.52            | 7.84            |
-| 384             | 2          | 222.05                       | 1.96                                         | 9.01                | 9.11            | 9.24            | 10.94           |
-| 384             | 4          | 314.47                       | 2.42                                         | 12.72               | 12.87           | 13.01           | 14.42           |
-| 384             | 8          | 392.32                       | 2.72                                         | 20.39               | 20.44           | 20.67           | 22.16           |
+| 128 | 1 | 207.01 | 1.455050257 | 4.83  | 5.23  | 5.38  | 5.59  |
+| 128 | 2 | 405.92 | 1.808429119 | 4.93  | 4.99  | 5.04  | 5.2   |
+| 128 | 4 | 646.8  | 2.258695349 | 6.18  | 6.06  | 6.14  | 6.55  |
+| 128 | 8 | 909.41 | 2.616781285 | 8.8   | 8.86  | 8.96  | 9.52  |
+| 384 | 1 | 153.97 | 1.959653812 | 6.49  | 6.88  | 7.01  | 7.2   |
+| 384 | 2 | 229.46 | 2.366298855 | 8.72  | 8.57  | 8.67  | 8.97  |
+| 384 | 4 | 333.2  | 3.078913325 | 12    | 11.74 | 11.85 | 12.86 |
+| 384 | 8 | 403.02 | 3.646579805 | 19.85 | 19.83 | 20    | 21.11 |

 BERT BASE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 161.69                       | 6.18                | 6.26            | 6.31            | 6.51            |
-| 128             | 2          | 254.84                       | 7.85                | 8               | 8.09            | 8.29            |
-| 128             | 4          | 331.72                       | 12.06               | 12.17           | 12.26           | 12.51           |
-| 128             | 8          | 412.85                       | 19.38               | 19.6            | 19.72           | 20.13           |
-| 384             | 1          | 94.42                        | 10.59               | 10.71           | 10.8            | 11.36           |
-| 384             | 2          | 117.64                       | 17                  | 17.07           | 17.1            | 17.83           |
-| 384             | 4          | 131.72                       | 30.37               | 30.64           | 30.77           | 31.26           |
-| 384             | 8          | 139.75                       | 57.25               | 57.74           | 58.08           | 59.53           |
+| 128 | 1 | 142.27 | 7.03  | 7.39  | 7.45  | 11.7  |
+| 128 | 2 | 224.46 | 8.91  | 9     | 9.08  | 9.66  |
+| 128 | 4 | 286.36 | 13.97 | 14.46 | 14.52 | 14.82 |
+| 128 | 8 | 347.53 | 23.02 | 23.23 | 23.4  | 24.12 |
+| 384 | 1 | 78.57  | 12.73 | 13.01 | 13.1  | 14.06 |
+| 384 | 2 | 96.97  | 20.62 | 21    | 21.15 | 21.82 |
+| 384 | 4 | 108.22 | 36.96 | 37.05 | 37.18 | 38.12 |
+| 384 | 8 | 110.52 | 72.38 | 73.06 | 73.32 | 74.64 |


 ##### Inference performance: NVIDIA Tesla T4 (1x T4 16G)

 ###### Fine-tuning inference performance for SQuAD on Tesla T4 16G

-Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
+Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.

 BERT LARGE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 53.56                        | 1.18                                         | 18.67               | 20.22           | 20.31           | 20.49           |
-| 128             | 2          | 95.39                        | 1.52                                         | 20.97               | 22.86           | 23.15           | 23.73           |
-| 128             | 4          | 137.44                       | 2.18                                         | 29.1                | 30.34           | 30.62           | 31.5            |
-| 128             | 8          | 166.19                       | 2.88                                         | 48.14               | 49.38           | 49.73           | 50.86           |
-| 384             | 1          | 34.28                        | 1.84                                         | 29.17               | 30.58           | 30.77           | 31.28           |
-| 384             | 2          | 41.89                        | 2.37                                         | 47.74               | 49.05           | 49.34           | 50              |
-| 384             | 4          | 47.15                        | 2.84                                         | 84.83               | 86.79           | 87.41           | 88.73           |
-| 384             | 8          | 50.28                        | 3.13                                         | 159.11              | 161.75          | 162.85          | 165.72          |
+| 128 | 1 | 54.53  | 1.552234557 | 18.34  | 19.09  | 19.28  | 21.74  |
+| 128 | 2 | 95.59  | 2.521498285 | 20.92  | 21.86  | 22.61  | 23.33  |
+| 128 | 4 | 133.2  | 3.434760186 | 30.03  | 30.32  | 30.43  | 31.06  |
+| 128 | 8 | 168.85 | 4.352926012 | 47.38  | 48.21  | 48.56  | 49.25  |
+| 384 | 1 | 33.58  | 2.87008547  | 29.78  | 30.3   | 30.46  | 31.69  |
+| 384 | 2 | 41.31  | 3.576623377 | 48.41  | 49.03  | 49.26  | 50.04  |
+| 384 | 4 | 47.08  | 3.94635373  | 84.96  | 86.88  | 87.38  | 88.3   |
+| 384 | 8 | 50.08  | 4.254885302 | 159.76 | 162.37 | 163.23 | 165.79 |

 BERT LARGE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128 | 1 | 40.34  | 24.79  | 26.97  | 27.38  | 28.21  |
-| 128 | 2 | 45.17  | 44.27  | 46.01  | 46.6   | 47.68  |
-| 128 | 4 | 47.39  | 84.41  | 86.31  | 86.92  | 88.14  |
-| 128 | 8 | 46.98  | 170.29 | 173.35 | 174.15 | 175.48 |
-| 384 | 1 | 14.07  | 71.06  | 73     | 73.42  | 73.99  |
-| 384 | 2 | 14.91  | 134.17 | 136.72 | 137.51 | 138.66 |
-| 384 | 4 | 14.44  | 277.03 | 281.89 | 282.63 | 284.41 |
-| 384 | 8 | 14.95  | 534.94 | 540.45 | 542.32 | 544.75 |
+| 128 | 1 | 35.13 | 28.46  | 29.89  | 30.12  | 30.6   |
+| 128 | 2 | 37.91 | 52.76  | 54.01  | 54.29  | 54.84  |
+| 128 | 4 | 38.78 | 103.14 | 105.39 | 106.05 | 107.4  |
+| 128 | 8 | 38.79 | 206.22 | 209.63 | 210.2  | 211.5  |
+| 384 | 1 | 11.7  | 85.5   | 87.18  | 87.43  | 88     |
+| 384 | 2 | 11.55 | 173.19 | 176.13 | 177.02 | 178.4  |
+| 384 | 4 | 11.93 | 335.41 | 340.26 | 341.76 | 343.54 |
+| 384 | 8 | 11.77 | 679.77 | 686.01 | 686.79 | 689.24 |

 BERT BASE FP16

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128             | 1          | 107.3                        | 1.19                                         | 9.32                | 10.18           | 10.32           | 11.48           |
-| 128             | 2          | 185.18                       | 1.37                                         | 10.8                | 11.71           | 12.11           | 12.35           |
-| 128             | 4          | 335.47                       | 1.79                                         | 11.92               | 12.58           | 12.72           | 13.36           |
-| 128             | 8          | 454.12                       | 2.15                                         | 17.62               | 18.45           | 18.68           | 19.25           |
-| 384             | 1          | 83.5                         | 1.57                                         | 11.98               | 12.71           | 12.93           | 13.29           |
-| 384             | 2          | 117.75                       | 1.96                                         | 16.99               | 17.62           | 17.83           | 19.48           |
-| 384             | 4          | 139.08                       | 2.42                                         | 28.76               | 29.59           | 29.85           | 30.74           |
-| 384             | 8          | 149.93                       | 2.72                                         | 53.36               | 54.83           | 55.48           | 56.93           |
+| 12.71 | 13.1  | 13.22 | 1.552234557 | 18.34  | 19.09  | 19.28  | 21.74  |
+| 11.16 | 12.85 | 12.97 | 2.521498285 | 20.92  | 21.86  | 22.61  | 23.33  |
+| 11.82 | 11.9  | 13.21 | 3.434760186 | 30.03  | 30.32  | 30.43  | 31.06  |
+| 17.88 | 18.08 | 18.82 | 4.352926012 | 47.38  | 48.21  | 48.56  | 49.25  |
+| 11.83 | 12.95 | 15.44 | 2.87008547  | 29.78  | 30.3   | 30.46  | 31.69  |
+| 16.91 | 17.08 | 19.38 | 3.576623377 | 48.41  | 49.03  | 49.26  | 50.04  |
+| 28.89 | 29.23 | 30.84 | 3.94635373  | 84.96  | 86.88  | 87.38  | 88.3   |
+| 54.58 | 55.19 | 56.31 | 4.254885302 | 159.76 | 162.37 | 163.23 | 165.79 |

 BERT BASE FP32

 | Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
 |-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
-| 128 | 1 | 92.82  | 10.77  | 11.06 | 11.11  | 11.24 |
-| 128 | 2 | 127.87 | 15.64  | 16.2  | 16.4   | 16.86 |
-| 128 | 4 | 151.68 | 26.37  | 27.26 | 27.48  | 27.98 |
-| 128 | 8 | 164.51 | 48.63  | 50.36 | 50.72  | 51.52 |
-| 384 | 1 | 45.64  | 21.91  | 23.39 | 23.66  | 24.14 |
-| 384 | 2 | 48.11  | 41.57  | 42.99 | 43.47  | 44.44 |
-| 384 | 4 | 48.64  | 82.24  | 84.35 | 84.97  | 86.2  |
-| 384 | 8 | 48.04  | 166.51 | 169.9 | 170.84 | 172.6 |
+| 128 | 1 | 64.15  | 15.59  | 19.77  | 21.03  | 21.82  |
+| 128 | 2 | 110.69 | 18.07  | 18.92  | 20.77  | 21.6   |
+| 128 | 4 | 125.8  | 31.8   | 32.82  | 33.11  | 33.93  |
+| 128 | 8 | 127.55 | 62.72  | 63.9   | 64.28  | 65.25  |
+| 384 | 1 | 35.46  | 28.2   | 28.83  | 28.95  | 29.43  |
+| 384 | 2 | 37.15  | 53.83  | 54.75  | 55.08  | 56.01  |
+| 384 | 4 | 36.86  | 108.53 | 110.57 | 111.16 | 112.48 |
+| 384 | 8 | 36.1   | 221.61 | 225.94 | 226.94 | 228.58 |


 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@ -1229,11 +1131,16 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide

 ### Changelog

+November 2019
+- Pre-training and Finetuning on BioMedical tasks and corpus
+
+October 2019
+- Disabling Grappler Optimizations for improved performance
+
 September 2019
 - Pre-training using LAMB
 - Multi Node support
 - Fine Tuning support for GLUE (CoLA, MNLI, MRPC)
- Jupyter Notebooks

 July 2019
 - Results obtained using 19.06
@ -1245,4 +1152,4 @@ March 2019
 ### Known issues


- There is a known performance regression with the 19.08 release on Tesla V100 boards with 16 GB memory, smaller batch sizes may be a better choice for this model on these GPUs with the 19.08 release. 32 GB GPUs are not affected.
+- There is a known performance regression with the 19.08 release on Tesla V100 boards with 16 GB memory, smaller batch sizes may be a better choice for this model on these GPUs with the 19.08 release. 32 GB GPUs are not affected.
--- a/TensorFlow/LanguageModeling/BERT/biobert/README.md
+++ b/TensorFlow/LanguageModeling/BERT/biobert/README.md
@ -0,0 +1,570 @@
+# BioBert For TensorFlow
+
+This folder provides a script and recipe to train BERT for TensorFlow to achieve state-of-the-art accuracy on *biomedical text-mining* and is tested and maintained by NVIDIA.
+
+## Table Of Contents
+
+* [Model overview](#model-overview)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+  * [Scripts and sample code](#scripts-and-sample-code)
+  * [Parameters](#parameters)
+  * [Command-line options](#command-line-options)
+  * [Getting the data](#getting-the-data)
+    * [Dataset guidelines](#dataset-guidelines)
+    * [Multi-dataset](#multi-dataset)
+  * [Training process](#training-process)
+    * [Pre-training](#pre-training)
+    * [Fine tuning](#fine-tuning)
+    * [Multi-node](#multi-node)
+  * [Inference process](#inference-process)
+* [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    * [Training performance benchmark](#training-performance-benchmark)
+    * [Inference performance benchmark](#inference-performance-benchmark)
+  * [Results](#results)
+    * [Training accuracy results](#training-accuracy-results)
+      * [Training accuracy](#training-accuracy)
+        *[Pre-training accuracy](#pre-training-accuracy)
+        *[Fine-tuning accuracy](#fine-tuning-accuracy)
+          *[Fine-tuning accuracy for NER Chem](#fine-tuning-accuracy-for-ner-chem)
+      * [Training stability test](#training-stability-test)
+        * [Fine-tuning stability test](#fine-tuning-stability-test)
+    * [Training performance results](#training-performance-results)
+  	* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
+    	  * [Pre-training training performance: multi-node on 16G](#pre-training-training-performance-multi-node-on-16g)
+    	  * [Fine-tuning training performance for NER on 16G](#fine-tuning-training-performance-for-ner-on-16g)
+  	* [Training performance: NVIDIA DGX-1 (8x V100 32G)](#training-performance-nvidia-dgx-1-8x-v100-32g)
+    	  * [Fine-tuning training performance for NER on 32G](#fine-tuning-training-performance-for-ner-on-32g)
+  	* [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-16x-v100-32g)
+  	  * [Pre-training training performance: multi-node on DGX-2 32G](#pre-training-training-performance-multi-node-on-dgx-2-32g)
+    	  * [Fine-tuning training performance for NER on DGX-2 32G](#fine-tuning-training-performance-for-ner-on-dgx-2-32g)
+* [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known issues](#known-issues)
+
+
+
+## Model overview
+
+In the original [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) paper, pre-training is done on [Wikipedia](https://dumps.wikimedia.org/) and [Books Corpus](http://yknzhu.wixsite.com/mbweb), with state-of-the-art results demonstrated on [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (Stanford Question Answering Dataset) benchmark.
+
+Meanwhile, many works, including [BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [SciBERT](https://arxiv.org/pdf/1903.10676.pdf), [NCBI-BERT](https://arxiv.org/pdf/1906.05474.pdf), [ClinicalBERT (MIT)](https://arxiv.org/pdf/1904.03323.pdf), [ClinicalBERT (NYU, Princeton)](https://arxiv.org/pdf/1904.05342.pdf), and others at [BioNLP’19 workshop](https://aclweb.org/aclwiki/BioNLP_Workshop), show that additional pre-training of BERT on large biomedical text corpus such as [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) results in better performance in biomedical text-mining tasks.
+
+This repository provides scripts and recipe to adopt the [NVIDIA BERT code-base](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT) to achieve state-of-the-art results in the following biomedical text-mining benchmark tasks:
+
+- [BC5CDR-disease](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/) A Named-Entity-Recognition task to recognize diseases mentioned in a collection of 1500 PubMed titles and abstracts ([Li et al., 2016](https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414))
+
+- [BC5CDR-chemical](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/) A Named-Entity-Recognition task to recognize chemicals mentioned in a collection of 1500 PubMed titles and abstracts ([Li et al., 2016](https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414))
+
+- [ChemProt](https://biocreative.bioinformatics.udel.edu/news/corpora/) A Relation-Extraction task to determine chemical-protein interactions in a collection of 1820 PubMed abstracts ([Krallinger et al., 2017](https://biocreative.bioinformatics.udel.edu/media/store/files/2017/ProceedingsBCVI_v2.pdf?page=141))
+
+
+## Quick Start Guide
+
+To pretrain or fine tune your model for BioMedical tasks using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the BERT model.
+
+1. Clone the repository.
+
+```bash
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT
+```
+
+2. Build the BERT TensorFlow NGC container.
+
+```bash
+bash scripts/docker/build.sh
+```
+
+3. Download and preprocess the PubMed dataset.
+
+To download and preprocess pre-training data as well as the required vocab files, run the following script:
+
+
+```bash
+bash biobert/scripts/biobert_data_download.sh
+```
+
+Datasets for finetuning can be obtained from this (repository) [https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1]
+
+Place them in `/workspace/bert/data/biobert/` to be automatically picked up by our scripts.
+
+4. Start an interactive session in the NGC container to run training/inference.
+
+After you build the container image and download the data, you can start an interactive CLI session as follows:
+
+```bash
+bash scripts/docker/launch.sh
+```
+
+5. Download the pre-trained checkpoint, vocabulary, and configuration files.
+
+We have uploaded checkpoints for fine tuning and pre-training on BioMedical Corpus’s on the NGC Model Registry. You can download them directly from the [NGC model catalog](https://ngc.nvidia.com/catalog/models). 
+
+Place our `BioBERT checkpoints` in the `results/` to easily access it in your scripts.
+
+6. Start pre-training.
+
+From within the container, you can use the following script to run the 1st phase of the pre-training using cased vocabulary:
+
+```bash
+bash biobert/scripts/run_pretraining-pubmed_base_phase_1.sh <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>
+```
+
+For the 2nd phase of the pre-training, issue:
+
+```bash
+bash biobert/scripts/run_pretraining-pubmed_base_phase_2.sh <path_to_phase_1_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>
+```
+
+
+Refer to (MultiNode Section)[multi-node] for details on utilizing multiple nodes for faster pretraining.
+
+6. Start fine tuning.
+
+The above pretrained BERT representations can be fine tuned with just one additional output layer for a state-of-the-art biomedical text-mining system.
+From within the container, you can use the following script to run fine-training for NER.
+
+Note: The scripts assume you are running on 16 V100 32GB GPUs. If you are running on GPU having less than 32GB memory or fewer GPUs, batch size, learning rate and number of GPUs needs to be adjusted.
+
+For NER on disease entities:
+
+```bash
+bash biobert/scripts/ner_bc5cdr-disease.sh  <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>
+```
+
+For NER on chemical entities:
+
+```bash
+bash biobert/scripts/ner_bc5cdr-chem.sh  <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>
+```
+
+For relation extraction, issue:
+
+```
+bash biobert/scripts/rel_chemprot.sh <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>
+```
+
+8. Start validation/evaluation.
+
+The `biobert/scripts/run_biobert_finetuning_inference.sh` script runs inference on a checkpoint fine tuned for a specific task and evaluates the validity of predictions on the basis of F1, precision and recall scores.
+
+```bash
+bash biobert/scripts/run_biobert_finetuning_inference.sh <task> <init_checkpoint> <bert_model> <cased> <precision> <use_xla> <batch_size>
+```
+
+For FP16 inference for NER on BC5DR Chemical task with XLA using a DGX-2 V100 32G, run:
+```bash
+bash biobert/scripts/run_biobert_finetuning_inference.sh ner_bc5cdr-chem /results/model.ckpt base false fp16 true 16
+```
+
+Tasks `ner_bc5cdr-chem`, `ner_bc5cdr-disease` and `rel_chemprot` are currently supported.
+
+## Advanced
+
+The following sections provide greater details of the dataset, running training and inference, and the training results.
+
+### Scripts and sample code
+
+In addition to BERT TensorFlow files, the most important files added for NER and RE fine tuning tasks are:
+* `run_ner.py` - Serves as an entry point for NER training.
+* `run_re.py` - Serves as an entry point for RE training.
+
+The `biobert/scripts/` folder encapsulates all the one-click scripts required for running various functionalities supported such as:
+* `ner_bc5cdr-chem.sh` - Runs NER training and inference on the BC5CDR Chemical dataset using the `run_ner.py` file.
+* `ner_bc5cdr-disease.sh` - Runs NER training and inference on the BC5CDR Disease dataset using the `run_ner.py` file.
+* `rel_chemprot.sh` - Runs RE training and inference on the ChemProt dataset using the `run_re.py` file.
+* `run_pretraining_pubmed_base_phase_*.sh` - Runs pre-training with LAMB optimizer using the `run_pretraining.py` file in two phases. Phase 1 does training with sequence length = 128. In phase 2, the remaining 10% of the training is done with sequence length = 512.
+* `biobert_data_download.sh` - Downloads the PubMed dataset and Vocab files using files in the `data/` folder.
+* `run_biobert_finetuning_inference.sh` - Runs task specific inference using a fine tuned checkpoint.
+
+
+### Parameters
+
+Aside from the options to set hyperparameters, some relevant options to control the behaviour of the `run_ner.py` and `run_re.py` scripts are:
+
+```
+  --bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
+--vocab_file: The vocabulary file that the BERT model was trained on.
+  --output_dir: The output directory where the model checkpoints will be written.
+  --[no]do_eval: Whether to run evaluation on the dev set. (default: 'false')
+  --[no]do_predict: Whether to run evaluation on the test set. (default: 'false')
+  --[no]do_train: Whether to run training. (default: 'false')
+  --learning_rate: The initial learning rate for Adam.(default: '5e-06')(a number)
+  --max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.(default: '384')(an integer)
+  --predict_batch_size: Total batch size for predictions.(default: '8')(an integer)
+  --train_batch_size: Total batch size for training (default: '8')(an integer)
+  --[no]use_fp16: Whether to enable AMP ops.(default: 'false')
+  --[no]use_xla: Whether to enable XLA JIT compilation.(default: 'false')
+--init_checkpoint: Initial checkpoint (usually from a pre-trained BERT model).
+--num_train_epochs: Total number of training epochs to perform.(default: '3.0')(a number)
+
+```
+
+Note: When initializing from a checkpoint using `--init_checkpoint` and a corpus of your choice, keep in mind that `bert_config_file` and `vocab_file` should remain unchanged.
+
+### Command-line options
+
+To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example:
+
+```bash
+python run_ner.py --help
+python run_re.py --help
+```
+### Getting the data
+
+For pre-training BERT, we use the PubMed Dataset. For PubMed, we extract the xml files which are structured as a document level corpus rather than a shuffled sentence level corpus because it is critical to extract long contiguous sentences.
+
+The next step is to run `create_pretraining_data.py` with the document level corpus as input, which generates input data and labels for the masked language modeling and next sentence prediction tasks. Pre-training can also be performed on any corpus of your choice. The collection of data generation scripts are intended to be modular to allow modifications for additional preprocessing steps or to use additional data. They can hence easily be modified for an arbitrary corpus.
+
+The preparation of an individual pre-training dataset is described in the `create_biobert_datasets_from_start.sh ` script found in the `data/` folder. The component steps to prepare the datasets are as follows:
+
+1.  Data download and extract - the dataset is downloaded and extracted.
+2.  Clean and format - document tags, etc. are removed from the dataset. The end result of this step is a `{dataset_name_one_article_per_line}.txt` file that contains the entire corpus. Each line in the text file contains an entire document from the corpus. One file per dataset is created in the `formatted_one_article_per_line` folder.
+3.  Sharding - the sentence segmented corpus file is split into a number of smaller text documents. The sharding is configured so that a document will not be split between two shards. Sentence segmentation is performed at this time using NLTK.
+4.  TFRecord file creation - each text file shard is processed by the `create_pretraining_data.py` script to produce a corresponding TFRecord file. The script generates input data and labels for masked language modeling and sentence prediction tasks for the input text shard.
+
+
+For fine tuning BioBERT for the task of Named Entity Recognition and Relation Extraction Tasks, we use BC5CDR and Chemprot Datasets. BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
+ChemProt corpus consists of text exhaustively annotated by hand with mentions of chemical compounds/drugs and genes/proteins, as well as 22 different types of compound-protein relations focussing on 5 important relation classes. It was preprocessed following [Lim and Kang](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6014134/) guidelines.
+
+#### Dataset guidelines
+
+The procedure to prepare a text corpus for pre-training is described in the previous section. This section provides additional insight into how exactly raw text is processed so that it is ready for pre-training.
+
+First, raw text is tokenized using [WordPiece tokenization](https://arxiv.org/pdf/1609.08144.pdf). A [CLS] token is inserted at the start of every sequence, and the two sentences in the sequence are separated by a [SEP] token.
+
+Note: BERT pre-training looks at pairs of sentences at a time. A sentence embedding token [A] is added to the first sentence and token [B] to the next.
+
+BERT pre-training optimizes for two unsupervised classification tasks. The first is Masked Language Modelling (Masked LM). One training instance of Masked LM is a single modified sentence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with another random token and the remaining 10% with the same token. The task is then to predict the original token.
+
+The second task is next sentence prediction. One training instance of BERT pre-training is two sentences (a sentence pair). A sentence pair may be constructed by simply taking two adjacent sentences from a single document, or by pairing up two random sentences with equal probability. The goal of this task is to predict whether or not the second sentence followed the first in the original document.
+
+The `create_pretraining_data.py` script takes in raw text and creates training instances for both pre-training tasks.
+
+#### Multi-dataset
+
+We are able to combine multiple datasets into a single dataset for pre-training on a diverse text corpus. Once TFRecords have been created for each component dataset, you can create a combined dataset by adding the directory to `*FILES_DIR` in `run_pretraining_*.sh`. This will feed all matching files to the input pipeline in `run_pretraining.py`. However, in the training process, only one TFRecord file is consumed at a time, therefore, the training instances of any given training batch will all belong to the same source dataset.
+
+
+
+### Training process
+
+The training process consists of two steps: pre-training and fine tuning.
+
+#### Pre-training
+
+BERT is designed to pre-train deep bidirectional representations for language representations. The following scripts are to pre-train BERT on PubMed dataset. These scripts are general and can be used for pre-training language representations on additional corpus of biomedical text.
+
+Pre-training is performed using the `run_pretraining.py` script along with parameters defined in the `biobert/scripts/run_pretraining_pubmed_base_phase_1.sh` and `biobert/scripts/run_pretraining_pubmed_base_phase_2.sh` scripts.
+
+The `biobert/scripts/run_pretraining_pubmed_base_phase*.sh` scripts run a job on a single node that trains the BERT-base model from scratch using the PubMed Corpus dataset as training data. By default, the training script:
+- Runs on 16 GPUs
+- Has FP16 precision enabled
+- Is XLA enabled
+- Creates a log file containing all the output
+- Saves a checkpoint every 5000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results, and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory).
+- Evaluates the model at the end of each phase
+
+- Phase 1
+    - Runs 19531 steps with 1953 warmup steps
+    - Sets Maximum sequence length as 128
+    - Sets Global Batch size as 64K
+
+- Phase 2
+    - Runs 4340 steps with 434 warm-up steps
+    - Sets Maximum sequence length as 512
+    - Sets Global Batch size as 32K
+    - Should start from Phase1's final checkpoint
+
+These parameters train PubMed with reasonable accuracy on a DGX-2 with 32GB V100 cards.
+
+For example:
+```bash
+biobert/scripts/run_pretraining-pubmed_base_phase_1.sh <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>
+```
+
+Where:
+- `<training_batch_size>` is per-GPU batch size used for training. Batch size varies with precision, larger batch sizes run more efficiently, but require more memory.
+
+- `<learning_rate>` is the default rate of 3.2e-5 is good for global batch size 64k.
+
+- `<cased>` is set to `true` or `false` depending on whether the model should be trained on cased or uncased data.
+
+- `<precision>` is the type of math in your model, can be either `fp32` or `fp16`. Specifically:
+
+    - `fp32` is 32-bit IEEE single precision floats.
+    - `fp16` is Automatic rewrite of TensorFlow compute graph to take advantage of 16-bit arithmetic whenever it is safe.
+
+- `<num_gpus>` is the number of GPUs to use for training. Must be equal to or smaller than the number of GPUs attached to your node.
+
+- `<warmup_steps>` is the number of warm-up steps at the start of training.
+
+- `<training_steps>` is the total number of training steps.
+
+- `<save_checkpoint_steps>` controls how often checkpoints are saved. Default is 5000 steps.
+
+- `<num_accumulation_steps>` is used to mimic higher batch sizes in the respective phase by accumulating gradients N times before weight update.
+
+- `<bert_model>` is used to indicate whether to pretrain BERT Large or BERT Base model.
+
+- `<eval_batch_size>` is per-GPU batch size used for evaluation after training.
+
+The following sample code trains phase 1 of BERT-base from scratch on a single DGX-2 using FP16 arithmetic and uncased data.
+
+```bash
+biobert/scripts/run_pretraining-pubmed_base_phase_1.sh 128 3.2e-5 false fp16 true 16 1953 19531 32 5000 80
+```
+
+#### Fine tuning
+
+Fine tuning is performed using the `run_ner.py` script along with parameters defined in `biobert/scripts/ner_bc5cdr*.sh`.
+
+For example, `biobert/scripts/ner_bc5cdr-chem.sh` script trains a model and performs evaluation on the BC5CDR Chemical dataset. By default, the training script:
+
+- Trains on BERT Base Uncased Model
+- Uses 16 GPUs and batch size of 8 on each GPU
+- Has FP16 precision enabled
+- Is XLA enabled
+- Runs for 10 epochs
+- Evaluation is done at the end of training. To skip evaluation, modify `--do_eval` and  `--do_predict` to `False`.
+
+This script outputs checkpoints to the `/results` directory, by default, inside the container. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file. The training log contains information about:
+- Loss for the final step
+- Training and evaluation performance
+- F1, Precision and Recall on the Test Set of BC5CDR Chemical after evaluation.
+
+The summary after training is printed in the following format:
+```bash
+ 0: /results/biobert_finetune_ner_chem_191028154209/test_labels.txt
+ 0: /results/biobert_finetune_ner_chem_191028154209/test_labels_errs.txt
+ 0: processed 124669 tokens with 5433 phrases; found: 5484 phrases; correct: 5102.
+ 0: accuracy:  99.26%; precision:  93.03%; recall:  93.91%; FB1:  93.47
+ 0:                  : precision:  93.03%; recall:  93.91%; FB1:  93.47  5484
+```
+
+Multi-GPU training is enabled with the Horovod TensorFlow module. The following example runs training on 16 GPUs:
+
+```bash
+BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
+DATA_DIR=data/biobert/BC5CDR/chem
+
+mpi_command="mpirun -np 16 -H localhost:16 \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib" \
+     python run_ner.py --horovod --use_fp16 --use_xla \
+      --vocab_file=$BERT_DIR/vocab.txt \
+     --bert_config_file=$BERT_DIR/bert_config.json \
+     --output_dir=/results --data_dir=$DATA_DIR”
+```
+
+#### Multi-node
+
+Multi-node runs can be launched on a pyxis/enroot Slurm cluster (see [Requirements](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#requirements)) with the `biobert/scripts/run_biobert.sub` script with the following command for a 4-node DGX2 example for both phase 1 and phase 2:
+
+```bash
+BATCHSIZE=128 LEARNING_RATE='8e-6' NUM_ACCUMULATION_STEPS=8 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub
+BATCHSIZE=16 LEARNING_RATE='3.2e-5' NUM_ACCUMULATION_STEPS=32 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub
+```
+
+Checkpoint after phase 1 will be saved in `checkpointdir` specified in `biobert/scripts/run_biobert.sub`. The checkpoint will be automatically picked up to resume training on phase 2. Note that phase 2 should be run after phase 1.
+
+Variables to re-run the [Training performance results](#training-performance-results) are available in the `configurations.yml` file.
+
+The batch variables `BATCHSIZE`, `LEARNING_RATE`, `NUM_ACCUMULATION_STEPS` refer to the Python arguments `train_batch_size`, `learning_rate`, `num_accumulation_steps` respectively.
+The variable `PHASE` refers to phase specific arguments available in `biobert/scripts/run_biobert.sub`.
+
+Note that the `biobert/scripts/run_biobert.sub` script is a starting point that has to be adapted depending on the environment. In particular, variables such as `datadir` handle the location of the files for each phase.
+
+Refer to the file contents to see the full list of variables to adjust for your system.
+
+### Inference process
+
+Inference on a fine tuned model for Bio Medical tasks is performed using the `run_ner.py` or `run_re.py` script along with parameters defined in `biobert/scripts/run_biobert_finetuning_inference.sh`. Inference is supported on a single GPU.
+
+The `biobert/scripts/run_biobert_finetuning_inference.sh` script performs evaluation on ChemProt or BC5CDR datasets depending on the task specified. By default, the inferencing script:
+
+- Uses BC5CDR Chemical dataset
+- Has FP16 precision enabled
+- Is XLA enabled
+- Evaluates the latest checkpoint present in `/results` with a batch size of 16.
+
+This script computes F1, Precision and Recall scores. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file.
+
+## Performance
+
+### Benchmarking
+
+The following section shows how to run benchmarks measuring the model performance in training and inference modes.
+
+Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.
+
+#### Training performance benchmark
+
+Training benchmarking can be performed by running the script:
+``` bash
+biobert/scripts/biobert_finetune_training_benchmark.sh <task> <num_gpu> <bert_model> <cased>
+```
+
+This script runs 2 epochs by default on the NER BC5CDR dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at `/results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>`
+
+#### Inference performance benchmark
+
+Training benchmarking can be performed by running the script:
+``` bash
+biobert/scripts/biobert_finetune_inference_benchmark.sh <task> <bert_model> <cased>
+```
+
+This script runs inference on the test and dev sets and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at `/results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>`
+
+### Results
+
+The following sections provide detailed results of downstream fine-tuning task on NER and RE benchmark tasks.
+
+#### Training accuracy results
+ 
+##### Training accuracy
+ 
+###### Pre-training accuracy
+ 
+Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container.
+ 
+| **DGX System** | **Nodes** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)** | **Final Loss** |
+|----------------|-----------|---------------|------------------------------------|----------------------------------------|----------------|-------------------------|
+| DGX2H | 4 | FP16 | 128, 16 | 8, 32 | 19.14 | 0.88 |
+| DGX2H | 16 | FP16 | 128, 16 | 2, 8 | 4.81  | 0.86 |
+| DGX2H | 32 | FP16 | 128, 16 | 1, 4 | 2.65  | 0.87 |
+
+###### Fine-tuning accuracy
+ 
+| **Task** | **F1** | **Precision** | **Recall** |
+|:-------:|:----:|:----:|:----:|
+| NER BC5CDR-chemical | 93.47 | 93.03 | 93.91 |
+| NER BC5CDR-disease | 86.22 | 85.05 | 87.43 |
+| RE Chemprot | 76.27 | 77.62 | 74.98 |
+
+####### Fine-tuning accuracy for NER Chem
+
+Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container.
+
+| **DGX System** | **Batch size / GPU** | **F1 - FP32** | **F1- mixed precision** | **Time to Train - FP32 (Minutes)** | **Time to Train - mixed precision (Minutes)** |
+|:---:|:----:|:----:|:---:|:----:|:----:|
+| DGX-1 16G | 64 |93.33|93.40|23.95|14.13|
+| DGX-1 32G | 64 |93.31|93.36|24.35|12.63|
+| DGX-2 32G | 64 |93.66|93.47|12.26|8.16|
+
+
+##### Training stability test
+
+###### Fine-tuning stability test:
+
+The following tables compare F1 scores scores across 5 different training runs on the NER Chemical task with different seeds, for both FP16 and FP32.  The runs showcase consistent convergence on all 5 seeds with very little deviation.
+
+| **16 x V100 GPUs** | **seed 1** | **seed 2** | **seed 3** | **seed 4** | **seed 5** | **mean** | **std** |
+|:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+| F1 Score (FP16)  | 93.13     | 92.92   | 93.34   | 93.66   | 93.47   | 93.3  | 0.29 |
+| F1 Score (FP32)  | 93.1      | 93.28   | 93.33   | 93.45   | 93.17   | 93.27 | 0.14 |
+
+
+#### Training performance results
+ 
+##### Training performance: NVIDIA DGX-1 (8x V100 16G)
+ 
+###### Pre-training training performance: multi-node on DGX-1 16G
+ 
+Our results were obtained by running the `biobert/scripts/run_biobert.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
+ 
+| **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
+|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
+| 1  | 128 | 64,32 | 2762.06  | 744.48   | 3.71 | 1.00  | 1.00  |
+| 4  | 128 | 64,32 | 10283.08 | 2762.88  | 3.72 | 3.72  | 3.71  |
+| 16 | 128 | 64,32 | 39051.69 | 10715.14 | 3.64 | 14.14 | 14.39 |
+| 32 | 128 | 64,32 | 76077.39 | 21104.87 | 3.60 | 27.54 | 28.35 |
+| 1  | 512 | 8,8   | 432.33   | 160.38   | 2.70 | 1.00  | 1.00  |
+| 4  | 512 | 8,8   | 1593.00  | 604.36   | 2.64 | 3.68  | 3.77  |
+| 16 | 512 | 8,8   | 5941.82  | 2356.44  | 2.52 | 13.74 | 14.69 |
+| 32 | 512 | 8,8   | 11483.73 | 4631.29  | 2.48 | 26.56 | 28.88 |
+ 
+Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
+ 
+###### Fine-tuning training performance for NER on DGX-1 16G
+ 
+Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
+ 
+| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
+|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
+| 1 | 64 | 147.71 | 348.84  | 2.36 | 1.00 | 1.00 |
+| 4 | 64 | 583.78 | 1145.46 | 1.96 | 3.95 | 3.28 |
+| 8 | 64 | 981.22 | 1964.85 | 2.00 | 6.64 | 5.63 |
+ 
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+ 
+##### Training performance: NVIDIA DGX-1 (8x V100 32G)
+ 
+ 
+###### Fine-tuning training performance for NER on DGX-1 32G
+ 
+Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
+ 
+ 
+| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
+|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
+| 1 | 64 | 144.1 | 417.39  | 2.89 | 1.00 | 1.00 |
+| 4 | 64 | 525.15 | 1354.14 | 2.57 | 3.64 | 3.24 |
+| 8 | 64 | 969.4 | 2341.39 | 2.41 | 6.73 | 5.61 |
+ 
+ 
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+ 
+##### Training performance: NVIDIA DGX-2 (16x V100 32G)
+ 
+ 
+###### Pre-training training performance: multi-node on DGX-2H 32G
+ 
+Our results were obtained by running the `biobert/scripts/run_biobert.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2H with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
+ 
+ 
+| **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
+|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
+| 1  | 128 | 128,128 | 7772.18   | 2165.04   | 3.59 | 1.00  | 1.00  |
+| 4  | 128 | 128,128 | 29785.31  | 8516.90   | 3.50 | 3.83  | 3.93  |
+| 16 | 128 | 128,128 | 115581.29 | 33699.15  | 3.43 | 14.87 | 15.57 |
+| 32 | 128 | 128,128 | 226156.53 | 66996.73  | 3.38 | 29.10 | 30.94 |
+| 64 | 128 | 128,128 | 444955.74 | 133424.95 | 3.33 | 57.25 | 61.63 |
+| 1  | 512 | 16,16   | 1260.06   | 416.92    | 3.02 | 1.00  | 1.00  |
+| 4  | 512 | 16,16   | 4781.19   | 1626.76   | 2.94 | 3.79  | 3.90  |
+| 16 | 512 | 16,16   | 18405.65  | 6418.09   | 2.87 | 14.61 | 15.39 |
+| 32 | 512 | 16,16   | 36071.06  | 12713.67  | 2.84 | 28.63 | 30.49 |
+| 64 | 512 | 16,16   | 69950.86  | 25245.96  | 2.77 | 55.51 | 60.55 |
+ 
+ 
+###### Fine-tuning training performance for NER on DGX-2 32G
+ 
+Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
+ 
+| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
+|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
+| 1 | 64 | 139.59 | 475.54  | 3.4 | 1.00 | 1.00 |
+| 4 | 64 | 517.08 | 1544.01 | 2.98 | 3.70 | 3.25 |
+| 8 | 64 | 1009.84 | 2695.34 | 2.66 | 7.23 | 5.67 |
+| 16 | 64 | 1997.73 | 4268.81 | 2.13 | 14.31 | 8.98 |
+ 
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+ 
+## Release notes
+ 
+### Changelog
+ 
+November 2019
+- Initial release
+ 
+### Known issues
+ 
+ 
+- There are no known issues with the model.
+ 
+ 
+
--- a/TensorFlow/LanguageModeling/BERT/biobert/init.py
+++ b/TensorFlow/LanguageModeling/BERT/biobert/init.py
--- a/TensorFlow/LanguageModeling/BERT/biobert/conlleval.py
+++ b/TensorFlow/LanguageModeling/BERT/biobert/conlleval.py
@ -0,0 +1,302 @@
+# Python version of the evaluation script from CoNLL'00-
+# Originates from: https://github.com/spyysalo/conlleval.py
+
+
+# Intentional differences:
+# - accept any space as delimiter by default
+# - optional file argument (default STDIN)
+# - option to set boundary (-b argument)
+# - LaTeX output (-l argument) not supported
+# - raw tags (-r argument) not supported
+
+# add function :evaluate(predicted_label, ori_label): which will not read from file
+
+import sys
+import re
+import codecs
+from collections import defaultdict, namedtuple
+
+ANY_SPACE = '<SPACE>'
+
+
+class FormatError(Exception):
+    pass
+
+Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore')
+
+
+class EvalCounts(object):
+    def __init__(self):
+        self.correct_chunk = 0    # number of correctly identified chunks
+        self.correct_tags = 0     # number of correct chunk tags
+        self.found_correct = 0    # number of chunks in corpus
+        self.found_guessed = 0    # number of identified chunks
+        self.token_counter = 0    # token counter (ignores sentence breaks)
+
+        # counts by type
+        self.t_correct_chunk = defaultdict(int)
+        self.t_found_correct = defaultdict(int)
+        self.t_found_guessed = defaultdict(int)
+
+
+def parse_args(argv):
+    import argparse
+    parser = argparse.ArgumentParser(
+        description='evaluate tagging results using CoNLL criteria',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    arg = parser.add_argument
+    arg('-b', '--boundary', metavar='STR', default='-X-',
+        help='sentence boundary')
+    arg('-d', '--delimiter', metavar='CHAR', default=ANY_SPACE,
+        help='character delimiting items in input')
+    arg('-o', '--otag', metavar='CHAR', default='O',
+        help='alternative outside tag')
+    arg('file', nargs='?', default=None)
+    return parser.parse_args(argv)
+
+
+def parse_tag(t):
+    m = re.match(r'^([^-]*)-(.*)$', t)
+    return m.groups() if m else (t, '')
+
+
+def evaluate(iterable, options=None):
+    if options is None:
+        options = parse_args([])    # use defaults
+
+    counts = EvalCounts()
+    num_features = None       # number of features per line
+    in_correct = False        # currently processed chunks is correct until now
+    last_correct = 'O'        # previous chunk tag in corpus
+    last_correct_type = ''    # type of previously identified chunk tag
+    last_guessed = 'O'        # previously identified chunk tag
+    last_guessed_type = ''    # type of previous chunk tag in corpus
+
+    for i, line in enumerate(iterable):
+        line = line.rstrip('\r\n')
+        # print(line)
+
+        if options.delimiter == ANY_SPACE:
+            features = line.split()
+        else:
+            features = line.split(options.delimiter)
+
+        if num_features is None:
+            num_features = len(features)
+        elif num_features != len(features) and len(features) != 0:
+            raise FormatError('unexpected number of features: %d (%d) at line %d\n%s' %
+                              (len(features), num_features, i, line))
+
+        if len(features) == 0 or features[0] == options.boundary:
+            features = [options.boundary, 'O', 'O']
+        if len(features) < 3:
+            raise FormatError('unexpected number of features in line %s' % line)
+
+        guessed, guessed_type = parse_tag(features.pop())
+        correct, correct_type = parse_tag(features.pop())
+        first_item = features.pop(0)
+
+        if first_item == options.boundary:
+            guessed = 'O'
+
+        end_correct = end_of_chunk(last_correct, correct,
+                                   last_correct_type, correct_type)
+        end_guessed = end_of_chunk(last_guessed, guessed,
+                                   last_guessed_type, guessed_type)
+        start_correct = start_of_chunk(last_correct, correct,
+                                       last_correct_type, correct_type)
+        start_guessed = start_of_chunk(last_guessed, guessed,
+                                       last_guessed_type, guessed_type)
+
+        if in_correct:
+            if (end_correct and end_guessed and
+                last_guessed_type == last_correct_type):
+                in_correct = False
+                counts.correct_chunk += 1
+                counts.t_correct_chunk[last_correct_type] += 1
+            elif (end_correct != end_guessed or guessed_type != correct_type):
+                in_correct = False
+
+        if start_correct and start_guessed and guessed_type == correct_type:
+            in_correct = True
+
+        if start_correct:
+            counts.found_correct += 1
+            counts.t_found_correct[correct_type] += 1
+        if start_guessed:
+            counts.found_guessed += 1
+            counts.t_found_guessed[guessed_type] += 1
+        if first_item != options.boundary:
+            if correct == guessed and guessed_type == correct_type:
+                counts.correct_tags += 1
+            counts.token_counter += 1
+
+        last_guessed = guessed
+        last_correct = correct
+        last_guessed_type = guessed_type
+        last_correct_type = correct_type
+
+    if in_correct:
+        counts.correct_chunk += 1
+        counts.t_correct_chunk[last_correct_type] += 1
+
+    return counts
+
+
+
+def uniq(iterable):
+  seen = set()
+  return [i for i in iterable if not (i in seen or seen.add(i))]
+
+
+def calculate_metrics(correct, guessed, total):
+    tp, fp, fn = correct, guessed-correct, total-correct
+    p = 0 if tp + fp == 0 else 1.*tp / (tp + fp)
+    r = 0 if tp + fn == 0 else 1.*tp / (tp + fn)
+    f = 0 if p + r == 0 else 2 * p * r / (p + r)
+    return Metrics(tp, fp, fn, p, r, f)
+
+
+def metrics(counts):
+    c = counts
+    overall = calculate_metrics(
+        c.correct_chunk, c.found_guessed, c.found_correct
+    )
+    by_type = {}
+    for t in uniq(list(c.t_found_correct) + list(c.t_found_guessed)):
+        by_type[t] = calculate_metrics(
+            c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t]
+        )
+    return overall, by_type
+
+
+def report(counts, out=None):
+    if out is None:
+        out = sys.stdout
+
+    overall, by_type = metrics(counts)
+
+    c = counts
+    out.write('processed %d tokens with %d phrases; ' %
+              (c.token_counter, c.found_correct))
+    out.write('found: %d phrases; correct: %d.\n' %
+              (c.found_guessed, c.correct_chunk))
+
+    if c.token_counter > 0:
+        out.write('accuracy: %6.2f%%; ' %
+                  (100.*c.correct_tags/c.token_counter))
+        out.write('precision: %6.2f%%; ' % (100.*overall.prec))
+        out.write('recall: %6.2f%%; ' % (100.*overall.rec))
+        out.write('FB1: %6.2f\n' % (100.*overall.fscore))
+
+    for i, m in sorted(by_type.items()):
+        out.write('%17s: ' % i)
+        out.write('precision: %6.2f%%; ' % (100.*m.prec))
+        out.write('recall: %6.2f%%; ' % (100.*m.rec))
+        out.write('FB1: %6.2f  %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
+
+
+def report_notprint(counts, out=None):
+    if out is None:
+        out = sys.stdout
+
+    overall, by_type = metrics(counts)
+
+    c = counts
+    final_report = []
+    line = []
+    line.append('processed %d tokens with %d phrases; ' %
+              (c.token_counter, c.found_correct))
+    line.append('found: %d phrases; correct: %d.\n' %
+              (c.found_guessed, c.correct_chunk))
+    final_report.append("".join(line))
+
+    if c.token_counter > 0:
+        line = []
+        line.append('accuracy: %6.2f%%; ' %
+                  (100.*c.correct_tags/c.token_counter))
+        line.append('precision: %6.2f%%; ' % (100.*overall.prec))
+        line.append('recall: %6.2f%%; ' % (100.*overall.rec))
+        line.append('FB1: %6.2f\n' % (100.*overall.fscore))
+        final_report.append("".join(line))
+
+    for i, m in sorted(by_type.items()):
+        line = []
+        line.append('%17s: ' % i)
+        line.append('precision: %6.2f%%; ' % (100.*m.prec))
+        line.append('recall: %6.2f%%; ' % (100.*m.rec))
+        line.append('FB1: %6.2f  %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
+        final_report.append("".join(line))
+    return final_report
+
+
+def end_of_chunk(prev_tag, tag, prev_type, type_):
+    # check if a chunk ended between the previous and current word
+    # arguments: previous and current chunk tags, previous and current types
+    chunk_end = False
+
+    if prev_tag == 'E': chunk_end = True
+    if prev_tag == 'S': chunk_end = True
+
+    if prev_tag == 'B' and tag == 'B': chunk_end = True
+    if prev_tag == 'B' and tag == 'S': chunk_end = True
+    if prev_tag == 'B' and tag == 'O': chunk_end = True
+    if prev_tag == 'I' and tag == 'B': chunk_end = True
+    if prev_tag == 'I' and tag == 'S': chunk_end = True
+    if prev_tag == 'I' and tag == 'O': chunk_end = True
+
+    if prev_tag != 'O' and prev_tag != '.' and prev_type != type_:
+        chunk_end = True
+
+    # these chunks are assumed to have length 1
+    if prev_tag == ']': chunk_end = True
+    if prev_tag == '[': chunk_end = True
+
+    return chunk_end
+
+
+def start_of_chunk(prev_tag, tag, prev_type, type_):
+    # check if a chunk started between the previous and current word
+    # arguments: previous and current chunk tags, previous and current types
+    chunk_start = False
+
+    if tag == 'B': chunk_start = True
+    if tag == 'S': chunk_start = True
+
+    if prev_tag == 'E' and tag == 'E': chunk_start = True
+    if prev_tag == 'E' and tag == 'I': chunk_start = True
+    if prev_tag == 'S' and tag == 'E': chunk_start = True
+    if prev_tag == 'S' and tag == 'I': chunk_start = True
+    if prev_tag == 'O' and tag == 'E': chunk_start = True
+    if prev_tag == 'O' and tag == 'I': chunk_start = True
+
+    if tag != 'O' and tag != '.' and prev_type != type_:
+        chunk_start = True
+
+    # these chunks are assumed to have length 1
+    if tag == '[': chunk_start = True
+    if tag == ']': chunk_start = True
+
+    return chunk_start
+
+
+def main(argv):
+    args = parse_args(argv[1:])
+
+    if args.file is None:
+        counts = evaluate(sys.stdin, args)
+    else:
+        with open(args.file) as f:
+            counts = evaluate(f, args)
+    report(counts)
+
+
+def return_report(input_file):
+    with open(input_file, "r") as f:
+        counts = evaluate(f)
+    return report_notprint(counts)
+
+if __name__ == '__main__':
+    # sys.exit(main(sys.argv))
+    return_report('/home/pengy6/data/sentence_similarity/data/cdr/test1/wanli_result2/label_test.txt')
--- a/TensorFlow/LanguageModeling/BERT/biobert/re_eval.py
+++ b/TensorFlow/LanguageModeling/BERT/biobert/re_eval.py
@ -0,0 +1,51 @@
+import os
+import numpy as np
+import pandas as pd
+import sklearn.metrics
+import argparse
+
+
+parser = argparse.ArgumentParser(description='')
+parser.add_argument('--output_path', type=str,  help='')
+parser.add_argument('--answer_path', type=str,  help='')
+parser.add_argument('--task', type=str,  default="binary", help='default:binary, possible other options:{chemprot}')
+args = parser.parse_args()
+
+
+testdf = pd.read_csv(args.answer_path, sep="\t", index_col=0)
+preddf = pd.read_csv(args.output_path, sep="\t", header=None)
+
+
+# binary
+if args.task == "binary":
+    pred = [preddf.iloc[i].tolist() for i in preddf.index]
+    pred_class = [np.argmax(v) for v in pred]
+    pred_prob_one = [v[1] for v in pred]
+
+    p,r,f,s = sklearn.metrics.precision_recall_fscore_support(y_pred=pred_class, y_true=testdf["label"])
+    results = dict()
+    results["f1 score"] = f[1]
+    results["recall"] = r[1]
+    results["precision"] = p[1]
+    results["specificity"] = r[0]
+
+# chemprot
+# micro-average of 5 target classes
+# see "Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction (Mehryary, 2018)" for details
+if args.task == "chemprot":
+    pred = [preddf.iloc[i].tolist() for i in preddf.index]
+    pred_class = [np.argmax(v) for v in pred]
+    str_to_int_mapper = dict()
+
+    for i,v in enumerate(sorted(testdf["label"].unique())):
+        str_to_int_mapper[v] = i
+    test_answer = [str_to_int_mapper[v] for v in testdf["label"]]
+
+    p,r,f,s = sklearn.metrics.precision_recall_fscore_support(y_pred=pred_class, y_true=test_answer, labels=[0,1,2,3,4], average="micro")
+    results = dict()
+    results["f1 score"] = f
+    results["recall"] = r
+    results["precision"] = p
+
+for k,v in results.items():
+    print("{:11s} : {:.2%}".format(k,v))
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/biobert_data_download.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/biobert_data_download.sh
@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+docker run --runtime=nvidia -v $PWD:/workspace/bert \
+    --rm --shm-size=1g --ulimit memlock=-1 \
+    --ulimit stack=67108864 --ipc=host -t -i \
+    bert bash -c "bash data/create_biobert_datasets_from_start.sh"
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/biobert_finetune_inference_benchmark.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/biobert_finetune_inference_benchmark.sh
@ -0,0 +1,187 @@
+#!/bin/bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+task=${1:-"ner_bc5cdr-chem"}
+bert_model=${2:-"base"}
+cased=${3:-"false"}
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+    case_flag="--do_lower_case=False"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+    case_flag="--do_lower_case=True"
+fi
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
+else
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
+fi
+
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+printf -v TAG "tf_bert_biobert_%s_inference_benchmark_%s_%s" "$task" "$bert_model" "$CASING_DIR_PREFIX"
+OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+mkdir -p ${OUTPUT_DIR}
+
+if [ "$task" = "ner_bc5cdr-chem" ] ; then
+
+
+  DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
+  LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}.log"
+
+    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
+
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+
+    for seq_length in 128 512; do
+        for batch_size in 8 32 64; do
+            for precision in fp16 fp32; do
+                res_dir=${OUTPUT_DIR}/bert_${bert_model}_sl_${seq_len}_prec_${precision}_bs_${batch_size}
+                mkdir -p ${res_dir}
+                tmp_file="${res_dir}/${task}_training_benchmark.log"
+
+                if [ "$precision" = "fp16" ] ; then
+                    echo "fp16 activated!"
+                    use_fp16="--use_fp16"
+                    use_xla_tag="--use_xla"
+                else
+                    echo "fp32 activated!"
+                    use_fp16=""
+                    use_xla_tag=""
+                fi
+
+                python /workspace/bert/run_ner.py \
+                --do_prepare=true \
+                --do_eval=true \
+                --do_predict=true \
+                --task_name="bc5cdr" \
+                --vocab_file=$BERT_DIR/vocab.txt \
+                --bert_config_file=$BERT_DIR/bert_config.json \
+                --init_checkpoint="$BERT_DIR/bert_model.ckpt" \
+                --data_dir=$DATASET_DIR \
+                --output_dir=$res_dir \
+                --eval_batch_size=$batch_size \
+                --predict_batch_size=$batch_size \
+                --max_seq_length=$seq_length \
+                $use_fp16 $use_xla_tag $case_flag  |& tee $tmp_file
+
+                  perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision $seq_len  $batch_size $perf" >> $LOGFILE
+
+            done
+        done
+    done
+
+elif [ "$task" = "ner_bc5cdr-disease" ] ; then
+  DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
+
+  LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}.log"
+
+    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
+
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+
+    for seq_length in 128 512; do
+        for batch_size in 8 32 64; do
+            for precision in fp16 fp32; do
+                res_dir=${OUTPUT_DIR}/bert_${bert_model}_sl_${seq_len}_prec_${precision}_bs_${batch_size}
+                mkdir -p ${res_dir}
+                tmp_file="${res_dir}/${task}_training_benchmark.log"
+
+                if [ "$precision" = "fp16" ] ; then
+                    echo "fp16 activated!"
+                    use_fp16="--use_fp16"
+                    use_xla_tag="--use_xla"
+                else
+                    echo "fp32 activated!"
+                    use_fp16=""
+                    use_xla_tag=""
+                fi
+                python3 /workspace/bert/run_ner.py \
+                --do_prepare=true \
+                --do_eval=true \
+                --do_predict=true \
+                --task_name="bc5cdr" \
+                --vocab_file=$BERT_DIR/vocab.txt \
+                --bert_config_file=$BERT_DIR/bert_config.json \
+                --init_checkpoint="$BERT_DIR/bert_model.ckpt" \
+                --data_dir=$DATASET_DIR \
+                --output_dir=$res_dir \
+                --eval_batch_size=$batch_size \
+                --predict_batch_size=$batch_size \
+                --max_seq_length=$seq_length \
+                "$use_fp16" $use_xla_tag $case_flag  |& tee $tmp_file
+
+                  perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision $seq_len  $batch_size $perf" >> $LOGFILE
+
+            done
+        done
+    done
+
+elif [ "$task" = "rel_chemprot" ] ; then
+  DATASET_DIR=/workspace/bert/data/biobert/ChemProt
+
+  LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}.log"
+
+    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
+
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+
+    for seq_length in 128 512; do
+        for batch_size in 8 32 64; do
+            for precision in fp16 fp32; do
+                res_dir=${OUTPUT_DIR}/bert_${bert_model}_sl_${seq_len}_prec_${precision}_bs_${batch_size}
+                mkdir -p ${res_dir}
+                tmp_file="${res_dir}/${task}_training_benchmark.log"
+
+                if [ "$precision" = "fp16" ] ; then
+                    echo "fp16 activated!"
+                    use_fp16="--use_fp16"
+                    use_xla_tag="--use_xla"
+                else
+                    echo "fp32 activated!"
+                    use_fp16=""
+                    use_xla_tag=""
+                fi
+                python3 /workspace/bert/run_re.py \
+                --do_prepare=true \
+                --do_eval=true \
+                --do_predict=true \
+                --task_name="chemprot" \
+                --vocab_file=$BERT_DIR/vocab.txt \
+                --bert_config_file=$BERT_DIR/bert_config.json \
+                --init_checkpoint="$BERT_DIR/bert_model.ckpt" \
+                --data_dir=$DATASET_DIR \
+                --output_dir=$res_dir \
+                --eval_batch_size=$batch_size \
+                --predict_batch_size=$batch_size \
+                --max_seq_length=$seq_length \
+                "$use_fp16" $use_xla_tag $case_flag  |& tee $tmp_file
+
+                  perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision $seq_len  $batch_size $perf" >> $LOGFILE
+
+            done
+        done
+    done
+
+else
+
+    echo "Benchmarking for " $task "currently not supported. Sorry!"
+
+fi
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/biobert_finetune_train_benchmark.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/biobert_finetune_train_benchmark.sh
@ -0,0 +1,203 @@
+#!/bin/bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task=${1:-"ner_bc5cdr-chem"}
+num_gpu=${2:-"2"}
+bert_model=${3:-"base"}
+cased=${4:-"false"}
+
+
+epochs=2.0
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+    case_flag="--do_lower_case=False"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+    case_flag="--do_lower_case=True"
+fi
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
+else
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
+fi
+
+if [ $num_gpu -gt 1 ] ; then
+    mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib"
+    use_hvd="--horovod"
+else
+    mpi_command=""
+    use_hvd=""
+fi
+
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+printf -v TAG "tf_bert_biobert_%s_training_benchmark_%s_%s_num_gpu_%d" "$task" "$bert_model" "$CASING_DIR_PREFIX" "$num_gpu"
+OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+mkdir -p ${OUTPUT_DIR}
+
+if [ "$task" = "ner_bc5cdr-chem" ] ; then
+
+  DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
+  LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}_gpu_${num_gpu}.log"
+
+    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+
+    for seq_length in 128 512; do
+        for train_batch_size in 8 32 64; do
+            for precision in fp16 fp32; do
+                res_dir=${OUTPUT_DIR}/bert_${bert_model}_gpu_${num_gpu}_sl_${seq_length}_prec_${precision}_bs_${batch_size}
+                mkdir -p ${res_dir}
+                tmp_file="${res_dir}/${task}_training_benchmark.log"
+
+                if [ "$precision" = "fp16" ] ; then
+                    echo "fp16 activated!"
+                    use_fp16="--use_fp16"
+                    use_xla_tag="--use_xla"
+                else
+                    echo "fp32 activated!"
+                    use_fp16=""
+                    use_xla_tag=""
+                fi
+
+                $mpi_command python /workspace/bert/run_ner.py \
+                  --do_prepare=true \
+                  --do_train=true \
+                  --do_eval=true \
+                  --do_predict=true \
+                  --task_name=bc5cdr \
+                  --vocab_file=$BERT_DIR/vocab.txt \
+                  --bert_config_file=$BERT_DIR/bert_config.json \
+                  --init_checkpoint="$BERT_DIR/bert_model.ckpt" \
+                  --num_train_epochs=$epochs \
+                  --data_dir=$DATASET_DIR \
+                  --output_dir=$res_dir \
+                  --train_batch_size=$train_batch_size \
+                  --max_seq_length=$seq_length \
+                  $use_hvd $use_fp16 $use_xla_tag $case_flag |& tee $tmp_file
+
+                perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision  $seq_length  $train_batch_size $perf" >> $LOGFILE
+
+            done
+        done
+    done
+
+elif [ "$task" = "ner_bc5cdr-disease" ] ; then
+  DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
+  LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}_gpu_${num_gpu}.log"
+
+    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+
+    for seq_length in 128 512; do
+        for train_batch_size in 8 32 64; do
+            for precision in fp16 fp32; do
+                res_dir=${OUTPUT_DIR}/bert_${bert_model}_gpu_${num_gpu}_sl_${seq_length}_prec_${precision}_bs_${batch_size}
+                mkdir -p ${res_dir}
+                tmp_file="${res_dir}/${task}_training_benchmark.log"
+
+                if [ "$precision" = "fp16" ] ; then
+                    echo "fp16 activated!"
+                    use_fp16="--use_fp16"
+                    use_xla_tag="--use_xla"
+                else
+                    echo "fp32 activated!"
+                    use_fp16=""
+                    use_xla_tag=""
+                fi
+
+                $mpi_command python3 /workspace/bert/run_ner.py \
+                --do_prepare=true \
+                --do_train=true \
+                --do_eval=true \
+                --do_predict=true \
+                --task_name="bc5cdr" \
+                --vocab_file=$BERT_DIR/vocab.txt \
+                --bert_config_file=$BERT_DIR/bert_config.json \
+                --init_checkpoint="$BERT_DIR/bert_model.ckpt" \
+                --num_train_epochs=$epochs \
+                --data_dir=$DATASET_DIR \
+                --output_dir=$res_dir \
+                --train_batch_size=$train_batch_size \
+                --max_seq_length=$seq_length \
+                "$use_hvd" "$use_fp16" $use_xla_tag $case_flag  |& tee $tmp_file
+
+                  perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision  $seq_length  $train_batch_size $perf" >> $LOGFILE
+
+            done
+        done
+    done
+
+elif [ "$task" = "rel_chemprot" ] ; then
+  DATASET_DIR=/workspace/bert/data/biobert/ChemProt
+  LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}_gpu_${num_gpu}.log"
+
+    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+
+    for seq_length in 128 512; do
+        for train_batch_size in 8 32 64; do
+            for precision in fp16 fp32; do
+                res_dir=${OUTPUT_DIR}/bert_${bert_model}_gpu_${num_gpu}_sl_${seq_length}_prec_${precision}_bs_${batch_size}
+                mkdir -p ${res_dir}
+                tmp_file="${res_dir}/${task}_training_benchmark.log"
+
+                if [ "$precision" = "fp16" ] ; then
+                    echo "fp16 activated!"
+                    use_fp16="--use_fp16"
+                    use_xla_tag="--use_xla"
+                else
+                    echo "fp32 activated!"
+                    use_fp16=""
+                    use_xla_tag=""
+                fi
+
+                $mpi_command python3 /workspace/bert/run_re.py \
+                --do_prepare=true \
+                --do_train=true \
+                --do_eval=true \
+                --do_predict=true \
+                --task_name="chemprot" \
+                --vocab_file=$BERT_DIR/vocab.txt \
+                --bert_config_file=$BERT_DIR/bert_config.json \
+                --init_checkpoint="$BERT_DIR/bert_model.ckpt" \
+                --num_train_epochs=$epochs \
+                --data_dir=$DATASET_DIR \
+                --output_dir=$res_dir \
+                --train_batch_size=$train_batch_size \
+                --max_seq_length=$seq_length \
+                "$use_hvd" "$use_fp16" $use_xla_tag $case_flag |& tee $tmp_file
+
+                perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision  $seq_length  $train_batch_size $perf" >> $LOGFILE
+
+            done
+        done
+    done
+
+else
+
+    echo "Benchmarking for " $task "currently not supported. Sorry!"
+
+fi
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/ner_bc5cdr-chem.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/ner_bc5cdr-chem.sh
@ -0,0 +1,86 @@
+#!/bin/bash
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+init_checkpoint=${1:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
+train_batch_size=${2:-8}
+learning_rate=${3:-3.125e-6}
+cased=${4:-false}
+precision=${5:-"fp16"}
+use_xla=${6:-"true"}
+num_gpu=${7:-"16"}
+seq_length=${8:-128}
+bert_model=${9:-"base"}
+eval_batch_size=${10:-8} #Eval and Predict BS is assumed to be same
+epochs=${11:-"10.0"}
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+    case_flag="--do_lower_case=False"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+    case_flag="--do_lower_case=True"
+fi
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
+else
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
+fi
+
+
+export GBS=$(expr $train_batch_size \* $num_gpu)
+printf -v TAG "tf_bert_biobert_ner_bc5cdr_chem_%s_%s_gbs%d" "$bert_model" "$precision" $GBS
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+
+
+DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
+OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+mkdir -p ${OUTPUT_DIR}
+
+use_fp16=""
+if [ "$precision" = "fp16" ] ; then
+        echo "fp16 activated!"
+        use_fp16="--use_fp16"
+fi
+
+if [ "$use_xla" = "true" ] ; then
+    use_xla_tag="--use_xla"
+    echo "XLA activated"
+else
+    use_xla_tag=""
+fi
+
+
+if [ $num_gpu -gt 1 ] ; then
+    mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib"
+    use_hvd="--horovod"
+else
+    mpi_command=""
+    use_hvd=""
+fi
+
+$mpi python /workspace/bert/run_ner.py \
+  --do_prepare=true \
+  --do_train=true \
+  --do_eval=true \
+  --do_predict=true \
+  --task_name=bc5cdr \
+  --vocab_file=$BERT_DIR/vocab.txt \
+  --bert_config_file=$BERT_DIR/bert_config.json \
+  --init_checkpoint=$init_checkpoint \
+  --num_train_epochs=$epochs \
+  --data_dir=$DATASET_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --learning_rate=$learning_rate \
+  --train_batch_size=$train_batch_size \
+  --eval_batch_size=$eval_batch_size \
+  --predict_batch_size=$eval_batch_size \
+  --max_seq_length=$seq_length \
+  $use_hvd $use_fp16 $use_xla_tag $case_flag
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/ner_bc5cdr-disease.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/ner_bc5cdr-disease.sh
@ -0,0 +1,85 @@
+#!/bin/bash
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+init_checkpoint=${1:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
+train_batch_size=${2:-8}
+learning_rate=${3:-3.125e-6}
+cased=${4:-false}
+precision=${5:-"fp16"}
+use_xla=${6:-"true"}
+num_gpu=${7:-"16"}
+seq_length=${8:-128}
+bert_model=${9:-"base"}
+eval_batch_size=${10:-8} #Eval and Predict BS is assumed to be same
+epochs=${11:-"100.0"}
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+    case_flag="--do_lower_case=False"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+    case_flag="--do_lower_case=True"
+fi
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
+else
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
+fi
+
+export GBS=$(expr $train_batch_size \* $num_gpu)
+printf -v TAG "tf_bert_biobert_ner_bc5cdr_disease_%s_%s_gbs%d" "$bert_model" "$precision" $GBS
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+
+
+
+DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
+OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+mkdir -p ${OUTPUT_DIR}
+
+use_fp16=""
+if [ "$precision" = "fp16" ] ; then
+        echo "fp16 activated!"
+        use_fp16="--use_fp16"
+fi
+
+if [ "$use_xla" = "true" ] ; then
+    use_xla_tag="--use_xla"
+    echo "XLA activated"
+else
+    use_xla_tag=""
+fi
+
+if [ $num_gpu -gt 1 ] ; then
+    mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib"
+    use_hvd="--horovod"
+else
+    mpi_command=""
+    use_hvd=""
+fi
+
+$mpi_command python3 /workspace/bert/run_ner.py \
+  --do_prepare=true \
+  --do_train=true \
+  --do_eval=true \
+  --do_predict=true \
+  --task_name="bc5cdr" \
+  --vocab_file=$BERT_DIR/vocab.txt \
+  --bert_config_file=$BERT_DIR/bert_config.json \
+  --init_checkpoint=$init_checkpoint \
+  --num_train_epochs=$epochs \
+  --data_dir=$DATASET_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --learning_rate=$learning_rate \
+  --train_batch_size=$train_batch_size \
+  --eval_batch_size=$eval_batch_size \
+  --predict_batch_size=$eval_batch_size \
+  --max_seq_length=$seq_length \
+  "$use_hvd" "$use_fp16" $use_xla_tag $case_flag
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/rel_chemprot.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/rel_chemprot.sh
@ -0,0 +1,87 @@
+#!/bin/bash
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+init_checkpoint=${1:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
+train_batch_size=${2:-64}
+learning_rate=${3:-1.5e-6}
+cased=${4:-false}
+precision=${5:-"fp16"}
+use_xla=${6:-"true"}
+num_gpu=${7:-"16"}
+seq_length=${8:-512}
+bert_model=${9:-"base"}
+eval_batch_size=${10:-16} #Eval and Predict BS is assumed to be same
+epochs=${11:-"3.0"}
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+    case_flag="--do_lower_case=False"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+    case_flag="--do_lower_case=True"
+fi
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
+else
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
+fi
+
+export GBS=$(expr $train_batch_size \* $num_gpu)
+printf -v TAG "tf_bert_biobert_rel_chemprot_%s_%s_gbs%d" "$bert_model" "$precision" $GBS
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+
+
+DATASET_DIR=/workspace/bert/data/biobert/ChemProt
+OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+mkdir -p ${OUTPUT_DIR}
+
+use_fp16=""
+if [ "$precision" = "fp16" ] ; then
+        echo "fp16 activated!"
+        use_fp16="--use_fp16"
+fi
+
+if [ "$use_xla" = "true" ] ; then
+    use_xla_tag="--use_xla"
+    echo "XLA activated"
+else
+    use_xla_tag=""
+fi
+
+if [ $num_gpu -gt 1 ] ; then
+    mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib"
+    use_hvd="--horovod"
+else
+    mpi_command=""
+    use_hvd=""
+fi
+
+$mpi_command python3 /workspace/bert/run_re.py \
+  --do_prepare=true \
+  --do_train=true \
+  --do_eval=true \
+  --do_predict=true \
+  --task_name="chemprot" \
+  --vocab_file=$BERT_DIR/vocab.txt \
+  --bert_config_file=$BERT_DIR/bert_config.json \
+  --init_checkpoint=$init_checkpoint \
+  --num_train_epochs=$epochs \
+  --data_dir=$DATASET_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --learning_rate=$learning_rate \
+  --train_batch_size=$train_batch_size \
+  --eval_batch_size=$eval_batch_size \
+  --predict_batch_size=$eval_batch_size \
+  --max_seq_length=$seq_length \
+  "$use_hvd" "$use_fp16" $use_xla_tag $case_flag
+
+python3 /workspace/bert/biobert/re_eval.py --task=chemprot --output_path=$OUTPUT_DIR/test_results.tsv \
+  --answer_path=$DATASET_DIR/test.tsv |& tee $OUTPUT_DIR/test_results.txt
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_biobert.sub
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_biobert.sub
@ -0,0 +1,87 @@
+#!/bin/bash
+#SBATCH --exclusive
+#SBATCH --mem=0
+#SBATCH --overcommit
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -eux
+
+readonly docker_image="nvcr.io/nvidia/tensorflow:19.08-py3"
+readonly datadir="/raid/data/bert"
+readonly checkpointdir="$PWD/checkpoints"
+
+readonly mounts=".:/workspace/bert,${datadir}:/workspace/bert/data,${checkpointdir}:/results"
+
+DO_LOWER_CASE=${DO_LOWER_CASE:-1}
+if [ "$DO_LOWER_CASE" == "1" ]; then
+  CASING_DIR_PREFIX="uncased"
+else
+  CASING_DIR_PREFIX="cased"
+fi
+
+DO_BERT_BASE=${DO_BERT_BASE:-1}
+if [ "$DO_BERT_BASE" == "1" ]; then
+  CASING_DIR_SUFFIX="L-12_H-768_A-12"
+else
+  CASING_DIR_SUFFIX="L-24_H-1024_A-16"
+fi
+
+srun --ntasks="${SLURM_JOB_NUM_NODES}" --ntasks-per-node=1 mkdir -p "${checkpointdir}/biobert_phase_1"
+srun --ntasks="${SLURM_JOB_NUM_NODES}" --ntasks-per-node=1 mkdir -p "${checkpointdir}/biobert_phase_2"
+
+PHASE1="\
+     --train_batch_size=${BATCHSIZE:-128} \
+     --learning_rate=${LEARNING_RATE:-3.2e-5} \
+     --num_accumulation_steps=${NUM_ACCUMULATION_STEPS:-128} \
+     --input_files_dir=lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training \
+     --eval_files_dir=lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test \
+     --max_seq_length=128 \
+     --max_predictions_per_seq=20 \
+     --num_train_steps=19531 \
+     --num_warmup_steps=1953 \
+     --output_dir=/results/biobert_phase_1 \
+     "
+
+PHASE2="\
+     --train_batch_size=${BATCHSIZE:-16} \
+     --learning_rate=${LEARNING_RATE:-6.4e-5} \
+     --num_accumulation_steps=${NUM_ACCUMULATION_STEPS:-512} \
+     --input_files_dir=/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training \
+     --eval_files_dir=/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test \
+     --max_seq_length=512 \
+     --max_predictions_per_seq=80 \
+     --num_train_steps=4340 \
+     --num_warmup_steps=434 \
+     --output_dir=/results/biobert_phase_2 \
+     --init_checkpoint=/results/biobert_phase_1/model.ckpt-19531 \
+    "
+
+PHASES=( "$PHASE1" "$PHASE2" )
+
+PHASE=${PHASE:-1}
+
+BERT_CMD="\
+    python /workspace/bert/run_pretraining.py \
+     ${PHASES[$((PHASE-1))]} \
+     --bert_config_file=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_${CASING_DIR_SUFFIX}/bert_config.json \
+     --vocab_file=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_${CASING_DIR_SUFFIX}/vocab.txt \
+     --do_train=True \
+     --do_eval=True \
+     --save_checkpoints_steps=5000 \
+     --horovod --use_fp16 --use_xla \
+     --allreduce_post_accumulation=True \
+     --eval_batch_size=8"
+
+srun --mpi=pmi2 -l --container-image="${docker_image}" --container-mounts="${mounts}" bash -c "${BERT_CMD}"
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_biobert_finetuning_inference.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_biobert_finetuning_inference.sh
@ -0,0 +1,122 @@
+#!/bin/bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+task=${1:-"ner_bc5cdr-chem"}
+init_checkpoint=${2:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
+bert_model=${3:-"base"}
+cased=${4:-"false"}
+precision=${5:-"fp16"}
+use_xla=${6:-"true"}
+batch_size=${7:-"16"}
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+    case_flag="--do_lower_case=False"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+    case_flag="--do_lower_case=True"
+fi
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
+else
+    export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
+fi
+
+use_fp16=""
+if [ "$precision" = "fp16" ] ; then
+        echo "fp16 activated!"
+        use_fp16="--use_fp16"
+fi
+
+if [ "$use_xla" = "true" ] ; then
+    use_xla_tag="--use_xla"
+    echo "XLA activated"
+else
+    use_xla_tag=""
+fi
+
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+
+if [ "$task" = "ner_bc5cdr-chem" ] ; then
+
+  printf -v TAG "tf_bert_biobert_ner_bc5cdr_chem_inference_%s_%s" "$bert_model" "$precision"
+  DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
+  OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+
+  python /workspace/bert/run_ner.py \
+  --do_prepare=true \
+  --do_eval=true \
+  --do_predict=true \
+  --task_name="bc5cdr" \
+  --vocab_file=$BERT_DIR/vocab.txt \
+  --bert_config_file=$BERT_DIR/bert_config.json \
+  --init_checkpoint=$init_checkpoint \
+  --data_dir=$DATASET_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --eval_batch_size=$batch_size \
+  --predict_batch_size=$batch_size \
+  --max_seq_length=128 \
+  $use_fp16 $use_xla_tag $case_flag
+
+elif [ "$task" = "ner_bc5cdr-disease" ] ; then
+  printf -v TAG "tf_bert_biobert_ner_bc5cdr_disease_inference_%s_%s" "$bert_model" "$precision"
+  DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
+  OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+
+  python3 /workspace/bert/run_ner.py \
+  --do_prepare=true \
+  --do_eval=true \
+  --do_predict=true \
+  --task_name="bc5cdr" \
+  --vocab_file=$BERT_DIR/vocab.txt \
+  --bert_config_file=$BERT_DIR/bert_config.json \
+  --init_checkpoint=$init_checkpoint \
+  --data_dir=$DATASET_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --eval_batch_size=$batch_size \
+  --predict_batch_size=$batch_size \
+  --max_seq_length=128 \
+  "$use_fp16" $use_xla_tag $case_flag
+
+elif [ "$task" = "rel_chemprot" ] ; then
+  printf -v TAG "tf_bert_biobert_rel_chemprot_inference_%s_%s_" "$bert_model" "$precision"
+  DATASET_DIR=/workspace/bert/data/biobert/ChemProt
+  OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
+
+  python3 /workspace/bert/run_re.py \
+  --do_prepare=true \
+  --do_eval=true \
+  --do_predict=true \
+  --task_name="chemprot" \
+  --vocab_file=$BERT_DIR/vocab.txt \
+  --bert_config_file=$BERT_DIR/bert_config.json \
+  --init_checkpoint=$init_checkpoint \
+  --data_dir=$DATASET_DIR \
+  --output_dir=$OUTPUT_DIR \
+  --eval_batch_size=$batch_size \
+  --predict_batch_size=$batch_size \
+  --max_seq_length=512 \
+  "$use_fp16" $use_xla_tag $case_flag
+
+  python3 /workspace/bert/biobert/re_eval.py --task=chemprot --output_path=$OUTPUT_DIR/test_results.tsv \
+  --answer_path=$DATASET_DIR/test.tsv |& tee $OUTPUT_DIR/test_results.txt
+
+else
+
+    echo "Benchmarking for " $task "currently not supported. Sorry!"
+
+fi
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_pretraining_pubmed_base_phase_1.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_pretraining_pubmed_base_phase_1.sh
@ -0,0 +1,87 @@
+#! /bin/bash
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+train_batch_size=${1:-128}
+learning_rate=${2:-"9.625e-5"}
+cased=${3:-false}
+precision=${4:-"fp16"}
+use_xla=${5:-"true"}
+num_gpus=${6:-16}
+warmup_steps=${7:-"1953"}
+train_steps=${8:-19531}
+num_accumulation_steps=${9:-32}
+save_checkpoint_steps=${10:-5000}
+eval_batch_size=${11:-80}
+
+use_fp16=""
+if [ "$precision" = "fp16" ] ; then
+        echo "fp16 activated!"
+        use_fp16="--use_fp16"
+fi
+
+if [ "$use_xla" = "true" ] ; then
+    use_xla_tag="--use_xla"
+    echo "XLA activated"
+else
+    use_xla_tag=""
+fi
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+fi
+
+BERT_CONFIG=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12/bert_config.json
+RESULTS_DIR=/results
+CHECKPOINTS_DIR=${RESULTS_DIR}/biobert_phase_1
+mkdir -p ${CHECKPOINTS_DIR}
+
+INIT_CHECKPOINT=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12/bert_model.ckpt
+
+INPUT_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training"
+EVAL_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test"
+
+
+if [ $num_gpu -gt 1 ] ; then
+    mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib"
+    use_hvd="--horovod"
+else
+    mpi_command=""
+    use_hvd=""
+fi
+
+
+export GBS=$(expr $train_batch_size \* $num_gpus \* num_accumulation_steps)
+printf -v TAG "tf_bert_bio_1n_phase1_cased_%s_%s_gbs%d" "$cased" "$precision" $GBS
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+LOGFILE=$RESULTS_DIR/$TAG.$DATESTAMP.log
+printf "Logs written to %s\n" "$LOGFILE"
+
+
+$mpi python3 /workspace/bert/run_pretraining.py \
+ --input_files_dir=$INPUT_FILES_DIR \
+ --eval_files_dir=$EVAL_FILES_DIR \
+ --output_dir=$CHECKPOINTS_DIR \
+ --bert_config_file=$BERT_CONFIG \
+ --do_train=True \
+ --do_eval=True \
+ --train_batch_size=$train_batch_size \
+ --eval_batch_size=$eval_batch_size \
+ --max_seq_length=128 \
+ --max_predictions_per_seq=20 \
+ --num_train_steps=$train_steps \
+ --num_warmup_steps=$warmup_steps \
+ --save_checkpoints_steps=$save_checkpoint_steps \
+ --num_accumulation_steps=$num_accumulation_steps \
+ --learning_rate=$learning_rate \
+ --report_loss \
+ --$use_hvd $use_fp16 $use_xla_tag \
+ --init_checkpoint=$INIT_CHECKPOINT |& tee $LOGFILE
--- a/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_pretraining_pubmed_base_phase_2.sh
+++ b/TensorFlow/LanguageModeling/BERT/biobert/scripts/run_pretraining_pubmed_base_phase_2.sh
@ -0,0 +1,85 @@
+#! /bin/bash
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+init_checkpoint=${1}
+train_batch_size=${2:-16}
+learning_rate=${3:-"2.9e-4"}
+cased=${4:-false}
+precision=${5:-"fp16"}
+use_xla=${6:-true}
+num_gpus=${7:-16}
+warmup_steps=${8:-"434"}
+train_steps=${9:-4340}
+num_accumulation_steps=${10:-128}
+save_checkpoint_steps=${11:-5000}
+eval_batch_size=${12:-26}
+
+
+use_fp16=""
+if [ "$precision" = "fp16" ] ; then
+        echo "fp16 activated!"
+        use_fp16="--use_fp16"
+fi
+
+if [ "$use_xla" = "true" ] ; then
+    use_xla_tag="--use_xla"
+    echo "XLA activated"
+else
+    use_xla_tag=""
+fi
+
+if [ "$cased" = "true" ] ; then
+    DO_LOWER_CASE=0
+    CASING_DIR_PREFIX="cased"
+else
+    DO_LOWER_CASE=1
+    CASING_DIR_PREFIX="uncased"
+fi
+
+BERT_CONFIG=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12/bert_config.json
+RESULTS_DIR=/results
+CHECKPOINTS_DIR=${RESULTS_DIR}/biobert_phase_2
+mkdir -p ${CHECKPOINTS_DIR}
+
+INPUT_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training"
+EVAL_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test"
+
+if [ $num_gpu -gt 1 ] ; then
+    mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
+    --allow-run-as-root -bind-to none -map-by slot \
+    -x NCCL_DEBUG=INFO \
+    -x LD_LIBRARY_PATH \
+    -x PATH -mca pml ob1 -mca btl ^openib"
+    use_hvd="--horovod"
+else
+    mpi_command=""
+    use_hvd=""
+fi
+
+export GBS=$(expr $train_batch_size \* $num_gpus \* num_accumulation_steps)
+printf -v TAG "tf_bert_bio_1n_phase2_cased_%s_%s_gbs%d" "$cased" "$precision" $GBS
+DATESTAMP=`date +'%y%m%d%H%M%S'`
+LOGFILE=$RESULTS_DIR/$TAG.$DATESTAMP.log
+printf "Logs written to %s\n" "$LOGFILE"
+
+
+$mpi python3 /workspace/bert/run_pretraining.py \
+ --input_files_dir=$INPUT_FILES_DIR \
+ --eval_files_dir=$EVAL_FILES_DIR \
+ --output_dir=$CHECKPOINTS_DIR \
+ --bert_config_file=$BERT_CONFIG \
+ --do_train=True \
+ --do_eval=True \
+ --train_batch_size=$train_batch_size \
+ --eval_batch_size=$eval_batch_size \
+ --max_seq_length=512 \
+ --max_predictions_per_seq=80 \
+ --num_train_steps=$train_steps \
+ --num_warmup_steps=$warmup_steps \
+ --save_checkpoints_steps=$save_checkpoint_steps \
+ --num_accumulation_steps=$num_accumulation_steps \
+ --learning_rate=$learning_rate \
+ --report_loss \
+ --$use_hvd $use_xla_tag $use_fp16 \
+ --init_checkpoint=$INIT_CHECKPOINT |& tee $LOGFILE
--- a/TensorFlow/LanguageModeling/BERT/data/PubMedTextFormatting.py
+++ b/TensorFlow/LanguageModeling/BERT/data/PubMedTextFormatting.py
@ -27,7 +27,7 @@ class PubMedTextFormatting:
        print('PubMed path:', self.pubmed_path)

        with open(self.output_filename, mode='w', newline='\n') as ofile:
-            for filename in glob.glob(self.pubmed_path + '/*.xml', recursive=self.recursive):
+            for filename in glob.glob(self.pubmed_path + '/*.xml*', recursive=self.recursive):
                print('file:', filename)
                dicts_out = pmp.parse_medline_xml(filename)
                for dict_out in dicts_out:
--- a/TensorFlow/LanguageModeling/BERT/data/bertPrep.py
+++ b/TensorFlow/LanguageModeling/BERT/data/bertPrep.py
@ -302,7 +302,7 @@ if __name__ == "__main__":
        '--fraction_test_set',
        type=float,
        help='Specify the fraction (0..1) of the data to withhold for the test data split (based on number of sequences)',
-        default=0.2
+        default=0.1
    )

    parser.add_argument(
--- a/TensorFlow/LanguageModeling/BERT/data/create_biobert_datasets_from_start.sh
+++ b/TensorFlow/LanguageModeling/BERT/data/create_biobert_datasets_from_start.sh
@ -0,0 +1,55 @@
+#!/bin/bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export BERT_PREP_WORKING_DIR="${BERT_PREP_WORKING_DIR}"
+
+# Download
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset pubmed_baseline
+
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset google_pretrained_weights  # Includes vocab
+
+# Properly format the text files
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action text_formatting --dataset pubmed_baseline
+
+
+# Shard the text files
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action sharding --dataset pubmed_baseline
+
+### BERT BASE
+
+## UNCASED
+
+# Create TFRecord files Phase 1
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 128 \
+ --max_predictions_per_seq 20 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt
+
+
+# Create TFRecord files Phase 2
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 512 \
+ --max_predictions_per_seq 80 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt
+
+
+## CASED
+
+# Create TFRecord files Phase 1
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 128 \
+ --max_predictions_per_seq 20 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/cased_L-12_H-768_A-12/vocab.txt \
+ --do_lower_case=0
+
+
+# Create TFRecord files Phase 2
+python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 512 \
+ --max_predictions_per_seq 80 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/cased_L-12_H-768_A-12/vocab.txt \
+ --do_lower_case=0
--- a/TensorFlow/LanguageModeling/BERT/data/images/images_nvlamb.png
+++ b/TensorFlow/LanguageModeling/BERT/data/images/images_nvlamb.png
--- a/TensorFlow/LanguageModeling/BERT/notebooks/README.md
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/README.md
@ -8,6 +8,12 @@
 ```
 <img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">

+
+# Table Of Contents
+- [BERT Question Answering Fine-Tuning and Inference with Mixed Precision](#bert-question-answering-inference/fine-tuning-with-mixed-precision)
+- [BioBERT Named-Entity Recognition Inference with Mixed Precision](#biobert-named-entity-recognition-inference-with-mixed-precision)
+
+
 # BERT Question Answering Inference/Fine-Tuning with Mixed Precision

 ## 1. Overview
@ -88,3 +94,80 @@ in, for example:
 ```
 http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b
 ```
+
+
+# BioBERT Named-Entity Recognition Inference with Mixed Precision
+
+## 1. Overview
+
+Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. 
+
+BioBERT is a domain specific version of BERT that has been trained on PubMed abstracts.
+
+The original BioBERT paper can be found here: https://arxiv.org/abs/1901.08746
+
+NVIDIA's BioBERT is an optimized version of the implementation presented in the paper, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy.
+
+### 1.a Learning objectives
+
+This repository contains an example notebook that demonstrates:
+- Inference on NER task with BioBERT model
+- The use/download of fine-tuned NVIDIA BioBERT models
+- Use of Mixed Precision for Inference
+
+Here is a short description of the relevant file:
+ - _biobert_ner_tf_inference.ipynb_ : BioBERT Inference with TF Checkpoint model
+ 
+## 2. Quick Start Guide
+
+### 2.a Build the BERT TensorFlow NGC container:
+To run the notebook you first need to build the Bert TensorFlow container using the following command from the main directory of this repository:
+
+``` bash
+docker build . --rm -t bert
+```
+### 2.b Start of the NGC container to run inference:
+Once the image is built, you need to run the container with the `--publish
+0.0.0.0:8888:8888` option to publish Jupyter's port `8888` to the host machine
+at port `8888` over all network interfaces (`0.0.0.0`):
+
+```bash
+nvidia-docker run \
+  -v $PWD:/workspace/bert \
+  -v $PWD/results:/results \
+  --shm-size=1g \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  --publish 0.0.0.0:8888:8888 \
+  -it bert:latest bash
+```
+
+Then you can use the following commands within the BERT Tensorflow container under
+`/workspace/bert`:
+
+
+Install spaCy. You'll use this to pre-process text and to visualize the results using displaCy.
+```
+pip install spacy
+python -m spacy download en_core_web_sm
+```
+
+Launch Jupyter.
+```bash
+jupyter notebook --ip=0.0.0.0 --allow-root
+```
+
+And navigate a web browser to the IP address or hostname of the host machine
+at port `8888`:
+
+```
+http://[host machine]:8888
+```
+
+Use the token listed in the output from running the `jupyter` command to log
+in, for example:
+
+```
+http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b
+```
+
--- a/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb
@ -2,7 +2,7 @@
 "cells": [
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -26,6 +26,13 @@
    "# =============================================================================="
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb#scrollTo=5hRb96NKE3X0\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {
@ -78,14 +85,35 @@
   "source": [
    "## 2. Requirements\n",
    "\n",
-    "Please enable the GPU runtime (Runtime->Change Runtime Type)\n",
+    "### 2.a GPU\n",
    "\n",
-    "Download the required files from NVIDIA-Github:"
+    "Before running this notebook, please set the Colab runtime environment to GPU via the menu *Runtime => Change runtime type => GPU*.\n",
+    "\n",
+    "This demo will work on any NVIDIA GPU with CUDA cores, though for improved FP16 inference, a Volta, Turing or newer generation GPU with Tensor cores is desired.  On Google Colab, this normally means a T4 GPU. If you are assigned an older K80 GPU, another trial at another time might give you a T4 GPU."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-smi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "hxNJ8HByw60o"
+   },
+   "source": [
+    "### 2.b Download the required files from NVIDIA-Github:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -100,7 +128,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -115,19 +143,6 @@
    "print (os.getcwd())"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": 0,
-   "metadata": {
-    "colab": {},
-    "colab_type": "code",
-    "id": "p560UwaE6lAf"
-   },
-   "outputs": [],
-   "source": [
-    "ls"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {
@ -184,7 +199,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -228,7 +243,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -246,7 +261,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -286,7 +301,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -310,7 +325,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -321,29 +336,6 @@
    "use_mixed_precision_model = True"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "colab_type": "text",
-    "id": "tUQ1jWFHw61h"
-   },
-   "source": [
-    "To effectively evaluate the speedup of mixed precision try a bigger workload by uncommenting the following line:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 0,
-   "metadata": {
-    "colab": {},
-    "colab_type": "code",
-    "id": "VpkeBiyPw61j"
-   },
-   "outputs": [],
-   "source": [
-    "#input_file = os.path.join(working_dir, 'data/download/squad/v2.0/dev-v2.0.json')"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {
@ -372,7 +364,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -417,7 +409,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -444,7 +436,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -474,7 +466,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -555,7 +547,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
@ -624,7 +616,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 0,
+   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
--- a/TensorFlow/LanguageModeling/BERT/notebooks/biobert_ner_tf_inference.ipynb
+++ b/TensorFlow/LanguageModeling/BERT/notebooks/biobert_ner_tf_inference.ipynb
@ -0,0 +1,610 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions and\n",
+    "# limitations under the License.\n",
+    "# =============================================================================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
+    "\n",
+    "# BioBERT Named-Entity Recognition Inference with Mixed Precision\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Overview\n",
+    "\n",
+    "Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. \n",
+    "\n",
+    "BioBERT is a domain specific version of BERT that has been trained on PubMed abstracts.\n",
+    "\n",
+    "The original BioBERT paper can be found here: https://arxiv.org/abs/1901.08746\n",
+    "\n",
+    "NVIDIA's BioBERT is an optimized version of the implementation presented in the paper, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.a Learning objectives\n",
+    "\n",
+    "This notebook demonstrates:\n",
+    "- Inference on NER task with BioBERT model\n",
+    "- The use/download of fine-tuned NVIDIA BioBERT models\n",
+    "- Use of Mixed Precision for Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Requirements\n",
+    "\n",
+    "Please refer to the ReadMe file"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. BioBERT Inference: Named-Entity Recognition\n",
+    "\n",
+    "We can run inference on a fine-tuned BioBERT model for tasks like Named-Entity Recognition.\n",
+    "\n",
+    "Here we use a BioBERT model fine-tuned on a [BC5CDR-disease Dataset](https://www.ncbi.nlm.nih.gov/research/bionlp/Data/) which consists of 1500 PubMed articles with 5818 annotated diseases."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.a Extract Disease Information from Text\n",
+    "\n",
+    "In this example we will use Named-Entity Recognition model created using BioBERT to extract disease information from the following paragraph:\n",
+    "\n",
+    "**Input Text**\n",
+    "\n",
+    "_\"The authors describe the case of a 56 - year - old woman with chronic, severe heart failure \n",
+    "secondary to dilated cardiomyopathy and absence of significant ventricular arrhythmias \n",
+    "who developed QT prolongation and torsade de pointes ventricular tachycardia during one cycle \n",
+    "of intermittent low dose (2.5 mcg/kg per min) dobutamine. \n",
+    "This report of torsade de pointes ventricular tachycardia during intermittent dobutamine \n",
+    "supports the hypothesis that unpredictable fatal arrhythmias may occur even with low doses \n",
+    "and in patients with no history of significant rhythm disturbances.\n",
+    "The mechanisms of proarrhythmic effects of Dubutamine are discussed.\"_\n",
+    "\n",
+    "**Output visualized using displaCy**\n",
+    "\n",
+    "<div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">The authors describe the case of a 56 year old woman with chronic , severe \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    heart failure \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "secondary to \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    dilated cardiomyopathy \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "and absence of significant \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    ventricular arrhythmias \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "who developed QT \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    prolongation \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "and torsade de pointes \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    ventricular tachycardia \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "during one cycle of intermittent low dose ( 2.5 mcg / kg per min ) dobutamine . This report of torsade de pointes \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    ventricular tachycardia \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "during intermittent dobutamine supports the hypothesis that unpredictable fatal \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    arrhythmias \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    "may occur even with low doses and in patients with no history of significant \n",
+    "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
+    "    rhythm disturbances \n",
+    "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
+    "</mark>\n",
+    ". The mechanisms of proarrhythmic effects of Dubutamine are discussed . </div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text= \"\"\"\n",
+    "The authors describe the case of a 56 year old woman with chronic, severe heart failure\n",
+    "secondary to dilated cardiomyopathy and absence of significant ventricular arrhythmias\n",
+    "who developed QT prolongation and torsade de pointes ventricular tachycardia during one cycle\n",
+    "of intermittent low dose (2.5 mcg/kg per min) dobutamine.\n",
+    "This report of torsade de pointes ventricular tachycardia during intermittent dobutamine\n",
+    "supports the hypothesis that unpredictable fatal arrhythmias may occur even with low doses\n",
+    "and in patients with no history of significant rhythm disturbances.\n",
+    "The mechanisms of proarrhythmic effects of Dubutamine are discussed.\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import sys\n",
+    "\n",
+    "notebooks_dir = '/workspace/bert/notebooks'\n",
+    "working_dir = '/workspace/bert'\n",
+    "if working_dir not in sys.path:\n",
+    "    sys.path.append(working_dir)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convert the text into the IOB tags format seen during training, using dummy placeholder labels\n",
+    "import spacy\n",
+    "nlp = spacy.load(\"en_core_web_sm\")\n",
+    "\n",
+    "text = text.strip()\n",
+    "doc = nlp(text)\n",
+    "input_file = os.path.join(notebooks_dir, 'input.tsv')\n",
+    "with open(os.path.join(input_file), 'w') as wf: \n",
+    "    for word in doc:\n",
+    "        if word.text is '\\n':\n",
+    "            continue\n",
+    "        wf.write(word.text + '\\tO\\n')\n",
+    "    wf.write('\\n') # Indicate end of text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.b Mixed Precision\n",
+    "\n",
+    "Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.\n",
+    "\n",
+    "For information about:\n",
+    "- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.\n",
+    "- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.\n",
+    "- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this notebook we control mixed precision execution with the environmental variable:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "os.environ[\"TF_ENABLE_AUTO_MIXED_PRECISION\"] = \"1\" "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The model we'll use was trained with mixed precision model, which takes much less time to train than the fp32 version, without losing accuracy. So we'll need to set with the following flag: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "use_mixed_precision_model = True"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Fine-Tuned NVIDIA BioBERT TF Models\n",
+    "\n",
+    "We have the following Named Entity Reconition models fine-tuned from BioBERT available on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
+    "\n",
+    "| **Model** | **Description** |\n",
+    "|:---------:|:----------:|\n",
+    "|BioBERT NER BC5CDR Disease  | NER model to extract disease information from text, trained on the BC5CDR-Disease dataset |\n",
+    "|BioBERT NER BC5CDR Chemical | NER model to extract chemical information from text, trained on the BC5CDR-Chemical dataset. |\n",
+    "\n",
+    "\n",
+    "For this exampple, we will download the Diease NER model trained from the BC5CDR-disease Dataset.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# biobert_uncased_base_ner_disease\n",
+    "DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
+    "!mkdir -p $DATA_DIR_FP16\n",
+    "!wget -nc -q --show-progress -O $DATA_DIR_FP16/biobert_uncased_base_ner_disease.zip \\\n",
+    "https://api.ngc.nvidia.com/v2/models/nvidia/biobert_uncased_base_ner_disease/versions/1/zip\n",
+    "!unzip -n -d $DATA_DIR_FP16/ $DATA_DIR_FP16/biobert_uncased_base_ner_disease.zip "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the code that follows we will refer to these models."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Running NER task inference\n",
+    "\n",
+    "In order to run NER inference we will follow step-by-step the flow implemented in run_ner.py."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.a Configure Things"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import run_ner\n",
+    "from run_ner import BC5CDRProcessor, model_fn_builder, file_based_input_fn_builder, filed_based_convert_examples_to_features, result_to_pair\n",
+    "\n",
+    "import os, sys\n",
+    "import time\n",
+    "\n",
+    "import tensorflow as tf\n",
+    "import modeling\n",
+    "import tokenization\n",
+    "\n",
+    "tf.logging.set_verbosity(tf.logging.ERROR)\n",
+    "\n",
+    "# Create the output directory where all the results are saved.\n",
+    "output_dir = os.path.join(working_dir, 'output')\n",
+    "tf.gfile.MakeDirs(output_dir)\n",
+    "\n",
+    "# The config json file corresponding to the pre-trained BERT model.\n",
+    "# This specifies the model architecture.\n",
+    "bert_config_file = os.path.join(DATA_DIR_FP16, 'bert_config.json')\n",
+    "\n",
+    "# The vocabulary file that the BERT model was trained on.\n",
+    "vocab_file = os.path.join(DATA_DIR_FP16, 'vocab.txt')\n",
+    "\n",
+    "init_checkpoint = os.path.join(DATA_DIR_FP16, 'model.ckpt-10251')\n",
+    "\n",
+    "# Whether to lower case the input text. \n",
+    "# Should be True for uncased models and False for cased models.\n",
+    "# The BioBERT available in NGC is uncased\n",
+    "do_lower_case = True\n",
+    "  \n",
+    "# Total batch size for predictions\n",
+    "predict_batch_size = 1\n",
+    "params = dict([('batch_size', predict_batch_size)])\n",
+    "\n",
+    "# The maximum total input sequence length after WordPiece tokenization. \n",
+    "# Sequences longer than this will be truncated, and sequences shorter than this will be padded.\n",
+    "max_seq_length = 128\n",
+    "\n",
+    "# This is a WA to use flags from here:\n",
+    "flags = tf.flags\n",
+    "\n",
+    "if 'f' not in tf.flags.FLAGS: \n",
+    "    tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
+    "FLAGS = flags.FLAGS\n",
+    "\n",
+    "FLAGS.output_dir = output_dir"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.b Define Tokenizer & Create Estimator"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Validate the casing config consistency with the checkpoint name.\n",
+    "tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)\n",
+    "\n",
+    "# Create the tokenizer.\n",
+    "tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)\n",
+    "\n",
+    "# Load the configuration from file\n",
+    "bert_config = modeling.BertConfig.from_json_file(bert_config_file)\n",
+    "\n",
+    "\n",
+    "# Use the data processor for BC5CDR\n",
+    "processor = BC5CDRProcessor()\n",
+    "# Get labels in the index order that was used during training\n",
+    "label_list = processor.get_labels()\n",
+    "\n",
+    "# Reverse index the labels. This will be used later when evaluating predictions.\n",
+    "id2label = {}\n",
+    "for (i, label) in enumerate(label_list, 1):\n",
+    "    id2label[i] = label\n",
+    "\n",
+    "\n",
+    "config = tf.ConfigProto(log_device_placement=True) \n",
+    "run_config = tf.estimator.RunConfig(\n",
+    "      model_dir=None,\n",
+    "      session_config=config,\n",
+    "      save_checkpoints_steps=1000,\n",
+    "      keep_checkpoint_max=1)\n",
+    "\n",
+    "\n",
+    "# Use model function builder to create the model function\n",
+    "model_fn = model_fn_builder(\n",
+    "    bert_config=bert_config,\n",
+    "    num_labels=len(label_list) + 1,\n",
+    "    init_checkpoint=init_checkpoint,\n",
+    "    use_fp16=use_mixed_precision_model)\n",
+    "\n",
+    "estimator = tf.estimator.Estimator(\n",
+    "  model_fn=model_fn,\n",
+    "  config=run_config,\n",
+    "  params=params)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.c Run Inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the input data using the BC5CDR processor\n",
+    "predict_examples = processor.get_test_examples(notebooks_dir, file_name='input.tsv')\n",
+    "\n",
+    "\n",
+    "# Convert to tf_records and save it\n",
+    "predict_file = os.path.join(output_dir, \"predict.tf_record\")\n",
+    "filed_based_convert_examples_to_features(predict_examples, label_list,\n",
+    "                                         max_seq_length, tokenizer,\n",
+    "                                         predict_file)\n",
+    "\n",
+    "\n",
+    "tf.logging.info(\"***** Running predictions *****\")\n",
+    "tf.logging.info(\"  Num orig examples = %d\", len(predict_examples))\n",
+    "tf.logging.info(\"  Batch size = %d\", predict_batch_size)\n",
+    "\n",
+    "# Run prediction on this tf_record file\n",
+    "predict_input_fn = file_based_input_fn_builder(\n",
+    "    input_file=predict_file,\n",
+    "    batch_size=predict_batch_size,\n",
+    "    seq_length=max_seq_length,\n",
+    "    is_training=False,\n",
+    "    drop_remainder=False)\n",
+    "\n",
+    "\n",
+    "pred_start_time = time.time()\n",
+    "\n",
+    "predictions = estimator.predict(input_fn=predict_input_fn)\n",
+    "predictions = list(predictions)\n",
+    "\n",
+    "pred_time_elapsed = time.time() - pred_start_time\n",
+    "\n",
+    "tf.logging.info(\"-----------------------------\")\n",
+    "tf.logging.info(\"Total Inference Time = %0.2f\", pred_time_elapsed)\n",
+    "# tf.logging.info(\"Inference Performance = %0.4f sentences/sec\", avg_sentences_per_second)\n",
+    "tf.logging.info(\"-----------------------------\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.d Save Predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's now process the predictions and save them to file(s)\n",
+    "tf.logging.info(\"Save Predictions:\")\n",
+    "\n",
+    "# File containing the list of predictions as IOB tags\n",
+    "output_predict_file = os.path.join(FLAGS.output_dir, \"label_test.txt\")\n",
+    "# File containing the list of words, the dummy token and the predicted IOB tag\n",
+    "test_labels_file = os.path.join(FLAGS.output_dir, \"test_labels.txt\")\n",
+    "test_labels_err_file = os.path.join(FLAGS.output_dir, \"test_labels_errs.txt\")\n",
+    "\n",
+    "with tf.gfile.Open(output_predict_file, 'w') as writer, \\\n",
+    "        tf.gfile.Open(test_labels_file, 'w') as tl, \\\n",
+    "        tf.gfile.Open(test_labels_err_file, 'w') as tle:\n",
+    "    i=0\n",
+    "    for prediction in estimator.predict(input_fn=predict_input_fn, yield_single_examples=True):\n",
+    "        output_line = \"\\n\".join(id2label[id] for id in prediction if id != 0) + \"\\n\"\n",
+    "        writer.write(output_line)\n",
+    "        result_to_pair(predict_examples[i], prediction, id2label, tl, tle)\n",
+    "        i = i + 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.e Visualize Predictions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's create a function that can formats the predictions for display using displaCy\n",
+    "def predictions_for_displacy(predict_examples, predictions, id2label):\n",
+    "    processed_text = ''\n",
+    "    entities = []\n",
+    "    current_pos = 0\n",
+    "    start_pos = 0\n",
+    "    end_pos = 0\n",
+    "    end_detected = False\n",
+    "    prev_label = ''\n",
+    "\n",
+    "    for predict_line, pred_ids in zip(predict_examples, predictions):\n",
+    "        words = str(predict_line.text).split(' ')\n",
+    "        labels = str(predict_line.label).split(' ')\n",
+    "\n",
+    "        # get from CLS to SEP\n",
+    "        pred_labels = []\n",
+    "        for id in pred_ids:\n",
+    "            if id == 0:\n",
+    "                continue\n",
+    "            curr_label = id2label[id]\n",
+    "            if curr_label == '[CLS]':\n",
+    "                continue\n",
+    "            elif curr_label == '[SEP]':\n",
+    "                break\n",
+    "            elif curr_label == 'X':\n",
+    "                continue\n",
+    "            pred_labels.append(curr_label)\n",
+    "\n",
+    "        for tok, label, pred_label in zip(words, labels, pred_labels):\n",
+    "            if pred_label is 'B':\n",
+    "                start_pos = current_pos\n",
+    "            elif pred_label is 'I' and prev_label is not 'B' and prev_label is not 'I':\n",
+    "                start_pos = current_pos\n",
+    "            elif pred_label is 'O' and (prev_label is 'B' or prev_label is 'I'):\n",
+    "                end_pos = current_pos\n",
+    "                end_detected = True\n",
+    "\n",
+    "            if end_detected:\n",
+    "                entities.append({'start':start_pos, 'end': end_pos, 'label': 'DISEASE'})\n",
+    "                start_pos = 0\n",
+    "                end_pos = 0\n",
+    "                end_detected = False\n",
+    "\n",
+    "            processed_text = processed_text + tok + ' '\n",
+    "            current_pos = current_pos + len(tok) + 1\n",
+    "            prev_label = pred_label\n",
+    "\n",
+    "    #Handle entity at the very end\n",
+    "    if start_pos > 0 and end_detected is False:\n",
+    "        entities.append({'start':start_pos, 'end': current_pos, 'label': 'DISEASE'})\n",
+    "    \n",
+    "    displacy_input = [{\"text\": processed_text,\n",
+    "                            \"ents\": entities,\n",
+    "                            \"title\": None}]\n",
+    "    \n",
+    "    return displacy_input"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Convert the predictions to the Named Entities format required by displaCy and visualize\n",
+    "displacy_input = predictions_for_displacy(predict_examples, predictions, id2label)\n",
+    "html = spacy.displacy.render(displacy_input, style=\"ent\", manual=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. What's next"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that you are familiar with running NER Inference on BioBERT, using mixed precision, you may want to try extracting disease information from other biomedical text. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/TensorFlow/LanguageModeling/BERT/requirements.txt
+++ b/TensorFlow/LanguageModeling/BERT/requirements.txt
@ -1,2 +1,9 @@
 tensorflow >= 1.11.0   # CPU Version of TensorFlow.
 # tensorflow-gpu  >= 1.11.0  # GPU version of TensorFlow.
+toposort
+networkx
+pytest
+nltk
+tqdm
+html2text
+progressbar
--- a/TensorFlow/LanguageModeling/BERT/run_ner.py
+++ b/TensorFlow/LanguageModeling/BERT/run_ner.py
@ -0,0 +1,872 @@
+#! usr/bin/env python3
+# -*- coding:utf-8 -*-
+"""
+Copyright 2018 The Google AI Language Team Authors.
+BASED ON Google_BERT.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import os, sys
+import pickle
+
+import tensorflow as tf
+import numpy as np
+
+sys.path.append("/workspace/bert")
+
+from biobert.conlleval import evaluate, report_notprint
+import modeling
+import optimization
+import tokenization
+import tf_metrics
+
+import time
+import horovod.tensorflow as hvd
+from utils.utils import LogEvalRunHook, LogTrainRunHook
+
+flags = tf.flags
+
+FLAGS = flags.FLAGS
+
+flags.DEFINE_string(
+    "task_name", "NER", "The name of the task to train."
+)
+
+flags.DEFINE_string(
+    "data_dir", None,
+    "The input datadir.",
+)
+
+flags.DEFINE_string(
+    "output_dir", None,
+    "The output directory where the model checkpoints will be written."
+)
+
+flags.DEFINE_string(
+    "bert_config_file", None,
+    "The config json file corresponding to the pre-trained BERT model."
+)
+
+flags.DEFINE_string(
+    "vocab_file", None,
+    "The vocabulary file that the BERT model was trained on.")
+
+flags.DEFINE_string(
+    "init_checkpoint", None,
+    "Initial checkpoint (usually from a pre-trained BERT model)."
+)
+
+flags.DEFINE_bool(
+    "do_lower_case", False,
+    "Whether to lower case the input text."
+)
+
+flags.DEFINE_integer(
+    "max_seq_length", 128,
+    "The maximum total input sequence length after WordPiece tokenization."
+)
+
+flags.DEFINE_bool(
+    "do_train", False,
+    "Whether to run training."
+)
+
+flags.DEFINE_bool(
+    "do_eval", False,
+    "Whether to run eval on the dev set.")
+
+flags.DEFINE_bool(
+    "do_predict", False,
+    "Whether to run the model in inference mode on the test set.")
+
+flags.DEFINE_integer(
+    "train_batch_size", 64,
+    "Total batch size for training.")
+
+flags.DEFINE_integer(
+    "eval_batch_size", 16,
+    "Total batch size for eval.")
+
+flags.DEFINE_integer(
+    "predict_batch_size", 16,
+    "Total batch size for predict.")
+
+flags.DEFINE_float(
+    "learning_rate", 5e-6,
+    "The initial learning rate for Adam.")
+
+flags.DEFINE_float(
+    "num_train_epochs", 10.0,
+    "Total number of training epochs to perform.")
+
+flags.DEFINE_float(
+    "warmup_proportion", 0.1,
+    "Proportion of training to perform linear learning rate warmup for. "
+    "E.g., 0.1 = 10% of training.")
+
+flags.DEFINE_integer(
+    "save_checkpoints_steps", 1000,
+    "How often to save the model checkpoint.")
+
+flags.DEFINE_integer(
+    "iterations_per_loop", 1000,
+    "How many steps to make in each estimator call.")
+
+tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
+
+flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
+flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
+flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text, label=None):
+        """Constructs a InputExample.
+
+        Args:
+          guid: Unique id for the example.
+          text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+          label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text = text
+        self.label = label
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self, input_ids, input_mask, segment_ids, label_ids, ):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_ids = label_ids
+        # self.label_mask = label_mask
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    @classmethod
+    def _read_data(cls, input_file):
+        """Reads a BIO data."""
+        with tf.gfile.Open(input_file, "r") as f:
+            lines = []
+            words = []
+            labels = []
+            for line in f:
+                contends = line.strip()
+                if len(contends) == 0:
+                    assert len(words) == len(labels)
+                    if len(words) > 30:
+                        # split if the sentence is longer than 30
+                        while len(words) > 30:
+                            tmplabel = labels[:30]
+                            for iidx in range(len(tmplabel)):
+                                if tmplabel.pop() == 'O':
+                                    break
+                            l = ' '.join(
+                                [label for label in labels[:len(tmplabel) + 1] if len(label) > 0])
+                            w = ' '.join(
+                                [word for word in words[:len(tmplabel) + 1] if len(word) > 0])
+                            lines.append([l, w])
+                            words = words[len(tmplabel) + 1:]
+                            labels = labels[len(tmplabel) + 1:]
+
+                    if len(words) == 0:
+                        continue
+                    l = ' '.join([label for label in labels if len(label) > 0])
+                    w = ' '.join([word for word in words if len(word) > 0])
+                    lines.append([l, w])
+                    words = []
+                    labels = []
+                    continue
+
+                word = line.strip().split()[0]
+                label = line.strip().split()[-1]
+                words.append(word)
+                labels.append(label)
+            return lines
+
+
+class BC5CDRProcessor(DataProcessor):
+    def get_train_examples(self, data_dir):
+        l1 = self._read_data(os.path.join(data_dir, "train.tsv"))
+        l2 = self._read_data(os.path.join(data_dir, "devel.tsv"))
+        return self._create_example(l1 + l2, "train")
+
+    def get_dev_examples(self, data_dir, file_name="devel.tsv"):
+        return self._create_example(
+            self._read_data(os.path.join(data_dir, file_name)), "dev"
+        )
+
+    def get_test_examples(self, data_dir, file_name="test.tsv"):
+        return self._create_example(
+            self._read_data(os.path.join(data_dir, file_name)), "test")
+
+    def get_labels(self):
+        return ["B", "I", "O", "X", "[CLS]", "[SEP]"]
+
+    def _create_example(self, lines, set_type):
+        examples = []
+        for (i, line) in enumerate(lines):
+            guid = "%s-%s" % (set_type, i)
+            text = tokenization.convert_to_unicode(line[1])
+            label = tokenization.convert_to_unicode(line[0])
+            examples.append(InputExample(guid=guid, text=text, label=label))
+        return examples
+
+
+class CLEFEProcessor(DataProcessor):
+    def get_train_examples(self, data_dir):
+        lines1 = self._read_data2(os.path.join(data_dir, "Training.tsv"))
+        lines2 = self._read_data2(os.path.join(data_dir, "Development.tsv"))
+        return self._create_example(
+            lines1 + lines2, "train"
+        )
+
+    def get_dev_examples(self, data_dir, file_name="Development.tsv"):
+        return self._create_example(
+            self._read_data2(os.path.join(data_dir, file_name)), "dev"
+        )
+
+    def get_test_examples(self, data_dir, file_name="Test.tsv"):
+        return self._create_example(
+            self._read_data2(os.path.join(data_dir, file_name)), "test")
+
+    def get_labels(self):
+        return ["B", "I", "O", "X", "[CLS]", "[SEP]"]
+
+    def _create_example(self, lines, set_type):
+        examples = []
+        for (i, line) in enumerate(lines):
+            guid = "%s-%s" % (set_type, i)
+            text = tokenization.convert_to_unicode(line[1])
+            label = tokenization.convert_to_unicode(line[0])
+            examples.append(InputExample(guid=guid, text=text, label=label))
+        return examples
+
+    @classmethod
+    def _read_data2(cls, input_file):
+        with tf.gfile.Open(input_file, "r") as f:
+            lines = []
+            words = []
+            labels = []
+            for line in f:
+                contends = line.strip()
+                if len(contends) == 0:
+                    assert len(words) == len(labels)
+                    if len(words) == 0:
+                        continue
+                    l = ' '.join([label for label in labels if len(label) > 0])
+                    w = ' '.join([word for word in words if len(word) > 0])
+                    lines.append([l, w])
+                    words = []
+                    labels = []
+                    continue
+                elif contends.startswith('###'):
+                    continue
+
+                word = line.strip().split()[0]
+                label = line.strip().split()[-1]
+                words.append(word)
+                labels.append(label)
+            return lines
+
+
+class I2b22012Processor(CLEFEProcessor):
+    def get_labels(self):
+        return ['B-CLINICAL_DEPT', 'B-EVIDENTIAL', 'B-OCCURRENCE', 'B-PROBLEM', 'B-TEST', 'B-TREATMENT', 'I-CLINICAL_DEPT', 'I-EVIDENTIAL', 'I-OCCURRENCE', 'I-PROBLEM', 'I-TEST', 'I-TREATMENT', "O", "X", "[CLS]", "[SEP]"]
+
+
+def write_tokens(tokens, labels, mode):
+    if mode == "test":
+        path = os.path.join(FLAGS.output_dir, "token_" + mode + ".txt")
+        if tf.gfile.Exists(path):
+            wf = tf.gfile.Open(path, 'a')
+        else:
+            wf = tf.gfile.Open(path, 'w')
+        for token, label in zip(tokens, labels):
+            if token != "**NULL**":
+                wf.write(token + ' ' + str(label) + '\n')
+        wf.close()
+
+
+def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode):
+    label_map = {}
+    for (i, label) in enumerate(label_list, 1):
+        label_map[label] = i
+    label2id_file = os.path.join(FLAGS.output_dir, 'label2id.pkl')
+    if not tf.gfile.Exists(label2id_file):
+        with tf.gfile.Open(label2id_file, 'wb') as w:
+            pickle.dump(label_map, w)
+    textlist = example.text.split(' ')
+    labellist = example.label.split(' ')
+    tokens = []
+    labels = []
+    for i, word in enumerate(textlist):
+        token = tokenizer.tokenize(word)
+        tokens.extend(token)
+        label_1 = labellist[i]
+        for m in range(len(token)):
+            if m == 0:
+                labels.append(label_1)
+            else:
+                labels.append("X")
+    # tokens = tokenizer.tokenize(example.text)
+    if len(tokens) >= max_seq_length - 1:
+        tokens = tokens[0:(max_seq_length - 2)]
+        labels = labels[0:(max_seq_length - 2)]
+    ntokens = []
+    segment_ids = []
+    label_ids = []
+    ntokens.append("[CLS]")
+    segment_ids.append(0)
+    # append("O") or append("[CLS]") not sure!
+    label_ids.append(label_map["[CLS]"])
+    for i, token in enumerate(tokens):
+        ntokens.append(token)
+        segment_ids.append(0)
+        label_ids.append(label_map[labels[i]])
+    ntokens.append("[SEP]")
+    segment_ids.append(0)
+    # append("O") or append("[SEP]") not sure!
+    label_ids.append(label_map["[SEP]"])
+    input_ids = tokenizer.convert_tokens_to_ids(ntokens)
+    input_mask = [1] * len(input_ids)
+    # label_mask = [1] * len(input_ids)
+    while len(input_ids) < max_seq_length:
+        input_ids.append(0)
+        input_mask.append(0)
+        segment_ids.append(0)
+        # we don't concerned about it!
+        label_ids.append(0)
+        ntokens.append("**NULL**")
+        # label_mask.append(0)
+    # print(len(input_ids))
+    assert len(input_ids) == max_seq_length
+    assert len(input_mask) == max_seq_length
+    assert len(segment_ids) == max_seq_length
+    assert len(label_ids) == max_seq_length
+    # assert len(label_mask) == max_seq_length
+
+    if ex_index < 5:
+        tf.logging.info("*** Example ***")
+        tf.logging.info("guid: %s" % (example.guid))
+        tf.logging.info("tokens: %s" % " ".join(
+            [tokenization.printable_text(x) for x in tokens]))
+        tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+        tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+        tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+        tf.logging.info("label_ids: %s" % " ".join([str(x) for x in label_ids]))
+        # tf.logging.info("label_mask: %s" % " ".join([str(x) for x in label_mask]))
+
+    feature = InputFeatures(
+        input_ids=input_ids,
+        input_mask=input_mask,
+        segment_ids=segment_ids,
+        label_ids=label_ids,
+        # label_mask = label_mask
+    )
+    # write_tokens(ntokens, label_ids, mode)
+    return feature
+
+
+def filed_based_convert_examples_to_features(
+        examples, label_list, max_seq_length, tokenizer, output_file, mode=None):
+    writer = tf.python_io.TFRecordWriter(output_file)
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 5000 == 0:
+            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
+        feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer,
+                                         mode)
+
+        def create_int_feature(values):
+            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
+            return f
+
+        features = collections.OrderedDict()
+        features["input_ids"] = create_int_feature(feature.input_ids)
+        features["input_mask"] = create_int_feature(feature.input_mask)
+        features["segment_ids"] = create_int_feature(feature.segment_ids)
+        features["label_ids"] = create_int_feature(feature.label_ids)
+        # features["label_mask"] = create_int_feature(feature.label_mask)
+        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
+        writer.write(tf_example.SerializeToString())
+
+
+def file_based_input_fn_builder(input_file, batch_size, seq_length, is_training, drop_remainder, hvd=None):
+    name_to_features = {
+        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
+        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
+        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
+        "label_ids": tf.FixedLenFeature([seq_length], tf.int64),
+        # "label_ids":tf.VarLenFeature(tf.int64),
+        # "label_mask": tf.FixedLenFeature([seq_length], tf.int64),
+    }
+
+    def _decode_record(record, name_to_features):
+        example = tf.parse_single_example(record, name_to_features)
+        for name in list(example.keys()):
+            t = example[name]
+            if t.dtype == tf.int64:
+                t = tf.to_int32(t)
+            example[name] = t
+        return example
+
+    def input_fn(params):
+        #batch_size = params["batch_size"]
+        d = tf.data.TFRecordDataset(input_file)
+        if is_training:
+            if hvd is not None: d = d.shard(hvd.size(), hvd.rank())
+            d = d.repeat()
+            d = d.shuffle(buffer_size=100)
+
+        d = d.apply(tf.contrib.data.map_and_batch(
+            lambda record: _decode_record(record, name_to_features),
+            batch_size=batch_size,
+            drop_remainder=drop_remainder
+        ))
+        return d
+
+    return input_fn
+
+
+def create_model(bert_config, is_training, input_ids, input_mask,
+                 segment_ids, labels, num_labels, use_one_hot_embeddings):
+    model = modeling.BertModel(
+        config=bert_config,
+        is_training=is_training,
+        input_ids=input_ids,
+        input_mask=input_mask,
+        token_type_ids=segment_ids,
+        use_one_hot_embeddings=use_one_hot_embeddings
+    )
+
+    output_layer = model.get_sequence_output()
+
+    hidden_size = output_layer.shape[-1].value
+
+    output_weight = tf.get_variable(
+        "output_weights", [num_labels, hidden_size],
+        initializer=tf.truncated_normal_initializer(stddev=0.02)
+    )
+    output_bias = tf.get_variable(
+        "output_bias", [num_labels], initializer=tf.zeros_initializer()
+    )
+    with tf.variable_scope("loss"):
+        if is_training:
+            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
+        output_layer = tf.reshape(output_layer, [-1, hidden_size])
+        logits = tf.matmul(output_layer, output_weight, transpose_b=True)
+        logits = tf.nn.bias_add(logits, output_bias)
+        logits = tf.reshape(logits, [-1, FLAGS.max_seq_length, num_labels])
+        # mask = tf.cast(input_mask,tf.float32)
+        # loss = tf.contrib.seq2seq.sequence_loss(logits,labels,mask)
+        # return (loss, logits, predict)
+        ##########################################################################
+        log_probs = tf.nn.log_softmax(logits, axis=-1)
+        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
+        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
+        loss = tf.reduce_mean(per_example_loss)
+        probabilities = tf.nn.softmax(logits, axis=-1)
+        predict = tf.argmax(probabilities, axis=-1)
+        return (loss, per_example_loss, logits, predict)
+        ##########################################################################
+
+
+def model_fn_builder(bert_config, num_labels, init_checkpoint=None, learning_rate=None,
+                     num_train_steps=None, num_warmup_steps=None,
+                     use_one_hot_embeddings=False, hvd=None, use_fp16=False):
+    def model_fn(features, labels, mode, params):
+        tf.logging.info("*** Features ***")
+        for name in sorted(features.keys()):
+            tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
+        input_ids = features["input_ids"]
+        input_mask = features["input_mask"]
+        segment_ids = features["segment_ids"]
+        label_ids = features["label_ids"]
+        # label_mask = features["label_mask"]
+        is_training = (mode == tf.estimator.ModeKeys.TRAIN)
+
+        (total_loss, per_example_loss, logits, predicts) = create_model(
+            bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
+            num_labels, use_one_hot_embeddings)
+        tvars = tf.trainable_variables()
+        initialized_variable_names = {}
+        scaffold_fn = None
+        if init_checkpoint and (hvd is None or hvd.rank() == 0):
+            (assignment_map,
+             initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars,
+                                                                                       init_checkpoint)
+            tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
+            tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
+        tf.logging.info("**** Trainable Variables ****")
+
+        for var in tvars:
+            init_string = ""
+            if var.name in initialized_variable_names:
+                init_string = ", *INIT_FROM_CKPT*"
+            tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
+                            init_string)
+        output_spec = None
+        if mode == tf.estimator.ModeKeys.TRAIN:
+            train_op = optimization.create_optimizer(
+                total_loss, learning_rate, num_train_steps, num_warmup_steps, hvd, False, use_fp16)
+            output_spec = tf.estimator.EstimatorSpec(
+              mode=mode,
+              loss=total_loss,
+              train_op=train_op)
+        elif mode == tf.estimator.ModeKeys.EVAL:
+
+            def metric_fn(per_example_loss, label_ids, logits):
+                # def metric_fn(label_ids, logits):
+                predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
+                precision = tf_metrics.precision(label_ids, predictions, num_labels, [1, 2], average="macro")
+                recall = tf_metrics.recall(label_ids, predictions, num_labels, [1, 2], average="macro")
+                f = tf_metrics.f1(label_ids, predictions, num_labels, [1, 2], average="macro")
+                #
+                return {
+                    "eval_precision": precision,
+                    "eval_recall": recall,
+                    "eval_f": f,
+                    # "eval_loss": loss,
+                }
+
+            eval_metric_ops = metric_fn(per_example_loss, label_ids, logits)
+            output_spec = tf.estimator.EstimatorSpec(
+              mode=mode,
+              loss=total_loss,
+              eval_metric_ops=eval_metric_ops)
+        else:
+            output_spec = tf.estimator.EstimatorSpec(
+              mode=mode, predictions=predicts)#probabilities)
+        return output_spec
+
+    return model_fn
+
+
+def result_to_pair(predict_line, pred_ids, id2label, writer, err_writer):
+
+    words = str(predict_line.text).split(' ')
+    labels = str(predict_line.label).split(' ')
+    if len(words) != len(labels):
+        tf.logging.error('Text and label not equal')
+        tf.logging.error(predict_line.text)
+        tf.logging.error(predict_line.label)
+        exit(1)
+
+    # get from CLS to SEP
+    pred_labels = []
+    for id in pred_ids:
+        if id == 0:
+            continue
+        curr_label = id2label[id]
+        if curr_label == '[CLS]':
+            continue
+        elif curr_label == '[SEP]':
+            break
+        elif curr_label == 'X':
+            continue
+        pred_labels.append(curr_label)
+    if len(pred_labels) > len(words):
+        err_writer.write(predict_line.guid + '\n')
+        err_writer.write(predict_line.text + '\n')
+        err_writer.write(predict_line.label + '\n')
+        err_writer.write(' '.join([str(i) for i in pred_ids]) + '\n')
+        err_writer.write(' '.join([id2label.get(i, '**NULL**') for i in pred_ids]) + '\n\n')
+        pred_labels = pred_labels[:len(words)]
+    elif len(pred_labels) < len(words):
+        err_writer.write(predict_line.guid + '\n')
+        err_writer.write(predict_line.text + '\n')
+        err_writer.write(predict_line.label + '\n')
+        err_writer.write(' '.join([str(i) for i in pred_ids]) + '\n')
+        err_writer.write(' '.join([id2label.get(i, '**NULL**') for i in pred_ids]) + '\n\n')
+        pred_labels += ['O'] * (len(words) - len(pred_labels))
+
+    for tok, label, pred_label in zip(words, labels, pred_labels):
+        writer.write(tok + ' ' + label + ' ' + pred_label + '\n')
+    writer.write('\n')
+
+
+def main(_):
+    tf.logging.set_verbosity(tf.logging.INFO)
+
+    if FLAGS.horovod:
+      hvd.init()
+    if FLAGS.use_fp16:
+        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
+
+    processors = {
+        "bc5cdr": BC5CDRProcessor,
+        "clefe": CLEFEProcessor,
+        'i2b2': I2b22012Processor
+    }
+    if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
+       raise ValueError("At least one of `do_train` or `do_eval` must be True.")
+
+    bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
+
+    if FLAGS.max_seq_length > bert_config.max_position_embeddings:
+        raise ValueError(
+            "Cannot use sequence length %d because the BERT model "
+            "was only trained up to sequence length %d" %
+            (FLAGS.max_seq_length, bert_config.max_position_embeddings))
+
+    task_name = FLAGS.task_name.lower()
+    if task_name not in processors:
+        raise ValueError("Task not found: %s" % (task_name))
+
+    tf.gfile.MakeDirs(FLAGS.output_dir)
+
+    processor = processors[task_name]()
+
+    label_list = processor.get_labels()
+
+    tokenizer = tokenization.FullTokenizer(
+        vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
+
+    is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
+
+    master_process = True
+    training_hooks = []
+    global_batch_size = FLAGS.train_batch_size
+    hvd_rank = 0
+
+    config = tf.ConfigProto()
+    if FLAGS.horovod:
+      global_batch_size = FLAGS.train_batch_size * hvd.size()
+      master_process = (hvd.rank() == 0)
+      hvd_rank = hvd.rank()
+      config.gpu_options.allow_growth = True
+      config.gpu_options.visible_device_list = str(hvd.local_rank())
+      if hvd.size() > 1:
+        training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
+
+    if FLAGS.use_xla:
+        config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
+    run_config = tf.estimator.RunConfig(
+      model_dir=FLAGS.output_dir if master_process else None,
+      session_config=config,
+      save_checkpoints_steps=FLAGS.save_checkpoints_steps if master_process else None,
+      keep_checkpoint_max=1)
+
+    if master_process:
+      tf.logging.info("***** Configuaration *****")
+      for key in FLAGS.__flags.keys():
+          tf.logging.info('  {}: {}'.format(key, getattr(FLAGS, key)))
+      tf.logging.info("**************************")
+
+    train_examples = None
+    num_train_steps = None
+    num_warmup_steps = None
+    training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank))
+
+    if FLAGS.do_train:
+        train_examples = processor.get_train_examples(FLAGS.data_dir)
+        num_train_steps = int(
+            len(train_examples) / global_batch_size * FLAGS.num_train_epochs)
+        num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
+
+        start_index = 0
+        end_index = len(train_examples)
+        tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record")]
+
+        if FLAGS.horovod:
+          tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.size())]
+          num_examples_per_rank = len(train_examples) // hvd.size()
+          remainder = len(train_examples) % hvd.size()
+          if hvd.rank() < remainder:
+            start_index = hvd.rank() * (num_examples_per_rank+1)
+            end_index = start_index + num_examples_per_rank + 1
+          else:
+            start_index = hvd.rank() * num_examples_per_rank + remainder
+            end_index = start_index + (num_examples_per_rank)
+
+    model_fn = model_fn_builder(
+        bert_config=bert_config,
+        num_labels=len(label_list) + 1,
+        init_checkpoint=FLAGS.init_checkpoint,
+        learning_rate=FLAGS.learning_rate if not FLAGS.horovod else FLAGS.learning_rate * hvd.size(),
+        num_train_steps=num_train_steps,
+        num_warmup_steps=num_warmup_steps,
+        use_one_hot_embeddings=False,
+        hvd=None if not FLAGS.horovod else hvd,
+        use_fp16=FLAGS.use_fp16)
+
+    estimator = tf.estimator.Estimator(
+      model_fn=model_fn,
+      config=run_config)
+
+    if FLAGS.do_train:
+        #train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
+        #filed_based_convert_examples_to_features(
+        #    train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
+        filed_based_convert_examples_to_features(
+          train_examples[start_index:end_index], label_list, FLAGS.max_seq_length, tokenizer, tmp_filenames[hvd_rank])
+        tf.logging.info("***** Running training *****")
+        tf.logging.info("  Num examples = %d", len(train_examples))
+        tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
+        tf.logging.info("  Num steps = %d", num_train_steps)
+        train_input_fn = file_based_input_fn_builder(
+            input_file=tmp_filenames, #train_file,
+            batch_size=FLAGS.train_batch_size,
+            seq_length=FLAGS.max_seq_length,
+            is_training=True,
+            drop_remainder=True,
+            hvd=None if not FLAGS.horovod else hvd)
+        
+        #estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
+        train_start_time = time.time()
+        estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=training_hooks)
+        train_time_elapsed = time.time() - train_start_time
+        train_time_wo_overhead = training_hooks[-1].total_time
+        avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed
+        ss_sentences_per_second = (num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
+
+        if master_process:
+          tf.logging.info("-----------------------------")
+          tf.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
+                        num_train_steps * global_batch_size)
+          tf.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
+                        (num_train_steps - training_hooks[-1].skipped) * global_batch_size)
+          tf.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second)
+          tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
+          tf.logging.info("-----------------------------")
+
+    if FLAGS.do_eval and master_process:
+        eval_examples = processor.get_dev_examples(FLAGS.data_dir)
+        eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
+        filed_based_convert_examples_to_features(
+            eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
+
+        tf.logging.info("***** Running evaluation *****")
+        tf.logging.info("  Num examples = %d", len(eval_examples))
+        tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
+        eval_steps = None
+        eval_drop_remainder = False
+        eval_input_fn = file_based_input_fn_builder(
+            input_file=eval_file,
+            batch_size=FLAGS.eval_batch_size,
+            seq_length=FLAGS.max_seq_length,
+            is_training=False,
+            drop_remainder=eval_drop_remainder)
+        result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
+        output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
+        with tf.gfile.Open(output_eval_file, "w") as writer:
+            tf.logging.info("***** Eval results *****")
+            for key in sorted(result.keys()):
+                tf.logging.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+    if FLAGS.do_predict and master_process:
+        predict_examples = processor.get_test_examples(FLAGS.data_dir)
+        predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
+        filed_based_convert_examples_to_features(predict_examples, label_list,
+                                                 FLAGS.max_seq_length, tokenizer,
+                                                 predict_file, mode="test")
+
+        with tf.gfile.Open(os.path.join(FLAGS.output_dir, 'label2id.pkl'), 'rb') as rf:
+            label2id = pickle.load(rf)
+            id2label = {value: key for key, value in label2id.items()}
+        token_path = os.path.join(FLAGS.output_dir, "token_test.txt")
+        if tf.gfile.Exists(token_path):
+            tf.gfile.Remove(token_path)
+
+        tf.logging.info("***** Running prediction*****")
+        tf.logging.info("  Num examples = %d", len(predict_examples))
+        tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
+
+        predict_drop_remainder = False
+        predict_input_fn = file_based_input_fn_builder(
+            input_file=predict_file,
+            batch_size=FLAGS.predict_batch_size,
+            seq_length=FLAGS.max_seq_length,
+            is_training=False,
+            drop_remainder=predict_drop_remainder)
+
+        eval_hooks = [LogEvalRunHook(FLAGS.predict_batch_size)]
+        eval_start_time = time.time()
+
+        output_predict_file = os.path.join(FLAGS.output_dir, "label_test.txt")
+        test_labels_file = os.path.join(FLAGS.output_dir, "test_labels.txt")
+        test_labels_err_file = os.path.join(FLAGS.output_dir, "test_labels_errs.txt")
+        with tf.gfile.Open(output_predict_file, 'w') as writer, \
+                tf.gfile.Open(test_labels_file, 'w') as tl, \
+                tf.gfile.Open(test_labels_err_file, 'w') as tle:
+            print(id2label)
+            i=0
+            for prediction in estimator.predict(input_fn=predict_input_fn, hooks=eval_hooks,
+                                                yield_single_examples=True):
+                output_line = "\n".join(id2label[id] for id in prediction if id != 0) + "\n"
+                writer.write(output_line)
+                result_to_pair(predict_examples[i], prediction, id2label, tl, tle)
+                i = i + 1
+
+        eval_time_elapsed = time.time() - eval_start_time
+        eval_time_wo_overhead = eval_hooks[-1].total_time
+
+        time_list = eval_hooks[-1].time_list
+        time_list.sort()
+        num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
+
+        avg = np.mean(time_list)
+        cf_50 = max(time_list[:int(len(time_list) * 0.50)])
+        cf_90 = max(time_list[:int(len(time_list) * 0.90)])
+        cf_95 = max(time_list[:int(len(time_list) * 0.95)])
+        cf_99 = max(time_list[:int(len(time_list) * 0.99)])
+        cf_100 = max(time_list[:int(len(time_list) * 1)])
+        ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
+
+        tf.logging.info("-----------------------------")
+        tf.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
+                        eval_hooks[-1].count * FLAGS.predict_batch_size)
+        tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
+                        (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
+        tf.logging.info("Summary Inference Statistics")
+        tf.logging.info("Batch size = %d", FLAGS.predict_batch_size)
+        tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
+        tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32")
+        tf.logging.info("Latency Confidence Level 50 (ms) = %0.2f", cf_50 * 1000)
+        tf.logging.info("Latency Confidence Level 90 (ms) = %0.2f", cf_90 * 1000)
+        tf.logging.info("Latency Confidence Level 95 (ms) = %0.2f", cf_95 * 1000)
+        tf.logging.info("Latency Confidence Level 99 (ms) = %0.2f", cf_99 * 1000)
+        tf.logging.info("Latency Confidence Level 100 (ms) = %0.2f", cf_100 * 1000)
+        tf.logging.info("Latency Average (ms) = %0.2f", avg * 1000)
+        tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
+        tf.logging.info("-----------------------------")
+
+        tf.logging.info('Reading: %s', test_labels_file)
+        with tf.gfile.Open(test_labels_file, "r") as f:
+            counts = evaluate(f)
+        eval_result = report_notprint(counts)
+        print(''.join(eval_result))
+        with tf.gfile.Open(os.path.join(FLAGS.output_dir, 'test_results_conlleval.txt'), 'w') as fd:
+            fd.write(''.join(eval_result))
+
+
+
+if __name__ == "__main__":
+    flags.mark_flag_as_required("data_dir")
+    flags.mark_flag_as_required("task_name")
+    flags.mark_flag_as_required("vocab_file")
+    flags.mark_flag_as_required("bert_config_file")
+    flags.mark_flag_as_required("output_dir")
+    tf.app.run()
--- a/TensorFlow/LanguageModeling/BERT/run_pretraining.py
+++ b/TensorFlow/LanguageModeling/BERT/run_pretraining.py
@ -26,6 +26,8 @@ import modeling
 import optimization
 import tensorflow as tf
 import glob
+from utils.utils import LogEvalRunHook
+from tensorflow.core.protobuf import rewriter_config_pb2

 flags = tf.flags

@ -244,6 +246,7 @@ def model_fn_builder(bert_config, init_checkpoint, learning_rate,

    initialized_variable_names = {}
    if init_checkpoint and (hvd is None or hvd.rank() == 0):
+      print("Loading checkpoint", init_checkpoint)
      (assignment_map, initialized_variable_names
      ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)

@ -528,7 +531,9 @@ def main(_):
      tf.logging.info("**************************")

 #    config.gpu_options.per_process_gpu_memory_fraction = 0.7
-  if FLAGS.use_xla: config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
+  if FLAGS.use_xla: 
+      config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
+      config.graph_options.rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.NO_MEM_OPT

  run_config = tf.estimator.RunConfig(
      model_dir=FLAGS.output_dir,
@ -590,8 +595,29 @@ def main(_):
        is_training=False,
        hvd=None if not FLAGS.horovod else hvd)

+    eval_hooks = [LogEvalRunHook(FLAGS.eval_batch_size)]
+    eval_start_time = time.time()
    result = estimator.evaluate(
-        input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
+        input_fn=eval_input_fn, steps=FLAGS.max_eval_steps, hooks=eval_hooks)
+
+    eval_time_elapsed = time.time() - eval_start_time
+    eval_time_wo_overhead = eval_hooks[-1].total_time
+
+    num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size
+
+    ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
+
+    tf.logging.info("-----------------------------")
+    tf.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
+                    eval_hooks[-1].count * FLAGS.eval_batch_size)
+    tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
+                    (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size)
+    tf.logging.info("Summary Inference Statistics on EVAL set")
+    tf.logging.info("Batch size = %d", FLAGS.eval_batch_size)
+    tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
+    tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32")
+    tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
+    tf.logging.info("-----------------------------")

    output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
    with tf.gfile.GFile(output_eval_file, "w") as writer:
--- a/TensorFlow/LanguageModeling/BERT/run_re.py
+++ b/TensorFlow/LanguageModeling/BERT/run_re.py
@ -0,0 +1,940 @@
+# coding=utf-8
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT finetuning runner."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import collections
+import csv
+import logging
+import os, sys
+import numpy as np
+
+import tensorflow as tf
+
+sys.path.append("/workspace/bert")
+
+import modeling
+import optimization
+import tokenization
+
+import time
+import horovod.tensorflow as hvd
+from utils.utils import LogEvalRunHook, LogTrainRunHook
+
+flags = tf.flags
+
+FLAGS = flags.FLAGS
+
+## Required parameters
+flags.DEFINE_string(
+    "data_dir", None,
+    "The input data dir. Should contain the .tsv files (or other data files) "
+    "for the task.")
+
+flags.DEFINE_string(
+    "bert_config_file", None,
+    "The config json file corresponding to the pre-trained BERT model. "
+    "This specifies the model architecture.")
+
+flags.DEFINE_string("task_name", None, "The name of the task to train.")
+
+flags.DEFINE_string("vocab_file", None,
+                    "The vocabulary file that the BERT model was trained on.")
+
+flags.DEFINE_string(
+    "output_dir", None,
+    "The output directory where the model checkpoints will be written.")
+
+## Other parameters
+
+flags.DEFINE_string(
+    "init_checkpoint", None,
+    "Initial checkpoint (usually from a pre-trained BERT model).")
+
+flags.DEFINE_bool(
+    "do_lower_case", True,
+    "Whether to lower case the input text. Should be True for uncased "
+    "models and False for cased models.")
+
+flags.DEFINE_integer(
+    "max_seq_length", 128,
+    "The maximum total input sequence length after WordPiece tokenization. "
+    "Sequences longer than this will be truncated, and sequences shorter "
+    "than this will be padded.")
+
+flags.DEFINE_bool("do_train", False, "Whether to run training.")
+
+flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
+
+flags.DEFINE_bool(
+    "do_predict", False,
+    "Whether to run the model in inference mode on the test set.")
+
+flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
+
+flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
+
+flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
+
+flags.DEFINE_float("learning_rate", 5e-6, "The initial learning rate for Adam.")
+
+flags.DEFINE_float("num_train_epochs", 3.0,
+                   "Total number of training epochs to perform.")
+
+flags.DEFINE_float(
+    "warmup_proportion", 0.1,
+    "Proportion of training to perform linear learning rate warmup for. "
+    "E.g., 0.1 = 10% of training.")
+
+flags.DEFINE_integer("save_checkpoints_steps", 1000,
+                     "How often to save the model checkpoint.")
+
+flags.DEFINE_integer("iterations_per_loop", 1000,
+                     "How many steps to make in each estimator call.")
+
+tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
+
+flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
+flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
+flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
+
+class InputExample(object):
+    """A single training/test example for simple sequence classification."""
+
+    def __init__(self, guid, text_a, text_b=None, label=None):
+        """Constructs a InputExample.
+
+        Args:
+          guid: Unique id for the example.
+          text_a: string. The untokenized text of the first sequence. For single
+            sequence tasks, only this sequence must be specified.
+          text_b: (Optional) string. The untokenized text of the second sequence.
+            Only must be specified for sequence pair tasks.
+          label: (Optional) string. The label of the example. This should be
+            specified for train and dev examples, but not for test examples.
+        """
+        self.guid = guid
+        self.text_a = text_a
+        self.text_b = text_b
+        self.label = label
+
+
+class PaddingInputExample(object):
+    """Fake example so the num input examples is a multiple of the batch size.
+
+    When running eval/predict on the TPU, we need to pad the number of examples
+    to be a multiple of the batch size, because the TPU requires a fixed batch
+    size. The alternative is to drop the last batch, which is bad because it means
+    the entire output data won't be generated.
+
+    We use this class instead of `None` because treating `None` as padding
+    battches could cause silent errors.
+    """
+
+
+class InputFeatures(object):
+    """A single set of features of data."""
+
+    def __init__(self,
+                 input_ids,
+                 input_mask,
+                 segment_ids,
+                 label_id,
+                 is_real_example=True):
+        self.input_ids = input_ids
+        self.input_mask = input_mask
+        self.segment_ids = segment_ids
+        self.label_id = label_id
+        self.is_real_example = is_real_example
+
+
+class DataProcessor(object):
+    """Base class for data converters for sequence classification data sets."""
+
+    def get_train_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the train set."""
+        raise NotImplementedError()
+
+    def get_dev_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for the dev set."""
+        raise NotImplementedError()
+
+    def get_test_examples(self, data_dir):
+        """Gets a collection of `InputExample`s for prediction."""
+        raise NotImplementedError()
+
+    def get_labels(self):
+        """Gets the list of labels for this data set."""
+        raise NotImplementedError()
+
+    @classmethod
+    def _read_tsv(cls, input_file, quotechar=None):
+        """Reads a tab separated value file."""
+        with tf.gfile.Open(input_file, "r") as f:
+            reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
+            lines = []
+            for line in reader:
+                lines.append(line)
+            return lines
+
+
+class _ChemProtProcessor(DataProcessor):
+    """Processor for the ChemProt data set."""
+
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir, file_name="dev.tsv"):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, file_name)), "dev")
+
+    def get_test_examples(self, data_dir, file_name="test.tsv"):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, file_name)), "test")
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            # skip header
+            if i == 0:
+                continue
+            guid = line[0]
+            text_a = tokenization.convert_to_unicode(line[1])
+            if set_type == "test":
+                label = self.get_labels()[-1]
+            else:
+                try:
+                    label = tokenization.convert_to_unicode(line[2])
+                except IndexError:
+                    logging.exception(line)
+                    exit(1)
+            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
+        return examples
+
+
+class ChemProtProcessor(_ChemProtProcessor):
+    def get_labels(self):
+        """See base class."""
+        return ["CPR:3", "CPR:4", "CPR:5", "CPR:6", "CPR:9", "false"]
+
+
+
+class MedNLIProcessor(DataProcessor):
+    def get_train_examples(self, data_dir):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
+
+    def get_dev_examples(self, data_dir, file_name="dev.tsv"):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, file_name)), "dev")
+
+    def get_test_examples(self, data_dir, file_name="test.tsv"):
+        """See base class."""
+        return self._create_examples(
+            self._read_tsv(os.path.join(data_dir, file_name)), "test")
+
+    def get_labels(self):
+        """See base class."""
+        return ['contradiction', 'entailment', 'neutral']
+
+    def _create_examples(self, lines, set_type):
+        """Creates examples for the training and dev sets."""
+        examples = []
+        for (i, line) in enumerate(lines):
+            if i == 0:
+                continue
+            guid = line[1]
+            text_a = tokenization.convert_to_unicode(line[2])
+            text_b = tokenization.convert_to_unicode(line[3])
+            if set_type == "test":
+                label = self.get_labels()[-1]
+            else:
+                label = tokenization.convert_to_unicode(line[0])
+            examples.append(
+                InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
+        return examples
+
+
+def convert_single_example(ex_index, example, label_list, max_seq_length,
+                           tokenizer):
+    """Converts a single `InputExample` into a single `InputFeatures`."""
+
+    if isinstance(example, PaddingInputExample):
+        return InputFeatures(
+            input_ids=[0] * max_seq_length,
+            input_mask=[0] * max_seq_length,
+            segment_ids=[0] * max_seq_length,
+            label_id=0,
+            is_real_example=False)
+
+    label_map = {}
+    for (i, label) in enumerate(label_list):
+        label_map[label] = i
+
+    tokens_a = tokenizer.tokenize(example.text_a)
+    tokens_b = None
+    if example.text_b:
+        tokens_b = tokenizer.tokenize(example.text_b)
+
+    if tokens_b:
+        # Modifies `tokens_a` and `tokens_b` in place so that the total
+        # length is less than the specified length.
+        # Account for [CLS], [SEP], [SEP] with "- 3"
+        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
+    else:
+        # Account for [CLS] and [SEP] with "- 2"
+        if len(tokens_a) > max_seq_length - 2:
+            tokens_a = tokens_a[0:(max_seq_length - 2)]
+
+    # The convention in BERT is:
+    # (a) For sequence pairs:
+    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
+    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
+    # (b) For single sequences:
+    #  tokens:   [CLS] the dog is hairy . [SEP]
+    #  type_ids: 0     0   0   0  0     0 0
+    #
+    # Where "type_ids" are used to indicate whether this is the first
+    # sequence or the second sequence. The embedding vectors for `type=0` and
+    # `type=1` were learned during pre-training and are added to the wordpiece
+    # embedding vector (and position vector). This is not *strictly* necessary
+    # since the [SEP] token unambiguously separates the sequences, but it makes
+    # it easier for the model to learn the concept of sequences.
+    #
+    # For classification tasks, the first vector (corresponding to [CLS]) is
+    # used as the "sentence vector". Note that this only makes sense because
+    # the entire model is fine-tuned.
+    tokens = []
+    segment_ids = []
+    tokens.append("[CLS]")
+    segment_ids.append(0)
+    for token in tokens_a:
+        tokens.append(token)
+        segment_ids.append(0)
+    tokens.append("[SEP]")
+    segment_ids.append(0)
+
+    if tokens_b:
+        for token in tokens_b:
+            tokens.append(token)
+            segment_ids.append(1)
+        tokens.append("[SEP]")
+        segment_ids.append(1)
+
+    input_ids = tokenizer.convert_tokens_to_ids(tokens)
+
+    # The mask has 1 for real tokens and 0 for padding tokens. Only real
+    # tokens are attended to.
+    input_mask = [1] * len(input_ids)
+
+    # Zero-pad up to the sequence length.
+    while len(input_ids) < max_seq_length:
+        input_ids.append(0)
+        input_mask.append(0)
+        segment_ids.append(0)
+
+    assert len(input_ids) == max_seq_length
+    assert len(input_mask) == max_seq_length
+    assert len(segment_ids) == max_seq_length
+
+    label_id = label_map[example.label]
+    if ex_index < 5:
+        tf.logging.info("*** Example ***")
+        tf.logging.info("guid: %s" % (example.guid))
+        tf.logging.info("tokens: %s" % " ".join(
+            [tokenization.printable_text(x) for x in tokens]))
+        tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
+        tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
+        tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
+        tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
+
+    feature = InputFeatures(
+        input_ids=input_ids,
+        input_mask=input_mask,
+        segment_ids=segment_ids,
+        label_id=label_id,
+        is_real_example=True)
+    return feature
+
+
+def file_based_convert_examples_to_features(
+        examples, label_list, max_seq_length, tokenizer, output_file):
+    """Convert a set of `InputExample`s to a TFRecord file."""
+
+    writer = tf.python_io.TFRecordWriter(output_file)
+
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 10000 == 0:
+            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
+
+        feature = convert_single_example(ex_index, example, label_list,
+                                         max_seq_length, tokenizer)
+
+        def create_int_feature(values):
+            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
+            return f
+
+        features = collections.OrderedDict()
+        features["input_ids"] = create_int_feature(feature.input_ids)
+        features["input_mask"] = create_int_feature(feature.input_mask)
+        features["segment_ids"] = create_int_feature(feature.segment_ids)
+        features["label_ids"] = create_int_feature([feature.label_id])
+        features["is_real_example"] = create_int_feature(
+            [int(feature.is_real_example)])
+
+        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
+        writer.write(tf_example.SerializeToString())
+    writer.close()
+
+
+def file_based_input_fn_builder(input_file, batch_size, seq_length, is_training,
+                                drop_remainder, hvd=None):
+    """Creates an `input_fn` closure to be passed to TPUEstimator."""
+
+    name_to_features = {
+        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
+        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
+        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
+        "label_ids": tf.FixedLenFeature([], tf.int64),
+        "is_real_example": tf.FixedLenFeature([], tf.int64),
+    }
+
+    def _decode_record(record, name_to_features):
+        """Decodes a record to a TensorFlow example."""
+        example = tf.parse_single_example(record, name_to_features)
+
+        # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
+        # So cast all int64 to int32.
+        for name in list(example.keys()):
+            t = example[name]
+            if t.dtype == tf.int64:
+                t = tf.to_int32(t)
+            example[name] = t
+
+        return example
+
+    def input_fn(params):
+        """The actual input function."""
+        #batch_size = params["batch_size"]
+
+        # For training, we want a lot of parallel reading and shuffling.
+        # For eval, we want no shuffling and parallel reading doesn't matter.
+        d = tf.data.TFRecordDataset(input_file)
+        if is_training:
+            if hvd is not None: d = d.shard(hvd.size(), hvd.rank())
+            d = d.repeat()
+            d = d.shuffle(buffer_size=100)
+
+        d = d.apply(
+            tf.contrib.data.map_and_batch(
+                lambda record: _decode_record(record, name_to_features),
+                batch_size=batch_size,
+                drop_remainder=drop_remainder))
+
+        return d
+
+    return input_fn
+
+
+def _truncate_seq_pair(tokens_a, tokens_b, max_length):
+    """Truncates a sequence pair in place to the maximum length."""
+
+    # This is a simple heuristic which will always truncate the longer sequence
+    # one token at a time. This makes more sense than truncating an equal percent
+    # of tokens from each, since if one sequence is very short then each token
+    # that's truncated likely contains more information than a longer sequence.
+    while True:
+        total_length = len(tokens_a) + len(tokens_b)
+        if total_length <= max_length:
+            break
+        if len(tokens_a) > len(tokens_b):
+            tokens_a.pop()
+        else:
+            tokens_b.pop()
+
+
+def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
+                 labels, num_labels, use_one_hot_embeddings):
+    """Creates a classification model."""
+    model = modeling.BertModel(
+        config=bert_config,
+        is_training=is_training,
+        input_ids=input_ids,
+        input_mask=input_mask,
+        token_type_ids=segment_ids,
+        use_one_hot_embeddings=use_one_hot_embeddings)
+
+    # In the demo, we are doing a simple classification task on the entire
+    # segment.
+    #
+    # If you want to use the token-level output, use model.get_sequence_output()
+    # instead.
+    output_layer = model.get_pooled_output()
+
+    hidden_size = output_layer.shape[-1].value
+
+    output_weights = tf.get_variable(
+        "output_weights", [num_labels, hidden_size],
+        initializer=tf.truncated_normal_initializer(stddev=0.02))
+
+    output_bias = tf.get_variable(
+        "output_bias", [num_labels], initializer=tf.zeros_initializer())
+
+    with tf.variable_scope("loss"):
+        if is_training:
+            # I.e., 0.1 dropout
+            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
+
+        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
+        logits = tf.nn.bias_add(logits, output_bias)
+        probabilities = tf.nn.softmax(logits, axis=-1)
+        log_probs = tf.nn.log_softmax(logits, axis=-1)
+
+        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
+
+        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
+        loss = tf.reduce_mean(per_example_loss)
+
+        return (loss, per_example_loss, logits, probabilities)
+
+
+def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate=None,
+                     num_train_steps=None, num_warmup_steps=None,
+                     use_one_hot_embeddings=False, hvd=None, use_fp16=False):
+    """Returns `model_fn` closure for TPUEstimator."""
+
+    def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
+        """The `model_fn` for TPUEstimator."""
+
+        tf.logging.info("*** Features ***")
+        for name in sorted(features.keys()):
+            tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
+
+        input_ids = features["input_ids"]
+        input_mask = features["input_mask"]
+        segment_ids = features["segment_ids"]
+        label_ids = features["label_ids"]
+        is_real_example = None
+        if "is_real_example" in features:
+            is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
+        else:
+            is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
+
+        is_training = (mode == tf.estimator.ModeKeys.TRAIN)
+
+        (total_loss, per_example_loss, logits, probabilities) = create_model(
+            bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
+            num_labels, use_one_hot_embeddings)
+
+        tvars = tf.trainable_variables()
+        initialized_variable_names = {}
+        scaffold_fn = None
+        if init_checkpoint and (hvd is None or hvd.rank() == 0):
+            (assignment_map, initialized_variable_names
+             ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
+            tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
+
+        tf.logging.info("**** Trainable Variables ****")
+        for var in tvars:
+            init_string = ""
+            if var.name in initialized_variable_names:
+                init_string = ", *INIT_FROM_CKPT*"
+            tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
+                            init_string)
+
+        output_spec = None
+        if mode == tf.estimator.ModeKeys.TRAIN:
+
+            train_op = optimization.create_optimizer(
+                total_loss, learning_rate, num_train_steps, num_warmup_steps, hvd, False, use_fp16)
+
+            output_spec = tf.estimator.EstimatorSpec(
+              mode=mode,
+              loss=total_loss,
+              train_op=train_op)
+        elif mode == tf.estimator.ModeKeys.EVAL:
+
+            def metric_fn(per_example_loss, label_ids, logits, is_real_example):
+                predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
+                accuracy = tf.metrics.accuracy(
+                    labels=label_ids, predictions=predictions, weights=is_real_example)
+                loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
+                return {
+                    "eval_accuracy": accuracy,
+                    "eval_loss": loss,
+                }
+
+            eval_metric_ops = metric_fn(per_example_loss, label_ids, logits, is_real_example)
+            output_spec = tf.estimator.EstimatorSpec(
+              mode=mode,
+              loss=total_loss,
+              eval_metric_ops=eval_metric_ops)
+        else:
+            output_spec = tf.estimator.EstimatorSpec(
+                    mode=mode, predictions={"probabilities": probabilities})#predicts)#probabilities)
+        return output_spec
+
+    return model_fn
+
+
+# This function is not used by this file but is still used by the Colab and
+# people who depend on it.
+def input_fn_builder(features, seq_length, is_training, drop_remainder):
+    """Creates an `input_fn` closure to be passed to TPUEstimator."""
+
+    all_input_ids = []
+    all_input_mask = []
+    all_segment_ids = []
+    all_label_ids = []
+
+    for feature in features:
+        all_input_ids.append(feature.input_ids)
+        all_input_mask.append(feature.input_mask)
+        all_segment_ids.append(feature.segment_ids)
+        all_label_ids.append(feature.label_id)
+
+    def input_fn(params):
+        """The actual input function."""
+        batch_size = params["batch_size"]
+
+        num_examples = len(features)
+
+        # This is for demo purposes and does NOT scale to large data sets. We do
+        # not use Dataset.from_generator() because that uses tf.py_func which is
+        # not TPU compatible. The right way to load data is with TFRecordReader.
+        d = tf.data.Dataset.from_tensor_slices({
+            "input_ids":
+                tf.constant(
+                    all_input_ids, shape=[num_examples, seq_length],
+                    dtype=tf.int32),
+            "input_mask":
+                tf.constant(
+                    all_input_mask,
+                    shape=[num_examples, seq_length],
+                    dtype=tf.int32),
+            "segment_ids":
+                tf.constant(
+                    all_segment_ids,
+                    shape=[num_examples, seq_length],
+                    dtype=tf.int32),
+            "label_ids":
+                tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
+        })
+
+        if is_training:
+            d = d.repeat()
+            d = d.shuffle(buffer_size=100)
+
+        d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
+        return d
+
+    return input_fn
+
+
+# This function is not used by this file but is still used by the Colab and
+# people who depend on it.
+def convert_examples_to_features(examples, label_list, max_seq_length,
+                                 tokenizer):
+    """Convert a set of `InputExample`s to a list of `InputFeatures`."""
+
+    features = []
+    for (ex_index, example) in enumerate(examples):
+        if ex_index % 10000 == 0:
+            tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
+
+        feature = convert_single_example(ex_index, example, label_list,
+                                         max_seq_length, tokenizer)
+
+        features.append(feature)
+    return features
+
+
+def main(_):
+    tf.logging.set_verbosity(tf.logging.INFO)
+
+    if FLAGS.horovod:
+      hvd.init()
+    if FLAGS.use_fp16:
+        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
+
+    processors = {
+        "chemprot": ChemProtProcessor,
+        'mednli': MedNLIProcessor,
+    }
+
+    tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
+                                                  FLAGS.init_checkpoint)
+
+    if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
+        raise ValueError(
+            "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
+
+    bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
+
+    if FLAGS.max_seq_length > bert_config.max_position_embeddings:
+        raise ValueError(
+            "Cannot use sequence length %d because the BERT model "
+            "was only trained up to sequence length %d" %
+            (FLAGS.max_seq_length, bert_config.max_position_embeddings))
+
+    tf.gfile.MakeDirs(FLAGS.output_dir)
+
+    task_name = FLAGS.task_name.lower()
+
+    if task_name not in processors:
+        raise ValueError("Task not found: %s" % (task_name))
+
+    processor = processors[task_name]()
+
+    label_list = processor.get_labels()
+
+    tokenizer = tokenization.FullTokenizer(
+        vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
+
+    is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
+
+    master_process = True
+    training_hooks = []
+    global_batch_size = FLAGS.train_batch_size
+    hvd_rank = 0
+
+    config = tf.ConfigProto()
+    if FLAGS.horovod:
+      global_batch_size = FLAGS.train_batch_size * hvd.size()
+      master_process = (hvd.rank() == 0)
+      hvd_rank = hvd.rank()
+      config.gpu_options.allow_growth = True
+      config.gpu_options.visible_device_list = str(hvd.local_rank())
+      if hvd.size() > 1:
+        training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
+
+    if FLAGS.use_xla:
+        config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
+    run_config = tf.estimator.RunConfig(
+      model_dir=FLAGS.output_dir if master_process else None,
+      session_config=config,
+      save_checkpoints_steps=FLAGS.save_checkpoints_steps if master_process else None,
+      keep_checkpoint_max=1)
+
+    if master_process:
+      tf.logging.info("***** Configuaration *****")
+      for key in FLAGS.__flags.keys():
+          tf.logging.info('  {}: {}'.format(key, getattr(FLAGS, key)))
+      tf.logging.info("**************************")
+
+    train_examples = None
+    num_train_steps = None
+    num_warmup_steps = None
+
+    training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank))
+
+    if FLAGS.do_train:
+        train_examples = processor.get_train_examples(FLAGS.data_dir)
+        num_train_steps = int(
+            len(train_examples) / global_batch_size * FLAGS.num_train_epochs)
+        num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
+
+        start_index = 0
+        end_index = len(train_examples)
+        tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record")]
+
+        if FLAGS.horovod:
+          tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.size())]
+          num_examples_per_rank = len(train_examples) // hvd.size()
+          remainder = len(train_examples) % hvd.size()
+          if hvd.rank() < remainder:
+            start_index = hvd.rank() * (num_examples_per_rank+1)
+            end_index = start_index + num_examples_per_rank + 1
+          else:
+            start_index = hvd.rank() * num_examples_per_rank + remainder
+            end_index = start_index + (num_examples_per_rank)
+
+
+    model_fn = model_fn_builder(
+        bert_config=bert_config,
+        num_labels=len(label_list),
+        init_checkpoint=FLAGS.init_checkpoint,
+        learning_rate=FLAGS.learning_rate if not FLAGS.horovod else FLAGS.learning_rate * hvd.size(),
+        num_train_steps=num_train_steps,
+        num_warmup_steps=num_warmup_steps,
+        use_one_hot_embeddings=False,
+        hvd=None if not FLAGS.horovod else hvd,
+        use_fp16=FLAGS.use_fp16)
+
+    estimator = tf.estimator.Estimator(
+      model_fn=model_fn,
+      config=run_config)
+
+
+    if FLAGS.do_train:
+        file_based_convert_examples_to_features(
+          train_examples[start_index:end_index], label_list, FLAGS.max_seq_length, tokenizer, tmp_filenames[hvd_rank])
+        tf.logging.info("***** Running training *****")
+        tf.logging.info("  Num examples = %d", len(train_examples))
+        tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
+        tf.logging.info("  Num steps = %d", num_train_steps)
+        train_input_fn = file_based_input_fn_builder(
+            input_file=tmp_filenames,
+            batch_size=FLAGS.train_batch_size,
+            seq_length=FLAGS.max_seq_length,
+            is_training=True,
+            drop_remainder=True,
+            hvd=None if not FLAGS.horovod else hvd)
+
+        train_start_time = time.time()
+        estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=training_hooks)
+        train_time_elapsed = time.time() - train_start_time
+        train_time_wo_overhead = training_hooks[-1].total_time
+        avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed
+        ss_sentences_per_second = (num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
+
+        if master_process:
+          tf.logging.info("-----------------------------")
+          tf.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
+                        num_train_steps * global_batch_size)
+          tf.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
+                        (num_train_steps - training_hooks[-1].skipped) * global_batch_size)
+          tf.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second)
+          tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
+          tf.logging.info("-----------------------------")
+
+
+    if FLAGS.do_eval and master_process:
+        eval_examples = processor.get_dev_examples(FLAGS.data_dir)
+        num_actual_eval_examples = len(eval_examples)
+
+
+        eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
+        file_based_convert_examples_to_features(
+            eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
+
+        tf.logging.info("***** Running evaluation *****")
+        tf.logging.info("  Num examples = %d (%d actual, %d padding)",
+                        len(eval_examples), num_actual_eval_examples,
+                        len(eval_examples) - num_actual_eval_examples)
+        tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
+
+        # This tells the estimator to run through the entire set.
+        eval_steps = None
+
+        eval_drop_remainder = False
+        eval_input_fn = file_based_input_fn_builder(
+            input_file=eval_file,
+            batch_size=FLAGS.eval_batch_size,
+            seq_length=FLAGS.max_seq_length,
+            is_training=False,
+            drop_remainder=eval_drop_remainder)
+
+        result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
+
+        output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
+        with tf.gfile.GFile(output_eval_file, "w") as writer:
+            tf.logging.info("***** Eval results *****")
+            for key in sorted(result.keys()):
+                tf.logging.info("  %s = %s", key, str(result[key]))
+                writer.write("%s = %s\n" % (key, str(result[key])))
+
+    if FLAGS.do_predict and master_process:
+        predict_examples = processor.get_test_examples(FLAGS.data_dir)
+        num_actual_predict_examples = len(predict_examples)
+
+        predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
+        file_based_convert_examples_to_features(predict_examples, label_list,
+                                                FLAGS.max_seq_length, tokenizer,
+                                                predict_file)
+
+        tf.logging.info("***** Running prediction*****")
+        tf.logging.info("  Num examples = %d (%d actual, %d padding)",
+                        len(predict_examples), num_actual_predict_examples,
+                        len(predict_examples) - num_actual_predict_examples)
+        tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
+
+        predict_drop_remainder = False
+        predict_input_fn = file_based_input_fn_builder(
+            input_file=predict_file,
+            batch_size=FLAGS.predict_batch_size,
+            seq_length=FLAGS.max_seq_length,
+            is_training=False,
+            drop_remainder=predict_drop_remainder)
+
+        eval_hooks = [LogEvalRunHook(FLAGS.predict_batch_size)]
+        eval_start_time = time.time()
+
+
+        output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
+        with tf.gfile.GFile(output_predict_file, "w") as writer:
+            num_written_lines = 0
+            tf.logging.info("***** Predict results *****")
+            for prediction in estimator.predict(input_fn=predict_input_fn, hooks=eval_hooks,
+                                                     yield_single_examples=True):
+                probabilities = prediction["probabilities"]
+                output_line = "\t".join(
+                    str(class_probability)
+                    for class_probability in probabilities) + "\n"
+                writer.write(output_line)
+                num_written_lines += 1
+        assert num_written_lines == num_actual_predict_examples
+
+        eval_time_elapsed = time.time() - eval_start_time
+        eval_time_wo_overhead = eval_hooks[-1].total_time
+
+        time_list = eval_hooks[-1].time_list
+        time_list.sort()
+        num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
+
+        avg = np.mean(time_list)
+        cf_50 = max(time_list[:int(len(time_list) * 0.50)])
+        cf_90 = max(time_list[:int(len(time_list) * 0.90)])
+        cf_95 = max(time_list[:int(len(time_list) * 0.95)])
+        cf_99 = max(time_list[:int(len(time_list) * 0.99)])
+        cf_100 = max(time_list[:int(len(time_list) * 1)])
+        ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
+
+        tf.logging.info("-----------------------------")
+        tf.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
+                        eval_hooks[-1].count * FLAGS.predict_batch_size)
+        tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
+                        (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
+        tf.logging.info("Summary Inference Statistics")
+        tf.logging.info("Batch size = %d", FLAGS.predict_batch_size)
+        tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
+        tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32")
+        tf.logging.info("Latency Confidence Level 50 (ms) = %0.2f", cf_50 * 1000)
+        tf.logging.info("Latency Confidence Level 90 (ms) = %0.2f", cf_90 * 1000)
+        tf.logging.info("Latency Confidence Level 95 (ms) = %0.2f", cf_95 * 1000)
+        tf.logging.info("Latency Confidence Level 99 (ms) = %0.2f", cf_99 * 1000)
+        tf.logging.info("Latency Confidence Level 100 (ms) = %0.2f", cf_100 * 1000)
+        tf.logging.info("Latency Average (ms) = %0.2f", avg * 1000)
+        tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
+        tf.logging.info("-----------------------------")
+
+if __name__ == "__main__":
+    flags.mark_flag_as_required("data_dir")
+    flags.mark_flag_as_required("task_name")
+    flags.mark_flag_as_required("vocab_file")
+    flags.mark_flag_as_required("bert_config_file")
+    flags.mark_flag_as_required("output_dir")
+    tf.app.run()
--- a/TensorFlow/LanguageModeling/BERT/run_squad.py
+++ b/TensorFlow/LanguageModeling/BERT/run_squad.py
@ -921,7 +921,6 @@ def main(_):
  training_hooks = []
  global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps
  hvd_rank = 0
-  hvd_local_rank = 0

  config = tf.ConfigProto()
  learning_rate = FLAGS.learning_rate
@ -933,7 +932,6 @@ def main(_):
      learning_rate = learning_rate * hvd.size()
      master_process = (hvd.rank() == 0)
      hvd_rank = hvd.rank()
-      hvd_local_rank = hvd.local_rank()
      config.gpu_options.allow_growth = True
      config.gpu_options.visible_device_list = str(hvd.local_rank())
      if hvd.size() > 1:
@ -976,15 +974,15 @@ def main(_):
    tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record")]

    if FLAGS.horovod:
-      tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.local_size())]
-      num_examples_per_local_rank = len(train_examples) // hvd.local_size()
-      remainder = len(train_examples) % hvd.local_size()
-      if hvd.local_rank() < remainder:
-        start_index = hvd.local_rank() * (num_examples_per_local_rank+1)
-        end_index = start_index + num_examples_per_local_rank + 1
+      tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.size())]
+      num_examples_per_rank = len(train_examples) // hvd.size()
+      remainder = len(train_examples) % hvd.size()
+      if hvd.rank() < remainder:
+        start_index = hvd.rank() * (num_examples_per_rank+1)
+        end_index = start_index + num_examples_per_rank + 1
      else:
-        start_index = hvd.local_rank() * num_examples_per_local_rank + remainder
-        end_index = start_index + (num_examples_per_local_rank)
+        start_index = hvd.rank() * num_examples_per_rank + remainder
+        end_index = start_index + (num_examples_per_rank)


  model_fn = model_fn_builder(
@ -1005,7 +1003,7 @@ def main(_):
    # We write to a temporary file to avoid storing very large constant tensors
    # in memory.
    train_writer = FeatureWriter(
-        filename=tmp_filenames[hvd_local_rank],
+        filename=tmp_filenames[hvd_rank],
        is_training=True)
    convert_examples_to_features(
        examples=train_examples[start_index:end_index],
@ -1025,10 +1023,6 @@ def main(_):
    tf.logging.info("  Num steps = %d", num_train_steps)
    tf.logging.info("  LR = %f", learning_rate)
    del train_examples
-    if FLAGS.horovod:
-        barrier = hvd.allreduce(tf.constant(0))
-        with tf.Session(config=config) as sess:
-          sess.run(barrier)

    train_input_fn = input_fn_builder(
        input_file=tmp_filenames,
--- a/TensorFlow/LanguageModeling/BERT/scripts/docker/build.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/docker/build.sh
@ -1,10 +1,9 @@
 #!/bin/bash

-docker pull nvcr.io/nvidia/tensorrtserver:19.06-py3
+docker pull nvcr.io/nvidia/tensorrtserver:19.08-py3

-#The follow has been commented out since we need fixes for the perf_client from Guan
-#Uncomment to enable building. 
-#For now, the tensorrt_client can be downloaded from https://drive.google.com/drive/u/1/folders/1CeOMZbnFT1VUIlIMoDEZJb3kOKbXBDbZ
-git submodule update --init --recursive && cd tensorrt-inference-server && docker build -t tensorrtserver_client -f Dockerfile.client . && cd -
+#Will have to update submodule from root
+git submodule update --init --recursive
+cd tensorrt-inference-server && docker build -t tensorrtserver_client -f Dockerfile.client . && cd -

 docker build . --rm -t bert
--- a/TensorFlow/LanguageModeling/BERT/scripts/finetune_inference_benchmark.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/finetune_inference_benchmark.sh
@ -14,8 +14,7 @@
 # limitations under the License.

 bert_model=${1:-"large"}
-use_xla=${2:-"true"}
-task=${3:-"squad"}
+task=${2:-"squad"}

 if [ "$bert_model" = "large" ] ; then
    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
@ -26,13 +25,6 @@ echo  "BERT directory set as " $BERT_DIR

 init_checkpoint="$BERT_DIR/bert_model.ckpt"

-if [ "$use_xla" = "true" ] ; then
-    use_xla_tag="--use_xla"
-    echo "XLA activated"
-else
-    use_xla_tag=""
-fi
-
 #Edit to save logs & checkpoints in a different directory
 RESULTS_DIR=/results
 if [ ! -d "$RESULTS_DIR" ] ; then
@ -49,8 +41,7 @@ if [ "$task" = "squad" ] ; then
    echo "Squad directory set as " $SQUAD_DIR

    echo "Inference performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
-    echo "Precision $precision" >> $LOGFILE
-    echo "Sequence-Length Batch-size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms) Latency-100%(ms)" >> $LOGFILE
+    echo "Precision Sequence-Length Batch-size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms) Latency-100%(ms)" >> $LOGFILE

    for seq_len in 128 384; do

@ -60,11 +51,13 @@ if [ "$task" = "squad" ] ; then


        if [ "$precision" = "fp16" ] ; then
-            echo "fp16 activated!"
+            echo "fp16 and XLA activated!"
            use_fp16="--use_fp16"
+            use_xla_tag="--use_xla"
        else
            echo "fp32 activated!"
            use_fp16=""
+            use_xla_tag=""
        fi

        python run_squad.py \
@ -80,7 +73,7 @@ if [ "$task" = "squad" ] ; then
        "$use_fp16" \
        $use_xla_tag --num_eval_iterations=1024 |& tee $tmp_file

-        perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec)' | awk -F'= ' '{print $2}'`
+        perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}'`
        la=`cat $tmp_file | grep -F 'Latency Average (ms)' | awk -F'= ' '{print $2}'`
        l50=`cat $tmp_file | grep -F 'Latency Confidence Level 50 (ms)' | awk -F'= ' '{print $2}'`
        l90=`cat $tmp_file | grep -F 'Latency Confidence Level 90 (ms)' | awk -F'= ' '{print $2}'`
@ -88,7 +81,7 @@ if [ "$task" = "squad" ] ; then
        l99=`cat $tmp_file | grep -F 'Latency Confidence Level 99 (ms)' | awk -F'= ' '{print $2}'`
        l100=`cat $tmp_file | grep -F 'Latency Confidence Level 100 (ms)' | awk -F'= ' '{print $2}'`

-        echo "$seq_len $bs $precision $perf $la $l50 $l90 $l95 $l99 $l100" >> $LOGFILE
+        echo "$precision $seq_len $bs $precision $perf $la $l50 $l90 $l95 $l99 $l100" >> $LOGFILE

     done
     done
--- a/TensorFlow/LanguageModeling/BERT/scripts/finetune_train_benchmark.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/finetune_train_benchmark.sh
@ -64,8 +64,7 @@ if [ "$task" = "squad" ] ; then
    echo "Squad directory set as " $SQUAD_DIR

    echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
-    echo "Precision $precision" >> $LOGFILE
-    echo "Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE
+    echo "Precision Sequence Length   Batch size  Performance(sent/sec)" >> $LOGFILE

    for seq_len in 128 384; do

@ -104,8 +103,8 @@ if [ "$task" = "squad" ] ; then
                "$use_fp16" \
                $use_xla_tag |& tee $tmp_file

-                perf=`cat $tmp_file | grep -F 'Training Performance' | awk -F'= ' '{print $2}'`
-                echo "$seq_len  $batch_size $perf"
+                perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
+                echo "$precision $seq_len  $batch_size $perf" >> $LOGFILE

            done
        done
--- a/TensorFlow/LanguageModeling/BERT/tf_metrics.py
+++ b/TensorFlow/LanguageModeling/BERT/tf_metrics.py
@ -0,0 +1,215 @@
+"""
+Multiclass
+from: 
+https://github.com/guillaumegenthial/tf_metrics/blob/master/tf_metrics/__init__.py
+
+"""
+
+__author__ = "Guillaume Genthial"
+
+import numpy as np
+import tensorflow as tf
+from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix
+
+
+def precision(labels, predictions, num_classes, pos_indices=None,
+              weights=None, average='micro'):
+    """Multi-class precision metric for Tensorflow
+    Parameters
+    ----------
+    labels : Tensor of tf.int32 or tf.int64
+        The true labels
+    predictions : Tensor of tf.int32 or tf.int64
+        The predictions, same shape as labels
+    num_classes : int
+        The number of classes
+    pos_indices : list of int, optional
+        The indices of the positive classes, default is all
+    weights : Tensor of tf.int32, optional
+        Mask, must be of compatible shape with labels
+    average : str, optional
+        'micro': counts the total number of true positives, false
+            positives, and false negatives for the classes in
+            `pos_indices` and infer the metric from it.
+        'macro': will compute the metric separately for each class in
+            `pos_indices` and average. Will not account for class
+            imbalance.
+        'weighted': will compute the metric separately for each class in
+            `pos_indices` and perform a weighted average by the total
+            number of true labels for each class.
+    Returns
+    -------
+    tuple of (scalar float Tensor, update_op)
+    """
+    cm, op = _streaming_confusion_matrix(
+        labels, predictions, num_classes, weights)
+    pr, _, _ = metrics_from_confusion_matrix(
+        cm, pos_indices, average=average)
+    op, _, _ = metrics_from_confusion_matrix(
+        op, pos_indices, average=average)
+    return (pr, op)
+
+
+def recall(labels, predictions, num_classes, pos_indices=None, weights=None,
+           average='micro'):
+    """Multi-class recall metric for Tensorflow
+    Parameters
+    ----------
+    labels : Tensor of tf.int32 or tf.int64
+        The true labels
+    predictions : Tensor of tf.int32 or tf.int64
+        The predictions, same shape as labels
+    num_classes : int
+        The number of classes
+    pos_indices : list of int, optional
+        The indices of the positive classes, default is all
+    weights : Tensor of tf.int32, optional
+        Mask, must be of compatible shape with labels
+    average : str, optional
+        'micro': counts the total number of true positives, false
+            positives, and false negatives for the classes in
+            `pos_indices` and infer the metric from it.
+        'macro': will compute the metric separately for each class in
+            `pos_indices` and average. Will not account for class
+            imbalance.
+        'weighted': will compute the metric separately for each class in
+            `pos_indices` and perform a weighted average by the total
+            number of true labels for each class.
+    Returns
+    -------
+    tuple of (scalar float Tensor, update_op)
+    """
+    cm, op = _streaming_confusion_matrix(
+        labels, predictions, num_classes, weights)
+    _, re, _ = metrics_from_confusion_matrix(
+        cm, pos_indices, average=average)
+    _, op, _ = metrics_from_confusion_matrix(
+        op, pos_indices, average=average)
+    return (re, op)
+
+
+def f1(labels, predictions, num_classes, pos_indices=None, weights=None,
+       average='micro'):
+    return fbeta(labels, predictions, num_classes, pos_indices, weights,
+                 average)
+
+
+def fbeta(labels, predictions, num_classes, pos_indices=None, weights=None,
+          average='micro', beta=1):
+    """Multi-class fbeta metric for Tensorflow
+    Parameters
+    ----------
+    labels : Tensor of tf.int32 or tf.int64
+        The true labels
+    predictions : Tensor of tf.int32 or tf.int64
+        The predictions, same shape as labels
+    num_classes : int
+        The number of classes
+    pos_indices : list of int, optional
+        The indices of the positive classes, default is all
+    weights : Tensor of tf.int32, optional
+        Mask, must be of compatible shape with labels
+    average : str, optional
+        'micro': counts the total number of true positives, false
+            positives, and false negatives for the classes in
+            `pos_indices` and infer the metric from it.
+        'macro': will compute the metric separately for each class in
+            `pos_indices` and average. Will not account for class
+            imbalance.
+        'weighted': will compute the metric separately for each class in
+            `pos_indices` and perform a weighted average by the total
+            number of true labels for each class.
+    beta : int, optional
+        Weight of precision in harmonic mean
+    Returns
+    -------
+    tuple of (scalar float Tensor, update_op)
+    """
+    cm, op = _streaming_confusion_matrix(
+        labels, predictions, num_classes, weights)
+    _, _, fbeta = metrics_from_confusion_matrix(
+        cm, pos_indices, average=average, beta=beta)
+    _, _, op = metrics_from_confusion_matrix(
+        op, pos_indices, average=average, beta=beta)
+    return (fbeta, op)
+
+
+def safe_div(numerator, denominator):
+    """Safe division, return 0 if denominator is 0"""
+    numerator, denominator = tf.to_float(numerator), tf.to_float(denominator)
+    zeros = tf.zeros_like(numerator, dtype=numerator.dtype)
+    denominator_is_zero = tf.equal(denominator, zeros)
+    return tf.where(denominator_is_zero, zeros, numerator / denominator)
+
+
+def pr_re_fbeta(cm, pos_indices, beta=1):
+    """Uses a confusion matrix to compute precision, recall and fbeta"""
+    num_classes = cm.shape[0]
+    neg_indices = [i for i in range(num_classes) if i not in pos_indices]
+    cm_mask = np.ones([num_classes, num_classes])
+    cm_mask[neg_indices, neg_indices] = 0
+    diag_sum = tf.reduce_sum(tf.diag_part(cm * cm_mask))
+
+    cm_mask = np.ones([num_classes, num_classes])
+    cm_mask[:, neg_indices] = 0
+    tot_pred = tf.reduce_sum(cm * cm_mask)
+
+    cm_mask = np.ones([num_classes, num_classes])
+    cm_mask[neg_indices, :] = 0
+    tot_gold = tf.reduce_sum(cm * cm_mask)
+
+    pr = safe_div(diag_sum, tot_pred)
+    re = safe_div(diag_sum, tot_gold)
+    fbeta = safe_div((1. + beta**2) * pr * re, beta**2 * pr + re)
+
+    return pr, re, fbeta
+
+
+def metrics_from_confusion_matrix(cm, pos_indices=None, average='micro',
+                                  beta=1):
+    """Precision, Recall and F1 from the confusion matrix
+    Parameters
+    ----------
+    cm : tf.Tensor of type tf.int32, of shape (num_classes, num_classes)
+        The streaming confusion matrix.
+    pos_indices : list of int, optional
+        The indices of the positive classes
+    beta : int, optional
+        Weight of precision in harmonic mean
+    average : str, optional
+        'micro', 'macro' or 'weighted'
+    """
+    num_classes = cm.shape[0]
+    if pos_indices is None:
+        pos_indices = [i for i in range(num_classes)]
+
+    if average == 'micro':
+        return pr_re_fbeta(cm, pos_indices, beta)
+    elif average in {'macro', 'weighted'}:
+        precisions, recalls, fbetas, n_golds = [], [], [], []
+        for idx in pos_indices:
+            pr, re, fbeta = pr_re_fbeta(cm, [idx], beta)
+            precisions.append(pr)
+            recalls.append(re)
+            fbetas.append(fbeta)
+            cm_mask = np.zeros([num_classes, num_classes])
+            cm_mask[idx, :] = 1
+            n_golds.append(tf.to_float(tf.reduce_sum(cm * cm_mask)))
+
+        if average == 'macro':
+            pr = tf.reduce_mean(precisions)
+            re = tf.reduce_mean(recalls)
+            fbeta = tf.reduce_mean(fbetas)
+            return pr, re, fbeta
+        if average == 'weighted':
+            n_gold = tf.reduce_sum(n_golds)
+            pr_sum = sum(p * n for p, n in zip(precisions, n_golds))
+            pr = safe_div(pr_sum, n_gold)
+            re_sum = sum(r * n for r, n in zip(recalls, n_golds))
+            re = safe_div(re_sum, n_gold)
+            fbeta_sum = sum(f * n for f, n in zip(fbetas, n_golds))
+            fbeta = safe_div(fbeta_sum, n_gold)
+            return pr, re, fbeta
+
+    else:
+        raise NotImplementedError()
--- a/TensorFlow/LanguageModeling/BERT/trtis/README.md
+++ b/TensorFlow/LanguageModeling/BERT/trtis/README.md
@ -0,0 +1,108 @@
+# Deploying the BERT model using TensorRT Inference Server
+
+The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using TensorRT Inference Server.
+
+## Table Of Contents
+
+- [TensorRT Inference Server Overview](#tensorrt-inference-server-overview)
+- [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
+  * [Advanced Details](#advanced-details)
+- [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)
+
+## TensorRT Inference Server Overview
+
+A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
+1. Client serializes the inference request into a message and sends it to the server (Client Send)
+2. Message travels over the network from the client to the server (Network)
+3. Message arrives at server, and is deserialized (Server Receive)
+4. Request is placed on the queue (Server Queue)
+5. Request is removed from the queue and computed (Server Compute)
+6. Completed request is serialized in a message and sent back to the client (Server Send)
+7. Completed message travels over network from the server to the client (Network)
+8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)
+
+Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
+In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
+
+Note: The following instructions are run from outside the container and call `docker run` commands as required.
+
+## Performance analysis for TensorRT Inference Server
+
+Based on the figures 1 and 2 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.
+
+![](../data/images/trtis_base_summary.png?raw=true)
+
+Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
+
+![](../data/images/trtis_large_summary.png?raw=true)
+
+Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
+
+### Advanced Details
+
+This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
+
+Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
+
+```bash
+bash trtis/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
+```
+
+All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1).  A high number of concurrent requests can reduce the impact of network latency on overall throughput.
+
+#### Maximum batch size
+
+As we can see in Figure 3, the throughput at BS=1, Client Concurrent Requests = 64 is 119 and in Figure 4, the throughput at BS=8, Client Concurrent Requests = 8 is 517, respectively giving a speedup of ~4.3x
+
+Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurrent Requests = 8 to keep the Total Number of Outstanding Requests equal between the two different modes. Where Total Number of Outstanding Requests = Batch Size * Client Concurrent Requests. This is also why there are 8 times as many bars on the BS=1 chart than the BS=8 chart.
+
+Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
+
+![](../data/images/trtis_bs_1.png?raw=true)
+
+Figure 3: Latency & Throughput vs Concurrency at Batch size = 1
+
+![](../data/images/trtis_bs_8.png?raw=true)
+
+Figure 4: Latency & Throughput vs Concurrency at Batch size = 8
+
+#### Batching techniques
+
+Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
+
+Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and ‘preferred_batchsize’ to indicate your optimal batch sizes in the TensorRT Inference Server model config.
+
+Figures 5 and 6 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.
+
+![](../data/images/trtis_static.png?raw=true)
+
+Figure 5: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1
+
+![](../data/images/trtis_dynamic.png?raw=true)
+
+Figure 6: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000
+
+#### Model execution instance count
+
+TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesn’t saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
+
+Figures 7 and 8 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.
+
+![](../data/images/trtis_ec_1.png?raw=true)
+
+Figure 7: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
+(One copy of the model loaded in GPU memory)
+
+![](../data/images/trtis_ec_4.png?raw=true)
+
+Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
+(Four copies the model loaded in GPU memory)
+
+## Running the TensorRT Inference Server and client
+
+The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
+
+```bash
+bash trtis/scripts/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
+```
--- a/TensorFlow/LanguageModeling/BERT/trtis/run_squad_trtis_client.py
+++ b/TensorFlow/LanguageModeling/BERT/trtis/run_squad_trtis_client.py
--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/export_model.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/export_model.sh
--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/generate_figures.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/generate_figures.sh
@ -38,12 +38,12 @@ EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DI
 PERF_CLIENT_ARGS="1000 10 20 localhost"

 # Start Server
-./scripts/docker/launch_server.sh $precision
+./trtis/scripts/launch_server.sh $precision

 # Restart Server
 restart_server() {
 docker kill trt_server_cont
-./scripts/docker/launch_server.sh $precision
+./trtis/scripts/launch_server.sh $precision
 }

 ############## Dynamic Batching Comparison ##############
@ -53,32 +53,32 @@ TRTIS_ENGINE_COUNT=1

 # Dynamic batching 10 ms
 TRTIS_DYN_BATCHING_DELAY=10
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+.trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}

 # Dynamic batching 5 ms
 TRTIS_DYN_BATCHING_DELAY=5
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}

 # Dynamic batching 2 ms
 TRTIS_DYN_BATCHING_DELAY=2
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}


 # Static Batching (i.e. Dynamic batching 0 ms)
 TRTIS_DYN_BATCHING_DELAY=0
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}


 # ############## Engine Count Comparison ##############
@ -88,24 +88,24 @@ TRTIS_DYN_BATCHING_DELAY=0

 # Engine Count = 4
 TRTIS_ENGINE_COUNT=4
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}

 # Engine Count = 2
 TRTIS_ENGINE_COUNT=2
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}

 # Engine Count = 1
 TRTIS_ENGINE_COUNT=1
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}


 ############## Batch Size Comparison ##############
@ -115,32 +115,32 @@ CLIENT_BATCH_SIZE=1
 TRTIS_ENGINE_COUNT=1 
 TRTIS_DYN_BATCHING_DELAY=0 

-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost

 # BATCH=2 Generate model and perf
 SERVER_BATCH_SIZE=2
 CLIENT_BATCH_SIZE=2
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost

 # BATCH=4 Generate model and perf
 SERVER_BATCH_SIZE=4
 CLIENT_BATCH_SIZE=4
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost

 # BATCH=8 Generate model and perf
 SERVER_BATCH_SIZE=8
 CLIENT_BATCH_SIZE=8
-./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
+./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
 restart_server
 sleep 15
-./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost
+./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost

--- a/TensorFlow/LanguageModeling/BERT/scripts/docker/launch_server.sh
+++ b/TensorFlow/LanguageModeling/BERT/scripts/docker/launch_server.sh
--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/run_client.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/run_client.sh
@ -35,7 +35,7 @@ if [ ! -d "$SQUAD_DIR" ] ; then
 fi

 bash scripts/docker/launch.sh \
-   "python run_squad_trtis_client.py \
+   "python trtis/run_squad_trtis_client.py \
      --trtis_model_name=$trtis_model_name \
      --trtis_model_version=$trtis_version_name \
      --vocab_file=$BERT_DIR/vocab.txt \
--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/run_perf_client.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/run_perf_client.sh
@ -32,7 +32,7 @@ then
    if [ ! "$(docker inspect -f "{{.State.Running}}" trt_server_cont)" = "true" ] ; then

        echo "Launching TRTIS server"
-        bash scripts/docker/launch_server.sh $precision
+        bash trtis/scripts/launch_server.sh $precision
        SERVER_LAUNCHED=true

        function cleanup_server {
@ -47,7 +47,7 @@ then
 fi

 # Wait until server is up. curl on the health of the server and sleep until its ready
-bash scripts/trtis/wait_for_trtis_server.sh $SERVER_HOSTNAME
+bash trtis/scripts/wait_for_trtis_server.sh $SERVER_HOSTNAME

 TIMESTAMP=$(date "+%y%m%d_%H%M")

--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/run_trtis.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/run_trtis.sh
@ -22,8 +22,8 @@ doc_stride=${6:-"128"}
 bert_model=${7:-"large"}
 squad_version=${8:-"1.1"}
 trtis_version_name=${9:-1}
-trtis_model_name=${10:-"bert"}
-trtis_export_model=${11:-"true"}
+trtis_model_name=${10:-"bert_onnx"}
+trtis_export_model=${11:-"false"}
 trtis_dyn_batching_delay=${12:-0}
 trtis_engine_count=${13:-1}
 trtis_model_overwrite=${14:-"False"}
@ -68,19 +68,19 @@ echo
 if [ "$trtis_export_model" = "true" ] ; then
   echo "Exporting model as: Name - $trtis_model_name Version - $trtis_version_name"

-      bash scripts/trtis/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
+      bash trtis/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
         $doc_stride $BERT_DIR $RESULTS_DIR $trtis_version_name $trtis_model_name \
         $trtis_dyn_batching_delay $trtis_engine_count $trtis_model_overwrite
 fi

 # Start TRTIS server in detached state
-bash scripts/docker/launch_server.sh $precision
+bash trtis/scripts/launch_server.sh $precision

 # Wait until server is up. curl on the health of the server and sleep until its ready
-bash scripts/trtis/wait_for_trtis_server.sh localhost
+bash trtis/scripts/wait_for_trtis_server.sh localhost

 # Start TRTIS client for inference and evaluate results
-bash scripts/trtis/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
+bash trtis/scripts/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
    $BERT_DIR $squad_version


--- a/TensorFlow/LanguageModeling/BERT/trtis/scripts/wait_for_trtis_server.sh
+++ b/TensorFlow/LanguageModeling/BERT/trtis/scripts/wait_for_trtis_server.sh