Merge pull request #277 from NVIDIA/nvpstr/release

Updating BERT/TF
This commit is contained in:
nvpstr 2019-11-04 23:33:38 +01:00 committed by GitHub
commit 4760c03a04
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
42 changed files with 5219 additions and 446 deletions

View file

@ -15,6 +15,7 @@
.git/
__pycache__/
results/
data/binary
data/download
data/extracted
data/formatted_one_article_per_line

View file

@ -23,6 +23,7 @@ COPY --from=trt /workspace/build/perf_client /workspace/build/perf_client
COPY --from=trt /workspace/build/dist/dist/tensorrtserver*.whl /tmp/
RUN pip install /tmp/tensorrtserver*.whl && rm /tmp/tensorrtserver*.whl
WORKDIR /workspace/bert
COPY . .

View file

@ -28,9 +28,7 @@ This repository provides a script and recipe to train the BERT model for TensorF
* [Multi-node](#multi-node)
* [Inference process](#inference-process)
* [Deploying the BERT model using TensorRT Inference Server](#deploying-the-bert-model-using-tensorrt-inference-server)
* [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
* [Advanced Details](#advanced-details)
* [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)
* [BioBERT](#biobert)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
@ -138,6 +136,8 @@ Multi-GPU training with Horovod - Our model uses Horovod to implement efficient
[LAMB](https://arxiv.org/pdf/1904.00962.pdf) stands for Layerwise Adaptive Moments based optimizer, is a large batch optimization technique that helps accelerate training of deep neural networks using large minibatches. It allows using a global batch size of 65536 and 32768 on sequence lengths 128 and 512 respectively, compared to a batch size of 256 for Adam. The optimized implementation accumulates 1024 gradients batches in phase 1 and 4096 steps in phase 2 before updating weights once. This results in 27% training speedup on a single DGX2 node. On multi-node systems, LAMB allows scaling up to 1024 GPUs resulting in training speedups of up to 17x in comparison to [Adam](https://arxiv.org/pdf/1412.6980.pdf). Adam has limitations on the learning rate that can be used since it is applied globally on all parameters whereas LAMB follows a layerwise learning rate strategy.
NVLAMB adds necessary tweaks to [LAMB version 1](https://arxiv.org/abs/1904.00962v1), to ensure correct convergence. The algorithm is as follows:
![NVLAMB](data/images/images_nvlamb.png)
### Mixed precision training
@ -183,7 +183,7 @@ The following section lists the requirements in order to start training the BERT
This repository contains `Dockerfile` which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
- [TensorFlow 19.06-py3+](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
- [TensorFlow 19.08-py3+](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
@ -227,7 +227,7 @@ bash scripts/data_download.sh
The script launches a Docker container with the current directory mounted and downloads the datasets to a `data/` folder on the host.
Note: The dataset is 170GB+ and takes 15+ hours to download. Expired dataset links are ignored during data download.
Note: The dataset is 170GB+ and takes 15+ hours to download. The BookCorpus server could sometimes get overloaded and also contain broken links resulting in HTTP 403 and 503 errors. You can either skip the missing files or retry downloading at a later time. Expired dataset links are ignored during data download.
4. Download the pretrained models from NGC.
@ -617,102 +617,13 @@ I0312 23:14:00.550973 140287431493376 run_squad.py:1397] 0 Inference Performance
### Deploying the BERT model using TensorRT Inference Server
The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `TensorRT Inference Server` can be found in the subfolder `./trtis/README.md`.
A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
1. Client serializes the inference request into a message and sends it to the server (Client Send)
2. Message travels over the network from the client to the server (Network)
3. Message arrives at server, and is deserialized (Server Receive)
4. Request is placed on the queue (Server Queue)
5. Request is removed from the queue and computed (Server Compute)
6. Completed request is serialized in a message and sent back to the client (Server Send)
7. Completed message travels over network from the server to the client (Network)
8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)
### BioBERT
Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
Many works, including [BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [SciBERT](https://arxiv.org/pdf/1903.10676.pdf), [NCBI-BERT](https://arxiv.org/pdf/1906.05474.pdf), [ClinicalBERT (MIT)](https://arxiv.org/pdf/1904.03323.pdf), [ClinicalBERT (NYU, Princeton)](https://arxiv.org/pdf/1904.05342.pdf), and others at [BioNLP19 workshop](https://aclweb.org/aclwiki/BioNLP_Workshop), show that pre-training of BERT on large biomedical text corpus such as [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) results in better performance in biomedical text-mining tasks.
Note: The following instructions are run from outside the container and call `docker run` commands as required.
#### Performance analysis for TensorRT Inference Server
Based on the figures 2 and 3 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.
![](data/images/trtis_base_summary.png?raw=true)
Figure 2: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
![](data/images/trtis_large_summary.png?raw=true)
Figure 3: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
##### Advanced Details
This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
```bash
bash scripts/trtis/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
```
All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1). A high number of concurrent requests can reduce the impact of network latency on overall throughput.
###### Maximum batch size
As we can see in Figure 4, the throughput at BS=1, Client Concurrent Requests = 64 is 119 and in Figure 5, the throughput at BS=8, Client Concurrent Requests = 8 is 517, respectively giving a speedup of ~4.3x
Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurrent Requests = 8 to keep the Total Number of Outstanding Requests equal between the two different modes. Where Total Number of Outstanding Requests = Batch Size * Client Concurrent Requests. This is also why there are 8 times as many bars on the BS=1 chart than the BS=8 chart.
Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
![](data/images/trtis_bs_1.png?raw=true)
Figure 4: Latency & Throughput vs Concurrency at Batch size = 1
![](data/images/trtis_bs_8.png?raw=true)
Figure 5: Latency & Throughput vs Concurrency at Batch size = 8
###### Batching techniques
Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and preferred_batchsize to indicate your optimal batch sizes in the TensorRT Inference Server model config.
Figures 6 and 7 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.
![](data/images/trtis_static.png?raw=true)
Figure 6: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1
![](data/images/trtis_dynamic.png?raw=true)
Figure 7: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000
###### Model execution instance count
TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesnt saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
Figures 8 and 9 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.
![](data/images/trtis_ec_1.png?raw=true)
Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
(One copy of the model loaded in GPU memory)
![](data/images/trtis_ec_4.png?raw=true)
Figure 9: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
(Four copies the model loaded in GPU memory)
#### Running the TensorRT Inference Server and client
The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
```bash
bash scripts/trtis/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
```
More information on how to download a biomedical corpus and pre-train as well as finetune for biomedical tasks can be found in the subfolder `./biobert/README.md`.
## Performance
@ -736,10 +647,10 @@ This script runs 2 epochs by default on the SQuAD v1.1 dataset and extracts perf
Inference benchmarking can be performed by running the script:
``` bash
scripts/finetune_inference_benchmark.sh <bert_model> <use_xla> squad
scripts/finetune_inference_benchmark.sh <bert_model> squad
```
This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at `/results/squad_train_benchmark_bert_<bert_model>.log`.
This script runs 1024 eval iterations by default on the SQuAD v1.1 dataset and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at `/results/squad_train_benchmark_bert_<bert_model>.log`.
### Results
@ -747,7 +658,6 @@ The following sections provide details on how we achieved our performance and ac
#### Training accuracy results
##### Training accuracy
###### Pre-training accuracy: single-node
@ -759,32 +669,32 @@ Our results were obtained by running the `scripts/run_pretraining_lamb.sh` train
| DGX1 | 8 | 16, 2 | 512, 2048 | 247.51 | 1.43 |
| DGX2 | 16 | 64, 8 | 64, 256 | 108.16 | 1.58 |
###### Pre-training accuracy: multi-node
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container.
| **DGX System** | **Nodes** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)** | **Final Loss**|
| **DGX System** | **Nodes** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)** | **Final Loss** |
|----------------|-----------|---------------|------------------------------------|----------------------------------------|----------------|-------------------------|
| DGX1 | 4 | FP16 | 32, 2 | 32, 128 | 48.66 | 1.48 |
| DGX1 | 16 | FP16 | 32, 2 | 32, 128 | 24.35 | 1.53 |
| DGX1 | 32 | FP16 | 32, 2 | 32, 128 | 12.98 | 1.61 |
| DGX1 | 32 | FP32 | 32, 2 | 32, 128 | 30.92 | 1.49 |
| DGX2H | 4 | FP16 | 64, 8 | 16, 64 | 25.85 | 1.56 |
| DGX2H | 16 | FP16 | 64, 8 | 8, 32 | 7.9 | 1.57 |
| DGX2H | 32 | FP16 | 64, 8 | 4, 16 | 4.77 | 1.61 |
| DGX2H | 32 | FP32 | 32, 4 | 8, 32 | 12.72 | 1.53 |
| DGX1 | 4 | FP16 | 16, 4 |128, 256| 62.49 | 1.72 |
| DGX1 | 16 | FP16 | 16, 4 | 32, 64 | 16.58 | 1.76 |
| DGX1 | 32 | FP16 | 16, 2 | 16, 64 | 9.85 | 1.71 |
| DGX2H | 1 | FP16 | 32, 8 |128, 256| 69.27 | 1.59 |
| DGX2H | 4 | FP16 | 32, 8 | 32, 64 | 22.17 | 1.60 |
| DGX2H | 16 | FP16 | 32, 8 | 8, 16 | 6.25 | 1.56 |
| DGX2H | 32 | FP16 | 32, 8 | 4, 8 | 3.73 | 1.58 |
| DGX2H | 64 | FP16 | 32, 8 | 2, 4 | 2.44 | 1.64 |
| DGX2H | 64 | FP32 | 32, 4 | 2, 8 | 5.76 | 1.66 |
Note: Time to train includes upto 16 minutes of start up time for every restart. Experiments were run on clusters with a maximum wall clock time of 8 hours and 2 hours for DGX1 and DGX2H systems respectively.
Note: Time to train includes upto 16 minutes of start up time for every restart. Experiments were run on clusters with a maximum wall clock time of 8 hours.
###### Fine-tuning accuracy for SQuAD: NVIDIA DGX-2 (16x V100 32G)
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs.
| **GPUs** | **Batch size / GPU** | **Accuracy - FP32** | **Accuracy - mixed precision** | **Time to Train - FP32 (Hrs)** | **Time to Train - mixed precision (Hrs)** |
|:---:|:----:|:----:|:---:|:----:|:----:|
| 16 | 4 |90.94|90.84|0.38|0.27|
| 16 | 4 |90.94|90.84|0.44|0.16|
##### Training stability test
@ -818,17 +728,17 @@ The following tables compare `F1` scores across 5 different training runs with d
###### Pre-training training performance: single-node on 16G
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
| **GPUs** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 16, 8 | 80.1 | 23.1 | 3.47 | 1 | 1 |
| 4 | 128 | 16, 8 | 282.1 | 85 | 3.32 | 3.52 | 3.68 |
| 8 | 128 | 16, 8 | 540.4 | 166.1 | 3.25 | 6.75 | 7.19 |
| 1 | 512 | 4, 2 | 10.9 | 5.3 | 2.06 | 1 | 1 |
| 4 | 512 | 4, 2 | 35.6 | 19.5 | 1.83 | 3.27 | 3.68 |
| 8 | 512 | 4, 2 | 61.1 | 37.9 | 1.61 | 5.61 | 7.15 |
| 1 | 128 | 16,8 | 91.30 | 23.90 | 3.82 | 1.00 | 1.00 |
| 4 | 128 | 16,8 | 297.70 | 86.90 | 3.43 | 3.26 | 3.64 |
| 8 | 128 | 16,8 | 578.60 | 167.80 | 3.45 | 6.34 | 7.02 |
| 1 | 512 | 4,1 | 20.00 | 4.00 | 5.00 | 1.00 | 1.00 |
| 4 | 512 | 4,1 | 66.80 | 13.50 | 4.95 | 3.34 | 3.38 |
| 8 | 512 | 4,1 | 129.50 | 26.30 | 4.92 | 6.48 | 6.58 |
Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
@ -838,29 +748,26 @@ Our results were obtained by running the `run.sub` training script in the Tensor
| **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 16,8 | 440.3 | 167.9 | 2.62 | 1.00 | 1.00 |
| 4 | 128 | 16,8 | 1712.3 | 600.7 | 2.85 | 3.89 | 3.58 |
| 16 | 128 | 16,8 | 4833.5 | 2186.2 | 2.21 | 10.98 | 13.02 |
| 32 | 128 | 16,8 | 9742.9 | 4020.9 | 2.42 | 22.13 | 23.95 |
| 1 | 512 | 2,1 | 74.9 | 26 | 2.88 | 0.00 | 0.00 |
| 4 | 512 | 2,1 | 257.5 | 91.2 | 2.82 | 1.00 | 1.00 |
| 16 | 512 | 2,1 | 899.7 | 313 | 2.87 | 3.44 | 3.51 |
| 32 | 512 | 2,1 | 1737.1 | 579.4 | 3.0 | 23.19 | 22.28 |
| 1 | 128 | 16,4 | 571.877 | 109.366 | 5.229019988 | 1.00 | 1.00 |
| 4 | 128 | 16,4 | 2028.85 | 386.23 | 5.252958082 | 3.55 | 3.53 |
| 16 | 128 | 16,4 | 7299.88 | 1350.49 | 5.405356574 | 12.76 | 12.35 |
| 32 | 128 | 16,4 | 13917.37 | 2555.25 | 5.446578613 | 24.34 | 23.36 |
| 1 | 512 | 4,1 | 128.94 | 25.65 | 5.026900585 | 1.00 | 1.00 |
| 4 | 512 | 4,1 | 466 | 92.36 | 5.045474231 | 3.61 | 3.60 |
| 16 | 512 | 4,1 | 1632 | 325.22 | 5.018141566 | 12.66 | 12.68 |
| 32 | 512 | 4,1 | 3076 | 614.18 | 5.008303755 | 23.86 | 23.94 |
Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
###### Fine-tuning training performance for SQuAD on 16G
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
| **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
| 1 | 2 | 7.19 |14.37|2.0 |1.0 |1.0 |
| 4 | 2 |25.61 |40.44|1.58|3.56|2.81|
| 8 | 2 |49.79 |74.61|1.5 |6.92|5.19|
| 1 | 3 | - |17.2 | - | - |1.0 |
| 4 | 3 | - |50.71| - | - |2.95|
| 8 | 3 | - |91.88| - | - |5.34|
| 1 | 3,2 | 17.17 | 7.35 | 2.336054422 | 1.00 | 1.00 |
| 4 | 3,2 | 50.68 | 26.38 | 1.921152388 | 3.59 | 2.95 |
| 8 | 3,2 | 89.98 | 50.17 | 1.793502093 | 6.83 | 5.24 |
Note: The respective values for FP32 runs that use a batch size of 3 are not available due to out of memory errors that arise. Batch size of 3 is only available on using FP16.
@ -870,32 +777,29 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
###### Pre-training training performance: single-node on 32G
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
| **GPUs** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 48,32 | 130.2 | 33.5 | 3.89 | 1 | 1 |
| 4 | 128 | 48,32 | 462.1 | 127.7 | 3.62 | 3.55 | 3.81 |
| 8 | 128 | 48,32 | 874.8 | 255.4 | 3.43 | 6.72 | 7.62 |
| 1 | 512 | 8, 4 | 22.1 | 6.3 | 3.51 | 1 | 1 |
| 4 | 512 | 8, 4 | 80.4 | 24 | 3.35 | 3.64 | 3.81 |
| 8 | 512 | 8, 4 | 155 | 47.1 | 3.29 | 7.01 | 7.48 |
| 1 | 128 | 48,32 | 140.30 | 34.30 | 4.09 | 1.00 | 1.00 |
| 4 | 128 | 48,32 | 504.40 | 131.70 | 3.83 | 3.60 | 3.84 |
| 8 | 128 | 48,32 | 986.80 | 260.10 | 3.79 | 7.03 | 7.58 |
| 1 | 512 | 8,4 | 25.60 | 6.50 | 3.94 | 1.00 | 1.00 |
| 4 | 512 | 8,4 | 89.90 | 24.70 | 3.64 | 3.51 | 3.80 |
| 8 | 512 | 8,4 | 176.70 | 48.60 | 3.64 | 6.90 | 7.48 |
Note: The respective values for FP32 runs that use a batch size of 48, 8 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
###### Fine-tuning training performance for SQuAD on 32G
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
| **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|-----|------|----|----|----|
| 1 | 4 | 8.74|20.55 |2.35|1.0 |1.0 |
| 4 | 4 |32.22|57.58 |1.79|3.69|2.81|
| 8 | 4 |62.69|100.22|1.60|7.17|4.88|
| 1 | 10| - |31.33 | - | - |1.0 |
| 4 | 10| - |94.19 | - | - |3.0|
| 8 | 10| - |155.53| - | - |4.96|
| 1 | 10,4 | 33.79 | 9 | 3.754444444 | 1.00 | 1.00 |
| 4 | 10,4 | 103.38 | 32.5 | 3.180923077 | 3.61 | 3.06 |
| 8 | 10,4 | 172.46 | 63.54 | 2.714195782 | 7.06 | 5.10 |
Note: The respective values for FP32 runs that use a batch size of 10 are not available due to out of memory errors that arise. Batch size of 10 is only available on using FP16.
@ -905,52 +809,50 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
###### Pre-training training performance: single-node on DGX-2 32G
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
| **GPUs** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 48,32 | 141.3 | 35.8 | 3.946927374 | 1 | 1 |
| 4 | 128 | 48,32 | 520.4 | 138.8 | 3.749279539 | 3.68 | 3.88 |
| 8 | 128 | 48,32 | 1024 | 275.1 | 3.722282806 | 7.25 | 7.68 |
| 16| 128 | 48,32 | 1907 | 533 | 3.577861163 | 13.5 | 14.89 |
| 1 | 512 | 8, 4 | 23.9 | 6.8 | 3.514705882 | 1 | 1 |
| 4 | 512 | 8, 4 | 89.8 | 25.8 | 3.480620155 | 3.76 | 3.79 |
| 8 | 512 | 8, 4 | 177.2 | 51 | 3.474509804 | 7.41 | 7.5 |
| 16| 512 | 8, 4 | 332.2 | 94.2 | 3.526539278 | 13.9 | 13.85 |
| 1 | 128 | 48,32 | 143.20 | 36.30 | 3.94 | 1.00 | 1.00 |
| 4 | 128 | 48,32 | 538.30 | 141.50 | 3.80 | 3.76 | 3.90 |
| 8 | 128 | 48,32 | 1057.30 | 281.30 | 3.76 | 7.38 | 7.75 |
| 16 | 128 | 48,32 | 1990.70 | 516.80 | 3.85 | 13.90 | 14.24 |
| 1 | 512 | 8,4 | 26.90 | 6.90 | 3.90 | 1.00 | 1.00 |
| 4 | 512 | 8,4 | 96.30 | 26.40 | 3.65 | 3.58 | 3.83 |
| 8 | 512 | 8,4 | 189.00 | 52.40 | 3.61 | 7.03 | 7.59 |
| 16 | 512 | 8,4 | 354.30 | 96.50 | 3.67 | 13.17 | 13.99 |
Note: The respective values for FP32 runs that use a batch size of 48, 8 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
###### Pre-training training performance: multi-node on DGX-2 32G
###### Pre-training training performance: multi-node on DGX-2H 32G
Our results were obtained by running the `run.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
| **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 32, 32 | 1806.7 | 599.3 | 3.01 | 1 | 1 |
| 4 | 128 | 32, 32 | 4088.7 | 1762.3 | 2.32 | 2.26 | 2.94 |
| 16 | 128 | 32, 32 | 14719.6 | 6400.2 | 2.30 | 8.15 | 10.68|
| 32 | 128 | 32, 32 | 27303.6 | 12203.6| 2.24 | 15.11| 20.36|
| 1 | 512 | 8, 4 | 269.7 | 109.6 | 2.46 | 1 | 1 |
| 4 | 512 | 8, 4 | 960.9 | 268.5 | 3.58 | 3.56 | 2.45 |
| 16 | 512 | 8, 4 | 3726.3 | 965 | 3.86 | 13.82| 8.8 |
| 32 | 512 | 8, 4 | 6192.7 | 1800.3 | 3.44 | 22.96| 16.43|
| 1 | 128 | 32,32 | 1758.32 | 602.22 | 2.92 | 1.00 | 1.00 |
| 4 | 128 | 32,32 | 6379.94 | 2261.10 | 2.82 | 3.63 | 3.75 |
| 16 | 128 | 32,32 | 23846.92 | 8875.42 | 2.69 | 13.56 | 14.74 |
| 32 | 128 | 32,32 | 46191.78 | 17445.53 | 2.65 | 26.27 | 28.97 |
| 64 | 128 | 32,32 | 89195.63 | 34263.71 | 2.60 | 50.73 | 56.90 |
| 1 | 512 | 8,4 | 383.35 | 109.97 | 3.49 | 1.00 | 1.00 |
| 4 | 512 | 8,4 | 1408.75 | 400.93 | 3.51 | 3.67 | 3.65 |
| 16 | 512 | 8,4 | 5344.10 | 1559.96 | 3.43 | 13.94 | 14.19 |
| 32 | 512 | 8,4 | 10323.75 | 3061.39 | 3.37 | 26.93 | 27.84 |
| 64 | 512 | 8,4 | 19766.57 | 6029.48 | 3.28 | 51.56 | 54.83 |
###### Fine-tuning training performance for SQuAD on DGX-2 32G
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
Our results were obtained by running the `scripts/run_squad.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
| **GPUs** | **Batch size / GPU: mixed precision, FP32** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|---|---|------|------|----|-----|-----|
| 1| 4 | 9.39 | 20.69 |2.20| 1.0 | 1.0 |
| 4| 4 | 34.63| 62.79|1.81| 3.69 | 3.03 |
| 8| 4 | 66.95|111.47|1.66| 7.13 | 5.39 |
| 16| 4 |126.09|179.09|1.42| 13.43 |8.66 |
| 1| 10| - | 32.72| - | - | 1.0 |
| 4| 10| - |100.73| - | - | 3.07 |
| 8| 10| - |168.92| - | - | 5.16 |
| 16| 10| - |249.54| - | - | 7.63 |
| 1 | 10,4 | 36.30 | 9.59 | 3.785192909 | 1.00 | 1.00 |
| 4 | 10,4 | 115.67 | 35.46 | 3.261985336 | 3.70 | 3.19 |
| 8 | 10,4 | 197.16 | 68.00 | 2.899411765 | 7.09 | 5.43 |
| 16 | 10,4 | 304.72 | 111.62 | 2.729976707 | 11.64 | 8.39 |
Note: The respective values for FP32 runs that use a batch size of 10 are not available due to out of memory errors that arise. Batch size of 10 is only available on using FP16.
@ -967,63 +869,63 @@ Our results were obtained by running the `scripts/run_pretraining_lamb.sh` scrip
| **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** |
|:-----:|:-------:|:-------:|:-------:|:-------------:|
|128 |8, 8 |349.49 | 104.03 | 3.36 |
|128 |8, 8 |349.51 | 104.31 | 3.35 |
###### Fine-tuning inference performance for SQuAD on 16G
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT LARGE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 89.4 | 1.19 | 11.19 | 11.29 | 11.44 | 11.71 |
| 128 | 2 | 162.29 | 1.56 | 12.32 | 12.5 | 12.57 | 12.74 |
| 128 | 4 | 263.44 | 2.24 | 15.18 | 15.32 | 15.54 | 17 |
| 128 | 8 | 374.33 | 2.98 | 21.37 | 21.56 | 21.72 | 23.23 |
| 384 | 1 | 64.57 | 1.87 | 15.49 | 15.61 | 15.73 | 16.18 |
| 384 | 2 | 94.04 | 2.47 | 21.27 | 21.34 | 21.4 | 21.9 |
| 384 | 4 | 118.81 | 2.96 | 33.67 | 33.89 | 34.37 | 36.18 |
| 384 | 8 | 137.65 | 3.26 | 58.12 | 58.53 | 59.34 | 61.32 |
| 128 | 1 | 95.87 | 1.433462919 | 10.43 | 10.61 | 10.71 | 11.27 |
| 128 | 2 | 168.02 | 1.871046771 | 11.9 | 12.08 | 12.18 | 12.32 |
| 128 | 4 | 263.08 | 2.617451 | 15.2 | 14.86 | 14.95 | 15.55 |
| 128 | 8 | 379.78 | 3.414366628 | 21.07 | 20.94 | 21.03 | 21.49 |
| 384 | 1 | 67.52 | 2.274932615 | 14.81 | 14.93 | 15.05 | 15.38 |
| 384 | 2 | 93.8 | 2.929419113 | 21.32 | 20.75 | 20.83 | 21.43 |
| 384 | 4 | 118.97 | 3.397201599 | 33.62 | 33.17 | 33.37 | 33.85 |
| 384 | 8 | 138.43 | 3.838879645 | 57.79 | 57 | 57.38 | 58.19 |
BERT LARGE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 75.28 | 13.28 | 13.4 | 13.49 | 13.66 |
| 128 | 2 | 104.16 | 19.2 | 19.51 | 19.69 | 20.83 |
| 128 | 4 | 117.4 | 34.07 | 34.4 | 34.76 | 36.99 |
| 128 | 8 | 125.63 | 63.68 | 64.58 | 65.1 | 67.54 |
| 384 | 1 | 34.53 | 28.96 | 29.32 | 29.61 | 31.08 |
| 384 | 2 | 38.03 | 52.59 | 53.16 | 53.75 | 55.5 |
| 384 | 4 | 40.16 | 99.6 | 100.76 | 101.62 | 103.4 |
| 384 | 8 | 42.2 | 189.57 | 190.82 | 191.47 | 193.27 |
| 128 | 1 | 66.88 | 14.95 | 14.96 | 15.41 | 18.02 |
| 128 | 2 | 89.8 | 22.27 | 22.46 | 22.53 | 22.84 |
| 128 | 4 | 100.51 | 39.8 | 39.91 | 40.06 | 41.04 |
| 128 | 8 | 111.23 | 71.92 | 72.42 | 72.58 | 73.63 |
| 384 | 1 | 29.68 | 33.7 | 33.85 | 33.91 | 34.62 |
| 384 | 2 | 32.02 | 62.47 | 63.06 | 63.28 | 63.66 |
| 384 | 4 | 35.02 | 114.21 | 114.69 | 114.82 | 115.85 |
| 384 | 8 | 36.06 | 221.86 | 222.7 | 223.03 | 223.53 |
BERT BASE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 196.58 | 1.19 | 5.09 | 5.18 | 5.23 | 5.42 |
| 128 | 2 | 361.92 | 1.41 | 5.53 | 5.62 | 5.67 | 5.85 |
| 128 | 4 | 605.43 | 1.79 | 6.61 | 6.71 | 6.8 | 7.04 |
| 128 | 8 | 916 | 2.18 | 8.73 | 8.83 | 8.95 | 9.19 |
| 384 | 1 | 154.05 | 1.58 | 6.49 | 6.6 | 6.72 | 7.05 |
| 384 | 2 | 238.89 | 1.99 | 8.37 | 8.42 | 8.47 | 9.1 |
| 384 | 4 | 327.18 | 2.47 | 12.23 | 12.3 | 12.36 | 13.08 |
| 384 | 8 | 390.95 | 2.82 | 20.46 | 20.5 | 20.8 | 21.89 |
| 128 | 1 | 204.33 | 1.459187317 | 4.89 | 5.14 | 5.32 | 5.54 |
| 128 | 2 | 375.19 | 1.779501043 | 5.33 | 5.47 | 5.58 | 5.87 |
| 128 | 4 | 606.98 | 2.198645271 | 6.59 | 6.49 | 6.55 | 6.83 |
| 128 | 8 | 902.6 | 2.69023278 | 8.86 | 8.62 | 8.72 | 9.22 |
| 384 | 1 | 154.33 | 1.990070922 | 6.48 | 6.59 | 6.65 | 7.04 |
| 384 | 2 | 225.7 | 2.386087324 | 8.86 | 8.45 | 8.53 | 9.16 |
| 384 | 4 | 317.93 | 3.044431677 | 12.58 | 12.34 | 12.39 | 13.01 |
| 384 | 8 | 393.44 | 3.672547372 | 20.33 | 20.06 | 20.38 | 21.38 |
BERT BASE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 165.51 | 6.04 | 6.19 | 6.3 | 6.62 |
| 128 | 2 | 257.54 | 7.77 | 7.86 | 7.92 | 8.28 |
| 128 | 4 | 338.52 | 11.82 | 11.98 | 12.05 | 12.27 |
| 128 | 8 | 419.94 | 19.05 | 19.25 | 19.35 | 20.12 |
| 384 | 1 | 97.4 | 10.27 | 10.39 | 10.44 | 10.56 |
| 384 | 2 | 119.84 | 16.69 | 16.78 | 16.85 | 17.66 |
| 384 | 4 | 132.5 | 30.19 | 30.41 | 30.5 | 31.13 |
| 384 | 8 | 138.63 | 57.71 | 58.15 | 58.37 | 59.33 |
| 128 | 1 | 140.03 | 7.14 | 7.6 | 7.78 | 7.97 |
| 128 | 2 | 210.84 | 9.49 | 9.59 | 9.65 | 10.57 |
| 128 | 4 | 276.07 | 14.49 | 14.61 | 14.71 | 15.16 |
| 128 | 8 | 335.51 | 23.84 | 23.79 | 23.89 | 24.94 |
| 384 | 1 | 77.55 | 12.89 | 13.01 | 13.05 | 14.26 |
| 384 | 2 | 94.59 | 21.14 | 21.14 | 21.23 | 21.86 |
| 384 | 4 | 104.43 | 38.3 | 38.38 | 38.45 | 39.15 |
| 384 | 8 | 107.13 | 74.68 | 75.05 | 75.19 | 76.2 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@ -1032,67 +934,67 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
###### Pre-training inference performance on 32G
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs.
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs.
| **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** |
|:-----:|:-------:|:-------:|:-------:|:-------------:|
|128 |8, 8 |304.88 | 100.88 | 3.02 |
|128 |8, 8 |345.50 | 101.84 | 3.39 |
###### Fine-tuning inference performance for SQuAD on 32G
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT LARGE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 86.4 | 1.18 | 11.57 | 11.74 | 11.86 | 12.04 |
| 128 | 2 | 155.32 | 1.52 | 12.88 | 12.98 | 13.05 | 13.31 |
| 128 | 4 | 252.18 | 2.18 | 15.86 | 15.78 | 15.89 | 17.01 |
| 128 | 8 | 359.19 | 2.88 | 22.27 | 22.44 | 22.58 | 23.94 |
| 384 | 1 | 62.45 | 1.84 | 16.01 | 16.16 | 16.23 | 16.42 |
| 384 | 2 | 89.34 | 2.37 | 22.39 | 22.45 | 22.53 | 23.13 |
| 384 | 4 | 113.77 | 2.84 | 35.16 | 35.24 | 35.33 | 35.9 |
| 384 | 8 | 131.9 | 3.13 | 60.65 | 61 | 61.49 | 65.3 |
| 128 | 1 | 87.75 | 1.352913969 | 11.4 | 11.46 | 18.77 | 19.06 |
| 128 | 2 | 159.87 | 1.833161335 | 12.51 | 12.69 | 12.79 | 12.98 |
| 128 | 4 | 254.65 | 2.622014003 | 15.71 | 15.49 | 15.59 | 16.03 |
| 128 | 8 | 365.51 | 3.377783939 | 21.89 | 21.72 | 21.94 | 23.79 |
| 384 | 1 | 63.11 | 2.153924915 | 15.84 | 17.3 | 19.22 | 19.37 |
| 384 | 2 | 89.61 | 2.884132604 | 22.32 | 21.83 | 21.96 | 23.8 |
| 384 | 4 | 114.9 | 3.395390071 | 34.81 | 34.33 | 34.47 | 35.15 |
| 384 | 8 | 132.79 | 3.814708417 | 60.25 | 59.4 | 59.77 | 60.7 |
BERT LARGE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 73.42 | 13.62 | 13.78 | 13.85 | 14.13 |
| 128 | 2 | 102.47 | 19.52 | 19.66 | 19.73 | 19.98 |
| 128 | 4 | 115.76 | 34.55 | 34.86 | 35.34 | 37.87 |
| 128 | 8 | 124.84 | 64.08 | 64.78 | 65.78 | 69.55 |
| 384 | 1 | 33.93 | 29.47 | 29.7 | 29.8 | 29.98 |
| 384 | 2 | 37.62 | 53.16 | 53.52 | 53.73 | 55.03 |
| 384 | 4 | 39.99 | 100.02 | 100.91 | 101.69 | 106.63 |
| 384 | 8 | 42.09 | 190.08 | 191.35 | 192.29 | 196.47 |
| 128 | 1 | 64.86 | 15.42 | 16.32 | 17.55 | 20.89 |
| 128 | 2 | 87.21 | 22.93 | 23.06 | 24.17 | 31.93 |
| 128 | 4 | 97.12 | 41.19 | 41.38 | 41.5 | 44.13 |
| 128 | 8 | 108.21 | 73.93 | 74.34 | 74.48 | 74.77 |
| 384 | 1 | 29.3 | 34.13 | 34.21 | 34.25 | 34.76 |
| 384 | 2 | 31.07 | 64.38 | 64.83 | 64.95 | 65.42 |
| 384 | 4 | 33.84 | 118.22 | 119.01 | 119.57 | 120.06 |
| 384 | 8 | 34.81 | 229.84 | 230.72 | 231.22 | 232.96 |
BERT BASE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 192.89 | 1.19 | 5.18 | 5.29 | 5.35 | 5.55 |
| 128 | 2 | 348.23 | 1.37 | 5.74 | 5.91 | 6.02 | 6.26 |
| 128 | 4 | 592.54 | 1.79 | 6.75 | 6.96 | 7.08 | 7.34 |
| 128 | 8 | 888.58 | 2.15 | 9 | 9.11 | 9.22 | 9.5 |
| 384 | 1 | 148.64 | 1.57 | 6.73 | 6.82 | 6.87 | 7.06 |
| 384 | 2 | 230.74 | 1.96 | 8.67 | 8.75 | 8.87 | 9.44 |
| 384 | 4 | 318.45 | 2.42 | 12.56 | 12.65 | 12.76 | 13.36 |
| 384 | 8 | 380.14 | 2.72 | 21.05 | 21.1 | 21.25 | 21.83 |
| 128 | 1 | 198.72 | 1.393352966 | 5.03 | 5.3 | 5.47 | 5.69 |
| 128 | 2 | 338.44 | 1.611158717 | 5.91 | 6.04 | 9.77 | 9.94 |
| 128 | 4 | 599.62 | 2.24804109 | 6.67 | 6.6 | 6.66 | 6.83 |
| 128 | 8 | 858.56 | 2.63370042 | 9.32 | 10.01 | 10.04 | 10.39 |
| 384 | 1 | 150.28 | 1.948146228 | 6.65 | 6.76 | 6.82 | 7.21 |
| 384 | 2 | 200.68 | 2.200197347 | 9.97 | 9.88 | 9.94 | 10.08 |
| 384 | 4 | 305.72 | 3.01707293 | 13.08 | 12.86 | 12.97 | 13.71 |
| 384 | 8 | 373.64 | 3.61249154 | 21.41 | 21.98 | 22.03 | 22.61 |
BERT BASE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 161.69 | 6.18 | 6.26 | 6.31 | 6.51 |
| 128 | 2 | 254.84 | 7.85 | 8 | 8.09 | 8.29 |
| 128 | 4 | 331.72 | 12.06 | 12.17 | 12.26 | 12.51 |
| 128 | 8 | 412.85 | 19.38 | 19.6 | 19.72 | 20.13 |
| 384 | 1 | 94.42 | 10.59 | 10.71 | 10.8 | 11.36 |
| 384 | 2 | 117.64 | 17 | 17.07 | 17.1 | 17.83 |
| 384 | 4 | 131.72 | 30.37 | 30.64 | 30.77 | 31.26 |
| 384 | 8 | 139.75 | 57.25 | 57.74 | 58.08 | 59.53 |
| 128 | 1 | 142.62 | 7.01 | 7.07 | 7.44 | 9.23 |
| 128 | 2 | 210.06 | 9.52 | 9.63 | 9.69 | 10.22 |
| 128 | 4 | 266.73 | 15 | 15.77 | 15.91 | 16.79 |
| 128 | 8 | 325.99 | 24.54 | 24.52 | 24.6 | 25 |
| 384 | 1 | 77.14 | 12.96 | 13.01 | 13.03 | 13.67 |
| 384 | 2 | 91.21 | 21.93 | 21.93 | 21.99 | 22.31 |
| 384 | 4 | 101.33 | 39.47 | 39.69 | 39.82 | 40.88 |
| 384 | 8 | 103.43 | 77.34 | 77.76 | 77.9 | 78.45 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@ -1101,126 +1003,126 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
###### Pre-training inference performance on DGX-2 32G
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs.
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs.
| **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** |
|:-----:|:-------:|:-------:|:-------:|:-------------:|
|128 |8, 8 |350.63 | 106.36 | 3.30 |
|128 |8, 8 |366.24 | 107.88 | 3.39 |
###### Fine-tuning inference performance for SQuAD on DGX-2 32G
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 1x V100 32G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT LARGE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 79 | 1.18 | 12.66 | 13.13 | 13.36 | 14.49 |
| 128 | 2 | 151.28 | 1.52 | 13.22 | 13.66 | 13.89 | 14.84 |
| 128 | 4 | 250.41 | 2.18 | 15.97 | 16.13 | 16.3 | 17.81 |
| 128 | 8 | 369.76 | 2.88 | 21.64 | 21.88 | 22.08 | 26.35 |
| 384 | 1 | 61.66 | 1.84 | 16.22 | 16.46 | 16.62 | 17.26 |
| 384 | 2 | 91.54 | 2.37 | 21.85 | 22.11 | 22.3 | 23.44 |
| 384 | 4 | 121.04 | 2.84 | 33.05 | 33.08 | 33.31 | 34.97 |
| 384 | 8 | 142.03 | 3.13 | 56.33 | 56.46 | 57.49 | 59.85 |
| 128 | 1 | 96.22 | 1.371045882 | 10.39 | 10.78 | 10.9 | 11.43 |
| 128 | 2 | 171.66 | 1.835935829 | 11.65 | 11.86 | 12.04 | 12.45 |
| 128 | 4 | 262.89 | 2.566032211 | 15.22 | 15.13 | 15.24 | 15.91 |
| 128 | 8 | 394.23 | 3.441253492 | 20.29 | 20.22 | 20.6 | 22.19 |
| 384 | 1 | 69.69 | 2.278195489 | 14.35 | 14.39 | 14.58 | 15.68 |
| 384 | 2 | 96.35 | 2.909118357 | 20.76 | 20.25 | 20.32 | 21.54 |
| 384 | 4 | 124.06 | 3.42612538 | 32.24 | 31.87 | 32.14 | 33.02 |
| 384 | 8 | 144.28 | 3.876410532 | 55.45 | 54.77 | 55.16 | 55.93 |
BERT LARGE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 70.1 | 14.27 | 14.6 | 14.84 | 15.38 |
| 128 | 2 | 101.3 | 19.74 | 20.09 | 20.27 | 20.77 |
| 128 | 4 | 122.19 | 32.74 | 32.99 | 33.39 | 36.76 |
| 128 | 8 | 134.09 | 59.66 | 60.36 | 61.79 | 69.33 |
| 384 | 1 | 34.52 | 28.97 | 29.28 | 29.46 | 31.78 |
| 384 | 2 | 39.84 | 50.21 | 50.61 | 51.53 | 54 |
| 384 | 4 | 42.79 | 93.48 | 94.73 | 96.52 | 104.37 |
| 384 | 8 | 45.91 | 174.24 | 175.34 | 176.59 | 183.76 |
| 128 | 1 | 70.18 | 14.25 | 14.7 | 14.88 | 15.35 |
| 128 | 2 | 93.5 | 21.39 | 21.83 | 22.04 | 22.85 |
| 128 | 4 | 102.45 | 39.04 | 39.28 | 39.42 | 40.5 |
| 128 | 8 | 114.56 | 69.83 | 70.5 | 70.74 | 72.78 |
| 384 | 1 | 30.59 | 32.69 | 33.14 | 33.32 | 33.86 |
| 384 | 2 | 33.12 | 60.38 | 60.91 | 61.12 | 61.67 |
| 384 | 4 | 36.21 | 110.46 | 111.1 | 111.26 | 112.15 |
| 384 | 8 | 37.22 | 214.95 | 215.69 | 216.13 | 217.96 |
BERT BASE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 172.33 | 1.19 | 5.8 | 5.94 | 6 | 6.27 |
| 128 | 2 | 315.17 | 1.37 | 6.35 | 6.64 | 6.78 | 7.07 |
| 128 | 4 | 549.36 | 1.79 | 7.28 | 7.47 | 7.6 | 8.05 |
| 128 | 8 | 872.67 | 2.15 | 9.17 | 9.33 | 9.5 | 9.92 |
| 384 | 1 | 138.52 | 1.57 | 7.22 | 7.45 | 7.52 | 7.84 |
| 384 | 2 | 222.05 | 1.96 | 9.01 | 9.11 | 9.24 | 10.94 |
| 384 | 4 | 314.47 | 2.42 | 12.72 | 12.87 | 13.01 | 14.42 |
| 384 | 8 | 392.32 | 2.72 | 20.39 | 20.44 | 20.67 | 22.16 |
| 128 | 1 | 207.01 | 1.455050257 | 4.83 | 5.23 | 5.38 | 5.59 |
| 128 | 2 | 405.92 | 1.808429119 | 4.93 | 4.99 | 5.04 | 5.2 |
| 128 | 4 | 646.8 | 2.258695349 | 6.18 | 6.06 | 6.14 | 6.55 |
| 128 | 8 | 909.41 | 2.616781285 | 8.8 | 8.86 | 8.96 | 9.52 |
| 384 | 1 | 153.97 | 1.959653812 | 6.49 | 6.88 | 7.01 | 7.2 |
| 384 | 2 | 229.46 | 2.366298855 | 8.72 | 8.57 | 8.67 | 8.97 |
| 384 | 4 | 333.2 | 3.078913325 | 12 | 11.74 | 11.85 | 12.86 |
| 384 | 8 | 403.02 | 3.646579805 | 19.85 | 19.83 | 20 | 21.11 |
BERT BASE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 161.69 | 6.18 | 6.26 | 6.31 | 6.51 |
| 128 | 2 | 254.84 | 7.85 | 8 | 8.09 | 8.29 |
| 128 | 4 | 331.72 | 12.06 | 12.17 | 12.26 | 12.51 |
| 128 | 8 | 412.85 | 19.38 | 19.6 | 19.72 | 20.13 |
| 384 | 1 | 94.42 | 10.59 | 10.71 | 10.8 | 11.36 |
| 384 | 2 | 117.64 | 17 | 17.07 | 17.1 | 17.83 |
| 384 | 4 | 131.72 | 30.37 | 30.64 | 30.77 | 31.26 |
| 384 | 8 | 139.75 | 57.25 | 57.74 | 58.08 | 59.53 |
| 128 | 1 | 142.27 | 7.03 | 7.39 | 7.45 | 11.7 |
| 128 | 2 | 224.46 | 8.91 | 9 | 9.08 | 9.66 |
| 128 | 4 | 286.36 | 13.97 | 14.46 | 14.52 | 14.82 |
| 128 | 8 | 347.53 | 23.02 | 23.23 | 23.4 | 24.12 |
| 384 | 1 | 78.57 | 12.73 | 13.01 | 13.1 | 14.06 |
| 384 | 2 | 96.97 | 20.62 | 21 | 21.15 | 21.82 |
| 384 | 4 | 108.22 | 36.96 | 37.05 | 37.18 | 38.12 |
| 384 | 8 | 110.52 | 72.38 | 73.06 | 73.32 | 74.64 |
##### Inference performance: NVIDIA Tesla T4 (1x T4 16G)
###### Fine-tuning inference performance for SQuAD on Tesla T4 16G
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.06-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
Our results were obtained by running the `scripts/finetune_inference_benchmark.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA Tesla T4 with 1x T4 16G GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model ie no pipelining.
BERT LARGE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 53.56 | 1.18 | 18.67 | 20.22 | 20.31 | 20.49 |
| 128 | 2 | 95.39 | 1.52 | 20.97 | 22.86 | 23.15 | 23.73 |
| 128 | 4 | 137.44 | 2.18 | 29.1 | 30.34 | 30.62 | 31.5 |
| 128 | 8 | 166.19 | 2.88 | 48.14 | 49.38 | 49.73 | 50.86 |
| 384 | 1 | 34.28 | 1.84 | 29.17 | 30.58 | 30.77 | 31.28 |
| 384 | 2 | 41.89 | 2.37 | 47.74 | 49.05 | 49.34 | 50 |
| 384 | 4 | 47.15 | 2.84 | 84.83 | 86.79 | 87.41 | 88.73 |
| 384 | 8 | 50.28 | 3.13 | 159.11 | 161.75 | 162.85 | 165.72 |
| 128 | 1 | 54.53 | 1.552234557 | 18.34 | 19.09 | 19.28 | 21.74 |
| 128 | 2 | 95.59 | 2.521498285 | 20.92 | 21.86 | 22.61 | 23.33 |
| 128 | 4 | 133.2 | 3.434760186 | 30.03 | 30.32 | 30.43 | 31.06 |
| 128 | 8 | 168.85 | 4.352926012 | 47.38 | 48.21 | 48.56 | 49.25 |
| 384 | 1 | 33.58 | 2.87008547 | 29.78 | 30.3 | 30.46 | 31.69 |
| 384 | 2 | 41.31 | 3.576623377 | 48.41 | 49.03 | 49.26 | 50.04 |
| 384 | 4 | 47.08 | 3.94635373 | 84.96 | 86.88 | 87.38 | 88.3 |
| 384 | 8 | 50.08 | 4.254885302 | 159.76 | 162.37 | 163.23 | 165.79 |
BERT LARGE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 40.34 | 24.79 | 26.97 | 27.38 | 28.21 |
| 128 | 2 | 45.17 | 44.27 | 46.01 | 46.6 | 47.68 |
| 128 | 4 | 47.39 | 84.41 | 86.31 | 86.92 | 88.14 |
| 128 | 8 | 46.98 | 170.29 | 173.35 | 174.15 | 175.48 |
| 384 | 1 | 14.07 | 71.06 | 73 | 73.42 | 73.99 |
| 384 | 2 | 14.91 | 134.17 | 136.72 | 137.51 | 138.66 |
| 384 | 4 | 14.44 | 277.03 | 281.89 | 282.63 | 284.41 |
| 384 | 8 | 14.95 | 534.94 | 540.45 | 542.32 | 544.75 |
| 128 | 1 | 35.13 | 28.46 | 29.89 | 30.12 | 30.6 |
| 128 | 2 | 37.91 | 52.76 | 54.01 | 54.29 | 54.84 |
| 128 | 4 | 38.78 | 103.14 | 105.39 | 106.05 | 107.4 |
| 128 | 8 | 38.79 | 206.22 | 209.63 | 210.2 | 211.5 |
| 384 | 1 | 11.7 | 85.5 | 87.18 | 87.43 | 88 |
| 384 | 2 | 11.55 | 173.19 | 176.13 | 177.02 | 178.4 |
| 384 | 4 | 11.93 | 335.41 | 340.26 | 341.76 | 343.54 |
| 384 | 8 | 11.77 | 679.77 | 686.01 | 686.79 | 689.24 |
BERT BASE FP16
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Throughput speedup (FP32 to mixed precision) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|----------------------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 107.3 | 1.19 | 9.32 | 10.18 | 10.32 | 11.48 |
| 128 | 2 | 185.18 | 1.37 | 10.8 | 11.71 | 12.11 | 12.35 |
| 128 | 4 | 335.47 | 1.79 | 11.92 | 12.58 | 12.72 | 13.36 |
| 128 | 8 | 454.12 | 2.15 | 17.62 | 18.45 | 18.68 | 19.25 |
| 384 | 1 | 83.5 | 1.57 | 11.98 | 12.71 | 12.93 | 13.29 |
| 384 | 2 | 117.75 | 1.96 | 16.99 | 17.62 | 17.83 | 19.48 |
| 384 | 4 | 139.08 | 2.42 | 28.76 | 29.59 | 29.85 | 30.74 |
| 384 | 8 | 149.93 | 2.72 | 53.36 | 54.83 | 55.48 | 56.93 |
| 12.71 | 13.1 | 13.22 | 1.552234557 | 18.34 | 19.09 | 19.28 | 21.74 |
| 11.16 | 12.85 | 12.97 | 2.521498285 | 20.92 | 21.86 | 22.61 | 23.33 |
| 11.82 | 11.9 | 13.21 | 3.434760186 | 30.03 | 30.32 | 30.43 | 31.06 |
| 17.88 | 18.08 | 18.82 | 4.352926012 | 47.38 | 48.21 | 48.56 | 49.25 |
| 11.83 | 12.95 | 15.44 | 2.87008547 | 29.78 | 30.3 | 30.46 | 31.69 |
| 16.91 | 17.08 | 19.38 | 3.576623377 | 48.41 | 49.03 | 49.26 | 50.04 |
| 28.89 | 29.23 | 30.84 | 3.94635373 | 84.96 | 86.88 | 87.38 | 88.3 |
| 54.58 | 55.19 | 56.31 | 4.254885302 | 159.76 | 162.37 | 163.23 | 165.79 |
BERT BASE FP32
| Sequence Length | Batch Size | Throughput-Average(sent/sec) | Latency-Average(ms) | Latency-90%(ms) | Latency-95%(ms) | Latency-99%(ms) |
|-----------------|------------|------------------------------|---------------------|-----------------|-----------------|-----------------|
| 128 | 1 | 92.82 | 10.77 | 11.06 | 11.11 | 11.24 |
| 128 | 2 | 127.87 | 15.64 | 16.2 | 16.4 | 16.86 |
| 128 | 4 | 151.68 | 26.37 | 27.26 | 27.48 | 27.98 |
| 128 | 8 | 164.51 | 48.63 | 50.36 | 50.72 | 51.52 |
| 384 | 1 | 45.64 | 21.91 | 23.39 | 23.66 | 24.14 |
| 384 | 2 | 48.11 | 41.57 | 42.99 | 43.47 | 44.44 |
| 384 | 4 | 48.64 | 82.24 | 84.35 | 84.97 | 86.2 |
| 384 | 8 | 48.04 | 166.51 | 169.9 | 170.84 | 172.6 |
| 128 | 1 | 64.15 | 15.59 | 19.77 | 21.03 | 21.82 |
| 128 | 2 | 110.69 | 18.07 | 18.92 | 20.77 | 21.6 |
| 128 | 4 | 125.8 | 31.8 | 32.82 | 33.11 | 33.93 |
| 128 | 8 | 127.55 | 62.72 | 63.9 | 64.28 | 65.25 |
| 384 | 1 | 35.46 | 28.2 | 28.83 | 28.95 | 29.43 |
| 384 | 2 | 37.15 | 53.83 | 54.75 | 55.08 | 56.01 |
| 384 | 4 | 36.86 | 108.53 | 110.57 | 111.16 | 112.48 |
| 384 | 8 | 36.1 | 221.61 | 225.94 | 226.94 | 228.58 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
@ -1229,11 +1131,16 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
### Changelog
November 2019
- Pre-training and Finetuning on BioMedical tasks and corpus
October 2019
- Disabling Grappler Optimizations for improved performance
September 2019
- Pre-training using LAMB
- Multi Node support
- Fine Tuning support for GLUE (CoLA, MNLI, MRPC)
- Jupyter Notebooks
July 2019
- Results obtained using 19.06
@ -1245,4 +1152,4 @@ March 2019
### Known issues
- There is a known performance regression with the 19.08 release on Tesla V100 boards with 16 GB memory, smaller batch sizes may be a better choice for this model on these GPUs with the 19.08 release. 32 GB GPUs are not affected.
- There is a known performance regression with the 19.08 release on Tesla V100 boards with 16 GB memory, smaller batch sizes may be a better choice for this model on these GPUs with the 19.08 release. 32 GB GPUs are not affected.

View file

@ -0,0 +1,570 @@
# BioBert For TensorFlow
This folder provides a script and recipe to train BERT for TensorFlow to achieve state-of-the-art accuracy on *biomedical text-mining* and is tested and maintained by NVIDIA.
## Table Of Contents
* [Model overview](#model-overview)
* [Quick Start Guide](#quick-start-guide)
* [Advanced](#advanced)
* [Scripts and sample code](#scripts-and-sample-code)
* [Parameters](#parameters)
* [Command-line options](#command-line-options)
* [Getting the data](#getting-the-data)
* [Dataset guidelines](#dataset-guidelines)
* [Multi-dataset](#multi-dataset)
* [Training process](#training-process)
* [Pre-training](#pre-training)
* [Fine tuning](#fine-tuning)
* [Multi-node](#multi-node)
* [Inference process](#inference-process)
* [Performance](#performance)
* [Benchmarking](#benchmarking)
* [Training performance benchmark](#training-performance-benchmark)
* [Inference performance benchmark](#inference-performance-benchmark)
* [Results](#results)
* [Training accuracy results](#training-accuracy-results)
* [Training accuracy](#training-accuracy)
*[Pre-training accuracy](#pre-training-accuracy)
*[Fine-tuning accuracy](#fine-tuning-accuracy)
*[Fine-tuning accuracy for NER Chem](#fine-tuning-accuracy-for-ner-chem)
* [Training stability test](#training-stability-test)
* [Fine-tuning stability test](#fine-tuning-stability-test)
* [Training performance results](#training-performance-results)
* [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
* [Pre-training training performance: multi-node on 16G](#pre-training-training-performance-multi-node-on-16g)
* [Fine-tuning training performance for NER on 16G](#fine-tuning-training-performance-for-ner-on-16g)
* [Training performance: NVIDIA DGX-1 (8x V100 32G)](#training-performance-nvidia-dgx-1-8x-v100-32g)
* [Fine-tuning training performance for NER on 32G](#fine-tuning-training-performance-for-ner-on-32g)
* [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-16x-v100-32g)
* [Pre-training training performance: multi-node on DGX-2 32G](#pre-training-training-performance-multi-node-on-dgx-2-32g)
* [Fine-tuning training performance for NER on DGX-2 32G](#fine-tuning-training-performance-for-ner-on-dgx-2-32g)
* [Release notes](#release-notes)
* [Changelog](#changelog)
* [Known issues](#known-issues)
## Model overview
In the original [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) paper, pre-training is done on [Wikipedia](https://dumps.wikimedia.org/) and [Books Corpus](http://yknzhu.wixsite.com/mbweb), with state-of-the-art results demonstrated on [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) (Stanford Question Answering Dataset) benchmark.
Meanwhile, many works, including [BioBERT](https://arxiv.org/pdf/1901.08746.pdf), [SciBERT](https://arxiv.org/pdf/1903.10676.pdf), [NCBI-BERT](https://arxiv.org/pdf/1906.05474.pdf), [ClinicalBERT (MIT)](https://arxiv.org/pdf/1904.03323.pdf), [ClinicalBERT (NYU, Princeton)](https://arxiv.org/pdf/1904.05342.pdf), and others at [BioNLP19 workshop](https://aclweb.org/aclwiki/BioNLP_Workshop), show that additional pre-training of BERT on large biomedical text corpus such as [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/) results in better performance in biomedical text-mining tasks.
This repository provides scripts and recipe to adopt the [NVIDIA BERT code-base](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT) to achieve state-of-the-art results in the following biomedical text-mining benchmark tasks:
- [BC5CDR-disease](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/) A Named-Entity-Recognition task to recognize diseases mentioned in a collection of 1500 PubMed titles and abstracts ([Li et al., 2016](https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414))
- [BC5CDR-chemical](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/track-3-cdr/) A Named-Entity-Recognition task to recognize chemicals mentioned in a collection of 1500 PubMed titles and abstracts ([Li et al., 2016](https://academic.oup.com/database/article/doi/10.1093/database/baw068/2630414))
- [ChemProt](https://biocreative.bioinformatics.udel.edu/news/corpora/) A Relation-Extraction task to determine chemical-protein interactions in a collection of 1820 PubMed abstracts ([Krallinger et al., 2017](https://biocreative.bioinformatics.udel.edu/media/store/files/2017/ProceedingsBCVI_v2.pdf?page=141))
## Quick Start Guide
To pretrain or fine tune your model for BioMedical tasks using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the BERT model.
1. Clone the repository.
```bash
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT
```
2. Build the BERT TensorFlow NGC container.
```bash
bash scripts/docker/build.sh
```
3. Download and preprocess the PubMed dataset.
To download and preprocess pre-training data as well as the required vocab files, run the following script:
```bash
bash biobert/scripts/biobert_data_download.sh
```
Datasets for finetuning can be obtained from this (repository) [https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1]
Place them in `/workspace/bert/data/biobert/` to be automatically picked up by our scripts.
4. Start an interactive session in the NGC container to run training/inference.
After you build the container image and download the data, you can start an interactive CLI session as follows:
```bash
bash scripts/docker/launch.sh
```
5. Download the pre-trained checkpoint, vocabulary, and configuration files.
We have uploaded checkpoints for fine tuning and pre-training on BioMedical Corpuss on the NGC Model Registry. You can download them directly from the [NGC model catalog](https://ngc.nvidia.com/catalog/models).
Place our `BioBERT checkpoints` in the `results/` to easily access it in your scripts.
6. Start pre-training.
From within the container, you can use the following script to run the 1st phase of the pre-training using cased vocabulary:
```bash
bash biobert/scripts/run_pretraining-pubmed_base_phase_1.sh <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>
```
For the 2nd phase of the pre-training, issue:
```bash
bash biobert/scripts/run_pretraining-pubmed_base_phase_2.sh <path_to_phase_1_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>
```
Refer to (MultiNode Section)[multi-node] for details on utilizing multiple nodes for faster pretraining.
6. Start fine tuning.
The above pretrained BERT representations can be fine tuned with just one additional output layer for a state-of-the-art biomedical text-mining system.
From within the container, you can use the following script to run fine-training for NER.
Note: The scripts assume you are running on 16 V100 32GB GPUs. If you are running on GPU having less than 32GB memory or fewer GPUs, batch size, learning rate and number of GPUs needs to be adjusted.
For NER on disease entities:
```bash
bash biobert/scripts/ner_bc5cdr-disease.sh <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>
```
For NER on chemical entities:
```bash
bash biobert/scripts/ner_bc5cdr-chem.sh <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>
```
For relation extraction, issue:
```
bash biobert/scripts/rel_chemprot.sh <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>
```
8. Start validation/evaluation.
The `biobert/scripts/run_biobert_finetuning_inference.sh` script runs inference on a checkpoint fine tuned for a specific task and evaluates the validity of predictions on the basis of F1, precision and recall scores.
```bash
bash biobert/scripts/run_biobert_finetuning_inference.sh <task> <init_checkpoint> <bert_model> <cased> <precision> <use_xla> <batch_size>
```
For FP16 inference for NER on BC5DR Chemical task with XLA using a DGX-2 V100 32G, run:
```bash
bash biobert/scripts/run_biobert_finetuning_inference.sh ner_bc5cdr-chem /results/model.ckpt base false fp16 true 16
```
Tasks `ner_bc5cdr-chem`, `ner_bc5cdr-disease` and `rel_chemprot` are currently supported.
## Advanced
The following sections provide greater details of the dataset, running training and inference, and the training results.
### Scripts and sample code
In addition to BERT TensorFlow files, the most important files added for NER and RE fine tuning tasks are:
* `run_ner.py` - Serves as an entry point for NER training.
* `run_re.py` - Serves as an entry point for RE training.
The `biobert/scripts/` folder encapsulates all the one-click scripts required for running various functionalities supported such as:
* `ner_bc5cdr-chem.sh` - Runs NER training and inference on the BC5CDR Chemical dataset using the `run_ner.py` file.
* `ner_bc5cdr-disease.sh` - Runs NER training and inference on the BC5CDR Disease dataset using the `run_ner.py` file.
* `rel_chemprot.sh` - Runs RE training and inference on the ChemProt dataset using the `run_re.py` file.
* `run_pretraining_pubmed_base_phase_*.sh` - Runs pre-training with LAMB optimizer using the `run_pretraining.py` file in two phases. Phase 1 does training with sequence length = 128. In phase 2, the remaining 10% of the training is done with sequence length = 512.
* `biobert_data_download.sh` - Downloads the PubMed dataset and Vocab files using files in the `data/` folder.
* `run_biobert_finetuning_inference.sh` - Runs task specific inference using a fine tuned checkpoint.
### Parameters
Aside from the options to set hyperparameters, some relevant options to control the behaviour of the `run_ner.py` and `run_re.py` scripts are:
```
--bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
--vocab_file: The vocabulary file that the BERT model was trained on.
--output_dir: The output directory where the model checkpoints will be written.
--[no]do_eval: Whether to run evaluation on the dev set. (default: 'false')
--[no]do_predict: Whether to run evaluation on the test set. (default: 'false')
--[no]do_train: Whether to run training. (default: 'false')
--learning_rate: The initial learning rate for Adam.(default: '5e-06')(a number)
--max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.(default: '384')(an integer)
--predict_batch_size: Total batch size for predictions.(default: '8')(an integer)
--train_batch_size: Total batch size for training (default: '8')(an integer)
--[no]use_fp16: Whether to enable AMP ops.(default: 'false')
--[no]use_xla: Whether to enable XLA JIT compilation.(default: 'false')
--init_checkpoint: Initial checkpoint (usually from a pre-trained BERT model).
--num_train_epochs: Total number of training epochs to perform.(default: '3.0')(a number)
```
Note: When initializing from a checkpoint using `--init_checkpoint` and a corpus of your choice, keep in mind that `bert_config_file` and `vocab_file` should remain unchanged.
### Command-line options
To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option with the Python file, for example:
```bash
python run_ner.py --help
python run_re.py --help
```
### Getting the data
For pre-training BERT, we use the PubMed Dataset. For PubMed, we extract the xml files which are structured as a document level corpus rather than a shuffled sentence level corpus because it is critical to extract long contiguous sentences.
The next step is to run `create_pretraining_data.py` with the document level corpus as input, which generates input data and labels for the masked language modeling and next sentence prediction tasks. Pre-training can also be performed on any corpus of your choice. The collection of data generation scripts are intended to be modular to allow modifications for additional preprocessing steps or to use additional data. They can hence easily be modified for an arbitrary corpus.
The preparation of an individual pre-training dataset is described in the `create_biobert_datasets_from_start.sh ` script found in the `data/` folder. The component steps to prepare the datasets are as follows:
1. Data download and extract - the dataset is downloaded and extracted.
2. Clean and format - document tags, etc. are removed from the dataset. The end result of this step is a `{dataset_name_one_article_per_line}.txt` file that contains the entire corpus. Each line in the text file contains an entire document from the corpus. One file per dataset is created in the `formatted_one_article_per_line` folder.
3. Sharding - the sentence segmented corpus file is split into a number of smaller text documents. The sharding is configured so that a document will not be split between two shards. Sentence segmentation is performed at this time using NLTK.
4. TFRecord file creation - each text file shard is processed by the `create_pretraining_data.py` script to produce a corresponding TFRecord file. The script generates input data and labels for masked language modeling and sentence prediction tasks for the input text shard.
For fine tuning BioBERT for the task of Named Entity Recognition and Relation Extraction Tasks, we use BC5CDR and Chemprot Datasets. BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
ChemProt corpus consists of text exhaustively annotated by hand with mentions of chemical compounds/drugs and genes/proteins, as well as 22 different types of compound-protein relations focussing on 5 important relation classes. It was preprocessed following [Lim and Kang](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6014134/) guidelines.
#### Dataset guidelines
The procedure to prepare a text corpus for pre-training is described in the previous section. This section provides additional insight into how exactly raw text is processed so that it is ready for pre-training.
First, raw text is tokenized using [WordPiece tokenization](https://arxiv.org/pdf/1609.08144.pdf). A [CLS] token is inserted at the start of every sequence, and the two sentences in the sequence are separated by a [SEP] token.
Note: BERT pre-training looks at pairs of sentences at a time. A sentence embedding token [A] is added to the first sentence and token [B] to the next.
BERT pre-training optimizes for two unsupervised classification tasks. The first is Masked Language Modelling (Masked LM). One training instance of Masked LM is a single modified sentence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with another random token and the remaining 10% with the same token. The task is then to predict the original token.
The second task is next sentence prediction. One training instance of BERT pre-training is two sentences (a sentence pair). A sentence pair may be constructed by simply taking two adjacent sentences from a single document, or by pairing up two random sentences with equal probability. The goal of this task is to predict whether or not the second sentence followed the first in the original document.
The `create_pretraining_data.py` script takes in raw text and creates training instances for both pre-training tasks.
#### Multi-dataset
We are able to combine multiple datasets into a single dataset for pre-training on a diverse text corpus. Once TFRecords have been created for each component dataset, you can create a combined dataset by adding the directory to `*FILES_DIR` in `run_pretraining_*.sh`. This will feed all matching files to the input pipeline in `run_pretraining.py`. However, in the training process, only one TFRecord file is consumed at a time, therefore, the training instances of any given training batch will all belong to the same source dataset.
### Training process
The training process consists of two steps: pre-training and fine tuning.
#### Pre-training
BERT is designed to pre-train deep bidirectional representations for language representations. The following scripts are to pre-train BERT on PubMed dataset. These scripts are general and can be used for pre-training language representations on additional corpus of biomedical text.
Pre-training is performed using the `run_pretraining.py` script along with parameters defined in the `biobert/scripts/run_pretraining_pubmed_base_phase_1.sh` and `biobert/scripts/run_pretraining_pubmed_base_phase_2.sh` scripts.
The `biobert/scripts/run_pretraining_pubmed_base_phase*.sh` scripts run a job on a single node that trains the BERT-base model from scratch using the PubMed Corpus dataset as training data. By default, the training script:
- Runs on 16 GPUs
- Has FP16 precision enabled
- Is XLA enabled
- Creates a log file containing all the output
- Saves a checkpoint every 5000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results, and training logs are saved to the `/results` directory (in the container which can be mounted to a local directory).
- Evaluates the model at the end of each phase
- Phase 1
- Runs 19531 steps with 1953 warmup steps
- Sets Maximum sequence length as 128
- Sets Global Batch size as 64K
- Phase 2
- Runs 4340 steps with 434 warm-up steps
- Sets Maximum sequence length as 512
- Sets Global Batch size as 32K
- Should start from Phase1's final checkpoint
These parameters train PubMed with reasonable accuracy on a DGX-2 with 32GB V100 cards.
For example:
```bash
biobert/scripts/run_pretraining-pubmed_base_phase_1.sh <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>
```
Where:
- `<training_batch_size>` is per-GPU batch size used for training. Batch size varies with precision, larger batch sizes run more efficiently, but require more memory.
- `<learning_rate>` is the default rate of 3.2e-5 is good for global batch size 64k.
- `<cased>` is set to `true` or `false` depending on whether the model should be trained on cased or uncased data.
- `<precision>` is the type of math in your model, can be either `fp32` or `fp16`. Specifically:
- `fp32` is 32-bit IEEE single precision floats.
- `fp16` is Automatic rewrite of TensorFlow compute graph to take advantage of 16-bit arithmetic whenever it is safe.
- `<num_gpus>` is the number of GPUs to use for training. Must be equal to or smaller than the number of GPUs attached to your node.
- `<warmup_steps>` is the number of warm-up steps at the start of training.
- `<training_steps>` is the total number of training steps.
- `<save_checkpoint_steps>` controls how often checkpoints are saved. Default is 5000 steps.
- `<num_accumulation_steps>` is used to mimic higher batch sizes in the respective phase by accumulating gradients N times before weight update.
- `<bert_model>` is used to indicate whether to pretrain BERT Large or BERT Base model.
- `<eval_batch_size>` is per-GPU batch size used for evaluation after training.
The following sample code trains phase 1 of BERT-base from scratch on a single DGX-2 using FP16 arithmetic and uncased data.
```bash
biobert/scripts/run_pretraining-pubmed_base_phase_1.sh 128 3.2e-5 false fp16 true 16 1953 19531 32 5000 80
```
#### Fine tuning
Fine tuning is performed using the `run_ner.py` script along with parameters defined in `biobert/scripts/ner_bc5cdr*.sh`.
For example, `biobert/scripts/ner_bc5cdr-chem.sh` script trains a model and performs evaluation on the BC5CDR Chemical dataset. By default, the training script:
- Trains on BERT Base Uncased Model
- Uses 16 GPUs and batch size of 8 on each GPU
- Has FP16 precision enabled
- Is XLA enabled
- Runs for 10 epochs
- Evaluation is done at the end of training. To skip evaluation, modify `--do_eval` and `--do_predict` to `False`.
This script outputs checkpoints to the `/results` directory, by default, inside the container. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file. The training log contains information about:
- Loss for the final step
- Training and evaluation performance
- F1, Precision and Recall on the Test Set of BC5CDR Chemical after evaluation.
The summary after training is printed in the following format:
```bash
0: /results/biobert_finetune_ner_chem_191028154209/test_labels.txt
0: /results/biobert_finetune_ner_chem_191028154209/test_labels_errs.txt
0: processed 124669 tokens with 5433 phrases; found: 5484 phrases; correct: 5102.
0: accuracy: 99.26%; precision: 93.03%; recall: 93.91%; FB1: 93.47
0: : precision: 93.03%; recall: 93.91%; FB1: 93.47 5484
```
Multi-GPU training is enabled with the Horovod TensorFlow module. The following example runs training on 16 GPUs:
```bash
BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
DATA_DIR=data/biobert/BC5CDR/chem
mpi_command="mpirun -np 16 -H localhost:16 \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib" \
python run_ner.py --horovod --use_fp16 --use_xla \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--output_dir=/results --data_dir=$DATA_DIR”
```
#### Multi-node
Multi-node runs can be launched on a pyxis/enroot Slurm cluster (see [Requirements](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT#requirements)) with the `biobert/scripts/run_biobert.sub` script with the following command for a 4-node DGX2 example for both phase 1 and phase 2:
```bash
BATCHSIZE=128 LEARNING_RATE='8e-6' NUM_ACCUMULATION_STEPS=8 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub
BATCHSIZE=16 LEARNING_RATE='3.2e-5' NUM_ACCUMULATION_STEPS=32 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub
```
Checkpoint after phase 1 will be saved in `checkpointdir` specified in `biobert/scripts/run_biobert.sub`. The checkpoint will be automatically picked up to resume training on phase 2. Note that phase 2 should be run after phase 1.
Variables to re-run the [Training performance results](#training-performance-results) are available in the `configurations.yml` file.
The batch variables `BATCHSIZE`, `LEARNING_RATE`, `NUM_ACCUMULATION_STEPS` refer to the Python arguments `train_batch_size`, `learning_rate`, `num_accumulation_steps` respectively.
The variable `PHASE` refers to phase specific arguments available in `biobert/scripts/run_biobert.sub`.
Note that the `biobert/scripts/run_biobert.sub` script is a starting point that has to be adapted depending on the environment. In particular, variables such as `datadir` handle the location of the files for each phase.
Refer to the file contents to see the full list of variables to adjust for your system.
### Inference process
Inference on a fine tuned model for Bio Medical tasks is performed using the `run_ner.py` or `run_re.py` script along with parameters defined in `biobert/scripts/run_biobert_finetuning_inference.sh`. Inference is supported on a single GPU.
The `biobert/scripts/run_biobert_finetuning_inference.sh` script performs evaluation on ChemProt or BC5CDR datasets depending on the task specified. By default, the inferencing script:
- Uses BC5CDR Chemical dataset
- Has FP16 precision enabled
- Is XLA enabled
- Evaluates the latest checkpoint present in `/results` with a batch size of 16.
This script computes F1, Precision and Recall scores. Mount point of `/results` can be changed in the `scripts/docker/launch.sh` file.
## Performance
### Benchmarking
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.
#### Training performance benchmark
Training benchmarking can be performed by running the script:
``` bash
biobert/scripts/biobert_finetune_training_benchmark.sh <task> <num_gpu> <bert_model> <cased>
```
This script runs 2 epochs by default on the NER BC5CDR dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at `/results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>`
#### Inference performance benchmark
Training benchmarking can be performed by running the script:
``` bash
biobert/scripts/biobert_finetune_inference_benchmark.sh <task> <bert_model> <cased>
```
This script runs inference on the test and dev sets and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at `/results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>`
### Results
The following sections provide detailed results of downstream fine-tuning task on NER and RE benchmark tasks.
#### Training accuracy results
##### Training accuracy
###### Pre-training accuracy
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container.
| **DGX System** | **Nodes** | **Precision** | **Batch Size/GPU: Phase1, Phase2** | **Accumulation Steps: Phase1, Phase2** | **Time to Train (Hrs)** | **Final Loss** |
|----------------|-----------|---------------|------------------------------------|----------------------------------------|----------------|-------------------------|
| DGX2H | 4 | FP16 | 128, 16 | 8, 32 | 19.14 | 0.88 |
| DGX2H | 16 | FP16 | 128, 16 | 2, 8 | 4.81 | 0.86 |
| DGX2H | 32 | FP16 | 128, 16 | 1, 4 | 2.65 | 0.87 |
###### Fine-tuning accuracy
| **Task** | **F1** | **Precision** | **Recall** |
|:-------:|:----:|:----:|:----:|
| NER BC5CDR-chemical | 93.47 | 93.03 | 93.91 |
| NER BC5CDR-disease | 86.22 | 85.05 | 87.43 |
| RE Chemprot | 76.27 | 77.62 | 74.98 |
####### Fine-tuning accuracy for NER Chem
Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container.
| **DGX System** | **Batch size / GPU** | **F1 - FP32** | **F1- mixed precision** | **Time to Train - FP32 (Minutes)** | **Time to Train - mixed precision (Minutes)** |
|:---:|:----:|:----:|:---:|:----:|:----:|
| DGX-1 16G | 64 |93.33|93.40|23.95|14.13|
| DGX-1 32G | 64 |93.31|93.36|24.35|12.63|
| DGX-2 32G | 64 |93.66|93.47|12.26|8.16|
##### Training stability test
###### Fine-tuning stability test:
The following tables compare F1 scores scores across 5 different training runs on the NER Chemical task with different seeds, for both FP16 and FP32. The runs showcase consistent convergence on all 5 seeds with very little deviation.
| **16 x V100 GPUs** | **seed 1** | **seed 2** | **seed 3** | **seed 4** | **seed 5** | **mean** | **std** |
|:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
| F1 Score (FP16) | 93.13 | 92.92 | 93.34 | 93.66 | 93.47 | 93.3 | 0.29 |
| F1 Score (FP32) | 93.1 | 93.28 | 93.33 | 93.45 | 93.17 | 93.27 | 0.14 |
#### Training performance results
##### Training performance: NVIDIA DGX-1 (8x V100 16G)
###### Pre-training training performance: multi-node on DGX-1 16G
Our results were obtained by running the `biobert/scripts/run_biobert.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
| **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 64,32 | 2762.06 | 744.48 | 3.71 | 1.00 | 1.00 |
| 4 | 128 | 64,32 | 10283.08 | 2762.88 | 3.72 | 3.72 | 3.71 |
| 16 | 128 | 64,32 | 39051.69 | 10715.14 | 3.64 | 14.14 | 14.39 |
| 32 | 128 | 64,32 | 76077.39 | 21104.87 | 3.60 | 27.54 | 28.35 |
| 1 | 512 | 8,8 | 432.33 | 160.38 | 2.70 | 1.00 | 1.00 |
| 4 | 512 | 8,8 | 1593.00 | 604.36 | 2.64 | 3.68 | 3.77 |
| 16 | 512 | 8,8 | 5941.82 | 2356.44 | 2.52 | 13.74 | 14.69 |
| 32 | 512 | 8,8 | 11483.73 | 4631.29 | 2.48 | 26.56 | 28.88 |
Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.
###### Fine-tuning training performance for NER on DGX-1 16G
Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
| 1 | 64 | 147.71 | 348.84 | 2.36 | 1.00 | 1.00 |
| 4 | 64 | 583.78 | 1145.46 | 1.96 | 3.95 | 3.28 |
| 8 | 64 | 981.22 | 1964.85 | 2.00 | 6.64 | 5.63 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-1 (8x V100 32G)
###### Fine-tuning training performance for NER on DGX-1 32G
Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
| 1 | 64 | 144.1 | 417.39 | 2.89 | 1.00 | 1.00 |
| 4 | 64 | 525.15 | 1354.14 | 2.57 | 3.64 | 3.24 |
| 8 | 64 | 969.4 | 2341.39 | 2.41 | 6.73 | 5.61 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
##### Training performance: NVIDIA DGX-2 (16x V100 32G)
###### Pre-training training performance: multi-node on DGX-2H 32G
Our results were obtained by running the `biobert/scripts/run_biobert.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2H with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
| **Nodes** | **Sequence Length**| **Batch size / GPU: mixed precision, FP32** | **Throughput - mixed precision** | **Throughput - FP32** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - FP32** |
|:-------:|:-----:|:-------:|:-------:|:-------:|:-------------:|:------:|:------:|
| 1 | 128 | 128,128 | 7772.18 | 2165.04 | 3.59 | 1.00 | 1.00 |
| 4 | 128 | 128,128 | 29785.31 | 8516.90 | 3.50 | 3.83 | 3.93 |
| 16 | 128 | 128,128 | 115581.29 | 33699.15 | 3.43 | 14.87 | 15.57 |
| 32 | 128 | 128,128 | 226156.53 | 66996.73 | 3.38 | 29.10 | 30.94 |
| 64 | 128 | 128,128 | 444955.74 | 133424.95 | 3.33 | 57.25 | 61.63 |
| 1 | 512 | 16,16 | 1260.06 | 416.92 | 3.02 | 1.00 | 1.00 |
| 4 | 512 | 16,16 | 4781.19 | 1626.76 | 2.94 | 3.79 | 3.90 |
| 16 | 512 | 16,16 | 18405.65 | 6418.09 | 2.87 | 14.61 | 15.39 |
| 32 | 512 | 16,16 | 36071.06 | 12713.67 | 2.84 | 28.63 | 30.49 |
| 64 | 512 | 16,16 | 69950.86 | 25245.96 | 2.77 | 55.51 | 60.55 |
###### Fine-tuning training performance for NER on DGX-2 32G
Our results were obtained by running the `biobert/scripts/ner_bc5cdr-chem.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.
| **GPUs** | **Batch size / GPU** | **Throughput - FP32** | **Throughput - mixed precision** | **Throughput speedup (FP32 to mixed precision)** | **Weak scaling - FP32** | **Weak scaling - mixed precision** |
|:---:|:---:|:------:|:-----:|:----:|:----:|:----:|
| 1 | 64 | 139.59 | 475.54 | 3.4 | 1.00 | 1.00 |
| 4 | 64 | 517.08 | 1544.01 | 2.98 | 3.70 | 3.25 |
| 8 | 64 | 1009.84 | 2695.34 | 2.66 | 7.23 | 5.67 |
| 16 | 64 | 1997.73 | 4268.81 | 2.13 | 14.31 | 8.98 |
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
## Release notes
### Changelog
November 2019
- Initial release
### Known issues
- There are no known issues with the model.

View file

@ -0,0 +1,302 @@
# Python version of the evaluation script from CoNLL'00-
# Originates from: https://github.com/spyysalo/conlleval.py
# Intentional differences:
# - accept any space as delimiter by default
# - optional file argument (default STDIN)
# - option to set boundary (-b argument)
# - LaTeX output (-l argument) not supported
# - raw tags (-r argument) not supported
# add function :evaluate(predicted_label, ori_label): which will not read from file
import sys
import re
import codecs
from collections import defaultdict, namedtuple
ANY_SPACE = '<SPACE>'
class FormatError(Exception):
pass
Metrics = namedtuple('Metrics', 'tp fp fn prec rec fscore')
class EvalCounts(object):
def __init__(self):
self.correct_chunk = 0 # number of correctly identified chunks
self.correct_tags = 0 # number of correct chunk tags
self.found_correct = 0 # number of chunks in corpus
self.found_guessed = 0 # number of identified chunks
self.token_counter = 0 # token counter (ignores sentence breaks)
# counts by type
self.t_correct_chunk = defaultdict(int)
self.t_found_correct = defaultdict(int)
self.t_found_guessed = defaultdict(int)
def parse_args(argv):
import argparse
parser = argparse.ArgumentParser(
description='evaluate tagging results using CoNLL criteria',
formatter_class=argparse.ArgumentDefaultsHelpFormatter
)
arg = parser.add_argument
arg('-b', '--boundary', metavar='STR', default='-X-',
help='sentence boundary')
arg('-d', '--delimiter', metavar='CHAR', default=ANY_SPACE,
help='character delimiting items in input')
arg('-o', '--otag', metavar='CHAR', default='O',
help='alternative outside tag')
arg('file', nargs='?', default=None)
return parser.parse_args(argv)
def parse_tag(t):
m = re.match(r'^([^-]*)-(.*)$', t)
return m.groups() if m else (t, '')
def evaluate(iterable, options=None):
if options is None:
options = parse_args([]) # use defaults
counts = EvalCounts()
num_features = None # number of features per line
in_correct = False # currently processed chunks is correct until now
last_correct = 'O' # previous chunk tag in corpus
last_correct_type = '' # type of previously identified chunk tag
last_guessed = 'O' # previously identified chunk tag
last_guessed_type = '' # type of previous chunk tag in corpus
for i, line in enumerate(iterable):
line = line.rstrip('\r\n')
# print(line)
if options.delimiter == ANY_SPACE:
features = line.split()
else:
features = line.split(options.delimiter)
if num_features is None:
num_features = len(features)
elif num_features != len(features) and len(features) != 0:
raise FormatError('unexpected number of features: %d (%d) at line %d\n%s' %
(len(features), num_features, i, line))
if len(features) == 0 or features[0] == options.boundary:
features = [options.boundary, 'O', 'O']
if len(features) < 3:
raise FormatError('unexpected number of features in line %s' % line)
guessed, guessed_type = parse_tag(features.pop())
correct, correct_type = parse_tag(features.pop())
first_item = features.pop(0)
if first_item == options.boundary:
guessed = 'O'
end_correct = end_of_chunk(last_correct, correct,
last_correct_type, correct_type)
end_guessed = end_of_chunk(last_guessed, guessed,
last_guessed_type, guessed_type)
start_correct = start_of_chunk(last_correct, correct,
last_correct_type, correct_type)
start_guessed = start_of_chunk(last_guessed, guessed,
last_guessed_type, guessed_type)
if in_correct:
if (end_correct and end_guessed and
last_guessed_type == last_correct_type):
in_correct = False
counts.correct_chunk += 1
counts.t_correct_chunk[last_correct_type] += 1
elif (end_correct != end_guessed or guessed_type != correct_type):
in_correct = False
if start_correct and start_guessed and guessed_type == correct_type:
in_correct = True
if start_correct:
counts.found_correct += 1
counts.t_found_correct[correct_type] += 1
if start_guessed:
counts.found_guessed += 1
counts.t_found_guessed[guessed_type] += 1
if first_item != options.boundary:
if correct == guessed and guessed_type == correct_type:
counts.correct_tags += 1
counts.token_counter += 1
last_guessed = guessed
last_correct = correct
last_guessed_type = guessed_type
last_correct_type = correct_type
if in_correct:
counts.correct_chunk += 1
counts.t_correct_chunk[last_correct_type] += 1
return counts
def uniq(iterable):
seen = set()
return [i for i in iterable if not (i in seen or seen.add(i))]
def calculate_metrics(correct, guessed, total):
tp, fp, fn = correct, guessed-correct, total-correct
p = 0 if tp + fp == 0 else 1.*tp / (tp + fp)
r = 0 if tp + fn == 0 else 1.*tp / (tp + fn)
f = 0 if p + r == 0 else 2 * p * r / (p + r)
return Metrics(tp, fp, fn, p, r, f)
def metrics(counts):
c = counts
overall = calculate_metrics(
c.correct_chunk, c.found_guessed, c.found_correct
)
by_type = {}
for t in uniq(list(c.t_found_correct) + list(c.t_found_guessed)):
by_type[t] = calculate_metrics(
c.t_correct_chunk[t], c.t_found_guessed[t], c.t_found_correct[t]
)
return overall, by_type
def report(counts, out=None):
if out is None:
out = sys.stdout
overall, by_type = metrics(counts)
c = counts
out.write('processed %d tokens with %d phrases; ' %
(c.token_counter, c.found_correct))
out.write('found: %d phrases; correct: %d.\n' %
(c.found_guessed, c.correct_chunk))
if c.token_counter > 0:
out.write('accuracy: %6.2f%%; ' %
(100.*c.correct_tags/c.token_counter))
out.write('precision: %6.2f%%; ' % (100.*overall.prec))
out.write('recall: %6.2f%%; ' % (100.*overall.rec))
out.write('FB1: %6.2f\n' % (100.*overall.fscore))
for i, m in sorted(by_type.items()):
out.write('%17s: ' % i)
out.write('precision: %6.2f%%; ' % (100.*m.prec))
out.write('recall: %6.2f%%; ' % (100.*m.rec))
out.write('FB1: %6.2f %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
def report_notprint(counts, out=None):
if out is None:
out = sys.stdout
overall, by_type = metrics(counts)
c = counts
final_report = []
line = []
line.append('processed %d tokens with %d phrases; ' %
(c.token_counter, c.found_correct))
line.append('found: %d phrases; correct: %d.\n' %
(c.found_guessed, c.correct_chunk))
final_report.append("".join(line))
if c.token_counter > 0:
line = []
line.append('accuracy: %6.2f%%; ' %
(100.*c.correct_tags/c.token_counter))
line.append('precision: %6.2f%%; ' % (100.*overall.prec))
line.append('recall: %6.2f%%; ' % (100.*overall.rec))
line.append('FB1: %6.2f\n' % (100.*overall.fscore))
final_report.append("".join(line))
for i, m in sorted(by_type.items()):
line = []
line.append('%17s: ' % i)
line.append('precision: %6.2f%%; ' % (100.*m.prec))
line.append('recall: %6.2f%%; ' % (100.*m.rec))
line.append('FB1: %6.2f %d\n' % (100.*m.fscore, c.t_found_guessed[i]))
final_report.append("".join(line))
return final_report
def end_of_chunk(prev_tag, tag, prev_type, type_):
# check if a chunk ended between the previous and current word
# arguments: previous and current chunk tags, previous and current types
chunk_end = False
if prev_tag == 'E': chunk_end = True
if prev_tag == 'S': chunk_end = True
if prev_tag == 'B' and tag == 'B': chunk_end = True
if prev_tag == 'B' and tag == 'S': chunk_end = True
if prev_tag == 'B' and tag == 'O': chunk_end = True
if prev_tag == 'I' and tag == 'B': chunk_end = True
if prev_tag == 'I' and tag == 'S': chunk_end = True
if prev_tag == 'I' and tag == 'O': chunk_end = True
if prev_tag != 'O' and prev_tag != '.' and prev_type != type_:
chunk_end = True
# these chunks are assumed to have length 1
if prev_tag == ']': chunk_end = True
if prev_tag == '[': chunk_end = True
return chunk_end
def start_of_chunk(prev_tag, tag, prev_type, type_):
# check if a chunk started between the previous and current word
# arguments: previous and current chunk tags, previous and current types
chunk_start = False
if tag == 'B': chunk_start = True
if tag == 'S': chunk_start = True
if prev_tag == 'E' and tag == 'E': chunk_start = True
if prev_tag == 'E' and tag == 'I': chunk_start = True
if prev_tag == 'S' and tag == 'E': chunk_start = True
if prev_tag == 'S' and tag == 'I': chunk_start = True
if prev_tag == 'O' and tag == 'E': chunk_start = True
if prev_tag == 'O' and tag == 'I': chunk_start = True
if tag != 'O' and tag != '.' and prev_type != type_:
chunk_start = True
# these chunks are assumed to have length 1
if tag == '[': chunk_start = True
if tag == ']': chunk_start = True
return chunk_start
def main(argv):
args = parse_args(argv[1:])
if args.file is None:
counts = evaluate(sys.stdin, args)
else:
with open(args.file) as f:
counts = evaluate(f, args)
report(counts)
def return_report(input_file):
with open(input_file, "r") as f:
counts = evaluate(f)
return report_notprint(counts)
if __name__ == '__main__':
# sys.exit(main(sys.argv))
return_report('/home/pengy6/data/sentence_similarity/data/cdr/test1/wanli_result2/label_test.txt')

View file

@ -0,0 +1,51 @@
import os
import numpy as np
import pandas as pd
import sklearn.metrics
import argparse
parser = argparse.ArgumentParser(description='')
parser.add_argument('--output_path', type=str, help='')
parser.add_argument('--answer_path', type=str, help='')
parser.add_argument('--task', type=str, default="binary", help='default:binary, possible other options:{chemprot}')
args = parser.parse_args()
testdf = pd.read_csv(args.answer_path, sep="\t", index_col=0)
preddf = pd.read_csv(args.output_path, sep="\t", header=None)
# binary
if args.task == "binary":
pred = [preddf.iloc[i].tolist() for i in preddf.index]
pred_class = [np.argmax(v) for v in pred]
pred_prob_one = [v[1] for v in pred]
p,r,f,s = sklearn.metrics.precision_recall_fscore_support(y_pred=pred_class, y_true=testdf["label"])
results = dict()
results["f1 score"] = f[1]
results["recall"] = r[1]
results["precision"] = p[1]
results["specificity"] = r[0]
# chemprot
# micro-average of 5 target classes
# see "Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction (Mehryary, 2018)" for details
if args.task == "chemprot":
pred = [preddf.iloc[i].tolist() for i in preddf.index]
pred_class = [np.argmax(v) for v in pred]
str_to_int_mapper = dict()
for i,v in enumerate(sorted(testdf["label"].unique())):
str_to_int_mapper[v] = i
test_answer = [str_to_int_mapper[v] for v in testdf["label"]]
p,r,f,s = sklearn.metrics.precision_recall_fscore_support(y_pred=pred_class, y_true=test_answer, labels=[0,1,2,3,4], average="micro")
results = dict()
results["f1 score"] = f
results["recall"] = r
results["precision"] = p
for k,v in results.items():
print("{:11s} : {:.2%}".format(k,v))

View file

@ -0,0 +1,19 @@
#!/usr/bin/env bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
docker run --runtime=nvidia -v $PWD:/workspace/bert \
--rm --shm-size=1g --ulimit memlock=-1 \
--ulimit stack=67108864 --ipc=host -t -i \
bert bash -c "bash data/create_biobert_datasets_from_start.sh"

View file

@ -0,0 +1,187 @@
#!/bin/bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task=${1:-"ner_bc5cdr-chem"}
bert_model=${2:-"base"}
cased=${3:-"false"}
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
case_flag="--do_lower_case=False"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
case_flag="--do_lower_case=True"
fi
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
else
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
fi
DATESTAMP=`date +'%y%m%d%H%M%S'`
printf -v TAG "tf_bert_biobert_%s_inference_benchmark_%s_%s" "$task" "$bert_model" "$CASING_DIR_PREFIX"
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
mkdir -p ${OUTPUT_DIR}
if [ "$task" = "ner_bc5cdr-chem" ] ; then
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}.log"
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_length in 128 512; do
for batch_size in 8 32 64; do
for precision in fp16 fp32; do
res_dir=${OUTPUT_DIR}/bert_${bert_model}_sl_${seq_len}_prec_${precision}_bs_${batch_size}
mkdir -p ${res_dir}
tmp_file="${res_dir}/${task}_training_benchmark.log"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
python /workspace/bert/run_ner.py \
--do_prepare=true \
--do_eval=true \
--do_predict=true \
--task_name="bc5cdr" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint="$BERT_DIR/bert_model.ckpt" \
--data_dir=$DATASET_DIR \
--output_dir=$res_dir \
--eval_batch_size=$batch_size \
--predict_batch_size=$batch_size \
--max_seq_length=$seq_length \
$use_fp16 $use_xla_tag $case_flag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_len $batch_size $perf" >> $LOGFILE
done
done
done
elif [ "$task" = "ner_bc5cdr-disease" ] ; then
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}.log"
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_length in 128 512; do
for batch_size in 8 32 64; do
for precision in fp16 fp32; do
res_dir=${OUTPUT_DIR}/bert_${bert_model}_sl_${seq_len}_prec_${precision}_bs_${batch_size}
mkdir -p ${res_dir}
tmp_file="${res_dir}/${task}_training_benchmark.log"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
python3 /workspace/bert/run_ner.py \
--do_prepare=true \
--do_eval=true \
--do_predict=true \
--task_name="bc5cdr" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint="$BERT_DIR/bert_model.ckpt" \
--data_dir=$DATASET_DIR \
--output_dir=$res_dir \
--eval_batch_size=$batch_size \
--predict_batch_size=$batch_size \
--max_seq_length=$seq_length \
"$use_fp16" $use_xla_tag $case_flag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_len $batch_size $perf" >> $LOGFILE
done
done
done
elif [ "$task" = "rel_chemprot" ] ; then
DATASET_DIR=/workspace/bert/data/biobert/ChemProt
LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}.log"
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_length in 128 512; do
for batch_size in 8 32 64; do
for precision in fp16 fp32; do
res_dir=${OUTPUT_DIR}/bert_${bert_model}_sl_${seq_len}_prec_${precision}_bs_${batch_size}
mkdir -p ${res_dir}
tmp_file="${res_dir}/${task}_training_benchmark.log"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
python3 /workspace/bert/run_re.py \
--do_prepare=true \
--do_eval=true \
--do_predict=true \
--task_name="chemprot" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint="$BERT_DIR/bert_model.ckpt" \
--data_dir=$DATASET_DIR \
--output_dir=$res_dir \
--eval_batch_size=$batch_size \
--predict_batch_size=$batch_size \
--max_seq_length=$seq_length \
"$use_fp16" $use_xla_tag $case_flag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_len $batch_size $perf" >> $LOGFILE
done
done
done
else
echo "Benchmarking for " $task "currently not supported. Sorry!"
fi

View file

@ -0,0 +1,203 @@
#!/bin/bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task=${1:-"ner_bc5cdr-chem"}
num_gpu=${2:-"2"}
bert_model=${3:-"base"}
cased=${4:-"false"}
epochs=2.0
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
case_flag="--do_lower_case=False"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
case_flag="--do_lower_case=True"
fi
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
else
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
fi
if [ $num_gpu -gt 1 ] ; then
mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib"
use_hvd="--horovod"
else
mpi_command=""
use_hvd=""
fi
DATESTAMP=`date +'%y%m%d%H%M%S'`
printf -v TAG "tf_bert_biobert_%s_training_benchmark_%s_%s_num_gpu_%d" "$task" "$bert_model" "$CASING_DIR_PREFIX" "$num_gpu"
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
mkdir -p ${OUTPUT_DIR}
if [ "$task" = "ner_bc5cdr-chem" ] ; then
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}_gpu_${num_gpu}.log"
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_length in 128 512; do
for train_batch_size in 8 32 64; do
for precision in fp16 fp32; do
res_dir=${OUTPUT_DIR}/bert_${bert_model}_gpu_${num_gpu}_sl_${seq_length}_prec_${precision}_bs_${batch_size}
mkdir -p ${res_dir}
tmp_file="${res_dir}/${task}_training_benchmark.log"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
$mpi_command python /workspace/bert/run_ner.py \
--do_prepare=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--task_name=bc5cdr \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint="$BERT_DIR/bert_model.ckpt" \
--num_train_epochs=$epochs \
--data_dir=$DATASET_DIR \
--output_dir=$res_dir \
--train_batch_size=$train_batch_size \
--max_seq_length=$seq_length \
$use_hvd $use_fp16 $use_xla_tag $case_flag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_length $train_batch_size $perf" >> $LOGFILE
done
done
done
elif [ "$task" = "ner_bc5cdr-disease" ] ; then
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}_gpu_${num_gpu}.log"
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_length in 128 512; do
for train_batch_size in 8 32 64; do
for precision in fp16 fp32; do
res_dir=${OUTPUT_DIR}/bert_${bert_model}_gpu_${num_gpu}_sl_${seq_length}_prec_${precision}_bs_${batch_size}
mkdir -p ${res_dir}
tmp_file="${res_dir}/${task}_training_benchmark.log"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
$mpi_command python3 /workspace/bert/run_ner.py \
--do_prepare=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--task_name="bc5cdr" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint="$BERT_DIR/bert_model.ckpt" \
--num_train_epochs=$epochs \
--data_dir=$DATASET_DIR \
--output_dir=$res_dir \
--train_batch_size=$train_batch_size \
--max_seq_length=$seq_length \
"$use_hvd" "$use_fp16" $use_xla_tag $case_flag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_length $train_batch_size $perf" >> $LOGFILE
done
done
done
elif [ "$task" = "rel_chemprot" ] ; then
DATASET_DIR=/workspace/bert/data/biobert/ChemProt
LOGFILE="${OUTPUT_DIR}/${task}_training_benchmark_bert_${bert_model}_gpu_${num_gpu}.log"
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_length in 128 512; do
for train_batch_size in 8 32 64; do
for precision in fp16 fp32; do
res_dir=${OUTPUT_DIR}/bert_${bert_model}_gpu_${num_gpu}_sl_${seq_length}_prec_${precision}_bs_${batch_size}
mkdir -p ${res_dir}
tmp_file="${res_dir}/${task}_training_benchmark.log"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
$mpi_command python3 /workspace/bert/run_re.py \
--do_prepare=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--task_name="chemprot" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint="$BERT_DIR/bert_model.ckpt" \
--num_train_epochs=$epochs \
--data_dir=$DATASET_DIR \
--output_dir=$res_dir \
--train_batch_size=$train_batch_size \
--max_seq_length=$seq_length \
"$use_hvd" "$use_fp16" $use_xla_tag $case_flag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_length $train_batch_size $perf" >> $LOGFILE
done
done
done
else
echo "Benchmarking for " $task "currently not supported. Sorry!"
fi

View file

@ -0,0 +1,86 @@
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
init_checkpoint=${1:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
train_batch_size=${2:-8}
learning_rate=${3:-3.125e-6}
cased=${4:-false}
precision=${5:-"fp16"}
use_xla=${6:-"true"}
num_gpu=${7:-"16"}
seq_length=${8:-128}
bert_model=${9:-"base"}
eval_batch_size=${10:-8} #Eval and Predict BS is assumed to be same
epochs=${11:-"10.0"}
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
case_flag="--do_lower_case=False"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
case_flag="--do_lower_case=True"
fi
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
else
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
fi
export GBS=$(expr $train_batch_size \* $num_gpu)
printf -v TAG "tf_bert_biobert_ner_bc5cdr_chem_%s_%s_gbs%d" "$bert_model" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
mkdir -p ${OUTPUT_DIR}
use_fp16=""
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
fi
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
if [ $num_gpu -gt 1 ] ; then
mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib"
use_hvd="--horovod"
else
mpi_command=""
use_hvd=""
fi
$mpi python /workspace/bert/run_ner.py \
--do_prepare=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--task_name=bc5cdr \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$init_checkpoint \
--num_train_epochs=$epochs \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--learning_rate=$learning_rate \
--train_batch_size=$train_batch_size \
--eval_batch_size=$eval_batch_size \
--predict_batch_size=$eval_batch_size \
--max_seq_length=$seq_length \
$use_hvd $use_fp16 $use_xla_tag $case_flag

View file

@ -0,0 +1,85 @@
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
init_checkpoint=${1:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
train_batch_size=${2:-8}
learning_rate=${3:-3.125e-6}
cased=${4:-false}
precision=${5:-"fp16"}
use_xla=${6:-"true"}
num_gpu=${7:-"16"}
seq_length=${8:-128}
bert_model=${9:-"base"}
eval_batch_size=${10:-8} #Eval and Predict BS is assumed to be same
epochs=${11:-"100.0"}
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
case_flag="--do_lower_case=False"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
case_flag="--do_lower_case=True"
fi
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
else
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
fi
export GBS=$(expr $train_batch_size \* $num_gpu)
printf -v TAG "tf_bert_biobert_ner_bc5cdr_disease_%s_%s_gbs%d" "$bert_model" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
mkdir -p ${OUTPUT_DIR}
use_fp16=""
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
fi
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
if [ $num_gpu -gt 1 ] ; then
mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib"
use_hvd="--horovod"
else
mpi_command=""
use_hvd=""
fi
$mpi_command python3 /workspace/bert/run_ner.py \
--do_prepare=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--task_name="bc5cdr" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$init_checkpoint \
--num_train_epochs=$epochs \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--learning_rate=$learning_rate \
--train_batch_size=$train_batch_size \
--eval_batch_size=$eval_batch_size \
--predict_batch_size=$eval_batch_size \
--max_seq_length=$seq_length \
"$use_hvd" "$use_fp16" $use_xla_tag $case_flag

View file

@ -0,0 +1,87 @@
#!/bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
init_checkpoint=${1:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
train_batch_size=${2:-64}
learning_rate=${3:-1.5e-6}
cased=${4:-false}
precision=${5:-"fp16"}
use_xla=${6:-"true"}
num_gpu=${7:-"16"}
seq_length=${8:-512}
bert_model=${9:-"base"}
eval_batch_size=${10:-16} #Eval and Predict BS is assumed to be same
epochs=${11:-"3.0"}
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
case_flag="--do_lower_case=False"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
case_flag="--do_lower_case=True"
fi
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
else
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
fi
export GBS=$(expr $train_batch_size \* $num_gpu)
printf -v TAG "tf_bert_biobert_rel_chemprot_%s_%s_gbs%d" "$bert_model" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
DATASET_DIR=/workspace/bert/data/biobert/ChemProt
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
mkdir -p ${OUTPUT_DIR}
use_fp16=""
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
fi
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
if [ $num_gpu -gt 1 ] ; then
mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib"
use_hvd="--horovod"
else
mpi_command=""
use_hvd=""
fi
$mpi_command python3 /workspace/bert/run_re.py \
--do_prepare=true \
--do_train=true \
--do_eval=true \
--do_predict=true \
--task_name="chemprot" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$init_checkpoint \
--num_train_epochs=$epochs \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--learning_rate=$learning_rate \
--train_batch_size=$train_batch_size \
--eval_batch_size=$eval_batch_size \
--predict_batch_size=$eval_batch_size \
--max_seq_length=$seq_length \
"$use_hvd" "$use_fp16" $use_xla_tag $case_flag
python3 /workspace/bert/biobert/re_eval.py --task=chemprot --output_path=$OUTPUT_DIR/test_results.tsv \
--answer_path=$DATASET_DIR/test.tsv |& tee $OUTPUT_DIR/test_results.txt

View file

@ -0,0 +1,87 @@
#!/bin/bash
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --overcommit
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
set -eux
readonly docker_image="nvcr.io/nvidia/tensorflow:19.08-py3"
readonly datadir="/raid/data/bert"
readonly checkpointdir="$PWD/checkpoints"
readonly mounts=".:/workspace/bert,${datadir}:/workspace/bert/data,${checkpointdir}:/results"
DO_LOWER_CASE=${DO_LOWER_CASE:-1}
if [ "$DO_LOWER_CASE" == "1" ]; then
CASING_DIR_PREFIX="uncased"
else
CASING_DIR_PREFIX="cased"
fi
DO_BERT_BASE=${DO_BERT_BASE:-1}
if [ "$DO_BERT_BASE" == "1" ]; then
CASING_DIR_SUFFIX="L-12_H-768_A-12"
else
CASING_DIR_SUFFIX="L-24_H-1024_A-16"
fi
srun --ntasks="${SLURM_JOB_NUM_NODES}" --ntasks-per-node=1 mkdir -p "${checkpointdir}/biobert_phase_1"
srun --ntasks="${SLURM_JOB_NUM_NODES}" --ntasks-per-node=1 mkdir -p "${checkpointdir}/biobert_phase_2"
PHASE1="\
--train_batch_size=${BATCHSIZE:-128} \
--learning_rate=${LEARNING_RATE:-3.2e-5} \
--num_accumulation_steps=${NUM_ACCUMULATION_STEPS:-128} \
--input_files_dir=lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training \
--eval_files_dir=lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=19531 \
--num_warmup_steps=1953 \
--output_dir=/results/biobert_phase_1 \
"
PHASE2="\
--train_batch_size=${BATCHSIZE:-16} \
--learning_rate=${LEARNING_RATE:-6.4e-5} \
--num_accumulation_steps=${NUM_ACCUMULATION_STEPS:-512} \
--input_files_dir=/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training \
--eval_files_dir=/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--num_train_steps=4340 \
--num_warmup_steps=434 \
--output_dir=/results/biobert_phase_2 \
--init_checkpoint=/results/biobert_phase_1/model.ckpt-19531 \
"
PHASES=( "$PHASE1" "$PHASE2" )
PHASE=${PHASE:-1}
BERT_CMD="\
python /workspace/bert/run_pretraining.py \
${PHASES[$((PHASE-1))]} \
--bert_config_file=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_${CASING_DIR_SUFFIX}/bert_config.json \
--vocab_file=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_${CASING_DIR_SUFFIX}/vocab.txt \
--do_train=True \
--do_eval=True \
--save_checkpoints_steps=5000 \
--horovod --use_fp16 --use_xla \
--allreduce_post_accumulation=True \
--eval_batch_size=8"
srun --mpi=pmi2 -l --container-image="${docker_image}" --container-mounts="${mounts}" bash -c "${BERT_CMD}"

View file

@ -0,0 +1,122 @@
#!/bin/bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
task=${1:-"ner_bc5cdr-chem"}
init_checkpoint=${2:-"/results/biobert_tf_uncased_base/model.ckpt-4340"}
bert_model=${3:-"base"}
cased=${4:-"false"}
precision=${5:-"fp16"}
use_xla=${6:-"true"}
batch_size=${7:-"16"}
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
case_flag="--do_lower_case=False"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
case_flag="--do_lower_case=True"
fi
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-24_H-1024_A-16
else
export BERT_DIR=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12
fi
use_fp16=""
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
fi
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
DATESTAMP=`date +'%y%m%d%H%M%S'`
if [ "$task" = "ner_bc5cdr-chem" ] ; then
printf -v TAG "tf_bert_biobert_ner_bc5cdr_chem_inference_%s_%s" "$bert_model" "$precision"
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/chem
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
python /workspace/bert/run_ner.py \
--do_prepare=true \
--do_eval=true \
--do_predict=true \
--task_name="bc5cdr" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$init_checkpoint \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--eval_batch_size=$batch_size \
--predict_batch_size=$batch_size \
--max_seq_length=128 \
$use_fp16 $use_xla_tag $case_flag
elif [ "$task" = "ner_bc5cdr-disease" ] ; then
printf -v TAG "tf_bert_biobert_ner_bc5cdr_disease_inference_%s_%s" "$bert_model" "$precision"
DATASET_DIR=/workspace/bert/data/biobert/BC5CDR/disease
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
python3 /workspace/bert/run_ner.py \
--do_prepare=true \
--do_eval=true \
--do_predict=true \
--task_name="bc5cdr" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$init_checkpoint \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--eval_batch_size=$batch_size \
--predict_batch_size=$batch_size \
--max_seq_length=128 \
"$use_fp16" $use_xla_tag $case_flag
elif [ "$task" = "rel_chemprot" ] ; then
printf -v TAG "tf_bert_biobert_rel_chemprot_inference_%s_%s_" "$bert_model" "$precision"
DATASET_DIR=/workspace/bert/data/biobert/ChemProt
OUTPUT_DIR=/results/${TAG}_${DATESTAMP}
python3 /workspace/bert/run_re.py \
--do_prepare=true \
--do_eval=true \
--do_predict=true \
--task_name="chemprot" \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--init_checkpoint=$init_checkpoint \
--data_dir=$DATASET_DIR \
--output_dir=$OUTPUT_DIR \
--eval_batch_size=$batch_size \
--predict_batch_size=$batch_size \
--max_seq_length=512 \
"$use_fp16" $use_xla_tag $case_flag
python3 /workspace/bert/biobert/re_eval.py --task=chemprot --output_path=$OUTPUT_DIR/test_results.tsv \
--answer_path=$DATASET_DIR/test.tsv |& tee $OUTPUT_DIR/test_results.txt
else
echo "Benchmarking for " $task "currently not supported. Sorry!"
fi

View file

@ -0,0 +1,87 @@
#! /bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
train_batch_size=${1:-128}
learning_rate=${2:-"9.625e-5"}
cased=${3:-false}
precision=${4:-"fp16"}
use_xla=${5:-"true"}
num_gpus=${6:-16}
warmup_steps=${7:-"1953"}
train_steps=${8:-19531}
num_accumulation_steps=${9:-32}
save_checkpoint_steps=${10:-5000}
eval_batch_size=${11:-80}
use_fp16=""
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
fi
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
fi
BERT_CONFIG=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12/bert_config.json
RESULTS_DIR=/results
CHECKPOINTS_DIR=${RESULTS_DIR}/biobert_phase_1
mkdir -p ${CHECKPOINTS_DIR}
INIT_CHECKPOINT=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12/bert_model.ckpt
INPUT_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training"
EVAL_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test"
if [ $num_gpu -gt 1 ] ; then
mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib"
use_hvd="--horovod"
else
mpi_command=""
use_hvd=""
fi
export GBS=$(expr $train_batch_size \* $num_gpus \* num_accumulation_steps)
printf -v TAG "tf_bert_bio_1n_phase1_cased_%s_%s_gbs%d" "$cased" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULTS_DIR/$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
$mpi python3 /workspace/bert/run_pretraining.py \
--input_files_dir=$INPUT_FILES_DIR \
--eval_files_dir=$EVAL_FILES_DIR \
--output_dir=$CHECKPOINTS_DIR \
--bert_config_file=$BERT_CONFIG \
--do_train=True \
--do_eval=True \
--train_batch_size=$train_batch_size \
--eval_batch_size=$eval_batch_size \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=$train_steps \
--num_warmup_steps=$warmup_steps \
--save_checkpoints_steps=$save_checkpoint_steps \
--num_accumulation_steps=$num_accumulation_steps \
--learning_rate=$learning_rate \
--report_loss \
--$use_hvd $use_fp16 $use_xla_tag \
--init_checkpoint=$INIT_CHECKPOINT |& tee $LOGFILE

View file

@ -0,0 +1,85 @@
#! /bin/bash
echo "Container nvidia build = " $NVIDIA_BUILD_ID
init_checkpoint=${1}
train_batch_size=${2:-16}
learning_rate=${3:-"2.9e-4"}
cased=${4:-false}
precision=${5:-"fp16"}
use_xla=${6:-true}
num_gpus=${7:-16}
warmup_steps=${8:-"434"}
train_steps=${9:-4340}
num_accumulation_steps=${10:-128}
save_checkpoint_steps=${11:-5000}
eval_batch_size=${12:-26}
use_fp16=""
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
use_fp16="--use_fp16"
fi
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
if [ "$cased" = "true" ] ; then
DO_LOWER_CASE=0
CASING_DIR_PREFIX="cased"
else
DO_LOWER_CASE=1
CASING_DIR_PREFIX="uncased"
fi
BERT_CONFIG=/workspace/bert/data/download/google_pretrained_weights/${CASING_DIR_PREFIX}_L-12_H-768_A-12/bert_config.json
RESULTS_DIR=/results
CHECKPOINTS_DIR=${RESULTS_DIR}/biobert_phase_2
mkdir -p ${CHECKPOINTS_DIR}
INPUT_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/training"
EVAL_FILES_DIR="/workspace/bert/data/tfrecord/lower_case_${DO_LOWER_CASE}_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/pubmed_baseline/test"
if [ $num_gpu -gt 1 ] ; then
mpi_command="mpirun -np $num_gpu -H localhost:$num_gpu \
--allow-run-as-root -bind-to none -map-by slot \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH \
-x PATH -mca pml ob1 -mca btl ^openib"
use_hvd="--horovod"
else
mpi_command=""
use_hvd=""
fi
export GBS=$(expr $train_batch_size \* $num_gpus \* num_accumulation_steps)
printf -v TAG "tf_bert_bio_1n_phase2_cased_%s_%s_gbs%d" "$cased" "$precision" $GBS
DATESTAMP=`date +'%y%m%d%H%M%S'`
LOGFILE=$RESULTS_DIR/$TAG.$DATESTAMP.log
printf "Logs written to %s\n" "$LOGFILE"
$mpi python3 /workspace/bert/run_pretraining.py \
--input_files_dir=$INPUT_FILES_DIR \
--eval_files_dir=$EVAL_FILES_DIR \
--output_dir=$CHECKPOINTS_DIR \
--bert_config_file=$BERT_CONFIG \
--do_train=True \
--do_eval=True \
--train_batch_size=$train_batch_size \
--eval_batch_size=$eval_batch_size \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--num_train_steps=$train_steps \
--num_warmup_steps=$warmup_steps \
--save_checkpoints_steps=$save_checkpoint_steps \
--num_accumulation_steps=$num_accumulation_steps \
--learning_rate=$learning_rate \
--report_loss \
--$use_hvd $use_xla_tag $use_fp16 \
--init_checkpoint=$INIT_CHECKPOINT |& tee $LOGFILE

View file

@ -27,7 +27,7 @@ class PubMedTextFormatting:
print('PubMed path:', self.pubmed_path)
with open(self.output_filename, mode='w', newline='\n') as ofile:
for filename in glob.glob(self.pubmed_path + '/*.xml', recursive=self.recursive):
for filename in glob.glob(self.pubmed_path + '/*.xml*', recursive=self.recursive):
print('file:', filename)
dicts_out = pmp.parse_medline_xml(filename)
for dict_out in dicts_out:

View file

@ -302,7 +302,7 @@ if __name__ == "__main__":
'--fraction_test_set',
type=float,
help='Specify the fraction (0..1) of the data to withhold for the test data split (based on number of sequences)',
default=0.2
default=0.1
)
parser.add_argument(

View file

@ -0,0 +1,55 @@
#!/bin/bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
export BERT_PREP_WORKING_DIR="${BERT_PREP_WORKING_DIR}"
# Download
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset pubmed_baseline
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
# Properly format the text files
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action text_formatting --dataset pubmed_baseline
# Shard the text files
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action sharding --dataset pubmed_baseline
### BERT BASE
## UNCASED
# Create TFRecord files Phase 1
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 128 \
--max_predictions_per_seq 20 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt
# Create TFRecord files Phase 2
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 512 \
--max_predictions_per_seq 80 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/uncased_L-12_H-768_A-12/vocab.txt
## CASED
# Create TFRecord files Phase 1
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 128 \
--max_predictions_per_seq 20 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/cased_L-12_H-768_A-12/vocab.txt \
--do_lower_case=0
# Create TFRecord files Phase 2
python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action create_tfrecord_files --dataset pubmed_baseline --max_seq_length 512 \
--max_predictions_per_seq 80 --vocab_file ${BERT_PREP_WORKING_DIR}/download/google_pretrained_weights/cased_L-12_H-768_A-12/vocab.txt \
--do_lower_case=0

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

View file

@ -8,6 +8,12 @@
```
<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: right;">
# Table Of Contents
- [BERT Question Answering Fine-Tuning and Inference with Mixed Precision](#bert-question-answering-inference/fine-tuning-with-mixed-precision)
- [BioBERT Named-Entity Recognition Inference with Mixed Precision](#biobert-named-entity-recognition-inference-with-mixed-precision)
# BERT Question Answering Inference/Fine-Tuning with Mixed Precision
## 1. Overview
@ -88,3 +94,80 @@ in, for example:
```
http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b
```
# BioBERT Named-Entity Recognition Inference with Mixed Precision
## 1. Overview
Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
BioBERT is a domain specific version of BERT that has been trained on PubMed abstracts.
The original BioBERT paper can be found here: https://arxiv.org/abs/1901.08746
NVIDIA's BioBERT is an optimized version of the implementation presented in the paper, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy.
### 1.a Learning objectives
This repository contains an example notebook that demonstrates:
- Inference on NER task with BioBERT model
- The use/download of fine-tuned NVIDIA BioBERT models
- Use of Mixed Precision for Inference
Here is a short description of the relevant file:
- _biobert_ner_tf_inference.ipynb_ : BioBERT Inference with TF Checkpoint model
## 2. Quick Start Guide
### 2.a Build the BERT TensorFlow NGC container:
To run the notebook you first need to build the Bert TensorFlow container using the following command from the main directory of this repository:
``` bash
docker build . --rm -t bert
```
### 2.b Start of the NGC container to run inference:
Once the image is built, you need to run the container with the `--publish
0.0.0.0:8888:8888` option to publish Jupyter's port `8888` to the host machine
at port `8888` over all network interfaces (`0.0.0.0`):
```bash
nvidia-docker run \
-v $PWD:/workspace/bert \
-v $PWD/results:/results \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--publish 0.0.0.0:8888:8888 \
-it bert:latest bash
```
Then you can use the following commands within the BERT Tensorflow container under
`/workspace/bert`:
Install spaCy. You'll use this to pre-process text and to visualize the results using displaCy.
```
pip install spacy
python -m spacy download en_core_web_sm
```
Launch Jupyter.
```bash
jupyter notebook --ip=0.0.0.0 --allow-root
```
And navigate a web browser to the IP address or hostname of the host machine
at port `8888`:
```
http://[host machine]:8888
```
Use the token listed in the output from running the `jupyter` command to log
in, for example:
```
http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b
```

View file

@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -26,6 +26,13 @@
"# =============================================================================="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference_colab.ipynb#scrollTo=5hRb96NKE3X0\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
@ -78,14 +85,35 @@
"source": [
"## 2. Requirements\n",
"\n",
"Please enable the GPU runtime (Runtime->Change Runtime Type)\n",
"### 2.a GPU\n",
"\n",
"Download the required files from NVIDIA-Github:"
"Before running this notebook, please set the Colab runtime environment to GPU via the menu *Runtime => Change runtime type => GPU*.\n",
"\n",
"This demo will work on any NVIDIA GPU with CUDA cores, though for improved FP16 inference, a Volta, Turing or newer generation GPU with Tensor cores is desired. On Google Colab, this normally means a T4 GPU. If you are assigned an older K80 GPU, another trial at another time might give you a T4 GPU."
]
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "hxNJ8HByw60o"
},
"source": [
"### 2.b Download the required files from NVIDIA-Github:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -100,7 +128,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -115,19 +143,6 @@
"print (os.getcwd())"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "p560UwaE6lAf"
},
"outputs": [],
"source": [
"ls"
]
},
{
"cell_type": "markdown",
"metadata": {
@ -184,7 +199,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -228,7 +243,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -246,7 +261,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -286,7 +301,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -310,7 +325,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -321,29 +336,6 @@
"use_mixed_precision_model = True"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "tUQ1jWFHw61h"
},
"source": [
"To effectively evaluate the speedup of mixed precision try a bigger workload by uncommenting the following line:"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "VpkeBiyPw61j"
},
"outputs": [],
"source": [
"#input_file = os.path.join(working_dir, 'data/download/squad/v2.0/dev-v2.0.json')"
]
},
{
"cell_type": "markdown",
"metadata": {
@ -372,7 +364,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -417,7 +409,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -444,7 +436,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -474,7 +466,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -555,7 +547,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
@ -624,7 +616,7 @@
},
{
"cell_type": "code",
"execution_count": 0,
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",

View file

@ -0,0 +1,610 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"# =============================================================================="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
"\n",
"# BioBERT Named-Entity Recognition Inference with Mixed Precision\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Overview\n",
"\n",
"Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. \n",
"\n",
"BioBERT is a domain specific version of BERT that has been trained on PubMed abstracts.\n",
"\n",
"The original BioBERT paper can be found here: https://arxiv.org/abs/1901.08746\n",
"\n",
"NVIDIA's BioBERT is an optimized version of the implementation presented in the paper, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.a Learning objectives\n",
"\n",
"This notebook demonstrates:\n",
"- Inference on NER task with BioBERT model\n",
"- The use/download of fine-tuned NVIDIA BioBERT models\n",
"- Use of Mixed Precision for Inference"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Requirements\n",
"\n",
"Please refer to the ReadMe file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. BioBERT Inference: Named-Entity Recognition\n",
"\n",
"We can run inference on a fine-tuned BioBERT model for tasks like Named-Entity Recognition.\n",
"\n",
"Here we use a BioBERT model fine-tuned on a [BC5CDR-disease Dataset](https://www.ncbi.nlm.nih.gov/research/bionlp/Data/) which consists of 1500 PubMed articles with 5818 annotated diseases."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.a Extract Disease Information from Text\n",
"\n",
"In this example we will use Named-Entity Recognition model created using BioBERT to extract disease information from the following paragraph:\n",
"\n",
"**Input Text**\n",
"\n",
"_\"The authors describe the case of a 56 - year - old woman with chronic, severe heart failure \n",
"secondary to dilated cardiomyopathy and absence of significant ventricular arrhythmias \n",
"who developed QT prolongation and torsade de pointes ventricular tachycardia during one cycle \n",
"of intermittent low dose (2.5 mcg/kg per min) dobutamine. \n",
"This report of torsade de pointes ventricular tachycardia during intermittent dobutamine \n",
"supports the hypothesis that unpredictable fatal arrhythmias may occur even with low doses \n",
"and in patients with no history of significant rhythm disturbances.\n",
"The mechanisms of proarrhythmic effects of Dubutamine are discussed.\"_\n",
"\n",
"**Output visualized using displaCy**\n",
"\n",
"<div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">The authors describe the case of a 56 year old woman with chronic , severe \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" heart failure \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"secondary to \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" dilated cardiomyopathy \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"and absence of significant \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" ventricular arrhythmias \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"who developed QT \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" prolongation \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"and torsade de pointes \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" ventricular tachycardia \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"during one cycle of intermittent low dose ( 2.5 mcg / kg per min ) dobutamine . This report of torsade de pointes \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" ventricular tachycardia \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"during intermittent dobutamine supports the hypothesis that unpredictable fatal \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" arrhythmias \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
"may occur even with low doses and in patients with no history of significant \n",
"<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em; box-decoration-break: clone; -webkit-box-decoration-break: clone\">\n",
" rhythm disturbances \n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">DISEASE</span>\n",
"</mark>\n",
". The mechanisms of proarrhythmic effects of Dubutamine are discussed . </div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"text= \"\"\"\n",
"The authors describe the case of a 56 year old woman with chronic, severe heart failure\n",
"secondary to dilated cardiomyopathy and absence of significant ventricular arrhythmias\n",
"who developed QT prolongation and torsade de pointes ventricular tachycardia during one cycle\n",
"of intermittent low dose (2.5 mcg/kg per min) dobutamine.\n",
"This report of torsade de pointes ventricular tachycardia during intermittent dobutamine\n",
"supports the hypothesis that unpredictable fatal arrhythmias may occur even with low doses\n",
"and in patients with no history of significant rhythm disturbances.\n",
"The mechanisms of proarrhythmic effects of Dubutamine are discussed.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"\n",
"notebooks_dir = '/workspace/bert/notebooks'\n",
"working_dir = '/workspace/bert'\n",
"if working_dir not in sys.path:\n",
" sys.path.append(working_dir)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert the text into the IOB tags format seen during training, using dummy placeholder labels\n",
"import spacy\n",
"nlp = spacy.load(\"en_core_web_sm\")\n",
"\n",
"text = text.strip()\n",
"doc = nlp(text)\n",
"input_file = os.path.join(notebooks_dir, 'input.tsv')\n",
"with open(os.path.join(input_file), 'w') as wf: \n",
" for word in doc:\n",
" if word.text is '\\n':\n",
" continue\n",
" wf.write(word.text + '\\tO\\n')\n",
" wf.write('\\n') # Indicate end of text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.b Mixed Precision\n",
"\n",
"Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.\n",
"\n",
"For information about:\n",
"- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.\n",
"- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.\n",
"- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we control mixed precision execution with the environmental variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"TF_ENABLE_AUTO_MIXED_PRECISION\"] = \"1\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model we'll use was trained with mixed precision model, which takes much less time to train than the fp32 version, without losing accuracy. So we'll need to set with the following flag: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"use_mixed_precision_model = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Fine-Tuned NVIDIA BioBERT TF Models\n",
"\n",
"We have the following Named Entity Reconition models fine-tuned from BioBERT available on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
"\n",
"| **Model** | **Description** |\n",
"|:---------:|:----------:|\n",
"|BioBERT NER BC5CDR Disease | NER model to extract disease information from text, trained on the BC5CDR-Disease dataset |\n",
"|BioBERT NER BC5CDR Chemical | NER model to extract chemical information from text, trained on the BC5CDR-Chemical dataset. |\n",
"\n",
"\n",
"For this exampple, we will download the Diease NER model trained from the BC5CDR-disease Dataset.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# biobert_uncased_base_ner_disease\n",
"DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
"!mkdir -p $DATA_DIR_FP16\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP16/biobert_uncased_base_ner_disease.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/biobert_uncased_base_ner_disease/versions/1/zip\n",
"!unzip -n -d $DATA_DIR_FP16/ $DATA_DIR_FP16/biobert_uncased_base_ner_disease.zip "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the code that follows we will refer to these models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Running NER task inference\n",
"\n",
"In order to run NER inference we will follow step-by-step the flow implemented in run_ner.py."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.a Configure Things"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import run_ner\n",
"from run_ner import BC5CDRProcessor, model_fn_builder, file_based_input_fn_builder, filed_based_convert_examples_to_features, result_to_pair\n",
"\n",
"import os, sys\n",
"import time\n",
"\n",
"import tensorflow as tf\n",
"import modeling\n",
"import tokenization\n",
"\n",
"tf.logging.set_verbosity(tf.logging.ERROR)\n",
"\n",
"# Create the output directory where all the results are saved.\n",
"output_dir = os.path.join(working_dir, 'output')\n",
"tf.gfile.MakeDirs(output_dir)\n",
"\n",
"# The config json file corresponding to the pre-trained BERT model.\n",
"# This specifies the model architecture.\n",
"bert_config_file = os.path.join(DATA_DIR_FP16, 'bert_config.json')\n",
"\n",
"# The vocabulary file that the BERT model was trained on.\n",
"vocab_file = os.path.join(DATA_DIR_FP16, 'vocab.txt')\n",
"\n",
"init_checkpoint = os.path.join(DATA_DIR_FP16, 'model.ckpt-10251')\n",
"\n",
"# Whether to lower case the input text. \n",
"# Should be True for uncased models and False for cased models.\n",
"# The BioBERT available in NGC is uncased\n",
"do_lower_case = True\n",
" \n",
"# Total batch size for predictions\n",
"predict_batch_size = 1\n",
"params = dict([('batch_size', predict_batch_size)])\n",
"\n",
"# The maximum total input sequence length after WordPiece tokenization. \n",
"# Sequences longer than this will be truncated, and sequences shorter than this will be padded.\n",
"max_seq_length = 128\n",
"\n",
"# This is a WA to use flags from here:\n",
"flags = tf.flags\n",
"\n",
"if 'f' not in tf.flags.FLAGS: \n",
" tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
"FLAGS = flags.FLAGS\n",
"\n",
"FLAGS.output_dir = output_dir"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.b Define Tokenizer & Create Estimator"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Validate the casing config consistency with the checkpoint name.\n",
"tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)\n",
"\n",
"# Create the tokenizer.\n",
"tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)\n",
"\n",
"# Load the configuration from file\n",
"bert_config = modeling.BertConfig.from_json_file(bert_config_file)\n",
"\n",
"\n",
"# Use the data processor for BC5CDR\n",
"processor = BC5CDRProcessor()\n",
"# Get labels in the index order that was used during training\n",
"label_list = processor.get_labels()\n",
"\n",
"# Reverse index the labels. This will be used later when evaluating predictions.\n",
"id2label = {}\n",
"for (i, label) in enumerate(label_list, 1):\n",
" id2label[i] = label\n",
"\n",
"\n",
"config = tf.ConfigProto(log_device_placement=True) \n",
"run_config = tf.estimator.RunConfig(\n",
" model_dir=None,\n",
" session_config=config,\n",
" save_checkpoints_steps=1000,\n",
" keep_checkpoint_max=1)\n",
"\n",
"\n",
"# Use model function builder to create the model function\n",
"model_fn = model_fn_builder(\n",
" bert_config=bert_config,\n",
" num_labels=len(label_list) + 1,\n",
" init_checkpoint=init_checkpoint,\n",
" use_fp16=use_mixed_precision_model)\n",
"\n",
"estimator = tf.estimator.Estimator(\n",
" model_fn=model_fn,\n",
" config=run_config,\n",
" params=params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.c Run Inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load the input data using the BC5CDR processor\n",
"predict_examples = processor.get_test_examples(notebooks_dir, file_name='input.tsv')\n",
"\n",
"\n",
"# Convert to tf_records and save it\n",
"predict_file = os.path.join(output_dir, \"predict.tf_record\")\n",
"filed_based_convert_examples_to_features(predict_examples, label_list,\n",
" max_seq_length, tokenizer,\n",
" predict_file)\n",
"\n",
"\n",
"tf.logging.info(\"***** Running predictions *****\")\n",
"tf.logging.info(\" Num orig examples = %d\", len(predict_examples))\n",
"tf.logging.info(\" Batch size = %d\", predict_batch_size)\n",
"\n",
"# Run prediction on this tf_record file\n",
"predict_input_fn = file_based_input_fn_builder(\n",
" input_file=predict_file,\n",
" batch_size=predict_batch_size,\n",
" seq_length=max_seq_length,\n",
" is_training=False,\n",
" drop_remainder=False)\n",
"\n",
"\n",
"pred_start_time = time.time()\n",
"\n",
"predictions = estimator.predict(input_fn=predict_input_fn)\n",
"predictions = list(predictions)\n",
"\n",
"pred_time_elapsed = time.time() - pred_start_time\n",
"\n",
"tf.logging.info(\"-----------------------------\")\n",
"tf.logging.info(\"Total Inference Time = %0.2f\", pred_time_elapsed)\n",
"# tf.logging.info(\"Inference Performance = %0.4f sentences/sec\", avg_sentences_per_second)\n",
"tf.logging.info(\"-----------------------------\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.d Save Predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's now process the predictions and save them to file(s)\n",
"tf.logging.info(\"Save Predictions:\")\n",
"\n",
"# File containing the list of predictions as IOB tags\n",
"output_predict_file = os.path.join(FLAGS.output_dir, \"label_test.txt\")\n",
"# File containing the list of words, the dummy token and the predicted IOB tag\n",
"test_labels_file = os.path.join(FLAGS.output_dir, \"test_labels.txt\")\n",
"test_labels_err_file = os.path.join(FLAGS.output_dir, \"test_labels_errs.txt\")\n",
"\n",
"with tf.gfile.Open(output_predict_file, 'w') as writer, \\\n",
" tf.gfile.Open(test_labels_file, 'w') as tl, \\\n",
" tf.gfile.Open(test_labels_err_file, 'w') as tle:\n",
" i=0\n",
" for prediction in estimator.predict(input_fn=predict_input_fn, yield_single_examples=True):\n",
" output_line = \"\\n\".join(id2label[id] for id in prediction if id != 0) + \"\\n\"\n",
" writer.write(output_line)\n",
" result_to_pair(predict_examples[i], prediction, id2label, tl, tle)\n",
" i = i + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.e Visualize Predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's create a function that can formats the predictions for display using displaCy\n",
"def predictions_for_displacy(predict_examples, predictions, id2label):\n",
" processed_text = ''\n",
" entities = []\n",
" current_pos = 0\n",
" start_pos = 0\n",
" end_pos = 0\n",
" end_detected = False\n",
" prev_label = ''\n",
"\n",
" for predict_line, pred_ids in zip(predict_examples, predictions):\n",
" words = str(predict_line.text).split(' ')\n",
" labels = str(predict_line.label).split(' ')\n",
"\n",
" # get from CLS to SEP\n",
" pred_labels = []\n",
" for id in pred_ids:\n",
" if id == 0:\n",
" continue\n",
" curr_label = id2label[id]\n",
" if curr_label == '[CLS]':\n",
" continue\n",
" elif curr_label == '[SEP]':\n",
" break\n",
" elif curr_label == 'X':\n",
" continue\n",
" pred_labels.append(curr_label)\n",
"\n",
" for tok, label, pred_label in zip(words, labels, pred_labels):\n",
" if pred_label is 'B':\n",
" start_pos = current_pos\n",
" elif pred_label is 'I' and prev_label is not 'B' and prev_label is not 'I':\n",
" start_pos = current_pos\n",
" elif pred_label is 'O' and (prev_label is 'B' or prev_label is 'I'):\n",
" end_pos = current_pos\n",
" end_detected = True\n",
"\n",
" if end_detected:\n",
" entities.append({'start':start_pos, 'end': end_pos, 'label': 'DISEASE'})\n",
" start_pos = 0\n",
" end_pos = 0\n",
" end_detected = False\n",
"\n",
" processed_text = processed_text + tok + ' '\n",
" current_pos = current_pos + len(tok) + 1\n",
" prev_label = pred_label\n",
"\n",
" #Handle entity at the very end\n",
" if start_pos > 0 and end_detected is False:\n",
" entities.append({'start':start_pos, 'end': current_pos, 'label': 'DISEASE'})\n",
" \n",
" displacy_input = [{\"text\": processed_text,\n",
" \"ents\": entities,\n",
" \"title\": None}]\n",
" \n",
" return displacy_input"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert the predictions to the Named Entities format required by displaCy and visualize\n",
"displacy_input = predictions_for_displacy(predict_examples, predictions, id2label)\n",
"html = spacy.displacy.render(displacy_input, style=\"ent\", manual=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. What's next"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you are familiar with running NER Inference on BioBERT, using mixed precision, you may want to try extracting disease information from other biomedical text. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View file

@ -1,2 +1,9 @@
tensorflow >= 1.11.0 # CPU Version of TensorFlow.
# tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow.
toposort
networkx
pytest
nltk
tqdm
html2text
progressbar

View file

@ -0,0 +1,872 @@
#! usr/bin/env python3
# -*- coding:utf-8 -*-
"""
Copyright 2018 The Google AI Language Team Authors.
BASED ON Google_BERT.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import os, sys
import pickle
import tensorflow as tf
import numpy as np
sys.path.append("/workspace/bert")
from biobert.conlleval import evaluate, report_notprint
import modeling
import optimization
import tokenization
import tf_metrics
import time
import horovod.tensorflow as hvd
from utils.utils import LogEvalRunHook, LogTrainRunHook
flags = tf.flags
FLAGS = flags.FLAGS
flags.DEFINE_string(
"task_name", "NER", "The name of the task to train."
)
flags.DEFINE_string(
"data_dir", None,
"The input datadir.",
)
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written."
)
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model."
)
flags.DEFINE_string(
"vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model)."
)
flags.DEFINE_bool(
"do_lower_case", False,
"Whether to lower case the input text."
)
flags.DEFINE_integer(
"max_seq_length", 128,
"The maximum total input sequence length after WordPiece tokenization."
)
flags.DEFINE_bool(
"do_train", False,
"Whether to run training."
)
flags.DEFINE_bool(
"do_eval", False,
"Whether to run eval on the dev set.")
flags.DEFINE_bool(
"do_predict", False,
"Whether to run the model in inference mode on the test set.")
flags.DEFINE_integer(
"train_batch_size", 64,
"Total batch size for training.")
flags.DEFINE_integer(
"eval_batch_size", 16,
"Total batch size for eval.")
flags.DEFINE_integer(
"predict_batch_size", 16,
"Total batch size for predict.")
flags.DEFINE_float(
"learning_rate", 5e-6,
"The initial learning rate for Adam.")
flags.DEFINE_float(
"num_train_epochs", 10.0,
"Total number of training epochs to perform.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_integer(
"save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer(
"iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text = text
self.label = label
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self, input_ids, input_mask, segment_ids, label_ids, ):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_ids = label_ids
# self.label_mask = label_mask
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_data(cls, input_file):
"""Reads a BIO data."""
with tf.gfile.Open(input_file, "r") as f:
lines = []
words = []
labels = []
for line in f:
contends = line.strip()
if len(contends) == 0:
assert len(words) == len(labels)
if len(words) > 30:
# split if the sentence is longer than 30
while len(words) > 30:
tmplabel = labels[:30]
for iidx in range(len(tmplabel)):
if tmplabel.pop() == 'O':
break
l = ' '.join(
[label for label in labels[:len(tmplabel) + 1] if len(label) > 0])
w = ' '.join(
[word for word in words[:len(tmplabel) + 1] if len(word) > 0])
lines.append([l, w])
words = words[len(tmplabel) + 1:]
labels = labels[len(tmplabel) + 1:]
if len(words) == 0:
continue
l = ' '.join([label for label in labels if len(label) > 0])
w = ' '.join([word for word in words if len(word) > 0])
lines.append([l, w])
words = []
labels = []
continue
word = line.strip().split()[0]
label = line.strip().split()[-1]
words.append(word)
labels.append(label)
return lines
class BC5CDRProcessor(DataProcessor):
def get_train_examples(self, data_dir):
l1 = self._read_data(os.path.join(data_dir, "train.tsv"))
l2 = self._read_data(os.path.join(data_dir, "devel.tsv"))
return self._create_example(l1 + l2, "train")
def get_dev_examples(self, data_dir, file_name="devel.tsv"):
return self._create_example(
self._read_data(os.path.join(data_dir, file_name)), "dev"
)
def get_test_examples(self, data_dir, file_name="test.tsv"):
return self._create_example(
self._read_data(os.path.join(data_dir, file_name)), "test")
def get_labels(self):
return ["B", "I", "O", "X", "[CLS]", "[SEP]"]
def _create_example(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text = tokenization.convert_to_unicode(line[1])
label = tokenization.convert_to_unicode(line[0])
examples.append(InputExample(guid=guid, text=text, label=label))
return examples
class CLEFEProcessor(DataProcessor):
def get_train_examples(self, data_dir):
lines1 = self._read_data2(os.path.join(data_dir, "Training.tsv"))
lines2 = self._read_data2(os.path.join(data_dir, "Development.tsv"))
return self._create_example(
lines1 + lines2, "train"
)
def get_dev_examples(self, data_dir, file_name="Development.tsv"):
return self._create_example(
self._read_data2(os.path.join(data_dir, file_name)), "dev"
)
def get_test_examples(self, data_dir, file_name="Test.tsv"):
return self._create_example(
self._read_data2(os.path.join(data_dir, file_name)), "test")
def get_labels(self):
return ["B", "I", "O", "X", "[CLS]", "[SEP]"]
def _create_example(self, lines, set_type):
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text = tokenization.convert_to_unicode(line[1])
label = tokenization.convert_to_unicode(line[0])
examples.append(InputExample(guid=guid, text=text, label=label))
return examples
@classmethod
def _read_data2(cls, input_file):
with tf.gfile.Open(input_file, "r") as f:
lines = []
words = []
labels = []
for line in f:
contends = line.strip()
if len(contends) == 0:
assert len(words) == len(labels)
if len(words) == 0:
continue
l = ' '.join([label for label in labels if len(label) > 0])
w = ' '.join([word for word in words if len(word) > 0])
lines.append([l, w])
words = []
labels = []
continue
elif contends.startswith('###'):
continue
word = line.strip().split()[0]
label = line.strip().split()[-1]
words.append(word)
labels.append(label)
return lines
class I2b22012Processor(CLEFEProcessor):
def get_labels(self):
return ['B-CLINICAL_DEPT', 'B-EVIDENTIAL', 'B-OCCURRENCE', 'B-PROBLEM', 'B-TEST', 'B-TREATMENT', 'I-CLINICAL_DEPT', 'I-EVIDENTIAL', 'I-OCCURRENCE', 'I-PROBLEM', 'I-TEST', 'I-TREATMENT', "O", "X", "[CLS]", "[SEP]"]
def write_tokens(tokens, labels, mode):
if mode == "test":
path = os.path.join(FLAGS.output_dir, "token_" + mode + ".txt")
if tf.gfile.Exists(path):
wf = tf.gfile.Open(path, 'a')
else:
wf = tf.gfile.Open(path, 'w')
for token, label in zip(tokens, labels):
if token != "**NULL**":
wf.write(token + ' ' + str(label) + '\n')
wf.close()
def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer, mode):
label_map = {}
for (i, label) in enumerate(label_list, 1):
label_map[label] = i
label2id_file = os.path.join(FLAGS.output_dir, 'label2id.pkl')
if not tf.gfile.Exists(label2id_file):
with tf.gfile.Open(label2id_file, 'wb') as w:
pickle.dump(label_map, w)
textlist = example.text.split(' ')
labellist = example.label.split(' ')
tokens = []
labels = []
for i, word in enumerate(textlist):
token = tokenizer.tokenize(word)
tokens.extend(token)
label_1 = labellist[i]
for m in range(len(token)):
if m == 0:
labels.append(label_1)
else:
labels.append("X")
# tokens = tokenizer.tokenize(example.text)
if len(tokens) >= max_seq_length - 1:
tokens = tokens[0:(max_seq_length - 2)]
labels = labels[0:(max_seq_length - 2)]
ntokens = []
segment_ids = []
label_ids = []
ntokens.append("[CLS]")
segment_ids.append(0)
# append("O") or append("[CLS]") not sure!
label_ids.append(label_map["[CLS]"])
for i, token in enumerate(tokens):
ntokens.append(token)
segment_ids.append(0)
label_ids.append(label_map[labels[i]])
ntokens.append("[SEP]")
segment_ids.append(0)
# append("O") or append("[SEP]") not sure!
label_ids.append(label_map["[SEP]"])
input_ids = tokenizer.convert_tokens_to_ids(ntokens)
input_mask = [1] * len(input_ids)
# label_mask = [1] * len(input_ids)
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
# we don't concerned about it!
label_ids.append(0)
ntokens.append("**NULL**")
# label_mask.append(0)
# print(len(input_ids))
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
assert len(label_ids) == max_seq_length
# assert len(label_mask) == max_seq_length
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label_ids: %s" % " ".join([str(x) for x in label_ids]))
# tf.logging.info("label_mask: %s" % " ".join([str(x) for x in label_mask]))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_ids=label_ids,
# label_mask = label_mask
)
# write_tokens(ntokens, label_ids, mode)
return feature
def filed_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file, mode=None):
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 5000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer,
mode)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature(feature.label_ids)
# features["label_mask"] = create_int_feature(feature.label_mask)
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
def file_based_input_fn_builder(input_file, batch_size, seq_length, is_training, drop_remainder, hvd=None):
name_to_features = {
"input_ids": tf.FixedLenFeature([seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
"label_ids": tf.FixedLenFeature([seq_length], tf.int64),
# "label_ids":tf.VarLenFeature(tf.int64),
# "label_mask": tf.FixedLenFeature([seq_length], tf.int64),
}
def _decode_record(record, name_to_features):
example = tf.parse_single_example(record, name_to_features)
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
#batch_size = params["batch_size"]
d = tf.data.TFRecordDataset(input_file)
if is_training:
if hvd is not None: d = d.shard(hvd.size(), hvd.rank())
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder
))
return d
return input_fn
def create_model(bert_config, is_training, input_ids, input_mask,
segment_ids, labels, num_labels, use_one_hot_embeddings):
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings
)
output_layer = model.get_sequence_output()
hidden_size = output_layer.shape[-1].value
output_weight = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02)
)
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer()
)
with tf.variable_scope("loss"):
if is_training:
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
output_layer = tf.reshape(output_layer, [-1, hidden_size])
logits = tf.matmul(output_layer, output_weight, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
logits = tf.reshape(logits, [-1, FLAGS.max_seq_length, num_labels])
# mask = tf.cast(input_mask,tf.float32)
# loss = tf.contrib.seq2seq.sequence_loss(logits,labels,mask)
# return (loss, logits, predict)
##########################################################################
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
probabilities = tf.nn.softmax(logits, axis=-1)
predict = tf.argmax(probabilities, axis=-1)
return (loss, per_example_loss, logits, predict)
##########################################################################
def model_fn_builder(bert_config, num_labels, init_checkpoint=None, learning_rate=None,
num_train_steps=None, num_warmup_steps=None,
use_one_hot_embeddings=False, hvd=None, use_fp16=False):
def model_fn(features, labels, mode, params):
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
# label_mask = features["label_mask"]
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, logits, predicts) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint and (hvd is None or hvd.rank() == 0):
(assignment_map,
initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars,
init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, hvd, False, use_fp16)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(per_example_loss, label_ids, logits):
# def metric_fn(label_ids, logits):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
precision = tf_metrics.precision(label_ids, predictions, num_labels, [1, 2], average="macro")
recall = tf_metrics.recall(label_ids, predictions, num_labels, [1, 2], average="macro")
f = tf_metrics.f1(label_ids, predictions, num_labels, [1, 2], average="macro")
#
return {
"eval_precision": precision,
"eval_recall": recall,
"eval_f": f,
# "eval_loss": loss,
}
eval_metric_ops = metric_fn(per_example_loss, label_ids, logits)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
eval_metric_ops=eval_metric_ops)
else:
output_spec = tf.estimator.EstimatorSpec(
mode=mode, predictions=predicts)#probabilities)
return output_spec
return model_fn
def result_to_pair(predict_line, pred_ids, id2label, writer, err_writer):
words = str(predict_line.text).split(' ')
labels = str(predict_line.label).split(' ')
if len(words) != len(labels):
tf.logging.error('Text and label not equal')
tf.logging.error(predict_line.text)
tf.logging.error(predict_line.label)
exit(1)
# get from CLS to SEP
pred_labels = []
for id in pred_ids:
if id == 0:
continue
curr_label = id2label[id]
if curr_label == '[CLS]':
continue
elif curr_label == '[SEP]':
break
elif curr_label == 'X':
continue
pred_labels.append(curr_label)
if len(pred_labels) > len(words):
err_writer.write(predict_line.guid + '\n')
err_writer.write(predict_line.text + '\n')
err_writer.write(predict_line.label + '\n')
err_writer.write(' '.join([str(i) for i in pred_ids]) + '\n')
err_writer.write(' '.join([id2label.get(i, '**NULL**') for i in pred_ids]) + '\n\n')
pred_labels = pred_labels[:len(words)]
elif len(pred_labels) < len(words):
err_writer.write(predict_line.guid + '\n')
err_writer.write(predict_line.text + '\n')
err_writer.write(predict_line.label + '\n')
err_writer.write(' '.join([str(i) for i in pred_ids]) + '\n')
err_writer.write(' '.join([id2label.get(i, '**NULL**') for i in pred_ids]) + '\n\n')
pred_labels += ['O'] * (len(words) - len(pred_labels))
for tok, label, pred_label in zip(words, labels, pred_labels):
writer.write(tok + ' ' + label + ' ' + pred_label + '\n')
writer.write('\n')
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
if FLAGS.horovod:
hvd.init()
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
processors = {
"bc5cdr": BC5CDRProcessor,
"clefe": CLEFEProcessor,
'i2b2': I2b22012Processor
}
if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
if FLAGS.max_seq_length > bert_config.max_position_embeddings:
raise ValueError(
"Cannot use sequence length %d because the BERT model "
"was only trained up to sequence length %d" %
(FLAGS.max_seq_length, bert_config.max_position_embeddings))
task_name = FLAGS.task_name.lower()
if task_name not in processors:
raise ValueError("Task not found: %s" % (task_name))
tf.gfile.MakeDirs(FLAGS.output_dir)
processor = processors[task_name]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
master_process = True
training_hooks = []
global_batch_size = FLAGS.train_batch_size
hvd_rank = 0
config = tf.ConfigProto()
if FLAGS.horovod:
global_batch_size = FLAGS.train_batch_size * hvd.size()
master_process = (hvd.rank() == 0)
hvd_rank = hvd.rank()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
if hvd.size() > 1:
training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
if FLAGS.use_xla:
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
run_config = tf.estimator.RunConfig(
model_dir=FLAGS.output_dir if master_process else None,
session_config=config,
save_checkpoints_steps=FLAGS.save_checkpoints_steps if master_process else None,
keep_checkpoint_max=1)
if master_process:
tf.logging.info("***** Configuaration *****")
for key in FLAGS.__flags.keys():
tf.logging.info(' {}: {}'.format(key, getattr(FLAGS, key)))
tf.logging.info("**************************")
train_examples = None
num_train_steps = None
num_warmup_steps = None
training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank))
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir)
num_train_steps = int(
len(train_examples) / global_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
start_index = 0
end_index = len(train_examples)
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record")]
if FLAGS.horovod:
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.size())]
num_examples_per_rank = len(train_examples) // hvd.size()
remainder = len(train_examples) % hvd.size()
if hvd.rank() < remainder:
start_index = hvd.rank() * (num_examples_per_rank+1)
end_index = start_index + num_examples_per_rank + 1
else:
start_index = hvd.rank() * num_examples_per_rank + remainder
end_index = start_index + (num_examples_per_rank)
model_fn = model_fn_builder(
bert_config=bert_config,
num_labels=len(label_list) + 1,
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate if not FLAGS.horovod else FLAGS.learning_rate * hvd.size(),
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_one_hot_embeddings=False,
hvd=None if not FLAGS.horovod else hvd,
use_fp16=FLAGS.use_fp16)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
config=run_config)
if FLAGS.do_train:
#train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
#filed_based_convert_examples_to_features(
# train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
filed_based_convert_examples_to_features(
train_examples[start_index:end_index], label_list, FLAGS.max_seq_length, tokenizer, tmp_filenames[hvd_rank])
tf.logging.info("***** Running training *****")
tf.logging.info(" Num examples = %d", len(train_examples))
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = file_based_input_fn_builder(
input_file=tmp_filenames, #train_file,
batch_size=FLAGS.train_batch_size,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True,
hvd=None if not FLAGS.horovod else hvd)
#estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
train_start_time = time.time()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=training_hooks)
train_time_elapsed = time.time() - train_start_time
train_time_wo_overhead = training_hooks[-1].total_time
avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed
ss_sentences_per_second = (num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
if master_process:
tf.logging.info("-----------------------------")
tf.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
num_train_steps * global_batch_size)
tf.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
(num_train_steps - training_hooks[-1].skipped) * global_batch_size)
tf.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second)
tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
tf.logging.info("-----------------------------")
if FLAGS.do_eval and master_process:
eval_examples = processor.get_dev_examples(FLAGS.data_dir)
eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
filed_based_convert_examples_to_features(
eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Num examples = %d", len(eval_examples))
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
eval_steps = None
eval_drop_remainder = False
eval_input_fn = file_based_input_fn_builder(
input_file=eval_file,
batch_size=FLAGS.eval_batch_size,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=eval_drop_remainder)
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
with tf.gfile.Open(output_eval_file, "w") as writer:
tf.logging.info("***** Eval results *****")
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if FLAGS.do_predict and master_process:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
filed_based_convert_examples_to_features(predict_examples, label_list,
FLAGS.max_seq_length, tokenizer,
predict_file, mode="test")
with tf.gfile.Open(os.path.join(FLAGS.output_dir, 'label2id.pkl'), 'rb') as rf:
label2id = pickle.load(rf)
id2label = {value: key for key, value in label2id.items()}
token_path = os.path.join(FLAGS.output_dir, "token_test.txt")
if tf.gfile.Exists(token_path):
tf.gfile.Remove(token_path)
tf.logging.info("***** Running prediction*****")
tf.logging.info(" Num examples = %d", len(predict_examples))
tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
predict_drop_remainder = False
predict_input_fn = file_based_input_fn_builder(
input_file=predict_file,
batch_size=FLAGS.predict_batch_size,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=predict_drop_remainder)
eval_hooks = [LogEvalRunHook(FLAGS.predict_batch_size)]
eval_start_time = time.time()
output_predict_file = os.path.join(FLAGS.output_dir, "label_test.txt")
test_labels_file = os.path.join(FLAGS.output_dir, "test_labels.txt")
test_labels_err_file = os.path.join(FLAGS.output_dir, "test_labels_errs.txt")
with tf.gfile.Open(output_predict_file, 'w') as writer, \
tf.gfile.Open(test_labels_file, 'w') as tl, \
tf.gfile.Open(test_labels_err_file, 'w') as tle:
print(id2label)
i=0
for prediction in estimator.predict(input_fn=predict_input_fn, hooks=eval_hooks,
yield_single_examples=True):
output_line = "\n".join(id2label[id] for id in prediction if id != 0) + "\n"
writer.write(output_line)
result_to_pair(predict_examples[i], prediction, id2label, tl, tle)
i = i + 1
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
time_list = eval_hooks[-1].time_list
time_list.sort()
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
avg = np.mean(time_list)
cf_50 = max(time_list[:int(len(time_list) * 0.50)])
cf_90 = max(time_list[:int(len(time_list) * 0.90)])
cf_95 = max(time_list[:int(len(time_list) * 0.95)])
cf_99 = max(time_list[:int(len(time_list) * 0.99)])
cf_100 = max(time_list[:int(len(time_list) * 1)])
ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
tf.logging.info("-----------------------------")
tf.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.predict_batch_size)
tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
tf.logging.info("Summary Inference Statistics")
tf.logging.info("Batch size = %d", FLAGS.predict_batch_size)
tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32")
tf.logging.info("Latency Confidence Level 50 (ms) = %0.2f", cf_50 * 1000)
tf.logging.info("Latency Confidence Level 90 (ms) = %0.2f", cf_90 * 1000)
tf.logging.info("Latency Confidence Level 95 (ms) = %0.2f", cf_95 * 1000)
tf.logging.info("Latency Confidence Level 99 (ms) = %0.2f", cf_99 * 1000)
tf.logging.info("Latency Confidence Level 100 (ms) = %0.2f", cf_100 * 1000)
tf.logging.info("Latency Average (ms) = %0.2f", avg * 1000)
tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
tf.logging.info("-----------------------------")
tf.logging.info('Reading: %s', test_labels_file)
with tf.gfile.Open(test_labels_file, "r") as f:
counts = evaluate(f)
eval_result = report_notprint(counts)
print(''.join(eval_result))
with tf.gfile.Open(os.path.join(FLAGS.output_dir, 'test_results_conlleval.txt'), 'w') as fd:
fd.write(''.join(eval_result))
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()

View file

@ -26,6 +26,8 @@ import modeling
import optimization
import tensorflow as tf
import glob
from utils.utils import LogEvalRunHook
from tensorflow.core.protobuf import rewriter_config_pb2
flags = tf.flags
@ -244,6 +246,7 @@ def model_fn_builder(bert_config, init_checkpoint, learning_rate,
initialized_variable_names = {}
if init_checkpoint and (hvd is None or hvd.rank() == 0):
print("Loading checkpoint", init_checkpoint)
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
@ -528,7 +531,9 @@ def main(_):
tf.logging.info("**************************")
# config.gpu_options.per_process_gpu_memory_fraction = 0.7
if FLAGS.use_xla: config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
if FLAGS.use_xla:
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
config.graph_options.rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.NO_MEM_OPT
run_config = tf.estimator.RunConfig(
model_dir=FLAGS.output_dir,
@ -590,8 +595,29 @@ def main(_):
is_training=False,
hvd=None if not FLAGS.horovod else hvd)
eval_hooks = [LogEvalRunHook(FLAGS.eval_batch_size)]
eval_start_time = time.time()
result = estimator.evaluate(
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps, hooks=eval_hooks)
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size
ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
tf.logging.info("-----------------------------")
tf.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.eval_batch_size)
tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size)
tf.logging.info("Summary Inference Statistics on EVAL set")
tf.logging.info("Batch size = %d", FLAGS.eval_batch_size)
tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32")
tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
tf.logging.info("-----------------------------")
output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:

View file

@ -0,0 +1,940 @@
# coding=utf-8
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Copyright 2018 The Google AI Language Team Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""BERT finetuning runner."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import csv
import logging
import os, sys
import numpy as np
import tensorflow as tf
sys.path.append("/workspace/bert")
import modeling
import optimization
import tokenization
import time
import horovod.tensorflow as hvd
from utils.utils import LogEvalRunHook, LogTrainRunHook
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"data_dir", None,
"The input data dir. Should contain the .tsv files (or other data files) "
"for the task.")
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
flags.DEFINE_string("task_name", None, "The name of the task to train.")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
## Other parameters
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model).")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer(
"max_seq_length", 128,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool(
"do_predict", False,
"Whether to run the model in inference mode on the test set.")
flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
flags.DEFINE_float("learning_rate", 5e-6, "The initial learning rate for Adam.")
flags.DEFINE_float("num_train_epochs", 3.0,
"Total number of training epochs to perform.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class InputFeatures(object):
"""A single set of features of data."""
def __init__(self,
input_ids,
input_mask,
segment_ids,
label_id,
is_real_example=True):
self.input_ids = input_ids
self.input_mask = input_mask
self.segment_ids = segment_ids
self.label_id = label_id
self.is_real_example = is_real_example
class DataProcessor(object):
"""Base class for data converters for sequence classification data sets."""
def get_train_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the train set."""
raise NotImplementedError()
def get_dev_examples(self, data_dir):
"""Gets a collection of `InputExample`s for the dev set."""
raise NotImplementedError()
def get_test_examples(self, data_dir):
"""Gets a collection of `InputExample`s for prediction."""
raise NotImplementedError()
def get_labels(self):
"""Gets the list of labels for this data set."""
raise NotImplementedError()
@classmethod
def _read_tsv(cls, input_file, quotechar=None):
"""Reads a tab separated value file."""
with tf.gfile.Open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t", quotechar=quotechar)
lines = []
for line in reader:
lines.append(line)
return lines
class _ChemProtProcessor(DataProcessor):
"""Processor for the ChemProt data set."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir, file_name="dev.tsv"):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, file_name)), "dev")
def get_test_examples(self, data_dir, file_name="test.tsv"):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, file_name)), "test")
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
# skip header
if i == 0:
continue
guid = line[0]
text_a = tokenization.convert_to_unicode(line[1])
if set_type == "test":
label = self.get_labels()[-1]
else:
try:
label = tokenization.convert_to_unicode(line[2])
except IndexError:
logging.exception(line)
exit(1)
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
class ChemProtProcessor(_ChemProtProcessor):
def get_labels(self):
"""See base class."""
return ["CPR:3", "CPR:4", "CPR:5", "CPR:6", "CPR:9", "false"]
class MedNLIProcessor(DataProcessor):
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir, file_name="dev.tsv"):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, file_name)), "dev")
def get_test_examples(self, data_dir, file_name="test.tsv"):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, file_name)), "test")
def get_labels(self):
"""See base class."""
return ['contradiction', 'entailment', 'neutral']
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = line[1]
text_a = tokenization.convert_to_unicode(line[2])
text_b = tokenization.convert_to_unicode(line[3])
if set_type == "test":
label = self.get_labels()[-1]
else:
label = tokenization.convert_to_unicode(line[0])
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is:
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
def file_based_convert_examples_to_features(
examples, label_list, max_seq_length, tokenizer, output_file):
"""Convert a set of `InputExample`s to a TFRecord file."""
writer = tf.python_io.TFRecordWriter(output_file)
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
features = collections.OrderedDict()
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
features["label_ids"] = create_int_feature([feature.label_id])
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()
def file_based_input_fn_builder(input_file, batch_size, seq_length, is_training,
drop_remainder, hvd=None):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
name_to_features = {
"input_ids": tf.FixedLenFeature([seq_length], tf.int64),
"input_mask": tf.FixedLenFeature([seq_length], tf.int64),
"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
"label_ids": tf.FixedLenFeature([], tf.int64),
"is_real_example": tf.FixedLenFeature([], tf.int64),
}
def _decode_record(record, name_to_features):
"""Decodes a record to a TensorFlow example."""
example = tf.parse_single_example(record, name_to_features)
# tf.Example only supports tf.int64, but the TPU only supports tf.int32.
# So cast all int64 to int32.
for name in list(example.keys()):
t = example[name]
if t.dtype == tf.int64:
t = tf.to_int32(t)
example[name] = t
return example
def input_fn(params):
"""The actual input function."""
#batch_size = params["batch_size"]
# For training, we want a lot of parallel reading and shuffling.
# For eval, we want no shuffling and parallel reading doesn't matter.
d = tf.data.TFRecordDataset(input_file)
if is_training:
if hvd is not None: d = d.shard(hvd.size(), hvd.rank())
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.apply(
tf.contrib.data.map_and_batch(
lambda record: _decode_record(record, name_to_features),
batch_size=batch_size,
drop_remainder=drop_remainder))
return d
return input_fn
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
"""Truncates a sequence pair in place to the maximum length."""
# This is a simple heuristic which will always truncate the longer sequence
# one token at a time. This makes more sense than truncating an equal percent
# of tokens from each, since if one sequence is very short then each token
# that's truncated likely contains more information than a longer sequence.
while True:
total_length = len(tokens_a) + len(tokens_b)
if total_length <= max_length:
break
if len(tokens_a) > len(tokens_b):
tokens_a.pop()
else:
tokens_b.pop()
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
# In the demo, we are doing a simple classification task on the entire
# segment.
#
# If you want to use the token-level output, use model.get_sequence_output()
# instead.
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate=None,
num_train_steps=None, num_warmup_steps=None,
use_one_hot_embeddings=False, hvd=None, use_fp16=False):
"""Returns `model_fn` closure for TPUEstimator."""
def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
"""The `model_fn` for TPUEstimator."""
tf.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
input_ids = features["input_ids"]
input_mask = features["input_mask"]
segment_ids = features["segment_ids"]
label_ids = features["label_ids"]
is_real_example = None
if "is_real_example" in features:
is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
else:
is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
(total_loss, per_example_loss, logits, probabilities) = create_model(
bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
num_labels, use_one_hot_embeddings)
tvars = tf.trainable_variables()
initialized_variable_names = {}
scaffold_fn = None
if init_checkpoint and (hvd is None or hvd.rank() == 0):
(assignment_map, initialized_variable_names
) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
tf.logging.info("**** Trainable Variables ****")
for var in tvars:
init_string = ""
if var.name in initialized_variable_names:
init_string = ", *INIT_FROM_CKPT*"
tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
init_string)
output_spec = None
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = optimization.create_optimizer(
total_loss, learning_rate, num_train_steps, num_warmup_steps, hvd, False, use_fp16)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
train_op=train_op)
elif mode == tf.estimator.ModeKeys.EVAL:
def metric_fn(per_example_loss, label_ids, logits, is_real_example):
predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
accuracy = tf.metrics.accuracy(
labels=label_ids, predictions=predictions, weights=is_real_example)
loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
return {
"eval_accuracy": accuracy,
"eval_loss": loss,
}
eval_metric_ops = metric_fn(per_example_loss, label_ids, logits, is_real_example)
output_spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=total_loss,
eval_metric_ops=eval_metric_ops)
else:
output_spec = tf.estimator.EstimatorSpec(
mode=mode, predictions={"probabilities": probabilities})#predicts)#probabilities)
return output_spec
return model_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def input_fn_builder(features, seq_length, is_training, drop_remainder):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_id)
def input_fn(params):
"""The actual input function."""
batch_size = params["batch_size"]
num_examples = len(features)
# This is for demo purposes and does NOT scale to large data sets. We do
# not use Dataset.from_generator() because that uses tf.py_func which is
# not TPU compatible. The right way to load data is with TFRecordReader.
d = tf.data.Dataset.from_tensor_slices({
"input_ids":
tf.constant(
all_input_ids, shape=[num_examples, seq_length],
dtype=tf.int32),
"input_mask":
tf.constant(
all_input_mask,
shape=[num_examples, seq_length],
dtype=tf.int32),
"segment_ids":
tf.constant(
all_segment_ids,
shape=[num_examples, seq_length],
dtype=tf.int32),
"label_ids":
tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
})
if is_training:
d = d.repeat()
d = d.shuffle(buffer_size=100)
d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
return d
return input_fn
# This function is not used by this file but is still used by the Colab and
# people who depend on it.
def convert_examples_to_features(examples, label_list, max_seq_length,
tokenizer):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
features = []
for (ex_index, example) in enumerate(examples):
if ex_index % 10000 == 0:
tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
feature = convert_single_example(ex_index, example, label_list,
max_seq_length, tokenizer)
features.append(feature)
return features
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
if FLAGS.horovod:
hvd.init()
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
processors = {
"chemprot": ChemProtProcessor,
'mednli': MedNLIProcessor,
}
tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
FLAGS.init_checkpoint)
if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
raise ValueError(
"At least one of `do_train`, `do_eval` or `do_predict' must be True.")
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
if FLAGS.max_seq_length > bert_config.max_position_embeddings:
raise ValueError(
"Cannot use sequence length %d because the BERT model "
"was only trained up to sequence length %d" %
(FLAGS.max_seq_length, bert_config.max_position_embeddings))
tf.gfile.MakeDirs(FLAGS.output_dir)
task_name = FLAGS.task_name.lower()
if task_name not in processors:
raise ValueError("Task not found: %s" % (task_name))
processor = processors[task_name]()
label_list = processor.get_labels()
tokenizer = tokenization.FullTokenizer(
vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
master_process = True
training_hooks = []
global_batch_size = FLAGS.train_batch_size
hvd_rank = 0
config = tf.ConfigProto()
if FLAGS.horovod:
global_batch_size = FLAGS.train_batch_size * hvd.size()
master_process = (hvd.rank() == 0)
hvd_rank = hvd.rank()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
if hvd.size() > 1:
training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
if FLAGS.use_xla:
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
run_config = tf.estimator.RunConfig(
model_dir=FLAGS.output_dir if master_process else None,
session_config=config,
save_checkpoints_steps=FLAGS.save_checkpoints_steps if master_process else None,
keep_checkpoint_max=1)
if master_process:
tf.logging.info("***** Configuaration *****")
for key in FLAGS.__flags.keys():
tf.logging.info(' {}: {}'.format(key, getattr(FLAGS, key)))
tf.logging.info("**************************")
train_examples = None
num_train_steps = None
num_warmup_steps = None
training_hooks.append(LogTrainRunHook(global_batch_size, hvd_rank))
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir)
num_train_steps = int(
len(train_examples) / global_batch_size * FLAGS.num_train_epochs)
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
start_index = 0
end_index = len(train_examples)
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record")]
if FLAGS.horovod:
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.size())]
num_examples_per_rank = len(train_examples) // hvd.size()
remainder = len(train_examples) % hvd.size()
if hvd.rank() < remainder:
start_index = hvd.rank() * (num_examples_per_rank+1)
end_index = start_index + num_examples_per_rank + 1
else:
start_index = hvd.rank() * num_examples_per_rank + remainder
end_index = start_index + (num_examples_per_rank)
model_fn = model_fn_builder(
bert_config=bert_config,
num_labels=len(label_list),
init_checkpoint=FLAGS.init_checkpoint,
learning_rate=FLAGS.learning_rate if not FLAGS.horovod else FLAGS.learning_rate * hvd.size(),
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
use_one_hot_embeddings=False,
hvd=None if not FLAGS.horovod else hvd,
use_fp16=FLAGS.use_fp16)
estimator = tf.estimator.Estimator(
model_fn=model_fn,
config=run_config)
if FLAGS.do_train:
file_based_convert_examples_to_features(
train_examples[start_index:end_index], label_list, FLAGS.max_seq_length, tokenizer, tmp_filenames[hvd_rank])
tf.logging.info("***** Running training *****")
tf.logging.info(" Num examples = %d", len(train_examples))
tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
tf.logging.info(" Num steps = %d", num_train_steps)
train_input_fn = file_based_input_fn_builder(
input_file=tmp_filenames,
batch_size=FLAGS.train_batch_size,
seq_length=FLAGS.max_seq_length,
is_training=True,
drop_remainder=True,
hvd=None if not FLAGS.horovod else hvd)
train_start_time = time.time()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps, hooks=training_hooks)
train_time_elapsed = time.time() - train_start_time
train_time_wo_overhead = training_hooks[-1].total_time
avg_sentences_per_second = num_train_steps * global_batch_size * 1.0 / train_time_elapsed
ss_sentences_per_second = (num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
if master_process:
tf.logging.info("-----------------------------")
tf.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
num_train_steps * global_batch_size)
tf.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
(num_train_steps - training_hooks[-1].skipped) * global_batch_size)
tf.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second)
tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
tf.logging.info("-----------------------------")
if FLAGS.do_eval and master_process:
eval_examples = processor.get_dev_examples(FLAGS.data_dir)
num_actual_eval_examples = len(eval_examples)
eval_file = os.path.join(FLAGS.output_dir, "eval.tf_record")
file_based_convert_examples_to_features(
eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
tf.logging.info("***** Running evaluation *****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(eval_examples), num_actual_eval_examples,
len(eval_examples) - num_actual_eval_examples)
tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
# This tells the estimator to run through the entire set.
eval_steps = None
eval_drop_remainder = False
eval_input_fn = file_based_input_fn_builder(
input_file=eval_file,
batch_size=FLAGS.eval_batch_size,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=eval_drop_remainder)
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:
tf.logging.info("***** Eval results *****")
for key in sorted(result.keys()):
tf.logging.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
if FLAGS.do_predict and master_process:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
num_actual_predict_examples = len(predict_examples)
predict_file = os.path.join(FLAGS.output_dir, "predict.tf_record")
file_based_convert_examples_to_features(predict_examples, label_list,
FLAGS.max_seq_length, tokenizer,
predict_file)
tf.logging.info("***** Running prediction*****")
tf.logging.info(" Num examples = %d (%d actual, %d padding)",
len(predict_examples), num_actual_predict_examples,
len(predict_examples) - num_actual_predict_examples)
tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
predict_drop_remainder = False
predict_input_fn = file_based_input_fn_builder(
input_file=predict_file,
batch_size=FLAGS.predict_batch_size,
seq_length=FLAGS.max_seq_length,
is_training=False,
drop_remainder=predict_drop_remainder)
eval_hooks = [LogEvalRunHook(FLAGS.predict_batch_size)]
eval_start_time = time.time()
output_predict_file = os.path.join(FLAGS.output_dir, "test_results.tsv")
with tf.gfile.GFile(output_predict_file, "w") as writer:
num_written_lines = 0
tf.logging.info("***** Predict results *****")
for prediction in estimator.predict(input_fn=predict_input_fn, hooks=eval_hooks,
yield_single_examples=True):
probabilities = prediction["probabilities"]
output_line = "\t".join(
str(class_probability)
for class_probability in probabilities) + "\n"
writer.write(output_line)
num_written_lines += 1
assert num_written_lines == num_actual_predict_examples
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
time_list = eval_hooks[-1].time_list
time_list.sort()
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
avg = np.mean(time_list)
cf_50 = max(time_list[:int(len(time_list) * 0.50)])
cf_90 = max(time_list[:int(len(time_list) * 0.90)])
cf_95 = max(time_list[:int(len(time_list) * 0.95)])
cf_99 = max(time_list[:int(len(time_list) * 0.99)])
cf_100 = max(time_list[:int(len(time_list) * 1)])
ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
tf.logging.info("-----------------------------")
tf.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.predict_batch_size)
tf.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
tf.logging.info("Summary Inference Statistics")
tf.logging.info("Batch size = %d", FLAGS.predict_batch_size)
tf.logging.info("Sequence Length = %d", FLAGS.max_seq_length)
tf.logging.info("Precision = %s", "fp16" if FLAGS.use_fp16 else "fp32")
tf.logging.info("Latency Confidence Level 50 (ms) = %0.2f", cf_50 * 1000)
tf.logging.info("Latency Confidence Level 90 (ms) = %0.2f", cf_90 * 1000)
tf.logging.info("Latency Confidence Level 95 (ms) = %0.2f", cf_95 * 1000)
tf.logging.info("Latency Confidence Level 99 (ms) = %0.2f", cf_99 * 1000)
tf.logging.info("Latency Confidence Level 100 (ms) = %0.2f", cf_100 * 1000)
tf.logging.info("Latency Average (ms) = %0.2f", avg * 1000)
tf.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
tf.logging.info("-----------------------------")
if __name__ == "__main__":
flags.mark_flag_as_required("data_dir")
flags.mark_flag_as_required("task_name")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.app.run()

View file

@ -921,7 +921,6 @@ def main(_):
training_hooks = []
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps
hvd_rank = 0
hvd_local_rank = 0
config = tf.ConfigProto()
learning_rate = FLAGS.learning_rate
@ -933,7 +932,6 @@ def main(_):
learning_rate = learning_rate * hvd.size()
master_process = (hvd.rank() == 0)
hvd_rank = hvd.rank()
hvd_local_rank = hvd.local_rank()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
if hvd.size() > 1:
@ -976,15 +974,15 @@ def main(_):
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record")]
if FLAGS.horovod:
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.local_size())]
num_examples_per_local_rank = len(train_examples) // hvd.local_size()
remainder = len(train_examples) % hvd.local_size()
if hvd.local_rank() < remainder:
start_index = hvd.local_rank() * (num_examples_per_local_rank+1)
end_index = start_index + num_examples_per_local_rank + 1
tmp_filenames = [os.path.join(FLAGS.output_dir, "train.tf_record{}".format(i)) for i in range(hvd.size())]
num_examples_per_rank = len(train_examples) // hvd.size()
remainder = len(train_examples) % hvd.size()
if hvd.rank() < remainder:
start_index = hvd.rank() * (num_examples_per_rank+1)
end_index = start_index + num_examples_per_rank + 1
else:
start_index = hvd.local_rank() * num_examples_per_local_rank + remainder
end_index = start_index + (num_examples_per_local_rank)
start_index = hvd.rank() * num_examples_per_rank + remainder
end_index = start_index + (num_examples_per_rank)
model_fn = model_fn_builder(
@ -1005,7 +1003,7 @@ def main(_):
# We write to a temporary file to avoid storing very large constant tensors
# in memory.
train_writer = FeatureWriter(
filename=tmp_filenames[hvd_local_rank],
filename=tmp_filenames[hvd_rank],
is_training=True)
convert_examples_to_features(
examples=train_examples[start_index:end_index],
@ -1025,10 +1023,6 @@ def main(_):
tf.logging.info(" Num steps = %d", num_train_steps)
tf.logging.info(" LR = %f", learning_rate)
del train_examples
if FLAGS.horovod:
barrier = hvd.allreduce(tf.constant(0))
with tf.Session(config=config) as sess:
sess.run(barrier)
train_input_fn = input_fn_builder(
input_file=tmp_filenames,

View file

@ -1,10 +1,9 @@
#!/bin/bash
docker pull nvcr.io/nvidia/tensorrtserver:19.06-py3
docker pull nvcr.io/nvidia/tensorrtserver:19.08-py3
#The follow has been commented out since we need fixes for the perf_client from Guan
#Uncomment to enable building.
#For now, the tensorrt_client can be downloaded from https://drive.google.com/drive/u/1/folders/1CeOMZbnFT1VUIlIMoDEZJb3kOKbXBDbZ
git submodule update --init --recursive && cd tensorrt-inference-server && docker build -t tensorrtserver_client -f Dockerfile.client . && cd -
#Will have to update submodule from root
git submodule update --init --recursive
cd tensorrt-inference-server && docker build -t tensorrtserver_client -f Dockerfile.client . && cd -
docker build . --rm -t bert

View file

@ -14,8 +14,7 @@
# limitations under the License.
bert_model=${1:-"large"}
use_xla=${2:-"true"}
task=${3:-"squad"}
task=${2:-"squad"}
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
@ -26,13 +25,6 @@ echo "BERT directory set as " $BERT_DIR
init_checkpoint="$BERT_DIR/bert_model.ckpt"
if [ "$use_xla" = "true" ] ; then
use_xla_tag="--use_xla"
echo "XLA activated"
else
use_xla_tag=""
fi
#Edit to save logs & checkpoints in a different directory
RESULTS_DIR=/results
if [ ! -d "$RESULTS_DIR" ] ; then
@ -49,8 +41,7 @@ if [ "$task" = "squad" ] ; then
echo "Squad directory set as " $SQUAD_DIR
echo "Inference performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision $precision" >> $LOGFILE
echo "Sequence-Length Batch-size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms) Latency-100%(ms)" >> $LOGFILE
echo "Precision Sequence-Length Batch-size Precision Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) Latency-99%(ms) Latency-100%(ms)" >> $LOGFILE
for seq_len in 128 384; do
@ -60,11 +51,13 @@ if [ "$task" = "squad" ] ; then
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
echo "fp16 and XLA activated!"
use_fp16="--use_fp16"
use_xla_tag="--use_xla"
else
echo "fp32 activated!"
use_fp16=""
use_xla_tag=""
fi
python run_squad.py \
@ -80,7 +73,7 @@ if [ "$task" = "squad" ] ; then
"$use_fp16" \
$use_xla_tag --num_eval_iterations=1024 |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec)' | awk -F'= ' '{print $2}'`
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | tail -1 | awk -F'= ' '{print $2}'`
la=`cat $tmp_file | grep -F 'Latency Average (ms)' | awk -F'= ' '{print $2}'`
l50=`cat $tmp_file | grep -F 'Latency Confidence Level 50 (ms)' | awk -F'= ' '{print $2}'`
l90=`cat $tmp_file | grep -F 'Latency Confidence Level 90 (ms)' | awk -F'= ' '{print $2}'`
@ -88,7 +81,7 @@ if [ "$task" = "squad" ] ; then
l99=`cat $tmp_file | grep -F 'Latency Confidence Level 99 (ms)' | awk -F'= ' '{print $2}'`
l100=`cat $tmp_file | grep -F 'Latency Confidence Level 100 (ms)' | awk -F'= ' '{print $2}'`
echo "$seq_len $bs $precision $perf $la $l50 $l90 $l95 $l99 $l100" >> $LOGFILE
echo "$precision $seq_len $bs $precision $perf $la $l50 $l90 $l95 $l99 $l100" >> $LOGFILE
done
done

View file

@ -64,8 +64,7 @@ if [ "$task" = "squad" ] ; then
echo "Squad directory set as " $SQUAD_DIR
echo "Training performance benchmarking for BERT $bert_model from $BERT_DIR" >> $LOGFILE
echo "Precision $precision" >> $LOGFILE
echo "Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
echo "Precision Sequence Length Batch size Performance(sent/sec)" >> $LOGFILE
for seq_len in 128 384; do
@ -104,8 +103,8 @@ if [ "$task" = "squad" ] ; then
"$use_fp16" \
$use_xla_tag |& tee $tmp_file
perf=`cat $tmp_file | grep -F 'Training Performance' | awk -F'= ' '{print $2}'`
echo "$seq_len $batch_size $perf"
perf=`cat $tmp_file | grep -F 'Throughput Average (sentences/sec) =' | head -1 | awk -F'= ' '{print $2}' | awk -F' sen' '{print $1}'`
echo "$precision $seq_len $batch_size $perf" >> $LOGFILE
done
done

View file

@ -0,0 +1,215 @@
"""
Multiclass
from:
https://github.com/guillaumegenthial/tf_metrics/blob/master/tf_metrics/__init__.py
"""
__author__ = "Guillaume Genthial"
import numpy as np
import tensorflow as tf
from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix
def precision(labels, predictions, num_classes, pos_indices=None,
weights=None, average='micro'):
"""Multi-class precision metric for Tensorflow
Parameters
----------
labels : Tensor of tf.int32 or tf.int64
The true labels
predictions : Tensor of tf.int32 or tf.int64
The predictions, same shape as labels
num_classes : int
The number of classes
pos_indices : list of int, optional
The indices of the positive classes, default is all
weights : Tensor of tf.int32, optional
Mask, must be of compatible shape with labels
average : str, optional
'micro': counts the total number of true positives, false
positives, and false negatives for the classes in
`pos_indices` and infer the metric from it.
'macro': will compute the metric separately for each class in
`pos_indices` and average. Will not account for class
imbalance.
'weighted': will compute the metric separately for each class in
`pos_indices` and perform a weighted average by the total
number of true labels for each class.
Returns
-------
tuple of (scalar float Tensor, update_op)
"""
cm, op = _streaming_confusion_matrix(
labels, predictions, num_classes, weights)
pr, _, _ = metrics_from_confusion_matrix(
cm, pos_indices, average=average)
op, _, _ = metrics_from_confusion_matrix(
op, pos_indices, average=average)
return (pr, op)
def recall(labels, predictions, num_classes, pos_indices=None, weights=None,
average='micro'):
"""Multi-class recall metric for Tensorflow
Parameters
----------
labels : Tensor of tf.int32 or tf.int64
The true labels
predictions : Tensor of tf.int32 or tf.int64
The predictions, same shape as labels
num_classes : int
The number of classes
pos_indices : list of int, optional
The indices of the positive classes, default is all
weights : Tensor of tf.int32, optional
Mask, must be of compatible shape with labels
average : str, optional
'micro': counts the total number of true positives, false
positives, and false negatives for the classes in
`pos_indices` and infer the metric from it.
'macro': will compute the metric separately for each class in
`pos_indices` and average. Will not account for class
imbalance.
'weighted': will compute the metric separately for each class in
`pos_indices` and perform a weighted average by the total
number of true labels for each class.
Returns
-------
tuple of (scalar float Tensor, update_op)
"""
cm, op = _streaming_confusion_matrix(
labels, predictions, num_classes, weights)
_, re, _ = metrics_from_confusion_matrix(
cm, pos_indices, average=average)
_, op, _ = metrics_from_confusion_matrix(
op, pos_indices, average=average)
return (re, op)
def f1(labels, predictions, num_classes, pos_indices=None, weights=None,
average='micro'):
return fbeta(labels, predictions, num_classes, pos_indices, weights,
average)
def fbeta(labels, predictions, num_classes, pos_indices=None, weights=None,
average='micro', beta=1):
"""Multi-class fbeta metric for Tensorflow
Parameters
----------
labels : Tensor of tf.int32 or tf.int64
The true labels
predictions : Tensor of tf.int32 or tf.int64
The predictions, same shape as labels
num_classes : int
The number of classes
pos_indices : list of int, optional
The indices of the positive classes, default is all
weights : Tensor of tf.int32, optional
Mask, must be of compatible shape with labels
average : str, optional
'micro': counts the total number of true positives, false
positives, and false negatives for the classes in
`pos_indices` and infer the metric from it.
'macro': will compute the metric separately for each class in
`pos_indices` and average. Will not account for class
imbalance.
'weighted': will compute the metric separately for each class in
`pos_indices` and perform a weighted average by the total
number of true labels for each class.
beta : int, optional
Weight of precision in harmonic mean
Returns
-------
tuple of (scalar float Tensor, update_op)
"""
cm, op = _streaming_confusion_matrix(
labels, predictions, num_classes, weights)
_, _, fbeta = metrics_from_confusion_matrix(
cm, pos_indices, average=average, beta=beta)
_, _, op = metrics_from_confusion_matrix(
op, pos_indices, average=average, beta=beta)
return (fbeta, op)
def safe_div(numerator, denominator):
"""Safe division, return 0 if denominator is 0"""
numerator, denominator = tf.to_float(numerator), tf.to_float(denominator)
zeros = tf.zeros_like(numerator, dtype=numerator.dtype)
denominator_is_zero = tf.equal(denominator, zeros)
return tf.where(denominator_is_zero, zeros, numerator / denominator)
def pr_re_fbeta(cm, pos_indices, beta=1):
"""Uses a confusion matrix to compute precision, recall and fbeta"""
num_classes = cm.shape[0]
neg_indices = [i for i in range(num_classes) if i not in pos_indices]
cm_mask = np.ones([num_classes, num_classes])
cm_mask[neg_indices, neg_indices] = 0
diag_sum = tf.reduce_sum(tf.diag_part(cm * cm_mask))
cm_mask = np.ones([num_classes, num_classes])
cm_mask[:, neg_indices] = 0
tot_pred = tf.reduce_sum(cm * cm_mask)
cm_mask = np.ones([num_classes, num_classes])
cm_mask[neg_indices, :] = 0
tot_gold = tf.reduce_sum(cm * cm_mask)
pr = safe_div(diag_sum, tot_pred)
re = safe_div(diag_sum, tot_gold)
fbeta = safe_div((1. + beta**2) * pr * re, beta**2 * pr + re)
return pr, re, fbeta
def metrics_from_confusion_matrix(cm, pos_indices=None, average='micro',
beta=1):
"""Precision, Recall and F1 from the confusion matrix
Parameters
----------
cm : tf.Tensor of type tf.int32, of shape (num_classes, num_classes)
The streaming confusion matrix.
pos_indices : list of int, optional
The indices of the positive classes
beta : int, optional
Weight of precision in harmonic mean
average : str, optional
'micro', 'macro' or 'weighted'
"""
num_classes = cm.shape[0]
if pos_indices is None:
pos_indices = [i for i in range(num_classes)]
if average == 'micro':
return pr_re_fbeta(cm, pos_indices, beta)
elif average in {'macro', 'weighted'}:
precisions, recalls, fbetas, n_golds = [], [], [], []
for idx in pos_indices:
pr, re, fbeta = pr_re_fbeta(cm, [idx], beta)
precisions.append(pr)
recalls.append(re)
fbetas.append(fbeta)
cm_mask = np.zeros([num_classes, num_classes])
cm_mask[idx, :] = 1
n_golds.append(tf.to_float(tf.reduce_sum(cm * cm_mask)))
if average == 'macro':
pr = tf.reduce_mean(precisions)
re = tf.reduce_mean(recalls)
fbeta = tf.reduce_mean(fbetas)
return pr, re, fbeta
if average == 'weighted':
n_gold = tf.reduce_sum(n_golds)
pr_sum = sum(p * n for p, n in zip(precisions, n_golds))
pr = safe_div(pr_sum, n_gold)
re_sum = sum(r * n for r, n in zip(recalls, n_golds))
re = safe_div(re_sum, n_gold)
fbeta_sum = sum(f * n for f, n in zip(fbetas, n_golds))
fbeta = safe_div(fbeta_sum, n_gold)
return pr, re, fbeta
else:
raise NotImplementedError()

View file

@ -0,0 +1,108 @@
# Deploying the BERT model using TensorRT Inference Server
The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using TensorRT Inference Server.
## Table Of Contents
- [TensorRT Inference Server Overview](#tensorrt-inference-server-overview)
- [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
* [Advanced Details](#advanced-details)
- [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)
## TensorRT Inference Server Overview
A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
1. Client serializes the inference request into a message and sends it to the server (Client Send)
2. Message travels over the network from the client to the server (Network)
3. Message arrives at server, and is deserialized (Server Receive)
4. Request is placed on the queue (Server Queue)
5. Request is removed from the queue and computed (Server Compute)
6. Completed request is serialized in a message and sent back to the client (Server Send)
7. Completed message travels over network from the server to the client (Network)
8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)
Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
Note: The following instructions are run from outside the container and call `docker run` commands as required.
## Performance analysis for TensorRT Inference Server
Based on the figures 1 and 2 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.
![](../data/images/trtis_base_summary.png?raw=true)
Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
![](../data/images/trtis_large_summary.png?raw=true)
Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
### Advanced Details
This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
```bash
bash trtis/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
```
All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1). A high number of concurrent requests can reduce the impact of network latency on overall throughput.
#### Maximum batch size
As we can see in Figure 3, the throughput at BS=1, Client Concurrent Requests = 64 is 119 and in Figure 4, the throughput at BS=8, Client Concurrent Requests = 8 is 517, respectively giving a speedup of ~4.3x
Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurrent Requests = 8 to keep the Total Number of Outstanding Requests equal between the two different modes. Where Total Number of Outstanding Requests = Batch Size * Client Concurrent Requests. This is also why there are 8 times as many bars on the BS=1 chart than the BS=8 chart.
Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
![](../data/images/trtis_bs_1.png?raw=true)
Figure 3: Latency & Throughput vs Concurrency at Batch size = 1
![](../data/images/trtis_bs_8.png?raw=true)
Figure 4: Latency & Throughput vs Concurrency at Batch size = 8
#### Batching techniques
Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and preferred_batchsize to indicate your optimal batch sizes in the TensorRT Inference Server model config.
Figures 5 and 6 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.
![](../data/images/trtis_static.png?raw=true)
Figure 5: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1
![](../data/images/trtis_dynamic.png?raw=true)
Figure 6: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000
#### Model execution instance count
TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesnt saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
Figures 7 and 8 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.
![](../data/images/trtis_ec_1.png?raw=true)
Figure 7: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
(One copy of the model loaded in GPU memory)
![](../data/images/trtis_ec_4.png?raw=true)
Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
(Four copies the model loaded in GPU memory)
## Running the TensorRT Inference Server and client
The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
```bash
bash trtis/scripts/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
```

View file

@ -38,12 +38,12 @@ EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DI
PERF_CLIENT_ARGS="1000 10 20 localhost"
# Start Server
./scripts/docker/launch_server.sh $precision
./trtis/scripts/launch_server.sh $precision
# Restart Server
restart_server() {
docker kill trt_server_cont
./scripts/docker/launch_server.sh $precision
./trtis/scripts/launch_server.sh $precision
}
############## Dynamic Batching Comparison ##############
@ -53,32 +53,32 @@ TRTIS_ENGINE_COUNT=1
# Dynamic batching 10 ms
TRTIS_DYN_BATCHING_DELAY=10
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
.trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Dynamic batching 5 ms
TRTIS_DYN_BATCHING_DELAY=5
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Dynamic batching 2 ms
TRTIS_DYN_BATCHING_DELAY=2
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Static Batching (i.e. Dynamic batching 0 ms)
TRTIS_DYN_BATCHING_DELAY=0
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# ############## Engine Count Comparison ##############
@ -88,24 +88,24 @@ TRTIS_DYN_BATCHING_DELAY=0
# Engine Count = 4
TRTIS_ENGINE_COUNT=4
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Engine Count = 2
TRTIS_ENGINE_COUNT=2
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Engine Count = 1
TRTIS_ENGINE_COUNT=1
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
############## Batch Size Comparison ##############
@ -115,32 +115,32 @@ CLIENT_BATCH_SIZE=1
TRTIS_ENGINE_COUNT=1
TRTIS_DYN_BATCHING_DELAY=0
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
# BATCH=2 Generate model and perf
SERVER_BATCH_SIZE=2
CLIENT_BATCH_SIZE=2
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
# BATCH=4 Generate model and perf
SERVER_BATCH_SIZE=4
CLIENT_BATCH_SIZE=4
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
# BATCH=8 Generate model and perf
SERVER_BATCH_SIZE=8
CLIENT_BATCH_SIZE=8
./scripts/trtis/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
./trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
./scripts/trtis/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost
./trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost

View file

@ -35,7 +35,7 @@ if [ ! -d "$SQUAD_DIR" ] ; then
fi
bash scripts/docker/launch.sh \
"python run_squad_trtis_client.py \
"python trtis/run_squad_trtis_client.py \
--trtis_model_name=$trtis_model_name \
--trtis_model_version=$trtis_version_name \
--vocab_file=$BERT_DIR/vocab.txt \

View file

@ -32,7 +32,7 @@ then
if [ ! "$(docker inspect -f "{{.State.Running}}" trt_server_cont)" = "true" ] ; then
echo "Launching TRTIS server"
bash scripts/docker/launch_server.sh $precision
bash trtis/scripts/launch_server.sh $precision
SERVER_LAUNCHED=true
function cleanup_server {
@ -47,7 +47,7 @@ then
fi
# Wait until server is up. curl on the health of the server and sleep until its ready
bash scripts/trtis/wait_for_trtis_server.sh $SERVER_HOSTNAME
bash trtis/scripts/wait_for_trtis_server.sh $SERVER_HOSTNAME
TIMESTAMP=$(date "+%y%m%d_%H%M")

View file

@ -22,8 +22,8 @@ doc_stride=${6:-"128"}
bert_model=${7:-"large"}
squad_version=${8:-"1.1"}
trtis_version_name=${9:-1}
trtis_model_name=${10:-"bert"}
trtis_export_model=${11:-"true"}
trtis_model_name=${10:-"bert_onnx"}
trtis_export_model=${11:-"false"}
trtis_dyn_batching_delay=${12:-0}
trtis_engine_count=${13:-1}
trtis_model_overwrite=${14:-"False"}
@ -68,19 +68,19 @@ echo
if [ "$trtis_export_model" = "true" ] ; then
echo "Exporting model as: Name - $trtis_model_name Version - $trtis_version_name"
bash scripts/trtis/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
bash trtis/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
$doc_stride $BERT_DIR $RESULTS_DIR $trtis_version_name $trtis_model_name \
$trtis_dyn_batching_delay $trtis_engine_count $trtis_model_overwrite
fi
# Start TRTIS server in detached state
bash scripts/docker/launch_server.sh $precision
bash trtis/scripts/launch_server.sh $precision
# Wait until server is up. curl on the health of the server and sleep until its ready
bash scripts/trtis/wait_for_trtis_server.sh localhost
bash trtis/scripts/wait_for_trtis_server.sh localhost
# Start TRTIS client for inference and evaluate results
bash scripts/trtis/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
bash trtis/scripts/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
$BERT_DIR $squad_version