Krzysztof Kudrynski 49e23b4597 Adding links to performance benchmark page

2021-07-21 14:39:48 +02:00

34 KiB

Raw Blame History

BioBert For TensorFlow

This folder provides a script and recipe to train BERT for TensorFlow to achieve state-of-the-art accuracy on biomedical text-mining and is tested and maintained by NVIDIA.

Model overview
Quick Start Guide
Advanced
Performance
- Benchmarking
Release notes
- Changelog
- Known issues

Model overview

In the original BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper, pre-training is done on Wikipedia and Books Corpus, with state-of-the-art results demonstrated on SQuAD (Stanford Question Answering Dataset) benchmark.

Meanwhile, many works, including BioBERT, SciBERT, NCBI-BERT, ClinicalBERT (MIT), ClinicalBERT (NYU, Princeton), and others at BioNLP’19 workshop, show that additional pre-training of BERT on large biomedical text corpus such as PubMed results in better performance in biomedical text-mining tasks.

This repository provides scripts and recipe to adopt the NVIDIA BERT code-base to achieve state-of-the-art results in the following biomedical text-mining benchmark tasks:

BC5CDR-disease A Named-Entity-Recognition task to recognize diseases mentioned in a collection of 1500 PubMed titles and abstracts (Li et al., 2016)
BC5CDR-chemical A Named-Entity-Recognition task to recognize chemicals mentioned in a collection of 1500 PubMed titles and abstracts (Li et al., 2016)
ChemProt A Relation-Extraction task to determine chemical-protein interactions in a collection of 1820 PubMed abstracts (Krallinger et al., 2017)

Quick Start Guide

To pretrain or fine tune your model for BioMedical tasks using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the BERT model.

Clone the repository.

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow/LanguageModeling/BERT

Build the BERT TensorFlow NGC container.

bash scripts/docker/build.sh

Download and preprocess the PubMed dataset.

To download and preprocess pre-training data as well as the required vocab files, run the following script:

bash biobert/scripts/biobert_data_download.sh

Datasets for finetuning for NER can be obtained from this repository Datasets for finetuning for RE can be obtained from this repository

Place them both in /workspace/bert/data/biobert/ to be automatically picked up by our scripts.

Start an interactive session in the NGC container to run training/inference.

After you build the container image and download the data, you can start an interactive CLI session as follows:

bash scripts/docker/launch.sh

Download the pre-trained checkpoint, vocabulary, and configuration files.

We have uploaded checkpoints for fine tuning and pre-training on BioMedical Corpus’s on the NGC Model Registry. You can download them directly from the NGC model catalog.

Place our BioBERT checkpoints in the results/ to easily access it in your scripts.

Start pre-training.

From within the container, you can use the following script to run the 1st phase of the pre-training using cased vocabulary:

bash biobert/scripts/run_pretraining-pubmed_base_phase_1.sh <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>

For the 2nd phase of the pre-training, issue:

bash biobert/scripts/run_pretraining-pubmed_base_phase_2.sh <path_to_phase_1_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>

Refer to (MultiNode Section)[multi-node] for details on utilizing multiple nodes for faster pretraining.

Start fine tuning.

The above pretrained BERT representations can be fine tuned with just one additional output layer for a state-of-the-art biomedical text-mining system. From within the container, you can use the following script to run fine-training for NER.

Note: The scripts assume you are running on 16 V100 32GB GPUs. If you are running on GPU having less than 32GB memory or fewer GPUs, batch size, learning rate and number of GPUs needs to be adjusted.

For NER on disease entities:

bash biobert/scripts/ner_bc5cdr-disease.sh  <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>

For NER on chemical entities:

bash biobert/scripts/ner_bc5cdr-chem.sh  <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>

For relation extraction, issue:

bash biobert/scripts/rel_chemprot.sh <init_checkpoint> <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpu> <seq_length> <bert_model> <eval_batch_size> <epochs>

Start validation/evaluation.

The biobert/scripts/run_biobert_finetuning_inference.sh script runs inference on a checkpoint fine tuned for a specific task and evaluates the validity of predictions on the basis of F1, precision and recall scores.

bash biobert/scripts/run_biobert_finetuning_inference.sh <task> <init_checkpoint> <bert_model> <cased> <precision> <use_xla> <batch_size>

For FP16 inference for NER on BC5DR Chemical task with XLA using a DGX-2 V100 32G, run:

bash biobert/scripts/run_biobert_finetuning_inference.sh ner_bc5cdr-chem /results/model.ckpt base false fp16 true 16

Tasks ner_bc5cdr-chem, ner_bc5cdr-disease and rel_chemprot are currently supported.

Advanced

The following sections provide greater details of the dataset, running training and inference, and the training results.

Scripts and sample code

In addition to BERT TensorFlow files, the most important files added for NER and RE fine tuning tasks are:

run_ner.py - Serves as an entry point for NER training.
run_re.py - Serves as an entry point for RE training.

The biobert/scripts/ folder encapsulates all the one-click scripts required for running various functionalities supported such as:

ner_bc5cdr-chem.sh - Runs NER training and inference on the BC5CDR Chemical dataset using the run_ner.py file.
ner_bc5cdr-disease.sh - Runs NER training and inference on the BC5CDR Disease dataset using the run_ner.py file.
rel_chemprot.sh - Runs RE training and inference on the ChemProt dataset using the run_re.py file.
run_pretraining_pubmed_base_phase_*.sh - Runs pre-training with LAMB optimizer using the run_pretraining.py file in two phases. Phase 1 does training with sequence length = 128. In phase 2, the remaining 10% of the training is done with sequence length = 512.
biobert_data_download.sh - Downloads the PubMed dataset and Vocab files using files in the data/ folder.
run_biobert_finetuning_inference.sh - Runs task specific inference using a fine tuned checkpoint.

Parameters

Aside from the options to set hyperparameters, some relevant options to control the behaviour of the run_ner.py and run_re.py scripts are:

  --bert_config_file: The config json file corresponding to the pre-trained BERT model. This specifies the model architecture.
--vocab_file: The vocabulary file that the BERT model was trained on.
  --output_dir: The output directory where the model checkpoints will be written.
  --[no]do_eval: Whether to run evaluation on the dev set. (default: 'false')
  --[no]do_predict: Whether to run evaluation on the test set. (default: 'false')
  --[no]do_train: Whether to run training. (default: 'false')
  --learning_rate: The initial learning rate for Adam.(default: '5e-06')(a number)
  --max_seq_length: The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded.(default: '384')(an integer)
  --predict_batch_size: Total batch size for predictions.(default: '8')(an integer)
  --train_batch_size: Total batch size for training (default: '8')(an integer)
  --[no]use_fp16: Whether to enable AMP ops.(default: 'false')
  --[no]use_xla: Whether to enable XLA JIT compilation.(default: 'false')
--init_checkpoint: Initial checkpoint (usually from a pre-trained BERT model).
--num_train_epochs: Total number of training epochs to perform.(default: '3.0')(a number)

Note: When initializing from a checkpoint using --init_checkpoint and a corpus of your choice, keep in mind that bert_config_file and vocab_file should remain unchanged.

Command-line options

To see the full list of available options and their descriptions, use the -h or --help command-line option with the Python file, for example:

python run_ner.py --help
python run_re.py --help

Getting the data

For pre-training BERT, we use the PubMed Dataset. For PubMed, we extract the xml files which are structured as a document level corpus rather than a shuffled sentence level corpus because it is critical to extract long contiguous sentences.

The next step is to run create_pretraining_data.py with the document level corpus as input, which generates input data and labels for the masked language modeling and next sentence prediction tasks. Pre-training can also be performed on any corpus of your choice. The collection of data generation scripts are intended to be modular to allow modifications for additional preprocessing steps or to use additional data. They can hence easily be modified for an arbitrary corpus.

The preparation of an individual pre-training dataset is described in the create_biobert_datasets_from_start.sh script found in the data/ folder. The component steps to prepare the datasets are as follows:

Data download and extract - the dataset is downloaded and extracted.
Clean and format - document tags, etc. are removed from the dataset. The end result of this step is a {dataset_name_one_article_per_line}.txt file that contains the entire corpus. Each line in the text file contains an entire document from the corpus. One file per dataset is created in the formatted_one_article_per_line folder.
Sharding - the sentence segmented corpus file is split into a number of smaller text documents. The sharding is configured so that a document will not be split between two shards. Sentence segmentation is performed at this time using NLTK.
TFRecord file creation - each text file shard is processed by the create_pretraining_data.py script to produce a corresponding TFRecord file. The script generates input data and labels for masked language modeling and sentence prediction tasks for the input text shard.

For fine tuning BioBERT for the task of Named Entity Recognition and Relation Extraction Tasks, we use BC5CDR and Chemprot Datasets. BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. ChemProt corpus consists of text exhaustively annotated by hand with mentions of chemical compounds/drugs and genes/proteins, as well as 22 different types of compound-protein relations focussing on 5 important relation classes. It was preprocessed following Lim and Kang guidelines.

Dataset guidelines

The procedure to prepare a text corpus for pre-training is described in the previous section. This section provides additional insight into how exactly raw text is processed so that it is ready for pre-training.

First, raw text is tokenized using WordPiece tokenization. A [CLS] token is inserted at the start of every sequence, and the two sentences in the sequence are separated by a [SEP] token.

Note: BERT pre-training looks at pairs of sentences at a time. A sentence embedding token [A] is added to the first sentence and token [B] to the next.

BERT pre-training optimizes for two unsupervised classification tasks. The first is Masked Language Modelling (Masked LM). One training instance of Masked LM is a single modified sentence. Each token in the sentence has a 15% chance of being replaced by a [MASK] token. The chosen token is replaced with [MASK] 80% of the time, 10% with another random token and the remaining 10% with the same token. The task is then to predict the original token.

The second task is next sentence prediction. One training instance of BERT pre-training is two sentences (a sentence pair). A sentence pair may be constructed by simply taking two adjacent sentences from a single document, or by pairing up two random sentences with equal probability. The goal of this task is to predict whether or not the second sentence followed the first in the original document.

The create_pretraining_data.py script takes in raw text and creates training instances for both pre-training tasks.

Multi-dataset

We are able to combine multiple datasets into a single dataset for pre-training on a diverse text corpus. Once TFRecords have been created for each component dataset, you can create a combined dataset by adding the directory to *FILES_DIR in run_pretraining_*.sh. This will feed all matching files to the input pipeline in run_pretraining.py. However, in the training process, only one TFRecord file is consumed at a time, therefore, the training instances of any given training batch will all belong to the same source dataset.

Training process

The training process consists of two steps: pre-training and fine tuning.

Pre-training

BERT is designed to pre-train deep bidirectional representations for language representations. The following scripts are to pre-train BERT on PubMed dataset. These scripts are general and can be used for pre-training language representations on additional corpus of biomedical text.

Pre-training is performed using the run_pretraining.py script along with parameters defined in the biobert/scripts/run_pretraining_pubmed_base_phase_1.sh and biobert/scripts/run_pretraining_pubmed_base_phase_2.sh scripts.

The biobert/scripts/run_pretraining_pubmed_base_phase*.sh scripts run a job on a single node that trains the BERT-base model from scratch using the PubMed Corpus dataset as training data. By default, the training script:

Runs on 16 GPUs
Has FP16 precision enabled
Is XLA enabled
Creates a log file containing all the output
Saves a checkpoint every 5000 iterations (keeps only the latest checkpoint) and at the end of training. All checkpoints, evaluation results, and training logs are saved to the /results directory (in the container which can be mounted to a local directory).
Evaluates the model at the end of each phase
Phase 1
- Runs 19531 steps with 1953 warmup steps
- Sets Maximum sequence length as 128
- Sets Global Batch size as 64K
Phase 2
- Runs 4340 steps with 434 warm-up steps
- Sets Maximum sequence length as 512
- Sets Global Batch size as 32K
- Should start from Phase1's final checkpoint

These parameters train PubMed with reasonable accuracy on a DGX-2 with 32GB V100 cards.

For example:

biobert/scripts/run_pretraining-pubmed_base_phase_1.sh <train_batch_size> <learning_rate> <cased> <precision> <use_xla> <num_gpus> <warmup_steps> <train_steps> <num_accumulation_steps> <save_checkpoint_steps> <eval_batch_size>

Where:

<training_batch_size> is per-GPU batch size used for training. Batch size varies with precision, larger batch sizes run more efficiently, but require more memory.
<learning_rate> is the default rate of 3.2e-5 is good for global batch size 64k.
<cased> is set to true or false depending on whether the model should be trained on cased or uncased data.
<precision> is the type of math in your model, can be either fp32 or fp16. Specifically:
- fp32 is 32-bit IEEE single precision floats.
- fp16 is Automatic rewrite of TensorFlow compute graph to take advantage of 16-bit arithmetic whenever it is safe.
<num_gpus> is the number of GPUs to use for training. Must be equal to or smaller than the number of GPUs attached to your node.
<warmup_steps> is the number of warm-up steps at the start of training.
<training_steps> is the total number of training steps.
<save_checkpoint_steps> controls how often checkpoints are saved. Default is 5000 steps.
<num_accumulation_steps> is used to mimic higher batch sizes in the respective phase by accumulating gradients N times before weight update.
<bert_model> is used to indicate whether to pretrain BERT Large or BERT Base model.
<eval_batch_size> is per-GPU batch size used for evaluation after training.

The following sample code trains phase 1 of BERT-base from scratch on a single DGX-2 using FP16 arithmetic and uncased data.

biobert/scripts/run_pretraining-pubmed_base_phase_1.sh 128 3.2e-5 false fp16 true 16 1953 19531 32 5000 80

Fine tuning

Fine tuning is performed using the run_ner.py script along with parameters defined in biobert/scripts/ner_bc5cdr*.sh.

For example, biobert/scripts/ner_bc5cdr-chem.sh script trains a model and performs evaluation on the BC5CDR Chemical dataset. By default, the training script:

Trains on BERT Base Uncased Model
Uses 16 GPUs and batch size of 8 on each GPU
Has FP16 precision enabled
Is XLA enabled
Runs for 10 epochs
Evaluation is done at the end of training. To skip evaluation, modify --do_eval and --do_predict to False.

This script outputs checkpoints to the /results directory, by default, inside the container. Mount point of /results can be changed in the scripts/docker/launch.sh file. The training log contains information about:

Loss for the final step
Training and evaluation performance
F1, Precision and Recall on the Test Set of BC5CDR Chemical after evaluation.

The summary after training is printed in the following format:

 0: /results/biobert_finetune_ner_chem_191028154209/test_labels.txt
 0: /results/biobert_finetune_ner_chem_191028154209/test_labels_errs.txt
 0: processed 124669 tokens with 5433 phrases; found: 5484 phrases; correct: 5102.
 0: accuracy:  99.26%; precision:  93.03%; recall:  93.91%; FB1:  93.47
 0:                  : precision:  93.03%; recall:  93.91%; FB1:  93.47  5484

Multi-GPU training is enabled with the Horovod TensorFlow module. The following example runs training on 16 GPUs:

BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
DATA_DIR=data/biobert/BC5CDR/chem

mpi_command="mpirun -np 16 -H localhost:16 \
    --allow-run-as-root -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO \
    -x LD_LIBRARY_PATH \
    -x PATH -mca pml ob1 -mca btl ^openib" \
     python run_ner.py --horovod --amp --use_xla \
      --vocab_file=$BERT_DIR/vocab.txt \
     --bert_config_file=$BERT_DIR/bert_config.json \
     --output_dir=/results --data_dir=$DATA_DIR"

Multi-node

Multi-node runs can be launched on a pyxis/enroot Slurm cluster (see Requirements) with the biobert/scripts/run_biobert.sub script with the following command for a 4-node DGX2 example for both phase 1 and phase 2:

BATCHSIZE=128 LEARNING_RATE='8e-6' NUM_ACCUMULATION_STEPS=8 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub
BATCHSIZE=16 LEARNING_RATE='3.2e-5' NUM_ACCUMULATION_STEPS=32 PHASE=1 sbatch -N4 --ntasks-per-node=16 biobert/scripts/run_biobert.sub

Checkpoint after phase 1 will be saved in checkpointdir specified in biobert/scripts/run_biobert.sub. The checkpoint will be automatically picked up to resume training on phase 2. Note that phase 2 should be run after phase 1.

Variables to re-run the Training performance results are available in the configurations.yml file.

The batch variables BATCHSIZE, LEARNING_RATE, NUM_ACCUMULATION_STEPS refer to the Python arguments train_batch_size, learning_rate, num_accumulation_steps respectively. The variable PHASE refers to phase specific arguments available in biobert/scripts/run_biobert.sub.

Note that the biobert/scripts/run_biobert.sub script is a starting point that has to be adapted depending on the environment. In particular, variables such as datadir handle the location of the files for each phase.

Refer to the file contents to see the full list of variables to adjust for your system.

Inference process

Inference on a fine tuned model for Bio Medical tasks is performed using the run_ner.py or run_re.py script along with parameters defined in biobert/scripts/run_biobert_finetuning_inference.sh. Inference is supported on a single GPU.

The biobert/scripts/run_biobert_finetuning_inference.sh script performs evaluation on ChemProt or BC5CDR datasets depending on the task specified. By default, the inferencing script:

Uses BC5CDR Chemical dataset
Has FP16 precision enabled
Is XLA enabled
Evaluates the latest checkpoint present in /results with a batch size of 16.

This script computes F1, Precision and Recall scores. Mount point of /results can be changed in the scripts/docker/launch.sh file.

Performance

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Both of these benchmarking scripts enable you to run a number of epochs, extract performance numbers, and run the BERT model for fine tuning.

Training performance benchmark

Training benchmarking can be performed by running the script:

biobert/scripts/biobert_finetune_training_benchmark.sh <task> <num_gpu> <bert_model> <cased>

This script runs 2 epochs by default on the NER BC5CDR dataset and extracts performance numbers for various batch sizes and sequence lengths in both FP16 and FP32. These numbers are saved at /results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>

Inference performance benchmark

Training benchmarking can be performed by running the script:

biobert/scripts/biobert_finetune_inference_benchmark.sh <task> <bert_model> <cased>

This script runs inference on the test and dev sets and extracts performance and latency numbers for various batch sizes and sequence lengths in both FP16 with XLA and FP32 without XLA. These numbers are saved at /results/tf_bert_biobert_<task>_training_benchmark__<bert_model>_<cased/uncased>_num_gpu_<num_gpu>_<DATESTAMP>

Results

The following sections provide detailed results of downstream fine-tuning task on NER and RE benchmark tasks.

Training accuracy results

Pre-training accuracy

Our results were obtained by running the scripts/run_pretraining_lamb.sh training script in the TensorFlow 19.08-py3 NGC container.

DGX System	Nodes	Precision	Batch Size/GPU: Phase1, Phase2	Accumulation Steps: Phase1, Phase2	Time to Train (Hrs)	Final Loss
DGX2H	4	FP16	128, 16	8, 32	19.14	0.88
DGX2H	16	FP16	128, 16	2, 8	4.81	0.86
DGX2H	32	FP16	128, 16	1, 4	2.65	0.87
DGX1	1	FP16	64, 8	128,512	174.58	0.87
DGX1	4	FP16	64, 8	32, 128	57.71	0.85
DGX1	16	FP16	64, 8	8, 32	12.62	0.87
DGX1	32	FP16	64, 8	4, 16	6.97	0.87

Fine-tuning accuracy

Task	F1	Precision	Recall
NER BC5CDR-chemical	93.47	93.03	93.91
NER BC5CDR-disease	86.22	85.05	87.43
RE Chemprot	76.27	77.62	74.98

Fine-tuning accuracy for NER Chem

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container.

DGX System	Batch size / GPU	F1 - FP32	F1- mixed precision	Time to Train - FP32 (Minutes)	Time to Train - mixed precision (Minutes)
DGX-1 16G	64	93.33	93.40	23.95	14.13
DGX-1 32G	64	93.31	93.36	24.35	12.63
DGX-2 32G	64	93.66	93.47	12.26	8.16

Training stability test

Fine-tuning stability test:

The following tables compare F1 scores scores across 5 different training runs on the NER Chemical task with different seeds, for both FP16 and FP32. The runs showcase consistent convergence on all 5 seeds with very little deviation.

16 x V100 GPUs	seed 1	seed 2	seed 3	seed 4	seed 5	mean	std
F1 Score (FP16)	93.13	92.92	93.34	93.66	93.47	93.3	0.29
F1 Score (FP32)	93.1	93.28	93.33	93.45	93.17	93.27	0.14

Training performance results

Training performance: NVIDIA DGX-1 (8x V100 16G)

Pre-training training performance: multi-node on DGX-1 16G

Our results were obtained by running the biobert/scripts/run_biobert.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.

Nodes	Sequence Length	Batch size / GPU: mixed precision, FP32	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 to mixed precision)	Weak scaling - mixed precision	Weak scaling - FP32
1	128	64,32	2762.06	744.48	3.71	1.00	1.00
4	128	64,32	10283.08	2762.88	3.72	3.72	3.71
16	128	64,32	39051.69	10715.14	3.64	14.14	14.39
32	128	64,32	76077.39	21104.87	3.60	27.54	28.35
1	512	8,8	432.33	160.38	2.70	1.00	1.00
4	512	8,8	1593.00	604.36	2.64	3.68	3.77
16	512	8,8	5941.82	2356.44	2.52	13.74	14.69
32	512	8,8	11483.73	4631.29	2.48	26.56	28.88

Note: The respective values for FP32 runs that use a batch size of 16, 2 in sequence lengths 128 and 512 respectively are not available due to out of memory errors that arise.

Fine-tuning training performance for NER on DGX-1 16G

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs	Batch size / GPU	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 to mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	64	147.71	348.84	2.36	1.00	1.00
4	64	583.78	1145.46	1.96	3.95	3.28
8	64	981.22	1964.85	2.00	6.64	5.63

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-1 (8x V100 32G)

Fine-tuning training performance for NER on DGX-1 32G

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs	Batch size / GPU	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 to mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	64	144.1	417.39	2.89	1.00	1.00
4	64	525.15	1354.14	2.57	3.64	3.24
8	64	969.4	2341.39	2.41	6.73	5.61

To achieve these same results, follow the Quick Start Guide outlined above.

Training performance: NVIDIA DGX-2 (16x V100 32G)

Pre-training training performance: multi-node on DGX-2H 32G

Our results were obtained by running the biobert/scripts/run_biobert.sub training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2H with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.

Nodes	Sequence Length	Batch size / GPU: mixed precision, FP32	Throughput - mixed precision	Throughput - FP32	Throughput speedup (FP32 to mixed precision)	Weak scaling - mixed precision	Weak scaling - FP32
1	128	128,128	7772.18	2165.04	3.59	1.00	1.00
4	128	128,128	29785.31	8516.90	3.50	3.83	3.93
16	128	128,128	115581.29	33699.15	3.43	14.87	15.57
32	128	128,128	226156.53	66996.73	3.38	29.10	30.94
64	128	128,128	444955.74	133424.95	3.33	57.25	61.63
1	512	16,16	1260.06	416.92	3.02	1.00	1.00
4	512	16,16	4781.19	1626.76	2.94	3.79	3.90
16	512	16,16	18405.65	6418.09	2.87	14.61	15.39
32	512	16,16	36071.06	12713.67	2.84	28.63	30.49
64	512	16,16	69950.86	25245.96	2.77	55.51	60.55

Fine-tuning training performance for NER on DGX-2 32G

Our results were obtained by running the biobert/scripts/ner_bc5cdr-chem.sh training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the mean throughput from 2 epochs.

GPUs	Batch size / GPU	Throughput - FP32	Throughput - mixed precision	Throughput speedup (FP32 to mixed precision)	Weak scaling - FP32	Weak scaling - mixed precision
1	64	139.59	475.54	3.4	1.00	1.00
4	64	517.08	1544.01	2.98	3.70	3.25
8	64	1009.84	2695.34	2.66	7.23	5.67
16	64	1997.73	4268.81	2.13	14.31	8.98

To achieve these same results, follow the Quick Start Guide outlined above.

Release notes

Changelog

November 2019

Initial release

Known issues

There are no known issues with the model.

34 KiB Raw Blame History Unescape Escape

BioBert For TensorFlow

Table Of Contents

Model overview

Quick Start Guide

Advanced

Scripts and sample code

Parameters

Command-line options

Getting the data

Dataset guidelines

Multi-dataset

Training process

Pre-training

Fine tuning

Multi-node

Inference process

Performance

Benchmarking

Training performance benchmark

Inference performance benchmark

Results

Training accuracy results

Pre-training accuracy

Fine-tuning accuracy

Fine-tuning accuracy for NER Chem

Training stability test

Fine-tuning stability test:

Training performance results

Training performance: NVIDIA DGX-1 (8x V100 16G)

Pre-training training performance: multi-node on DGX-1 16G

Fine-tuning training performance for NER on DGX-1 16G

Training performance: NVIDIA DGX-1 (8x V100 32G)

Fine-tuning training performance for NER on DGX-1 32G

Training performance: NVIDIA DGX-2 (16x V100 32G)

Pre-training training performance: multi-node on DGX-2H 32G

Fine-tuning training performance for NER on DGX-2 32G

Release notes

Changelog

Known issues

34 KiB

Raw Blame History