Merge pull request #502 from swethmandava/master

BERT module update
This commit is contained in:
Swetha Mandava 2020-05-18 10:27:19 -07:00 committed by GitHub
commit f3786969a3
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
30 changed files with 985 additions and 727 deletions

View file

@ -1,4 +1,4 @@
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:19.10-py3
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.03-tf1-py3
FROM ${FROM_IMAGE_NAME}
@ -13,10 +13,11 @@ RUN git clone https://github.com/attardi/wikiextractor.git
RUN git clone https://github.com/soskek/bookcorpus.git
RUN git clone https://github.com/titipata/pubmed_parser
RUN pip3 install /workspace/pubmed_parser
#Copy the perf_client over
ARG TRTIS_CLIENTS_URL=https://github.com/NVIDIA/tensorrt-inference-server/releases/download/v1.5.0/v1.5.0_ubuntu1804.clients.tar.gz
ARG TRTIS_CLIENTS_URL=https://github.com/NVIDIA/triton-inference-server/releases/download/v1.12.0/v1.12.0_ubuntu1804.clients.tar.gz
RUN mkdir -p /workspace/install \
&& curl -L ${TRTIS_CLIENTS_URL} | tar xvz -C /workspace/install

View file

@ -28,7 +28,7 @@ This repository provides a script and recipe to train the BERT model for TensorF
* [Multi-node](#multi-node)
* [Inference process](#inference-process)
* [Inference Process With TensorRT](#inference-process-with-tensorrt)
* [Deploying the BERT model using TensorRT Inference Server](#deploying-the-bert-model-using-tensorrt-inference-server)
* [Deploying the BERT model using Triton Inference Server](#deploying-the-bert-model-using-triton-inference-server)
* [BioBERT](#biobert)
- [Performance](#performance)
* [Benchmarking](#benchmarking)
@ -619,9 +619,9 @@ I0312 23:14:00.550973 140287431493376 run_squad.py:1397] 0 Inference Performance
### Inference Process With TensorRT
NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. More information on how to perform inference using TensorRT can be found in the subfolder [./trt/README.md](trt/README.md)
### Deploying the BERT model using TensorRT Inference Server
### Deploying the BERT model using Triton Inference Server
The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `TensorRT Inference Server` can be found in the subfolder `./trtis/README.md`.
The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `Triton Inference Server` can be found in the subfolder `./triton/README.md`.
### BioBERT
@ -1153,7 +1153,7 @@ September 2019
July 2019
- Results obtained using 19.06
- Inference Studies using TensorRT Inference Server
- Inference Studies using Triton Inference Server
March 2019
- Initial release

View file

@ -41,24 +41,10 @@ Here is a short description of each relevant file:
### 2.a Build the BERT TensorFlow NGC container:
To run the notebook you first need to build the Bert TensorFlow container using the following command from the main directory of this repository:
``` bash
```bash
docker build . --rm -t bert
```
### 2.b Dataset
We need to download the vocabulary and the bert_config files:
``` python3
python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
```
This is only needed during fine-tuning in order to download the Squad dataset:
``` python3
python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
```
### 2.c Start of the NGC container to run inference:
### 2.b Start of the NGC container to run inference:
Once the image is built, you need to run the container with the `--publish
0.0.0.0:8888:8888` option to publish Jupyter's port `8888` to the host machine
at port `8888` over all network interfaces (`0.0.0.0`):
@ -74,7 +60,23 @@ nvidia-docker run \
-it bert:latest bash
```
Then you can use the following command within the BERT Tensorflow container under
### 2.c Dataset
We need to download the vocabulary and the bert_config files:
```python3
python3 /workspace/bert/data/bertPrep.py --action download --dataset google_pretrained_weights # Includes vocab
```
This is only needed during fine-tuning in order to download the Squad dataset:
```python3
python3 /workspace/bert/data/bertPrep.py --action download --dataset squad
```
### 2.d Starting Jupyter Notebook
Now you can use the following command within the BERT Tensorflow container under
`/workspace/bert`:
```bash
@ -123,7 +125,7 @@ Here is a short description of the relevant file:
### 2.a Build the BERT TensorFlow NGC container:
To run the notebook you first need to build the Bert TensorFlow container using the following command from the main directory of this repository:
``` bash
```bash
docker build . --rm -t bert
```
### 2.b Start of the NGC container to run inference:

View file

@ -84,7 +84,7 @@
"import os\n",
"import sys\n",
"\n",
"data_dir = '/workspace/bert/data/download'\n",
"data_dir = '../data/download'\n",
"\n",
"# SQuAD json for training\n",
"train_file = os.path.join(data_dir, 'squad/v1.1/train-v1.1.json')\n",
@ -141,7 +141,7 @@
"|BERTBASE |12 encoder| 768| 12|4 x 768|512|110M|\n",
"|BERTLARGE|24 encoder|1024| 16|4 x 1024|512|330M|\n",
"\n",
"We will large use pre-trained models avaialble on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
"We will use large pre-trained models avaialble on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
"There are many configuration available, in particular we will download and use the following:\n",
"\n",
"**bert_tf_large_fp16_384**\n",
@ -164,7 +164,7 @@
"outputs": [],
"source": [
"# bert_tf_large_fp16_384\n",
"DATA_DIR_FP16 = '/workspace/bert/data/download/pretrained_model_fp16'\n",
"DATA_DIR_FP16 = data_dir + '/pretrained_model_fp16'\n",
"!mkdir -p $DATA_DIR_FP16\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP16/bert_for_tensorflow.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/bert_for_tensorflow/versions/1/zip\n",
@ -184,9 +184,9 @@
"metadata": {},
"outputs": [],
"source": [
"notebooks_dir = '/workspace/bert/notebooks'\n",
"notebooks_dir = '../notebooks'\n",
"\n",
"working_dir = '/workspace/bert'\n",
"working_dir = '..'\n",
"if working_dir not in sys.path:\n",
" sys.path.append(working_dir)\n",
"\n",
@ -257,7 +257,10 @@
"if 'f' not in tf.flags.FLAGS: \n",
" tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
"FLAGS = flags.FLAGS\n",
"# FLAGS.verbose_logging = True\n",
"\n",
"verbose_logging = True\n",
"# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False\n",
"version_2_with_negative = False\n",
"\n",
"# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
"n_best_size = 20\n",
@ -536,7 +539,10 @@
" end_logits=end_logits))\n",
"\n",
"eval_time_elapsed = time.time() - eval_start_time\n",
"eval_time_wo_startup = eval_hooks[-1].total_time\n",
"\n",
"time_list = eval_hooks[-1].time_list\n",
"time_list.sort()\n",
"eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])\n",
"num_sentences = eval_hooks[-1].count * predict_batch_size\n",
"avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
"\n",
@ -554,7 +560,8 @@
"run_squad.write_predictions(eval_examples, eval_features, all_results,\n",
" n_best_size, max_answer_length,\n",
" do_lower_case, output_prediction_file,\n",
" output_nbest_file, output_null_log_odds_file)\n",
" output_nbest_file, output_null_log_odds_file,\n",
" version_2_with_negative, verbose_logging)\n",
"\n",
"tf.logging.info(\"Inference Results:\")\n",
"\n",
@ -585,7 +592,7 @@
"metadata": {},
"outputs": [],
"source": [
"!python /workspace/bert/data/download/squad/v1.1/evaluate-v1.1.py \\\n",
"!python ../data/download/squad/v1.1/evaluate-v1.1.py \\\n",
" $predict_file \\\n",
" $output_dir/predictions.json"
]
@ -596,7 +603,7 @@
"source": [
"## 6. What's next\n",
"\n",
"Now that you have fine-tuned a BERT model you may want to take a look ad the run_squad script which containd more options for fine-tuning."
"Now that you have fine-tuned a BERT model you may want to take a look at the run_squad script which containd more options for fine-tuning."
]
}
],
@ -616,7 +623,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.6.9"
}
},
"nbformat": 4,

View file

@ -162,10 +162,10 @@
"import os\n",
"import sys\n",
"\n",
"notebooks_dir = '/workspace/bert/notebooks'\n",
"data_dir = '/workspace/bert/data/download'\n",
"notebooks_dir = '../notebooks'\n",
"data_dir = '../data/download'\n",
"\n",
"working_dir = '/workspace/bert'\n",
"working_dir = '../'\n",
"if working_dir not in sys.path:\n",
" sys.path.append(working_dir)"
]
@ -239,7 +239,7 @@
"metadata": {},
"outputs": [],
"source": [
"#input_file = '/workspace/bert/data/download/squad/v2.0/dev-v2.0.json'"
"#input_file = '../data/download/squad/v2.0/dev-v2.0.json'"
]
},
{
@ -272,14 +272,14 @@
"outputs": [],
"source": [
"# bert_tf_v2_large_fp32_384\n",
"DATA_DIR_FP32='/workspace/bert/data/download/finetuned_model_fp32'\n",
"DATA_DIR_FP32 = data_dir + '/finetuned_model_fp32'\n",
"!mkdir -p $DATA_DIR_FP32\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP32/bert_tf_v2_large_fp32_384.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v2_large_fp32_384/versions/1/zip\n",
"!unzip -n -d $DATA_DIR_FP32/ $DATA_DIR_FP32/bert_tf_v2_large_fp32_384.zip \n",
" \n",
"# bert_tf_v2_large_fp16_384\n",
"DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
"DATA_DIR_FP16 = data_dir + '/finetuned_model_fp16'\n",
"!mkdir -p $DATA_DIR_FP16\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP16/bert_tf_v2_large_fp16_384.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v2_large_fp16_384/versions/1/zip\n",
@ -363,6 +363,10 @@
" tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
"FLAGS = flags.FLAGS\n",
"\n",
"verbose_logging = True\n",
"# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False\n",
"version_2_with_negative = False\n",
"\n",
"# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
"n_best_size = 20\n",
"\n",
@ -501,7 +505,9 @@
"\n",
"eval_time_elapsed = time.time() - eval_start_time\n",
"\n",
"eval_time_wo_startup = eval_hooks[-1].total_time\n",
"time_list = eval_hooks[-1].time_list\n",
"time_list.sort()\n",
"eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])\n",
"num_sentences = eval_hooks[-1].count * predict_batch_size\n",
"avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
"\n",
@ -519,7 +525,8 @@
"run_squad.write_predictions(eval_examples, eval_features, all_results,\n",
" n_best_size, max_answer_length,\n",
" do_lower_case, output_prediction_file,\n",
" output_nbest_file, output_null_log_odds_file)\n",
" output_nbest_file, output_null_log_odds_file,\n",
" version_2_with_negative, verbose_logging)\n",
"\n",
"tf.logging.info(\"Inference Results:\")\n",
"\n",
@ -569,7 +576,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.6.9"
}
},
"nbformat": 4,

View file

@ -527,6 +527,10 @@
" tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
"FLAGS = flags.FLAGS\n",
"\n",
"verbose_logging = True\n",
"# Set to True if the dataset has samples with no answers. For SQuAD 1.1, this is set to False\n",
"version_2_with_negative = False\n",
"\n",
"# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
"n_best_size = 20\n",
"\n",
@ -678,7 +682,9 @@
"\n",
"eval_time_elapsed = time.time() - eval_start_time\n",
"\n",
"eval_time_wo_startup = eval_hooks[-1].total_time\n",
"time_list = eval_hooks[-1].time_list\n",
"time_list.sort()\n",
"eval_time_wo_startup = sum(time_list[:int(len(time_list) * 0.99)])\n",
"num_sentences = eval_hooks[-1].count * predict_batch_size\n",
"avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
"\n",
@ -757,7 +763,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.6.9"
}
},
"nbformat": 4,

View file

@ -169,8 +169,8 @@
"import os\n",
"import sys\n",
"\n",
"notebooks_dir = '/workspace/bert/notebooks'\n",
"working_dir = '/workspace/bert'\n",
"notebooks_dir = '../notebooks'\n",
"working_dir = '../'\n",
"if working_dir not in sys.path:\n",
" sys.path.append(working_dir)"
]
@ -267,7 +267,7 @@
"outputs": [],
"source": [
"# biobert_uncased_base_ner_disease\n",
"DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
"DATA_DIR_FP16 = '../data/download/finetuned_model_fp16'\n",
"!mkdir -p $DATA_DIR_FP16\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP16/biobert_uncased_base_ner_disease.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/biobert_uncased_base_ner_disease/versions/1/zip\n",
@ -602,7 +602,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
"version": "3.6.9"
}
},
"nbformat": 4,

View file

@ -95,8 +95,15 @@ def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None,
if hvd is not None and (num_accumulation_steps == 1 or (not allreduce_post_accumulation)):
optimizer = hvd.DistributedOptimizer(optimizer, sparse_as_dense=True, compression=Compression.fp16 if use_fp16 or manual_fp16 else Compression.none)
if manual_fp16 or use_fp16:
loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5)
if use_fp16:
loss_scaler = tf.train.experimental.DynamicLossScale(initial_loss_scale=2**32, increment_period=1000, multiplier=2.0)
optimizer = tf.train.experimental.enable_mixed_precision_graph_rewrite(optimizer, loss_scaler)
loss_scale_value = tf.identity(loss_scaler(), name="loss_scale")
if manual_fp16:
loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2 ** 32,
incr_every_n_steps=1000,
decr_every_n_nan_or_inf=2,
decr_ratio=0.5)
optimizer = tf.contrib.mixed_precision.LossScaleOptimizer(optimizer, loss_scale_manager)
tvars = tf.trainable_variables()
@ -149,7 +156,10 @@ def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None,
update_op = tf.cond(update_step,
lambda: update(accum_vars), lambda: tf.no_op())
new_global_step = tf.cond(tf.math.logical_and(update_step, tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)), tf.bool)), lambda: global_step+1, lambda: global_step)
new_global_step = tf.cond(tf.math.logical_and(update_step,
tf.cast(hvd.allreduce(tf.cast(batch_finite, tf.int32)), tf.bool)) if hvd is not None else batch_finite,
lambda: global_step+1,
lambda: global_step)
new_global_step = tf.identity(new_global_step, name='step_update')
train_op = tf.group(update_op, [global_step.assign(new_global_step)])
else:

View file

@ -61,7 +61,7 @@ flags.DEFINE_string(
## Other parameters
flags.DEFINE_string(
"dllog_path", "bert_dllog.json",
"dllog_path", "/results/bert_dllog.json",
"filename where dllogger writes to")
flags.DEFINE_string(
@ -301,6 +301,7 @@ def model_fn_builder(task_name, bert_config, num_labels, init_checkpoint, learni
"eval_loss": loss,
}
tf.compat.v1.logging.info("*** Features ***")
tf.compat.v1.logging.info("*** Features ***")
for name in sorted(features.keys()):
tf.compat.v1.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
@ -428,13 +429,14 @@ def input_fn_builder(features, batch_size, seq_length, is_training, drop_remaind
def main(_):
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)
if FLAGS.horovod:
hvd.init()
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
@ -597,11 +599,12 @@ def main(_):
result = estimator.evaluate(input_fn=eval_input_fn, hooks=eval_hooks)
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
time_list = eval_hooks[-1].time_list
time_list.sort()
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size
# Removing outliers (init/warmup) in throughput computation.
eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
num_sentences = (int(len(time_list) * 0.99)) * FLAGS.predict_batch_size
avg = np.mean(time_list)
cf_50 = max(time_list[:int(len(time_list) * 0.50)])
@ -615,7 +618,7 @@ def main(_):
tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.eval_batch_size)
tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size)
num_sentences))
tf.compat.v1.logging.info("Summary Inference Statistics on EVAL set")
tf.compat.v1.logging.info("Batch size = %d", FLAGS.eval_batch_size)
tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)

View file

@ -57,7 +57,7 @@ flags.DEFINE_string(
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"dllog_path", "bert_dllog.json",
"dllog_path", "/results/bert_dllog.json",
"filename where dllogger writes to")
flags.DEFINE_string(
@ -174,7 +174,7 @@ class DataProcessor(object):
@classmethod
def _read_data(cls, input_file):
"""Reads a BIO data."""
with tf.io.gfile.Open(input_file, "r") as f:
with open(input_file, "r") as f:
lines = []
words = []
labels = []
@ -321,8 +321,8 @@ def convert_single_example(ex_index, example, label_list, max_seq_length, tokeni
for (i, label) in enumerate(label_list, 1):
label_map[label] = i
label2id_file = os.path.join(FLAGS.output_dir, 'label2id.pkl')
if not tf.io.gfile.Exists(label2id_file):
with tf.io.gfile.Open(label2id_file, 'wb') as w:
if not os.path.exists(label2id_file):
with open(label2id_file, 'wb') as w:
pickle.dump(label_map, w)
textlist = example.text.split(' ')
labellist = example.label.split(' ')
@ -613,13 +613,13 @@ def result_to_pair(predict_line, pred_ids, id2label, writer, err_writer):
def main(_):
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)
if FLAGS.horovod:
hvd.init()
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
processors = {
"bc5cdr": BC5CDRProcessor,
@ -828,11 +828,12 @@ def main(_):
i = i + 1
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
time_list = eval_hooks[-1].time_list
time_list.sort()
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
# Removing outliers (init/warmup) in throughput computation.
eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
num_sentences = (int(len(time_list) * 0.99)) * FLAGS.predict_batch_size
avg = np.mean(time_list)
cf_50 = max(time_list[:int(len(time_list) * 0.50)])
@ -846,7 +847,7 @@ def main(_):
tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.predict_batch_size)
tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
num_sentences)
tf.compat.v1.logging.info("Summary Inference Statistics")
tf.compat.v1.logging.info("Batch size = %d", FLAGS.predict_batch_size)
tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)

View file

@ -56,7 +56,7 @@ flags.DEFINE_string(
## Other parameters
flags.DEFINE_string(
"dllog_path", "bert_dllog.json",
"dllog_path", "/results/bert_dllog.json",
"filename where dllogger writes to")
flags.DEFINE_string(
@ -125,18 +125,24 @@ flags.DEFINE_bool("use_fp16", False, "Whether to enable AMP ops.")
# report samples/sec, total loss and learning rate during training
class _LogSessionRunHook(tf.estimator.SessionRunHook):
def __init__(self, global_batch_size, num_accumulation_steps, dllogging, display_every=10, hvd_rank=-1):
def __init__(self, global_batch_size, num_accumulation_steps, dllogging, display_every=10,
save_ckpt_steps=1000, report_loss=True, hvd_rank=-1):
self.global_batch_size = global_batch_size
self.display_every = display_every
self.save_ckpt_steps = save_ckpt_steps
self.hvd_rank = hvd_rank
self.num_accumulation_steps = num_accumulation_steps
self.dllogging = dllogging
self.report_loss = report_loss
def after_create_session(self, session, coord):
self.elapsed_secs = 0.
self.count = 0
self.all_count = 0
self.avg_loss = 0.0
self.elapsed_secs = 0.0 #elapsed seconds between every print
self.count = 0 # number of global steps between every print
self.all_count = 0 #number of steps (including accumulation) between every print
self.loss = 0.0 # accumulation of loss in each step between every print
self.total_time = 0.0 # total time taken to train (excluding warmup + ckpt saving steps)
self.init_global_step = session.run(tf.train.get_global_step()) # training starts at init_global_step
def before_run(self, run_context):
self.t0 = time.time()
@ -153,40 +159,49 @@ class _LogSessionRunHook(tf.estimator.SessionRunHook):
'mlm_loss:0'])
else:
if FLAGS.manual_fp16 or FLAGS.use_fp16:
return tf.estimator.SessionRunArgs(
fetches=['step_update:0', 'update_step:0', 'total_loss:0',
'learning_rate:0', 'nsp_loss:0',
'mlm_loss:0', 'loss_scale:0'])
return tf.estimator.SessionRunArgs(
fetches=['step_update:0', 'update_step:0', 'total_loss:0',
'learning_rate:0', 'nsp_loss:0',
'mlm_loss:0', 'loss_scale:0'])
else:
return tf.estimator.SessionRunArgs(
fetches=['step_update:0', 'update_step:0', 'total_loss:0',
'learning_rate:0', 'nsp_loss:0',
'mlm_loss:0'])
def after_run(self, run_context, run_values):
self.elapsed_secs += time.time() - self.t0
run_time = time.time() - self.t0
if self.num_accumulation_steps <=1:
if FLAGS.manual_fp16 or FLAGS.use_fp16:
global_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
self.global_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
else:
global_step, total_loss, lr, nsp_loss, mlm_loss = run_values. \
self.global_step, total_loss, lr, nsp_loss, mlm_loss = run_values. \
results
update_step = True
else:
if FLAGS.manual_fp16 or FLAGS.use_fp16:
global_step, update_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
self.global_step, update_step, total_loss, lr, nsp_loss, mlm_loss, loss_scaler = run_values.results
else:
global_step, update_step, total_loss, lr, nsp_loss, mlm_loss = run_values.\
self.global_step, update_step, total_loss, lr, nsp_loss, mlm_loss = run_values.\
results
print_step = global_step + 1 # One-based index for printing.
self.avg_loss += total_loss
# Removing first two steps after every checkpoint save from timing
if (self.global_step - self.init_global_step) % self.save_ckpt_steps <= 5:
print("Skipping time record for ", self.global_step, " due to checkpoint-saving/warmup overhead")
else:
self.total_time += run_time
self.elapsed_secs += run_time
print_step = self.global_step + 1 # One-based index for printing.
self.loss += total_loss
self.all_count += 1
if update_step:
self.count += 1
if (print_step == 1 or print_step % self.display_every == 0):
dt = self.elapsed_secs / self.count
sent_per_sec = self.global_batch_size / dt
avg_loss_step = self.avg_loss / self.all_count
if self.hvd_rank >= 0:
avg_loss_step = self.loss / self.all_count
if self.hvd_rank >= 0 and FLAGS.report_loss:
if FLAGS.manual_fp16 or FLAGS.use_fp16:
self.dllogging.logger.log(step=(print_step),
data={"Rank": int(rank), "throughput_train": float(sent_per_sec),
@ -216,10 +231,15 @@ class _LogSessionRunHook(tf.estimator.SessionRunHook):
"total_loss":float(total_loss), "avg_loss_step":float(avg_loss_step),
"learning_rate": str(lr)},
verbosity=Verbosity.DEFAULT)
self.elapsed_secs = 0.
self.elapsed_secs = 0.0
self.count = 0
self.avg_loss = 0.0
self.loss = 0.0
self.all_count = 0
def end(self, session):
num_global_steps = self.global_step - self.init_global_step
self.skipped = (num_global_steps // self.save_ckpt_steps) * 5 + \
min(5, num_global_steps % self.save_ckpt_steps)
def model_fn_builder(bert_config, init_checkpoint, learning_rate,
num_train_steps, num_warmup_steps,
@ -519,15 +539,14 @@ def _decode_record(record, name_to_features):
def main(_):
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)
if not FLAGS.do_train and not FLAGS.do_eval:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
if FLAGS.horovod:
import horovod.tensorflow as hvd
hvd.init()
@ -564,6 +583,7 @@ def main(_):
model_dir=FLAGS.output_dir,
session_config=config,
save_checkpoints_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None,
save_summary_steps=FLAGS.save_checkpoints_steps if not FLAGS.horovod or hvd.rank() == 0 else None,
# This variable controls how often estimator reports examples/sec.
# Default value is every 100 steps.
# When --report_loss is True, we set to very large value to prevent
@ -580,18 +600,19 @@ def main(_):
use_one_hot_embeddings=False,
hvd=None if not FLAGS.horovod else hvd)
training_hooks = []
if FLAGS.report_loss and (not FLAGS.horovod or hvd.rank() == 0):
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, dllogging, FLAGS.display_loss_steps))
if FLAGS.horovod and hvd.size() > 1:
training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
estimator = tf.estimator.Estimator(
model_fn=model_fn,
config=run_config)
if FLAGS.do_train:
training_hooks = []
if FLAGS.horovod and hvd.size() > 1:
training_hooks.append(hvd.BroadcastGlobalVariablesHook(0))
if (not FLAGS.horovod or hvd.rank() == 0):
global_batch_size = FLAGS.train_batch_size * FLAGS.num_accumulation_steps if not FLAGS.horovod else FLAGS.train_batch_size * FLAGS.num_accumulation_steps * hvd.size()
training_hooks.append(_LogSessionRunHook(global_batch_size, FLAGS.num_accumulation_steps, dllogging, FLAGS.display_loss_steps, FLAGS.save_checkpoints_steps, FLAGS.report_loss))
tf.compat.v1.logging.info("***** Running training *****")
tf.compat.v1.logging.info(" Batch size = %d", FLAGS.train_batch_size)
train_input_fn = input_fn_builder(
@ -602,7 +623,24 @@ def main(_):
is_training=True,
hvd=None if not FLAGS.horovod else hvd)
train_start_time = time.time()
estimator.train(input_fn=train_input_fn, hooks=training_hooks, max_steps=FLAGS.num_train_steps)
train_time_elapsed = time.time() - train_start_time
if (not FLAGS.horovod or hvd.rank() == 0):
train_time_wo_overhead = training_hooks[-1].total_time
avg_sentences_per_second = FLAGS.num_train_steps * global_batch_size * 1.0 / train_time_elapsed
ss_sentences_per_second = (FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size * 1.0 / train_time_wo_overhead
tf.compat.v1.logging.info("-----------------------------")
tf.compat.v1.logging.info("Total Training Time = %0.2f for Sentences = %d", train_time_elapsed,
FLAGS.num_train_steps * global_batch_size)
tf.compat.v1.logging.info("Total Training Time W/O Overhead = %0.2f for Sentences = %d", train_time_wo_overhead,
(FLAGS.num_train_steps - training_hooks[-1].skipped) * global_batch_size)
tf.compat.v1.logging.info("Throughput Average (sentences/sec) with overhead = %0.2f", avg_sentences_per_second)
tf.compat.v1.logging.info("Throughput Average (sentences/sec) = %0.2f", ss_sentences_per_second)
dllogging.logger.log(step=(), data={"throughput_train": ss_sentences_per_second}, verbosity=Verbosity.DEFAULT)
tf.compat.v1.logging.info("-----------------------------")
if FLAGS.do_eval and (not FLAGS.horovod or hvd.rank() == 0):
tf.compat.v1.logging.info("***** Running evaluation *****")
@ -626,9 +664,11 @@ def main(_):
input_fn=eval_input_fn, steps=FLAGS.max_eval_steps, hooks=eval_hooks)
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size
time_list = eval_hooks[-1].time_list
time_list.sort()
# Removing outliers (init/warmup) in throughput computation.
eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
num_sentences = (int(len(time_list) * 0.99)) * FLAGS.eval_batch_size
ss_sentences_per_second = num_sentences * 1.0 / eval_time_wo_overhead
@ -636,7 +676,7 @@ def main(_):
tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.eval_batch_size)
tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.eval_batch_size)
num_sentences)
tf.compat.v1.logging.info("Summary Inference Statistics on EVAL set")
tf.compat.v1.logging.info("Batch size = %d", FLAGS.eval_batch_size)
tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)

View file

@ -65,7 +65,7 @@ flags.DEFINE_string(
## Other parameters
flags.DEFINE_string(
"dllog_path", "bert_dllog.json",
"dllog_path", "/results/bert_dllog.json",
"filename where dllogger writes to")
flags.DEFINE_string(
@ -719,13 +719,13 @@ def convert_examples_to_features(examples, label_list, max_seq_length,
def main(_):
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)
if FLAGS.horovod:
hvd.init()
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
processors = {
"chemprot": BioBERTChemprotProcessor,
@ -946,11 +946,12 @@ def main(_):
assert num_written_lines == num_actual_predict_examples
eval_time_elapsed = time.time() - eval_start_time
eval_time_wo_overhead = eval_hooks[-1].total_time
time_list = eval_hooks[-1].time_list
time_list.sort()
num_sentences = (eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size
# Removing outliers (init/warmup) in throughput computation.
eval_time_wo_overhead = sum(time_list[:int(len(time_list) * 0.99)])
num_sentences = (int(len(time_list) * 0.99)) * FLAGS.predict_batch_size
avg = np.mean(time_list)
cf_50 = max(time_list[:int(len(time_list) * 0.50)])
@ -964,7 +965,7 @@ def main(_):
tf.compat.v1.logging.info("Total Inference Time = %0.2f for Sentences = %d", eval_time_elapsed,
eval_hooks[-1].count * FLAGS.predict_batch_size)
tf.compat.v1.logging.info("Total Inference Time W/O Overhead = %0.2f for Sentences = %d", eval_time_wo_overhead,
(eval_hooks[-1].count - eval_hooks[-1].skipped) * FLAGS.predict_batch_size)
num_sentences)
tf.compat.v1.logging.info("Summary Inference Statistics")
tf.compat.v1.logging.info("Batch size = %d", FLAGS.predict_batch_size)
tf.compat.v1.logging.info("Sequence Length = %d", FLAGS.max_seq_length)

View file

@ -41,134 +41,139 @@ import utils.dllogger_class
from dllogger import Verbosity
flags = tf.flags
FLAGS = None
FLAGS = flags.FLAGS
def extract_run_squad_flags():
## Required parameters
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
## Required parameters
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
## Other parameters
## Other parameters
flags.DEFINE_string(
"dllog_path", "bert_dllog.json",
"filename where dllogger writes to")
flags.DEFINE_string(
"dllog_path", "/results/bert_dllog.json",
"filename where dllogger writes to")
flags.DEFINE_string("train_file", None,
"SQuAD json for training. E.g., train-v1.1.json")
flags.DEFINE_string("train_file", None,
"SQuAD json for training. E.g., train-v1.1.json")
flags.DEFINE_string(
"predict_file", None,
"SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
flags.DEFINE_string(
"eval_script", None,
"SQuAD evaluate.py file to compute f1 and exact_match E.g., evaluate-v1.1.py")
flags.DEFINE_string(
"predict_file", None,
"SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
flags.DEFINE_string(
"eval_script", None,
"SQuAD evaluate.py file to compute f1 and exact_match E.g., evaluate-v1.1.py")
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model).")
flags.DEFINE_string(
"init_checkpoint", None,
"Initial checkpoint (usually from a pre-trained BERT model).")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer(
"max_seq_length", 384,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_integer(
"max_seq_length", 384,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_integer(
"doc_stride", 128,
"When splitting up a long document into chunks, how much stride to "
"take between chunks.")
flags.DEFINE_integer(
"doc_stride", 128,
"When splitting up a long document into chunks, how much stride to "
"take between chunks.")
flags.DEFINE_integer(
"max_query_length", 64,
"The maximum number of tokens for the question. Questions longer than "
"this will be truncated to this length.")
flags.DEFINE_integer(
"max_query_length", 64,
"The maximum number of tokens for the question. Questions longer than "
"this will be truncated to this length.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_train", False, "Whether to run training.")
flags.DEFINE_bool("do_predict", False, "Whether to run eval on the dev set.")
flags.DEFINE_bool("do_predict", False, "Whether to run eval on the dev set.")
flags.DEFINE_integer("train_batch_size", 8, "Total batch size for training.")
flags.DEFINE_integer("train_batch_size", 8, "Total batch size for training.")
flags.DEFINE_integer("predict_batch_size", 8,
"Total batch size for predictions.")
flags.DEFINE_integer("predict_batch_size", 8,
"Total batch size for predictions.")
flags.DEFINE_float("learning_rate", 5e-6, "The initial learning rate for Adam.")
flags.DEFINE_float("learning_rate", 5e-6, "The initial learning rate for Adam.")
flags.DEFINE_bool("use_trt", False, "Whether to use TF-TRT")
flags.DEFINE_bool("use_trt", False, "Whether to use TF-TRT")
flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
flags.DEFINE_float("num_train_epochs", 3.0,
"Total number of training epochs to perform.")
flags.DEFINE_bool("horovod", False, "Whether to use Horovod for multi-gpu runs")
flags.DEFINE_float("num_train_epochs", 3.0,
"Total number of training epochs to perform.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_float(
"warmup_proportion", 0.1,
"Proportion of training to perform linear learning rate warmup for. "
"E.g., 0.1 = 10% of training.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("save_checkpoints_steps", 1000,
"How often to save the model checkpoint.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_integer("iterations_per_loop", 1000,
"How many steps to make in each estimator call.")
flags.DEFINE_integer("num_accumulation_steps", 1,
"Number of accumulation steps before gradient update"
"Global batch size = num_accumulation_steps * train_batch_size")
flags.DEFINE_integer("num_accumulation_steps", 1,
"Number of accumulation steps before gradient update"
"Global batch size = num_accumulation_steps * train_batch_size")
flags.DEFINE_integer(
"n_best_size", 20,
"The total number of n-best predictions to generate in the "
"nbest_predictions.json output file.")
flags.DEFINE_integer(
"n_best_size", 20,
"The total number of n-best predictions to generate in the "
"nbest_predictions.json output file.")
flags.DEFINE_integer(
"max_answer_length", 30,
"The maximum length of an answer that can be generated. This is needed "
"because the start and end predictions are not conditioned on one another.")
flags.DEFINE_integer(
"max_answer_length", 30,
"The maximum length of an answer that can be generated. This is needed "
"because the start and end predictions are not conditioned on one another.")
flags.DEFINE_bool(
"verbose_logging", False,
"If true, all of the warnings related to data processing will be printed. "
"A number of warnings are expected for a normal SQuAD evaluation.")
flags.DEFINE_bool(
"verbose_logging", False,
"If true, all of the warnings related to data processing will be printed. "
"A number of warnings are expected for a normal SQuAD evaluation.")
flags.DEFINE_bool(
"version_2_with_negative", False,
"If true, the SQuAD examples contain some that do not have an answer.")
flags.DEFINE_bool(
"version_2_with_negative", False,
"If true, the SQuAD examples contain some that do not have an answer.")
flags.DEFINE_float(
"null_score_diff_threshold", 0.0,
"If null_score - best_non_null is greater than the threshold predict null.")
flags.DEFINE_float(
"null_score_diff_threshold", 0.0,
"If null_score - best_non_null is greater than the threshold predict null.")
flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
flags.DEFINE_integer("num_eval_iterations", None,
"How many eval iterations to run - performs inference on subset")
flags.DEFINE_bool("use_fp16", False, "Whether to use fp32 or fp16 arithmetic on GPU.")
flags.DEFINE_bool("use_xla", False, "Whether to enable XLA JIT compilation.")
flags.DEFINE_integer("num_eval_iterations", None,
"How many eval iterations to run - performs inference on subset")
# TRTIS Specific flags
flags.DEFINE_bool("export_trtis", False, "Whether to export saved model or run inference with TRTIS")
flags.DEFINE_string("trtis_model_name", "bert", "exports to appropriate directory for TRTIS")
flags.DEFINE_integer("trtis_model_version", 1, "exports to appropriate directory for TRTIS")
flags.DEFINE_string("trtis_server_url", "localhost:8001", "exports to appropriate directory for TRTIS")
flags.DEFINE_bool("trtis_model_overwrite", False, "If True, will overwrite an existing directory with the specified 'model_name' and 'version_name'")
flags.DEFINE_integer("trtis_max_batch_size", 8, "Specifies the 'max_batch_size' in the TRTIS model config. See the TRTIS documentation for more info.")
flags.DEFINE_float("trtis_dyn_batching_delay", 0, "Determines the dynamic_batching queue delay in milliseconds(ms) for the TRTIS model config. Use '0' or '-1' to specify static batching. See the TRTIS documentation for more info.")
flags.DEFINE_integer("trtis_engine_count", 1, "Specifies the 'instance_group' count value in the TRTIS model config. See the TRTIS documentation for more info.")
# Triton Specific flags
flags.DEFINE_bool("export_triton", False, "Whether to export saved model or run inference with Triton")
flags.DEFINE_string("triton_model_name", "bert", "exports to appropriate directory for Triton")
flags.DEFINE_integer("triton_model_version", 1, "exports to appropriate directory for Triton")
flags.DEFINE_string("triton_server_url", "localhost:8001", "exports to appropriate directory for Triton")
flags.DEFINE_bool("triton_model_overwrite", False, "If True, will overwrite an existing directory with the specified 'model_name' and 'version_name'")
flags.DEFINE_integer("triton_max_batch_size", 8, "Specifies the 'max_batch_size' in the Triton model config. See the Triton documentation for more info.")
flags.DEFINE_float("triton_dyn_batching_delay", 0, "Determines the dynamic_batching queue delay in milliseconds(ms) for the Triton model config. Use '0' or '-1' to specify static batching. See the Triton documentation for more info.")
flags.DEFINE_integer("triton_engine_count", 1, "Specifies the 'instance_group' count value in the Triton model config. See the Triton documentation for more info.")
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
return flags.FLAGS
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
use_one_hot_embeddings):
@ -434,12 +439,9 @@ RawResult = collections.namedtuple("RawResult",
["unique_id", "start_logits", "end_logits"])
def write_predictions(all_examples, all_features, all_results, n_best_size,
max_answer_length, do_lower_case, output_prediction_file,
output_nbest_file, output_null_log_odds_file):
"""Write final predictions to the json file and log-odds of null if needed."""
tf.compat.v1.logging.info("Writing predictions to: %s" % (output_prediction_file))
tf.compat.v1.logging.info("Writing nbest to: %s" % (output_nbest_file))
def get_predictions(all_examples, all_features, all_results, n_best_size, max_answer_length,
do_lower_case, version_2_with_negative, verbose_logging):
"""Get final predictions"""
example_index_to_features = collections.defaultdict(list)
for feature in all_features:
@ -471,7 +473,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
start_indexes = _get_best_indexes(result.start_logits, n_best_size)
end_indexes = _get_best_indexes(result.end_logits, n_best_size)
# if we could have irrelevant answers, get the min score of irrelevant
if FLAGS.version_2_with_negative:
if version_2_with_negative:
feature_null_score = result.start_logits[0] + result.end_logits[0]
if feature_null_score < score_null:
score_null = feature_null_score
@ -506,7 +508,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
start_logit=result.start_logits[start_index],
end_logit=result.end_logits[end_index]))
if FLAGS.version_2_with_negative:
if version_2_with_negative:
prelim_predictions.append(
_PrelimPrediction(
feature_index=min_null_feature_index,
@ -544,7 +546,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
tok_text = " ".join(tok_text.split())
orig_text = " ".join(orig_tokens)
final_text = get_final_text(tok_text, orig_text, do_lower_case)
final_text = get_final_text(tok_text, orig_text, do_lower_case, verbose_logging)
if final_text in seen_predictions:
continue
@ -560,7 +562,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
end_logit=pred.end_logit))
# if we didn't inlude the empty option in the n-best, inlcude it
if FLAGS.version_2_with_negative:
if version_2_with_negative:
if "" not in seen_predictions:
nbest.append(
_NbestPrediction(
@ -595,7 +597,7 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
assert len(nbest_json) >= 1
if not FLAGS.version_2_with_negative:
if not version_2_with_negative:
all_predictions[example.qas_id] = nbest_json[0]["text"]
else:
# predict "" iff the null score - the score of best non-null > threshold
@ -608,6 +610,19 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
all_predictions[example.qas_id] = best_non_null_entry.text
all_nbest_json[example.qas_id] = nbest_json
return all_predictions, all_nbest_json, scores_diff_json
def write_predictions(all_examples, all_features, all_results, n_best_size,
max_answer_length, do_lower_case, output_prediction_file,
output_nbest_file, output_null_log_odds_file,
version_2_with_negative, verbose_logging):
"""Write final predictions to the json file and log-odds of null if needed."""
tf.compat.v1.logging.info("Writing predictions to: %s" % (output_prediction_file))
tf.compat.v1.logging.info("Writing nbest to: %s" % (output_nbest_file))
all_predictions, all_nbest_json, scores_diff_json = get_predictions(all_examples, all_features,
all_results, n_best_size, max_answer_length, do_lower_case, version_2_with_negative, verbose_logging)
with tf.io.gfile.GFile(output_prediction_file, "w") as writer:
writer.write(json.dumps(all_predictions, indent=4) + "\n")
@ -615,12 +630,12 @@ def write_predictions(all_examples, all_features, all_results, n_best_size,
with tf.io.gfile.GFile(output_nbest_file, "w") as writer:
writer.write(json.dumps(all_nbest_json, indent=4) + "\n")
if FLAGS.version_2_with_negative:
if version_2_with_negative:
with tf.io.gfile.GFile(output_null_log_odds_file, "w") as writer:
writer.write(json.dumps(scores_diff_json, indent=4) + "\n")
def get_final_text(pred_text, orig_text, do_lower_case):
def get_final_text(pred_text, orig_text, do_lower_case, verbose_logging):
"""Project the tokenized prediction back to the original text."""
# When we created the data, we kept track of the alignment between original
@ -669,7 +684,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
start_position = tok_text.find(pred_text)
if start_position == -1:
if FLAGS.verbose_logging:
if verbose_logging:
tf.compat.v1.logging.info(
"Unable to find text: '%s' in '%s'" % (pred_text, orig_text))
return orig_text
@ -679,7 +694,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
(tok_ns_text, tok_ns_to_s_map) = _strip_spaces(tok_text)
if len(orig_ns_text) != len(tok_ns_text):
if FLAGS.verbose_logging:
if verbose_logging:
tf.compat.v1.logging.info("Length not equal after stripping spaces: '%s' vs '%s'",
orig_ns_text, tok_ns_text)
return orig_text
@ -697,7 +712,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
orig_start_position = orig_ns_to_s_map[ns_start_position]
if orig_start_position is None:
if FLAGS.verbose_logging:
if verbose_logging:
tf.compat.v1.logging.info("Couldn't map start position")
return orig_text
@ -708,7 +723,7 @@ def get_final_text(pred_text, orig_text, do_lower_case):
orig_end_position = orig_ns_to_s_map[ns_end_position]
if orig_end_position is None:
if FLAGS.verbose_logging:
if verbose_logging:
tf.compat.v1.logging.info("Couldn't map end position")
return orig_text
@ -757,7 +772,7 @@ def validate_flags_or_throw(bert_config):
tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
FLAGS.init_checkpoint)
if not FLAGS.do_train and not FLAGS.do_predict and not FLAGS.export_trtis:
if not FLAGS.do_train and not FLAGS.do_predict and not FLAGS.export_triton:
raise ValueError("At least one of `do_train` or `do_predict` or `export_SavedModel` must be True.")
if FLAGS.do_train:
@ -782,7 +797,7 @@ def validate_flags_or_throw(bert_config):
def export_model(estimator, export_dir, init_checkpoint):
"""Exports a checkpoint in SavedModel format in a directory structure compatible with TRTIS."""
"""Exports a checkpoint in SavedModel format in a directory structure compatible with Triton."""
def serving_input_fn():
@ -806,10 +821,10 @@ def export_model(estimator, export_dir, init_checkpoint):
checkpoint_path=init_checkpoint,
strip_default_attrs=False)
model_name = FLAGS.trtis_model_name
model_name = FLAGS.triton_model_name
model_folder = export_dir + "/trtis_models/" + model_name
version_folder = model_folder + "/" + str(FLAGS.trtis_model_version)
model_folder = export_dir + "/triton_models/" + model_name
version_folder = model_folder + "/" + str(FLAGS.triton_model_version)
final_model_folder = version_folder + "/model.savedmodel"
if not os.path.exists(version_folder):
@ -819,19 +834,19 @@ def export_model(estimator, export_dir, init_checkpoint):
os.rename(saved_dir, final_model_folder)
print("Model saved to dir", final_model_folder)
else:
if (FLAGS.trtis_model_overwrite):
if (FLAGS.triton_model_overwrite):
shutil.rmtree(final_model_folder)
os.rename(saved_dir, final_model_folder)
print("WARNING: Existing model was overwritten. Model dir: {}".format(final_model_folder))
else:
print("ERROR: Could not save TRTIS model. Folder already exists. Use '--trtis_model_overwrite=True' if you would like to overwrite an existing model. Model dir: {}".format(final_model_folder))
print("ERROR: Could not save Triton model. Folder already exists. Use '--triton_model_overwrite=True' if you would like to overwrite an existing model. Model dir: {}".format(final_model_folder))
return
# Now build the config for TRTIS. Check to make sure we can overwrite it, if it exists
# Now build the config for Triton. Check to make sure we can overwrite it, if it exists
config_filename = os.path.join(model_folder, "config.pbtxt")
if (os.path.exists(config_filename) and not FLAGS.trtis_model_overwrite):
print("ERROR: Could not save TRTIS model config. Config file already exists. Use '--trtis_model_overwrite=True' if you would like to overwrite an existing model config. Model config: {}".format(config_filename))
if (os.path.exists(config_filename) and not FLAGS.triton_model_overwrite):
print("ERROR: Could not save Triton model config. Config file already exists. Use '--triton_model_overwrite=True' if you would like to overwrite an existing model config. Model config: {}".format(config_filename))
return
config_template = r"""
@ -883,9 +898,9 @@ instance_group [
]"""
batching_str = ""
max_batch_size = FLAGS.trtis_max_batch_size
max_batch_size = FLAGS.triton_max_batch_size
if (FLAGS.trtis_dyn_batching_delay > 0):
if (FLAGS.triton_dyn_batching_delay > 0):
# Use only full and half full batches
pref_batch_size = [int(max_batch_size / 2.0), max_batch_size]
@ -894,7 +909,7 @@ instance_group [
dynamic_batching {{
preferred_batch_size: [{0}]
max_queue_delay_microseconds: {1}
}}""".format(", ".join([str(x) for x in pref_batch_size]), int(FLAGS.trtis_dyn_batching_delay * 1000.0))
}}""".format(", ".join([str(x) for x in pref_batch_size]), int(FLAGS.triton_dyn_batching_delay * 1000.0))
config_values = {
"model_name": model_name,
@ -902,7 +917,7 @@ dynamic_batching {{
"seq_length": FLAGS.max_seq_length,
"dynamic_batching": batching_str,
"gpu_list": ", ".join([x.name.split(":")[-1] for x in device_lib.list_local_devices() if x.device_type == "GPU"]),
"engine_count": FLAGS.trtis_engine_count
"engine_count": FLAGS.triton_engine_count
}
with open(model_folder + "/config.pbtxt", "w") as file:
@ -911,13 +926,13 @@ dynamic_batching {{
file.write(final_config_str)
def main(_):
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
dllogging = utils.dllogger_class.dllogger_class(FLAGS.dllog_path)
if FLAGS.horovod:
hvd.init()
if FLAGS.use_fp16:
os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
@ -1061,7 +1076,7 @@ def main(_):
tf.compat.v1.logging.info("-----------------------------")
if FLAGS.export_trtis and master_process:
if FLAGS.export_triton and master_process:
export_model(estimator, FLAGS.output_dir, FLAGS.init_checkpoint)
if FLAGS.do_predict and master_process:
@ -1163,7 +1178,8 @@ def main(_):
write_predictions(eval_examples, eval_features, all_results,
FLAGS.n_best_size, FLAGS.max_answer_length,
FLAGS.do_lower_case, output_prediction_file,
output_nbest_file, output_null_log_odds_file)
output_nbest_file, output_null_log_odds_file,
FLAGS.version_2_with_negative, FLAGS.verbose_logging)
if FLAGS.eval_script:
import sys
@ -1179,7 +1195,5 @@ def main(_):
if __name__ == "__main__":
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
tf.compat.v1.app.run()
FLAGS = extract_run_squad_flags()
tf.app.run()

View file

@ -68,7 +68,14 @@ printf "Logs written to %s\n" "$LOGFILE"
INPUT_FILES="$DATA_DIR/training"
EVAL_FILES="$DATA_DIR/test"
CMD="python3 /workspace/bert/run_pretraining.py"
horovod_str=""
mpi=""
if [ $num_gpus -gt 1 ] ; then
mpi="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket"
horovod_str="--horovod"
fi
CMD="$mpi python3 /workspace/bert/run_pretraining.py"
CMD+=" --input_files_dir=$INPUT_FILES"
CMD+=" --eval_files_dir=$EVAL_FILES"
CMD+=" --output_dir=$RESULTS_DIR"
@ -85,7 +92,7 @@ CMD+=" --num_accumulation_steps=$num_accumulation_steps"
CMD+=" --save_checkpoints_steps=$save_checkpoints_steps"
CMD+=" --learning_rate=$learning_rate"
CMD+=" --optimizer_type=adam"
CMD+=" --horovod $PREC"
CMD+=" $horovod_str $PREC"
CMD+=" --allreduce_post_accumulation=True"
#Check if all necessary files are available before training
@ -96,10 +103,6 @@ for DIR_or_file in $DATA_DIR $BERT_CONFIG $RESULTS_DIR; do
fi
done
if [ $num_gpus -gt 1 ] ; then
CMD="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket $CMD"
fi
set -x
if [ -z "$LOGFILE" ] ; then
$CMD

View file

@ -59,8 +59,10 @@ if [ "$use_xla" = "true" ] ; then
fi
mpi=""
horovod_str=""
if [ $num_gpus -gt 1 ] ; then
mpi="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket"
horovod_str="--horovod"
fi
#PHASE 1
@ -99,5 +101,5 @@ done
--num_warmup_steps=$warmup_steps_phase1 \
--save_checkpoints_steps=$save_checkpoints_steps \
--learning_rate=$learning_rate_phase1 \
--horovod $PREC \
$horovod_str $PREC \
--allreduce_post_accumulation=True

View file

@ -61,8 +61,10 @@ if [ "$use_xla" = "true" ] ; then
fi
mpi=""
horovod_str=""
if [ $num_gpus -gt 1 ] ; then
mpi="mpiexec --allow-run-as-root -np $num_gpus --bind-to socket"
horovod_str="--horovod"
fi
#PHASE 1 Config
@ -110,6 +112,6 @@ $mpi python /workspace/bert/run_pretraining.py \
--num_warmup_steps=$warmup_steps_phase2 \
--save_checkpoints_steps=$save_checkpoints_steps \
--learning_rate=$learning_rate_phase2 \
--horovod $PREC \
$horovod_str $PREC \
--allreduce_post_accumulation=True

View file

@ -76,6 +76,7 @@ python run_squad.py \
--init_checkpoint=$init_checkpoint \
--do_predict=True \
--predict_file=$SQUAD_DIR/dev-v${squad_version}.json \
--eval_script=$SQUAD_DIR/evaluate-v${squad_version}.py \
--predict_batch_size=$batch_size \
--max_seq_length=$seq_length \
--doc_stride=$doc_stride \
@ -83,5 +84,3 @@ python run_squad.py \
--output_dir=$RESULTS_DIR \
"$use_fp16" \
$use_xla_tag --version_2_with_negative=${version_2_with_negative}
python $SQUAD_DIR/evaluate-v${squad_version}.py $SQUAD_DIR/dev-v${squad_version}.json $RESULTS_DIR/predictions.json

View file

@ -1,18 +1,18 @@
# Deploying the BERT model using TensorRT Inference Server
# Deploying the BERT model using Triton Inference Server
The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using TensorRT Inference Server.
The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using Triton Inference Server.
## Table Of Contents
- [TensorRT Inference Server Overview](#tensorrt-inference-server-overview)
- [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
- [Triton Inference Server Overview](#triton-inference-server-overview)
- [Running the Triton Inference Server and client](#running-the-triton-inference-server-and-client)
- [Performance analysis for Triton Inference Server](#performance-analysis-for-triton-inference-server)
* [Advanced Details](#advanced-details)
- [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)
## TensorRT Inference Server Overview
## Triton Inference Server Overview
A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
A typical Triton Inference Server pipeline can be broken down into the following 8 steps:
1. Client serializes the inference request into a message and sends it to the server (Client Send)
2. Message travels over the network from the client to the server (Network)
3. Message arrives at server, and is deserialized (Server Receive)
@ -23,30 +23,40 @@ A typical TensorRT Inference Server pipeline can be broken down into the followi
8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)
Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
In this section, we will go over how to launch Triton Inference Server and client and get the best performant solution that fits your specific application needs.
Note: The following instructions are run from outside the container and call `docker run` commands as required.
## Performance analysis for TensorRT Inference Server
## Running the Triton Inference Server and client
The `run_triton.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that Triton Inference Server accepts, builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client on SQuAD v1.1 dataset and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
```bash
bash triton/scripts/run_triton.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <triton_version_name> <triton_model_name> <triton_export_model> <triton_dyn_batching_delay> <triton_engine_count> <triton_model_overwrite>
```
You can also run inference with a sample by passing `--question` and `--context` arguments to the client.
## Performance analysis for Triton Inference Server
Based on the figures 1 and 2 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.
![](../data/images/trtis_base_summary.png?raw=true)
![](../data/images/triton_base_summary.png?raw=true)
Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in Triton Inference Server
![](../data/images/trtis_large_summary.png?raw=true)
![](../data/images/triton_large_summary.png?raw=true)
Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in Triton Inference Server
### Advanced Details
This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
This section digs deeper into the performance numbers and configurations corresponding to running Triton Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that Triton Inference Server accepts, builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
```bash
bash trtis/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
bash triton/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
```
All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1). A high number of concurrent requests can reduce the impact of network latency on overall throughput.
@ -59,11 +69,11 @@ Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurren
Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
![](../data/images/trtis_bs_1.png?raw=true)
![](../data/images/triton_bs_1.png?raw=true)
Figure 3: Latency & Throughput vs Concurrency at Batch size = 1
![](../data/images/trtis_bs_8.png?raw=true)
![](../data/images/triton_bs_8.png?raw=true)
Figure 4: Latency & Throughput vs Concurrency at Batch size = 8
@ -71,38 +81,30 @@ Figure 4: Latency & Throughput vs Concurrency at Batch size = 8
Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and preferred_batchsize to indicate your optimal batch sizes in the TensorRT Inference Server model config.
Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and preferred_batchsize to indicate your optimal batch sizes in the Triton Inference Server model config.
Figures 5 and 6 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.
![](../data/images/trtis_static.png?raw=true)
![](../data/images/triton_static.png?raw=true)
Figure 5: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1
![](../data/images/trtis_dynamic.png?raw=true)
![](../data/images/triton_dynamic.png?raw=true)
Figure 6: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000
#### Model execution instance count
TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesnt saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
Triton Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesnt saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
Figures 7 and 8 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.
![](../data/images/trtis_ec_1.png?raw=true)
![](../data/images/triton_ec_1.png?raw=true)
Figure 7: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
(One copy of the model loaded in GPU memory)
![](../data/images/trtis_ec_4.png?raw=true)
![](../data/images/triton_ec_4.png?raw=true)
Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
(Four copies the model loaded in GPU memory)
## Running the TensorRT Inference Server and client
The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
```bash
bash trtis/scripts/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
```
(Four copies the model loaded in GPU memory)

View file

@ -0,0 +1,325 @@
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import modeling
import tokenization
from tensorrtserver.api import ProtocolType, InferContext, ServerStatusContext, grpc_service_pb2_grpc, grpc_service_pb2, model_config_pb2
from utils.create_squad_data import *
import grpc
from run_squad import write_predictions, get_predictions, RawResult
import numpy as np
import tqdm
from functools import partial
import sys
if sys.version_info >= (3, 0):
import queue
else:
import Queue as queue
flags = tf.flags
FLAGS = flags.FLAGS
## Required parameters
flags.DEFINE_string(
"bert_config_file", None,
"The config json file corresponding to the pre-trained BERT model. "
"This specifies the model architecture.")
flags.DEFINE_string("vocab_file", None,
"The vocabulary file that the BERT model was trained on.")
flags.DEFINE_string(
"output_dir", None,
"The output directory where the model checkpoints will be written.")
flags.DEFINE_bool(
"do_lower_case", True,
"Whether to lower case the input text. Should be True for uncased "
"models and False for cased models.")
flags.DEFINE_integer(
"max_seq_length", 384,
"The maximum total input sequence length after WordPiece tokenization. "
"Sequences longer than this will be truncated, and sequences shorter "
"than this will be padded.")
flags.DEFINE_integer(
"doc_stride", 128,
"When splitting up a long document into chunks, how much stride to "
"take between chunks.")
flags.DEFINE_integer(
"max_query_length", 64,
"The maximum number of tokens for the question. Questions longer than "
"this will be truncated to this length.")
flags.DEFINE_integer("predict_batch_size", 8,
"Total batch size for predictions.")
flags.DEFINE_integer(
"n_best_size", 20,
"The total number of n-best predictions to generate in the "
"nbest_predictions.json output file.")
flags.DEFINE_integer(
"max_answer_length", 30,
"The maximum length of an answer that can be generated. This is needed "
"because the start and end predictions are not conditioned on one another.")
flags.DEFINE_bool(
"version_2_with_negative", False,
"If true, the SQuAD examples contain some that do not have an answer.")
flags.DEFINE_bool(
"verbose_logging", False,
"If true, all of the warnings related to data processing will be printed. "
"A number of warnings are expected for a normal SQuAD evaluation.")
# Triton Specific flags
flags.DEFINE_string("triton_model_name", "bert", "exports to appropriate directory for Triton")
flags.DEFINE_integer("triton_model_version", 1, "exports to appropriate directory for Triton")
flags.DEFINE_string("triton_server_url", "localhost:8001", "exports to appropriate directory for Triton")
# Input Text for Inference
flags.DEFINE_string("question", None, "Question for Inference")
flags.DEFINE_string("context", None, "Context for Inference")
flags.DEFINE_string(
"predict_file", None,
"SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
# Set this to either 'label_ids' for Google bert or 'unique_ids' for JoC
label_id_key = "unique_ids"
# User defined class to store infer_ctx and request id
# from callback function and let main thread to handle them
class UserData:
def __init__(self):
self._completed_requests = queue.Queue()
# Callback function used for async_run(), it can capture
# additional information using functools.partial as long as the last
# two arguments are reserved for InferContext and request id
def completion_callback(user_data, idx, start_time, inputs, infer_ctx, request_id):
user_data._completed_requests.put((infer_ctx, request_id, idx, start_time, inputs))
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
label_ids_data = ()
input_ids_data = ()
input_mask_data = ()
segment_ids_data = ()
for i in range(0, min(n, l-ndx)):
label_ids_data = label_ids_data + (np.array([iterable[ndx + i].unique_id], dtype=np.int32),)
input_ids_data = input_ids_data+ (np.array(iterable[ndx + i].input_ids, dtype=np.int32),)
input_mask_data = input_mask_data+ (np.array(iterable[ndx + i].input_mask, dtype=np.int32),)
segment_ids_data = segment_ids_data+ (np.array(iterable[ndx + i].segment_ids, dtype=np.int32),)
inputs_dict = {label_id_key: label_ids_data,
'input_ids': input_ids_data,
'input_mask': input_mask_data,
'segment_ids': segment_ids_data}
yield inputs_dict
def main(_):
"""
Ask a question of context on Triton.
:param context: str
:param question: str
:param question_id: int
:return:
"""
os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_lazy_compilation=false" #causes memory fragmentation for bert leading to OOM
tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
# Get the Data
if FLAGS.predict_file:
eval_examples = read_squad_examples(
input_file=FLAGS.predict_file, is_training=False,
version_2_with_negative=FLAGS.version_2_with_negative)
elif FLAGS.question and FLAGS.answer:
input_data = [{"paragraphs":[{"context":FLAGS.context,
"qas":[{"id":0, "question":FLAGS.question}]}]}]
eval_examples = read_squad_examples(input_file=None, is_training=False,
version_2_with_negative=FLAGS.version_2_with_negative, input_data=input_data)
else:
raise ValueError("Either predict_file or question+answer need to defined")
# Get Eval Features = Preprocessing
eval_features = []
def append_feature(feature):
eval_features.append(feature)
convert_examples_to_features(
examples=eval_examples[0:],
tokenizer=tokenizer,
max_seq_length=FLAGS.max_seq_length,
doc_stride=FLAGS.doc_stride,
max_query_length=FLAGS.max_query_length,
is_training=False,
output_fn=append_feature)
protocol_str = 'grpc' # http or grpc
url = FLAGS.triton_server_url
verbose = True
model_name = FLAGS.triton_model_name
model_version = FLAGS.triton_model_version
batch_size = FLAGS.predict_batch_size
protocol = ProtocolType.from_str(protocol_str) # or 'grpc'
ctx = InferContext(url, protocol, model_name, model_version, verbose)
status_ctx = ServerStatusContext(url, protocol, model_name=model_name, verbose=verbose)
model_config_pb2.ModelConfig()
status_result = status_ctx.get_server_status()
user_data = UserData()
max_outstanding = 20
# Number of outstanding requests
outstanding = 0
sent_prog = tqdm.tqdm(desc="Send Requests", total=len(eval_features))
recv_prog = tqdm.tqdm(desc="Recv Requests", total=len(eval_features))
def process_outstanding(do_wait, outstanding):
if (outstanding == 0 or do_wait is False):
return outstanding
# Wait for deferred items from callback functions
(infer_ctx, ready_id, idx, start_time, inputs) = user_data._completed_requests.get()
if (ready_id is None):
return outstanding
# If we are here, we got an id
result = ctx.get_async_run_results(ready_id)
stop = time.time()
if (result is None):
raise ValueError("Context returned null for async id marked as done")
outstanding -= 1
time_list.append(stop - start_time)
batch_count = len(inputs[label_id_key])
for i in range(batch_count):
unique_id = int(inputs[label_id_key][i][0])
start_logits = [float(x) for x in result["start_logits"][i].flat]
end_logits = [float(x) for x in result["end_logits"][i].flat]
all_results.append(
RawResult(
unique_id=unique_id,
start_logits=start_logits,
end_logits=end_logits))
recv_prog.update(n=batch_count)
return outstanding
all_results = []
time_list = []
print("Starting Sending Requests....\n")
all_results_start = time.time()
idx = 0
for inputs_dict in batch(eval_features, batch_size):
present_batch_size = len(inputs_dict[label_id_key])
outputs_dict = {'start_logits': InferContext.ResultFormat.RAW,
'end_logits': InferContext.ResultFormat.RAW}
start_time = time.time()
ctx.async_run(partial(completion_callback, user_data, idx, start_time, inputs_dict),
inputs_dict, outputs_dict, batch_size=present_batch_size)
outstanding += 1
idx += 1
sent_prog.update(n=present_batch_size)
# Try to process at least one response per request
outstanding = process_outstanding(outstanding >= max_outstanding, outstanding)
tqdm.tqdm.write("All Requests Sent! Waiting for responses. Outstanding: {}.\n".format(outstanding))
# Now process all outstanding requests
while (outstanding > 0):
outstanding = process_outstanding(True, outstanding)
all_results_end = time.time()
all_results_total = (all_results_end - all_results_start) * 1000.0
print("-----------------------------")
print("Total Time: {} ms".format(all_results_total))
print("-----------------------------")
print("-----------------------------")
print("Total Inference Time = %0.2f for"
"Sentences processed = %d" % (sum(time_list), len(eval_features)))
print("Throughput Average (sentences/sec) = %0.2f" % (len(eval_features) / all_results_total * 1000.0))
print("-----------------------------")
if FLAGS.output_dir and FLAGS.predict_file:
# When inferencing on a dataset, get inference statistics and write results to json file
time_list.sort()
avg = np.mean(time_list)
cf_95 = max(time_list[:int(len(time_list) * 0.95)])
cf_99 = max(time_list[:int(len(time_list) * 0.99)])
cf_100 = max(time_list[:int(len(time_list) * 1)])
print("-----------------------------")
print("Summary Statistics")
print("Batch size =", FLAGS.predict_batch_size)
print("Sequence Length =", FLAGS.max_seq_length)
print("Latency Confidence Level 95 (ms) =", cf_95 * 1000)
print("Latency Confidence Level 99 (ms) =", cf_99 * 1000)
print("Latency Confidence Level 100 (ms) =", cf_100 * 1000)
print("Latency Average (ms) =", avg * 1000)
print("-----------------------------")
output_prediction_file = os.path.join(FLAGS.output_dir, "predictions.json")
output_nbest_file = os.path.join(FLAGS.output_dir, "nbest_predictions.json")
output_null_log_odds_file = os.path.join(FLAGS.output_dir, "null_odds.json")
write_predictions(eval_examples, eval_features, all_results,
FLAGS.n_best_size, FLAGS.max_answer_length,
FLAGS.do_lower_case, output_prediction_file,
output_nbest_file, output_null_log_odds_file,
FLAGS.version_2_with_negative, FLAGS.verbose_logging)
else:
# When inferencing on a single example, write best answer to stdout
all_predictions, all_nbest_json, scores_diff_json = get_predictions(
eval_examples, eval_features, all_results,
FLAGS.n_best_size, FLAGS.max_answer_length,
FLAGS.do_lower_case, FLAGS.version_2_with_negative,
FLAGS.verbose_logging)
print("Context is: %s \n\nQuestion is: %s \n\nPredicted Answer is: %s" %(FLAGS.context, FLAGS.question, all_predictions[0]))
if __name__ == "__main__":
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
tf.compat.v1.app.run()

View file

@ -20,15 +20,15 @@ use_xla=${4:-"true"}
seq_length=${5:-"384"}
doc_stride=${6:-"128"}
BERT_DIR=${7:-"data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16"}
trtis_model_version=${8:-1}
trtis_model_name=${9:-"bert"}
trtis_dyn_batching_delay=${10:-0}
trtis_engine_count=${11:-1}
trtis_model_overwrite=${12:-"False"}
triton_model_version=${8:-1}
triton_model_name=${9:-"bert"}
triton_dyn_batching_delay=${10:-0}
triton_engine_count=${11:-1}
triton_model_overwrite=${12:-"False"}
additional_args="--trtis_model_version=$trtis_model_version --trtis_model_name=$trtis_model_name --trtis_max_batch_size=$batch_size \
--trtis_model_overwrite=$trtis_model_overwrite --trtis_dyn_batching_delay=$trtis_dyn_batching_delay \
--trtis_engine_count=$trtis_engine_count"
additional_args="--triton_model_version=$triton_model_version --triton_model_name=$triton_model_name --triton_max_batch_size=$batch_size \
--triton_model_overwrite=$triton_model_overwrite --triton_dyn_batching_delay=$triton_dyn_batching_delay \
--triton_engine_count=$triton_engine_count"
if [ "$precision" = "fp16" ] ; then
echo "fp16 activated!"
@ -51,7 +51,7 @@ bash scripts/docker/launch.sh \
--doc_stride=${doc_stride} \
--predict_batch_size=${batch_size} \
--output_dir=/results \
--export_trtis=True \
--export_triton=True \
${additional_args}

View file

@ -0,0 +1,146 @@
#!/bin/bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set the number of devices to use
export NVIDIA_VISIBLE_DEVICES=0
# Always need to be overwriting models to keep memory use low
export TRITON_MODEL_OVERWRITE=True
bert_model=${1:-small}
seq_length=${2:-128}
precision=${3:-fp16}
init_checkpoint=${4:-"/results/models/bert_${bert_model}_${precision}_${seq_length}_v1/model.ckpt-5474"}
MODEL_NAME="bert_${bert_model}_${seq_length}_${precision}"
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
else
export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
fi
doc_stride=128
use_xla=true
EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DIR} 1 ${MODEL_NAME}"
PERF_CLIENT_ARGS="1000 10 20 localhost"
# Start Server
bash triton/scripts/launch_server.sh $precision
# Restart Server
restart_server() {
docker kill triton_server_cont
bash triton/scripts/launch_server.sh $precision
}
############## Dynamic Batching Comparison ##############
SERVER_BATCH_SIZE=8
CLIENT_BATCH_SIZE=1
TRITON_ENGINE_COUNT=1
# Dynamic batching 10 ms
TRITON_DYN_BATCHING_DELAY=10
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Dynamic batching 5 ms
TRITON_DYN_BATCHING_DELAY=5
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Dynamic batching 2 ms
TRITON_DYN_BATCHING_DELAY=2
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Static Batching (i.e. Dynamic batching 0 ms)
TRITON_DYN_BATCHING_DELAY=0
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# ############## Engine Count Comparison ##############
SERVER_BATCH_SIZE=1
CLIENT_BATCH_SIZE=1
TRITON_DYN_BATCHING_DELAY=0
# Engine Count = 4
TRITON_ENGINE_COUNT=4
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Engine Count = 2
TRITON_ENGINE_COUNT=2
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Engine Count = 1
TRITON_ENGINE_COUNT=1
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
############## Batch Size Comparison ##############
# BATCH=1 Generate model and perf
SERVER_BATCH_SIZE=1
CLIENT_BATCH_SIZE=1
TRITON_ENGINE_COUNT=1
TRITON_DYN_BATCHING_DELAY=0
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
# BATCH=2 Generate model and perf
SERVER_BATCH_SIZE=2
CLIENT_BATCH_SIZE=2
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
# BATCH=4 Generate model and perf
SERVER_BATCH_SIZE=4
CLIENT_BATCH_SIZE=4
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
# BATCH=8 Generate model and perf
SERVER_BATCH_SIZE=8
CLIENT_BATCH_SIZE=8
bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
restart_server
sleep 15
bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost

View file

@ -9,16 +9,16 @@ else
export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=0
fi
# Start TRTIS server in detached state
nvidia-docker run -d --rm \
# Start TRITON server in detached state
docker run --gpus all -d --rm \
--shm-size=1g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p8000:8000 \
-p8001:8001 \
-p8002:8002 \
--name trt_server_cont \
--name triton_server_cont \
-e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
-e TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE \
-v $PWD/results/trtis_models:/models \
nvcr.io/nvidia/tensorrtserver:20.02-py3 trtserver --model-store=/models --strict-model-config=false
-v $PWD/results/triton_models:/models \
nvcr.io/nvidia/tritonserver:20.03-py3 trtserver --model-store=/models --strict-model-config=false

View file

@ -16,36 +16,18 @@
batch_size=${1:-"8"}
seq_length=${2:-"384"}
doc_stride=${3:-"128"}
trtis_version_name=${4:-"1"}
trtis_model_name=${5:-"bert"}
triton_version_name=${4:-"1"}
triton_model_name=${5:-"bert"}
BERT_DIR=${6:-"data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16"}
squad_version=${7:-"1.1"}
export SQUAD_DIR=data/download/squad/v${squad_version}
if [ "$squad_version" = "1.1" ] ; then
version_2_with_negative="False"
else
version_2_with_negative="True"
fi
echo "Squad directory set as " $SQUAD_DIR
if [ ! -d "$SQUAD_DIR" ] ; then
echo "Error! $SQUAD_DIR directory missing. Please mount SQuAD dataset."
exit -1
fi
bash scripts/docker/launch.sh \
"python trtis/run_squad_trtis_client.py \
--trtis_model_name=$trtis_model_name \
--trtis_model_version=$trtis_version_name \
"python triton/run_squad_triton_client.py \
--triton_model_name=$triton_model_name \
--triton_model_version=$triton_version_name \
--vocab_file=$BERT_DIR/vocab.txt \
--bert_config_file=$BERT_DIR/bert_config.json \
--predict_file=$SQUAD_DIR/dev-v${squad_version}.json \
--predict_batch_size=$batch_size \
--max_seq_length=${seq_length} \
--doc_stride=${doc_stride} \
--output_dir=/results \
--version_2_with_negative=${version_2_with_negative}"
bash scripts/docker/launch.sh "python $SQUAD_DIR/evaluate-v${squad_version}.py \
$SQUAD_DIR/dev-v${squad_version}.json /results/predictions.json"
${@:7}"

View file

@ -23,21 +23,21 @@ MAX_CONCURRENCY=${7:-50}
SERVER_HOSTNAME=${8:-"localhost"}
if [[ $SERVER_HOSTNAME == *":"* ]]; then
echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRTIS HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRITON HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
exit 1
fi
if [ "$SERVER_HOSTNAME" = "localhost" ]
then
if [ ! "$(docker inspect -f "{{.State.Running}}" trt_server_cont)" = "true" ] ; then
if [ ! "$(docker inspect -f "{{.State.Running}}" triton_server_cont)" = "true" ] ; then
echo "Launching TRTIS server"
bash trtis/scripts/launch_server.sh $precision
echo "Launching TRITON server"
bash triton/scripts/launch_server.sh $precision
SERVER_LAUNCHED=true
function cleanup_server {
echo "Killing TRTIS server"
docker kill trt_server_cont
echo "Killing TRITON server"
docker kill triton_server_cont
}
# Ensure we cleanup the server on exit
@ -47,7 +47,7 @@ then
fi
# Wait until server is up. curl on the health of the server and sleep until its ready
bash trtis/scripts/wait_for_trtis_server.sh $SERVER_HOSTNAME
bash triton/scripts/wait_for_triton_server.sh $SERVER_HOSTNAME
TIMESTAMP=$(date "+%y%m%d_%H%M")

View file

@ -21,12 +21,13 @@ seq_length=${5:-"384"}
doc_stride=${6:-"128"}
bert_model=${7:-"large"}
squad_version=${8:-"1.1"}
trtis_version_name=${9:-1}
trtis_model_name=${10:-"bert"}
trtis_export_model=${11:-"false"}
trtis_dyn_batching_delay=${12:-0}
trtis_engine_count=${13:-1}
trtis_model_overwrite=${14:-"False"}
triton_version_name=${9:-1}
triton_model_name=${10:-"bert"}
triton_export_model=${11:-"true"}
triton_dyn_batching_delay=${12:-0}
triton_engine_count=${13:-1}
triton_model_overwrite=${14:-"False"}
squad_version=${15:-"1.1"}
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
@ -39,8 +40,21 @@ if [ ! -d "$BERT_DIR" ] ; then
exit -1
fi
export SQUAD_DIR=data/download/squad/v${squad_version}
if [ "$squad_version" = "1.1" ] ; then
version_2_with_negative="False"
else
version_2_with_negative="True"
fi
echo "Squad directory set as " $SQUAD_DIR
if [ ! -d "$SQUAD_DIR" ] ; then
echo "Error! $SQUAD_DIR directory missing. Please mount SQuAD dataset."
exit -1
fi
# Need to ignore case on some variables
trtis_export_model=$(echo "$trtis_export_model" | tr '[:upper:]' '[:lower:]')
triton_export_model=$(echo "$triton_export_model" | tr '[:upper:]' '[:lower:]')
# Explicitly save this variable to pass down to new containers
NV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}
@ -56,33 +70,36 @@ echo " seq_length = $seq_length"
echo " doc_stride = $doc_stride"
echo " bert_model = $bert_model"
echo " squad_version = $squad_version"
echo " version_name = $trtis_version_name"
echo " model_name = $trtis_model_name"
echo " export_model = $trtis_export_model"
echo " version_name = $triton_version_name"
echo " model_name = $triton_model_name"
echo " export_model = $triton_export_model"
echo
echo "Env: "
echo " NVIDIA_VISIBLE_DEVICES = $NV_VISIBLE_DEVICES"
echo
# Export Model in SavedModel format if enabled
if [ "$trtis_export_model" = "true" ] ; then
echo "Exporting model as: Name - $trtis_model_name Version - $trtis_version_name"
if [ "$triton_export_model" = "true" ] ; then
echo "Exporting model as: Name - $triton_model_name Version - $triton_version_name"
bash trtis/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
$doc_stride $BERT_DIR $RESULTS_DIR $trtis_version_name $trtis_model_name \
$trtis_dyn_batching_delay $trtis_engine_count $trtis_model_overwrite
bash triton/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
$doc_stride $BERT_DIR $triton_version_name $triton_model_name \
$triton_dyn_batching_delay $triton_engine_count $triton_model_overwrite
fi
# Start TRTIS server in detached state
bash trtis/scripts/launch_server.sh $precision
bash triton/scripts/launch_server.sh $precision
# Wait until server is up. curl on the health of the server and sleep until its ready
bash trtis/scripts/wait_for_trtis_server.sh localhost
bash triton/scripts/wait_for_triton_server.sh localhost
# Start TRTIS client for inference and evaluate results
bash trtis/scripts/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
$BERT_DIR $squad_version
# Start TRTIS client for inference on SQuAD Dataset
bash triton/scripts/run_client.sh $batch_size $seq_length $doc_stride $triton_version_name $triton_model_name \
$BERT_DIR --predict_file=$SQUAD_DIR/dev-v${squad_version}.json --version_2_with_negative=${version_2_with_negative}
# Evaluate SQuAD results
bash scripts/docker/launch.sh "python $SQUAD_DIR/evaluate-v${squad_version}.py \
$SQUAD_DIR/dev-v${squad_version}.json /results/predictions.json"
#Kill the TRTIS Server
docker kill trt_server_cont
docker kill triton_server_cont

View file

@ -15,7 +15,7 @@
SERVER_URI=${1:-"localhost"}
echo "Waiting for TRTIS Server to be ready at http://$SERVER_URI:8000..."
echo "Waiting for TRITON Server to be ready at http://$SERVER_URI:8000..."
live_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/live"
ready_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/ready"
@ -30,4 +30,4 @@ while [[ ${current_status} != "200" ]] || [[ $($ready_command) != "200" ]]; do
current_status=$($live_command)
done
echo "TRTIS Server is ready!"
echo "TRITON Server is ready!"

View file

@ -1,222 +0,0 @@
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import modeling
import tokenization
from tensorrtserver.api import ProtocolType, InferContext, ServerStatusContext, grpc_service_pb2_grpc, grpc_service_pb2, model_config_pb2
from utils.create_squad_data import *
import grpc
from run_squad import *
import numpy as np
import tqdm
# Set this to either 'label_ids' for Google bert or 'unique_ids' for JoC
label_id_key = "unique_ids"
PendingResult = collections.namedtuple("PendingResult",
["async_id", "start_time", "inputs"])
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
label_ids_data = ()
input_ids_data = ()
input_mask_data = ()
segment_ids_data = ()
for i in range(0, min(n, l-ndx)):
label_ids_data = label_ids_data + (np.array([iterable[ndx + i].unique_id], dtype=np.int32),)
input_ids_data = input_ids_data+ (np.array(iterable[ndx + i].input_ids, dtype=np.int32),)
input_mask_data = input_mask_data+ (np.array(iterable[ndx + i].input_mask, dtype=np.int32),)
segment_ids_data = segment_ids_data+ (np.array(iterable[ndx + i].segment_ids, dtype=np.int32),)
inputs_dict = {label_id_key: label_ids_data,
'input_ids': input_ids_data,
'input_mask': input_mask_data,
'segment_ids': segment_ids_data}
yield inputs_dict
def run_client():
"""
Ask a question of context on TRTIS.
:param context: str
:param question: str
:param question_id: int
:return:
"""
tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
eval_examples = read_squad_examples(
input_file=FLAGS.predict_file, is_training=False,
version_2_with_negative=FLAGS.version_2_with_negative)
eval_features = []
def append_feature(feature):
eval_features.append(feature)
convert_examples_to_features(
examples=eval_examples[0:],
tokenizer=tokenizer,
max_seq_length=FLAGS.max_seq_length,
doc_stride=FLAGS.doc_stride,
max_query_length=FLAGS.max_query_length,
is_training=False,
output_fn=append_feature)
protocol_str = 'grpc' # http or grpc
url = FLAGS.trtis_server_url
verbose = True
model_name = FLAGS.trtis_model_name
model_version = FLAGS.trtis_model_version
batch_size = FLAGS.predict_batch_size
protocol = ProtocolType.from_str(protocol_str) # or 'grpc'
ctx = InferContext(url, protocol, model_name, model_version, verbose)
channel = grpc.insecure_channel(url)
stub = grpc_service_pb2_grpc.GRPCServiceStub(channel)
prof_request = grpc_service_pb2.server__status__pb2.model__config__pb2.ModelConfig()
prof_response = stub.Profile(prof_request)
status_ctx = ServerStatusContext(url, protocol, model_name=model_name, verbose=verbose)
model_config_pb2.ModelConfig()
status_result = status_ctx.get_server_status()
outstanding = {}
max_outstanding = 20
sent_prog = tqdm.tqdm(desc="Send Requests", total=len(eval_features))
recv_prog = tqdm.tqdm(desc="Recv Requests", total=len(eval_features))
def process_outstanding(do_wait):
if (len(outstanding) == 0):
return
ready_id = ctx.get_ready_async_request(do_wait)
if (ready_id is None):
return
# If we are here, we got an id
result = ctx.get_async_run_results(ready_id, False)
stop = time.time()
if (result is None):
raise ValueError("Context returned null for async id marked as done")
outResult = outstanding.pop(ready_id)
time_list.append(stop - outResult.start_time)
batch_count = len(outResult.inputs[label_id_key])
for i in range(batch_count):
unique_id = int(outResult.inputs[label_id_key][i][0])
start_logits = [float(x) for x in result["start_logits"][i].flat]
end_logits = [float(x) for x in result["end_logits"][i].flat]
all_results.append(
RawResult(
unique_id=unique_id,
start_logits=start_logits,
end_logits=end_logits))
recv_prog.update(n=batch_count)
all_results = []
time_list = []
print("Starting Sending Requests....\n")
all_results_start = time.time()
for inputs_dict in batch(eval_features, batch_size):
present_batch_size = len(inputs_dict[label_id_key])
outputs_dict = {'start_logits': InferContext.ResultFormat.RAW,
'end_logits': InferContext.ResultFormat.RAW}
start = time.time()
async_id = ctx.async_run(inputs_dict, outputs_dict, batch_size=present_batch_size)
outstanding[async_id] = PendingResult(async_id=async_id, start_time=start, inputs=inputs_dict)
sent_prog.update(n=present_batch_size)
# Try to process at least one response per request
process_outstanding(len(outstanding) >= max_outstanding)
tqdm.tqdm.write("All Requests Sent! Waiting for responses. Outstanding: {}.\n".format(len(outstanding)))
# Now process all outstanding requests
while (len(outstanding) > 0):
process_outstanding(True)
all_results_end = time.time()
all_results_total = (all_results_end - all_results_start) * 1000.0
print("-----------------------------")
print("Individual Time Runs - Ignoring first two iterations")
print("Total Time: {} ms".format(all_results_total))
print("-----------------------------")
print("-----------------------------")
print("Total Inference Time = %0.2f for"
"Sentences processed = %d" % (sum(time_list), len(eval_features)))
print("Throughput Average (sentences/sec) = %0.2f" % (len(eval_features) / all_results_total * 1000.0))
print("-----------------------------")
time_list.sort()
avg = np.mean(time_list)
cf_95 = max(time_list[:int(len(time_list) * 0.95)])
cf_99 = max(time_list[:int(len(time_list) * 0.99)])
cf_100 = max(time_list[:int(len(time_list) * 1)])
print("-----------------------------")
print("Summary Statistics")
print("Batch size =", FLAGS.predict_batch_size)
print("Sequence Length =", FLAGS.max_seq_length)
print("Latency Confidence Level 95 (ms) =", cf_95 * 1000)
print("Latency Confidence Level 99 (ms) =", cf_99 * 1000)
print("Latency Confidence Level 100 (ms) =", cf_100 * 1000)
print("Latency Average (ms) =", avg * 1000)
print("-----------------------------")
output_prediction_file = os.path.join(FLAGS.output_dir, "predictions.json")
output_nbest_file = os.path.join(FLAGS.output_dir, "nbest_predictions.json")
output_null_log_odds_file = os.path.join(FLAGS.output_dir, "null_odds.json")
write_predictions(eval_examples, eval_features, all_results,
FLAGS.n_best_size, FLAGS.max_answer_length,
FLAGS.do_lower_case, output_prediction_file,
output_nbest_file, output_null_log_odds_file)
if __name__ == "__main__":
flags.mark_flag_as_required("vocab_file")
flags.mark_flag_as_required("bert_config_file")
flags.mark_flag_as_required("output_dir")
run_client()

View file

@ -1,146 +0,0 @@
#!/bin/bash
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set the number of devices to use
export NVIDIA_VISIBLE_DEVICES=0
# Always need to be overwriting models to keep memory use low
export TRTIS_MODEL_OVERWRITE=True
bert_model=${1:-small}
seq_length=${2:-128}
precision=${3:-fp16}
init_checkpoint=${4:-"/results/models/bert_tf_${bert_model}_${precision}_${seq_length}_v1/model.ckpt-5474"}
MODEL_NAME="bert_${bert_model}_${seq_length}_${precision}"
if [ "$bert_model" = "large" ] ; then
export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
else
export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
fi
doc_stride=128
use_xla=true
EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DIR} 1 ${MODEL_NAME}"
PERF_CLIENT_ARGS="1000 10 20 localhost"
# Start Server
bash trtis/scripts/launch_server.sh $precision
# Restart Server
restart_server() {
docker kill trt_server_cont
bash trtis/scripts/launch_server.sh $precision
}
############## Dynamic Batching Comparison ##############
SERVER_BATCH_SIZE=8
CLIENT_BATCH_SIZE=1
TRTIS_ENGINE_COUNT=1
# Dynamic batching 10 ms
TRTIS_DYN_BATCHING_DELAY=10
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Dynamic batching 5 ms
TRTIS_DYN_BATCHING_DELAY=5
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Dynamic batching 2 ms
TRTIS_DYN_BATCHING_DELAY=2
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Static Batching (i.e. Dynamic batching 0 ms)
TRTIS_DYN_BATCHING_DELAY=0
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# ############## Engine Count Comparison ##############
SERVER_BATCH_SIZE=1
CLIENT_BATCH_SIZE=1
TRTIS_DYN_BATCHING_DELAY=0
# Engine Count = 4
TRTIS_ENGINE_COUNT=4
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Engine Count = 2
TRTIS_ENGINE_COUNT=2
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
# Engine Count = 1
TRTIS_ENGINE_COUNT=1
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
############## Batch Size Comparison ##############
# BATCH=1 Generate model and perf
SERVER_BATCH_SIZE=1
CLIENT_BATCH_SIZE=1
TRTIS_ENGINE_COUNT=1
TRTIS_DYN_BATCHING_DELAY=0
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
# BATCH=2 Generate model and perf
SERVER_BATCH_SIZE=2
CLIENT_BATCH_SIZE=2
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
# BATCH=4 Generate model and perf
SERVER_BATCH_SIZE=4
CLIENT_BATCH_SIZE=4
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
# BATCH=8 Generate model and perf
SERVER_BATCH_SIZE=8
CLIENT_BATCH_SIZE=8
bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
restart_server
sleep 15
bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost

View file

@ -149,10 +149,11 @@ class InputFeatures(object):
self.end_position = end_position
self.is_impossible = is_impossible
def read_squad_examples(input_file, is_training, version_2_with_negative=False):
"""Read a SQuAD json file into a list of SquadExample."""
with tf.gfile.Open(input_file, "r") as reader:
input_data = json.load(reader)["data"]
def read_squad_examples(input_file, is_training, version_2_with_negative=False, input_data=None):
"""Return list of SquadExample from input_data or input_file (SQuAD json file)"""
if input_data is None:
with tf.gfile.Open(input_file, "r") as reader:
input_data = json.load(reader)["data"]
def is_whitespace(c):
if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:

View file

@ -0,0 +1,55 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dllogger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
import numpy
class dllogger_class():
def format_step(self, step):
if isinstance(step, str):
return step
elif isinstance(step, int):
return "Iteration: {} ".format(step)
elif len(step) > 0:
return "Iteration: {} ".format(step[0])
else:
return ""
def __init__(self, log_path="bert_dllog.json"):
self.logger = Logger([
StdOutBackend(Verbosity.DEFAULT, step_format=self.format_step),
JSONStreamBackend(Verbosity.VERBOSE, log_path),
])
self.logger.metadata("mlm_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
self.logger.metadata("nsp_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
self.logger.metadata("avg_loss_step", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
self.logger.metadata("total_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
self.logger.metadata("loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
self.logger.metadata("f1", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
self.logger.metadata("precision", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
self.logger.metadata("recall", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
self.logger.metadata("mcc", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
self.logger.metadata("exact_match", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
self.logger.metadata(
"throughput_train",
{"unit": "seq/s", "format": ":.3f", "GOAL": "MAXIMIZE", "STAGE": "TRAIN"},
)
self.logger.metadata(
"throughput_inf",
{"unit": "seq/s", "format": ":.3f", "GOAL": "MAXIMIZE", "STAGE": "VAL"},
)