DeepLearningExamples/TensorFlow/LanguageModeling/BERT/notebooks/bert_squad_tf_inference.ipynb
2019-10-07 11:35:22 -07:00

577 lines
24 KiB
Text
Executable file

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# http://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License.\n",
"# =============================================================================="
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
"\n",
"# BERT Question Answering Inference with Mixed Precision\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Overview\n",
"\n",
"Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. \n",
"\n",
"The original paper can be found here: https://arxiv.org/abs/1810.04805.\n",
"\n",
"NVIDIA's BERT 19.10 is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.a Learning objectives\n",
"\n",
"This notebook demonstrates:\n",
"- Inference on QA task with BERT Large model\n",
"- The use/download of fine-tuned NVIDIA BERT models\n",
"- Use of Mixed Precision for Inference"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Requirements\n",
"\n",
"Please refer to the ReadMe file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. BERT Inference: Question Answering\n",
"\n",
"We can run inference on a fine-tuned BERT model for tasks like Question Answering.\n",
"\n",
"Here we use a BERT model fine-tuned on a [SQuaD 2.0 Dataset](https://rajpurkar.github.io/SQuAD-explorer/) which contains 100,000+ question-answer pairs on 500+ articles combined with over 50,000 new, unanswerable questions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.a Paragraph and Queries\n",
"\n",
"In this example we will ask our BERT model questions related to the following paragraph:\n",
"\n",
"**The Apollo Program**\n",
"_\"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.\"_\n",
"\n",
"The questions and relative answers expected are shown below:\n",
"\n",
" - **Q1:** \"What project put the first Americans into space?\" \n",
" - **A1:** \"Project Mercury\"\n",
" - **Q2:** \"What program was created to carry out these projects and missions?\"\n",
" - **A2:** \"The Apollo program\"\n",
" - **Q3:** \"What year did the first manned Apollo flight occur?\"\n",
" - **A3:** \"1968\"\n",
" - **Q4:** \"What President is credited with the original notion of putting Americans in space?\"\n",
" - **A4:** \"John F. Kennedy\"\n",
" - **Q5:** \"Who did the U.S. collaborate with on an Earth orbit mission in 1975?\"\n",
" - **A5:** \"Soviet Union\"\n",
" - **Q6:** \"How long did Project Apollo run?\"\n",
" - **A6:** \"1961 to 1972\"\n",
" - **Q7:** \"What program helped develop space travel techniques that Project Apollo used?\"\n",
" - **A7:** \"Gemini Mission\"\n",
" - **Q8:** \"What space station supported three manned missions in 1973-1974?\"\n",
" - **A8:** \"Skylab\"\n",
" \n",
"---\n",
"\n",
"The paragraph and the questions can be easily customized by changing the code below:\n",
"\n",
"---"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile input.json\n",
"{\"data\": \n",
" [\n",
" {\"title\": \"Project Apollo\",\n",
" \"paragraphs\": [\n",
" {\"context\":\"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.\", \n",
" \"qas\": [\n",
" { \"question\": \"What project put the first Americans into space?\", \n",
" \"id\": \"Q1\"\n",
" },\n",
" { \"question\": \"What program was created to carry out these projects and missions?\",\n",
" \"id\": \"Q2\"\n",
" },\n",
" { \"question\": \"What year did the first manned Apollo flight occur?\",\n",
" \"id\": \"Q3\"\n",
" }, \n",
" { \"question\": \"What President is credited with the original notion of putting Americans in space?\",\n",
" \"id\": \"Q4\"\n",
" },\n",
" { \"question\": \"Who did the U.S. collaborate with on an Earth orbit mission in 1975?\",\n",
" \"id\": \"Q5\"\n",
" },\n",
" { \"question\": \"How long did Project Apollo run?\",\n",
" \"id\": \"Q6\"\n",
" }, \n",
" { \"question\": \"What program helped develop space travel techniques that Project Apollo used?\",\n",
" \"id\": \"Q7\"\n",
" }, \n",
" {\"question\": \"What space station supported three manned missions in 1973-1974?\",\n",
" \"id\": \"Q8\"\n",
" } \n",
"]}]}]}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"\n",
"notebooks_dir = '/workspace/bert/notebooks'\n",
"data_dir = '/workspace/bert/data/download'\n",
"\n",
"working_dir = '/workspace/bert'\n",
"if working_dir not in sys.path:\n",
" sys.path.append(working_dir)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"input_file = os.path.join(notebooks_dir, 'input.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.b Mixed Precision\n",
"\n",
"Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of tensor cores in the Volta and Turing architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.\n",
"\n",
"For information about:\n",
"- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.\n",
"- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.\n",
"- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we control mixed precision execution with the environmental variable:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"TF_ENABLE_AUTO_MIXED_PRECISION\"] = \"1\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can choose the mixed precision model (which takes much less time to train than the fp32 version) without losing accuracy, with the following flag: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"use_mixed_precision_model = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To effectively evaluate the speedup of mixed precision try a bigger workload by uncommenting the following line:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#input_file = '/workspace/bert/data/download/squad/v2.0/dev-v2.0.json'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Fine-Tuned NVIDIA BERT TF Models\n",
"\n",
"Based on the model size, we have the following two default configurations of BERT.\n",
"\n",
"| **Model** | **Hidden layers** | **Hidden unit size** | **Attention heads** | **Feedforward filter size** | **Max sequence length** | **Parameters** |\n",
"|:---------:|:----------:|:----:|:---:|:--------:|:---:|:----:|\n",
"|BERTBASE |12 encoder| 768| 12|4 x 768|512|110M|\n",
"|BERTLARGE|24 encoder|1024| 16|4 x 1024|512|330M|\n",
"\n",
"We will take advantage of the fine-tuned models available on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).\n",
"Among the many configurations available we will download these two:\n",
"\n",
" - **bert_tf_v2_large_fp32_384**\n",
"\n",
" - **bert_tf_v2_large_fp16_384**\n",
"\n",
"Which are trained on the SQuaD 2.0 Dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# bert_tf_v2_large_fp32_384\n",
"DATA_DIR_FP32='/workspace/bert/data/download/finetuned_model_fp32'\n",
"!mkdir -p $DATA_DIR_FP32\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP32/bert_tf_v2_large_fp32_384.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v2_large_fp32_384/versions/1/zip\n",
"!unzip -n -d $DATA_DIR_FP32/ $DATA_DIR_FP32/bert_tf_v2_large_fp32_384.zip \n",
" \n",
"# bert_tf_v2_large_fp16_384\n",
"DATA_DIR_FP16='/workspace/bert/data/download/finetuned_model_fp16'\n",
"!mkdir -p $DATA_DIR_FP16\n",
"!wget -nc -q --show-progress -O $DATA_DIR_FP16/bert_tf_v2_large_fp16_384.zip \\\n",
"https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_v2_large_fp16_384/versions/1/zip\n",
"!unzip -n -d $DATA_DIR_FP16/ $DATA_DIR_FP16/bert_tf_v2_large_fp16_384.zip "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the code that follows we will refer to these models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Running QA task inference\n",
"\n",
"In order to run QA inference we will follow step-by-step the flow implemented in run_squad.py.\n",
"\n",
"Configuration:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import run_squad\n",
"import json\n",
"import tensorflow as tf\n",
"import modeling\n",
"import tokenization\n",
"import time\n",
"import random\n",
"\n",
"tf.logging.set_verbosity(tf.logging.INFO)\n",
"\n",
"# Create the output directory where all the results are saved.\n",
"output_dir = os.path.join(working_dir, 'results')\n",
"tf.gfile.MakeDirs(output_dir)\n",
"\n",
"# The config json file corresponding to the pre-trained BERT model.\n",
"# This specifies the model architecture.\n",
"bert_config_file = os.path.join(data_dir, 'google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_config.json')\n",
"\n",
"# The vocabulary file that the BERT model was trained on.\n",
"vocab_file = os.path.join(data_dir, 'google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt')\n",
"\n",
"# Depending on the mixed precision flag we use different fine-tuned model\n",
"if use_mixed_precision_model:\n",
" init_checkpoint = os.path.join(data_dir, 'finetuned_model_fp16/model.ckpt-8144')\n",
"else:\n",
" init_checkpoint = os.path.join(data_dir, 'finetuned_model_fp32/model.ckpt-8144')\n",
"\n",
"# Whether to lower case the input text. \n",
"# Should be True for uncased models and False for cased models.\n",
"do_lower_case = True\n",
" \n",
"# Total batch size for predictions\n",
"predict_batch_size = 1\n",
"params = dict([('batch_size', predict_batch_size)])\n",
"\n",
"# The maximum total input sequence length after WordPiece tokenization. \n",
"# Sequences longer than this will be truncated, and sequences shorter than this will be padded.\n",
"max_seq_length = 384\n",
"\n",
"# When splitting up a long document into chunks, how much stride to take between chunks.\n",
"doc_stride = 128\n",
"\n",
"# The maximum number of tokens for the question. \n",
"# Questions longer than this will be truncated to this length.\n",
"max_query_length = 64\n",
"\n",
"# This is a WA to use flags from here:\n",
"flags = tf.flags\n",
"\n",
"if 'f' not in tf.flags.FLAGS: \n",
" tf.app.flags.DEFINE_string('f', '', 'kernel')\n",
"FLAGS = flags.FLAGS\n",
"\n",
"# The total number of n-best predictions to generate in the nbest_predictions.json output file.\n",
"n_best_size = 20\n",
"\n",
"# The maximum length of an answer that can be generated. \n",
"# This is needed because the start and end predictions are not conditioned on one another.\n",
"max_answer_length = 30"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define the tokenizer and create the model for the estimator:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Validate the casing config consistency with the checkpoint name.\n",
"tokenization.validate_case_matches_checkpoint(do_lower_case, init_checkpoint)\n",
"\n",
"# Create the tokenizer.\n",
"tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)\n",
"\n",
"# Load the configuration from file\n",
"bert_config = modeling.BertConfig.from_json_file(bert_config_file)\n",
"\n",
"def model_fn(features, labels, mode, params): # pylint: disable=unused-argument\n",
" unique_ids = features[\"unique_ids\"]\n",
" input_ids = features[\"input_ids\"]\n",
" input_mask = features[\"input_mask\"]\n",
" segment_ids = features[\"segment_ids\"]\n",
"\n",
" (start_logits, end_logits) = run_squad.create_model(\n",
" bert_config=bert_config,\n",
" is_training=False,\n",
" input_ids=input_ids,\n",
" input_mask=input_mask,\n",
" segment_ids=segment_ids,\n",
" use_one_hot_embeddings=False)\n",
"\n",
" tvars = tf.trainable_variables()\n",
"\n",
" initialized_variable_names = {}\n",
" (assignment_map, initialized_variable_names) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)\n",
" tf.train.init_from_checkpoint(init_checkpoint, assignment_map)\n",
" output_spec = None\n",
" predictions = {\"unique_ids\": unique_ids,\n",
" \"start_logits\": start_logits,\n",
" \"end_logits\": end_logits}\n",
" output_spec = tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)\n",
" return output_spec\n",
"\n",
"config = tf.ConfigProto(log_device_placement=True) \n",
"\n",
"run_config = tf.estimator.RunConfig(\n",
" model_dir=None,\n",
" session_config=config,\n",
" save_checkpoints_steps=1000,\n",
" keep_checkpoint_max=1)\n",
"\n",
"estimator = tf.estimator.Estimator(\n",
" model_fn=model_fn,\n",
" config=run_config,\n",
" params=params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5.a Inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"eval_examples = run_squad.read_squad_examples(\n",
" input_file=input_file, is_training=False)\n",
"\n",
"eval_writer = run_squad.FeatureWriter(\n",
" filename=os.path.join(output_dir, \"eval.tf_record\"),\n",
" is_training=False)\n",
"\n",
"eval_features = []\n",
"def append_feature(feature):\n",
" eval_features.append(feature)\n",
" eval_writer.process_feature(feature)\n",
"\n",
"\n",
"# Loads a data file into a list of InputBatch's\n",
"run_squad.convert_examples_to_features(\n",
" examples=eval_examples,\n",
" tokenizer=tokenizer,\n",
" max_seq_length=max_seq_length,\n",
" doc_stride=doc_stride,\n",
" max_query_length=max_query_length,\n",
" is_training=False,\n",
" output_fn=append_feature)\n",
"\n",
"eval_writer.close()\n",
"\n",
"tf.logging.info(\"***** Running predictions *****\")\n",
"tf.logging.info(\" Num orig examples = %d\", len(eval_examples))\n",
"tf.logging.info(\" Num split examples = %d\", len(eval_features))\n",
"tf.logging.info(\" Batch size = %d\", predict_batch_size)\n",
"\n",
"predict_input_fn = run_squad.input_fn_builder(\n",
" input_file=eval_writer.filename,\n",
" batch_size=predict_batch_size,\n",
" seq_length=max_seq_length,\n",
" is_training=False,\n",
" drop_remainder=False)\n",
"\n",
"all_results = []\n",
"eval_hooks = [run_squad.LogEvalRunHook(predict_batch_size)]\n",
"eval_start_time = time.time()\n",
"for result in estimator.predict(\n",
" predict_input_fn, yield_single_examples=True, hooks=eval_hooks, checkpoint_path=init_checkpoint):\n",
" unique_id = int(result[\"unique_ids\"])\n",
" start_logits = [float(x) for x in result[\"start_logits\"].flat]\n",
" end_logits = [float(x) for x in result[\"end_logits\"].flat]\n",
" all_results.append(\n",
" run_squad.RawResult(\n",
" unique_id=unique_id,\n",
" start_logits=start_logits,\n",
" end_logits=end_logits))\n",
"\n",
"eval_time_elapsed = time.time() - eval_start_time\n",
"\n",
"eval_time_wo_startup = eval_hooks[-1].total_time\n",
"num_sentences = eval_hooks[-1].count * predict_batch_size\n",
"avg_sentences_per_second = num_sentences * 1.0 / eval_time_wo_startup\n",
"\n",
"tf.logging.info(\"-----------------------------\")\n",
"tf.logging.info(\"Total Inference Time = %0.2f Inference Time W/O start up overhead = %0.2f \"\n",
" \"Sentences processed = %d\", eval_time_elapsed, eval_time_wo_startup,\n",
" num_sentences)\n",
"tf.logging.info(\"Inference Performance = %0.4f sentences/sec\", avg_sentences_per_second)\n",
"tf.logging.info(\"-----------------------------\")\n",
"\n",
"output_prediction_file = os.path.join(output_dir, \"predictions.json\")\n",
"output_nbest_file = os.path.join(output_dir, \"nbest_predictions.json\")\n",
"output_null_log_odds_file = os.path.join(output_dir, \"null_odds.json\")\n",
"\n",
"run_squad.write_predictions(eval_examples, eval_features, all_results,\n",
" n_best_size, max_answer_length,\n",
" do_lower_case, output_prediction_file,\n",
" output_nbest_file, output_null_log_odds_file)\n",
"\n",
"tf.logging.info(\"Inference Results:\")\n",
"\n",
"# Here we show only the prediction results, nbest prediction is also available in the output directory\n",
"results = \"\"\n",
"with open(output_prediction_file, 'r') as json_file:\n",
" data = json.load(json_file)\n",
" for question in eval_examples:\n",
" results += \"<tr><td>{}</td><td>{}</td><td>{}</td></tr>\".format(question.qas_id, question.question_text, data[question.qas_id])\n",
"\n",
"\n",
"from IPython.display import display, HTML\n",
"display(HTML(\"<table><tr><th>Id</th><th>Question</th><th>Answer</th></tr>{}</table>\".format(results))) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. What's next"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you are familiar with running QA Inference on BERT, using mixed precision, you may want to try\n",
"your own paragraphs and queries. \n",
"\n",
"You may also want to take a look to the notebook __bert_squad_tf_finetuning.ipynb__ on how to run fine-tuning on BERT, available in the same directory."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}