DeepLearningExamples/FasterTransformer
2020-11-09 10:24:56 +08:00
..
v1 1. Add License to FT. 2020-03-26 17:13:11 +08:00
v2 fix typo (#605) 2020-07-21 23:10:01 +08:00
v2.1 [FT] 1. Fix the bug of TensorRT plugin of FasterTransformer encoder. (#640) 2020-08-06 20:15:49 +08:00
v3.0 Fix: Fix the bugs of allocating workspace (#746,#747) 2020-11-09 10:24:56 +08:00
LICENSE 1. Add License to FT. 2020-03-26 17:13:11 +08:00
README.md [FT] FasterTransformer 3.0 Release (#696) 2020-09-23 10:03:37 +08:00

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer for inference, and it is tested and maintained by NVIDIA.

Table Of Contents

Model overview

FasterTransformer v1

FasterTransformer v1 provides a highly optimized BERT equivalent Transformer layer for inference, including C++ API, TensorFlow op and TensorRT plugin. The experiments show that FasterTransformer v1 can provide 1.3 ~ 2 times speedup on NVIDIA Tesla T4 and NVIDIA Tesla V100 for inference.

FasterTransformer v2

FastTransformer v2 adds a highly optimized OpenNMT-tf based decoder and decoding for inference in FasterTransformer v1, including C++ API and TensorFlow op. The experiments show that FasterTransformer v2 can provide 1.5 ~ 11 times speedup on NVIDIA Telsa T4 and NVIDIA Tesla V 100 for inference.

FasterTransformer v2.1

FasterTransformer v2.1 optimizes some kernels of encoder and decoder, adding the support of PyTorch, the support of remove the padding of encoder and the support of sampling algorithm in decoding.

FasterTransformer v3.0

FasterTransformer v3.0 adds the supporting of INT8 quantization for cpp and TensorFlow encoder model on Turing and Ampere GPUs.

Architecture matrix

The following matrix shows the Architecture Differences between the model.

Architecure Encoder Encoder INT8 quantization Decoder Decoding with beam search Decoding with sampling
FasterTransformer v1 Yes No No No No
FasterTransformer v2 Yes No Yes Yes No
FasterTransformer v2.1 Yes No Yes Yes Yes
FasterTransformer v3.0 Yes Yes Yes Yes Yes

Release notes

FasterTransformer v1 was deprecated on July 2020.

FasterTransformer v2 will be deprecated on Dec 2020.

FasterTransformer v2.1 will be deprecated on July 2021.

Changelog

Sep 2020

  • Release the FasterTransformer 3.0
    • Support INT8 quantization of encoder of cpp and TensorFlow op.
    • Add bert-tf-quantization tool.
    • Fix the issue that Cmake 15 or Cmake 16 fail to build this project.

Aug 2020

  • Fix the bug of trt plugin.

June 2020

  • Release the FasterTransformer 2.1
  • Add effective transformer supporting into encoder.
  • Optimize the beam search kernels.
  • Add PyTorch op supporting

May 2020

  • Fix the bug that seq_len of encoder must be larger than 3.
  • Add the position_encoding of decoding as the input of FasterTransformer decoding. This is convenient to use different types of position encoding. FasterTransformer does not compute the position encoding value, but only lookup the table.
  • Modifying the method of loading model in translate_sample.py.

April 2020

  • Rename decoding_opennmt.h to decoding_beamsearch.h
  • Add DiverseSiblingsSearch for decoding.
  • Add sampling into Decoding
    • The implementation is in the decoding_sampling.h
    • Add top_k sampling, top_p sampling for decoding.
  • Refactor the tensorflow custom op codes.
    • Merge bert_transformer_op.h, bert_transformer_op.cu.cc into bert_transformer_op.cc
    • Merge decoder.h, decoder.cu.cc into decoder.cc
    • Merge decoding_beamsearch.h, decoding_beamsearch.cu.cc into decoding_beamsearch.cc
  • Fix the bugs of finalize function decoding.py.
  • Fix the bug of tf DiverseSiblingSearch.
  • Add BLEU scorer bleu_score.py into utils. Note that the BLEU score requires python3.
  • Fuse QKV Gemm of encoder and masked_multi_head_attention of decoder.
  • Add dynamic batch size and dynamic sequence length features into all ops.

March 2020

  • Add feature in FasterTransformer 2.0
    • Fix the bug of maximum sequence length of decoder cannot be larger than 128.
    • Add translate_sample.py to demonstrate how to translate a sentence by restoring the pretrained model of OpenNMT-tf.
    • Fix the bug that decoding does not check finish or not after each step.
    • Fix the bug of decoder about max_seq_len.
    • Modify the decoding model structure to fit the OpenNMT-tf decoding model.
      • Add a layer normalization layer after decoder.
      • Add a normalization for inputs of decoder

February 2020

  • Release the FasterTransformer 2.0
    • Provide a highly optimized OpenNMT-tf based decoder and decoding, including C++ API and TensorFlow OP.
    • Refine the sample codes of encoder.
    • Add dynamic batch size feature into encoder op.

July 2019

  • Release the FasterTransformer 1.0
    • Provide a highly optimized bert equivalent transformer layer, including C++ API, TensorFlow OP and TensorRT plugin.

Known issues

There are no known issues with this model.