Squashed 'PyTorch/Translation/GNMT/' changes from 4dc145a..51a90b1

51a90b1 Feb 14, 2019 update

git-subtree-dir: PyTorch/Translation/GNMT
git-subtree-split: 51a90b1667e7c1d45bd68bf719f8bf5d4c4521e3
This commit is contained in:
Szymon Migacz 2019-02-14 12:40:30 +01:00
parent 8f95a78af2
commit 8efbea403f
35 changed files with 1939 additions and 1340 deletions

1
.gitignore vendored
View file

@ -4,3 +4,4 @@ tags
/results
/data
.DS_Store
.rsyncignore

View file

@ -1,4 +1,4 @@
FROM nvcr.io/nvidia/pytorch:18.06-py3
FROM nvcr.io/nvidia/pytorch:19.01-py3
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

240
README.md
View file

@ -28,20 +28,37 @@ and
* 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
unidirectional
* with residual connections starting from 3rd layer
* uses LSTM layer accelerated by cuDNN
* uses standard pytorch nn.LSTM layer
* dropout is applied on input to all LSTM layers, probability of dropout is
set to 0.2
* hidden state of LSTM layers is initialized with zeros
* weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
distribution
* decoder:
* 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
classifier
* with residual connections starting from 3rd layer
* uses LSTM layer accelerated by cuDNN
* uses standard pytorch nn.LSTM layer
* dropout is applied on input to all LSTM layers, probability of dropout is
set to 0.2
* hidden state of LSTM layers is initialized with zeros
* weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
distribution
* weights and bias of fully-connected classifier is initialized with
uniform(-0.1, 0.1) distribution
* attention:
* normalized Bahdanau attention
* output from first LSTM layer of decoder goes into attention,
then re-weighted context is concatenated with the input to all subsequent
LSTM layers of the decoder at the current timestep
* linear transform of keys and queries is initialized with uniform(-0.1, 0.1),
normalization scalar is initialized with 1.0 / sqrt(1024),
normalization bias is initialized with zero
* inference:
* beam search with default beam size of 5
* with coverage penalty and length normalization terms
* with coverage penalty and length normalization, coverage penalty factor is
set to 0.1, length normalization factor is set to 0.6 and length
normalization constant is set to 5.0
* detokenized BLEU computed by [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu)
* [motivation](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu#motivation) for choosing SacreBLEU
@ -53,12 +70,11 @@ Our experiments show that a 4-layer model is significantly faster to train and
yields comparable accuracy on the public
[WMT16 English-German](http://www.statmt.org/wmt16/translation-task.html)
dataset. The number of LSTM layers is controlled by the `num_layers` parameter
in the `scripts/train.sh` training script.
in the `train.py` training script.
# Setup
## Requirements
* [PyTorch 18.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
(or newer)
* [PyTorch 19.01-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
* [SacreBLEU 1.2.10](https://pypi.org/project/sacrebleu/1.2.10/)
This repository contains `Dockerfile` which extends the PyTorch NGC container
@ -76,7 +92,7 @@ and
Before you can train using mixed precision with Tensor Cores, ensure that you
have a
[NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
based GPU.
based GPU. Other platforms might likely work but aren't officially supported.
For information about how to train using mixed precision, see the
[Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
and
@ -109,15 +125,33 @@ By default, the training script will use all available GPUs. The training script
saves only one checkpoint with the lowest value of the loss function on the
validation dataset. All results and logs are saved to the `results` directory
(on the host) or to the `/workspace/gnmt/results` directory (in the container).
By default, the `scripts/train.sh` script will launch mixed precision training
By default, the `train.py` script will launch mixed precision training
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
flag in the `scripts/train.sh` script.
flag for the `train.py` training script.
Launching training on 1, 4 or 8 GPUs:
```
bash scripts/train.sh
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
```
Launching training on 16 GPUs:
```
python3 -m launch train.py --seed 2 --train-global-batch-size 2048
```
By default the training script will launch training with batch size 128 per GPU.
If specified `--train-global-batch-size` is larger than 128 times the number of
GPUs available for the training then the training script will accumulate
gradients over consecutive iterations and then perform the weight update.
For example 1 GPU training with `--train-global-batch-size 1024` will accumulate
gradients over 8 iterations before doing the weight update with accumulated
gradients.
The training script automatically runs the validation and testing after each
training epoch. The results from the validation and testing are printed to
the standard error (stderr) and saved to log files.
the standard output (stdout) and saved to log files.
The summary after each training epoch is printed in the following format:
```
@ -145,21 +179,24 @@ Our download script is very similar to the `wmt16_en_de.sh` script from the
[tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
repository. Our download script contains an extra preprocessing step, which
discards all pairs of sentences which can't be decoded by *latin-1* encoder.
The `scripts/wmt16_en_de.sh` script uses the
[subword-nmt](https://github.com/rsennrich/subword-nmt)
package to segment text into subword units (BPE). By default, the script builds
the shared vocabulary of 32,000 tokens.
In order to test with other datasets, scripts need to be customized accordingly.
## Running training
The default training configuration can be launched by running the
`scripts/train.sh` training script.
`train.py` training script.
By default, the training script saves only one checkpoint with the lowest value
of the loss function on the validation dataset, an evaluation is performed after
each training epoch. Results are stored in the `results/gnmt_wmt16` directory.
The training script launches data-parallel training with batch size 128 per GPU
on all available GPUs. After each training epoch, the script runs an evaluation
on all available GPUs. We have tested reliance on up to 16 GPUs on a single
node.
After each training epoch, the script runs an evaluation
on the validation dataset and outputs a BLEU score on the test dataset
(*newstest2014*). BLEU is computed by the
[SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu)
@ -171,12 +208,11 @@ behavior by setting the `CUDA_VISIBLE_DEVICES` variable in your environment or
by setting the `NV_GPU` variable at the Docker container launch
([see section "GPU isolation"](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation)).
By default, the `scripts/train.sh` script will launch mixed precision training
By default, the `train.py` script will launch mixed precision training
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
flag in the `scripts/train.sh` script.
flag for the `train.py` script.
Internally, the `scripts/train.sh` script uses `train.py`. To view all available
options for training, run `python3 train.py --help`.
To view all available options for training, run `python3 train.py --help`.
## Running inference
Inference can be run by launching the `translate.py` inference script, although,
@ -188,93 +224,129 @@ normalization term. Greedy decoding can be enabled by setting the beam size to 1
To view all available options for inference, run `python3 translate.py --help`.
## Benchmarking scripts
### Training performance benchmark
The `scripts/benchmark_training.sh` benchmarking script runs a few, relatively
short training sessions and automatically collects performance numbers. The
benchmarking script assumes that the `scripts/wmt16_en_de.sh` data download
script was launched and the datasets are available in the default location
(`data` directory).
Results from the benchmark are stored in the `results` directory. After the
benchmark is done, you can launch the `scripts/parse_train_benchmark.sh` script
to generate a short summary which will contain launch configuration, performance
(in tokens per second), and estimated training time needed for one epoch (in
seconds).
### Inference performance and accuracy benchmark
The `scripts/benchmark_inference.sh` benchmarking script launches a number of
inference runs with different hyperparameters (beam size, batch size, arithmetic
type) on sorted and unsorted *newstest2014* test dataset. Performance and
accuracy results are stored in the `results/inference_benchmark` directory.
BLEU score is computed by the SacreBLEU package.
The `scripts/benchmark_inference.sh` script assumes that the
`scripts/wmt16_en_de.sh` data download script was
launched and the datasets are available in the default location (`data`
directory).
The `scripts/benchmark_inference.sh` script requires a pre-trained
model checkpoint. By default, the script is loading a checkpoint from the
`results/gnmt_wmt16/model_best.pth` location.
## Training Accuracy Results
All results were obtained by running the `scripts/train.sh` script in
the pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
Results were obtained by running the `train.py` script with the default
batch size = 128 per GPU in the pytorch-19.01-py3 Docker container.
### NVIDIA DGX-1 (8x Tesla V100 16G)
Command used to launch the training:
| **number of GPUs** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
| ------------------ | ------------------------ | ------------- | --------------------------------- | ---------------------- |
| 1 | 22.54 | 22.25 | 412 minutes | 948 minutes |
| 4 | 22.45 | 22.46 | 118 minutes | 264 minutes |
| 8 | 22.41 | 22.43 | 64 minutes | 139 minutes |
```
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
```
| **number of GPUs** | **batch size/GPU** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
| --- | --- | ----- | ----- | ------------- | ------------- |
| 1 | 128 | 24.59 | 24.71 | 264.4 minutes | 824.4 minutes |
| 4 | 128 | 24.30 | 24.45 | 89.5 minutes | 230.8 minutes |
| 8 | 128 | 24.45 | 24.48 | 46.2 minutes | 116.6 minutes |
### NVIDIA DGX-2 (16x Tesla V100 32G)
Commands used to launch the training:
```
for 1,4,8 GPUs:
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
for 16 GPUs:
python3 -m launch train.py --seed 2 --train-global-batch-size 2048
```
| **number of GPUs** | **batch size/GPU** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
| --- | --- | ----- | ----- | ------------- | ------------- |
| 1 | 128 | 24.59 | 24.71 | 265.0 minutes | 825.1 minutes |
| 4 | 128 | 24.69 | 24.33 | 87.4 minutes | 216.3 minutes |
| 8 | 128 | 24.50 | 24.47 | 49.6 minutes | 113.5 minutes |
| 16 | 128 | 24.22 | 24.16 | 26.3 minutes | 58.6 minutes |
![TrainingLoss](./img/training_loss.png)
### Training Stability Test
The GNMT v2 model was trained for 10 epochs, starting from 96 different initial
The GNMT v2 model was trained for 6 epochs, starting from 50 different initial
random seeds. After each training epoch the model was evaluated on the test
dataset and the BLEU score was recorded. The training was performed in the
pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs. The
following table summarizes results of the stability test.
pytorch-19.01-py3 Docker container on NVIDIA DGX-1 with 8 Tesla V100 16G GPUs.
The following table summarizes results of the stability test.
![TrainingAccuracy](./img/training_accuracy.png)
## Training Performance Results
All results were obtained by running the `scripts/train.sh` training script in
the pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
Performance numbers (in tokens per second) were averaged over an entire training
epoch.
#### BLEU scores after each training epoch for different initial random seeds
| **epoch** | **average** | **stdev** | **minimum** | **maximum** | **median** |
| --- | ------ | ----- | ------ | ------ | ------ |
| 1 | 19.954 | 0.326 | 18.710 | 20.490 | 20.020 |
| 2 | 21.734 | 0.222 | 21.220 | 22.120 | 21.765 |
| 3 | 22.502 | 0.223 | 21.960 | 22.970 | 22.485 |
| 4 | 23.004 | 0.221 | 22.350 | 23.430 | 23.020 |
| 5 | 24.201 | 0.146 | 23.900 | 24.480 | 24.215 |
| 6 | 24.423 | 0.159 | 24.070 | 24.820 | 24.395 |
| **number of GPUs** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu weak scaling** | **fp32 multi-gpu weak scaling** |
| -------- | ------------- | ------------- | ------------ | --------------------------- | --------------------------- |
| 1 | 42337 | 18581 | 2.279 | 1.000 | 1.000 |
| 4 | 153433 | 67586 | 2.270 | 3.624 | 3.637 |
| 8 | 300181 | 132734 | 2.262 | 7.090 | 7.144 |
## Training Performance Results
All results were obtained by running the `train.py` training script in the
pytorch-19.01-py3 Docker container. Performance numbers (in tokens per second)
were averaged over an entire training epoch.
### NVIDIA DGX-1 (8x Tesla V100 16G)
| **number of GPUs** | **batch size/GPU** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu strong scaling** | **fp32 multi-gpu strong scaling** |
| --- | --- | ------ | ------ | ----- | ----- | ----- |
| 1 | 128 | 66050 | 21346 | 3.094 | 1.000 | 1.000|
| 4 | 128 | 196174 | 76083 | 2.578 | 2.970 | 3.564|
| 8 | 128 | 387282 | 153697 | 2.520 | 5.863 | 7.200|
### NVIDIA DGX-2 (16x Tesla V100 32G)
| **number of GPUs** | **batch size/GPU** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu strong scaling** | **fp32 multi-gpu strong scaling** |
| --- | --- | ------ | ------- | ----- | ------ | ------ |
| 1 | 128 | 65830 | 22695 | 2.901 | 1.000 | 1.000 |
| 4 | 128 | 200886 | 81224 | 2.473 | 3.052 | 3.579 |
| 8 | 128 | 362612 | 156536 | 2.316 | 5.508 | 6.897 |
| 16 | 128 | 738521 | 314831 | 2.346 | 11.219 | 13.872 |
## Inference Performance Results
All results were obtained by running the `scripts/benchmark_inference.sh`
benchmarking script in the pytorch-18.06-py3 Docker container on NVIDIA DGX-1.
Inference was run on a single V100 16G GPU.
All results were obtained by running the `translate.py` script in the
pytorch-19.01-py3 Docker container on NVIDIA DGX-1. Inference benchmark was run
on a single Tesla V100 16G GPU. The benchmark requires a checkpoint from a fully
trained model.
Command to launch the inference benchmark:
```
python3 translate.py --input data/wmt16_de_en/newstest2014.tok.bpe.32000.en \
--reference data/wmt16_de_en/newstest2014.de --output /tmp/output \
--model results/gnmt/model_best.pth --batch-size 32 128 512 \
--beam-size 1 2 5 10 --math fp16 fp32
```
| **batch size** | **beam size** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision tokens/s** | **fp32 tokens/s** |
| -------------- | ------------- | ------------- | ------------- | ----------------- | ------------ |
| 512 | 1 | 20.63 | 20.63 | 62009 | 31229 |
| 512 | 2 | 21.55 | 21.60 | 32669 | 16454 |
| 512 | 5 | 22.34 | 22.36 | 21105 | 8562 |
| 512 | 10 | 22.34 | 22.40 | 12967 | 4720 |
| 128 | 1 | 20.62 | 20.63 | 27095 | 19505 |
| 128 | 2 | 21.56 | 21.60 | 13224 | 9718 |
| 128 | 5 | 22.38 | 22.36 | 10987 | 6575 |
| 128 | 10 | 22.35 | 22.40 | 8603 | 4103 |
| 32 | 1 | 20.62 | 20.63 | 9451 | 8483 |
| 32 | 2 | 21.56 | 21.60 | 4818 | 4333 |
| 32 | 5 | 22.34 | 22.36 | 4505 | 3655 |
| 32 | 10 | 22.37 | 22.40 | 4086 | 2822 |
| ---- | ----- | ------- | ------- | ---------|-------- |
| 32 | 1 | 23.18 | 23.18 | 23571 | 19462 |
| 32 | 2 | 24.09 | 24.12 | 15303 | 12345 |
| 32 | 5 | 24.63 | 24.62 | 13644 | 7725 |
| 32 | 10 | 24.50 | 24.48 | 11049 | 5359 |
| 128 | 1 | 23.17 | 23.18 | 73429 | 42272 |
| 128 | 2 | 24.07 | 24.12 | 43373 | 23131 |
| 128 | 5 | 24.69 | 24.63 | 29646 | 12525 |
| 128 | 10 | 24.45 | 24.48 | 19100 | 6886 |
| 512 | 1 | 23.17 | 23.18 | 135333 | 48962 |
| 512 | 2 | 24.08 | 24.12 | 74367 | 27308 |
| 512 | 5 | 24.60 | 24.63 | 39217 | 12674 |
| 512 | 10 | 24.54 | 24.48 | 21433 | 6640 |
# Changelog
1. Aug 7, 2018
* Initial release
2. Dec 4, 2018
* Added exponential warm-up and step learning rate decay
* Multi-GPU (distributed) inference and validation
* Default container updated to PyTorch 18.11-py3
* General performance improvements
3. Feb 14, 2019
* Different batching algorithm (bucketing with 5 equal-width buckets)
* Additional dropouts before first LSTM layer in encoder and in decoder
* Weight initialization changed to uniform (-0.1, 0.1)
* Switched order of dropout and concatenation with attention in decoder
* Default container updated to PyTorch 19.01-py3
# Known issues
None

Binary file not shown.

Before

Width:  |  Height:  |  Size: 121 KiB

After

Width:  |  Height:  |  Size: 115 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 205 KiB

After

Width:  |  Height:  |  Size: 295 KiB

235
launch.py Normal file
View file

@ -0,0 +1,235 @@
r"""
`torch.distributed.launch` is a module that spawns up multiple distributed
training processes on each of the training nodes.
The utility can be used for single-node distributed training, in which one or
more processes per node will be spawned. The utility can be used for either
CPU training or GPU training. If the utility is used for GPU training,
each distributed process will be operating on a single GPU. This can achieve
well-improved single-node training performance. It can also be used in
multi-node distributed training, by spawning up multiple processes on each node
for well-improved multi-node distributed training performance as well.
This will especially be benefitial for systems with multiple Infiniband
interfaces that have direct-GPU support, since all of them can be utilized for
aggregated communication bandwidth.
In both cases of single-node distributed training or multi-node distributed
training, this utility will launch the given number of processes per node
(``--nproc_per_node``). If used for GPU training, this number needs to be less
or euqal to the number of GPUs on the current system (``nproc_per_node``),
and each process will be operating on a single GPU from *GPU 0 to
GPU (nproc_per_node - 1)*.
**How to use this module:**
1. Single-Node multi-process distributed training
::
>>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
arguments of your training script)
2. Multi-Node multi-process distributed training: (e.g. two nodes)
Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*
::
>>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)
Node 2:
::
>>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
--nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
and all other arguments of your training script)
3. To look up what optional arguments this module offers:
::
>>> python -m torch.distributed.launch --help
**Important Notices:**
1. This utilty and multi-process distributed (single-node or
multi-node) GPU training currently only achieves the best performance using
the NCCL distributed backend. Thus NCCL backend is the recommended backend to
use for GPU training.
2. In your training program, you must parse the command-line argument:
``--local_rank=LOCAL_PROCESS_RANK``, which will be provided by this module.
If your training program uses GPUs, you should ensure that your code only
runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by:
Parsing the local_rank argument
::
>>> import argparse
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument("--local_rank", type=int)
>>> args = parser.parse_args()
Set your device to local rank using either
::
>>> torch.cuda.set_device(arg.local_rank) # before your code runs
or
::
>>> with torch.cuda.device(arg.local_rank):
>>> # your code to run
3. In your training program, you are supposed to call the following function
at the beginning to start the distributed backend. You need to make sure that
the init_method uses ``env://``, which is the only supported ``init_method``
by this module.
::
torch.distributed.init_process_group(backend='YOUR BACKEND',
init_method='env://')
4. In your training program, you can either use regular distributed functions
or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your
training program uses GPUs for training and you would like to use
:func:`torch.nn.parallel.DistributedDataParallel` module,
here is how to configure it.
::
model = torch.nn.parallel.DistributedDataParallel(model,
device_ids=[arg.local_rank],
output_device=arg.local_rank)
Please ensure that ``device_ids`` argument is set to be the only GPU device id
that your code will be operating on. This is generally the local rank of the
process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``,
and ``output_device`` needs to be ``args.local_rank`` in order to use this
utility
.. warning::
``local_rank`` is NOT globally unique: it is only unique per process
on a machine. Thus, don't use it to decide if you should, e.g.,
write to a networked filesystem. See
https://github.com/pytorch/pytorch/issues/12042 for an example of
how things can go wrong if you don't do this correctly.
"""
import sys
import subprocess
import os
import socket
from argparse import ArgumentParser, REMAINDER
import torch
def parse_args():
"""
Helper function parsing the command line options
@retval ArgumentParser
"""
parser = ArgumentParser(description="PyTorch distributed training launch "
"helper utilty that will spawn up "
"multiple distributed processes")
# Optional arguments for the launch helper
parser.add_argument("--nnodes", type=int, default=1,
help="The number of nodes to use for distributed "
"training")
parser.add_argument("--node_rank", type=int, default=0,
help="The rank of the node for multi-node distributed "
"training")
parser.add_argument("--nproc_per_node", type=int, default=None,
help="The number of processes to launch on each node, "
"for GPU training, this is recommended to be set "
"to the number of GPUs in your system so that "
"each process can be bound to a single GPU.")
parser.add_argument("--master_addr", default="127.0.0.1", type=str,
help="Master node (rank 0)'s address, should be either "
"the IP address or the hostname of node 0, for "
"single node multi-proc training, the "
"--master_addr can simply be 127.0.0.1")
parser.add_argument("--master_port", default=29500, type=int,
help="Master node (rank 0)'s free port that needs to "
"be used for communciation during distributed "
"training")
# positional
parser.add_argument("training_script", type=str,
help="The full path to the single GPU training "
"program/script to be launched in parallel, "
"followed by all the arguments for the "
"training script")
# rest from the training program
parser.add_argument('training_script_args', nargs=REMAINDER)
return parser.parse_args()
def main():
args = parse_args()
if args.nproc_per_node is None:
args.nproc_per_node = torch.cuda.device_count()
# world size in terms of number of processes
dist_world_size = args.nproc_per_node * args.nnodes
# set PyTorch distributed related environmental variables
current_env = os.environ.copy()
current_env["MASTER_ADDR"] = args.master_addr
current_env["MASTER_PORT"] = str(args.master_port)
current_env["WORLD_SIZE"] = str(dist_world_size)
processes = []
for local_rank in range(0, args.nproc_per_node):
# each process's rank
dist_rank = args.nproc_per_node * args.node_rank + local_rank
current_env["RANK"] = str(dist_rank)
# spawn the processes
cmd = [sys.executable,
"-u",
args.training_script,
"--local_rank={}".format(local_rank)] + args.training_script_args
process = subprocess.Popen(cmd, env=current_env)
processes.append(process)
returncode = 0
try:
for process in processes:
process_returncode = process.wait()
if process_returncode != 0:
returncode = 1
except KeyboardInterrupt:
print('CTRL-C, TERMINATING WORKERS ...')
for process in processes:
process.terminate()
for process in processes:
process.wait()
raise
sys.exit(returncode)
if __name__ == "__main__":
main()

View file

@ -1,46 +0,0 @@
import sys
import subprocess
import torch
def main():
argslist = list(sys.argv)[1:]
world_size = torch.cuda.device_count()
if '--world-size' in argslist:
argslist[argslist.index('--world-size') + 1] = str(world_size)
else:
argslist.append('--world-size')
argslist.append(str(world_size))
workers = []
for i in range(world_size):
if '--rank' in argslist:
argslist[argslist.index('--rank') + 1] = str(i)
else:
argslist.append('--rank')
argslist.append(str(i))
stdout = None if i == 0 else subprocess.DEVNULL
worker = subprocess.Popen([str(sys.executable)] + argslist, stdout=stdout)
workers.append(worker)
returncode = 0
try:
for worker in workers:
worker_returncode = worker.wait()
if worker_returncode != 0:
returncode = 1
except KeyboardInterrupt:
print('Pressed CTRL-C, TERMINATING')
for worker in workers:
worker.terminate()
for worker in workers:
worker.wait()
raise
sys.exit(returncode)
if __name__ == "__main__":
main()

View file

@ -1 +1,2 @@
sacrebleu==1.2.10
git+git://github.com/NVIDIA/apex.git#egg=apex

View file

@ -1,101 +0,0 @@
#!/bin/bash
set -e
DATASET_DIR='data/wmt16_de_en'
RESULTS_DIR='gnmt_wmt16'
# sort by length (ascending)
cat ${DATASET_DIR}/newstest2014.tok.bpe.32000.en \
| awk '{ print length, $0 }' \
| sort -n -s \
| cut -d" " -f2- > /tmp/newstest2014.tok.bpe.32000.en.sorted
batches=(512 256 128 64 32)
beams=(1 2 5 10)
maths=(fp16 fp32)
model=results/${RESULTS_DIR}/model_best.pth
odir=results/inference_benchmark
mkdir -p $odir
echo RUNNING on unsorted dataset
rm -rf $odir/fp16_perf_unsorted.log
rm -rf $odir/fp32_perf_unsorted.log
rm -rf $odir/fp16_bleu.log
rm -rf $odir/fp32_bleu.log
ifile=${DATASET_DIR}/newstest2014.tok.bpe.32000.en
rfile=${DATASET_DIR}/newstest2014.de
for math in "${maths[@]}"
do
for batch in "${batches[@]}"
do
for beam in "${beams[@]}"
do
echo RUNNING: batch_size: $batch, beam_size: $beam, math: $math
# run translation
python3 translate.py \
-i $ifile \
-r $rfile \
-m $model \
--math $math \
--print-freq 1 \
--beam-size $beam \
--batch-size $batch \
-o /tmp/output.tok &> /tmp/log.log
tok_per_sec=`cat /tmp/log.log \
|grep "Avg total tokens" \
|cut -f 2 \
|cut -d ':' -f 2 |tr -d ' '`
bleu=`cat /tmp/log.log \
|grep BLEU \
|cut -d ':' -f 2 |tr -d ' '`
echo -e $tok_per_sec '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_perf_unsorted.log
echo -e $bleu '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_bleu.log
echo Tokens per second: $tok_per_sec, BLEU: $bleu
done
done
done
echo RUNNING on sorted dataset
rm -rf $odir/fp16_perf_sorted.log
rm -rf $odir/fp32_perf_sorted.log
ifile=/tmp/newstest2014.tok.bpe.32000.en.sorted
for math in "${maths[@]}"
do
for batch in "${batches[@]}"
do
for beam in "${beams[@]}"
do
echo RUNNING: batch_size: $batch, beam_size: $beam, math: $math
# run translation
python3 translate.py \
-i $ifile \
-m $model \
--math $math \
--print-freq 1 \
--beam-size $beam \
--batch-size $batch \
--no-bleu \
-o /tmp/output.tok &> /tmp/log.log
tok_per_sec=`cat /tmp/log.log \
|grep "Avg total tokens" \
|cut -f 2 \
|cut -d ':' -f 2 |tr -d ' '`
echo -e $tok_per_sec '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_perf_sorted.log
echo Tokens per second: $tok_per_sec
done
done
done

View file

@ -1,28 +0,0 @@
#!/bin/bash
DATASET_DIR='data/wmt16_de_en'
batches=(128)
maths=(fp16 fp32)
gpus=(1 2 4 8)
for math in "${maths[@]}"
do
for batch in "${batches[@]}"
do
for gpu in "${gpus[@]}"
do
export CUDA_VISIBLE_DEVICES=`seq -s "," 0 $((gpu - 1))`
python3 -m multiproc train.py \
--save benchmark_gpu_${gpu}_math_${math}_batch_${batch} \
--dataset-dir ${DATASET_DIR} \
--seed 1 \
--epochs 1 \
--math ${math} \
--print-freq 1 \
--batch-size ${batch} \
--disable-eval \
--max-size $((512 * ${batch} * ${gpu}))
done
done
done

View file

@ -1,6 +1,7 @@
import argparse
from collections import Counter
def parse_args():
parser = argparse.ArgumentParser(description='Clean dataset')
parser.add_argument('-f1', '--file1', help='file1')
@ -12,6 +13,7 @@ def save_output(fname, data):
with open(fname, 'w') as f:
f.writelines(data)
def main():
"""
Discards all pairs of sentences which can't be decoded by latin-1 encoder.

View file

@ -1,46 +0,0 @@
#!/bin/bash
batches=(128)
maths=(fp16 fp32)
gpus=(1 2 4 8)
sentences=3498161
echo -e [parameters] "\t\t\t" [tokens / s] [second per epoch]
for batch in "${batches[@]}"
do
for math in "${maths[@]}"
do
for gpu in "${gpus[@]}"
do
dir=results/benchmark_gpu_${gpu}_math_${math}_batch_${batch}/
if [ ! -d $dir ]; then
echo Directory $dir does not exist
continue
fi
total_tokens_per_s=0
for gpu_id in `seq 0 $((gpu - 1))`
do
tokens_per_s=`cat ${dir}/log_gpu_${gpu_id}.log \
|grep TRAIN \
|cut -f 4 \
|sed -E -n 's/.*\(([0-9]+)\).*/\1/p' \
|tail -n 1`
total_tokens_per_s=$((total_tokens_per_s + tokens_per_s))
done
batch_time=`cat ${dir}/log_gpu_0.log \
|grep TRAIN \
|cut -f 2 \
|sed -E -n 's/.*\(([.0-9]+)\).*/\1/p' \
|tail -n 1`
n_batches=$(( $sentences / ($batch * $gpu)))
epoch_time=`awk "BEGIN {print $n_batches * $batch_time}"`
echo -e math: $math batch: $batch gpus: $gpu "\t\t" $total_tokens_per_s "\t" $epoch_time
done
done
done

View file

@ -1,6 +1,14 @@
fp16,1,Tesla V100-SXM2-16GB,42337
fp16,4,Tesla V100-SXM2-16GB,153433
fp16,8,Tesla V100-SXM2-16GB,300181
fp32,1,Tesla V100-SXM2-16GB,18581
fp32,4,Tesla V100-SXM2-16GB,67586
fp32,8,Tesla V100-SXM2-16GB,132734
fp16,1,Tesla V100-SXM2-16GB,66050
fp16,4,Tesla V100-SXM2-16GB,196174
fp16,8,Tesla V100-SXM2-16GB,387282
fp32,1,Tesla V100-SXM2-16GB,21346
fp32,4,Tesla V100-SXM2-16GB,76083
fp32,8,Tesla V100-SXM2-16GB,153697
fp16,1,Tesla V100-SXM3-32GB,65830
fp16,4,Tesla V100-SXM3-32GB,200886
fp16,8,Tesla V100-SXM3-32GB,362612
fp16,16,Tesla V100-SXM3-32GB,738521
fp32,1,Tesla V100-SXM3-32GB,22695
fp32,4,Tesla V100-SXM3-32GB,81224
fp32,8,Tesla V100-SXM3-32GB,156536
fp32,16,Tesla V100-SXM3-32GB,314831

View file

@ -3,81 +3,35 @@
set -e
DATASET_DIR='data/wmt16_de_en'
RESULTS_DIR='gnmt_wmt16_test'
REFERENCE_FILE=scripts/tests/reference_performance
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
REPO_DIR='/workspace/gnmt'
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
REFERENCE_ACCURACY=17.2
MATH='fp16'
PERFORMANCE_TOLERANCE=0.9
python3 -m multiproc train.py \
--save ${RESULTS_DIR} \
--dataset-dir ${DATASET_DIR} \
--seed 1 \
--epochs 1 \
--math ${MATH} \
--print-freq 10 \
--batch-size 128 \
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
PERF_TOLERANCE=0.9
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
echo 'GPU_NAME:' ${GPU_NAME}
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
echo 'GPU_COUNT:' ${GPU_COUNT}
# Accuracy test
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|grep Summary \
|tail -n 1 \
|cut -f 4 \
|egrep -o [0-9.]+`
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
${REFERENCE_FILE} | \cut -f 4 -d ','`
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
if (( ${ACCURACY_TEST_RESULT} )); then
echo "&&&& ACCURACY TEST PASSED"
else
echo "&&&& ACCURACY TEST FAILED"
fi
# Performance test
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|grep Performance \
|tail -n 1 \
|cut -f 2 \
|egrep -o [0-9.]+`
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
| \cut -f 4 -d ','`
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
PERFORMANCE_TEST_RESULT=1
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
if [ -z "${REFERENCE_PERF}" ]; then
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
echo "&&&& PERFORMANCE TEST WAIVED"
TARGET_PERF=''
else
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
if (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PERFORMANCE TEST PASSED"
else
echo "&&&& PERFORMANCE TEST FAILED"
fi
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
fi
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PASSED"
exit 0
else
echo "&&&& FAILED"
exit 1
fi
cd $REPO_DIR
python3 -m launch train.py \
--dataset-dir $DATASET_DIR \
--seed 1 \
--epochs 1 \
--remain-steps 1.0 \
--target-bleu 17.20 \
--math ${MATH} \
${TARGET_PERF}

View file

@ -3,81 +3,35 @@
set -e
DATASET_DIR='data/wmt16_de_en'
RESULTS_DIR='gnmt_wmt16_test'
REFERENCE_FILE=scripts/tests/reference_performance
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
REPO_DIR='/workspace/gnmt'
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
REFERENCE_ACCURACY=17.2
MATH='fp32'
PERFORMANCE_TOLERANCE=0.9
python3 -m multiproc train.py \
--save ${RESULTS_DIR} \
--dataset-dir ${DATASET_DIR} \
--seed 1 \
--epochs 1 \
--math ${MATH} \
--print-freq 10 \
--batch-size 128 \
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
PERF_TOLERANCE=0.9
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
echo 'GPU_NAME:' ${GPU_NAME}
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
echo 'GPU_COUNT:' ${GPU_COUNT}
# Accuracy test
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|grep Summary \
|tail -n 1 \
|cut -f 4 \
|egrep -o [0-9.]+`
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
${REFERENCE_FILE} | \cut -f 4 -d ','`
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
if (( ${ACCURACY_TEST_RESULT} )); then
echo "&&&& ACCURACY TEST PASSED"
else
echo "&&&& ACCURACY TEST FAILED"
fi
# Performance test
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|grep Performance \
|tail -n 1 \
|cut -f 2 \
|egrep -o [0-9.]+`
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
| \cut -f 4 -d ','`
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
PERFORMANCE_TEST_RESULT=1
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
if [ -z "${REFERENCE_PERF}" ]; then
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
echo "&&&& PERFORMANCE TEST WAIVED"
TARGET_PERF=''
else
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
if (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PERFORMANCE TEST PASSED"
else
echo "&&&& PERFORMANCE TEST FAILED"
fi
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
fi
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PASSED"
exit 0
else
echo "&&&& FAILED"
exit 1
fi
cd $REPO_DIR
python3 -m launch train.py \
--dataset-dir $DATASET_DIR \
--seed 1 \
--epochs 1 \
--remain-steps 1.0 \
--target-bleu 17.20 \
--math ${MATH} \
${TARGET_PERF}

View file

@ -3,81 +3,34 @@
set -e
DATASET_DIR='data/wmt16_de_en'
RESULTS_DIR='gnmt_wmt16_test'
REFERENCE_FILE=scripts/tests/reference_performance
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
REPO_DIR='/workspace/gnmt'
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
REFERENCE_ACCURACY=22.0
MATH='fp16'
PERFORMANCE_TOLERANCE=0.9
python3 -m multiproc train.py \
--save ${RESULTS_DIR} \
--dataset-dir ${DATASET_DIR} \
--seed 1 \
--epochs 6 \
--math ${MATH} \
--print-freq 10 \
--batch-size 128 \
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
PERF_TOLERANCE=0.9
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
echo 'GPU_NAME:' ${GPU_NAME}
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
echo 'GPU_COUNT:' ${GPU_COUNT}
# Accuracy test
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|grep Summary \
|tail -n 1 \
|cut -f 4 \
|egrep -o [0-9.]+`
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
${REFERENCE_FILE} | \cut -f 4 -d ','`
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
if (( ${ACCURACY_TEST_RESULT} )); then
echo "&&&& ACCURACY TEST PASSED"
else
echo "&&&& ACCURACY TEST FAILED"
fi
# Performance test
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|grep Performance \
|tail -n 1 \
|cut -f 2 \
|egrep -o [0-9.]+`
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
| \cut -f 4 -d ','`
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
PERFORMANCE_TEST_RESULT=1
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
if [ -z "${REFERENCE_PERF}" ]; then
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
echo "&&&& PERFORMANCE TEST WAIVED"
TARGET_PERF=''
else
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
if (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PERFORMANCE TEST PASSED"
else
echo "&&&& PERFORMANCE TEST FAILED"
fi
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
fi
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PASSED"
exit 0
else
echo "&&&& FAILED"
exit 1
fi
cd $REPO_DIR
python3 -m launch train.py \
--dataset-dir $DATASET_DIR \
--seed 1 \
--epochs 6 \
--target-bleu 22.00 \
--math ${MATH} \
${TARGET_PERF}

View file

@ -3,81 +3,34 @@
set -e
DATASET_DIR='data/wmt16_de_en'
RESULTS_DIR='gnmt_wmt16_test'
REFERENCE_FILE=scripts/tests/reference_performance
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
REPO_DIR='/workspace/gnmt'
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
REFERENCE_ACCURACY=22.0
MATH='fp32'
PERFORMANCE_TOLERANCE=0.9
python3 -m multiproc train.py \
--save ${RESULTS_DIR} \
--dataset-dir ${DATASET_DIR} \
--seed 1 \
--epochs 6 \
--math ${MATH} \
--print-freq 10 \
--batch-size 128 \
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
PERF_TOLERANCE=0.9
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
echo 'GPU_NAME:' ${GPU_NAME}
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
echo 'GPU_COUNT:' ${GPU_COUNT}
# Accuracy test
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|grep Summary \
|tail -n 1 \
|cut -f 4 \
|egrep -o [0-9.]+`
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
${REFERENCE_FILE} | \cut -f 4 -d ','`
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
if (( ${ACCURACY_TEST_RESULT} )); then
echo "&&&& ACCURACY TEST PASSED"
else
echo "&&&& ACCURACY TEST FAILED"
fi
# Performance test
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|grep Performance \
|tail -n 1 \
|cut -f 2 \
|egrep -o [0-9.]+`
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
| \cut -f 4 -d ','`
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
PERFORMANCE_TEST_RESULT=1
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
if [ -z "${REFERENCE_PERF}" ]; then
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
echo "&&&& PERFORMANCE TEST WAIVED"
TARGET_PERF=''
else
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
if (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PERFORMANCE TEST PASSED"
else
echo "&&&& PERFORMANCE TEST FAILED"
fi
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
fi
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
echo "&&&& PASSED"
exit 0
else
echo "&&&& FAILED"
exit 1
fi
cd $REPO_DIR
python3 -m launch train.py \
--dataset-dir $DATASET_DIR \
--seed 1 \
--epochs 6 \
--target-bleu 22.00 \
--math ${MATH} \
${TARGET_PERF}

View file

@ -2,17 +2,4 @@
set -e
DATASET_DIR='data/wmt16_de_en'
RESULTS_DIR='gnmt_wmt16'
# run training
python3 -m multiproc train.py \
--save ${RESULTS_DIR} \
--dataset-dir ${DATASET_DIR} \
--seed 1 \
--epochs 6 \
--math fp16 \
--print-freq 10 \
--batch-size 128 \
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
python3 -m launch train.py

View file

@ -1,20 +1,23 @@
import logging
from operator import itemgetter
import torch
from torch.utils.data import Dataset
from torch.utils.data.sampler import SequentialSampler
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import seq2seq.data.config as config
from seq2seq.data.sampler import BucketingSampler
from seq2seq.data.sampler import DistributedSampler
from seq2seq.data.sampler import ShardingSampler
from seq2seq.data.sampler import StaticDistributedSampler
def build_collate_fn(batch_first=False, parallel=True, sort=False):
"""
Factor for collate_fn functions.
Factory for collate_fn functions.
:param batch_first: if True returns batches in (batch, seq) format, if not
returns in (seq, batch) format
:param batch_first: if True returns batches in (batch, seq) format, if
False returns in (seq, batch) format
:param parallel: if True builds batches from parallel corpus (src, tgt)
:param sort: if True sorts by src sequence length within each batch
"""
@ -49,8 +52,8 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
"""
src_seqs, tgt_seqs = zip(*seqs)
if sort:
key = lambda item: len(item[1])
indices, src_seqs = zip(*sorted(enumerate(src_seqs), key=key,
indices, src_seqs = zip(*sorted(enumerate(src_seqs),
key=lambda item: len(item[1]),
reverse=True))
tgt_seqs = [tgt_seqs[idx] for idx in indices]
@ -64,8 +67,8 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
:param src_seqs: source sequences
"""
if sort:
key = lambda item: len(item[1])
indices, src_seqs = zip(*sorted(enumerate(src_seqs), key=key,
indices, src_seqs = zip(*sorted(enumerate(src_seqs),
key=lambda item: len(item[1]),
reverse=True))
else:
indices = range(len(src_seqs))
@ -81,10 +84,22 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
class TextDataset(Dataset):
def __init__(self, src_fname, tokenizer, min_len=None, max_len=None,
sort=False, max_size=None):
"""
Constructor for the TextDataset. Builds monolingual dataset.
:param src_fname: path to the file with data
:param tokenizer: tokenizer
:param min_len: minimum sequence length
:param max_len: maximum sequence length
:param sort: sorts dataset by sequence length
:param max_size: loads at most 'max_size' samples from the input file,
if None loads the entire dataset
"""
self.min_len = min_len
self.max_len = max_len
self.parallel = False
self.sorted = False
self.src = self.process_data(src_fname, tokenizer, max_size)
@ -98,11 +113,35 @@ class TextDataset(Dataset):
self.sort_by_length()
def sort_by_length(self):
"""
Sorts dataset by the sequence length.
"""
self.lengths, indices = self.lengths.sort(descending=True)
self.src = [self.src[idx] for idx in indices]
self.indices = indices.tolist()
self.sorted = True
def unsort(self, array):
"""
"Unsorts" given array (restores original order of elements before
dataset was sorted by sequence length).
:param array: array to be "unsorted"
"""
if self.sorted:
inverse = sorted(enumerate(self.indices), key=itemgetter(1))
array = [array[i[0]] for i in inverse]
return array
def filter_data(self, min_len, max_len):
"""
Preserves only samples which satisfy the following inequality:
min_len <= sample sequence length <= max_len
:param min_len: minimum sequence length
:param max_len: maximum sequence length
"""
logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
initial_len = len(self.src)
@ -116,6 +155,14 @@ class TextDataset(Dataset):
logging.info(f'Pairs before: {initial_len}, after: {filtered_len}')
def process_data(self, fname, tokenizer, max_size):
"""
Loads data from the input file.
:param fname: input file name
:param tokenizer: tokenizer
:param max_size: loads at most 'max_size' samples from the input file,
if None loads the entire dataset
"""
logging.info(f'Processing data from {fname}')
data = []
with open(fname) as dfile:
@ -133,33 +180,57 @@ class TextDataset(Dataset):
def __getitem__(self, idx):
return self.src[idx]
def get_loader(self, batch_size=1, shuffle=False, num_workers=0,
batch_first=False, drop_last=False, bucketing=True):
def get_loader(self, batch_size=1, seeds=None, shuffle=False,
num_workers=0, batch_first=False, pad=False,
batching=None, batching_opt={}):
collate_fn = build_collate_fn(batch_first, parallel=self.parallel,
sort=True)
if shuffle:
sampler = BucketingSampler(self, batch_size, bucketing)
if batching == 'random':
sampler = DistributedSampler(self, batch_size, seeds)
elif batching == 'sharding':
sampler = ShardingSampler(self, batch_size, seeds,
batching_opt['shard_size'])
elif batching == 'bucketing':
sampler = BucketingSampler(self, batch_size, seeds,
batching_opt['num_buckets'])
else:
raise NotImplementedError
else:
sampler = SequentialSampler(self)
sampler = StaticDistributedSampler(self, batch_size, pad)
return DataLoader(self,
batch_size=batch_size,
collate_fn=collate_fn,
sampler=sampler,
num_workers=num_workers,
pin_memory=False,
drop_last=drop_last)
pin_memory=True,
drop_last=False)
class ParallelDataset(TextDataset):
def __init__(self, src_fname, tgt_fname, tokenizer,
min_len, max_len, sort=False, max_size=None):
"""
Constructor for the ParallelDataset.
Tokenization is done when the data is loaded from the disk.
:param src_fname: path to the file with src language data
:param tgt_fname: path to the file with tgt language data
:param tokenizer: tokenizer
:param min_len: minimum sequence length
:param max_len: maximum sequence length
:param sort: sorts dataset by sequence length
:param max_size: loads at most 'max_size' samples from the input file,
if None loads the entire dataset
"""
self.min_len = min_len
self.max_len = max_len
self.parallel = True
self.sorted = False
self.src = self.process_data(src_fname, tokenizer, max_size)
self.tgt = self.process_data(tgt_fname, tokenizer, max_size)
@ -168,19 +239,37 @@ class ParallelDataset(TextDataset):
self.filter_data(min_len, max_len)
assert len(self.src) == len(self.tgt)
lengths = [len(s) + len(t) for (s, t) in zip(self.src, self.tgt)]
self.lengths = torch.tensor(lengths)
src_lengths = [len(s) for s in self.src]
tgt_lengths = [len(t) for t in self.tgt]
self.src_lengths = torch.tensor(src_lengths)
self.tgt_lengths = torch.tensor(tgt_lengths)
self.lengths = self.src_lengths + self.tgt_lengths
if sort:
self.sort_by_length()
def sort_by_length(self):
"""
Sorts dataset by the sequence length.
"""
self.lengths, indices = self.lengths.sort(descending=True)
self.src = [self.src[idx] for idx in indices]
self.tgt = [self.tgt[idx] for idx in indices]
self.src_lengths = [self.src_lengths[idx] for idx in indices]
self.tgt_lengths = [self.tgt_lengths[idx] for idx in indices]
self.indices = indices.tolist()
self.sorted = True
def filter_data(self, min_len, max_len):
"""
Preserves only samples which satisfy the following inequality:
min_len <= src sample sequence length <= max_len AND
min_len <= tgt sample sequence length <= max_len
:param min_len: minimum sequence length
:param max_len: maximum sequence length
"""
logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
initial_len = len(self.src)
@ -199,3 +288,98 @@ class ParallelDataset(TextDataset):
def __getitem__(self, idx):
return self.src[idx], self.tgt[idx]
class LazyParallelDataset(TextDataset):
def __init__(self, src_fname, tgt_fname, tokenizer,
min_len, max_len, sort=False, max_size=None):
"""
Constructor for the LazyParallelDataset.
Tokenization is done on the fly.
:param src_fname: path to the file with src language data
:param tgt_fname: path to the file with tgt language data
:param tokenizer: tokenizer
:param min_len: minimum sequence length
:param max_len: maximum sequence length
:param sort: sorts dataset by sequence length
:param max_size: loads at most 'max_size' samples from the input file,
if None loads the entire dataset
"""
self.min_len = min_len
self.max_len = max_len
self.parallel = True
self.sorted = False
self.tokenizer = tokenizer
self.raw_src = self.process_raw_data(src_fname, max_size)
self.raw_tgt = self.process_raw_data(tgt_fname, max_size)
assert len(self.raw_src) == len(self.raw_tgt)
logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
# Subtracting 2 because EOS and BOS are added later during tokenization
self.filter_raw_data(min_len - 2, max_len - 2)
assert len(self.raw_src) == len(self.raw_tgt)
# Adding 2 because EOS and BOS are added later during tokenization
src_lengths = [i + 2 for i in self.src_len]
tgt_lengths = [i + 2 for i in self.tgt_len]
self.src_lengths = torch.tensor(src_lengths)
self.tgt_lengths = torch.tensor(tgt_lengths)
self.lengths = self.src_lengths + self.tgt_lengths
def process_raw_data(self, fname, max_size):
"""
Loads data from the input file.
:param fname: input file name
:param max_size: loads at most 'max_size' samples from the input file,
if None loads the entire dataset
"""
logging.info(f'Processing data from {fname}')
data = []
with open(fname) as dfile:
for idx, line in enumerate(dfile):
if max_size and idx == max_size:
break
data.append(line)
return data
def filter_raw_data(self, min_len, max_len):
"""
Preserves only samples which satisfy the following inequality:
min_len <= src sample sequence length <= max_len AND
min_len <= tgt sample sequence length <= max_len
:param min_len: minimum sequence length
:param max_len: maximum sequence length
"""
initial_len = len(self.raw_src)
filtered_src = []
filtered_tgt = []
filtered_src_len = []
filtered_tgt_len = []
for src, tgt in zip(self.raw_src, self.raw_tgt):
src_len = src.count(' ') + 1
tgt_len = tgt.count(' ') + 1
if min_len <= src_len <= max_len and \
min_len <= tgt_len <= max_len:
filtered_src.append(src)
filtered_tgt.append(tgt)
filtered_src_len.append(src_len)
filtered_tgt_len.append(tgt_len)
self.raw_src = filtered_src
self.raw_tgt = filtered_tgt
self.src_len = filtered_src_len
self.tgt_len = filtered_tgt_len
filtered_len = len(self.raw_src)
logging.info(f'Pairs before: {initial_len}, after: {filtered_len}')
def __getitem__(self, idx):
src = torch.tensor(self.tokenizer.segment(self.raw_src[idx]))
tgt = torch.tensor(self.tokenizer.segment(self.raw_tgt[idx]))
return src, tgt
def __len__(self):
return len(self.raw_src)

View file

@ -1,23 +1,22 @@
import logging
import torch
from torch.utils.data.sampler import Sampler
from seq2seq.utils import get_world_size, get_rank
from seq2seq.utils import get_rank
from seq2seq.utils import get_world_size
class BucketingSampler(Sampler):
"""
Distributed data sampler supporting bucketing by sequence length.
"""
def __init__(self, dataset, batch_size, bucketing=True, world_size=None,
rank=None):
class DistributedSampler(Sampler):
def __init__(self, dataset, batch_size, seeds, world_size=None, rank=None):
"""
Constructor for the BucketingSampler.
Constructor for the DistributedSampler.
:param dataset: dataset
:param batch_size: batch size
:param bucketing: if True enables bucketing by sequence length
:param world_size: number of processes participating in distributed
training
:param rank: rank of the current process within world_size
:param batch_size: local batch size
:param seeds: list of seeds, one seed for each training epoch
:param world_size: number of distributed workers
:param rank: rank of the current process
"""
if world_size is None:
world_size = get_world_size()
@ -28,75 +27,251 @@ class BucketingSampler(Sampler):
self.world_size = world_size
self.rank = rank
self.epoch = 0
self.bucketing = bucketing
self.seeds = seeds
self.batch_size = batch_size
self.global_batch_size = batch_size * world_size
self.data_len = len(self.dataset)
self.num_samples = self.data_len // self.global_batch_size \
* self.global_batch_size
def __iter__(self):
# deterministically shuffle based on epoch
g = torch.Generator()
g.manual_seed(self.epoch)
def init_rng(self):
"""
Creates new RNG, seed depends on current epoch idx.
"""
rng = torch.Generator()
seed = self.seeds[self.epoch]
logging.info(f'Sampler for epoch {self.epoch} uses seed {seed}')
rng.manual_seed(seed)
return rng
# generate permutation
indices = torch.randperm(self.data_len, generator=g)
# make indices evenly divisible by (batch_size * world_size)
indices = indices[:self.num_samples]
# splits the dataset into chunks of 'batches_in_shard' global batches
# each, sorts by (src + tgt) sequence length within each chunk,
# reshuffles all global batches
if self.bucketing:
batches_in_shard = 80
shard_size = self.global_batch_size * batches_in_shard
nshards = (self.num_samples + shard_size - 1) // shard_size
lengths = self.dataset.lengths[indices]
shards = [indices[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
len_shards = [lengths[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
indices = []
for len_shard in len_shards:
_, ind = len_shard.sort()
indices.append(ind)
output = tuple(shard[idx] for shard, idx in zip(shards, indices))
indices = torch.cat(output)
# global reshuffle
indices = indices.view(-1, self.global_batch_size)
order = torch.randperm(indices.shape[0], generator=g)
indices = indices[order, :]
indices = indices.view(-1)
def distribute_batches(self, indices):
"""
Assigns batches to workers.
Consecutive ranks are getting consecutive batches.
:param indices: torch.tensor with batch indices
"""
assert len(indices) == self.num_samples
# build indices for each individual worker
# consecutive ranks are getting consecutive batches,
# default pytorch DistributedSampler assigns strided batches
# with offset = length / world_size
indices = indices.view(-1, self.batch_size)
indices = indices[self.rank::self.world_size].contiguous()
indices = indices.view(-1)
indices = indices.tolist()
assert len(indices) == self.num_samples // self.world_size
return indices
def reshuffle_batches(self, indices, rng):
"""
Permutes global batches
:param indices: torch.tensor with batch indices
:param rng: instance of torch.Generator
"""
indices = indices.view(-1, self.global_batch_size)
num_batches = indices.shape[0]
order = torch.randperm(num_batches, generator=rng)
indices = indices[order, :]
indices = indices.view(-1)
return indices
def __iter__(self):
rng = self.init_rng()
# generate permutation
indices = torch.randperm(self.data_len, generator=rng)
# make indices evenly divisible by (batch_size * world_size)
indices = indices[:self.num_samples]
# assign batches to workers
indices = self.distribute_batches(indices)
return iter(indices)
def __len__(self):
return self.num_samples // self.world_size
def set_epoch(self, epoch):
"""
Sets current epoch index. This value is used to seed RNGs in __iter__()
function.
Sets current epoch index.
Epoch index is used to seed RNG in __iter__() function.
:param epoch: index of current epoch
"""
self.epoch = epoch
def __len__(self):
return self.num_samples // self.world_size
class ShardingSampler(DistributedSampler):
def __init__(self, dataset, batch_size, seeds, shard_size,
world_size=None, rank=None):
"""
Constructor for the ShardingSampler.
:param dataset: dataset
:param batch_size: local batch size
:param seeds: list of seeds, one seed for each training epoch
:param shard_size: number of global batches within one shard
:param world_size: number of distributed workers
:param rank: rank of the current process
"""
super().__init__(dataset, batch_size, seeds, world_size, rank)
self.shard_size = shard_size
self.num_samples = self.data_len // self.global_batch_size \
* self.global_batch_size
def __iter__(self):
rng = self.init_rng()
# generate permutation
indices = torch.randperm(self.data_len, generator=rng)
# make indices evenly divisible by (batch_size * world_size)
indices = indices[:self.num_samples]
# splits the dataset into chunks of 'self.shard_size' global batches
# each, sorts by (src + tgt) sequence length within each chunk,
# reshuffles all global batches
shard_size = self.global_batch_size * self.shard_size
nshards = (self.num_samples + shard_size - 1) // shard_size
lengths = self.dataset.lengths[indices]
shards = [indices[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
len_shards = [lengths[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
# sort by (src + tgt) sequence length within each shard
indices = []
for len_shard in len_shards:
_, ind = len_shard.sort()
indices.append(ind)
output = tuple(shard[idx] for shard, idx in zip(shards, indices))
# build batches
indices = torch.cat(output)
# perform global reshuffle of all global batches
indices = self.reshuffle_batches(indices, rng)
# distribute batches to individual workers
indices = self.distribute_batches(indices)
return iter(indices)
class BucketingSampler(DistributedSampler):
def __init__(self, dataset, batch_size, seeds, num_buckets,
world_size=None, rank=None):
"""
Constructor for the BucketingSampler.
:param dataset: dataset
:param batch_size: local batch size
:param seeds: list of seeds, one seed for each training epoch
:param num_buckets: number of buckets
:param world_size: number of distributed workers
:param rank: rank of the current process
"""
super().__init__(dataset, batch_size, seeds, world_size, rank)
self.num_buckets = num_buckets
bucket_width = (dataset.max_len + num_buckets - 1) // num_buckets
# assign sentences to buckets based on src and tgt sequence lengths
bucket_ids = torch.max(dataset.src_lengths // bucket_width,
dataset.tgt_lengths // bucket_width)
bucket_ids.clamp_(0, num_buckets - 1)
# build buckets
all_indices = torch.tensor(range(self.data_len))
self.buckets = []
self.num_samples = 0
global_bs = self.global_batch_size
for bid in range(num_buckets):
# gather indices for current bucket
indices = all_indices[bucket_ids == bid]
self.buckets.append(indices)
# count number of samples in current bucket
samples = len(indices) // global_bs * global_bs
self.num_samples += samples
def __iter__(self):
rng = self.init_rng()
global_bs = self.global_batch_size
indices = []
for bid in range(self.num_buckets):
# random shuffle within current bucket
perm = torch.randperm(len(self.buckets[bid]), generator=rng)
bucket_indices = self.buckets[bid][perm]
# make bucket_indices evenly divisible by global batch size
length = len(bucket_indices) // global_bs * global_bs
bucket_indices = bucket_indices[:length]
assert len(bucket_indices) % self.global_batch_size == 0
# add samples from current bucket to indices for current epoch
indices.append(bucket_indices)
indices = torch.cat(indices)
assert len(indices) % self.global_batch_size == 0
# perform global reshuffle of all global batches
indices = self.reshuffle_batches(indices, rng)
# distribute batches to individual workers
indices = self.distribute_batches(indices)
return iter(indices)
class StaticDistributedSampler(Sampler):
def __init__(self, dataset, batch_size, pad, world_size=None, rank=None):
"""
Constructor for the StaticDistributedSampler.
:param dataset: dataset
:param batch_size: local batch size
:param pad: if True: pads dataset to a multiple of global_batch_size
samples
:param world_size: number of distributed workers
:param rank: rank of the current process
"""
if world_size is None:
world_size = get_world_size()
if rank is None:
rank = get_rank()
self.world_size = world_size
global_batch_size = batch_size * world_size
data_len = len(dataset)
num_samples = (data_len + global_batch_size - 1) \
// global_batch_size * global_batch_size
self.num_samples = num_samples
indices = list(range(data_len))
if pad:
# pad dataset to a multiple of global_batch_size samples, uses
# sample with idx 0 as pad
indices += [0] * (num_samples - len(indices))
else:
# temporary pad to a multiple of global batch size, pads with "-1"
# which is later removed from the list of indices
indices += [-1] * (num_samples - len(indices))
indices = torch.tensor(indices)
indices = indices.view(-1, batch_size)
indices = indices[rank::world_size].contiguous()
indices = indices.view(-1)
# remove temporary pad
indices = indices[indices != -1]
indices = indices.tolist()
self.indices = indices
def __iter__(self):
return iter(self.indices)
def __len__(self):
return len(self.indices)

View file

@ -1,43 +1,76 @@
import logging
from collections import defaultdict
from functools import partial
import seq2seq.data.config as config
def default():
return config.UNK
class Tokenizer:
"""
Tokenizer class.
"""
def __init__(self, vocab_fname, separator='@@'):
def __init__(self, vocab_fname=None, pad=1, separator='@@'):
"""
Constructor for the Tokenizer class.
:param vocab_fname: path to the file with vocabulary
:param pad: pads vocabulary to a multiple of 'pad' tokens
:param separator: tokenization separator
"""
self.separator = separator
if vocab_fname:
self.separator = separator
logging.info(f'Building vocabulary from {vocab_fname}')
vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
config.BOS_TOKEN, config.EOS_TOKEN]
logging.info(f'Building vocabulary from {vocab_fname}')
vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
config.BOS_TOKEN, config.EOS_TOKEN]
with open(vocab_fname) as vfile:
for line in vfile:
vocab.append(line.strip())
with open(vocab_fname) as vfile:
for line in vfile:
vocab.append(line.strip())
logging.info(f'Size of vocabulary: {len(vocab)}')
self.vocab_size = len(vocab)
self.pad_vocabulary(vocab, pad)
self.vocab_size = len(vocab)
logging.info(f'Size of vocabulary: {self.vocab_size}')
self.tok2idx = defaultdict(default)
for idx, token in enumerate(vocab):
self.tok2idx[token] = idx
self.tok2idx = defaultdict(partial(int, config.UNK))
for idx, token in enumerate(vocab):
self.tok2idx[token] = idx
self.idx2tok = {}
for key, value in self.tok2idx.items():
self.idx2tok[value] = key
self.idx2tok = {}
for key, value in self.tok2idx.items():
self.idx2tok[value] = key
def pad_vocabulary(self, vocab, pad):
"""
Pads vocabulary to a multiple of 'pad' tokens.
:param vocab: list with vocabulary
:param pad: integer
"""
vocab_size = len(vocab)
padded_vocab_size = (vocab_size + pad - 1) // pad * pad
for i in range(0, padded_vocab_size - vocab_size):
token = f'madeupword{i:04d}'
vocab.append(token)
assert len(vocab) % pad == 0
def get_state(self):
logging.info(f'Saving state of the tokenizer')
state = {
'separator': self.separator,
'vocab_size': self.vocab_size,
'tok2idx': self.tok2idx,
'idx2tok': self.idx2tok,
}
return state
def set_state(self, state):
logging.info(f'Restoring state of the tokenizer')
self.separator = state['separator']
self.vocab_size = state['vocab_size']
self.tok2idx = state['tok2idx']
self.idx2tok = state['idx2tok']
def segment(self, line):
"""
@ -62,6 +95,11 @@ class Tokenizer:
returns: string representing detokenized sentence
"""
detok = delim.join([self.idx2tok[idx] for idx in inputs])
detok = detok.replace(
self.separator+ ' ', '').replace(self.separator, '')
detok = detok.replace(self.separator + ' ', '')
detok = detok.replace(self.separator, '')
detok = detok.replace(config.BOS_TOKEN, '')
detok = detok.replace(config.EOS_TOKEN, '')
detok = detok.replace(config.PAD_TOKEN, '')
detok = detok.strip()
return detok

View file

@ -81,7 +81,8 @@ class SequenceGenerator:
counter += 1
words = words.view(word_view)
words, logprobs, attn, context = self.model.generate(words, context, 1)
output = self.model.generate(words, context, 1)
words, logprobs, attn, context = output
words = words.view(-1)
translation[active, idx] = words
@ -123,13 +124,15 @@ class SequenceGenerator:
max_seq_len = self.max_seq_len
cov_penalty_factor = self.cov_penalty_factor
translation = torch.zeros(batch_size * beam_size, max_seq_len, dtype=torch.int64)
translation = torch.zeros(batch_size * beam_size, max_seq_len,
dtype=torch.int64)
lengths = torch.ones(batch_size * beam_size, dtype=torch.int64)
scores = torch.zeros(batch_size * beam_size, dtype=torch.float32)
active = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
base_mask = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
global_offset = torch.arange(0, batch_size * beam_size, beam_size, dtype=torch.int64)
global_offset = torch.arange(0, batch_size * beam_size, beam_size,
dtype=torch.int64)
eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')])
@ -161,21 +164,23 @@ class SequenceGenerator:
_, seq, feature = context[0].shape
context[0].unsqueeze_(1)
context[0] = context[0].expand(-1, beam_size, -1, -1)
context[0] = context[0].contiguous().view(batch_size * beam_size, seq, feature)
context[0] = context[0].contiguous().view(batch_size * beam_size,
seq, feature)
# context[0]: (batch * beam, seq, feature)
else:
# context[0] (encoder state): (seq, batch, feature)
seq, _, feature = context[0].shape
context[0].unsqueeze_(2)
context[0] = context[0].expand(-1, -1, beam_size, -1)
context[0] = context[0].contiguous().view(seq, batch_size * beam_size, feature)
context[0] = context[0].contiguous().view(seq, batch_size *
beam_size, feature)
# context[0]: (seq, batch * beam, feature)
#context[1] (encoder seq length): (batch)
# context[1] (encoder seq length): (batch)
context[1].unsqueeze_(1)
context[1] = context[1].expand(-1, beam_size)
context[1] = context[1].contiguous().view(batch_size * beam_size)
#context[1]: (batch * beam)
# context[1]: (batch * beam)
accu_attn_scores = torch.zeros(batch_size * beam_size, seq)
if self.cuda:
@ -194,7 +199,8 @@ class SequenceGenerator:
lengths[active[~eos_mask.view(-1)]] += 1
words, logprobs, attn, context = self.model.generate(words, context, beam_size)
output = self.model.generate(words, context, beam_size)
words, logprobs, attn, context = output
attn = attn.float().squeeze(attn_query_dim)
attn = attn.masked_fill(eos_mask.view(-1).unsqueeze(1), 0)

View file

@ -1,7 +1,8 @@
import contextlib
import logging
import os
import subprocess
import time
import os
import torch
import torch.distributed as dist
@ -9,7 +10,18 @@ import torch.distributed as dist
import seq2seq.data.config as config
from seq2seq.inference.beam_search import SequenceGenerator
from seq2seq.utils import AverageMeter
from seq2seq.utils import get_rank, get_world_size
from seq2seq.utils import barrier
from seq2seq.utils import get_rank
from seq2seq.utils import get_world_size
def gather_predictions(preds):
world_size = get_world_size()
if world_size > 1:
all_preds = [preds.new(preds.size(0), preds.size(1)) for i in range(world_size)]
dist.all_gather(all_preds, preds)
preds = torch.cat(all_preds)
return preds
class Translator:
@ -96,17 +108,25 @@ class Translator:
eval_path = self.build_eval_path(epoch, iteration)
detok_eval_path = eval_path + '.detok'
rank = get_rank()
if rank == 0:
logging.info(f'Running evaluation on test set')
self.model.eval()
torch.cuda.empty_cache()
with contextlib.suppress(FileNotFoundError):
os.remove(eval_path)
os.remove(detok_eval_path)
self.evaluate(epoch, iteration, eval_path, summary)
rank = get_rank()
logging.info(f'Running evaluation on test set')
self.model.eval()
torch.cuda.empty_cache()
output = self.evaluate(epoch, iteration, summary)
output = output[:len(self.loader.dataset)]
output = self.loader.dataset.unsort(output)
if rank == 0:
with open(eval_path, 'a') as eval_file:
eval_file.writelines(output)
if calc_bleu:
self.run_detokenizer(eval_path)
test_bleu[0] = self.run_sacrebleu(detok_eval_path,
reference_path)
test_bleu[0] = self.run_sacrebleu(detok_eval_path, reference_path)
if summary:
logging.info(f'BLEU on test dataset: {test_bleu[0]:.2f}')
@ -114,8 +134,9 @@ class Translator:
logging.info(f'Target accuracy reached')
break_training[0] = 1
torch.cuda.empty_cache()
logging.info(f'Finished evaluation on test set')
barrier()
torch.cuda.empty_cache()
logging.info(f'Finished evaluation on test set')
if self.distributed:
dist.broadcast(break_training, 0)
@ -123,35 +144,29 @@ class Translator:
return test_bleu[0].item(), break_training[0].item()
def evaluate(self, epoch, iteration, eval_path, summary):
def evaluate(self, epoch, iteration, summary):
"""
Runs evaluation on test dataset.
:param epoch: index of the current epoch
:param iteration: index of the current iteration
:param eval_path: path to the file for saving results
:param summary: if True prints summary
"""
eval_file = open(eval_path, 'w')
batch_time = AverageMeter(False)
tot_tok_per_sec = AverageMeter(False)
iterations = AverageMeter(False)
enc_seq_len = AverageMeter(False)
dec_seq_len = AverageMeter(False)
total_iters = 0
total_lines = 0
stats = {}
output = []
for i, (src, indices) in enumerate(self.loader):
translate_timer = time.time()
src, src_length = src
if self.batch_first:
batch_size = src.size(0)
else:
batch_size = src.size(1)
total_lines += batch_size
batch_size = self.loader.batch_size
global_batch_size = batch_size * get_world_size()
beam_size = self.beam_size
bos = [self.insert_target_start] * (batch_size * beam_size)
@ -179,20 +194,18 @@ class Translator:
generator = self.generator.beam_search
preds, lengths, counter = generator(batch_size, bos, context)
preds = preds.cpu()
lengths = lengths.cpu()
stats['total_dec_len'] = int(lengths.sum())
stats['total_dec_len'] = lengths.sum().item()
stats['iters'] = counter
total_iters += stats['iters']
output = []
for idx, pred in enumerate(preds):
end = lengths[idx] - 1
pred = pred[1:end].tolist()
out = self.tokenizer.detokenize(pred)
output.append(out)
indices = torch.tensor(indices).to(preds)
preds = preds.scatter(0, indices.unsqueeze(1).expand_as(preds), preds)
output = [output[indices.index(i)] for i in range(len(output))]
preds = gather_predictions(preds).cpu()
for pred in preds:
pred = pred.tolist()
detok = self.tokenizer.detokenize(pred)
output.append(detok + '\n')
elapsed = time.time() - translate_timer
batch_time.update(elapsed, batch_size)
@ -219,25 +232,28 @@ class Translator:
log = ''.join(log)
logging.info(log)
for line in output:
eval_file.write(line)
eval_file.write('\n')
tot_tok_per_sec.reduce('sum')
enc_seq_len.reduce('mean')
dec_seq_len.reduce('mean')
batch_time.reduce('mean')
iterations.reduce('sum')
eval_file.close()
if summary:
time_per_sentence = (batch_time.avg / self.loader.batch_size)
if summary and get_rank() == 0:
time_per_sentence = (batch_time.avg / global_batch_size)
log = []
log += f'TEST SUMMARY:\n'
log += f'Lines translated: {total_lines}\t'
log += f'Lines translated: {len(self.loader.dataset)}\t'
log += f'Avg total tokens/s: {tot_tok_per_sec.avg:.0f}\n'
log += f'Avg time per batch: {batch_time.avg:.3f} s\t'
log += f'Avg time per sentence: {1000*time_per_sentence:.3f} ms\n'
log += f'Avg encoder seq len: {enc_seq_len.avg:.2f}\t'
log += f'Avg decoder seq len: {dec_seq_len.avg:.2f}\t'
log += f'Total decoder iterations: {total_iters}'
log += f'Total decoder iterations: {int(iterations.sum)}'
log = ''.join(log)
logging.info(log)
return output
def run_detokenizer(self, eval_path):
"""
Executes moses detokenizer on eval_path file and saves result to

View file

@ -12,7 +12,7 @@ class BahdanauAttention(nn.Module):
Implementation is very similar to tf.contrib.seq2seq.BahdanauAttention
"""
def __init__(self, query_size, key_size, num_units, normalize=False,
dropout=0, batch_first=False):
batch_first=False, init_weight=0.1):
"""
Constructor for the BahdanauAttention.
@ -20,9 +20,10 @@ class BahdanauAttention(nn.Module):
:param key_size: feature dimension for keys
:param num_units: internal feature dimension
:param normalize: whether to normalize energy term
:param dropout: probability of the dropout (between softmax and bmm)
:param batch_first: if True batch size is the 1st dimension, if False
the sequence is first and batch size is second
:param init_weight: range for uniform initializer used to initialize
Linear key and query transform layers and linear_att vector
"""
super(BahdanauAttention, self).__init__()
@ -32,10 +33,11 @@ class BahdanauAttention(nn.Module):
self.linear_q = nn.Linear(query_size, num_units, bias=False)
self.linear_k = nn.Linear(key_size, num_units, bias=False)
nn.init.uniform_(self.linear_q.weight.data, -init_weight, init_weight)
nn.init.uniform_(self.linear_k.weight.data, -init_weight, init_weight)
self.linear_att = Parameter(torch.Tensor(num_units))
self.dropout = nn.Dropout(dropout)
self.mask = None
if self.normalize:
@ -45,14 +47,14 @@ class BahdanauAttention(nn.Module):
self.register_parameter('normalize_scalar', None)
self.register_parameter('normalize_bias', None)
self.reset_parameters()
self.reset_parameters(init_weight)
def reset_parameters(self):
def reset_parameters(self, init_weight):
"""
Sets initial random values for trainable parameters.
"""
stdv = 1. / math.sqrt(self.num_units)
self.linear_att.data.uniform_(-stdv, stdv)
self.linear_att.data.uniform_(-init_weight, init_weight)
if self.normalize:
self.normalize_scalar.data.fill_(stdv)
@ -74,7 +76,8 @@ class BahdanauAttention(nn.Module):
else:
max_len = context.size(0)
indices = torch.arange(0, max_len, dtype=torch.int64, device=context.device)
indices = torch.arange(0, max_len, dtype=torch.int64,
device=context.device)
self.mask = indices >= (context_len.unsqueeze(1))
def calc_score(self, att_query, att_keys):
@ -96,16 +99,12 @@ class BahdanauAttention(nn.Module):
if self.normalize:
sum_qk = sum_qk + self.normalize_bias
tmp = self.linear_att.to(torch.float32)
linear_att = tmp / tmp.norm()
linear_att = linear_att.to(self.normalize_scalar)
linear_att = self.linear_att / self.linear_att.norm()
linear_att = linear_att * self.normalize_scalar
else:
linear_att = self.linear_att
out = F.tanh(sum_qk).matmul(linear_att)
out = torch.tanh(sum_qk).matmul(linear_att)
return out
def forward(self, query, keys):
@ -152,7 +151,6 @@ class BahdanauAttention(nn.Module):
# Calculate the weighted average of the attention inputs according to
# the scores
scores_normalized = self.dropout(scores_normalized)
# context: (b x t_q x n)
context = torch.bmm(scores_normalized, keys)

View file

@ -3,16 +3,18 @@ import itertools
import torch
import torch.nn as nn
from seq2seq.models.attention import BahdanauAttention
import seq2seq.data.config as config
from seq2seq.models.attention import BahdanauAttention
from seq2seq.utils import init_lstm_
class RecurrentAttention(nn.Module):
"""
LSTM with an attention module.
LSTM wrapped with an attention module.
"""
def __init__(self, input_size, context_size, hidden_size, num_layers=1,
bias=True, batch_first=False, dropout=0):
def __init__(self, input_size=1024, context_size=1024, hidden_size=1024,
num_layers=1, batch_first=False, dropout=0.2,
init_weight=0.1):
"""
Constructor for the RecurrentAttention.
@ -20,16 +22,17 @@ class RecurrentAttention(nn.Module):
:param context_size: number of features in output from encoder
:param hidden_size: internal hidden size
:param num_layers: number of layers in LSTM
:param bias: enables bias in LSTM layers
:param batch_first: if True the model uses (batch,seq,feature) tensors,
if false the model uses (seq, batch, feature)
:param dropout: probability of dropout
:param dropout: probability of dropout (on input to LSTM layer)
:param init_weight: range for the uniform initializer
"""
super(RecurrentAttention, self).__init__()
self.rnn = nn.LSTM(input_size, hidden_size, num_layers, bias,
batch_first)
self.rnn = nn.LSTM(input_size, hidden_size, num_layers, bias=True,
batch_first=batch_first)
init_lstm_(self.rnn, init_weight)
self.attn = BahdanauAttention(hidden_size, context_size, context_size,
normalize=True, batch_first=batch_first)
@ -52,9 +55,9 @@ class RecurrentAttention(nn.Module):
# softmax
self.attn.set_mask(context_len, context)
inputs = self.dropout(inputs)
rnn_outputs, hidden = self.rnn(inputs, hidden)
attn_outputs, scores = self.attn(rnn_outputs, context)
rnn_outputs = self.dropout(rnn_outputs)
return rnn_outputs, hidden, attn_outputs, scores
@ -63,23 +66,18 @@ class Classifier(nn.Module):
"""
Fully-connected classifier
"""
def __init__(self, in_features, out_features, math='fp32'):
def __init__(self, in_features, out_features, init_weight=0.1):
"""
Constructor for the Classifier.
:param in_features: number of input features
:param out_features: number of output features (size of vocabulary)
:param math: arithmetic type, 'fp32' or 'fp16'
:param init_weight: range for the uniform initializer
"""
super(Classifier, self).__init__()
self.out_features = out_features
# padding required to trigger HMMA kernels
if math == 'fp16':
out_features = (out_features + 7) // 8 * 8
self.classifier = nn.Linear(in_features, out_features)
nn.init.uniform_(self.classifier.weight.data, -init_weight, init_weight)
nn.init.uniform_(self.classifier.bias.data, -init_weight, init_weight)
def forward(self, x):
"""
@ -88,7 +86,6 @@ class Classifier(nn.Module):
:param x: output from decoder
"""
out = self.classifier(x)
out = out[..., :self.out_features]
return out
@ -102,22 +99,24 @@ class ResidualRecurrentDecoder(nn.Module):
LSTM layer of the decoder goes into the attention module, then the
re-weighted context is concatenated with inputs to all subsequent LSTM
layers in the decoder at the current timestep.
Residual connections are enabled after 3rd LSTM layer, dropout is applied
on inputs to LSTM layers.
"""
def __init__(self, vocab_size, hidden_size=128, num_layers=8, bias=True,
dropout=0, batch_first=False, math='fp32', embedder=None):
def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
batch_first=False, embedder=None, init_weight=0.1):
"""
Constructor of the ResidualRecurrentDecoder.
:param vocab_size: size of vocabulary
:param hidden_size: hidden size for LSMT layers
:param num_layers: number of LSTM layers
:param bias: enables bias in LSTM layers
:param dropout: probability of dropout (between LSTM layers)
:param dropout: probability of dropout (on input to LSTM layers)
:param batch_first: if True the model uses (batch,seq,feature) tensors,
if false the model uses (seq, batch, feature)
:param math: arithmetic type, 'fp32' or 'fp16'
:param embedder: embedding module, if None constructor will create new
embedding layer
:param embedder: instance of nn.Embedding, if None constructor will
create new embedding layer
:param init_weight: range for the uniform initializer
"""
super(ResidualRecurrentDecoder, self).__init__()
@ -125,21 +124,26 @@ class ResidualRecurrentDecoder(nn.Module):
self.att_rnn = RecurrentAttention(hidden_size, hidden_size,
hidden_size, num_layers=1,
batch_first=batch_first)
batch_first=batch_first,
dropout=dropout)
self.rnn_layers = nn.ModuleList()
for _ in range(num_layers - 1):
self.rnn_layers.append(
nn.LSTM(2 * hidden_size, hidden_size, num_layers=1, bias=bias,
nn.LSTM(2 * hidden_size, hidden_size, num_layers=1, bias=True,
batch_first=batch_first))
for lstm in self.rnn_layers:
init_lstm_(lstm, init_weight)
if embedder is not None:
self.embedder = embedder
else:
self.embedder = nn.Embedding(vocab_size, hidden_size,
padding_idx=config.PAD)
nn.init.uniform_(embedder.weight.data, -init_weight, init_weight)
self.classifier = Classifier(hidden_size, vocab_size, math)
self.classifier = Classifier(hidden_size, vocab_size)
self.dropout = nn.Dropout(p=dropout)
def init_hidden(self, hidden):
@ -199,15 +203,15 @@ class ResidualRecurrentDecoder(nn.Module):
x, h, attn, scores = self.att_rnn(x, hidden[0], enc_context, enc_len)
self.append_hidden(h)
x = self.dropout(x)
x = torch.cat((x, attn), dim=2)
x = self.dropout(x)
x, h = self.rnn_layers[0](x, hidden[1])
self.append_hidden(h)
for i in range(1, len(self.rnn_layers)):
residual = x
x = self.dropout(x)
x = torch.cat((x, attn), dim=2)
x = self.dropout(x)
x, h = self.rnn_layers[i](x, hidden[i + 1])
self.append_hidden(h)
x = x + residual

View file

@ -3,6 +3,7 @@ from torch.nn.utils.rnn import pack_padded_sequence
from torch.nn.utils.rnn import pad_packed_sequence
import seq2seq.data.config as config
from seq2seq.utils import init_lstm_
class ResidualRecurrentEncoder(nn.Module):
@ -10,42 +11,48 @@ class ResidualRecurrentEncoder(nn.Module):
Encoder with Embedding, LSTM layers, residual connections and optional
dropout.
The first LSTM layer is bidirectional and uses variable sequence length API,
the remaining (num_layers-1) layers are unidirectional. Residual
connections are enabled after third LSTM layer, dropout is applied between
LSTM layers.
The first LSTM layer is bidirectional and uses variable sequence length
API, the remaining (num_layers-1) layers are unidirectional. Residual
connections are enabled after third LSTM layer, dropout is applied on
inputs to LSTM layers.
"""
def __init__(self, vocab_size, hidden_size=128, num_layers=8, bias=True,
dropout=0, batch_first=False, embedder=None):
def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
batch_first=False, embedder=None, init_weight=0.1):
"""
Constructor for the ResidualRecurrentEncoder.
:param vocab_size: size of vocabulary
:param hidden_size: hidden size for LSTM layers
:param num_layers: number of LSTM layers, 1st layer is bidirectional
:param bias: enables bias in LSTM layers
:param dropout: probability of dropout (between LSTM layers)
:param dropout: probability of dropout (on input to LSTM layers)
:param batch_first: if True the model uses (batch,seq,feature) tensors,
if false the model uses (seq, batch, feature)
:param embedder: embedding module, if None constructor will create new
embedding layer
:param embedder: instance of nn.Embedding, if None constructor will
create new embedding layer
:param init_weight: range for the uniform initializer
"""
super(ResidualRecurrentEncoder, self).__init__()
self.batch_first = batch_first
self.rnn_layers = nn.ModuleList()
# 1st LSTM layer, bidirectional
self.rnn_layers.append(
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=bias,
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=True,
batch_first=batch_first, bidirectional=True))
# 2nd LSTM layer, with 2x larger input_size
self.rnn_layers.append(
nn.LSTM((2 * hidden_size), hidden_size, num_layers=1, bias=bias,
nn.LSTM((2 * hidden_size), hidden_size, num_layers=1, bias=True,
batch_first=batch_first))
# Remaining LSTM layers
for _ in range(num_layers - 2):
self.rnn_layers.append(
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=bias,
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=True,
batch_first=batch_first))
for lstm in self.rnn_layers:
init_lstm_(lstm, init_weight)
self.dropout = nn.Dropout(p=dropout)
if embedder is not None:
@ -53,6 +60,7 @@ class ResidualRecurrentEncoder(nn.Module):
else:
self.embedder = nn.Embedding(vocab_size, hidden_size,
padding_idx=config.PAD)
nn.init.uniform_(embedder.weight.data, -init_weight, init_weight)
def forward(self, inputs, lengths):
"""
@ -66,6 +74,7 @@ class ResidualRecurrentEncoder(nn.Module):
x = self.embedder(inputs)
# bidirectional layer
x = self.dropout(x)
x = pack_padded_sequence(x, lengths.cpu().numpy(),
batch_first=self.batch_first)
x, _ = self.rnn_layers[0](x)

View file

@ -1,18 +1,17 @@
import torch.nn as nn
import seq2seq.data.config as config
from seq2seq.models.seq2seq_base import Seq2Seq
from seq2seq.models.encoder import ResidualRecurrentEncoder
from seq2seq.models.decoder import ResidualRecurrentDecoder
from seq2seq.models.encoder import ResidualRecurrentEncoder
from seq2seq.models.seq2seq_base import Seq2Seq
class GNMT(Seq2Seq):
"""
GNMT v2 model
"""
def __init__(self, vocab_size, hidden_size=512, num_layers=8, bias=True,
dropout=0.2, batch_first=False, math='fp32',
share_embedding=False):
def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
batch_first=False, share_embedding=True):
"""
Constructor for the GNMT v2 model.
@ -20,11 +19,9 @@ class GNMT(Seq2Seq):
:param hidden_size: internal hidden size of the model
:param num_layers: number of layers, applies to both encoder and
decoder
:param bias: globally enables or disables bias in encoder and decoder
:param dropout: probability of dropout (in encoder and decoder)
:param batch_first: if True the model uses (batch,seq,feature) tensors,
if false the model uses (seq, batch, feature)
:param math: arithmetic type, 'fp32' or 'fp16'
:param share_embedding: if True embeddings are shared between encoder
and decoder
"""
@ -32,17 +29,19 @@ class GNMT(Seq2Seq):
super(GNMT, self).__init__(batch_first=batch_first)
if share_embedding:
embedder = nn.Embedding(vocab_size, hidden_size, padding_idx=config.PAD)
embedder = nn.Embedding(vocab_size, hidden_size,
padding_idx=config.PAD)
nn.init.uniform_(embedder.weight.data, -0.1, 0.1)
else:
embedder = None
self.encoder = ResidualRecurrentEncoder(vocab_size, hidden_size,
num_layers, bias, dropout,
num_layers, dropout,
batch_first, embedder)
self.decoder = ResidualRecurrentDecoder(vocab_size, hidden_size,
num_layers, bias, dropout,
batch_first, math, embedder)
num_layers, dropout,
batch_first, embedder)
def forward(self, input_encoder, input_enc_len, input_decoder):
context = self.encode(input_encoder, input_enc_len)

View file

@ -1,222 +0,0 @@
import torch
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
import torch.distributed as dist
from torch.nn.modules import Module
from torch.autograd import Variable
from collections import OrderedDict
def flat_dist_call(tensors, call, extra_args=None):
flat_dist_call.warn_on_half = True
buckets = OrderedDict()
for tensor in tensors:
tp = tensor.type()
if tp not in buckets:
buckets[tp] = []
buckets[tp].append(tensor)
if flat_dist_call.warn_on_half:
if torch.cuda.HalfTensor in buckets:
print("WARNING: gloo dist backend for half parameters may be extremely slow." +
" It is recommended to use the NCCL backend in this case.")
flat_dist_call.warn_on_half = False
for tp in buckets:
bucket = buckets[tp]
coalesced = _flatten_dense_tensors(bucket)
if extra_args is not None:
call(coalesced, *extra_args)
else:
call(coalesced)
if call is dist.all_reduce:
coalesced /= dist.get_world_size()
for buf, synced in zip(bucket, _unflatten_dense_tensors(coalesced, bucket)):
buf.copy_(synced)
class DistributedDataParallel(Module):
"""
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``.
:class:`DistributedDataParallel` is designed to work with
the launch utility script ``apex.parallel.multiproc.py``.
When used with ``multiproc.py``, :class:`DistributedDataParallel`
assigns 1 process to each of the available (visible) GPUs on the node.
Parameters are broadcast across participating processes on initialization, and gradients are
allreduced and averaged over processes during ``backward()``.
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
transfers to reduce the total number of transfers required.
:class:`DistributedDataParallel` assumes that your script accepts the command line
arguments "rank" and "world-size." It also assumes that your script calls
``torch.cuda.set_device(args.rank)`` before creating the model.
https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows another example
that combines :class:`DistributedDataParallel` with mixed precision training.
Args:
module: Network definition to be run in multi-gpu/distributed mode.
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
shared_param (Default = False): If your model uses shared parameters this must be True. It will disable bucketing of parameters to avoid race conditions.
"""
def __init__(self, module, message_size=10000000, shared_param=False):
super(DistributedDataParallel, self).__init__()
self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
self.shared_param = shared_param
self.message_size = message_size
#reference to last iterations parameters to see if anything has changed
self.param_refs = []
self.reduction_stream = torch.cuda.Stream()
self.module = module
self.param_list = list(self.module.parameters())
if dist._backend == dist.dist_backend.NCCL:
for param in self.param_list:
assert param.is_cuda, "NCCL backend only supports model parameters to be on GPU."
self.record = []
self.create_hooks()
flat_dist_call([param.data for param in self.module.parameters()], dist.broadcast, (0,) )
def create_hooks(self):
#all reduce gradient hook
def allreduce_params():
if not self.needs_reduction:
return
self.needs_reduction = False
#parameter ordering refresh
if self.needs_refresh and not self.shared_param:
t_record = torch.cuda.IntTensor(self.record)
dist.broadcast(t_record, 0)
self.record = [int(entry) for entry in t_record]
self.needs_refresh = False
grads = [param.grad.data for param in self.module.parameters() if param.grad is not None]
flat_dist_call(grads, dist.all_reduce)
def flush_buckets():
if not self.needs_reduction:
return
self.needs_reduction = False
grads = []
for i in range(self.ready_end, len(self.param_state)):
param = self.param_refs[self.record[i]]
if param.grad is not None:
grads.append(param.grad.data)
grads = [param.grad.data for param in self.ready_params] + grads
if(len(grads)>0):
orig_stream = torch.cuda.current_stream()
with torch.cuda.stream(self.reduction_stream):
self.reduction_stream.wait_stream(orig_stream)
flat_dist_call(grads, dist.all_reduce)
torch.cuda.current_stream().wait_stream(self.reduction_stream)
for param_i, param in enumerate(list(self.module.parameters())):
def wrapper(param_i):
def allreduce_hook(*unused):
if self.needs_refresh:
self.record.append(param_i)
Variable._execution_engine.queue_callback(allreduce_params)
else:
Variable._execution_engine.queue_callback(flush_buckets)
self.comm_ready_buckets(self.record.index(param_i))
if param.requires_grad:
param.register_hook(allreduce_hook)
wrapper(param_i)
def comm_ready_buckets(self, param_ind):
if self.param_state[param_ind] != 0:
raise RuntimeError("Error: Your model uses shared parameters, DDP flag shared_params must be set to True in initialization.")
if self.param_state[self.ready_end] == 0:
self.param_state[param_ind] = 1
return
while self.ready_end < len(self.param_state) and self.param_state[self.ready_end] == 1:
self.ready_params.append(self.param_refs[self.record[self.ready_end]])
self.ready_numel += self.ready_params[-1].numel()
self.ready_end += 1
if self.ready_numel < self.message_size:
self.param_state[param_ind] = 1
return
grads = [param.grad.data for param in self.ready_params]
bucket = []
bucket_inds = []
while grads:
bucket.append(grads.pop(0))
cumm_size = 0
for ten in bucket:
cumm_size += ten.numel()
if cumm_size < self.message_size:
continue
evt = torch.cuda.Event()
evt.record(torch.cuda.current_stream())
evt.wait(stream=self.reduction_stream)
with torch.cuda.stream(self.reduction_stream):
flat_dist_call(bucket, dist.all_reduce)
for i in range(self.ready_start, self.ready_start+len(bucket)):
self.param_state[i] = 2
self.ready_params.pop(0)
self.param_state[param_ind] = 1
def forward(self, *inputs, **kwargs):
param_list = [param for param in list(self.module.parameters()) if param.requires_grad]
#Force needs_refresh to True if there are shared params
#this will force it to always, only call flush_buckets which is safe
#for shared parameters in the model.
#Parentheses are not necessary for correct order of operations, but make the intent clearer.
if (not self.param_refs) or self.shared_param:
self.needs_refresh = True
else:
self.needs_refresh = (
(len(param_list) != len(self.param_refs)) or any(
[param1 is not param2 for param1, param2 in zip(param_list, self.param_refs)]))
if self.needs_refresh:
self.record = []
self.param_state = [0 for i in range(len(param_list))]
self.param_refs = param_list
self.needs_reduction = True
self.ready_start = 0
self.ready_end = 0
self.ready_params = []
self.ready_numel = 0
return self.module(*inputs, **kwargs)

View file

@ -35,7 +35,21 @@ class Fp16Optimizer:
param.data.copy_(new_param.data)
def __init__(self, fp16_model, grad_clip=float('inf'), loss_scale=8192,
dls_downscale=2, dls_upscale=2, dls_upscale_interval=2048):
dls_downscale=2, dls_upscale=2, dls_upscale_interval=128):
"""
Constructor for the Fp16Optimizer.
:param fp16_model: model (previously casted to half)
:param grad_clip: coefficient for gradient clipping, max L2 norm of the
gradients
:param loss_scale: initial loss scale
:param dls_downscale: loss downscale factor, loss scale is divided by
this factor when NaN/INF occurs in the gradients
:param dls_upscale: loss upscale factor, loss scale is multiplied by
this factor if previous dls_upscale_interval batches finished
successfully
:param dls_upscale_interval: interval for loss scale upscaling
"""
logging.info('Initializing fp16 optimizer')
self.initialize_model(fp16_model)
@ -61,7 +75,7 @@ class Fp16Optimizer:
for param in self.fp32_params:
param.requires_grad = True
def step(self, loss, optimizer, update=True):
def step(self, loss, optimizer, scheduler, update=True):
"""
Performs one step of the optimizer.
Applies loss scaling, computes gradients in fp16, converts gradients to
@ -76,21 +90,21 @@ class Fp16Optimizer:
:param update: if True executes weight update
"""
loss *= self.loss_scale
self.fp16_model.zero_grad()
loss.backward()
self.set_grads(self.fp32_params, self.fp16_model.parameters())
if self.loss_scale != 1.0:
for param in self.fp32_params:
param.grad.data /= self.loss_scale
norm = clip_grad_norm_(self.fp32_params, self.grad_clip)
if update:
self.set_grads(self.fp32_params, self.fp16_model.parameters())
if self.loss_scale != 1.0:
for param in self.fp32_params:
param.grad.data /= self.loss_scale
norm = clip_grad_norm_(self.fp32_params, self.grad_clip)
if math.isfinite(norm):
scheduler.step()
optimizer.step()
self.set_weights(self.fp16_model.parameters(), self.fp32_params)
self.set_weights(self.fp16_model.parameters(),
self.fp32_params)
self.since_last_invalid += 1
else:
self.loss_scale /= self.dls_downscale
@ -104,6 +118,8 @@ class Fp16Optimizer:
logging.info(f'Upscaling, new scale: {self.loss_scale}')
self.since_last_invalid = 0
self.fp16_model.zero_grad()
class Fp32Optimizer:
"""
@ -114,7 +130,8 @@ class Fp32Optimizer:
Constructor for the Fp32Optimizer
:param model: model
:param grad_clip: max value of gradient norm
:param grad_clip: coefficient for gradient clipping, max L2 norm of the
gradients
"""
logging.info('Initializing fp32 optimizer')
self.initialize_model(model)
@ -129,7 +146,7 @@ class Fp32Optimizer:
self.model = model
self.model.zero_grad()
def step(self, loss, optimizer, update=True):
def step(self, loss, optimizer, scheduler, update=True):
"""
Performs one step of the optimizer.
@ -138,8 +155,9 @@ class Fp32Optimizer:
:param update: if True executes weight update
"""
loss.backward()
if self.grad_clip != float('inf'):
clip_grad_norm_(self.model.parameters(), self.grad_clip)
if update:
if self.grad_clip != float('inf'):
clip_grad_norm_(self.model.parameters(), self.grad_clip)
scheduler.step()
optimizer.step()
self.model.zero_grad()
self.model.zero_grad()

View file

@ -0,0 +1,98 @@
import logging
import math
import torch
def perhaps_convert_float(param, total):
if isinstance(param, float):
param = int(param * total)
return param
class WarmupMultiStepLR(torch.optim.lr_scheduler._LRScheduler):
"""
Learning rate scheduler with exponential warmup and step decay.
"""
def __init__(self, optimizer, iterations, warmup_steps=0,
remain_steps=1.0, decay_interval=None, decay_steps=4,
decay_factor=0.5, last_epoch=-1):
"""
Constructor of WarmupMultiStepLR.
Parameters: warmup_steps, remain_steps and decay_interval accept both
integers and floats as an input. Integer input is interpreted as
absolute index of iteration, float input is interpreted as a fraction
of total training iterations (epochs * steps_per_epoch).
If decay_interval is None then the decay will happen at regulary spaced
intervals ('decay_steps' decays between iteration indices
'remain_steps' and 'iterations').
:param optimizer: instance of optimizer
:param iterations: total number of training iterations
:param warmup_steps: number of warmup iterations
:param remain_steps: start decay at 'remain_steps' iteration
:param decay_interval: interval between LR decay steps
:param decay_steps: max number of decay steps
:param decay_factor: decay factor
:param last_epoch: the index of last iteration
"""
# iterations before learning rate reaches base LR
self.warmup_steps = perhaps_convert_float(warmup_steps, iterations)
logging.info(f'Scheduler warmup steps: {self.warmup_steps}')
# iteration at which decay starts
self.remain_steps = perhaps_convert_float(remain_steps, iterations)
logging.info(f'Scheduler remain steps: {self.remain_steps}')
# number of steps between each decay
if decay_interval is None:
# decay at regulary spaced intervals
decay_iterations = iterations - self.remain_steps
self.decay_interval = decay_iterations // (decay_steps)
self.decay_interval = max(self.decay_interval, 1)
else:
self.decay_interval = perhaps_convert_float(decay_interval,
iterations)
logging.info(f'Scheduler decay interval: {self.decay_interval}')
# multiplicative decay factor
self.decay_factor = decay_factor
logging.info(f'Scheduler decay factor: {self.decay_factor}')
# max number of decay steps
self.decay_steps = decay_steps
logging.info(f'Scheduler max decay steps: {self.decay_steps}')
if self.warmup_steps > self.remain_steps:
logging.warn(f'warmup_steps should not be larger than '
f'remain_steps, setting warmup_steps=remain_steps')
self.warmup_steps = self.remain_steps
super(WarmupMultiStepLR, self).__init__(optimizer, last_epoch)
def get_lr(self):
if self.last_epoch <= self.warmup_steps:
# exponential lr warmup
if self.warmup_steps != 0:
warmup_factor = math.exp(math.log(0.01) / self.warmup_steps)
else:
warmup_factor = 1.0
inv_decay = warmup_factor ** (self.warmup_steps - self.last_epoch)
lr = [base_lr * inv_decay for base_lr in self.base_lrs]
elif self.last_epoch >= self.remain_steps:
# step decay
decay_iter = self.last_epoch - self.remain_steps
num_decay_steps = decay_iter // self.decay_interval + 1
num_decay_steps = min(num_decay_steps, self.decay_steps)
lr = [
base_lr * (self.decay_factor ** num_decay_steps)
for base_lr in self.base_lrs
]
else:
# base lr
lr = [base_lr for base_lr in self.base_lrs]
return lr

View file

@ -1,6 +1,7 @@
import torch
import torch.nn as nn
class LabelSmoothing(nn.Module):
"""
NLL loss with label smoothing.
@ -18,7 +19,8 @@ class LabelSmoothing(nn.Module):
self.smoothing = smoothing
def forward(self, x, target):
logprobs = torch.nn.functional.log_softmax(x, dim=-1)
logprobs = torch.nn.functional.log_softmax(x, dim=-1,
dtype=torch.float32)
non_pad_mask = (target != self.padding_idx)
nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))

View file

@ -1,15 +1,17 @@
import logging
import time
import os
import time
from itertools import cycle
import numpy as np
import torch
import torch.optim
import torch.utils.data
import numpy as np
from apex.parallel import DistributedDataParallel as DDP
from seq2seq.train.distributed import DistributedDataParallel as DDP
from seq2seq.train.fp_optimizers import Fp16Optimizer, Fp32Optimizer
from seq2seq.train.fp_optimizers import Fp16Optimizer
from seq2seq.train.fp_optimizers import Fp32Optimizer
from seq2seq.train.lr_scheduler import WarmupMultiStepLR
from seq2seq.utils import AverageMeter
from seq2seq.utils import sync_workers
@ -18,21 +20,54 @@ class Seq2SeqTrainer:
"""
Seq2SeqTrainer
"""
def __init__(self, model, criterion, opt_config,
def __init__(self,
model,
criterion,
opt_config,
scheduler_config,
print_freq=10,
save_freq=1000,
grad_clip=float('inf'),
batch_first=False,
save_info={},
save_path='.',
train_iterations=0,
checkpoint_filename='checkpoint%s.pth',
keep_checkpoints=5,
math='fp32',
cuda=True,
distributed=False,
intra_epoch_eval=0,
iter_size=1,
translator=None,
verbose=False):
"""
Constructor for the Seq2SeqTrainer.
:param model: model to train
:param criterion: criterion (loss function)
:param opt_config: dictionary with options for the optimizer
:param scheduler_config: dictionary with options for the learning rate
scheduler
:param print_freq: prints short summary every 'print_freq' iterations
:param save_freq: saves checkpoint every 'save_freq' iterations
:param grad_clip: coefficient for gradient clipping
:param batch_first: if True the model uses (batch,seq,feature) tensors,
if false the model uses (seq, batch, feature)
:param save_info: dict with additional state stored in each checkpoint
:param save_path: path to the directiory for checkpoints
:param train_iterations: total number of training iterations to execute
:param checkpoint_filename: name of files with checkpoints
:param keep_checkpoints: max number of checkpoints to keep
:param math: arithmetic type
:param cuda: if True use cuda, if False train on cpu
:param distributed: if True run distributed training
:param intra_epoch_eval: number of additional eval runs within each
training epoch
:param iter_size: number of iterations between weight updates
:param translator: instance of Translator, runs inference on test set
:param verbose: enables verbose logging
"""
super(Seq2SeqTrainer, self).__init__()
self.model = model
self.criterion = criterion
@ -52,16 +87,19 @@ class Seq2SeqTrainer:
self.loss = None
self.translator = translator
self.intra_epoch_eval = intra_epoch_eval
self.iter_size = iter_size
if cuda:
self.model = self.model.cuda()
self.criterion = self.criterion.cuda()
if math == 'fp16':
self.model = self.model.half()
if distributed:
self.model = DDP(self.model)
if math == 'fp16':
self.model = self.model.half()
self.fp_optimizer = Fp16Optimizer(self.model, grad_clip)
params = self.fp_optimizer.fp32_params
elif math == 'fp32':
@ -72,7 +110,18 @@ class Seq2SeqTrainer:
self.optimizer = torch.optim.__dict__[opt_name](params, **opt_config)
logging.info(f'Using optimizer: {self.optimizer}')
self.scheduler = WarmupMultiStepLR(self.optimizer, train_iterations,
**scheduler_config)
def iterate(self, src, tgt, update=True, training=True):
"""
Performs one iteration of the training/validation.
:param src: batch of examples from the source language
:param tgt: batch of examples from the target language
:param update: if True: optimizer does update of the weights
:param training: if True: executes optimizer
"""
src, src_length = src
tgt, tgt_length = tgt
src_length = torch.LongTensor(src_length)
@ -96,14 +145,15 @@ class Seq2SeqTrainer:
tgt_labels = tgt[1:]
T, B = output.size(0), output.size(1)
loss = self.criterion(output.view(T * B, -1).float(),
loss = self.criterion(output.view(T * B, -1),
tgt_labels.contiguous().view(-1))
loss_per_batch = loss.item()
loss /= B
loss /= (B * self.iter_size)
if training:
self.fp_optimizer.step(loss, self.optimizer, update)
self.fp_optimizer.step(loss, self.optimizer, self.scheduler,
update)
loss_per_token = loss_per_batch / num_toks['tgt']
loss_per_sentence = loss_per_batch / B
@ -120,13 +170,15 @@ class Seq2SeqTrainer:
if training:
assert self.optimizer is not None
eval_fractions = np.linspace(0, 1, self.intra_epoch_eval+2)[1:-1]
eval_iters = (eval_fractions * len(data_loader)).astype(int)
iters_with_update = len(data_loader) // self.iter_size
eval_iters = (eval_fractions * iters_with_update).astype(int)
eval_iters = eval_iters * self.iter_size
eval_iters = set(eval_iters)
batch_time = AverageMeter()
data_time = AverageMeter()
losses_per_token = AverageMeter()
losses_per_sentence = AverageMeter()
losses_per_token = AverageMeter(skip_first=False)
losses_per_sentence = AverageMeter(skip_first=False)
tot_tok_time = AverageMeter()
src_tok_time = AverageMeter()
@ -140,8 +192,12 @@ class Seq2SeqTrainer:
# measure data loading time
data_time.update(time.time() - end)
update = False
if i % self.iter_size == self.iter_size - 1:
update = True
# do a train/evaluate iteration
stats = self.iterate(src, tgt, training=training)
stats = self.iterate(src, tgt, update, training=training)
loss_per_token, loss_per_sentence, num_toks = stats
# measure accuracy and record loss
@ -176,13 +232,16 @@ class Seq2SeqTrainer:
log = []
log += [f'{phase} [{self.epoch}][{i}/{len(data_loader)}]']
log += [f'Time {batch_time.val:.3f} ({batch_time.avg:.3f})']
log += [f'Data {data_time.val:.3f} ({data_time.avg:.3f})']
log += [f'Data {data_time.val:.2e} ({data_time.avg:.2e})']
log += [f'Tok/s {tot_tok_time.val:.0f} ({tot_tok_time.avg:.0f})']
if self.verbose:
log += [f'Src tok/s {src_tok_time.val:.0f} ({src_tok_time.avg:.0f})']
log += [f'Tgt tok/s {tgt_tok_time.val:.0f} ({tgt_tok_time.avg:.0f})']
log += [f'Loss/sentence {losses_per_sentence.val:.1f} ({losses_per_sentence.avg:.1f})']
log += [f'Loss/tok {losses_per_token.val:.4f} ({losses_per_token.avg:.4f})']
if training:
lr = self.optimizer.param_groups[0]['lr']
log += [f'LR {lr:.3e}']
log = '\t'.join(log)
logging.info(log)
@ -198,9 +257,8 @@ class Seq2SeqTrainer:
end = time.time()
if training:
tot_tok_time.reduce('sum')
losses_per_token.reduce('mean')
tot_tok_time.reduce('sum')
losses_per_token.reduce('mean')
return losses_per_token.avg, tot_tok_time.avg
@ -228,6 +286,7 @@ class Seq2SeqTrainer:
src = src, src_length
tgt = tgt, tgt_length
self.iterate(src, tgt, update=False, training=training)
self.model.zero_grad()
def optimize(self, data_loader):
"""
@ -241,6 +300,7 @@ class Seq2SeqTrainer:
torch.cuda.empty_cache()
self.preallocate(data_loader, training=True)
output = self.feed_data(data_loader, training=True)
self.model.zero_grad()
torch.cuda.empty_cache()
return output
@ -256,6 +316,7 @@ class Seq2SeqTrainer:
torch.cuda.empty_cache()
self.preallocate(data_loader, training=False)
output = self.feed_data(data_loader, training=False)
self.model.zero_grad()
torch.cuda.empty_cache()
return output
@ -267,9 +328,13 @@ class Seq2SeqTrainer:
"""
if os.path.isfile(filename):
checkpoint = torch.load(filename, map_location={'cuda:0': 'cpu'})
self.model.load_state_dict(checkpoint['state_dict'])
if self.distributed:
self.model.module.load_state_dict(checkpoint['state_dict'])
else:
self.model.load_state_dict(checkpoint['state_dict'])
self.fp_optimizer.initialize_model(self.model)
self.optimizer.load_state_dict(checkpoint['optimizer'])
self.scheduler.load_state_dict(checkpoint['scheduler'])
self.epoch = checkpoint['epoch']
self.loss = checkpoint['loss']
logging.info(f'Loaded checkpoint {filename} (epoch {self.epoch})')
@ -291,10 +356,16 @@ class Seq2SeqTrainer:
logging.info(f'Saving model to {filename}')
torch.save(state, filename)
if self.distributed:
model_state = self.model.module.state_dict()
else:
model_state = self.model.state_dict()
state = {
'epoch': self.epoch,
'state_dict': self.model.state_dict(),
'state_dict': model_state,
'optimizer': self.optimizer.state_dict(),
'scheduler': self.scheduler.state_dict(),
'loss': getattr(self, 'loss', None),
}
state = dict(list(state.items()) + list(self.save_info.items()))

View file

@ -1,9 +1,111 @@
from contextlib import contextmanager
import logging.config
import os
import random
import sys
import time
from contextlib import contextmanager
import numpy as np
import torch
import torch.distributed as dist
import torch.nn.init as init
import torch.utils.collect_env
def init_lstm_(lstm, init_weight=0.1):
"""
Initializes weights of LSTM layer.
Weights and biases are initialized with uniform(-init_weight, init_weight)
distribution.
:param lstm: instance of torch.nn.LSTM
:param init_weight: range for the uniform initializer
"""
# Initialize hidden-hidden weights
init.uniform_(lstm.weight_hh_l0.data, -init_weight, init_weight)
# Initialize input-hidden weights:
init.uniform_(lstm.weight_ih_l0.data, -init_weight, init_weight)
# Initialize bias. PyTorch LSTM has two biases, one for input-hidden GEMM
# and the other for hidden-hidden GEMM. Here input-hidden bias is
# initialized with uniform distribution and hidden-hidden bias is
# initialized with zeros.
init.uniform_(lstm.bias_ih_l0.data, -init_weight, init_weight)
init.zeros_(lstm.bias_hh_l0.data)
if lstm.bidirectional:
init.uniform_(lstm.weight_hh_l0_reverse.data, -init_weight, init_weight)
init.uniform_(lstm.weight_ih_l0_reverse.data, -init_weight, init_weight)
init.uniform_(lstm.bias_ih_l0_reverse.data, -init_weight, init_weight)
init.zeros_(lstm.bias_hh_l0_reverse.data)
def generate_seeds(rng, size):
"""
Generate list of random seeds
:param rng: random number generator
:param size: length of the returned list
"""
seeds = [rng.randint(0, 2**32 - 1) for _ in range(size)]
return seeds
def broadcast_seeds(seeds, device):
"""
Broadcasts random seeds to all distributed workers.
Returns list of random seeds (broadcasted from workers with rank 0).
:param seeds: list of seeds (integers)
:param device: torch.device
"""
if torch.distributed.is_available() and torch.distributed.is_initialized():
seeds_tensor = torch.LongTensor(seeds).to(device)
torch.distributed.broadcast(seeds_tensor, 0)
seeds = seeds_tensor.tolist()
return seeds
def setup_seeds(master_seed, epochs, device):
"""
Generates seeds from one master_seed.
Function returns (worker_seeds, shuffling_seeds), worker_seeds are later
used to initialize per-worker random number generators (mostly for
dropouts), shuffling_seeds are for RNGs resposible for reshuffling the
dataset before each epoch.
Seeds are generated on worker with rank 0 and broadcasted to all other
workers.
:param master_seed: master RNG seed used to initialize other generators
:param epochs: number of epochs
:param device: torch.device (used for distributed.broadcast)
"""
if master_seed is None:
# random master seed, random.SystemRandom() uses /dev/urandom on Unix
master_seed = random.SystemRandom().randint(0, 2**32 - 1)
if get_rank() == 0:
# master seed is reported only from rank=0 worker, it's to avoid
# confusion, seeds from rank=0 are later broadcasted to other
# workers
logging.info(f'Using random master seed: {master_seed}')
else:
# master seed was specified from command line
logging.info(f'Using master seed from command line: {master_seed}')
# initialize seeding RNG
seeding_rng = random.Random(master_seed)
# generate worker seeds, one seed for every distributed worker
worker_seeds = generate_seeds(seeding_rng, get_world_size())
# generate seeds for data shuffling, one seed for every epoch
shuffling_seeds = generate_seeds(seeding_rng, epochs)
# broadcast seeds from rank=0 to other workers
worker_seeds = broadcast_seeds(worker_seeds, device)
shuffling_seeds = broadcast_seeds(shuffling_seeds, device)
return worker_seeds, shuffling_seeds
def barrier():
@ -12,7 +114,7 @@ def barrier():
doesn't implement barrier for NCCL backend.
Calls all_reduce on dummy tensor and synchronizes with GPU.
"""
if torch.distributed.is_initialized():
if torch.distributed.is_available() and torch.distributed.is_initialized():
torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
torch.cuda.synchronize()
@ -21,7 +123,7 @@ def get_rank():
"""
Gets distributed rank or returns zero if distributed is not initialized.
"""
if torch.distributed.is_initialized():
if torch.distributed.is_available() and torch.distributed.is_initialized():
rank = torch.distributed.get_rank()
else:
rank = 0
@ -33,7 +135,7 @@ def get_world_size():
Gets total number of distributed workers or returns one if distributed is
not initialized.
"""
if torch.distributed.is_initialized():
if torch.distributed.is_available() and torch.distributed.is_initialized():
world_size = torch.distributed.get_world_size()
else:
world_size = 1
@ -50,7 +152,20 @@ def sync_workers():
barrier()
def setup_logging(log_file='log.log'):
@contextmanager
def timer(name, ndigits=2, sync_gpu=True):
if sync_gpu:
torch.cuda.synchronize()
start = time.time()
yield
if sync_gpu:
torch.cuda.synchronize()
stop = time.time()
elapsed = round(stop - start, ndigits)
logging.info(f'TIMER {name} {elapsed}')
def setup_logging(log_file=os.devnull):
"""
Configures logging.
By default logs from all workers are printed to the console, entries are
@ -69,12 +184,13 @@ def setup_logging(log_file='log.log'):
rank = get_rank()
rank_filter = RankFilter(rank)
logging_format = "%(asctime)s - %(levelname)s - %(rank)s - %(message)s"
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s - %(levelname)s - %(rank)s - %(message)s",
format=logging_format,
datefmt="%Y-%m-%d %H:%M:%S",
filename=log_file,
filemode='w')
console = logging.StreamHandler()
console = logging.StreamHandler(sys.stdout)
console.setLevel(logging.INFO)
formatter = logging.Formatter('%(rank)s: %(message)s')
console.setFormatter(formatter)
@ -82,6 +198,73 @@ def setup_logging(log_file='log.log'):
logging.getLogger('').addFilter(rank_filter)
def set_device(cuda, local_rank):
"""
Sets device based on local_rank and returns instance of torch.device.
:param cuda: if True: use cuda
:param local_rank: local rank of the worker
"""
if cuda:
torch.cuda.set_device(local_rank)
device = torch.device('cuda')
else:
device = torch.device('cpu')
return device
def init_distributed(cuda):
"""
Initializes distributed backend.
:param cuda: (bool) if True initializes nccl backend, if False initializes
gloo backend
"""
world_size = int(os.environ.get('WORLD_SIZE', 1))
distributed = (world_size > 1)
if distributed:
backend = 'nccl' if cuda else 'gloo'
dist.init_process_group(backend=backend,
init_method='env://')
assert dist.is_initialized()
return distributed
def log_env_info():
"""
Prints information about execution environment.
"""
logging.info('Collecting environment information...')
env_info = torch.utils.collect_env.get_pretty_env_info()
logging.info(f'{env_info}')
def pad_vocabulary(math):
if math == 'fp16':
pad_vocab = 8
elif math == 'fp32':
pad_vocab = 1
return pad_vocab
def benchmark(test_acc, target_acc, test_perf, target_perf):
def test(achieved, target, name):
passed = True
if target is not None and achieved is not None:
logging.info(f'{name} achieved: {achieved:.2f} '
f'target: {target:.2f}')
if achieved >= target:
logging.info(f'{name} test passed')
else:
logging.info(f'{name} test failed')
passed = False
return passed
passed = True
passed &= test(test_acc, target_acc, 'Accuracy')
passed &= test(test_perf, target_perf, 'Performance')
return passed
class AverageMeter:
"""
Computes and stores the average and current value
@ -117,18 +300,42 @@ class AverageMeter:
distributed = (get_world_size() > 1)
if distributed:
cuda = (dist._backend == dist.dist_backend.NCCL)
# Backward/forward compatibility around
# https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
# https://github.com/pytorch/pytorch/commit/044d00516ccd6572c0d6ab6d54587155b02a3b86
# To accomodate change in Pytorch's distributed API
if hasattr(dist, "get_backend"):
_backend = dist.get_backend()
if hasattr(dist, "DistBackend"):
backend_enum_holder = dist.DistBackend
else:
backend_enum_holder = dist.Backend
else:
_backend = dist._backend
backend_enum_holder = dist.dist_backend
cuda = _backend == backend_enum_holder.NCCL
if cuda:
avg = torch.cuda.FloatTensor([self.avg])
_sum = torch.cuda.FloatTensor([self.sum])
else:
avg = torch.FloatTensor([self.avg])
_sum = torch.FloatTensor([self.sum])
dist.all_reduce(avg, op=dist.reduce_op.SUM)
try:
_reduce_op = dist.ReduceOp
except AttributeError:
_reduce_op = dist.reduce_op
dist.all_reduce(avg, op=_reduce_op.SUM)
dist.all_reduce(_sum, op=_reduce_op.SUM)
self.avg = avg.item()
self.sum = _sum.item()
if op == 'mean':
self.avg /= get_world_size()
self.sum /= get_world_size()
def debug_tensor(tensor, name):

417
train.py
View file

@ -1,130 +1,187 @@
#!/usr/bin/env python
import argparse
import os
import logging
import os
import sys
from ast import literal_eval
import torch.nn as nn
import torch.nn.parallel
import torch.utils.data.distributed
import torch.distributed as dist
import torch.optim
import torch.utils.data.distributed
from seq2seq.models.gnmt import GNMT
from seq2seq.train.smoothing import LabelSmoothing
from seq2seq.data.dataset import TextDataset
from seq2seq.data.dataset import ParallelDataset
from seq2seq.data.tokenizer import Tokenizer
from seq2seq.utils import setup_logging
import seq2seq.data.config as config
import seq2seq.train.trainer as trainers
import seq2seq.utils as utils
from seq2seq.data.dataset import LazyParallelDataset
from seq2seq.data.dataset import ParallelDataset
from seq2seq.data.dataset import TextDataset
from seq2seq.data.tokenizer import Tokenizer
from seq2seq.inference.inference import Translator
from seq2seq.models.gnmt import GNMT
from seq2seq.train.smoothing import LabelSmoothing
def parse_args():
"""
Parse commandline arguments.
"""
parser = argparse.ArgumentParser(description='GNMT training',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
def exclusive_group(group, name, default, help):
destname = name.replace('-', '_')
subgroup = group.add_mutually_exclusive_group(required=False)
subgroup.add_argument(f'--{name}', dest=f'{destname}',
action='store_true',
help=f'{help} (use \'--no-{name}\' to disable)')
subgroup.add_argument(f'--no-{name}', dest=f'{destname}',
action='store_false', help=argparse.SUPPRESS)
subgroup.set_defaults(**{destname: default})
parser = argparse.ArgumentParser(
description='GNMT training',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# dataset
dataset = parser.add_argument_group('dataset setup')
dataset.add_argument('--dataset-dir', default=None, required=True,
help='path to directory with training/validation data')
dataset.add_argument('--dataset-dir', default='data/wmt16_de_en',
help='path to the directory with training/test data')
dataset.add_argument('--max-size', default=None, type=int,
help='use at most MAX_SIZE elements from training \
dataset (useful for benchmarking), by default \
uses entire dataset')
dataset (useful for benchmarking), by default \
uses entire dataset')
# results
results = parser.add_argument_group('results setup')
results.add_argument('--results-dir', default='results',
help='path to directory with results, it it will be \
automatically created if does not exist')
results.add_argument('--save', default='gnmt_wmt16',
help='path to directory with results, it will be \
automatically created if it does not exist')
results.add_argument('--save', default='gnmt',
help='defines subdirectory within RESULTS_DIR for \
results from this training run')
results from this training run')
results.add_argument('--print-freq', default=10, type=int,
help='print log every PRINT_FREQ batches')
# model
model = parser.add_argument_group('model setup')
model.add_argument('--model-config',
default="{'hidden_size': 1024,'num_layers': 4, \
'dropout': 0.2, 'share_embedding': True}",
help='GNMT architecture configuration')
model.add_argument('--hidden-size', default=1024, type=int,
help='model hidden size')
model.add_argument('--num-layers', default=4, type=int,
help='number of RNN layers in encoder and in decoder')
model.add_argument('--dropout', default=0.2, type=float,
help='dropout applied to input of RNN cells')
exclusive_group(group=model, name='share-embedding', default=True,
help='use shared embeddings for encoder and decoder')
model.add_argument('--smoothing', default=0.1, type=float,
help='label smoothing, if equal to zero model will use \
CrossEntropyLoss, if not zero model will be trained \
with label smoothing loss')
CrossEntropyLoss, if not zero model will be trained \
with label smoothing loss')
# setup
general = parser.add_argument_group('general setup')
general.add_argument('--math', default='fp16', choices=['fp32', 'fp16'],
general.add_argument('--math', default='fp16', choices=['fp16', 'fp32'],
help='arithmetic type')
general.add_argument('--seed', default=None, type=int,
help='set random number generator seed')
general.add_argument('--disable-eval', action='store_true', default=False,
help='disables validation after every epoch')
general.add_argument('--workers', default=0, type=int,
help='number of workers for data loading')
help='master seed for random number generators, if \
"seed" is undefined then the master seed will be \
sampled from random.SystemRandom()')
cuda_parser = general.add_mutually_exclusive_group(required=False)
cuda_parser.add_argument('--cuda', dest='cuda', action='store_true',
help='enables cuda (use \'--no-cuda\' to disable)')
cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false',
help=argparse.SUPPRESS)
cuda_parser.set_defaults(cuda=True)
cudnn_parser = general.add_mutually_exclusive_group(required=False)
cudnn_parser.add_argument('--cudnn', dest='cudnn', action='store_true',
help='enables cudnn (use \'--no-cudnn\' to disable)')
cudnn_parser.add_argument('--no-cudnn', dest='cudnn', action='store_false',
help=argparse.SUPPRESS)
cudnn_parser.set_defaults(cudnn=True)
exclusive_group(group=general, name='eval', default=True,
help='run validation and test after every epoch')
exclusive_group(group=general, name='env', default=True,
help='print info about execution env')
exclusive_group(group=general, name='cuda', default=True,
help='enables cuda')
exclusive_group(group=general, name='cudnn', default=True,
help='enables cudnn')
# training
training = parser.add_argument_group('training setup')
training.add_argument('--batch-size', default=128, type=int,
help='batch size for training')
training.add_argument('--epochs', default=8, type=int,
help='number of total epochs to run')
training.add_argument('--optimization-config',
default="{'optimizer': 'Adam', 'lr': 5e-4}", type=str,
help='optimizer config')
training.add_argument('--grad-clip', default=5.0, type=float,
help='enabled gradient clipping and sets maximum \
gradient norm value')
training.add_argument('--max-length-train', default=50, type=int,
help='maximum sequence length for training')
training.add_argument('--min-length-train', default=0, type=int,
help='minimum sequence length for training')
training.add_argument('--train-batch-size', default=128, type=int,
help='training batch size per worker')
training.add_argument('--train-global-batch-size', default=None, type=int,
help='global training batch size, this argument \
does not have to be defined, if it is defined it \
will be used to automatically \
compute train_iter_size \
using the equation: train_iter_size = \
train_global_batch_size // (train_batch_size * \
world_size)')
training.add_argument('--train-iter-size', metavar='N', default=1,
type=int,
help='training iter size, training loop will \
accumulate gradients over N iterations and execute \
optimizer every N steps')
training.add_argument('--epochs', default=6, type=int,
help='max number of training epochs')
bucketing_parser = training.add_mutually_exclusive_group(required=False)
bucketing_parser.add_argument('--bucketing', dest='bucketing', action='store_true',
help='enables bucketing (use \'--no-bucketing\' to disable)')
bucketing_parser.add_argument('--no-bucketing', dest='bucketing', action='store_false',
help=argparse.SUPPRESS)
bucketing_parser.set_defaults(bucketing=True)
training.add_argument('--grad-clip', default=5.0, type=float,
help='enables gradient clipping and sets maximum \
norm of gradients')
training.add_argument('--max-length-train', default=50, type=int,
help='maximum sequence length for training \
(including special BOS and EOS tokens)')
training.add_argument('--min-length-train', default=0, type=int,
help='minimum sequence length for training \
(including special BOS and EOS tokens)')
training.add_argument('--train-loader-workers', default=2, type=int,
help='number of workers for training data loading')
training.add_argument('--batching', default='bucketing', type=str,
choices=['random', 'sharding', 'bucketing'],
help='select batching algorithm')
training.add_argument('--shard-size', default=80, type=int,
help='shard size for "sharding" batching algorithm, \
in multiples of global batch size')
training.add_argument('--num-buckets', default=5, type=int,
help='number of buckets for "bucketing" batching \
algorithm')
# optimizer
optimizer = parser.add_argument_group('optimizer setup')
optimizer.add_argument('--optimizer', type=str, default='Adam',
help='training optimizer')
optimizer.add_argument('--lr', type=float, default=2.00e-3,
help='learning rate')
optimizer.add_argument('--optimizer-extra', type=str,
default="{}",
help='extra options for the optimizer')
# scheduler
scheduler = parser.add_argument_group('learning rate scheduler setup')
scheduler.add_argument('--warmup-steps', type=str, default='200',
help='number of learning rate warmup iterations')
scheduler.add_argument('--remain-steps', type=str, default='0.666',
help='starting iteration for learning rate decay')
scheduler.add_argument('--decay-interval', type=str, default='None',
help='interval between learning rate decay steps')
scheduler.add_argument('--decay-steps', type=int, default=4,
help='max number of learning rate decay steps')
scheduler.add_argument('--decay-factor', type=float, default=0.5,
help='learning rate decay factor')
# validation
validation = parser.add_argument_group('validation setup')
validation.add_argument('--val-batch-size', default=128, type=int,
help='batch size for validation')
validation.add_argument('--max-length-val', default=80, type=int,
help='maximum sequence length for validation')
validation.add_argument('--min-length-val', default=0, type=int,
help='minimum sequence length for validation')
val = parser.add_argument_group('validation setup')
val.add_argument('--val-batch-size', default=64, type=int,
help='batch size for validation')
val.add_argument('--max-length-val', default=125, type=int,
help='maximum sequence length for validation \
(including special BOS and EOS tokens)')
val.add_argument('--min-length-val', default=0, type=int,
help='minimum sequence length for validation \
(including special BOS and EOS tokens)')
val.add_argument('--val-loader-workers', default=0, type=int,
help='number of workers for validation data loading')
# test
test = parser.add_argument_group('test setup')
test.add_argument('--test-batch-size', default=128, type=int,
help='batch size for test')
test.add_argument('--max-length-test', default=150, type=int,
help='maximum sequence length for test')
help='maximum sequence length for test \
(including special BOS and EOS tokens)')
test.add_argument('--min-length-test', default=0, type=int,
help='minimum sequence length for test')
help='minimum sequence length for test \
(including special BOS and EOS tokens)')
test.add_argument('--beam-size', default=5, type=int,
help='beam size')
test.add_argument('--len-norm-factor', default=0.6, type=float,
@ -133,36 +190,50 @@ def parse_args():
help='coverage penalty factor')
test.add_argument('--len-norm-const', default=5.0, type=float,
help='length normalization constant')
test.add_argument('--target-bleu', default=None, type=float,
help='target accuracy')
test.add_argument('--intra-epoch-eval', default=0, type=int,
help='evaluate within epoch')
test.add_argument('--intra-epoch-eval', metavar='N', default=0, type=int,
help='evaluate within training epoch, this option will \
enable extra N equally spaced evaluations executed \
during each training epoch')
test.add_argument('--test-loader-workers', default=0, type=int,
help='number of workers for test data loading')
# checkpointing
checkpoint = parser.add_argument_group('checkpointing setup')
checkpoint.add_argument('--start-epoch', default=0, type=int,
help='manually set initial epoch counter')
checkpoint.add_argument('--resume', default=None, type=str, metavar='PATH',
help='resumes training from checkpoint from PATH')
checkpoint.add_argument('--save-all', action='store_true', default=False,
help='saves checkpoint after every epoch')
checkpoint.add_argument('--save-freq', default=5000, type=int,
help='save checkpoint every SAVE_FREQ batches')
checkpoint.add_argument('--keep-checkpoints', default=0, type=int,
help='keep only last KEEP_CHECKPOINTS checkpoints, \
affects only checkpoints controlled by --save-freq \
option')
chkpt = parser.add_argument_group('checkpointing setup')
chkpt.add_argument('--start-epoch', default=0, type=int,
help='manually set initial epoch counter')
chkpt.add_argument('--resume', default=None, type=str, metavar='PATH',
help='resumes training from checkpoint from PATH')
chkpt.add_argument('--save-all', action='store_true', default=False,
help='saves checkpoint after every epoch')
chkpt.add_argument('--save-freq', default=5000, type=int,
help='save checkpoint every SAVE_FREQ batches')
chkpt.add_argument('--keep-checkpoints', default=0, type=int,
help='keep only last KEEP_CHECKPOINTS checkpoints, \
affects only checkpoints controlled by --save-freq \
option')
# distributed support
# benchmarking
benchmark = parser.add_argument_group('benchmark setup')
benchmark.add_argument('--target-perf', default=None, type=float,
help='target training performance (in tokens \
per second)')
benchmark.add_argument('--target-bleu', default=None, type=float,
help='target accuracy')
# distributed
distributed = parser.add_argument_group('distributed setup')
distributed.add_argument('--rank', default=0, type=int,
help='rank of the process, do not set! Done by multiproc module')
distributed.add_argument('--world-size', default=1, type=int,
help='number of processes, do not set! Done by multiproc module')
distributed.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
help='url used to set up distributed training')
help='global rank of the process, do not set!')
distributed.add_argument('--local_rank', default=0, type=int,
help='local rank of the process, do not set!')
return parser.parse_args()
args = parser.parse_args()
args.warmup_steps = literal_eval(args.warmup_steps)
args.remain_steps = literal_eval(args.remain_steps)
args.decay_interval = literal_eval(args.decay_interval)
return args
def build_criterion(vocab_size, padding_idx, smoothing):
@ -178,24 +249,18 @@ def build_criterion(vocab_size, padding_idx, smoothing):
return criterion
@utils.timer('TOTAL RUNTIME', sync_gpu=False)
def main():
"""
Launches data-parallel multi-gpu training.
"""
args = parse_args()
device = utils.set_device(args.cuda, args.local_rank)
distributed = utils.init_distributed(args.cuda)
args.rank = utils.get_rank()
if not args.cudnn:
torch.backends.cudnn.enabled = False
if args.seed is not None:
torch.manual_seed(args.seed + args.rank)
# initialize distributed backend
distributed = args.world_size > 1
if distributed:
backend = 'nccl' if args.cuda else 'gloo'
dist.init_process_group(backend=backend, rank=args.rank,
init_method=args.dist_url,
world_size=args.world_size)
# create directory for results
save_path = os.path.join(args.results_dir, args.save)
@ -203,20 +268,39 @@ def main():
os.makedirs(save_path, exist_ok=True)
# setup logging
log_filename = f'log_gpu_{args.rank}.log'
setup_logging(os.path.join(save_path, log_filename))
log_filename = f'log_rank_{utils.get_rank()}.log'
utils.setup_logging(os.path.join(save_path, log_filename))
if args.env:
utils.log_env_info()
logging.info(f'Saving results to: {save_path}')
logging.info(f'Run arguments: {args}')
if args.cuda:
torch.cuda.set_device(args.rank)
# automatically set train_iter_size based on train_global_batch_size,
# world_size and per-worker train_batch_size
if args.train_global_batch_size is not None:
global_bs = args.train_global_batch_size
bs = args.train_batch_size
world_size = utils.get_world_size()
assert global_bs % (bs * world_size) == 0
args.train_iter_size = global_bs // (bs * world_size)
logging.info(f'Global batch size was set in the config, '
f'Setting train_iter_size to {args.train_iter_size}')
worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed, args.epochs,
device)
worker_seed = worker_seeds[args.rank]
logging.info(f'Worker {args.rank} is using worker seed: {worker_seed}')
torch.manual_seed(worker_seed)
# build tokenizer
tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME))
pad_vocab = utils.pad_vocabulary(args.math)
tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME),
pad_vocab)
# build datasets
train_data = ParallelDataset(
train_data = LazyParallelDataset(
src_fname=os.path.join(args.dataset_dir, config.SRC_TRAIN_FNAME),
tgt_fname=os.path.join(args.dataset_dir, config.TGT_TRAIN_FNAME),
tokenizer=tokenizer,
@ -238,45 +322,59 @@ def main():
tokenizer=tokenizer,
min_len=args.min_length_test,
max_len=args.max_length_test,
sort=False)
sort=True)
vocab_size = tokenizer.vocab_size
# build GNMT model
model_config = dict(vocab_size=vocab_size, math=args.math,
**literal_eval(args.model_config))
model = GNMT(**model_config)
model_config = {'hidden_size': args.hidden_size,
'num_layers': args.num_layers,
'dropout': args.dropout, 'batch_first': False,
'share_embedding': args.share_embedding}
model = GNMT(vocab_size=vocab_size, **model_config)
logging.info(model)
batch_first = model.batch_first
# define loss function (criterion) and optimizer
criterion = build_criterion(vocab_size, config.PAD, args.smoothing)
opt_config = literal_eval(args.optimization_config)
logging.info(f'Training optimizer: {opt_config}')
opt_config = {'optimizer': args.optimizer, 'lr': args.lr}
opt_config.update(literal_eval(args.optimizer_extra))
logging.info(f'Training optimizer config: {opt_config}')
scheduler_config = {'warmup_steps': args.warmup_steps,
'remain_steps': args.remain_steps,
'decay_interval': args.decay_interval,
'decay_steps': args.decay_steps,
'decay_factor': args.decay_factor}
logging.info(f'Training LR schedule config: {scheduler_config}')
num_parameters = sum([l.nelement() for l in model.parameters()])
logging.info(f'Number of parameters: {num_parameters}')
batching_opt = {'shard_size': args.shard_size,
'num_buckets': args.num_buckets}
# get data loaders
train_loader = train_data.get_loader(batch_size=args.batch_size,
train_loader = train_data.get_loader(batch_size=args.train_batch_size,
seeds=shuffling_seeds,
batch_first=batch_first,
shuffle=True,
bucketing=args.bucketing,
num_workers=args.workers,
drop_last=True)
batching=args.batching,
batching_opt=batching_opt,
num_workers=args.train_loader_workers)
val_loader = val_data.get_loader(batch_size=args.val_batch_size,
batch_first=batch_first,
shuffle=False,
num_workers=args.workers,
drop_last=False)
num_workers=args.val_loader_workers)
test_loader = test_data.get_loader(batch_size=args.test_batch_size,
batch_first=batch_first,
shuffle=False,
num_workers=args.workers,
drop_last=False)
pad=True,
num_workers=args.test_loader_workers)
translator = Translator(model=model,
tokenizer=tokenizer,
@ -293,13 +391,19 @@ def main():
save_path=args.save_path)
# create trainer
total_train_iters = len(train_loader) // args.train_iter_size * args.epochs
save_info = {'model_config': model_config, 'config': args, 'tokenizer':
tokenizer.get_state()}
trainer_options = dict(
criterion=criterion,
grad_clip=args.grad_clip,
iter_size=args.train_iter_size,
save_path=save_path,
save_freq=args.save_freq,
save_info={'config': args, 'tokenizer': tokenizer},
save_info=save_info,
opt_config=opt_config,
scheduler_config=scheduler_config,
train_iterations=total_train_iters,
batch_first=batch_first,
keep_checkpoints=args.keep_checkpoints,
math=args.math,
@ -325,47 +429,60 @@ def main():
# training loop
best_loss = float('inf')
break_training = False
test_bleu = None
for epoch in range(args.start_epoch, args.epochs):
logging.info(f'Starting epoch {epoch}')
if distributed:
train_loader.sampler.set_epoch(epoch)
train_loader.sampler.set_epoch(epoch)
trainer.epoch = epoch
train_loss, train_perf = trainer.optimize(train_loader)
# evaluate on validation set
if args.rank == 0 and not args.disable_eval:
if args.eval:
logging.info(f'Running validation on dev set')
val_loss, val_perf = trainer.evaluate(val_loader)
# remember best prec@1 and save checkpoint
is_best = val_loss < best_loss
best_loss = min(val_loss, best_loss)
trainer.save(save_all=args.save_all, is_best=is_best)
if args.rank == 0:
is_best = val_loss < best_loss
best_loss = min(val_loss, best_loss)
trainer.save(save_all=args.save_all, is_best=is_best)
break_training = False
if not args.disable_eval:
if args.eval:
utils.barrier()
test_bleu, break_training = translator.run(calc_bleu=True,
epoch=epoch)
if args.rank == 0 and not args.disable_eval:
logging.info(f'Summary: Epoch: {epoch}\t'
f'Training Loss: {train_loss:.4f}\t'
f'Validation Loss: {val_loss:.4f}\t'
f'Test BLEU: {test_bleu:.2f}')
logging.info(f'Performance: Epoch: {epoch}\t'
f'Training: {train_perf:.0f} Tok/s\t'
f'Validation: {val_perf:.0f} Tok/s')
else:
logging.info(f'Summary: Epoch: {epoch}\t'
f'Training Loss {train_loss:.4f}')
logging.info(f'Performance: Epoch: {epoch}\t'
f'Training: {train_perf:.0f} Tok/s')
acc_log = []
acc_log += [f'Summary: Epoch: {epoch}']
acc_log += [f'Training Loss: {train_loss:.4f}']
if args.eval:
acc_log += [f'Validation Loss: {val_loss:.4f}']
acc_log += [f'Test BLEU: {test_bleu:.2f}']
perf_log = []
perf_log += [f'Performance: Epoch: {epoch}']
perf_log += [f'Training: {train_perf:.0f} Tok/s']
if args.eval:
perf_log += [f'Validation: {val_perf:.0f} Tok/s']
if args.rank == 0:
logging.info('\t'.join(acc_log))
logging.info('\t'.join(perf_log))
logging.info(f'Finished epoch {epoch}')
if break_training:
break
utils.barrier()
passed = utils.benchmark(test_bleu, args.target_bleu,
train_perf, args.target_perf)
return passed
if __name__ == '__main__':
main()
passed = main()
if not passed:
sys.exit(1)

View file

@ -1,39 +1,61 @@
#!/usr/bin/env python
import logging
import argparse
import logging
import os
import warnings
from ast import literal_eval
from itertools import product
import torch
import torch.distributed as dist
from seq2seq.models.gnmt import GNMT
from seq2seq.inference.inference import Translator
import seq2seq.utils as utils
from seq2seq.data.dataset import TextDataset
from seq2seq.data.tokenizer import Tokenizer
from seq2seq.inference.inference import Translator
from seq2seq.models.gnmt import GNMT
from seq2seq.utils import setup_logging
def parse_args():
"""
Parse commandline arguments.
"""
parser = argparse.ArgumentParser(description='GNMT Translate',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# data
def exclusive_group(group, name, default, help):
destname = name.replace('-', '_')
subgroup = group.add_mutually_exclusive_group(required=False)
subgroup.add_argument(f'--{name}', dest=f'{destname}',
action='store_true',
help=f'{help} (use \'--no-{name}\' to disable)')
subgroup.add_argument(f'--no-{name}', dest=f'{destname}',
action='store_false', help=argparse.SUPPRESS)
subgroup.set_defaults(**{destname: default})
parser = argparse.ArgumentParser(
description='GNMT Translate',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# dataset
dataset = parser.add_argument_group('data setup')
dataset.add_argument('--dataset-dir', default='data/wmt16_de_en/',
help='path to directory with training/validation data')
help='path to directory with training/test data')
dataset.add_argument('-i', '--input', required=True,
help='full path to the input file (tokenized)')
dataset.add_argument('-o', '--output', required=True,
help='full path to the output file (tokenized)')
dataset.add_argument('-r', '--reference', default=None,
help='full path to the reference file (for sacrebleu)')
help='full path to the file with reference \
translations (for sacrebleu)')
dataset.add_argument('-m', '--model', required=True,
help='full path to the model checkpoint file')
exclusive_group(group=dataset, name='sort', default=True,
help='sorts dataset by sequence length')
# parameters
params = parser.add_argument_group('inference setup')
params.add_argument('--batch-size', default=128, type=int,
help='batch size')
params.add_argument('--beam-size', default=5, type=int,
params.add_argument('--batch-size', nargs='+', default=[128], type=int,
help='batch size per GPU')
params.add_argument('--beam-size', nargs='+', default=[5], type=int,
help='beam size')
params.add_argument('--max-seq-len', default=80, type=int,
help='maximum generated sequence length')
@ -45,16 +67,18 @@ def parse_args():
help='length normalization constant')
# general setup
general = parser.add_argument_group('general setup')
general.add_argument('--math', default='fp16', choices=['fp32', 'fp16'],
help='arithmetic type')
general.add_argument('--math', nargs='+', default=['fp16'],
choices=['fp16', 'fp32'], help='arithmetic type')
bleu_parser = general.add_mutually_exclusive_group(required=False)
bleu_parser.add_argument('--bleu', dest='bleu', action='store_true',
help='compares with reference and computes BLEU \
(use \'--no-bleu\' to disable)')
bleu_parser.add_argument('--no-bleu', dest='bleu', action='store_false',
help=argparse.SUPPRESS)
bleu_parser.set_defaults(bleu=True)
exclusive_group(group=general, name='env', default=True,
help='print info about execution env')
exclusive_group(group=general, name='bleu', default=True,
help='compares with reference translation and computes \
BLEU')
exclusive_group(group=general, name='cuda', default=True,
help='enables cuda')
exclusive_group(group=general, name='cudnn', default=True,
help='enables cudnn')
batch_first_parser = general.add_mutually_exclusive_group(required=False)
batch_first_parser.add_argument('--batch-first', dest='batch_first',
@ -67,62 +91,27 @@ def parse_args():
format for RNNs')
batch_first_parser.set_defaults(batch_first=True)
cuda_parser = general.add_mutually_exclusive_group(required=False)
cuda_parser.add_argument('--cuda', dest='cuda', action='store_true',
help='enables cuda (use \'--no-cuda\' to disable)')
cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false',
help=argparse.SUPPRESS)
cuda_parser.set_defaults(cuda=True)
cudnn_parser = general.add_mutually_exclusive_group(required=False)
cudnn_parser.add_argument('--cudnn', dest='cudnn', action='store_true',
help='enables cudnn (use \'--no-cudnn\' to disable)')
cudnn_parser.add_argument('--no-cudnn', dest='cudnn', action='store_false',
help=argparse.SUPPRESS)
cudnn_parser.set_defaults(cudnn=True)
general.add_argument('--print-freq', '-p', default=1, type=int,
help='print log every PRINT_FREQ batches')
# distributed
distributed = parser.add_argument_group('distributed setup')
distributed.add_argument('--rank', default=0, type=int,
help='global rank of the process, do not set!')
distributed.add_argument('--local_rank', default=0, type=int,
help='local rank of the process, do not set!')
args = parser.parse_args()
if args.bleu and args.reference is None:
parser.error('--bleu requires --reference')
if 'fp16' in args.math and not args.cuda:
parser.error('--math fp16 requires --cuda')
return args
def checkpoint_from_distributed(state_dict):
"""
Checks whether checkpoint was generated by DistributedDataParallel. DDP
wraps model in additional "module.", it needs to be unwrapped for single
GPU inference.
:param state_dict: model's state dict
"""
ret = False
for key, _ in state_dict.items():
if key.find('module.') != -1:
ret = True
break
return ret
def unwrap_distributed(state_dict):
"""
Unwraps model from DistributedDataParallel.
DDP wraps model in additional "module.", it needs to be removed for single
GPU inference.
:param state_dict: model's state dict
"""
new_state_dict = {}
for key, value in state_dict.items():
new_key = key.replace('module.', '')
new_state_dict[new_key] = value
return new_state_dict
def main():
"""
Launches translation (inference).
@ -130,26 +119,17 @@ def main():
with length normalization and coverage penalty.
"""
args = parse_args()
utils.set_device(args.cuda, args.local_rank)
utils.init_distributed(args.cuda)
setup_logging()
logging.basicConfig(level=logging.DEBUG,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
filename='log.log',
filemode='w')
console = logging.StreamHandler()
console.setLevel(logging.INFO)
formatter = logging.Formatter('%(message)s')
console.setFormatter(formatter)
logging.getLogger('').addHandler(console)
if args.env:
utils.log_env_info()
logging.info(args)
logging.info(f'Run arguments: {args}')
if args.cuda:
torch.cuda.set_device(0)
if not args.cuda and torch.cuda.is_available():
warnings.warn('cuda is available but not enabled')
if args.math == 'fp16' and not args.cuda:
raise RuntimeError('fp16 requires cuda')
if not args.cudnn:
torch.backends.cudnn.enabled = False
@ -157,57 +137,57 @@ def main():
checkpoint = torch.load(args.model, map_location={'cuda:0': 'cpu'})
# build GNMT model
tokenizer = checkpoint['tokenizer']
tokenizer = Tokenizer()
tokenizer.set_state(checkpoint['tokenizer'])
vocab_size = tokenizer.vocab_size
model_config = dict(vocab_size=vocab_size, math=checkpoint['config'].math,
**literal_eval(checkpoint['config'].model_config))
model_config = checkpoint['model_config']
model_config['batch_first'] = args.batch_first
model = GNMT(**model_config)
model = GNMT(vocab_size=vocab_size, **model_config)
model.load_state_dict(checkpoint['state_dict'])
state_dict = checkpoint['state_dict']
if checkpoint_from_distributed(state_dict):
state_dict = unwrap_distributed(state_dict)
for (math, batch_size, beam_size) in product(args.math, args.batch_size,
args.beam_size):
logging.info(f'math: {math}, batch size: {batch_size}, '
f'beam size: {beam_size}')
if math == 'fp32':
dtype = torch.FloatTensor
if math == 'fp16':
dtype = torch.HalfTensor
model.type(dtype)
model.load_state_dict(state_dict)
if args.cuda:
model = model.cuda()
model.eval()
if args.math == 'fp32':
dtype = torch.FloatTensor
if args.math == 'fp16':
dtype = torch.HalfTensor
# construct the dataset
test_data = TextDataset(src_fname=args.input,
tokenizer=tokenizer,
sort=args.sort)
model.type(dtype)
if args.cuda:
model = model.cuda()
model.eval()
# build the data loader
test_loader = test_data.get_loader(batch_size=batch_size,
batch_first=args.batch_first,
shuffle=False,
pad=True,
num_workers=0)
# construct the dataset
test_data = TextDataset(src_fname=args.input,
tokenizer=tokenizer,
sort=False)
# build the translator object
translator = Translator(model=model,
tokenizer=tokenizer,
loader=test_loader,
beam_size=beam_size,
max_seq_len=args.max_seq_len,
len_norm_factor=args.len_norm_factor,
len_norm_const=args.len_norm_const,
cov_penalty_factor=args.cov_penalty_factor,
cuda=args.cuda,
print_freq=args.print_freq,
dataset_dir=args.dataset_dir)
# build the data loader
test_loader = test_data.get_loader(batch_size=args.batch_size,
batch_first=args.batch_first,
shuffle=False,
num_workers=0,
drop_last=False)
# execute the inference
translator.run(calc_bleu=args.bleu, eval_path=args.output,
reference_path=args.reference, summary=True)
# build the translator object
translator = Translator(model=model,
tokenizer=tokenizer,
loader=test_loader,
beam_size=args.beam_size,
max_seq_len=args.max_seq_len,
len_norm_factor=args.len_norm_factor,
len_norm_const=args.len_norm_const,
cov_penalty_factor=args.cov_penalty_factor,
cuda=args.cuda,
print_freq=args.print_freq,
dataset_dir=args.dataset_dir)
# execute the inference
translator.run(calc_bleu=args.bleu, eval_path=args.output,
reference_path=args.reference, summary=True)
if __name__ == '__main__':
main()