Squashed 'PyTorch/Translation/GNMT/' changes from 4dc145a..51a90b1
51a90b1 Feb 14, 2019 update git-subtree-dir: PyTorch/Translation/GNMT git-subtree-split: 51a90b1667e7c1d45bd68bf719f8bf5d4c4521e3
This commit is contained in:
parent
8f95a78af2
commit
8efbea403f
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -4,3 +4,4 @@ tags
|
|||
/results
|
||||
/data
|
||||
.DS_Store
|
||||
.rsyncignore
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
FROM nvcr.io/nvidia/pytorch:18.06-py3
|
||||
FROM nvcr.io/nvidia/pytorch:19.01-py3
|
||||
|
||||
ENV LANG C.UTF-8
|
||||
ENV LC_ALL C.UTF-8
|
||||
|
|
240
README.md
240
README.md
|
@ -28,20 +28,37 @@ and
|
|||
* 4-layer LSTM, hidden size 1024, first layer is bidirectional, the rest are
|
||||
unidirectional
|
||||
* with residual connections starting from 3rd layer
|
||||
* uses LSTM layer accelerated by cuDNN
|
||||
* uses standard pytorch nn.LSTM layer
|
||||
* dropout is applied on input to all LSTM layers, probability of dropout is
|
||||
set to 0.2
|
||||
* hidden state of LSTM layers is initialized with zeros
|
||||
* weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
|
||||
distribution
|
||||
* decoder:
|
||||
* 4-layer unidirectional LSTM with hidden size 1024 and fully-connected
|
||||
classifier
|
||||
* with residual connections starting from 3rd layer
|
||||
* uses LSTM layer accelerated by cuDNN
|
||||
* uses standard pytorch nn.LSTM layer
|
||||
* dropout is applied on input to all LSTM layers, probability of dropout is
|
||||
set to 0.2
|
||||
* hidden state of LSTM layers is initialized with zeros
|
||||
* weights and bias of LSTM layers is initialized with uniform(-0.1, 0.1)
|
||||
distribution
|
||||
* weights and bias of fully-connected classifier is initialized with
|
||||
uniform(-0.1, 0.1) distribution
|
||||
* attention:
|
||||
* normalized Bahdanau attention
|
||||
* output from first LSTM layer of decoder goes into attention,
|
||||
then re-weighted context is concatenated with the input to all subsequent
|
||||
LSTM layers of the decoder at the current timestep
|
||||
* linear transform of keys and queries is initialized with uniform(-0.1, 0.1),
|
||||
normalization scalar is initialized with 1.0 / sqrt(1024),
|
||||
normalization bias is initialized with zero
|
||||
* inference:
|
||||
* beam search with default beam size of 5
|
||||
* with coverage penalty and length normalization terms
|
||||
* with coverage penalty and length normalization, coverage penalty factor is
|
||||
set to 0.1, length normalization factor is set to 0.6 and length
|
||||
normalization constant is set to 5.0
|
||||
* detokenized BLEU computed by [SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu)
|
||||
* [motivation](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu#motivation) for choosing SacreBLEU
|
||||
|
||||
|
@ -53,12 +70,11 @@ Our experiments show that a 4-layer model is significantly faster to train and
|
|||
yields comparable accuracy on the public
|
||||
[WMT16 English-German](http://www.statmt.org/wmt16/translation-task.html)
|
||||
dataset. The number of LSTM layers is controlled by the `num_layers` parameter
|
||||
in the `scripts/train.sh` training script.
|
||||
in the `train.py` training script.
|
||||
|
||||
# Setup
|
||||
## Requirements
|
||||
* [PyTorch 18.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
||||
(or newer)
|
||||
* [PyTorch 19.01-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
||||
* [SacreBLEU 1.2.10](https://pypi.org/project/sacrebleu/1.2.10/)
|
||||
|
||||
This repository contains `Dockerfile` which extends the PyTorch NGC container
|
||||
|
@ -76,7 +92,7 @@ and
|
|||
Before you can train using mixed precision with Tensor Cores, ensure that you
|
||||
have a
|
||||
[NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
||||
based GPU.
|
||||
based GPU. Other platforms might likely work but aren't officially supported.
|
||||
For information about how to train using mixed precision, see the
|
||||
[Mixed Precision Training paper](https://arxiv.org/abs/1710.03740)
|
||||
and
|
||||
|
@ -109,15 +125,33 @@ By default, the training script will use all available GPUs. The training script
|
|||
saves only one checkpoint with the lowest value of the loss function on the
|
||||
validation dataset. All results and logs are saved to the `results` directory
|
||||
(on the host) or to the `/workspace/gnmt/results` directory (in the container).
|
||||
By default, the `scripts/train.sh` script will launch mixed precision training
|
||||
By default, the `train.py` script will launch mixed precision training
|
||||
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
|
||||
flag in the `scripts/train.sh` script.
|
||||
flag for the `train.py` training script.
|
||||
|
||||
Launching training on 1, 4 or 8 GPUs:
|
||||
|
||||
```
|
||||
bash scripts/train.sh
|
||||
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
|
||||
```
|
||||
|
||||
Launching training on 16 GPUs:
|
||||
|
||||
```
|
||||
python3 -m launch train.py --seed 2 --train-global-batch-size 2048
|
||||
```
|
||||
|
||||
By default the training script will launch training with batch size 128 per GPU.
|
||||
If specified `--train-global-batch-size` is larger than 128 times the number of
|
||||
GPUs available for the training then the training script will accumulate
|
||||
gradients over consecutive iterations and then perform the weight update.
|
||||
For example 1 GPU training with `--train-global-batch-size 1024` will accumulate
|
||||
gradients over 8 iterations before doing the weight update with accumulated
|
||||
gradients.
|
||||
|
||||
The training script automatically runs the validation and testing after each
|
||||
training epoch. The results from the validation and testing are printed to
|
||||
the standard error (stderr) and saved to log files.
|
||||
the standard output (stdout) and saved to log files.
|
||||
|
||||
The summary after each training epoch is printed in the following format:
|
||||
```
|
||||
|
@ -145,21 +179,24 @@ Our download script is very similar to the `wmt16_en_de.sh` script from the
|
|||
[tensorflow/nmt](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh)
|
||||
repository. Our download script contains an extra preprocessing step, which
|
||||
discards all pairs of sentences which can't be decoded by *latin-1* encoder.
|
||||
|
||||
The `scripts/wmt16_en_de.sh` script uses the
|
||||
[subword-nmt](https://github.com/rsennrich/subword-nmt)
|
||||
package to segment text into subword units (BPE). By default, the script builds
|
||||
the shared vocabulary of 32,000 tokens.
|
||||
|
||||
In order to test with other datasets, scripts need to be customized accordingly.
|
||||
|
||||
## Running training
|
||||
The default training configuration can be launched by running the
|
||||
`scripts/train.sh` training script.
|
||||
`train.py` training script.
|
||||
By default, the training script saves only one checkpoint with the lowest value
|
||||
of the loss function on the validation dataset, an evaluation is performed after
|
||||
each training epoch. Results are stored in the `results/gnmt_wmt16` directory.
|
||||
|
||||
The training script launches data-parallel training with batch size 128 per GPU
|
||||
on all available GPUs. After each training epoch, the script runs an evaluation
|
||||
on all available GPUs. We have tested reliance on up to 16 GPUs on a single
|
||||
node.
|
||||
After each training epoch, the script runs an evaluation
|
||||
on the validation dataset and outputs a BLEU score on the test dataset
|
||||
(*newstest2014*). BLEU is computed by the
|
||||
[SacreBLEU](https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu)
|
||||
|
@ -171,12 +208,11 @@ behavior by setting the `CUDA_VISIBLE_DEVICES` variable in your environment or
|
|||
by setting the `NV_GPU` variable at the Docker container launch
|
||||
([see section "GPU isolation"](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation)).
|
||||
|
||||
By default, the `scripts/train.sh` script will launch mixed precision training
|
||||
By default, the `train.py` script will launch mixed precision training
|
||||
with Tensor Cores. You can change this behaviour by setting the `--math fp32`
|
||||
flag in the `scripts/train.sh` script.
|
||||
flag for the `train.py` script.
|
||||
|
||||
Internally, the `scripts/train.sh` script uses `train.py`. To view all available
|
||||
options for training, run `python3 train.py --help`.
|
||||
To view all available options for training, run `python3 train.py --help`.
|
||||
|
||||
## Running inference
|
||||
Inference can be run by launching the `translate.py` inference script, although,
|
||||
|
@ -188,93 +224,129 @@ normalization term. Greedy decoding can be enabled by setting the beam size to 1
|
|||
|
||||
To view all available options for inference, run `python3 translate.py --help`.
|
||||
|
||||
## Benchmarking scripts
|
||||
### Training performance benchmark
|
||||
The `scripts/benchmark_training.sh` benchmarking script runs a few, relatively
|
||||
short training sessions and automatically collects performance numbers. The
|
||||
benchmarking script assumes that the `scripts/wmt16_en_de.sh` data download
|
||||
script was launched and the datasets are available in the default location
|
||||
(`data` directory).
|
||||
|
||||
Results from the benchmark are stored in the `results` directory. After the
|
||||
benchmark is done, you can launch the `scripts/parse_train_benchmark.sh` script
|
||||
to generate a short summary which will contain launch configuration, performance
|
||||
(in tokens per second), and estimated training time needed for one epoch (in
|
||||
seconds).
|
||||
|
||||
### Inference performance and accuracy benchmark
|
||||
The `scripts/benchmark_inference.sh` benchmarking script launches a number of
|
||||
inference runs with different hyperparameters (beam size, batch size, arithmetic
|
||||
type) on sorted and unsorted *newstest2014* test dataset. Performance and
|
||||
accuracy results are stored in the `results/inference_benchmark` directory.
|
||||
BLEU score is computed by the SacreBLEU package.
|
||||
|
||||
The `scripts/benchmark_inference.sh` script assumes that the
|
||||
`scripts/wmt16_en_de.sh` data download script was
|
||||
launched and the datasets are available in the default location (`data`
|
||||
directory).
|
||||
|
||||
The `scripts/benchmark_inference.sh` script requires a pre-trained
|
||||
model checkpoint. By default, the script is loading a checkpoint from the
|
||||
`results/gnmt_wmt16/model_best.pth` location.
|
||||
|
||||
## Training Accuracy Results
|
||||
All results were obtained by running the `scripts/train.sh` script in
|
||||
the pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
|
||||
Results were obtained by running the `train.py` script with the default
|
||||
batch size = 128 per GPU in the pytorch-19.01-py3 Docker container.
|
||||
|
||||
### NVIDIA DGX-1 (8x Tesla V100 16G)
|
||||
Command used to launch the training:
|
||||
|
||||
| **number of GPUs** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
|
||||
| ------------------ | ------------------------ | ------------- | --------------------------------- | ---------------------- |
|
||||
| 1 | 22.54 | 22.25 | 412 minutes | 948 minutes |
|
||||
| 4 | 22.45 | 22.46 | 118 minutes | 264 minutes |
|
||||
| 8 | 22.41 | 22.43 | 64 minutes | 139 minutes |
|
||||
```
|
||||
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
|
||||
```
|
||||
|
||||
| **number of GPUs** | **batch size/GPU** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
|
||||
| --- | --- | ----- | ----- | ------------- | ------------- |
|
||||
| 1 | 128 | 24.59 | 24.71 | 264.4 minutes | 824.4 minutes |
|
||||
| 4 | 128 | 24.30 | 24.45 | 89.5 minutes | 230.8 minutes |
|
||||
| 8 | 128 | 24.45 | 24.48 | 46.2 minutes | 116.6 minutes |
|
||||
|
||||
### NVIDIA DGX-2 (16x Tesla V100 32G)
|
||||
Commands used to launch the training:
|
||||
|
||||
```
|
||||
for 1,4,8 GPUs:
|
||||
python3 -m launch train.py --seed 2 --train-global-batch-size 1024
|
||||
for 16 GPUs:
|
||||
python3 -m launch train.py --seed 2 --train-global-batch-size 2048
|
||||
```
|
||||
|
||||
| **number of GPUs** | **batch size/GPU** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision training time** | **fp32 training time** |
|
||||
| --- | --- | ----- | ----- | ------------- | ------------- |
|
||||
| 1 | 128 | 24.59 | 24.71 | 265.0 minutes | 825.1 minutes |
|
||||
| 4 | 128 | 24.69 | 24.33 | 87.4 minutes | 216.3 minutes |
|
||||
| 8 | 128 | 24.50 | 24.47 | 49.6 minutes | 113.5 minutes |
|
||||
| 16 | 128 | 24.22 | 24.16 | 26.3 minutes | 58.6 minutes |
|
||||
|
||||
![TrainingLoss](./img/training_loss.png)
|
||||
|
||||
### Training Stability Test
|
||||
The GNMT v2 model was trained for 10 epochs, starting from 96 different initial
|
||||
The GNMT v2 model was trained for 6 epochs, starting from 50 different initial
|
||||
random seeds. After each training epoch the model was evaluated on the test
|
||||
dataset and the BLEU score was recorded. The training was performed in the
|
||||
pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs. The
|
||||
following table summarizes results of the stability test.
|
||||
pytorch-19.01-py3 Docker container on NVIDIA DGX-1 with 8 Tesla V100 16G GPUs.
|
||||
The following table summarizes results of the stability test.
|
||||
|
||||
![TrainingAccuracy](./img/training_accuracy.png)
|
||||
|
||||
## Training Performance Results
|
||||
All results were obtained by running the `scripts/train.sh` training script in
|
||||
the pytorch-18.06-py3 Docker container on NVIDIA DGX-1 with 8 V100 16G GPUs.
|
||||
Performance numbers (in tokens per second) were averaged over an entire training
|
||||
epoch.
|
||||
#### BLEU scores after each training epoch for different initial random seeds
|
||||
| **epoch** | **average** | **stdev** | **minimum** | **maximum** | **median** |
|
||||
| --- | ------ | ----- | ------ | ------ | ------ |
|
||||
| 1 | 19.954 | 0.326 | 18.710 | 20.490 | 20.020 |
|
||||
| 2 | 21.734 | 0.222 | 21.220 | 22.120 | 21.765 |
|
||||
| 3 | 22.502 | 0.223 | 21.960 | 22.970 | 22.485 |
|
||||
| 4 | 23.004 | 0.221 | 22.350 | 23.430 | 23.020 |
|
||||
| 5 | 24.201 | 0.146 | 23.900 | 24.480 | 24.215 |
|
||||
| 6 | 24.423 | 0.159 | 24.070 | 24.820 | 24.395 |
|
||||
|
||||
| **number of GPUs** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu weak scaling** | **fp32 multi-gpu weak scaling** |
|
||||
| -------- | ------------- | ------------- | ------------ | --------------------------- | --------------------------- |
|
||||
| 1 | 42337 | 18581 | 2.279 | 1.000 | 1.000 |
|
||||
| 4 | 153433 | 67586 | 2.270 | 3.624 | 3.637 |
|
||||
| 8 | 300181 | 132734 | 2.262 | 7.090 | 7.144 |
|
||||
|
||||
## Training Performance Results
|
||||
All results were obtained by running the `train.py` training script in the
|
||||
pytorch-19.01-py3 Docker container. Performance numbers (in tokens per second)
|
||||
were averaged over an entire training epoch.
|
||||
|
||||
### NVIDIA DGX-1 (8x Tesla V100 16G)
|
||||
|
||||
| **number of GPUs** | **batch size/GPU** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu strong scaling** | **fp32 multi-gpu strong scaling** |
|
||||
| --- | --- | ------ | ------ | ----- | ----- | ----- |
|
||||
| 1 | 128 | 66050 | 21346 | 3.094 | 1.000 | 1.000|
|
||||
| 4 | 128 | 196174 | 76083 | 2.578 | 2.970 | 3.564|
|
||||
| 8 | 128 | 387282 | 153697 | 2.520 | 5.863 | 7.200|
|
||||
|
||||
|
||||
### NVIDIA DGX-2 (16x Tesla V100 32G)
|
||||
|
||||
| **number of GPUs** | **batch size/GPU** | **mixed precision tokens/s** | **fp32 tokens/s** | **mixed precision speedup** | **mixed precision multi-gpu strong scaling** | **fp32 multi-gpu strong scaling** |
|
||||
| --- | --- | ------ | ------- | ----- | ------ | ------ |
|
||||
| 1 | 128 | 65830 | 22695 | 2.901 | 1.000 | 1.000 |
|
||||
| 4 | 128 | 200886 | 81224 | 2.473 | 3.052 | 3.579 |
|
||||
| 8 | 128 | 362612 | 156536 | 2.316 | 5.508 | 6.897 |
|
||||
| 16 | 128 | 738521 | 314831 | 2.346 | 11.219 | 13.872 |
|
||||
|
||||
## Inference Performance Results
|
||||
All results were obtained by running the `scripts/benchmark_inference.sh`
|
||||
benchmarking script in the pytorch-18.06-py3 Docker container on NVIDIA DGX-1.
|
||||
Inference was run on a single V100 16G GPU.
|
||||
All results were obtained by running the `translate.py` script in the
|
||||
pytorch-19.01-py3 Docker container on NVIDIA DGX-1. Inference benchmark was run
|
||||
on a single Tesla V100 16G GPU. The benchmark requires a checkpoint from a fully
|
||||
trained model.
|
||||
|
||||
Command to launch the inference benchmark:
|
||||
```
|
||||
python3 translate.py --input data/wmt16_de_en/newstest2014.tok.bpe.32000.en \
|
||||
--reference data/wmt16_de_en/newstest2014.de --output /tmp/output \
|
||||
--model results/gnmt/model_best.pth --batch-size 32 128 512 \
|
||||
--beam-size 1 2 5 10 --math fp16 fp32
|
||||
```
|
||||
|
||||
| **batch size** | **beam size** | **mixed precision BLEU** | **fp32 BLEU** | **mixed precision tokens/s** | **fp32 tokens/s** |
|
||||
| -------------- | ------------- | ------------- | ------------- | ----------------- | ------------ |
|
||||
| 512 | 1 | 20.63 | 20.63 | 62009 | 31229 |
|
||||
| 512 | 2 | 21.55 | 21.60 | 32669 | 16454 |
|
||||
| 512 | 5 | 22.34 | 22.36 | 21105 | 8562 |
|
||||
| 512 | 10 | 22.34 | 22.40 | 12967 | 4720 |
|
||||
| 128 | 1 | 20.62 | 20.63 | 27095 | 19505 |
|
||||
| 128 | 2 | 21.56 | 21.60 | 13224 | 9718 |
|
||||
| 128 | 5 | 22.38 | 22.36 | 10987 | 6575 |
|
||||
| 128 | 10 | 22.35 | 22.40 | 8603 | 4103 |
|
||||
| 32 | 1 | 20.62 | 20.63 | 9451 | 8483 |
|
||||
| 32 | 2 | 21.56 | 21.60 | 4818 | 4333 |
|
||||
| 32 | 5 | 22.34 | 22.36 | 4505 | 3655 |
|
||||
| 32 | 10 | 22.37 | 22.40 | 4086 | 2822 |
|
||||
| ---- | ----- | ------- | ------- | ---------|-------- |
|
||||
| 32 | 1 | 23.18 | 23.18 | 23571 | 19462 |
|
||||
| 32 | 2 | 24.09 | 24.12 | 15303 | 12345 |
|
||||
| 32 | 5 | 24.63 | 24.62 | 13644 | 7725 |
|
||||
| 32 | 10 | 24.50 | 24.48 | 11049 | 5359 |
|
||||
| 128 | 1 | 23.17 | 23.18 | 73429 | 42272 |
|
||||
| 128 | 2 | 24.07 | 24.12 | 43373 | 23131 |
|
||||
| 128 | 5 | 24.69 | 24.63 | 29646 | 12525 |
|
||||
| 128 | 10 | 24.45 | 24.48 | 19100 | 6886 |
|
||||
| 512 | 1 | 23.17 | 23.18 | 135333 | 48962 |
|
||||
| 512 | 2 | 24.08 | 24.12 | 74367 | 27308 |
|
||||
| 512 | 5 | 24.60 | 24.63 | 39217 | 12674 |
|
||||
| 512 | 10 | 24.54 | 24.48 | 21433 | 6640 |
|
||||
|
||||
|
||||
# Changelog
|
||||
1. Aug 7, 2018
|
||||
* Initial release
|
||||
2. Dec 4, 2018
|
||||
* Added exponential warm-up and step learning rate decay
|
||||
* Multi-GPU (distributed) inference and validation
|
||||
* Default container updated to PyTorch 18.11-py3
|
||||
* General performance improvements
|
||||
3. Feb 14, 2019
|
||||
* Different batching algorithm (bucketing with 5 equal-width buckets)
|
||||
* Additional dropouts before first LSTM layer in encoder and in decoder
|
||||
* Weight initialization changed to uniform (-0.1, 0.1)
|
||||
* Switched order of dropout and concatenation with attention in decoder
|
||||
* Default container updated to PyTorch 19.01-py3
|
||||
|
||||
# Known issues
|
||||
None
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 121 KiB After Width: | Height: | Size: 115 KiB |
Binary file not shown.
Before Width: | Height: | Size: 205 KiB After Width: | Height: | Size: 295 KiB |
235
launch.py
Normal file
235
launch.py
Normal file
|
@ -0,0 +1,235 @@
|
|||
r"""
|
||||
`torch.distributed.launch` is a module that spawns up multiple distributed
|
||||
training processes on each of the training nodes.
|
||||
|
||||
The utility can be used for single-node distributed training, in which one or
|
||||
more processes per node will be spawned. The utility can be used for either
|
||||
CPU training or GPU training. If the utility is used for GPU training,
|
||||
each distributed process will be operating on a single GPU. This can achieve
|
||||
well-improved single-node training performance. It can also be used in
|
||||
multi-node distributed training, by spawning up multiple processes on each node
|
||||
for well-improved multi-node distributed training performance as well.
|
||||
This will especially be benefitial for systems with multiple Infiniband
|
||||
interfaces that have direct-GPU support, since all of them can be utilized for
|
||||
aggregated communication bandwidth.
|
||||
|
||||
In both cases of single-node distributed training or multi-node distributed
|
||||
training, this utility will launch the given number of processes per node
|
||||
(``--nproc_per_node``). If used for GPU training, this number needs to be less
|
||||
or euqal to the number of GPUs on the current system (``nproc_per_node``),
|
||||
and each process will be operating on a single GPU from *GPU 0 to
|
||||
GPU (nproc_per_node - 1)*.
|
||||
|
||||
**How to use this module:**
|
||||
|
||||
1. Single-Node multi-process distributed training
|
||||
|
||||
::
|
||||
|
||||
>>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
|
||||
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other
|
||||
arguments of your training script)
|
||||
|
||||
2. Multi-Node multi-process distributed training: (e.g. two nodes)
|
||||
|
||||
|
||||
Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*
|
||||
|
||||
::
|
||||
|
||||
>>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
|
||||
--nnodes=2 --node_rank=0 --master_addr="192.168.1.1"
|
||||
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
|
||||
and all other arguments of your training script)
|
||||
|
||||
Node 2:
|
||||
|
||||
::
|
||||
|
||||
>>> python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
|
||||
--nnodes=2 --node_rank=1 --master_addr="192.168.1.1"
|
||||
--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
|
||||
and all other arguments of your training script)
|
||||
|
||||
3. To look up what optional arguments this module offers:
|
||||
|
||||
::
|
||||
|
||||
>>> python -m torch.distributed.launch --help
|
||||
|
||||
|
||||
**Important Notices:**
|
||||
|
||||
1. This utilty and multi-process distributed (single-node or
|
||||
multi-node) GPU training currently only achieves the best performance using
|
||||
the NCCL distributed backend. Thus NCCL backend is the recommended backend to
|
||||
use for GPU training.
|
||||
|
||||
2. In your training program, you must parse the command-line argument:
|
||||
``--local_rank=LOCAL_PROCESS_RANK``, which will be provided by this module.
|
||||
If your training program uses GPUs, you should ensure that your code only
|
||||
runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by:
|
||||
|
||||
Parsing the local_rank argument
|
||||
|
||||
::
|
||||
|
||||
>>> import argparse
|
||||
>>> parser = argparse.ArgumentParser()
|
||||
>>> parser.add_argument("--local_rank", type=int)
|
||||
>>> args = parser.parse_args()
|
||||
|
||||
Set your device to local rank using either
|
||||
|
||||
::
|
||||
|
||||
>>> torch.cuda.set_device(arg.local_rank) # before your code runs
|
||||
|
||||
or
|
||||
|
||||
::
|
||||
|
||||
>>> with torch.cuda.device(arg.local_rank):
|
||||
>>> # your code to run
|
||||
|
||||
3. In your training program, you are supposed to call the following function
|
||||
at the beginning to start the distributed backend. You need to make sure that
|
||||
the init_method uses ``env://``, which is the only supported ``init_method``
|
||||
by this module.
|
||||
|
||||
::
|
||||
|
||||
torch.distributed.init_process_group(backend='YOUR BACKEND',
|
||||
init_method='env://')
|
||||
|
||||
4. In your training program, you can either use regular distributed functions
|
||||
or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your
|
||||
training program uses GPUs for training and you would like to use
|
||||
:func:`torch.nn.parallel.DistributedDataParallel` module,
|
||||
here is how to configure it.
|
||||
|
||||
::
|
||||
|
||||
model = torch.nn.parallel.DistributedDataParallel(model,
|
||||
device_ids=[arg.local_rank],
|
||||
output_device=arg.local_rank)
|
||||
|
||||
Please ensure that ``device_ids`` argument is set to be the only GPU device id
|
||||
that your code will be operating on. This is generally the local rank of the
|
||||
process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``,
|
||||
and ``output_device`` needs to be ``args.local_rank`` in order to use this
|
||||
utility
|
||||
|
||||
.. warning::
|
||||
|
||||
``local_rank`` is NOT globally unique: it is only unique per process
|
||||
on a machine. Thus, don't use it to decide if you should, e.g.,
|
||||
write to a networked filesystem. See
|
||||
https://github.com/pytorch/pytorch/issues/12042 for an example of
|
||||
how things can go wrong if you don't do this correctly.
|
||||
|
||||
"""
|
||||
|
||||
|
||||
import sys
|
||||
import subprocess
|
||||
import os
|
||||
import socket
|
||||
from argparse import ArgumentParser, REMAINDER
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""
|
||||
Helper function parsing the command line options
|
||||
@retval ArgumentParser
|
||||
"""
|
||||
parser = ArgumentParser(description="PyTorch distributed training launch "
|
||||
"helper utilty that will spawn up "
|
||||
"multiple distributed processes")
|
||||
|
||||
# Optional arguments for the launch helper
|
||||
parser.add_argument("--nnodes", type=int, default=1,
|
||||
help="The number of nodes to use for distributed "
|
||||
"training")
|
||||
parser.add_argument("--node_rank", type=int, default=0,
|
||||
help="The rank of the node for multi-node distributed "
|
||||
"training")
|
||||
parser.add_argument("--nproc_per_node", type=int, default=None,
|
||||
help="The number of processes to launch on each node, "
|
||||
"for GPU training, this is recommended to be set "
|
||||
"to the number of GPUs in your system so that "
|
||||
"each process can be bound to a single GPU.")
|
||||
parser.add_argument("--master_addr", default="127.0.0.1", type=str,
|
||||
help="Master node (rank 0)'s address, should be either "
|
||||
"the IP address or the hostname of node 0, for "
|
||||
"single node multi-proc training, the "
|
||||
"--master_addr can simply be 127.0.0.1")
|
||||
parser.add_argument("--master_port", default=29500, type=int,
|
||||
help="Master node (rank 0)'s free port that needs to "
|
||||
"be used for communciation during distributed "
|
||||
"training")
|
||||
|
||||
# positional
|
||||
parser.add_argument("training_script", type=str,
|
||||
help="The full path to the single GPU training "
|
||||
"program/script to be launched in parallel, "
|
||||
"followed by all the arguments for the "
|
||||
"training script")
|
||||
|
||||
# rest from the training program
|
||||
parser.add_argument('training_script_args', nargs=REMAINDER)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
def main():
|
||||
args = parse_args()
|
||||
|
||||
if args.nproc_per_node is None:
|
||||
args.nproc_per_node = torch.cuda.device_count()
|
||||
|
||||
# world size in terms of number of processes
|
||||
dist_world_size = args.nproc_per_node * args.nnodes
|
||||
|
||||
# set PyTorch distributed related environmental variables
|
||||
current_env = os.environ.copy()
|
||||
current_env["MASTER_ADDR"] = args.master_addr
|
||||
current_env["MASTER_PORT"] = str(args.master_port)
|
||||
current_env["WORLD_SIZE"] = str(dist_world_size)
|
||||
|
||||
processes = []
|
||||
|
||||
for local_rank in range(0, args.nproc_per_node):
|
||||
# each process's rank
|
||||
dist_rank = args.nproc_per_node * args.node_rank + local_rank
|
||||
current_env["RANK"] = str(dist_rank)
|
||||
|
||||
# spawn the processes
|
||||
cmd = [sys.executable,
|
||||
"-u",
|
||||
args.training_script,
|
||||
"--local_rank={}".format(local_rank)] + args.training_script_args
|
||||
|
||||
process = subprocess.Popen(cmd, env=current_env)
|
||||
processes.append(process)
|
||||
|
||||
returncode = 0
|
||||
try:
|
||||
for process in processes:
|
||||
process_returncode = process.wait()
|
||||
if process_returncode != 0:
|
||||
returncode = 1
|
||||
except KeyboardInterrupt:
|
||||
print('CTRL-C, TERMINATING WORKERS ...')
|
||||
for process in processes:
|
||||
process.terminate()
|
||||
for process in processes:
|
||||
process.wait()
|
||||
raise
|
||||
|
||||
sys.exit(returncode)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
46
multiproc.py
46
multiproc.py
|
@ -1,46 +0,0 @@
|
|||
import sys
|
||||
import subprocess
|
||||
|
||||
import torch
|
||||
|
||||
def main():
|
||||
argslist = list(sys.argv)[1:]
|
||||
world_size = torch.cuda.device_count()
|
||||
|
||||
if '--world-size' in argslist:
|
||||
argslist[argslist.index('--world-size') + 1] = str(world_size)
|
||||
else:
|
||||
argslist.append('--world-size')
|
||||
argslist.append(str(world_size))
|
||||
|
||||
workers = []
|
||||
|
||||
for i in range(world_size):
|
||||
if '--rank' in argslist:
|
||||
argslist[argslist.index('--rank') + 1] = str(i)
|
||||
else:
|
||||
argslist.append('--rank')
|
||||
argslist.append(str(i))
|
||||
stdout = None if i == 0 else subprocess.DEVNULL
|
||||
worker = subprocess.Popen([str(sys.executable)] + argslist, stdout=stdout)
|
||||
workers.append(worker)
|
||||
|
||||
returncode = 0
|
||||
try:
|
||||
for worker in workers:
|
||||
worker_returncode = worker.wait()
|
||||
if worker_returncode != 0:
|
||||
returncode = 1
|
||||
except KeyboardInterrupt:
|
||||
print('Pressed CTRL-C, TERMINATING')
|
||||
for worker in workers:
|
||||
worker.terminate()
|
||||
for worker in workers:
|
||||
worker.wait()
|
||||
raise
|
||||
|
||||
sys.exit(returncode)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1 +1,2 @@
|
|||
sacrebleu==1.2.10
|
||||
git+git://github.com/NVIDIA/apex.git#egg=apex
|
||||
|
|
|
@ -1,101 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
RESULTS_DIR='gnmt_wmt16'
|
||||
|
||||
# sort by length (ascending)
|
||||
cat ${DATASET_DIR}/newstest2014.tok.bpe.32000.en \
|
||||
| awk '{ print length, $0 }' \
|
||||
| sort -n -s \
|
||||
| cut -d" " -f2- > /tmp/newstest2014.tok.bpe.32000.en.sorted
|
||||
|
||||
batches=(512 256 128 64 32)
|
||||
beams=(1 2 5 10)
|
||||
maths=(fp16 fp32)
|
||||
|
||||
model=results/${RESULTS_DIR}/model_best.pth
|
||||
|
||||
odir=results/inference_benchmark
|
||||
mkdir -p $odir
|
||||
|
||||
echo RUNNING on unsorted dataset
|
||||
rm -rf $odir/fp16_perf_unsorted.log
|
||||
rm -rf $odir/fp32_perf_unsorted.log
|
||||
rm -rf $odir/fp16_bleu.log
|
||||
rm -rf $odir/fp32_bleu.log
|
||||
ifile=${DATASET_DIR}/newstest2014.tok.bpe.32000.en
|
||||
rfile=${DATASET_DIR}/newstest2014.de
|
||||
|
||||
for math in "${maths[@]}"
|
||||
do
|
||||
for batch in "${batches[@]}"
|
||||
do
|
||||
for beam in "${beams[@]}"
|
||||
do
|
||||
echo RUNNING: batch_size: $batch, beam_size: $beam, math: $math
|
||||
|
||||
# run translation
|
||||
python3 translate.py \
|
||||
-i $ifile \
|
||||
-r $rfile \
|
||||
-m $model \
|
||||
--math $math \
|
||||
--print-freq 1 \
|
||||
--beam-size $beam \
|
||||
--batch-size $batch \
|
||||
-o /tmp/output.tok &> /tmp/log.log
|
||||
|
||||
tok_per_sec=`cat /tmp/log.log \
|
||||
|grep "Avg total tokens" \
|
||||
|cut -f 2 \
|
||||
|cut -d ':' -f 2 |tr -d ' '`
|
||||
|
||||
bleu=`cat /tmp/log.log \
|
||||
|grep BLEU \
|
||||
|cut -d ':' -f 2 |tr -d ' '`
|
||||
|
||||
echo -e $tok_per_sec '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_perf_unsorted.log
|
||||
echo -e $bleu '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_bleu.log
|
||||
echo Tokens per second: $tok_per_sec, BLEU: $bleu
|
||||
done
|
||||
done
|
||||
done
|
||||
|
||||
|
||||
echo RUNNING on sorted dataset
|
||||
rm -rf $odir/fp16_perf_sorted.log
|
||||
rm -rf $odir/fp32_perf_sorted.log
|
||||
ifile=/tmp/newstest2014.tok.bpe.32000.en.sorted
|
||||
|
||||
|
||||
for math in "${maths[@]}"
|
||||
do
|
||||
for batch in "${batches[@]}"
|
||||
do
|
||||
for beam in "${beams[@]}"
|
||||
do
|
||||
echo RUNNING: batch_size: $batch, beam_size: $beam, math: $math
|
||||
|
||||
# run translation
|
||||
python3 translate.py \
|
||||
-i $ifile \
|
||||
-m $model \
|
||||
--math $math \
|
||||
--print-freq 1 \
|
||||
--beam-size $beam \
|
||||
--batch-size $batch \
|
||||
--no-bleu \
|
||||
-o /tmp/output.tok &> /tmp/log.log
|
||||
|
||||
tok_per_sec=`cat /tmp/log.log \
|
||||
|grep "Avg total tokens" \
|
||||
|cut -f 2 \
|
||||
|cut -d ':' -f 2 |tr -d ' '`
|
||||
|
||||
echo -e $tok_per_sec '\t\t batch: '$batch 'beam: ' $beam >> $odir/${math}_perf_sorted.log
|
||||
echo Tokens per second: $tok_per_sec
|
||||
done
|
||||
done
|
||||
done
|
|
@ -1,28 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
|
||||
batches=(128)
|
||||
maths=(fp16 fp32)
|
||||
gpus=(1 2 4 8)
|
||||
|
||||
for math in "${maths[@]}"
|
||||
do
|
||||
for batch in "${batches[@]}"
|
||||
do
|
||||
for gpu in "${gpus[@]}"
|
||||
do
|
||||
export CUDA_VISIBLE_DEVICES=`seq -s "," 0 $((gpu - 1))`
|
||||
python3 -m multiproc train.py \
|
||||
--save benchmark_gpu_${gpu}_math_${math}_batch_${batch} \
|
||||
--dataset-dir ${DATASET_DIR} \
|
||||
--seed 1 \
|
||||
--epochs 1 \
|
||||
--math ${math} \
|
||||
--print-freq 1 \
|
||||
--batch-size ${batch} \
|
||||
--disable-eval \
|
||||
--max-size $((512 * ${batch} * ${gpu}))
|
||||
done
|
||||
done
|
||||
done
|
|
@ -1,6 +1,7 @@
|
|||
import argparse
|
||||
from collections import Counter
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(description='Clean dataset')
|
||||
parser.add_argument('-f1', '--file1', help='file1')
|
||||
|
@ -12,6 +13,7 @@ def save_output(fname, data):
|
|||
with open(fname, 'w') as f:
|
||||
f.writelines(data)
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Discards all pairs of sentences which can't be decoded by latin-1 encoder.
|
||||
|
|
|
@ -1,46 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
batches=(128)
|
||||
maths=(fp16 fp32)
|
||||
gpus=(1 2 4 8)
|
||||
|
||||
sentences=3498161
|
||||
|
||||
echo -e [parameters] "\t\t\t" [tokens / s] [second per epoch]
|
||||
|
||||
for batch in "${batches[@]}"
|
||||
do
|
||||
for math in "${maths[@]}"
|
||||
do
|
||||
for gpu in "${gpus[@]}"
|
||||
do
|
||||
dir=results/benchmark_gpu_${gpu}_math_${math}_batch_${batch}/
|
||||
if [ ! -d $dir ]; then
|
||||
echo Directory $dir does not exist
|
||||
continue
|
||||
fi
|
||||
|
||||
total_tokens_per_s=0
|
||||
for gpu_id in `seq 0 $((gpu - 1))`
|
||||
do
|
||||
tokens_per_s=`cat ${dir}/log_gpu_${gpu_id}.log \
|
||||
|grep TRAIN \
|
||||
|cut -f 4 \
|
||||
|sed -E -n 's/.*\(([0-9]+)\).*/\1/p' \
|
||||
|tail -n 1`
|
||||
total_tokens_per_s=$((total_tokens_per_s + tokens_per_s))
|
||||
done
|
||||
|
||||
batch_time=`cat ${dir}/log_gpu_0.log \
|
||||
|grep TRAIN \
|
||||
|cut -f 2 \
|
||||
|sed -E -n 's/.*\(([.0-9]+)\).*/\1/p' \
|
||||
|tail -n 1`
|
||||
|
||||
n_batches=$(( $sentences / ($batch * $gpu)))
|
||||
epoch_time=`awk "BEGIN {print $n_batches * $batch_time}"`
|
||||
|
||||
echo -e math: $math batch: $batch gpus: $gpu "\t\t" $total_tokens_per_s "\t" $epoch_time
|
||||
done
|
||||
done
|
||||
done
|
|
@ -1,6 +1,14 @@
|
|||
fp16,1,Tesla V100-SXM2-16GB,42337
|
||||
fp16,4,Tesla V100-SXM2-16GB,153433
|
||||
fp16,8,Tesla V100-SXM2-16GB,300181
|
||||
fp32,1,Tesla V100-SXM2-16GB,18581
|
||||
fp32,4,Tesla V100-SXM2-16GB,67586
|
||||
fp32,8,Tesla V100-SXM2-16GB,132734
|
||||
fp16,1,Tesla V100-SXM2-16GB,66050
|
||||
fp16,4,Tesla V100-SXM2-16GB,196174
|
||||
fp16,8,Tesla V100-SXM2-16GB,387282
|
||||
fp32,1,Tesla V100-SXM2-16GB,21346
|
||||
fp32,4,Tesla V100-SXM2-16GB,76083
|
||||
fp32,8,Tesla V100-SXM2-16GB,153697
|
||||
fp16,1,Tesla V100-SXM3-32GB,65830
|
||||
fp16,4,Tesla V100-SXM3-32GB,200886
|
||||
fp16,8,Tesla V100-SXM3-32GB,362612
|
||||
fp16,16,Tesla V100-SXM3-32GB,738521
|
||||
fp32,1,Tesla V100-SXM3-32GB,22695
|
||||
fp32,4,Tesla V100-SXM3-32GB,81224
|
||||
fp32,8,Tesla V100-SXM3-32GB,156536
|
||||
fp32,16,Tesla V100-SXM3-32GB,314831
|
||||
|
|
|
@ -3,81 +3,35 @@
|
|||
set -e
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
RESULTS_DIR='gnmt_wmt16_test'
|
||||
REFERENCE_FILE=scripts/tests/reference_performance
|
||||
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
|
||||
REPO_DIR='/workspace/gnmt'
|
||||
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
|
||||
|
||||
REFERENCE_ACCURACY=17.2
|
||||
MATH='fp16'
|
||||
PERFORMANCE_TOLERANCE=0.9
|
||||
|
||||
python3 -m multiproc train.py \
|
||||
--save ${RESULTS_DIR} \
|
||||
--dataset-dir ${DATASET_DIR} \
|
||||
--seed 1 \
|
||||
--epochs 1 \
|
||||
--math ${MATH} \
|
||||
--print-freq 10 \
|
||||
--batch-size 128 \
|
||||
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
|
||||
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
|
||||
PERF_TOLERANCE=0.9
|
||||
|
||||
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
|
||||
echo 'GPU_NAME:' ${GPU_NAME}
|
||||
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
|
||||
echo 'GPU_COUNT:' ${GPU_COUNT}
|
||||
|
||||
# Accuracy test
|
||||
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|
||||
|grep Summary \
|
||||
|tail -n 1 \
|
||||
|cut -f 4 \
|
||||
|egrep -o [0-9.]+`
|
||||
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
|
||||
${REFERENCE_FILE} | \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
|
||||
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
|
||||
|
||||
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )); then
|
||||
echo "&&&& ACCURACY TEST PASSED"
|
||||
else
|
||||
echo "&&&& ACCURACY TEST FAILED"
|
||||
fi
|
||||
|
||||
# Performance test
|
||||
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|
||||
|grep Performance \
|
||||
|tail -n 1 \
|
||||
|cut -f 2 \
|
||||
|egrep -o [0-9.]+`
|
||||
|
||||
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
|
||||
| \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
|
||||
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
|
||||
|
||||
PERFORMANCE_TEST_RESULT=1
|
||||
|
||||
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
|
||||
if [ -z "${REFERENCE_PERF}" ]; then
|
||||
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
|
||||
echo "&&&& PERFORMANCE TEST WAIVED"
|
||||
TARGET_PERF=''
|
||||
else
|
||||
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
|
||||
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
|
||||
|
||||
if (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PERFORMANCE TEST PASSED"
|
||||
else
|
||||
echo "&&&& PERFORMANCE TEST FAILED"
|
||||
fi
|
||||
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
|
||||
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
|
||||
fi
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PASSED"
|
||||
exit 0
|
||||
else
|
||||
echo "&&&& FAILED"
|
||||
exit 1
|
||||
fi
|
||||
cd $REPO_DIR
|
||||
|
||||
python3 -m launch train.py \
|
||||
--dataset-dir $DATASET_DIR \
|
||||
--seed 1 \
|
||||
--epochs 1 \
|
||||
--remain-steps 1.0 \
|
||||
--target-bleu 17.20 \
|
||||
--math ${MATH} \
|
||||
${TARGET_PERF}
|
||||
|
|
|
@ -3,81 +3,35 @@
|
|||
set -e
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
RESULTS_DIR='gnmt_wmt16_test'
|
||||
REFERENCE_FILE=scripts/tests/reference_performance
|
||||
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
|
||||
REPO_DIR='/workspace/gnmt'
|
||||
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
|
||||
|
||||
REFERENCE_ACCURACY=17.2
|
||||
MATH='fp32'
|
||||
PERFORMANCE_TOLERANCE=0.9
|
||||
|
||||
python3 -m multiproc train.py \
|
||||
--save ${RESULTS_DIR} \
|
||||
--dataset-dir ${DATASET_DIR} \
|
||||
--seed 1 \
|
||||
--epochs 1 \
|
||||
--math ${MATH} \
|
||||
--print-freq 10 \
|
||||
--batch-size 128 \
|
||||
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
|
||||
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
|
||||
PERF_TOLERANCE=0.9
|
||||
|
||||
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
|
||||
echo 'GPU_NAME:' ${GPU_NAME}
|
||||
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
|
||||
echo 'GPU_COUNT:' ${GPU_COUNT}
|
||||
|
||||
# Accuracy test
|
||||
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|
||||
|grep Summary \
|
||||
|tail -n 1 \
|
||||
|cut -f 4 \
|
||||
|egrep -o [0-9.]+`
|
||||
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
|
||||
${REFERENCE_FILE} | \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
|
||||
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
|
||||
|
||||
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )); then
|
||||
echo "&&&& ACCURACY TEST PASSED"
|
||||
else
|
||||
echo "&&&& ACCURACY TEST FAILED"
|
||||
fi
|
||||
|
||||
# Performance test
|
||||
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|
||||
|grep Performance \
|
||||
|tail -n 1 \
|
||||
|cut -f 2 \
|
||||
|egrep -o [0-9.]+`
|
||||
|
||||
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
|
||||
| \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
|
||||
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
|
||||
|
||||
PERFORMANCE_TEST_RESULT=1
|
||||
|
||||
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
|
||||
if [ -z "${REFERENCE_PERF}" ]; then
|
||||
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
|
||||
echo "&&&& PERFORMANCE TEST WAIVED"
|
||||
TARGET_PERF=''
|
||||
else
|
||||
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
|
||||
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
|
||||
|
||||
if (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PERFORMANCE TEST PASSED"
|
||||
else
|
||||
echo "&&&& PERFORMANCE TEST FAILED"
|
||||
fi
|
||||
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
|
||||
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
|
||||
fi
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PASSED"
|
||||
exit 0
|
||||
else
|
||||
echo "&&&& FAILED"
|
||||
exit 1
|
||||
fi
|
||||
cd $REPO_DIR
|
||||
|
||||
python3 -m launch train.py \
|
||||
--dataset-dir $DATASET_DIR \
|
||||
--seed 1 \
|
||||
--epochs 1 \
|
||||
--remain-steps 1.0 \
|
||||
--target-bleu 17.20 \
|
||||
--math ${MATH} \
|
||||
${TARGET_PERF}
|
||||
|
|
|
@ -3,81 +3,34 @@
|
|||
set -e
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
RESULTS_DIR='gnmt_wmt16_test'
|
||||
REFERENCE_FILE=scripts/tests/reference_performance
|
||||
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
|
||||
REPO_DIR='/workspace/gnmt'
|
||||
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
|
||||
|
||||
REFERENCE_ACCURACY=22.0
|
||||
MATH='fp16'
|
||||
PERFORMANCE_TOLERANCE=0.9
|
||||
|
||||
python3 -m multiproc train.py \
|
||||
--save ${RESULTS_DIR} \
|
||||
--dataset-dir ${DATASET_DIR} \
|
||||
--seed 1 \
|
||||
--epochs 6 \
|
||||
--math ${MATH} \
|
||||
--print-freq 10 \
|
||||
--batch-size 128 \
|
||||
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
|
||||
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
|
||||
PERF_TOLERANCE=0.9
|
||||
|
||||
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
|
||||
echo 'GPU_NAME:' ${GPU_NAME}
|
||||
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
|
||||
echo 'GPU_COUNT:' ${GPU_COUNT}
|
||||
|
||||
# Accuracy test
|
||||
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|
||||
|grep Summary \
|
||||
|tail -n 1 \
|
||||
|cut -f 4 \
|
||||
|egrep -o [0-9.]+`
|
||||
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
|
||||
${REFERENCE_FILE} | \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
|
||||
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
|
||||
|
||||
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )); then
|
||||
echo "&&&& ACCURACY TEST PASSED"
|
||||
else
|
||||
echo "&&&& ACCURACY TEST FAILED"
|
||||
fi
|
||||
|
||||
# Performance test
|
||||
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|
||||
|grep Performance \
|
||||
|tail -n 1 \
|
||||
|cut -f 2 \
|
||||
|egrep -o [0-9.]+`
|
||||
|
||||
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
|
||||
| \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
|
||||
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
|
||||
|
||||
PERFORMANCE_TEST_RESULT=1
|
||||
|
||||
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
|
||||
if [ -z "${REFERENCE_PERF}" ]; then
|
||||
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
|
||||
echo "&&&& PERFORMANCE TEST WAIVED"
|
||||
TARGET_PERF=''
|
||||
else
|
||||
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
|
||||
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
|
||||
|
||||
if (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PERFORMANCE TEST PASSED"
|
||||
else
|
||||
echo "&&&& PERFORMANCE TEST FAILED"
|
||||
fi
|
||||
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
|
||||
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
|
||||
fi
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PASSED"
|
||||
exit 0
|
||||
else
|
||||
echo "&&&& FAILED"
|
||||
exit 1
|
||||
fi
|
||||
cd $REPO_DIR
|
||||
|
||||
python3 -m launch train.py \
|
||||
--dataset-dir $DATASET_DIR \
|
||||
--seed 1 \
|
||||
--epochs 6 \
|
||||
--target-bleu 22.00 \
|
||||
--math ${MATH} \
|
||||
${TARGET_PERF}
|
||||
|
|
|
@ -3,81 +3,34 @@
|
|||
set -e
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
RESULTS_DIR='gnmt_wmt16_test'
|
||||
REFERENCE_FILE=scripts/tests/reference_performance
|
||||
LOGFILE=results/${RESULTS_DIR}/log_gpu_0.log
|
||||
REPO_DIR='/workspace/gnmt'
|
||||
REFERENCE_FILE=$REPO_DIR/scripts/tests/reference_performance
|
||||
|
||||
REFERENCE_ACCURACY=22.0
|
||||
MATH='fp32'
|
||||
PERFORMANCE_TOLERANCE=0.9
|
||||
|
||||
python3 -m multiproc train.py \
|
||||
--save ${RESULTS_DIR} \
|
||||
--dataset-dir ${DATASET_DIR} \
|
||||
--seed 1 \
|
||||
--epochs 6 \
|
||||
--math ${MATH} \
|
||||
--print-freq 10 \
|
||||
--batch-size 128 \
|
||||
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
|
||||
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
|
||||
PERF_TOLERANCE=0.9
|
||||
|
||||
GPU_NAME=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |uniq`
|
||||
echo 'GPU_NAME:' ${GPU_NAME}
|
||||
GPU_COUNT=`nvidia-smi --query-gpu=gpu_name --format=csv,noheader |wc -l`
|
||||
echo 'GPU_COUNT:' ${GPU_COUNT}
|
||||
|
||||
# Accuracy test
|
||||
ACHIEVED_ACCURACY=`cat ${LOGFILE} \
|
||||
|grep Summary \
|
||||
|tail -n 1 \
|
||||
|cut -f 4 \
|
||||
|egrep -o [0-9.]+`
|
||||
REFERENCE_PERF=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" \
|
||||
${REFERENCE_FILE} | \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_ACCURACY:' ${REFERENCE_ACCURACY}
|
||||
echo 'ACHIEVED_ACCURACY:' ${ACHIEVED_ACCURACY}
|
||||
|
||||
ACCURACY_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_ACCURACY}' >= '${REFERENCE_ACCURACY}')}')
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )); then
|
||||
echo "&&&& ACCURACY TEST PASSED"
|
||||
else
|
||||
echo "&&&& ACCURACY TEST FAILED"
|
||||
fi
|
||||
|
||||
# Performance test
|
||||
ACHIEVED_PERFORMANCE=`cat ${LOGFILE} \
|
||||
|grep Performance \
|
||||
|tail -n 1 \
|
||||
|cut -f 2 \
|
||||
|egrep -o [0-9.]+`
|
||||
|
||||
REFERENCE_PERFORMANCE=`grep "${MATH},${GPU_COUNT},${GPU_NAME}" ${REFERENCE_FILE} \
|
||||
| \cut -f 4 -d ','`
|
||||
|
||||
echo 'REFERENCE_PERFORMANCE:' ${REFERENCE_PERFORMANCE}
|
||||
echo 'ACHIEVED_PERFORMANCE:' ${ACHIEVED_PERFORMANCE}
|
||||
|
||||
PERFORMANCE_TEST_RESULT=1
|
||||
|
||||
if [ -z "${REFERENCE_PERFORMANCE}" ]; then
|
||||
if [ -z "${REFERENCE_PERF}" ]; then
|
||||
echo "WARNING: COULD NOT FIND REFERENCE PERFORMANCE FOR EXECUTED CONFIG"
|
||||
echo "&&&& PERFORMANCE TEST WAIVED"
|
||||
TARGET_PERF=''
|
||||
else
|
||||
PERFORMANCE_TEST_RESULT=$(awk 'BEGIN {print ('${ACHIEVED_PERFORMANCE}' >= \
|
||||
('${REFERENCE_PERFORMANCE}' * '${PERFORMANCE_TOLERANCE}'))}')
|
||||
|
||||
if (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PERFORMANCE TEST PASSED"
|
||||
else
|
||||
echo "&&&& PERFORMANCE TEST FAILED"
|
||||
fi
|
||||
PERF_THRESHOLD=$(awk 'BEGIN {print ('${REFERENCE_PERF}' * '${PERF_TOLERANCE}')}')
|
||||
TARGET_PERF='--target-perf '${PERF_THRESHOLD}
|
||||
fi
|
||||
|
||||
if (( ${ACCURACY_TEST_RESULT} )) && (( ${PERFORMANCE_TEST_RESULT} )); then
|
||||
echo "&&&& PASSED"
|
||||
exit 0
|
||||
else
|
||||
echo "&&&& FAILED"
|
||||
exit 1
|
||||
fi
|
||||
cd $REPO_DIR
|
||||
|
||||
python3 -m launch train.py \
|
||||
--dataset-dir $DATASET_DIR \
|
||||
--seed 1 \
|
||||
--epochs 6 \
|
||||
--target-bleu 22.00 \
|
||||
--math ${MATH} \
|
||||
${TARGET_PERF}
|
||||
|
|
|
@ -2,17 +2,4 @@
|
|||
|
||||
set -e
|
||||
|
||||
DATASET_DIR='data/wmt16_de_en'
|
||||
RESULTS_DIR='gnmt_wmt16'
|
||||
|
||||
# run training
|
||||
python3 -m multiproc train.py \
|
||||
--save ${RESULTS_DIR} \
|
||||
--dataset-dir ${DATASET_DIR} \
|
||||
--seed 1 \
|
||||
--epochs 6 \
|
||||
--math fp16 \
|
||||
--print-freq 10 \
|
||||
--batch-size 128 \
|
||||
--model-config "{'num_layers': 4, 'hidden_size': 1024, 'dropout':0.2, 'share_embedding': True}" \
|
||||
--optimization-config "{'optimizer': 'Adam', 'lr': 5e-4}"
|
||||
python3 -m launch train.py
|
||||
|
|
|
@ -1,20 +1,23 @@
|
|||
import logging
|
||||
from operator import itemgetter
|
||||
|
||||
import torch
|
||||
from torch.utils.data import Dataset
|
||||
from torch.utils.data.sampler import SequentialSampler
|
||||
from torch.utils.data import DataLoader
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
import seq2seq.data.config as config
|
||||
from seq2seq.data.sampler import BucketingSampler
|
||||
from seq2seq.data.sampler import DistributedSampler
|
||||
from seq2seq.data.sampler import ShardingSampler
|
||||
from seq2seq.data.sampler import StaticDistributedSampler
|
||||
|
||||
|
||||
def build_collate_fn(batch_first=False, parallel=True, sort=False):
|
||||
"""
|
||||
Factor for collate_fn functions.
|
||||
Factory for collate_fn functions.
|
||||
|
||||
:param batch_first: if True returns batches in (batch, seq) format, if not
|
||||
returns in (seq, batch) format
|
||||
:param batch_first: if True returns batches in (batch, seq) format, if
|
||||
False returns in (seq, batch) format
|
||||
:param parallel: if True builds batches from parallel corpus (src, tgt)
|
||||
:param sort: if True sorts by src sequence length within each batch
|
||||
"""
|
||||
|
@ -49,8 +52,8 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
|
|||
"""
|
||||
src_seqs, tgt_seqs = zip(*seqs)
|
||||
if sort:
|
||||
key = lambda item: len(item[1])
|
||||
indices, src_seqs = zip(*sorted(enumerate(src_seqs), key=key,
|
||||
indices, src_seqs = zip(*sorted(enumerate(src_seqs),
|
||||
key=lambda item: len(item[1]),
|
||||
reverse=True))
|
||||
tgt_seqs = [tgt_seqs[idx] for idx in indices]
|
||||
|
||||
|
@ -64,8 +67,8 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
|
|||
:param src_seqs: source sequences
|
||||
"""
|
||||
if sort:
|
||||
key = lambda item: len(item[1])
|
||||
indices, src_seqs = zip(*sorted(enumerate(src_seqs), key=key,
|
||||
indices, src_seqs = zip(*sorted(enumerate(src_seqs),
|
||||
key=lambda item: len(item[1]),
|
||||
reverse=True))
|
||||
else:
|
||||
indices = range(len(src_seqs))
|
||||
|
@ -81,10 +84,22 @@ def build_collate_fn(batch_first=False, parallel=True, sort=False):
|
|||
class TextDataset(Dataset):
|
||||
def __init__(self, src_fname, tokenizer, min_len=None, max_len=None,
|
||||
sort=False, max_size=None):
|
||||
"""
|
||||
Constructor for the TextDataset. Builds monolingual dataset.
|
||||
|
||||
:param src_fname: path to the file with data
|
||||
:param tokenizer: tokenizer
|
||||
:param min_len: minimum sequence length
|
||||
:param max_len: maximum sequence length
|
||||
:param sort: sorts dataset by sequence length
|
||||
:param max_size: loads at most 'max_size' samples from the input file,
|
||||
if None loads the entire dataset
|
||||
"""
|
||||
|
||||
self.min_len = min_len
|
||||
self.max_len = max_len
|
||||
self.parallel = False
|
||||
self.sorted = False
|
||||
|
||||
self.src = self.process_data(src_fname, tokenizer, max_size)
|
||||
|
||||
|
@ -98,11 +113,35 @@ class TextDataset(Dataset):
|
|||
self.sort_by_length()
|
||||
|
||||
def sort_by_length(self):
|
||||
"""
|
||||
Sorts dataset by the sequence length.
|
||||
"""
|
||||
self.lengths, indices = self.lengths.sort(descending=True)
|
||||
|
||||
self.src = [self.src[idx] for idx in indices]
|
||||
self.indices = indices.tolist()
|
||||
self.sorted = True
|
||||
|
||||
def unsort(self, array):
|
||||
"""
|
||||
"Unsorts" given array (restores original order of elements before
|
||||
dataset was sorted by sequence length).
|
||||
|
||||
:param array: array to be "unsorted"
|
||||
"""
|
||||
if self.sorted:
|
||||
inverse = sorted(enumerate(self.indices), key=itemgetter(1))
|
||||
array = [array[i[0]] for i in inverse]
|
||||
return array
|
||||
|
||||
def filter_data(self, min_len, max_len):
|
||||
"""
|
||||
Preserves only samples which satisfy the following inequality:
|
||||
min_len <= sample sequence length <= max_len
|
||||
|
||||
:param min_len: minimum sequence length
|
||||
:param max_len: maximum sequence length
|
||||
"""
|
||||
logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
|
||||
|
||||
initial_len = len(self.src)
|
||||
|
@ -116,6 +155,14 @@ class TextDataset(Dataset):
|
|||
logging.info(f'Pairs before: {initial_len}, after: {filtered_len}')
|
||||
|
||||
def process_data(self, fname, tokenizer, max_size):
|
||||
"""
|
||||
Loads data from the input file.
|
||||
|
||||
:param fname: input file name
|
||||
:param tokenizer: tokenizer
|
||||
:param max_size: loads at most 'max_size' samples from the input file,
|
||||
if None loads the entire dataset
|
||||
"""
|
||||
logging.info(f'Processing data from {fname}')
|
||||
data = []
|
||||
with open(fname) as dfile:
|
||||
|
@ -133,33 +180,57 @@ class TextDataset(Dataset):
|
|||
def __getitem__(self, idx):
|
||||
return self.src[idx]
|
||||
|
||||
def get_loader(self, batch_size=1, shuffle=False, num_workers=0,
|
||||
batch_first=False, drop_last=False, bucketing=True):
|
||||
def get_loader(self, batch_size=1, seeds=None, shuffle=False,
|
||||
num_workers=0, batch_first=False, pad=False,
|
||||
batching=None, batching_opt={}):
|
||||
|
||||
collate_fn = build_collate_fn(batch_first, parallel=self.parallel,
|
||||
sort=True)
|
||||
|
||||
if shuffle:
|
||||
sampler = BucketingSampler(self, batch_size, bucketing)
|
||||
if batching == 'random':
|
||||
sampler = DistributedSampler(self, batch_size, seeds)
|
||||
elif batching == 'sharding':
|
||||
sampler = ShardingSampler(self, batch_size, seeds,
|
||||
batching_opt['shard_size'])
|
||||
elif batching == 'bucketing':
|
||||
sampler = BucketingSampler(self, batch_size, seeds,
|
||||
batching_opt['num_buckets'])
|
||||
else:
|
||||
raise NotImplementedError
|
||||
else:
|
||||
sampler = SequentialSampler(self)
|
||||
sampler = StaticDistributedSampler(self, batch_size, pad)
|
||||
|
||||
return DataLoader(self,
|
||||
batch_size=batch_size,
|
||||
collate_fn=collate_fn,
|
||||
sampler=sampler,
|
||||
num_workers=num_workers,
|
||||
pin_memory=False,
|
||||
drop_last=drop_last)
|
||||
pin_memory=True,
|
||||
drop_last=False)
|
||||
|
||||
|
||||
class ParallelDataset(TextDataset):
|
||||
def __init__(self, src_fname, tgt_fname, tokenizer,
|
||||
min_len, max_len, sort=False, max_size=None):
|
||||
"""
|
||||
Constructor for the ParallelDataset.
|
||||
Tokenization is done when the data is loaded from the disk.
|
||||
|
||||
:param src_fname: path to the file with src language data
|
||||
:param tgt_fname: path to the file with tgt language data
|
||||
:param tokenizer: tokenizer
|
||||
:param min_len: minimum sequence length
|
||||
:param max_len: maximum sequence length
|
||||
:param sort: sorts dataset by sequence length
|
||||
:param max_size: loads at most 'max_size' samples from the input file,
|
||||
if None loads the entire dataset
|
||||
"""
|
||||
|
||||
self.min_len = min_len
|
||||
self.max_len = max_len
|
||||
self.parallel = True
|
||||
self.sorted = False
|
||||
|
||||
self.src = self.process_data(src_fname, tokenizer, max_size)
|
||||
self.tgt = self.process_data(tgt_fname, tokenizer, max_size)
|
||||
|
@ -168,19 +239,37 @@ class ParallelDataset(TextDataset):
|
|||
self.filter_data(min_len, max_len)
|
||||
assert len(self.src) == len(self.tgt)
|
||||
|
||||
lengths = [len(s) + len(t) for (s, t) in zip(self.src, self.tgt)]
|
||||
self.lengths = torch.tensor(lengths)
|
||||
src_lengths = [len(s) for s in self.src]
|
||||
tgt_lengths = [len(t) for t in self.tgt]
|
||||
self.src_lengths = torch.tensor(src_lengths)
|
||||
self.tgt_lengths = torch.tensor(tgt_lengths)
|
||||
self.lengths = self.src_lengths + self.tgt_lengths
|
||||
|
||||
if sort:
|
||||
self.sort_by_length()
|
||||
|
||||
def sort_by_length(self):
|
||||
"""
|
||||
Sorts dataset by the sequence length.
|
||||
"""
|
||||
self.lengths, indices = self.lengths.sort(descending=True)
|
||||
|
||||
self.src = [self.src[idx] for idx in indices]
|
||||
self.tgt = [self.tgt[idx] for idx in indices]
|
||||
self.src_lengths = [self.src_lengths[idx] for idx in indices]
|
||||
self.tgt_lengths = [self.tgt_lengths[idx] for idx in indices]
|
||||
self.indices = indices.tolist()
|
||||
self.sorted = True
|
||||
|
||||
def filter_data(self, min_len, max_len):
|
||||
"""
|
||||
Preserves only samples which satisfy the following inequality:
|
||||
min_len <= src sample sequence length <= max_len AND
|
||||
min_len <= tgt sample sequence length <= max_len
|
||||
|
||||
:param min_len: minimum sequence length
|
||||
:param max_len: maximum sequence length
|
||||
"""
|
||||
logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
|
||||
|
||||
initial_len = len(self.src)
|
||||
|
@ -199,3 +288,98 @@ class ParallelDataset(TextDataset):
|
|||
|
||||
def __getitem__(self, idx):
|
||||
return self.src[idx], self.tgt[idx]
|
||||
|
||||
|
||||
class LazyParallelDataset(TextDataset):
|
||||
def __init__(self, src_fname, tgt_fname, tokenizer,
|
||||
min_len, max_len, sort=False, max_size=None):
|
||||
"""
|
||||
Constructor for the LazyParallelDataset.
|
||||
Tokenization is done on the fly.
|
||||
|
||||
:param src_fname: path to the file with src language data
|
||||
:param tgt_fname: path to the file with tgt language data
|
||||
:param tokenizer: tokenizer
|
||||
:param min_len: minimum sequence length
|
||||
:param max_len: maximum sequence length
|
||||
:param sort: sorts dataset by sequence length
|
||||
:param max_size: loads at most 'max_size' samples from the input file,
|
||||
if None loads the entire dataset
|
||||
"""
|
||||
self.min_len = min_len
|
||||
self.max_len = max_len
|
||||
self.parallel = True
|
||||
self.sorted = False
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
self.raw_src = self.process_raw_data(src_fname, max_size)
|
||||
self.raw_tgt = self.process_raw_data(tgt_fname, max_size)
|
||||
assert len(self.raw_src) == len(self.raw_tgt)
|
||||
|
||||
logging.info(f'Filtering data, min len: {min_len}, max len: {max_len}')
|
||||
# Subtracting 2 because EOS and BOS are added later during tokenization
|
||||
self.filter_raw_data(min_len - 2, max_len - 2)
|
||||
assert len(self.raw_src) == len(self.raw_tgt)
|
||||
|
||||
# Adding 2 because EOS and BOS are added later during tokenization
|
||||
src_lengths = [i + 2 for i in self.src_len]
|
||||
tgt_lengths = [i + 2 for i in self.tgt_len]
|
||||
self.src_lengths = torch.tensor(src_lengths)
|
||||
self.tgt_lengths = torch.tensor(tgt_lengths)
|
||||
self.lengths = self.src_lengths + self.tgt_lengths
|
||||
|
||||
def process_raw_data(self, fname, max_size):
|
||||
"""
|
||||
Loads data from the input file.
|
||||
|
||||
:param fname: input file name
|
||||
:param max_size: loads at most 'max_size' samples from the input file,
|
||||
if None loads the entire dataset
|
||||
"""
|
||||
logging.info(f'Processing data from {fname}')
|
||||
data = []
|
||||
with open(fname) as dfile:
|
||||
for idx, line in enumerate(dfile):
|
||||
if max_size and idx == max_size:
|
||||
break
|
||||
data.append(line)
|
||||
return data
|
||||
|
||||
def filter_raw_data(self, min_len, max_len):
|
||||
"""
|
||||
Preserves only samples which satisfy the following inequality:
|
||||
min_len <= src sample sequence length <= max_len AND
|
||||
min_len <= tgt sample sequence length <= max_len
|
||||
|
||||
:param min_len: minimum sequence length
|
||||
:param max_len: maximum sequence length
|
||||
"""
|
||||
initial_len = len(self.raw_src)
|
||||
filtered_src = []
|
||||
filtered_tgt = []
|
||||
filtered_src_len = []
|
||||
filtered_tgt_len = []
|
||||
for src, tgt in zip(self.raw_src, self.raw_tgt):
|
||||
src_len = src.count(' ') + 1
|
||||
tgt_len = tgt.count(' ') + 1
|
||||
if min_len <= src_len <= max_len and \
|
||||
min_len <= tgt_len <= max_len:
|
||||
filtered_src.append(src)
|
||||
filtered_tgt.append(tgt)
|
||||
filtered_src_len.append(src_len)
|
||||
filtered_tgt_len.append(tgt_len)
|
||||
|
||||
self.raw_src = filtered_src
|
||||
self.raw_tgt = filtered_tgt
|
||||
self.src_len = filtered_src_len
|
||||
self.tgt_len = filtered_tgt_len
|
||||
filtered_len = len(self.raw_src)
|
||||
logging.info(f'Pairs before: {initial_len}, after: {filtered_len}')
|
||||
|
||||
def __getitem__(self, idx):
|
||||
src = torch.tensor(self.tokenizer.segment(self.raw_src[idx]))
|
||||
tgt = torch.tensor(self.tokenizer.segment(self.raw_tgt[idx]))
|
||||
return src, tgt
|
||||
|
||||
def __len__(self):
|
||||
return len(self.raw_src)
|
||||
|
|
|
@ -1,23 +1,22 @@
|
|||
import logging
|
||||
|
||||
import torch
|
||||
from torch.utils.data.sampler import Sampler
|
||||
from seq2seq.utils import get_world_size, get_rank
|
||||
|
||||
from seq2seq.utils import get_rank
|
||||
from seq2seq.utils import get_world_size
|
||||
|
||||
|
||||
class BucketingSampler(Sampler):
|
||||
"""
|
||||
Distributed data sampler supporting bucketing by sequence length.
|
||||
"""
|
||||
def __init__(self, dataset, batch_size, bucketing=True, world_size=None,
|
||||
rank=None):
|
||||
class DistributedSampler(Sampler):
|
||||
def __init__(self, dataset, batch_size, seeds, world_size=None, rank=None):
|
||||
"""
|
||||
Constructor for the BucketingSampler.
|
||||
Constructor for the DistributedSampler.
|
||||
|
||||
:param dataset: dataset
|
||||
:param batch_size: batch size
|
||||
:param bucketing: if True enables bucketing by sequence length
|
||||
:param world_size: number of processes participating in distributed
|
||||
training
|
||||
:param rank: rank of the current process within world_size
|
||||
:param batch_size: local batch size
|
||||
:param seeds: list of seeds, one seed for each training epoch
|
||||
:param world_size: number of distributed workers
|
||||
:param rank: rank of the current process
|
||||
"""
|
||||
if world_size is None:
|
||||
world_size = get_world_size()
|
||||
|
@ -28,75 +27,251 @@ class BucketingSampler(Sampler):
|
|||
self.world_size = world_size
|
||||
self.rank = rank
|
||||
self.epoch = 0
|
||||
self.bucketing = bucketing
|
||||
self.seeds = seeds
|
||||
|
||||
self.batch_size = batch_size
|
||||
self.global_batch_size = batch_size * world_size
|
||||
|
||||
self.data_len = len(self.dataset)
|
||||
|
||||
self.num_samples = self.data_len // self.global_batch_size \
|
||||
* self.global_batch_size
|
||||
|
||||
def __iter__(self):
|
||||
# deterministically shuffle based on epoch
|
||||
g = torch.Generator()
|
||||
g.manual_seed(self.epoch)
|
||||
def init_rng(self):
|
||||
"""
|
||||
Creates new RNG, seed depends on current epoch idx.
|
||||
"""
|
||||
rng = torch.Generator()
|
||||
seed = self.seeds[self.epoch]
|
||||
logging.info(f'Sampler for epoch {self.epoch} uses seed {seed}')
|
||||
rng.manual_seed(seed)
|
||||
return rng
|
||||
|
||||
# generate permutation
|
||||
indices = torch.randperm(self.data_len, generator=g)
|
||||
# make indices evenly divisible by (batch_size * world_size)
|
||||
indices = indices[:self.num_samples]
|
||||
|
||||
# splits the dataset into chunks of 'batches_in_shard' global batches
|
||||
# each, sorts by (src + tgt) sequence length within each chunk,
|
||||
# reshuffles all global batches
|
||||
if self.bucketing:
|
||||
batches_in_shard = 80
|
||||
shard_size = self.global_batch_size * batches_in_shard
|
||||
nshards = (self.num_samples + shard_size - 1) // shard_size
|
||||
|
||||
lengths = self.dataset.lengths[indices]
|
||||
|
||||
shards = [indices[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
|
||||
len_shards = [lengths[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
|
||||
|
||||
indices = []
|
||||
for len_shard in len_shards:
|
||||
_, ind = len_shard.sort()
|
||||
indices.append(ind)
|
||||
|
||||
output = tuple(shard[idx] for shard, idx in zip(shards, indices))
|
||||
indices = torch.cat(output)
|
||||
|
||||
# global reshuffle
|
||||
indices = indices.view(-1, self.global_batch_size)
|
||||
order = torch.randperm(indices.shape[0], generator=g)
|
||||
indices = indices[order, :]
|
||||
indices = indices.view(-1)
|
||||
def distribute_batches(self, indices):
|
||||
"""
|
||||
Assigns batches to workers.
|
||||
Consecutive ranks are getting consecutive batches.
|
||||
|
||||
:param indices: torch.tensor with batch indices
|
||||
"""
|
||||
assert len(indices) == self.num_samples
|
||||
|
||||
# build indices for each individual worker
|
||||
# consecutive ranks are getting consecutive batches,
|
||||
# default pytorch DistributedSampler assigns strided batches
|
||||
# with offset = length / world_size
|
||||
indices = indices.view(-1, self.batch_size)
|
||||
indices = indices[self.rank::self.world_size].contiguous()
|
||||
indices = indices.view(-1)
|
||||
indices = indices.tolist()
|
||||
|
||||
assert len(indices) == self.num_samples // self.world_size
|
||||
return indices
|
||||
|
||||
def reshuffle_batches(self, indices, rng):
|
||||
"""
|
||||
Permutes global batches
|
||||
|
||||
:param indices: torch.tensor with batch indices
|
||||
:param rng: instance of torch.Generator
|
||||
"""
|
||||
indices = indices.view(-1, self.global_batch_size)
|
||||
num_batches = indices.shape[0]
|
||||
order = torch.randperm(num_batches, generator=rng)
|
||||
indices = indices[order, :]
|
||||
indices = indices.view(-1)
|
||||
return indices
|
||||
|
||||
def __iter__(self):
|
||||
rng = self.init_rng()
|
||||
# generate permutation
|
||||
indices = torch.randperm(self.data_len, generator=rng)
|
||||
|
||||
# make indices evenly divisible by (batch_size * world_size)
|
||||
indices = indices[:self.num_samples]
|
||||
|
||||
# assign batches to workers
|
||||
indices = self.distribute_batches(indices)
|
||||
return iter(indices)
|
||||
|
||||
def __len__(self):
|
||||
return self.num_samples // self.world_size
|
||||
|
||||
def set_epoch(self, epoch):
|
||||
"""
|
||||
Sets current epoch index. This value is used to seed RNGs in __iter__()
|
||||
function.
|
||||
Sets current epoch index.
|
||||
Epoch index is used to seed RNG in __iter__() function.
|
||||
|
||||
:param epoch: index of current epoch
|
||||
"""
|
||||
self.epoch = epoch
|
||||
|
||||
def __len__(self):
|
||||
return self.num_samples // self.world_size
|
||||
|
||||
|
||||
class ShardingSampler(DistributedSampler):
|
||||
def __init__(self, dataset, batch_size, seeds, shard_size,
|
||||
world_size=None, rank=None):
|
||||
"""
|
||||
Constructor for the ShardingSampler.
|
||||
|
||||
:param dataset: dataset
|
||||
:param batch_size: local batch size
|
||||
:param seeds: list of seeds, one seed for each training epoch
|
||||
:param shard_size: number of global batches within one shard
|
||||
:param world_size: number of distributed workers
|
||||
:param rank: rank of the current process
|
||||
"""
|
||||
|
||||
super().__init__(dataset, batch_size, seeds, world_size, rank)
|
||||
|
||||
self.shard_size = shard_size
|
||||
self.num_samples = self.data_len // self.global_batch_size \
|
||||
* self.global_batch_size
|
||||
|
||||
def __iter__(self):
|
||||
rng = self.init_rng()
|
||||
# generate permutation
|
||||
indices = torch.randperm(self.data_len, generator=rng)
|
||||
# make indices evenly divisible by (batch_size * world_size)
|
||||
indices = indices[:self.num_samples]
|
||||
|
||||
# splits the dataset into chunks of 'self.shard_size' global batches
|
||||
# each, sorts by (src + tgt) sequence length within each chunk,
|
||||
# reshuffles all global batches
|
||||
shard_size = self.global_batch_size * self.shard_size
|
||||
nshards = (self.num_samples + shard_size - 1) // shard_size
|
||||
|
||||
lengths = self.dataset.lengths[indices]
|
||||
|
||||
shards = [indices[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
|
||||
len_shards = [lengths[i * shard_size:(i+1) * shard_size] for i in range(nshards)]
|
||||
|
||||
# sort by (src + tgt) sequence length within each shard
|
||||
indices = []
|
||||
for len_shard in len_shards:
|
||||
_, ind = len_shard.sort()
|
||||
indices.append(ind)
|
||||
|
||||
output = tuple(shard[idx] for shard, idx in zip(shards, indices))
|
||||
|
||||
# build batches
|
||||
indices = torch.cat(output)
|
||||
# perform global reshuffle of all global batches
|
||||
indices = self.reshuffle_batches(indices, rng)
|
||||
# distribute batches to individual workers
|
||||
indices = self.distribute_batches(indices)
|
||||
return iter(indices)
|
||||
|
||||
|
||||
class BucketingSampler(DistributedSampler):
|
||||
def __init__(self, dataset, batch_size, seeds, num_buckets,
|
||||
world_size=None, rank=None):
|
||||
"""
|
||||
Constructor for the BucketingSampler.
|
||||
|
||||
:param dataset: dataset
|
||||
:param batch_size: local batch size
|
||||
:param seeds: list of seeds, one seed for each training epoch
|
||||
:param num_buckets: number of buckets
|
||||
:param world_size: number of distributed workers
|
||||
:param rank: rank of the current process
|
||||
"""
|
||||
|
||||
super().__init__(dataset, batch_size, seeds, world_size, rank)
|
||||
|
||||
self.num_buckets = num_buckets
|
||||
bucket_width = (dataset.max_len + num_buckets - 1) // num_buckets
|
||||
|
||||
# assign sentences to buckets based on src and tgt sequence lengths
|
||||
bucket_ids = torch.max(dataset.src_lengths // bucket_width,
|
||||
dataset.tgt_lengths // bucket_width)
|
||||
bucket_ids.clamp_(0, num_buckets - 1)
|
||||
|
||||
# build buckets
|
||||
all_indices = torch.tensor(range(self.data_len))
|
||||
self.buckets = []
|
||||
self.num_samples = 0
|
||||
global_bs = self.global_batch_size
|
||||
|
||||
for bid in range(num_buckets):
|
||||
# gather indices for current bucket
|
||||
indices = all_indices[bucket_ids == bid]
|
||||
self.buckets.append(indices)
|
||||
|
||||
# count number of samples in current bucket
|
||||
samples = len(indices) // global_bs * global_bs
|
||||
self.num_samples += samples
|
||||
|
||||
def __iter__(self):
|
||||
rng = self.init_rng()
|
||||
global_bs = self.global_batch_size
|
||||
|
||||
indices = []
|
||||
for bid in range(self.num_buckets):
|
||||
# random shuffle within current bucket
|
||||
perm = torch.randperm(len(self.buckets[bid]), generator=rng)
|
||||
bucket_indices = self.buckets[bid][perm]
|
||||
|
||||
# make bucket_indices evenly divisible by global batch size
|
||||
length = len(bucket_indices) // global_bs * global_bs
|
||||
bucket_indices = bucket_indices[:length]
|
||||
assert len(bucket_indices) % self.global_batch_size == 0
|
||||
|
||||
# add samples from current bucket to indices for current epoch
|
||||
indices.append(bucket_indices)
|
||||
|
||||
indices = torch.cat(indices)
|
||||
assert len(indices) % self.global_batch_size == 0
|
||||
|
||||
# perform global reshuffle of all global batches
|
||||
indices = self.reshuffle_batches(indices, rng)
|
||||
# distribute batches to individual workers
|
||||
indices = self.distribute_batches(indices)
|
||||
return iter(indices)
|
||||
|
||||
|
||||
class StaticDistributedSampler(Sampler):
|
||||
def __init__(self, dataset, batch_size, pad, world_size=None, rank=None):
|
||||
"""
|
||||
Constructor for the StaticDistributedSampler.
|
||||
|
||||
:param dataset: dataset
|
||||
:param batch_size: local batch size
|
||||
:param pad: if True: pads dataset to a multiple of global_batch_size
|
||||
samples
|
||||
:param world_size: number of distributed workers
|
||||
:param rank: rank of the current process
|
||||
"""
|
||||
if world_size is None:
|
||||
world_size = get_world_size()
|
||||
if rank is None:
|
||||
rank = get_rank()
|
||||
|
||||
self.world_size = world_size
|
||||
|
||||
global_batch_size = batch_size * world_size
|
||||
|
||||
data_len = len(dataset)
|
||||
num_samples = (data_len + global_batch_size - 1) \
|
||||
// global_batch_size * global_batch_size
|
||||
self.num_samples = num_samples
|
||||
|
||||
indices = list(range(data_len))
|
||||
if pad:
|
||||
# pad dataset to a multiple of global_batch_size samples, uses
|
||||
# sample with idx 0 as pad
|
||||
indices += [0] * (num_samples - len(indices))
|
||||
else:
|
||||
# temporary pad to a multiple of global batch size, pads with "-1"
|
||||
# which is later removed from the list of indices
|
||||
indices += [-1] * (num_samples - len(indices))
|
||||
indices = torch.tensor(indices)
|
||||
|
||||
indices = indices.view(-1, batch_size)
|
||||
indices = indices[rank::world_size].contiguous()
|
||||
indices = indices.view(-1)
|
||||
# remove temporary pad
|
||||
indices = indices[indices != -1]
|
||||
indices = indices.tolist()
|
||||
self.indices = indices
|
||||
|
||||
def __iter__(self):
|
||||
return iter(self.indices)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.indices)
|
||||
|
|
|
@ -1,43 +1,76 @@
|
|||
import logging
|
||||
from collections import defaultdict
|
||||
from functools import partial
|
||||
|
||||
import seq2seq.data.config as config
|
||||
|
||||
def default():
|
||||
return config.UNK
|
||||
|
||||
class Tokenizer:
|
||||
"""
|
||||
Tokenizer class.
|
||||
"""
|
||||
def __init__(self, vocab_fname, separator='@@'):
|
||||
def __init__(self, vocab_fname=None, pad=1, separator='@@'):
|
||||
"""
|
||||
Constructor for the Tokenizer class.
|
||||
|
||||
:param vocab_fname: path to the file with vocabulary
|
||||
:param pad: pads vocabulary to a multiple of 'pad' tokens
|
||||
:param separator: tokenization separator
|
||||
"""
|
||||
self.separator = separator
|
||||
if vocab_fname:
|
||||
self.separator = separator
|
||||
|
||||
logging.info(f'Building vocabulary from {vocab_fname}')
|
||||
vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
|
||||
config.BOS_TOKEN, config.EOS_TOKEN]
|
||||
logging.info(f'Building vocabulary from {vocab_fname}')
|
||||
vocab = [config.PAD_TOKEN, config.UNK_TOKEN,
|
||||
config.BOS_TOKEN, config.EOS_TOKEN]
|
||||
|
||||
with open(vocab_fname) as vfile:
|
||||
for line in vfile:
|
||||
vocab.append(line.strip())
|
||||
with open(vocab_fname) as vfile:
|
||||
for line in vfile:
|
||||
vocab.append(line.strip())
|
||||
|
||||
logging.info(f'Size of vocabulary: {len(vocab)}')
|
||||
self.vocab_size = len(vocab)
|
||||
self.pad_vocabulary(vocab, pad)
|
||||
|
||||
self.vocab_size = len(vocab)
|
||||
logging.info(f'Size of vocabulary: {self.vocab_size}')
|
||||
|
||||
self.tok2idx = defaultdict(default)
|
||||
for idx, token in enumerate(vocab):
|
||||
self.tok2idx[token] = idx
|
||||
self.tok2idx = defaultdict(partial(int, config.UNK))
|
||||
for idx, token in enumerate(vocab):
|
||||
self.tok2idx[token] = idx
|
||||
|
||||
self.idx2tok = {}
|
||||
for key, value in self.tok2idx.items():
|
||||
self.idx2tok[value] = key
|
||||
self.idx2tok = {}
|
||||
for key, value in self.tok2idx.items():
|
||||
self.idx2tok[value] = key
|
||||
|
||||
def pad_vocabulary(self, vocab, pad):
|
||||
"""
|
||||
Pads vocabulary to a multiple of 'pad' tokens.
|
||||
|
||||
:param vocab: list with vocabulary
|
||||
:param pad: integer
|
||||
"""
|
||||
vocab_size = len(vocab)
|
||||
padded_vocab_size = (vocab_size + pad - 1) // pad * pad
|
||||
for i in range(0, padded_vocab_size - vocab_size):
|
||||
token = f'madeupword{i:04d}'
|
||||
vocab.append(token)
|
||||
assert len(vocab) % pad == 0
|
||||
|
||||
def get_state(self):
|
||||
logging.info(f'Saving state of the tokenizer')
|
||||
state = {
|
||||
'separator': self.separator,
|
||||
'vocab_size': self.vocab_size,
|
||||
'tok2idx': self.tok2idx,
|
||||
'idx2tok': self.idx2tok,
|
||||
}
|
||||
return state
|
||||
|
||||
def set_state(self, state):
|
||||
logging.info(f'Restoring state of the tokenizer')
|
||||
self.separator = state['separator']
|
||||
self.vocab_size = state['vocab_size']
|
||||
self.tok2idx = state['tok2idx']
|
||||
self.idx2tok = state['idx2tok']
|
||||
|
||||
def segment(self, line):
|
||||
"""
|
||||
|
@ -62,6 +95,11 @@ class Tokenizer:
|
|||
returns: string representing detokenized sentence
|
||||
"""
|
||||
detok = delim.join([self.idx2tok[idx] for idx in inputs])
|
||||
detok = detok.replace(
|
||||
self.separator+ ' ', '').replace(self.separator, '')
|
||||
detok = detok.replace(self.separator + ' ', '')
|
||||
detok = detok.replace(self.separator, '')
|
||||
|
||||
detok = detok.replace(config.BOS_TOKEN, '')
|
||||
detok = detok.replace(config.EOS_TOKEN, '')
|
||||
detok = detok.replace(config.PAD_TOKEN, '')
|
||||
detok = detok.strip()
|
||||
return detok
|
||||
|
|
|
@ -81,7 +81,8 @@ class SequenceGenerator:
|
|||
counter += 1
|
||||
|
||||
words = words.view(word_view)
|
||||
words, logprobs, attn, context = self.model.generate(words, context, 1)
|
||||
output = self.model.generate(words, context, 1)
|
||||
words, logprobs, attn, context = output
|
||||
words = words.view(-1)
|
||||
|
||||
translation[active, idx] = words
|
||||
|
@ -123,13 +124,15 @@ class SequenceGenerator:
|
|||
max_seq_len = self.max_seq_len
|
||||
cov_penalty_factor = self.cov_penalty_factor
|
||||
|
||||
translation = torch.zeros(batch_size * beam_size, max_seq_len, dtype=torch.int64)
|
||||
translation = torch.zeros(batch_size * beam_size, max_seq_len,
|
||||
dtype=torch.int64)
|
||||
lengths = torch.ones(batch_size * beam_size, dtype=torch.int64)
|
||||
scores = torch.zeros(batch_size * beam_size, dtype=torch.float32)
|
||||
|
||||
active = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
|
||||
base_mask = torch.arange(0, batch_size * beam_size, dtype=torch.int64)
|
||||
global_offset = torch.arange(0, batch_size * beam_size, beam_size, dtype=torch.int64)
|
||||
global_offset = torch.arange(0, batch_size * beam_size, beam_size,
|
||||
dtype=torch.int64)
|
||||
|
||||
eos_beam_fill = torch.tensor([0] + (beam_size - 1) * [float('-inf')])
|
||||
|
||||
|
@ -161,21 +164,23 @@ class SequenceGenerator:
|
|||
_, seq, feature = context[0].shape
|
||||
context[0].unsqueeze_(1)
|
||||
context[0] = context[0].expand(-1, beam_size, -1, -1)
|
||||
context[0] = context[0].contiguous().view(batch_size * beam_size, seq, feature)
|
||||
context[0] = context[0].contiguous().view(batch_size * beam_size,
|
||||
seq, feature)
|
||||
# context[0]: (batch * beam, seq, feature)
|
||||
else:
|
||||
# context[0] (encoder state): (seq, batch, feature)
|
||||
seq, _, feature = context[0].shape
|
||||
context[0].unsqueeze_(2)
|
||||
context[0] = context[0].expand(-1, -1, beam_size, -1)
|
||||
context[0] = context[0].contiguous().view(seq, batch_size * beam_size, feature)
|
||||
context[0] = context[0].contiguous().view(seq, batch_size *
|
||||
beam_size, feature)
|
||||
# context[0]: (seq, batch * beam, feature)
|
||||
|
||||
#context[1] (encoder seq length): (batch)
|
||||
# context[1] (encoder seq length): (batch)
|
||||
context[1].unsqueeze_(1)
|
||||
context[1] = context[1].expand(-1, beam_size)
|
||||
context[1] = context[1].contiguous().view(batch_size * beam_size)
|
||||
#context[1]: (batch * beam)
|
||||
# context[1]: (batch * beam)
|
||||
|
||||
accu_attn_scores = torch.zeros(batch_size * beam_size, seq)
|
||||
if self.cuda:
|
||||
|
@ -194,7 +199,8 @@ class SequenceGenerator:
|
|||
|
||||
lengths[active[~eos_mask.view(-1)]] += 1
|
||||
|
||||
words, logprobs, attn, context = self.model.generate(words, context, beam_size)
|
||||
output = self.model.generate(words, context, beam_size)
|
||||
words, logprobs, attn, context = output
|
||||
|
||||
attn = attn.float().squeeze(attn_query_dim)
|
||||
attn = attn.masked_fill(eos_mask.view(-1).unsqueeze(1), 0)
|
||||
|
|
|
@ -1,7 +1,8 @@
|
|||
import contextlib
|
||||
import logging
|
||||
import os
|
||||
import subprocess
|
||||
import time
|
||||
import os
|
||||
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
|
@ -9,7 +10,18 @@ import torch.distributed as dist
|
|||
import seq2seq.data.config as config
|
||||
from seq2seq.inference.beam_search import SequenceGenerator
|
||||
from seq2seq.utils import AverageMeter
|
||||
from seq2seq.utils import get_rank, get_world_size
|
||||
from seq2seq.utils import barrier
|
||||
from seq2seq.utils import get_rank
|
||||
from seq2seq.utils import get_world_size
|
||||
|
||||
|
||||
def gather_predictions(preds):
|
||||
world_size = get_world_size()
|
||||
if world_size > 1:
|
||||
all_preds = [preds.new(preds.size(0), preds.size(1)) for i in range(world_size)]
|
||||
dist.all_gather(all_preds, preds)
|
||||
preds = torch.cat(all_preds)
|
||||
return preds
|
||||
|
||||
|
||||
class Translator:
|
||||
|
@ -96,17 +108,25 @@ class Translator:
|
|||
eval_path = self.build_eval_path(epoch, iteration)
|
||||
detok_eval_path = eval_path + '.detok'
|
||||
|
||||
rank = get_rank()
|
||||
if rank == 0:
|
||||
logging.info(f'Running evaluation on test set')
|
||||
self.model.eval()
|
||||
torch.cuda.empty_cache()
|
||||
with contextlib.suppress(FileNotFoundError):
|
||||
os.remove(eval_path)
|
||||
os.remove(detok_eval_path)
|
||||
|
||||
self.evaluate(epoch, iteration, eval_path, summary)
|
||||
rank = get_rank()
|
||||
logging.info(f'Running evaluation on test set')
|
||||
self.model.eval()
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
output = self.evaluate(epoch, iteration, summary)
|
||||
output = output[:len(self.loader.dataset)]
|
||||
output = self.loader.dataset.unsort(output)
|
||||
|
||||
if rank == 0:
|
||||
with open(eval_path, 'a') as eval_file:
|
||||
eval_file.writelines(output)
|
||||
if calc_bleu:
|
||||
self.run_detokenizer(eval_path)
|
||||
test_bleu[0] = self.run_sacrebleu(detok_eval_path,
|
||||
reference_path)
|
||||
test_bleu[0] = self.run_sacrebleu(detok_eval_path, reference_path)
|
||||
if summary:
|
||||
logging.info(f'BLEU on test dataset: {test_bleu[0]:.2f}')
|
||||
|
||||
|
@ -114,8 +134,9 @@ class Translator:
|
|||
logging.info(f'Target accuracy reached')
|
||||
break_training[0] = 1
|
||||
|
||||
torch.cuda.empty_cache()
|
||||
logging.info(f'Finished evaluation on test set')
|
||||
barrier()
|
||||
torch.cuda.empty_cache()
|
||||
logging.info(f'Finished evaluation on test set')
|
||||
|
||||
if self.distributed:
|
||||
dist.broadcast(break_training, 0)
|
||||
|
@ -123,35 +144,29 @@ class Translator:
|
|||
|
||||
return test_bleu[0].item(), break_training[0].item()
|
||||
|
||||
def evaluate(self, epoch, iteration, eval_path, summary):
|
||||
def evaluate(self, epoch, iteration, summary):
|
||||
"""
|
||||
Runs evaluation on test dataset.
|
||||
|
||||
:param epoch: index of the current epoch
|
||||
:param iteration: index of the current iteration
|
||||
:param eval_path: path to the file for saving results
|
||||
:param summary: if True prints summary
|
||||
"""
|
||||
eval_file = open(eval_path, 'w')
|
||||
|
||||
batch_time = AverageMeter(False)
|
||||
tot_tok_per_sec = AverageMeter(False)
|
||||
iterations = AverageMeter(False)
|
||||
enc_seq_len = AverageMeter(False)
|
||||
dec_seq_len = AverageMeter(False)
|
||||
total_iters = 0
|
||||
total_lines = 0
|
||||
stats = {}
|
||||
|
||||
output = []
|
||||
|
||||
for i, (src, indices) in enumerate(self.loader):
|
||||
translate_timer = time.time()
|
||||
src, src_length = src
|
||||
|
||||
if self.batch_first:
|
||||
batch_size = src.size(0)
|
||||
else:
|
||||
batch_size = src.size(1)
|
||||
total_lines += batch_size
|
||||
batch_size = self.loader.batch_size
|
||||
global_batch_size = batch_size * get_world_size()
|
||||
beam_size = self.beam_size
|
||||
|
||||
bos = [self.insert_target_start] * (batch_size * beam_size)
|
||||
|
@ -179,20 +194,18 @@ class Translator:
|
|||
generator = self.generator.beam_search
|
||||
preds, lengths, counter = generator(batch_size, bos, context)
|
||||
|
||||
preds = preds.cpu()
|
||||
lengths = lengths.cpu()
|
||||
stats['total_dec_len'] = int(lengths.sum())
|
||||
stats['total_dec_len'] = lengths.sum().item()
|
||||
stats['iters'] = counter
|
||||
total_iters += stats['iters']
|
||||
|
||||
output = []
|
||||
for idx, pred in enumerate(preds):
|
||||
end = lengths[idx] - 1
|
||||
pred = pred[1:end].tolist()
|
||||
out = self.tokenizer.detokenize(pred)
|
||||
output.append(out)
|
||||
indices = torch.tensor(indices).to(preds)
|
||||
preds = preds.scatter(0, indices.unsqueeze(1).expand_as(preds), preds)
|
||||
|
||||
output = [output[indices.index(i)] for i in range(len(output))]
|
||||
preds = gather_predictions(preds).cpu()
|
||||
|
||||
for pred in preds:
|
||||
pred = pred.tolist()
|
||||
detok = self.tokenizer.detokenize(pred)
|
||||
output.append(detok + '\n')
|
||||
|
||||
elapsed = time.time() - translate_timer
|
||||
batch_time.update(elapsed, batch_size)
|
||||
|
@ -219,25 +232,28 @@ class Translator:
|
|||
log = ''.join(log)
|
||||
logging.info(log)
|
||||
|
||||
for line in output:
|
||||
eval_file.write(line)
|
||||
eval_file.write('\n')
|
||||
tot_tok_per_sec.reduce('sum')
|
||||
enc_seq_len.reduce('mean')
|
||||
dec_seq_len.reduce('mean')
|
||||
batch_time.reduce('mean')
|
||||
iterations.reduce('sum')
|
||||
|
||||
eval_file.close()
|
||||
if summary:
|
||||
time_per_sentence = (batch_time.avg / self.loader.batch_size)
|
||||
if summary and get_rank() == 0:
|
||||
time_per_sentence = (batch_time.avg / global_batch_size)
|
||||
log = []
|
||||
log += f'TEST SUMMARY:\n'
|
||||
log += f'Lines translated: {total_lines}\t'
|
||||
log += f'Lines translated: {len(self.loader.dataset)}\t'
|
||||
log += f'Avg total tokens/s: {tot_tok_per_sec.avg:.0f}\n'
|
||||
log += f'Avg time per batch: {batch_time.avg:.3f} s\t'
|
||||
log += f'Avg time per sentence: {1000*time_per_sentence:.3f} ms\n'
|
||||
log += f'Avg encoder seq len: {enc_seq_len.avg:.2f}\t'
|
||||
log += f'Avg decoder seq len: {dec_seq_len.avg:.2f}\t'
|
||||
log += f'Total decoder iterations: {total_iters}'
|
||||
log += f'Total decoder iterations: {int(iterations.sum)}'
|
||||
log = ''.join(log)
|
||||
logging.info(log)
|
||||
|
||||
return output
|
||||
|
||||
def run_detokenizer(self, eval_path):
|
||||
"""
|
||||
Executes moses detokenizer on eval_path file and saves result to
|
||||
|
|
|
@ -12,7 +12,7 @@ class BahdanauAttention(nn.Module):
|
|||
Implementation is very similar to tf.contrib.seq2seq.BahdanauAttention
|
||||
"""
|
||||
def __init__(self, query_size, key_size, num_units, normalize=False,
|
||||
dropout=0, batch_first=False):
|
||||
batch_first=False, init_weight=0.1):
|
||||
"""
|
||||
Constructor for the BahdanauAttention.
|
||||
|
||||
|
@ -20,9 +20,10 @@ class BahdanauAttention(nn.Module):
|
|||
:param key_size: feature dimension for keys
|
||||
:param num_units: internal feature dimension
|
||||
:param normalize: whether to normalize energy term
|
||||
:param dropout: probability of the dropout (between softmax and bmm)
|
||||
:param batch_first: if True batch size is the 1st dimension, if False
|
||||
the sequence is first and batch size is second
|
||||
:param init_weight: range for uniform initializer used to initialize
|
||||
Linear key and query transform layers and linear_att vector
|
||||
"""
|
||||
super(BahdanauAttention, self).__init__()
|
||||
|
||||
|
@ -32,10 +33,11 @@ class BahdanauAttention(nn.Module):
|
|||
|
||||
self.linear_q = nn.Linear(query_size, num_units, bias=False)
|
||||
self.linear_k = nn.Linear(key_size, num_units, bias=False)
|
||||
nn.init.uniform_(self.linear_q.weight.data, -init_weight, init_weight)
|
||||
nn.init.uniform_(self.linear_k.weight.data, -init_weight, init_weight)
|
||||
|
||||
self.linear_att = Parameter(torch.Tensor(num_units))
|
||||
|
||||
self.dropout = nn.Dropout(dropout)
|
||||
self.mask = None
|
||||
|
||||
if self.normalize:
|
||||
|
@ -45,14 +47,14 @@ class BahdanauAttention(nn.Module):
|
|||
self.register_parameter('normalize_scalar', None)
|
||||
self.register_parameter('normalize_bias', None)
|
||||
|
||||
self.reset_parameters()
|
||||
self.reset_parameters(init_weight)
|
||||
|
||||
def reset_parameters(self):
|
||||
def reset_parameters(self, init_weight):
|
||||
"""
|
||||
Sets initial random values for trainable parameters.
|
||||
"""
|
||||
stdv = 1. / math.sqrt(self.num_units)
|
||||
self.linear_att.data.uniform_(-stdv, stdv)
|
||||
self.linear_att.data.uniform_(-init_weight, init_weight)
|
||||
|
||||
if self.normalize:
|
||||
self.normalize_scalar.data.fill_(stdv)
|
||||
|
@ -74,7 +76,8 @@ class BahdanauAttention(nn.Module):
|
|||
else:
|
||||
max_len = context.size(0)
|
||||
|
||||
indices = torch.arange(0, max_len, dtype=torch.int64, device=context.device)
|
||||
indices = torch.arange(0, max_len, dtype=torch.int64,
|
||||
device=context.device)
|
||||
self.mask = indices >= (context_len.unsqueeze(1))
|
||||
|
||||
def calc_score(self, att_query, att_keys):
|
||||
|
@ -96,16 +99,12 @@ class BahdanauAttention(nn.Module):
|
|||
|
||||
if self.normalize:
|
||||
sum_qk = sum_qk + self.normalize_bias
|
||||
|
||||
tmp = self.linear_att.to(torch.float32)
|
||||
linear_att = tmp / tmp.norm()
|
||||
linear_att = linear_att.to(self.normalize_scalar)
|
||||
|
||||
linear_att = self.linear_att / self.linear_att.norm()
|
||||
linear_att = linear_att * self.normalize_scalar
|
||||
else:
|
||||
linear_att = self.linear_att
|
||||
|
||||
out = F.tanh(sum_qk).matmul(linear_att)
|
||||
out = torch.tanh(sum_qk).matmul(linear_att)
|
||||
return out
|
||||
|
||||
def forward(self, query, keys):
|
||||
|
@ -152,7 +151,6 @@ class BahdanauAttention(nn.Module):
|
|||
|
||||
# Calculate the weighted average of the attention inputs according to
|
||||
# the scores
|
||||
scores_normalized = self.dropout(scores_normalized)
|
||||
# context: (b x t_q x n)
|
||||
context = torch.bmm(scores_normalized, keys)
|
||||
|
||||
|
|
|
@ -3,16 +3,18 @@ import itertools
|
|||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
from seq2seq.models.attention import BahdanauAttention
|
||||
import seq2seq.data.config as config
|
||||
from seq2seq.models.attention import BahdanauAttention
|
||||
from seq2seq.utils import init_lstm_
|
||||
|
||||
|
||||
class RecurrentAttention(nn.Module):
|
||||
"""
|
||||
LSTM with an attention module.
|
||||
LSTM wrapped with an attention module.
|
||||
"""
|
||||
def __init__(self, input_size, context_size, hidden_size, num_layers=1,
|
||||
bias=True, batch_first=False, dropout=0):
|
||||
def __init__(self, input_size=1024, context_size=1024, hidden_size=1024,
|
||||
num_layers=1, batch_first=False, dropout=0.2,
|
||||
init_weight=0.1):
|
||||
"""
|
||||
Constructor for the RecurrentAttention.
|
||||
|
||||
|
@ -20,16 +22,17 @@ class RecurrentAttention(nn.Module):
|
|||
:param context_size: number of features in output from encoder
|
||||
:param hidden_size: internal hidden size
|
||||
:param num_layers: number of layers in LSTM
|
||||
:param bias: enables bias in LSTM layers
|
||||
:param batch_first: if True the model uses (batch,seq,feature) tensors,
|
||||
if false the model uses (seq, batch, feature)
|
||||
:param dropout: probability of dropout
|
||||
:param dropout: probability of dropout (on input to LSTM layer)
|
||||
:param init_weight: range for the uniform initializer
|
||||
"""
|
||||
|
||||
super(RecurrentAttention, self).__init__()
|
||||
|
||||
self.rnn = nn.LSTM(input_size, hidden_size, num_layers, bias,
|
||||
batch_first)
|
||||
self.rnn = nn.LSTM(input_size, hidden_size, num_layers, bias=True,
|
||||
batch_first=batch_first)
|
||||
init_lstm_(self.rnn, init_weight)
|
||||
|
||||
self.attn = BahdanauAttention(hidden_size, context_size, context_size,
|
||||
normalize=True, batch_first=batch_first)
|
||||
|
@ -52,9 +55,9 @@ class RecurrentAttention(nn.Module):
|
|||
# softmax
|
||||
self.attn.set_mask(context_len, context)
|
||||
|
||||
inputs = self.dropout(inputs)
|
||||
rnn_outputs, hidden = self.rnn(inputs, hidden)
|
||||
attn_outputs, scores = self.attn(rnn_outputs, context)
|
||||
rnn_outputs = self.dropout(rnn_outputs)
|
||||
|
||||
return rnn_outputs, hidden, attn_outputs, scores
|
||||
|
||||
|
@ -63,23 +66,18 @@ class Classifier(nn.Module):
|
|||
"""
|
||||
Fully-connected classifier
|
||||
"""
|
||||
def __init__(self, in_features, out_features, math='fp32'):
|
||||
def __init__(self, in_features, out_features, init_weight=0.1):
|
||||
"""
|
||||
Constructor for the Classifier.
|
||||
|
||||
:param in_features: number of input features
|
||||
:param out_features: number of output features (size of vocabulary)
|
||||
:param math: arithmetic type, 'fp32' or 'fp16'
|
||||
:param init_weight: range for the uniform initializer
|
||||
"""
|
||||
super(Classifier, self).__init__()
|
||||
|
||||
self.out_features = out_features
|
||||
|
||||
# padding required to trigger HMMA kernels
|
||||
if math == 'fp16':
|
||||
out_features = (out_features + 7) // 8 * 8
|
||||
|
||||
self.classifier = nn.Linear(in_features, out_features)
|
||||
nn.init.uniform_(self.classifier.weight.data, -init_weight, init_weight)
|
||||
nn.init.uniform_(self.classifier.bias.data, -init_weight, init_weight)
|
||||
|
||||
def forward(self, x):
|
||||
"""
|
||||
|
@ -88,7 +86,6 @@ class Classifier(nn.Module):
|
|||
:param x: output from decoder
|
||||
"""
|
||||
out = self.classifier(x)
|
||||
out = out[..., :self.out_features]
|
||||
return out
|
||||
|
||||
|
||||
|
@ -102,22 +99,24 @@ class ResidualRecurrentDecoder(nn.Module):
|
|||
LSTM layer of the decoder goes into the attention module, then the
|
||||
re-weighted context is concatenated with inputs to all subsequent LSTM
|
||||
layers in the decoder at the current timestep.
|
||||
|
||||
Residual connections are enabled after 3rd LSTM layer, dropout is applied
|
||||
on inputs to LSTM layers.
|
||||
"""
|
||||
def __init__(self, vocab_size, hidden_size=128, num_layers=8, bias=True,
|
||||
dropout=0, batch_first=False, math='fp32', embedder=None):
|
||||
def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
|
||||
batch_first=False, embedder=None, init_weight=0.1):
|
||||
"""
|
||||
Constructor of the ResidualRecurrentDecoder.
|
||||
|
||||
:param vocab_size: size of vocabulary
|
||||
:param hidden_size: hidden size for LSMT layers
|
||||
:param num_layers: number of LSTM layers
|
||||
:param bias: enables bias in LSTM layers
|
||||
:param dropout: probability of dropout (between LSTM layers)
|
||||
:param dropout: probability of dropout (on input to LSTM layers)
|
||||
:param batch_first: if True the model uses (batch,seq,feature) tensors,
|
||||
if false the model uses (seq, batch, feature)
|
||||
:param math: arithmetic type, 'fp32' or 'fp16'
|
||||
:param embedder: embedding module, if None constructor will create new
|
||||
embedding layer
|
||||
:param embedder: instance of nn.Embedding, if None constructor will
|
||||
create new embedding layer
|
||||
:param init_weight: range for the uniform initializer
|
||||
"""
|
||||
super(ResidualRecurrentDecoder, self).__init__()
|
||||
|
||||
|
@ -125,21 +124,26 @@ class ResidualRecurrentDecoder(nn.Module):
|
|||
|
||||
self.att_rnn = RecurrentAttention(hidden_size, hidden_size,
|
||||
hidden_size, num_layers=1,
|
||||
batch_first=batch_first)
|
||||
batch_first=batch_first,
|
||||
dropout=dropout)
|
||||
|
||||
self.rnn_layers = nn.ModuleList()
|
||||
for _ in range(num_layers - 1):
|
||||
self.rnn_layers.append(
|
||||
nn.LSTM(2 * hidden_size, hidden_size, num_layers=1, bias=bias,
|
||||
nn.LSTM(2 * hidden_size, hidden_size, num_layers=1, bias=True,
|
||||
batch_first=batch_first))
|
||||
|
||||
for lstm in self.rnn_layers:
|
||||
init_lstm_(lstm, init_weight)
|
||||
|
||||
if embedder is not None:
|
||||
self.embedder = embedder
|
||||
else:
|
||||
self.embedder = nn.Embedding(vocab_size, hidden_size,
|
||||
padding_idx=config.PAD)
|
||||
nn.init.uniform_(embedder.weight.data, -init_weight, init_weight)
|
||||
|
||||
self.classifier = Classifier(hidden_size, vocab_size, math)
|
||||
self.classifier = Classifier(hidden_size, vocab_size)
|
||||
self.dropout = nn.Dropout(p=dropout)
|
||||
|
||||
def init_hidden(self, hidden):
|
||||
|
@ -199,15 +203,15 @@ class ResidualRecurrentDecoder(nn.Module):
|
|||
x, h, attn, scores = self.att_rnn(x, hidden[0], enc_context, enc_len)
|
||||
self.append_hidden(h)
|
||||
|
||||
x = self.dropout(x)
|
||||
x = torch.cat((x, attn), dim=2)
|
||||
x = self.dropout(x)
|
||||
x, h = self.rnn_layers[0](x, hidden[1])
|
||||
self.append_hidden(h)
|
||||
|
||||
for i in range(1, len(self.rnn_layers)):
|
||||
residual = x
|
||||
x = self.dropout(x)
|
||||
x = torch.cat((x, attn), dim=2)
|
||||
x = self.dropout(x)
|
||||
x, h = self.rnn_layers[i](x, hidden[i + 1])
|
||||
self.append_hidden(h)
|
||||
x = x + residual
|
||||
|
|
|
@ -3,6 +3,7 @@ from torch.nn.utils.rnn import pack_padded_sequence
|
|||
from torch.nn.utils.rnn import pad_packed_sequence
|
||||
|
||||
import seq2seq.data.config as config
|
||||
from seq2seq.utils import init_lstm_
|
||||
|
||||
|
||||
class ResidualRecurrentEncoder(nn.Module):
|
||||
|
@ -10,42 +11,48 @@ class ResidualRecurrentEncoder(nn.Module):
|
|||
Encoder with Embedding, LSTM layers, residual connections and optional
|
||||
dropout.
|
||||
|
||||
The first LSTM layer is bidirectional and uses variable sequence length API,
|
||||
the remaining (num_layers-1) layers are unidirectional. Residual
|
||||
connections are enabled after third LSTM layer, dropout is applied between
|
||||
LSTM layers.
|
||||
The first LSTM layer is bidirectional and uses variable sequence length
|
||||
API, the remaining (num_layers-1) layers are unidirectional. Residual
|
||||
connections are enabled after third LSTM layer, dropout is applied on
|
||||
inputs to LSTM layers.
|
||||
"""
|
||||
def __init__(self, vocab_size, hidden_size=128, num_layers=8, bias=True,
|
||||
dropout=0, batch_first=False, embedder=None):
|
||||
def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
|
||||
batch_first=False, embedder=None, init_weight=0.1):
|
||||
"""
|
||||
Constructor for the ResidualRecurrentEncoder.
|
||||
|
||||
:param vocab_size: size of vocabulary
|
||||
:param hidden_size: hidden size for LSTM layers
|
||||
:param num_layers: number of LSTM layers, 1st layer is bidirectional
|
||||
:param bias: enables bias in LSTM layers
|
||||
:param dropout: probability of dropout (between LSTM layers)
|
||||
:param dropout: probability of dropout (on input to LSTM layers)
|
||||
:param batch_first: if True the model uses (batch,seq,feature) tensors,
|
||||
if false the model uses (seq, batch, feature)
|
||||
:param embedder: embedding module, if None constructor will create new
|
||||
embedding layer
|
||||
:param embedder: instance of nn.Embedding, if None constructor will
|
||||
create new embedding layer
|
||||
:param init_weight: range for the uniform initializer
|
||||
"""
|
||||
super(ResidualRecurrentEncoder, self).__init__()
|
||||
self.batch_first = batch_first
|
||||
self.rnn_layers = nn.ModuleList()
|
||||
# 1st LSTM layer, bidirectional
|
||||
self.rnn_layers.append(
|
||||
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=bias,
|
||||
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=True,
|
||||
batch_first=batch_first, bidirectional=True))
|
||||
|
||||
# 2nd LSTM layer, with 2x larger input_size
|
||||
self.rnn_layers.append(
|
||||
nn.LSTM((2 * hidden_size), hidden_size, num_layers=1, bias=bias,
|
||||
nn.LSTM((2 * hidden_size), hidden_size, num_layers=1, bias=True,
|
||||
batch_first=batch_first))
|
||||
|
||||
# Remaining LSTM layers
|
||||
for _ in range(num_layers - 2):
|
||||
self.rnn_layers.append(
|
||||
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=bias,
|
||||
nn.LSTM(hidden_size, hidden_size, num_layers=1, bias=True,
|
||||
batch_first=batch_first))
|
||||
|
||||
for lstm in self.rnn_layers:
|
||||
init_lstm_(lstm, init_weight)
|
||||
|
||||
self.dropout = nn.Dropout(p=dropout)
|
||||
|
||||
if embedder is not None:
|
||||
|
@ -53,6 +60,7 @@ class ResidualRecurrentEncoder(nn.Module):
|
|||
else:
|
||||
self.embedder = nn.Embedding(vocab_size, hidden_size,
|
||||
padding_idx=config.PAD)
|
||||
nn.init.uniform_(embedder.weight.data, -init_weight, init_weight)
|
||||
|
||||
def forward(self, inputs, lengths):
|
||||
"""
|
||||
|
@ -66,6 +74,7 @@ class ResidualRecurrentEncoder(nn.Module):
|
|||
x = self.embedder(inputs)
|
||||
|
||||
# bidirectional layer
|
||||
x = self.dropout(x)
|
||||
x = pack_padded_sequence(x, lengths.cpu().numpy(),
|
||||
batch_first=self.batch_first)
|
||||
x, _ = self.rnn_layers[0](x)
|
||||
|
|
|
@ -1,18 +1,17 @@
|
|||
import torch.nn as nn
|
||||
|
||||
import seq2seq.data.config as config
|
||||
from seq2seq.models.seq2seq_base import Seq2Seq
|
||||
from seq2seq.models.encoder import ResidualRecurrentEncoder
|
||||
from seq2seq.models.decoder import ResidualRecurrentDecoder
|
||||
from seq2seq.models.encoder import ResidualRecurrentEncoder
|
||||
from seq2seq.models.seq2seq_base import Seq2Seq
|
||||
|
||||
|
||||
class GNMT(Seq2Seq):
|
||||
"""
|
||||
GNMT v2 model
|
||||
"""
|
||||
def __init__(self, vocab_size, hidden_size=512, num_layers=8, bias=True,
|
||||
dropout=0.2, batch_first=False, math='fp32',
|
||||
share_embedding=False):
|
||||
def __init__(self, vocab_size, hidden_size=1024, num_layers=4, dropout=0.2,
|
||||
batch_first=False, share_embedding=True):
|
||||
"""
|
||||
Constructor for the GNMT v2 model.
|
||||
|
||||
|
@ -20,11 +19,9 @@ class GNMT(Seq2Seq):
|
|||
:param hidden_size: internal hidden size of the model
|
||||
:param num_layers: number of layers, applies to both encoder and
|
||||
decoder
|
||||
:param bias: globally enables or disables bias in encoder and decoder
|
||||
:param dropout: probability of dropout (in encoder and decoder)
|
||||
:param batch_first: if True the model uses (batch,seq,feature) tensors,
|
||||
if false the model uses (seq, batch, feature)
|
||||
:param math: arithmetic type, 'fp32' or 'fp16'
|
||||
:param share_embedding: if True embeddings are shared between encoder
|
||||
and decoder
|
||||
"""
|
||||
|
@ -32,17 +29,19 @@ class GNMT(Seq2Seq):
|
|||
super(GNMT, self).__init__(batch_first=batch_first)
|
||||
|
||||
if share_embedding:
|
||||
embedder = nn.Embedding(vocab_size, hidden_size, padding_idx=config.PAD)
|
||||
embedder = nn.Embedding(vocab_size, hidden_size,
|
||||
padding_idx=config.PAD)
|
||||
nn.init.uniform_(embedder.weight.data, -0.1, 0.1)
|
||||
else:
|
||||
embedder = None
|
||||
|
||||
self.encoder = ResidualRecurrentEncoder(vocab_size, hidden_size,
|
||||
num_layers, bias, dropout,
|
||||
num_layers, dropout,
|
||||
batch_first, embedder)
|
||||
|
||||
self.decoder = ResidualRecurrentDecoder(vocab_size, hidden_size,
|
||||
num_layers, bias, dropout,
|
||||
batch_first, math, embedder)
|
||||
num_layers, dropout,
|
||||
batch_first, embedder)
|
||||
|
||||
def forward(self, input_encoder, input_enc_len, input_decoder):
|
||||
context = self.encode(input_encoder, input_enc_len)
|
||||
|
|
|
@ -1,222 +0,0 @@
|
|||
import torch
|
||||
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
|
||||
import torch.distributed as dist
|
||||
from torch.nn.modules import Module
|
||||
from torch.autograd import Variable
|
||||
from collections import OrderedDict
|
||||
|
||||
|
||||
def flat_dist_call(tensors, call, extra_args=None):
|
||||
flat_dist_call.warn_on_half = True
|
||||
buckets = OrderedDict()
|
||||
for tensor in tensors:
|
||||
tp = tensor.type()
|
||||
if tp not in buckets:
|
||||
buckets[tp] = []
|
||||
buckets[tp].append(tensor)
|
||||
|
||||
if flat_dist_call.warn_on_half:
|
||||
if torch.cuda.HalfTensor in buckets:
|
||||
print("WARNING: gloo dist backend for half parameters may be extremely slow." +
|
||||
" It is recommended to use the NCCL backend in this case.")
|
||||
flat_dist_call.warn_on_half = False
|
||||
|
||||
for tp in buckets:
|
||||
bucket = buckets[tp]
|
||||
coalesced = _flatten_dense_tensors(bucket)
|
||||
if extra_args is not None:
|
||||
call(coalesced, *extra_args)
|
||||
else:
|
||||
call(coalesced)
|
||||
if call is dist.all_reduce:
|
||||
coalesced /= dist.get_world_size()
|
||||
|
||||
for buf, synced in zip(bucket, _unflatten_dense_tensors(coalesced, bucket)):
|
||||
buf.copy_(synced)
|
||||
|
||||
class DistributedDataParallel(Module):
|
||||
"""
|
||||
:class:`apex.parallel.DistributedDataParallel` is a module wrapper that enables
|
||||
easy multiprocess distributed data parallel training, similar to ``torch.nn.parallel.DistributedDataParallel``.
|
||||
|
||||
:class:`DistributedDataParallel` is designed to work with
|
||||
the launch utility script ``apex.parallel.multiproc.py``.
|
||||
When used with ``multiproc.py``, :class:`DistributedDataParallel`
|
||||
assigns 1 process to each of the available (visible) GPUs on the node.
|
||||
Parameters are broadcast across participating processes on initialization, and gradients are
|
||||
allreduced and averaged over processes during ``backward()``.
|
||||
|
||||
:class:`DistributedDataParallel` is optimized for use with NCCL. It achieves high performance by
|
||||
overlapping communication with computation during ``backward()`` and bucketing smaller gradient
|
||||
transfers to reduce the total number of transfers required.
|
||||
|
||||
:class:`DistributedDataParallel` assumes that your script accepts the command line
|
||||
arguments "rank" and "world-size." It also assumes that your script calls
|
||||
``torch.cuda.set_device(args.rank)`` before creating the model.
|
||||
|
||||
https://github.com/NVIDIA/apex/tree/master/examples/distributed shows detailed usage.
|
||||
https://github.com/NVIDIA/apex/tree/master/examples/imagenet shows another example
|
||||
that combines :class:`DistributedDataParallel` with mixed precision training.
|
||||
|
||||
Args:
|
||||
module: Network definition to be run in multi-gpu/distributed mode.
|
||||
message_size (Default = 1e7): Minimum number of elements in a communication bucket.
|
||||
shared_param (Default = False): If your model uses shared parameters this must be True. It will disable bucketing of parameters to avoid race conditions.
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self, module, message_size=10000000, shared_param=False):
|
||||
super(DistributedDataParallel, self).__init__()
|
||||
self.warn_on_half = True if dist._backend == dist.dist_backend.GLOO else False
|
||||
self.shared_param = shared_param
|
||||
self.message_size = message_size
|
||||
|
||||
#reference to last iterations parameters to see if anything has changed
|
||||
self.param_refs = []
|
||||
|
||||
self.reduction_stream = torch.cuda.Stream()
|
||||
|
||||
self.module = module
|
||||
self.param_list = list(self.module.parameters())
|
||||
|
||||
if dist._backend == dist.dist_backend.NCCL:
|
||||
for param in self.param_list:
|
||||
assert param.is_cuda, "NCCL backend only supports model parameters to be on GPU."
|
||||
|
||||
self.record = []
|
||||
self.create_hooks()
|
||||
|
||||
flat_dist_call([param.data for param in self.module.parameters()], dist.broadcast, (0,) )
|
||||
|
||||
def create_hooks(self):
|
||||
#all reduce gradient hook
|
||||
def allreduce_params():
|
||||
if not self.needs_reduction:
|
||||
return
|
||||
self.needs_reduction = False
|
||||
|
||||
#parameter ordering refresh
|
||||
if self.needs_refresh and not self.shared_param:
|
||||
t_record = torch.cuda.IntTensor(self.record)
|
||||
dist.broadcast(t_record, 0)
|
||||
self.record = [int(entry) for entry in t_record]
|
||||
self.needs_refresh = False
|
||||
|
||||
grads = [param.grad.data for param in self.module.parameters() if param.grad is not None]
|
||||
flat_dist_call(grads, dist.all_reduce)
|
||||
|
||||
def flush_buckets():
|
||||
if not self.needs_reduction:
|
||||
return
|
||||
self.needs_reduction = False
|
||||
|
||||
grads = []
|
||||
for i in range(self.ready_end, len(self.param_state)):
|
||||
param = self.param_refs[self.record[i]]
|
||||
if param.grad is not None:
|
||||
grads.append(param.grad.data)
|
||||
grads = [param.grad.data for param in self.ready_params] + grads
|
||||
|
||||
if(len(grads)>0):
|
||||
orig_stream = torch.cuda.current_stream()
|
||||
with torch.cuda.stream(self.reduction_stream):
|
||||
self.reduction_stream.wait_stream(orig_stream)
|
||||
flat_dist_call(grads, dist.all_reduce)
|
||||
|
||||
torch.cuda.current_stream().wait_stream(self.reduction_stream)
|
||||
|
||||
for param_i, param in enumerate(list(self.module.parameters())):
|
||||
def wrapper(param_i):
|
||||
|
||||
def allreduce_hook(*unused):
|
||||
if self.needs_refresh:
|
||||
self.record.append(param_i)
|
||||
Variable._execution_engine.queue_callback(allreduce_params)
|
||||
else:
|
||||
Variable._execution_engine.queue_callback(flush_buckets)
|
||||
self.comm_ready_buckets(self.record.index(param_i))
|
||||
|
||||
|
||||
if param.requires_grad:
|
||||
param.register_hook(allreduce_hook)
|
||||
wrapper(param_i)
|
||||
|
||||
|
||||
def comm_ready_buckets(self, param_ind):
|
||||
|
||||
if self.param_state[param_ind] != 0:
|
||||
raise RuntimeError("Error: Your model uses shared parameters, DDP flag shared_params must be set to True in initialization.")
|
||||
|
||||
|
||||
if self.param_state[self.ready_end] == 0:
|
||||
self.param_state[param_ind] = 1
|
||||
return
|
||||
|
||||
|
||||
while self.ready_end < len(self.param_state) and self.param_state[self.ready_end] == 1:
|
||||
self.ready_params.append(self.param_refs[self.record[self.ready_end]])
|
||||
self.ready_numel += self.ready_params[-1].numel()
|
||||
self.ready_end += 1
|
||||
|
||||
|
||||
if self.ready_numel < self.message_size:
|
||||
self.param_state[param_ind] = 1
|
||||
return
|
||||
|
||||
grads = [param.grad.data for param in self.ready_params]
|
||||
|
||||
bucket = []
|
||||
bucket_inds = []
|
||||
while grads:
|
||||
bucket.append(grads.pop(0))
|
||||
|
||||
cumm_size = 0
|
||||
for ten in bucket:
|
||||
cumm_size += ten.numel()
|
||||
|
||||
if cumm_size < self.message_size:
|
||||
continue
|
||||
|
||||
evt = torch.cuda.Event()
|
||||
evt.record(torch.cuda.current_stream())
|
||||
evt.wait(stream=self.reduction_stream)
|
||||
|
||||
with torch.cuda.stream(self.reduction_stream):
|
||||
flat_dist_call(bucket, dist.all_reduce)
|
||||
|
||||
for i in range(self.ready_start, self.ready_start+len(bucket)):
|
||||
self.param_state[i] = 2
|
||||
self.ready_params.pop(0)
|
||||
|
||||
self.param_state[param_ind] = 1
|
||||
|
||||
def forward(self, *inputs, **kwargs):
|
||||
|
||||
param_list = [param for param in list(self.module.parameters()) if param.requires_grad]
|
||||
|
||||
|
||||
#Force needs_refresh to True if there are shared params
|
||||
#this will force it to always, only call flush_buckets which is safe
|
||||
#for shared parameters in the model.
|
||||
#Parentheses are not necessary for correct order of operations, but make the intent clearer.
|
||||
if (not self.param_refs) or self.shared_param:
|
||||
self.needs_refresh = True
|
||||
else:
|
||||
self.needs_refresh = (
|
||||
(len(param_list) != len(self.param_refs)) or any(
|
||||
[param1 is not param2 for param1, param2 in zip(param_list, self.param_refs)]))
|
||||
|
||||
if self.needs_refresh:
|
||||
self.record = []
|
||||
|
||||
|
||||
self.param_state = [0 for i in range(len(param_list))]
|
||||
self.param_refs = param_list
|
||||
self.needs_reduction = True
|
||||
|
||||
self.ready_start = 0
|
||||
self.ready_end = 0
|
||||
self.ready_params = []
|
||||
self.ready_numel = 0
|
||||
|
||||
return self.module(*inputs, **kwargs)
|
|
@ -35,7 +35,21 @@ class Fp16Optimizer:
|
|||
param.data.copy_(new_param.data)
|
||||
|
||||
def __init__(self, fp16_model, grad_clip=float('inf'), loss_scale=8192,
|
||||
dls_downscale=2, dls_upscale=2, dls_upscale_interval=2048):
|
||||
dls_downscale=2, dls_upscale=2, dls_upscale_interval=128):
|
||||
"""
|
||||
Constructor for the Fp16Optimizer.
|
||||
|
||||
:param fp16_model: model (previously casted to half)
|
||||
:param grad_clip: coefficient for gradient clipping, max L2 norm of the
|
||||
gradients
|
||||
:param loss_scale: initial loss scale
|
||||
:param dls_downscale: loss downscale factor, loss scale is divided by
|
||||
this factor when NaN/INF occurs in the gradients
|
||||
:param dls_upscale: loss upscale factor, loss scale is multiplied by
|
||||
this factor if previous dls_upscale_interval batches finished
|
||||
successfully
|
||||
:param dls_upscale_interval: interval for loss scale upscaling
|
||||
"""
|
||||
logging.info('Initializing fp16 optimizer')
|
||||
self.initialize_model(fp16_model)
|
||||
|
||||
|
@ -61,7 +75,7 @@ class Fp16Optimizer:
|
|||
for param in self.fp32_params:
|
||||
param.requires_grad = True
|
||||
|
||||
def step(self, loss, optimizer, update=True):
|
||||
def step(self, loss, optimizer, scheduler, update=True):
|
||||
"""
|
||||
Performs one step of the optimizer.
|
||||
Applies loss scaling, computes gradients in fp16, converts gradients to
|
||||
|
@ -76,21 +90,21 @@ class Fp16Optimizer:
|
|||
:param update: if True executes weight update
|
||||
"""
|
||||
loss *= self.loss_scale
|
||||
|
||||
self.fp16_model.zero_grad()
|
||||
loss.backward()
|
||||
|
||||
self.set_grads(self.fp32_params, self.fp16_model.parameters())
|
||||
if self.loss_scale != 1.0:
|
||||
for param in self.fp32_params:
|
||||
param.grad.data /= self.loss_scale
|
||||
|
||||
norm = clip_grad_norm_(self.fp32_params, self.grad_clip)
|
||||
|
||||
if update:
|
||||
self.set_grads(self.fp32_params, self.fp16_model.parameters())
|
||||
if self.loss_scale != 1.0:
|
||||
for param in self.fp32_params:
|
||||
param.grad.data /= self.loss_scale
|
||||
|
||||
norm = clip_grad_norm_(self.fp32_params, self.grad_clip)
|
||||
|
||||
if math.isfinite(norm):
|
||||
scheduler.step()
|
||||
optimizer.step()
|
||||
self.set_weights(self.fp16_model.parameters(), self.fp32_params)
|
||||
self.set_weights(self.fp16_model.parameters(),
|
||||
self.fp32_params)
|
||||
self.since_last_invalid += 1
|
||||
else:
|
||||
self.loss_scale /= self.dls_downscale
|
||||
|
@ -104,6 +118,8 @@ class Fp16Optimizer:
|
|||
logging.info(f'Upscaling, new scale: {self.loss_scale}')
|
||||
self.since_last_invalid = 0
|
||||
|
||||
self.fp16_model.zero_grad()
|
||||
|
||||
|
||||
class Fp32Optimizer:
|
||||
"""
|
||||
|
@ -114,7 +130,8 @@ class Fp32Optimizer:
|
|||
Constructor for the Fp32Optimizer
|
||||
|
||||
:param model: model
|
||||
:param grad_clip: max value of gradient norm
|
||||
:param grad_clip: coefficient for gradient clipping, max L2 norm of the
|
||||
gradients
|
||||
"""
|
||||
logging.info('Initializing fp32 optimizer')
|
||||
self.initialize_model(model)
|
||||
|
@ -129,7 +146,7 @@ class Fp32Optimizer:
|
|||
self.model = model
|
||||
self.model.zero_grad()
|
||||
|
||||
def step(self, loss, optimizer, update=True):
|
||||
def step(self, loss, optimizer, scheduler, update=True):
|
||||
"""
|
||||
Performs one step of the optimizer.
|
||||
|
||||
|
@ -138,8 +155,9 @@ class Fp32Optimizer:
|
|||
:param update: if True executes weight update
|
||||
"""
|
||||
loss.backward()
|
||||
if self.grad_clip != float('inf'):
|
||||
clip_grad_norm_(self.model.parameters(), self.grad_clip)
|
||||
if update:
|
||||
if self.grad_clip != float('inf'):
|
||||
clip_grad_norm_(self.model.parameters(), self.grad_clip)
|
||||
scheduler.step()
|
||||
optimizer.step()
|
||||
self.model.zero_grad()
|
||||
self.model.zero_grad()
|
||||
|
|
98
seq2seq/train/lr_scheduler.py
Normal file
98
seq2seq/train/lr_scheduler.py
Normal file
|
@ -0,0 +1,98 @@
|
|||
import logging
|
||||
import math
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
def perhaps_convert_float(param, total):
|
||||
if isinstance(param, float):
|
||||
param = int(param * total)
|
||||
return param
|
||||
|
||||
|
||||
class WarmupMultiStepLR(torch.optim.lr_scheduler._LRScheduler):
|
||||
"""
|
||||
Learning rate scheduler with exponential warmup and step decay.
|
||||
"""
|
||||
def __init__(self, optimizer, iterations, warmup_steps=0,
|
||||
remain_steps=1.0, decay_interval=None, decay_steps=4,
|
||||
decay_factor=0.5, last_epoch=-1):
|
||||
"""
|
||||
Constructor of WarmupMultiStepLR.
|
||||
|
||||
Parameters: warmup_steps, remain_steps and decay_interval accept both
|
||||
integers and floats as an input. Integer input is interpreted as
|
||||
absolute index of iteration, float input is interpreted as a fraction
|
||||
of total training iterations (epochs * steps_per_epoch).
|
||||
|
||||
If decay_interval is None then the decay will happen at regulary spaced
|
||||
intervals ('decay_steps' decays between iteration indices
|
||||
'remain_steps' and 'iterations').
|
||||
|
||||
:param optimizer: instance of optimizer
|
||||
:param iterations: total number of training iterations
|
||||
:param warmup_steps: number of warmup iterations
|
||||
:param remain_steps: start decay at 'remain_steps' iteration
|
||||
:param decay_interval: interval between LR decay steps
|
||||
:param decay_steps: max number of decay steps
|
||||
:param decay_factor: decay factor
|
||||
:param last_epoch: the index of last iteration
|
||||
"""
|
||||
|
||||
# iterations before learning rate reaches base LR
|
||||
self.warmup_steps = perhaps_convert_float(warmup_steps, iterations)
|
||||
logging.info(f'Scheduler warmup steps: {self.warmup_steps}')
|
||||
|
||||
# iteration at which decay starts
|
||||
self.remain_steps = perhaps_convert_float(remain_steps, iterations)
|
||||
logging.info(f'Scheduler remain steps: {self.remain_steps}')
|
||||
|
||||
# number of steps between each decay
|
||||
if decay_interval is None:
|
||||
# decay at regulary spaced intervals
|
||||
decay_iterations = iterations - self.remain_steps
|
||||
self.decay_interval = decay_iterations // (decay_steps)
|
||||
self.decay_interval = max(self.decay_interval, 1)
|
||||
else:
|
||||
self.decay_interval = perhaps_convert_float(decay_interval,
|
||||
iterations)
|
||||
logging.info(f'Scheduler decay interval: {self.decay_interval}')
|
||||
|
||||
# multiplicative decay factor
|
||||
self.decay_factor = decay_factor
|
||||
logging.info(f'Scheduler decay factor: {self.decay_factor}')
|
||||
|
||||
# max number of decay steps
|
||||
self.decay_steps = decay_steps
|
||||
logging.info(f'Scheduler max decay steps: {self.decay_steps}')
|
||||
|
||||
if self.warmup_steps > self.remain_steps:
|
||||
logging.warn(f'warmup_steps should not be larger than '
|
||||
f'remain_steps, setting warmup_steps=remain_steps')
|
||||
self.warmup_steps = self.remain_steps
|
||||
|
||||
super(WarmupMultiStepLR, self).__init__(optimizer, last_epoch)
|
||||
|
||||
def get_lr(self):
|
||||
if self.last_epoch <= self.warmup_steps:
|
||||
# exponential lr warmup
|
||||
if self.warmup_steps != 0:
|
||||
warmup_factor = math.exp(math.log(0.01) / self.warmup_steps)
|
||||
else:
|
||||
warmup_factor = 1.0
|
||||
inv_decay = warmup_factor ** (self.warmup_steps - self.last_epoch)
|
||||
lr = [base_lr * inv_decay for base_lr in self.base_lrs]
|
||||
|
||||
elif self.last_epoch >= self.remain_steps:
|
||||
# step decay
|
||||
decay_iter = self.last_epoch - self.remain_steps
|
||||
num_decay_steps = decay_iter // self.decay_interval + 1
|
||||
num_decay_steps = min(num_decay_steps, self.decay_steps)
|
||||
lr = [
|
||||
base_lr * (self.decay_factor ** num_decay_steps)
|
||||
for base_lr in self.base_lrs
|
||||
]
|
||||
else:
|
||||
# base lr
|
||||
lr = [base_lr for base_lr in self.base_lrs]
|
||||
return lr
|
|
@ -1,6 +1,7 @@
|
|||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
|
||||
class LabelSmoothing(nn.Module):
|
||||
"""
|
||||
NLL loss with label smoothing.
|
||||
|
@ -18,7 +19,8 @@ class LabelSmoothing(nn.Module):
|
|||
self.smoothing = smoothing
|
||||
|
||||
def forward(self, x, target):
|
||||
logprobs = torch.nn.functional.log_softmax(x, dim=-1)
|
||||
logprobs = torch.nn.functional.log_softmax(x, dim=-1,
|
||||
dtype=torch.float32)
|
||||
|
||||
non_pad_mask = (target != self.padding_idx)
|
||||
nll_loss = -logprobs.gather(dim=-1, index=target.unsqueeze(1))
|
||||
|
|
|
@ -1,15 +1,17 @@
|
|||
import logging
|
||||
import time
|
||||
import os
|
||||
import time
|
||||
from itertools import cycle
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.optim
|
||||
import torch.utils.data
|
||||
import numpy as np
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
|
||||
from seq2seq.train.distributed import DistributedDataParallel as DDP
|
||||
from seq2seq.train.fp_optimizers import Fp16Optimizer, Fp32Optimizer
|
||||
from seq2seq.train.fp_optimizers import Fp16Optimizer
|
||||
from seq2seq.train.fp_optimizers import Fp32Optimizer
|
||||
from seq2seq.train.lr_scheduler import WarmupMultiStepLR
|
||||
from seq2seq.utils import AverageMeter
|
||||
from seq2seq.utils import sync_workers
|
||||
|
||||
|
@ -18,21 +20,54 @@ class Seq2SeqTrainer:
|
|||
"""
|
||||
Seq2SeqTrainer
|
||||
"""
|
||||
def __init__(self, model, criterion, opt_config,
|
||||
def __init__(self,
|
||||
model,
|
||||
criterion,
|
||||
opt_config,
|
||||
scheduler_config,
|
||||
print_freq=10,
|
||||
save_freq=1000,
|
||||
grad_clip=float('inf'),
|
||||
batch_first=False,
|
||||
save_info={},
|
||||
save_path='.',
|
||||
train_iterations=0,
|
||||
checkpoint_filename='checkpoint%s.pth',
|
||||
keep_checkpoints=5,
|
||||
math='fp32',
|
||||
cuda=True,
|
||||
distributed=False,
|
||||
intra_epoch_eval=0,
|
||||
iter_size=1,
|
||||
translator=None,
|
||||
verbose=False):
|
||||
"""
|
||||
Constructor for the Seq2SeqTrainer.
|
||||
|
||||
:param model: model to train
|
||||
:param criterion: criterion (loss function)
|
||||
:param opt_config: dictionary with options for the optimizer
|
||||
:param scheduler_config: dictionary with options for the learning rate
|
||||
scheduler
|
||||
:param print_freq: prints short summary every 'print_freq' iterations
|
||||
:param save_freq: saves checkpoint every 'save_freq' iterations
|
||||
:param grad_clip: coefficient for gradient clipping
|
||||
:param batch_first: if True the model uses (batch,seq,feature) tensors,
|
||||
if false the model uses (seq, batch, feature)
|
||||
:param save_info: dict with additional state stored in each checkpoint
|
||||
:param save_path: path to the directiory for checkpoints
|
||||
:param train_iterations: total number of training iterations to execute
|
||||
:param checkpoint_filename: name of files with checkpoints
|
||||
:param keep_checkpoints: max number of checkpoints to keep
|
||||
:param math: arithmetic type
|
||||
:param cuda: if True use cuda, if False train on cpu
|
||||
:param distributed: if True run distributed training
|
||||
:param intra_epoch_eval: number of additional eval runs within each
|
||||
training epoch
|
||||
:param iter_size: number of iterations between weight updates
|
||||
:param translator: instance of Translator, runs inference on test set
|
||||
:param verbose: enables verbose logging
|
||||
"""
|
||||
super(Seq2SeqTrainer, self).__init__()
|
||||
self.model = model
|
||||
self.criterion = criterion
|
||||
|
@ -52,16 +87,19 @@ class Seq2SeqTrainer:
|
|||
self.loss = None
|
||||
self.translator = translator
|
||||
self.intra_epoch_eval = intra_epoch_eval
|
||||
self.iter_size = iter_size
|
||||
|
||||
if cuda:
|
||||
self.model = self.model.cuda()
|
||||
self.criterion = self.criterion.cuda()
|
||||
|
||||
if math == 'fp16':
|
||||
self.model = self.model.half()
|
||||
|
||||
if distributed:
|
||||
self.model = DDP(self.model)
|
||||
|
||||
if math == 'fp16':
|
||||
self.model = self.model.half()
|
||||
self.fp_optimizer = Fp16Optimizer(self.model, grad_clip)
|
||||
params = self.fp_optimizer.fp32_params
|
||||
elif math == 'fp32':
|
||||
|
@ -72,7 +110,18 @@ class Seq2SeqTrainer:
|
|||
self.optimizer = torch.optim.__dict__[opt_name](params, **opt_config)
|
||||
logging.info(f'Using optimizer: {self.optimizer}')
|
||||
|
||||
self.scheduler = WarmupMultiStepLR(self.optimizer, train_iterations,
|
||||
**scheduler_config)
|
||||
|
||||
def iterate(self, src, tgt, update=True, training=True):
|
||||
"""
|
||||
Performs one iteration of the training/validation.
|
||||
|
||||
:param src: batch of examples from the source language
|
||||
:param tgt: batch of examples from the target language
|
||||
:param update: if True: optimizer does update of the weights
|
||||
:param training: if True: executes optimizer
|
||||
"""
|
||||
src, src_length = src
|
||||
tgt, tgt_length = tgt
|
||||
src_length = torch.LongTensor(src_length)
|
||||
|
@ -96,14 +145,15 @@ class Seq2SeqTrainer:
|
|||
tgt_labels = tgt[1:]
|
||||
T, B = output.size(0), output.size(1)
|
||||
|
||||
loss = self.criterion(output.view(T * B, -1).float(),
|
||||
loss = self.criterion(output.view(T * B, -1),
|
||||
tgt_labels.contiguous().view(-1))
|
||||
|
||||
loss_per_batch = loss.item()
|
||||
loss /= B
|
||||
loss /= (B * self.iter_size)
|
||||
|
||||
if training:
|
||||
self.fp_optimizer.step(loss, self.optimizer, update)
|
||||
self.fp_optimizer.step(loss, self.optimizer, self.scheduler,
|
||||
update)
|
||||
|
||||
loss_per_token = loss_per_batch / num_toks['tgt']
|
||||
loss_per_sentence = loss_per_batch / B
|
||||
|
@ -120,13 +170,15 @@ class Seq2SeqTrainer:
|
|||
if training:
|
||||
assert self.optimizer is not None
|
||||
eval_fractions = np.linspace(0, 1, self.intra_epoch_eval+2)[1:-1]
|
||||
eval_iters = (eval_fractions * len(data_loader)).astype(int)
|
||||
iters_with_update = len(data_loader) // self.iter_size
|
||||
eval_iters = (eval_fractions * iters_with_update).astype(int)
|
||||
eval_iters = eval_iters * self.iter_size
|
||||
eval_iters = set(eval_iters)
|
||||
|
||||
batch_time = AverageMeter()
|
||||
data_time = AverageMeter()
|
||||
losses_per_token = AverageMeter()
|
||||
losses_per_sentence = AverageMeter()
|
||||
losses_per_token = AverageMeter(skip_first=False)
|
||||
losses_per_sentence = AverageMeter(skip_first=False)
|
||||
|
||||
tot_tok_time = AverageMeter()
|
||||
src_tok_time = AverageMeter()
|
||||
|
@ -140,8 +192,12 @@ class Seq2SeqTrainer:
|
|||
# measure data loading time
|
||||
data_time.update(time.time() - end)
|
||||
|
||||
update = False
|
||||
if i % self.iter_size == self.iter_size - 1:
|
||||
update = True
|
||||
|
||||
# do a train/evaluate iteration
|
||||
stats = self.iterate(src, tgt, training=training)
|
||||
stats = self.iterate(src, tgt, update, training=training)
|
||||
loss_per_token, loss_per_sentence, num_toks = stats
|
||||
|
||||
# measure accuracy and record loss
|
||||
|
@ -176,13 +232,16 @@ class Seq2SeqTrainer:
|
|||
log = []
|
||||
log += [f'{phase} [{self.epoch}][{i}/{len(data_loader)}]']
|
||||
log += [f'Time {batch_time.val:.3f} ({batch_time.avg:.3f})']
|
||||
log += [f'Data {data_time.val:.3f} ({data_time.avg:.3f})']
|
||||
log += [f'Data {data_time.val:.2e} ({data_time.avg:.2e})']
|
||||
log += [f'Tok/s {tot_tok_time.val:.0f} ({tot_tok_time.avg:.0f})']
|
||||
if self.verbose:
|
||||
log += [f'Src tok/s {src_tok_time.val:.0f} ({src_tok_time.avg:.0f})']
|
||||
log += [f'Tgt tok/s {tgt_tok_time.val:.0f} ({tgt_tok_time.avg:.0f})']
|
||||
log += [f'Loss/sentence {losses_per_sentence.val:.1f} ({losses_per_sentence.avg:.1f})']
|
||||
log += [f'Loss/tok {losses_per_token.val:.4f} ({losses_per_token.avg:.4f})']
|
||||
if training:
|
||||
lr = self.optimizer.param_groups[0]['lr']
|
||||
log += [f'LR {lr:.3e}']
|
||||
log = '\t'.join(log)
|
||||
logging.info(log)
|
||||
|
||||
|
@ -198,9 +257,8 @@ class Seq2SeqTrainer:
|
|||
|
||||
end = time.time()
|
||||
|
||||
if training:
|
||||
tot_tok_time.reduce('sum')
|
||||
losses_per_token.reduce('mean')
|
||||
tot_tok_time.reduce('sum')
|
||||
losses_per_token.reduce('mean')
|
||||
|
||||
return losses_per_token.avg, tot_tok_time.avg
|
||||
|
||||
|
@ -228,6 +286,7 @@ class Seq2SeqTrainer:
|
|||
src = src, src_length
|
||||
tgt = tgt, tgt_length
|
||||
self.iterate(src, tgt, update=False, training=training)
|
||||
self.model.zero_grad()
|
||||
|
||||
def optimize(self, data_loader):
|
||||
"""
|
||||
|
@ -241,6 +300,7 @@ class Seq2SeqTrainer:
|
|||
torch.cuda.empty_cache()
|
||||
self.preallocate(data_loader, training=True)
|
||||
output = self.feed_data(data_loader, training=True)
|
||||
self.model.zero_grad()
|
||||
torch.cuda.empty_cache()
|
||||
return output
|
||||
|
||||
|
@ -256,6 +316,7 @@ class Seq2SeqTrainer:
|
|||
torch.cuda.empty_cache()
|
||||
self.preallocate(data_loader, training=False)
|
||||
output = self.feed_data(data_loader, training=False)
|
||||
self.model.zero_grad()
|
||||
torch.cuda.empty_cache()
|
||||
return output
|
||||
|
||||
|
@ -267,9 +328,13 @@ class Seq2SeqTrainer:
|
|||
"""
|
||||
if os.path.isfile(filename):
|
||||
checkpoint = torch.load(filename, map_location={'cuda:0': 'cpu'})
|
||||
self.model.load_state_dict(checkpoint['state_dict'])
|
||||
if self.distributed:
|
||||
self.model.module.load_state_dict(checkpoint['state_dict'])
|
||||
else:
|
||||
self.model.load_state_dict(checkpoint['state_dict'])
|
||||
self.fp_optimizer.initialize_model(self.model)
|
||||
self.optimizer.load_state_dict(checkpoint['optimizer'])
|
||||
self.scheduler.load_state_dict(checkpoint['scheduler'])
|
||||
self.epoch = checkpoint['epoch']
|
||||
self.loss = checkpoint['loss']
|
||||
logging.info(f'Loaded checkpoint {filename} (epoch {self.epoch})')
|
||||
|
@ -291,10 +356,16 @@ class Seq2SeqTrainer:
|
|||
logging.info(f'Saving model to {filename}')
|
||||
torch.save(state, filename)
|
||||
|
||||
if self.distributed:
|
||||
model_state = self.model.module.state_dict()
|
||||
else:
|
||||
model_state = self.model.state_dict()
|
||||
|
||||
state = {
|
||||
'epoch': self.epoch,
|
||||
'state_dict': self.model.state_dict(),
|
||||
'state_dict': model_state,
|
||||
'optimizer': self.optimizer.state_dict(),
|
||||
'scheduler': self.scheduler.state_dict(),
|
||||
'loss': getattr(self, 'loss', None),
|
||||
}
|
||||
state = dict(list(state.items()) + list(self.save_info.items()))
|
||||
|
|
225
seq2seq/utils.py
225
seq2seq/utils.py
|
@ -1,9 +1,111 @@
|
|||
from contextlib import contextmanager
|
||||
import logging.config
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
from contextlib import contextmanager
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
import torch.nn.init as init
|
||||
import torch.utils.collect_env
|
||||
|
||||
|
||||
def init_lstm_(lstm, init_weight=0.1):
|
||||
"""
|
||||
Initializes weights of LSTM layer.
|
||||
Weights and biases are initialized with uniform(-init_weight, init_weight)
|
||||
distribution.
|
||||
|
||||
:param lstm: instance of torch.nn.LSTM
|
||||
:param init_weight: range for the uniform initializer
|
||||
"""
|
||||
# Initialize hidden-hidden weights
|
||||
init.uniform_(lstm.weight_hh_l0.data, -init_weight, init_weight)
|
||||
# Initialize input-hidden weights:
|
||||
init.uniform_(lstm.weight_ih_l0.data, -init_weight, init_weight)
|
||||
|
||||
# Initialize bias. PyTorch LSTM has two biases, one for input-hidden GEMM
|
||||
# and the other for hidden-hidden GEMM. Here input-hidden bias is
|
||||
# initialized with uniform distribution and hidden-hidden bias is
|
||||
# initialized with zeros.
|
||||
init.uniform_(lstm.bias_ih_l0.data, -init_weight, init_weight)
|
||||
init.zeros_(lstm.bias_hh_l0.data)
|
||||
|
||||
if lstm.bidirectional:
|
||||
init.uniform_(lstm.weight_hh_l0_reverse.data, -init_weight, init_weight)
|
||||
init.uniform_(lstm.weight_ih_l0_reverse.data, -init_weight, init_weight)
|
||||
|
||||
init.uniform_(lstm.bias_ih_l0_reverse.data, -init_weight, init_weight)
|
||||
init.zeros_(lstm.bias_hh_l0_reverse.data)
|
||||
|
||||
|
||||
def generate_seeds(rng, size):
|
||||
"""
|
||||
Generate list of random seeds
|
||||
|
||||
:param rng: random number generator
|
||||
:param size: length of the returned list
|
||||
"""
|
||||
seeds = [rng.randint(0, 2**32 - 1) for _ in range(size)]
|
||||
return seeds
|
||||
|
||||
|
||||
def broadcast_seeds(seeds, device):
|
||||
"""
|
||||
Broadcasts random seeds to all distributed workers.
|
||||
Returns list of random seeds (broadcasted from workers with rank 0).
|
||||
|
||||
:param seeds: list of seeds (integers)
|
||||
:param device: torch.device
|
||||
"""
|
||||
if torch.distributed.is_available() and torch.distributed.is_initialized():
|
||||
seeds_tensor = torch.LongTensor(seeds).to(device)
|
||||
torch.distributed.broadcast(seeds_tensor, 0)
|
||||
seeds = seeds_tensor.tolist()
|
||||
return seeds
|
||||
|
||||
|
||||
def setup_seeds(master_seed, epochs, device):
|
||||
"""
|
||||
Generates seeds from one master_seed.
|
||||
Function returns (worker_seeds, shuffling_seeds), worker_seeds are later
|
||||
used to initialize per-worker random number generators (mostly for
|
||||
dropouts), shuffling_seeds are for RNGs resposible for reshuffling the
|
||||
dataset before each epoch.
|
||||
Seeds are generated on worker with rank 0 and broadcasted to all other
|
||||
workers.
|
||||
|
||||
:param master_seed: master RNG seed used to initialize other generators
|
||||
:param epochs: number of epochs
|
||||
:param device: torch.device (used for distributed.broadcast)
|
||||
"""
|
||||
if master_seed is None:
|
||||
# random master seed, random.SystemRandom() uses /dev/urandom on Unix
|
||||
master_seed = random.SystemRandom().randint(0, 2**32 - 1)
|
||||
if get_rank() == 0:
|
||||
# master seed is reported only from rank=0 worker, it's to avoid
|
||||
# confusion, seeds from rank=0 are later broadcasted to other
|
||||
# workers
|
||||
logging.info(f'Using random master seed: {master_seed}')
|
||||
else:
|
||||
# master seed was specified from command line
|
||||
logging.info(f'Using master seed from command line: {master_seed}')
|
||||
|
||||
# initialize seeding RNG
|
||||
seeding_rng = random.Random(master_seed)
|
||||
|
||||
# generate worker seeds, one seed for every distributed worker
|
||||
worker_seeds = generate_seeds(seeding_rng, get_world_size())
|
||||
|
||||
# generate seeds for data shuffling, one seed for every epoch
|
||||
shuffling_seeds = generate_seeds(seeding_rng, epochs)
|
||||
|
||||
# broadcast seeds from rank=0 to other workers
|
||||
worker_seeds = broadcast_seeds(worker_seeds, device)
|
||||
shuffling_seeds = broadcast_seeds(shuffling_seeds, device)
|
||||
return worker_seeds, shuffling_seeds
|
||||
|
||||
|
||||
def barrier():
|
||||
|
@ -12,7 +114,7 @@ def barrier():
|
|||
doesn't implement barrier for NCCL backend.
|
||||
Calls all_reduce on dummy tensor and synchronizes with GPU.
|
||||
"""
|
||||
if torch.distributed.is_initialized():
|
||||
if torch.distributed.is_available() and torch.distributed.is_initialized():
|
||||
torch.distributed.all_reduce(torch.cuda.FloatTensor(1))
|
||||
torch.cuda.synchronize()
|
||||
|
||||
|
@ -21,7 +123,7 @@ def get_rank():
|
|||
"""
|
||||
Gets distributed rank or returns zero if distributed is not initialized.
|
||||
"""
|
||||
if torch.distributed.is_initialized():
|
||||
if torch.distributed.is_available() and torch.distributed.is_initialized():
|
||||
rank = torch.distributed.get_rank()
|
||||
else:
|
||||
rank = 0
|
||||
|
@ -33,7 +135,7 @@ def get_world_size():
|
|||
Gets total number of distributed workers or returns one if distributed is
|
||||
not initialized.
|
||||
"""
|
||||
if torch.distributed.is_initialized():
|
||||
if torch.distributed.is_available() and torch.distributed.is_initialized():
|
||||
world_size = torch.distributed.get_world_size()
|
||||
else:
|
||||
world_size = 1
|
||||
|
@ -50,7 +152,20 @@ def sync_workers():
|
|||
barrier()
|
||||
|
||||
|
||||
def setup_logging(log_file='log.log'):
|
||||
@contextmanager
|
||||
def timer(name, ndigits=2, sync_gpu=True):
|
||||
if sync_gpu:
|
||||
torch.cuda.synchronize()
|
||||
start = time.time()
|
||||
yield
|
||||
if sync_gpu:
|
||||
torch.cuda.synchronize()
|
||||
stop = time.time()
|
||||
elapsed = round(stop - start, ndigits)
|
||||
logging.info(f'TIMER {name} {elapsed}')
|
||||
|
||||
|
||||
def setup_logging(log_file=os.devnull):
|
||||
"""
|
||||
Configures logging.
|
||||
By default logs from all workers are printed to the console, entries are
|
||||
|
@ -69,12 +184,13 @@ def setup_logging(log_file='log.log'):
|
|||
rank = get_rank()
|
||||
rank_filter = RankFilter(rank)
|
||||
|
||||
logging_format = "%(asctime)s - %(levelname)s - %(rank)s - %(message)s"
|
||||
logging.basicConfig(level=logging.DEBUG,
|
||||
format="%(asctime)s - %(levelname)s - %(rank)s - %(message)s",
|
||||
format=logging_format,
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
filename=log_file,
|
||||
filemode='w')
|
||||
console = logging.StreamHandler()
|
||||
console = logging.StreamHandler(sys.stdout)
|
||||
console.setLevel(logging.INFO)
|
||||
formatter = logging.Formatter('%(rank)s: %(message)s')
|
||||
console.setFormatter(formatter)
|
||||
|
@ -82,6 +198,73 @@ def setup_logging(log_file='log.log'):
|
|||
logging.getLogger('').addFilter(rank_filter)
|
||||
|
||||
|
||||
def set_device(cuda, local_rank):
|
||||
"""
|
||||
Sets device based on local_rank and returns instance of torch.device.
|
||||
|
||||
:param cuda: if True: use cuda
|
||||
:param local_rank: local rank of the worker
|
||||
"""
|
||||
if cuda:
|
||||
torch.cuda.set_device(local_rank)
|
||||
device = torch.device('cuda')
|
||||
else:
|
||||
device = torch.device('cpu')
|
||||
return device
|
||||
|
||||
|
||||
def init_distributed(cuda):
|
||||
"""
|
||||
Initializes distributed backend.
|
||||
|
||||
:param cuda: (bool) if True initializes nccl backend, if False initializes
|
||||
gloo backend
|
||||
"""
|
||||
world_size = int(os.environ.get('WORLD_SIZE', 1))
|
||||
distributed = (world_size > 1)
|
||||
if distributed:
|
||||
backend = 'nccl' if cuda else 'gloo'
|
||||
dist.init_process_group(backend=backend,
|
||||
init_method='env://')
|
||||
assert dist.is_initialized()
|
||||
return distributed
|
||||
|
||||
|
||||
def log_env_info():
|
||||
"""
|
||||
Prints information about execution environment.
|
||||
"""
|
||||
logging.info('Collecting environment information...')
|
||||
env_info = torch.utils.collect_env.get_pretty_env_info()
|
||||
logging.info(f'{env_info}')
|
||||
|
||||
|
||||
def pad_vocabulary(math):
|
||||
if math == 'fp16':
|
||||
pad_vocab = 8
|
||||
elif math == 'fp32':
|
||||
pad_vocab = 1
|
||||
return pad_vocab
|
||||
|
||||
|
||||
def benchmark(test_acc, target_acc, test_perf, target_perf):
|
||||
def test(achieved, target, name):
|
||||
passed = True
|
||||
if target is not None and achieved is not None:
|
||||
logging.info(f'{name} achieved: {achieved:.2f} '
|
||||
f'target: {target:.2f}')
|
||||
if achieved >= target:
|
||||
logging.info(f'{name} test passed')
|
||||
else:
|
||||
logging.info(f'{name} test failed')
|
||||
passed = False
|
||||
return passed
|
||||
|
||||
passed = True
|
||||
passed &= test(test_acc, target_acc, 'Accuracy')
|
||||
passed &= test(test_perf, target_perf, 'Performance')
|
||||
return passed
|
||||
|
||||
class AverageMeter:
|
||||
"""
|
||||
Computes and stores the average and current value
|
||||
|
@ -117,18 +300,42 @@ class AverageMeter:
|
|||
|
||||
distributed = (get_world_size() > 1)
|
||||
if distributed:
|
||||
cuda = (dist._backend == dist.dist_backend.NCCL)
|
||||
# Backward/forward compatibility around
|
||||
# https://github.com/pytorch/pytorch/commit/540ef9b1fc5506369a48491af8a285a686689b36 and
|
||||
# https://github.com/pytorch/pytorch/commit/044d00516ccd6572c0d6ab6d54587155b02a3b86
|
||||
# To accomodate change in Pytorch's distributed API
|
||||
if hasattr(dist, "get_backend"):
|
||||
_backend = dist.get_backend()
|
||||
if hasattr(dist, "DistBackend"):
|
||||
backend_enum_holder = dist.DistBackend
|
||||
else:
|
||||
backend_enum_holder = dist.Backend
|
||||
else:
|
||||
_backend = dist._backend
|
||||
backend_enum_holder = dist.dist_backend
|
||||
|
||||
cuda = _backend == backend_enum_holder.NCCL
|
||||
|
||||
if cuda:
|
||||
avg = torch.cuda.FloatTensor([self.avg])
|
||||
_sum = torch.cuda.FloatTensor([self.sum])
|
||||
else:
|
||||
avg = torch.FloatTensor([self.avg])
|
||||
_sum = torch.FloatTensor([self.sum])
|
||||
|
||||
dist.all_reduce(avg, op=dist.reduce_op.SUM)
|
||||
try:
|
||||
_reduce_op = dist.ReduceOp
|
||||
except AttributeError:
|
||||
_reduce_op = dist.reduce_op
|
||||
|
||||
dist.all_reduce(avg, op=_reduce_op.SUM)
|
||||
dist.all_reduce(_sum, op=_reduce_op.SUM)
|
||||
self.avg = avg.item()
|
||||
self.sum = _sum.item()
|
||||
|
||||
if op == 'mean':
|
||||
self.avg /= get_world_size()
|
||||
self.sum /= get_world_size()
|
||||
|
||||
|
||||
def debug_tensor(tensor, name):
|
||||
|
|
417
train.py
417
train.py
|
@ -1,130 +1,187 @@
|
|||
#!/usr/bin/env python
|
||||
import argparse
|
||||
import os
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
from ast import literal_eval
|
||||
|
||||
import torch.nn as nn
|
||||
import torch.nn.parallel
|
||||
import torch.utils.data.distributed
|
||||
import torch.distributed as dist
|
||||
import torch.optim
|
||||
import torch.utils.data.distributed
|
||||
|
||||
from seq2seq.models.gnmt import GNMT
|
||||
from seq2seq.train.smoothing import LabelSmoothing
|
||||
from seq2seq.data.dataset import TextDataset
|
||||
from seq2seq.data.dataset import ParallelDataset
|
||||
from seq2seq.data.tokenizer import Tokenizer
|
||||
from seq2seq.utils import setup_logging
|
||||
import seq2seq.data.config as config
|
||||
import seq2seq.train.trainer as trainers
|
||||
import seq2seq.utils as utils
|
||||
from seq2seq.data.dataset import LazyParallelDataset
|
||||
from seq2seq.data.dataset import ParallelDataset
|
||||
from seq2seq.data.dataset import TextDataset
|
||||
from seq2seq.data.tokenizer import Tokenizer
|
||||
from seq2seq.inference.inference import Translator
|
||||
from seq2seq.models.gnmt import GNMT
|
||||
from seq2seq.train.smoothing import LabelSmoothing
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""
|
||||
Parse commandline arguments.
|
||||
"""
|
||||
parser = argparse.ArgumentParser(description='GNMT training',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
def exclusive_group(group, name, default, help):
|
||||
destname = name.replace('-', '_')
|
||||
subgroup = group.add_mutually_exclusive_group(required=False)
|
||||
subgroup.add_argument(f'--{name}', dest=f'{destname}',
|
||||
action='store_true',
|
||||
help=f'{help} (use \'--no-{name}\' to disable)')
|
||||
subgroup.add_argument(f'--no-{name}', dest=f'{destname}',
|
||||
action='store_false', help=argparse.SUPPRESS)
|
||||
subgroup.set_defaults(**{destname: default})
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description='GNMT training',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
# dataset
|
||||
dataset = parser.add_argument_group('dataset setup')
|
||||
dataset.add_argument('--dataset-dir', default=None, required=True,
|
||||
help='path to directory with training/validation data')
|
||||
dataset.add_argument('--dataset-dir', default='data/wmt16_de_en',
|
||||
help='path to the directory with training/test data')
|
||||
dataset.add_argument('--max-size', default=None, type=int,
|
||||
help='use at most MAX_SIZE elements from training \
|
||||
dataset (useful for benchmarking), by default \
|
||||
uses entire dataset')
|
||||
dataset (useful for benchmarking), by default \
|
||||
uses entire dataset')
|
||||
|
||||
# results
|
||||
results = parser.add_argument_group('results setup')
|
||||
results.add_argument('--results-dir', default='results',
|
||||
help='path to directory with results, it it will be \
|
||||
automatically created if does not exist')
|
||||
results.add_argument('--save', default='gnmt_wmt16',
|
||||
help='path to directory with results, it will be \
|
||||
automatically created if it does not exist')
|
||||
results.add_argument('--save', default='gnmt',
|
||||
help='defines subdirectory within RESULTS_DIR for \
|
||||
results from this training run')
|
||||
results from this training run')
|
||||
results.add_argument('--print-freq', default=10, type=int,
|
||||
help='print log every PRINT_FREQ batches')
|
||||
|
||||
# model
|
||||
model = parser.add_argument_group('model setup')
|
||||
model.add_argument('--model-config',
|
||||
default="{'hidden_size': 1024,'num_layers': 4, \
|
||||
'dropout': 0.2, 'share_embedding': True}",
|
||||
help='GNMT architecture configuration')
|
||||
model.add_argument('--hidden-size', default=1024, type=int,
|
||||
help='model hidden size')
|
||||
model.add_argument('--num-layers', default=4, type=int,
|
||||
help='number of RNN layers in encoder and in decoder')
|
||||
model.add_argument('--dropout', default=0.2, type=float,
|
||||
help='dropout applied to input of RNN cells')
|
||||
|
||||
exclusive_group(group=model, name='share-embedding', default=True,
|
||||
help='use shared embeddings for encoder and decoder')
|
||||
|
||||
model.add_argument('--smoothing', default=0.1, type=float,
|
||||
help='label smoothing, if equal to zero model will use \
|
||||
CrossEntropyLoss, if not zero model will be trained \
|
||||
with label smoothing loss')
|
||||
CrossEntropyLoss, if not zero model will be trained \
|
||||
with label smoothing loss')
|
||||
|
||||
# setup
|
||||
general = parser.add_argument_group('general setup')
|
||||
general.add_argument('--math', default='fp16', choices=['fp32', 'fp16'],
|
||||
general.add_argument('--math', default='fp16', choices=['fp16', 'fp32'],
|
||||
help='arithmetic type')
|
||||
general.add_argument('--seed', default=None, type=int,
|
||||
help='set random number generator seed')
|
||||
general.add_argument('--disable-eval', action='store_true', default=False,
|
||||
help='disables validation after every epoch')
|
||||
general.add_argument('--workers', default=0, type=int,
|
||||
help='number of workers for data loading')
|
||||
help='master seed for random number generators, if \
|
||||
"seed" is undefined then the master seed will be \
|
||||
sampled from random.SystemRandom()')
|
||||
|
||||
cuda_parser = general.add_mutually_exclusive_group(required=False)
|
||||
cuda_parser.add_argument('--cuda', dest='cuda', action='store_true',
|
||||
help='enables cuda (use \'--no-cuda\' to disable)')
|
||||
cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false',
|
||||
help=argparse.SUPPRESS)
|
||||
cuda_parser.set_defaults(cuda=True)
|
||||
|
||||
cudnn_parser = general.add_mutually_exclusive_group(required=False)
|
||||
cudnn_parser.add_argument('--cudnn', dest='cudnn', action='store_true',
|
||||
help='enables cudnn (use \'--no-cudnn\' to disable)')
|
||||
cudnn_parser.add_argument('--no-cudnn', dest='cudnn', action='store_false',
|
||||
help=argparse.SUPPRESS)
|
||||
cudnn_parser.set_defaults(cudnn=True)
|
||||
exclusive_group(group=general, name='eval', default=True,
|
||||
help='run validation and test after every epoch')
|
||||
exclusive_group(group=general, name='env', default=True,
|
||||
help='print info about execution env')
|
||||
exclusive_group(group=general, name='cuda', default=True,
|
||||
help='enables cuda')
|
||||
exclusive_group(group=general, name='cudnn', default=True,
|
||||
help='enables cudnn')
|
||||
|
||||
# training
|
||||
training = parser.add_argument_group('training setup')
|
||||
training.add_argument('--batch-size', default=128, type=int,
|
||||
help='batch size for training')
|
||||
training.add_argument('--epochs', default=8, type=int,
|
||||
help='number of total epochs to run')
|
||||
training.add_argument('--optimization-config',
|
||||
default="{'optimizer': 'Adam', 'lr': 5e-4}", type=str,
|
||||
help='optimizer config')
|
||||
training.add_argument('--grad-clip', default=5.0, type=float,
|
||||
help='enabled gradient clipping and sets maximum \
|
||||
gradient norm value')
|
||||
training.add_argument('--max-length-train', default=50, type=int,
|
||||
help='maximum sequence length for training')
|
||||
training.add_argument('--min-length-train', default=0, type=int,
|
||||
help='minimum sequence length for training')
|
||||
training.add_argument('--train-batch-size', default=128, type=int,
|
||||
help='training batch size per worker')
|
||||
training.add_argument('--train-global-batch-size', default=None, type=int,
|
||||
help='global training batch size, this argument \
|
||||
does not have to be defined, if it is defined it \
|
||||
will be used to automatically \
|
||||
compute train_iter_size \
|
||||
using the equation: train_iter_size = \
|
||||
train_global_batch_size // (train_batch_size * \
|
||||
world_size)')
|
||||
training.add_argument('--train-iter-size', metavar='N', default=1,
|
||||
type=int,
|
||||
help='training iter size, training loop will \
|
||||
accumulate gradients over N iterations and execute \
|
||||
optimizer every N steps')
|
||||
training.add_argument('--epochs', default=6, type=int,
|
||||
help='max number of training epochs')
|
||||
|
||||
bucketing_parser = training.add_mutually_exclusive_group(required=False)
|
||||
bucketing_parser.add_argument('--bucketing', dest='bucketing', action='store_true',
|
||||
help='enables bucketing (use \'--no-bucketing\' to disable)')
|
||||
bucketing_parser.add_argument('--no-bucketing', dest='bucketing', action='store_false',
|
||||
help=argparse.SUPPRESS)
|
||||
bucketing_parser.set_defaults(bucketing=True)
|
||||
training.add_argument('--grad-clip', default=5.0, type=float,
|
||||
help='enables gradient clipping and sets maximum \
|
||||
norm of gradients')
|
||||
training.add_argument('--max-length-train', default=50, type=int,
|
||||
help='maximum sequence length for training \
|
||||
(including special BOS and EOS tokens)')
|
||||
training.add_argument('--min-length-train', default=0, type=int,
|
||||
help='minimum sequence length for training \
|
||||
(including special BOS and EOS tokens)')
|
||||
training.add_argument('--train-loader-workers', default=2, type=int,
|
||||
help='number of workers for training data loading')
|
||||
training.add_argument('--batching', default='bucketing', type=str,
|
||||
choices=['random', 'sharding', 'bucketing'],
|
||||
help='select batching algorithm')
|
||||
training.add_argument('--shard-size', default=80, type=int,
|
||||
help='shard size for "sharding" batching algorithm, \
|
||||
in multiples of global batch size')
|
||||
training.add_argument('--num-buckets', default=5, type=int,
|
||||
help='number of buckets for "bucketing" batching \
|
||||
algorithm')
|
||||
|
||||
# optimizer
|
||||
optimizer = parser.add_argument_group('optimizer setup')
|
||||
optimizer.add_argument('--optimizer', type=str, default='Adam',
|
||||
help='training optimizer')
|
||||
optimizer.add_argument('--lr', type=float, default=2.00e-3,
|
||||
help='learning rate')
|
||||
optimizer.add_argument('--optimizer-extra', type=str,
|
||||
default="{}",
|
||||
help='extra options for the optimizer')
|
||||
|
||||
# scheduler
|
||||
scheduler = parser.add_argument_group('learning rate scheduler setup')
|
||||
scheduler.add_argument('--warmup-steps', type=str, default='200',
|
||||
help='number of learning rate warmup iterations')
|
||||
scheduler.add_argument('--remain-steps', type=str, default='0.666',
|
||||
help='starting iteration for learning rate decay')
|
||||
scheduler.add_argument('--decay-interval', type=str, default='None',
|
||||
help='interval between learning rate decay steps')
|
||||
scheduler.add_argument('--decay-steps', type=int, default=4,
|
||||
help='max number of learning rate decay steps')
|
||||
scheduler.add_argument('--decay-factor', type=float, default=0.5,
|
||||
help='learning rate decay factor')
|
||||
|
||||
# validation
|
||||
validation = parser.add_argument_group('validation setup')
|
||||
validation.add_argument('--val-batch-size', default=128, type=int,
|
||||
help='batch size for validation')
|
||||
validation.add_argument('--max-length-val', default=80, type=int,
|
||||
help='maximum sequence length for validation')
|
||||
validation.add_argument('--min-length-val', default=0, type=int,
|
||||
help='minimum sequence length for validation')
|
||||
val = parser.add_argument_group('validation setup')
|
||||
val.add_argument('--val-batch-size', default=64, type=int,
|
||||
help='batch size for validation')
|
||||
val.add_argument('--max-length-val', default=125, type=int,
|
||||
help='maximum sequence length for validation \
|
||||
(including special BOS and EOS tokens)')
|
||||
val.add_argument('--min-length-val', default=0, type=int,
|
||||
help='minimum sequence length for validation \
|
||||
(including special BOS and EOS tokens)')
|
||||
val.add_argument('--val-loader-workers', default=0, type=int,
|
||||
help='number of workers for validation data loading')
|
||||
|
||||
# test
|
||||
test = parser.add_argument_group('test setup')
|
||||
test.add_argument('--test-batch-size', default=128, type=int,
|
||||
help='batch size for test')
|
||||
test.add_argument('--max-length-test', default=150, type=int,
|
||||
help='maximum sequence length for test')
|
||||
help='maximum sequence length for test \
|
||||
(including special BOS and EOS tokens)')
|
||||
test.add_argument('--min-length-test', default=0, type=int,
|
||||
help='minimum sequence length for test')
|
||||
help='minimum sequence length for test \
|
||||
(including special BOS and EOS tokens)')
|
||||
test.add_argument('--beam-size', default=5, type=int,
|
||||
help='beam size')
|
||||
test.add_argument('--len-norm-factor', default=0.6, type=float,
|
||||
|
@ -133,36 +190,50 @@ def parse_args():
|
|||
help='coverage penalty factor')
|
||||
test.add_argument('--len-norm-const', default=5.0, type=float,
|
||||
help='length normalization constant')
|
||||
test.add_argument('--target-bleu', default=None, type=float,
|
||||
help='target accuracy')
|
||||
test.add_argument('--intra-epoch-eval', default=0, type=int,
|
||||
help='evaluate within epoch')
|
||||
test.add_argument('--intra-epoch-eval', metavar='N', default=0, type=int,
|
||||
help='evaluate within training epoch, this option will \
|
||||
enable extra N equally spaced evaluations executed \
|
||||
during each training epoch')
|
||||
test.add_argument('--test-loader-workers', default=0, type=int,
|
||||
help='number of workers for test data loading')
|
||||
|
||||
# checkpointing
|
||||
checkpoint = parser.add_argument_group('checkpointing setup')
|
||||
checkpoint.add_argument('--start-epoch', default=0, type=int,
|
||||
help='manually set initial epoch counter')
|
||||
checkpoint.add_argument('--resume', default=None, type=str, metavar='PATH',
|
||||
help='resumes training from checkpoint from PATH')
|
||||
checkpoint.add_argument('--save-all', action='store_true', default=False,
|
||||
help='saves checkpoint after every epoch')
|
||||
checkpoint.add_argument('--save-freq', default=5000, type=int,
|
||||
help='save checkpoint every SAVE_FREQ batches')
|
||||
checkpoint.add_argument('--keep-checkpoints', default=0, type=int,
|
||||
help='keep only last KEEP_CHECKPOINTS checkpoints, \
|
||||
affects only checkpoints controlled by --save-freq \
|
||||
option')
|
||||
chkpt = parser.add_argument_group('checkpointing setup')
|
||||
chkpt.add_argument('--start-epoch', default=0, type=int,
|
||||
help='manually set initial epoch counter')
|
||||
chkpt.add_argument('--resume', default=None, type=str, metavar='PATH',
|
||||
help='resumes training from checkpoint from PATH')
|
||||
chkpt.add_argument('--save-all', action='store_true', default=False,
|
||||
help='saves checkpoint after every epoch')
|
||||
chkpt.add_argument('--save-freq', default=5000, type=int,
|
||||
help='save checkpoint every SAVE_FREQ batches')
|
||||
chkpt.add_argument('--keep-checkpoints', default=0, type=int,
|
||||
help='keep only last KEEP_CHECKPOINTS checkpoints, \
|
||||
affects only checkpoints controlled by --save-freq \
|
||||
option')
|
||||
|
||||
# distributed support
|
||||
# benchmarking
|
||||
benchmark = parser.add_argument_group('benchmark setup')
|
||||
benchmark.add_argument('--target-perf', default=None, type=float,
|
||||
help='target training performance (in tokens \
|
||||
per second)')
|
||||
benchmark.add_argument('--target-bleu', default=None, type=float,
|
||||
help='target accuracy')
|
||||
|
||||
# distributed
|
||||
distributed = parser.add_argument_group('distributed setup')
|
||||
distributed.add_argument('--rank', default=0, type=int,
|
||||
help='rank of the process, do not set! Done by multiproc module')
|
||||
distributed.add_argument('--world-size', default=1, type=int,
|
||||
help='number of processes, do not set! Done by multiproc module')
|
||||
distributed.add_argument('--dist-url', default='tcp://localhost:23456', type=str,
|
||||
help='url used to set up distributed training')
|
||||
help='global rank of the process, do not set!')
|
||||
distributed.add_argument('--local_rank', default=0, type=int,
|
||||
help='local rank of the process, do not set!')
|
||||
|
||||
return parser.parse_args()
|
||||
args = parser.parse_args()
|
||||
|
||||
args.warmup_steps = literal_eval(args.warmup_steps)
|
||||
args.remain_steps = literal_eval(args.remain_steps)
|
||||
args.decay_interval = literal_eval(args.decay_interval)
|
||||
|
||||
return args
|
||||
|
||||
|
||||
def build_criterion(vocab_size, padding_idx, smoothing):
|
||||
|
@ -178,24 +249,18 @@ def build_criterion(vocab_size, padding_idx, smoothing):
|
|||
return criterion
|
||||
|
||||
|
||||
@utils.timer('TOTAL RUNTIME', sync_gpu=False)
|
||||
def main():
|
||||
"""
|
||||
Launches data-parallel multi-gpu training.
|
||||
"""
|
||||
args = parse_args()
|
||||
device = utils.set_device(args.cuda, args.local_rank)
|
||||
distributed = utils.init_distributed(args.cuda)
|
||||
args.rank = utils.get_rank()
|
||||
|
||||
if not args.cudnn:
|
||||
torch.backends.cudnn.enabled = False
|
||||
if args.seed is not None:
|
||||
torch.manual_seed(args.seed + args.rank)
|
||||
|
||||
# initialize distributed backend
|
||||
distributed = args.world_size > 1
|
||||
if distributed:
|
||||
backend = 'nccl' if args.cuda else 'gloo'
|
||||
dist.init_process_group(backend=backend, rank=args.rank,
|
||||
init_method=args.dist_url,
|
||||
world_size=args.world_size)
|
||||
|
||||
# create directory for results
|
||||
save_path = os.path.join(args.results_dir, args.save)
|
||||
|
@ -203,20 +268,39 @@ def main():
|
|||
os.makedirs(save_path, exist_ok=True)
|
||||
|
||||
# setup logging
|
||||
log_filename = f'log_gpu_{args.rank}.log'
|
||||
setup_logging(os.path.join(save_path, log_filename))
|
||||
log_filename = f'log_rank_{utils.get_rank()}.log'
|
||||
utils.setup_logging(os.path.join(save_path, log_filename))
|
||||
|
||||
if args.env:
|
||||
utils.log_env_info()
|
||||
|
||||
logging.info(f'Saving results to: {save_path}')
|
||||
logging.info(f'Run arguments: {args}')
|
||||
|
||||
if args.cuda:
|
||||
torch.cuda.set_device(args.rank)
|
||||
# automatically set train_iter_size based on train_global_batch_size,
|
||||
# world_size and per-worker train_batch_size
|
||||
if args.train_global_batch_size is not None:
|
||||
global_bs = args.train_global_batch_size
|
||||
bs = args.train_batch_size
|
||||
world_size = utils.get_world_size()
|
||||
assert global_bs % (bs * world_size) == 0
|
||||
args.train_iter_size = global_bs // (bs * world_size)
|
||||
logging.info(f'Global batch size was set in the config, '
|
||||
f'Setting train_iter_size to {args.train_iter_size}')
|
||||
|
||||
worker_seeds, shuffling_seeds = utils.setup_seeds(args.seed, args.epochs,
|
||||
device)
|
||||
worker_seed = worker_seeds[args.rank]
|
||||
logging.info(f'Worker {args.rank} is using worker seed: {worker_seed}')
|
||||
torch.manual_seed(worker_seed)
|
||||
|
||||
# build tokenizer
|
||||
tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME))
|
||||
pad_vocab = utils.pad_vocabulary(args.math)
|
||||
tokenizer = Tokenizer(os.path.join(args.dataset_dir, config.VOCAB_FNAME),
|
||||
pad_vocab)
|
||||
|
||||
# build datasets
|
||||
train_data = ParallelDataset(
|
||||
train_data = LazyParallelDataset(
|
||||
src_fname=os.path.join(args.dataset_dir, config.SRC_TRAIN_FNAME),
|
||||
tgt_fname=os.path.join(args.dataset_dir, config.TGT_TRAIN_FNAME),
|
||||
tokenizer=tokenizer,
|
||||
|
@ -238,45 +322,59 @@ def main():
|
|||
tokenizer=tokenizer,
|
||||
min_len=args.min_length_test,
|
||||
max_len=args.max_length_test,
|
||||
sort=False)
|
||||
sort=True)
|
||||
|
||||
vocab_size = tokenizer.vocab_size
|
||||
|
||||
# build GNMT model
|
||||
model_config = dict(vocab_size=vocab_size, math=args.math,
|
||||
**literal_eval(args.model_config))
|
||||
model = GNMT(**model_config)
|
||||
model_config = {'hidden_size': args.hidden_size,
|
||||
'num_layers': args.num_layers,
|
||||
'dropout': args.dropout, 'batch_first': False,
|
||||
'share_embedding': args.share_embedding}
|
||||
model = GNMT(vocab_size=vocab_size, **model_config)
|
||||
logging.info(model)
|
||||
|
||||
batch_first = model.batch_first
|
||||
|
||||
# define loss function (criterion) and optimizer
|
||||
criterion = build_criterion(vocab_size, config.PAD, args.smoothing)
|
||||
opt_config = literal_eval(args.optimization_config)
|
||||
logging.info(f'Training optimizer: {opt_config}')
|
||||
|
||||
opt_config = {'optimizer': args.optimizer, 'lr': args.lr}
|
||||
opt_config.update(literal_eval(args.optimizer_extra))
|
||||
logging.info(f'Training optimizer config: {opt_config}')
|
||||
|
||||
scheduler_config = {'warmup_steps': args.warmup_steps,
|
||||
'remain_steps': args.remain_steps,
|
||||
'decay_interval': args.decay_interval,
|
||||
'decay_steps': args.decay_steps,
|
||||
'decay_factor': args.decay_factor}
|
||||
|
||||
logging.info(f'Training LR schedule config: {scheduler_config}')
|
||||
|
||||
num_parameters = sum([l.nelement() for l in model.parameters()])
|
||||
logging.info(f'Number of parameters: {num_parameters}')
|
||||
|
||||
batching_opt = {'shard_size': args.shard_size,
|
||||
'num_buckets': args.num_buckets}
|
||||
# get data loaders
|
||||
train_loader = train_data.get_loader(batch_size=args.batch_size,
|
||||
train_loader = train_data.get_loader(batch_size=args.train_batch_size,
|
||||
seeds=shuffling_seeds,
|
||||
batch_first=batch_first,
|
||||
shuffle=True,
|
||||
bucketing=args.bucketing,
|
||||
num_workers=args.workers,
|
||||
drop_last=True)
|
||||
batching=args.batching,
|
||||
batching_opt=batching_opt,
|
||||
num_workers=args.train_loader_workers)
|
||||
|
||||
val_loader = val_data.get_loader(batch_size=args.val_batch_size,
|
||||
batch_first=batch_first,
|
||||
shuffle=False,
|
||||
num_workers=args.workers,
|
||||
drop_last=False)
|
||||
num_workers=args.val_loader_workers)
|
||||
|
||||
test_loader = test_data.get_loader(batch_size=args.test_batch_size,
|
||||
batch_first=batch_first,
|
||||
shuffle=False,
|
||||
num_workers=args.workers,
|
||||
drop_last=False)
|
||||
pad=True,
|
||||
num_workers=args.test_loader_workers)
|
||||
|
||||
translator = Translator(model=model,
|
||||
tokenizer=tokenizer,
|
||||
|
@ -293,13 +391,19 @@ def main():
|
|||
save_path=args.save_path)
|
||||
|
||||
# create trainer
|
||||
total_train_iters = len(train_loader) // args.train_iter_size * args.epochs
|
||||
save_info = {'model_config': model_config, 'config': args, 'tokenizer':
|
||||
tokenizer.get_state()}
|
||||
trainer_options = dict(
|
||||
criterion=criterion,
|
||||
grad_clip=args.grad_clip,
|
||||
iter_size=args.train_iter_size,
|
||||
save_path=save_path,
|
||||
save_freq=args.save_freq,
|
||||
save_info={'config': args, 'tokenizer': tokenizer},
|
||||
save_info=save_info,
|
||||
opt_config=opt_config,
|
||||
scheduler_config=scheduler_config,
|
||||
train_iterations=total_train_iters,
|
||||
batch_first=batch_first,
|
||||
keep_checkpoints=args.keep_checkpoints,
|
||||
math=args.math,
|
||||
|
@ -325,47 +429,60 @@ def main():
|
|||
|
||||
# training loop
|
||||
best_loss = float('inf')
|
||||
break_training = False
|
||||
test_bleu = None
|
||||
for epoch in range(args.start_epoch, args.epochs):
|
||||
logging.info(f'Starting epoch {epoch}')
|
||||
|
||||
if distributed:
|
||||
train_loader.sampler.set_epoch(epoch)
|
||||
train_loader.sampler.set_epoch(epoch)
|
||||
|
||||
trainer.epoch = epoch
|
||||
train_loss, train_perf = trainer.optimize(train_loader)
|
||||
|
||||
# evaluate on validation set
|
||||
if args.rank == 0 and not args.disable_eval:
|
||||
if args.eval:
|
||||
logging.info(f'Running validation on dev set')
|
||||
val_loss, val_perf = trainer.evaluate(val_loader)
|
||||
|
||||
# remember best prec@1 and save checkpoint
|
||||
is_best = val_loss < best_loss
|
||||
best_loss = min(val_loss, best_loss)
|
||||
trainer.save(save_all=args.save_all, is_best=is_best)
|
||||
if args.rank == 0:
|
||||
is_best = val_loss < best_loss
|
||||
best_loss = min(val_loss, best_loss)
|
||||
trainer.save(save_all=args.save_all, is_best=is_best)
|
||||
|
||||
break_training = False
|
||||
if not args.disable_eval:
|
||||
if args.eval:
|
||||
utils.barrier()
|
||||
test_bleu, break_training = translator.run(calc_bleu=True,
|
||||
epoch=epoch)
|
||||
|
||||
if args.rank == 0 and not args.disable_eval:
|
||||
logging.info(f'Summary: Epoch: {epoch}\t'
|
||||
f'Training Loss: {train_loss:.4f}\t'
|
||||
f'Validation Loss: {val_loss:.4f}\t'
|
||||
f'Test BLEU: {test_bleu:.2f}')
|
||||
logging.info(f'Performance: Epoch: {epoch}\t'
|
||||
f'Training: {train_perf:.0f} Tok/s\t'
|
||||
f'Validation: {val_perf:.0f} Tok/s')
|
||||
else:
|
||||
logging.info(f'Summary: Epoch: {epoch}\t'
|
||||
f'Training Loss {train_loss:.4f}')
|
||||
logging.info(f'Performance: Epoch: {epoch}\t'
|
||||
f'Training: {train_perf:.0f} Tok/s')
|
||||
acc_log = []
|
||||
acc_log += [f'Summary: Epoch: {epoch}']
|
||||
acc_log += [f'Training Loss: {train_loss:.4f}']
|
||||
if args.eval:
|
||||
acc_log += [f'Validation Loss: {val_loss:.4f}']
|
||||
acc_log += [f'Test BLEU: {test_bleu:.2f}']
|
||||
|
||||
perf_log = []
|
||||
perf_log += [f'Performance: Epoch: {epoch}']
|
||||
perf_log += [f'Training: {train_perf:.0f} Tok/s']
|
||||
if args.eval:
|
||||
perf_log += [f'Validation: {val_perf:.0f} Tok/s']
|
||||
|
||||
if args.rank == 0:
|
||||
logging.info('\t'.join(acc_log))
|
||||
logging.info('\t'.join(perf_log))
|
||||
|
||||
logging.info(f'Finished epoch {epoch}')
|
||||
if break_training:
|
||||
break
|
||||
|
||||
utils.barrier()
|
||||
passed = utils.benchmark(test_bleu, args.target_bleu,
|
||||
train_perf, args.target_perf)
|
||||
return passed
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
passed = main()
|
||||
if not passed:
|
||||
sys.exit(1)
|
||||
|
|
224
translate.py
224
translate.py
|
@ -1,39 +1,61 @@
|
|||
#!/usr/bin/env python
|
||||
import logging
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import warnings
|
||||
from ast import literal_eval
|
||||
from itertools import product
|
||||
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
|
||||
from seq2seq.models.gnmt import GNMT
|
||||
from seq2seq.inference.inference import Translator
|
||||
import seq2seq.utils as utils
|
||||
from seq2seq.data.dataset import TextDataset
|
||||
from seq2seq.data.tokenizer import Tokenizer
|
||||
from seq2seq.inference.inference import Translator
|
||||
from seq2seq.models.gnmt import GNMT
|
||||
from seq2seq.utils import setup_logging
|
||||
|
||||
|
||||
def parse_args():
|
||||
"""
|
||||
Parse commandline arguments.
|
||||
"""
|
||||
parser = argparse.ArgumentParser(description='GNMT Translate',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
# data
|
||||
def exclusive_group(group, name, default, help):
|
||||
destname = name.replace('-', '_')
|
||||
subgroup = group.add_mutually_exclusive_group(required=False)
|
||||
subgroup.add_argument(f'--{name}', dest=f'{destname}',
|
||||
action='store_true',
|
||||
help=f'{help} (use \'--no-{name}\' to disable)')
|
||||
subgroup.add_argument(f'--no-{name}', dest=f'{destname}',
|
||||
action='store_false', help=argparse.SUPPRESS)
|
||||
subgroup.set_defaults(**{destname: default})
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description='GNMT Translate',
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
|
||||
|
||||
# dataset
|
||||
dataset = parser.add_argument_group('data setup')
|
||||
dataset.add_argument('--dataset-dir', default='data/wmt16_de_en/',
|
||||
help='path to directory with training/validation data')
|
||||
help='path to directory with training/test data')
|
||||
dataset.add_argument('-i', '--input', required=True,
|
||||
help='full path to the input file (tokenized)')
|
||||
dataset.add_argument('-o', '--output', required=True,
|
||||
help='full path to the output file (tokenized)')
|
||||
dataset.add_argument('-r', '--reference', default=None,
|
||||
help='full path to the reference file (for sacrebleu)')
|
||||
help='full path to the file with reference \
|
||||
translations (for sacrebleu)')
|
||||
dataset.add_argument('-m', '--model', required=True,
|
||||
help='full path to the model checkpoint file')
|
||||
exclusive_group(group=dataset, name='sort', default=True,
|
||||
help='sorts dataset by sequence length')
|
||||
|
||||
# parameters
|
||||
params = parser.add_argument_group('inference setup')
|
||||
params.add_argument('--batch-size', default=128, type=int,
|
||||
help='batch size')
|
||||
params.add_argument('--beam-size', default=5, type=int,
|
||||
params.add_argument('--batch-size', nargs='+', default=[128], type=int,
|
||||
help='batch size per GPU')
|
||||
params.add_argument('--beam-size', nargs='+', default=[5], type=int,
|
||||
help='beam size')
|
||||
params.add_argument('--max-seq-len', default=80, type=int,
|
||||
help='maximum generated sequence length')
|
||||
|
@ -45,16 +67,18 @@ def parse_args():
|
|||
help='length normalization constant')
|
||||
# general setup
|
||||
general = parser.add_argument_group('general setup')
|
||||
general.add_argument('--math', default='fp16', choices=['fp32', 'fp16'],
|
||||
help='arithmetic type')
|
||||
general.add_argument('--math', nargs='+', default=['fp16'],
|
||||
choices=['fp16', 'fp32'], help='arithmetic type')
|
||||
|
||||
bleu_parser = general.add_mutually_exclusive_group(required=False)
|
||||
bleu_parser.add_argument('--bleu', dest='bleu', action='store_true',
|
||||
help='compares with reference and computes BLEU \
|
||||
(use \'--no-bleu\' to disable)')
|
||||
bleu_parser.add_argument('--no-bleu', dest='bleu', action='store_false',
|
||||
help=argparse.SUPPRESS)
|
||||
bleu_parser.set_defaults(bleu=True)
|
||||
exclusive_group(group=general, name='env', default=True,
|
||||
help='print info about execution env')
|
||||
exclusive_group(group=general, name='bleu', default=True,
|
||||
help='compares with reference translation and computes \
|
||||
BLEU')
|
||||
exclusive_group(group=general, name='cuda', default=True,
|
||||
help='enables cuda')
|
||||
exclusive_group(group=general, name='cudnn', default=True,
|
||||
help='enables cudnn')
|
||||
|
||||
batch_first_parser = general.add_mutually_exclusive_group(required=False)
|
||||
batch_first_parser.add_argument('--batch-first', dest='batch_first',
|
||||
|
@ -67,62 +91,27 @@ def parse_args():
|
|||
format for RNNs')
|
||||
batch_first_parser.set_defaults(batch_first=True)
|
||||
|
||||
cuda_parser = general.add_mutually_exclusive_group(required=False)
|
||||
cuda_parser.add_argument('--cuda', dest='cuda', action='store_true',
|
||||
help='enables cuda (use \'--no-cuda\' to disable)')
|
||||
cuda_parser.add_argument('--no-cuda', dest='cuda', action='store_false',
|
||||
help=argparse.SUPPRESS)
|
||||
cuda_parser.set_defaults(cuda=True)
|
||||
|
||||
cudnn_parser = general.add_mutually_exclusive_group(required=False)
|
||||
cudnn_parser.add_argument('--cudnn', dest='cudnn', action='store_true',
|
||||
help='enables cudnn (use \'--no-cudnn\' to disable)')
|
||||
cudnn_parser.add_argument('--no-cudnn', dest='cudnn', action='store_false',
|
||||
help=argparse.SUPPRESS)
|
||||
cudnn_parser.set_defaults(cudnn=True)
|
||||
|
||||
general.add_argument('--print-freq', '-p', default=1, type=int,
|
||||
help='print log every PRINT_FREQ batches')
|
||||
|
||||
# distributed
|
||||
distributed = parser.add_argument_group('distributed setup')
|
||||
distributed.add_argument('--rank', default=0, type=int,
|
||||
help='global rank of the process, do not set!')
|
||||
distributed.add_argument('--local_rank', default=0, type=int,
|
||||
help='local rank of the process, do not set!')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.bleu and args.reference is None:
|
||||
parser.error('--bleu requires --reference')
|
||||
|
||||
if 'fp16' in args.math and not args.cuda:
|
||||
parser.error('--math fp16 requires --cuda')
|
||||
|
||||
return args
|
||||
|
||||
|
||||
def checkpoint_from_distributed(state_dict):
|
||||
"""
|
||||
Checks whether checkpoint was generated by DistributedDataParallel. DDP
|
||||
wraps model in additional "module.", it needs to be unwrapped for single
|
||||
GPU inference.
|
||||
|
||||
:param state_dict: model's state dict
|
||||
"""
|
||||
ret = False
|
||||
for key, _ in state_dict.items():
|
||||
if key.find('module.') != -1:
|
||||
ret = True
|
||||
break
|
||||
return ret
|
||||
|
||||
|
||||
def unwrap_distributed(state_dict):
|
||||
"""
|
||||
Unwraps model from DistributedDataParallel.
|
||||
DDP wraps model in additional "module.", it needs to be removed for single
|
||||
GPU inference.
|
||||
|
||||
:param state_dict: model's state dict
|
||||
"""
|
||||
new_state_dict = {}
|
||||
for key, value in state_dict.items():
|
||||
new_key = key.replace('module.', '')
|
||||
new_state_dict[new_key] = value
|
||||
return new_state_dict
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
Launches translation (inference).
|
||||
|
@ -130,26 +119,17 @@ def main():
|
|||
with length normalization and coverage penalty.
|
||||
"""
|
||||
args = parse_args()
|
||||
utils.set_device(args.cuda, args.local_rank)
|
||||
utils.init_distributed(args.cuda)
|
||||
setup_logging()
|
||||
|
||||
logging.basicConfig(level=logging.DEBUG,
|
||||
format="%(asctime)s - %(levelname)s - %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
filename='log.log',
|
||||
filemode='w')
|
||||
console = logging.StreamHandler()
|
||||
console.setLevel(logging.INFO)
|
||||
formatter = logging.Formatter('%(message)s')
|
||||
console.setFormatter(formatter)
|
||||
logging.getLogger('').addHandler(console)
|
||||
if args.env:
|
||||
utils.log_env_info()
|
||||
|
||||
logging.info(args)
|
||||
logging.info(f'Run arguments: {args}')
|
||||
|
||||
if args.cuda:
|
||||
torch.cuda.set_device(0)
|
||||
if not args.cuda and torch.cuda.is_available():
|
||||
warnings.warn('cuda is available but not enabled')
|
||||
if args.math == 'fp16' and not args.cuda:
|
||||
raise RuntimeError('fp16 requires cuda')
|
||||
if not args.cudnn:
|
||||
torch.backends.cudnn.enabled = False
|
||||
|
||||
|
@ -157,57 +137,57 @@ def main():
|
|||
checkpoint = torch.load(args.model, map_location={'cuda:0': 'cpu'})
|
||||
|
||||
# build GNMT model
|
||||
tokenizer = checkpoint['tokenizer']
|
||||
tokenizer = Tokenizer()
|
||||
tokenizer.set_state(checkpoint['tokenizer'])
|
||||
vocab_size = tokenizer.vocab_size
|
||||
model_config = dict(vocab_size=vocab_size, math=checkpoint['config'].math,
|
||||
**literal_eval(checkpoint['config'].model_config))
|
||||
model_config = checkpoint['model_config']
|
||||
model_config['batch_first'] = args.batch_first
|
||||
model = GNMT(**model_config)
|
||||
model = GNMT(vocab_size=vocab_size, **model_config)
|
||||
model.load_state_dict(checkpoint['state_dict'])
|
||||
|
||||
state_dict = checkpoint['state_dict']
|
||||
if checkpoint_from_distributed(state_dict):
|
||||
state_dict = unwrap_distributed(state_dict)
|
||||
for (math, batch_size, beam_size) in product(args.math, args.batch_size,
|
||||
args.beam_size):
|
||||
logging.info(f'math: {math}, batch size: {batch_size}, '
|
||||
f'beam size: {beam_size}')
|
||||
if math == 'fp32':
|
||||
dtype = torch.FloatTensor
|
||||
if math == 'fp16':
|
||||
dtype = torch.HalfTensor
|
||||
model.type(dtype)
|
||||
|
||||
model.load_state_dict(state_dict)
|
||||
if args.cuda:
|
||||
model = model.cuda()
|
||||
model.eval()
|
||||
|
||||
if args.math == 'fp32':
|
||||
dtype = torch.FloatTensor
|
||||
if args.math == 'fp16':
|
||||
dtype = torch.HalfTensor
|
||||
# construct the dataset
|
||||
test_data = TextDataset(src_fname=args.input,
|
||||
tokenizer=tokenizer,
|
||||
sort=args.sort)
|
||||
|
||||
model.type(dtype)
|
||||
if args.cuda:
|
||||
model = model.cuda()
|
||||
model.eval()
|
||||
# build the data loader
|
||||
test_loader = test_data.get_loader(batch_size=batch_size,
|
||||
batch_first=args.batch_first,
|
||||
shuffle=False,
|
||||
pad=True,
|
||||
num_workers=0)
|
||||
|
||||
# construct the dataset
|
||||
test_data = TextDataset(src_fname=args.input,
|
||||
tokenizer=tokenizer,
|
||||
sort=False)
|
||||
# build the translator object
|
||||
translator = Translator(model=model,
|
||||
tokenizer=tokenizer,
|
||||
loader=test_loader,
|
||||
beam_size=beam_size,
|
||||
max_seq_len=args.max_seq_len,
|
||||
len_norm_factor=args.len_norm_factor,
|
||||
len_norm_const=args.len_norm_const,
|
||||
cov_penalty_factor=args.cov_penalty_factor,
|
||||
cuda=args.cuda,
|
||||
print_freq=args.print_freq,
|
||||
dataset_dir=args.dataset_dir)
|
||||
|
||||
# build the data loader
|
||||
test_loader = test_data.get_loader(batch_size=args.batch_size,
|
||||
batch_first=args.batch_first,
|
||||
shuffle=False,
|
||||
num_workers=0,
|
||||
drop_last=False)
|
||||
# execute the inference
|
||||
translator.run(calc_bleu=args.bleu, eval_path=args.output,
|
||||
reference_path=args.reference, summary=True)
|
||||
|
||||
# build the translator object
|
||||
translator = Translator(model=model,
|
||||
tokenizer=tokenizer,
|
||||
loader=test_loader,
|
||||
beam_size=args.beam_size,
|
||||
max_seq_len=args.max_seq_len,
|
||||
len_norm_factor=args.len_norm_factor,
|
||||
len_norm_const=args.len_norm_const,
|
||||
cov_penalty_factor=args.cov_penalty_factor,
|
||||
cuda=args.cuda,
|
||||
print_freq=args.print_freq,
|
||||
dataset_dir=args.dataset_dir)
|
||||
|
||||
# execute the inference
|
||||
translator.run(calc_bleu=args.bleu, eval_path=args.output,
|
||||
reference_path=args.reference, summary=True)
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
|
Loading…
Reference in a new issue